Yahoo! Bigthinker India Series

I was present at Yahoo! Big Thinker series session today. This was the last event of this series for the year 2009. The talk for today’s session was –

” Building Knowledge bases from the Web ”

The Web is a vast repository of human Knowledge. A grand challenge is to mine the web to build comprehensive databases of entities ( eg: people, places, things ), relationships and facts. Building such knowledge involves four key steps :

1. Content Acquisition.

2. Information extraction from Web pages.

3. De-duplication of extracted information.

4. Integration of information for the same entity.

I learn t about some techniques for implementing the above from this talk. Some of my notes :

The size of Deep Web is 500 times the Surface Web. They are embedded to billions of pages. Structure of most of the company websites changes frequently. They do so for promotions, look and feel etc. There are some tail sites which has diverse structures. Information extraction depends on various factors – the property of content,  wrapper induction,  page signature,  vector of shingles, page annotated per site. There are a few limitations to wrapper induction. If the page layout ( structure ) is changed, wrapper induction technique won’t work. Most of the sites today are noisy. So, at times when there are match values, those values also contain noisy matches. By noisy match I mean, sites which displays  something of this sort – ‘ people who have visited site X has also visited site Y ‘ or  ‘ too many links of hotels or restaurants when one is booking a ticket in some traveling sites ‘. During integration phase, multiple records for an entity is merged into single record.

