2004/09/07
Detecting and Partitioning of Data Objects in Complex Web Pages
Shiren Ye and Tat-Seng Chua
- In Web Intelligence '04: available at http://www.comp.nus.edu.sg/~yesr/webmining/Wr2290_ye_s.pdf
- Read (4 page version)
- Related to: information extraction, PARCELS, web page cleaning
Uses a tree-based kernel (?) to calculate the similarity of a page to a corpus of webpages, using the DOM tree structure to retrieve the tree structure. They define a novelty value to distinguish the "data" portion of a web page from the "non-data" portion of the web page. Further processing is used to delimit the data portion into records, but I will not focus on this aspect of the work in this summary.
Their similarity metrics uses both attributes (think html tags) as well as the text of the html node to calculate similarity of a DOM tree node. The formula for novelty and repeatability I'm not exactly sure whether I deciphering correctly. Not exactly sure how this is calculated, better to ask Shiren or Tat-Seng about this...
Food for thought: Wondering how this work can be related to the R measure introduced in SIGIR 03 by Khmelev and Teahan. Sort of a tree kernel based R similarity (but without the efficiency gains?)
