Please use this identifier to cite or link to this item:

Wrapper induction based on nested pattern discovery

Authors Wang, Jiying
Lochovsky, Frederick H.
Issue Date 2002
Summary One of the most difficult issues in information extraction from the World Wide Web is the automatic generation of wrappers that can extract relevant data objects embedded in semi-structured HTML pages. Wrapper generation requires that (a) the elevant data for extraction (data-rich section) in a web page be identified and (b) a pattern be constructed that represents the structure of the data objects in the data-rich section. This pattern can then be used to extract the data objects in a web page for subsequent querying. To ddress the first problem, a novel algorithm, called data-rich section extraction (DSE), is employed to identify the data-rich section of HTML pages. The DSE algorithm acts as a pre-processing "clean-up" step that improves the accuracy of the generated wrappers. To address the second problem, a new concept, C-repeated pattern, is tilized to identify plain or repeated nested data structures in HTML pages. A web page is considered as a token sequence and epeated substrings are discovered from it by building a token suffix-tree from the token sequence. By iteratively applying the extraction process, we build a pattern-tree, which represents the hierarchical relationship between discovered patterns, and obtain a (regular expression) wrapper that is used to extract both plain- and nested-structured data from HTML pages. Our approach is fast, fully automatic with no human involvement, and experiments show that the discovered patterns can achieve high accuracy and retrieval rates when matching them in web pages to extract data object instances.
Language English
Format Technical report
Files in this item:
File Description Size Format
tr02-27.pdf 308840 B Adobe PDF