HKUST Library Institutional Repository Banner

HKUST Institutional Repository >
Computer Science and Engineering >
CSE Doctoral Theses >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1783.1/3172
Title: Domain-based data integration for Web databases
Authors: Su, Weifeng
Issue Date: 2007
Abstract: An important part of today‚Äôs Web is Web databases, in which 80% of the databases are structured databases. To facilitate a user to retrieve relevant records from different Web databases simultaneously, we propose a simultaneous querying system, called SIM-querying, which is comprised of three components: query interface integrator, data extractor and result integrator. In each component, a novel method is presented that performs its function automatically. In the query interface integrator, a holistic schema matching method, HSM, is presented that takes advantage of the attribute occurrence patterns in multiple query interfaces to find the attributes that match in different interfaces within a domain. In the data extractor, a domain-based data extraction method, ODE, is presented. In ODE, a domain ontology is first learned from the information overlap and schema matching in the query results and query interfaces from different Web databases within the domain and the ontology is then used to extract the data encoded in the result HTML pages automatically. In the result integrator, a new duplicate detection method, UDD, is presented to identify the duplicates that exist in the query results from different Web databases. In UDD, a set of negative records is first constructed based on two observations about the query results of Web databases and then, starting from the negative records, an iterative algorithm identifies the duplicates from different Web databases. Experimental results show that each of these novel methods can achieve very high precision and outperform existing methods in the context of Web databases.
Description: Thesis (Ph.D.)--Hong Kong University of Science and Technology, 2007
xii, 138 leaves : ill. ; 30 cm
HKUST Call Number: Thesis CSED 2007 Su
URI: http://hdl.handle.net/1783.1/3172
Appears in Collections:CSE Doctoral Theses

Files in This Item:

File Description SizeFormat
th_redirect.html0KbHTMLView/Open

All items in this Repository are protected by copyright, with all rights reserved.