HKUST Institutional Repository >
Computer Science and Engineering >
CSE Master Theses >
Please use this identifier to cite or link to this item:
|Title: ||Multi-schema entity resolution|
|Authors: ||Huang, Qiong|
|Issue Date: ||2008 |
|Abstract: ||Entity resolution (ER) is the problem of identifying and merging the records judged to represent the same real-world entity. Most previous ER approaches assumed a unified schema (or a global schema) under which all records are compared and merged in a field-by-field basis. We consider the multi-schema ER problem in which records come from multiple sources that are of different schemas. A prime example of multi-schema ER is Information Integration over the deep web, where the goal is to integrate data from heterogeneous sources.
In this thesis, we formalize the multi-schema ER problem, investigate some properties that are satisfied in a unified-schema setting, but not in a multi-schema setting, and identify the possible resolution conflicts that might occur in a multi-schema setting using the previous ER approaches. We then propose the validity-ensured and order-sensitive (VEOS) algorithm that is free from such conflicts and, at the same time, can take advantage of order scheduling to improve accuracy.
We identify schema-level and data-level criteria to distinguish the more reliable comparisons so that by comparing them first a more accurate result is obtained. To leverage such information, we propose to construct a confidence graph upon which our scheduling algorithm is developed. Our experiments, using real online shopping data, show that: (1) our scheduling algorithm is very effective in improving accuracy, and (2) VEOS with scheduling outperforms other methods in both accuracy and efficiency.|
|Description: ||Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2008|
ix, 56 leaves : ill. ; 30 cm
HKUST Call Number: Thesis CSED 2008 Huang
|Appears in Collections:||CSE Master Theses |
Files in This Item:
All items in this Repository are protected by copyright, with all rights reserved.