Please use this identifier to cite or link to this item: http://hdl.handle.net/1783.1/4232

Multi-schema entity resolution

Authors Huang, Qiong
Issue Date 2008
Summary Entity resolution (ER) is the problem of identifying and merging the records judged to represent the same real-world entity. Most previous ER approaches assumed a unified schema (or a global schema) under which all records are compared and merged in a field-by-field basis. We consider the multi-schema ER problem in which records come from multiple sources that are of different schemas. A prime example of multi-schema ER is Information Integration over the deep web, where the goal is to integrate data from heterogeneous sources. In this thesis, we formalize the multi-schema ER problem, investigate some properties that are satisfied in a unified-schema setting, but not in a multi-schema setting, and identify the possible resolution conflicts that might occur in a multi-schema setting using the previous ER approaches. We then propose the validity-ensured and order-sensitive (VEOS) algorithm that is free from such conflicts and, at the same time, can take advantage of order scheduling to improve accuracy. We identify schema-level and data-level criteria to distinguish the more reliable comparisons so that by comparing them first a more accurate result is obtained. To leverage such information, we propose to construct a confidence graph upon which our scheduling algorithm is developed. Our experiments, using real online shopping data, show that: (1) our scheduling algorithm is very effective in improving accuracy, and (2) VEOS with scheduling outperforms other methods in both accuracy and efficiency.
Note Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2008
Subjects
Language English
Format Thesis
Access
Files in this item:
File Description Size Format
th_redirect.html 343 B HTML