Please use this identifier to cite or link to this item: http://hdl.handle.net/1783.1/6073

Duplicate detection in XML Web data

Authors Huang, Yuzhou
Issue Date 2009
Summary Duplicate entities are quite common on the Web, where structured XML data are increasingly common. Duplicate detection, which is considered an important data cleaning task, consists of detecting different presentations of the same real world object. Detecting and resolving duplicate entities will certainly be of benefit to Web users. Thus, to improve the web data quality, algorithms for detecting duplicates are required. In this thesis, we present a feature-dependent algorithm, which efficiently identifies duplicates in XML Web data. First, we generate features which are related to the targeted duplicates. Then, we create a function which is used for the similarity measurements, based on the generated features. A threshold is used to help identify whether the identified duplicates are real duplicates. We also introduce another step, similarity function learning, to improve the duplicate detection results. To prove that the above methodology can be broadly applied, we apply the algorithm on different kinds of XML Web data, which can be easily found on websites. We also use various entities as the duplicates in the experiments, such as CD name entities and author entities. Moreover, we generate some dirty data manually to show that our algorithm can work well even when there are some errors or missing information in the datasets.
Note Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2009
Subjects
Language English
Format Thesis
Access View full-text via DOI
Files in this item:
File Description Size Format
th_redirect.html 343 B HTML
Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.