Please use this identifier to cite or link to this item: http://hdl.handle.net/1783.1/3443

An adaptive framework for searching XML documents

Authors Lau, Ho Lam
Issue Date 2007
Summary The evolution of computing technology suggests that it has become more feasible to offer access to Web information in a ubiquitous way, through various kinds of interaction devices such as PCs, laptops, palmtops, and so on. As XML has become a defacto standard for exchanging Web data, an interesting and practical research problem is the development of models and techniques to satisfy various needs and preferences in searching XML data. In this thesis, we employ a list of simple XML tagged keywords as a vehicle for searching XML fragments in a collection of XML documents. In order to deal with the diversified nature of XML documents as well as user preferences, we propose a novel Multi-Ranker Model (MRM), which is able to abstract a spectrum of important XML properties and adapt the features to different XML search needs. The MRM is composed of three ranking levels. The lowest level consists of two categories of similarity and granularity features. At the intermediate level, we define four tailored XML Rankers (XRs), which consist of different lower level features and have different strengths in searching XML fragments. The XRs are trained via a learning mechanism called the Ranking Support Vector Machine in a voting Spy Naïve Bayes Framework (RSSF). The RSSF takes as input a set of labeled fragments and feature vectors and generates as output Adaptive Rankers (ARs) in the learning process. The ARs are defined over the XRs and generated at the top level of the MRM. We show empirically that the RSSF is able to improve the MRM significantly in the learning process and needs only a small set of training XML fragments. We demonstrate that the trained MRM is able to bring out the strengths of the XRs in order to adapt different preferences and queries. We also present the Adaptive Information Merging Approach (AIM) to merge the XML fragments returned from the ranked result list. We incorporate the users’ feedback in order to further improve the coverage and specificity of the merged results, which are measured in terms of two formal notions of Information Completeness (IC) and Data Complexity (DC). IC represents source coverage and computes the “completeness” of the involved information sources and DC represents the “richness” of data and computes the complexity of the retrieved data items.
Note Thesis (Ph.D.)--Hong Kong University of Science and Technology, 2007
Subjects
Language English
Format Thesis
Access View full-text via DOI
Files in this item:
File Description Size Format
th_redirect.html 339 B HTML
Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.