HKUST Institutional Repository >
Computer Science and Engineering >
CSE Master Theses >
Please use this identifier to cite or link to this item:
|Title: ||Automatically merging lexicons that have heterogeneous part-of-speech categories|
|Authors: ||Chan, Daniel Ka-Leung|
|Issue Date: ||1999 |
|Abstract: ||Merging lexical resources from different sources is a way to cope with inadequate availability of resources, but lexicon design incompatibilities have been an obstacle to hinder the research in this direction. The most frequently-encountered problem is part-of-speech tagset inconsistency, where the set of tag symbols in the part-of-speech (POS) category in one lexicon is different from those used by another lexicon.
To attack this problem, we present a new method to automatically merge lexicons that employ heterogeneous part-of-speech categories. Given an "original lexicon", our method is able to merge lexemes from an "additional lexicon" into the original lexicon, converting lexemes from the additional lexicon with more than 89% precision. This level of precision is achieved with the aid of a device we introduced called an anti-lexicon, which neatly summarizes all the essential information we need about the co-occurrence of POS tags and lemmas.
Based on this co-occurrence information we propose a set of lexicon algorithms to learn a set of mapping rules between the POS tagsets. With the enhancement from anti-lexicon, these mapping rules are able to produce a merged lexicon with high precision.
To test the viability of our approach, we have conducted experiments using four machine-readable dictionaries, and compared the accuracy of the automatically- generated lexicons with the "oracle" lexicons in which the lexemes are manually converted by two linguists. Precision in the merged lexicons using our anti-lexicon method reaches more than 89% precision on average, which outperforms the method with perfect POS mapping rules created by linguists by more than 53%.
Our model is intuitive, easy to implement, and does not require heavy computational resources nor training corpus.|
|Description: ||Thesis (M.Phil.)--Hong Kong University of Science and Technology, 1999|
xviii, 113 leaves ; 30 cm
HKUST Call Number: Thesis COMP 1999 Chan
|Appears in Collections:||CSE Master Theses |
Files in This Item:
All items in this Repository are protected by copyright, with all rights reserved.