Please use this identifier to cite or link to this item: http://hdl.handle.net/1783.1/5670

Automatically merging lexicons that have heterogeneous part-of-speech categories

Authors Chan, Daniel Ka-Leung
Issue Date 1999
Summary Merging lexical resources from different sources is a way to cope with inadequate availability of resources, but lexicon design incompatibilities have been an obstacle to hinder the research in this direction. The most frequently-encountered problem is part-of-speech tagset inconsistency, where the set of tag symbols in the part-of-speech (POS) category in one lexicon is different from those used by another lexicon. To attack this problem, we present a new method to automatically merge lexicons that employ heterogeneous part-of-speech categories. Given an "original lexicon", our method is able to merge lexemes from an "additional lexicon" into the original lexicon, converting lexemes from the additional lexicon with more than 89% precision. This level of precision is achieved with the aid of a device we introduced called an anti-lexicon, which neatly summarizes all the essential information we need about the co-occurrence of POS tags and lemmas. Based on this co-occurrence information we propose a set of lexicon algorithms to learn a set of mapping rules between the POS tagsets. With the enhancement from anti-lexicon, these mapping rules are able to produce a merged lexicon with high precision. To test the viability of our approach, we have conducted experiments using four machine-readable dictionaries, and compared the accuracy of the automatically- generated lexicons with the "oracle" lexicons in which the lexemes are manually converted by two linguists. Precision in the merged lexicons using our anti-lexicon method reaches more than 89% precision on average, which outperforms the method with perfect POS mapping rules created by linguists by more than 53%. Our model is intuitive, easy to implement, and does not require heavy computational resources nor training corpus.
Note Thesis (M.Phil.)--Hong Kong University of Science and Technology, 1999
Subjects
Language English
Format Thesis
Access
Files in this item:
File Description Size Format
th_redirect.html 341 B HTML