HKUST Library Institutional Repository Banner

HKUST Institutional Repository >
Computer Science and Engineering >
CSE Master Theses  >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1783.1/5677
Title: Learning a lightweight robust deterministic parser
Authors: Wong, Wah Hon
Issue Date: 1999
Abstract: We describe a method for automatically learning a parser from labeled, bracketed corpora that results in a fast, robust, lightweight parser that is suitable for realtime natural language systems and similar applications. Unlike ordinary parsers, all grammatical knowledge is captured in the learned decision trees, so no explicit phrase-structure grammar is needed. Another characteristic of the architecture is robustness, since the input need not fit pre-specified productions. The runtime architecture is very slim and references two learned decision trees that allow the parser to operate in a "strictly deterministic" manner in Marcus' (1977) sense. The basis is a shift-reduce parser (Aho, Sethi, & Ullman 1986) consisting of a stack, an input stream and a decision control mechanism. The core part of our work is to learn the decision control mechanism, for which we employ a novel Shift/Reduce decision algorithm and a novel Constituent Labeling decision algorithm. The features used for both the Shift/Reduce and Constituent Labeling decision tasks are restricted to the constituent labels in the stack and the part-of-speech tags of the words in the input. Even without using specific lexical features, we have achieved respectable labeled bracket accuracies of about 81% precision and 82% recall on the Penn Treebank corpus. Processing speed on a Sparc Ultra I machine is more than 500 words per CPU second. The high processing speed makes our parser suitable for applications like online language understanding and machine translation applications. Without any optimization, the decision trees consume only 6M of memory, making it possible to run on platforms with limited memory. Since the only resource needed to train our parser is a labeled and bracketed corpus, we believe the learning method is readily applicable to other languages. Preliminary experiments on a Chinese corpus (which contains about 3000 sentences from Chinese primary school text) have yielded results comparable to that for English.
Description: Thesis (M.Phil.)--Hong Kong University of Science and Technology, 1999
xi, 82 leaves : ill. ; 30 cm
HKUST Call Number: Thesis COMP 1999 WongWH
URI: http://hdl.handle.net/1783.1/5677
Appears in Collections:CSE Master Theses

Files in This Item:

File Description SizeFormat
th_redirect.html0KbHTMLView/Open

All items in this Repository are protected by copyright, with all rights reserved.