A maximum-entropy Chinese parser augmented by transformation-based learning
Fung, Pascale N.
|Source||ACM Transactions on Asian Language Information Processing , v. 3, (2), 2004, p. 159-168|
|Summary||Parsing, the task of identifying syntactic components, e.g., noun and verb phrases, in a sentence, is one of the fundamental tasks in natural language processing. Many natural language applications such as spoken-language understanding, machine translation, and information extraction, would benefit from, or even require, high accuracy parsing as a preprocessing step. Even though most state-of-the-art statistical parsers were initially constructed for parsing in English, most of them are not language-specific, in that they do not rely on properties of the language that are specific to English. Therefore, construction of a parser in a given language becomes a matter of retraining the statistical parameters with a Treebank in the corresponding language. The development of the Chinese treebank [Xia et al. 2000] spurred the construction of parsers for Chinese. However, Chinese as a language poses some unique problems for the development of a statistical parser, the most apparent being word segmentation. Since words in written Chinese are not delimited in the same way as in Western languages, the first problem that needs to be solved before an existing statistical method can be applied to Chinese is to identify the word boundaries. This is a step that is neglected by most pre-existing Chinese parsers, which assume that the input data has already been pre-segmented. This article describes a character-based statistical parser, which gives the best performance to-date on the Chinese treebank data. We augment an existing maximum entropy parser with transformation-based learning, creating a parser that can operate at the character level. We present experiments that show that our parser achieves results that are close to those achievable under perfect word segmentation conditions.|
View full-text via DOI
View full-text via Scopus