HKUST Library Institutional Repository Banner

HKUST Institutional Repository >
Computer Science and Engineering >
CSE Journal/Magazine Articles >

Please use this identifier to cite or link to this item:
Title: Large-scale automatic extraction of an English-Chinese translation lexicon
Authors: Wu, Dekai
Xia, Xuanyin
Keywords: English-Chinese machine translation
Lexical acquisition
Translation lexicon
Parallel corpus
Issue Date: 1995
Citation: Machine Translation, v. 9, no. 3-4, September 1994, p. 285-313
Abstract: We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the first empirical results of the kind between an Indo-European and non-Indo-European language for any significant vocabulary and corpus size. The learned vocabulary size is about 6,500 English words, achieving translation precision in the 86-96% range, with alignment proceeding at paragraph, sentence, and word levels. Specifically, we report (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus, (2) experiments supporting the usefulness of restricted lexical cues for statistical paragraph and sentence alignment, and (3) experiments that question the role of hand-derived monolingual lexicons for automatic word translation acquitision. Using a hand-derived monolingual lexicon, the learned translation lexicon averages 2.33 Chinese translations per English entry, with a manually-filtered precision of 95.1%, and an automatically-filtered weighted precision of 86.0%. We then introduce a fully automatic two-stage statistical methodology that is able to learn translations for collocations. A statistically-learned monolingual Chinese lexicon is first used to segment the Chinese text, before applying bilingual training to produce 6,429 English entries with 2.25 Chinese translations per entry. This method improves the manually-filtered precision to 96.0% and the automatically-filtered weighted precision to 91.0%, an error rate reduction of 35.7% from using a hand-derived monolingual lexicon.
Rights: Machine Translation © copyright (1994) Springer. The original publication is available at
Appears in Collections:CSE Journal/Magazine Articles

Files in This Item:

File Description SizeFormat
large.pdfpre-published version2647KbAdobe PDFView/Open

All items in this Repository are protected by copyright, with all rights reserved.