HKUST Institutional Repository >
Electronic and Computer Engineering >
ECE Master Theses >
Please use this identifier to cite or link to this item:
|Title: ||Building phrase based language model from large corpus|
|Authors: ||Tang, Haijiang|
|Issue Date: ||2002 |
|Abstract: ||Statistical language models (SLM) encode linguistic information in the form of estimation of probability distribution of natural language, and have been successfully applied in various language processing applications.
Currently, most SLMs are based on words. A language model is trained on text based on a pre-defined lexicon. However, even language model trained from huge body of text with very large lexicons yield a significant number of unreliable estimates due to the lack of linguistic information and the inaccurate independent assumption of language modeling. In fact, word bases n-gram language model, the most commonly used SLM, uses so little linguistic knowledge that it may applied to a sequence of any symbols with no deep structure or meaning behind them. One solution is to encode linguistic information into lexical units that has longer context, which is, including phrases as the linguistic unit for language modeling. The research work presented in this thesis focus on using automatically extracted phrase for language model training.
In this thesis, we investigate phrase based language model building techniques. We compare phrase extraction approaches using different statistical information obtained from the training data. The experimental results show that phrase based language model addresses main problems with regard to word-based n-gram model, hence systematically and significantly improves the quality of perplexity and recognition accuracy. We also propose our new approach that outperforms the previous methods.
Another contribution of this thesis is in robust Chinese syllable to word decoding. Syllable to word decoding is very important for Chinese keyboard input, and also a main part of Chinese CSR. However, inherent ambiguities of Chinese language hamper the accurate decoding. We present a multi-path search algorithm that addresses this problem and significantly improves the recognition accuracy.|
|Description: ||Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2002|
x, 79 leaves : ill. ; 30 cm
HKUST Call Number: Thesis ELEC 2002 Tang
|Appears in Collections:||ECE Master Theses|
Files in This Item:
All items in this Repository are protected by copyright, with all rights reserved.