Please use this identifier to cite or link to this item: http://hdl.handle.net/1783.1/4683

The use of prosodic features in Chinese speech recognition and spoken language processing

Authors Wong, Jimmy Pui Fung
Issue Date 2003
Summary Prosody can be defined as the aspects of a sentence's pronunciation not described by the phone sequence of the words. Examples of prosodic information include pitch contour, loudness and duration. Prosody or prosodic information plays an important role in human communications. In tonal languages such as Chinese, words are often differentiated by the lexical tones which are related to the pitch contour. In addition, intonation patterns are used in both Chinese and English to express emotion or emphasis. The goal of this thesis is to investigate different approaches of applying prosodic information for speech processing such as using tone information for Chinese speech recognition and classification of English intonation patterns with an application to computer-aided language learning. The classification of pitch contour patterns involves four components, i) pitch detection, ii) pitch contour representation, iii) pitch pattern classification and, iv) integration of prosody for Chinese speech recognition. While pitch can be estimated frame by frame, because pitch pattern is a supra-segment, it is more effective to represent the pitch pattern by a continuous curve, such as a polynomial. By representing the pitch contour as a polynomial, we experimented with different classifiers, such as Decision tree, Neutral Network and Hidden Markov Model, and evaluated their effectiveness on tone classification of isolated and continuous Chinese syllables, and English intonation pattern classification. Finally, prosody information is integrated into a Chinese speech recognition system. In our study, we found that the Cepstrum (CEP) pitch detection algorithm could detect pitch values with good accuracy (97.2%). Since cepstral coefficients are typically used in recognition, the CEP method has the additional advantage that it can be implemented in the speech recognition front-end. In pitch pattern classification, the combination of polynomial representation and decision tree gave the best performance in both English intonation pattern classification and isolated Chinese syllable tone recognition. An analysis of the decision tree structure indicated that the slope of the pitch contour was the most important feature which was consistent with our knowledge of Chinese tones. In continuous Chinese speech, because of the co-articulation effects and tone sandhi, tone classification degraded significantly with confusions at tone 3 or neutral tone. By integrating the prosodic information into Chinese speech recognition, syllable accuracy improved from 65.4% to 70.5%.
Note Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2003
Subjects
Language English
Format Thesis
Access
Files in this item:
File Description Size Format
th_redirect.html 341 B HTML