Please use this identifier to cite or link to this item:

Polynomial segment model for large vocabulary continuous speech recognition

Authors Au Yeung, Siu-Kei
Issue Date 2005
Summary While hidden Markov model (HMM) are the most widely used model in automatic speech recognition, its weaknesses including the poor duration modeling, piecewise constant regions property and mixture hopping problem can limit its potential power. Polynomial Segment Model (PSM) is one of the alternative models proposed to solve some of these problems. Because of the segmental property in PSM in which all the frames within a segment are jointly evaluated, PSM can better model the correlations between frames and enhance the recognition performance. PSM has been shown to perform well in small vocabulary tasks such as phone recognition. However, it is not clear whether and how it can be applied to large vocabulary continuous speech recognition (LVCSR). Com-peting with HMM using this novel model in LVCSR is often difficult because many LVCSR related issues, such as mixture growing, parameter tying, are thoroughly studied only under the HMM framework. In this thesis, we propose a set of algorithms to address some of the LVCSR issues that are PSM-specific and can take advantage of the structure and flexibility of this parametric time-varying model. We showed that by carefully optimizing these algorithms, PSM can outperform HMM on the Wall Street Journal 5000-word LVCSR task by more than 10%. We showed that PSM-specific solutions to these LVCSR issues can account for a large portion of this gain. The recognition performance can be further improved by 19% after applying the PSM-based Maximum Likelihood Linear Regression (MLLR) adaptation. Furthermore, to evaluate the PSM performance under mismatched con-dition, the Aurora 4 corpus was used to compare both HMM and PSM. Although PSM is shown to be a sharper model, the PSM system still out-perform the HMM system under mismatched condition. We also showed that the PSM-based MLLR helps to further enhance the performance when only one adaptation utterance is available.
Note Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2005
Language English
Format Thesis
Access View full-text via DOI
Files in this item:
File Description Size Format
th_redirect.html 337 B HTML