Please use this identifier to cite or link to this item:

Kernel eigenvoice speaker adaptation

Authors Ho, Ka-Lung
Issue Date 2003
Summary Speech recognition is a powerful and widely used technology nowadays. However, its performance is not robust enough due to variations in speech introduced by the operating environment, noises (their type and energy) and inter-speaker differences. Speaker adaptation is an important technology to fine-tune either features or speech models for the mis-match due to inter-speaker variation. In the last decade, eigenvoice (EV) speaker adaptation has been developed. It makes use of the prior knowledge of training speakers to provide a fast adaptation algorithm (in other words, only a small amount of adaptation data is needed). Inspired by the kernel eigenface idea in face recognition, kernel eigenvoice (KEV) is proposed. KEV is a non-linear generalization to EV. This incorporates Kernel Principal Component Analysis (KPCA), a non-linear version of Principal Component Analysis (PCA), to capture the higher order correlations in order to further explore the speaker space and enhance recognition performance. The major difficulty is that through KEV adaptation, the adapted speaker model is estimated in the kernel feature space which may not have an exact pre-image in the input speaker-supervector space, yet observation likelihoods are computed in the acoustic observation space for both adaptation and recognition. Composite kernel is proposed to solve the problem. Experimental investigation on TIDIGITS corpus, an English continuous digits recognition task, using 4 seconds of adaptation data shows that KEV adaptation gives a 21% relative improvement (RI) over the speaker-independent (SI) model, a 25% RI over MLLR adaptation, a 32% RI over MAP adaptation and a 32% RI over EV adaptation. When the speaker-adapted models from KEV are interpolated with the SI model, the RI increase to 32% over SI model, 35% over MLLR adaptation, 41%over MAP adaptation and 32% over similarly interpolated EV adaptation.
Note Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2003
Language English
Format Thesis
Access View full-text via DOI
Files in this item:
File Description Size Format
th_redirect.html 339 B HTML
Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.