113 research outputs found
Maximum a posteriori adaptation of subspace Gaussian mixture models for cross-lingual speech recognition
This paper concerns cross-lingual acoustic modeling in the case when there are limited target language resources. We build on an approach in which a subspace Gaussian mixture model (SGMM) is adapted to the target language by reusing the globally shared parameters estimated from out-of-language training data. In current cross-lingual systems, these parameters are fixed when training the target system, which can give rise to a mismatch between the source and target systems. We investigate a maximum a posteriori (MAP) adaptation approach to alleviate the potential mismatch. In partic-ular, we focus on the adaptation of phonetic subspace parameters using a matrix variate Gaussian prior distribution. Experiments on the GlobalPhone corpus using the MAP adaptation approach results in word error rate reductions, compared with the cross-lingual base-line systems and systems updated using maximum likelihood, for training conditions with 1 hour and 5 hours of target language data
Cross-Lingual Subspace Gaussian Mixture Models for Low-Resource Speech Recognition
This paper studies cross-lingual acoustic modelling in the context of subspace Gaussian mixture models (SGMMs). SGMMs factorize the acoustic model parameters into a set that is globally shared between all the states of a hidden Markov model (HMM) and another that is specific to the HMM states. We demonstrate that the SGMM global parameters are transferable between languages, particularly when the parameters are trained multilingually. As a result, acoustic models may be trained using limited amounts of transcribed audio by borrowing the SGMM global parameters from one or more source languages, and only training the state-specific parameters on the target language audio. Model regularization using â„“1-norm penalty is shown to be particularly effective at avoiding overtraining and leading to lower word error rates. We investigate maximum a posteriori (MAP) adaptation of subspace parameters in order to reduce the mismatch between the SGMM global parameters of the source and target languages. In addition, monolingual and cross-lingual speaker adaptive training is used to reduce the model variance introduced by speakers. We have systematically evaluated these techniques by experiments on the GlobalPhone corpus
Subspace Gaussian mixture models for automatic speech recognition
In most of state-of-the-art speech recognition systems, Gaussian mixture models (GMMs)
are used to model the density of the emitting states in the hidden Markov models
(HMMs). In a conventional system, the model parameters of each GMM are estimated
directly and independently given the alignment. This results a large number of
model parameters to be estimated, and consequently, a large amount of training data
is required to fit the model. In addition, different sources of acoustic variability that
impact the accuracy of a recogniser such as pronunciation variation, accent, speaker
factor and environmental noise are only weakly modelled and factorized by adaptation
techniques such as maximum likelihood linear regression (MLLR), maximum a posteriori
adaptation (MAP) and vocal tract length normalisation (VTLN). In this thesis,
we will discuss an alternative acoustic modelling approach — the subspace Gaussian
mixture model (SGMM), which is expected to deal with these two issues better. In an
SGMM, the model parameters are derived from low-dimensional model and speaker
subspaces that can capture phonetic and speaker correlations. Given these subspaces,
only a small number of state-dependent parameters are required to derive the corresponding
GMMs. Hence, the total number of model parameters can be reduced, which
allows acoustic modelling with a limited amount of training data. In addition, the
SGMM-based acoustic model factorizes the phonetic and speaker factors and within
this framework, other source of acoustic variability may also be explored.
In this thesis, we propose a regularised model estimation for SGMMs, which avoids
overtraining in case that the training data is sparse. We will also take advantage of
the structure of SGMMs to explore cross-lingual acoustic modelling for low-resource
speech recognition. Here, the model subspace is estimated from out-domain data and
ported to the target language system. In this case, only the state-dependent parameters
need to be estimated which relaxes the requirement of the amount of training data. To
improve the robustness of SGMMs against environmental noise, we propose to apply
the joint uncertainty decoding (JUD) technique that is shown to be efficient and effective.
We will report experimental results on the Wall Street Journal (WSJ) database
and GlobalPhone corpora to evaluate the regularisation and cross-lingual modelling of
SGMMs. Noise compensation using JUD for SGMM acoustic models is evaluated on
the Aurora 4 database
Regularized subspace Gaussian mixture models for cross-lingual speech recognition
Abstract—We investigate cross-lingual acoustic modelling for low resource languages using the subspace Gaussian mixture model (SGMM). We assume the presence of acoustic models trained on multiple source languages, and use the global subspace parameters from those models for improved modelling in a target language with limited amounts of transcribed speech. Experiments on the GlobalPhone corpus using Spanish, Portuguese, and Swedish as source languages and German as target language (with 1 hour and 5 hours of transcribed audio) show that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language. We also show that regularizing the estimation of the SGMM state vectors by penalizing their ℓ1-norm help to overcome numerical instabilities and lead to lower WER. I
Using out-of-language data to improve an under-resourced speech recognizer
Under-resourced speech recognizers may benefit from data in languages other than the target language. In this paper, we report how to boost the performance of an Afrikaans automatic speech recognition system by using already available Dutch data. We successfully exploit available multilingual resources through 1) posterior features, estimated by multilayer perceptrons (MLP) and 2) subspace Gaussian mixture models (SGMMs). Both the MLPs and the SGMMs can be trained on out-of-language data. We use three different acoustic modeling techniques, namely Tandem, Kullback-Leibler divergence based HMMs (KL-HMM) as well as SGMMs and show that the proposed multilingual systems yield 12% relative improvement compared to a conventional monolingual HMM/GMM system only trained on Afrikaans. We also show that KL-HMMs are extremely powerful for under-resourced languages: using only six minutes of Afrikaans data (in combination with out-of-language data), KL-HMM yields about 30% relative improvement compared to conventional maximum likelihood linear regression and maximum a posteriori based acoustic model adaptation
Physiologically-Motivated Feature Extraction Methods for Speaker Recognition
Speaker recognition has received a great deal of attention from the speech community, and significant gains in robustness and accuracy have been obtained over the past decade. However, the features used for identification are still primarily representations of overall spectral characteristics, and thus the models are primarily phonetic in nature, differentiating speakers based on overall pronunciation patterns. This creates difficulties in terms of the amount of enrollment data and complexity of the models required to cover the phonetic space, especially in tasks such as identification where enrollment and testing data may not have similar phonetic coverage. This dissertation introduces new features based on vocal source characteristics intended to capture physiological information related to the laryngeal excitation energy of a speaker. These features, including RPCC, GLFCC and TPCC, represent the unique characteristics of speech production not represented in current state-of-the-art speaker identification systems. The proposed features are evaluated through three experimental paradigms including cross-lingual speaker identification, cross song-type avian speaker identification and mono-lingual speaker identification. The experimental results show that the proposed features provide information about speaker characteristics that is significantly different in nature from the phonetically-focused information present in traditional spectral features. The incorporation of the proposed glottal source features offers significant overall improvement to the robustness and accuracy of speaker identification tasks
Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models
Conventional deep neural networks (DNN) for speech acoustic modeling rely on
Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary
class labels as the targets for DNN training. Subword classes in speech
recognition systems correspond to context-dependent tied states or senones. The
present work addresses some limitations of GMM-HMM senone alignments for DNN
training. We hypothesize that the senone probabilities obtained from a DNN
trained with binary labels can provide more accurate targets to learn better
acoustic models. However, DNN outputs bear inaccuracies which are exhibited as
high dimensional unstructured noise, whereas the informative components are
structured and low-dimensional. We exploit principle component analysis (PCA)
and sparse coding to characterize the senone subspaces. Enhanced probabilities
obtained from low-rank and sparse reconstructions are used as soft-targets for
DNN acoustic modeling, that also enables training with untranscribed data.
Experiments conducted on AMI corpus shows 4.6% relative reduction in word error
rate
- …