24 research outputs found
Joint Uncertainty Decoding with Unscented Transform for Noise Robust Subspace Gaussian Mixture Models
Common noise compensation techniques use vector Taylor series (VTS) to approximate the mismatch function. Recent work shows that the approximation accuracy may be improved by sampling. One such sampling technique is the unscented transform (UT), which draws samples deterministically from clean speech and noise model to derive the noise corrupted speech parameters. This paper applies UT to noise compensation of the subspace Gaussian mixture model (SGMM). Since UT requires relatively smaller number of samples for accurate estimation, it has significantly lower computational cost compared to other random sampling techniques. However, the number of surface Gaussians in an SGMM is typically very large, making the direct application of UT, for compensating individual Gaussian components, computationally impractical. In this paper, we avoid the computational burden by employing UT in the framework of joint uncertainty decoding (JUD), which groups all the Gaussian components into small number of classes, sharing the compensation parameters by class. We evaluate the JUD-UT technique for an SGMM system using the Aurora 4 corpus. Experimental results indicate that UT can lead to increased accuracy compared to VTS approximation if the JUD phase factor is untuned, and to similar accuracy if the phase factor is tuned empirically. 1
Statistical models for noise-robust speech recognition
A standard way of improving the robustness of speech recognition systems to noise is model compensation. This replaces a speech recogniser's distributions over clean speech by ones over noise-corrupted speech. For each clean speech component, model compensation techniques usually approximate the corrupted speech distribution with a diagonal-covariance Gaussian distribution. This thesis looks into improving on this approximation in two ways: firstly, by estimating full-covariance Gaussian distributions; secondly, by approximating corrupted-speech likelihoods without any parameterised distribution.
The first part of this work is about compensating for within-component feature correlations under noise. For this, the covariance matrices of the computed Gaussians should be full instead of diagonal. The estimation of off-diagonal covariance elements turns out to be sensitive to approximations. A popular approximation is the one that state-of-the-art compensation schemes, like VTS compensation, use for dynamic coefficients: the continuous-time approximation. Standard speech recognisers contain both per-time slice, static, coefficients, and dynamic coefficients, which represent signal changes over time, and are normally computed from a window of static coefficients. To remove the need for the continuous-time approximation, this thesis introduces a new technique. It first compensates a distribution over the window of statics, and then applies the same linear projection that extracts dynamic coefficients. It introduces a number of methods that address the correlation changes that occur in noise within this framework. The next problem is decoding speed with full covariances. This thesis re-analyses the previously-introduced predictive linear transformations, and shows how they can model feature correlations at low and tunable computational cost.
The second part of this work removes the Gaussian assumption completely. It introduces a sampling method that, given speech and noise distributions and a mismatch function, in the limit calculates the corrupted speech likelihood exactly. For this, it transforms the integral in the likelihood expression, and then applies sequential importance resampling. Though it is too slow to use for recognition, it enables a more fine-grained assessment of compensation techniques, based on the KL divergence to the ideal compensation for one component. The KL divergence proves to predict the word error rate well. This technique also makes it possible to evaluate the impact of approximations that standard compensation schemes make.This work was supported by Toshiba Research Europe Ltd., Cambridge Research Laboratory
Subspace Gaussian mixture models for automatic speech recognition
In most of state-of-the-art speech recognition systems, Gaussian mixture models (GMMs)
are used to model the density of the emitting states in the hidden Markov models
(HMMs). In a conventional system, the model parameters of each GMM are estimated
directly and independently given the alignment. This results a large number of
model parameters to be estimated, and consequently, a large amount of training data
is required to fit the model. In addition, different sources of acoustic variability that
impact the accuracy of a recogniser such as pronunciation variation, accent, speaker
factor and environmental noise are only weakly modelled and factorized by adaptation
techniques such as maximum likelihood linear regression (MLLR), maximum a posteriori
adaptation (MAP) and vocal tract length normalisation (VTLN). In this thesis,
we will discuss an alternative acoustic modelling approach — the subspace Gaussian
mixture model (SGMM), which is expected to deal with these two issues better. In an
SGMM, the model parameters are derived from low-dimensional model and speaker
subspaces that can capture phonetic and speaker correlations. Given these subspaces,
only a small number of state-dependent parameters are required to derive the corresponding
GMMs. Hence, the total number of model parameters can be reduced, which
allows acoustic modelling with a limited amount of training data. In addition, the
SGMM-based acoustic model factorizes the phonetic and speaker factors and within
this framework, other source of acoustic variability may also be explored.
In this thesis, we propose a regularised model estimation for SGMMs, which avoids
overtraining in case that the training data is sparse. We will also take advantage of
the structure of SGMMs to explore cross-lingual acoustic modelling for low-resource
speech recognition. Here, the model subspace is estimated from out-domain data and
ported to the target language system. In this case, only the state-dependent parameters
need to be estimated which relaxes the requirement of the amount of training data. To
improve the robustness of SGMMs against environmental noise, we propose to apply
the joint uncertainty decoding (JUD) technique that is shown to be efficient and effective.
We will report experimental results on the Wall Street Journal (WSJ) database
and GlobalPhone corpora to evaluate the regularisation and cross-lingual modelling of
SGMMs. Noise compensation using JUD for SGMM acoustic models is evaluated on
the Aurora 4 database
Temporally Varying Weight Regression for Speech Recognition
Ph.DDOCTOR OF PHILOSOPH
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes