5 research outputs found
Probabilistic Linear Discriminant Analysis for Acoustic Modeling
Acoustic models using probabilistic linear discriminant analysis (PLDA)
capture the correlations within feature vectors using subspaces which do not
vastly expand the model. This allows high dimensional and correlated feature
spaces to be used, without requiring the estimation of multiple high dimension
covariance matrices. In this letter we extend the recently presented PLDA
mixture model for speech recognition through a tied PLDA approach, which is
better able to control the model size to avoid overfitting. We carried out
experiments using the Switchboard corpus, with both mel frequency cepstral
coefficient features and bottleneck feature derived from a deep neural network.
Reductions in word error rate were obtained by using tied PLDA, compared with
the PLDA mixture model, subspace Gaussian mixture models, and deep neural
networks
Full Covariance Modelling for Speech Recognition
HMM-based systems for Automatic Speech Recognition typically model
the acoustic features using mixtures of multivariate Gaussians. In this
thesis, we consider the problem of learning a suitable covariance matrix
for each Gaussian. A variety of schemes have been proposed for
controlling the number of covariance parameters per Gaussian, and
studies have shown that in general, the greater the number of parameters
used in the models, the better the recognition performance. We
therefore investigate systems with full covariance Gaussians. However,
in this case, the obvious choice of parameters – given by the sample
covariance matrix – leads to matrices that are poorly-conditioned, and
do not generalise well to unseen test data. The problem is particularly
acute when the amount of training data is limited.
We propose two solutions to this problem: firstly, we impose the requirement
that each matrix should take the form of a Gaussian graphical
model, and introduce a method for learning the parameters and
the model structure simultaneously. Secondly, we explain how an
alternative estimator, the shrinkage estimator, is preferable to the
standard maximum likelihood estimator, and derive formulae for the
optimal shrinkage intensity within the context of a Gaussian mixture
model. We show how this relates to the use of a diagonal covariance
smoothing prior.
We compare the effectiveness of these techniques to standard methods
on a phone recognition task where the quantity of training data is
artificially constrained. We then investigate the performance of the
shrinkage estimator on a large-vocabulary conversational telephone
speech recognition task. Discriminative training techniques can be used to compensate for the
invalidity of the model correctness assumption underpinning maximum
likelihood estimation. On the large-vocabulary task, we use discriminative
training of the full covariance models and diagonal priors
to yield improved recognition performance
An application of sparse representation in Gaussian mixture models used inspeech recognition task
U ovoj disertaciji je predstavljen model koji aproksimira pune kova- rijansne matrice u modelu gausovih mešavina (GMM) sa smanjenim brojem parametara i izračunavanja koji su potrebni za izračunavanje izglednosti. U predloženom modelu inverzne kovarijansne matrice su aproksimirane korišćenjem retke reprezentacije njihovih karakteri- stičnih vektora. Pored samog modela prikazan je i algoritam za estimaciju parametara zasnovan na kriterijumu maksimizacije izgeldnosti. Eksperimentalni rezultati na problemu prepoznavanja govora su pokazali da predloženi model za isti nivo greške kao GMM sa upunim kovarijansnim, redukuje broj parametara za 45%.This thesis proposes a model which approximates full covariance matrices in Gaussian mixture models with a reduced number of parameters and computations required for likelihood evaluations. In the proposed model inverse covariance (precision) matrices are approximated using sparsely represented eigenvectors. A maximum likelihood algorithm for parameter estimation and its practical implementation are presented. Experimental results on a speech recognition task show that while keeping the word error rate close to the one obtained by GMMs with full covariance matrices, the proposed model can reduce the number of parameters by 45%
Large vocabulary conversational speech recognition with the extended maximum likelihood linear transformation (EMLLT) model
This paper applies the recently proposed Extended Maximum Likelihood Linear Transformation (EMLLT) model in a Speaker Adaptive Training (SAT) context on the Switchboard database. Adaptation is carried out with maximum likelihood estimation of linear transforms for the means, precisions (inverse covariances) and the feature-space under the EMLLT model. This paper shows the first experimental evidence that significant word-error-rate improvements can be achieved with the EMLLT model (in both VTL and VTL+SAT training contexts) over a state-of-the-art diagonal covariance model in a difficult large-vocabulary conversational speech recognition task. The improvements were of the order of 1 % absolute in multiple scenarios. 1
Large vocabulary conversational speech recognition with the extended maximum likelihood linear transformation (EMLLT) model
This paper applies the recently proposed SPAM models for acoustic modeling in a Speaker Adaptive Training (SAT) context on large vocabulary conversational speech databases, including the Switchboard database. SPAM models are Gaussian mixture models in which a subspace constraint is placed on the precision and mean matrices (although this paper focuses on the case of unconstrained means). They include diagonal covariance, full covariance, MLLT, and EMLLT models as special cases. Adaptation is carried out with maximum likelihood estimation of the means and feature-space under the SPAM model. This paper shows the first experimental evidence that the SPAM models can achieve significant word-error-rate improvements over state-of-the-art diagonal covariance models, even when those diagonal models are given the benefit of choosing the optimal number of Gaussians (according to the Bayesian Information Criterion). This paper also is the first to apply SPAM models in a SAT context. All experiments are performed on the IBM “Superhuman ” speech corpus which is a challenging and diverse conversational speech test set that includes the Switchboard portion of the 1998 Hub5e evaluation data set. 1