201 research outputs found

    Exploiting Low-dimensional Structures to Enhance DNN Based Acoustic Modeling in Speech Recognition

    Get PDF
    We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of low-dimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training data. Relying on the fact that the intrinsic dimensions of the posterior subspaces are indeed very small and the matrix of all posteriors belonging to a class has a very low rank, we demonstrate how low-dimensional structures enable further enhancement of the posteriors and rectify the spurious errors due to mismatch conditions. The enhanced acoustic modeling method leads to improvements in continuous speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in both clean and noisy conditions, where upto 15.4% relative reduction in word error rate (WER) is achieved

    Posterior-based Sparse Representation for Automatic Speech Recognition

    Get PDF
    Posterior features have been shown to yield very good performance in multiple contexts including speech recognition, spoken term detection, and template matching. These days, posterior features are usually estimated at the output of a neural network. More recently, sparse representation has also been shown to potentially provide additional advantages to improve discrimination and robustness. One possible instance of this, is referred to as exemplar-based sparse representation. The present work investigates how to exploit sparse modelling together with posterior space properties to further improve speech recognition features. In that context, we leverage exemplar-based sparse representation, and propose a novel approach to project phone posterior features into a new, high-dimensional, sparse feature space. In fact, exploiting the properties of posterior spaces, we generate, new, high-dimensional, linguistically inspired (sub-phone and words), posterior distributions. Validation experiments are performed on the Phonebook (isolated words) and HIWIRE (continuous speech) databases, which support the effectiveness of the proposed approach for speech recognition tasks

    Integrating Posterior Features and Self-Organizing Maps for Isolated Word Recognition without Dynamic Programming

    Get PDF
    In recent works, the use of phone class-conditional posterior probabilities (posterior features) directly as features provided successful results in template-based ASR systems. Moreover, it has been shown that these features tend to be sparse and orthogonal. Given such properties, new types of ASR may be investigated. In this work, we investigate the use of Self-Organizing Maps to transform sequences of posterior feature vectors representing words into sparse fixed-size patterns. We evaluate the ability of these patterns to discriminate between words in the context of template-based ASR using a simple histogram matching technique (i.e. without the use of dynamic programming). We present experiments on 75-word speaker- and task-independent isolated word recognition task

    Posterior Features for Template-based ASR

    Get PDF
    This paper investigates the use of phoneme class conditional probabilities as features (posterior features) for template-based ASR. Using 75 words and 600 words task-independent and speaker-independent setup on Phonebook database, we investigate the use of different posterior distribution estimators, different distance measures that are better suited for posterior distributions, and different training data. The reported experiments clearly demonstrate that posterior features are always superior, and generalize better than other classical acoustic features (at the cost of training a posterior distribution estimator)

    On MLP-based Posterior Features for Template-based ASR

    Get PDF
    We investigate the invariance of posterior features estimated using MLP trained on auxiliary corpus towards different data condition and different distance measures for matching posterior features in the context of template-based ASR. Through ASR studies on isolated word recognition task we show that posterior features estimated using MLP trained on auxiliary corpus with out any kind of adaptation can achieve comparable or better performance when compared to the case where the MLP is trained on the corpus same as that of the test set. We also show that local scores, weighted symmetric KL-divergence and Bhattacharya distance yield better systems compared to Hellinger distance, cosine angle, L1-norm, L2-norm, dot product, and cross entropy

    Low-Rank Representation For Enhanced Deep Neural Network Acoustic Models

    Get PDF
    Automatic speech recognition (ASR) is a fascinating area of research towards realizing humanmachine interactions. After more than 30 years of exploitation of Gaussian Mixture Models (GMMs), state-of-the-art systems currently rely on Deep Neural Network (DNN) to estimate class-conditional posterior probabilities. The posterior probabilities are used for acoustic modeling in hidden Markov models (HMM), and form a hybrid DNN-HMM which is now the leading edge approach to solve ASR problems. The present work builds upon the hypothesis that the optimal acoustic models are sparse and lie on multiple low-rank probability subspaces. Hence, the main goal of this Master project aimed at investigating different ways to restructure the DNN outputs using low-rank representation. Exploiting a large number of training posterior vectors, the underlying low-dimensional subspace can be identified, and low-rank decomposition enables separation of the “optimal” posteriors from the spurious (unstructured) uncertainties at the DNN output. Experiments demonstrate that low-rank representation can enhance posterior probability estimation, and lead to higher ASR accuracy. The posteriors are grouped according to their subspace similarities, and structured through low-rank decomposition. Furthermore, a novel hashing technique is proposed exploiting the low-rank property of posterior subspaces that enables fast search in the space of posterior exemplars
    • …
    corecore