5,124 research outputs found

    Porting concepts from DNNs back to GMMs

    Get PDF
    Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination

    From Projection Pursuit and CART to Adaptive Discriminant Analysis

    Get PDF
    Abstract—While many efforts have been put into the development of nonlinear approximation theory and its applications to signal and image compression, encoding and denoising, there seems to be very few theoretical developments of adaptive discriminant representations in the area of feature extraction, selection and signal classification. In this paper, we try to advocate the idea that such developments and efforts are worthwhile, based on the theorerical study of a data-driven discriminant analysis method on a simple—yet instructive—example. We consider the problem of classifying a signal drawn from a mixture of two classes, using its projections onto low-dimensional subspaces. Unlike the linear discriminant analysis (LDA) strategy, which selects subspaces that do not depend on the observed signal, we consider an adaptive sequential selection of projections, in the spirit of nonlinear approximation and classification and regression trees (CART): at each step, the subspace is enlarged in a direction that maximizes the mutual information with the unknown class. We derive explicit characterizations of this adaptive discriminant analysis (ADA) strategy in two situations. When the two classes are Gaussian with the same covariance matrix but different means, the adaptive subspaces are actually nonadaptive and can be computed with an algorithm similar to orthonormal matching pursuit. When the classes are centered Gaussians with different covariances, the adaptive subspaces are spanned by eigen-vectors of an operator given by the covariance matrices (just as could be predicted by regular LDA), however we prove that the order of observation of the components along these eigen-vectors actually depends on the observed signal. Numerical experiments on synthetic data illustrate how data-dependent features can be used to outperform LDA on a classification task, and we discuss how our results could be applied in practice. Index Terms—Classification and regression trees (CART), classification tree, discriminant analysis, mutual information, nonlinear approximation, projection pursuit, sequential testing. I

    Speech Recognition Using Augmented Conditional Random Fields

    Get PDF
    Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT phone recognition task, a phone error rate of 23.0\% was recorded on the full test set, a significant improvement over comparable HMM-based systems

    Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

    No full text
    In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisakis model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: Linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers error rates and then to evaluate the information expressed by the classifiers confusion matrices. © Springer Science+Business Media, LLC 2011

    Robustness issues in a data-driven spoken language understanding system

    Get PDF
    Robustness is a key requirement in spoken language understanding (SLU) systems. Human speech is often ungrammatical and ill-formed, and there will frequently be a mismatch between training and test data. This paper discusses robustness and adaptation issues in a statistically-based SLU system which is entirely data-driven. To test robustness, the system has been tested on data from the Air Travel Information Service (ATIS) domain which has been artificially corrupted with varying levels of additive noise. Although the speech recognition performance degraded steadily, the system did not fail catastrophically. Indeed, the rate at which the end-to-end performance of the complete system degraded was significantly slower than that of the actual recognition component. In a second set of experiments, the ability to rapidly adapt the core understanding component of the system to a different application within the same broad domain has been tested. Using only a small amount of training data, experiments have shown that a semantic parser based on the Hidden Vector State (HVS) model originally trained on the ATIS corpus can be straightforwardly adapted to the somewhat different DARPA Communicator task using standard adaptation algorithms. The paper concludes by suggesting that the results presented provide initial support to the claim that an SLU system which is statistically-based and trained entirely from data is intrinsically robust and can be readily adapted to new applications
    corecore