783 research outputs found

    Very Deep Convolutional Neural Networks for Robust Speech Recognition

    Full text link
    This paper describes the extension and optimization of our previous work on very deep convolutional neural networks (CNNs) for effective recognition of noisy speech in the Aurora 4 task. The appropriate number of convolutional layers, the sizes of the filters, pooling operations and input feature maps are all modified: the filter and pooling sizes are reduced and dimensions of input feature maps are extended to allow adding more convolutional layers. Furthermore appropriate input padding and input feature map selection strategies are developed. In addition, an adaptation framework using joint training of very deep CNN with auxiliary features i-vector and fMLLR features is developed. These modifications give substantial word error rate reductions over the standard CNN used as baseline. Finally the very deep CNN is combined with an LSTM-RNN acoustic model and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%, further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN joint decoding.Comment: accepted by SLT 201

    Design of reservoir computing systems for the recognition of noise corrupted speech and handwriting

    Get PDF

    Discriminative connectionist approaches for automatic speech recognition in cars

    Get PDF
    The first part of this thesis is devoted to the evaluation of approaches which exploit the inherent redundancy of the speech signal to improve the noise robustness. On the basis of this evaluation on the AURORA 2000 database, we further study in detail two of the evaluated approaches. The first of these approaches is the hybrid RBF/HMM approach, which is an attempt to combine the superior classification performance of radial basis functions (RBFs) with the ability of HMMs to model time variation. The second approach is using neural networks to non-linearly reduce the dimensionality of large feature vectors including context frames. We propose the use of different MLP topologies for that purpose. Experiments on the AURORA 2000 database reveal that the performance of the first approach is similar to the performance of systems based on SCHMMs. The second approach cannot outperform the performance of linear discriminant analysis (LDA) on a database recorded in real car environments, but it is on average significantly better than LDA on the AURORA 2000 database.Im ersten Teil dieser Arbeit werden bestehende Verfahren zur Erhöhung der Robustheit von Spracherkennungssystemen in lauten Umgebungen evaluiert, die auf der Ausnutzung der Redundanz im Sprachsignal basieren. Auf der Grundlage dieser Evaluation auf der AURORA 2000 Datenbank werden zwei spezielle Ansätze weiter ausgearbeitet und detalliert analysiert. Der erste dieser Ansätze verbindet die herausragende Klassifikationsleistung von neuronalen Netzen mit radialen Basisfunktionen (RBF) mit der Fähigkeit von Hidden-Markov-Modellen (HMM), Zeitveränderlichkeiten zu modellieren. In einem zweiten Ansatz werden NN zur nichtlinearen Dimensionsreduktion hochdimensionaler Kontextvektoren in unterschiedlichen Netzwerk-Topologien untersucht. In Experimenten konnte gezeigt werden, dass der erste dieser Ansätze für die AURORA-Datenbank eine ähnliche Leistungsfähigkeit wie semikontinuierliche HMM (SCHMM) aufweist. Der zweite Ansatz erzielt auf einer im Kraftfahrzeug aufgenommenen Datenbank keine Verbesserung gegenüber den klassischen linearen Ansätzen zu Dimensionsreduktion (LDA), erweist sich aber auf der AURORA-Datenbank als signifikan

    On Use of Task Independent Training Data in Tandem Feature Extraction

    Get PDF
    The problem we address in this paper is, whether the feature extraction module trained on large amounts of task independent data, can improve the performance of stochastic models? We show that when there is only a small amount of task specific training data available, tandem features trained on task independent data give considerable improvement over Perceptual Linear Prediction (PLP) cepstral features in Hidden Markov Model (HMM) based speech recognition systems

    Towards using hierarchical posteriors for flexible automatic speech recognition systems

    Get PDF
    Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., ``Tandem'') towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task

    Noise-Robust Speech Recognition Using Deep Neural Network

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR

    Get PDF
    In this article we review several successful extensions to the standard Hidden-Markov-Model/Artificial Neural Network (HMM/ANN) hybrid, which have recently made important contributions to the field of noise robust automatic speech recognition. The first extension to the standard hybrid was the ``multi-band hybrid'', in which a separate ANN is trained on each frequency subband, followed by some form of weighted combination of \ANN state posterior probability outputs prior to decoding. However, due to the inaccurate assumption of subband independence, this system usually gives degraded performance, except in the case of narrow-band noise. All of the systems which we review overcome this independence assumption and give improved performance in noise, while also improving or not significantly degrading performance with clean speech. The ``all-combinations multi-band'' hybrid trains a separate ANN for each subband combination. This, however, typically requires a large number of ANNs. The ``all-combinations multi-stream'' hybrid trains an ANN expert for every combination of just a small number of complementary data streams. Multiple ANN posteriors combination using maximum a-posteriori (MAP) weighting gives rise to the further successful strategy of hypothesis level combination by MAP selection. An alternative strategy for exploiting the classification capacity of ANNs is the ``tandem hybrid'' approach in which one or more ANN classifiers are trained with multi-condition data to generate discriminative and noise robust features for input to a standard ASR system. The ``multi-stream tandem hybrid'' trains an ANN for a number of complementary feature streams, permitting multi-stream data fusion. The ``narrow-band tandem hybrid'' trains an ANN for a number of particularly narrow frequency subbands. This gives improved robustness to noises not seen during training. Of the systems presented, all of the multi-stream systems provide generic models for multi-modal data fusion. Test results for each system are presented and discusse

    Combining Evidence from a Generative and a Discriminative Model in Phoneme Recognition

    Get PDF
    We investigate the use of the log-likelihood of the features obtained from a generative Gaussian mixture model, and the posterior probability of phonemes from a discriminative multilayered perceptron in multi-stream combination for recognition of phonemes. Multi-stream combination techniques, namely early integration and late integration are used to combine the evidence from these models. By using multi-stream combination, we obtain a phoneme recognition accuracy of 74\% on the standard TIMIT database, an absolute improvement of 2.5\% over the single best stream

    Learning representations for speech recognition using artificial neural networks

    Get PDF
    Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals
    corecore