26 research outputs found

    Deep Scattering Power Spectrum Features for Robust Speech Recognition

    Get PDF

    Very Deep Convolutional Neural Networks for Robust Speech Recognition

    Full text link
    This paper describes the extension and optimization of our previous work on very deep convolutional neural networks (CNNs) for effective recognition of noisy speech in the Aurora 4 task. The appropriate number of convolutional layers, the sizes of the filters, pooling operations and input feature maps are all modified: the filter and pooling sizes are reduced and dimensions of input feature maps are extended to allow adding more convolutional layers. Furthermore appropriate input padding and input feature map selection strategies are developed. In addition, an adaptation framework using joint training of very deep CNN with auxiliary features i-vector and fMLLR features is developed. These modifications give substantial word error rate reductions over the standard CNN used as baseline. Finally the very deep CNN is combined with an LSTM-RNN acoustic model and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%, further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN joint decoding.Comment: accepted by SLT 201

    AN EFFICIENT AND ROBUST MULTI-STREAM FRAMEWORK FOR END-TO-END SPEECH RECOGNITION

    Get PDF
    In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process distant-speech interaction into text with high accuracy. However, with dynamic corruption of noises and reverberations or human movement present, there is no guarantee that any microphone array (stream) is constantly informative. In these cases, an appropriate strategy to dynamically fuse streams is necessary. The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. Such streams could be defined as microphone arrays, frequency bands, various modalities or etc. Hence, a robust stream fusion is crucial to emphasize on more informative streams than corrupted ones, especially under unseen conditions. This thesis focuses on improving the performance and robustness of speech recognition in multi-stream scenarios. With increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this thesis, a multi-stream framework is presented based on the joint Connectionist Temporal Classification/ATTention (CTC/ATT) E2E model, where parallel streams are represented by separate encoders. On top of regular attention networks, a secondary stream-fusion network is to steer the decoder toward the most informative streams. The MEM-Array model aims at improving the far-field ASR robustness using microphone arrays which are activated by separate encoders. With an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training strategy is designated to address these issues. Furthermore, a two-stage augmentation scheme is present to improve robustness of the multi-stream model. In MEM-Res, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complementary information from the same acoustics. Compared with the best single-stream performance, both models have achieved substantial improvement, outperforming alternative fusion strategies. While the proposed framework optimizes information in multi-stream scenarios, this thesis also studies the Performance Monitoring (PM) measures to predict if recognition results of an E2E model are reliable without growth-truth knowledge. Four PM techniques are investigated, suggesting that PM measures on attention distributions and decoder posteriors are well-correlated with true performances

    Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation

    Get PDF
    This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic model, working both in a speaker-dependent and a speaker-independent manner, without the requirements to maintain auxiliary speaker-dependent feature extractors or to introduce significant speaker-dependent changes to the DNN structure. Through a series of experiments on four different speech recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4) comprising 270 test speakers, we show that LHUC in both its test-only and SAT variants results in consistent word error rate reductions ranging from 5% to 23% relative depending on the task and the degree of mismatch between training and test data. In addition, we have investigated the effect of the amount of adaptation data per speaker, the quality of unsupervised adaptation targets, the complementarity to other adaptation techniques, one-shot adaptation, and an extension to adapting DNNs trained in a sequence discriminative manner.Comment: 14 pages, 9 Tables, 11 Figues in IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, Num. 8, 201

    Deep Maxout Networks applied to Noise-Robust Speech Recognition

    Get PDF
    Proceedings of: IberSPEECH 2014 "VIII Jornadas en Tecnologías del Habla" and "IV Iberian SLTech Workshop". Las Palmas de Gran Canaria, Spain, November 19-21, 2014.Deep Neural Networks (DNN) have become very popular for acoustic modeling due to the improvements found over traditional Gaussian Mixture Models (GMM). However, not many works have addressed the robustness of these systems under noisy conditions. Recently, the machine learning community has proposed new methods to improve the accuracy of DNNs by using techniques such as dropout and maxout. In this paper, we investigate Deep Maxout Networks (DMN) for acoustic modeling in a noisy automatic speech recognition environment. Experiments show that DMNs improve substantially the recognition accuracy over DNNs and other traditional techniques in both clean and noisy conditions on the TIMIT dataset.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT project 2011-26807/TEC.Publicad

    Towards End-to-End Speech Recognition

    Get PDF
    Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which allow training of systems in an end-to-end manner. Such approaches have found success in the area of natural language processing and computer vision community, and have consequently peaked interest in the speech community. The present thesis builds on these recent advances to investigate approaches to develop speech recognition systems in end-to-end manner. In that respect, the thesis follows two main axes of research. The first axis of research focuses on joint learning of features and classifiers for acoustic modeling. The second axis of research focuses on joint modeling of the acoustic model and the decoder. Along the first axis of research, in the framework of hybrid hidden Markov model/artificial neural networks (HMM/ANN) based ASR, we develop a convolution neural networks (CNNs) based acoustic modeling approach that takes raw speech signal as input and estimates phone class conditional probabilities. Specifically, the CNN has several convolution layers (feature stage) followed by multilayer perceptron (classifier stage), which are jointly optimized during the training. Through ASR studies on multiple languages and extensive analysis of the approach, we show that the proposed approach, with minimal prior knowledge, is able to learn automatically the relevant features from the raw speech signal. This approach yields systems that have less number of parameters and achieves better performance, when compared to the conventional approach of cepstral feature extraction followed by classifier training. As the features are automatically learned from the signal, a natural question that arises is: are such systems robust to noise? Towards that we propose a robust CNN approach referred to as normalized CNN approach, which yields systems that are as robust as or better than the conventional ASR systems using cepstral features (with feature level normalizations). The second axis of research focuses on end-to-end sequence-to-sequence conversion. We first propose an end-to-end phoneme recognition system. In this system the relevant features, classifier and the decoder (based on conditional random fields) are jointly modeled during training. We demonstrate the viability of the approach on TIMIT phoneme recognition task. Building on top of that, we investigate a ``weakly supervised'' training that alleviates the necessity for frame level alignments. Finally, we extend the weakly supervised approach to propose a novel keyword spotting technique. In this technique, a CNN first process the input observation sequence to output word level scores, which are subsequently aggregated to detect or spot words. We demonstrate the potential of the approach through a comparative study on LibriSpeech with the standard approach of keyword word spotting based on lattice indexing using ASR system

    Environmentally robust ASR front-end for deep neural network acoustic models

    Get PDF
    This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00
    corecore