15 research outputs found
End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks
Most phoneme recognition state-of-the-art systems rely on a classical neural
network classifiers, fed with highly tuned features, such as MFCC or PLP
features. Recent advances in ``deep learning'' approaches questioned such
systems, but while some attempts were made with simpler features such as
spectrograms, state-of-the-art systems still rely on MFCCs. This might be
viewed as a kind of failure from deep learning approaches, which are often
claimed to have the ability to train with raw signals, alleviating the need of
hand-crafted features. In this paper, we investigate a convolutional neural
network approach for raw speech signals. While convolutional architectures got
tremendous success in computer vision or text processing, they seem to have
been let down in the past recent years in the speech processing field. We show
that it is possible to learn an end-to-end phoneme sequence classifier system
directly from raw signal, with similar performance on the TIMIT and WSJ
datasets than existing systems based on MFCC, questioning the need of complex
hand-crafted features on large datasets.Comment: NIPS Deep Learning Workshop, 201
Towards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which allow training of systems in an end-to-end manner. Such approaches have found success in the area of natural language processing and computer vision community, and have consequently peaked interest in the speech community. The present thesis builds on these recent advances to investigate approaches to develop speech recognition systems in end-to-end manner. In that respect, the thesis follows two main axes of research. The first axis of research focuses on joint learning of features and classifiers for acoustic modeling. The second axis of research focuses on joint modeling of the acoustic model and the decoder. Along the first axis of research, in the framework of hybrid hidden Markov model/artificial neural networks (HMM/ANN) based ASR, we develop a convolution neural networks (CNNs) based acoustic modeling approach that takes raw speech signal as input and estimates phone class conditional probabilities. Specifically, the CNN has several convolution layers (feature stage) followed by multilayer perceptron (classifier stage), which are jointly optimized during the training. Through ASR studies on multiple languages and extensive analysis of the approach, we show that the proposed approach, with minimal prior knowledge, is able to learn automatically the relevant features from the raw speech signal. This approach yields systems that have less number of parameters and achieves better performance, when compared to the conventional approach of cepstral feature extraction followed by classifier training. As the features are automatically learned from the signal, a natural question that arises is: are such systems robust to noise? Towards that we propose a robust CNN approach referred to as normalized CNN approach, which yields systems that are as robust as or better than the conventional ASR systems using cepstral features (with feature level normalizations). The second axis of research focuses on end-to-end sequence-to-sequence conversion. We first propose an end-to-end phoneme recognition system. In this system the relevant features, classifier and the decoder (based on conditional random fields) are jointly modeled during training. We demonstrate the viability of the approach on TIMIT phoneme recognition task. Building on top of that, we investigate a ``weakly supervised'' training that alleviates the necessity for frame level alignments. Finally, we extend the weakly supervised approach to propose a novel keyword spotting technique. In this technique, a CNN first process the input observation sequence to output word level scores, which are subsequently aggregated to detect or spot words. We demonstrate the potential of the approach through a comparative study on LibriSpeech with the standard approach of keyword word spotting based on lattice indexing using ASR system
Sparse stereo image coding with learned dictionaries
This paper proposes a framework for stereo image coding with effective representation of geometry in 3D scenes. We propose a joint sparse approximation framework for pairs of perspective images that are represented as linear expansions of atoms selected from a dictionary of geometric functions learned on a database of stereo perspective images. We then present a coding solution where atoms are selected iteratively as a trade-off between distortion and consistency of the geometry information. Experimental results on stereo images from the Middlebury database show that the new coder achieves better rate-distortion performance compared to the MPEG4-part10 scheme, at all rates. In addition to good rate-distortion performance, our flexible framework permits to build consistent image representations that capture the geometry of the scene. It certainly represents a promising solution towards the design of multi-view coding algorithms where the compressed stream inherently contains rich information about 3D geometry
Joint Phoneme Segmentation Inference and Classification using CRFs
State-of-the-art phoneme sequence recognition systems are based on hybrid hidden Markov model/artificial neural networks (HMM/ANN) framework. In this framework, the local classifier, ANN, is typically trained using Viterbi expectation-maximization algorithm, which involves two separate steps: phoneme sequence segmentation and training of ANN. In this paper, we propose a CRF based phoneme sequence recognition approach that simultaneously infers the phoneme segmentation and classifies the phoneme sequence. More specifically, the phoneme sequence recognition system consists of a local classifier ANN followed by a conditional random field (CRF) whose parameters are trained jointly, using a cost function that discriminates the true phoneme sequence against all competing sequences. In order to efficiently train such a system we introduce a novel CRF based segmentation using acyclic graph. We study the viability of the proposed approach on TIMIT phoneme recognition task. Our studies show that the proposed approach is capable of achieving performance similar to standard hybrid HMM/ANN and ANN/CRF systems where the ANN is trained with manual segmentation
Analysis of CNN-based Speech Recognition System using Raw Speech as Input
Automatic speech recognition systems typically model the rela-tionship between the acoustic speech signal and the phones in two separate steps: feature extraction and classifier training. In our recent works, we have shown that, in the framework of con-volutional neural networks (CNN), the relationship between the raw speech signal and the phones can be directly modeled and ASR systems competitive to standard approach can be built. In this paper, we first analyze and show that, between the first two convolutional layers, the CNN learns (in parts) and models the phone-specific spectral envelope information of 2-4 ms speech. Given that we show that the CNN-based approach yields ASR trends similar to standard short-term spectral based ASR sys-tem under mismatched (noisy) conditions, with the CNN-based approach being more robust. Index Terms: automatic speech recognition, convolutional neural networks, raw signal, robust speech recognition