Search CORE

3,296 research outputs found

Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Author: Bengio Yoshua
De Mori Renato
Linarès Georges
Morchid Mohamed
Parcollet Titouan
Trabelsi Chiheb
Zhang Ying
Publication venue
Publication date: 20/06/2018
Field of study

Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.Comment: Accepted at INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks

Author: -Doss Mathew Magimai.
Collobert Ronan
Palaz Dimitri
Publication venue
Publication date: 07/12/2013
Field of study

Most phoneme recognition state-of-the-art systems rely on a classical neural network classifiers, fed with highly tuned features, such as MFCC or PLP features. Recent advances in ``deep learning'' approaches questioned such systems, but while some attempts were made with simpler features such as spectrograms, state-of-the-art systems still rely on MFCCs. This might be viewed as a kind of failure from deep learning approaches, which are often claimed to have the ability to train with raw signals, alleviating the need of hand-crafted features. In this paper, we investigate a convolutional neural network approach for raw speech signals. While convolutional architectures got tremendous success in computer vision or text processing, they seem to have been let down in the past recent years in the speech processing field. We show that it is possible to learn an end-to-end phoneme sequence classifier system directly from raw signal, with similar performance on the TIMIT and WSJ datasets than existing systems based on MFCC, questioning the need of complex hand-crafted features on large datasets.Comment: NIPS Deep Learning Workshop, 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Author: Dong Ren (135842)
Hong Du (117108)
Libing Song (203669)
Qing Yang (67856)
Wei Guo (86150)
Xinsheng Peng (350750)
Yuhu Dai (436301)
Publication venue
Publication date: 01/07/2017
Field of study

Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.Comment: NIPS 201

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare