659 research outputs found
A Study of All-Convolutional Encoders for Connectionist Temporal Classification
Connectionist temporal classification (CTC) is a popular sequence prediction
approach for automatic speech recognition that is typically used with models
based on recurrent neural networks (RNNs). We explore whether deep
convolutional neural networks (CNNs) can be used effectively instead of RNNs as
the "encoder" in CTC. CNNs lack an explicit representation of the entire
sequence, but have the advantage that they are much faster to train. We present
an exploration of CNNs as encoders for CTC models, in the context of
character-based (lexicon-free) automatic speech recognition. In particular, we
explore a range of one-dimensional convolutional layers, which are particularly
efficient. We compare the performance of our CNN-based models against typical
RNNbased models in terms of training time, decoding time, model size and word
error rate (WER) on the Switchboard Eval2000 corpus. We find that our CNN-based
models are close in performance to LSTMs, while not matching them, and are much
faster to train and decode.Comment: Accepted to ICASSP-201
Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition
The success of self-attention in NLP has led to recent applications in
end-to-end encoder-decoder architectures for speech recognition. Separately,
connectionist temporal classification (CTC) has matured as an alignment-free,
non-autoregressive approach to sequence transduction, either by itself or in
various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully
self-attentional network for CTC, and show it is tractable and competitive for
end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing
CTC models and most encoder-decoder models, with character error rates (CERs)
of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean,
with a fixed architecture and one GPU. Similar improvements hold for WERs after
LM decoding. We motivate the architecture for speech, evaluate position and
downsampling approaches, and explore how label alphabets (character, phoneme,
subword) affect attention heads and performance.Comment: Accepted to ICASSP 201
Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
Recently, the connectionist temporal classification (CTC) model coupled with
recurrent (RNN) or convolutional neural networks (CNN), made it easier to train
speech recognition systems in an end-to-end fashion. However in real-valued
models, time frame components such as mel-filter-bank energies and the cepstral
coefficients obtained from them, together with their first and second order
derivatives, are processed as individual elements, while a natural alternative
is to process such components as composed entities. We propose to group such
elements in the form of quaternions and to process these quaternions using the
established quaternion algebra. Quaternion numbers and quaternion neural
networks have shown their efficiency to process multidimensional inputs as
entities, to encode internal dependencies, and to solve many tasks with less
learning parameters than real-valued models. This paper proposes to integrate
multiple feature views in quaternion-valued convolutional neural network
(QCNN), to be used for sequence-to-sequence mapping with the CTC model.
Promising results are reported using simple QCNNs in phoneme recognition
experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme
error rate (PER) with less learning parameters than a competing model based on
real-valued CNNs.Comment: Accepted at INTERSPEECH 201
- …