11 research outputs found
Subword and Crossword Units for CTC Acoustic Models
This paper proposes a novel approach to create an unit set for CTC based
speech recognition systems. By using Byte Pair Encoding we learn an unit set of
an arbitrary size on a given training text. In contrast to using characters or
words as units this allows us to find a good trade-off between the size of our
unit set and the available training data. We evaluate both Crossword units,
that may span multiple word, and Subword units. By combining this approach with
decoding methods using a separate language model we are able to achieve state
of the art results for grapheme based CTC systems.Comment: Current version accepted at Interspeech 201
Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition
The success of self-attention in NLP has led to recent applications in
end-to-end encoder-decoder architectures for speech recognition. Separately,
connectionist temporal classification (CTC) has matured as an alignment-free,
non-autoregressive approach to sequence transduction, either by itself or in
various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully
self-attentional network for CTC, and show it is tractable and competitive for
end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing
CTC models and most encoder-decoder models, with character error rates (CERs)
of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean,
with a fixed architecture and one GPU. Similar improvements hold for WERs after
LM decoding. We motivate the architecture for speech, evaluate position and
downsampling approaches, and explore how label alphabets (character, phoneme,
subword) affect attention heads and performance.Comment: Accepted to ICASSP 201
Hierarchical Multi Task Learning With CTC
In Automatic Speech Recognition it is still challenging to learn useful
intermediate representations when using high-level (or abstract) target units
such as words. For that reason, character or phoneme based systems tend to
outperform word-based systems when just few hundreds of hours of training data
are being used. In this paper, we first show how hierarchical multi-task
training can encourage the formation of useful intermediate representations. We
achieve this by performing Connectionist Temporal Classification at different
levels of the network with targets of different granularity. Our model thus
performs predictions in multiple scales for the same input. On the standard
300h Switchboard training setup, our hierarchical multi-task architecture
exhibits improvements over single-task architectures with the same number of
parameters. Our model obtains 14.0% Word Error Rate on the Eval2000 Switchboard
subset without any decoder or language model, outperforming the current
state-of-the-art on acoustic-to-word models.Comment: In Proceedings at SLT 201