349 research outputs found
Subword and Crossword Units for CTC Acoustic Models
This paper proposes a novel approach to create an unit set for CTC based
speech recognition systems. By using Byte Pair Encoding we learn an unit set of
an arbitrary size on a given training text. In contrast to using characters or
words as units this allows us to find a good trade-off between the size of our
unit set and the available training data. We evaluate both Crossword units,
that may span multiple word, and Subword units. By combining this approach with
decoding methods using a separate language model we are able to achieve state
of the art results for grapheme based CTC systems.Comment: Current version accepted at Interspeech 201
Hierarchical Multi Task Learning With CTC
In Automatic Speech Recognition it is still challenging to learn useful
intermediate representations when using high-level (or abstract) target units
such as words. For that reason, character or phoneme based systems tend to
outperform word-based systems when just few hundreds of hours of training data
are being used. In this paper, we first show how hierarchical multi-task
training can encourage the formation of useful intermediate representations. We
achieve this by performing Connectionist Temporal Classification at different
levels of the network with targets of different granularity. Our model thus
performs predictions in multiple scales for the same input. On the standard
300h Switchboard training setup, our hierarchical multi-task architecture
exhibits improvements over single-task architectures with the same number of
parameters. Our model obtains 14.0% Word Error Rate on the Eval2000 Switchboard
subset without any decoder or language model, outperforming the current
state-of-the-art on acoustic-to-word models.Comment: In Proceedings at SLT 201
Multimodal Grounding for Sequence-to-Sequence Speech Recognition
Humans are capable of processing speech by making use of multiple sensory
modalities. For example, the environment where a conversation takes place
generally provides semantic and/or acoustic context that helps us to resolve
ambiguities or to recall named entities. Motivated by this, there have been
many works studying the integration of visual information into the speech
recognition pipeline. Specifically, in our previous work, we propose a
multistep visual adaptive training approach which improves the accuracy of an
audio-based Automatic Speech Recognition (ASR) system. This approach, however,
is not end-to-end as it requires fine-tuning the whole model with an adaptation
layer. In this paper, we propose novel end-to-end multimodal ASR systems and
compare them to the adaptive approach by using a range of visual
representations obtained from state-of-the-art convolutional neural networks.
We show that adaptive training is effective for S2S models leading to an
absolute improvement of 1.4% in word error rate. As for the end-to-end systems,
although they perform better than baseline, the improvements are slightly less
than adaptive training, 0.8 absolute WER reduction in single-best models. Using
ensemble decoding, end-to-end models reach a WER of 15% which is the lowest
score among all systems.Comment: ICASSP 201
- …