196 research outputs found
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR)
model. We learn to listen and write characters with a joint Connectionist
Temporal Classification (CTC) and attention-based encoder-decoder network. The
encoder is a deep Convolutional Neural Network (CNN) based on the VGG network.
The CTC network sits on top of the encoder and is jointly trained with the
attention-based decoder. During the beam search process, we combine the CTC
predictions, the attention-based decoder predictions and a separately trained
LSTM language model. We achieve a 5-10\% error reduction compared to prior
systems on spontaneous Japanese and Chinese speech, and our end-to-end model
beats out traditional hybrid ASR systems.Comment: Accepted for INTERSPEECH 201
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Recent work on end-to-end automatic speech recognition (ASR) has shown that
the connectionist temporal classification (CTC) loss can be used to convert
acoustics to phone or character sequences. Such systems are used with a
dictionary and separately-trained Language Model (LM) to produce word
sequences. However, they are not truly end-to-end in the sense of mapping
acoustics directly to words without an intermediate phone representation. In
this paper, we present the first results employing direct acoustics-to-word CTC
models on two well-known public benchmark tasks: Switchboard and CallHome.
These models do not require an LM or even a decoder at run-time and hence
recognize speech with minimal complexity. However, due to the large number of
word output units, CTC word models require orders of magnitude more data to
train reliably compared to traditional systems. We present some techniques to
mitigate this issue. Our CTC word model achieves a word error rate of
13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or
decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also
present rescoring results on CTC word model lattices to quantify the
performance benefits of a LM, and contrast the performance of word and phone
CTC models.Comment: Submitted to Interspeech-201
Sampling from Stochastic Finite Automata with Applications to CTC Decoding
Stochastic finite automata arise naturally in many language and speech
processing tasks. They include stochastic acceptors, which represent certain
probability distributions over random strings. We consider the problem of
efficient sampling: drawing random string variates from the probability
distribution represented by stochastic automata and transformations of those.
We show that path-sampling is effective and can be efficient if the
epsilon-graph of a finite automaton is acyclic. We provide an algorithm that
ensures this by conflating epsilon-cycles within strongly connected components.
Sampling is also effective in the presence of non-injective transformations of
strings. We illustrate this in the context of decoding for Connectionist
Temporal Classification (CTC), where the predictive probabilities yield
auxiliary sequences which are transformed into shorter labeling strings. We can
sample efficiently from the transformed labeling distribution and use this in
two different strategies for finding the most probable CTC labeling
- …