684 research outputs found
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR)
model. We learn to listen and write characters with a joint Connectionist
Temporal Classification (CTC) and attention-based encoder-decoder network. The
encoder is a deep Convolutional Neural Network (CNN) based on the VGG network.
The CTC network sits on top of the encoder and is jointly trained with the
attention-based decoder. During the beam search process, we combine the CTC
predictions, the attention-based decoder predictions and a separately trained
LSTM language model. We achieve a 5-10\% error reduction compared to prior
systems on spontaneous Japanese and Chinese speech, and our end-to-end model
beats out traditional hybrid ASR systems.Comment: Accepted for INTERSPEECH 201
Building competitive direct acoustics-to-word models for English conversational speech recognition
Direct acoustics-to-word (A2W) models in the end-to-end paradigm have
received increasing attention compared to conventional sub-word based automatic
speech recognition models using phones, characters, or context-dependent hidden
Markov model states. This is because A2W models recognize words from speech
without any decoder, pronunciation lexicon, or externally-trained language
model, making training and decoding with such models simple. Prior work has
shown that A2W models require orders of magnitude more training data in order
to perform comparably to conventional models. Our work also showed this
accuracy gap when using the English Switchboard-Fisher data set. This paper
describes a recipe to train an A2W model that closes this gap and is at-par
with state-of-the-art sub-word based models. We achieve a word error rate of
8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder
or language model. We find that model initialization, training data order, and
regularization have the most impact on the A2W model performance. Next, we
present a joint word-character A2W model that learns to first spell the word
and then recognize it. This model provides a rich output to the user instead of
simple word hypotheses, making it especially useful in the case of words unseen
or rarely-seen during training.Comment: Submitted to IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 201
- …