71 research outputs found
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Recent work on end-to-end automatic speech recognition (ASR) has shown that
the connectionist temporal classification (CTC) loss can be used to convert
acoustics to phone or character sequences. Such systems are used with a
dictionary and separately-trained Language Model (LM) to produce word
sequences. However, they are not truly end-to-end in the sense of mapping
acoustics directly to words without an intermediate phone representation. In
this paper, we present the first results employing direct acoustics-to-word CTC
models on two well-known public benchmark tasks: Switchboard and CallHome.
These models do not require an LM or even a decoder at run-time and hence
recognize speech with minimal complexity. However, due to the large number of
word output units, CTC word models require orders of magnitude more data to
train reliably compared to traditional systems. We present some techniques to
mitigate this issue. Our CTC word model achieves a word error rate of
13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or
decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also
present rescoring results on CTC word model lattices to quantify the
performance benefits of a LM, and contrast the performance of word and phone
CTC models.Comment: Submitted to Interspeech-201
Building competitive direct acoustics-to-word models for English conversational speech recognition
Direct acoustics-to-word (A2W) models in the end-to-end paradigm have
received increasing attention compared to conventional sub-word based automatic
speech recognition models using phones, characters, or context-dependent hidden
Markov model states. This is because A2W models recognize words from speech
without any decoder, pronunciation lexicon, or externally-trained language
model, making training and decoding with such models simple. Prior work has
shown that A2W models require orders of magnitude more training data in order
to perform comparably to conventional models. Our work also showed this
accuracy gap when using the English Switchboard-Fisher data set. This paper
describes a recipe to train an A2W model that closes this gap and is at-par
with state-of-the-art sub-word based models. We achieve a word error rate of
8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder
or language model. We find that model initialization, training data order, and
regularization have the most impact on the A2W model performance. Next, we
present a joint word-character A2W model that learns to first spell the word
and then recognize it. This model provides a rich output to the user instead of
simple word hypotheses, making it especially useful in the case of words unseen
or rarely-seen during training.Comment: Submitted to IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 201
Speaking clearly for the hard of hearing
Thesis (Sc.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1981.MICROFICHE COPY AVAILABLE IN ARCHIVES AND ENGINEERING.Vita.Bibliography: leaves 46-48.by Michael Alan Picheny.Sc.D
A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition
We study large-scale kernel methods for acoustic modeling and compare to DNNs
on performance metrics related to both acoustic modeling and recognition.
Measuring perplexity and frame-level classification accuracy, kernel-based
acoustic models are as effective as their DNN counterparts. However, on
token-error-rates DNN models can be significantly better. We have discovered
that this might be attributed to DNN's unique strength in reducing both the
perplexity and the entropy of the predicted posterior probabilities. Motivated
by our findings, we propose a new technique, entropy regularized perplexity,
for model selection. This technique can noticeably improve the recognition
performance of both types of models, and reduces the gap between them. While
effective on Broadcast News, this technique could be also applicable to other
tasks.Comment: arXiv admin note: text overlap with arXiv:1411.400
- …