24 research outputs found
Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation
Conventional automatic speech recognition (ASR) systems trained from
frame-level alignments can easily leverage posterior fusion to improve ASR
accuracy and build a better single model with knowledge distillation.
End-to-end ASR systems trained using the Connectionist Temporal Classification
(CTC) loss do not require frame-level alignment and hence simplify model
training. However, sparse and arbitrary posterior spike timings from CTC models
pose a new set of challenges in posterior fusion from multiple models and
knowledge distillation between CTC models. We propose a method to train a CTC
model so that its spike timings are guided to align with those of a pre-trained
guiding CTC model. As a result, all models that share the same guiding model
have aligned spike timings. We show the advantage of our method in various
scenarios including posterior fusion of CTC models and knowledge distillation
between CTC models with different architectures. With the 300-hour Switchboard
training data, the single word CTC model distilled from multiple models
improved the word error rates to 13.7%/23.1% from 14.9%/24.1% on the Hub5 2000
Switchboard/CallHome test sets without using any data augmentation, language
model, or complex decoder.Comment: Accepted to Interspeech 201
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Recent work on end-to-end automatic speech recognition (ASR) has shown that
the connectionist temporal classification (CTC) loss can be used to convert
acoustics to phone or character sequences. Such systems are used with a
dictionary and separately-trained Language Model (LM) to produce word
sequences. However, they are not truly end-to-end in the sense of mapping
acoustics directly to words without an intermediate phone representation. In
this paper, we present the first results employing direct acoustics-to-word CTC
models on two well-known public benchmark tasks: Switchboard and CallHome.
These models do not require an LM or even a decoder at run-time and hence
recognize speech with minimal complexity. However, due to the large number of
word output units, CTC word models require orders of magnitude more data to
train reliably compared to traditional systems. We present some techniques to
mitigate this issue. Our CTC word model achieves a word error rate of
13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or
decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also
present rescoring results on CTC word model lattices to quantify the
performance benefits of a LM, and contrast the performance of word and phone
CTC models.Comment: Submitted to Interspeech-201
Building competitive direct acoustics-to-word models for English conversational speech recognition
Direct acoustics-to-word (A2W) models in the end-to-end paradigm have
received increasing attention compared to conventional sub-word based automatic
speech recognition models using phones, characters, or context-dependent hidden
Markov model states. This is because A2W models recognize words from speech
without any decoder, pronunciation lexicon, or externally-trained language
model, making training and decoding with such models simple. Prior work has
shown that A2W models require orders of magnitude more training data in order
to perform comparably to conventional models. Our work also showed this
accuracy gap when using the English Switchboard-Fisher data set. This paper
describes a recipe to train an A2W model that closes this gap and is at-par
with state-of-the-art sub-word based models. We achieve a word error rate of
8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder
or language model. We find that model initialization, training data order, and
regularization have the most impact on the A2W model performance. Next, we
present a joint word-character A2W model that learns to first spell the word
and then recognize it. This model provides a rich output to the user instead of
simple word hypotheses, making it especially useful in the case of words unseen
or rarely-seen during training.Comment: Submitted to IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 201
O-1: Self-training with Oracle and 1-best Hypothesis
We introduce O-1, a new self-training objective to reduce training bias and
unify training and evaluation metrics for speech recognition. O-1 is a faster
variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle
hypothesis and can accommodate both supervised and unsupervised data. We
demonstrate the effectiveness of our approach in terms of recognition on
publicly available SpeechStew datasets and a large-scale, in-house data set. On
Speechstew, the O-1 objective closes the gap between the actual and oracle
performance by 80\% relative compared to EMBR which bridges the gap by 43\%
relative. O-1 achieves 13\% to 25\% relative improvement over EMBR on the
various datasets that SpeechStew comprises of, and a 12\% relative gap
reduction with respect to the oracle WER over EMBR training on the in-house
dataset. Overall, O-1 results in a 9\% relative improvement in WER over EMBR,
thereby speaking to the scalability of the proposed objective for large-scale
datasets