16 research outputs found
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
In conventional speech recognition, phoneme-based models outperform
grapheme-based models for non-phonetic languages such as English. The
performance gap between the two typically reduces as the amount of training
data is increased. In this work, we examine the impact of the choice of
modeling unit for attention-based encoder-decoder models. We conduct
experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various
target units (phoneme, grapheme, and word-piece); across all tasks, we find
that grapheme or word-piece models consistently outperform phoneme-based
models, even though they are evaluated without a lexicon or an external
language model. We also investigate model complementarity: we find that we can
improve WERs by up to 9% relative by rescoring N-best lists generated from a
strong word-piece based baseline with either the phoneme or the grapheme model.
Rescoring an N-best list generated by the phonemic system, however, provides
limited improvements. Further analysis shows that the word-piece-based models
produce more diverse N-best hypotheses, and thus lower oracle WERs, than
phonemic models.Comment: To appear in the proceedings of INTERSPEECH 201
Improving the Performance of Online Neural Transducer Models
Having a sequence-to-sequence model which can operate in an online fashion is
important for streaming applications such as Voice Search. Neural transducer is
a streaming sequence-to-sequence model, but has shown a significant degradation
in performance compared to non-streaming models such as Listen, Attend and
Spell (LAS). In this paper, we present various improvements to NT.
Specifically, we look at increasing the window over which NT computes
attention, mainly by looking backwards in time so the model still remains
online. In addition, we explore initializing a NT model from a LAS-trained
model so that it is guided with a better alignment. Finally, we explore
including stronger language models such as using wordpiece models, and applying
an external LM during the beam search. On a Voice Search task, we find with
these improvements we can get NT to match the performance of LAS
A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition
Attention-based recurrent neural encoder-decoder models present an elegant
solution to the automatic speech recognition problem. This approach folds the
acoustic model, pronunciation model, and language model into a single network
and requires only a parallel corpus of speech and text for training. However,
unlike in conventional approaches that combine separate acoustic and language
models, it is not clear how to use additional (unpaired) text. While there has
been previous work on methods addressing this problem, a thorough comparison
among methods is still lacking. In this paper, we compare a suite of past
methods and some of our own proposed methods for using unpaired text data to
improve encoder-decoder models. For evaluation, we use the medium-sized
Switchboard data set and the large-scale Google voice search and dictation data
sets. Our results confirm the benefits of using unpaired text across a range of
methods and data sets. Surprisingly, for first-pass decoding, the rather simple
approach of shallow fusion performs best across data sets. However, for Google
data sets we find that cold fusion has a lower oracle error rate and
outperforms other approaches after second-pass rescoring on the Google voice
search data set.Comment: Accepted in SLT 201
No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models
For decades, context-dependent phonemes have been the dominant sub-word unit
for conventional acoustic modeling systems. This status quo has begun to be
challenged recently by end-to-end models which seek to combine acoustic,
pronunciation, and language model components into a single neural network. Such
systems, which typically predict graphemes or words, simplify the recognition
process since they remove the need for a separate expert-curated pronunciation
lexicon to map from phoneme-based units to words. However, there has been
little previous work comparing phoneme-based versus grapheme-based sub-word
units in the end-to-end modeling framework, to determine whether the gains from
such approaches are primarily due to the new probabilistic model, or from the
joint learning of the various components with grapheme-based units.
In this work, we conduct detailed experiments which are aimed at quantifying
the value of phoneme-based pronunciation lexica in the context of end-to-end
models. We examine phoneme-based end-to-end models, which are contrasted
against grapheme-based ones on a large vocabulary English Voice-search task,
where we find that graphemes do indeed outperform phonemes. We also compare
grapheme and phoneme-based approaches on a multi-dialect English task, which
once again confirm the superiority of graphemes, greatly simplifying the system
for recognizing multiple dialects
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Attention-based encoder-decoder architectures such as Listen, Attend, and
Spell (LAS), subsume the acoustic, pronunciation and language model components
of a traditional automatic speech recognition (ASR) system into a single neural
network. In previous work, we have shown that such architectures are comparable
to state-of-theart ASR systems on dictation tasks, but it was not clear if such
architectures would be practical for more challenging tasks such as voice
search. In this work, we explore a variety of structural and optimization
improvements to our LAS model which significantly improve performance. On the
structural side, we show that word piece models can be used instead of
graphemes. We also introduce a multi-head attention architecture, which offers
improvements over the commonly-used single-head attention. On the optimization
side, we explore synchronous training, scheduled sampling, label smoothing, and
minimum word error rate optimization, which are all shown to improve accuracy.
We present results with a unidirectional LSTM encoder for streaming
recognition. On a 12, 500 hour voice search task, we find that the proposed
changes improve the WER from 9.2% to 5.6%, while the best conventional system
achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to
5% for the conventional system.Comment: ICASSP camera-ready versio