94,775 research outputs found
Word Embeddings: A Survey
This work lists and describes the main recent strategies for building
fixed-length, dense and distributed representations for words, based on the
distributional hypothesis. These representations are now commonly called word
embeddings and, in addition to encoding surprisingly good syntactic and
semantic information, have been proven useful as extra features in many
downstream NLP tasks.Comment: 10 pages, 2 tables, 1 imag
Improved training for online end-to-end speech recognition systems
Achieving high accuracy with end-to-end speech recognizers requires careful
parameter initialization prior to training. Otherwise, the networks may fail to
find a good local optimum. This is particularly true for online networks, such
as unidirectional LSTMs. Currently, the best strategy to train such systems is
to bootstrap the training from a tied-triphone system. However, this is time
consuming, and more importantly, is impossible for languages without a
high-quality pronunciation lexicon. In this work, we propose an initialization
strategy that uses teacher-student learning to transfer knowledge from a large,
well-trained, offline end-to-end speech recognition model to an online
end-to-end model, eliminating the need for a lexicon or any other linguistic
resources. We also explore curriculum learning and label smoothing and show how
they can be combined with the proposed teacher-student learning for further
improvements. We evaluate our methods on a Microsoft Cortana personal assistant
task and show that the proposed method results in a 19 % relative improvement
in word error rate compared to a randomly-initialized baseline system.Comment: Interspeech 201
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Attention-based encoder-decoder architectures such as Listen, Attend, and
Spell (LAS), subsume the acoustic, pronunciation and language model components
of a traditional automatic speech recognition (ASR) system into a single neural
network. In previous work, we have shown that such architectures are comparable
to state-of-theart ASR systems on dictation tasks, but it was not clear if such
architectures would be practical for more challenging tasks such as voice
search. In this work, we explore a variety of structural and optimization
improvements to our LAS model which significantly improve performance. On the
structural side, we show that word piece models can be used instead of
graphemes. We also introduce a multi-head attention architecture, which offers
improvements over the commonly-used single-head attention. On the optimization
side, we explore synchronous training, scheduled sampling, label smoothing, and
minimum word error rate optimization, which are all shown to improve accuracy.
We present results with a unidirectional LSTM encoder for streaming
recognition. On a 12, 500 hour voice search task, we find that the proposed
changes improve the WER from 9.2% to 5.6%, while the best conventional system
achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to
5% for the conventional system.Comment: ICASSP camera-ready versio
- …