6,385 research outputs found
Regularizing Neural Machine Translation by Target-bidirectional Agreement
Although Neural Machine Translation (NMT) has achieved remarkable progress in
the past several years, most NMT systems still suffer from a fundamental
shortcoming as in other sequence generation tasks: errors made early in
generation process are fed as inputs to the model and can be quickly amplified,
harming subsequent sequence generation. To address this issue, we propose a
novel model regularization method for NMT training, which aims to improve the
agreement between translations generated by left-to-right (L2R) and
right-to-left (R2L) NMT decoders. This goal is achieved by introducing two
Kullback-Leibler divergence regularization terms into the NMT training
objective to reduce the mismatch between output probabilities of L2R and R2L
models. In addition, we also employ a joint training strategy to allow L2R and
R2L models to improve each other in an interactive update process. Experimental
results show that our proposed method significantly outperforms
state-of-the-art baselines on Chinese-English and English-German translation
tasks.Comment: Accepted by AAAI 201
Improved training of end-to-end attention models for speech recognition
Sequence-to-sequence attention-based models on subword units allow simple
open-vocabulary end-to-end speech recognition. In this work, we show that such
models can achieve competitive results on the Switchboard 300h and LibriSpeech
1000h tasks. In particular, we report the state-of-the-art word error rates
(WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets
of LibriSpeech. We introduce a new pretraining scheme by starting with a high
time reduction factor and lowering it during training, which is crucial both
for convergence and final performance. In some experiments, we also use an
auxiliary CTC loss function to help the convergence. In addition, we train long
short-term memory (LSTM) language models on subword units. By shallow fusion,
we report up to 27% relative improvements in WER over the attention baseline
without a language model.Comment: submitted to Interspeech 201
- …