234 research outputs found
Regularizing Neural Machine Translation by Target-bidirectional Agreement
Although Neural Machine Translation (NMT) has achieved remarkable progress in
the past several years, most NMT systems still suffer from a fundamental
shortcoming as in other sequence generation tasks: errors made early in
generation process are fed as inputs to the model and can be quickly amplified,
harming subsequent sequence generation. To address this issue, we propose a
novel model regularization method for NMT training, which aims to improve the
agreement between translations generated by left-to-right (L2R) and
right-to-left (R2L) NMT decoders. This goal is achieved by introducing two
Kullback-Leibler divergence regularization terms into the NMT training
objective to reduce the mismatch between output probabilities of L2R and R2L
models. In addition, we also employ a joint training strategy to allow L2R and
R2L models to improve each other in an interactive update process. Experimental
results show that our proposed method significantly outperforms
state-of-the-art baselines on Chinese-English and English-German translation
tasks.Comment: Accepted by AAAI 201
Accelerating Transducers through Adjacent Token Merging
Recent end-to-end automatic speech recognition (ASR) systems often utilize a
Transformer-based acoustic encoder that generates embedding at a high frame
rate. However, this design is inefficient, particularly for long speech signals
due to the quadratic computation of self-attention. To address this, we propose
a new method, Adjacent Token Merging (A-ToMe), which gradually combines
adjacent tokens with high similarity scores between their key values. In this
way, the total time step could be reduced, and the inference of both the
encoder and joint network is accelerated. Experiments on LibriSpeech show that
our method can reduce 57% of tokens and improve the inference speed on GPU by
70% without any notable loss of accuracy. Additionally, we demonstrate that
A-ToMe is also an effective solution to reduce tokens in long-form ASR, where
the input speech consists of multiple utterances.Comment: Interspeech 202
Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation
End-to-end speech translation, a hot topic in recent years, aims to translate
a segment of audio into a specific language with an end-to-end model.
Conventional approaches employ multi-task learning and pre-training methods for
this task, but they suffer from the huge gap between pre-training and
fine-tuning. To address these issues, we propose a Tandem Connectionist
Encoding Network (TCEN) which bridges the gap by reusing all subnets in
fine-tuning, keeping the roles of subnets consistent, and pre-training the
attention module. Furthermore, we propose two simple but effective methods to
guarantee the speech encoder outputs and the MT encoder inputs are consistent
in terms of semantic representation and sequence length. Experimental results
show that our model outperforms baselines 2.2 BLEU on a large benchmark
dataset.Comment: AAAI202
- …