5 research outputs found
Greedy Search with Probabilistic N-gram Matching for Neural Machine Translation
Neural machine translation (NMT) models are usually trained with the
word-level loss using the teacher forcing algorithm, which not only evaluates
the translation improperly but also suffers from exposure bias. Sequence-level
training under the reinforcement framework can mitigate the problems of the
word-level loss, but its performance is unstable due to the high variance of
the gradient estimation. On these grounds, we present a method with a
differentiable sequence-level training objective based on probabilistic n-gram
matching which can avoid the reinforcement framework. In addition, this method
performs greedy search in the training which uses the predicted words as
context just as at inference to alleviate the problem of exposure bias.
Experiment results on the NIST Chinese-to-English translation tasks show that
our method significantly outperforms the reinforcement-based algorithms and
achieves an improvement of 1.5 BLEU points on average over a strong baseline
system.Comment: 7 pages, accepted by emnlp 201
Bridging the Gap between Training and Inference for Neural Machine Translation
Neural Machine Translation (NMT) generates target words sequentially in the
way of predicting the next word conditioned on the context words. At training
time, it predicts with the ground truth words as context while at inference it
has to generate the entire sequence from scratch. This discrepancy of the fed
context leads to error accumulation among the way. Furthermore, word-level
training requires strict matching between the generated sequence and the ground
truth sequence which leads to overcorrection over different but reasonable
translations. In this paper, we address these issues by sampling context words
not only from the ground truth sequence but also from the predicted sequence by
the model during training, where the predicted sequence is selected with a
sentence-level optimum. Experiment results on Chinese->English and WMT'14
English->German translation tasks demonstrate that our approach can achieve
significant improvements on multiple datasets.Comment: 10 pages, 7 figure
Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation
Non-Autoregressive Transformer (NAT) aims to accelerate the Transformer model
through discarding the autoregressive mechanism and generating target words
independently, which fails to exploit the target sequential information.
Over-translation and under-translation errors often occur for the above reason,
especially in the long sentence translation scenario. In this paper, we propose
two approaches to retrieve the target sequential information for NAT to enhance
its translation ability while preserving the fast-decoding property. Firstly,
we propose a sequence-level training method based on a novel reinforcement
algorithm for NAT (Reinforce-NAT) to reduce the variance and stabilize the
training procedure. Secondly, we propose an innovative Transformer decoder
named FS-decoder to fuse the target sequential information into the top layer
of the decoder. Experimental results on three translation tasks show that the
Reinforce-NAT surpasses the baseline NAT system by a significant margin on BLEU
without decelerating the decoding speed and the FS-decoder achieves comparable
translation performance to the autoregressive Transformer with considerable
speedup.Comment: 12 pages, 4 figures, ACL 2019 long pape
Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation
Although teacher forcing has become the main training paradigm for neural
machine translation, it usually makes predictions only conditioned on past
information, and hence lacks global planning for the future. To address this
problem, we introduce another decoder, called seer decoder, into the
encoder-decoder framework during training, which involves future information in
target predictions. Meanwhile, we force the conventional decoder to simulate
the behaviors of the seer decoder via knowledge distillation. In this way, at
test the conventional decoder can perform like the seer decoder without the
attendance of it. Experiment results on the Chinese-English, English-German and
English-Romanian translation tasks show our method can outperform competitive
baselines significantly and achieves greater improvements on the bigger data
sets. Besides, the experiments also prove knowledge distillation the best way
to transfer knowledge from the seer decoder to the conventional decoder
compared to adversarial learning and L2 regularization.Comment: Accepted by ACL-IJCNLP 2021 main conferenc
Sequence-Level Training for Non-Autoregressive Neural Machine Translation
In recent years, Neural Machine Translation (NMT) has achieved notable
results in various translation tasks. However, the word-by-word generation
manner determined by the autoregressive mechanism leads to high translation
latency of the NMT and restricts its low-latency applications.
Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive
mechanism and achieves significant decoding speedup through generating target
words independently and simultaneously. Nevertheless, NAT still takes the
word-level cross-entropy loss as the training objective, which is not optimal
because the output of NAT cannot be properly evaluated due to the multimodality
problem. In this paper, we propose using sequence-level training objectives to
train NAT models, which evaluate the NAT outputs as a whole and correlates well
with the real translation quality. Firstly, we propose training NAT models to
optimize sequence-level evaluation metrics (e.g., BLEU) based on several novel
reinforcement algorithms customized for NAT, which outperforms the conventional
method by reducing the variance of gradient estimation. Secondly, we introduce
a novel training objective for NAT models, which aims to minimize the
Bag-of-Ngrams (BoN) difference between the model output and the reference
sentence. The BoN training objective is differentiable and can be calculated
efficiently without doing any approximations. Finally, we apply a three-stage
training strategy to combine these two methods to train the NAT model. We
validate our approach on four translation tasks (WMT14 EnDe,
WMT16 EnRo), which shows that our approach largely outperforms
NAT baselines and achieves remarkable performance on all translation tasks