4 research outputs found
On the Inference Calibration of Neural Machine Translation
Confidence calibration, which aims to make model predictions equal to the
true correctness measures, is important for neural machine translation (NMT)
because it is able to offer useful indicators of translation errors in the
generated output. While prior studies have shown that NMT models trained with
label smoothing are well-calibrated on the ground-truth training data, we find
that miscalibration still remains a severe challenge for NMT during inference
due to the discrepancy between training and inference. By carefully designing
experiments on three language pairs, our work provides in-depth analyses of the
correlation between calibration and translation performance as well as
linguistic properties of miscalibration and reports a number of interesting
findings that might help humans better analyze, understand and improve NMT
models. Based on these observations, we further propose a new graduated label
smoothing method that can improve both inference calibration and translation
performance.Comment: Accepted by ACL202
Decoding and Diversity in Machine Translation
Neural Machine Translation (NMT) systems are typically evaluated using
automated metrics that assess the agreement between generated translations and
ground truth candidates. To improve systems with respect to these metrics, NLP
researchers employ a variety of heuristic techniques, including searching for
the conditional mode (vs. sampling) and incorporating various training
heuristics (e.g., label smoothing). While search strategies significantly
improve BLEU score, they yield deterministic outputs that lack the diversity of
human translations. Moreover, search tends to bias the distribution of
translated gender pronouns. This makes human-level BLEU a misleading benchmark
in that modern MT systems cannot approach human-level BLEU while simultaneously
maintaining human-level translation diversity. In this paper, we characterize
distributional differences between generated and real translations, examining
the cost in diversity paid for the BLEU scores enjoyed by NMT. Moreover, our
study implicates search as a salient source of known bias when translating
gender pronouns.Comment: Presented at the Resistance AI Workshop, 34th Conference on Neural
Information Processing Systems (NeurIPS 2020), Vancouver, Canad
Language Models are Good Translators
Recent years have witnessed the rapid advance in neural machine translation
(NMT), the core of which lies in the encoder-decoder architecture. Inspired by
the recent progress of large-scale pre-trained language models on machine
translation in a limited scenario, we firstly demonstrate that a single
language model (LM4MT) can achieve comparable performance with strong
encoder-decoder NMT models on standard machine translation benchmarks, using
the same training data and similar amount of model parameters. LM4MT can also
easily utilize source-side texts as additional supervision. Though modeling the
source- and target-language texts with the same mechanism, LM4MT can provide
unified representations for both source and target sentences, which can better
transfer knowledge across languages. Extensive experiments on pivot-based and
zero-shot translation tasks show that LM4MT can outperform the encoder-decoder
NMT model by a large margin.Comment: 12 pages. Work in progress. An earlier verison of this manuscript is
under revie
Towards Understanding Label Smoothing
Label smoothing regularization (LSR) has a great success in training deep
neural networks by stochastic algorithms such as stochastic gradient descent
and its variants. However, the theoretical understanding of its power from the
view of optimization is still rare. This study opens the door to a deep
understanding of LSR by initiating the analysis. In this paper, we analyze the
convergence behaviors of stochastic gradient descent with label smoothing
regularization for solving non-convex problems and show that an appropriate LSR
can help to speed up the convergence by reducing the variance. More
interestingly, we proposed a simple yet effective strategy, namely Two-Stage
LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs
and drops it off in the later training epochs. We observe from the improved
convergence result of TSLA that it benefits from LSR in the first stage and
essentially converges faster in the second stage. To the best of our knowledge,
this is the first work for understanding the power of LSR via establishing
convergence complexity of stochastic methods with LSR in non-convex
optimization. We empirically demonstrate the effectiveness of the proposed
method in comparison with baselines on training ResNet models over benchmark
data sets