36,469 research outputs found
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training
Automatic metrics play a crucial role in machine translation. Despite the
widespread use of n-gram-based metrics, there has been a recent surge in the
development of pre-trained model-based metrics that focus on measuring sentence
semantics. However, these neural metrics, while achieving higher correlations
with human evaluations, are often considered to be black boxes with potential
biases that are difficult to detect. In this study, we systematically analyze
and compare various mainstream and cutting-edge automatic metrics from the
perspective of their guidance for training machine translation systems. Through
Minimum Risk Training (MRT), we find that certain metrics exhibit robustness
defects, such as the presence of universal adversarial translations in BLEURT
and BARTScore. In-depth analysis suggests two main causes of these robustness
deficits: distribution biases in the training datasets, and the tendency of the
metric paradigm. By incorporating token-level constraints, we enhance the
robustness of evaluation metrics, which in turn leads to an improvement in the
performance of machine translation systems. Codes are available at
\url{https://github.com/powerpuffpomelo/fairseq_mrt}.Comment: Accepted to ACL 2023 main conferenc
Recommended from our members
Domain adaptation for neural machine translation
The development of deep learning techniques has allowed Neural Machine Translation (NMT) models to become extremely powerful, given sufficient training data and training time. However, such translation models struggle when translating text of a specific domain. A domain may consist of text on a well-defined topic, or text of unknown provenance with an identifiable vocabulary distribution, or language with some other stylometric feature. While NMT models can achieve good translation performance on domain-specific data via simple tuning on a representative training corpus, such data-centric approaches have negative side-effects. These include over-fitting, brittleness, and `catastrophic forgetting' of previous training examples.
In this thesis we instead explore more robust approaches to domain adaptation for NMT. We consider the case where a system is adapted to a specified domain of interest, but may also need to accommodate new language, or domain-mismatched sentences. We explore techniques relating to data selection and curriculum, model parameter adaptation procedure, and inference procedure. We show that iterative fine-tuning can achieve strong performance over multiple related domains, and that Elastic Weight Consolidation can be used to mitigate catastrophic forgetting in NMT domain adaptation across multiple sequential domains. We develop a robust variant of Minimum Risk Training which allows more beneficial use of small, highly domain-specific tuning sets than simple cross-entropy fine-tuning, and can mitigate exposure bias resulting from domain over-fitting. We extend Bayesian Interpolation inference schemes to Neural Machine Translation, allowing adaptive weighting of NMT ensembles to translate text from an unknown domain.
Finally we demonstrate the benefit of multi-domain adaptation approaches for other lines of NMT research. We show that NMT systems using multiple forms of data representation can benefit from multi-domain inference approaches. We also demonstrate a series of domain adaptation approaches to mitigating the effects of gender bias in machine translation
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machine translation systems, but before we can safely optimise towards such metrics, we should be aware of (and ideally eliminate) biases toward bad translations that receive high scores. Our experiments show that sample-based Minimum Bayes Risk decoding can be used to explore and quantify such weaknesses. When applying this strategy to COMET for en-de and de-en, we find that COMET models are not sensitive enough to discrepancies in numbers and named entities. We further show that these biases are hard to fully remove by simply training on additional synthetic data and release our code and data for facilitating further experiments
On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation
The standard training algorithm in neural machine translation (NMT) suffers
from exposure bias, and alternative algorithms have been proposed to mitigate
this. However, the practical impact of exposure bias is under debate. In this
paper, we link exposure bias to another well-known problem in NMT, namely the
tendency to generate hallucinations under domain shift. In experiments on three
datasets with multiple test domains, we show that exposure bias is partially to
blame for hallucinations, and that training with Minimum Risk Training, which
avoids exposure bias, can mitigate this. Our analysis explains why exposure
bias is more problematic under domain shift, and also links exposure bias to
the beam search problem, i.e. performance deterioration with increasing beam
size. Our results provide a new justification for methods that reduce exposure
bias: even if they do not increase performance on in-domain test sets, they can
increase model robustness to domain shift.Comment: ACL 202
Classical Structured Prediction Losses for Sequence to Sequence Learning
There has been much recent work on training neural attention models at the
sequence-level using either reinforcement learning-style methods or by
optimizing the beam. In this paper, we survey a range of classical objective
functions that have been widely used to train linear models for structured
prediction and apply them to neural sequence to sequence models. Our
experiments show that these losses can perform surprisingly well by slightly
outperforming beam search optimization in a like for like setup. We also report
new state of the art results on both IWSLT'14 German-English translation as
well as Gigaword abstractive summarization. On the larger WMT'14 English-French
translation task, sequence-level training achieves 41.5 BLEU which is on par
with the state of the art.Comment: 10 pages, NAACL 201
- …