164 research outputs found
Results of the WMT15 Metrics Shared Task
This paper presents the results of the WMT15 Metrics Shared Task. We asked
participants of this task to score the outputs of the MT systems involved in
the WMT15 Shared Translation Task. We collected scores of 46 metrics from 11
research groups. In addition to that, we computed scores of 7 standard metrics
(BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were
evaluated in terms of system level correlation (how well each metric's scores
correlate with WMT15 official manual ranking of systems) and in terms of segment
level correlation (how often a metric agrees with humans in comparing two
translations of a particular sentence)
Integrating meaning into quality evaluation of machine translation
Machine translation (MT) quality is evaluated through comparisons between MT outputs and the human translations (HT). Traditionally, this evaluation relies on form related features (e.g. lexicon and syntax) and ignores the transfer of meaning reflected in HT outputs. Instead, we evaluate the quality of MT outputs through meaning related features (e.g. polarity, subjectivity) with two experiments. In the first experiment, the meaning related features are compared to human rankings individually. In the second experiment, combinations of meaning related features and other quality metrics are utilized to predict the same human rankings. The results of our experiments confirm the benefit of these features in predicting human evaluation of translation quality in addition to traditional metrics which focus mainly on form
Results of the WMT16 Tuning Shared Task
This paper presents the results of the
WMT16 Tuning Shared Task. We provided
the participants of this task with a
complete machine translation system and
asked them to tune its internal parameters
(feature weights). The tuned systems were
used to translate the test set and the outputs
were manually ranked for translation
quality. We received 4 submissions in the
Czech-English and 8 in the English-Czech
translation direction. In addition, we ran
2 baseline setups, tuning the parameters
with standard optimizers for BLEU score.
In contrast to previous years, the tuned
systems in 2016 rely on large data
Results of the WMT15 Tuning Shared Task
This paper presents the results of the WMT15 Tuning Shared Task. We provided the
participants of this task with a complete machine translation system and asked them to tune its
internal parameters (feature weights). The tuned systems were used to translate the test set and
the outputs were manually ranked for translation quality. We received 4 submissions in the
English-Czech and 6 in the Czech-English translation direction. In addition, we ran
3 baseline setups, tuning the
parameters with standard optimizers for BLEU score
Reproducibility Issues for BERT-based Evaluation Metrics
Reproducibility is of utmost concern in machine learning and natural language
processing (NLP). In the field of natural language generation (especially
machine translation), the seminal paper of Post (2018) has pointed out problems
of reproducibility of the dominant metric, BLEU, at the time of publication.
Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this
paper, we ask whether results and claims from four recent BERT-based metrics
can be reproduced. We find that reproduction of claims and results often fails
because of (i) heavy undocumented preprocessing involved in the metrics, (ii)
missing code and (iii) reporting weaker results for the baseline metrics. (iv)
In one case, the problem stems from correlating not to human scores but to a
wrong column in the csv file, inflating scores by 5 points. Motivated by the
impact of preprocessing, we then conduct a second study where we examine its
effects more closely (for one of the metrics). We find that preprocessing can
have large effects, especially for highly inflectional languages. In this case,
the effect of preprocessing may be larger than the effect of the aggregation
mechanism (e.g., greedy alignment vs. Word Mover Distance).Comment: EMNLP 2022 Camera-Ready (captions fixed
Aligning Neural Machine Translation Models: Human Feedback in Training and Inference
Reinforcement learning from human feedback (RLHF) is a recent technique to
improve the quality of the text generated by a language model, making it closer
to what humans would generate. A core ingredient in RLHF's success in aligning
and improving large language models (LLMs) is its reward model, trained using
human feedback on model outputs. In machine translation (MT), where metrics
trained from human annotations can readily be used as reward models, recent
methods using minimum Bayes risk decoding and reranking have succeeded in
improving the final quality of translation. In this study, we comprehensively
explore and compare techniques for integrating quality metrics as reward models
into the MT pipeline. This includes using the reward model for data filtering,
during the training phase through RL, and at inference time by employing
reranking techniques, and we assess the effects of combining these in a unified
approach. Our experimental results, conducted across multiple translation
tasks, underscore the crucial role of effective data filtering, based on
estimated quality, in harnessing the full potential of RL in enhancing MT
quality. Furthermore, our findings demonstrate the effectiveness of combining
RL training with reranking techniques, showcasing substantial improvements in
translation quality.Comment: 14 pages, work-in-progres
Context Consistency between Training and Testing in Simultaneous Machine Translation
Simultaneous Machine Translation (SiMT) aims to yield a real-time partial
translation with a monotonically growing the source-side context. However,
there is a counterintuitive phenomenon about the context usage between training
and testing: e.g., the wait-k testing model consistently trained with wait-k is
much worse than that model inconsistently trained with wait-k' (k' is not equal
to k) in terms of translation quality. To this end, we first investigate the
underlying reasons behind this phenomenon and uncover the following two
factors: 1) the limited correlation between translation quality and training
(cross-entropy) loss; 2) exposure bias between training and testing. Based on
both reasons, we then propose an effective training approach called context
consistency training accordingly, which makes consistent the context usage
between training and testing by optimizing translation quality and latency as
bi-objectives and exposing the predictions to the model during the training.
The experiments on three language pairs demonstrate our intuition: our system
encouraging context consistency outperforms that existing systems with context
inconsistency for the first time, with the help of our context consistency
training approach
Findings of the 2015 Workshop on Statistical Machine Translation
This paper presents the results of the
WMT15 shared tasks, which included a
standard news translation task, a metrics
task, a tuning task, a task for run-time
estimation of machine translation quality,
and an automatic post-editing task. This
year, 68 machine translation systems from
24 institutions were submitted to the ten
translation directions in the standard translation
task. An additional 7 anonymized
systems were included, and were then
evaluated both automatically and manually.
The quality estimation task had three
subtasks, with a total of 10 teams, submitting
34 entries. The pilot automatic postediting
task had a total of 4 teams, submitting
7 entries
- …