132 research outputs found
Results of the WMT15 Tuning Shared Task
This paper presents the results of the WMT15 Tuning Shared Task. We provided the
participants of this task with a complete machine translation system and asked them to tune its
internal parameters (feature weights). The tuned systems were used to translate the test set and
the outputs were manually ranked for translation quality. We received 4 submissions in the
English-Czech and 6 in the Czech-English translation direction. In addition, we ran
3 baseline setups, tuning the
parameters with standard optimizers for BLEU score
Results of the WMT16 Tuning Shared Task
This paper presents the results of the
WMT16 Tuning Shared Task. We provided
the participants of this task with a
complete machine translation system and
asked them to tune its internal parameters
(feature weights). The tuned systems were
used to translate the test set and the outputs
were manually ranked for translation
quality. We received 4 submissions in the
Czech-English and 8 in the English-Czech
translation direction. In addition, we ran
2 baseline setups, tuning the parameters
with standard optimizers for BLEU score.
In contrast to previous years, the tuned
systems in 2016 rely on large data
Findings of the 2015 Workshop on Statistical Machine Translation
This paper presents the results of the
WMT15 shared tasks, which included a
standard news translation task, a metrics
task, a tuning task, a task for run-time
estimation of machine translation quality,
and an automatic post-editing task. This
year, 68 machine translation systems from
24 institutions were submitted to the ten
translation directions in the standard translation
task. An additional 7 anonymized
systems were included, and were then
evaluated both automatically and manually.
The quality estimation task had three
subtasks, with a total of 10 teams, submitting
34 entries. The pilot automatic postediting
task had a total of 4 teams, submitting
7 entries
Aligning Neural Machine Translation Models: Human Feedback in Training and Inference
Reinforcement learning from human feedback (RLHF) is a recent technique to
improve the quality of the text generated by a language model, making it closer
to what humans would generate. A core ingredient in RLHF's success in aligning
and improving large language models (LLMs) is its reward model, trained using
human feedback on model outputs. In machine translation (MT), where metrics
trained from human annotations can readily be used as reward models, recent
methods using minimum Bayes risk decoding and reranking have succeeded in
improving the final quality of translation. In this study, we comprehensively
explore and compare techniques for integrating quality metrics as reward models
into the MT pipeline. This includes using the reward model for data filtering,
during the training phase through RL, and at inference time by employing
reranking techniques, and we assess the effects of combining these in a unified
approach. Our experimental results, conducted across multiple translation
tasks, underscore the crucial role of effective data filtering, based on
estimated quality, in harnessing the full potential of RL in enhancing MT
quality. Furthermore, our findings demonstrate the effectiveness of combining
RL training with reranking techniques, showcasing substantial improvements in
translation quality.Comment: 14 pages, work-in-progres
Integrating meaning into quality evaluation of machine translation
Machine translation (MT) quality is evaluated through comparisons between MT outputs and the human translations (HT). Traditionally, this evaluation relies on form related features (e.g. lexicon and syntax) and ignores the transfer of meaning reflected in HT outputs. Instead, we evaluate the quality of MT outputs through meaning related features (e.g. polarity, subjectivity) with two experiments. In the first experiment, the meaning related features are compared to human rankings individually. In the second experiment, combinations of meaning related features and other quality metrics are utilized to predict the same human rankings. The results of our experiments confirm the benefit of these features in predicting human evaluation of translation quality in addition to traditional metrics which focus mainly on form
Results of the WMT16 Metrics Shared Task
This paper presents the results of the
WMT16 Metrics Shared Task. We asked
participants of this task to score the outputs
of the MT systems involved in the
WMT16 Shared Translation Task. We
collected scores of 16 metrics from 9 research
groups. In addition to that, we computed
scores of 9 standard metrics (BLEU,
SentBLEU, NIST, WER, PER, TER and
CDER) as baselines. The collected scores
were evaluated in terms of system-level
correlation (how well each metric’s scores
correlate with WMT16 official manual
ranking of systems) and in terms of segment
level correlation (how often a metric
agrees with humans in comparing two
translations of a particular sentence).
This year there are several additions to
the setup: large number of language pairs
(18 in total), datasets from different domains
(news, IT and medical), and different
kinds of judgments: relative ranking
(RR), direct assessment (DA) and HUME
manual semantic judgments. Finally, generation
of large number of hybrid systems
was trialed for provision of more conclusive
system-level metric rankings
- …