7 research outputs found
Results of the WMT15 Metrics Shared Task
This paper presents the results of the WMT15 Metrics Shared Task. We asked
participants of this task to score the outputs of the MT systems involved in
the WMT15 Shared Translation Task. We collected scores of 46 metrics from 11
research groups. In addition to that, we computed scores of 7 standard metrics
(BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were
evaluated in terms of system level correlation (how well each metric's scores
correlate with WMT15 official manual ranking of systems) and in terms of segment
level correlation (how often a metric agrees with humans in comparing two
translations of a particular sentence)
Results of the WMT16 Metrics Shared Task
This paper presents the results of the
WMT16 Metrics Shared Task. We asked
participants of this task to score the outputs
of the MT systems involved in the
WMT16 Shared Translation Task. We
collected scores of 16 metrics from 9 research
groups. In addition to that, we computed
scores of 9 standard metrics (BLEU,
SentBLEU, NIST, WER, PER, TER and
CDER) as baselines. The collected scores
were evaluated in terms of system-level
correlation (how well each metric’s scores
correlate with WMT16 official manual
ranking of systems) and in terms of segment
level correlation (how often a metric
agrees with humans in comparing two
translations of a particular sentence).
This year there are several additions to
the setup: large number of language pairs
(18 in total), datasets from different domains
(news, IT and medical), and different
kinds of judgments: relative ranking
(RR), direct assessment (DA) and HUME
manual semantic judgments. Finally, generation
of large number of hybrid systems
was trialed for provision of more conclusive
system-level metric rankings
Multi-hypothesis machine translation evaluation
Reliably evaluating Machine Translation (MT) through automated metrics is a long-standing problem. One of the main challenges is the fact that multiple outputs can be equally valid. Attempts to minimise this issue include metrics that relax the matching of MT output and reference strings, and the use of multiple references. The latter has been shown to significantly improve the performance of evaluation metrics. However, collecting multiple references is expensive and in practice a single reference is generally used. In this paper, we propose an alternative approach: instead of modelling linguistic variation in human reference we exploit the MT model uncertainty to generate multiple diverse translations and use these: (i) as surrogates to reference translations; (ii) to obtain a quantification of translation variability to either complement existing metric scores or (iii) replace references altogether. We show that for a number of popular evaluation metrics our variability estimates lead to substantial improvements in correlation with human judgements of quality by up 15%
Taking MT evaluation metrics to extremes : beyond correlation with human judgments
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria
UPF-cobalt submission to WMT15 metrics task
Comunicació presentada a: 10th Workshop on Statistical Machine Translation, celebrat a Lisboa, Portugal, del 17 al 18 de setembre de 2015.An important limitation of automatic evaluation metrics is that, when comparing Machine Translation (MT) to a human reference, they are often unable to discriminate between acceptable variation and the differences that are indicative of MT errors. In this paper we present UPF-Cobalt evaluation system that addresses this issue by penalizing the differences in the syntactic contexts of aligned candidate and reference words. We evaluate our metric using the data from WMT workshops of the recent years and show that it performs competitively both at segment and at system levels.This work was supported by IULA (UPF) and the FIDGR grant program of the Generalitat de Catalunya
UPF-cobalt submission to WMT15 metrics task
Comunicació presentada a: 10th Workshop on Statistical Machine Translation, celebrat a Lisboa, Portugal, del 17 al 18 de setembre de 2015.An important limitation of automatic evaluation metrics is that, when comparing Machine Translation (MT) to a human reference, they are often unable to discriminate between acceptable variation and the differences that are indicative of MT errors. In this paper we present UPF-Cobalt evaluation system that addresses this issue by penalizing the differences in the syntactic contexts of aligned candidate and reference words. We evaluate our metric using the data from WMT workshops of the recent years and show that it performs competitively both at segment and at system levels.This work was supported by IULA (UPF) and the FIDGR grant program of the Generalitat de Catalunya