Search CORE

7 research outputs found

Results of the WMT15 Metrics Shared Task

Author: Bojar Ondrej
Kamran Amir
Koehn Philipp
Stanojevic Milos
Publication venue
Publication date: 01/01/2015
Field of study

This paper presents the results of the WMT15 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT15 Shared Translation Task. We collected scores of 46 metrics from 11 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system level correlation (how well each metric's scores correlate with WMT15 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence)

Crossref

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

Biblio at Institute of Formal and Applied Linguistics

Results of the WMT16 Metrics Shared Task

Author: Bojar Ondřej
Graham Yvette
Kamran Amir
Stanojević Miloš
Publication venue
Publication date: 01/01/2016
Field of study

This paper presents the results of the WMT16 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT16 Shared Translation Task. We collected scores of 16 metrics from 9 research groups. In addition to that, we computed scores of 9 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT16 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence). This year there are several additions to the setup: large number of language pairs (18 in total), datasets from different domains (news, IT and medical), and different kinds of judgments: relative ranking (RR), direct assessment (DA) and HUME manual semantic judgments. Finally, generation of large number of hybrid systems was trialed for provision of more conclusive system-level metric rankings

Crossref

Biblio at Institute of Formal and Applied Linguistics

Multi-hypothesis machine translation evaluation

Author: Fomicheva M.
Guzmán F.
Specia L.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Reliably evaluating Machine Translation (MT) through automated metrics is a long-standing problem. One of the main challenges is the fact that multiple outputs can be equally valid. Attempts to minimise this issue include metrics that relax the matching of MT output and reference strings, and the use of multiple references. The latter has been shown to significantly improve the performance of evaluation metrics. However, collecting multiple references is expensive and in practice a single reference is generally used. In this paper, we propose an alternative approach: instead of modelling linguistic variation in human reference we exploit the MT model uncertainty to generate multiple diverse translations and use these: (i) as surrogates to reference translations; (ii) to obtain a quantification of translation variability to either complement existing metric scores or (iii) replace references altogether. We show that for a number of popular evaluation metrics our variability estimates lead to substantial improvements in correlation with human judgements of quality by up 15%

Crossref

White Rose Research Online

Taking MT evaluation metrics to extremes : beyond correlation with human judgments

Author: Bahdanau Dzmitry
Baig Taimur
Banerjee Satanjeev
Berentsen Geir Drage
Bojar Ondřej
Callison-Burch Chris
Callison-Burch Chris
Coughlin Deborah
Culy Christopher
Denkowski Michael
Fomicheva Marina
Giménez Jesús
Graham Yvette
Hjort Nils Lid
Junczys-Dowmunt Marcin
Levene Howard
Liu Ding
Lucia Specia
Marina Fomicheva
Moore Robert C.
Nießen Sonja
Papineni Kishore
Snover Matthew
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Sutskever Ilya
Tillmann Christoph
Williams Evan James
Publication venue: 'MIT Press - Journals'
Publication date: 12/06/2019
Field of study

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria

Crossref

Spiral - Imperial College Digital Repository

White Rose Research Online

UPF-cobalt submission to WMT15 metrics task

Author: Bel Rafecas Núria
da Cunha Fanego Iria
Fomicheva Marina
Malinovskiy Anton
Publication venue: ACL (Association for Computational Linguistics)
Publication date: 01/01/2015
Field of study

Comunicació presentada a: 10th Workshop on Statistical Machine Translation, celebrat a Lisboa, Portugal, del 17 al 18 de setembre de 2015.An important limitation of automatic evaluation metrics is that, when comparing Machine Translation (MT) to a human reference, they are often unable to discriminate between acceptable variation and the differences that are indicative of MT errors. In this paper we present UPF-Cobalt evaluation system that addresses this issue by penalizing the differences in the syntactic contexts of aligned candidate and reference words. We evaluate our metric using the data from WMT workshops of the recent years and show that it performs competitively both at segment and at system levels.This work was supported by IULA (UPF) and the FIDGR grant program of the Generalitat de Catalunya

Crossref

Publikationsserver der RWTH Aachen University

UPF Digital Repository

UPF-cobalt submission to WMT15 metrics task

Author: Bel Rafecas Núria
Da Cunha Fanego Iria
Fomicheva Marina
Malinovskiy Anton
Publication venue: ACL (Association for Computational Linguistics)
Publication date
Field of study

RECERCAT