Search CORE

258 research outputs found

A Grain of Salt for the WMT Manual Evaluation

Author: Bojar Ondřej
Ercegovčević Miloš
Popel Martin
Zaidan Omar
Publication venue
Publication date: 01/01/2011
Field of study

The Workshop on Statistical Machine Translation (WMT) has become one of ACL's flagship workshops, held annually since 2006. In addition to soliciting papers from the research community, WMT also features a shared translation task for evaluating MT systems. This shared task is notable for having manual evaluation as its cornerstone. The Workshop's overview paper, playing a descriptive and administrative role, reports the main results of the evaluation without delving deep into analyzing those results. The aim of this paper is to investigate and explain some interesting idiosyncrasies in the reported results, which only become apparent when performing a more thorough analysis of the collected annotations. Our analysis sheds some light on how the reported results should (and should not) be interpreted, and also gives rise to some helpful recommendation for the organizers of WMT

Biblio at Institute of Formal and Applied Linguistics

Evaluation of Automatic Video Captioning Using Direct Assessment

Author: Awad George
Graham Yvette
Smeaton Alan
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 29/10/2017
Field of study

We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Automatic metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowdsourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation. Using data from the TRECVid video-to-text task in 2016, we show how our direct assessment method is replicable and robust and should scale to where there many caption-generation techniques to be evaluated.Comment: 26 pages, 8 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

Fine-grained human evaluation of neural versus phrase-based machine translation

Author: Klubička Filip
Sánchez-Cartagena Víctor M.
Toral Antonio
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2017
Field of study

We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems' outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.Comment: 12 pages, 2 figures, The Prague Bulletin of Mathematical Linguistic

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Results of the WMT16 Metrics Shared Task

Author: Bojar Ondřej
Graham Yvette
Kamran Amir
Stanojević Miloš
Publication venue
Publication date: 01/01/2016
Field of study

This paper presents the results of the WMT16 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT16 Shared Translation Task. We collected scores of 16 metrics from 9 research groups. In addition to that, we computed scores of 9 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT16 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in comparing two translations of a particular sentence). This year there are several additions to the setup: large number of language pairs (18 in total), datasets from different domains (news, IT and medical), and different kinds of judgments: relative ranking (RR), direct assessment (DA) and HUME manual semantic judgments. Finally, generation of large number of hybrid systems was trialed for provision of more conclusive system-level metric rankings

Crossref

Biblio at Institute of Formal and Applied Linguistics

Online Versus Offline NMT Quality: An In-depth Analysis on English-German and German-English

Author: Besacier Laurent
Elbayad Maha
Esperança-Rodier Emmanuelle
Manquat Francis Brunet
Ustaszewski Michael
Verbeek Jakob
Publication venue
Publication date: 01/01/2020
Field of study

We conduct in this work an evaluation study comparing offline and online neural machine translation architectures. Two sequence-to-sequence models: convolutional Pervasive Attention (Elbayad et al. 2018) and attention-based Transformer (Vaswani et al. 2017) are considered. We investigate, for both architectures, the impact of online decoding constraints on the translation quality through a carefully designed human evaluation on English-German and German-English language pairs, the latter being particularly sensitive to latency constraints. The evaluation results allow us to identify the strengths and shortcomings of each model when we shift to the online setup.Comment: Accepted at COLING 202

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Results of the WMT17 metrics shared task

Author: Bojar Ondřej
Graham Yvette
Kamran Amir
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

This paper presents the results of the WMT17 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT17 news translation task and Neural MT training task. We collected scores of 14 metrics from 8 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT17 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in judging the quality of a particular sentence). This year, we build upon two types of manual judgements: direct assessment (DA) and HUME manual semantic judgements

Crossref

Irish Universities

DCU Online Research Access Service

Biblio at Institute of Formal and Applied Linguistics

A Pronoun Test Suite Evaluation of the English--German MT Systems at WMT 2018

Author: Guillou Liane
Hardmeier Christian
Lapshinova-Koltunski Ekaterina
Loáiciga Sharid
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Crossref

Edinburgh Research Explorer

Quantitative Fine-grained Human Evaluation of Machine Translation Systems: a Case Study on English to Croatian

Author: Klubicka Filip
Sanchez-Cartagena Victor Manuel
Toral Antonio
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2018
Field of study

This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to- Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective

Arrow@TUDublin