258 research outputs found
A Grain of Salt for the WMT Manual Evaluation
The Workshop on Statistical Machine Translation (WMT) has become one of ACL's
flagship workshops, held annually since 2006. In addition to soliciting papers
from the research community, WMT also features a shared translation task for
evaluating MT systems. This shared task is notable for having manual evaluation
as its cornerstone.
The Workshop's overview paper, playing a descriptive and administrative role, reports
the main results of the evaluation without delving deep into analyzing those results.
The aim of this paper is to investigate and explain some interesting idiosyncrasies
in the reported results, which only become apparent when performing a more thorough
analysis of the collected annotations. Our analysis sheds some light on how the
reported results should (and should not) be interpreted, and also gives rise to some helpful
recommendation for the organizers of WMT
Evaluation of Automatic Video Captioning Using Direct Assessment
We present Direct Assessment, a method for manually assessing the quality of
automatically-generated captions for video. Evaluating the accuracy of video
captions is particularly difficult because for any given video clip there is no
definitive ground truth or correct answer against which to measure. Automatic
metrics for comparing automatic video captions against a manual caption such as
BLEU and METEOR, drawn from techniques used in evaluating machine translation,
were used in the TRECVid video captioning task in 2016 but these are shown to
have weaknesses. The work presented here brings human assessment into the
evaluation by crowdsourcing how well a caption describes a video. We
automatically degrade the quality of some sample captions which are assessed
manually and from this we are able to rate the quality of the human assessors,
a factor we take into account in the evaluation. Using data from the TRECVid
video-to-text task in 2016, we show how our direct assessment method is
replicable and robust and should scale to where there many caption-generation
techniques to be evaluated.Comment: 26 pages, 8 figure
Fine-grained human evaluation of neural versus phrase-based machine translation
We compare three approaches to statistical machine translation (pure
phrase-based, factored phrase-based and neural) by performing a fine-grained
manual evaluation via error annotation of the systems' outputs. The error types
in our annotation are compliant with the multidimensional quality metrics
(MQM), and the annotation is performed by two annotators. Inter-annotator
agreement is high for such a task, and results show that the best performing
system (neural) reduces the errors produced by the worst system (phrase-based)
by 54%.Comment: 12 pages, 2 figures, The Prague Bulletin of Mathematical Linguistic
Results of the WMT16 Metrics Shared Task
This paper presents the results of the
WMT16 Metrics Shared Task. We asked
participants of this task to score the outputs
of the MT systems involved in the
WMT16 Shared Translation Task. We
collected scores of 16 metrics from 9 research
groups. In addition to that, we computed
scores of 9 standard metrics (BLEU,
SentBLEU, NIST, WER, PER, TER and
CDER) as baselines. The collected scores
were evaluated in terms of system-level
correlation (how well each metric’s scores
correlate with WMT16 official manual
ranking of systems) and in terms of segment
level correlation (how often a metric
agrees with humans in comparing two
translations of a particular sentence).
This year there are several additions to
the setup: large number of language pairs
(18 in total), datasets from different domains
(news, IT and medical), and different
kinds of judgments: relative ranking
(RR), direct assessment (DA) and HUME
manual semantic judgments. Finally, generation
of large number of hybrid systems
was trialed for provision of more conclusive
system-level metric rankings
Online Versus Offline NMT Quality: An In-depth Analysis on English-German and German-English
We conduct in this work an evaluation study comparing offline and online
neural machine translation architectures. Two sequence-to-sequence models:
convolutional Pervasive Attention (Elbayad et al. 2018) and attention-based
Transformer (Vaswani et al. 2017) are considered. We investigate, for both
architectures, the impact of online decoding constraints on the translation
quality through a carefully designed human evaluation on English-German and
German-English language pairs, the latter being particularly sensitive to
latency constraints. The evaluation results allow us to identify the strengths
and shortcomings of each model when we shift to the online setup.Comment: Accepted at COLING 202
Results of the WMT17 metrics shared task
This paper presents the results of the
WMT17 Metrics Shared Task. We asked
participants of this task to score the outputs of the MT systems involved in the
WMT17 news translation task and Neural MT training task. We collected scores
of 14 metrics from 8 research groups. In
addition to that, we computed scores of
7 standard metrics (BLEU, SentBLEU,
NIST, WER, PER, TER and CDER) as
baselines. The collected scores were evaluated in terms of system-level correlation
(how well each metric’s scores correlate
with WMT17 official manual ranking of
systems) and in terms of segment level
correlation (how often a metric agrees with
humans in judging the quality of a particular sentence).
This year, we build upon two types of
manual judgements: direct assessment
(DA) and HUME manual semantic judgements
Quantitative Fine-grained Human Evaluation of Machine Translation Systems: a Case Study on English to Croatian
This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to- Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective
- …