752 research outputs found
Findings of the WMT 2018 shared task on quality estimation
© 2018 The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence.
The published version can be accessed at the following link on the publisher’s website: http://dx.doi.org/10.18653/v1/W18-6451We report the results of the WMT18 shared task on Quality Estimation, i.e. the task of predicting the quality of the output of machine translation systems at various granularity levels: word, phrase, sentence and document. This year we include four language pairs, three text domains, and translations produced by both statistical and neural machine translation systems. Participating teams from ten institutions submitted a variety of systems to different task variants and language pairs.The data and annotations collected for Tasks 1, 2 and 3 was supported by the EC H2020 QT21 project (grant agreement no. 645452). The shared task organisation was also supported by the QT21 project, national funds through Fundacao para a Ciencia e Tecnologia (FCT), with references UID/CEC/50021/2013 and UID/EEA/50008/2013, and by the European Research Council (ERC StG DeepSPIN 758969). We would also like to thank Julie Beliao and the Unbabel Quality Team for coordinating the annotation of the dataset used in Task 4
Results of the WMT19 metrics shared task: segment-level and strong MT systems pose big challenges
This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked to score the outputs of the translations systems competing in the WMT19 News Translation Task with automatic metrics. 13 research groups submitted 24 metrics, 10 of which are reference-less "metrics" and constitute submissions to the joint task with WMT19 Quality Estimation Task, "QE as a Metric". In addition, we computed 11 baseline metrics, with 8 commonly applied baselines (BLEU, SentBLEU, NIST, WER, PER, TER, CDER, and chrF) and 3 reimplementations (chrF+, sacreBLEU-BLEU, and sacreBLEU-chrF). Metrics were evaluated on the system level, how well a given metric correlates with the WMT19 official manual ranking, and segment level, how well the metric correlates with human judgements of segment quality. This year, we use direct assessment (DA) as our only form of manual evaluation
Findings of the E2E NLG Challenge
This paper summarises the experimental setup and results of the first shared
task on end-to-end (E2E) natural language generation (NLG) in spoken dialogue
systems. Recent end-to-end generation systems are promising since they reduce
the need for data annotation. However, they are currently limited to small,
delexicalised datasets. The E2E NLG shared task aims to assess whether these
novel approaches can generate better-quality output by learning from a dataset
containing higher lexical richness, syntactic complexity and diverse discourse
phenomena. We compare 62 systems submitted by 17 institutions, covering a wide
range of approaches, including machine learning architectures -- with the
majority implementing sequence-to-sequence models (seq2seq) -- as well as
systems based on grammatical rules and templates.Comment: Accepted to INLG 201
"Bilingual Expert" Can Find Translation Errors
Recent advances in statistical machine translation via the adoption of neural
sequence-to-sequence models empower the end-to-end system to achieve
state-of-the-art in many WMT benchmarks. The performance of such machine
translation (MT) system is usually evaluated by automatic metric BLEU when the
golden references are provided for validation. However, for model inference or
production deployment, the golden references are prohibitively available or
require expensive human annotation with bilingual expertise. In order to
address the issue of quality evaluation (QE) without reference, we propose a
general framework for automatic evaluation of translation output for most WMT
quality evaluation tasks. We first build a conditional target language model
with a novel bidirectional transformer, named neural bilingual expert model,
which is pre-trained on large parallel corpora for feature extraction. For QE
inference, the bilingual expert model can simultaneously produce the joint
latent representation between the source and the translation, and real-valued
measurements of possible erroneous tokens based on the prior knowledge learned
from parallel data. Subsequently, the features will further be fed into a
simple Bi-LSTM predictive model for quality evaluation. The experimental
results show that our approach achieves the state-of-the-art performance in the
quality estimation track of WMT 2017/2018.Comment: Accepted to AAAI 201
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
RankME: Reliable Human Ratings for Natural Language Generation
Human evaluation for natural language generation (NLG) often suffers from
inconsistent user ratings. While previous research tends to attribute this
problem to individual user preferences, we show that the quality of human
judgements can also be improved by experimental design. We present a novel
rank-based magnitude estimation method (RankME), which combines the use of
continuous scales and relative assessments. We show that RankME significantly
improves the reliability and consistency of human ratings compared to
traditional evaluation methods. In addition, we show that it is possible to
evaluate NLG systems according to multiple, distinct criteria, which is
important for error analysis. Finally, we demonstrate that RankME, in
combination with Bayesian estimation of system quality, is a cost-effective
alternative for ranking multiple NLG systems.Comment: Accepted to NAACL 2018 (The 2018 Conference of the North American
Chapter of the Association for Computational Linguistics
- …