5,426 research outputs found
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
On context span needed for machine translation evaluation
Despite increasing efforts to improve evaluation of machine translation (MT) by going beyond the sentence level to the document level, the definition of what exactly constitutes a ``document level'' is still not clear. This work deals with the context span necessary for a more reliable MT evaluation. We report results from a series of surveys involving three domains and 18 target languages designed to identify the necessary context span as well as issues related to it. Our findings indicate that, despite the fact that some issues and spans are strongly dependent on domain and on the target language, a number of common patterns can be observed so that general guidelines for context-aware MT evaluation can be drawn
"Bilingual Expert" Can Find Translation Errors
Recent advances in statistical machine translation via the adoption of neural
sequence-to-sequence models empower the end-to-end system to achieve
state-of-the-art in many WMT benchmarks. The performance of such machine
translation (MT) system is usually evaluated by automatic metric BLEU when the
golden references are provided for validation. However, for model inference or
production deployment, the golden references are prohibitively available or
require expensive human annotation with bilingual expertise. In order to
address the issue of quality evaluation (QE) without reference, we propose a
general framework for automatic evaluation of translation output for most WMT
quality evaluation tasks. We first build a conditional target language model
with a novel bidirectional transformer, named neural bilingual expert model,
which is pre-trained on large parallel corpora for feature extraction. For QE
inference, the bilingual expert model can simultaneously produce the joint
latent representation between the source and the translation, and real-valued
measurements of possible erroneous tokens based on the prior knowledge learned
from parallel data. Subsequently, the features will further be fed into a
simple Bi-LSTM predictive model for quality evaluation. The experimental
results show that our approach achieves the state-of-the-art performance in the
quality estimation track of WMT 2017/2018.Comment: Accepted to AAAI 201
A Set of Recommendations for Assessing Human--Machine Parity in Language Translation
The quality of machine translation has increased remarkably over the past years, to the
degree that it was found to be indistinguishable from professional human translation in
a number of empirical investigations. We reassess Hassan et al.âs 2018 investigation into
Chinese to English news translation, showing that the finding of humanâmachine parity was
owed to weaknesses in the evaluation designâwhich is currently considered best practice in
the field. We show that the professional human translations contained significantly fewer
errors, and that perceived quality in human evaluation depends on the choice of raters, the
availability of linguistic context, and the creation of reference translations. Our results call
for revisiting current best practices to assess strong machine translation systems in general
and humanâmachine parity in particular, for which we offer a set of recommendations based
on our empirical findings
Attaining the unattainable? Reassessing claims of human parity in neural machine translation
We reassess a recent study (Hassan et al.,
2018) that claimed that machine translation
(MT) has reached human parity for the translation of news from Chinese into English, using
pairwise ranking and considering three variables that were not taken into account in that
previous study: the language in which the
source side of the test set was originally written, the translation proficiency of the evaluators, and the provision of inter-sentential context. If we consider only original source text
(i.e. not translated from another language, or
translationese), then we find evidence showing
that human parity has not been achieved. We
compare the judgments of professional translators against those of non-experts and discover that those of the experts result in higher
inter-annotator agreement and better discrimination between human and machine translations. In addition, we analyse the human translations of the test set and identify important
translation issues. Finally, based on these findings, we provide a set of recommendations for
future human evaluations of MT
- âŠ