20 research outputs found
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
Why don't people use character-level machine translation?
We present a literature and empirical survey that critically assesses the
state of the art in character-level modeling for machine translation (MT).
Despite evidence in the literature that character-level systems are comparable
with subword systems, they are virtually never used in competitive setups in
WMT competitions. We empirically show that even with recent modeling
innovations in character-level natural language processing, character-level MT
systems still struggle to match their subword-based counterparts.
Character-level MT systems show neither better domain robustness, nor better
morphological generalization, despite being often so motivated. However, we are
able to show robustness towards source side noise and that translation quality
does not degrade with increasing beam size at decoding time.Comment: 16 pages, 4 figures; Findings of ACL 2022, camera-read
What's the Difference Between Professional Human and Machine Translation? A Blind Multi-language Study on Domain-specific MT
Machine translation (MT) has been shown to produce a number of errors that
require human post-editing, but the extent to which professional human
translation (HT) contains such errors has not yet been compared to MT. We
compile pre-translated documents in which MT and HT are interleaved, and ask
professional translators to flag errors and post-edit these documents in a
blind evaluation. We find that the post-editing effort for MT segments is only
higher in two out of three language pairs, and that the number of segments with
wrong terminology, omissions, and typographical problems is similar in HT.Comment: EAMT 2020 (Research Track
Survey of Low-Resource Machine Translation
International audienceWe present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT
BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation
Standard automatic metrics, e.g. BLEU, are not reliable for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones, nor identify the discourse phenomena that cause context-agnostic translations. This paper introduces a novel automatic metric BlonDe to widen the scope of automatic MT evaluation from sentence to document level. BlonDe takes discourse coherence into consideration by categorizing discourse-related spans and calculating the similarity-based F1 measure of categorized spans. We conduct extensive comparisons on a newly constructed dataset BWB. The experimental results show that BlonDe possesses better selectivity and interpretability at the document-level, and is more sensitive to document-level nuances. In a large-scale human study, BlonDe also achieves significantly higher Pearson’s r correlation with human judgments compared to previous metrics