27 research outputs found

    Towards Explainable Evaluation Metrics for Machine Translation

    Full text link
    Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.Comment: Preprint. We published an earlier version of this paper (arXiv:2203.11131) under a different title. Both versions consider the conceptualization of explainable metrics and are overall similar. However, the new version puts a stronger emphasis on the survey of approaches for the explanation of MT metrics including the latest LLM based approache

    Towards Explainable Evaluation Metrics for Machine Translation

    Get PDF
    Acknowledgments and Disclosure of Funding Since November 2022 Christoph Leiter is financed by the BMBF project “Metrics4NLG”. Piyawat Lertvittayakumjorn had been financially supported by Anandamahidol Foundation, Thailand, from 2015-2021. He mainly contributed to this work only until September 2022 while affiliated with Imperial College London (before joining Google as a research scientist after that). Marina Fomicheva mainly contributed to this work until April 2022. Wei Zhao was supported by the Klaus Tschira Foundation and Young Marsilius Fellowship, Heidelberg, until December 2023. Yang Gao mainly contributed to this work before he joined Google Research in December 2021. Steffen Eger is financed by DFG Heisenberg grant EG 375/5–1 and by the BMBF propject “Metrics4NLG”.Peer reviewe

    Análisis del tratamiento de la terminología en la traducción automática: implicaciones para la evaluación

    Get PDF
    This paper presents a methodology for the comparative analysis of human and machine translation at a lexical-terminological level. This proposal is applied to an English-Spanish parallel corpus of specialized texts in the medical domain. The main aim of the study is to explore systematic linguistic differences between machine and human translation in the light of the problem of automatic system evaluation. The specific objectives are: a) detect differences in the distribution of terminological units between human and machine translation, b) identify the conditions under which such differences occur considering original texts and strategies for human and machine translation. The study methodology involves, on the one hand, the use of stylometry techniques to characterize the language of machine translation versus human translation and, on the other hand, the classification of translation shifts performed by human translators and modifications made by machine translation systems to the original text. The research results indicate that the differences between machine translation and human translation related to optional translation shifts performed by translators and the differences related to the lack of obligatory changes in machine translation are not equally important to assess the quality of the latter.En este artículo se presenta una propuesta metodológica para el análisis comparativo de Traducciones Automáticas [TAs] y Traducciones Humanas [THs] a nivel léxico-terminológico. El objetivo general de la investigación es estudiar las diferencias lingüísticas sistemáticas entre la TA y la TH de cara a la problemática de la evaluación automática de sistemas de TA. Los objetivos específicos son: a) detectar las diferencias en la distribución de unidades terminológicas entre la TA y la TH; b) identificar las condiciones en las que se producen dichas diferencias teniendo en cuenta los Textos Originales [TOs] y las estrategias de traducción. La metodología del estudio incluye las fases siguientes: selección de dos sistemas de TA basados en estrategias diferentes, constitución de un corpus paralelo inglés-español de textos especializados del ámbito médico, análisis estilométrico de los Textos Traducidos [TTs] para caracterizar el lenguaje de la TA en oposición al de la TH y, finalmente, clasificación de las modificaciones que realizan con respecto al TO los sistemas de TA y los traductores humanos. Los resultados de la investigación indican que las diferencias relacionadas con las modificaciones opcionales realizadas por los traductores y las diferencias que se deben a la falta de modificaciones obligatorias en la TA no tienen la misma relevancia para evaluar la calidad de esta última.Este trabalho apresenta uma metodologia para a análise comparativa de traduções humanas e automáticas no plano léxico-terminológico. Esta proposta é aplicada a um corpus inglês-espanhol paralelo de textos especializados na área médica. O objetivo geral da pesquisa é estudar as diferenças sistemáticas lingüísticas entre tradução automática e tradução humana no contexto do problema de avaliação automática de sistemas. Os objetivos específicos são: a) detectar diferenças na distribuição das unidades terminológicas entre a tradução humana e traduções automáticas de sistemas baseados em estratégias diferentes; b) identificar as condições em que essas diferenças ocorrem considerando os textos originais e as estratégias de tradução humana e automática. A metodologia do estudo envolve, em primeiro lugar, a utilização de técnicas estilométricas para caracterizar a linguagem da tradução automática contra o da tradução humana e, por outro lado, a classificação das alterações em relação o texto original feitas por tradutores humanos e sistemas de tradução automática. Os resultados da pesquisa indicam que as diferenças entre tradução automática e tradução humana relacionadas com modificações opcionais feitas por tradutores e as diferenças que se devem à falta de mudanças obrigatórias na tradução automática não são igualmente importantes para avaliar a qualidade desta última

    Pushing the right buttons: adversarial evaluation of quality estimation

    Get PDF
    © (2021) The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://aclanthology.org/2021.wmt-1.67Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they are known to produce fluent translation outputs that can contain important meaning errors, thus undermining their reliability in practice. Quality Estimation (QE) is the task of automatically assessing the performance of MT systems at test time. Thus, in order to be useful, QE systems should be able to detect such errors. However, this ability is yet to be tested in the current evaluation practices, where QE systems are assessed only in terms of their correlation with human judgements. In this work, we bridge this gap by proposing a general methodology for adversarial testing of QE for MT. First, we show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect. Second, we show that on average, the ability of a given model to discriminate between meaningpreserving and meaning-altering perturbations is predictive of its overall performance, thus potentially allowing for comparing QE systems without relying on manual quality annotation

    deepQuest-py: large and distilled models for quality estimation

    Get PDF
    © (2021) The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://aclanthology.org/2021.emnlp-demo.42/We introduce deepQuest-py, a framework for training and evaluation of large and lightweight models for Quality Estimation (QE). deepQuest-py provides access to (1) state-ofthe-art models based on pre-trained Transformers for sentence-level and word-level QE; (2) light-weight and efficient sentence-level models implemented via knowledge distillation; and (3) a web interface for testing models and visualising their predictions. deepQuestpy is available at https://github.com/ sheffieldnlp/deepQuest-py under a CC BY-NC-SA licence

    Findings of the WMT 2021 shared task on quality estimation

    Get PDF
    © (2021) The Authors. Published by ACL. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: http://www.statmt.org/wmt21/pdf/2021.wmt-1.71.pdfWe report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels. This edition focused on two main novel additions: (i) prediction for unseen languages, i.e. zero-shot settings, and (ii) prediction of sentences with catastrophic errors. In addition, new data was released for a number of languages, especially post-edited data. Participating teams from 19 institutions submitted altogether 1263 systems to different task variants and language pairs

    USFD at SemEval-2016 task 1: putting different state-of-the-arts into a box

    Get PDF
    © 2016 The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: http://dx.doi.org/10.18653/v1/S16-1092Aker, A., Blain, F., Duque, A., Fomicheva, M. et al. (2016) USFD at SemEval-2016 task 1: putting different state-of-the-arts into a box. In, Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Bethard, S., Carpuat, M., Cer, D., Jurgens, D. et al. (eds.) Stroudsburg, PA: Association for Computational Linguistics, pp. 609-613.The research leading to these results has received funding from the EU - Seventh Framework Program (FP7/2007-2013) under grant agreement n610916 SENSEI

    Findings of the WMT 2020 shared task on quality estimation

    Get PDF
    © 2020 The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website; https://www.aclweb.org/anthology/2020.wmt-1.79We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels. This edition included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian-English, Estonian-English, Sinhala-English and Nepali-English data for the sentence-level subtasks, English-German and English-Chinese for the word-level subtask, and English-French data for the document-level subtask. In addition, we made neural machine translation models available to participants. 19 participating teams from 27 institutions submitted altogether 1374 systems to different task variants and language pairs

    BERGAMOT-LATTE submissions for the WMT20 quality estimation shared task

    Get PDF
    © 2020 The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://www.aclweb.org/anthology/2020.wmt-1.116/This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models
    corecore