39 research outputs found

    On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

    Full text link
    Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish "translationese", i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.Comment: ACL2020 Camera Ready (v3: several small fixes, e.g., Unicode errors

    Machine translation evaluation metrics benchmarking: from traditional MT to LLMs

    Get PDF
    Treballs finals del Màster de Fonaments de Ciùncia de Dades, Facultat de matemàtiques, Universitat de Barcelona. Curs: 2022-2023. Tutor: Jordi Vitrià i MarcaThis thesis endeavors to cast a spotlight on the evolution and applicability of machine translation (MT) evaluation metrics and models, mainly contrasting statistical methods against the more contemporary neural-based ones, where we also give special attention to the exciting modern Large Language Models (LLMs). MT, a significant area in Natural Language Processing (NLP), has seen a vast metamorphosis over the years, bringing into focus the critical need for thorough exploration of these evolving systems. Our research is anchored on the Digital Corpus of the European Parliament (DCEP), a complex and multilingual corpus that makes it an ideal testbed to benchmark MT models given its comprehensive and diversified linguistic data. Through the use of this extensive corpus, we aim to present a comprehensive benchmarking of various selected MT models, encapsulating not just their evolution but also their performance dynamics across different tasks and contexts. A vital facet of our study includes evaluating the relevance and reliability of various MT metrics, such as the old BLEU, METEOR, CHRF, along with newer neuralbased metrics which promise to capture semantics more effectively. We aim to uncover the inherent strengths and limitations of these metrics, consequently guiding the choice of appropriate metrics for specific MT contexts for future practitioners and researchers. In this holistic examination, we will also propose to analyze the interplay between model selection, evaluation metric, and translation quality. This thesis will provide a novel lens to understand the idiosyncrasies of various popular MT models and evaluation metrics, ultimately contributing to more effective and nuanced applications of MT. In sum, this exploration promises to furnish a new perspective on MT evaluation, honing our understanding of both the models’ and metrics’ evolutionary paths, and providing insights into their contextual performance on the DCEP corpus, creating a benchmark that can serve the broader MT community. The insights derived aim to significantly contribute to the latter. The reader can find all the code, used for the text pre/postprocessing and evaluation of the models and metrics at play along with other intermediate matters, published publicly in our GitHub repository

    Sentence Similarity and Machine Translation

    Get PDF
    Neural machine translation (NMT) systems encode an input sentence into an intermediate representation and then decode that representation into the output sentence. Translation requires deep understanding of language; as a result, NMT models trained on large amounts of data develop a semantically rich intermediate representation. We leverage this rich intermediate representation of NMT systems—in particular, multilingual NMT systems, which learn to map many languages into and out of a joint space—for bitext curation, paraphrasing, and automatic machine translation (MT) evaluation. At a high level, all of these tasks are rooted in similarity: sentence and document alignment requires measuring similarity of sentences and documents, respectively; paraphrasing requires producing output which is similar to an input; and automatic MT evaluation requires measuring the similarity between MT system outputs and corresponding human reference translations. We use multilingual NMT for similarity in two ways: First, we use a multilingual NMT model with a fixed-size intermediate representation (Artetxe and Schwenk, 2018) to produce multilingual sentence embeddings, which we use in both sentence and document alignment. Second, we train a multilingual NMT model and show that it generalizes to the task of generative paraphrasing (i.e., “translating” from Russian to Russian), when used in conjunction with a simple generation algorithm to discourage copying from the input to the output. We also use this model for automatic MT evaluation, to force decode and score MT system outputs conditioned on their respective human reference translations. Since we leverage multilingual NMT models, each method works in many languages using a single model. We show that simple methods, which leverage the intermediate representation of multilingual NMT models trained on large amounts of bitext, outperform prior work in paraphrasing, sentence alignment, document alignment, and automatic MT evaluation. This finding is consistent with recent trends in the natural language processing community, where large language models trained on huge amounts of unlabeled text have achieved state-of-the-art results on tasks such as question answering, named entity recognition, and parsing
    corecore