39 research outputs found
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
Evaluation of cross-lingual encoders is usually performed either via
zero-shot cross-lingual transfer in supervised downstream tasks or via
unsupervised cross-lingual textual similarity. In this paper, we concern
ourselves with reference-free machine translation (MT) evaluation where we
directly compare source texts to (sometimes low-quality) system translations,
which represents a natural adversarial setup for multilingual encoders.
Reference-free evaluation holds the promise of web-scale comparison of MT
systems. We systematically investigate a range of metrics based on
state-of-the-art cross-lingual semantic representations obtained with
pretrained M-BERT and LASER. We find that they perform poorly as semantic
encoders for reference-free MT evaluation and identify their two key
limitations, namely, (a) a semantic mismatch between representations of mutual
translations and, more prominently, (b) the inability to punish
"translationese", i.e., low-quality literal translations. We propose two
partial remedies: (1) post-hoc re-alignment of the vector spaces and (2)
coupling of semantic-similarity based metrics with target-side language
modeling. In segment-level MT evaluation, our best metric surpasses
reference-based BLEU by 5.7 correlation points.Comment: ACL2020 Camera Ready (v3: several small fixes, e.g., Unicode errors
Machine translation evaluation metrics benchmarking: from traditional MT to LLMs
Treballs finals del MĂ ster de Fonaments de CiĂšncia de Dades, Facultat de matemĂ tiques, Universitat de Barcelona. Curs: 2022-2023. Tutor: Jordi VitriĂ i MarcaThis thesis endeavors to cast a spotlight on the evolution and applicability of machine translation (MT) evaluation metrics and models, mainly contrasting statistical methods against the more contemporary neural-based ones, where we also give special attention to the exciting modern Large Language Models (LLMs). MT, a significant area in Natural Language Processing (NLP), has seen a vast metamorphosis over the years, bringing into focus the critical need for thorough exploration of these evolving systems.
Our research is anchored on the Digital Corpus of the European Parliament (DCEP), a complex and multilingual corpus that makes it an ideal testbed to benchmark MT models given its comprehensive and diversified linguistic data. Through the use of this extensive corpus, we aim to present a comprehensive benchmarking of various selected MT models, encapsulating not just their evolution but also their performance dynamics across different tasks and contexts.
A vital facet of our study includes evaluating the relevance and reliability of various MT metrics, such as the old BLEU, METEOR, CHRF, along with newer neuralbased metrics which promise to capture semantics more effectively. We aim to uncover the inherent strengths and limitations of these metrics, consequently guiding
the choice of appropriate metrics for specific MT contexts for future practitioners and researchers.
In this holistic examination, we will also propose to analyze the interplay between model selection, evaluation metric, and translation quality. This thesis will provide a novel lens to understand the idiosyncrasies of various popular MT models and evaluation metrics, ultimately contributing to more effective and nuanced applications of MT.
In sum, this exploration promises to furnish a new perspective on MT evaluation, honing our understanding of both the modelsâ and metricsâ evolutionary paths, and providing insights into their contextual performance on the DCEP corpus, creating a benchmark that can serve the broader MT community. The insights derived aim to significantly contribute to the latter.
The reader can find all the code, used for the text pre/postprocessing and evaluation of the models and metrics at play along with other intermediate matters, published publicly in our GitHub repository
Sentence Similarity and Machine Translation
Neural machine translation (NMT) systems encode an input sentence into an intermediate representation and then decode that representation into the output sentence. Translation requires deep understanding of language; as a result, NMT models trained on large amounts of data develop a semantically rich intermediate representation.
We leverage this rich intermediate representation of NMT systemsâin particular, multilingual NMT systems, which learn to map many languages into and out of a joint spaceâfor bitext curation, paraphrasing, and automatic machine translation (MT) evaluation. At a high level, all of these tasks are rooted in similarity: sentence and document alignment requires measuring similarity of sentences and documents, respectively; paraphrasing requires producing output which is similar to an input; and automatic MT evaluation requires measuring the similarity between MT system outputs and corresponding human reference translations.
We use multilingual NMT for similarity in two ways: First, we use a multilingual NMT model with a fixed-size intermediate representation (Artetxe and Schwenk, 2018) to produce multilingual sentence embeddings, which we use in both sentence and document alignment. Second, we train a multilingual NMT model and show that it generalizes to the task of generative paraphrasing (i.e., âtranslatingâ from Russian to Russian), when used in conjunction with a simple generation algorithm to discourage copying from the input to the output. We also use this model for automatic MT evaluation, to force decode and score MT system outputs conditioned on their respective human reference translations. Since we leverage multilingual NMT models, each method works in many languages using a single model.
We show that simple methods, which leverage the intermediate representation of multilingual NMT models trained on large amounts of bitext, outperform prior work in paraphrasing, sentence alignment, document alignment, and automatic MT evaluation. This finding is consistent with recent trends in the natural language processing community, where large language models trained on huge amounts of unlabeled text have achieved state-of-the-art results on tasks such as question answering, named entity recognition, and parsing