Search CORE

219 research outputs found

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Author: Eger Steffen
Gao Yang
Glavaš Goran
Peyrard Maxime
West Robert
Zhao Wei
Publication venue
Publication date: 01/01/2020
Field of study

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish "translationese", i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.Comment: ACL2020 Camera Ready (v3: several small fixes, e.g., Unicode errors

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

TUbiblio

Crossref

MAnnheim DOCument Server

Inducing Language-Agnostic Multilingual Representations

Author: Augenstein Isabelle
Bjerva Johannes
Eger Steffen
Zhao Wei
Publication venue
Publication date: 01/01/2020
Field of study

Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In this work, we address these obstacles by removing language identity signals from multilingual embeddings. We examine three approaches for this: (i) re-aligning the vector spaces of target languages (all together) to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering. We evaluate on XNLI and reference-free MT across 19 typologically diverse languages. Our findings expose the limitations of these approaches -- unlike vector normalization, vector space re-alignment and text normalization do not achieve consistent gains across encoders and languages. Due to the approaches' additive effects, their combination decreases the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R) on average across all tasks and languages, however. Our code and models are publicly available.Comment: *SEM2021 Camera Read

arXiv.org e-Print Archive

Copenhagen University Research Information System

VBN

Constrained Density Matching and Modeling for Cross-lingual Alignment of Contextualized Representations

Author: Eger Steffen
Zhao Wei
Publication venue
Publication date: 18/09/2022
Field of study

Multilingual representations pre-trained with monolingual data exhibit considerably unequal task performances across languages. Previous studies address this challenge with resource-intensive contextualized alignment, which assumes the availability of large parallel data, thereby leaving under-represented language communities behind. In this work, we attribute the data hungriness of previous alignment techniques to two limitations: (i) the inability to sufficiently leverage data and (ii) these techniques are not trained properly. To address these issues, we introduce supervised and unsupervised density-based approaches named Real-NVP and GAN-Real-NVP, driven by Normalizing Flow, to perform alignment, both dissecting the alignment of multilingual subspaces into density matching and density modeling. We complement these approaches with our validation criteria in order to guide the training process. Our experiments encompass 16 alignments, including our approaches, evaluated across 6 language pairs, synthetic data and 5 NLP tasks. We demonstrate the effectiveness of our approaches in the scenarios of limited and no parallel data. First, our supervised approach trained on 20k parallel data (sentences) mostly surpasses Joint-Align and InfoXLM trained on over 100k parallel sentences. Second, parallel data can be removed without sacrificing performance when integrating our unsupervised approach in our bootstrapping procedure, which is theoretically motivated to enforce equality of multilingual subspaces. Moreover, we demonstrate the advantages of validation criteria over validation data for guiding supervised training.Comment: ACML2022 Camera Read

arXiv.org e-Print Archive

Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

Author: Eger Steffen
Grünwald Jens
Leiter Christoph
Publication venue
Publication date: 20/09/2022
Field of study

We explore efficient evaluation metrics for Natural Language Generation (NLG). To implement efficient metrics, we replace (i) computation-heavy transformers in metrics such as BERTScore, MoverScore, BARTScore, XMoverScore, etc. with lighter versions (such as distilled ones) and (ii) cubic inference time alignment algorithms such as Word Mover Distance with linear and quadratic approximations. We consider six evaluation metrics (both monolingual and multilingual), assessed on three different machine translation datasets, and 16 light-weight transformers as replacement. We find, among others, that (a) TinyBERT shows best quality-efficiency tradeoff for semantic similarity metrics of the BERTScore family, retaining 97\% quality and being 5x faster at inference time on average, (b) there is a large difference in speed-ups on CPU vs. GPU (much higher speed-ups on CPU), and (c) WMD approximations yield no efficiency gains but lead to a substantial drop in quality on 2 out of 3 datasets we examine.Comment: Work in progres

arXiv.org e-Print Archive

Probing Multilingual BERT for Genetic and Typological Signals

Author: Beinborn Lisa
Eger Steffen
Rama Taraka
Publication venue
Publication date: 01/01/2020
Field of study

We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages and compute language distances based on the mBERT representations. We 1) employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance, 2) perform distance matrix regression analysis, finding that the language distances can be best explained by phylogenetic and worst by structural factors and 3) present a novel measure for measuring diachronic meaning stability (based on cross-lingual representation variability) which correlates significantly with published ranked lists based on linguistic approaches. Our results contribute to the nascent field of typological interpretability of cross-lingual text representations.Comment: COLING 202

arXiv.org e-Print Archive

VU Research Portal

Crossref

On the Principles of Evaluation for Natural Language Generation

Author: Zhao Wei
Publication venue
Publication date: 01/01/2023
Field of study

Natural language processing is concerned with the ability of computers to understand natural language texts, which is, arguably, one of the major bottlenecks in the course of chasing the holy grail of general Artificial Intelligence. Given the unprecedented success of deep learning technology, the natural language processing community has been almost entirely in favor of practical applications with state-of-the-art systems emerging and competing for human-parity performance at an ever-increasing pace. For that reason, fair and adequate evaluation and comparison, responsible for ensuring trustworthy, reproducible and unbiased results, have fascinated the scientific community for long, not only in natural language but also in other fields. A popular example is the ISO-9126 evaluation standard for software products, which outlines a wide range of evaluation concerns, such as cost, reliability, scalability, security, and so forth. The European project EAGLES-1996, being the acclaimed extension to ISO-9126, depicted the fundamental principles specifically for evaluating natural language technologies, which underpins succeeding methodologies in the evaluation of natural language. Natural language processing encompasses an enormous range of applications, each with its own evaluation concerns, criteria and measures. This thesis cannot hope to be comprehensive but particularly addresses the evaluation in natural language generation (NLG), which touches on, arguably, one of the most human-like natural language applications. In this context, research on quantifying day-to-day progress with evaluation metrics lays the foundation of the fast-growing NLG community. However, previous works have failed to address high-quality metrics in multiple scenarios such as evaluating long texts and when human references are not available, and, more prominently, these studies are limited in scope, given the lack of a holistic view sketched for principled NLG evaluation. In this thesis, we aim for a holistic view of NLG evaluation from three complementary perspectives, driven by the evaluation principles in EAGLES-1996: (i) high-quality evaluation metrics, (ii) rigorous comparison of NLG systems for properly tracking the progress, and (iii) understanding evaluation metrics. To this end, we identify the current state of challenges derived from the inherent characteristics of these perspectives, and then present novel metrics, rigorous comparison approaches, and explainability techniques for metrics to address the identified issues. We hope that our work on evaluation metrics, system comparison and explainability for metrics inspires more research towards principled NLG evaluation, and contributes to the fair and adequate evaluation and comparison in natural language processing

TUbiblio

tuprints

Better quality estimation for low resource corpus mining

Author: Kocyigit Muhammed
Lee Jiho
Wijaya Derry
Publication venue: Association for Computational Linguistics
Publication date: 27/03/2023
Field of study

000000000000000000000000000000000000000000000000000000010241 - University of California, Berkeleyhttps://aclanthology.org/2022.findings-acl.45/First author draf

Boston University Institutional Repository (OpenBU)

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

Author: Akula Bapi
Andrews Pierre
Balioglu Can
Barrault Loïc
Celebi Onur
Chen Peng-Jen
Chung Yu-An
Communication Seamless
Costa-jussà Marta R.
Dale David
Dong Ning
Duquenne Paul-Ambroise
Elbayad Maha
Ellis Brian
Elsahar Hady
Gao Cynthia
Gong Hongyu
Gonzalez Gabriel Mejia
Guzmán Francisco
Haaheim Justin
Hachem Naji El
Hansanti Prangthip
Heffernan Kevin
Hoffman John
Howes Russ
Huang Bernie
Hwang Min-Jae
Inaguma Hirofumi
Jain Somya
Kalbassi Elahe
Kallet Amanda
Kao Justine
Klaiber Christopher
Kulikov Ilia
Lam Janice
Lee Ann
Li Daniel
Li Pengwei
Licht Daniel
Ma Xutai
Maillard Jean
Mavlyutov Ruslan
Meglioli Mariano Cora
Mourachko Alexandre
Peloquin Benjamin
Pino Juan
Popuri Sravya
Rakotoarison Alice
Ramadan Mohamed
Ramakrishnan Abinesh
Ropers Christophe
Sadagopan Kaushik Ram
Saleem Safiyyah
Schwenk Holger
Sun Anna
Tomasello Paden
Tran Kevin
Tran Tuan
Tufanov Igor
Vogeti Vish
Wang Changhan
Wang Jeff
Wang Skyler
Wenzek Guillaume
Wood Carleigh
Yang Yilin
Ye Ethan
Yu Bokai
Publication venue
Publication date: 23/08/2023
Field of study

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communicatio

arXiv.org e-Print Archive

Pretrained Transformers for Text Ranking: BERT and Beyond

Author: Lin Jimmy
Nogueira Rodrigo
Yates Andrew
Publication venue
Publication date: 01/01/2020
Field of study

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

arXiv.org e-Print Archive

MPG.PuRe

Model-Based Evaluation of Multilinguality

Author: Vamvas Jannis
Publication venue
Publication date: 01/01/2023
Field of study

ZORA