219 research outputs found
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
Evaluation of cross-lingual encoders is usually performed either via
zero-shot cross-lingual transfer in supervised downstream tasks or via
unsupervised cross-lingual textual similarity. In this paper, we concern
ourselves with reference-free machine translation (MT) evaluation where we
directly compare source texts to (sometimes low-quality) system translations,
which represents a natural adversarial setup for multilingual encoders.
Reference-free evaluation holds the promise of web-scale comparison of MT
systems. We systematically investigate a range of metrics based on
state-of-the-art cross-lingual semantic representations obtained with
pretrained M-BERT and LASER. We find that they perform poorly as semantic
encoders for reference-free MT evaluation and identify their two key
limitations, namely, (a) a semantic mismatch between representations of mutual
translations and, more prominently, (b) the inability to punish
"translationese", i.e., low-quality literal translations. We propose two
partial remedies: (1) post-hoc re-alignment of the vector spaces and (2)
coupling of semantic-similarity based metrics with target-side language
modeling. In segment-level MT evaluation, our best metric surpasses
reference-based BLEU by 5.7 correlation points.Comment: ACL2020 Camera Ready (v3: several small fixes, e.g., Unicode errors
Inducing Language-Agnostic Multilingual Representations
Cross-lingual representations have the potential to make NLP techniques
available to the vast majority of languages in the world. However, they
currently require large pretraining corpora or access to typologically similar
languages. In this work, we address these obstacles by removing language
identity signals from multilingual embeddings. We examine three approaches for
this: (i) re-aligning the vector spaces of target languages (all together) to a
pivot source language; (ii) removing language-specific means and variances,
which yields better discriminativeness of embeddings as a by-product; and (iii)
increasing input similarity across languages by removing morphological
contractions and sentence reordering. We evaluate on XNLI and reference-free MT
across 19 typologically diverse languages. Our findings expose the limitations
of these approaches -- unlike vector normalization, vector space re-alignment
and text normalization do not achieve consistent gains across encoders and
languages. Due to the approaches' additive effects, their combination decreases
the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R)
on average across all tasks and languages, however. Our code and models are
publicly available.Comment: *SEM2021 Camera Read
Constrained Density Matching and Modeling for Cross-lingual Alignment of Contextualized Representations
Multilingual representations pre-trained with monolingual data exhibit
considerably unequal task performances across languages. Previous studies
address this challenge with resource-intensive contextualized alignment, which
assumes the availability of large parallel data, thereby leaving
under-represented language communities behind. In this work, we attribute the
data hungriness of previous alignment techniques to two limitations: (i) the
inability to sufficiently leverage data and (ii) these techniques are not
trained properly. To address these issues, we introduce supervised and
unsupervised density-based approaches named Real-NVP and GAN-Real-NVP, driven
by Normalizing Flow, to perform alignment, both dissecting the alignment of
multilingual subspaces into density matching and density modeling. We
complement these approaches with our validation criteria in order to guide the
training process. Our experiments encompass 16 alignments, including our
approaches, evaluated across 6 language pairs, synthetic data and 5 NLP tasks.
We demonstrate the effectiveness of our approaches in the scenarios of limited
and no parallel data. First, our supervised approach trained on 20k parallel
data (sentences) mostly surpasses Joint-Align and InfoXLM trained on over 100k
parallel sentences. Second, parallel data can be removed without sacrificing
performance when integrating our unsupervised approach in our bootstrapping
procedure, which is theoretically motivated to enforce equality of multilingual
subspaces. Moreover, we demonstrate the advantages of validation criteria over
validation data for guiding supervised training.Comment: ACML2022 Camera Read
Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG
We explore efficient evaluation metrics for Natural Language Generation
(NLG). To implement efficient metrics, we replace (i) computation-heavy
transformers in metrics such as BERTScore, MoverScore, BARTScore, XMoverScore,
etc. with lighter versions (such as distilled ones) and (ii) cubic inference
time alignment algorithms such as Word Mover Distance with linear and quadratic
approximations. We consider six evaluation metrics (both monolingual and
multilingual), assessed on three different machine translation datasets, and 16
light-weight transformers as replacement. We find, among others, that (a)
TinyBERT shows best quality-efficiency tradeoff for semantic similarity metrics
of the BERTScore family, retaining 97\% quality and being 5x faster at
inference time on average, (b) there is a large difference in speed-ups on CPU
vs. GPU (much higher speed-ups on CPU), and (c) WMD approximations yield no
efficiency gains but lead to a substantial drop in quality on 2 out of 3
datasets we examine.Comment: Work in progres
Probing Multilingual BERT for Genetic and Typological Signals
We probe the layers in multilingual BERT (mBERT) for phylogenetic and
geographic language signals across 100 languages and compute language distances
based on the mBERT representations. We 1) employ the language distances to
infer and evaluate language trees, finding that they are close to the reference
family tree in terms of quartet tree distance, 2) perform distance matrix
regression analysis, finding that the language distances can be best explained
by phylogenetic and worst by structural factors and 3) present a novel measure
for measuring diachronic meaning stability (based on cross-lingual
representation variability) which correlates significantly with published
ranked lists based on linguistic approaches. Our results contribute to the
nascent field of typological interpretability of cross-lingual text
representations.Comment: COLING 202
On the Principles of Evaluation for Natural Language Generation
Natural language processing is concerned with the ability of computers to understand natural language texts, which is, arguably, one of the major bottlenecks in the course of chasing the holy grail of general Artificial Intelligence. Given the unprecedented success of deep learning technology, the natural language processing community has been almost entirely in favor of practical applications with state-of-the-art systems emerging and competing for human-parity performance at an ever-increasing pace. For that reason, fair and adequate evaluation and comparison, responsible for ensuring trustworthy, reproducible and unbiased results, have fascinated the scientific community for long, not only in natural language but also in other fields. A popular example is the ISO-9126 evaluation standard for software products, which outlines a wide range of evaluation concerns, such as cost, reliability, scalability, security, and so forth. The European project EAGLES-1996, being the acclaimed extension to ISO-9126, depicted the fundamental principles specifically for evaluating natural language technologies, which underpins succeeding methodologies in the evaluation of natural language.
Natural language processing encompasses an enormous range of applications, each with its own evaluation concerns, criteria and measures. This thesis cannot hope to be comprehensive but particularly addresses the evaluation in natural language generation (NLG), which touches on, arguably, one of the most human-like natural language applications. In this context, research on quantifying day-to-day progress with evaluation metrics lays the foundation of the fast-growing NLG community. However, previous works have failed to address high-quality metrics in multiple scenarios such as evaluating long texts and when human references are not available, and, more prominently, these studies are limited in scope, given the lack of a holistic view sketched for principled NLG evaluation.
In this thesis, we aim for a holistic view of NLG evaluation from three complementary perspectives, driven by the evaluation principles in EAGLES-1996: (i) high-quality evaluation metrics, (ii) rigorous comparison of NLG systems for properly tracking the progress, and (iii) understanding evaluation metrics. To this end, we identify the current state of challenges derived from the inherent characteristics of these perspectives, and then present novel metrics, rigorous comparison approaches, and explainability techniques for metrics to address the identified issues.
We hope that our work on evaluation metrics, system comparison and explainability for metrics inspires more research towards principled NLG evaluation, and contributes to the fair and adequate evaluation and comparison in natural language processing
Better quality estimation for low resource corpus mining
000000000000000000000000000000000000000000000000000000010241 - University of California, Berkeleyhttps://aclanthology.org/2022.findings-acl.45/First author draf
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals
translate speech between any two languages? While recent breakthroughs in
text-based models have pushed machine translation coverage beyond 200
languages, unified speech-to-speech translation models have yet to achieve
similar strides. More specifically, conventional speech-to-speech translation
systems rely on cascaded systems that perform translation progressively,
putting high-performing unified systems out of reach. To address these gaps, we
introduce SeamlessM4T, a single model that supports speech-to-speech
translation, speech-to-text translation, text-to-speech translation,
text-to-text translation, and automatic speech recognition for up to 100
languages. To build this, we used 1 million hours of open speech audio data to
learn self-supervised speech representations with w2v-BERT 2.0. Subsequently,
we created a multimodal corpus of automatically aligned speech translations.
Filtered and combined with human-labeled and pseudo-labeled data, we developed
the first multilingual system capable of translating from and into English for
both speech and text. On FLEURS, SeamlessM4T sets a new standard for
translations into multiple target languages, achieving an improvement of 20%
BLEU over the previous SOTA in direct speech-to-text translation. Compared to
strong cascaded models, SeamlessM4T improves the quality of into-English
translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in
speech-to-speech. Tested for robustness, our system performs better against
background noises and speaker variations in speech-to-text tasks compared to
the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and
added toxicity to assess translation safety. Finally, all contributions in this
work are open-sourced and accessible at
https://github.com/facebookresearch/seamless_communicatio
Pretrained Transformers for Text Ranking: BERT and Beyond
The goal of text ranking is to generate an ordered list of texts retrieved
from a corpus in response to a query. Although the most common formulation of
text ranking is search, instances of the task can also be found in many natural
language processing applications. This survey provides an overview of text
ranking with neural network architectures known as transformers, of which BERT
is the best-known example. The combination of transformers and self-supervised
pretraining has been responsible for a paradigm shift in natural language
processing (NLP), information retrieval (IR), and beyond. In this survey, we
provide a synthesis of existing work as a single point of entry for
practitioners who wish to gain a better understanding of how to apply
transformers to text ranking problems and researchers who wish to pursue work
in this area. We cover a wide range of modern techniques, grouped into two
high-level categories: transformer models that perform reranking in multi-stage
architectures and dense retrieval techniques that perform ranking directly.
There are two themes that pervade our survey: techniques for handling long
documents, beyond typical sentence-by-sentence processing in NLP, and
techniques for addressing the tradeoff between effectiveness (i.e., result
quality) and efficiency (e.g., query latency, model and index size). Although
transformer architectures and pretraining techniques are recent innovations,
many aspects of how they are applied to text ranking are relatively well
understood and represent mature techniques. However, there remain many open
research questions, and thus in addition to laying out the foundations of
pretrained transformers for text ranking, this survey also attempts to
prognosticate where the field is heading
- …