15 research outputs found
On the Limits of Minimal Pairs in Contrastive Evaluation
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often assumed that model behavior on contrastive pairs is predictive of model behavior at large. We argue that two conditions are necessary for this assumption to hold: First, a tested hypothesis should be well-motivated, since experiments show that contrastive evaluation can lead to false positives. Secondly, test data should be chosen such as to minimize distributional discrepancy between evaluation time and deployment time. For a good approximation of deployment-time decoding, we recommend that minimal pairs are created based on machine-generated text, as opposed to human-written references. We present a contrastive evaluation suite for English–German MT that implements this recommendation
As Little as Possible, as Much as Necessary: Detecting Over- and Undertranslations with Contrastive Conditioning
Omission and addition of content is a typical issue in neural machine
translation. We propose a method for detecting such phenomena with
off-the-shelf translation models. Using contrastive conditioning, we compare
the likelihood of a full sequence under a translation model to the likelihood
of its parts, given the corresponding source or target sequence. This allows to
pinpoint superfluous words in the translation and untranslated words in the
source even in the absence of a reference translation. The accuracy of our
method is comparable to a supervised method that requires a custom quality
estimation model.Comment: ACL 202
Contrastive Conditioning for Assessing Disambiguation in MT: A Case Study of Distilled Bias
Lexical disambiguation is a major challenge for machine translation systems, especially if some senses of a word are trained less often than others. Identifying patterns of overgeneralization requires evaluation methods that are both reliable and scalable. We propose contrastive conditioning as a reference-free black-box method for detecting disambiguation errors. Specifically, we score the quality of a translation by conditioning on variants of the source that provide contrastive disambiguation cues. After validating our method, we apply it in a case study to perform a targeted evaluation of sequence-level knowledge distillation. By probing word sense disambiguation and translation of gendered occupation names, we show that distillation-trained models tend to overgeneralize more than other models with a comparable BLEU score. Contrastive conditioning thus highlights a side effect of distillation that is not fully captured by standard evaluation metrics. Code and data to reproduce our findings are publicly available
Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents
Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked language model. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement
NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures
Being able to rank the similarity of short text segments is an interesting bonus feature of neural machine translation. Translation-based similarity measures include direct and pivot translation probability, as well as translation cross-likelihood, which has not been studied so far. We analyze these measures in the common framework of multilingual NMT, releasing the NMTScore library. Compared to baselines such as sentence embeddings, translation-based measures prove competitive in paraphrase identification and are more robust against adversarial or multilingual input, especially if proper normalization is applied. When used for reference-based evaluation of data-to-text generation in 2 tasks and 17 languages, translation-based measures show a relatively high correlation to human judgments
X-stance: A Multilingual Multi-Target Dataset for Stance Detection
We extract a large-scale stance detection dataset from comments written by
candidates of elections in Switzerland. The dataset consists of German, French
and Italian text, allowing for a cross-lingual evaluation of stance detection.
It contains 67 000 comments on more than 150 political issues (targets). Unlike
stance detection models that have specific target issues, we use the dataset to
train a single model on all the issues. To make learning across targets
possible, we prepend to each instance a natural question that represents the
target (e.g. "Do you support X?"). Baseline results from multilingual BERT show
that zero-shot cross-lingual and cross-target transfer of stance detection is
moderately successful with this approach.Comment: SwissText + KONVENS 2020. Data and code are available at
https://github.com/ZurichNLP/xstanc
Mitigating Hallucinations and Off-target Machine Translation with Source-Contrastive and Language-Contrastive Decoding
Hallucinations and off-target translation remain unsolved problems in machine
translation, especially for low-resource languages and massively multilingual
models. In this paper, we introduce methods to mitigate both failure cases with
a modified decoding objective, without requiring retraining or external models.
In source-contrastive decoding, we search for a translation that is probable
given the correct input, but improbable given a random input segment,
hypothesising that hallucinations will be similarly probable given either. In
language-contrastive decoding, we search for a translation that is probable,
but improbable given the wrong language indicator token. In experiments on
M2M-100 (418M) and SMaLL-100, we find that these methods effectively suppress
hallucinations and off-target translations, improving chrF2 by 1.7 and 1.4
points on average across 57 tested translation directions. In a proof of
concept on English--German, we also show that we can suppress off-target
translations with the Llama 2 chat models, demonstrating the applicability of
the method to machine translation with LLMs. We release our source code at
https://github.com/ZurichNLP/ContraDecode
SwissBERT: The Multilingual Language Model for Switzerland
We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at https://github.com/ZurichNLP/swissbert
SwissBERT: The Multilingual Language Model for Switzerland
We present SwissBERT, a masked language model created specifically for
processing Switzerland-related text. SwissBERT is a pre-trained model that we
adapted to news articles written in the national languages of Switzerland --
German, French, Italian, and Romansh. We evaluate SwissBERT on natural language
understanding tasks related to Switzerland and find that it tends to outperform
previous models on these tasks, especially when processing contemporary news
and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be
extended to Swiss German dialects in future work. The model and our open-source
code are publicly released at https://github.com/ZurichNLP/swissbert.Comment: SwissText 2023 [v3: Changed template because the proceedings moved to
a different publisher. Same content.