23 research outputs found
Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval
State-of-the-art neural (re)rankers are notoriously data hungry which - given
the lack of large-scale training data in languages other than English - makes
them rarely used in multilingual and cross-lingual retrieval settings. Current
approaches therefore typically transfer rankers trained on English data to
other languages and cross-lingual setups by means of multilingual encoders:
they fine-tune all the parameters of a pretrained massively multilingual
Transformer (MMT, e.g., multilingual BERT) on English relevance judgments and
then deploy it in the target language. In this work, we show that two
parameter-efficient approaches to cross-lingual transfer, namely Sparse
Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more
effective zero-shot transfer to multilingual and cross-lingual retrieval tasks.
We first train language adapters (or SFTMs) via Masked Language Modelling and
then train retrieval (i.e., reranking) adapters (SFTMs) on top while keeping
all other parameters fixed. At inference, this modular design allows us to
compose the ranker by applying the task adapter (or SFTM) trained with source
language data together with the language adapter (or SFTM) of a target
language. Besides improved transfer performance, these two approaches offer
faster ranker training, with only a fraction of parameters being updated
compared to full MMT fine-tuning. We benchmark our models on the CLEF-2003
benchmark, showing that our parameter-efficient methods outperform standard
zero-shot transfer with full MMT fine-tuning, while enabling modularity and
reducing training times. Further, we show on the example of Swahili and Somali
that, for low(er)-resource languages, our parameter-efficient neural re-rankers
can improve the ranking of the competitive machine translation-based ranker
Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Instruction-tuning has become an integral part of training pipelines for
Large Language Models (LLMs) and has been shown to yield strong performance
gains. In an orthogonal line of research, Annotation Error Detection (AED) has
emerged as a tool for detecting quality issues of gold-standard labels. But so
far, the application of AED methods is limited to discriminative settings. It
is an open question how well AED methods generalize to generative settings
which are becoming widespread via generative LLMs. In this work, we present a
first and new benchmark for AED on instruction-tuning data: Donkii. It
encompasses three instruction-tuning datasets enriched with annotations by
experts and semi-automatic methods. We find that all three datasets contain
clear-cut errors that sometimes directly propagate into instruction-tuned LLMs.
We propose four AED baselines for the generative setting and evaluate them
comprehensively on the newly introduced dataset. Our results demonstrate that
choosing the right AED method and model size is indeed crucial, thereby
deriving practical recommendations. To gain insights, we provide a first
case-study to examine how the quality of the instruction-tuning datasets
influences downstream performance
Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models. We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent
Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
Language understanding is a multi-faceted cognitive capability, which the
Natural Language Processing (NLP) community has striven to model
computationally for decades. Traditionally, facets of linguistic intelligence
have been compartmentalized into tasks with specialized model architectures and
corresponding evaluation protocols. With the advent of large language models
(LLMs) the community has witnessed a dramatic shift towards general purpose,
task-agnostic approaches powered by generative models. As a consequence, the
traditional compartmentalized notion of language tasks is breaking down,
followed by an increasing challenge for evaluation and analysis. At the same
time, LLMs are being deployed in more real-world scenarios, including
previously unforeseen zero-shot setups, increasing the need for trustworthy and
reliable systems. Therefore, we argue that it is time to rethink what
constitutes tasks and model evaluation in NLP, and pursue a more holistic view
on language, placing trustworthiness at the center. Towards this goal, we
review existing compartmentalized approaches for understanding the origins of a
model's functional capacity, and provide recommendations for more multi-faceted
evaluation protocols.Comment: Accepted at EMNLP 2023 (Main Conference), camera-read
ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open retrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation. For passage retrieval, we evaluated the monolingual BM25 ranker against the ensemble of re-rankers based on multilingual pretrained language models (PLMs) and also variants of the shared task baseline, re-training it from scratch using a recently introduced contrastive loss that maintains a strong gradient signal throughout training by means of mixed negative samples. For answer generation, we focused on languageand domain-specialization by means of continued language model (LM) pretraining of existing multilingual encoders. Additionally, for both passage retrieval and answer generation, we augmented the training data provided by the task organizers with automatically generated question-answer pairs created from Wikipedia passages to mitigate the issue of data scarcity, particularly for the low-resource languages for which no training data were provided. Our results show that language- and domain-specialization as well as data augmentation help, especially for low-resource languages
CLEF 2000-2003 Query Translations (Uyghur, Kyrgyz, Turkish)
This dataset contains query translations of the Cross-Language Evaluation Forum (CLEF) 2000-2003 campaign for bilingual ad-hoc retrieval tracks. We translated English queries into Uyghur, Kyrgyz and Turkish with Google Translate and had native speakers post-edit translation errors
Data for paper: "Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval"
This repository contains resources for the arxiv preprint abs/2204.02292
Data for paper: "Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval"
Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks