17 research outputs found
Pretrained Transformers for Text Ranking: BERT and Beyond
The goal of text ranking is to generate an ordered list of texts retrieved
from a corpus in response to a query. Although the most common formulation of
text ranking is search, instances of the task can also be found in many natural
language processing applications. This survey provides an overview of text
ranking with neural network architectures known as transformers, of which BERT
is the best-known example. The combination of transformers and self-supervised
pretraining has been responsible for a paradigm shift in natural language
processing (NLP), information retrieval (IR), and beyond. In this survey, we
provide a synthesis of existing work as a single point of entry for
practitioners who wish to gain a better understanding of how to apply
transformers to text ranking problems and researchers who wish to pursue work
in this area. We cover a wide range of modern techniques, grouped into two
high-level categories: transformer models that perform reranking in multi-stage
architectures and dense retrieval techniques that perform ranking directly.
There are two themes that pervade our survey: techniques for handling long
documents, beyond typical sentence-by-sentence processing in NLP, and
techniques for addressing the tradeoff between effectiveness (i.e., result
quality) and efficiency (e.g., query latency, model and index size). Although
transformer architectures and pretraining techniques are recent innovations,
many aspects of how they are applied to text ranking are relatively well
understood and represent mature techniques. However, there remain many open
research questions, and thus in addition to laying out the foundations of
pretrained transformers for text ranking, this survey also attempts to
prognosticate where the field is heading
Query-as-context Pre-training for Dense Passage Retrieval
Recently, methods have been developed to improve the performance of dense
passage retrieval by using context-supervised pre-training. These methods
simply consider two passages from the same document to be relevant, without
taking into account the possibility of weakly correlated pairs. Thus, this
paper proposes query-as-context pre-training, a simple yet effective
pre-training technique to alleviate the issue. Query-as-context pre-training
assumes that the query derived from a passage is more likely to be relevant to
that passage and forms a passage-query pair. These passage-query pairs are then
used in contrastive or generative context-supervised pre-training. The
pre-trained models are evaluated on large-scale passage retrieval benchmarks
and out-of-domain zero-shot benchmarks. Experimental results show that
query-as-context pre-training brings considerable gains and meanwhile speeds up
training, demonstrating its effectiveness and efficiency. Our code will be
available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .Comment: EMNLP 2023 Main Conferenc
SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval
Traditionally, sparse retrieval systems relied on lexical representations to
retrieve documents, such as BM25, dominated information retrieval tasks. With
the onset of pre-trained transformer models such as BERT, neural sparse
retrieval has led to a new paradigm within retrieval. Despite the success,
there has been limited software supporting different sparse retrievers running
in a unified, common environment. This hinders practitioners from fairly
comparing different sparse models and obtaining realistic evaluation results.
Another missing piece is, that a majority of prior work evaluates sparse
retrieval models on in-domain retrieval, i.e. on a single dataset: MS MARCO.
However, a key requirement in practical retrieval systems requires models that
can generalize well to unseen out-of-domain, i.e. zero-shot retrieval tasks. In
this work, we provide SPRINT, a unified Python toolkit based on Pyserini and
Lucene, supporting a common interface for evaluating neural sparse retrieval.
The toolkit currently includes five built-in models: uniCOIL, DeepImpact,
SPARTA, TILDEv2 and SPLADEv2. Users can also easily add customized models by
defining their term weighting method. Using our toolkit, we establish strong
and reproducible zero-shot sparse retrieval baselines across the
well-acknowledged benchmark, BEIR. Our results demonstrate that SPLADEv2
achieves the best average score of 0.470 nDCG@10 on BEIR amongst all neural
sparse retrievers. In this work, we further uncover the reasons behind its
performance gain. We show that SPLADEv2 produces sparse representations with a
majority of tokens outside of the original query and document which is often
crucial for its performance gains, i.e. a limitation among its other sparse
counterparts. We provide our SPRINT toolkit, models, and data used in our
experiments publicly here at https://github.com/thakur-nandan/sprint.Comment: Accepted at SIGIR 2023 (Resource Track
Noisy Self-Training with Synthetic Queries for Dense Retrieval
Although existing neural retrieval models reveal promising results when
training data is abundant and the performance keeps improving as training data
increases, collecting high-quality annotated data is prohibitively costly. To
this end, we introduce a novel noisy self-training framework combined with
synthetic queries, showing that neural retrievers can be improved in a
self-evolution manner with no reliance on any external models. Experimental
results show that our method improves consistently over existing methods on
both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval
benchmarks. Extra analysis on low-resource settings reveals that our method is
data efficient and outperforms competitive baselines, with as little as 30% of
labelled training data. Further extending the framework for reranker training
demonstrates that the proposed method is general and yields additional gains on
tasks of diverse domains.\footnote{Source code is available at
\url{https://github.com/Fantabulous-J/Self-Training-DPR}}Comment: Accepted by EMNLP 2023 Finding
Information Retrieval: Recent Advances and Beyond
In this paper, we provide a detailed overview of the models used for
information retrieval in the first and second stages of the typical processing
chain. We discuss the current state-of-the-art models, including methods based
on terms, semantic retrieval, and neural. Additionally, we delve into the key
topics related to the learning process of these models. This way, this survey
offers a comprehensive understanding of the field and is of interest for for
researchers and practitioners entering/working in the information retrieval
domain
Synergistic Interplay between Search and Large Language Models for Information Retrieval
Information retrieval (IR) plays a crucial role in locating relevant
resources from vast amounts of data, and its applications have evolved from
traditional knowledge bases to modern retrieval models (RMs). The emergence of
large language models (LLMs) has further revolutionized the IR field by
enabling users to interact with search systems in natural languages. In this
paper, we explore the advantages and disadvantages of LLMs and RMs,
highlighting their respective strengths in understanding user-issued queries
and retrieving up-to-date information. To leverage the benefits of both
paradigms while circumventing their limitations, we propose InteR, a novel
framework that facilitates information refinement through synergy between RMs
and LLMs. InteR allows RMs to expand knowledge in queries using LLM-generated
knowledge collections and enables LLMs to enhance prompt formulation using
retrieved documents. This iterative refinement process augments the inputs of
RMs and LLMs, leading to more accurate retrieval. Experiments on large-scale
retrieval benchmarks involving web search and low-resource retrieval tasks
demonstrate that InteR achieves overall superior zero-shot retrieval
performance compared to state-of-the-art methods, even those using relevance
judgment. Source code is available at https://github.com/Cyril-JZ/InteRComment: Pre-print. Work in progres