45,716 research outputs found
Topic-DPR: Topic-based Prompts for Dense Passage Retrieval
Prompt-based learning's efficacy across numerous natural language processing
tasks has led to its integration into dense passage retrieval. Prior research
has mainly focused on enhancing the semantic understanding of pre-trained
language models by optimizing a single vector as a continuous prompt. This
approach, however, leads to a semantic space collapse; identical semantic
information seeps into all representations, causing their distributions to
converge in a restricted region. This hinders differentiation between relevant
and irrelevant passages during dense retrieval. To tackle this issue, we
present Topic-DPR, a dense passage retrieval model that uses topic-based
prompts. Unlike the single prompt method, multiple topic-based prompts are
established over a probabilistic simplex and optimized simultaneously through
contrastive learning. This encourages representations to align with their topic
distributions, improving space uniformity. Furthermore, we introduce a novel
positive and negative sampling strategy, leveraging semi-structured data to
boost dense retrieval efficiency. Experimental results from two datasets affirm
that our method surpasses previous state-of-the-art retrieval techniques.Comment: Findings of EMNLP 202
ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open retrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation. For passage retrieval, we evaluated the monolingual BM25 ranker against the ensemble of re-rankers based on multilingual pretrained language models (PLMs) and also variants of the shared task baseline, re-training it from scratch using a recently introduced contrastive loss that maintains a strong gradient signal throughout training by means of mixed negative samples. For answer generation, we focused on languageand domain-specialization by means of continued language model (LM) pretraining of existing multilingual encoders. Additionally, for both passage retrieval and answer generation, we augmented the training data provided by the task organizers with automatically generated question-answer pairs created from Wikipedia passages to mitigate the issue of data scarcity, particularly for the low-resource languages for which no training data were provided. Our results show that language- and domain-specialization as well as data augmentation help, especially for low-resource languages
Expansion via Prediction of Importance with Contextualization
The identification of relevance with little textual context is a primary
challenge in passage retrieval. We address this problem with a
representation-based ranking approach that: (1) explicitly models the
importance of each term using a contextualized language model; (2) performs
passage expansion by propagating the importance to similar terms; and (3)
grounds the representations in the lexicon, making them interpretable. Passage
representations can be pre-computed at index time to reduce query-time latency.
We call our approach EPIC (Expansion via Prediction of Importance with
Contextualization). We show that EPIC significantly outperforms prior
importance-modeling and document expansion approaches. We also observe that the
performance is additive with the current leading first-stage retrieval methods,
further narrowing the gap between inexpensive and cost-prohibitive passage
ranking approaches. Specifically, EPIC achieves a MRR@10 of 0.304 on the
MS-MARCO passage ranking dataset with 78ms average query latency on commodity
hardware. We also find that the latency is further reduced to 68ms by pruning
document representations, with virtually no difference in effectiveness.Comment: Accepted at SIGIR 2020 (short
I^3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval
Passage retrieval is a fundamental task in many information systems, such as
web search and question answering, where both efficiency and effectiveness are
critical concerns. In recent years, neural retrievers based on pre-trained
language models (PLM), such as dual-encoders, have achieved huge success. Yet,
studies have found that the performance of dual-encoders are often limited due
to the neglecting of the interaction information between queries and candidate
passages. Therefore, various interaction paradigms have been proposed to
improve the performance of vanilla dual-encoders. Particularly, recent
state-of-the-art methods often introduce late-interaction during the model
inference process. However, such late-interaction based methods usually bring
extensive computation and storage cost on large corpus. Despite their
effectiveness, the concern of efficiency and space footprint is still an
important factor that limits the application of interaction-based neural
retrieval models. To tackle this issue, we incorporate implicit interaction
into dual-encoders, and propose I^3 retriever. In particular, our implicit
interaction paradigm leverages generated pseudo-queries to simulate
query-passage interaction, which jointly optimizes with query and passage
encoders in an end-to-end manner. It can be fully pre-computed and cached, and
its inference process only involves simple dot product operation of the query
vector and passage vector, which makes it as efficient as the vanilla dual
encoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep
Learning Datasets, demonstrating the I^3 retriever's superiority in terms of
both effectiveness and efficiency. Moreover, the proposed implicit interaction
is compatible with special pre-training and knowledge distillation for passage
retrieval, which brings a new state-of-the-art performance.Comment: 10 page
Expansion via Prediction of Importance with Contextualization
The identification of relevance with little textual context is a primary challenge in passage retrieval. We address this problem with a representation-based ranking approach that: (1) explicitly models the importance of each term using a contextualized language model; (2) performs passage expansion by propagating the importance to similar terms; and (3) grounds the representations in the lexicon, making them interpretable. Passage representations can be pre-computed at index time to reduce query-time latency. We call our approach EPIC (Expansion via Prediction of Importance with Contextualization). We show that EPIC significantly outperforms prior importance-modeling and document expansion approaches. We also observe that the performance is additive with the current leading first-stage retrieval methods, further narrowing the gap between inexpensive and cost-prohibitive passage ranking approaches. Specifically, EPIC achieves a MRR@10 of 0.304 on the MS-MARCO passage ranking dataset with 78ms average query latency on commodity hardware. We also find that the latency is further reduced to 68ms by pruning document representations, with virtually no difference in effectiveness
Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models,
have shown the usefulness of expanding and reweighting the users' initial
queries using information occurring in an initial set of retrieved documents,
known as the pseudo-relevant set. Recently, dense retrieval -- through the use
of neural contextual language models such as BERT for analysing the documents'
and queries' contents and computing their relevance scores -- has shown a
promising performance on several information retrieval tasks still relying on
the traditional inverted index for identifying documents relevant to a query.
Two different dense retrieval families have emerged: the use of single embedded
representations for each passage and query (e.g. using BERT's [CLS] token), or
via multiple representations (e.g. using an embedding for each token of the
query and document). In this work, we conduct the first study into the
potential for multiple representation dense retrieval to be enhanced using
pseudo-relevance feedback. In particular, based on the pseudo-relevant set of
documents identified using a first-pass dense retrieval, we extract
representative feedback embeddings (using KMeans clustering) -- while ensuring
that these embeddings discriminate among passages (based on IDF) -- which are
then added to the query representation. These additional feedback embeddings
are shown to both enhance the effectiveness of a reranking as well as an
additional dense retrieval operation. Indeed, experiments on the MSMARCO
passage ranking dataset show that MAP can be improved by upto 26% on the TREC
2019 query set and 10% on the TREC 2020 query set by the application of our
proposed ColBERT-PRF method on a ColBERT dense retrieval approach.Comment: 10 page
KGI: An Integrated Framework for Knowledge Intensive Language Tasks
In a recent work, we presented a novel state-of-the-art approach to zero-shot
slot filling that extends dense passage retrieval with hard negatives and
robust training procedures for retrieval augmented generation models. In this
paper, we propose a system based on an enhanced version of this approach where
we train task specific models for other knowledge intensive language tasks,
such as open domain question answering (QA), dialogue and fact checking. Our
system achieves results comparable to the best models in the KILT leaderboards.
Moreover, given a user query, we show how the output from these different
models can be combined to cross-examine each other. Particularly, we show how
accuracy in dialogue can be improved using the QA model. A short video
demonstrating the system is available here -
\url{https://ibm.box.com/v/kgi-interactive-demo}
GAR-meets-RAG Paradigm for Zero-Shot Information Retrieval
Given a query and a document corpus, the information retrieval (IR) task is
to output a ranked list of relevant documents. Combining large language models
(LLMs) with embedding-based retrieval models, recent work shows promising
results on the zero-shot retrieval problem, i.e., no access to labeled data
from the target domain. Two such popular paradigms are generation-augmented
retrieval or GAR (generate additional context for the query and then retrieve),
and retrieval-augmented generation or RAG (retrieve relevant documents as
context and then generate answers). The success of these paradigms hinges on
(i) high-recall retrieval models, which are difficult to obtain in the
zero-shot setting, and (ii) high-precision (re-)ranking models which typically
need a good initialization. In this work, we propose a novel GAR-meets-RAG
recurrence formulation that overcomes the challenges of existing paradigms. Our
method iteratively improves retrieval (via GAR) and rewrite (via RAG) stages in
the zero-shot setting. A key design principle is that the rewrite-retrieval
stages improve the recall of the system and a final re-ranking stage improves
the precision. We conduct extensive experiments on zero-shot passage retrieval
benchmarks, BEIR and TREC-DL. Our method establishes a new state-of-the-art in
the BEIR benchmark, outperforming previous best results in Recall@100 and
nDCG@10 metrics on 6 out of 8 datasets, with up to 17% relative gains over the
previous best.Comment: preprin
Recommended from our members
ANSWER SIMILARITY GROUPING AND DIVERSIFICATION IN QUESTION ANSWERING SYSTEMS
The rise in popularity of mobile and voice search has led to a shift in IR from document to passage retrieval for non-factoid questions. Various datasets such as MSMarco, as well as efficient retrieval models have been developed to identify single best answer passages for this task. However, such models do not specifically address questions which could have multiple or alternative answers. In this dissertation, we focus on this new research area that involves studying answer passage relationships and how this could be applied to passage retrieval tasks.
We first create a high quality dataset for the answer passage similarity task in the context of question answering. Manual annotation of passage pairs is performed to set the similarity labels, from which answer group information is automatically generated. We next investigate different types of representations, which could be used to create effective clusters. We experiment with various unsupervised representations and show that distributional representations outperform term based representations for this task. Next, weak supervision is leveraged to further improve the cluster modeling performance. We use BERT as the underlying model for training and show the relative performance of various weak signals such as GloVe and term-based Language Modeling for this task. In order to apply these clusters to the answer passage retrieval task for multi-answer questions, we use a modified version of the Maximal Marginal Relevance (MMR) diversification model. We demonstrate that answers retrieved using this model are more diverse i.e, cover more answer types with low redundancy as well as maximize relevance, with respect to the baselines. So far, we used passage clustering as a means to identify answer groups corresponding to a question and apply them in a question answering task. We extend this a step further by looking at related questions within a conversation. For this purpose, we expand the definition of Reciprocal Rank Fusion (RRF) and use this to identify pertinent history passages for such questions. Updated question rewrites generated using these passages are then used to improve the conversational search task. In addition to being the first work that looks at answer relationships, our specific contributions can be summarized as follows: (1) Creation of new datasets with passage similarity and answer type information; (2) Effective passage similarity clustering models using unsupervised representations and weak supervision methods; (3) Applying the passage similarity/clustering information to diversification framework; (4) Identifying good response history candidates using answer passage clustering for the conversational search task
- …