45,716 research outputs found

    Topic-DPR: Topic-based Prompts for Dense Passage Retrieval

    Full text link
    Prompt-based learning's efficacy across numerous natural language processing tasks has led to its integration into dense passage retrieval. Prior research has mainly focused on enhancing the semantic understanding of pre-trained language models by optimizing a single vector as a continuous prompt. This approach, however, leads to a semantic space collapse; identical semantic information seeps into all representations, causing their distributions to converge in a restricted region. This hinders differentiation between relevant and irrelevant passages during dense retrieval. To tackle this issue, we present Topic-DPR, a dense passage retrieval model that uses topic-based prompts. Unlike the single prompt method, multiple topic-based prompts are established over a probabilistic simplex and optimized simultaneously through contrastive learning. This encourages representations to align with their topic distributions, improving space uniformity. Furthermore, we introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency. Experimental results from two datasets affirm that our method surpasses previous state-of-the-art retrieval techniques.Comment: Findings of EMNLP 202

    ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

    Get PDF
    This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open retrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation. For passage retrieval, we evaluated the monolingual BM25 ranker against the ensemble of re-rankers based on multilingual pretrained language models (PLMs) and also variants of the shared task baseline, re-training it from scratch using a recently introduced contrastive loss that maintains a strong gradient signal throughout training by means of mixed negative samples. For answer generation, we focused on languageand domain-specialization by means of continued language model (LM) pretraining of existing multilingual encoders. Additionally, for both passage retrieval and answer generation, we augmented the training data provided by the task organizers with automatically generated question-answer pairs created from Wikipedia passages to mitigate the issue of data scarcity, particularly for the low-resource languages for which no training data were provided. Our results show that language- and domain-specialization as well as data augmentation help, especially for low-resource languages

    Expansion via Prediction of Importance with Contextualization

    Full text link
    The identification of relevance with little textual context is a primary challenge in passage retrieval. We address this problem with a representation-based ranking approach that: (1) explicitly models the importance of each term using a contextualized language model; (2) performs passage expansion by propagating the importance to similar terms; and (3) grounds the representations in the lexicon, making them interpretable. Passage representations can be pre-computed at index time to reduce query-time latency. We call our approach EPIC (Expansion via Prediction of Importance with Contextualization). We show that EPIC significantly outperforms prior importance-modeling and document expansion approaches. We also observe that the performance is additive with the current leading first-stage retrieval methods, further narrowing the gap between inexpensive and cost-prohibitive passage ranking approaches. Specifically, EPIC achieves a MRR@10 of 0.304 on the MS-MARCO passage ranking dataset with 78ms average query latency on commodity hardware. We also find that the latency is further reduced to 68ms by pruning document representations, with virtually no difference in effectiveness.Comment: Accepted at SIGIR 2020 (short

    I^3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval

    Full text link
    Passage retrieval is a fundamental task in many information systems, such as web search and question answering, where both efficiency and effectiveness are critical concerns. In recent years, neural retrievers based on pre-trained language models (PLM), such as dual-encoders, have achieved huge success. Yet, studies have found that the performance of dual-encoders are often limited due to the neglecting of the interaction information between queries and candidate passages. Therefore, various interaction paradigms have been proposed to improve the performance of vanilla dual-encoders. Particularly, recent state-of-the-art methods often introduce late-interaction during the model inference process. However, such late-interaction based methods usually bring extensive computation and storage cost on large corpus. Despite their effectiveness, the concern of efficiency and space footprint is still an important factor that limits the application of interaction-based neural retrieval models. To tackle this issue, we incorporate implicit interaction into dual-encoders, and propose I^3 retriever. In particular, our implicit interaction paradigm leverages generated pseudo-queries to simulate query-passage interaction, which jointly optimizes with query and passage encoders in an end-to-end manner. It can be fully pre-computed and cached, and its inference process only involves simple dot product operation of the query vector and passage vector, which makes it as efficient as the vanilla dual encoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep Learning Datasets, demonstrating the I^3 retriever's superiority in terms of both effectiveness and efficiency. Moreover, the proposed implicit interaction is compatible with special pre-training and knowledge distillation for passage retrieval, which brings a new state-of-the-art performance.Comment: 10 page

    Expansion via Prediction of Importance with Contextualization

    Get PDF
    The identification of relevance with little textual context is a primary challenge in passage retrieval. We address this problem with a representation-based ranking approach that: (1) explicitly models the importance of each term using a contextualized language model; (2) performs passage expansion by propagating the importance to similar terms; and (3) grounds the representations in the lexicon, making them interpretable. Passage representations can be pre-computed at index time to reduce query-time latency. We call our approach EPIC (Expansion via Prediction of Importance with Contextualization). We show that EPIC significantly outperforms prior importance-modeling and document expansion approaches. We also observe that the performance is additive with the current leading first-stage retrieval methods, further narrowing the gap between inexpensive and cost-prohibitive passage ranking approaches. Specifically, EPIC achieves a MRR@10 of 0.304 on the MS-MARCO passage ranking dataset with 78ms average query latency on commodity hardware. We also find that the latency is further reduced to 68ms by pruning document representations, with virtually no difference in effectiveness

    Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval

    Full text link
    Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval -- through the use of neural contextual language models such as BERT for analysing the documents' and queries' contents and computing their relevance scores -- has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query (e.g. using BERT's [CLS] token), or via multiple representations (e.g. using an embedding for each token of the query and document). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, we extract representative feedback embeddings (using KMeans clustering) -- while ensuring that these embeddings discriminate among passages (based on IDF) -- which are then added to the query representation. These additional feedback embeddings are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by upto 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.Comment: 10 page

    KGI: An Integrated Framework for Knowledge Intensive Language Tasks

    Full text link
    In a recent work, we presented a novel state-of-the-art approach to zero-shot slot filling that extends dense passage retrieval with hard negatives and robust training procedures for retrieval augmented generation models. In this paper, we propose a system based on an enhanced version of this approach where we train task specific models for other knowledge intensive language tasks, such as open domain question answering (QA), dialogue and fact checking. Our system achieves results comparable to the best models in the KILT leaderboards. Moreover, given a user query, we show how the output from these different models can be combined to cross-examine each other. Particularly, we show how accuracy in dialogue can be improved using the QA model. A short video demonstrating the system is available here - \url{https://ibm.box.com/v/kgi-interactive-demo}

    GAR-meets-RAG Paradigm for Zero-Shot Information Retrieval

    Full text link
    Given a query and a document corpus, the information retrieval (IR) task is to output a ranked list of relevant documents. Combining large language models (LLMs) with embedding-based retrieval models, recent work shows promising results on the zero-shot retrieval problem, i.e., no access to labeled data from the target domain. Two such popular paradigms are generation-augmented retrieval or GAR (generate additional context for the query and then retrieve), and retrieval-augmented generation or RAG (retrieve relevant documents as context and then generate answers). The success of these paradigms hinges on (i) high-recall retrieval models, which are difficult to obtain in the zero-shot setting, and (ii) high-precision (re-)ranking models which typically need a good initialization. In this work, we propose a novel GAR-meets-RAG recurrence formulation that overcomes the challenges of existing paradigms. Our method iteratively improves retrieval (via GAR) and rewrite (via RAG) stages in the zero-shot setting. A key design principle is that the rewrite-retrieval stages improve the recall of the system and a final re-ranking stage improves the precision. We conduct extensive experiments on zero-shot passage retrieval benchmarks, BEIR and TREC-DL. Our method establishes a new state-of-the-art in the BEIR benchmark, outperforming previous best results in Recall@100 and nDCG@10 metrics on 6 out of 8 datasets, with up to 17% relative gains over the previous best.Comment: preprin
    corecore