Search CORE

98 research outputs found

Pretrained Transformers for Text Ranking: BERT and Beyond

Author: Lin Jimmy
Nogueira Rodrigo
Yates Andrew
Publication venue
Publication date: 01/01/2020
Field of study

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

arXiv.org e-Print Archive

Answering Consumer Health Questions on the Web

Author: Vakili Tahami Amir
Publication venue: 'University of Waterloo'
Publication date: 16/12/2022
Field of study

Question answering is an important sub task in the field of information retrieval. Question answering has typically used reliable sources of information such as the Wikipedia for information. In this work, we look at answering health questions using the web. The web offers the means to answer general medical questions on a variety of topics but comes with the downside of being rife with misinformation and contradictory information. We develop our techniques using the TREC health misinformation tracks that use consumer health question as topics and web crawls as their document collection. In this work, we implement a document filtering technique based on topic-sensitive PageRank that uses a web graph of the hosts in common crawl. We develop a new passage extraction technique that performs query-based contextualized sentence selection. We test this technique on a multi-span extractive question answering dataset. We also develop an answer aggregation technique that can combine language features and manual features to predict answers to these consumer health questions. We test all of these approaches on the TREC Health Misinformation Track. We show that these techniques in the majority of cases provide an uplift in performance

University of Waterloo's Institutional Repository

EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Author: Fergus Rob
Jayasumana Sadeep
Jitkrittum Wittawat
Kim Seungyeon
Kumar Sanjiv
Menon Aditya Krishna
Rawat Ankit Singh
Sadhanala Veeranjaneyulu
Zaheer Manzil
Publication venue
Publication date: 03/07/2023
Field of study

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. Unlike existing teacher score-based distillation methods, our proposed approach employs embedding matching tasks to provide a stronger signal to align the representations of the teacher and student models. In addition, it utilizes query generation to explore the data manifold to reduce the discrepancies between the student and the teacher where training data is sparse. Furthermore, our analysis also motivates novel asymmetric architectures for student models which realizes better embedding alignment without increasing online inference cost. On standard benchmarks like MSMARCO, we show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance

arXiv.org e-Print Archive