1,747 research outputs found
The Potential of Learned Index Structures for Index Compression
Inverted indexes are vital in providing fast key-word-based search. For every
term in the document collection, a list of identifiers of documents in which
the term appears is stored, along with auxiliary information such as term
frequency, and position offsets. While very effective, inverted indexes have
large memory requirements for web-sized collections. Recently, the concept of
learned index structures was introduced, where machine learned models replace
common index structures such as B-tree-indexes, hash-indexes, and
bloom-filters. These learned index structures require less memory, and can be
computationally much faster than their traditional counterparts. In this paper,
we consider whether such models may be applied to conjunctive Boolean querying.
First, we investigate how a learned model can replace document postings of an
inverted index, and then evaluate the compromises such an approach might have.
Second, we evaluate the potential gains that can be achieved in terms of memory
requirements. Our work shows that learned models have great potential in
inverted indexing, and this direction seems to be a promising area for future
research.Comment: Will appear in the proceedings of ADCS'1
Pretrained Transformers for Text Ranking: BERT and Beyond
The goal of text ranking is to generate an ordered list of texts retrieved
from a corpus in response to a query. Although the most common formulation of
text ranking is search, instances of the task can also be found in many natural
language processing applications. This survey provides an overview of text
ranking with neural network architectures known as transformers, of which BERT
is the best-known example. The combination of transformers and self-supervised
pretraining has been responsible for a paradigm shift in natural language
processing (NLP), information retrieval (IR), and beyond. In this survey, we
provide a synthesis of existing work as a single point of entry for
practitioners who wish to gain a better understanding of how to apply
transformers to text ranking problems and researchers who wish to pursue work
in this area. We cover a wide range of modern techniques, grouped into two
high-level categories: transformer models that perform reranking in multi-stage
architectures and dense retrieval techniques that perform ranking directly.
There are two themes that pervade our survey: techniques for handling long
documents, beyond typical sentence-by-sentence processing in NLP, and
techniques for addressing the tradeoff between effectiveness (i.e., result
quality) and efficiency (e.g., query latency, model and index size). Although
transformer architectures and pretraining techniques are recent innovations,
many aspects of how they are applied to text ranking are relatively well
understood and represent mature techniques. However, there remain many open
research questions, and thus in addition to laying out the foundations of
pretrained transformers for text ranking, this survey also attempts to
prognosticate where the field is heading
Joint Upper & Lower Bound Normalization for IR Evaluation
In this paper, we present a novel perspective towards IR evaluation by
proposing a new family of evaluation metrics where the existing popular metrics
(e.g., nDCG, MAP) are customized by introducing a query-specific lower-bound
(LB) normalization term. While original nDCG, MAP etc. metrics are normalized
in terms of their upper bounds based on an ideal ranked list, a corresponding
LB normalization for them has not yet been studied. Specifically, we introduce
two different variants of the proposed LB normalization, where the lower bound
is estimated from a randomized ranking of the corresponding documents present
in the evaluation set. We next conducted two case-studies by instantiating the
new framework for two popular IR evaluation metric (with two variants, e.g.,
DCG_UL_V1,2 and MSP_UL_V1,2 ) and then comparing against the traditional metric
without the proposed LB normalization. Experiments on two different data-sets
with eight Learning-to-Rank (LETOR) methods demonstrate the following
properties of the new LB normalized metric: 1) Statistically significant
differences (between two methods) in terms of original metric no longer remain
statistically significant in terms of Upper Lower (UL) Bound normalized version
and vice-versa, especially for uninformative query-sets. 2) When compared
against the original metric, our proposed UL normalized metrics demonstrate
higher Discriminatory Power and better Consistency across different data-sets.
These findings suggest that the IR community should consider UL normalization
seriously when computing nDCG and MAP and more in-depth study of UL
normalization for general IR evaluation is warranted.Comment: 26 pages, 3 figure
LAW SEARCH IN THE AGE OF THE ALGORITHM
The process of searching for relevant legal materials is
fundamental to legal reasoning. However, despite its enormous
practical and theoretical importance, law search has not been given
significant attention by scholars. In this Article, we define the problem
of law search and examine the consequences of new technologies
capable of automating this core lawyerly task. We introduce a theory
of law search in which legal relevance is a sociological phenomenon
that leads to convergence over a shared set of legal materials and
explore the normative stakes of law search. We examine ways in which
law scholars can understand empirically the phenomenon of law
search, argue that computational modeling is a valuable epistemic
tool in this domain, and report the results from a multi-year,
interdisciplinary effort to develop an advanced law search algorithm
based on human-generated data. Finally, we explore how
policymakers can manage the challenges posed by new machine
learning-based search technologies
Index ordering by query-independent measures
Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming.
A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced
Managing tail latency in large scale information retrieval systems
As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency
- …