12 research outputs found

    Anytime Ranking for Impact-Ordered Indexes

    Full text link
    The ability for a ranking function to control its own execution time is useful for managing load, reigning in outliers, and adapting to different types of queries. We propose a simple yet effective anytime algorithm for impact-ordered indexes that builds on a score-at-a-time query evaluation strategy. In our approach, postings segments are processed in decreasing order of their impact scores, and the algorithm early terminates when a specified number of postings have been processed. With a simple linear model and a few training topics, we can determine this threshold given a time budget in milliseconds. Experiments on two web test collections show that our approach can accurately control query evaluation latency and that aggressive limits on execution time lead to minimal decreases in effectiveness

    The Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017

    Get PDF
    As an empirical discipline, information access and retrieval research requires substantial software infrastructure to index and search large collections. This workshop is motivated by the desire to better align information retrieval research with the practice of building search applications from the perspective of open-source information retrieval systems. Our goal is to promote the use of Lucene for information access and retrieval research

    Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

    Get PDF
    The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2. The product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines, such that with a single script, anyone with a copy of the collection can reproduce the submitted runs. Our vision is that these results would serve as widely accessible points of comparison in future IR research. This project represents an ongoing effort, but we describe the first phase of the challenge that was organized as part of a workshop at SIGIR 2015. We have succeeded modestly so far, achieving our main goals on the Gov2 collection with seven opensource search engines. In this paper, we describe our methodology, share experimental results, and discuss lessons learned as well as next steps

    MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation

    Get PDF
    Nowadays, current information systems are so large and maintain huge amount of data. At every time, they process millions of documents and millions of queries. In order to choose the most important responses from this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt to extract the Top-K documents according to a specified increasing monotone function. The principal idea behind is to reach and score the most significant less number of documents. So, they avoid fully processing the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate a tree structure as an index into WAND and we add new levels in query processing. In the second contribution, we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results on real datasets show that MWAND is more efficient than the WAND approach

    Ранжирование документов при полнотекстовом поиске с учетом расстояния с использованием индексов с многокомпонентными ключами

    Full text link
    The problem of proximity full-text search is considered. If a search query contains high-frequently occurring words, then multi-component key indexes deliver improvement of the search speed in comparison with ordinary inverted indexes. It was shown that we can increase the search speed up to 130 times in cases when queries consist of high-frequently occurring words. In this paper, we are investigating how the multi-component key indexes architecture affects the quality of the search. We consider several well-known methods of relevance ranking; these methods are of different authors. Using these methods we perform the search in the ordinary inverted index and then in the index that is enhanced with multi-component key indexes. The results show that with multi-component key indexes we obtain search results that are very near in terms of relevance ranking to the search results that are obtained by means of ordinary inverted indexes. © 2021 Udmurt State University. All rights reserved

    Bridging Dense and Sparse Maximum Inner Product Search

    Full text link
    Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-kk retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-kk retrieval methods. We study IVF-based retrieval where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that IVF serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and demonstrate their potential. First, we cast the IVF paradigm as a dynamic pruning technique and turn that insight into a novel organization of the inverted index for approximate MIPS for general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, and show its robustness to query distributions

    An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors

    Full text link
    Maximum Inner Product Search or top-k retrieval on sparse vectors is well-understood in information retrieval, with a number of mature algorithms that solve it exactly. However, all existing algorithms are tailored to text and frequency-based similarity measures. To achieve optimal memory footprint and query latency, they rely on the near stationarity of documents and on laws governing natural languages. We consider, instead, a setup in which collections are streaming -- necessitating dynamic indexing -- and where indexing and retrieval must work with arbitrarily distributed real-valued vectors. As we show, existing algorithms are no longer competitive in this setup, even against naive solutions. We investigate this gap and present a novel approximate solution, called Sinnamon, that can efficiently retrieve the top-k results for sparse real valued vectors drawn from arbitrary distributions. Notably, Sinnamon offers levers to trade-off memory consumption, latency, and accuracy, making the algorithm suitable for constrained applications and systems. We give theoretical results on the error introduced by the approximate nature of the algorithm, and present an empirical evaluation of its performance on two hardware platforms and synthetic and real-valued datasets. We conclude by laying out concrete directions for future research on this general top-k retrieval problem over sparse vectors

    Efficient query processing for scalable web search

    Get PDF
    Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures
    corecore