22 research outputs found

    The Potential of Learned Index Structures for Index Compression

    Full text link
    Inverted indexes are vital in providing fast key-word-based search. For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency, and position offsets. While very effective, inverted indexes have large memory requirements for web-sized collections. Recently, the concept of learned index structures was introduced, where machine learned models replace common index structures such as B-tree-indexes, hash-indexes, and bloom-filters. These learned index structures require less memory, and can be computationally much faster than their traditional counterparts. In this paper, we consider whether such models may be applied to conjunctive Boolean querying. First, we investigate how a learned model can replace document postings of an inverted index, and then evaluate the compromises such an approach might have. Second, we evaluate the potential gains that can be achieved in terms of memory requirements. Our work shows that learned models have great potential in inverted indexing, and this direction seems to be a promising area for future research.Comment: Will appear in the proceedings of ADCS'1

    Column Stores as an IR Prototyping Tool

    Get PDF
    . We make the suggestion that instead of implementing custom index structures and query evaluation algorithms, IR researchers should simply store document representations in a column-oriented relational database and write ranking models using SQL. For rapid prototyping, this is particularly advantageous since researchers can explore new ranking functions and features by simply issuing SQL queries, without needing to write imperative code. We demonstrate the feasibility of this approach by an implementation of conjunctive BM25 using MonetDB on a part of the ClueWeb12 collection

    The Feasibility of Brute Force Scans for Real-Time Tweet Search

    Full text link
    The real-time search problem requires making ingested doc-uments immediately searchable, which presents architectural challenges for systems built around inverted indexing. In this paper, we explore a radical proposition: What if we abandon document inversion and instead adopt an architec-ture based on brute force scans of document representations? In such a design, “indexing ” simply involves appending the parsed representation of an ingested document to an exist-ing buffer, which is simple and fast. Quite surprisingly, ex-periments with TREC Microblog test collections show that query evaluation with brute force scans is feasible and per-formance compares favorably to a traditional search archi-tecture based on an inverted index, especially if we take ad-vantage of vectorized SIMD instructions and multiple cores in modern processor architectures. We believe that such a novel design is worth further exploration by IR researchers and practitioners

    Anytime Ranking for Impact-Ordered Indexes

    Full text link
    The ability for a ranking function to control its own execution time is useful for managing load, reigning in outliers, and adapting to different types of queries. We propose a simple yet effective anytime algorithm for impact-ordered indexes that builds on a score-at-a-time query evaluation strategy. In our approach, postings segments are processed in decreasing order of their impact scores, and the algorithm early terminates when a specified number of postings have been processed. With a simple linear model and a few training topics, we can determine this threshold given a time budget in milliseconds. Experiments on two web test collections show that our approach can accurately control query evaluation latency and that aggressive limits on execution time lead to minimal decreases in effectiveness

    A unified posterior regularized topic model with maximum margin for learning-to-rank

    Get PDF
    While most methods for learning-to-rank documents only consider relevance scores as features, better results can often be obtained by taking into account the latent topic structure of the document collection. Existing approaches that consider latent topics follow a two-stage approach, in which topics are discovered in an unsupervised way, as usual, and then used as features for the learning-to-rank task. In contrast, we propose a learning-to-rank framework which integrates the supervised learning of a maximum margin classifier with the discovery of a suitable probabilistic topic model. In this way, the labelled data that is available for the learning-to-rank task can be exploited to identify the most appropriate topics. To this end, we use a unified constrained optimization framework, which can dynamically compute the latent topic similarity score between the query and the document. Our experimental results show a consistent improvement over the state-of-the-art learning-to-rank models
    corecore