234,469 research outputs found
The Potential of Learned Index Structures for Index Compression
Inverted indexes are vital in providing fast key-word-based search. For every
term in the document collection, a list of identifiers of documents in which
the term appears is stored, along with auxiliary information such as term
frequency, and position offsets. While very effective, inverted indexes have
large memory requirements for web-sized collections. Recently, the concept of
learned index structures was introduced, where machine learned models replace
common index structures such as B-tree-indexes, hash-indexes, and
bloom-filters. These learned index structures require less memory, and can be
computationally much faster than their traditional counterparts. In this paper,
we consider whether such models may be applied to conjunctive Boolean querying.
First, we investigate how a learned model can replace document postings of an
inverted index, and then evaluate the compromises such an approach might have.
Second, we evaluate the potential gains that can be achieved in terms of memory
requirements. Our work shows that learned models have great potential in
inverted indexing, and this direction seems to be a promising area for future
research.Comment: Will appear in the proceedings of ADCS'1
Reconfigurable Inverted Index
Existing approximate nearest neighbor search systems suffer from two
fundamental problems that are of practical importance but have not received
sufficient attention from the research community. First, although existing
systems perform well for the whole database, it is difficult to run a search
over a subset of the database. Second, there has been no discussion concerning
the performance decrement after many items have been newly added to a system.
We develop a reconfigurable inverted index (Rii) to resolve these two issues.
Based on the standard IVFADC system, we design a data layout such that items
are stored linearly. This enables us to efficiently run a subset search by
switching the search method to a linear PQ scan if the size of a subset is
small. Owing to the linear layout, the data structure can be dynamically
adjusted after new items are added, maintaining the fast speed of the system.
Extensive comparisons show that Rii achieves a comparable performance with
state-of-the art systems such as Faiss.Comment: ACMMM 2018 (oral). Code: https://github.com/matsui528/ri
On inverted index compression for search engine efficiency
Efficient access to the inverted index data structure is a key aspect for a search engine to achieve fast response times to users’ queries . While the performance of an information retrieval (IR) system can be enhanced through the compression of its posting lists, there is little recent work in the literature that thoroughly compares and analyses the performance of modern integer compression schemes across different types of posting information (document ids, frequencies, positions). In this paper, we experiment with different modern integer compression algorithms, integrating these into a modern IR system. Through comprehensive experiments conducted on two large, widely used document corpora and large query sets, our results show the benefit of compression for different types of posting information to the space- and time-efficiency of the search engine. Overall, we find that the simple Frame of Reference compression scheme results in the best query response times for all types of posting information. Moreover, we observe that the frequency and position posting information in Web corpora that have large volumes of anchor text are more challenging to compress, yet compression is beneficial in reducing average query response times
Inverted index compression based on term and document identifier reassignment
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 43-46.Compression of inverted indexes received great attention in recent years. An
inverted index consists of lists of document identifiers, also referred as posting
lists, for each term. Compressing an inverted index reduces the size of the index,
which also improves the query performance due to the reduction on disk access
times.
In recent studies, it is shown that reassigning document identifiers has great
effect in compression of an inverted index. In this work, we propose a novel
technique that reassigns both term and document identifiers of an inverted index
by transforming the matrix representation of the index into a block-diagonal
form, which improves the compression ratio dramatically. We adapted row-net
hypergraph-partitioning model for the transformation into block-diagonal form,
which improves the compression ratio by as much as 50%. To the best of our
knowledge, this method performs more effectively than previous inverted index
compression techniques.Baykan, İzzet ÇağrıM.S
Efficient Update of Indexes for Dynamically Changing Web Documents
The original publication is available at www.springerlink.comRecent work on incremental crawling has enabled the indexed document collection of a
search engine to be more synchronized with the changing World Wide Web. However, this
synchronized collection is not immediately searchable, because the keyword index is rebuilt
from scratch less frequently than the collection can be refreshed. An inverted index is usually
used to index documents crawled from the web. Complete index rebuild at high frequency is
expensive. Previous work on incremental inverted index updates have been restricted to adding
and removing documents. Updating the inverted index for previously indexed documents that
have changed has not been addressed.
In this paper, we propose an efficient method to update the inverted index for previously
indexed documents whose contents have changed. Our method uses the idea of landmarks
together with the diff algorithm to significantly reduce the number of postings in the inverted
index that need to be updated. Our experiments verify that our landmark-diff method results
in significant savings in the number of update operations on the inverted index
- …