14,025 research outputs found
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
From Theory to Practice: Plug and Play with Succinct Data Structures
Engineering efficient implementations of compact and succinct structures is a
time-consuming and challenging task, since there is no standard library of
easy-to- use, highly optimized, and composable components. One consequence is
that measuring the practical impact of new theoretical proposals is a difficult
task, since older base- line implementations may not rely on the same basic
components, and reimplementing from scratch can be very time-consuming. In this
paper we present a framework for experimentation with succinct data structures,
providing a large set of configurable components, together with tests,
benchmarks, and tools to analyze resource requirements. We demonstrate the
functionality of the framework by recomposing succinct solutions for document
retrieval.Comment: 10 pages, 4 figures, 3 table
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
The Potential of Learned Index Structures for Index Compression
Inverted indexes are vital in providing fast key-word-based search. For every
term in the document collection, a list of identifiers of documents in which
the term appears is stored, along with auxiliary information such as term
frequency, and position offsets. While very effective, inverted indexes have
large memory requirements for web-sized collections. Recently, the concept of
learned index structures was introduced, where machine learned models replace
common index structures such as B-tree-indexes, hash-indexes, and
bloom-filters. These learned index structures require less memory, and can be
computationally much faster than their traditional counterparts. In this paper,
we consider whether such models may be applied to conjunctive Boolean querying.
First, we investigate how a learned model can replace document postings of an
inverted index, and then evaluate the compromises such an approach might have.
Second, we evaluate the potential gains that can be achieved in terms of memory
requirements. Our work shows that learned models have great potential in
inverted indexing, and this direction seems to be a promising area for future
research.Comment: Will appear in the proceedings of ADCS'1
Re-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection
operations is an active research topic. Most compression schemes rely on
encoding differences between consecutive positions with techniques that favor
small numbers. In this paper we explore a completely different alternative: We
use Re-Pair compression of those differences. While Re-Pair by itself offers
fast decompression at arbitrary positions in main and secondary memory, we
introduce variants that in addition speed up the operations required for
inverted list intersection. We compare the resulting data structures with
several recent proposals under various list intersection algorithms, to
conclude that our Re-Pair variants offer an interesting time/space tradeoff for
this problem, yet further improvements are required for it to improve upon the
state of the art
- …