11 research outputs found
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
Rpair: Rescaling RePair with Rsync
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice
Document retrieval hacks
Publisher Copyright: © Simon J. Puglisi and Bella Zhukova; licensed under Creative Commons License CC-BY 4.0 19th International Symposium on Experimental Algorithms (SEA 2021).Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.Peer reviewe
Document Listing on Repetitive Collections with Guaranteed Performance
We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,a] is composed of D copies of a string of size n, and s single-character edits are applied on the copies. We introduce the first document listing index with size O~(n + s), precisely O((n lg a + s lg^2 N) lg D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc strings where it appears in time O(m^2 + m lg N (lg D + lg^e N) ndoc), for any constant e > 0
Tailoring r-index for Document Listing Towards Metagenomics Applications
A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of a set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with k-mer hashing-based pseudoalignment: a read is assigned to species A if each of its k-mer hits to a reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an r-index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each species. Given t species whose concatenation length is n, and whose Burrows-Wheeler transform contains r runs, our first solution, based on a grammar-compressed document array with precomputed queries at non terminal symbols, reports the frequencies for the distinct documents in which the pattern of length m occurs in time. Our second solution is also based on a grammar-compressed document array, but enhanced with bitvectors and reports the frequencies in time, over a machine with wordsize w. Our third solution, based on the interleaved LCP array, answers the same query in time. We implemented our solutions and tested them on real-world and synthetic datasets. The results show that all the solutions are fast on highly-repetitive data, and the size overhead introduced by the indexes are comparable with the size of the r-index.Peer reviewe
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Document retrieval on repetitive string collections
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.Peer reviewe
Indexing Highly Repetitive String Collections
Two decades ago, a breakthrough in indexing string collections made it
possible to represent them within their compressed space while at the same time
offering indexed search functionalities. As this new technology permeated
through applications like bioinformatics, the string collections experienced a
growth that outperforms Moore's Law and challenges our ability of handling them
even in compressed form. It turns out, fortunately, that many of these rapidly
growing string collections are highly repetitive, so that their information
content is orders of magnitude lower than their plain size. The statistical
compression methods used for classical collections, however, are blind to this
repetitiveness, and therefore a new set of techniques has been developed in
order to properly exploit it. The resulting indexes form a new generation of
data structures able to handle the huge repetitive string collections that we
are facing.
In this survey we cover the algorithmic developments that have led to these
data structures. We describe the distinct compression paradigms that have been
used to exploit repetitiveness, the fundamental algorithmic ideas that form the
base of all the existing indexes, and the various structures that have been
proposed, comparing them both in theoretical and practical aspects. We conclude
with the current challenges in this fascinating field