667 research outputs found
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
Indexing Finite Language Representation of Population Genotypes
With the recent advances in DNA sequencing, it is now possible to have
complete genomes of individuals sequenced and assembled. This rich and focused
genotype information can be used to do different population-wide studies, now
first time directly on whole genome level. We propose a way to index population
genotype information together with the complete genome sequence, so that one
can use the index to efficiently align a given sequence to the genome with all
plausible genotype recombinations taken into account. This is achieved through
converting a multiple alignment of individual genomes into a finite automaton
recognizing all strings that can be read from the alignment by switching the
sequence at any time. The finite automaton is indexed with an extension of
Burrows-Wheeler transform to allow pattern search inside the plausible
recombinant sequences. The size of the index stays limited, because of the high
similarity of individual genomes. The index finds applications in variation
calling and in primer design. On a variation calling experiment, we found about
1.0% of matches to novel recombinants just with exact matching, and up to 2.4%
with approximate matching.Comment: This is the full version of the paper that was presented at WABI
2011. The implementation is available at
http://www.cs.helsinki.fi/group/suds/gcsa
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
- …