9 research outputs found
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
Optimal Substring-Equality Queries with Applications to Sparse Text Indexing
We consider the problem of encoding a string of length from an integer
alphabet of size so that access and substring equality queries (that
is, determining the equality of any two substrings) can be answered
efficiently. Any uniquely-decodable encoding supporting access must take
bits. We describe a new data
structure matching this lower bound when while supporting
both queries in optimal time. Furthermore, we show that the string can
be overwritten in-place with this structure. The redundancy of
bits and the constant query time break exponentially a lower bound that is
known to hold in the read-only model. Using our new string representation, we
obtain the first in-place subquadratic (indeed, even sublinear in some cases)
algorithms for several string-processing problems in the restore model: the
input string is rewritable and must be restored before the computation
terminates. In particular, we describe the first in-place subquadratic Monte
Carlo solutions to the sparse suffix sorting, sparse LCP array construction,
and suffix selection problems. With the sole exception of suffix selection, our
algorithms are also the first running in sublinear time for small enough sets
of input suffixes. Combining these solutions, we obtain the first
sublinear-time Monte Carlo algorithm for building the sparse suffix tree in
compact space. We also show how to derandomize our algorithms using small
space. This leads to the first Las Vegas in-place algorithm computing the full
LCP array in time and to the first Las Vegas in-place algorithms
solving the sparse suffix sorting and sparse LCP array construction problems in
time. Running times of these Las Vegas
algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las
Vegas algorithm
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Indexing highly repetitive texts - such as genomic databases, software
repositories and versioned text collections - has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms
(BWTs). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used O(r) space and was able to efficiently count the number of
occurrences of a pattern of length m in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
r. In this paper we close this long-standing problem, showing how to extend the
Run-Length FM-index so that it can locate the occ occurrences efficiently
within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m
+ occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n
over an alphabet of size {\sigma} on a RAM machine with words of w =
{\Omega}(log n) bits. Within that space, our index can also count in optimal
time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and
locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which
is optimal in the packed setting and had not been obtained before in compressed
space. We also describe a structure using O(r log(n/r)) space that replaces the
text and extracts any text substring of length ` in almost-optimal time
O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct
access to suffix array, inverse suffix array, and longest common prefix array
cells, and extend these capabilities to full suffix tree functionality,
typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log
log_w(n/r + sigma)
Optimal-Time Dictionary-Compressed Indexes
We describe the first self-indexes able to count and locate pattern
occurrences in optimal time within a space bounded by the size of the most
popular dictionary compressors. To achieve this result we combine several
recent findings, including \emph{string attractors} --- new combinatorial
objects encompassing most known compressibility measures for highly repetitive
texts ---, and grammars based on \emph{locally-consistent parsing}.
More in detail, let be the size of the smallest attractor for a text
of length . The measure is an (asymptotic) lower bound to the
size of dictionary compressors based on Lempel--Ziv, context-free grammars, and
many others. The smallest known text representations in terms of attractors use
space , and our lightest indexes work within the same
asymptotic space. Let be a suitably small constant fixed at
construction time, be the pattern length, and be the number of its
text occurrences. Our index counts pattern occurrences in
time, and locates them in time. These times already outperform those of most dictionary-compressed
indexes, while obtaining the least asymptotic space for any index searching
within time. Further, by increasing the space
to , we reduce the locating time to the
optimal , and within space we can
also count in optimal time. No dictionary-compressed index had obtained
this time before. All our indexes can be constructed in space and
expected time.
As a byproduct of independent interest..
Locally Consistent Parsing for Text Indexing in Small Space
We consider two closely related problems of text indexing in a sub-linear
working space. The first problem is the Sparse Suffix Tree (SST) construction
of a set of suffixes using only words of space. The second problem
is the Longest Common Extension (LCE) problem, where for some parameter
, the goal is to construct a data structure that uses words of space and can compute the longest common prefix length of
any pair of suffixes. We show how to use ideas based on the Locally Consistent
Parsing technique, that was introduced by Sahinalp and Vishkin [STOC '94], in
some non-trivial ways in order to improve the known results for the above
problems. We introduce new Las-Vegas and deterministic algorithms for both
problems.
We introduce the first Las-Vegas SST construction algorithm that takes
time. This is an improvement over the last result of Gawrychowski and Kociumaka
[SODA '17] who obtained time for Monte-Carlo algorithm, and
time for Las-Vegas algorithm. In addition, we introduce a
randomized Las-Vegas construction for an LCE data structure that can be
constructed in linear time and answers queries in time.
For the deterministic algorithms, we introduce an SST construction algorithm
that takes time (for ). This is
the first almost linear time, , deterministic SST
construction algorithm, where all previous algorithms take at least
time. For the LCE problem, we
introduce a data structure that answers LCE queries in
time, with construction time (for ).
This data structure improves both query time and construction time upon the
results of Tanimura et al. [CPM '16].Comment: Extended abstract to appear is SODA 202