14 research outputs found
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
Decompressing Lempel-Ziv Compressed Text
We consider the problem of decompressing the Lempel--Ziv 77 representation of
a string of length using a working space as close as possible to the
size of the input. The folklore solution for the problem runs in
time but requires random access to the whole decompressed text. Another
folklore solution is to convert LZ77 into a grammar of size and
then stream in linear time. In this paper, we show that time and
working space can be achieved for constant-size alphabets. On general
alphabets of size , we describe (i) a trade-off achieving
time and space for any
, and (ii) a solution achieving time and
space. The latter solution, in particular, dominates both
folklore algorithms for the problem. Our solutions can, more generally, extract
any specified subsequence of with little overheads on top of the linear
running time and working space. As an immediate corollary, we show that our
techniques yield improved results for pattern matching problems on
LZ77-compressed text
Approximating Edit Distance in the Fully Dynamic Model
The edit distance is a fundamental measure of sequence similarity, defined as
the minimum number of character insertions, deletions, and substitutions needed
to transform one string into the other. Given two strings of length at most
, simple dynamic programming computes their edit distance exactly in
time, which is also the best possible (up to subpolynomial factors)
assuming the Strong Exponential Time Hypothesis (SETH). The last few decades
have seen tremendous progress in edit distance approximation, where the runtime
has been brought down to subquadratic, near-linear, and even sublinear at the
cost of approximation.
In this paper, we study the dynamic edit distance problem, where the strings
change dynamically as the characters are substituted, inserted, or deleted over
time. Each change may happen at any location of either of the two strings. The
goal is to maintain the (exact or approximate) edit distance of such dynamic
strings while minimizing the update time. The exact edit distance can be
maintained in time per update (Charalampopoulos, Kociumaka,
Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the
unprecedented progress in edit distance approximation in the static setting,
strikingly little is known regarding dynamic edit distance approximation.
Utilizing the off-the-shelf tools, it is possible to achieve an
-approximation in update time for any constant . Improving upon this trade-off remains open.
The contribution of this work is a dynamic -approximation algorithm
with amortized expected update time of . In other words, we bring the
approximation-ratio and update-time product down to . Our solution
utilizes an elegant framework of precision sampling tree for edit distance
approximation (Andoni, Krauthgamer, Onak; 2010).Comment: Accepted to FOCS 202
Breaking the -Barrier in the Construction of Compressed Suffix Arrays
The suffix array, describing the lexicographic order of suffixes of a given
text, is the central data structure in string algorithms. The suffix array of a
length- text uses bits, which is prohibitive in many
applications. To address this, Grossi and Vitter [STOC 2000] and,
independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient
versions of the suffix array, known as the compressed suffix array (CSA) and
the FM-index. For a length- text over an alphabet of size , these
data structures use only bits. Immediately after their
discovery, they almost completely replaced plain suffix arrays in practical
applications, and a race started to develop efficient construction procedures.
Yet, after more than 20 years, even for , the fastest algorithm
remains stuck at time [Hon et al., FOCS 2003], which is slower by a
factor than the lower bound of (following
simply from the necessity to read the entire input). We break this
long-standing barrier with a new data structure that takes
bits, answers suffix array queries in time, and can be
constructed in time using
bits of space. Our result is based on several new insights into the recently
developed notion of string synchronizing sets [STOC 2019]. In particular,
compared to their previous applications, we eliminate orthogonal range queries,
replacing them with new queries that we dub prefix rank and prefix selection
queries. As a further demonstration of our techniques, we present a new
pattern-matching index that simultaneously minimizes the construction time and
the query time among all known compact indexes (i.e., those using bits).Comment: 41 page
Small space and streaming pattern matching with k edits
In this work, we revisit the fundamental and well-studied problem of
approximate pattern matching under edit distance. Given an integer , a
pattern of length , and a text of length , the task is to
find substrings of that are within edit distance from . Our main
result is a streaming algorithm that solves the problem in
space and amortised time per character of the text, providing
answers correct with high probability. (Hereafter, hides a
factor.) This answers a decade-old question: since the
discovery of a -space streaming algorithm for pattern
matching under Hamming distance by Porat and Porat [FOCS 2009], the existence
of an analogous result for edit distance remained open. Up to this work, no
-space algorithm was known even in the simpler
semi-streaming model, where comes as a stream but is available for
read-only access. In this model, we give a deterministic algorithm that
achieves slightly better complexity.
In order to develop the fully streaming algorithm, we introduce a new edit
distance sketch parametrised by integers . For any string of length at
most , the sketch is of size and it can be computed with an
-space streaming algorithm. Given the sketches of two strings,
in time we can compute their edit distance or certify that it
is larger than . This result improves upon -size sketches of
Belazzougui and Zhu [FOCS 2016] and very recent -size sketches
of Jin, Nelson, and Wu [STACS 2021]