2,623 research outputs found
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string and a source string , a relative
compression of with respect to is an encoding of as a sequence of
references to substrings of . Relative compression schemes are a classic
model of compression and have recently proved very successful for compressing
highly-repetitive massive data sets such as genomes and web-data. We initiate
the study of relative compression in a dynamic setting where the compressed
source string is subject to edit operations. The goal is to maintain the
compressed representation compactly, while supporting edits and allowing
efficient random access to the (uncompressed) source string. We present new
data structures that achieve optimal time for updates and queries while using
space linear in the size of the optimal relative compression, for nearly all
combinations of parameters. We also present solutions for restricted and
extended sets of updates. To achieve these results, we revisit the dynamic
partial sums problem and the substring concatenation problem. We present new
optimal or near optimal bounds for these problems. Plugging in our new results
we also immediately obtain new bounds for the string indexing for patterns with
wildcards problem and the dynamic text and static pattern matching problem
Random Access to Grammar Compressed Strings
Grammar based compression, where one replaces a long string by a small
context-free grammar that generates the string, is a simple and powerful
paradigm that captures many popular compression schemes. In this paper, we
present a novel grammar representation that allows efficient random access to
any character or substring without decompressing the string.
Let be a string of length compressed into a context-free grammar
of size . We present two representations of
achieving random access time, and either
construction time and space on the pointer machine model, or
construction time and space on the RAM. Here, is the inverse of
the row of Ackermann's function. Our representations also efficiently
support decompression of any substring in : we can decompress any substring
of length in the same complexity as a single random access query and
additional time. Combining these results with fast algorithms for
uncompressed approximate string matching leads to several efficient algorithms
for approximate string matching on grammar-compressed strings without
decompression. For instance, we can find all approximate occurrences of a
pattern with at most errors in time , where is the number of occurrences of in . Finally, we
generalize our results to navigation and other operations on grammar-compressed
ordered trees.
All of the above bounds significantly improve the currently best known
results. To achieve these bounds, we introduce several new techniques and data
structures of independent interest, including a predecessor data structure, two
"biased" weighted ancestor data structures, and a compact representation of
heavy paths in grammars.Comment: Preliminary version in SODA 201
Faster subsequence recognition in compressed strings
Computation on compressed strings is one of the key approaches to processing
massive data sets. We consider local subsequence recognition problems on
strings compressed by straight-line programs (SLP), which is closely related to
Lempel--Ziv compression. For an SLP-compressed text of length , and an
uncompressed pattern of length , C{\'e}gielski et al. gave an algorithm for
local subsequence recognition running in time . We improve
the running time to . Our algorithm can also be used to
compute the longest common subsequence between a compressed text and an
uncompressed pattern in time ; the same problem with a
compressed pattern is known to be NP-hard
Wavelet Trees Meet Suffix Trees
We present an improved wavelet tree construction algorithm and discuss its
applications to a number of rank/select problems for integer keys and strings.
Given a string of length n over an alphabet of size , our
method builds the wavelet tree in time,
improving upon the state-of-the-art algorithm by a factor of .
As a consequence, given an array of n integers we can construct in time a data structure consisting of machine words and
capable of answering rank/select queries for the subranges of the array in
time. This is a -factor improvement in
query time compared to Chan and P\u{a}tra\c{s}cu and a -factor
improvement in construction time compared to Brodal et al.
Next, we switch to stringological context and propose a novel notion of
wavelet suffix trees. For a string w of length n, this data structure occupies
words, takes time to construct, and simultaneously
captures the combinatorial structure of substrings of w while enabling
efficient top-down traversal and binary search. In particular, with a wavelet
suffix tree we are able to answer in time the following two
natural analogues of rank/select queries for suffixes of substrings: for
substrings x and y of w count the number of suffixes of x that are
lexicographically smaller than y, and for a substring x of w and an integer k,
find the k-th lexicographically smallest suffix of x.
We further show that wavelet suffix trees allow to compute a
run-length-encoded Burrows-Wheeler transform of a substring x of w in time, where s denotes the length of the resulting run-length encoding.
This answers a question by Cormode and Muthukrishnan, who considered an
analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201
One-variable word equations in linear time
In this paper we consider word equations with one variable (and arbitrary
many appearances of it). A recent technique of recompression, which is
applicable to general word equations, is shown to be suitable also in this
case. While in general case it is non-deterministic, it determinises in case of
one variable and the obtained running time is O(n + #_X log n), where #_X is
the number of appearances of the variable in the equation. This matches the
previously-best algorithm due to D\k{a}browski and Plandowski. Then, using a
couple of heuristics as well as more detailed time analysis the running time is
lowered to O(n) in RAM model. Unfortunately no new properties of solutions are
shown.Comment: submitted to a journal, general overhaul over the previous versio
- …