15,110 research outputs found
Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic
Countless variants of the Lempel-Ziv compression are widely used in many
real-life applications. This paper is concerned with a natural modification of
the classical pattern matching problem inspired by the popularity of such
compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv
representation of a string t[1..N], does s occur in t? Farach and Thorup gave a
randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size
of the compressed representation of t. We improve their result by developing a
faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same
space complexity. Note that for highly compressible texts, log(N/n) might be of
order n, so for such inputs the improvement is very significant. A (tiny)
fragment of our method can be used to give an asymptotically optimal solution
for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
Faster subsequence recognition in compressed strings
Computation on compressed strings is one of the key approaches to processing
massive data sets. We consider local subsequence recognition problems on
strings compressed by straight-line programs (SLP), which is closely related to
Lempel--Ziv compression. For an SLP-compressed text of length , and an
uncompressed pattern of length , C{\'e}gielski et al. gave an algorithm for
local subsequence recognition running in time . We improve
the running time to . Our algorithm can also be used to
compute the longest common subsequence between a compressed text and an
uncompressed pattern in time ; the same problem with a
compressed pattern is known to be NP-hard
Computing LZ77 in Run-Compressed Space
In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n}
can be computed in O(R log n) bits of working space and O(n log R) time, R
being the number of runs in the Burrows-Wheeler transform of T reversed. For
extremely repetitive inputs, the working space can be as low as O(log n) bits:
exponentially smaller than the text itself. As a direct consequence of our
result, we show that a class of repetition-aware self-indexes based on a
combination of run-length encoded BWT and LZ77 can be built in asymptotically
optimal O(R + z) words of working space, z being the size of the LZ77 parsing
- …