176 research outputs found
A really simple approximation of smallest grammar
In this paper we present a really simple linear-time algorithm constructing a
context-free grammar of size O(g log (N/g)) for the input string, where N is
the size of the input string and g the size of the optimal grammar generating
this string. The algorithm works for arbitrary size alphabets, but the running
time is linear assuming that the alphabet Sigma of the input string can be
identified with numbers from 1,ldots, N^c for some constant c. Algorithms with
such an approximation guarantee and running time are known, however all of them
were non-trivial and their analyses were involved. The here presented algorithm
computes the LZ77 factorisation and transforms it in phases to a grammar. In
each phase it maintains an LZ77-like factorisation of the word with at most l
factors as well as additional O(l) letters, where l was the size of the
original LZ77 factorisation. In one phase in a greedy way (by a left-to-right
sweep and a help of the factorisation) we choose a set of pairs of consecutive
letters to be replaced with new symbols, i.e. nonterminals of the constructed
grammar. We choose at least 2/3 of the letters in the word and there are O(l)
many different pairs among them. Hence there are O(log N) phases, each of them
introduces O(l) nonterminals to a grammar. A more precise analysis yields a
bound O(l log(N/l)). As l \leq g, this yields the desired bound O(g log(N/g)).Comment: Accepted for CPM 201
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
Computing LZ77 in Run-Compressed Space
In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n}
can be computed in O(R log n) bits of working space and O(n log R) time, R
being the number of runs in the Burrows-Wheeler transform of T reversed. For
extremely repetitive inputs, the working space can be as low as O(log n) bits:
exponentially smaller than the text itself. As a direct consequence of our
result, we show that a class of repetition-aware self-indexes based on a
combination of run-length encoded BWT and LZ77 can be built in asymptotically
optimal O(R + z) words of working space, z being the size of the LZ77 parsing
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
- …