39 research outputs found
Approximation of grammar-based compression via recompression
In this paper we present a simple linear-time algorithm constructing a
context-free grammar of size O(g log(N/g)) for the input string, where N is the
size of the input string and g the size of the optimal grammar generating this
string. The algorithm works for arbitrary size alphabets, but the running time
is linear assuming that the alphabet \Sigma of the input string can be
identified with numbers from {1, ..., N^c} for some constant c. Otherwise,
additional cost of O(n log|\Sigma|) is needed.
Algorithms with such approximation guarantees and running time are known, the
novelty of this paper is a particular simplicity of the algorithm as well as
the analysis of the algorithm, which uses a general technique of recompression
recently introduced by the author. Furthermore, contrary to the previous
results, this work does not use the LZ representation of the input string in
the construction, nor in the analysis.Comment: 22 pages, some many small improvements, to be submited to a journa
Comparison of LZ77-type Parsings
We investigate the relations between different variants of the LZ77 parsing
existing in the literature. All of them are defined as greedily constructed
parsings encoding each phrase by reference to a string occurring earlier in the
input. They differ by the phrase encodings: encoded by pairs (length + position
of an earlier occurrence) or by triples (length + position of an earlier
occurrence + the letter following the earlier occurring part); and they differ
by allowing or not allowing overlaps between the phrase and its earlier
occurrence. For a given string of length over an alphabet of size ,
denote the numbers of phrases in the parsings allowing (resp., not allowing)
overlaps by (resp., ) for "pairs", and by (resp.,
) for "triples". We prove the following bounds and provide series of
examples showing that these bounds are tight:
and
;
and .Comment: 6 page
A really simple approximation of smallest grammar
In this paper we present a really simple linear-time algorithm constructing a
context-free grammar of size O(g log (N/g)) for the input string, where N is
the size of the input string and g the size of the optimal grammar generating
this string. The algorithm works for arbitrary size alphabets, but the running
time is linear assuming that the alphabet Sigma of the input string can be
identified with numbers from 1,ldots, N^c for some constant c. Algorithms with
such an approximation guarantee and running time are known, however all of them
were non-trivial and their analyses were involved. The here presented algorithm
computes the LZ77 factorisation and transforms it in phases to a grammar. In
each phase it maintains an LZ77-like factorisation of the word with at most l
factors as well as additional O(l) letters, where l was the size of the
original LZ77 factorisation. In one phase in a greedy way (by a left-to-right
sweep and a help of the factorisation) we choose a set of pairs of consecutive
letters to be replaced with new symbols, i.e. nonterminals of the constructed
grammar. We choose at least 2/3 of the letters in the word and there are O(l)
many different pairs among them. Hence there are O(log N) phases, each of them
introduces O(l) nonterminals to a grammar. A more precise analysis yields a
bound O(l log(N/l)). As l \leq g, this yields the desired bound O(g log(N/g)).Comment: Accepted for CPM 201
Rpair: Rescaling RePair with Rsync
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
Accepted for CPM 2014
In this paper we present a really simple linear-time algorithm constructing a context-free grammar of size O(g log (N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear assuming that the alphabet Sigma of the input string can be identified with numbers from 1,ldots, N^c for some constant c. Algorithms with such an approximation guarantee and running time are known, however all of them were non-trivial and their analyses were involved. The here presented algorithm computes the LZ77 factorisation and transforms it in phases to a grammar. In each phase it maintains an LZ77-like factorisation of the word with at most l factors as well as additional O(l) letters, where l was the size of the original LZ77 factorisation. In one phase in a greedy way (by a left-to-right sweep and a help of the factorisation) we choose a set of pairs of consecutive letters to be replaced with new symbols, i.e. nonterminals of the constructed grammar. We choose at least 2/3 of the letters in the word and there are O(l) many different pairs among them. Hence there are O(log N) phases, each of them introduces O(l) nonterminals to a grammar. A more precise analysis yields a bound O(l log(N/l)). As l \leq g, this yields the desired bound O(g log(N/g))