134 research outputs found
Sublinear Algorithms for Approximating String Compressibility
We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.
Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its ℓth subword complexity , for small ℓ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.National Science Foundation (U.S.) (Award CCF-1065125)National Science Foundation (U.S.) (Award CCF-0728645)Marie Curie International Reintegration Grant PIRG03-GA-2008-231077Israel Science Foundation (Grant 1147/09)Israel Science Foundation (Grant 1675/09
Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing
Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL property: the heights of parse trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: ?(z log n) where z is the size of the LZ77 parsing and n is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical.
We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77-like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and usually not more than double the size of grammars produced by the practical Re-Pair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a run-length BWT in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structures
Image Characterization and Classification by Physical Complexity
We present a method for estimating the complexity of an image based on
Bennett's concept of logical depth. Bennett identified logical depth as the
appropriate measure of organized complexity, and hence as being better suited
to the evaluation of the complexity of objects in the physical world. Its use
results in a different, and in some sense a finer characterization than is
obtained through the application of the concept of Kolmogorov complexity alone.
We use this measure to classify images by their information content. The method
provides a means for classifying and evaluating the complexity of objects by
way of their visual representations. To the authors' knowledge, the method and
application inspired by the concept of logical depth presented herein are being
proposed and implemented for the first time.Comment: 30 pages, 21 figure
Decompressing Lempel-Ziv Compressed Text
We consider the problem of decompressing the Lempel--Ziv 77 representation of
a string of length using a working space as close as possible to the
size of the input. The folklore solution for the problem runs in
time but requires random access to the whole decompressed text. Another
folklore solution is to convert LZ77 into a grammar of size and
then stream in linear time. In this paper, we show that time and
working space can be achieved for constant-size alphabets. On general
alphabets of size , we describe (i) a trade-off achieving
time and space for any
, and (ii) a solution achieving time and
space. The latter solution, in particular, dominates both
folklore algorithms for the problem. Our solutions can, more generally, extract
any specified subsequence of with little overheads on top of the linear
running time and working space. As an immediate corollary, we show that our
techniques yield improved results for pattern matching problems on
LZ77-compressed text
- …