27 research outputs found

    Lempel-Ziv Factorization May Be Harder Than Computing All Runs

    Get PDF
    The complexity of computing the Lempel-Ziv factorization and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM algorithms in O(nlogσ)O(n\log\sigma) time, where nn is the length of the input string and σ\sigma is the number of distinct letters in it. We prove an Ω(nlogσ)\Omega(n\log\sigma) lower bound on the number of comparisons required to construct the Lempel-Ziv factorization and thereby conclude that a popular technique of computation of runs using the Lempel-Ziv factorization cannot achieve an o(nlogσ)o(n\log\sigma) time bound. In contrast with this, we exhibit an O(n)O(n) decision tree algorithm finding all runs in a string. Therefore, in the decision tree model the runs problem is easier than the Lempel-Ziv factorization. Thus we support the conjecture that there is a linear RAM algorithm finding all runs.Comment: 12 pages, 3 figures, submitte

    Computing Runs on a General Alphabet

    Full text link
    We describe a RAM algorithm computing all runs (maximal repetitions) of a given string of length nn over a general ordered alphabet in O(nlog23n)O(n\log^{\frac{2}3} n) time and linear space. Our algorithm outperforms all known solutions working in Θ(nlogσ)\Theta(n\log\sigma) time provided σ=nΩ(1)\sigma = n^{\Omega(1)}, where σ\sigma is the alphabet size. We conjecture that there exists a linear time RAM algorithm finding all runs.Comment: 4 pages, 2 figure

    Finding the Leftmost Critical Factorization on Unordered Alphabet

    Full text link
    We present a linear time and space algorithm computing the leftmost critical factorization of a given string on an unordered alphabet.Comment: 13 pages, 13 figures (accepted to Theor. Comp. Sci.

    Relations Between Greedy and Bit-Optimal LZ77 Encodings

    Get PDF
    This paper investigates the size in bits of the LZ77 encoding, which is the most popular and efficient variant of the Lempel--Ziv encodings used in data compression. We prove that, for a wide natural class of variable-length encoders for LZ77 phrases, the size of the greedily constructed LZ77 encoding on constant alphabets is within a factor O(lognlogloglogn)O(\frac{\log n}{\log\log\log n}) of the optimal LZ77 encoding, where nn is the length of the processed string. We describe a series of examples showing that, surprisingly, this bound is tight, thus improving both the previously known upper and lower bounds. Further, we obtain a more detailed bound O(min{z,lognloglogz})O(\min\{z, \frac{\log n}{\log\log z}\}), which uses the number zz of phrases in the greedy LZ77 encoding as a parameter, and construct a series of examples showing that this bound is tight even for binary alphabet. We then investigate the problem on non-constant alphabets: we show that the known O(logn)O(\log n) bound is tight even for alphabets of logarithmic size, and provide tight bounds for some other important cases.Peer reviewe

    Comparison of LZ77-type Parsings

    Full text link
    We investigate the relations between different variants of the LZ77 parsing existing in the literature. All of them are defined as greedily constructed parsings encoding each phrase by reference to a string occurring earlier in the input. They differ by the phrase encodings: encoded by pairs (length + position of an earlier occurrence) or by triples (length + position of an earlier occurrence + the letter following the earlier occurring part); and they differ by allowing or not allowing overlaps between the phrase and its earlier occurrence. For a given string of length nn over an alphabet of size σ\sigma, denote the numbers of phrases in the parsings allowing (resp., not allowing) overlaps by zz (resp., z^\hat{z}) for "pairs", and by z3z_3 (resp., z^3\hat{z}_3) for "triples". We prove the following bounds and provide series of examples showing that these bounds are tight: \bullet zz^zO(lognzlogσz)z \le \hat{z} \le z \cdot O(\log\frac{n}{z\log_\sigma z}) and z3z^3z3O(lognz3logσz3)z_3 \le \hat{z}_3 \le z_3 \cdot O(\log\frac{n}{z_3\log_\sigma z_3}); \bullet 12z^<z^3z^\frac{1}2\hat{z} < \hat{z}_3 \le \hat{z} and 12z<z3z\frac{1}2 z < z_3 \le z.Comment: 6 page

    LZ-End Parsing in Linear Time

    Get PDF
    Peer reviewe

    Lempel-Ziv Parsing for Sequences of Blocks

    Get PDF
    The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b>1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b=8 in case of bytes) are related as zb=O(bzlognz). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb=O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods

    Lempel-Ziv Parsing for Sequences of Blocks

    Get PDF
    The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b>1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b=8 in case of bytes) are related as zb=O(bzlognz). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb=O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods

    Tight lower bounds for the longest common extension problem

    Get PDF
    The longest common extension problem is to preprocess a given string of length n into a data structure that uses S(n) bits on top of the input and answers in T(n) time the queries LCE(i, j) computing the length of the longest string that occurs at both positions i and j in the input. We prove that the trade-off S (n)T (n) = (it logn) holds in the non-uniform cell-probe model provided that the input string is read-only, each letter occupies a separate memory cell, S(n) = Omega(n), and the size of the input alphabet is at least 2(8inverted right perpendicularS(n)/ninverted left perpendicular). It is known that this trade-off is tight. (C) 2017 Elsevier B.V. All rights reserved.Peer reviewe
    corecore