Search CORE

27 research outputs found

Lempel-Ziv Factorization May Be Harder Than Computing All Runs

Author: Kosolobov Dmitry
Publication venue
Publication date: 19/09/2014
Field of study

The complexity of computing the Lempel-Ziv factorization and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM algorithms in

O(n\log\sigma)

time, where

n

is the length of the input string and

\sigma

is the number of distinct letters in it. We prove an

\Omega(n\log\sigma)

lower bound on the number of comparisons required to construct the Lempel-Ziv factorization and thereby conclude that a popular technique of computation of runs using the Lempel-Ziv factorization cannot achieve an

o(n\log\sigma)

time bound. In contrast with this, we exhibit an

O(n)

decision tree algorithm finding all runs in a string. Therefore, in the decision tree model the runs problem is easier than the Lempel-Ziv factorization. Thus we support the conjecture that there is a linear RAM algorithm finding all runs.Comment: 12 pages, 3 figures, submitte

arXiv.org e-Print Archive

CiteSeerX

DROPS Dagstuhl Research Online Publication Server

Computing Runs on a General Alphabet

Author: Kosolobov Dmitry
Publication venue
Publication date: 22/11/2015
Field of study

We describe a RAM algorithm computing all runs (maximal repetitions) of a given string of length

n

over a general ordered alphabet in

O(n\log^{\frac{2}3} n)

time and linear space. Our algorithm outperforms all known solutions working in

\Theta(n\log\sigma)

time provided

\sigma = n^{\Omega(1)}

, where

\sigma

is the alphabet size. We conjecture that there exists a linear time RAM algorithm finding all runs.Comment: 4 pages, 2 figure

arXiv.org e-Print Archive

Crossref

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Finding the Leftmost Critical Factorization on Unordered Alphabet

Author: Kosolobov Dmitry
Publication venue
Publication date: 01/01/2016
Field of study

We present a linear time and space algorithm computing the leftmost critical factorization of a given string on an unordered alphabet.Comment: 13 pages, 13 figures (accepted to Theor. Comp. Sci.

arXiv.org e-Print Archive

Crossref

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Relations Between Greedy and Bit-Optimal LZ77 Encodings

Author: Kosolobov Dmitry
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2018
Field of study

This paper investigates the size in bits of the LZ77 encoding, which is the most popular and efficient variant of the Lempel--Ziv encodings used in data compression. We prove that, for a wide natural class of variable-length encoders for LZ77 phrases, the size of the greedily constructed LZ77 encoding on constant alphabets is within a factor

O(\frac{\log n}{\log\log\log n})

of the optimal LZ77 encoding, where

n

is the length of the processed string. We describe a series of examples showing that, surprisingly, this bound is tight, thus improving both the previously known upper and lower bounds. Further, we obtain a more detailed bound

O(\min\{z, \frac{\log n}{\log\log z}\})

, which uses the number

z

of phrases in the greedy LZ77 encoding as a parameter, and construct a series of examples showing that this bound is tight even for binary alphabet. We then investigate the problem on non-constant alphabets: we show that the known

O(\log n)

bound is tight even for alphabets of logarithmic size, and provide tight bounds for some other important cases.Peer reviewe

arXiv.org e-Print Archive

DROPS Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Comparison of LZ77-type Parsings

Author: Kosolobov Dmitry
Shur Arseny M.
Publication venue
Publication date: 23/05/2018
Field of study

We investigate the relations between different variants of the LZ77 parsing existing in the literature. All of them are defined as greedily constructed parsings encoding each phrase by reference to a string occurring earlier in the input. They differ by the phrase encodings: encoded by pairs (length + position of an earlier occurrence) or by triples (length + position of an earlier occurrence + the letter following the earlier occurring part); and they differ by allowing or not allowing overlaps between the phrase and its earlier occurrence. For a given string of length

n

over an alphabet of size

\sigma

, denote the numbers of phrases in the parsings allowing (resp., not allowing) overlaps by

z

(resp.,

\hat{z}

) for "pairs", and by

z_3

(resp.,

\hat{z}_3

) for "triples". We prove the following bounds and provide series of examples showing that these bounds are tight:

\bullet

z \le \hat{z} \le z \cdot O(\log\frac{n}{z\log_\sigma z})

and

z_3 \le \hat{z}_3 \le z_3 \cdot O(\log\frac{n}{z_3\log_\sigma z_3})

;

\bullet

\frac{1}2\hat{z} < \hat{z}_3 \le \hat{z}

and

\frac{1}2 z < z_3 \le z

.Comment: 6 page

arXiv.org e-Print Archive

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

LZ-End Parsing in Linear Time

Author: Kempa Dominik
Kosolobov Dmitry
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Peer reviewe

DROPS Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Lempel-Ziv Parsing for Sequences of Blocks

Author: Kosolobov Dmitry
Valenzuela Daniel
Publication venue: Multidisciplinary Digital Publishing Institute
Publication date: 10/12/2021
Field of study

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b>1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b=8 in case of bytes) are related as zb=O(bzlognz). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb=O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods

Helsingin yliopiston digitaalinen arkisto

Lempel-Ziv Parsing for Sequences of Blocks

Author: Kosolobov Dmitry
Valenzuela Daniel
Publication venue: Multidisciplinary Digital Publishing Institute
Publication date: 01/12/2021
Field of study

Helsingin yliopiston digitaalinen arkisto

Tight lower bounds for the longest common extension problem

Author: Kosolobov Dmitry
Publication venue
Publication date: 11/05/2017
Field of study

The longest common extension problem is to preprocess a given string of length n into a data structure that uses S(n) bits on top of the input and answers in T(n) time the queries LCE(i, j) computing the length of the longest string that occurs at both positions i and j in the input. We prove that the trade-off S (n)T (n) = (it logn) holds in the non-uniform cell-probe model provided that the input string is read-only, each letter occupies a separate memory cell, S(n) = Omega(n), and the size of the input alphabet is at least 2(8inverted right perpendicularS(n)/ninverted left perpendicular). It is known that this trade-off is tight. (C) 2017 Elsevier B.V. All rights reserved.Peer reviewe

arXiv.org e-Print Archive

Crossref

Helsingin yliopiston digitaalinen arkisto