Search CORE

24 research outputs found

Comparison of LZ77-type Parsings

Author: Kosolobov Dmitry
Shur Arseny M.
Publication venue
Publication date: 23/05/2018
Field of study

We investigate the relations between different variants of the LZ77 parsing existing in the literature. All of them are defined as greedily constructed parsings encoding each phrase by reference to a string occurring earlier in the input. They differ by the phrase encodings: encoded by pairs (length + position of an earlier occurrence) or by triples (length + position of an earlier occurrence + the letter following the earlier occurring part); and they differ by allowing or not allowing overlaps between the phrase and its earlier occurrence. For a given string of length

n

over an alphabet of size

\sigma

, denote the numbers of phrases in the parsings allowing (resp., not allowing) overlaps by

z

(resp.,

\hat{z}

) for "pairs", and by

z_3

(resp.,

\hat{z}_3

) for "triples". We prove the following bounds and provide series of examples showing that these bounds are tight:

\bullet

z \le \hat{z} \le z \cdot O(\log\frac{n}{z\log_\sigma z})

and

z_3 \le \hat{z}_3 \le z_3 \cdot O(\log\frac{n}{z_3\log_\sigma z_3})

;

\bullet

\frac{1}2\hat{z} < \hat{z}_3 \le \hat{z}

and

\frac{1}2 z < z_3 \le z

.Comment: 6 page

arXiv.org e-Print Archive

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Bicriteria data compression

Author: Farruggia Andrea
Ferragina Paolo
Frangioni Antonio
Venturini Rossano
Publication venue
Publication date: 15/07/2013
Field of study

The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Lempel-Ziv Parsing for Sequences of Blocks

Author: Kosolobov Dmitry
Valenzuela Daniel
Publication venue: Multidisciplinary Digital Publishing Institute
Publication date: 10/12/2021
Field of study

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b>1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b=8 in case of bytes) are related as zb=O(bzlognz). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb=O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods

Helsingin yliopiston digitaalinen arkisto

Lempel-Ziv Parsing for Sequences of Blocks

Author: Kosolobov Dmitry
Valenzuela Daniel
Publication venue: Multidisciplinary Digital Publishing Institute
Publication date: 01/12/2021
Field of study

Helsingin yliopiston digitaalinen arkisto

Lempel-Ziv Parsing for Sequences of Blocks

Author: Kosolobov D.
Valenzuela D.
Publication venue: 'MDPI AG'
Publication date: 01/01/2021
Field of study

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz lognz ). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb = O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.Funding: This research was funded by the Ministry of Science and Higher Education of the Russian Federation (Ural Mathematical Center project No. 075-02-2021-1387)

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Brotli: A General-Purpose Data Compressor

Author: Andrea Farruggia
Eugene Kliuchnikov
Jyrki Alakuijala
Lode Vandevenne
Paolo Ferragina
Robert Obryk
Zoltan Szabadka
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Brotli is an open source general-purpose data compressor introduced by Google in late 2013 and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format was submitted as RFC 7932 in July 2016. Brotli is based on the Lempel-Ziv compression scheme and planned as a generic replacement of Gzip and ZLib. The main goal in its design was to compress data on the Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This article is intended to provide the first thorough, systematic description of the Brotli format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software engineering problems that deserve further attention from the scientific community

Archivio della Ricerca - Università di Pisa