84 research outputs found
Comparison of LZ77-type Parsings
We investigate the relations between different variants of the LZ77 parsing
existing in the literature. All of them are defined as greedily constructed
parsings encoding each phrase by reference to a string occurring earlier in the
input. They differ by the phrase encodings: encoded by pairs (length + position
of an earlier occurrence) or by triples (length + position of an earlier
occurrence + the letter following the earlier occurring part); and they differ
by allowing or not allowing overlaps between the phrase and its earlier
occurrence. For a given string of length over an alphabet of size ,
denote the numbers of phrases in the parsings allowing (resp., not allowing)
overlaps by (resp., ) for "pairs", and by (resp.,
) for "triples". We prove the following bounds and provide series of
examples showing that these bounds are tight:
and
;
and .Comment: 6 page
Bicriteria data compression
The advent of massive datasets (and the consequent design of high-performing
distributed storage systems) have reignited the interest of the scientific and
engineering community towards the design of lossless data compressors which
achieve effective compression ratio and very efficient decompression speed.
Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of
its decompression speed and its flexibility in trading decompression speed
versus compressed-space efficiency. Each of the existing implementations offers
a trade-off between space occupancy and decompression speed, so software
engineers have to content themselves by picking the one which comes closer to
the requirements of the application in their hands. Starting from these
premises, and for the first time in the literature, we address in this paper
the problem of trading optimally, and in a principled way, the consumption of
these two resources by introducing the Bicriteria LZ77-Parsing problem, which
formalizes in a principled way what data-compressors have traditionally
approached by means of heuristics. The goal is to determine an LZ77 parsing
which minimizes the space occupancy in bits of the compressed file, provided
that the decompression time is bounded by a fixed amount (or vice-versa). This
way, the software engineer can set its space (or time) requirements and then
derive the LZ77 parsing which optimizes the decompression speed (or the space
occupancy, respectively). We solve this problem efficiently in O(n log^2 n)
time and optimal linear space within a small, additive approximation, by
proving and deploying some specific structural properties of the weighted graph
derived from the possible LZ77-parsings of the input file. The preliminary set
of experiments shows that our novel proposal dominates all the highly
engineered competitors, hence offering a win-win situation in theory&practice
L-Systems for Measuring Repetitiveness
In order to use them for compression, we extend L-systems (without ?-rules) with two parameters d and n, and also a coding ?, which determines unambiguously a string w = ?(?^d(s))[1:n], where ? is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as ?, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence.
In this paper, we deepen the study of the measure ? and its relation with a better-established measure called ?, which builds on substring complexity. Our results show that ? and ? are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness.
We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ? of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms
Optimal-Time Text Indexing in BWT-runs Bounded Space
Indexing highly repetitive texts --- such as genomic databases, software
repositories and versioned text collections --- has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is , the number of runs in their Burrows-Wheeler Transform
(BWT). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used space and was able to efficiently count the number of
occurrences of a pattern of length in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
. Since then, a number of other indexes with space bounded by other measures
of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size
of the smallest grammar generating the text, the size of the smallest automaton
recognizing the text factors --- have been proposed for efficiently locating,
but not directly counting, the occurrences of a pattern. In this paper we close
this long-standing problem, showing how to extend the Run-Length FM-index so
that it can locate the occurrences efficiently within space (in
loglogarithmic time each), and reaching optimal time within
space, on a RAM machine of bits. Within
space, our index can also count in optimal time .
Raising the space to , we support count and locate in
and time, which is optimal in the
packed setting and had not been obtained before in compressed space. We also
describe a structure using space that replaces the text and
extracts any text substring of length in almost-optimal time
. (...continues...
LZ-Compressed String Dictionaries
We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data
compression algorithm. Our approach is validated experimentally on dictionaries
of up to 1.5 GB of uncompressed text. We achieve compression ratios often
outperforming the existing alternatives, especially on dictionaries containing
many repeated substrings. Our query times remain competitive
Brotli: A General-Purpose Data Compressor
Brotli is an open source general-purpose data compressor introduced by Google in late 2013 and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format was submitted as RFC 7932 in July 2016. Brotli is based on the Lempel-Ziv compression scheme and planned as a generic replacement of Gzip and ZLib. The main goal in its design was to compress data on the Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density.
This article is intended to provide the first thorough, systematic description of the Brotli format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software engineering problems that deserve further attention from the scientific community
- …