324 research outputs found
On the Approximation Ratio of Lempel-Ziv Parsing
Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based
compression. A plausible lower bound is b, the least number of phrases
of a general bidirectional parse of a text, where phrases can be copied
from anywhere else in the text. Since computing b is NP-complete, a
popular gold standard is z, the number of phrases in the Lempel-Ziv
parse of the text, where phrases can be copied only from the left. While
z can be computed in linear time, almost nothing has been known for
decades about its approximation ratio with respect to b. In this paper
we prove that z = O(b log(n/b)), where n is the text length. We also
show that the bound is tight as a function of n, by exhibiting a string
family where z = Ω(b log n). Our upper bound is obtained by building a
run-length context-free grammar based on a locally consistent parsing of
the text. Our lower bound is obtained by relating b with r, the number of
equal-letter runs in the Burrows-Wheeler transform of the text. On our
way, we prove other relevant bounds between compressibility measures
Bicriteria data compression
The advent of massive datasets (and the consequent design of high-performing
distributed storage systems) have reignited the interest of the scientific and
engineering community towards the design of lossless data compressors which
achieve effective compression ratio and very efficient decompression speed.
Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of
its decompression speed and its flexibility in trading decompression speed
versus compressed-space efficiency. Each of the existing implementations offers
a trade-off between space occupancy and decompression speed, so software
engineers have to content themselves by picking the one which comes closer to
the requirements of the application in their hands. Starting from these
premises, and for the first time in the literature, we address in this paper
the problem of trading optimally, and in a principled way, the consumption of
these two resources by introducing the Bicriteria LZ77-Parsing problem, which
formalizes in a principled way what data-compressors have traditionally
approached by means of heuristics. The goal is to determine an LZ77 parsing
which minimizes the space occupancy in bits of the compressed file, provided
that the decompression time is bounded by a fixed amount (or vice-versa). This
way, the software engineer can set its space (or time) requirements and then
derive the LZ77 parsing which optimizes the decompression speed (or the space
occupancy, respectively). We solve this problem efficiently in O(n log^2 n)
time and optimal linear space within a small, additive approximation, by
proving and deploying some specific structural properties of the weighted graph
derived from the possible LZ77-parsings of the input file. The preliminary set
of experiments shows that our novel proposal dominates all the highly
engineered competitors, hence offering a win-win situation in theory&practice
Lempel-Ziv-like Parsing in Small Space
Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and
widely-used compressors for repetitive texts. However, the existing efficient
methods computing the exact LZ parsing have to use linear or close to linear
space to index the input text during the construction of the parsing, which is
prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which
indexes only a fixed reference sequence, whose size can be controlled. Deriving
the reference sequence by sampling the text yields reasonable compression
ratios for RLZ, but performance is not always competitive with that of LZ and
depends heavily on the similarity of the reference to the text. In this paper
we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate
the LZ parsing using little memory. RLZ is first used to produce a sequence of
phrases, and these are regarded as metasymbols that are input to LZ for a
second-level parsing on a (most often) drastically shorter sequence. This
parsing is finally translated into one on the original sequence.
We analyze the new scheme and prove that, like LZ, it achieves the th
order empirical entropy compression with , where is the input length and is the alphabet
size. In fact, we prove this entropy bound not only for ReLZ but for a wide
class of LZ-like encodings. Then, we establish a lower bound on ReLZ
approximation ratio showing that the number of phrases in it can be
times larger than the number of phrases in LZ. Our experiments
show that ReLZ is faster than existing alternatives to compute the (exact or
approximate) LZ parsing, at the reasonable price of an approximation factor
below in all tested scenarios, and sometimes below , to the size of
LZ.Comment: 21 pages, 6 figures, 2 table
Optimal-Time Text Indexing in BWT-runs Bounded Space
Indexing highly repetitive texts --- such as genomic databases, software
repositories and versioned text collections --- has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is , the number of runs in their Burrows-Wheeler Transform
(BWT). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used space and was able to efficiently count the number of
occurrences of a pattern of length in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
. Since then, a number of other indexes with space bounded by other measures
of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size
of the smallest grammar generating the text, the size of the smallest automaton
recognizing the text factors --- have been proposed for efficiently locating,
but not directly counting, the occurrences of a pattern. In this paper we close
this long-standing problem, showing how to extend the Run-Length FM-index so
that it can locate the occurrences efficiently within space (in
loglogarithmic time each), and reaching optimal time within
space, on a RAM machine of bits. Within
space, our index can also count in optimal time .
Raising the space to , we support count and locate in
and time, which is optimal in the
packed setting and had not been obtained before in compressed space. We also
describe a structure using space that replaces the text and
extracts any text substring of length in almost-optimal time
. (...continues...
Real-time and distributed applications for dictionary-based data compression
The greedy approach to dictionary-based static text compression can be executed by a finite state machine.
When it is applied in parallel to different blocks of data independently, there is no lack of robustness
even on standard large scale distributed systems with input files of arbitrary size. Beyond standard large
scale, a negative effect on the compression effectiveness is caused by the very small size of the data blocks.
A robust approach for extreme distributed systems is presented in this paper, where this problem is fixed by
overlapping adjacent blocks and preprocessing the neighborhoods of the boundaries.
Moreover, we introduce the notion of pseudo-prefix dictionary, which allows optimal compression by means
of a real-time semi-greedy procedure and a slight improvement on the compression ratio obtained by the
distributed implementations
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
- …