104 research outputs found
Real-time and distributed applications for dictionary-based data compression
The greedy approach to dictionary-based static text compression can be executed by a finite state machine.
When it is applied in parallel to different blocks of data independently, there is no lack of robustness
even on standard large scale distributed systems with input files of arbitrary size. Beyond standard large
scale, a negative effect on the compression effectiveness is caused by the very small size of the data blocks.
A robust approach for extreme distributed systems is presented in this paper, where this problem is fixed by
overlapping adjacent blocks and preprocessing the neighborhoods of the boundaries.
Moreover, we introduce the notion of pseudo-prefix dictionary, which allows optimal compression by means
of a real-time semi-greedy procedure and a slight improvement on the compression ratio obtained by the
distributed implementations
Optimal Parsing for Dictionary Text Compression
Dictionary-based compression algorithms include a parsing strategy to
transform the input text into a sequence of dictionary phrases. Given a text,
such process usually is not unique and, for compression purpose, it makes
sense to find one of the possible parsing that minimize the final compression
ratio. This is the parsing problem. An optimal parsing is a parsing strategy
or a parsing algorithm that solve the parsing problem taking account of
all the constraints of a compression algorithm or of a class of homogeneous
compression algorithms. Compression algorithm constrains are, for instance,
the dictionary itself, i.e. the dynamic set of available phrases, and how much
a phrase weights on the compressed text, i.e. the number of bits of which
the codeword representing such phrase is composed, also denoted as the
encoding cost of a dictionary pointer.
In more than 30th years of history of dictionary based text compression,
while plenty of algorithms, variants and extensions appeared and while dictionary
approach to text compression became one of the most appreciated
and utilized in almost all the storage and communication processes, only few
optimal parsing algorithms were presented. Many compression algorithms
still leaks optimality of their parsing or, at least, proof of optimality. This
happens because there is not a general model of the parsing problem that includes
all the dictionary based algorithms and because the existing optimal
parsing algorithms work under too restrictive hypothesis.
This work focus on the parsing problem and presents both a general
model for dictionary based text compression called Dictionary-Symbolwise
Text Compression theory and a general parsing algorithm that is proved
to be optimal under some realistic hypothesis. This algorithm is called
iii
Dictionary-Symbolwise Flexible Parsing and it covers almost all of the known
cases of dictionary based text compression algorithms together with the large
class of their variants where the text is decomposed in a sequence of symbols
and dictionary phrases.
In this work we further consider the case of a free mixture of a dictionary
compressor and a symbolwise compressor. Our Dictionary-Symbolwise
Flexible Parsing covers also this case. We have indeed an optimal parsing
algorithm in the case of dictionary-symbolwise compression where the dictionary
is prefix closed and the cost of encoding dictionary pointer is variable.
The symbolwise compressor is any classical one that works in linear time, as
many common variable-length encoders do. Our algorithm works under the
assumption that a special graph that will be described in the following, is
well defined. Even if this condition is not satisfied, it is possible to use the
same method to obtain almost optimal parses. In detail, when the dictionary
is LZ78-like, we show how to implement our algorithm in linear time.
When the dictionary is LZ77-like our algorithm can be implemented in time
O(n log n). Both have O(n) space complexity.
Even if the main aim of this work is of theoretical nature, some experimental
results will be introduced to underline some practical effects of
the parsing optimality in terms of compression performance and to show
how to improve the compression ratio by building extensions Dictionary-
Symbolwise of known algorithms. Finally, some more detailed experiments
are hosted in a devoted appendix
Bicriteria data compression
The advent of massive datasets (and the consequent design of high-performing
distributed storage systems) have reignited the interest of the scientific and
engineering community towards the design of lossless data compressors which
achieve effective compression ratio and very efficient decompression speed.
Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of
its decompression speed and its flexibility in trading decompression speed
versus compressed-space efficiency. Each of the existing implementations offers
a trade-off between space occupancy and decompression speed, so software
engineers have to content themselves by picking the one which comes closer to
the requirements of the application in their hands. Starting from these
premises, and for the first time in the literature, we address in this paper
the problem of trading optimally, and in a principled way, the consumption of
these two resources by introducing the Bicriteria LZ77-Parsing problem, which
formalizes in a principled way what data-compressors have traditionally
approached by means of heuristics. The goal is to determine an LZ77 parsing
which minimizes the space occupancy in bits of the compressed file, provided
that the decompression time is bounded by a fixed amount (or vice-versa). This
way, the software engineer can set its space (or time) requirements and then
derive the LZ77 parsing which optimizes the decompression speed (or the space
occupancy, respectively). We solve this problem efficiently in O(n log^2 n)
time and optimal linear space within a small, additive approximation, by
proving and deploying some specific structural properties of the weighted graph
derived from the possible LZ77-parsings of the input file. The preliminary set
of experiments shows that our novel proposal dominates all the highly
engineered competitors, hence offering a win-win situation in theory&practice
Lempel-Ziv Compression in a Sliding Window
We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem.
Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log log w+z log log s) deterministic time. Here, w is the window size, n is the size of the input string, z is the number of phrases in the parse, and s is the size of the alphabet. This matches the space and time bounds of previous results while removing constant size restrictions on the alphabet size.
To achieve our result, we combine a simple modification and augmentation of the suffix tree with periodicity properties of sliding windows. We also apply this new technique to obtain an algorithm for the approximate rightmost LZ77 problem that uses O(n(log z + log log n)) time and O(n) space and produces a (1+e)-approximation of the rightmost parsing (any constant e>0). While this does not improve the best known time-space trade-offs for exact rightmost parsing, our algorithm is significantly simpler and exposes a direct connection between sliding window parsing and the approximate rightmost matching problem
Efficient Text Compression Algorithm Based on an Existing Dictionary
This research article presents a new efficient lossless text compression algorithm based on an existing dictionary. The proposed algorithm represents the target texts to be compressed in a bit form, and the vocabularies are stored in the existing dictionary. Regarding to the results, the time complexity only takes O(n) time of both cases of encoding and decoding scenarios. The space complexity is O(d) bit(s) per 2d words where d=1,2,3,…The theoretical results showed bits per words and maximum spaces to be saved. These results indicated that the maximum original texts could be compressed more than 99 %
Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve
Can we analyze data without decompressing it? As our data keeps growing, understanding the time complexity of problems on compressed inputs, rather than in convenient uncompressed forms, becomes more and more relevant. Suppose we are given a compression of size of data that originally has size , and we want to solve a problem with time complexity . The naive strategy of "decompress-and-solve" gives time , whereas "the gold standard" is time : to analyze the compression as efficiently as if the original data was small. We restrict our attention to data in the form of a string (text, files, genomes, etc.) and study the most ubiquitous tasks. While the challenge might seem to depend heavily on the specific compression scheme, most methods of practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be unified under the elegant notion of Grammar Compressions. A vast literature, across many disciplines, established this as an influential notion for Algorithm design. We introduce a framework for proving (conditional) lower bounds in this field, allowing us to assess whether decompress-and-solve can be improved, and by how much. Our main results are: - The bound for LCS and the bound for Pattern Matching with Wildcards are optimal up to factors, under the Strong Exponential Time Hypothesis. (Here, denotes the uncompressed length of the compressed pattern.) - Decompress-and-solve is essentially optimal for Context-Free Grammar Parsing and RNA Folding, under the -Clique conjecture. - We give an algorithm showing that decompress-and-solve is not optimal for Disjointness
Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve
Can we analyze data without decompressing it? As our data keeps growing,
understanding the time complexity of problems on compressed inputs, rather than
in convenient uncompressed forms, becomes more and more relevant. Suppose we
are given a compression of size of data that originally has size , and
we want to solve a problem with time complexity . The naive strategy
of "decompress-and-solve" gives time , whereas "the gold standard" is
time : to analyze the compression as efficiently as if the original data
was small.
We restrict our attention to data in the form of a string (text, files,
genomes, etc.) and study the most ubiquitous tasks. While the challenge might
seem to depend heavily on the specific compression scheme, most methods of
practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be
unified under the elegant notion of Grammar Compressions. A vast literature,
across many disciplines, established this as an influential notion for
Algorithm design.
We introduce a framework for proving (conditional) lower bounds in this
field, allowing us to assess whether decompress-and-solve can be improved, and
by how much. Our main results are:
- The bound for LCS and the
bound for Pattern Matching with Wildcards are optimal up to factors,
under the Strong Exponential Time Hypothesis. (Here, denotes the
uncompressed length of the compressed pattern.)
- Decompress-and-solve is essentially optimal for Context-Free Grammar
Parsing and RNA Folding, under the -Clique conjecture.
- We give an algorithm showing that decompress-and-solve is not optimal for
Disjointness.Comment: Presented at FOCS'17. Full version. 63 page
- …