14 research outputs found
Optimal Parsing for Dictionary Text Compression
Dictionary-based compression algorithms include a parsing strategy to
transform the input text into a sequence of dictionary phrases. Given a text,
such process usually is not unique and, for compression purpose, it makes
sense to find one of the possible parsing that minimize the final compression
ratio. This is the parsing problem. An optimal parsing is a parsing strategy
or a parsing algorithm that solve the parsing problem taking account of
all the constraints of a compression algorithm or of a class of homogeneous
compression algorithms. Compression algorithm constrains are, for instance,
the dictionary itself, i.e. the dynamic set of available phrases, and how much
a phrase weights on the compressed text, i.e. the number of bits of which
the codeword representing such phrase is composed, also denoted as the
encoding cost of a dictionary pointer.
In more than 30th years of history of dictionary based text compression,
while plenty of algorithms, variants and extensions appeared and while dictionary
approach to text compression became one of the most appreciated
and utilized in almost all the storage and communication processes, only few
optimal parsing algorithms were presented. Many compression algorithms
still leaks optimality of their parsing or, at least, proof of optimality. This
happens because there is not a general model of the parsing problem that includes
all the dictionary based algorithms and because the existing optimal
parsing algorithms work under too restrictive hypothesis.
This work focus on the parsing problem and presents both a general
model for dictionary based text compression called Dictionary-Symbolwise
Text Compression theory and a general parsing algorithm that is proved
to be optimal under some realistic hypothesis. This algorithm is called
iii
Dictionary-Symbolwise Flexible Parsing and it covers almost all of the known
cases of dictionary based text compression algorithms together with the large
class of their variants where the text is decomposed in a sequence of symbols
and dictionary phrases.
In this work we further consider the case of a free mixture of a dictionary
compressor and a symbolwise compressor. Our Dictionary-Symbolwise
Flexible Parsing covers also this case. We have indeed an optimal parsing
algorithm in the case of dictionary-symbolwise compression where the dictionary
is prefix closed and the cost of encoding dictionary pointer is variable.
The symbolwise compressor is any classical one that works in linear time, as
many common variable-length encoders do. Our algorithm works under the
assumption that a special graph that will be described in the following, is
well defined. Even if this condition is not satisfied, it is possible to use the
same method to obtain almost optimal parses. In detail, when the dictionary
is LZ78-like, we show how to implement our algorithm in linear time.
When the dictionary is LZ77-like our algorithm can be implemented in time
O(n log n). Both have O(n) space complexity.
Even if the main aim of this work is of theoretical nature, some experimental
results will be introduced to underline some practical effects of
the parsing optimality in terms of compression performance and to show
how to improve the compression ratio by building extensions Dictionary-
Symbolwise of known algorithms. Finally, some more detailed experiments
are hosted in a devoted appendix
The Rightmost Equal-Cost Position Problem
LZ77-based compression schemes compress the input text by replacing factors
in the text with an encoded reference to a previous occurrence formed by the
couple (length, offset). For a given factor, the smallest is the offset, the
smallest is the resulting compression ratio. This is optimally achieved by
using the rightmost occurrence of a factor in the previous text. Given a cost
function, for instance the minimum number of bits used to represent an integer,
we define the Rightmost Equal-Cost Position (REP) problem as the problem of
finding one of the occurrences of a factor which cost is equal to the cost of
the rightmost one. We present the Multi-Layer Suffix Tree data structure that,
for a text of length n, at any time i, it provides REP(LPF) in constant time,
where LPF is the longest previous factor, i.e. the greedy phrase, a reference
to the list of REP({set of prefixes of LPF}) in constant time and REP(p) in
time O(|p| log log n) for any given pattern p
Relations Between Greedy and Bit-Optimal LZ77 Encodings
This paper investigates the size in bits of the LZ77 encoding, which is the most popular and efficient variant of the Lempel--Ziv encodings used in data compression. We prove that, for a wide natural class of variable-length encoders for LZ77 phrases, the size of the greedily constructed LZ77 encoding on constant alphabets is within a factor of the optimal LZ77 encoding, where is the length of the processed string. We describe a series of examples showing that, surprisingly, this bound is tight, thus improving both the previously known upper and lower bounds. Further, we obtain a more detailed bound , which uses the number of phrases in the greedy LZ77 encoding as a parameter, and construct a series of examples showing that this bound is tight even for binary alphabet. We then investigate the problem on non-constant alphabets: we show that the known bound is tight even for alphabets of logarithmic size, and provide tight bounds for some other important cases.Peer reviewe
Content-aware compression for big textual data analysis
A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements
Segmentation based coding of depth Information for 3D video
Increased interest in 3D artifact and the need of transmitting, broadcasting and saving the
whole information that represents the 3D view, has been a hot topic in recent years.
Knowing that adding the depth information to the views will increase the encoding bitrate
considerably, we decided to find a new approach to encode/decode the depth information
for 3D video.
In this project, different approaches to encode/decode the depth information are
experienced and a new method is implemented which its result is compared to the best
previously developed method considering both bitrate and quality (PSNR)
Dictionary-Symbolwise Flexible Parsing
International audienceLinear time optimal parsing algorithms are very rare in the dictionary based branch of the data compression theory. The most recent is the Flexible Parsing algorithm of Mathias and Shainalp that works when the dictionary is prefix closed and the encoding of dictionary pointers has a constant cost. We present the Dictionary-Symbolwise Flexible Parsing algorithm that is optimal for prefix-closed dictionaries and any symbolwise compressor under some natural hypothesis. In the case of LZ78-alike algorithms with variable costs and any, linear as usual, symbolwise compressor it can be implemented in linear time. In the case of LZ77-alike dictionaries and any symbolwise compressor it can be implemented in O(n log(n)) time. We further present some experimental results that show the effectiveness of the dictionary-symbolwise approach
Dictionary-Symbolwise Flexible Parsing
International audienceLinear time optimal parsing algorithms are very rare in the dictionary based branch of the data compression theory. The most recent is the Flexible Parsing algorithm of Mathias and Shainalp that works when the dictionary is prefix closed and the encoding of dictionary pointers has a constant cost. We present the Dictionary-Symbolwise Flexible Parsing algorithm that is optimal for prefix-closed dictionaries and any symbolwise compressor under some natural hypothesis. In the case of LZ78-alike algorithms with variable costs and any, linear as usual, symbolwise compressor it can be implemented in linear time. In the case of LZ77-alike dictionaries and any symbolwise compressor it can be implemented in O(n log(n)) time. We further present some experimental results that show the effectiveness of the dictionary-symbolwise approach