135 research outputs found
The Rightmost Equal-Cost Position Problem
LZ77-based compression schemes compress the input text by replacing factors
in the text with an encoded reference to a previous occurrence formed by the
couple (length, offset). For a given factor, the smallest is the offset, the
smallest is the resulting compression ratio. This is optimally achieved by
using the rightmost occurrence of a factor in the previous text. Given a cost
function, for instance the minimum number of bits used to represent an integer,
we define the Rightmost Equal-Cost Position (REP) problem as the problem of
finding one of the occurrences of a factor which cost is equal to the cost of
the rightmost one. We present the Multi-Layer Suffix Tree data structure that,
for a text of length n, at any time i, it provides REP(LPF) in constant time,
where LPF is the longest previous factor, i.e. the greedy phrase, a reference
to the list of REP({set of prefixes of LPF}) in constant time and REP(p) in
time O(|p| log log n) for any given pattern p
Real-time and distributed applications for dictionary-based data compression
The greedy approach to dictionary-based static text compression can be executed by a finite state machine.
When it is applied in parallel to different blocks of data independently, there is no lack of robustness
even on standard large scale distributed systems with input files of arbitrary size. Beyond standard large
scale, a negative effect on the compression effectiveness is caused by the very small size of the data blocks.
A robust approach for extreme distributed systems is presented in this paper, where this problem is fixed by
overlapping adjacent blocks and preprocessing the neighborhoods of the boundaries.
Moreover, we introduce the notion of pseudo-prefix dictionary, which allows optimal compression by means
of a real-time semi-greedy procedure and a slight improvement on the compression ratio obtained by the
distributed implementations
Hierarchical Relative Lempel-Ziv Compression
Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance
Lightweight Lempel-Ziv Parsing
We introduce a new approach to LZ77 factorization that uses O(n/d) words of
working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet
sizes). We also describe carefully engineered implementations of alternative
approaches to lightweight LZ77 factorization. Extensive experiments show that
the new algorithm is superior in most cases, particularly at the lowest memory
levels and for highly repetitive data. As a part of the algorithm, we describe
new methods for computing matching statistics which may be of independent
interest.Comment: 12 page
Less redundant codes for variable size dictionaries
We report on a family of variable-length codes with less redundancy than the flat code used in most of the variable size dictionary-based compression methods. The length of codes belonging to this family is still bounded above by [log_2/ |D|] where |D| denotes the dictionary size. We describe three of these codes, namely, the balanced code, the phase-in-binary code (PB), and the depth-span code (DS). As the name implies, the balanced code is constructed by a height balanced tree, so it has the shortest average codeword length. The corresponding coding tree for the PB code has an interesting property that it is made of full binary phases, and thus the code can be computed efficiently using simple binary shifting operations. The DS coding tree is maintained in such a way that the coder always finds the longest extendable codeword and extends it until it reaches the maximum length. It is optimal with respect to the code-length contrast. The PB and balanced codes have almost similar improvements, around 3% to 7% which is very close to the relative redundancy in flat code. The DS code is particularly good in dealing with files with a large amount of redundancy, such as a running sequence of one symbol. We also did some empirical study on the codeword distribution in the LZW dictionary and proposed a scheme called dynamic block shifting (DBS) to further improve the codes' performance. Experiments suggest that the DBS is helpful in compressing random sequences. From an application point of view, PB code with DBS is recommended for general practical usage
Lempel-Ziv-like Parsing in Small Space
Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and
widely-used compressors for repetitive texts. However, the existing efficient
methods computing the exact LZ parsing have to use linear or close to linear
space to index the input text during the construction of the parsing, which is
prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which
indexes only a fixed reference sequence, whose size can be controlled. Deriving
the reference sequence by sampling the text yields reasonable compression
ratios for RLZ, but performance is not always competitive with that of LZ and
depends heavily on the similarity of the reference to the text. In this paper
we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate
the LZ parsing using little memory. RLZ is first used to produce a sequence of
phrases, and these are regarded as metasymbols that are input to LZ for a
second-level parsing on a (most often) drastically shorter sequence. This
parsing is finally translated into one on the original sequence.
We analyze the new scheme and prove that, like LZ, it achieves the th
order empirical entropy compression with , where is the input length and is the alphabet
size. In fact, we prove this entropy bound not only for ReLZ but for a wide
class of LZ-like encodings. Then, we establish a lower bound on ReLZ
approximation ratio showing that the number of phrases in it can be
times larger than the number of phrases in LZ. Our experiments
show that ReLZ is faster than existing alternatives to compute the (exact or
approximate) LZ parsing, at the reasonable price of an approximation factor
below in all tested scenarios, and sometimes below , to the size of
LZ.Comment: 21 pages, 6 figures, 2 table
- …