14 research outputs found
Online LZ77 Parsing and Matching Statistics with RLBWTs
Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse T^R of a text T, to compute offline the LZ77 parse of T in O(n log r) time and O(r) space, where n is the length of T and r is the number of runs in the BWT of T^R. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza\u27s augmented RLBWT. This immediately implies that we can build online the LZ77 parse of T while still using O(n log r) time and O(r) space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto\u27s (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further - albeit making it static again and increasing its space by a factor proportional to the size of the alphabet - such that later, given another string S and O(log log n)-time random access to T, we can compute the matching statistics of S with respect to T in O(|S| log log n) time
Rpair: Rescaling RePair with Rsync
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice
Lempel-Ziv-like Parsing in Small Space
Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and
widely-used compressors for repetitive texts. However, the existing efficient
methods computing the exact LZ parsing have to use linear or close to linear
space to index the input text during the construction of the parsing, which is
prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which
indexes only a fixed reference sequence, whose size can be controlled. Deriving
the reference sequence by sampling the text yields reasonable compression
ratios for RLZ, but performance is not always competitive with that of LZ and
depends heavily on the similarity of the reference to the text. In this paper
we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate
the LZ parsing using little memory. RLZ is first used to produce a sequence of
phrases, and these are regarded as metasymbols that are input to LZ for a
second-level parsing on a (most often) drastically shorter sequence. This
parsing is finally translated into one on the original sequence.
We analyze the new scheme and prove that, like LZ, it achieves the th
order empirical entropy compression with , where is the input length and is the alphabet
size. In fact, we prove this entropy bound not only for ReLZ but for a wide
class of LZ-like encodings. Then, we establish a lower bound on ReLZ
approximation ratio showing that the number of phrases in it can be
times larger than the number of phrases in LZ. Our experiments
show that ReLZ is faster than existing alternatives to compute the (exact or
approximate) LZ parsing, at the reasonable price of an approximation factor
below in all tested scenarios, and sometimes below , to the size of
LZ.Comment: 21 pages, 6 figures, 2 table
Novel Results on the Number of Runs of the Burrows-Wheeler-Transform
The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is
one of the fundamental components of many current data structures in string
processing. It is central in data compression, as well as in efficient query
algorithms for sequence data, such as webpages, genomic and other biological
sequences, or indeed any textual data. The BWT lends itself well to compression
because its number of equal-letter-runs (usually referred to as ) is often
considerably lower than that of the original string; in particular, it is well
suited for strings with many repeated factors. In fact, much attention has been
paid to the parameter as measure of repetitiveness, especially to evaluate
the performance in terms of both space and time of compressed indexing data
structures.
In this paper, we investigate , the ratio of and of the number
of runs of the BWT of the reverse of . Kempa and Kociumaka [FOCS 2020] gave
the first non-trivial upper bound as , for any string
of length . However, nothing is known about the tightness of this upper
bound. We present infinite families of binary strings for which holds, thus giving the first non-trivial lower bound on
, the maximum over all strings of length .
Our results suggest that is not an ideal measure of the repetitiveness of
the string, since the number of repeated factors is invariant between the
string and its reverse. We believe that there is a more intricate relationship
between the number of runs of the BWT and the string's combinatorial
properties.Comment: 14 pages, 2 figue
Lempel–Ziv-Like Parsing in Small Space
Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression nHk+ o(nlog σ) with k= o(log σn) , where n is the input length and σ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω (log n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ. © 2020, Springer Science+Business Media, LLC, part of Springer Nature.D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454). This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS)