14 research outputs found

    Online LZ77 Parsing and Matching Statistics with RLBWTs

    Get PDF
    Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse T^R of a text T, to compute offline the LZ77 parse of T in O(n log r) time and O(r) space, where n is the length of T and r is the number of runs in the BWT of T^R. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza\u27s augmented RLBWT. This immediately implies that we can build online the LZ77 parse of T while still using O(n log r) time and O(r) space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto\u27s (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further - albeit making it static again and increasing its space by a factor proportional to the size of the alphabet - such that later, given another string S and O(log log n)-time random access to T, we can compute the matching statistics of S with respect to T in O(|S| log log n) time

    Rpair: Rescaling RePair with Rsync

    Get PDF
    Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice

    Lempel-Ziv-like Parsing in Small Space

    Full text link
    Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kkth order empirical entropy compression nHk+o(nlogσ)n H_k + o(n\log\sigma) with k=o(logσn)k = o(\log_\sigma n), where nn is the input length and σ\sigma is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω(logn)\Omega(\log n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.02.0 in all tested scenarios, and sometimes below 1.051.05, to the size of LZ.Comment: 21 pages, 6 figures, 2 table

    Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

    Full text link
    The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letter-runs (usually referred to as rr) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the rr parameter as measure of repetitiveness, especially to evaluate the performance in terms of both space and time of compressed indexing data structures. In this paper, we investigate ρ(v)\rho(v), the ratio of rr and of the number of runs of the BWT of the reverse of vv. Kempa and Kociumaka [FOCS 2020] gave the first non-trivial upper bound as ρ(v)=O(log2(n))\rho(v) = O(\log^2(n)), for any string vv of length nn. However, nothing is known about the tightness of this upper bound. We present infinite families of binary strings for which ρ(v)=Θ(logn)\rho(v) = \Theta(\log n) holds, thus giving the first non-trivial lower bound on ρ(n)\rho(n), the maximum over all strings of length nn. Our results suggest that rr is not an ideal measure of the repetitiveness of the string, since the number of repeated factors is invariant between the string and its reverse. We believe that there is a more intricate relationship between the number of runs of the BWT and the string's combinatorial properties.Comment: 14 pages, 2 figue

    Lempel–Ziv-Like Parsing in Small Space

    Full text link
    Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression nHk+ o(nlog σ) with k= o(log σn) , where n is the input length and σ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω (log n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ. © 2020, Springer Science+Business Media, LLC, part of Springer Nature.D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454). This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS)
    corecore