Search CORE

14 research outputs found

Online LZ77 Parsing and Matching Statistics with RLBWTs

Author: Bannai Hideo
Gagie Travis
I Tomohiro
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Annual Symposium on Combinatorial Pattern Matching (CPM 2018)
Publication date: 01/01/2018
Field of study

Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse T^R of a text T, to compute offline the LZ77 parse of T in O(n log r) time and O(r) space, where n is the length of T and r is the number of runs in the BWT of T^R. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza\u27s augmented RLBWT. This immediately implies that we can build online the LZ77 parse of T while still using O(n log r) time and O(r) space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto\u27s (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further - albeit making it static again and increasing its space by a factor proportional to the size of the alphabet - such that later, given another string S and O(log log n)-time random access to T, we can compute the matching statistics of S with respect to T in O(|S| log log n) time

Dagstuhl Research Online Publication Server

Rpair: Rescaling RePair with Rsync

Author: A Abeliuk
A Jeż
A Jeż
A Lempel
AR Christiansen
CG Nevill-Manning
F Claude
F Claude
F Claude
F Claude
G Navarro
H Sakamoto
J Larsson
J Ziv
JA Storer
JC Kieffer
JD Kornblum
M Charikar
N Brisaboa
P Bille
T Gagie
W Rytter
Y Peng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice

Crossref

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Lempel-Ziv-like Parsing in Small Space

Author: Kosolobov Dmitry
Navarro Gonzalo
Puglisi Simon J.
Valenzuela Daniel
Publication venue
Publication date: 01/01/2020
Field of study

Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the

k

th order empirical entropy compression

n H_k + o(n\log\sigma)

with

k = o(\log_\sigma n)

, where

n

is the input length and

\sigma

is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be

\Omega(\log n)

times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below

2.0

in all tested scenarios, and sometimes below

1.05

, to the size of LZ.Comment: 21 pages, 6 figures, 2 table

arXiv.org e-Print Archive

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Repositorio Académico de la Universidad de Chile

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Author: A Blumer
A de Luca
A de Luca
A de Luca
A Lempel
A Luca
D Knuth
G Castiglione
G Castiglione
J Berstel
J Borel
JA Storer
JC Kieffer
M Lothaire
S Mantaci
T Gagie
T Ohno
Publication venue
Publication date: 19/08/2020
Field of study

The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letter-runs (usually referred to as

r

) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the

r

parameter as measure of repetitiveness, especially to evaluate the performance in terms of both space and time of compressed indexing data structures. In this paper, we investigate

\rho(v)

, the ratio of

r

and of the number of runs of the BWT of the reverse of

v

. Kempa and Kociumaka [FOCS 2020] gave the first non-trivial upper bound as

\rho(v) = O(\log^2(n))

, for any string

v

of length

n

. However, nothing is known about the tightness of this upper bound. We present infinite families of binary strings for which

\rho(v) = \Theta(\log n)

holds, thus giving the first non-trivial lower bound on

\rho(n)

, the maximum over all strings of length

n

. Our results suggest that

r

is not an ideal measure of the repetitiveness of the string, since the number of repeated factors is invariant between the string and its reverse. We believe that there is a more intricate relationship between the number of runs of the BWT and the string's combinatorial properties.Comment: 14 pages, 2 figue

arXiv.org e-Print Archive

Crossref

Catalogo dei prodotti della ricerca

Lempel–Ziv-Like Parsing in Small Space

Author: Kosolobov D.
Navarro G.
Puglisi S. J.
Valenzuela D.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression nHk+ o(nlog σ) with k= o(log σn) , where n is the input length and σ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω (log n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ. © 2020, Springer Science+Business Media, LLC, part of Springer Nature.D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454). This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS)

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Repositorio Académico de la Universidad de Chile