Search CORE

8 research outputs found

Optimal rank and select queries on dictionary-compressed text

Author: Prezza N.
Publication venue
Publication date: 21/12/2018
Field of study

We study the problem of supporting queries on a string S of length n within a space bounded by the size \u3b3 of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/\u3b3)/ log log n) time within O (\u3b3 polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Towards a Definitive Measure of Repetitiveness

Author: A Blumer
A Jez
A Lempel
AN Kolmogorov
AR Christiansen
CE Shannon
D Belazzougui
D Belazzougui
D Belazzougui
F Claude
F Claude
G Navarro
G Navarro
JA Storer
JC Kieffer
M Charikar
M Rodeh
P Bille
P Bille
P Bille
RM Karp
S Kreft
S Raskhodnikova
T Gagie
T Gagie
T Gagie
T Gagie
T Kida
V Mäkinen
W Rytter
ZD Stephens
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Towards a Definitive Compressibility Measure for Repetitive Sequences

Author: Kociumaka Tomasz
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 15/01/2021
Field of study

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size

z

of the Lempel--Ziv parse are frequently used to estimate it. The size

b \le z

of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size

\gamma

of the smallest string \emph{attractor}, was introduced. The measure

\gamma \le b

lower bounds all the previous relevant ones, yet length-

n

strings can be represented and efficiently indexed within space

O(\gamma\log\frac{n}{\gamma})

, which also upper bounds most measures. While

\gamma

is certainly a better measure of repetitiveness than

b

, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in

o(\gamma\log n)

space. In this paper, we study an even smaller measure,

\delta \le \gamma

, which can be computed in linear time, is monotonic, and allows encoding every string in

O(\delta\log\frac{n}{\delta})

space because

z = O(\delta\log\frac{n}{\delta})

. We show that

\delta

better captures the compressibility of repetitive strings. Concretely, we show that (1)

\delta

can be strictly smaller than

\gamma

, by up to a logarithmic factor; (2) there are string families needing

\Omega(\delta\log\frac{n}{\delta})

space to be encoded, so this space is optimal for every

n

and

\delta

; (3) one can build run-length context-free grammars of size

O(\delta\log\frac{n}{\delta})

, whereas the smallest (non-run-length) grammar can be up to

\Theta(\log n/\log\log n)

times larger; and (4) within

O(\delta\log\frac{n}{\delta})

space we can not only..

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

MPG.PuRe

Faster Block Tree Construction

Author: Kurpicz Florian
Meyer Daniel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

The block tree [Belazzougui et al. J. Comput. Syst. Sci. \u2721] is a compressed text index that can answer access (extract a character at a position), rank (number of occurrences of a specified character in a prefix of the text), and select (size of smallest prefix such that a specified character has a specified rank) queries. It requires O(zlog(n/z)) words of space, where z is the number of Lempel-Ziv factors of the text. For some highly repetitive inputs, a block tree can require as little as 0.015 bits per character of the text. Small values of z make the block tree a space-efficient alternative to the wavelet tree, which is another index for these three types of queries. While wavelet trees can be constructed fast in practice, up so far compressed versions of the wavelet tree only leverage statistical compression, meaning that they are blind to spaced repetitions. To make block trees usable in practice, a first step is to find ways in constructing them efficiently. We address this problem by presenting a practically efficient construction algorithm for block trees, which is up to an order of magnitude faster than previous implementations. Additionally, we parallelize our implementation, making it the first block tree construction implementation that works in parallel in shared memory

Dagstuhl Research Online Publication Server

Block trees

Author: Belazzougui Djamal
Caceres Manuel
Gagie Travis
Gawrychowski Pawel
Kaerkkaeinen Juha
Navarro Gonzalo
Ordonez Alberto
Puglisi Simon J.
Tabei Yasuo
Publication venue
Publication date: 01/05/2021
Field of study

Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Optimal-Time Dictionary-Compressed Indexes

Author: Christiansen Anders Roy
Ettienne Mikko Berggren
Kociumaka Tomasz
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 04/09/2019
Field of study

We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based on \emph{locally-consistent parsing}. More in detail, let

\gamma

be the size of the smallest attractor for a text

T

of length

n

. The measure

\gamma

is an (asymptotic) lower bound to the size of dictionary compressors based on Lempel--Ziv, context-free grammars, and many others. The smallest known text representations in terms of attractors use space

O(\gamma\log(n/\gamma))

, and our lightest indexes work within the same asymptotic space. Let

\epsilon>0

be a suitably small constant fixed at construction time,

m

be the pattern length, and

occ

be the number of its text occurrences. Our index counts pattern occurrences in

O(m+\log^{2+\epsilon}n)

time, and locates them in

O(m+(occ+1)\log^\epsilon n)

time. These times already outperform those of most dictionary-compressed indexes, while obtaining the least asymptotic space for any index searching within

O((m+occ)\,\textrm{polylog}\,n)

time. Further, by increasing the space to

O(\gamma\log(n/\gamma)\log^\epsilon n)

, we reduce the locating time to the optimal

O(m+occ)

, and within

O(\gamma\log(n/\gamma)\log n)

space we can also count in optimal

O(m)

time. No dictionary-compressed index had obtained this time before. All our indexes can be constructed in

O(n)

space and

O(n\log n)

expected time. As a byproduct of independent interest..

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Online Research Database In Technology

Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

Author: De Rajat
Kempa Dominik
Publication venue
Publication date: 17/07/2023
Field of study

Grammar compression is a general compression framework in which a string

T

of length

N

is represented as a context-free grammar of size

n

whose language contains only

T

. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms that achieve the approximation ratio

\rho=\mathcal{O}(\text{polylog }N)

. Unfortunately, for the majority of grammar compressors,

\rho

is either unknown or satisfies

\rho=\omega(\text{polylog }N)

. In their seminal paper, Charikar et al. [IEEE Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms: RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and

\alpha

-Balanced. Only one of them (

\alpha

-Balanced) is known to achieve

\rho=\mathcal{O}(\text{polylog }N)

. We develop the first technique for proving lower bounds for data structures and algorithms on grammars that is fully general and does not depend on the approximation ratio

\rho

of the used grammar compressor. Using this technique, we first prove that

\Omega(\log N/\log \log N)

time is required for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection, while

\Omega(\log\log N)

time is required for random access to LZ78. All these lower bounds hold within space

\mathcal{O}(n\text{ polylog }N)

and match the existing upper bounds. We also generalize this technique to prove several conditional lower bounds for compressed computation. For example, we prove that unless the Combinatorial

k

-Clique Conjecture fails, there is no combinatorial algorithm for CFG parsing on Bisection (for which it holds

\rho=\tilde{\Theta}(N^{1/2})

) that runs in

\mathcal{O}(n^c\cdot N^{3-\epsilon})

time for all constants

c>0

and

\epsilon>0

. Previously, this was known only for

c<2\epsilon

arXiv.org e-Print Archive

LIPIcs, Volume 274, ESA 2023, Complete Volume

Author: Farach-Colton Martin
Herman Grzegorz
Puglisi Simon J.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

LIPIcs, Volume 274, ESA 2023, Complete Volum

Dagstuhl Research Online Publication Server