8 research outputs found
Optimal rank and select queries on dictionary-compressed text
We study the problem of supporting queries on a string S of length n within a space bounded by the size \u3b3 of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/\u3b3)/ log log n) time within O (\u3b3 polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations
Towards a Definitive Measure of Repetitiveness
Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ
Towards a Definitive Compressibility Measure for Repetitive Sequences
Unlike in statistical compression, where Shannon's entropy is a definitive
lower bound, no such clear measure exists for the compressibility of repetitive
sequences. Since statistical entropy does not capture repetitiveness, ad-hoc
measures like the size of the Lempel--Ziv parse are frequently used to
estimate it. The size of the smallest bidirectional macro scheme
captures better what can be achieved via copy-paste processes, though it is
NP-complete to compute and it is not monotonic upon symbol appends. Recently, a
more principled measure, the size of the smallest string
\emph{attractor}, was introduced. The measure lower bounds all
the previous relevant ones, yet length- strings can be represented and
efficiently indexed within space , which also
upper bounds most measures. While is certainly a better measure of
repetitiveness than , it is also NP-complete to compute and not monotonic,
and it is unknown if one can always represent a string in
space.
In this paper, we study an even smaller measure, , which
can be computed in linear time, is monotonic, and allows encoding every string
in space because . We show that better captures the
compressibility of repetitive strings. Concretely, we show that (1)
can be strictly smaller than , by up to a logarithmic factor; (2) there
are string families needing space to be
encoded, so this space is optimal for every and ; (3) one can build
run-length context-free grammars of size ,
whereas the smallest (non-run-length) grammar can be up to times larger; and (4) within
space we can not only..
Faster Block Tree Construction
The block tree [Belazzougui et al. J. Comput. Syst. Sci. \u2721] is a compressed text index that can answer access (extract a character at a position), rank (number of occurrences of a specified character in a prefix of the text), and select (size of smallest prefix such that a specified character has a specified rank) queries. It requires O(zlog(n/z)) words of space, where z is the number of Lempel-Ziv factors of the text. For some highly repetitive inputs, a block tree can require as little as 0.015 bits per character of the text. Small values of z make the block tree a space-efficient alternative to the wavelet tree, which is another index for these three types of queries. While wavelet trees can be constructed fast in practice, up so far compressed versions of the wavelet tree only leverage statistical compression, meaning that they are blind to spaced repetitions.
To make block trees usable in practice, a first step is to find ways in constructing them efficiently. We address this problem by presenting a practically efficient construction algorithm for block trees, which is up to an order of magnitude faster than previous implementations. Additionally, we parallelize our implementation, making it the first block tree construction implementation that works in parallel in shared memory
Block trees
Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(z log(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings. (C) 2020 Elsevier Inc. All rights reserved.Peer reviewe
Optimal-Time Dictionary-Compressed Indexes
We describe the first self-indexes able to count and locate pattern
occurrences in optimal time within a space bounded by the size of the most
popular dictionary compressors. To achieve this result we combine several
recent findings, including \emph{string attractors} --- new combinatorial
objects encompassing most known compressibility measures for highly repetitive
texts ---, and grammars based on \emph{locally-consistent parsing}.
More in detail, let be the size of the smallest attractor for a text
of length . The measure is an (asymptotic) lower bound to the
size of dictionary compressors based on Lempel--Ziv, context-free grammars, and
many others. The smallest known text representations in terms of attractors use
space , and our lightest indexes work within the same
asymptotic space. Let be a suitably small constant fixed at
construction time, be the pattern length, and be the number of its
text occurrences. Our index counts pattern occurrences in
time, and locates them in time. These times already outperform those of most dictionary-compressed
indexes, while obtaining the least asymptotic space for any index searching
within time. Further, by increasing the space
to , we reduce the locating time to the
optimal , and within space we can
also count in optimal time. No dictionary-compressed index had obtained
this time before. All our indexes can be constructed in space and
expected time.
As a byproduct of independent interest..
Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data
Grammar compression is a general compression framework in which a string
of length is represented as a context-free grammar of size whose
language contains only . In this paper, we focus on studying the limitations
of algorithms and data structures operating on strings in grammar-compressed
form. Previous work focused on proving lower bounds for grammars constructed
using algorithms that achieve the approximation ratio
. Unfortunately, for the majority of
grammar compressors, is either unknown or satisfies
. In their seminal paper, Charikar et al. [IEEE
Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms:
RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and
-Balanced. Only one of them (-Balanced) is known to achieve
.
We develop the first technique for proving lower bounds for data structures
and algorithms on grammars that is fully general and does not depend on the
approximation ratio of the used grammar compressor. Using this
technique, we first prove that time is required
for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection,
while time is required for random access to LZ78. All
these lower bounds hold within space and
match the existing upper bounds. We also generalize this technique to prove
several conditional lower bounds for compressed computation. For example, we
prove that unless the Combinatorial -Clique Conjecture fails, there is no
combinatorial algorithm for CFG parsing on Bisection (for which it holds
) that runs in time for all constants and . Previously,
this was known only for
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum