27 research outputs found
Large Alphabets and Incompressibility
We briefly survey some concepts related to empirical entropy -- normal
numbers, de Bruijn sequences and Markov processes -- and investigate how well
it approximates Kolmogorov complexity. Our results suggest th-order
empirical entropy stops being a reasonable complexity metric for almost all
strings of length over alphabets of size about when surpasses
Overcoming the compression limit of the individualsequence (zero order empirical entropy) using the Set Shaping Theory
Given the importance of the claim, we want to start by exposing the following
consideration: this claim comes out more than a year after the article
"Practical applications of Set Shaping Theory in Huffman coding" which reports
the program that carried out an experiment of data compression in which the
coding limit NH0(S) of a single sequence was questioned. We waited so long
because, before making a claim of this type, we wanted to be sure of the
consistency of the result. All this time the program has always been public;
anyone could download it, modify it and independently obtain the reported
results. In this period there have been many information theory experts who
have tested the program and agreed to help us, we thank these people for the
time dedicated to us and their precious advice. Given a sequence S of random
variables i.i.d. with symbols belonging to an alphabet A; the parameter NH0(S)
(the zero-order empirical entropy multiplied by the length of the sequence) is
considered the average coding limit of the symbols of the sequence S through a
uniquely decipherable and instantaneous code. Our experiment that calls into
question this limit is the following: a sequence S is generated in a random and
uniform way, the value NH0(S) is calculated, the sequence S is transformed into
a new sequence f(S), longer but with the symbols belonging to the same
alphabet, finally we code f(S) using Huffman coding. By generating a
statistically significant number of sequences we obtain that the average value
of the length of the encoded sequence f(S) is less than the average value of
NH0(S). In this way, a result is obtained which is incompatible with the
meaning given to NH0(S)
Computing LZ77 in Run-Compressed Space
In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n}
can be computed in O(R log n) bits of working space and O(n log R) time, R
being the number of runs in the Burrows-Wheeler transform of T reversed. For
extremely repetitive inputs, the working space can be as low as O(log n) bits:
exponentially smaller than the text itself. As a direct consequence of our
result, we show that a class of repetition-aware self-indexes based on a
combination of run-length encoded BWT and LZ77 can be built in asymptotically
optimal O(R + z) words of working space, z being the size of the LZ77 parsing
On Empirical Entropy
We propose a compression-based version of the empirical entropy of a finite
string over a finite alphabet. Whereas previously one considers the naked
entropy of (possibly higher order) Markov processes, we consider the sum of the
description of the random variable involved plus the entropy it induces. We
assume only that the distribution involved is computable. To test the new
notion we compare the Normalized Information Distance (the similarity metric)
with a related measure based on Mutual Information in Shannon's framework. This
way the similarities and differences of the last two concepts are exposed.Comment: 14 pages, LaTe
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment