27 research outputs found

    Large Alphabets and Incompressibility

    Full text link
    We briefly survey some concepts related to empirical entropy -- normal numbers, de Bruijn sequences and Markov processes -- and investigate how well it approximates Kolmogorov complexity. Our results suggest \ellth-order empirical entropy stops being a reasonable complexity metric for almost all strings of length mm over alphabets of size nn about when nn^\ell surpasses mm

    Overcoming the compression limit of the individualsequence (zero order empirical entropy) using the Set Shaping Theory

    Full text link
    Given the importance of the claim, we want to start by exposing the following consideration: this claim comes out more than a year after the article "Practical applications of Set Shaping Theory in Huffman coding" which reports the program that carried out an experiment of data compression in which the coding limit NH0(S) of a single sequence was questioned. We waited so long because, before making a claim of this type, we wanted to be sure of the consistency of the result. All this time the program has always been public; anyone could download it, modify it and independently obtain the reported results. In this period there have been many information theory experts who have tested the program and agreed to help us, we thank these people for the time dedicated to us and their precious advice. Given a sequence S of random variables i.i.d. with symbols belonging to an alphabet A; the parameter NH0(S) (the zero-order empirical entropy multiplied by the length of the sequence) is considered the average coding limit of the symbols of the sequence S through a uniquely decipherable and instantaneous code. Our experiment that calls into question this limit is the following: a sequence S is generated in a random and uniform way, the value NH0(S) is calculated, the sequence S is transformed into a new sequence f(S), longer but with the symbols belonging to the same alphabet, finally we code f(S) using Huffman coding. By generating a statistically significant number of sequences we obtain that the average value of the length of the encoded sequence f(S) is less than the average value of NH0(S). In this way, a result is obtained which is incompatible with the meaning given to NH0(S)

    Computing LZ77 in Run-Compressed Space

    Get PDF
    In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a class of repetition-aware self-indexes based on a combination of run-length encoded BWT and LZ77 can be built in asymptotically optimal O(R + z) words of working space, z being the size of the LZ77 parsing

    On Empirical Entropy

    Get PDF
    We propose a compression-based version of the empirical entropy of a finite string over a finite alphabet. Whereas previously one considers the naked entropy of (possibly higher order) Markov processes, we consider the sum of the description of the random variable involved plus the entropy it induces. We assume only that the distribution involved is computable. To test the new notion we compare the Normalized Information Distance (the similarity metric) with a related measure based on Mutual Information in Shannon's framework. This way the similarities and differences of the last two concepts are exposed.Comment: 14 pages, LaTe

    Fast Label Extraction in the CDAWG

    Full text link
    The compact directed acyclic word graph (CDAWG) of a string TT of length nn takes space proportional just to the number ee of right extensions of the maximal repeats of TT, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which ee grows significantly more slowly than nn. We reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to count the number of occurrences of a pattern of length mm, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(mloglogn+occ)O(m\log{\log{n}}+\mathtt{occ}) to O(m+occ)O(m+\mathtt{occ}) in the time needed to locate all the occ\mathtt{occ} occurrences of the pattern. We also reduce from O(kloglogn)O(k\log{\log{n}}) to O(k)O(k) the time needed to read the kk characters of the label of an edge of the suffix tree of TT, and we reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to compute the matching statistics between a query of length mm and TT, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.0864

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    On Empirical Entropy

    Get PDF
    corecore