Search CORE

58 research outputs found

Entropy Lower Bounds for Dictionary Compression

Author: Ganczorz Michal
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)
Publication date: 01/01/2019
Field of study

We show that a wide class of dictionary compression methods (including LZ77, LZ78, grammar compressors as well as parsing-based structures) require |S|H_k(S) + Omega (|S|k log sigma/log_sigma |S|) bits to encode their output. This matches known upper bounds and improves the information-theoretic lower bound of |S|H_k(S). To this end, we abstract the crucial properties of parsings created by those methods, construct a certain family of strings and analyze the parsings of those strings. We also show that for k = alpha log_sigma |S|, where 0 < alpha < 1 is a constant, the aforementioned methods produce an output of size at least 1/(1-alpha)|S|H_k(S) bits. Thus our results separate dictionary compressors from context-based one (such as PPM) and BWT-based ones, as the those include methods achieving |S|H_k(S) + O(sigma^k log sigma) bits, i.e. the redundancy depends on k and sigma but not on |S|

Dagstuhl Research Online Publication Server

Compressed Text Indexes:From Theory to Practice!

Author: Ferragina Paolo
Gonzalez Rodrigo
Navarro Gonzalo
Venturini Rossano
Publication venue
Publication date: 01/01/2007
Field of study

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology

arXiv.org e-Print Archive

CiteSeerX

Archivio della Ricerca - Università di Pisa

Optimal Parsing for Dictionary Text Compression

Author: LANGIU Alessio
Publication venue
Publication date: 03/04/2012
Field of study

Dictionary-based compression algorithms include a parsing strategy to transform the input text into a sequence of dictionary phrases. Given a text, such process usually is not unique and, for compression purpose, it makes sense to find one of the possible parsing that minimize the final compression ratio. This is the parsing problem. An optimal parsing is a parsing strategy or a parsing algorithm that solve the parsing problem taking account of all the constraints of a compression algorithm or of a class of homogeneous compression algorithms. Compression algorithm constrains are, for instance, the dictionary itself, i.e. the dynamic set of available phrases, and how much a phrase weights on the compressed text, i.e. the number of bits of which the codeword representing such phrase is composed, also denoted as the encoding cost of a dictionary pointer. In more than 30th years of history of dictionary based text compression, while plenty of algorithms, variants and extensions appeared and while dictionary approach to text compression became one of the most appreciated and utilized in almost all the storage and communication processes, only few optimal parsing algorithms were presented. Many compression algorithms still leaks optimality of their parsing or, at least, proof of optimality. This happens because there is not a general model of the parsing problem that includes all the dictionary based algorithms and because the existing optimal parsing algorithms work under too restrictive hypothesis. This work focus on the parsing problem and presents both a general model for dictionary based text compression called Dictionary-Symbolwise Text Compression theory and a general parsing algorithm that is proved to be optimal under some realistic hypothesis. This algorithm is called iii Dictionary-Symbolwise Flexible Parsing and it covers almost all of the known cases of dictionary based text compression algorithms together with the large class of their variants where the text is decomposed in a sequence of symbols and dictionary phrases. In this work we further consider the case of a free mixture of a dictionary compressor and a symbolwise compressor. Our Dictionary-Symbolwise Flexible Parsing covers also this case. We have indeed an optimal parsing algorithm in the case of dictionary-symbolwise compression where the dictionary is prefix closed and the cost of encoding dictionary pointer is variable. The symbolwise compressor is any classical one that works in linear time, as many common variable-length encoders do. Our algorithm works under the assumption that a special graph that will be described in the following, is well defined. Even if this condition is not satisfied, it is possible to use the same method to obtain almost optimal parses. In detail, when the dictionary is LZ78-like, we show how to implement our algorithm in linear time. When the dictionary is LZ77-like our algorithm can be implemented in time O(n log n). Both have O(n) space complexity. Even if the main aim of this work is of theoretical nature, some experimental results will be introduced to underline some practical effects of the parsing optimality in terms of compression performance and to show how to improve the compression ratio by building extensions Dictionary- Symbolwise of known algorithms. Finally, some more detailed experiments are hosted in a devoted appendix

Archivio istituzionale della ricerca - Università di Palermo

Searching and Indexing Genomic Databases via Kernelization

Author: Gagie Travis
Puglisi Simon J.
Publication venue
Publication date: 04/12/2014
Field of study

The rapid advance of DNA sequencing technologies has yielded databases of thousands of genomes. To search and index these databases effectively, it is important that we take advantage of the similarity between those genomes. Several authors have recently suggested searching or indexing only one reference genome and the parts of the other genomes where they differ. In this paper we survey the twenty-year history of this idea and discuss its relation to kernelization in parameterized complexity

arXiv.org e-Print Archive

Frontiers - Publisher Connector

Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

Author: De Rajat
Kempa Dominik
Publication venue
Publication date: 17/07/2023
Field of study

Grammar compression is a general compression framework in which a string

T

of length

N

is represented as a context-free grammar of size

n

whose language contains only

T

. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms that achieve the approximation ratio

\rho=\mathcal{O}(\text{polylog }N)

. Unfortunately, for the majority of grammar compressors,

\rho

is either unknown or satisfies

\rho=\omega(\text{polylog }N)

. In their seminal paper, Charikar et al. [IEEE Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms: RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and

\alpha

-Balanced. Only one of them (

\alpha

-Balanced) is known to achieve

\rho=\mathcal{O}(\text{polylog }N)

. We develop the first technique for proving lower bounds for data structures and algorithms on grammars that is fully general and does not depend on the approximation ratio

\rho

of the used grammar compressor. Using this technique, we first prove that

\Omega(\log N/\log \log N)

time is required for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection, while

\Omega(\log\log N)

time is required for random access to LZ78. All these lower bounds hold within space

\mathcal{O}(n\text{ polylog }N)

and match the existing upper bounds. We also generalize this technique to prove several conditional lower bounds for compressed computation. For example, we prove that unless the Combinatorial

k

-Clique Conjecture fails, there is no combinatorial algorithm for CFG parsing on Bisection (for which it holds

\rho=\tilde{\Theta}(N^{1/2})

) that runs in

\mathcal{O}(n^c\cdot N^{3-\epsilon})

time for all constants

c>0

and

\epsilon>0

. Previously, this was known only for

c<2\epsilon

arXiv.org e-Print Archive