Search CORE

820 research outputs found

Bicriteria data compression

Author: Farruggia Andrea
Ferragina Paolo
Frangioni Antonio
Venturini Rossano
Publication venue
Publication date: 15/07/2013
Field of study

The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

A Grammar Compression Algorithm based on Induced Suffix Sorting

Author: Ayala-Rincón Mauricio
Gog Simon
Louza Felipe A.
Navarro Gonzalo
Nunes Daniel Saad Nogueira
Publication venue
Publication date: 08/11/2017
Field of study

We introduce GCIS, a grammar compression algorithm based on the induced suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution builds on the factorization performed by SAIS during suffix sorting. We construct a context-free grammar on the input string which can be further reduced into a shorter string by substituting each substring by its correspondent factor. The resulting grammar is encoded by exploring some redundancies, such as common prefixes between suffix rules, which are sorted according to SAIS framework. When compared to well-known compression tools such as Re-Pair and 7-zip, our algorithm is competitive and very effective at handling repetitive string regarding compression ratio, compression and decompression running time

arXiv.org e-Print Archive

Crossref

Repositorio Académico de la Universidad de Chile

Finite-State Dimension and Lossy Decompressors

Author: Doty David
Moser Philippe
Publication venue
Publication date: 01/01/2006
Field of study

This paper examines information-theoretic questions regarding the difficulty of compressing data versus the difficulty of decompressing data and the role that information loss plays in this interaction. Finite-state compression and decompression are shown to be of equivalent difficulty, even when the decompressors are allowed to be lossy. Inspired by Kolmogorov complexity, this paper defines the optimal *decompression *ratio achievable on an infinite sequence by finite-state decompressors (that is, finite-state transducers outputting the sequence in question). It is shown that the optimal compression ratio achievable on a sequence S by any *information lossless* finite state compressor, known as the finite-state dimension of S, is equal to the optimal decompression ratio achievable on S by any finite-state decompressor. This result implies a new decompression characterization of finite-state dimension in terms of lossy finite-state transducers.Comment: We found that Theorem 3.11, which was basically the motive for this paper, was already proven by Sheinwald, Ziv, and Lempel in 1991 and 1995 paper

arXiv.org e-Print Archive

MURAL - Maynooth University Research Archive Library

NUI Maynooth Eprint Archive

Maynooth University ePrints and eTheses Archive

Efficient LZ78 factorization of grammar compressed text

Author: A. Amir
A. Jeż
E. Ukkonen
E.M. McCreight
J. Jansson
J. Westbrook
J. Ziv
J. Ziv
K. Goto
K. Goto
M. Crochemore
M. Li
M. Li
M.A. Bender
O. Berkman
P. Weiner
R. Cilibrasi
T. Kida
V. Freschi
W. Rytter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size

n

representing a text

S

of length

N

, our algorithm computes the LZ78 factorization of

T

O(n\sqrt{N}+m\log N)

time and

O(n\sqrt{N}+m)

space, where

m

is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the

n\sqrt{N}

term in the time and space complexities becomes either

nL

, where

L

is the length of the longest LZ78 factor, or

(N - \alpha)

where

\alpha \geq 0

is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of

S

of a certain length. Since

m = O(N/\log_\sigma N)

where

\sigma

is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when

\sigma

is constant, and can be more efficient when the text is compressible, i.e. when

m

and

n

are small.Comment: SPIRE 201

arXiv.org e-Print Archive

Crossref

Universal Indexes for Highly Repetitive Document Collections

Author: Claude Francisco
Fariña Antonio
Martínez-Prieto Miguel A.
Navarro Gonzalo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Académico de la Universidad de Chile

Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

Author: A Poyias
D Arroyuelo
D Lemire
D Lemire
D Lemire
G Marsaglia
GH Gonnet
H Bannai
H Luan
J Fischer
J Fischer
J Jansson
J Kärkkäinen
J Ziv
J Ziv
JA Feldman
JG Cleary
K Chung
L Carter
P Tchebychev
RM Karp
RM Robinson
TA Welch
Y Nakashima
Publication venue
Publication date: 09/06/2017
Field of study

We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

arXiv.org e-Print Archive

Crossref