3,238 research outputs found

    Entropy Lower Bounds for Dictionary Compression

    Get PDF
    We show that a wide class of dictionary compression methods (including LZ77, LZ78, grammar compressors as well as parsing-based structures) require |S|H_k(S) + Omega (|S|k log sigma/log_sigma |S|) bits to encode their output. This matches known upper bounds and improves the information-theoretic lower bound of |S|H_k(S). To this end, we abstract the crucial properties of parsings created by those methods, construct a certain family of strings and analyze the parsings of those strings. We also show that for k = alpha log_sigma |S|, where 0 < alpha < 1 is a constant, the aforementioned methods produce an output of size at least 1/(1-alpha)|S|H_k(S) bits. Thus our results separate dictionary compressors from context-based one (such as PPM) and BWT-based ones, as the those include methods achieving |S|H_k(S) + O(sigma^k log sigma) bits, i.e. the redundancy depends on k and sigma but not on |S|

    Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties

    Full text link
    We study a generalization of deduplication, which enables lossless deduplication of highly similar data and show that standard deduplication with fixed chunk length is a special case. We provide bounds on the expected length of coded sequences for generalized deduplication and show that the coding has asymptotic near-entropy cost under the proposed source model. More importantly, we show that generalized deduplication allows for multiple orders of magnitude faster convergence than standard deduplication. This means that generalized deduplication can provide compression benefits much earlier than standard deduplication, which is key in practical systems. Numerical examples demonstrate our results, showing that our lower bounds are achievable, and illustrating the potential gain of using the generalization over standard deduplication. In fact, we show that even for a simple case of generalized deduplication, the gain in convergence speed is linear with the size of the data chunks.Comment: 15 pages, 4 figures. This is the full version of a paper accepted for GLOBECOM 201

    New Algorithms and Lower Bounds for Sequential-Access Data Compression

    Get PDF
    This thesis concerns sequential-access data compression, i.e., by algorithms that read the input one or more times from beginning to end. In one chapter we consider adaptive prefix coding, for which we must read the input character by character, outputting each character's self-delimiting codeword before reading the next one. We show how to encode and decode each character in constant worst-case time while producing an encoding whose length is worst-case optimal. In another chapter we consider one-pass compression with memory bounded in terms of the alphabet size and context length, and prove a nearly tight tradeoff between the amount of memory we can use and the quality of the compression we can achieve. In a third chapter we consider compression in the read/write streams model, which allows us passes and memory both polylogarithmic in the size of the input. We first show how to achieve universal compression using only one pass over one stream. We then show that one stream is not sufficient for achieving good grammar-based compression. Finally, we show that two streams are necessary and sufficient for achieving entropy-only bounds.Comment: draft of PhD thesi

    Universal quantum information compression and degrees of prior knowledge

    Get PDF
    We describe a universal information compression scheme that compresses any pure quantum i.i.d. source asymptotically to its von Neumann entropy, with no prior knowledge of the structure of the source. We introduce a diagonalisation procedure that enables any classical compression algorithm to be utilised in a quantum context. Our scheme is then based on the corresponding quantum translation of the classical Lempel-Ziv algorithm. Our methods lead to a conceptually simple way of estimating the entropy of a source in terms of the measurement of an associated length parameter while maintaining high fidelity for long blocks. As a by-product we also estimate the eigenbasis of the source. Since our scheme is based on the Lempel-Ziv method, it can be applied also to target sequences that are not i.i.d.Comment: 17 pages, no figures. A preliminary version of this work was presented at EQIS '02, Tokyo, September 200

    On empirical cumulant generating functions of code lengths for individual sequences

    Full text link
    We consider the problem of lossless compression of individual sequences using finite-state (FS) machines, from the perspective of the best achievable empirical cumulant generating function (CGF) of the code length, i.e., the normalized logarithm of the empirical average of the exponentiated code length. Since the probabilistic CGF is minimized in terms of the R\'enyi entropy of the source, one of the motivations of this study is to derive an individual-sequence analogue of the R\'enyi entropy, in the same way that the FS compressibility is the individual-sequence counterpart of the Shannon entropy. We consider the CGF of the code-length both from the perspective of fixed-to-variable (F-V) length coding and the perspective of variable-to-variable (V-V) length coding, where the latter turns out to yield a better result, that coincides with the FS compressibility. We also extend our results to compression with side information, available at both the encoder and decoder. In this case, the V-V version no longer coincides with the FS compressibility, but results in a different complexity measure.Comment: 15 pages; submitted for publicatio
    • …
    corecore