5,917 research outputs found

    Universal lossless source coding with the Burrows Wheeler transform

    Get PDF
    The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n → ∞, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source

    Optimal modeling for complex system design

    Get PDF
    The article begins with a brief introduction to the theory describing optimal data compression systems and their performance. A brief outline is then given of a representative algorithm that employs these lessons for optimal data compression system design. The implications of rate-distortion theory for practical data compression system design is then described, followed by a description of the tensions between theoretical optimality and system practicality and a discussion of common tools used in current algorithms to resolve these tensions. Next, the generalization of rate-distortion principles to the design of optimal collections of models is presented. The discussion focuses initially on data compression systems, but later widens to describe how rate-distortion theory principles generalize to model design for a wide variety of modeling applications. The article ends with a discussion of the performance benefits to be achieved using the multiple-model design algorithms

    Implementation issues in source coding

    Get PDF
    An edge preserving image coding scheme which can be operated in both a lossy and a lossless manner was developed. The technique is an extension of the lossless encoding algorithm developed for the Mars observer spectral data. It can also be viewed as a modification of the DPCM algorithm. A packet video simulator was also developed from an existing modified packet network simulator. The coding scheme for this system is a modification of the mixture block coding (MBC) scheme described in the last report. Coding algorithms for packet video were also investigated

    Malleable coding for updatable cloud caching

    Full text link
    In software-as-a-service applications provisioned through cloud computing, locally cached data are often modified with updates from new versions. In some cases, with each edit, one may want to preserve both the original and new versions. In this paper, we focus on cases in which only the latest version must be preserved. Furthermore, it is desirable for the data to not only be compressed but to also be easily modified during updates, since representing information and modifying the representation both incur cost. We examine whether it is possible to have both compression efficiency and ease of alteration, in order to promote codeword reuse. In other words, we study the feasibility of a malleable and efficient coding scheme. The tradeoff between compression efficiency and malleability cost-the difficulty of synchronizing compressed versions-is measured as the length of a reused prefix portion. The region of achievable rates and malleability is found. Drawing from prior work on common information problems, we show that efficient data compression may not be the best engineering design principle when storing software-as-a-service data. In the general case, goals of efficiency and malleability are fundamentally in conflict.This work was supported in part by an NSF Graduate Research Fellowship (LRV), Grant CCR-0325774, and Grant CCF-0729069. This work was presented at the 2011 IEEE International Symposium on Information Theory [1] and the 2014 IEEE International Conference on Cloud Engineering [2]. The associate editor coordinating the review of this paper and approving it for publication was R. Thobaben. (CCR-0325774 - NSF Graduate Research Fellowship; CCF-0729069 - NSF Graduate Research Fellowship)Accepted manuscrip

    New Algorithms and Lower Bounds for Sequential-Access Data Compression

    Get PDF
    This thesis concerns sequential-access data compression, i.e., by algorithms that read the input one or more times from beginning to end. In one chapter we consider adaptive prefix coding, for which we must read the input character by character, outputting each character's self-delimiting codeword before reading the next one. We show how to encode and decode each character in constant worst-case time while producing an encoding whose length is worst-case optimal. In another chapter we consider one-pass compression with memory bounded in terms of the alphabet size and context length, and prove a nearly tight tradeoff between the amount of memory we can use and the quality of the compression we can achieve. In a third chapter we consider compression in the read/write streams model, which allows us passes and memory both polylogarithmic in the size of the input. We first show how to achieve universal compression using only one pass over one stream. We then show that one stream is not sufficient for achieving good grammar-based compression. Finally, we show that two streams are necessary and sufficient for achieving entropy-only bounds.Comment: draft of PhD thesi

    An implementation of Deflate in Coq

    Full text link
    The widely-used compression format "Deflate" is defined in RFC 1951 and is based on prefix-free codings and backreferences. There are unclear points about the way these codings are specified, and several sources for confusion in the standard. We tried to fix this problem by giving a rigorous mathematical specification, which we formalized in Coq. We produced a verified implementation in Coq which achieves competitive performance on inputs of several megabytes. In this paper we present the several parts of our implementation: a fully verified implementation of canonical prefix-free codings, which can be used in other compression formats as well, and an elegant formalism for specifying sophisticated formats, which we used to implement both a compression and decompression algorithm in Coq which we formally prove inverse to each other -- the first time this has been achieved to our knowledge. The compatibility to other Deflate implementations can be shown empirically. We furthermore discuss some of the difficulties, specifically regarding memory and runtime requirements, and our approaches to overcome them
    • 

    corecore