1,490 research outputs found

    Universal lossless source coding with the Burrows Wheeler transform

    Get PDF
    The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n → ∞, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source

    New Algorithms and Lower Bounds for Sequential-Access Data Compression

    Get PDF
    This thesis concerns sequential-access data compression, i.e., by algorithms that read the input one or more times from beginning to end. In one chapter we consider adaptive prefix coding, for which we must read the input character by character, outputting each character's self-delimiting codeword before reading the next one. We show how to encode and decode each character in constant worst-case time while producing an encoding whose length is worst-case optimal. In another chapter we consider one-pass compression with memory bounded in terms of the alphabet size and context length, and prove a nearly tight tradeoff between the amount of memory we can use and the quality of the compression we can achieve. In a third chapter we consider compression in the read/write streams model, which allows us passes and memory both polylogarithmic in the size of the input. We first show how to achieve universal compression using only one pass over one stream. We then show that one stream is not sufficient for achieving good grammar-based compression. Finally, we show that two streams are necessary and sufficient for achieving entropy-only bounds.Comment: draft of PhD thesi

    Parameterized Strings: Algorithms and Data Structures

    Get PDF
    A parameterized string (p-string) T = T[1] T[2]...T[n] is a sophisticated string of length n composed of symbols from a constant alphabet Sigma and a parameter alphabet pi. Given a pair of p-strings S and T, the parameterized pattern matching (p-match) problem is to verify whether the individual constant symbols match and whether there exists a bijection between the parameter symbols of S and T. If the two conditions are met, S is said to be a p-match of T. A significant breakthrough in the p-match area is the prev encoding, which is proven to identify a p-match between S and T if and only if prev(S) == prev(T). In order to utilize suffix data structures in terms of p-matching, we must account for the dynamic nature of the parameterized suffixes (p-suffixes) of T, namely prev(T[ i...n]) ∀ i, 1 ≤ i ≤ n.;In this work, we propose transformative approaches to the direct parameterized suffix sorting (p-suffix sorting) problem by generating and sorting lexicographically numeric fingerprints and arithmetic codes that correspond to individual p-suffixes. Our algorithm to p-suffix sort via fingerprints is the first theoretical linear time algorithm for p-suffix sorting for non-binary parameter alphabets, which assumes that each code is represented by a practical integer. We eliminate the key problems of fingerprints by introducing an algorithm that exploits the ordering of arithmetic codes to sort p-suffixes in linear time on average.;The longest previous factor (LPF) problem is defined for traditional strings exclusively from the constant alphabet Sigma. We generalize the LPF problem to the parameterized longest previous factor (pLPF) problem defined for p-strings. Subsequently, we present a linear time solution to construct the pLPF array. Given our pLPF algorithm, we show how to construct the pLCP (parameterized longest common prefix) array in linear time. Our algorithm is further exploited to construct the standard LPF and LCP arrays all in linear time.;We then study the structural string (s-string), a variant of the p-string that extends the p-string alphabets to include complementary parameters that correspond to one another. The s-string problem involves the new encoding schemes sencode and compl in order to identify a structural match (s-match). Current s-match solutions use a structural suffix tree (s-suffix tree) to study structural matches in RNA sequences. We introduce the suffix array, LCP, and LPF data structures for the s-string encoding schemes. Using our new data structures, we identify the first suffix array solution to the s-match problem. Our algorithms and data structures are shown to apply to s-strings and also p-strings and traditional strings

    Asymptotic Optimality of Antidictionary Codes

    Full text link
    An antidictionary code is a lossless compression algorithm using an antidictionary which is a set of minimal words that do not occur as substrings in an input string. The code was proposed by Crochemore et al. in 2000, and its asymptotic optimality has been proved with respect to only a specific information source, called balanced binary source that is a binary Markov source in which a state transition occurs with probability 1/2 or 1. In this paper, we prove the optimality of both static and dynamic antidictionary codes with respect to a stationary ergodic Markov source on finite alphabet such that a state transition occurs with probability p(0<p1)p (0 < p \leq 1).Comment: 5 pages, to appear in the proceedings of 2010 IEEE International Symposium on Information Theory (ISIT2010

    Burrows Wheeler Compression Algorithm (BWCA) in Lossless Image Compression

    Get PDF
    The present paper discusses the implementation of BWCA in lossless image compression. BWCA uses Burrows Wheeler Transform (BWT) as its main transform. As one of combinatorial compression algorithm which in particular reordered symbols according to their following context, it becomes one of promising approach in context modeling compression. BWT was initially created for text compression, and here we study the impact of BWCA method and its improvement when applied to image compression. Since this application is quite different from the original method aim, we analyze the pre- and post-processing influences of BWT
    corecore