546 research outputs found
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
BurrowsâWheeler compression: Principles and reflections
AbstractAfter a general description of the BurrowsâWheeler transform and a brief survey of recent work on processing its output, the paper examines the coding of the zero-runs from the MTF recoding stage, an aspect with little prior treatment. It is concluded that the original scheme proposed by Wheeler is extremely efficient and unlikely to be much improved.The paper then proposes some new interpretations and uses of the BurrowsâWheeler transform, with new insights and approaches to lossless compression, perhaps including techniques from error correction
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
Efficient methods for storing and querying are critical for scaling
high-order n-gram language models to large corpora. We propose a language model
based on compressed suffix trees, a representation that is highly compact and
can be easily held in memory, while supporting queries needed in computing
language model probabilities on-the-fly. We present several optimisations which
improve query runtimes up to 2500x, despite only incurring a modest increase in
construction time and memory usage. For large corpora and high Markov orders,
our method is highly competitive with the state-of-the-art KenLM package. It
imposes much lower memory requirements, often by orders of magnitude, and has
runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational
Linguistics (TACL) 201
Multiresolution source coding using entropy constrained dithered scalar quantization
In this paper, we build multiresolution source codes using entropy constrained dithered scalar quantizers. We demonstrate that for n-dimensional random vectors, dithering followed by uniform scalar quantization and then by entropy coding achieves performance close to the n-dimensional optimum for a multiresolution source code. Based on this result, we propose a practical code design algorithm and compare its performance with that of the set partitioning in hierarchical trees (SPIHT) algorithm on natural images
Asymptotic Optimality of Antidictionary Codes
An antidictionary code is a lossless compression algorithm using an
antidictionary which is a set of minimal words that do not occur as substrings
in an input string. The code was proposed by Crochemore et al. in 2000, and its
asymptotic optimality has been proved with respect to only a specific
information source, called balanced binary source that is a binary Markov
source in which a state transition occurs with probability 1/2 or 1. In this
paper, we prove the optimality of both static and dynamic antidictionary codes
with respect to a stationary ergodic Markov source on finite alphabet such that
a state transition occurs with probability .Comment: 5 pages, to appear in the proceedings of 2010 IEEE International
Symposium on Information Theory (ISIT2010
- âŠ