4 research outputs found

    Optimal Prefix Codes with Fewer Distinct Codeword Lengths are Faster to Construct

    Full text link
    A new method for constructing minimum-redundancy binary prefix codes is described. Our method does not explicitly build a Huffman tree; instead it uses a property of optimal prefix codes to compute the codeword lengths corresponding to the input weights. Let nn be the number of weights and kk be the number of distinct codeword lengths as produced by the algorithm for the optimum codes. The running time of our algorithm is O(kn)O(k \cdot n). Following our previous work in \cite{be}, no algorithm can possibly construct optimal prefix codes in o(kn)o(k \cdot n) time. When the given weights are presorted our algorithm performs O(9klog2kn)O(9^k \cdot \log^{2k}{n}) comparisons.Comment: 23 pages, a preliminary version appeared in STACS 200

    Design and application of variable-to-variable length codes

    Get PDF
    This work addresses the design of minimum redundancy variable-to-variable length (V2V) codes and studies their suitability for using them in the probability interval partitioning entropy (PIPE) coding concept as an alternative to binary arithmetic coding. Several properties and new concepts for V2V codes are discussed and a polynomial-based principle for designing V2V codes is proposed. Various minimum redundancy V2V codes are derived and combined with the PIPE coding concept. Their redundancy is compared to the binary arithmetic coder of the video compression standard H.265/HEVC

    Efficient compression of large repetitive strings

    Get PDF
    When is comes to managing large volumes of data, general-purpose compressors such as gzip are ubiquitous. They are fast, practical and available on every modern platform from standard desktops to mobile devices. These tools exploit local redundancy in a text using a fixed-size sliding window. This window is usually very small relative to the text, however, in principle it can be as large as available memory. The window acts as a dictionary. Compression is achieved by replacing substrings with pointers to previous occurrences found in the dictionary. This type of algorithm becomes problematic when dealing with collections that are larger than physical memory, as it fails to capture any non-local redundancy, that is, repetition that occurs outside of its search window. With rapid growth in the already enormous amount of data we store and process there is a pressing need for improving compression effectiveness, reducing both storage requirements and decompression costs. However, many systems still use general-purpose compression tools on large highly repetitive data collections. In this thesis we focus on addressing this issue. We explore compression in a variety of domains where large volumes of data need to be stored and accessed, and general-purpose compression tools are cannon. First we discuss our work on web corpus compression, then we discuss the implementation of a practical index for repetitive texts that gives strong theoretical bounds in terms of size and access, and finally, we discuss our work on compression of high-throughput sequencing reads. We show that in all cases, our new methods improve on current techniques in both run-time and compression effectiveness, and provide important functionality such as fast decoding and random access
    corecore