322 research outputs found

    An implementation of Deflate in Coq

    Full text link
    The widely-used compression format "Deflate" is defined in RFC 1951 and is based on prefix-free codings and backreferences. There are unclear points about the way these codings are specified, and several sources for confusion in the standard. We tried to fix this problem by giving a rigorous mathematical specification, which we formalized in Coq. We produced a verified implementation in Coq which achieves competitive performance on inputs of several megabytes. In this paper we present the several parts of our implementation: a fully verified implementation of canonical prefix-free codings, which can be used in other compression formats as well, and an elegant formalism for specifying sophisticated formats, which we used to implement both a compression and decompression algorithm in Coq which we formally prove inverse to each other -- the first time this has been achieved to our knowledge. The compatibility to other Deflate implementations can be shown empirically. We furthermore discuss some of the difficulties, specifically regarding memory and runtime requirements, and our approaches to overcome them

    Postings List Compression with Run-length and Zombit Encodings

    Get PDF
    Inverted indices is a core index structure for different low-level structures, like search engines and databases. It stores a mapping from terms, numbers etc. to list of location in document, set of documents, database, table etc. and allows efficient full-text searches on indexed structure. Mapping location in the inverted indicies is usually called a postings list. In real life applications, scale of the inverted indicies size can grow huge. Therefore efficient representation of it is needed, but at the same time, efficient queries must be supported. This thesis explores ways to represent postings lists efficiently, while allowing efficient nextGEQ queries on the set. Efficient nextGEQ queries is needed to implement inverted indicies. First we convert postings lists into one bitvector, which concatenates each postings list's characteristic bitvector. Then representing an integer set efficiently converts to representing this bitvector efficiently, which is expected to have long runs of 0s and 1s. Run-length encoding of bitvector have recently led to promising results. Therefore in this thesis we experiment two encoding methods (Top-k Hybrid coder, RLZ) that encode postings lists via run-length encodes of the bitvector. We also investigate another new bitvector compression method (Zombit-vector), which encodes bitvectors by finding redundancies of runs of 0/1s. We compare all encoding to current state-of-the-art Partitioned Elisa-Fano (PEF) coding. Compression results on all encodings were more efficient than the current state-of-the-art PEF encoding. Zombit-vector nextGEQ query results were slighty more efficient than PEF's, which make it more attractive with bitvectors that have long runs of 0s and 1s. More work is needed with Top-k Hybrid coder and RLZ, so that those encodings nextGEQ can be compared to Zombit-vector and PEF

    Constrained Optimal Querying: Huffman Coding and Beyond

    Full text link
    Huffman coding is well known to be useful in certain decision problems involving minimizing the average number of (freely chosen) queries to determine an unknown random variable. However, in problems where the queries are more constrained, the original Huffman coding no longer works. In this paper, we proposed a general model to describe such problems and two code schemes: one is Huffman-based, and the other called GBSC (Greedy Binary Separation Coding). We proved the optimality of GBSC by induction on a binary decision tree, telling us that GBSC is at least as good as Shannon coding. We then compared the two algorithms based on these two codes, by testing them with two problems: DNA detection and 1-player Battleship, and found both to be decent approximating algorithms, with Huffman-based algorithm giving an expected length 1.1 times the true optimal in DNA detection problem, and GBSC yielding an average number of queries 1.4 times the theoretical optimal in 1-player Battleship

    Lower Bound on Expected Communication Cost of Quantum Huffman Coding

    Get PDF
    Data compression is a fundamental problem in quantum and classical information theory. A typical version of the problem is that the sender Alice receives a (classical or quantum) state from some known ensemble and needs to transmit them to the receiver Bob with average error below some specified bound. We consider the case in which the message can have a variable length and the goal is to minimize its expected length. For classical messages this problem has a well-known solution given by Huffman coding. In this scheme, the expected length of the message is equal to the Shannon entropy of the source (with a constant additive factor) and the scheme succeeds with zero error. This is a single-shot result which implies the asymptotic result, viz. Shannon\u27s source coding theorem, by encoding each state sequentially. For the quantum case, the asymptotic compression rate is given by the von-Neumann entropy. However, we show that there is no one-shot scheme which is able to match this rate, even if interactive communication is allowed. This is a relatively rare case in quantum information theory when the cost of a quantum task is significantly different than the classical analogue. Our result has implications for direct sum theorems in quantum communication complexity and one-shot formulations of Quantum Reverse Shannon theorem
    • …
    corecore