310 research outputs found

    Postings List Compression with Run-length and Zombit Encodings

    Get PDF
    Inverted indices is a core index structure for different low-level structures, like search engines and databases. It stores a mapping from terms, numbers etc. to list of location in document, set of documents, database, table etc. and allows efficient full-text searches on indexed structure. Mapping location in the inverted indicies is usually called a postings list. In real life applications, scale of the inverted indicies size can grow huge. Therefore efficient representation of it is needed, but at the same time, efficient queries must be supported. This thesis explores ways to represent postings lists efficiently, while allowing efficient nextGEQ queries on the set. Efficient nextGEQ queries is needed to implement inverted indicies. First we convert postings lists into one bitvector, which concatenates each postings list's characteristic bitvector. Then representing an integer set efficiently converts to representing this bitvector efficiently, which is expected to have long runs of 0s and 1s. Run-length encoding of bitvector have recently led to promising results. Therefore in this thesis we experiment two encoding methods (Top-k Hybrid coder, RLZ) that encode postings lists via run-length encodes of the bitvector. We also investigate another new bitvector compression method (Zombit-vector), which encodes bitvectors by finding redundancies of runs of 0/1s. We compare all encoding to current state-of-the-art Partitioned Elisa-Fano (PEF) coding. Compression results on all encodings were more efficient than the current state-of-the-art PEF encoding. Zombit-vector nextGEQ query results were slighty more efficient than PEF's, which make it more attractive with bitvectors that have long runs of 0s and 1s. More work is needed with Top-k Hybrid coder and RLZ, so that those encodings nextGEQ can be compared to Zombit-vector and PEF

    Hu-Tucker alogorithm for building optimal alphabetic binary search trees

    Get PDF
    The purpose of this thesis is to study the behavior of the Hu- Tucker algorithm for building Optimal Alphabetic Binary Search Trees (OABST), to design an efficient implementation, and to evaluate the performance of the algorithm, and the implementation. The three phases of the algorithm are described and their time complexities evaluated. Two separate implementations for the most expensive phase, Combination, are presented achieving 0(n2) and O(nlogn) time and 0(n) space complexity. The break even point between them is experimentally established and the complexities of the implementations are compared against their theoretical time complexities. The electronic version of The Complete Works of William Shakespeare is compressed using the Hu- Tucker algorithm and other popular compression algorithms to compare the performance of the different techniques. The experiments justified the price that has to be paid to implement the Hu- Tucker algorithm. It is shown that an efficient implementation can process extremely large data sets relatively fast and can achieve optimality close to the Optimal Binary Tree, built using the Huffman algorithm, however the OABST can be used in both encoding and decoding processes, unlike the OBT where an additional mapping mechanism is needed for the decoding phase

    An implementation of Deflate in Coq

    Full text link
    The widely-used compression format "Deflate" is defined in RFC 1951 and is based on prefix-free codings and backreferences. There are unclear points about the way these codings are specified, and several sources for confusion in the standard. We tried to fix this problem by giving a rigorous mathematical specification, which we formalized in Coq. We produced a verified implementation in Coq which achieves competitive performance on inputs of several megabytes. In this paper we present the several parts of our implementation: a fully verified implementation of canonical prefix-free codings, which can be used in other compression formats as well, and an elegant formalism for specifying sophisticated formats, which we used to implement both a compression and decompression algorithm in Coq which we formally prove inverse to each other -- the first time this has been achieved to our knowledge. The compatibility to other Deflate implementations can be shown empirically. We furthermore discuss some of the difficulties, specifically regarding memory and runtime requirements, and our approaches to overcome them

    On the design of fast and efficient wavelet image coders with reduced memory usage

    Full text link
    Image compression is of great importance in multimedia systems and applications because it drastically reduces bandwidth requirements for transmission and memory requirements for storage. Although earlier standards for image compression were based on the Discrete Cosine Transform (DCT), a recently developed mathematical technique, called Discrete Wavelet Transform (DWT), has been found to be more efficient for image coding. Despite improvements in compression efficiency, wavelet image coders significantly increase memory usage and complexity when compared with DCT-based coders. A major reason for the high memory requirements is that the usual algorithm to compute the wavelet transform requires the entire image to be in memory. Although some proposals reduce the memory usage, they present problems that hinder their implementation. In addition, some wavelet image coders, like SPIHT (which has become a benchmark for wavelet coding), always need to hold the entire image in memory. Regarding the complexity of the coders, SPIHT can be considered quite complex because it performs bit-plane coding with multiple image scans. The wavelet-based JPEG 2000 standard is still more complex because it improves coding efficiency through time-consuming methods, such as an iterative optimization algorithm based on the Lagrange multiplier method, and high-order context modeling. In this thesis, we aim to reduce memory usage and complexity in wavelet-based image coding, while preserving compression efficiency. To this end, a run-length encoder and a tree-based wavelet encoder are proposed. In addition, a new algorithm to efficiently compute the wavelet transform is presented. This algorithm achieves low memory consumption using line-by-line processing, and it employs recursion to automatically place the order in which the wavelet transform is computed, solving some synchronization problems that have not been tackled by previous proposals. The proposed encodeOliver Gil, JS. (2006). On the design of fast and efficient wavelet image coders with reduced memory usage [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/1826Palanci

    Soft decoding and synchronization of arithmetic codes: application to image transmission over noisy channels

    Full text link
    • …
    corecore