310 research outputs found
Postings List Compression with Run-length and Zombit Encodings
Inverted indices is a core index structure for different low-level structures, like search engines and databases.
It stores a mapping from terms, numbers etc. to list of location in document, set of documents, database, table etc. and allows efficient full-text searches on indexed structure.
Mapping location in the inverted indicies is usually called a postings list.
In real life applications, scale of the inverted indicies size can grow huge.
Therefore efficient representation of it is needed, but at the same time, efficient queries must be supported.
This thesis explores ways to represent postings lists efficiently, while allowing efficient nextGEQ queries on the set.
Efficient nextGEQ queries is needed to implement inverted indicies.
First we convert postings lists into one bitvector, which concatenates each postings list's characteristic bitvector.
Then representing an integer set efficiently converts to representing this bitvector efficiently, which is expected to have long runs of 0s and 1s.
Run-length encoding of bitvector have recently led to promising results.
Therefore in this thesis we experiment two encoding methods (Top-k Hybrid coder, RLZ) that encode postings lists via run-length encodes of the bitvector.
We also investigate another new bitvector compression method (Zombit-vector), which encodes bitvectors by finding redundancies of runs of 0/1s.
We compare all encoding to current state-of-the-art Partitioned Elisa-Fano (PEF) coding.
Compression results on all encodings were more efficient than the current state-of-the-art PEF encoding.
Zombit-vector nextGEQ query results were slighty more efficient than PEF's, which make it more attractive with bitvectors that have long runs of 0s and 1s.
More work is needed with Top-k Hybrid coder and RLZ, so that those encodings nextGEQ can be compared to Zombit-vector and PEF
Hu-Tucker alogorithm for building optimal alphabetic binary search trees
The purpose of this thesis is to study the behavior of the Hu- Tucker algorithm for building Optimal Alphabetic Binary Search Trees (OABST), to design an efficient implementation, and to evaluate the performance of the algorithm, and the implementation. The three phases of the algorithm are described and their time complexities evaluated. Two separate implementations for the most expensive phase, Combination, are presented achieving 0(n2) and O(nlogn) time and 0(n) space complexity. The break even point between them is experimentally established and the complexities of the implementations are compared against their theoretical time complexities. The electronic version of The Complete Works of William Shakespeare is compressed using the Hu- Tucker algorithm and other popular compression algorithms to compare the performance of the different techniques. The experiments justified the price that has to be paid to implement the Hu- Tucker algorithm. It is shown that an efficient implementation can process extremely large data sets relatively fast and can achieve optimality close to the Optimal Binary Tree, built using the Huffman algorithm, however the OABST can be used in both encoding and decoding processes, unlike the OBT where an additional mapping mechanism is needed for the decoding phase
An implementation of Deflate in Coq
The widely-used compression format "Deflate" is defined in RFC 1951 and is
based on prefix-free codings and backreferences. There are unclear points about
the way these codings are specified, and several sources for confusion in the
standard. We tried to fix this problem by giving a rigorous mathematical
specification, which we formalized in Coq. We produced a verified
implementation in Coq which achieves competitive performance on inputs of
several megabytes. In this paper we present the several parts of our
implementation: a fully verified implementation of canonical prefix-free
codings, which can be used in other compression formats as well, and an elegant
formalism for specifying sophisticated formats, which we used to implement both
a compression and decompression algorithm in Coq which we formally prove
inverse to each other -- the first time this has been achieved to our
knowledge. The compatibility to other Deflate implementations can be shown
empirically. We furthermore discuss some of the difficulties, specifically
regarding memory and runtime requirements, and our approaches to overcome them
On the design of fast and efficient wavelet image coders with reduced memory usage
Image compression is of great importance in multimedia systems and
applications because it drastically reduces bandwidth requirements for
transmission and memory requirements for storage. Although earlier
standards for image compression were based on the Discrete Cosine
Transform (DCT), a recently developed mathematical technique, called
Discrete Wavelet Transform (DWT), has been found to be more efficient
for image coding.
Despite improvements in compression efficiency, wavelet image coders
significantly increase memory usage and complexity when compared with
DCT-based coders. A major reason for the high memory requirements is
that the usual algorithm to compute the wavelet transform requires the
entire image to be in memory. Although some proposals reduce the memory
usage, they present problems that hinder their implementation. In
addition, some wavelet image coders, like SPIHT (which has become a
benchmark for wavelet coding), always need to hold the entire image in
memory.
Regarding the complexity of the coders, SPIHT can be considered quite
complex because it performs bit-plane coding with multiple image scans.
The wavelet-based JPEG 2000 standard is still more complex because it
improves coding efficiency through time-consuming methods, such as an
iterative optimization algorithm based on the Lagrange multiplier
method, and high-order context modeling.
In this thesis, we aim to reduce memory usage and complexity in
wavelet-based image coding, while preserving compression efficiency. To
this end, a run-length encoder and a tree-based wavelet encoder are
proposed. In addition, a new algorithm to efficiently compute the
wavelet transform is presented. This algorithm achieves low memory
consumption using line-by-line processing, and it employs recursion to
automatically place the order in which the wavelet transform is
computed, solving some synchronization problems that have not been
tackled by previous proposals. The proposed encodeOliver Gil, JS. (2006). On the design of fast and efficient wavelet image coders with reduced memory usage [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/1826Palanci
- …