24 research outputs found

    Stream VByte: Faster Byte-Oriented Integer Compression

    Get PDF
    Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes judicious use of the powerful single-instruction-multiple-data (SIMD) instructions available in commodity processors. To surpass varint-G8IU, we present Stream VByte, a novel byte-oriented compression technique that separates the control stream from the encoded data. Like varint-G8IU, Stream VByte is well suited for SIMD instructions. We show that Stream VByte decoding can be up to twice as fast as varint-G8IU decoding over real data sets. In this sense, Stream VByte establishes new speed records for byte-oriented integer compression, at times exceeding the speed of the memcpy function. On a 3.4GHz Haswell processor, it decodes more than 4 billion differentially-coded integers per second from RAM to L1 cache

    On Optimally Partitioning Variable-Byte Codes

    Get PDF
    The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201

    Fast and Compact Set Intersection through Recursive Universe Partitioning

    Get PDF
    We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: A markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem

    Fast dictionary-based compression for inverted indexes

    Get PDF
    Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed

    From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms

    Get PDF
    Lightweight integer compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms with different data and hardware characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art lightweight integer compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of data as well as hardware properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion that there is no single-best algorithm. Moreover, in this article, we also introduce and evaluate a novel cost model for the selection of a suitable lightweight integer compression algorithm for a given dataset

    Select-based random access to variable-byte encodings

    Get PDF
    Enormous datasets are a common occurence today and compressing them is often beneficial. Fast direct access to any element in the compressed data is a requirement in the field of compressed data structures, which is not easily supported with traditional compression methods. Variable-byte encoding is a method for compressing integers of different byte lengths. It removes unused leading bytes and adds an additional continuation bit to each byte to denote whether the compressed integer continues to the next byte or not. An existing solution using a rank data structure performs well in this given task. This thesis introduces an alternative solution using a select data structure and compares the two implementations. An experimentation is also done on retrieving a subarray from the compressed data structure. The rank implementation performs better on data containing mostly small integers. The select implementation benefits on larger integers. The select implementation has significant advantages on subarray fetching due to how the data is compressed

    Multicompresión de grandes listas de enteros para sistemas de búsquedas

    Get PDF
    La búsqueda en grandes repositorios de documentos (como la web) exige que los sistemas se ejecuten bajo estrictas restricciones de performance. En la actualidad, dada la cantidad de documentos que un sistema gestiona, resulta indispensable aplicar técnicas tales como la compresión de las estructuras de datos. Particularmente, aquí se aborda el problema de la compresión de un índice invertido mediante un esquema “multicompresión” que procesa diferentes porciones de una lista utilizando diversos codees. Los resultados preliminares muestran que es posible compensar el overhead requerido para mantener este esquema, mientras que se mejora el tiempo de descompresión.Sociedad Argentina de Informátic
    corecore