Search CORE

24 research outputs found

Stream VByte: Faster Byte-Oriented Integer Compression

Author: Kurz Nathan
Lemire Daniel
Rupp Christoph
Publication venue: 'Elsevier BV'
Publication date: 27/09/2017
Field of study

Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes judicious use of the powerful single-instruction-multiple-data (SIMD) instructions available in commodity processors. To surpass varint-G8IU, we present Stream VByte, a novel byte-oriented compression technique that separates the control stream from the encoded data. Like varint-G8IU, Stream VByte is well suited for SIMD instructions. We show that Stream VByte decoding can be up to twice as fast as varint-G8IU decoding over real data sets. In this sense, Stream VByte establishes new speed records for byte-oriented integer compression, at times exceeding the speed of the memcpy function. On a 3.4GHz Haswell processor, it decodes more than 4 billion differentially-coded integers per second from RAM to L1 cache

arXiv.org e-Print Archive

R-libre

On Optimally Partitioning Variable-Byte Codes

Author: Pibiri Giulio Ermanno
Venturini Rossano
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Fast and Compact Set Intersection through Recursive Universe Partitioning

Author: Pibiri G. E.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: A markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Fast dictionary-based compression for inverted indexes

Author: Moffat A.
Petri M.
Pibiri G. E.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms

Author: Damme Patrick
Habich Dirk
Hildebrandt Juliana
Lehner Wolfgang
Ungethüm Annett
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/01/2023
Field of study

Lightweight integer compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms with different data and hardware characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art lightweight integer compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of data as well as hardware properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion that there is no single-best algorithm. Moreover, in this article, we also introduce and evaluate a novel cost model for the selection of a suitable lightweight integer compression algorithm for a given dataset

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Select-based random access to variable-byte encodings

Author: Timonen Jussi
Publication venue: Helsingfors universitet
Publication date: 01/01/2020
Field of study

Enormous datasets are a common occurence today and compressing them is often beneficial. Fast direct access to any element in the compressed data is a requirement in the field of compressed data structures, which is not easily supported with traditional compression methods. Variable-byte encoding is a method for compressing integers of different byte lengths. It removes unused leading bytes and adds an additional continuation bit to each byte to denote whether the compressed integer continues to the next byte or not. An existing solution using a rank data structure performs well in this given task. This thesis introduces an alternative solution using a select data structure and compares the two implementations. An experimentation is also done on retrieving a subarray from the compressed data structure. The rank implementation performs better on data containing mostly small integers. The select implementation benefits on larger integers. The select implementation has significant advantages on subarray fetching due to how the data is compressed

Helsingin yliopiston digitaalinen arkisto

Multicompresión de grandes listas de enteros para sistemas de búsquedas

Author: González Agustín
Tolosa Gabriel Hernán
Publication venue
Publication date: 07/04/2021
Field of study

La búsqueda en grandes repositorios de documentos (como la web) exige que los sistemas se ejecuten bajo estrictas restricciones de performance. En la actualidad, dada la cantidad de documentos que un sistema gestiona, resulta indispensable aplicar técnicas tales como la compresión de las estructuras de datos. Particularmente, aquí se aborda el problema de la compresión de un índice invertido mediante un esquema “multicompresión” que procesa diferentes porciones de una lista utilizando diversos codees. Los resultados preliminares muestran que es posible compensar el overhead requerido para mantener este esquema, mientras que se mejora el tiempo de descompresión.Sociedad Argentina de Informátic

Servicio de Difusión de la Creación Intelectual