110 research outputs found
Stream VByte: Faster Byte-Oriented Integer Compression
Arrays of integers are often compressed in search engines. Though there are
many ways to compress integers, we are interested in the popular byte-oriented
integer compression techniques (e.g., VByte or Google's Varint-GB). They are
appealing due to their simplicity and engineering convenience. Amazon's
varint-G8IU is one of the fastest byte-oriented compression technique published
so far. It makes judicious use of the powerful single-instruction-multiple-data
(SIMD) instructions available in commodity processors. To surpass varint-G8IU,
we present Stream VByte, a novel byte-oriented compression technique that
separates the control stream from the encoded data. Like varint-G8IU, Stream
VByte is well suited for SIMD instructions. We show that Stream VByte decoding
can be up to twice as fast as varint-G8IU decoding over real data sets. In this
sense, Stream VByte establishes new speed records for byte-oriented integer
compression, at times exceeding the speed of the memcpy function. On a 3.4GHz
Haswell processor, it decodes more than 4 billion differentially-coded integers
per second from RAM to L1 cache
Decoding billions of integers per second through vectorization
In many important applications -- such as search engines and relational
database systems -- data is stored in the form of arrays of integers. Encoding
and, most importantly, decoding of these arrays consumes considerable CPU time.
Therefore, substantial effort has been made to reduce costs associated with
compression and decompression. In particular, researchers have exploited the
superscalar nature of modern processors and SIMD instructions. Nevertheless, we
introduce a novel vectorized scheme called SIMD-BP128 that improves over
previously proposed vectorized approaches. It is nearly twice as fast as the
previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the
same time, SIMD-BP128 saves up to 2 bits per integer. For even better
compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has
a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while
being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see
http://boytsov.info/datasets/clueweb09gap
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
A General SIMD-based Approach to Accelerating Compression Algorithms
Compression algorithms are important for data oriented tasks, especially in
the era of Big Data. Modern processors equipped with powerful SIMD instruction
sets, provide us an opportunity for achieving better compression performance.
Previous research has shown that SIMD-based optimizations can multiply decoding
speeds. Following these pioneering studies, we propose a general approach to
accelerate compression algorithms. By instantiating the approach, we have
developed several novel integer compression algorithms, called Group-Simple,
Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding
vectorized versions. We evaluate the proposed algorithms on two public TREC
datasets, a Wikipedia dataset and a Twitter dataset. With competitive
compression ratios and encoding speeds, our SIMD-based algorithms outperform
state-of-the-art non-vectorized algorithms with respect to decoding speeds
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
Fast dictionary-based compression for inverted indexes
Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed
Make Larger Vector Register Sizes New Challenges?: Lessons Learned from the Area of Vectorized Lightweight Compression Algorithms
The exploitation of data as well as hardware properties is a core aspect for efficient data management. This holds in particular for the field of in-memory data processing. Aside from increasing main memory capacities, in-memory data processing also benefits from novel processing concepts based on lightweight compressed data. To speed up compression as well as decompression, an active research field deals with the specialization of these algorithms to hardware features such as vectorization using SIMD instructions. Most of the vectorized implementations have been proposed for 128 bit vector registers. However, hardware vendors still increase the vector register sizes, whereby a straightforward transformation to these wider vector sizes is possible in most-cases. Thus, we systematically investigated the impact of different SIMD instruction set extensions with wider vector sizes on the behavior of straightforward transformed implementations. In this paper, we will describe our evaluation methodology and present selective results of our exhaustive evaluation. In particular, we will highlight some challenges and present first approaches to tackle them
- …