11 research outputs found
Decoding billions of integers per second through vectorization
In many important applications -- such as search engines and relational
database systems -- data is stored in the form of arrays of integers. Encoding
and, most importantly, decoding of these arrays consumes considerable CPU time.
Therefore, substantial effort has been made to reduce costs associated with
compression and decompression. In particular, researchers have exploited the
superscalar nature of modern processors and SIMD instructions. Nevertheless, we
introduce a novel vectorized scheme called SIMD-BP128 that improves over
previously proposed vectorized approaches. It is nearly twice as fast as the
previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the
same time, SIMD-BP128 saves up to 2 bits per integer. For even better
compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has
a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while
being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see
http://boytsov.info/datasets/clueweb09gap
A General SIMD-based Approach to Accelerating Compression Algorithms
Compression algorithms are important for data oriented tasks, especially in
the era of Big Data. Modern processors equipped with powerful SIMD instruction
sets, provide us an opportunity for achieving better compression performance.
Previous research has shown that SIMD-based optimizations can multiply decoding
speeds. Following these pioneering studies, we propose a general approach to
accelerate compression algorithms. By instantiating the approach, we have
developed several novel integer compression algorithms, called Group-Simple,
Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding
vectorized versions. We evaluate the proposed algorithms on two public TREC
datasets, a Wikipedia dataset and a Twitter dataset. With competitive
compression ratios and encoding speeds, our SIMD-based algorithms outperform
state-of-the-art non-vectorized algorithms with respect to decoding speeds
LeCo: Lightweight Compression via Learning Serial Correlations
Lightweight data compression is a key technique that allows column stores to
exhibit superior performance for analytical queries. Despite a comprehensive
study on dictionary-based encodings to approach Shannon's entropy, few prior
works have systematically exploited the serial correlation in a column for
compression. In this paper, we propose LeCo (i.e., Learned Compression), a
framework that uses machine learning to remove the serial redundancy in a value
sequence automatically to achieve an outstanding compression ratio and
decompression performance simultaneously. LeCo presents a general approach to
this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR),
Delta Encoding, and Run-Length Encoding (RLE) special cases under our
framework. Our microbenchmark with three synthetic and six real-world data sets
shows that a prototype of LeCo achieves a Pareto improvement on both
compression ratio and random access speed over the existing solutions. When
integrating LeCo into widely-used applications, we observe up to 3.9x speed up
in filter-scanning a Parquet file and a 16% increase in Rocksdb's throughput
Pakkausmenetelmät hajautetussa aikasarjatietokannassa
Rise of microservices and distributed applications in containerized deployments are putting increasing amount of burden to the monitoring systems. They push the storage requirements to provide suitable performance for large queries.
In this paper we present the changes we made to our distributed time series database, Hawkular-Metrics, and how it stores data more effectively in the Cassandra. We show that using our methods provides significant space savings ranging from 50 to 90% reduction in storage usage, while reducing the query speeds by over 90\% compared to the nominal approach when using Cassandra.
We also provide our unique algorithm modified from Gorilla compression algorithm that we use in our solution, which provides almost three times the throughput in compression with equal compression ratio.Hajautettujen järjestelmien yleistyminen on aiheuttanut valvontajärjestelmissä tiedon määrän kasvua, sillä aikasarjojen määrä on kasvanut ja niihin talletetaan useammin tietoa. Tämä on aiheuttanut kasvavaa kuormitusta levyjärjestelmille, joilla on ongelmia palvella kasvavia kyselyitä
Tässä paperissa esittelemme muutoksia hajautettuun aikasarjatietokantaamme, Hawkular-Metricsiin, käyttäen hyödyksi tehokkaampaa tiedon pakkausta ja järjestelyä kun tietoa talletetaan Cassandraan. Nopeutimme kyselyjä lähes kymmenkertaisesti ja samalla pienensimme levytilavaatimuksia aineistosta riippuen 50-95%.
Esittelemme myös muutoksemme Gorilla pakkausalgoritmiin, jota hyödynnämme tulosten saavuttamiseksi. Muutoksemme nopeuttavat pakkaamista melkein kolminkertaiseksi alkuperäiseen algoritmiin nähden ilman pakkaustehon laskua
Managing tail latency in large scale information retrieval systems
As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency
Compression algorithms for biomedical signals and nanopore sequencing data
The massive generation of biological digital information creates various computing
challenges such as its storage and transmission. For example, biomedical
signals, such as electroencephalograms (EEG), are recorded by multiple sensors over
long periods of time, resulting in large volumes of data. Another example is genome
DNA sequencing data, where the amount of data generated globally is seeing explosive
growth, leading to increasing needs for processing, storage, and transmission
resources. In this thesis we investigate the use of data compression techniques for
this problem, in two different scenarios where computational efficiency is crucial.
First we study the compression of multi-channel biomedical signals. We present
a new lossless data compressor for multi-channel signals, GSC, which achieves compression
performance similar to the state of the art, while being more computationally
efficient than other available alternatives. The compressor uses two novel
integer-based implementations of the predictive coding and expert advice schemes
for multi-channel signals. We also develop a version of GSC optimized for EEG
data. This version manages to significantly lower compression times while attaining
similar compression performance for that specic type of signal.
In a second scenario we study the compression of DNA sequencing data produced
by nanopore sequencing technologies. We present two novel lossless compression algorithms
specifically tailored to nanopore FASTQ files. ENANO is a reference-free
compressor, which mainly focuses on the compression of quality scores. It achieves
state of the art compression performance, while being fast and with low memory
consumption when compared to other popular FASTQ compression tools. On the
other hand, RENANO is a reference-based compressor, which improves on ENANO,
by providing a more efficient base call sequence compression component. For RENANO
two algorithms are introduced, corresponding to the following scenarios: a
reference genome is available without cost to both the compressor and the decompressor;
and the reference genome is available only on the compressor side, and a
compacted version of the reference is included in the compressed le. Both algorithms
of RENANO significantly improve the compression performance of ENANO,
with similar compression times, and higher memory requirements.La generación masiva de información digital biológica da lugar a múltiples desafíos informáticos, como su almacenamiento y transmisión. Por ejemplo, las señales biomédicas, como los electroencefalogramas (EEG), son generadas por múltiples sensores registrando medidas en simultaneo durante largos períodos de tiempo,
generando grandes volúmenes de datos. Otro ejemplo son los datos de secuenciación de ADN, en donde la cantidad de datos a nivel mundial esta creciendo de forma explosiva, lo que da lugar a una gran necesidad de recursos de procesamiento, almacenamiento y transmisión. En esta tesis investigamos como aplicar técnicas de compresión de datos para atacar este problema, en dos escenarios diferentes donde
la eficiencia computacional juega un rol importante.
Primero estudiamos la compresión de señales biomédicas multicanal. Comenzamos presentando un nuevo compresor de datos sin perdida para señales multicanal, GSC, que logra obtener niveles de compresión en el estado del arte y que al mismo tiempo es mas eficiente computacionalmente que otras alternativas disponibles. El compresor utiliza dos nuevas implementaciones de los esquemas de codificación predictiva
y de asesoramiento de expertos para señales multicanal, basadas en aritmética
de enteros. También presentamos una versión de GSC optimizada para datos de
EEG. Esta versión logra reducir significativamente los tiempos de compresión, sin
deteriorar significativamente los niveles de compresión para datos de EEG.
En un segundo escenario estudiamos la compresión de datos de secuenciación
de ADN generados por tecnologías de secuenciación por nanoporos. En este sentido,
presentamos dos nuevos algoritmos de compresión sin perdida, específicamente
diseñados para archivos FASTQ generados por tecnología de nanoporos. ENANO
es un compresor libre de referencia, enfocado principalmente en la compresión de
los valores de calidad de las bases. ENANO alcanza niveles de compresión en el
estado del arte, siendo a la vez mas eficiente computacionalmente que otras herramientas
populares de compresión de archivos FASTQ. Por otro lado, RENANO es
un compresor basado en la utilización de una referencia, que mejora el rendimiento
de ENANO, a partir de un nuevo esquema de compresión de las secuencias de bases.
Presentamos dos variantes de RENANO, correspondientes a los siguientes escenarios:
(i) se tiene a disposición un genoma de referencia, tanto del lado del compresor
como del descompresor, y (ii) se tiene un genoma de referencia disponible solo del
lado del compresor, y se incluye una versión compacta de la referencia en el archivo
comprimido. Ambas variantes de RENANO mejoran significativamente los niveles
compresión de ENANO, alcanzando tiempos de compresión similares y un mayor
consumo de memoria
Efficient query processing for scalable web search
Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures
Analytical Query Processing Using Heterogeneous SIMD Instruction Sets
Numerous applications gather increasing amounts of data, which have to be managed and queried. Different hardware developments help to meet this challenge. The grow-ing capacity of main memory enables database systems to keep all their data in memory. Additionally, the hardware landscape is becoming more diverse. A plethora of homo-geneous and heterogeneous co-processors is available, where heterogeneity refers not only to a different computing power, but also to different instruction set architectures. For instance, modern Intel® CPUs offer different instruction sets supporting the Single Instruction Multiple Data (SIMD) paradigm, e.g. SSE, AVX, and AVX512.
Database systems have started to exploit SIMD to increase performance. However, this is still a challenging task, because existing algorithms were mainly developed for scalar processing and because there is a huge variety of different instruction sets, which were never standardized and have no unified interface. This requires to completely rewrite the source code for porting a system to another hardware architecture, even if those archi-tectures are not fundamentally different and designed by the same company. Moreover, operations on large registers, which are the core principle of SIMD processing, behave counter-intuitively in several cases. This is especially true for analytical query process-ing, where different memory access patterns and data dependencies caused by the com-pression of data, challenge the limits of the SIMD principle. Finally, there are physical constraints to the use of such instructions affecting the CPU frequency scaling, which is further influenced by the use of multiple cores. This is because the supply power of a CPU is limited, such that not all transistors can be powered at the same time. Hence, there is a complex relationship between performance and power, and therefore also between performance and energy consumption.
This thesis addresses the specific challenges, which are introduced by the application of SIMD in general, and the heterogeneity of SIMD ISAs in particular. Hence, the goal of this thesis is to exploit the potential of heterogeneous SIMD ISAs for increasing the performance as well as the energy-efficiency