69 research outputs found
Good Predictions Are Worth a Few Comparisons
Most modern processors are heavily parallelized and use predictors to guess the outcome of conditional branches, in order to avoid costly stalls in their pipelines. We propose predictor-friendly versions of two classical algorithms: exponentiation by squaring and binary search in a sorted array. These variants result in less mispredictions on average, at the cost of an increased number of operations. These theoretical results are supported by experimentations that show that our algorithms perform significantly better than the standard ones, for primitive data types
Simple and Fast BlockQuicksort using Lomuto's Partitioning Scheme
This paper presents simple variants of the BlockQuicksort algorithm described
by Edelkamp and Weiss (ESA 2016). The simplification is achieved by using
Lomuto's partitioning scheme instead of Hoare's crossing pointer technique to
partition the input. To achieve a robust sorting algorithm that works well on
many different input types, the paper introduces a novel two-pivot variant of
Lomuto's partitioning scheme. A surprisingly simple twist to the generic
two-pivot quicksort approach makes the algorithm robust. The paper provides an
analysis of the theoretical properties of the proposed algorithms and compares
them to their competitors. The analysis shows that Lomuto-based approaches
incur a higher average sorting cost than the Hoare-based approach of
BlockQuicksort. Moreover, the analysis is particularly useful to reason about
pivot choices that suit the two-pivot approach. An extensive experimental study
shows that, despite their worse theoretical behavior, the simpler variants
perform as well as the original version of BlockQuicksort.Comment: Accepted at ALENEX 201
Analysis of Branch Misses in Quicksort
The analysis of algorithms mostly relies on counting classic elementary
operations like additions, multiplications, comparisons, swaps etc. This
approach is often sufficient to quantify an algorithm's efficiency. In some
cases, however, features of modern processor architectures like pipelined
execution and memory hierarchies have significant impact on running time and
need to be taken into account to get a reliable picture. One such example is
Quicksort: It has been demonstrated experimentally that under certain
conditions on the hardware the classically optimal balanced choice of the pivot
as median of a sample gets harmful. The reason lies in mispredicted branches
whose rollback costs become dominating.
In this paper, we give the first precise analytical investigation of the
influence of pipelining and the resulting branch mispredictions on the
efficiency of (classic) Quicksort and Yaroslavskiy's dual-pivot Quicksort as
implemented in Oracle's Java 7 library. For the latter it is still not fully
understood why experiments prove it 10% faster than a highly engineered
implementation of a classic single-pivot version. For different branch
prediction strategies, we give precise asymptotics for the expected number of
branch misses caused by the aforementioned Quicksort variants when their pivots
are chosen from a sample of the input. We conclude that the difference in
branch misses is too small to explain the superiority of the dual-pivot
algorithm.Comment: to be presented at ANALCO 201
Embedded Processor Selection/Performance Estimation using FPGA-based Profiling
In embedded systems, modeling the performance of the candidate processor architectures is very important to enable the designer to estimate the capability of each architecture against the target application. Considering the large number of available embedded processors, the need has increased for building an infrastructure by which it is possible to estimate the performance of a given application on a given processor with a minimum of time and resources. This dissertation presents a framework that employs the softcore MicroBlaze processor as a reference architecture where FPGA-based profiling is implemented to extract the functional statistics that characterize the target application. Linear regression analysis is implemented for mapping the functional statistics of the target application to the performance of the candidate processor architecture. Hence, this approach does not require running the target application on each candidate processor; instead, it is run only on the reference processor which allows testing many processor architectures in very short time
Recommended from our members
Analytical Query Execution Optimized for all Layers of Modern Hardware
Analytical database queries are at the core of business intelligence and decision support. To analyze the vast amounts of data available today, query execution needs to be orders of magnitude faster. Hardware advances have made a profound impact on database design and implementation. The large main memory capacity allows queries to execute exclusively in memory and shifts the bottleneck from disk access to memory bandwidth. In the new setting, to optimize query performance, databases must be aware of an unprecedented multitude of complicated hardware features. This thesis focuses on the design and implementation of highly efficient database systems by optimizing analytical query execution for all layers of modern hardware. The hardware layers include the network across multiple machines, main memory and the NUMA interconnection across multiple processors, the multiple levels of caches across multiple processor cores, and the execution pipeline within each core. For the network layer, we introduce a distributed join algorithm that minimizes the network traffic. For the memory hierarchy, we describe partitioning variants aware to the dynamics of the CPU caches and the NUMA interconnection. To improve the memory access rate of linear scans, we optimize lightweight compression variants and evaluate their trade-offs. To accelerate query execution within the core pipeline, we introduce advanced SIMD vectorization techniques generalizable across multiple operators. We evaluate our algorithms and techniques on both mainstream hardware and on many-integrated-core platforms, and combine our techniques in a new query engine design that can better utilize the features of many-core CPUs. In the era of hardware becoming increasingly parallel and datasets consistently growing in size, this thesis can serve as a compass for developing hardware-conscious databases with truly high-performance analytical query execution
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
Building Efficient Query Engines in a High-Level Language
Abstraction without regret refers to the vision of using high-level
programming languages for systems development without experiencing a negative
impact on performance. A database system designed according to this vision
offers both increased productivity and high performance, instead of sacrificing
the former for the latter as is the case with existing, monolithic
implementations that are hard to maintain and extend. In this article, we
realize this vision in the domain of analytical query processing. We present
LegoBase, a query engine written in the high-level language Scala. The key
technique to regain efficiency is to apply generative programming: LegoBase
performs source-to-source compilation and optimizes the entire query engine by
converting the high-level Scala code to specialized, low-level C code. We show
how generative programming allows to easily implement a wide spectrum of
optimizations, such as introducing data partitioning or switching from a row to
a column data layout, which are difficult to achieve with existing low-level
query compilers that handle only queries. We demonstrate that sufficiently
powerful abstractions are essential for dealing with the complexity of the
optimization effort, shielding developers from compiler internals and
decoupling individual optimizations from each other. We evaluate our approach
with the TPC-H benchmark and show that: (a) With all optimizations enabled,
LegoBase significantly outperforms a commercial database and an existing query
compiler. (b) Programmers need to provide just a few hundred lines of
high-level code for implementing the optimizations, instead of complicated
low-level code that is required by existing query compilation approaches. (c)
The compilation overhead is low compared to the overall execution time, thus
making our approach usable in practice for compiling query engines
Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools
This dissertation focuses on two fundamental sorting problems: string sorting
and suffix sorting. The first part considers parallel string sorting on
shared-memory multi-core machines, the second part external memory suffix
sorting using the induced sorting principle, and the third part distributed
external memory suffix sorting with a new distributed algorithmic big data
framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie
(2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author
- …