Hardware-conscious Query Processing in GPU-accelerated Analytical Engines by Chrysogelos, Periklis et al.
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
Hardware-conscious Query Processing in
GPU-accelerated Analytical Engines
Periklis Chrysogelos? Panagiotis Sioulas? Anastasia Ailamaki? ‡
?École Polytechnique Fédérale de Lausanne ‡RAW Labs SA
firstname.lastname@epfl.ch
ABSTRACT
Over the last years, modern servers have been adopting hardware
accelerators, such as GPUs, in order to improve their power effi-
ciency and computational capacity. Modern analytical query pro-
cessing engines have been highly optimized for multi-core multi-
CPU query execution, but lack the necessary abstractions to sup-
port concurrent hardware-conscious query execution over multiple
heterogeneous devices and exploit the available accelerators.
This work presents a Heterogeneity-conscious Analytical query
Processing Engine (HAPE), a blueprint for hardware-conscious an-
alytical engines for efficient and concurrent multi-CPU multi-GPU
query execution. HAPE decomposes query execution on hetero-
geneous hardware into: 1) efficient single-device and 2) concur-
rent multi-device query execution. It uses hardware-conscious al-
gorithms designed for single-device execution and combines them
into efficient intra-device hardware-conscious execution modules,
via code generation. HAPE combines these modules to achieve
multi-device execution by handling data and control transfers.
We validate our design by building a prototype and evaluate its
performance using radix-join co-processing and the TPC-H bench-
mark. We show that it achieves up to 10x and 3.5x speed-up on the
radix-join against CPU and GPU alternatives and 1.6x-8x against
state-of-the-art CPU- and GPU-based commercial DBMSs on the
selected TPC-H queries.
1. INTRODUCTION
Traditionally, analytical query engines have relied on the expo-
nential increase of CPU performance in order to keep up with the
data growth, which is also exponential. Initially, CPUs relied on
Dennard scaling, improving their performance by increasing their
clock frequency. However, after 2005, this was no longer feasible
due to the power wall. As a response, CPU vendors started in-
creasing the core count, which signaled the beginning of the multi-
core era. Now, due to the power wall, the power inefficiency of
general-purpose hardware is causing modern servers to change.
The increased performance per watt of specialized hardware, such
as GPUs, has resulted in their adoption in emerging servers, which
can be seen by the almost linear increase over the past decade of
This article is published under a Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/), which permits distribution
and reproduction in any medium as well allowing derivative works, pro-
vided that you attribute the original work to the author(s) and CIDR 2019.
9th Biennial Conference on Innovative Data Systems Research (CIDR ‘19)
January 13-16, 2019 , Asilomar, California, USA.
accelerator-enabled servers in the TOP500 list. In addition, ar-
chitects explore designs that go beyond the classical system-wide
cache-coherence in favor of increased core scalability.
In order for analytical query engines to scale over time with
hardware improvements, they have to efficiently use the hetero-
geneous hardware of emerging servers. On the CPU front, state-
of-the-art engines are using algorithms [28, 29, 6, 26] that match
the CPU micro-architecture. Techniques like vector-at-a-time ex-
ecution [7] and just-in-time code generation [20, 19] are used to
reduce the query execution overheads, while the Exchange [12] op-
erator and HyPer’s Morsels [21] are used to parallelize query ex-
ecution in multi-core and multi-CPU configurations. On the GPU
front, recent work has explored optimized algorithms for GPU ex-
ecution [17, 27, 14, 18, 30] as well as GPU query execution mod-
els [32, 13, 23, 8]. The majority of these works do not consider
query execution over heterogeneous devices, for example multiple
GPUs, and many of them ignore the processing power available in
the server’s CPUs. Works that support both use a high-level frame-
work and/or hardware-oblivious algorithms and thus achieve sub-
optimal per-device execution. Lastly, works that support heteroge-
neous hardware, only consider a single device type per query [24]
due to the lack of abstractions and algorithms for multi-device ex-
ecution or rely on full wasteful materialization [32, 15, 8].
In this work, we describe a new analytical engine design for ef-
ficient analytical query execution on a heterogeneous multi-CPU
multi-GPU server node that combines hardware-conscious algo-
rithms with efficient intra- and inter-device execution models.
Contributions. The contributions of this work are the following:
• We make the case for heterogeneity- and hardware-conscious
analytical engines and present HAPE, an engine design for
concurrent execution on heterogeneous hardware.
• We show that decoupling inter- from intra-device operator
design can decrease the design space as well as achieve state-
of-the-art performance in each device and allow scaling ex-
isting algorithms to heterogeneous hardware.
• We evaluate our design by extending Proteus [19, 10] with a
GPU join [30] to show the importance of hardware-conscious
algorithms during hybrid execution. Our engine achieves
10x and 3.5x on equi-joins and 1.6x-8x speed-up on TPC-
H queries, against CPU and GPU state-of-the-art DBMSs.
Our design allows combining hardware-conscious device-specific
algorithms to achieve efficient execution across all the compute
units of a multi-CPU, multi-GPU server. HAPE achieves near op-
timal co-processing performance by combining algorithms opti-
mized for homogeneous hardware, effectively avoiding the devel-
oping cost of algorithms specialized for heterogeneous hardware.
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
2. BACKGROUND
In this section, we discuss hardware-conscious operator algo-
rithms and parallel execution of query plans. In the rest of the paper
we will use these components as our building blocks for the hetero-
geneous hardware-conscious analytical engines.
2.1 Hardware-conscious Operators
While hardware-oblivious algorithms simplify the optimization
process and the execution over heterogeneous hardware, tuning al-
gorithms for the underlying hardware can produce significant per-
formance benefits. For modern CPUs, most previous studies take
three architectural characteristics into account: cache hierarchy,
TLBs and SIMD instructions. These dimensions are analyzed in
conjunction with the available memory bandwidth and latency.
Prior work has introduced hardware-conscious variants of sev-
eral operators. including scan-like operators, sort-based operations
and index scans [33, 25, 16]. As a heavyweight operator, the join
has been studied and tuned extensively for modern CPUs, result-
ing in multiple variants of the radix hash-join [29, 6, 3, 2, 28].
Specifically, Shatdal et al. [29] proposed a cache-conscious vari-
ant that introduces a partitioning step. The two input tables are
co-partitioned such that for each partition pair the hash table fits in
cache. Then, all hash-table accesses during the probing phase are
in cache and cache misses are averted. Boncz et al. [6] observed
that for high number of output partitions the performance is im-
pacted by TLB misses. As a solution, they advocate for the use of
multiple partitioning passes, each producing a smaller number of
partitions, reducing TLB misses at the expense of extra passes over
the input. Schuh et al. [28] argue that the common denominator is
that these works try to minimize the effects of random memory ac-
cesses by minimizing cache and TLB misses. Still, Blanas et al. [5]
argue in favor of a hardware-oblivious hash-joins as they require
less parameter tuning and can outperform hardware-conscious im-
plementations in some scenarios.
In contrast to CPUs, modern GPUs have a significantly different
micro-architecture, including for all three of the aforementioned
characteristics. First of all, GPUs depart from the linear mem-
ory hierarchy of CPUs and adopt a fatter cache hierarchy, with a
hardware-managed L1-like cache, called shared memory, which is
a software managed scratchpad, and other more specialized caches,
like a constant cache. In addition, GPUs target different workloads
and thus size their caches and TLBs differently to CPUs. Karnagel
et al. [18] experimentally showed that GPU TLBs have 2MB pages
to support the high number of threads and pack more addressable
space per TLB entry. Finally, in the GPU SIMT model, each GPU
thread has an independent register file but, in contrast with the
SIMD model, thread divergence is handled in hardware. As for
CPUs, hardware-conscious algorithms that consider the GPU hard-
ware improves performance. Karnagel et al. [18] take into con-
sideration the TLBs in order to improve hash-based group-by op-
erations, while partitioned hash-join [27, 17] implementations use
shared memory to store histograms and per-partition hash-tables.
A limiting factor for GPU algorithms is GPU memory size. Prior
works make simplifying assumptions about the types of workloads
handled; [27] only addresses the case that at least one of the ta-
bles fits in GPU memory. Kaldewey et al. [17] use Unified Virtual
Addressing (UVA), to join arbitrarily large data by accessing data
over the interconnect. Still, interconnect bandwidth is an order of
magnitude slower than GPU memory bandwidth and this greatly
impacts multi-pass algorithms such as radix joins.
Inter-device co-processing can reduce unnecessary interconnect
traffic. Stehle and Jacobsen [31] present an efficient sorting algo-
rithm that consists of two steps: generating sorted runs in GPU and
CPU GPU
Register 
File
L1
L2
L3
CPU DRAM
Register 
File
Constant
Cache
Texture
Cache
Shared
Memory L1
L2
GPU SDRAM
PCIe
Figure 1: CPU and GPU hierarchy of data caches.
merging them in CPU. Merging in the CPU side allows for a sin-
gle pass, per direction, over the scarcest resource, the interconnect.
Sioulas et al. [30] exploited the CPU memory-bandwidth to parti-
tion the inputs of a partition-based hash-join before sending them
to the GPU. The initial partitioning breaks down big relations into
partitions that fit in the GPU memory, while its small fan-out allows
for a high throughput in the CPU side. In the GPU side they further
partition the inputs to fit the final partitions in the scratchpad and
minimize the effect of random accesses.
In Section 4.1, we use this join as a representative example to
discuss a hash-join optimized for GPU hardware with respect to
the memory hierarchy and compare it with a hardware-oblivious
GPU implementation as well as CPU algorithms. In Section 5 we
show how their out-of-GPU execution strategy can be generalized
in order to mix different intra-device algorithms to attain efficient
multi-device execution.
2.2 Query execution models
In-memory analytical query execution engines traditionally used
either a tuple-at-a-time or an operator-at-a-time execution model
and thus suffered from high interpretation overheads or material-
ization costs, respectively. To amortize these costs, vector-at-a-
time [7] execution and just-in-time (JIT) code generation [20] en-
gines emerged. The vector-at-a-time model communicates a block
of data at a time between operators and is based on the trade-off
between interpretation and materialization costs. This model is
usually coupled with using vectorized code (SIMD instructions)
and tuned for cache locality. JIT-based engines generate special-
ized code for each query, consisted of tight loops. Intermediate
results are passed across operators via the processors registers until
an operator forces a materialization point. Unlike previous models,
overheads are less dependent on the size of intermediate results.
GPU analytical query execution has similar challenges and tech-
niques. Several GPU systems have used the operator-at-time ex-
ecution model [32, 15, 8]. This model is restricted by the GPU
memory size and thus is often combined with transferring inter-
mediate result to CPU memory [15, 8]. However, the latter causes
excessive interconnect traffic, as all the results have to pass over the
interconnect. In order to reduce the materialization overhead, Paul
et al. [23] pipeline data between operators running as separate ker-
nels through OpenCL’s communication channels. HorseQC [11]
uses a block-at-a-time approach and materialized intermediate re-
sults in GPU memory to significantly reduce the execution time.
In addition, HorseQC and MapD [22] use just-in-time code gener-
ation to fuse multiple operators in a single kernel to reduce result
materialization and the number of required passes.
The emergence of systems with multiple processors has moti-
vated parallel query execution. On the one hand, the Exchange op-
erator [12] has been used to encapsulate parallelism and allow par-
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
allel execution using the existing, single-threaded, operators. On
the other hand, Hyper [21] exposes the operators to parallelism,
propagating the responsibility of maintaining shared data structures
to the operators, for example, its hash-join has to guarantee that the
hash-table is correctly built using multiple threads. In the heteroge-
neous context, Voodoo [24] allowed query execution on CPUs and
GPUs in MonetDB, but without support for concurrent CPU-GPU
execution, load balancing or data structures. Similarly, TVM [9]
focused on deep learning workloads and targeted multiple types of
devices but considered execution on a single device at a time.
The architecture of modern servers introduces new challenges
for targeting multiple types of devices at the same time. Both
the Exchange and Hyper’s approach rely on low-latency system-
wide cache coherent memory for synchronization and atomic prim-
itives as well as shared data structures, which is generally lack-
ing in heterogeneous servers. In addition, different devices may
have different access rights for different regions of the aggregate
memory of the system, based on the system topology as well as
the type of devices. To avoid complicating the relational opera-
tors and increase the applicability of our design to future archi-
tectures, the HAPE decouples the development of relational oper-
ators from the complexities of heterogeneous servers. Our paral-
lelization strategy builds upon the ideas of HetExchange [10], a
framework that allows multi-CPU, multi-GPU query execution by
encapsulating the heterogeneous parallelism of the server. While
HetExchange provides a framework for hardware-oblivious opera-
tors, HAPE provides server-wide hardware-conscious execution by
composing per-device hardware-conscious algorithms.
TVM automated the optimization of low-level programs to dif-
ferent hardware via an iterative process: a scheduler proposes opti-
mized versions of the input program and the measured performance
is used to refine a machine learning model that predicts the perfor-
mance of the device. TVM can be incorporated in our system to
tune the query optimizer as well as the compiler optimizations used
by the different device back-ends.
In addition, different devices are fit for different workloads and
can be leveraged synergistically. Appuswamy et al. [1] propose the
archipelago abstraction which encapsulates a set of devices and a
target workload as a means to partition resources per functionality.
Our work focuses on the design of a hardware-conscious analytical
multi-CPU, multi-GPU archipelago.
3. THE CASE FOR HAPE
Heterogeneity-conscious Analytical query Processing Engines
(HAPE) allow DBMSs to take advantage of heterogeneous hard-
ware present in modern servers by 1) encapsulating heterogene-
ity and multi-device parallelism, 2) providing a unified execution
model and 3) embracing single-device hardware-conscious opera-
tors. These operators are then composed together to provide server-
wide hardware-conscious execution, while the encapsulation han-
dles their communication and synchronization.
Decoupling heterogeneity from execution. HAPE exploits the
observation that by encapsulating inter-device functionality, the re-
maining system is composed of single device, and thus homoge-
neous, subsystems. HAPE minimizes the effect of heterogeneity
and allows the rest of the system to be build by combining exist-
ing work on homogeneous systems. Operators specialized to each
micro-architecture may be used for each type of device. Just-in-
time code generation provides a unified interface that allows oper-
ators to be used on multiple device types. In addition, JIT allows
the execution model to be adapted to each device providing enough
Het-agnostic 
QO
Het-aware 
QO
CPU CPU
GPU GPU
GPU GPU
Static
Resource Allocator
Dynamic
Resource
Allocator
Mem
Mem
Mem
Mem
Mem
Mem
Runtime
Query
Optimization
CPU
backend
GPU
backend
⨝
σ Γ
⨝
Γ
Code Generation
Hardware-conscious
Operators
Specialized Binary Code
Heterogeneity-aware Plan
Figure 2: HAPE architecture.
flexibility for efficient inter-operator execution. HAPE encapsu-
lates the heterogeneity by handling execution and data transfers be-
tween devices using the HetExchange operators.
Traits in heterogeneous systems. In a heterogeneous server
there are four simple traits [10] that characterize execution: target
devices, parallelism, data locality and data packing. The first two
traits concern the flow of execution, or control flow, inside the het-
erogeneous system. More specifically, for each operation, the first
one defines the execution device type, while the second one defines
the number of concurrently used devices. The last two traits con-
cern the data flow in the system. Data locality is concerned with the
distance of the data from their consumer. Transition between dif-
ferent values of any of the control-flow traits requires inter-thread
or inter-device task assignment, while increasing data-locality re-
quires data-transfers. All these operations are usually costly, thus it
is common practice to amortize their overhead by performing them
in the granularity of packets [12]. Unfortunately, decisions often
depend on the actual values of each tuple. In such cases we can op-
erate on units of packets only if the property on which the decision
depends is common among all the tuples of the packet. Thus, the
data packing trait specifies whether the operations operate on tuples
or packets and in the case of the later, the properties that are com-
mon between all the tuples of each packet. For example, routing
packets based on a hash-value implies that for every packet, all its
tuples have the same hash. This allows the system to route packets
without actually accessing their content.
HAPE architecture. HAPE is composed of three main parts,
as shown in Figure 2. The first part is the query optimizer, which
is responsible for translating the query into a heterogeneity-aware
physical plan, a physical plan augmented with information regard-
ing which devices will be used for each part of the tree. By en-
capsulating conversions of the aforementioned traits in the four
HetExchange operators, all relational operators are heterogeneity-
oblivious. The heterogeneity-aware plan can explicitly specify the
degree of parallelism and target devices of each operator by placing
these four operators. Combining the operators with a representation
of the plan as a directed acyclic graphs, instead of a tree, permits
the plan to use different paths for each device. As a consequence,
i) each node of the plan is mapped into a specific device, with the
exception of nodes representing a target device conversion, ii) the
plan is expressive enough to represent the selection of different al-
gorithms optimized for each target device.
The heterogeneity-aware plan is then broken down into pipelines
each targeting a single device. For each pipeline, the code genera-
tor produces code optimized for the pipeline’s target device through
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
device-specific back-ends, named device providers. The generated
code is executed on the available devices and is responsible for
transferring control and data between the devices. In addition, by
coordinating with the scheduler and resource managers, it load bal-
ances based on the runtime load.
HAPE benefits. HAPE architecture provides several benefits.
First, by encapsulating inter-device operations, HAPE allows re-
lational operators to be heterogeneity-oblivious but also hardware-
aware. Relational operators ignore the complexities of remote data,
multi-device execution and coordination between devices and focus
on using the microarchitecture of their specific target devices as ef-
ficiently as possible. At the same time, HAPE provides the methods
through four meta-operators to enable co-processing across a mix
of CPUs and GPUs. Second, by providing a unified code genera-
tion interface, HAPE allows operators to be used for a variety of
device types, depending on the needs and the degree of specializa-
tion. Third, by embracing control-flow and data-flow operations, it
allows load-balancing and data-transfers between the different de-
vices. As a result, HAPE supports query execution both over CPU-
and GPU-resident data as well as data scattered over the server’s
memories. Last but not least, extracting and handling heterogene-
ity traits through explicit converters makes HAPE compatible with
existing query optimizers [4].
HAPE challenges. HAPE has to overcome three challenges to
effectively use the underlying hardware. First, it needs efficient
operators for single-device execution. As HAPE builds on top of
single-device operators, it’s overall effectiveness relies on the effi-
ciency of the underlying single-device operators. For CPU query
execution there has been debate [28, 3, 2, 5] regarding hardware-
oblivious versus hardware-aware algorithms which generally con-
cludes that the more appropriate option depends on the workload.
To support multiple devices, the first challenge is to identify how
algorithm specialization and selection differs from CPUs to GPUs.
Second, even if optimal algorithms are used, the inter-operator
efficiency can significantly impact performance and in the case of
HAPE, the engine should have a common, albeit efficient, execu-
tion model to allow hybrid execution. Prior work [7, 20] on CPU
query execution has shown the impact of inefficient execution mod-
els and tried to minimize them. Recent work [24] has shown that
portability can be achieved by expressing the operators in high-
level frameworks like OpenCL and/or using vector primitives, but
Funke et al. [11] showed that such strategies can incur a high num-
ber of passes and thus waste memory (and cache) bandwidth, even
when optimized for the GPU-only case. Thus, the second challenge
is to identify an execution model efficient both on CPUs and GPUs.
Last but not least, in heterogeneous servers there are multiple
devices, cache-coherence is limited, globally shared memory may
either not exist or incur high access latencies and inter-device band-
width is one of the scarcest resources. In order to take advantage of
the efficient per-device execution, the engine should be capable of
efficiently handling the multiple devices. This requires: 1) that the
engine has the necessary mechanisms to efficiently handle transfers
and packet routing, 2) the necessary policies and algorithms to de-
cide on the required transfers and routing. So, the third challenge is
achieving concurrent multi-device execution and mapping parallel
algorithms in such a system.
HAPE extensibility. While we focus on the multi-CPU multi-
GPU case, HAPE is extensible to other accelerators as well. To
support a new device type, the engine needs a pair of new device-
crossing operators and a device provider. In our prototype the de-
vice provider translates the code generation directives to LLVM IR
and the generated code contains control flow statements, such as
branches and loops. Thus, HAPE is generalizable to such devices.
Compute
hash
Compute
offsets
Build
HT
Build 
partition
Output
tuples
Probe
partition
Compute
hash
Probe
HT
Buffer 
output
Shared Memory
Streaming Multiprocessor
GPU Device Memory
ATOMICS! ATOMICS!
Figure 3: Block diagram for a GPU join over partitioned data.
For devices without control-flow support, but with gather and
scatter capabilities, HAPE can be applied by restricting the de-
vice providers to such a subset of instructions and allow only the
CPU operators to generate more complex code, in order to support
routers and allow HAPE to maintain its load balancing capabilities
and apply multi-device algorithms.
4. EFFICIENT PARALLEL PROCESSING
4.1 Efficiently parallelizing operators
The abstractions of Proteus’s infrastructure allows the execution
engine to be composed of homogeneous subsystems. This property
empowers the optimizer to opt for hardware-conscious operators
tuned for the specific target device alongside the range of supported
hardware-oblivious operators. As discussed in Section 2.1, this
brings about the potential for significant performance benefits over
generic hardware-oblivious operators.
Tuning operators for devices. Specializing to the target devices
has the potential to boost performance. Prior work has optimized
data movement and access patterns with respect to the device’s
caches, including TLBs, and their characteristics. Other works
have considered properties and functionalities of processing units
such as the instruction level parallelism (ILP), branch predictors,
SIMD instructions for CPUs, and warp-wide execution and shuf-
fles in GPUs. Operator implementations need to exploit properties
of the underlying hardware and explore the available opportunities
within the design space to achieve high performance.
Common design, different specialization. Despite the micro-
architectural differences, the exploration of hardware-conscious op-
erator designs is not uncorrelated across different devices. Parallels
can be drawn between the optimization demands and consequently
the design choices across devices. The hardware-conscious join is
an indicative case: independently of CPU or GPU execution, ran-
dom accesses are the main bottleneck of a non-partitioned hash-
join, as they waste memory bandwidth due to over-fetching. In
both CPUs and GPUs, similar algorithmic approaches can be used
to mitigate the problem, as for example, partitioning the input to
fit the per-partition hash-tables in a memory (cache) with a higher
bandwidth. On the CPU side, the partitioning fanout is restricted
by the TLB size and, on the GPU side, the size of the cache that
contains write offsets and consolidates stores. The end result, in
both cases, is a multi-pass partitioned hash-join design.
In GPUs, it is possible to do some further optimizations. Ran-
dom accesses to L1 waste bandwidth as a whole cache line has to
be fetched per access. To make the probing step GPU-friendly, we
load the smaller partition to the scratchpad, build the hash-table us-
ing atomic operations and probe with the tuples of the correspond-
ing partition. The scratchpad is organized into banks and is capable
of serving a different word from each bank per (warp-)request, in-
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
Shared Memory
Streaming Multiprocessor
GPU Device Memory
Compute
partitions
Compute
offsets
Shuffle
tuples
Input
tuples
Output
tuples
ATOMICS!
Figure 4: Block diagram for a GPU partitioning pass.
dependently of its location in the bank. Thus, the scratchpad only
penalizes accesses to the same bank, but does not waste bandwidth
by over-fetching. We show a block diagram of the build & probe
sequence within the memory hierarchy in Figure 3.
The scratchpad is of limited size, in the order of L1 size, and the
produced partitions of the two inputs should be small enough to fit
in it. Therefore, the number of produced partitions should be suffi-
ciently high. In the CPU case, the partitioning is optimized with the
goal to reduce the TLB misses and improve the cache locality of the
output. Similarly, in the GPU case, we aim to reduce the sparsity
of the stores but the fanout is restricted by the memory available
for consolidating the stores. To consolidate the stores, we read a
chunk of the input (at a time) in the scratchpad and reorganize it
in such a way that elements belonging to the same output partition
are located in consecutive scratchpad entries. Then, we scan the
scratchpad and write each tuple to its corresponding output parti-
tion, before moving to the next chunk. By controlling the number
of output partitions, we control the average number of elements
mapped to each partition for each step. The reordering gathers el-
ements of the same partition together and thus increases the coa-
lescing of the stores, which allows for better utilization of the GPU
memory bandwidth and improves the effective throughput. The
fewer the output partitions the higher the average run length of ele-
ments in the same partition, and thus the bandwidth utilization, but
the more passes over the data are required to achieve scratchpad-
resident co-partitions. In our implementation, contrary to the GPU
hash-join of [27], in each partition pass we scan the data once and
write them to a linked list of buffers managed with atomic opera-
tions. This technique avoids performing an extra scan to determine
the output offsets. We illustrate the block diagram for the steps of
a partitioning pass within the memory hierarchy in Figure 4.
The GPU hardware-conscious join is tuned for the specific mem-
ory hierarchy of the GPUs. However, the skeleton of the algorithm
remains the same for both CPUs and GPUs. The main observa-
tion is that the design of hardware-conscious operators has two
components: the algorithmic skeleton and the hardware-specific
finer-grained building blocks, such as caching the hash table in
the scratchpad, that change between different device types. This
allows us to re-use the algorithms across devices and argue for sep-
arating hardware-consciousness from device-consciousness: algo-
rithms may be capable of solving different hardware-specific device-
invariant problems (eg. random accesses through multiple parti-
tioning steps) but the exact mappings to the hardware may differ
per device (eg. fanout based on TLB versus scratchpad capacity).
4.2 Efficiently parallelizing query plans
To achieve efficient inter-operator CPU and GPU query execu-
tion, we use code generation to produce efficient code for each
target device and we parallelize the execution to multiple homo-
geneous devices by scheduling execution of the generated code as
well as any necessary data transfers. In Section 5 we discuss how
these techniques extend concurrent execution to multiple heteroge-
neous devices.
Code generation. We use code generation to achieve two goals,
i) a unified interface for operators to target multiple devices and ii)
enough flexibility to provide a hardware optimized implementation.
We provide the implementation of the code generation interface
with a different back-end per device. Each back-end is responsi-
ble for producing code tailored to the underlying device. Starting
from the lower level, back-ends are responsible for translating code
generation directives received by the operator to the instruction set
of their target device. At a higher level, they specialize common
functions, like worker-scoped atomics, reductions and synchroniza-
tions to the underlying device. For example, a back-end for single-
threaded CPU execution would optimize-out worker-scoped atom-
ics to simple load-apply-store operations.
Homogeneous inter-device parallelism. In order to achieve
inter-device parallelism over a set of homogeneous device, we ex-
tend the traditional Exchange [12]. Similar to the Exchange, we
instantiate both the producer and consumer code on multiple de-
vices to achieve the desired input and output degree of parallelism.
In contrast with the traditional Exchange, we separate control flow,
control transfers between producers and consumers, from data flow,
data transfers over the interconnects. Separating them, allows for
taking data-dependent decisions to transfer control without access
to the data at the point of decision.
As control flow operations are inherently more CPU-friendly
than GPU we propagate task assignment and load balancing to
the CPU and perform them through a CPU-side operator, HetEx-
change’s router. This operators is a parallelism trait converter. It
receives tasks from producers and routes them to consumers, based
on policies. Our implementation has load-aware, locality-aware
and hash-based policies. We push down to the producers the re-
sponsibility of packing data such that the router can take routing
decisions the granularity of packets, without accessing the packet
contents and only based on packet metadata.
Depending on the routing policy, a packet may be routed to a
consumer that does not have access to its content. To handle such
cases, we represent data transfers as an operator and place them
on the plan. In addition, a variant of the same operator takes into
consideration the memory topology in order to perform broadcasts
with minimal number of copies. By taking into consideration the
memory topology, this operator performs packet multi-casts and
sharing, in order to minimize the data transfers during broadcasts.
In addition, decoupling the data transfers from the relational oper-
ators allows our design to operate over both initially CPU-resident
and GPU-resident datasets as well as datasets that are distributed
over all the CPU and GPU memory nodes.
5. EFFICIENT CO-PROCESSING
Supporting multiple types of devices, but only one device-type
at a time (homogeneous parallel execution) allows executing the
query on the most appropriate device type, but leaves the rest of the
devices underutilized. The rest of this section expands our design
to concurrently use all the available heterogeneous devices.
Similarly to the homogeneous case [12], there are three ways to
parallelize a query over heterogeneous devices. First, the engine
can vertically partition the query plan and execute each part on the
most appropriate type of devices, pipelining execution between het-
erogeneous devices. Second, different subtrees of the query plan
can be distributed to different devices, thus partitioning the plan
horizontally. Third, we can design efficient operators, the execu-
tion of which spans across multiple heterogeneous devices.
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
Vertical co-processing. We achieve pipelined execution across
devices by exploiting that the two vertical partitions of the plan are
independent which allows us to independently select the most ap-
propriate algorithms for each part as well as generate efficient code
for the target device of each part. We encode the transition be-
tween device targets using HetExchange’s device crossing, which
is responsible for changing the back-ends used during the code gen-
eration and handles the transition of execution between different
devices, effectively hiding from the rest of the operators that their
input might be potentially received from another device, both dur-
ing code generation and execution time.
Horizontal co-processing. The design supports horizontal par-
allelism across multiple heterogeneous devices by allowing routers
to have multiple distinct parents. By this relaxation, each of its par-
ents can target different type of devices. For example, in order to
split an aggregation across 48 CPU cores and 2 GPUs, the plan has
a router operator that runs on the CPU with two parents. The first
parent is an aggregation while the second one is a CPU-to-GPU
device crossing operator followed by an aggregation. For this ex-
ample, the first parent would be instantiated 48 times, resulting in
the parallel execution of the aggregation to the CPU cores. The de-
vice crossing operator and its aggregation would be instantiated 2
times by the router and each instance will transfer the execution to
its GPU and compute the aggregation. The router does not differ-
entiate between parents based on their execution target, the device
crossing operators handle that. Data partitioning for horizontal het-
erogeneous parallelism is handled as described in Section 4.2.
Intra-operator co-processing. In order to provide efficient al-
gorithms for co-processing, it is possible to combine, without mod-
ification, algorithms tailored to each device via data partitioning
and scheduling policies, reducing the design effort in this manner.
Continuing on the radix-join algorithm of Section 4.1, Sioulas et
al. [30] propose using the CPU in order to perform a first partition-
ing step locally to the input, which enables us to perform the join
with a single pass over the slow interconnect. The two join inputs
are co-partitioned in their initial location in such a way that each co-
partition can fit in GPU memory. Then, for each pair, we transfer it
to the GPU and execute the more fine grained partitioning steps and
the probing as described in Section 4.1. By controlling the number
of partitions, we fit each co-partition in the GPU memory and thus
only a single pass over the interconnect is required, as long as there
is no single key for which the corresponding tuples do not fit in
GPU memory. As the partitions generated in the CPU side should
be just small enough to fit in the GPU-memory, the CPU-side par-
titioning requires a small number of output partitions, compared to
the final number of partitions required by the radix join, and thus it
can be optimized to achieve very high throughput even for datasets
in the order of tens or hundreds of gigabytes.
The task of selecting the above server-wide algorithm is propa-
gated to the query optimizer. The query optimizer places an initial
CPU-side partitioning operator on each of the two inputs. These
operators are followed by a zip operator that matches the corre-
sponding partitions from each side into co-partitions and pushes
them to the next operator. The zip is followed by a split operator
which drives each of the two partitions to a (different) sequence
of a mem-move, a CPU-to-GPU and another partitioning operator.
The co-partitions produced by the latter are then zipped once more,
unpacked and propagated to the actual in-GPU join operator.
Based on the above plan, the query optimizer can produce other
more complex plans using its optimization rules. For example, in-
stead of sending all the co-partitions to a single GPU, it can add a
router to send some co-partitions into a second GPU, or even keep
some of them for joining on the CPU-side.
0 2k 4k
0
5
10
Partition Size (#elements)
E
xe
cu
tio
n
Ti
m
e
(m
s)
SM SM+L1 L1
Figure 5: Scratchpad (SM) vs L1 during GPU radix’s probing phase
6. EVALUATION
In this section, we present an evaluation of the performance for
the system described above.
6.1 Experimental Setup
The following experiments run on a machine provisioned with
two 12-core Intel Xeon E5–2650L v3 running at 1.8 GHz, with
64KB of L1 and 256KB of L2 cache memory per core, 30MB of
shared L3 cache and 256 GB of main memory. Also, the machine is
equipped with two NVidia GeForce GTX 1080 GPUs each with 8
GB of local memory and connected on one of the two CPU sockets,
through a dedicated PCIe 3 x16 interconnect. We compare our im-
plementation with two state-of-the-art commercial systems, DBMS
C and DBMS G. DBMS C is a CPU-based columnar DBMS that is
based on MonetDB/X100 [7], uses SIMD vector-at-a-time execu-
tion and supports multi-CPU execution. DBMS G is a GPU-based
DBMS that supports multi-GPU execution and uses just-in-time
code generation for the in-GPU kernels.
6.2 Hardware-conscious Join
This section evaluates the GPU partitioned hash-join against other
algorithms and implementations. The first two experiments use two
equally-sized tables, each with two 4-byte columns: a key and a
payload. We measure the performance of an equi-join over the key
columns that is followed by a sum/count aggregation over each pay-
load column. Both tables have exactly the same keys and thus the
join output has as many tuples as any of the inputs.
Figure 5 assesses the importance of using the GPU scratchpad
for the build and probe phase of the join, rather than following the
CPU conversion of using the L1. For this experiment, we use 32
million tuples for each table, load them in GPU memory and mea-
sure the execution time of the in-GPU partitioned join for varying
number of partitions. Thus, the input size is constant, while the
number of elements per partition varies. To filter-out the effect of
handling over-sized partitions and focus on the impact of selecting
the correct memory, we select the keys so that all produced parti-
tions have exactly the same size. Each co-partition is assigned to
a GPU block of threads and thus this defines the memory require-
ments for the intermediate structures per block. We compare three
variants: L1, which optimistically stores all the corresponding data
in L1, SM that stores all the data of the build table in the scratch-
pad and SM+L1 that stores the offsets of the heads of the hash table
chains in scratchpad and the rest in L1.
The more we rely on the scratchpad to store join’s intermediate
structures, the better the performance, as the scratchpad, in con-
trast with L1, is not over-fetching. In addition, L1 is affected by
the scanning of the co-partitions, which causes cache pollution and
decreases the hit rate proportionally to the input size, as multiple
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
1 2 8 32 128
0.01
0.1
1
10
Table size (Mtuples)
E
xe
cu
tio
n
Ti
m
e
(s
)
Partitioned CPU Partitioned GPU
Non-partitioned CPU Non-Partitioned GPU
DBMS C DBMS G
Figure 6: Comparison of parallel CPU and (single) GPU joins
GPU blocks running on the same streaming multiprocessor and
share their L1 cache. In contrast, the scratchpad is software man-
aged and thus it’s not affected by the same problem. As a result, the
performance of the scratchpad is almost constant while the perfor-
mance of L1-based solutions decreases as the number of partitions
increases. SM+L1 has the advantage that the first probe is in the
SM, but it is also affected by the drawbacks of the L1-based so-
lution. It is also worth mentioning that SM+L1 has an increased
capacity, compared to the other two solutions. The small perfor-
mance degradation from 1024 to 512 elements per partition is due
to hardware underutilization: the small partition size reduces the
opportunities for useful overlapping.
Figure 6 focuses on the in-CPU/-GPU performance of partitioned
and non-partitioned CPU and GPU joins implemented in our sys-
tem, as well as with the join implementations of DBMS C and
DBMS G. In this experiment we plot the execution time for vary-
ing table sizes from 1 million to 128 million tuples, at which point
the datasets stop fitting in the GPU memory. In each case, the data
are pre-loaded to the local memory of the corresponding compute
unit. Due to its improved hardware utilizations, the GPU hardware-
conscious algorithm outperforms all the alternatives, with an over
3x speed-up against the non-partitioned variant for the highest sup-
ported in-GPU size and over an order of magnitude for the 128
million tuple datasets against the other implementations.
6.3 Operator-level co-processing
The next experiment evaluates the join co-processing technique
designed for scaling up the join to cases when the GPU-memory is
insufficient for storing the inputs tables and intermediate join struc-
tures. In that purpose, we scale to datasets bigger than the ones in
Figure 6, from 256 million tuples to 2 billion tuples and operate
over CPU-resident data. Figure 7 shows the execution time of the
co-processing technique for the case of 1 and 2-GPUs compared
against the joins of the two commercial systems. DMBS G is not
designed for out-of-GPU datasets, and thus performs poorly even
after 512 million tuples. DBMS G scales linearly with the num-
ber of keys but despite the fact that it has access to the data with
the DRAM bandwidth, the random accesses force CPU implemen-
tations to suffer either from high latencies, or reduce the latencies
256 512 1024 2048
1
10
Table size (Mtuples)
E
xe
cu
tio
n
Ti
m
e
(s
)
1 GPU 2 GPUs DBMS C DBMS G
Figure 7: Comparison of join co-processing using 1 and 2 GPUs
at the expense of multiple passes, which causes the DMBS C to
achieve a throughput significantly lower than the PCIe throughput.
In contrast, the co-processing approach achieves the best from both
worlds. It partitions the data in the CPU-side were it has the DRAM
bandwidth in its availability for scanning the data. On top of that,
as it requires a relatively low-fanout, the partition materialization
can also take advantage of the high DRAM bandwidth. The par-
titioning allows for a single-pass over the slow PCIe and on the
GPU-side, the 280GBps memory bandwidth of each GPU in com-
bination with the optimizations to use the scratchpads allow for a
partition-and-join throughput also higher than the PCIe. As a re-
sult, in the single-GPU co-processing the join is bottlenecked by
the PCIe which is higher that the CPU-only join throughput. In
addition, adding an extra GPU, on a dedicated PCIe bus, almost
doubles (1.7x) the total throughput as the GPU-side join through-
put is also doubled, while the near-DRAM low-fanout CPU-side
partitioning can sustain providing for the two PCIes. Overall, the
co-processing achieves 12.5x and 4.4x speedup over DBMS G and
DBMS C, respectively, for the largest dataset size each supports.
6.4 Query Plan-level co-processing
The rest of the experiments focus on evaluating the end-to-end
performance of the query engine presented and more specifically
evaluate its efficiency on achieving state-of-the-art performance in
each device and, in hybrid mode, its efficiency on achieving the
aggregate throughput of the individual devices. We use four TPC-
H queries, at scale factor 100, to evaluate our system: Q1, Q6
which are simple aggregations and thus stress the interconnect and
memory bandwidth utilization of the system, and Q5, Q9 that are
join-heavy. As we currently have no support for LIKE conditions,
we run Q9 without the LIKE condition and the join to the corre-
sponding filtered table. We use a binary columnar format for the
inputs, which for our system is translated to 15-27GB working sets
per query. Taking into consideration data structures and space for
buffer management, none of these queries fits in the aggregated
GPU memory. Thus, for all the experiments and systems the data
are CPU-resident. For each experiment we warm up each systems
to allow them to load the data in-memory prior to any measurement.
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
Q1 Q5 Q6 Q9*
0
2
4
6
8
10
12
Query
E
xe
cu
tio
n
tim
e
(s
)
DBMS C Proteus CPUs Proteus Hybrid
Proteus GPUs DBMS G
Figure 8: CPU-, GPU-only and Hybrid performance on TPC-H
Figure 8 plots the execution time for the commercial systems
and for three configurations of Proteus: CPU-only that uses the
two CPU sockets, GPU-only that uses both GPUs, and hybrid exe-
cution which uses all the CPUs and GPUs of the server. The perfor-
mance of Proteus CPU is comparable DBMS C, with the exception
of Q1. Q1 has multiple aggregates and thus DBMS C has a higher
overhead due to the multiple in-L1 passes required by its vector-at-
a-time processing. In contrast the code generation of Proteus CPU
avoids that. DBMS G is optimized for star-schema based queries
and in-GPU processing and thus it was unable to run on 3 queries.
For the case of homogeneous device execution, we see that the
relative performance depends on the query category. For the scan-
bound queries, CPU-only execution demonstrates significantly bet-
ter performance compared to GPU-only execution and is more than
2.65 times faster in both queries. The CPU-only configurations
only access local DRAM and sustain a throughput higher than the
combined bandwidth of the interconnects, over which, the GPU
configurations, have to access the data. However, for the join-
intensive Q5, GPU-only execution attains higher performance and
achieves a speedup of 1.4x, despite the data transfers over PCIe.
This upfront interconnect overhead is amortized as the heavy pro-
cessing involved benefits from the hardware capabilities and the
fine-tuned algorithm on the GPU, while the CPU-side join suffers
from high latencies and/or multiple passes. Q9 is producing inter-
mediate results that increase the hash-table size requirements fur-
ther than the available memory on the GPUs and thus none of the
GPU-only systems is able to execute it.
Even though the choice between CPU-only and GPU-only exe-
cution is case-by-case, in all four experiments the multi-CPU multi-
GPU hybrid configuration outperforms both in all these scenarios.
The hybrid mode is most efficient for Q1 and Q6 queries, as it can
achieve 89% and 82% of the aggregate throughput achieved by the
CPU-only configuration plus the GPU-only configuration. For Q5
the hybrid configuration achieves 64% of the aggregate through-
put, due to the overhead of shuffling data for the joins. Addition-
ally, hybrid execution allows for co-processing at the operator level
which is the cornerstone for evaluating Q9. The co-processing join
technique presented earlier is combined with the in-GPU join to
provide a speedup of 2x over the CPU version. This result shows
the practical value of the technique as it allows for a query with
requirements higher than the capacity of the accelerators to benefit
from their processing power.
GPU Hybrid
0
0.5
1
1.5
2
Configuration
E
xe
cu
tio
n
tim
e
(s
)
Non partitioned join Partitioned join
Figure 9: Partitioned vs Non-Partitioned-based join on TPC-H Q5
Figure 9 depicts the execution time for GPU-only and multi-
CPU multi-GPU variants for query Q5, with a partitioned join as
a representative example of a hardware-conscious join and a non-
partitioned join as the representative for the hardware-oblivious
joins, for the heavy joins on the GPU-side of the plan, in order
to outline the impact of optimized operators within the query plan.
The plans that contain the partitioned joins have a lower execution
time, with 1.44x and 1.23x speedups for GPU and hybrid execution
respectively. The efficient device-optimized operator is able to mit-
igate the join bottleneck, increase performance and showcases the
importance of hardware-conscious processing.
7. CONCLUSIONS
In conclusion, we presented HAPE, a design for analytical query
engines that achieves efficient query execution over heterogeneous
hardware. We showed that heterogeneous execution can be de-
composed into a two dimension problem, achieving efficient intra-
device execution and inter-device execution. Efficient inter-device
execution requires mechanisms for transferring control and data be-
tween devices as well as the policies, co-processing algorithms, that
define how data and control should flow between the devices. Ef-
ficient intra-device execution can be further decomposed into opti-
mizing intra- and inter-operator (intra-device) efficiency.
8. ACKNOWLEDGMENTS
This project has received funding from European Union Seventh
Framework Programme, 2013 - ERC-2013-CoG, grant agreement
number 617508, ViDa and H2020 - UE Framework Programme for
Research & Innovation (2014-2020), 2017 - ERC-2017-PoC, grant
agreement number 768910, ViDaR.
References
[1] R. Appuswamy et al. The case for heterogeneous htap. In CIDR, 2017.
[2] C. Balkesen et al. Main-memory hash joins on multi-core cpus: tuning to the
underlying hardware. In Data Engineering (ICDE), 2013 IEEE 29th Interna-
tional Conference on, pages 362–373. IEEE, 2013.
[3] C. Balkesen et al. Multi-core, main-memory joins: sort vs. hash revisited. Proc.
VLDB Endow., 7(1):85–96, Sept. 2013. ISSN: 2150-8097. DOI: 10 . 14778 /
2732219.2732227. URL: http://dx.doi.org/10.14778/2732219.2732227.
[4] E. Begoli et al. Apache calcite: a foundational framework for optimized query
processing over heterogeneous data sources. In SIGMOD. ACM, 2018.
[git]
•
Branch:
mas-
ter @ 3821b16
•
Re-
lease:
(2019-
01-
25)
[5] S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash
join algorithms for multi-core cpus. In Proceedings of the 2011 ACM SIGMOD
International Conference on Management of Data, SIGMOD ’11, pages 37–
48, Athens, Greece, 2011. ISBN: 978-1-4503-0661-4.
[6] P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized
for the new bottleneck: memory access. In Proceedings of the 25th Interna-
tional Conference on Very Large Data Bases, VLDB ’99, pages 54–65, San
Francisco, CA, USA. Morgan Kaufmann Publishers Inc., 1999. ISBN: 1-55860-
615-7. URL: http://dl.acm.org/citation.cfm?id=645925.671364.
[7] P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining
Query Execution. In CIDR, 2005.
[8] S. Breß, H. Funke, and J. Teubner. Robust Query Processing in Co-Processor-
accelerated Databases. In SIGMOD, pages 1891–1906, 2016.
[9] T. Chen et al. Tvm: an automated end-to-end optimizing compiler for deep
learning. In 13th USENIX Symposium on Operating Systems Design and Im-
plementation (OSDI 18), pages 578–594, 2018.
[10] P. Chrysogelos et al. HetExchange: Encapsulating heterogeneous CPU-GPU
parallelism in JIT compiled engines. PVLDB, 2019.
[11] H. Funke et al. Pipelined query processing in coprocessor environments. In
Proceedings of the 2018 International Conference on Management of Data,
pages 1603–1618. ACM, 2018.
[12] G. Graefe. Encapsulation of parallelism in the volcano query processing sys-
tem. In SIGMOD, pages 102–111, 1990.
[13] B. He et al. Relational Query Coprocessing on Graphics Processors. TODS,
34(4):21:1–21:39, 2009.
[14] J. He, M. Lu, and B. He. Revisiting co-processing for hash joins on the coupled
cpu-gpu architecture. Proc. VLDB Endow., 6(10):889–900, Aug. 2013. ISSN:
2150-8097. DOI: 10 .14778/2536206 .2536216. URL: http : / /dx .doi .org /10 .
14778/2536206.2536216.
[15] M. Heimel et al. Hardware-oblivious parallelism for in-memory column-stores.
PVLDB, 6(9):709–720, 2013.
[16] H. Inoue et al. Aa-sort: a new parallel sorting algorithm for multi-core simd
processors. In 16th International Conference on Parallel Architecture and Com-
pilation Techniques (PACT 2007), pages 189–198, Sept. 2007. DOI: 10.1109/
PACT.2007.4336211.
[17] T. Kaldewey et al. GPU join processing revisited. In DaMoN, 2012.
[18] T. Karnagel et al. Big data causing big (tlb) problems: taming random memory
accesses on the gpu. In Proceedings of the 13th International Workshop on
Data Management on New Hardware, page 6. ACM, 2017.
[19] M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries Over Hetero-
geneous Data Through Engine Customization. PVLDB, 9(12):972–983, 2016.
[20] A. Kemper and T. Neumann. HyPer: A hybrid OLTP&OLAP main memory
database system based on virtual memory snapshots. In ICDE, 2011.
[21] V. Leis et al. Morsel-driven parallelism: a NUMA-aware query evaluation frame-
work for the many-core age. In SIGMOD, pages 743–754, 2014.
[22] MapD. https://www.mapd.com/.
[23] J. Paul, J. He, and B. He. GPL: A GPU-based Pipelined Query Processing
Engine. In SIGMOD, pages 1935–1950, 2016.
[24] H. Pirk et al. Voodoo - A Vector Algebra for Portable Database Performance
on Modern Hardware. PVLDB, 9(14):1707–1718, 2016.
[25] O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking simd vectorization
for in-memory databases. In Proceedings of the 2015 ACM SIGMOD Interna-
tional Conference on Management of Data, SIGMOD ’15, pages 1493–1508,
Melbourne, Victoria, Australia. ACM, 2015. ISBN: 978-1-4503-2758-9. DOI:
10 . 1145 / 2723372 . 2747645. URL: http : / / doi . acm . org / 10 . 1145 / 2723372 .
2747645.
[26] G. Psaropoulos et al. Interleaving with coroutines: a practical approach for ro-
bust index joins. Proceedings of the VLDB Endowment, 11(2):230–242, 2017.
[27] R. Rui and Y. Tu. Fast Equi-Join Algorithms on GPUs: Design and Implemen-
tation. In SSDBM, 17:1–17:12, 2017.
[28] S. Schuh, X. Chen, and J. Dittrich. An experimental comparison of thirteen
relational equi-joins in main memory. In Proceedings of the 2016 International
Conference on Management of Data, SIGMOD ’16, pages 1961–1976, San
Francisco, California, USA. ACM, 2016. ISBN: 978-1-4503-3531-7. DOI: 10.
1145/2882903.2882917. URL: http://doi.acm.org/10.1145/2882903.2882917.
[29] A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms for rela-
tional query processing. In Proceedings of the 20th International Conference
on Very Large Data Bases, VLDB ’94, pages 510–521, San Francisco, CA,
USA. Morgan Kaufmann Publishers Inc., 1994. ISBN: 1-55860-153-8. URL:
http://dl.acm.org/citation.cfm?id=645920.758363.
[30] P. Sioulas et al. Hardware-conscious Joins on GPUs. In ICDE, 2019.
[31] E. Stehle and H.-A. Jacobsen. A memory bandwidth-efficient hybrid radix sort
on gpus. In Proceedings of the 2017 ACM International Conference on Man-
agement of Data, SIGMOD ’17, pages 417–432, Chicago, Illinois, USA. ACM,
2017. ISBN: 978-1-4503-4197-4. DOI: 10.1145/3035918.3064043. URL: http:
//doi.acm.org/10.1145/3035918.3064043.
[32] Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of Processing Data Ware-
housing Queries on GPU Devices. PVLDB, 6(10):817–828, 2013.
[33] J. Zhou and K. A. Ross. Implementing database operations using simd instruc-
tions. In Proceedings of the 2002 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’02, pages 145–156, Madison, Wiscon-
sin. ACM, 2002. ISBN: 1-58113-497-5. DOI: 10.1145/564691.564709. URL:
http://doi.acm.org/10.1145/564691.564709.
