Heterogeneity-Aware Placement Strategies for Query Optimization by Karnagel, Tomas
Heterogeneity-Aware Placement
Strategies for Query Optimization
Dissertation
zur Erlangung des akademischen Grades
Doktoringenieur (Dr.-Ing.)
vorgelegt an der
Technischen Universität Dresden
Fakultät Informatik
eingereicht von
Dipl.-Inf. Tomas Karnagel
geboren am 30. August 1986 in Leipzig
Gutachter: Prof. Dr.-Ing. Wolfgang Lehner
Technische Universität Dresden
Fakultät Informatik, Institut für Systemarchitektur
Lehrstuhl für Datenbanken
01062 Dresden
Prof. Dr. Jens Teubner
Technische Universität Dortmund
Fakultät Informatik
Lehrstuhl für Datenbanken und Informationssysteme
44227 Dortmund
Tag der Verteidigung: 23. Mai 2017
2
ABSTRACT
Computing hardware is changing from systems with homogeneous CPUs to systems with het-
erogeneous computing units like GPUs, Many Integrated Cores, or FPGAs. This trend is caused
by scaling problems of homogeneous systems, where heat dissipation and energy consumption
is limiting further growths in compute-performance. Heterogeneous systems provide differ-
ently optimized computing hardware, which allows different operations to be computed on the
most appropriate computing unit, resulting in faster execution and less energy consumption.
For database systems, this is a new opportunity to accelerate query processing, allowing
faster and more interactive querying of large amounts of data. However, the current hardware
trend is also a challenge as most database systems do not support heterogeneous computing
resources and it is not clear how to support these systems best. In the past, mainly single op-
erators were ported to different computing units showing great results, while missing a system
wide application. To efficiently support heterogeneous systems, a systems approach for query
processing and query optimization is needed.
In this thesis, we tackle the optimization challenge in detail. As a starting point, we evalu-
ate three different approaches on isolated use-cases to assess their advantages and limitations.
First, we evaluate a fork-join approach of intra-operator parallelism, where the same operator
is executed on multiple computing units at the same time, each execution with different data
partitions. Second, we evaluate using one computing unit statically to accelerate one opera-
tor, which provides high code-optimization potential, due to this static and pre-known usage
of hardware and software. Third, we evaluate dynamically placing operators onto computing
units, depending on the operator, the available computing hardware, and the given data sizes.
We argue that the first and second approach suffer frommultiple overheads or high implemen-
tation costs. The third approach, dynamic placement, shows good performance, while being
highly extensible to different computing units and different operator implementations.
To automate this dynamic approach, we first propose general placement optimization for
query processing. This general approach includes runtime estimation of operators on different
computing units as well as two approaches for defining the actual operator placement accord-
ing to the estimated runtimes. The two placement approaches are local optimization, which
decides the placement locally at run-time, and global optimization, where the placement is de-
cided at compile-time, while allowing a global view for enhanced data sharing. The main lim-
itation of the latter is the high dependency on cardinality estimation of intermediate results,
as estimation errors for the cardinalities propagate to the operator runtime estimation and
placement optimization. Therefore, we propose adaptive placement optimization, allowing
the placement optimization to become fully independent of cardinalities estimation, effectively
eliminating the main source of inaccuracy for runtime estimation and placement optimization.
Finally, we define an adaptive placement sequence, incorporating all our proposed techniques
of placement optimization. We implement this sequence as a virtualization layer between the
database system and the heterogeneous hardware. Our implementation approach bases on
preexisting interfaces to the database system and the hardware, allowing non-intrusive integra-
tion into existing database systems. We evaluate our techniques using two different database
systems and two different OLAP benchmarks, accelerating the query processing through het-
erogeneous execution.
3
4
ACKNOWLEDGEMENTS
First and foremost, I thank my thesis advisor Wolfgang Lehner, who encouraged me to start a
Ph.D. in this topic and continuously provided interesting discussions on my research. Without
him and his support, this thesis would not be possible.
Additionally, I want to thank Dirk Habich and Benjamin Schlegel, who supported my re-
search from the beginning, helping me to focus and to publish my work. Both, Dirk and Ben-
jamin, had a significant impact on my research. I am grateful to Jens Teubner, who agreed to
externally review this thesis, as an expert in the field. Furthermore, I thank René Müller and
Guy Lohman from IBM Almaden for taking me in as an intern, while showing me the inter-
esting world of industry research. I am glad for the internship opportunity, the valuable input
to my work, and the new friends that I made. Additionally, I want to thank Max Heimel, Tal
Ben-Nun, andMatthias Werner for the fruitful cooperation and for co-authoring some publica-
tions that went into this thesis. I would not have come this far without Max’s database system
Ocelot, Tal’s hint that redundant work is sometimes better than going through a bottleneck,
and Matthias’s help with profiling GPUs. I also have to thank Matthias Hille for casually men-
tioning that we could use the OpenCL interface for the optimizer; an idea that grew in my
mind over years without being aware of how much work this actually is.
I also thank the many internal reviewers of this work for their valuable time and helpful
comments, especially Annett Ungethüm, Juliana Hildebrand, Alexander Krause, Ismail Oukid,
Johannes Luong, Patrick Damme, Steffen Huber, and Till Kolditz. Additionally, I thank the rest
of the database group for the constructive discussions and the good time together. In particular,
I thank Katrin Braunschweig, Ulrike Fischer, Ahmad Ahmadov, Claudio Hartmann, Hannes
Voigt, Julian Eberius, Kai Herrmann, Lars Kegel, Maik Thiele, Martin Hahmann, Robert Ul-
bricht, Thomas Kissinger, Tim Kiefer, and Tobias Jäkel. A special thanks goes to Ulrike Schöbel
for reviewing nearly all of my publications before submission and to Ines Funke for basically
organizing the whole group.
I am grateful to the German Research Foundation (DFG) for providing the funding for
my research through the Cluster of Excellence “Center for Advancing Electronics Dresden”.
Additionally, I want to thank the Center For Information Services and High Performance Com-
puting (ZIH) for access to the Taurus HPC system and the Dresden GPU Center of Excellence,
for providing hardware access and an open space for in-depth discussions.
Last but not least, I owe to thank my family! My wife Sandra always supported my work
and believed in me, while also accepting that I would use vacations to write papers. This thesis
would not be possible without her. Additionally, my daughter Jasmin and my son Michael
relentlessly tried to distract me from thinking about work, while being at home; helping me to
keep a healthy work-live balance. I thank you both so much.
Tomas Karnagel
May 26, 2017
5
6
CONTENTS
1 INTRODUCTION 11
2 QUERY OPTIMIZATION AND HETEROGENEOUS HARDWARE 19
2.1 Query Optimization and Query Processing . . . . . . . . . . . . . . . 20
2.1.1 Query Optimization Techniques . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Storage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Hardware Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Computing Unit Architectures . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Heterogeneous Connections . . . . . . . . . . . . . . . . . . . . . . 28
2.2.3 Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.4 Computing Units used throughout this Thesis . . . . . . . . . . . . . 30
2.3 Query Processing in Heterogeneous Computing Environments . . 34
2.3.1 Operator Implementations . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Database Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Placement Strategies in Heterogeneous Computing Environments . . . 37
3 APPROACHES TO UTILIZE HETEROGENEOUS ENVIRONMENTS 39
3.1 Approach I: Intra-Operator Parallelism on different CUs . . . . . . . 41
3.1.1 Intra-Operator Parallelism . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Possible Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.3 Operator Implementation and Hardware Setup . . . . . . . . . . . . . 45
3.1.4 Analysis of the Selection Operator . . . . . . . . . . . . . . . . . . . 46
3.1.5 Analysis of the Sort Operator . . . . . . . . . . . . . . . . . . . . . . 51
3.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Approach II: Static Placement . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.1 Operator Implementation . . . . . . . . . . . . . . . . . . . . . . . . 55
7
3.2.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.3 TLB Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.4 Implementation Adjustments . . . . . . . . . . . . . . . . . . . . . . 64
3.2.5 Configuration-based Optimizer . . . . . . . . . . . . . . . . . . . . . 72
3.2.6 Conclusion and Transferability of Results . . . . . . . . . . . . . . . . 74
3.3 Approach III: Dynamic Placement . . . . . . . . . . . . . . . . . . . . 76
3.3.1 Observations from Previous Approaches . . . . . . . . . . . . . . . . 76
3.3.2 Case Study: Dynamic Placement of Group-by Operator . . . . . . . . . 77
3.3.3 High Performance vs. Dynamic Placement . . . . . . . . . . . . . . . 80
3.3.4 Challenges for Automatic Placement Decisions . . . . . . . . . . . . . 82
4 GENERAL PLACEMENT OPTIMIZATION 85
4.1 Runtime and Transfer Estimation . . . . . . . . . . . . . . . . . . . . . 88
4.1.1 General Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.2 Operator Runtime Estimation . . . . . . . . . . . . . . . . . . . . . . 90
4.1.3 Transfer Time Estimation . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Local Placement Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.1 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.2 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Global Placement Strategy . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.2 Greedy Search Approach . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.3 Search Space Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.4 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.1 Runtime Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.2 Placement Optimization . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5 ADAPTIVE PLACEMENT OPTIMIZATION 111
5.1 Open Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Adaptive Placement Approach . . . . . . . . . . . . . . . . . . . . . . 115
5.2.1 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2.2 Improving Placement Quality . . . . . . . . . . . . . . . . . . . . . . 116
8 CONTENTS
5.3 Adaptive Placement Sequence . . . . . . . . . . . . . . . . . . . . . . 118
5.3.1 Steps at Query Compile-Time . . . . . . . . . . . . . . . . . . . . . . 118
5.3.2 Steps at Query Run-Time . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3.3 Feasibility of our Approach . . . . . . . . . . . . . . . . . . . . . . . 124
5.4 Implementation Approach . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.1 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.2 Database System Interface . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.3 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.4 Kernel Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4.5 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5.1 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5.2 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5.3 Performance and Placement Quality . . . . . . . . . . . . . . . . . . 137
5.5.4 Adaptivity of Heterogeneous Placement . . . . . . . . . . . . . . . . 139
5.5.5 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6 CONCLUSION AND FUTURE WORK 141
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
BIBLIOGRAPHY 147
LIST OF FIGURES 157
LIST OF TABLES 159
LIST OF ABBREVIATIONS 160
CONTENTS 9
10 CONTENTS
1
INTRODUCTION
Database management systems (DBMSs) are a core technology for the last half centuryand a basic building block for many applications. They are not only used for data storage
and data querying but also for data analytics and data mining. Therefore, improving DBMS
performance means inherently improving application performance, leading to a high demand
of database performance tuning. There are many possibilities for performance tuning inside a
database system. An overview on the database system architecture is given in Figure 1.1. An
incoming query (e.g., SQL) is translated into an internal format, on which query optimization
is performed (the query plan). This optimization includes query plan restructuring (logical
optimization) and defining the specific operations that need to be executed in order to compute
the query result (physical optimization). The optimized plan (query execution plan (QEP)) is
passed to the processing layer, where the defined operators are executed using data and data
access methods of the storage layer.
!"#$%&#'($)*#+,")
-.'/01#'($)*#+,")
2"(3,%%0$4)*#+,")
56("#4,)*#+,")
7$89,$3,:))
;+)39"",$6)
<#":=#",)
!",$:%)
>#6#;#%,)?9,"+)
?9,"+)2&#$)
?9,"+)@A,39'($)2&#$)
>#6#)B33,%%,%)
Figure 1.1: Database System Architecture.
In order to improve performance, the database system architecture was shaped by hardware
changes in the last decades. Examples are moving (1) from sequential processing to parallel
multi-core execution (processing layer), (2) from disk-centric systems to in-memory systems
(storage layer) [Abadi et al. 2013], and (3) from row-based execution to column-based execu-
tion (optimization, processing, and storage layer). Currently, hardware is changing again from
homogeneous CPU systems towards heterogeneous systems with different computing units
(CUs) like CPUs and GPUs, mainly to overcome physical limits of homogeneous systems [Es-
maeilzadeh et al. 2011]. The computing resources in a heterogeneous system usually have
different architectures for different use-cases. Database systems need to adapt to this hardware
trend to efficiently utilize the given opportunities. Therefore, the current question is:
How can database systems efficiently utilize heterogeneous
hardware environments to speed up query processing?
In this thesis, we strive to discuss this question in detail. In the following, we present the
most influential hardware changes as motivation, before presenting challenges, our contribu-
tion, and the structure of this thesis.
12 Chapter 1 Introduction
1970 1980 1990 2000 2010 2020
10
100
1000
10000
1e+05
1e+06
1e+07 memory (KByte)
(a) Main memory size1
1970 1980 1990 2000 2010 2020
1
10
100
1000
10000 frequency (MHz)
power (W)
(b) Frequency2 and power3
1970 1980 1990 2000 2010 2020
1
10
100
1000
10000
1e+05
1e+06
1e+07
#transistors (x1000)
process (nm)
(c) Manufacturing process and number of transistors4
1970 1980 1990 2000 2010 2020
0
2
4
6
8
10
#cores
(d) Number of cores3
Figure 1.2: Hardware evolution of different aspects from 1970 to the present. The CPU data
(b, c, d) are based on Intel CPUs. We note that the logarithmic scale of the y-axis for (a, b, c),
where linear growth actually stands for exponential scaling.
1http://www.jcmit.com/memoryprice.htm (Sep. 2016), 2Intel Poster: The Evolution of a Revolution,
3Intel Product Specifications: ark.intel.com (Sep. 2016, 4https://en.wikipedia.org/wiki/Transistor_count (Sep. 2016)
Motivation
In this section1, we briefly describe the major database and hardware trends of the past and the
present, which are the motivation of our work. The general hardware evolution is illustrated
in Figure 1.2.
Single Core Era: For the first database systems, single-core processors with a limited main
memory were generally used. There, the focus of optimization was on efficient computation
and minimizing costly disk accesses. While disk accesses remained a bottleneck, the compu-
tation improved without changes to the database architecture through an increasing CPU fre-
quency of⇡ 2x (Figure 1.2b) with each new processor manufacturing process [Pollack 1999].
Multi-Core Era: The power density and the leakage power rises with every new manufac-
turing process [Pollack 1999], prohibiting higher voltages and further frequency scaling. Mul-
tiple CPU cores on one die were introduced to reduce power consumption as it is likely that
not all cores are used at peak performance. Therefore, cores can be throttled individually to
control the overall power consumption. Figure 1.2b and 1.2d show that multiple cores were
1Parts of the material in this section (and Section 6.1) have been developed jointly with Dirk Habich. The
section is based on [Karnagel and Habich 2017]. The copyright is held by De Gruyter and the original publication is
available at https://doi.org/10.1515/itit-2016-0048
13
introduced, when frequency and overall power reached their limit. For database systems, this
means the “Free Lunch is over” [Sutter 2005], implicating that query processing can not im-
prove performance with newer hardware unless we adjust the database architecture towards
parallelism and concurrency.
In-Memory Era: Rising main memory capacities trigger another architecture change (Fig-
ure 1.2a). Traditionally, the main memory was used as a buffer pool for a data sub-set from
the disk drives. The main memory acted as cache for the hard-disk, where accesses to the disk
were still the main bottleneck. With increasing memory capacity, it is now possible to store
vast amounts, if not all data, in main memory. This redefines the systems’ bottlenecks, the
ideal database architecture, and the actual database usage towards an increasing amount of
analytical queries.
Current Challenge: Scaling further is the current challenge. Processor frequencies and over-
all power consumption does not increase anymore (Figure 1.2b). At the same time, more tran-
sistors are packed into one chip, which ismade possible by the shrinkingmanufacturing process
(Figure 1.2c). However, this also increases the power consumption permm2 and the theoret-
ical peak consumption of the chip. Since this peak consumption is limited, not all parts (or
cores) of the chip can be used at peak power (Dark Silicon [Esmaeilzadeh et al. 2011]). At the
same time, data sizes and the demand for instant data analytics is growing exponentially.
One direction to accelerate query processing is using distributed database systems, where
multiple nodes work together on one query. This introduces the overhead of data transfers
beyond one node, however, at the same time adding the advantage of being able to add more
computing power as needed. To increase the performance of a single node, possible approaches
include combining multiple homogeneous computing units (e.g., CPU-based NUMA systems)
or multiple heterogeneous computing units.
Heterogeneous Era: At the moment, multiple architectures are emerging to accelerate cer-
tain computations like GPUs for highly parallel SIMD processing; Many Integrated Cores
(MIC) for highly parallel processing of individual threads; field programmable gate arrays
(FPGA) for operators on reconfigurable logics; or different application-specific integrated cir-
cuits (ASIC) to speed up custom-specific algorithms. The complete systems themselves are
becoming more and more heterogeneous, which was already reported in 2011:
“Energy will be the key limiter of performance, forcing processor designs to use large-scale
parallelism with heterogeneous cores, ...” [Borkar and Chien 2011]
These heterogeneous systems are already widely available and most of them are also used
for data processing. For example, database research was done on using common processors
that integrate cores of different sizes and capabilities like ARM’s big-LITTLE [Mühlbauer et al.
2014; Ungethüm et al. 2016] or processors that integrate GPUs like AMD’s Accelerated Pro-
cessing Units (APU) [He et al. 2014]. For server systems, FPGA extensions are productively
used for search engines [Putnam et al. 2014] and high-end GPUs are used for visual data ana-
lytics [Mostak 2013]. Becoming more specialized, hardware/software co-design was proposed
14 Chapter 1 Introduction
to speedup sorted-set algorithms [Arnold et al. 2014b], hashing [Arnold et al. 2014a], and com-
pressed bitmap processing in database systems [Haas et al. 2016].
Even further, there are research projects like the Center for Advancing Electronics Dresden
(cfAED), which strive to find new materials for building processors. These materials include
silicon nanowires [Cui et al. 2003] and carbon nanotubes [Meric et al. 2008], but also organic
electronics [Lüssem et al. 2013], chemical information processing [Voigt et al. 2013], and DNA
origami [Gür et al. 2016]. When one or more of these materials become successful for compu-
tation, processors will not only be heterogeneous in their architecture and execution modes,
but also in their materials and processor design [Völp et al. 2016]. This illustrates the need
to investigate heterogeneity-aware computation and evaluate strategies that utilize multiple
heterogeneous computing units.
Challenges in the Heterogeneous Era
To support heterogeneous architectures or even heterogeneous materials within the database
system, a redesign of the database architecture and the database query optimization is needed.
Multiple choices for the computation significantly increase the complexity of the system and it
is yet unknown in which way hardware heterogeneity can be supported efficiently in database
systems. In the past, database operator implementations were ported from pure CPU-based
processing to an execution using other CUs, like GPUs [Govindaraju et al. 2006; He et al.
2008] or FPGAs [Müller et al. 2009a,b, 2012], but also the Intel MIC [Jha et al. 2015] and the
Cell Processor [Gedik et al. 2007]. Based on the assumption that operators have been ported
to the different CUs in a heterogeneous environment, the challenge is using them in the most
beneficial way to accelerate query processing.
For example, it would be possible to use a fork-join model and partition data in a way that
multiple CUs can partially execute an operator in parallel, theoretically improving performance
over a single CU execution. However, it is unclear how much the speedup will be and if possi-
ble overheads of this intra-operator parallelism can be compensated by the faster execution.
For single CU execution, it would be possible to define a static operator-CU combination
that is always used in this fixed scenario, e.g., using a GPU only for sorting. This allows a high
level of performance tuning for this specific operator and hardware. However, it is unclear if an
operator should always be executed on the same CU, given that data sizes and data properties
can be different. Additionally, this approach might not be extensible to multiple operators and
different hardware platforms.
A third approach would be dynamic execution, i.e., deciding the location (placement) of
an operator dynamically for each query and operator execution. There, any CU can be used
for any operator, allowing to change the placement if the execution is not beneficial for some
data sizes or operator implementations. Here, however, the question is how to define this
placement automatically, as decisions have to be based on the given computing hardware and
operator implementation, but also on the query structure and data transfers within a query.
15
Summary of Contributions
We investigate the three mentioned approaches in isolated scenarios to evaluate their potential
for heterogeneity-aware database systems. We choose the third approach as most promising
and propose further strategies for dynamic execution and heterogeneous operator placement.
The key contributions of this thesis are the following:
Approaches to utilize heterogeneous environments:
1. We evaluate intra-operator parallelism for two different operators and two different het-
erogeneous computing environments. We propose a data partitioning scheme and inves-
tigate performance effects of the execution in detail. As a result, we find underutilization,
result processing, and different CUs competing for computing resources, significantly
limiting the whole potential of this approach.
2. We evaluate and fine-tune a group-by operator using the static offloading approach on a
GPU. We find multiple performance effects and bottlenecks, which we explain through
in-depth hardware benchmarking. This includes a thorough analysis of the GPU TLB
architecture, identifying never-before-published TLB properties. We propose a total of
eight configurations combining different parameter and implementation adjustments to
improve the performance. The configurations themselves need to be switched according
to the hash table size of the operator.
3. We propose dynamic placement decisions, where the operator actually switches between
different CUs to reduce CU-disadvantages by changing placement decisions at the right
point. We evaluate this approach manually by executing the group-by operator on eight
different CUs including different CPUs, GPUs, and MIC. We determine this approach to
be highly extensible for different hardware environments and operator implementations
and we discuss challenges for fully automating these placement decisions.
Heterogeneous placement optimization:
1. We propose a novel way of runtime and transfer estimation, a basic technique for place-
ment optimization. We utilize a learning-based black-box approach for operators and
computing units, which allows high extensibility towards unknown hardware environ-
ments and operator implementations.
2. Based on the runtime estimation, we propose local and global placement optimization to
define placement decisions for all database operators within a query in order to reduce
the overall query runtime.
3. Placement optimization is highly dependent on cardinality estimation, where small er-
rors lead to inaccurate runtime estimations and wrong placement decisions. Therefore,
we propose adaptive placement optimization, which allows the decisions to become com-
pletely independent of intermediate cardinality estimations. This is achieved by parti-
tioning the query and allowing a combination of compile-time and run-time optimiza-
tions, leading to higher precisions for runtime estimation and placement optimization.
16 Chapter 1 Introduction
4. Finally, we propose a novel implementation approach as virtualization layer based on
OpenCL for the database system interface. This allows our approach to be highly extensi-
ble for different heterogeneous environments, while also being able to be integrated into
multiple database systems without additional effort.
We do not focus on building entirely new database systems, but instead we want to extend
existing systems through our approach of heterogeneous placement. Furthermore, we focus on
compute heterogeneity and not memory heterogeneity (e.g., volatile vs. non-volatile memory),
as the latter is more relevant for persistence and recovery consideration.
Query Optimization and Heterogeneous Hardware
Query Optimization
and Processing
Hardware
Heterogeneity
Query Processing in
Heterogeneous Env.
Approaches to Utilize Heterogeneous Environments
Intra-Operator
Parallelism
Static
Placement
Dynamic
Placement
General Placement Optimization
Runtime and Transfer Estimation
Local Optimization Global Optimization
Adaptive Placement Optimization
Adaptive Placement Approach
Adaptive Placement Sequence
Implementation Approach
2
3
4
5
Figure 1.3: The Structure of this thesis including chapter numbers.
Outline
Figure 1.3 illustrates the outline of this work. We first discuss background material in Chap-
ter 2, divided into (1) general query optimization and processing, (2) heterogeneous hardware,
and (3) query processing using heterogeneous computing resources. Especially the latter part
17
gives an overview of related work to our approaches. In Chapter 3, we explore three differ-
ent heterogeneous placement approaches in order to find the most promising approach to fol-
low. The three approaches include (1) intra-operator parallelism on heterogeneous hardware,
(2) static placement and implementation fine-tuning, and (3) dynamic operator placement.
Every approach is evaluated prototypically using a fixed scenario, where limitations and oppor-
tunities are shown. We choose dynamic placement for further research and start with a general
approach in Chapter 4. This general approach includes runtime estimation and data transfer
cost estimation as base for any further steps. For the actual optimization, we propose local and
global optimization, each with its own advantages and limitations. To improve on the general
approach, we propose an adaptive approach in Chapter 5, building a middle path between local
and global optimization. In detail, we describe (1) the adaptive placement approach, (2) an
adaptive work sequence, and (3) our implementation approach as vitulization layer including
an evaluation with two different database systems and workloads. We conclude the paper in
Chapter 6 with a summary and future work.
18 Chapter 1 Introduction
2
QUERY OPTIMIZATION AND
HETEROGENEOUS HARDWARE
2.1 Query Optimization and Query
Processing
2.2 Hardware Heterogeneity
2.3 Query Processing in Heteroge-
neous Computing Environments
Query Processing has to adapt to heterogeneous computing environments to improveoverall query execution. In the following, we first present an overview on traditional
database techniques including query optimization, query processing, as well as data storage.
Afterwards, we present heterogeneous computing systems by describing different hardware
architectures and their capabilities together with properties of multiple CUs, which are used
throughout this thesis. Finally, we present related work on query processing and optimization
for heterogeneous computing environments.
2.1 QUERY OPTIMIZATION AND QUERY PROCESSING
Database query processing in general uses a sequence of query parsing, query optimization,
and query processing, as shown in Figure 2.1. The query parser transforms an input query
(e.g., in SQL format) into an internal query plan of relational algebra. The query optimizer
then rewrites the query plan, generates possible execution plans, and chooses the best plan
for further processing. The query processor or execution engine executes the given plan to
compute the query result. In this section, we give a brief overview on query optimization and
query execution, together with different storage models.
!
"#$#%&!!'!
()*+!'!
,-#)#!'!
./#)0!
12)3#)!
./#)0!
456+78#)!
./#)0!
1)*%#33*)!
Figure 2.1: Query Processing Order.
2.1.1 Query Optimization Techniques
Query optimization defines the steps taken from a naive query plan based on an input query,
to a highly optimized query execution plan (QEP). Query optimization has always been a key
aspect in database research, resulting in a large variety of different approaches. We give a brief
overview on aspects of query optimization that are important to the work in this thesis. This
includes logical and physical query optimization, as well as cardinality estimation.
Query Optimization: As shown in Figure 2.1, the query optimizer receives a parsed query
plan from the parser, with the goal to determine the most efficient plan for execution. To
achieve that goal, logical and physical query optimization is applied. Figure 2.2 shows an ex-
ample query with two joins and one selection.
Logical query optimization applies data independent rewriting rules for operator reordering,
like query unnesting, redundancy removal, selection and projection moves, and cross-join re-
moval [Garcia-Molina et al. 2000]. For example, this could be applying a selection early to
reduce the possible number of intermediate result tuples and avoid unnecessary computation
of tuples that are not part of the final result (Figure 2.2b).
20 Chapter 2 Query Optimization and Heterogeneous Hardware
!" #"
$%&'"
#()*"
+"
$%&'"
(a) Input Query Plan.
!"
#"
$%&'"
#()*"
+"
$%&'"
(b) Simple Rewrites.
!"
#"
$%&'"
()*+"
("
$%&'"
(c) Join Enumeration.
!"#$%
&%
'$()*%
+%
'$()*)(%
,-%./0$%
!)12%
!"#$%
!%
3#45%
./0$%
(d) Physical Plan.
Figure 2.2: Query Optimization for an example query with three tables, two joins, and one
selection.
Physical query optimization, uses data characteristics and cardinality estimation to deter-
mine join enumerations and physical operator instances. Multiple join orders need to be eval-
uated, as it is not trivial to find the order with the least intermediate results and the lowest
runtime. For most queries, it is impractical to evaluate all possible join orderings as the search
space is too large. There are two main approaches to optimize these orderings. A greedy ap-
proach defines the operator order by starting with one operator that produces the smallest
result, before adding more operators with the same objective [Fegaras 1998]. This is a fast
way to produce an operator ordering, however, combinations that produce a large result in
the beginning, but much smaller results in the end are not considered. To find the optimal
plan, dynamic programming was proposed. Dynamic programming considers all promising
combinations, even if the initial intermediate results are large [Selinger et al. 1979]. There, the
problem is the large memory footprint and the large amount of computation. The resulting join
order can be applied to the plan structure (Figure 2.2c). Finally, logical operators are replaced
with physical instances, e.g., a join can be instantiated physically with different kinds of join
implementations, like a hash join or a nested loop join (Figure 2.2d). The decisions are usually
based on cardinality estimations and the data properties, e.g., an indexed nested loop join can
be preferred if the data is indexed or sorted.
Cardinality Estimation: The physical query optimization steps are based on cardinality es-
timations of the intermediate result. The cardinalities of the base relations are usually known,
whereas the intermediate cardinalities need to be estimated in order to make correct optimiza-
tion decisions. The main source of unknown cardinalities are selections, groupings, and joins,
whereas most other operators produce pre-known result sizes (e.g., a sorting result is always
the size of its input). There are different approaches to cardinality estimation. Histograms
over base table attributes are the most common approach, where either each value, a range of
values, or only the most common values are stored with their frequency [Kooi 1980]. A his-
togram is ideal for single selections, as the result size (or selectivities) can be easily taken from
the histogram. Usually, independence between the attributes is assumed and the selectivities
are multiplied for multi-attribute selections on the same table. This does not represent the
real data if attributes are not independent. To solve this problem, it is possible to use multi-
2.1 Query Optimization and Query Processing 21
!"#$% !"#$%
&'($%
!')*%
*+,-.%
*+,-.%
*+,-.%
$./*%
$./*%
$./*%
(a) Tuple-at-a-time.
!"#$% !"#$%
&'($%
!')*%
+,--%
)./,-*%
+,--%
)./,-*%
%
%+,--%
%)./,-*%
(b) Operator-at-a-time.
!"#$% !"#$%
&'($%
!')*%
+,%*-./01%
+,%*-./01%
+,%*-./01%
$02*%
$02*%
$02*%
(c) Block/Vector.
!"#$% !"#$%
&'($%
!')*%
*+,-.%
/+--%
).0+-*%
*+,-.%
(d) JIT Compilation.
Figure 2.3: Query Processing Models for an example query with one join and one sorting.
attribute histograms [Poosala and Ioannidis 1997], which can successfully model dependent
selectivities, however, multi-attribute histograms are complex to maintain and space intensive.
Another alternative are random samples of table rows [Lipton et al. 1990], which can be con-
sulted for arbitrary predicates and different query orderings. The estimation quality depends
on the sample size and the chosen rows in the sample.
2.1.2 Query Processing
There are multiple processing models possible for query processing. In general, a database
query is transformed to a query plan consisting of database operators, which define the steps
of query processing. Figure 2.3 shows a query plan with four operators, i.e., two table scans, a
join, and a sorting. Given this query plan, there are different processing models to execute the
query.
Tuple-at-a-time processing (Figure 2.3a): (also known as Volcano-model [Graefe 1994])
This processing method is based on an iterator model for operators, where each operator sup-
plies a next function, which computes the next result tuple of the given operator. Operators
can be connected, where succeeding operators call the next function of preceding operators to
retrieve their input data. This model is usually used in row-based database systems, however,
it can also be used in column-based systems, where the result tuples are represented as tuple
IDs (TID). The tuple-at-a-time approach is intuitive and intermediate results are kept small.
However, the execution of this approach is not optimized for modern processors. Each next-
function call results in a function-pointer lookup and jump to a different code location, which
itself results in instruction cache misses. As each operator keeps its own state, cache misses are
produced, which result in additional data loads from the main memory. Changing the compu-
tation frequently also results in many branch miss predictions preventing the processor to use
pipelining efficiently.
Operator-at-a-time processing (Figure 2.3b): Instead of producing single tuples,
Operator-at-a-time processing produces whole operator results [Boncz 2002]. With this ap-
proach, operator computations are done in a tight loop, enabling the processor to keep data and
22 Chapter 2 Query Optimization and Heterogeneous Hardware
instructions in caches, while allowing ideal branch prediction and pipelining. Also, the func-
tion call overhead becomes insignificant as only one function per operator is called, instead of
one function per operator per tuple. The main problem of the operator-at-a-time approach is
the full materialization of intermediate results, which results in cache misses when switching
to a different operator and therefore it mainly reads from main memory for the intermediate
results.
Block or Vector Processing (Figure 2.3c): This approach defines a middle-path between
the two previous approaches. Instead of returning one tuple per next call, the operators re-
turn a block of tuples (e.g., 1000 tuples) but not the full result. This reduces the function
call overhead significantly, while having a small block of data, which can stay inside the CPU
cache for multiple operators working on the same block of data. The approach is called block-
oriented processing [Padmanabhan et al. 2001] for row-based storage format, where blocks
contain multiple complete tuples. With column-based storage, this approach is called vector-
at-a-time processing [Boncz et al. 2005] and the operators return vectors (partitions) of table
columns.
JIT Query Compilation (Figure 2.3d): This approach compiles parts of the query operators
together in one function (pipeline) [Neumann 2011]. The execution is similar to the tuple-
at-a-time approach, while the just-in-time (JIT) compilation removes all function call over-
head through combining the computation in one single function. This is as cache efficient as
the column-at-a-time or vector-at-a-time processing, while avoiding most intermediate results.
Only if an operator can not be executed in a pipeline (e.g., a hash join that needs a full hash ta-
ble as input), multiple pipelines are created. This results in multiple compiled functions, while
intermediate results of the functions are fully materialized. However, this only happens in a
few cases compared to materializing after every operator execution in the operator-at-a-time
approach.
There is no best processing model and the ideal choice usually depends on the database work-
load, the underlying hardware architecture, and the used storage model.
2.1.3 Storage Models
Additional to query processing models, there are different data storage models. Figure 2.4
shows the three most widely used models, which we explain briefly in this section.
Row-oriented Storage Model. For the row-oriented storage model, also known as N-ary
Storage Model (NSM) [Ramakrishnan and Gehrke 2003], database tuples are stored one tuple
after another including all values belonging to one table entry (table row). This is beneficial
if only one tuple is altered or tuples are added, as they only need to be appended to the end
of the row store. Also, logically associated information is stored together, making the retrieval
of the full tuple context efficient. Single tuple operations are common for Online Transaction
Processing (OLTP).
2.1 Query Optimization and Query Processing 23
!"#$%"&
!"#$% &'$% ()*+$%
,*-% ..% /012%
345"% 01% 012.%
6"5#4+% 11% 12.7%
849)"$:% /;% 2.7<%
!=8%>%?"-:$%($*@:$A%
B,*-C%..C%/012DC%B345"C%01C%012.DC%
B6"5#4+C%11C%12.7DCB849)"$:C%/;C%2.7<D%
%
E=8%>%?"-:$%($*@:$A%
!"#$A% %,*-C%345"C%6"5#4+C%849)"$:%
&'$A%% %..C%01C%11C%/;%
()*+$A% %/012C%012.C%12.7C%2.7<%
?"-:$%F%:*'49":%G4$H% (&I%>%?"-:$%($*@:$A%
("'$%/A%
!"#$A% %,*-C%345"%
&'$A%% %..C%01%
()*+$A% %/012C%012.%
%
("'$%0A%
!"#$A% %6"5#4+C%849)"$:%
&'$A%% %11C%/;%
()*+$A% %12.7C%2.7<%
%
Figure 2.4: Examples for different storage models.
Column-oriented Storage Model. For column-oriented storage [Abadi et al. 2013; Stone-
braker et al. 2005], orDecomposition StorageModel (DSM), a database table is split into columns,
which are stored separately. Usually, a tuple ID or the position in the column defines the tuple
relation, e.g., the third entry in all columns assemble the third tuple of the table. Inserting a
tuple is more complex as this results in one insert per column. Also, point queries for complete
tuples have to access all columns. On the other hand, the storage is more efficient, as all values
in a column are of the same kind (numbers, fixed size strings, etc.), which allows compression
and vectorized processing. Additionally, OLAP (Online Analytical Processing) queries like “return
the average age of all people” do not have to access the whole table People in Figure 2.4 but only
the column Age.
Partition Attributes Across (PAX) Model. PAX [Ailamaki et al. 2002] is a combination of
the previous approaches: the complete tuples of a table are logically ordered in NSM format.
Naturally, the data spans over multiple memory pages. For the PAX approach, tuple data within
each page is converted from NSM to DSM format. Therefore, the individual pages contain
full tuples but these tuples are stored column-wise within the page. This approach achieves a
trade-off as tuples are stored close together to accelerate point queries, and, on the other hand,
data can be compressed and accessed cache-friendly in a column-format. However, even with
efficient column-oriented data access, every page has to be loaded for OLAP queries that aggre-
gate a whole column. In contrast, for the presented DSM approach, only pages containing the
queried column need to be loaded.
Beyond the presented storage models, there is a number of specializations like key-value-
based storage [Seeger and Ultra-Large-Sites 2009] or graph-based storage [Jouili and Vansteen-
berghe 2013]. The ideal storage model depends on the database workload, e.g., OLTP or OLAP,
and the chosen query processing model.
24 Chapter 2 Query Optimization and Heterogeneous Hardware
2.2 HARDWARE HETEROGENEITY
In this section, we present background information on hardware aspects like different CU ar-
chitectures, different ways of connecting these CUs, and how to program them in a universal
way. In the end, we present a list of CUs used throughout this thesis.
2.2.1 Computing Unit Architectures
To describe different CUs, we choose three general architectures and discuss their advantages
and limitations for different workloads. The three architectures are common CPUs, GPUs, and
Many Integrated Core systems (MIC). We choose these architectures exemplarily because of
their general availability and programmability. Also, all CUs used in this thesis are among these
three architectures. To describe the different architectures, we use different criteria:
1. Single Core Architecture. Most CUs consist of multiple cores, however, the single core
architecture can be different in complexity through internal optimizations like pipelin-
ing, branch prediction, or specialized instructions.
2. Single Core Performance. The single threaded execution performance differs from CU
to CU. The single core performance and the parallelism define the full CU performance.
3. Single Core Independence. There are two options, a single core can act independently
of other cores, or the execution has to be synchronized between the cores of a CU, allow-
ing no core independence.
4. Memory Bandwidth. Different CU architectures use different kinds of caches, memory
access optimizations, and have different memory bandwidths.
5. Vectorization. CUs usually differ in the support and width of vectorization, i.e., per-
forming vector operations within one clock cycle.
6. Parallelism. The architectures differ in their targeted degree of parallelism indicated by
the number of cores and the support of hyper-threading.
The criteria are not exhaustive, however, they should give a guideline to describe each archi-
tecture and allow an architecture-based comparison in the end.
Central Processing Unit (CPU): Figure 2.5a shows an example architecture of a CPU. The
CPU chip consists of multiple cores and usually a last level cache shared by all cores. The cores
themselves include another cache for data and instructions, registers, an instruction unit for
loading and decoding instructions, and a logic processor for the actual execution. Additionally,
the cores usually allow instruction pipelining with deep pipelines, out-of-order execution, and
branch prediction in order to improve single thread performance. Commonly, frequencies are
high and hyper-threading allows single cores to execute two threads at once in order to mask
memory stalls. Modern SIMD registers and instructions currently allow vectorization to a
width of 256 bit. This results in a high core complexity, but also a high performance of a single
2.2 Hardware Heterogeneity 25
(a) Common CPU Architecture.
!" #" $"%&'()*+"
,-.&/0."
123('4"
%&'()*-56"
,-.&"
123('4"
%&'()*-56"
(b) Independent CPU Threads.
Figure 2.5: CPU architecture and thread execution.
core. Separate threads can execute instructions independent of each other, while ideally, one
thread is executed per core (or two threads per core with hyper-threading, see Figure 2.5b).
In general, CPUs have a large memory hierarchy with multiple caches to improve the memory
bandwidth to the main memory, while the caches are coherent. If caches can not be used,
memory access is limited to the bandwidth of the off-chip main memory.
Graphics Processing Unit (GPU): Figure 2.6a shows an example architecture of a GPU.
Here, cores are streaming multi-processors (SM) containing many smaller processors (P) and
attached registers (R), with usually 32 or 64 processors per multi-processor. The smaller pro-
cessors share one instruction unit and therefore have to execute the same instruction on all
processors, while different data can be used (SIMT - Single Instruction, Multiple Threads).
The processors themselves are simpler than their CPU counterpart, because of the shared in-
struction unit but also because of the absence of pipelining, out-of-order execution, branch
prediction, and SIMD vectorization per processor. Threads are grouped in blocks, which are
assigned to a certain multi-processor. Additionally, the blocks are logically organized in a grid.
Branching is handled by computing all needed branches on the multiprocessor sequentially,
while single threads only execute the branch they need depending on their data and thread
number. Memory stalls are handled by switching whole groups of threads whenever threads
are waiting for data. Shared memory within a multi-processor is usually self-managed scratch-
pad memory without cache coherence, making the whole chip design simpler and more energy
efficient. Modern GPUs usually have a last-level cache for all multi-processors similar to the
CPU and off-chip device memory similar to the CPU’s main memory. The main difference
here is the memory bandwidth to the device memory, which could be up to 10x greater than
the CPU. To achieve these high bandwidths, memory accesses need to be coalesced for multi-
ple requests from different threads, e.g., by letting a group of threads load data from the same
memory region.
26 Chapter 2 Query Optimization and Heterogeneous Hardware
!"#$%&'()*++(',-,
.*/0+1*'+,
!"#$%&'()*++(',2,
34+1'")$(4,
5401,
.*/0+1*'+,
6708,
!"#$%& ()*++(',9,
34+1:,
5401,
&,
;7<'*=,!*>('?,
.,
&, &,
., .,
@<+1,@*A*#,6<)7*,
B*A0)*,!*>('?,
(a) Common GPU Architecture.
!" #"
$%&'()*"
!"
+"
$%&'()"
,-./0*"
1&2)"
(b) Grouped GPU Threads.
Figure 2.6: GPU architecture and thread execution.
Many Integrated Core (MIC): The MIC approach tries to bridge the gap between CPUs
and GPUs. The main architecture is similar to the normal CPU (Figure 2.5a). To allow more
parallelism than common CPUs, (1) more cores are added, (2) wider hyper-threading is ap-
plied, and (3) wider SIMD vectorization is supported. To add more cores, the cores themselves
are simplified to reduce chip space and energy consumption. Therefore, the small cores do
not support out-of-order execution, branch prediction, and execute with a reduced frequency.
Hyper-threading support increases from two threads per core to four threads per core and
SIMD instructions work on 512 bit. The memory bandwidth usually lies between the CPU
and the GPU, through on-board memory similar to GPU systems. The main product based on
MIC is the Intel Xeon Phi.
To summarize, we use our six criteria to compare the CPU, GPU, and MIC architecture. To
define real values to all criteria, we choose one example CU per architecture as presented in
Table 2.1. The values are visualized in Figure 2.7. The CPU has a high single core performance,
highly sophisticated core architecture, and cores execute independently. Drawbacks are lim-
Type CPU GPU MIC
Model Intel Xeon E5-2680 v3 Nvidia P100 Intel Xeon Phi 7120
Core Performance 36 GFLOPS 2.5 GFLOPS 9.3 GFLOPS
Core Architecture high low medium
Core Independence yes no (32 SIMT) yes
Memory Bandwidth 33 GB/s 562 GB/s 133 GB/s
Parallelism (cores, HT) 12 (⇤2) 3584 (⇤X) 61 (⇤4)
Vectorization (width) 256 bit – 512 bit
Table 2.1: Specific CU examples for each of the three hardware architectures. Performance and
bandwidth values are based on an OpenCL benchmark (See Section 2.2.4).
2.2 Hardware Heterogeneity 27
C.Perf
Vec
Par
Mem
C.Ind
C.Arch
(a) CPU properties.
C.Perf
Vec
Par
Mem
C.Ind
C.Arch
(b) GPU properties.
C.Perf
Vec
Par
Mem
C.Ind
C.Arch
(c) MIC properties.
Figure 2.7: Comparing the CPU, GPU, and MIC properties using six categories. Each architec-
ture is superior in one or more directions. (categories: single core performance, core architec-
ture, core independence, memory bandwidth, parallelism, vectorization)
ited vectorization, parallelism, and the low memory bandwidth. The GPU is tackling these
limits with high parallelization and high memory bandwidth, however, with limits in the other
categories. MIC systems are the middle path, showing medium core performance and medium
parallelization, while being especially good in vectorization and independent execution.
With this comparison, we want to illustrate the differences of the three architectures, while
also showing that each architecture has advantages and limitations. No single architecture is
superior to others in all categories and it depends on the use case, which architecture actually
performs better.
2.2.2 Heterogeneous Connections
Multiple CUs can be connected differently to the main system. The center of the system is usu-
ally the host CPU with a large amount of main memory. For other CUs, like GPUs and MICs,
there are two different options: loosely-coupled connection and tightly-coupled connection.
Loosely-coupled systems usually connect a CU to the main system using an external bus. This
is common for dGPUs (discrete GPU), where the GPU is connected by PCIe bus, while it does
not share resources directly with the CPU. This means the dGPU has its own device mem-
ory and data used on the GPU has to be transfered through the PCIe bus. There are different
versions of PCIe, with theoretical bandwidths ranging from 4 GB/s for PCIe1 over 8 GB/s for
PCIe2 to 16 GB/s for PCIe3 (all for 16x lanes, the bandwidth decreases with fewer lanes). As
alternative, Nvidia proposed NVLink with bandwidths between 20 GB/s and 200 GB/s [Nvidia
2014a] to connect Nvidia GPUs to themselves and to Power CPUs, however, data still needs to
be transfered.
Tightly-coupled systems solve this problem by bringing the CUs closer to the main processor
and the main memory. Examples are integrated GPUs (iGPU) or Accelerated Processing Units
(APUs), where the CU sits on the same die as the CPU, sharing the access to main memory
28 Chapter 2 Query Optimization and Heterogeneous Hardware
!"#$%&'"(')*%
+,-./0%012')'3%
456%%
+,-./0%6'17-'%%
87191)%
+,-./0%6'17-'%%
/"*,1:-9%
;<-=>$)2:-%
?@<-9A%%
B3#$-*%
B,-=1@=%
?93.)*1=A%
+,-./0%C.$-'D)=-%
4$%E>.F*-G%
%H%/"*,1:-%/I%/"9-%
%H%J').#D-'%6)$)%
%H%/".$'":%;<-=>F".%%
9K&I% 9K&I%1K&I%/&I%
Figure 2.8: OpenCL setup consisting of OpenCL library and OpenCL driver.
or even access to CPU caches. Other approaches include Intel’s recent MIC version, which
uses fast socket interconnects for communication [Hazra 2014]. For these approaches, data
does not need to be transfered for data access. However, through NUMA effects and missing
access optimizations for some CUs, it might be better to transfer data first to CU dedicated
memory regions for better performance. Either way, memory bandwidths for all CUs accessing
the CPU’s main memory are usually lower than in loosely-coupled systems.
2.2.3 Open Computing Language
Given different CUs and their architecture, there is a large variety of programming languages
and language extensions supporting different kinds of CUs. At this point, we describe the
Open Computing Language (OpenCL) in more detail, as OpenCL is supported by the many
CUs, including all CUs used in this thesis.
OpenCL was standardized in 2008 [Khronos 2011] and the OpenCL interface can be im-
plemented by any hardware vendor to support their computing units, making it possible to
execute OpenCL code on many different CUs. Currently, there is OpenCL support for all major
CPUs, including Intel, AMD, and IBM Power, as well as for all major GPUs, including AMD,
Nvidia, Intel, and ARM GPUs. Additionally, different accelerators are also supported, like the
Intel Xeon Phi, selected Altera FPGAs, or the IBM Cell processor.
Figure 2.8 shows how this support is achieved for many different CUs. A host program
is compiled using the OpenCL library, which provides the OpenCL interface without know-
ing the exact hardware environment. At run-time, the OpenCL library searches for OpenCL
drivers of different vendors. In the example of Figure 2.8, we illustrate a vendor specific driver
from AMD and Nvidia. The drivers also use the OpenCL interface and can support multiple
CUs, including the actual host CPU. In our example, the host program is compiled without any
hardware knowledge, while at run-time, it is able to use four different CUs. To offload compu-
tation, the specific code is written in so called kernels, which are being compiled by the vendor
specific drivers at run-time. Afterwards, data transfers and the actual offloaded execution is
possible.
2.2 Hardware Heterogeneity 29
Algorithm 1 OpenCL kernel example.
1: __kernel void addTwoValues( __global int * input1, . kernel function
2: __global int * input2,
3: __global int * output)
4: {
5: int threadID = get_global_id(0); . retrieve thread ID
6: output[threadID] = input1[threadID] + input2[threadID]; . add inputs
7: }
OpenCL kernels are specific functions for offloading. The kernel function specifies the
computation for one single thread, while the real execution is done in SIMT mode for many
threads, all applying the same computation to different data. Algorithm 1 shows an example.
The __kernel prefix defines the use as OpenCL kernel. Attributes marked with __global are
pointers to memory locations in the so called global memory of the CU (usually the CUs dedi-
cated device memory). In our example, the three arguments are arrays in global memory. The
first operation within the function retrieves the absolute thread ID. While all threads perform
this operation, the actual ID will be different. In the second operation, one value from in-
put1 and one value from input2 are summed and stored in the output array. Again, each thread
performs the same operation, but since the thread ID is different, different data locations are ac-
cessed. This kernel can now be offloaded to any CU for computation, provided that the kernel
is first compiled and that the memory objects are allocated and accessible by the used CU.
To schedule kernels, a kernel instance can be queued to a specific CU, given that the kernel
arguments are provided. Kernels can be executed with different number of threads and thread-
blocks. Thread-blocks combine multiple threads, to be executed on the same multi-processor,
to allow data sharing in the shared memory and thread synchronization between the threads
of a multi-processor. However, there are hardware limitations on how many threads can be
combined in a thread-block, depending on the CU.
CUDA is similar to OpenCL, as it applies the same SIMT execution style as OpenCL with
similar usage of kernels and execution patterns. However, CUDA is limited to Nvidia GPUs
and there are some naming differences to OpenCL.
Naming Style: As CUDA, OpenCL, and other programming languages have different names
for similar structures and operations, we provide a short naming guide for the remainder of this
thesis to avoid any confusion. Table 2.2 shows relevant CUDA and OpenCL naming differences
together with the naming we choose in this thesis. The naming we choose is not language spe-
cific but rather intuitive. We note that especially the use of “computing unit” (CU) is different
in our case.
2.2.4 Computing Units used throughout this Thesis
In this section, we present different CUs, to provide specific examples for the presented CU
architectures and heterogeneous connections, as well as to build a reference of all CUs used
throughout this thesis. Each test system in this thesis refers to a CU presented in Table 2.3. For
30 Chapter 2 Query Optimization and Heterogeneous Hardware
CUDA OpenCL In This Thesis
Device Device Computing Unit (CU)
Multi-processor (SM) Computing Unit Multi-processor (SM)
Host Memory Host Memory Host Memory
Device Memory Global Memory Global Memory
Shared Memory Local Memory Shared Memory
Registers Private Memory Registers
Kernel Kernel Kernel
Thread work-item Thread
Block work-group Thread-block
Table 2.2: Naming reference for CUDA, OpenCL, and the chosen naming in this thesis.
our heterogeneous computing environments, we use mainly CPU-GPU combinations, because
of their general availability and their support for OpenCL.
• CPUs: We use two different CPUs. The AMD CPU shares the chip with an integrated
GPU (iGPU), whereas the combination is also known as APU. The Xeon CPU is a two
socket system with 12 cores per socket and hyper-threading disabled.
• iGPUs: The integrated GPUs include AMD and Intel variants. While the AMD iGPU is
used in a desktop processor, the Intel iGPU is used in a low-power notebook processor
(i5-4250U).
• dGPUs: We use a variety of five different dGPUs in this thesis. The K20, K80, and GT640
are based on Nvidia’s older Kepler architecture [Nvidia 2014b], while the P100 is a newer
Pascal architecture GPU [Nvidia 2016]. The K80 GPU combines two GPUs on one PCIe
Board, however, to focus on heterogeneous CPU-GPU computation and to avoid resource
contention on the PCIe board, we only use one of the GPUs in our tests. Only the GT640
and the Tahiti GPU are meant for graphics rendering, while the other dGPUs are purely
meant for general purpose computation.
• MIC: Additionally to all CPUs and GPUs, we also work with Intel’s MIC architecture, the
Xeon Phi, which is connected as PCIe card.
All CUs in Table 2.3 support OpenCL with at least version 1.1. Additionally, the CPUs and
the Xeon Phi can be programmed with multiple languages including C, C++, and Java. The
Nvidia GPUs can also be programmed with CUDA.
We want to compare the CUs independently of a database context to show their potential
for later scenarios. Therefore, we use a synthetic benchmarking tool based on OpenCL 1. With
this tool, we test the single and double precision performance, the memory bandwidth to the
CUs global memory, and the transfer bandwidth from the host to the CU memory. The results
are presented in Table 2.4.
1clPeak: https://github.com/krrishnarraj/clpeak (verison a5c4543 on 4 Aug 2016)
2.2 Hardware Heterogeneity 31
Alias Type Vendor Model #Cores Freq. Mem. Connection
(MHz) (GB)
CPU CPU AMD A10-7870K 4 3900 32 host
Xeon CPU Intel Xeon E5-2680 v3 2 x 12 3300 2 x 32 host
iGPU iGPU AMD integrated Radeon R7 512 866 2.2 direct
Intel iGPU iGPU Intel HD 5000 (i5-4250U) 40 1200 1.5 direct
K20 dGPU Nvidia Tesla K20c 2496 706 5 4x PCIe2
K80 dGPU Nvidia Tesla K80 2 x 2496 875 2 x 12 16x PCIe3
GT640 dGPU Nvidia GT 640 384 901 2 16x PCIe3
P100 dGPU Nvidia Tesla P100 3584 1328 16 16x PCIe3
Tahiti dGPU AMD Radeon HD 7970 1792 925 3 16x PCIe2
Xeon Phi MIC Intel Xeon Phi 7120 61 (244)a 1333 16 16x PCIe2
Table 2.3: Overview of all used CUs throughout this thesis.
Alias single double Global Mem Transfer
precision precision Bandwidth Bandwidth
(GFLOPS) (GFLOPS) (GB/s) (GB/s)
CPU 44.17 23.76 18.21 (8.92)
Xeon 863.72 428.98 67.18 (7.82)
iGPU 877.18 55.27 25.72 8.78
Intel iGPU 492.47 — 16.67 2.87
K20 2900.63 1167.59 143.67 1.44
K80 2x3481.79 2x1403.76 2x168.66 11.38
GT640 601.97 28.81 26.56 7.15
P100 9098.78 4746.14 562.14 12.45
Tahiti 3726.65 919.99 224.23 8.72
Xeon Phi 2262.14 1163.23 133.12 6.84
Table 2.4: Benchmark results for the different CUs.
32 Chapter 2 Query Optimization and Heterogeneous Hardware
Performance: We can see that the P100, K80, K20, and Xeon Phi are superior in single and
double precision performance, as they are executing the benchmark highly parallel. The P100
is showing the best performance as it is the newest CU in our tests. Interestingly, for the
most CUs the double precision performance is nearly half of the single precision performance,
except for the GPUs that were not optimized for general purpose computation but rather for
graphic rendering (AMD iGPU, Tahiti, and GT640). This is probably caused bymapping double
precision computations to single precision execution units, due to missing double precision
execution units. The Intel iGPU does not support double precision at all.
Memory Bandwidth: For the global memory bandwidth, the four high-performance CUs
(P100, K80, K20, Xeon Phi) are leading again, since they have wide memory connections and
usually a high-clocked memory bus. The P100 uses High Bandwidth Memory of the second
generation (HBM2), leading to the fastest memory bandwidths of our tests.
Data Transfer: The results are different for the transfer bandwidth. There the two CPUs
(AMD CPU, Xeon) do not need to transfer data before using it. The numbers are only shown
as a result of the benchmark execution, but usually, no data needs to be copied. The iGPUs are
also capable of accessing main memory directly but we found that it is better to copy the data
first, as this provides higher memory access bandwidths than accessing the data without any
copy operation. When applying the copy operation for the integrated GPUs, data is stored in
a GPU specific part of the main memory, which ensures that the data can not be held in CPU
caches. This reduces the GPU-CPU communication during processing and memory access.
Without the copy operation, accessing CPU data from the iGPU leads to cache snooping and
coherency overheads, reducing the GPU’s effective memory bandwidth. Besides CPUs and iG-
PUs, the K80 and the P100 show the best transferring result, which is caused by an optimized
PCIe board and a 16 lane PCIe3 connection.
We note that the numbers presented in Table 2.4 are synthetic benchmark results. The real
performance might differ for each CU for several reasons: (1) For some CUs, OpenCL is not
the preferred programming language, leading to different performance compared to the native
languages. (2) Also, the CUs have different components specialized for different computations
(e.g., single or double precision, atomic accesses, branch prediction). Therefore, results could
be different with different benchmarks or real world workloads, where some CUs might excel
with the givenworkloads (e.g., CPUswith heavy branching code), while others have difficulties.
Therefore, we want to emphasize again that the values in Table 2.4 are only for a first glimpse
of the CUs performance, while it is impossible to define the best CU for a given workload using
these numbers.
2.2 Hardware Heterogeneity 33
2.3 QUERY PROCESSING IN HETEROGENEOUS COMPUT-
ING ENVIRONMENTS
In the previous sections, we presented general database query processing and heterogeneous
computing environments. In the following, we present related work on combining both, data-
base operators and query processing with the use of different CUs. We first give an overview on
single operator implementations that were ported to different CUs by presenting some exam-
ples with no intend to be exhaustive. Afterwards, we look at prototype systems that implement
these operators in different ways to allow query processing on different CUs. Finally, we discuss
aspects of different placement strategies by briefly describing our approaches and related work.
The vast majority of publications on heterogeneous computing is on GPUs. There are mul-
tiple reasons why GPUs are the preferred focus of research: (1) GPUs are generally available
from laptop computers to large HPC clusters; (2) they are easy to program with well-known
programming languages (CUDA, OpenCL), and (3) they have been programmed for over a
decade, producing a variety of guides and documentation. MIC systems emerged later than
the GPU programming and are generally found in larger systems, hence there are fewer pub-
lications on MIC than on GPUs. FPGA programming emerged before general purpose GPU
programming, however, the programming is more complex and the general availability is lim-
ited through high prices and software licensing. Other accelerators like IBM Cell or Adapteva’s
Epiphany Board [Adapteva 2014] are even less widely used, resulting in fewer publications.
Given the described reasons, we focus our related work overview on GPU-based approaches.
2.3.1 Operator Implementations
Different operator implementations were proposed for GPUs. In general, the operators work
in an operator-at-a-time approach on a column-based data storage.
Selection: There are multiple ways to compute a selection result on GPUs. The simplest
way is to evaluate the selection predicate in parallel and produce a bitmap as result, indicating,
which rows are satisfying the selection [Wu et al. 2010]. A position of the bitmap corresponds
to a row number in the column. Multiple bitmaps can be concatenated to resemble multiple
selections on the same table. Another approach is the immediate materialization of selection
results using a prefix sum computation to return a compact list of tuple IDs (TIDs) [He et al.
2009].
Join: A number of GPU join algorithms were proposed [He et al. 2008], including a non-
indexed nested loop join, an indexed nested loop join, a sort-merge join, and a hash join. For
all join algorithms on the GPU, the unknown result size is a problem with the cross product
as the worst case scenario. In most cases, the result size of the cross product is too large for
the limited GPU memory. Therefore, the authors propose to run the join in three stages as
shown in Figure 2.9: (1) compute the join result size for every thread by pre-computing the
join, (2) concatenating the single results using a parallel prefix sum and allocating the needed
memory size for the result, and (3) computing the join a second time, while now the results can
34 Chapter 2 Query Optimization and Heterogeneous Hardware
!"#$%&'()*+,-.# /"#$0&12)#+,-.#
!#
3#
4#
!#
5#
6#
7#
#
5"#89&:;#<=>#
6#
!#
4#
!5#
!/#
!3#
!3#
5?#
@22,A1(&#B&>,9C#
D#
D#
D#
D#
D#
D#
D#
D#
D#
D#
0&'=2(#'-E&'# 89&:;#<=>'# !"#$%&'F,2=>.#@# F,2=>.#G# F,2=>.#@# F,2=>.#G#
Figure 2.9: GPU optimized join approach.
be written to the calculated positions. Following this approach, the join is essentially computed
twice, while all steps can be done highly parallel. For highly parallel CUs like GPUs, this trade-
off increases the join performance and allows join computing as long as the result fits in GPU
memory.
Sort: There are different approaches to sort with high parallelism. The two most common
approaches are the radix sort [Dobosiewicz 1978; Davis 1992] and the bitonic sort [Batcher
1968]. While the bitonic sort is a perfect fit for highly parallel execution on the GPU or MIC, it
has a complexity ofO(log 2(n))with a fixed number of comparisons, independent of the actual
data. The radix sort performance is also independent of the data, while having a complexity of
O(n). However, the parallelization of the radix sort is more complex, because threads have to
communicate in order to find the right location for copied keys. Merrill and Grimshaw describe
an efficient implementation of a parallel GPU Radix sort [Merrill and Grimshaw 2010].
Grouping: Grouping on highly parallel CUs can be done in two ways. For the sort-based
grouping, data can be sorted first, using one of the presented approaches, before applying a
parallel reduction to remove duplicates [He et al. 2007]. The hash-based grouping uses a global
hash table, while executing the hash table inserts in parallel.
Aggregation: Aggregation is either done directly with the grouping in a combined operator
(sort-based or hash-based), or it is implemented as parallel reduction [He et al. 2007].
2.3.2 Database Systems
Besides single operator implementations on different CUs, there are database systems that al-
low query processing on GPUs or other CUs. Again, most systems execute in an operator-at-a-
time and column-at-a-time approach. In the following, we present these systems in detail.
gpuDB [Yuan et al. 2013] is a prototypical query execution engine to process OLAP queries,
whereas it only supports SSB queries [O’Neil et al. 2009]. During an offline compilation step,
each query is compiled into a binary for later use. gpuDB can either use CUDA to support
2.3 Query Processing in Heterogeneous Computing Environments 35
Nvidia GPUs or OpenCL for a wider application. Despite the support for OpenCL, heteroge-
neous query processing is impossible as one query can only run on one CU, which has to be
pre-defined manually before the execution.
Ocelot [Heimel et al. 2013] is an OpenCL extension to MonetDB [Boncz et al. 2008]. It is able
to process nearly arbitrary queries on different CUs by altering theMonetDB query plan, which
makes MonetDB to hand off certain operators to Ocelot for external computation. Ocelot then
uses OpenCL to execute the operator on a given CU. Like gpuDB, a query can only run on a sin-
gle CU and Ocelot is currently also limited to one query at a time. Intermediate data is cached
on the CU between operators and base data might be cached even beyond one query. The latter
has the potential to influence succeeding queries by either reusing cached data from a previous
query and therefore avoiding data transfer, or having to evict data, which is not needed to make
the memory space available. Ocelot was originally evaluated using TPC-H queries [TPC 2014].
gpuQP [Fang et al. 2007; He et al. 2009] implements traditional CPU operators alongside
CUDA-based GPU operators. Internally, a set of primitive operations is implemented includ-
ing map, scatter, gather, prefix scan, split, sort, reduce, and filter. Subsets of these primitives are
combined to larger database operators. gpuQP also includes a cost estimator and placement
optimization, defining, which operator executes on which CU.
CoGaDB [Breß 2014] is similar to gpuQP as it keeps two sets of operators, one implemented
in C/C++ for the CPU and one implemented in CUDA for Nvidia GPUs. Operators are imple-
mented directly, without the use of primitives. A hybrid query optimizer called HyPE [Breß
2013] is used for physical optimization and operator placement.
pgStrom [Kohei 2015] is an extension to Postgres [Stonebraker and Kemnitz 1991] accelerat-
ing joins and sortings with GPUs. To include GPUs into Postgres, pgStrom implements custom
operators based on OpenCL to offload the work. The row format of Postgres is transformed
into column format and cached in a columnar cache for later use of the GPU. Query code is
compiled at run-time in order to merge GPU operators into a few numbers of OpenCL kernels.
OmniDB [Zhang et al. 2013] is based on gpuQP and OpenCL. The system consists of a cost
model, execution engine, and scheduler, while different hardware environments are supported
by an adapter infrastructure. For every environment, an adapter needs to be provides in order
to support it, e.g., CPU-only adapter, CPU-GPU adapter, or APU adapter. The goal is to provide
a more optimized platform for operator implementations and specific cost functions for these
operators.
GPUTx [He and Yu 2011] is an OLTP database system based on GPU execution using CUDA.
The system uses a bulk execution model of small OLTP queries to combine multiple queries in
a single task, which is then executed on the GPU in parallel. The computation model is specif-
ically designed for the GPU, not supporting a dynamic reassignment of the execution location.
36 Chapter 2 Query Optimization and Heterogeneous Hardware
Alenka [Marks 2017] is a database system working on highly compressed data on the GPU in a
vector-at-a-time approach. The compression reduces the data size but adds additional computa-
tional overhead. For GPUs, this reduces data transfer times, while the computational overhead
can be reduced by the highly parallel processing.
Virginian [Bakkum and Chakradhar] is a research prototype using only a single GPU kernel,
while the operator computations are switched within. It works on vertically partitioned data,
so called tablets, which can be moved between CPU and GPU, to allow the query to work with
data beyond the GPU memory.
There are multiple trends visible in the presented systems. Some systems extend existing
database systems (Ocelot, pgStrom) essentially reusing core database components, while the
other systems are implemented from scratch to support heterogeneous hardware, showing that
both approaches are possible. Also, there are systems supporting only CUDA (gpuQP, CoGaDB,
GPUTx), which limits these systems to Nvidia GPUs, while other system build upon OpenCL.
OpenCL is wildly available on different CUs, making the systems extensible. A detailed survey
on GPU-accelerated database system is presented by Breß et al. [Breß et al. 2014].
2.3.3 Placement Strategies in Heterogeneous Computing Environments
Based on the ported operators and the databases systems supporting heterogeneous execution,
we now have to define how to use operators and heterogeneous environments. Having multi-
ple options for execution, e.g., an operator ported to the GPU and the traditional CPU-based
operator, the execution location has to be defined. We revisit the three different placement
strategies from Section 1, and discuss them in more detail. The strategies are:
(1) A collaboration approach, where all available CUs work on the same operator each
on a different data partition (intra-operator parallelism). The original idea was presented in
GAMMA [DeWitt et al. 1986], where data is split and processed on multiple homogeneous
processors. For heterogeneous systems, the performance of an operator differs for each CU,
leading to skewed partition sizes when using all CUs. Delorme [Delorme et al. 2013] evaluated
this approach for the radix sort on AMD APUs. We go further and propose a general partition-
ing function, list possible limitations, and evaluate two operators on two GPUs in Section 3.1.
(2) A second approach is a static decision that one specific operator is executed on one
specific CU. This has the consequence that the operator needs to be highly optimized for the
chosen CU, as the static decision should provide the ideal performance for every possible exe-
cution of this operator. At the same time, this high optimization can only be done for one or
a few operator-CU combinations with reasonable effort. Nearly all ported operators to a single
CU follow this approach and show impressive performance results [Wu et al. 2010; He et al.
2009, 2008; Merrill and Grimshaw 2010; He et al. 2007]. However, it is usually not reported
how much optimization effort is needed to achieve good results in all scenarios. We evaluate
this approach in detail by porting a hash-based group-by operator to the GPU in Section 3.2.
2.3 Query Processing in Heterogeneous Computing Environments 37
(3) Instead of statically deciding the execution placement, a dynamic placement approach
is possible. There, operators can be executed on a number of CUs and the execution location
is defined according to the best fit manner. Three of the presented database systems already
apply this approach for their operator execution. He et al. place their primitives on different
CUs [He et al. 2009], while CoGaDB places full query operators [Breß 2014]. PgStrom [Kohei
2015] provides a cost estimation of its operators to Postgres, which then decides between CPU
execution within Postgres or GPU execution using pgStrom. In Section 3.3, we evaluate the
potential of this approach using the same group-by operator from the static approach. We exe-
cute this operator on different CUs to evaluate if dynamic placement decisions can be used to
improve query performance.
We choose the third approach to be investigated in more detail. To automate dynamic
placement decisions, two main components are necessary: A cost model to assess each oper-
ator’s performance on each CU and an optimization approach to find the best heterogeneous
placement based on the operator runtime estimations. The latter needs to consider multiple
operators and their data sharing in order to find the best overall placement. In HyPE [Breß
2013], the authors use a learning-based approach for runtime estimation and algorithm selec-
tion. A learning model is trained during query execution, while the estimation is done using
spline interpolation [Breß et al. 2012]. Given the estimated runtime, the optimization process
then determines the best placement for each operator in the query. The placement optimizer
used in gpuQP [He et al. 2009] works on primitive operations. The runtime of these primitives
is calculated by adding (1) the input data transfer time from main memory to the GPU, (2) the
GPU computation, and (3) the result transfer time. The computation time is a combination of
memory accesses and pure computation. For each primitive, a complexity function needs to
be provided stating the number of memory accesses for input and output data. Additionally,
computation is tested by micro-benchmarks and is recorded as cost-per-tuple. The amount of
used tuples allows the complexity function and the cost-per-tuple value to estimate the primi-
tive’s runtime. For the optimization, He et al. [He et al. 2009] divide the query into sub-sets of
10 primitives, for which the authors evaluate all possible placements, while choosing the one
with the fastest runtime. The sub-sets are used to reduce the search space, as evaluating all
placements for full queries would be too large to process. Breß et al. [Breß et al. 2016] decide
data placement first, placing often used data on the GPU. Afterwards, operators using this data
can execute on the GPU, without additional data transfers. We investigate both, an operator
runtime estimation model and possible optimization algorithms in Chapter 4.
Furthermore, we identified cardinality estimation to be one of the key dependencies for
defining heterogeneous placements, because the operator runtimes and the transfer costs are
estimated according to the expected intermediate cardinalities. Therefore, inaccurate cardi-
nality estimations can change the chosen heterogeneous placement and influence the overall
runtime. Leis et al. [Leis et al. 2015] report significant errors in the cardinality estimation for
different database systems, leading to the need of the placement optimizations to become in-
dependent of these cardinality estimations. We investigate this challenge in Chapter 5 together
with an extensible way to integrate placement optimization into multiple database systems.
38 Chapter 2 Query Optimization and Heterogeneous Hardware
3
APPROACHES TO UTILIZE
HETEROGENEOUS ENVIRONMENTS
3.1 Approach I: Intra-Operator Paral-
lelism on different CUs
3.2 Approach II: Static Placement
3.3 Approach III: Dynamic Placement
Database systems have to adapt to heterogeneous environments. However, the right ap-proach for this adaptation is unclear and has to be investigated. In this chapter, we
discuss three different approaches for utilizing heterogeneous computing environments for
database query processing. For each approach, we choose an isolated use-case, investigate the
potential and limitations, and reason on its wider applicability. The three approaches are in
detail:
• Intra-Operator parallelism on different CUs. We choose two database operators, selec-
tion and sorting, in a fork-join approach, where input data is partitioned and the operator
is executed concurrently on multiple CUs.
• Static placement. The idea is to atomically execute an operator on one CU by statically
defining an operator-CU combination. Through knowing the exact CU for an operator,
in-depth code optimization and parameter optimization is made possible. We evaluate
this approach by investigating a hash-based group-by implementation on the GPU.
• Dynamic placement. In contrast to the previous approach, database operators can be
mapped dynamically on CUs, depending on a best-fit placement. For this approach, the
focus is less on code optimization and more on defining the dynamic placement. We
evaluate this approach by executing a hash-based group-by operator on eight different
CUs and define situations, where the execution location should be changed.
All three approaches are presented in the following sections, including a detailed evalua-
tion and a discussion of general applicability. We show that the first approach has several lim-
itations and potential overheads, while the possible performance improvement is small. The
second approach executes operators atomically, avoiding the limitations of the first approach.
To allow the static decision, the code needs to be optimized for every possible use-case. We
show that these optimizations result in tailor-made code implementations, which are not easily
portable to different CUs or different operators. The third approach performs dynamic place-
ment, allowing to switch the CU for certain use-cases if the performance can be improved by
switching. We consider the third approach as the most promising one and discuss the needed
steps towards a fully automatic placement optimizer for whole database queries.
40 Chapter 3 Approaches to Utilize Heterogeneous Environments
3.1 APPROACH I: INTRA-OPERATORPARALLELISMONDIF-
FERENT CUS
Homogeneous systems consisting of multiple CPUs can be utilized by using uniformly parti-
tioned data for all available CUs. The original idea was presented in GAMMA [DeWitt et al.
1986] as dataflow parallelism, where data is split and processed on multiple homogeneous pro-
cessors. There, data partitioning is simple, while skew in the data values, data transfers, and
result merging already complicate the approach.
We want to evaluate the same approach for heterogeneous systems in a fixed scenario.1
Heterogeneous systems combine different CUs, like CPUs and GPUs, with different architec-
tures, memory hierarchies, and interconnects, leading to different execution behaviors. The
original approach consists of (1) data partitioning, (2) parallel operator execution, and (3) re-
sult merging. For homogeneous systems, uniform partitioning is usually sufficient for the uni-
form computational performance, while heterogeneous systems need a non-uniform partition-
ing, depending on the execution performance of the CU, the operator, and used data sizes. We
analyze this partitioning and the resulting performance for two operators, selection and sort-
ing, with two different heterogeneous systems to evaluate the advantages and disadvantages.
We present performance insights as well as occurring limitation to intra-operator paral-
lelism in heterogeneous environments. As a result, we show that the actual potential of im-
provements is small, while the limitations and overheads can be significant, sometimes leading
to an even worse performance than single-CU execution.
Our contributions for this approach are in detail: (1) The theoretical discussion of the
potential and the limitations of intra-operator parallelism and (2) the prototypical implemen-
tation and evaluation of two database operators on two different hardware platforms.
In Section 3.1.1, we present intra-operator parallelism in more detail, before presenting the
operators and hardware environments for our analysis in Section 3.1.3. Afterwards, we analyze
the selection operator in Section 3.1.4 and the sort operator in Section 3.1.5.
3.1.1 Intra-Operator Parallelism
As intra-operator parallelism in heterogeneous environments, we define the goal of minimizing
an operator’s execution by using all available heterogeneous compute resources. This means
dividing input data into partitions, executing the operator on the given CUs, and merging the
result in the end. In the following, we discuss the general idea, an approach to find ideal
partition sizes, and the possible limitations of intra-operator parallelism.
1Parts of the material in this section have been developed jointly with Dirk Habich and Wolfgang Lehner. The
section is based on [Karnagel et al. 2016]. The copyright is held by Springer International Publishing Switzerland and
the original publication is available at http://dx.doi.org/10.1007/978-3-319-44039-2_20.
3.1 Approach I: Intra-Operator Parallelism on different CUs 41
Operator
Data on Host Computing Unit Data on Host
1
3
2 2 1 Full Input Data
2 Data Transfer orDirect Memory Access
3 Final Result
Figure 3.1: Operator execution on a single computing unit.
General Idea
Our starting point is the general operator execution on an arbitrary computing unit as shown
in Figure 3.1. We assume that the input data is initially stored in the system’s main memory
and that output data has to be stored in the same. Therefore, all our assumptions and tests
include input and output transfer, if the CU is not accessing the main memory directly. We also
assume that the operator implementation is inherently parallel and utilizes the complete CU,
which should normally be the case when the operator is implemented with CUDA or OpenCL.
Having a system with heterogeneous resources, parallel execution between multiple CUs
becomes possible. We focus on single operator execution, therefore, we want to execute the
same operator concurrently on multiple CUs, each CU working on its own data partition. Dur-
ing operator execution, we want to avoid communication overhead through multiple data ex-
changes. Therefore, we choose an approach, where we partition the input data, execute the
operator atomically on each CU with the given partitions, and merge the result in the end.
Figure 3.2 illustrates this approach for two CUs.
Operator
Operator
Data on Host Data on HostSync
Merge
1
1
2
2
2
2
2
3
3
4
1 Data Partition 2 Data Transfer orDirect Memory Access 3
Partial Result 4 Final Result
Computing Unit 1
Computing Unit 2
Figure 3.2: Operator execution on two computing units.
42 Chapter 3 Approaches to Utilize Heterogeneous Environments
0 20 40 60 80
data size (MB)
ru
nt
im
e 
(s
ec
)
0
20
40
60
80
100
A
B
(a) 1.74x speedup (42/58).
0 20 40 60 80
data size (MB)
ru
nt
im
e 
(s
ec
)
0
20
40
60
80
100
A
B
(b) No speedup (0/100).
0 20 40 60 80
data size (MB)
ru
nt
im
e 
(s
ec
)
0
20
40
60
80
100
A
B
(c) 2.25x speedup (46/54).
Figure 3.3: Simulating execution behavior in different setups with two CUs (A, B). In this
example, 80MB need to be partitioned between CU A and CU B.
While this approach is well studied for many operators in homogeneous systems, where
multiple CPU cores or multiple CPU sockets are used, there is not much information about
heterogeneous systems. In a homogeneous setup, the input data can be divided uniformly,
since every CU needs the same amount of execution time. In a heterogeneous system, differ-
ent CUs perform differently, so data has to be divided differently and multiple limitations could
hinder the execution. Mayr et al. [Mayr et al. 2000] looked at intra-operator parallelism for het-
erogeneous CPU clusters with the goal to prevent underutilization of available resources. The
authors also present a detailed overview of related work. We, however, look at heterogeneity
within one node with CUs like CPUs and GPUs, leading to different approaches and limitations.
Determining the Partition Size
In a first assessment, we want to look at the potential of intra-operator parallelism together
with possible ways to determine the best data partition size.
The intuitive approach would be: when both CUs execute an operator with the same run-
time, then the data is divided by two (50/50) and the potential speedup could be 2x. However,
heterogeneous CUs usually show different execution behavior and different scaling with data
sizes for an operator. Figure 3.3 shows three scenarios of heterogeneous execution. The execu-
tion time for different data sizes is given for CU A and CU B. The goal for all three scenarios
is to execute an operator with 80MB of data and to partition the input data to achieve the best
combined runtime. In Figure 3.3a, both CUs show equal execution time at 80MB, however,
the best partition is not 50/50, but 42/58. This is caused by the slope of the execution behavior,
resulting in different execution times for smaller data sizes. For example, when dividing 50/50,
CU A runs for 50 sec and CU B for 40 sec, therefore, the concurrent execution would be 50 sec
(the maximum of both single-CU executions). This partitioning is not ideal. The goal is to
achieve the same runtime on both CUs, which is 46 sec when using 42/58 as partitioning. The
speedup compared to a single-CU execution is 1.74x. We note that in this section speedups are
always given relative to the best single-CU execution.
3.1 Approach I: Intra-Operator Parallelism on different CUs 43
Figure 3.3b shows a similar scenario with a different outcome. Here no data partition size
is beneficial to improve the best single-CU execution, since even with less data, CU A does
not improve much in runtime. In our example, no partitioning has the potential to improve
the runtime of CU B, therefore, single-CU execution should be chosen. On the other side, if
the execution behavior is exponential (Figure 3.3c) then larger improvements beyond 2x are
possible.
The question is how to calculate the best data partition size for heterogeneous CUs. As-
suming we have n different CUs and we know the execution time (exec) of an operator for a
given data size (partition), we can calculate the total execution time (exectotal) for a given
input data size (input_data_size) by:
exectotal = max
1kn
(execk(partitionk)) (3.1)
with
input_data_size =
X
1kn
partitionk (3.2)
We have to minimize exectotal by adjusting the partition sizes (partitionk) to achieve the
best result. Essentially, this function finds the partition sizes, where the executions for multiple
CUs take the same time. If that is not possible, this function also allows single-CU execution if
one partition size is equal to input_data_size. Execution times for different data sizes can be
collected through previous runs or can be estimated by using estimation models.
3.1.2 Possible Limitations
While the presented function calculates ideal data partition sizes for ideal parallel execution,
there are many factors involved with parallel execution that could potentially increase the over-
all runtime:
1. Under Utilization. For small data sizes, an operator might not be parallel enough to
fully utilize a CU, e.g., highly parallel CUs like GPU and Xeon Phi, leading to slow execu-
tion. In that case, executing the operator with less input data leads to only small runtime
reductions (e.g., CU A in Figure 3.3b).
2. Synchronization Overhead. Parallel executions have to be synchronized in order to
merge their results (as shown in Figure 3.2). This synchronization could lead to delays
and communication overheads.
3. Result Processing. After synchronizing the executions, the intermediate results have to
be merged to generate a final result. This merge step strongly depends on the operator.
Some operators, like selection or projection, do not have a time-consuming merge step,
while others, like joins or sortings, rely on complex compute intensive merges, reducing
the potential of intra-operator parallelism significantly.
4. Shared Hardware Resources. CUs within one system are most likely to use shared re-
sources that could become a bottleneck when using all CUs simultaneously. This can
44 Chapter 3 Approaches to Utilize Heterogeneous Environments
be interconnects to the host memory, the memory or DMA controller, or computing re-
sources. When a workload produces contentions on these resources, the performance
might suffer.
5. Thermal Budget. Modern CUs reduce their frequency, and therefore their performance,
when a certain temperature threshold is reached. This is usually caused by the CU itself,
however, the temperature can also increase indirectly through other CUs. The best ex-
ample are tightly-coupled systems, where it might be possible to produce enough heat
through parallel execution to force the CUs to reduce the frequency.
With the possible limitations in mind, we analyze the parallel intra-operator execution of two
operators for two different heterogeneous systems.
3.1.3 Operator Implementation and Hardware Setup
To evaluate the potential and limitations of intra-operator parallelism in heterogeneous envi-
ronments, we use two operators with different characteristics in execution time, result size,
and merging overhead. In detail, we choose a selection operator and a sort operator. Our find-
ings can be applied to other operators by anticipating possible overheads, which are presented
in this work. We want to analyze parallel execution relative to its single-CU execution, so the
actual operator implementation is not the focus of our work, however, we briefly present the
implementation for completeness. All operators are implemented in OpenCL, enabling them
to be executed on all OpenCL-supporting CUs, including most CPUs and GPUs. The operators
are implemented in an operator-at-a-time approach with column-oriented data format.
Our selection operator scans an input column of 32 bit values and produces a bitmap in-
dicating values that satisfy the search condition. The implementation is taken from Ocelot1
[Heimel et al. 2013], an OpenCL-based extension to MonetDB [Boncz et al. 2008]. During
execution, each thread accesses 8 values from the input column, evaluates the given search
condition, and writes a one-byte-value to the output bitmap. Since we are working with 32 bit
values, the output is 132 of the size of the input. Merging results of multiple runs can be done
simply by aligning the results contiguously in memory, which should introduce no additional
merging overhead for parallel execution.
Our sort operator is based on the radix sort from Merrill and Grimshaw [Merrill and
Grimshaw 2010]. The actual OpenCL implementation is taken from the Clogs library2, which
has been implemented and evaluated by Merry [Merry 2015]. In our evaluation, we only sort
keys without any payload, to avoid additional transfer costs. The sort-operator execution is
more compute-intensive than the selection operator, while data transfers are also more sig-
nificant, because the operator is not reducing the input values. This leads to the same data
size for input and output. To merge two sorted results, we implement a light-weight parallel
merge for two CPU threads, where one thread starts merging from the beginning and another
1https://bitbucket.org/msaecker/monetdb-opencl
2http://clogs.sourceforge.net
3.1 Approach I: Intra-Operator Parallelism on different CUs 45
thread starts merging from the end. Both threads only merge the result until they processed
half of the resulting values. We choose this way of merging, to avoid overheads of highly paral-
lel approaches like significantly more comparisons (Bitonic Merge [Batcher 1968]) or defining
equally sized corresponding blocks in both sorted results [Satish et al. 2009].
For the analysis, we choose two different heterogeneous systems, to allow a broad eval-
uation: (1) a tightly-coupled system using an AMD Accelerated Processing Unit (APU) that
combines an AMD CPU and an AMD iGPU (Table 2.3) and (2) a loosely-coupled system us-
ing an Intel Xeon CPU and Nvidia K80 GPU (Table 2.3). Both systems combine a CPU and a
GPU, which is the most common setup for current heterogeneous systems. The iGPU shares
the main memory with the CPU, so it can actually access the CPU’s data directly. However, for
our tests, we noticed that it is more beneficial to copy the data to the GPU region of the main
memory before the execution. This way, the GPU data can not be cached by the CPU, avoiding
expensive cache snooping during GPU execution. The K80 also needs data copies and the data
is copied via PCIe to the GPU memory. The K80 consists of two GPUs on one board, however,
we only use the Xeon CPU and one K80 GPU to avoid limitations and overheads due to the
shared PCIe board.
3.1.4 Analysis of the Selection Operator
We begin with the analysis of the selection operator. In the following, we present the initial
test results and discuss general performance issues before examining the executions on each
system separately in more detail.
General Observations
For the initial experiment, we execute the operator on each CU with input sizes from 1024
values (4KB) up to around 268 million values (1GB). We capture the execution behavior and
apply our calculations from Section 3.1.1 to determine the data partitioning. The calculated
partitions are then used for the intra-operator execution. To see the effects of data partitioning,
we force the execution to use at least a small part of data on each CU, not allowing single-CU
execution, even if our calculations would suggest it.
The test results are shown in Figure 3.4. Single-CU execution behavior is similar for both
systems. For small data sizes, the execution time of a single CU does not differ much, because
the CUs are underutilized and show a constant OpenCL initialization overhead. For larger data
sizes, all CUs show linear scaling. Interestingly, for both systems the ideal CU changes between
1 and 4MB of data. In the tightly-coupled system, the GPU is better for large data, because of
the limited computational power of the CPU. For the loosely-coupled system, the CPU is better
because of the expensive data transfers to the GPU.
The parallel versions are generally not as good as expected. For small data sizes, we see
the same effects as previously discussed for Figure 3.3b. There is no potential for efficient
parallel execution through the bad scaling of each single-CU execution. Since we force data
partitioning to avoid single-CU execution, we observe at least the worst case performance of
46 Chapter 3 Approaches to Utilize Heterogeneous Environments
data size (mb)
ru
nt
im
e 
(m
s)
0.1
1
10
100
0.01 0.1 1 10 100 1000
CPU only
GPU only
parallel execution
(a) Tightly-coupled system.
data size (mb)
ru
nt
im
e 
(m
s)
0.1
1
10
100
0.01 0.1 1 10 100 1000
CPU only
GPU only
parallel execution
(b) Loosely-coupled system.
Figure 3.4: Selection operator executed on both test systems with different data sizes.
the two CUs caused by the underutilization and, additionally, we see a constant overhead for
data partitioning, CU synchronization, and final cleanup.
For large input data, these overheads should not be significant because of the higher ex-
ecution time and the better single-CU scaling. However, we still do not achieve a significant
performance improvement. In the following, executions with large data sizes are examined
separately for both systems.
Selection Operator on the Tightly-Coupled System
For large data sizes, limitations like underutilization or missing potential do not apply, how-
ever, the parallel execution performance is worse than expected. Therefore, we choose one
setting, specifically 1GB of input data, and analyze the execution in more detail. We execute
the operator with the fixed data size using different partition ratios (CPU/GPU) from 100/0 to
0/100, i.e., from 100% of the data on the CPU to 100% on the GPU. The result is shown in Fig-
ure 3.5a. The parallel execution does not show the expected performance of our calculations
and differs from the calculations especially for partition ratios where it should be beneficial.
Is the calculation wrong? To evaluate if the problem lies in our calculations, we rerun the
experiment without parallel execution. That means, we use the data partitioning but execute
the operators separately on each CU, not allowing parallel execution. Figure 3.5b shows that
the calculation and the actual executions are similar, confirming our calculation approach.
Therefore, the performance difference has to be caused by parallel execution itself.
Is heat a problem? Since our first test system is a tightly-coupled system, we would expect
the additional heat of parallel execution to be a problem, forcing both CUs to reduce their
frequency and, therefore, decrease in performance. To evaluate this idea, we rerun the three
most interesting configurations multiple times, while monitoring the frequencies of the CPU
(using lscpu) and the GPU (using aticonfig). Figure 3.5c shows the result. For the CPU, the
3.1 Approach I: Intra-Operator Parallelism on different CUs 47
Data Partition Size CPU/GPU (%)
ru
nt
im
e 
(m
s)
0
100
200
300
400
100/0 80/20 60/40 40/60 20/80 0/100
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! ! ! !
! !
! parallel execution
calculated
CPU part
GPU part
(a) Observation of parallel execution.
Data Partition Size CPU/GPU (%)
ru
nt
im
e 
(m
s)
0
100
200
300
400
100/0 80/20 60/40 40/60 20/80 0/100
calculated
CPU only
GPU only
(b) Observation w/o parallel execution.
           100/0             35/65              0/100
Data Partition Size CPU/GPU (%)
fre
qu
en
cy
 / 
pe
ak
 fr
eq
ue
nc
y (
%
)
0
20
40
60
80
100
3900
354
3900 866
1700
866CPU
GPU
(c) CU frequency (MHz) during tests.
Data Partition Size CPU/GPU (%)
CP
U 
Co
re
 U
sa
ge
 (%
)
0
20
40
60
80
100
100/0 80/20 60/40 40/60 20/80 0/100
CPU OpenCL threads (4x)
GPU control thread
parent thread
(d) CPU usage per thread.
Data Partition Size CPU/GPU (%)
ru
nt
im
e 
(m
s)
0
100
200
300
400
500
100/0 80/20 60/40 40/60 20/80 0/100
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
! parallel execution
calculated
CPU part
GPU part
(e) Execution with 3 out of 4 CPU cores.
data size (mb)
ru
nt
im
e 
(m
s)
0.1
1
10
100
0.01 0.1 1 10 100 1000
CPU only
GPU only
parallel execution
(f) Repeating initial test with 3 cores.
Figure 3.5: Extensive analysis of the parallel selection operator on the tightly-coupled system
(fixed to 1GB of data, except for (f)).
48 Chapter 3 Approaches to Utilize Heterogeneous Environments
peak frequency is 3900 MHz, while it will reduce the frequency to 1700 MHz when idle. For
the GPU, the peak frequency is 866 MHz and 354 Mhz when idle. The results show for each
CU that peak frequencies are always used when a CU is executing the operator, not supporting
the theory of reduced frequencies caused by heat problems.
Are CU synchronizations interfering with each other? The OpenCL calls are submitted
asynchronously, therefore, the parent thread is not blocking for each function call, however,
the parent thread has to synchronize in the end in order to wait for the execution to finish. This
synchronization might interfere with execution, if multiple CUs are used. We profiled the CPU
usage on thread level, for more insights. The result is shown in Figure 3.5d. One thread can
use up to 100% of one core, and since the system has four CPU cores, the total core usage of all
threads can not exceed 400% (calculation similar to the Unix-tool top). The presented numbers
are averages over many measuring points for each partition size, therefore, a low percentage
can represent a thread running at 100% for a short time, while being idle for the rest of the
execution. In Figure 3.5d, the black line represents CPU workers of OpenCL. There are four
threads (one per core) with similar execution behavior, so only one line is plotted, showing the
average of all four threads. For large data partitions on the CPU, the threads work constantly
at 100%. For small CPU partitions, the runtime is defined by GPU execution, and, therefore,
the CPU runs at 100% shortly, while being idle the rest of the time, hence, the smaller core
usage. So far, this is as expected. Surprisingly, the parent thread has nearly no CPU usage,
showing that the synchronization is not the problem because, apparently, it is implemented
using a suspend-and-resume approach instead of busy-waiting.
In Figure 3.5d, we see another thread that has not been created explicitly but, however,
has a significant CPU usage. We tested the same setup with single-CU execution, noticing
that this thread is only occurring when the GPU is used. We suspect this thread to be a GPU
control thread that manages the GPU queues and execution from the CPU side. With small
data partitions on the GPU, this thread is only running shortly, while it has a constant 60% core
usage, when using the GPU for a longer time. Therefore, this thread leads to contention on the
CPU resources. The interference is insignificant for the skewed execution times. For example,
for 90/10 the GPU runs only shortly, therefore, the GPU thread interferes only shortly, and
for 10/90 the CPU runs shortly leaving the resources to the GPU thread. However, for similar
execution times of CPU and GPU, the interference is large, leading to a performance decrease
of CPU and GPU. The CPU can not use all its resources, hence, the slow down. The GPU,
has a queue consisting of input transfer, execution, and output transfer, where the queued
commands are not executed on time if the GPU control thread is interrupted.
How to avoid the interference? Since we can not avoid the GPU control thread, we could
either accept the contention on the CPU resources and have the operating system handle the
thread switching, or we could reduce the number of CPU cores used by OpenCL. The latter
can be done with OpenCL device fission [Gaster 2011], where we reduce the number of used
cores by one. Other papers also propose to leave one core idle for controlling CPU and GPU
execution [Huismann et al. 2015]. Figure 3.5e shows the execution with only three CPU cores.
Here, parallel execution and calculation are similar. We can see that the CPU execution is about
25% slower with three cores instead of four, as it is expected. However, this also influences the
3.1 Approach I: Intra-Operator Parallelism on different CUs 49
Data Partition Size CPU/GPU (%)
ru
nt
im
e 
(m
s)
0
50
100
150
200
250
300
100/0 80/20 60/40 40/60 20/80 0/100
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
! parallel execution
calculated
CPU part
GPU part
(a) Observation of parallel execution.
Data Partition Size CPU/GPU (%)
ru
nt
im
e 
(m
s)
0
50
100
150
200
250
300
100/0 80/20 60/40 40/60 20/80 0/100
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
! parallel execution
calculated
CPU part
GPU part
(b) Execution with 23 out of 24 CPU cores.
Figure 3.6: Selection operator executed on the loosely-coupled test system with 1GB of data
and different partitions.
ideal data partition and the potential to achieve a speedup. With four cores, the calculated
speedup would be 1.54x, while with three cores it is only 1.41x. Adding the interference of
CPU and GPU, parallel execution takes 181 ms with four cores (35/65) and 164 ms with three
cores (30/70), leading only to a small difference. This effect can be seen when rerunning our
initial experiment with three CPU cores in Figure 3.5f, which, unfortunately, does not show a
significant difference to the initial results.
Selection Operator on the Loosely-Coupled System
When looking at 1GB of data with different partition ratios for the loosely-coupled system, we
see a nearly ideal performance according to our calculations (Figure 3.6a). The GPU runtimes
are slightly unstable because different data sizes result in a different degree of parallelism,
leading to divergent GPU-internal scheduling, which, in this case, is more visible on the Nvidia
GPU than on the AMD GPU. Additionally, the GPU runtime is slightly higher than expected.
To solve this, we did the same sequence of tests as for the previous test system. Our calculations
are correct according to the single-CU execution and power or heat issues are unlikely, because
the system is loosely-coupled, therefore, does not share a direct power budget. When looking
at the CPU utilization of each thread, we see the same effect as before: one additional thread
is controlling the GPU, and therefore fighting for CPU resources. On the CPU side, there is
no effect visible because one additional thread does not interfere significantly in a 24 core
system. However, for the GPU, a delayed control thread leads to delays in the queuing and
longer execution times. We apply the same solution as before: reducing the number of OpenCL
CPU cores by one to 23 cores (Figure 3.6b). This improves the GPU performance while the CPU
slowdown is insignificant (theoretically about 4%). Unfortunately, the GPU improvements are
only marginal, leading to no substantial improvements for the overall execution.
50 Chapter 3 Approaches to Utilize Heterogeneous Environments
data size (mb)
ru
nt
im
e 
(m
s)
1
10
100
1000
10000
0.01 0.1 1 10 100 1000
CPU only
GPU only
parallel exec. w/ merge
parallel exec. w/o merge
(a) Different data sizes.
Data Partition Size CPU/GPU (%)
ru
nt
im
e 
(s
ec
)
0
5
10
15
20
100/0 80/20 60/40 40/60 20/80 0/100
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! parallel execution
calculated
CPU part
GPU part
(b) Different partitions for 1GB of data.
Figure 3.7: Sort operator executed on the tightly-coupled test system.
3.1.5 Analysis of the Sort Operator
The sort operator differs from the selection operator in many ways. In general, the execution
takes longer since there is more computation and multiple data accesses. Therefore, computa-
tional power and data bandwidths to the CUs dedicated memories become important. On the
other side, when executing in parallel, the merge step can be significant for the performance.
Sort Operator on the Tightly-coupled Systems
Figure 3.7a shows the evaluation result for tightly-coupled systems. The GPU is always better
than the CPU because the computational workload is more suited for the GPUs parallelism. For
small data, the CUs are bound by underutilization leading to no potential for parallel execution.
For larger data, the parallel execution lies between the two single-CU executions, with the
merge step not being significant. In a closer analysis using 1GB of data (Figure 3.7b), the
reason for the parallel execution performance becomes obvious. The runtime between the
CUs differs by one order of magnitude, so that parallel execution does not rectify the means of
synchronization and merging. For this system, it would be best to use only the GPU, without
executing the operator in parallel.
Sort Operator on the Loosely-coupled System
For the loosely-coupled system, the results are different since both CUs are equally good in
executing the sort operator (Figure 3.8a), which is ideal for parallel execution. However, we
see a significant overhead through merging for larger data sizes. The close analysis in Fig-
ure 3.8b illustrates the extent of the merge step through the dotted lines above the actual CU
executions. For the loosely-coupled system, the merge step is more significant for the overall
runtime than in the tightly-coupled system because, here, the execution is faster on each CU,
while the runtime of the merge step is comparable for both systems. The calculated runtime
3.1 Approach I: Intra-Operator Parallelism on different CUs 51
data size (mb)
ru
nt
im
e 
(m
s)
1
10
100
1000
0.01 0.1 1 10 100 1000
CPU only
GPU only
parallel exec. w/ merge
parallel exec. w/o merge
(a) Different data sizes.
Data Partition Size CPU/GPU (%)
ru
nt
im
e 
(s
ec
)
0.0
0.5
1.0
1.5
2.0
2.5
100/0 80/20 60/40 40/60 20/80 0/100
!
! ! ! ! ! ! !
! !
! !
!
! !
! !
!
!
!
!
! parallel execution
calculated
CPU part
GPU part
merge overhead
(b) Different partitions for 1GB of data.
Figure 3.8: Sort operator executed on the loosely-coupled test system.
bases on single-CU execution without any merging overhead. The execution on the GPU varies
from the calculation, because of the additional GPU control thread. However, optimizing the
GPU execution would lead to only minor improvements because the main difference between
the single-CU parts and the actual execution is caused by the merge step.
Interestingly, the merging overhead itself varies with the chosen partitioning, caused by
branch prediction of the CPU. During the merge step, values from the two pre-sorted parti-
tions are combined to one result. For a 50/50 partitioning, the values for the final result come
from both pre-sorted partial results with 50% probability, the worst case for branch prediction.
Changing this partitioning also changes the probabilities towards more values being taken from
one partial result. This improves the CPU’s branch prediction and speeds up the merge step.
With single-CU execution, we do not need a merge at all. It might also be possible to optimize
the merge step further by, e.g., adding range partitioning [Albutiu et al. 2012], however, the
merge step itself is unavoidable for the execution on multiple CUs.
3.1.6 Conclusion
In this section, we analyzed intra-operator parallelism for heterogeneous computing resources.
We proposed an initial way to calculate good partition sizes and presented possible limitations
that could hinder parallel execution. In our analysis, we used two operators with two differ-
ent hardware setups and showed that especially underutilization, shared resources, different
execution performance, and the merging step limit parallel execution. Underutilization as well
as shared HW resources could be seen for every test. For the latter, only contention on CPU
cores was noticeable and the impact was significant for the tightly-coupled system. Reserving
one CPU core for controlling is a possible solution, however, CPU performance suffers if there
is only a small amount of cores. Additionally, we have seen no potential if the single-CU exe-
cution performance differs too much or if the merge step is too large compared to the actual
execution. These findings can be applied to many database operators or heterogeneous system,
by quantifying the merge overhead or CU performance.
52 Chapter 3 Approaches to Utilize Heterogeneous Environments
Ideally for parallel execution, we need to have (1) CUs that compute an operator equally
fast, (2) one CPU core reserved for controlling, and (3) a merge step with no significant impact
on the total execution time. If a merge step is needed, however, it will always be an additional
overhead compared to single-CU execution. To avoid this overhead, we thought about parti-
tioning input data once and run multiple operators in parallel on each other’s partial results
without merging in between. While it is possible in homogeneous systems with uniform data
partitions, in heterogeneous systems, each operator needs differently sized data partitions be-
cause different CUs execute an operator differently. For example, the tightly-coupled system
with 1GB of data needs a 35/65 partitioning for the selection and an 18/92 partitioning for the
sort operator. Executing both operators after each other using one global partitioning would
lead to a skewed execution time for CPU and GPU. It might be possible to find a partitioning
for a chain of operators, so that all CUs finish this chain at the same time. However, this would
only be possible if intermediate results do not need to be merged and it is unclear if the final
execution time, using suboptimal partition sizes for the single operators, is worth the effort.
All in all, there are two major lessons we learned: (1) Given the limited potential and
possible limitations, it is hard to achieve any speedup through intra-operator parallelism in
heterogeneous environments and even for ideal cases we only achieved a speedup of 1.52x
(Selection on the loosely-coupled system). It should always be considered if intra-operator
parallelism is beneficial or should be avoided. (2) During our analysis, we have seen different
single-CU execution behaviors like different ideal CUs for the selection or always better CUs for
sorting on tightly-coupled systems. If parallel execution is not beneficial, at least the placement
of the execution should be considered, e.g., for the selection on the tightly-coupled system
changing from CPU execution for small data sizes to GPU execution for large data sizes.
3.1 Approach I: Intra-Operator Parallelism on different CUs 53
3.2 APPROACH II: STATIC PLACEMENT
In the previous section, we presented the idea of executing an operator concurrently on mul-
tiple CUs with different data partitions. However, the evaluated use-cases showed only small
potential to improve performance and many possible limitations. To avoid these limitations,
we want to execute an operator atomically, where data is not partitioned and the operator is ex-
ecuted on only one CU. In the past, single operators were ported to GPUs (e.g., joins [He et al.
2008] or sorting [Govindaraju et al. 2006]), FPGAs (e.g., stream joins [Müller et al. 2009b]),
or the Xeon Phi (e.g., joins [Jha et al. 2015]). All these offloaded operator executions show
great performance results, mainly through high optimization for the used hardware. Once an
operator is ported and fine-tuned for a specific CU, a static decision is implemented that this
operator always runs on this specific CU. To allow an operator to always execute on one CU,
the performance has to be good in any given scenario. To estimate the needed effort for this
approach, we have to answer two questions:
1. How much hardware-specific fine-tuning is needed to make an operator perform well in
any scenario?
2. How extensible is this approach towards the support of multiple operators and multiple
hardware platforms?
To evaluate this approach of static execution of an operator on a given CU, we are porting
a hash-based group-by operator with aggregation to an Nvidia K80 GPU (see Table 2.3). We
want to use the fixed combination of operator and hardware to optimize the operator imple-
mentation for the best possible performance. We focus our evaluation on different numbers of
groups and, therefore, hash tables with different sizes.1
Our contributions with this approach are in detail: (1) We present an in-depth study on
offloading the hash-based grouping operator to a GPU, including a detailed performance anal-
ysis as the number of groups varies. In our analysis, we find the virtual memory translation to
be a major bottleneck. Therefore, (2) we define a benchmark to investigate GPU TLB proper-
ties and present our unconventional findings. (3) We evaluate multiple different implementa-
tion and parameter adjustments, together with their impact on performance and (4) propose
a configuration-based model, which switches between different implementations to guarantee
the best performance.
We first present an overview of our group-by implementation in Section 3.2.1, before pre-
senting the first performance observations in Section 3.2.2. There, we pin-point multiple
bottlenecks including the atomic operations, data caches, and translation lookaside buffers
1Parts of the material in this section have been developed jointly with René Müller and Guy Lohman during
an internship at the IBM Almaden Research Center, San Jose, USA. The section is based on [Karnagel et al. 2015b].
The copyright is held by the authors and the original publication is available at http://www.adms-conf.org/
2015/gpu-optimizer-camera-ready.pdf. The initial implementation approach, the private hash table op-
timizations, and the sort-based implementations were provided by René Müller (IBM), while the analysis of the
effects and the parameter adjustments were done by the author.
54 Chapter 3 Approaches to Utilize Heterogeneous Environments
Base Table
Host GPU
Key 5 12345 0
PCIe
Transfer
Hash
Function
Map to
Bucket
Hash Table
Key 6
...
Key 1
...
Key 5
...
Linear
Probing
Bucket 0
Bucket 1
Bucket 2
Figure 3.9: GPU based group-by operator.
(TLBs). In Section 3.2.3, we investigate GPU TLBs in more detail. Given the gained hard-
ware insights, we propose multiple ways to adjust the initial implementation and its parame-
ters in Section 3.2.4. As a result, a simple configuration-switching optimizer is presented in
Section 3.2.5. We discuss the applicability and transferability for this approach to different
hardware environments in Section 3.2.6.
3.2.1 Operator Implementation
In the following, we briefly describe the implementation and setup of our hash-based group-by
operator.
The input data of this operator is stored in main memory in a columnar representation, as
our group-by operator is meant for column-based in-memory database systems. The GPU oper-
ator itself, works on 128MB chunks of input data, which is accessed through zero-copy (Unified
Virtual Addressing [Nvidia 2015]). There, data is transfered during the execution via PCIe bus.
This leaves more space on the GPU to store the hash table, which can utilize the whole GPU
memory if needed. At the same time, this allows the operator to access vast amounts of input
data by iteratively loading 128MB of data and aggregating it in the hash table.
Figure 3.9 shows an example for the execution. Once an input key is transfered to the GPU,
a hash function is used to create a hash value, which is then mapped to a bucket index. When
accessing the hash table, the algorithm evaluates the chosen hash bucket. If it contains the
search key, the payload is updated using atomic instructions. If the bucket is empty, the search
key and the payload is inserted, also using atomic operations. If the bucket is not empty and the
stored key is not equal to the search key, the next bucket is evaluated (linear probing). For our
implementation, we support two entries per hash-bucket (key, count) as it would be needed for
this given query:
SELECT col1, count(*) FROM atable GROUP BY col1;
To support other queries with wider hash buckets and other operations, either different
kernels need to be generated or one kernel is implemented with the generality of supporting
many different scenarios. As we are mainly interested in the execution effects than the gener-
ality of our operator, we currently implement each kernel variant manually in CUDA. In the
future, the kernels could be generated automatically by using the OpenCL and its source com-
pilation at run-time [Khronos 2011] or NVRTC [Nvidia 2015], a C++ run-time compilation of
CUDA 7.0.
3.2 Approach II: Static Placement 55
3.2.2 Observations
We test the simple hash-based group-by described in the previous section on an NVIDIA K80
(single-GPU, Table 2.3). We use 13 thread blocks (the number of multi-processors on the GPU)
and 1024 threads per block. The allocated hash tables are twice the size of the group count,
leading to a fill factor of 50%. As a hash function, we apply the 32-bit FNV-1a hash [Fowler
et al. 2015] and we map the hash value to a bucket ID via modulo division. The group-by is
performed on a table with⇡1.6 billion rows consisting of four INTEGER columns:
CREATE TABLE atable (
col1 INTEGER NOT NULL,
col2 INTEGER NOT NULL,
col3 INTEGER NOT NULL,
col4 INTEGER NOT NULL
) ;
The table is stored in a column format, leading to a storage size of 6GB per column and 24GB
for the whole table. A table with this size can usually not be stored directly in the GPUs mem-
ory, supporting our initial implementation choice of storing the base data in the system’s main
memory. The values in col1 are independent, uniformly distributed, and random in the range
of [0, 232). Accessing the hash table based on this distribution results in scattered memory
accesses with minimal caching opportunities. By contrast, non-randomized or skewed distri-
butions generally have more localized memory accesses and make better use of the GPU cache
and, thus, lead to better performance. Our data distribution can thereby be considered as the
worst-case scenario. To analyze the performance as a function of the number of groups, we use
the following query that utilizes the MOD operation to limit the number of groups:
Query A: SELECT MOD(col1, ?), COUNT(*) FROM atable GROUP BY MOD(col1, ?);
We study the impact of the data transfers on the overall performance by extending the
SELECT clause of Query A with expressions that reference more columns. Ideally, we would
expect the execution time for these variants of Query A to be determined by howmany columns
are referenced, as shown in Figure 3.10a. The GPU can read from the host memory via zero-
copy access at a peak speed of 11.8GB/s in our setup. Since we believe that the connection
from GPU to host memory should be the bottleneck, we would expect an execution time that is
inversely proportional to the total size of the accessed columns. This is illustrated by the four
equi-spaced lines in Figure 3.10a. However, the actual performance we observe is different,
as shown in Figure 3.10b. We observe that: (1) the performance does not remain constant
as we increase the number of groups by adjusting the MOD parameter in the query. (2) The
runtime has a high variability, when only one column is accessed. The execution time can
have large jumps for certain group combinations. This is not a measurement artifact – the
numbers are in fact repeatable. (3) Accessing more than one column appears to hide some of
this variability. (4) The execution time increases for few groups and (5) sharply increases for
100 million groups and more.
56 Chapter 3 Approaches to Utilize Heterogeneous Environments
02
4
6
8
10
number of groups (M)
ru
nt
im
e 
(s
ec
)
1
3
5
7
9
1e!06 1e!04 0.01 0.1 1 10 100 1000
11.8 GB/s
Query 4: MOD(col1,?), SUM(col2+col3+col4)
Query 3: MOD(col1,?), SUM(col2+col3)
Query 2: MOD(col1,?), SUM(col2)
Query 1: MOD(col1,?), COUNT(*)
(a) Expected performance behavior.
0
2
4
6
8
10
number of groups (M)
ru
nt
im
e 
(s
ec
)
1
3
5
7
9
1e!06 1e!04 0.01 0.1 1 10 100 1000
Query 4
Query 3
Query 2
Query 1
(b) Actual performance behavior.
Figure 3.10: Performance of initial group-by implementation, measured as execution time of a
table scan over⇡1.6 billion rows (fill factor 50% and 13 x 1024 threads).
For further investigations, we focus on the 1-column query, as this query shows the perfor-
mance effects more detailed than the other queries, which partially mask the effects through
data transfers. For the 1-column query, we can distinguish five regions in which different phe-
nomena appear to dominate. Figure 3.11 shows the ranges of these five regions enumerated as
I, . . . , V. Unfortunately, we do not have insight information from NVIDIA on their hardware
that would provide clear explanations for the different behaviors. In order to provide an expla-
nation, we combine the publicly available hardware descriptions with extensive profiling via
performance counters and benchmarks.
Region I: Contention. The group count, and thus the hash table, is small. For a globally
shared hash table, this leads to contention that limits the overall performance. Beginning with
the Kepler GPU architecture (this includes the K80), atomic operations on device memory are
performed in ALUs in the L2 cache [Nyland and Jones 2012], which is shared by all processors
on the GPU. Atomic operations, e.g., atomic additions used for updating a SUM aggregate,
are treated as store instructions. The operations are then routed to the ALUs based on their
target memory address. A first-in first-out (FIFO) buffer in front of each ALU enqueues the
operations before they are processed and the memory locations in L2 are updated.
The downside of this approach is that updates to the same hash bucket, or to buckets that
are located on the same cache line, are sent to the same ALU. For small numbers of groups or
a highly skewed distribution of group-by keys, this introduces a work imbalance on the ALUs
and, thus, contention.
Region II: L2 & Spiky Performance. For all regions, the execution time of our Group-By
implementation is spiky. The execution times in Region II correspond to the access time for
the input columns, given the PCI Express bandwidth of ⇡ 11.8GB/s. If it were not for the
spikes, the performance in the second region would be entirely dominated by the available
PCIe bandwidth. In Region II, the hash sizes are large enough that contention, which domi-
3.2 Approach II: Static Placement 57
#groups (M)
ru
nt
im
e 
(s
ec
)
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
I II III IV V
Figure 3.11: Regions with different behavior for Query A (fill factor 50% and 13 x 1024 threads).
nated performance in Region I, no longer has a significant impact. The region contains hash
tables that are  1.5MB, i.e., that still fit into the L2 cache on the K80. We repeated the ex-
periment multiple times, assuming we would observe a non-deterministic artifact. However,
the spikes persisted at the same locations. Through deeper investigation, we found out that
the spikes are due to collisions of the hash mapping. The FNV-1a hash function followed by
the modulo division mapped many different keys to the same or close hash buckets, leading
to long sequences of linear probing. FNV-1a is known as a good hash function and we can
confirm this by studying the distribution of 32-bit hash values. The hash values are uniformly
distributed over the full range of 32-bit integers. The hash collisions are actually introduced
when mapping the 32-bit hash values into hash buckets. This mapping does not sufficiently
preserve the uniformity of the computed hash values, especially for small and compact key do-
mains (keymax keymin ⌧ 220). By contrast, we do not see performance issues for values that
were chosen randomly from large domains (> 220). However, the modulo operation in the
grouping expression of our benchmark query effectively produces such a compact key domain.
Compact key domains are common in real workloads, e.g., for sequential order keys. The col-
lisions of the hash mapping, and therefore the spikes, can be seen for all regions, however, in
other regions the effects of the collisions are masked by other effects.
Region III: Hash Table> L2Data Cache. In Region III, we observe the expected increase in
execution time when the hash table grows beyond the L2 cache capacity (1.5MB for the K80).
The execution time gradually increases and remains constant when almost every access results
in a cache miss and in a random access to global memory.
Region IV: TLB issues (1). Region IV starts at a hash table size of about 130MB (⇡8.5M
groups). The behavior looks like missing the next level of data cache; however, L2 is already
the last level cache on the GPU. The TLB is also organized as a cache, and we believe these
effects can be influenced by a TLB cache. However, Nvidia does not publish architectural
information, so we need to investigate further with micro-benchmarks.
58 Chapter 3 Approaches to Utilize Heterogeneous Environments
Region V: TLB issues (2). The jump at the beginning of Region V starts with hash table sizes
of 2GB. Again, we suspect a TLB cache to be the source of this problem and more investigation
is needed.
3.2.3 TLB Analysis
In this section, we focus on investigating virtual address translation and the Translation Looka-
side Buffer, which we suspect to be the source of the runtime increase in Region IV and Re-
gion V.1
GPUs use virtual addresses for their device memory for two reasons: (1) Isolation: The in-
direction controls a program’s memory accesses and, thus, keeps it from disallowed memory
accesses to internal device data or to data of other applications using the same GPU. (2) Frag-
mentation: Memory fragmentation can be hidden with virtual pages, allowing a large consecu-
tive region of virtual memory to be scattered across many positions in physical memory. This
can also increase memory bandwidth if physical memory is scattered to multiple memory chips
that are accessible in parallel. The translation of virtual to physical addresses is usually per-
formed in pages, which are fixed blocks of memory. To access data within one page, a virtual
page address is translated to a physical page address using a page table, while an additional off-
set defines the requested position inside the page. Address translation using the page table is
costly. To reduce the page translation delays, TLBs cache virtual to physical address mappings.
Generally, TLBs can be implemented for different page sizes and different amounts of entries.
Unfortunately, NVIDIA does not publish any information about TLB sizes of their GPUs.
Micro-Benchmark Methodology:
To determine all necessary TLB informations – properties and sharing between multiproces-
sors (SM) of a GPU – we developed two low level benchmarks.
TLB Property Benchmark: To identify the TLB properties like size or delays, we define
a single-threaded GPU kernel traversing a continuous data array. The traversal is done in a
specific stride for a specific distance (traversed data size), while performing data dependent
accesses (pointer chasing). The stride describes the distance between two memory accesses,
while the traversed data size is the total distance of all data accesses, e.g, 1024 memory accesses
with 2MB stride result in a traversed data size of 2GB. The stride-accessing kernel is always exe-
cuted twice. The first run initializes the memory by loading data into the TLB, while the second
run measures the execution cycles for memory accesses with initialized TLBs. We will measure
low cycle counts if the data fits in the TLB (always TLB hit), while otherwise measuring high
cycle counts due to TLB misses. To eliminate the influence of data cache misses, the accessed
data for our kernel is stored in the L2 data cache, which also works with physical addresses. So
even with the data being cached, the addresses have to be translated. Thus, any increase in our
measurements is purely due to TLB misses.
1Parts of the material about GPU TLBs have been developed jointly with Tal Ben-Nun, Matthias Werner, Dirk
Habich, andWolfgang Lehner. The section is based on [Karnagel et al. 2017a]. The copyright is held by the authors.
3.2 Approach II: Static Placement 59
Figure 3.12: TLB boundary pattern.
We use our kernel to benchmark a given GPU with multiple strides and multiple distances,
while searching for the pattern shown in Figure 3.12. The depicted pattern consists of three
stride sizes (X , 12X , and 2X). If we find this pattern, we can conclude thatX = page_size,
1
2X accesses every page twice, and 2X only accesses every second page. Each one of these
stride sizes shows a low cycle count for smaller data sizes and a higher cycle count for larger
ones, hence this is a TLB border. If the stride size is equal to the page size (X), every data
access requires a new address translation. TLB misses are encountered when the TLB can not
hold all pages of the first execution run (>a in Figure 3.12).
For a stride size of 12X , every first access triggers a TLB miss, while the second access
experiences a TLB hit, as it accesses the same page. A TLB hit is faster than a miss, so the
average cycle count of all accesses is lower thanX . However, it still accesses exactly the same
pages as X (each page twice), which leads to TLB misses at exactly the same traversed data
size (position a in Figure 3.12). Every stride size smaller than the page size behaves like 12X:
showing lower cycle counts but experiences the first TLB miss at the same position.
For a stride size of 2X , every second page is accessed, leading to a TLB miss for double
the traversed data size (2a). The low and high cycle counts are the same as X , because every
access leads to an address translation and potentially to a TLB miss. Every stride size larger
than the page size behaves like 2X: showing similar cycle counts, while experiencing the first
TLB miss at later positions.
TLB sharing benchmark: Our TLB sharing benchmark uses three stages: (1) accessing
N pages on the i-th multiprocessor SMi, (2) accessing N different pages on the k-th multi-
processor SMk, and (3) accessing the first N pages again on SMi. N is the number of pages
that fit in the TLB. We measure the used cycles for the last stage. A low cycle count indicates
no sharing between SMi and SMk, i.e., the accessed page addresses from the first stage are
still in the TLB; whereas a high cycle count indicates TLB sharing, i.e., SMk evicts the page
addresses loaded by SMi. To determine TLB sharing, we have to test every SM combination
(#SM x #SM) for every TLB level.
60 Chapter 3 Approaches to Utilize Heterogeneous Environments
1 2 3 4 5
235
240
245
250
255
traversed data size (MB)
cy
cle
s
64 KB stride 
128 KB stride 
256 KB stride 
239
cycles
248 cycles
(a) L1 TLB boundary.
100 150 200 250 300
240
260
280
300
320
340
traversed data size (MB)
cy
cle
s
1 MB stride
2 MB stride
4 MB stride
248 cycles
303 cycles
(b) L2 TLB boundary.
1500 2500 3500 4500
250
300
350
400
450
500
550
traversed data size (MB)
cy
cle
s
1 MB stride
2 MB stride
4 MB stride
303
cycles
480 cycles
(c) L3 TLB boundary.
Figure 3.13: TLB boundary benchmark results (K80, Kepler architecture).
SM 0 1 2 3 4 5 6 7 8 9 10 11 12
0 X .. .. .. .. .. .. .. .. .. .. .. ..
1 .. X .. .. .. .. .. .. .. .. .. .. ..
2 .. .. X .. .. .. .. .. .. .. .. .. ..
3 .. .. .. X .. .. .. .. .. .. .. .. ..
4 .. .. .. .. X .. .. .. .. .. .. .. ..
5 .. .. .. .. .. X .. .. .. .. .. .. ..
6 .. .. .. .. .. .. X .. .. .. .. .. ..
7 .. .. .. .. .. .. .. X .. .. .. .. ..
8 .. .. .. .. .. .. .. .. X .. .. .. ..
9 .. .. .. .. .. .. .. .. .. X .. .. ..
10 .. .. .. .. .. .. .. .. .. .. X .. ..
11 .. .. .. .. .. .. .. .. .. .. .. X ..
12 .. .. .. .. .. .. .. .. .. .. .. .. X
(a) L1 TLB Sharing.
SM 0 1 2 3 4 5 6 7 8 9 10 11 12
0 X .. .. .. .. X .. .. .. .. .. .. ..
1 .. X .. .. .. .. X .. .. .. .. .. ..
2 .. .. X .. .. .. .. X .. .. X .. ..
3 .. .. .. X .. .. .. .. X .. .. X ..
4 .. .. .. .. X .. .. .. .. X .. .. X
5 X .. .. .. .. X .. .. .. .. .. .. ..
6 .. X .. .. .. .. X .. .. .. .. .. ..
7 .. .. X .. .. .. .. X .. .. X .. ..
8 .. .. .. X .. .. .. .. X .. .. X ..
9 .. .. .. .. X .. .. .. .. X .. .. X
10 .. .. X .. .. .. .. X .. .. X .. ..
11 .. .. .. X .. .. .. .. X .. .. X ..
12 .. .. .. .. X .. .. .. .. X .. .. X
(b) L2 TLB Sharing.
SM 0 1 2 3 4 5 6 7 8 9 10 11 12
0 X X X X X X X X X X X X X
1 X X X X X X X X X X X X X
2 X X X X X X X X X X X X X
3 X X X X X X X X X X X X X
4 X X X X X X X X X X X X X
5 X X X X X X X X X X X X X
6 X X X X X X X X X X X X X
7 X X X X X X X X X X X X X
8 X X X X X X X X X X X X X
9 X X X X X X X X X X X X X
10 X X X X X X X X X X X X X
11 X X X X X X X X X X X X X
12 X X X X X X X X X X X X X
(c) L3 TLB Sharing.
Figure 3.14: TLB sharing benchmark results (K80, Kepler architecture).
Benchmark Application and Observations:
With our TLB benchmark, we test different stride sizes until we find the pattern shown in
Figure 3.12. We found this pattern three times for the K80 indicating three TLB cache level.
The results are shown in Figure 3.13. Figure 3.13c shows exactly our search pattern for 2MB
strides and a TLB border at 2GB. Figure 3.13b is slightly different. Again we can identify 2MB
as page size and 130MB as traversed data size, where TLB problems start appearing, however,
1MB strides (12X) behave differently. As expected, it has a lower average cycle count for TLB
misses (>130MB) and the TLB problem starts at the same point, but it does not have a lower
cycle count for data sizes<130MB. This indicates that either all tested stride sizes access a new
page for every data access (no two accesses to the same page) or that there is no smaller cache.
Figure 3.13a shows that there is smaller cache with 2MB capacity, interestingly, for a page size
of 128KB. This means that the GPU has three TLB caches, with apparently different page sizes
for the L1 TLB cache and the L2/L3 TLB cache.
We also apply our TLB sharing benchmark using the gained knowledge of page sizes and
TLB entries. The results are shown in Figure 3.14. We tested every #SM x #SM combination
and marked interfering SMs withX . In Figure 3.14a, we see that only the SM itself can evict
its L1 TLB entries, resulting in a diagonal pattern symbolizing a private L1 TLB. For the L2 TLB,
two or three SMs can interfere with each other, i.e., can evict each other’s L2 TLB entries. We
found that in general the TLB is shared in a group of three SMs and for two groups, only two
3.2 Approach II: Static Placement 61
SM 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 X .. .. .. .. X .. .. .. .. .X .. .. .. ..
1 .. X .. .. .. .. X .. .. .. .. .X .. .. ..
2 .. .. X .. .. .. .. X .. .. .. .. X .. ..
3 .. .. .. X .. .. .. .. X .. .. .. .. X ..
4 .. .. .. .. X .. .. .. .. X .. .. .. .. X
5 X .. .. .. .. X .. .. .. .. .X .. .. .. ..
6 .. X .. .. .. .. X .. .. .. .. .X .. .. ..
7 .. .. X .. .. .. .. X .. .. .. .. X .. ..
8 .. .. .. X .. .. .. .. X .. .. .. .. X ..
9 .. .. .. .. X .. .. .. .. X .. .. .. .. X
10 .X .. .. .. .. .X .. .. .. .. .X .. .. .. ..
11 .. .X .. .. .. .. .X .. .. .. .. .X .. .. ..
12 .. .. X .. .. .. .. X .. .. .. .. X .. ..
13 .. .. .. X .. .. .. .. X .. .. .. .. X ..
14 .. .. .. .. X .. .. .. .. X .. .. .. .. X
Figure 3.15: Hypothetical L2 TLB sharing with 15 SMs (hypothetical SMs are highlighted).
SMs share an L2 TLB. This can be explained, with deactivated SMs. The K80 GPU processor
(GK210) is designed for 15 SMs, however, only 13 of them are activated. Based on the pattern
from Figure 3.14b, we can add 2 hypothetical SMs between SM9 and SM10 to reconstruct
a regular pattern of three SMs sharing the L2 TLB as illustrated in Figure 3.15. Interestingly,
the SMs that are deactivated can be different for every GPU, even if the GPU model is the
same. We tested 256 different K80 GPUs (as part of the TU Dresden HPC cluster) and found
11 different patterns, caused by a different combination of deactivated SMs. In theory, there
could be 105 different combinations for two deactivated SMs out of 15. The shown pattern in
Figure 3.14b is by far the most common one, used in 161 of the 256 tested GPUs. Most GPUs
have the deactivated SMs in two different sharing groups, leaving two times two SMs that share
the L2 TLB (instead of three SMs). However, we also found occurrences where two SMs of one
group where deactivated, leaving 1 SM to have its own L2 TLB, while the other L2 TLBs are
shared by three SMs. For the L3 TLB in Figure 3.14c, we see that every SM can evict L3 TLB
entries of every other SM, therefore, we assume global L3 TLB sharing. The findings from both
benchmarks are summarized in Table 3.1.
Plausibility and Validation: As our results show unconventional TLB properties, we want
to discuss their plausibility. Again, NVIDIA does not publish any information about TLB sizes
on their GPUs. Thus, we have to rely on our findings. We found four points that strengthen
our results:
L1 TLB L2 TLB L3 TLB
K
80
Entries 16 65 ⇡1032
Page Size 128 KB 2 MB 2 MB
Cache-able Memory 2 MB 130 MB ⇡2064 MB
Delay on Miss ⇡10 cycles ⇡55 cycles ⇡173 cycles
TLB sharing private 3 SM global
Table 3.1: TLB findings for the K80 GPU. ”⇡” indicates that the result is slightly varying or
inconclusive.
62 Chapter 3 Approaches to Utilize Heterogeneous Environments
1. The sizes of the L1 TLB (16 entries) and L2 TLB (65 entries) for Kepler GPUs (K80)
were already reported by Mei et al. [Mei and Chu 2015] and we can confirm their results.
However, the authors did not report the different page sizes but only the page size of
2MB. This is understandable, because, even with 2MB strides, an L1 TLB miss occurs
after 16 accesses. However, we found that smaller strides (down to 128KB) also need 16
accesses before a TLB miss occurs, indicating that a 2MB stride would only access every
16th page.
2. The GPU cores are organized in a hierarchy, where one or more SMs are grouped to a
Texture Processing Cluster (TPC) and multiple TPCs are combined to a Graphics Processing
Cluster (GPC) [Nvidia 2016]. The K80 has 1 SM per TPC and 3 SMs per GPC. We see
exactly the same hierarchy for our TLB sharing, indicating that every TPC has its own L1
TLB and every GPC has its own L2 TLB, while the L3 TLB is shared for all SMs.
3. We repeated our tests on different GPUs. The GT640 and the K20 (see Table 2.3), both
from Nvidia’s Kepler architecture, show the same TLB architecture with the same page
sizes. The GT640 has only 2GB of memory, so the border of the L3 TLB could not be
tested. Additionally, we evaluated the P100 based on Nvidia’s Pascal architecture. There,
we have detected a L1 TLB and a L2 TLB with the same amount of entries like Kepler
GPUs. However, We found the page sizes are generally 16x larger, resulting in 2MB
pages for the L1 TLB and 32MB pages for the L2 TLB. Nvidia announced that the P100 is
working on 2MB page sizes [Appleyard 2016], which we can confirm for the L1 TLB. The
L3 TLB can not be tested, as our GPU has only 16GB of memory, while we expect the L3
TLB border at⇡32GB (1032 entries with 32MB pages). The TLB sharing is as expected.
The P100 has 2 SMs per TPC and 10 SMs per GPC (or 5 TPC). Again, we can see that
TPCs share an L1 TLB and GPCs share an L2 TLB.
4. We can see a significant performance drop for our group-by operator, when we access
more data than ⇡ 130MB and ⇡2GB. We can now pinpoint the problem to the L2 TLB
for the smaller data sizes and the L3 TLB for the larger one. Previous work on hash
joins [Kaldewey et al. 2012] has also reported a performance decrease at a data size of
2GB, which correlates with our findings.
Arguments for Unconventional TLB Properties: There are mainly two unconventional re-
sults: (1) TLB entry numbers not being the power of two and (2) different page sizes for differ-
ent TLB levels. The former was explained by Mei et al. for the L2 TLB [Mei and Chu 2015]. 65
entries are the result of associativity optimizations, where six sets hold eight entries and one set
holds 17 entries to store aligned page addresses. The L3 TLB could have similar optimizations
resulting in the unconventional number of 1032 entries.
Different page sizes are already used in some CPU systems, where they can be stored in
the same or different TLBs [Mittal 2016]. However, each page is allocated in one size or the
other. For our results, different allocation sizes are not possible as all TLBs work with these
allocations. We evaluated the allocation size and found that the smaller page size is always used
for allocations (128KB on K80). One possible explanation for the apparently larger page sizes
in the L2/L3 TLB could be a static pre-fetching algorithm, which always loads 16 contiguous
3.2 Approach II: Static Placement 63
#groups (K)
ru
nt
im
e 
(s
ec
)
1
10
100
0 10 20 30 40 50 60 70 80 90 100
**********
*
********
*********************************
********
*
******************
******************
*
***********
*******
*
*************
*
**********************
******************
***
*
***********
****************
*
***************
*********
*******
*
**************
*
***
************
**********
**************
*
******************
******
****************
*
**
*
**********
*
**************
*
**************************
*
*****************
*****
*
****************************************
******
***
**
******
************
*******
*
********
*
*********************
*************************
***********************
*
***********
*
******************
***********************************
************
*
********************************************
***************************
*******
*
FNV!1a
Murmur3
(a) Scatter plot for 1K - 100K groups in steps of 10.
#groups (M)
ru
nt
im
e 
(s
ec
)
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
FNV!1a
Murmur3
(b) Applying both hash functions to initial setup.
Figure 3.16: Execution time with FNV-1a andMurmur3. (fill factor 50% and 13 x 1024 threads)
pages when a TLB miss occurs. This would result in one TLB miss and 15 TLB hits, when using
the small page size as traversal stride. Therefore, this optimization significantly reduces TLB
misses for most applications with regular memory access that scan large memory regions. For
the applications (and our benchmark), this looks like 16x larger page sizes, while in fact, only
the reloading mechanism works on 16 pages at once. We have to emphasize that this is only
our speculation. Without knowing the exact hardware insights, we state the measured pages
sizes in Table 3.1.
3.2.4 Implementation Adjustments
In the previous sections, we analyzed performance effects and the GPU TLB structure. In this
section, we evaluate the impact of the chosen parameters on the execution, in order to find
better configurations reducing negative effects.
(1) Alternative Hash Function
To avoid the collisions of the hash mapping, we repeated our experiment with different hash
functions. Here, we only show the results of the best-performing hash function (Murmur3 [Ap-
pleby 2008]), compared to the initial setup (FNV-1a). Themain different is that FNV-1a is using
multiply and XOR, while Murmur3 is using multiply and rotate. Figure 3.16a depicts a scatter
plot of the different execution times using the FNV-1a and the Murmur3 hash functions. We
show a fine resolution in the number of groups to illustrate the variance. Indeed, FNV-1a has
a variance of more than two orders of magnitude! Murmur3, however, has a lower variance
but, in general, a higher minimal runtime. In detail for the tests in Figure 3.16a, FNV-1a has
a runtime between 0.54 sec and 159 sec and an average of 0.89 sec. Murmur3 has a runtime
between 0.70 sec and 2.0 sec with an average of 0.87 sec. We conclude that the collisions
introduced by the hash-value-to-bucket mapping is more uniformly distributed for Murmur3,
resulting in a more predictable performance than with FNV-1a. Having learned this lesson, we
will use Murmur3 as the hash function in the remainder of our experiments.
64 Chapter 3 Approaches to Utilize Heterogeneous Environments
#groups (M)
ru
nt
im
e 
(s
ec
)
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
I II III IV V
fill factor 83%
fill factor 50%
fill factor 25%
HT fixed to L2
Figure 3.17: Evaluating different hash table fill factors for the five regions. (Murmur3 and 13 x
1024 threads)
(2) Hash Table Size and Fill Factor
In our previous experiments, we choose a fill factor of 50%, which results in a hash table size
that has twice the number of entries than the expected number of groups. This fill factor is
usually a good trade-off between hash table size and insert/lookup performance. A larger hash
table would produce fewer conflicts on inserts and a shorter lookup path, but it also consumes
more space in device memory and cache. To evaluate the impact of the hash table size, we
run our initial experiment again using the Murmur3 hash function with varying fill factors.
We choose three adaptive hash table sizes using the fill factors of 25% (4⇤ #groups), 50% (2⇤
#groups), and 83% (1.2⇤ #groups), and one fixed-size hash table, which is set to the size of the
L2 cache (1.5MB, about 200K entries in our test scenario). The results are shown in Figure 3.17.
Region I is still dominated by the atomic throughput for all our test cases. Region II shows the
actual impact of the fill factor on execution performance. Here all hash tables fit in the L2
cache, and the performance is bound by bucket contention and linear probing. We can see
that the fixed hash table performs best because it has the smallest fill factor in most cases. The
fill factor of 25% has a slightly worse performance. The higher fill factors have an even worse
performance due to more contention in the hash table. In the passage from Region II to III, we
can see that the fixed L2 version is performing worse because the fill factor and the contentions
are growing until the hash table is too small for the actual data. In this part, the fill factor of
50% performs best because it has less bucket contention than the 83% fill factor and consumes
less memory than the 25% fill factor, where, in this case, efficient caching is not possible. The
advantage is lost when the hash table no longer fits in the cache, in the last part of Regions III
and IV. In Region V, the 2GB TLB problem problem hits every version at a different number of
groups, because the hash table sizes are different. There, the 83% fill factor version performs
best. These results show that we need to adapt the hash table size and fill factor in order to get
the best possible performance for any number of groups.
3.2 Approach II: Static Placement 65
(3) CUDA Grid Parameters
Additional to hash function and fill factor, we can change the CUDA grid configuration. The
grid configuration specifies a blocks x threads combination:
#total_threads = blocks ⇤ threads (3.3)
Threads combined in a block can share resources, and they are guaranteed to be executed
on the same multiprocessor. However, multiple blocks can be executed on one processor at
the same time to hide memory latency. For the K80, one block can contain up to 1024 threads
[Nvidia 2014b]. NVIDIA GPUs execute up to 32 threads simultaneously, so a block should con-
tain a minimum of 32 threads to allow the hardware to be utilized. Usually, shared memory
usage limits the number of threads per block, but since we do not use shared memory in our
implementation, we are free to use any configuration of blocks and threads. In our implemen-
tation, the total work of a 128MB input data stride is divided automatically between the total
number of threads. To evaluate the different grid parameters, we tested the number of threads
in power-of-two steps from 20 = 1 up to 210 = 1024 threads and the number of blocks in
multiples of the number of multiprocessors (in our case 13). We tested the grid configurations
extensively and found four different behaviors. Representatives of these behaviors are shown
in Figure 3.18. In detail, we show our evaluation with 1000 groups (behavior similar to Regions
I and II), 1 million groups (Region III), 100 million groups (Region IV), and 200million groups
(Region V). For these tests we used a fill factor of 50%.
Region I/II: Figure 3.18a shows that the ideal grid configuration for this test is greater than
or equal to 26624 total threads, with a minimum of 128 threads per block. If the total number
of threads is lower than the lower threshold, there are not enough threads to saturate the PCIe
bus with memory requests and to hide memory latency, resulting in the underutilization of the
GPU.
Region III: Figure 3.18b shows a different behavior. Here, it is beneficial to have between
6656 and 13312 threads or more threads if we only use 64 threads per block. The number of
threads can be explained with the caching behavior. For 1 million groups, the hash table does
not fit in the L2 cache, so each thread has to access global memory, unless the data is in the
cache by chance. Usually, a thread has to access only one 128 byte cache line during a lookup
or insert, which is accommodating 16 hash table buckets in our test scenario. Assuming that
the hash bucket for a given key is on average in the middle of a cache line, a second cache
line needs to be loaded only if the linear probing goes beyond 8 steps. With 26624 threads
running simultaneously (ideal configuration for previous test), 3.25MB will be loaded for the
first cache line. Since this does not fit into the L2 cache (1.5MB), the threads are evicting each
other’s cache lines, so that linear probing, even within the same cache line, could result in
multiple loads from global memory. However, with 6656 threads, only 0.813MB are loaded,
fitting perfectly in the L2 cache, which results in “undisturbed” linear probing for each thread.
With 13312 threads, 1.625MB of data is loaded, which does not fit in the L2 cache. However,
the cache evictions are few compared to 3.25MB of data, while the overall parallelism is the
double compared to 6656 threads. Therefore, this is a trade-off between parallelism and cache
66 Chapter 3 Approaches to Utilize Heterogeneous Environments
13 26 52 10
4
20
8
41
6
83
2
1
66
4
33
28
66
56
13
31
2
26
62
4
53
24
8
1
2
4
8
16
32
64
128
256
512
1024
Blocks
Th
re
ad
s/
Bl
oc
k
(a) 1000 groups (Region II).
13 26 52 10
4
20
8
41
6
83
2
1
66
4
33
28
66
56
13
31
2
26
62
4
53
24
8
1
2
4
8
16
32
64
128
256
512
1024
Blocks
Th
re
ad
s/
Bl
oc
k
(b) 1 million groups (Region III).
13 26 52 10
4
20
8
41
6
83
2
1
66
4
33
28
66
56
13
31
2
26
62
4
53
24
8
1
2
4
8
16
32
64
128
256
512
1024
Blocks
Th
re
ad
s/
Bl
oc
k
(c) 100 million groups (Region IV).
13 26 52 10
4
20
8
41
6
83
2
1
66
4
33
28
66
56
13
31
2
26
62
4
53
24
8
1
2
4
8
16
32
64
128
256
512
1024
Blocks
Th
re
ad
s/
Bl
oc
k
(d) 200 million groups (Region V).
Figure 3.18: Evaluation of grid parameters for four different number of groups. The perfor-
mance is encoded from black (worse) to white (best). The highlighted configurations are used
in Figure 3.19. (Murmur3 and fill factor 50%)
line evictions. Having even fewer threads would reduce the evictions, but it also under-utilizes
the system, causing worse performance. More threads are possible if there are only 64 threads
per block, because our test GPU can only schedule and execute 16 blocks simultaneously on one
multiprocessor [Nvidia 2014b], resulting in 64 threads⇤16 blocks⇤13multiprocessors = 13312
threads running simultaneously (same as before). The remaining blocks can be scheduled,
when other blocks finish, leading to a constant number of threads that can be active at one
time.
Region IV: The results for Region IV are more puzzling as shown in Figure 3.18c. Here,
it is possible to have fewer threads with a good performance, however, the best performance
is given with 53248 x 1024 threads. Additional to the 16 blocks per multiprocessors limit,
our GPU only supports a maximum of 2048 threads per multiprocessors. This means that at
most 26624 threads can be active at the same time. So why is 53248 x 1024 better than the
rest? Our implementation works on strides of 128MB per column, or ⇡33.6 million tuples.
Executing ⇡54.5 million threads (53248 ⇤ 1024) leads to one tuple per thread, with some
threads processing no tuples at all. This means, each thread processes its one tuple and waits
until the whole block is finished, before another block can be executed. With fewer threads,
3.2 Approach II: Static Placement 67
#groups (M)
ru
nt
im
e 
(s
ec
)
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
I II III IV V
26 x 64
26 x 512
26 x 1024
53248 x 1024
Figure 3.19: Best performing grid configurations. (Murmur3 and fill factor 50%)
each thread would process multiple tuples, accessing multiple memory regions. We assume
that accessing just one memory region per thread is beneficial for the L2 TLB.
Region V: Figure 3.18d shows the grid configurations for the L3 TLB problem. The perfor-
mance improves when we reduce the number of threads to 1664 or 832. This can be explained
through the L3 TLB properties, as the number of TLB entries is 1032. Each thread loads one
page translation into the TLB cache and due to linear probing, the threads potentially needmul-
tiple accesses to the same page. With 832 threads, it is unlikely that other threads are evicting
TLB entries, which are still needed. 1664 threads are a trade-off between higher parallelism and
evicting page translations of other threads. More threads would lead to more evictions, while
fewer threads underutilize the GPU. Interestingly, reducing the threads is beneficial even be-
low the 32 threads per block down to 4 or 2 threads, where usually 32 threads per block are a
minimum to utilize the GPU.
Based on the results shown in Figure 3.18, we choose the best-performing configurations
for the different behaviors and evaluate these configurations in our initial test scenario with
growing numbers of groups. Specifically, we choose the following blocks⇥ threads configura-
tions: 26 x 1024, 26 x 512, 53248 x 1024, and 26 x 64. The results are shown in Figure 3.19.
As expected, the single configurations are optimal in the regions for which they were selected,
while showing bad performance in most other regions. Again, this shows us that we need to
adapt the grid configuration according to the hash table size.
(4) Private Hash Tables
To reduce the atomic contention for small hash table sizes (up to 500 groups), we developed
two approaches of private hash tables, where not all threads update the hash table entries at
the same time. The two approaches are:
68 Chapter 3 Approaches to Utilize Heterogeneous Environments
number of groups
ru
nt
im
e 
(s
ec
)
1
0.5
2
3
4
5
8
1 10 100
single global HT, 13 x 1024
thread!local HT, 13 x 1024
block!local HT, 13 x 1024
block!local HT, 104 x 256
Figure 3.20: Performance of local hash table implementations for low cardinality. ( Murmur3
and 13 x 1024 threads)
• Thread-local Hash Tables: Each thread keeps an own hash table to insert all tuples,
which this thread is processing without the need of atomics. When a thread finishes, the
private hash table is merged with the global hash table. Since one private hash table per
thread is memory intensive, the private hash tables need to be stored in global memory.
• Block-local Hash Tables: Each block keeps a hash table to insert all tuples that are pro-
cessed by the threads of this block. The hash table accesses need to be atomic, to ensure
correct results, but fewer threads try to update the local hash table compared to one
global hash table. The local hash table size can be fixed to the shared memory size. When
a block finishes, the private hash table is merged with the global hash table.
The test results from 1 to 100 groups is illustrated in Figure 3.20. The thread-local hash
table performs well until 6 groups, before increasing in runtime. This is caused by the size of
the local hash tables that exceed the L2 cache for just a few numbers of groups. In contrast,
the block-local hash tables has contention on the atomics for small numbers of groups, but
performs well for a larger numbers of groups. We also tested both approaches with different
grid configurations. The thread-local approach does not improve with more or fewer threads,
as it is currently a trade-off between parallel execution and total size of the cumulated hash
tables. The block-local execution improves with a higher number of total threads, however,
with fewer threads per block. Fewer threads per block reduce the contention on the atomic
operations, since fewer threads try to access the same value. We found 104x256 to be the best
configuration for our tests.
3.2 Approach II: Static Placement 69
number of groups (M)
ru
nt
im
e 
(s
ec
)
0
10
20
30
40
50
60
70
0 100 200 300 400 500 600 700 800
fill factor 50%, 13 x 1024
fill factor 50%, 26 x 64
fill factor 83%, 13 x 1024
fill factor 83%, 26 x 64
(a) L3 TLB problem with proposed optimization.
number of groups (M)
ru
nt
im
e 
(s
ec
)
0
10
20
30
40
50
60
70
0 100 200 300 400 500 600 700 800
fill factor 50%, 13 x 1024
fill factor 83%, 26 x 64
fill factor 50%, 13 x 1024, 128MB scope
fill factor 50%, 13 x 1024, 2GB scope
(b) L3 TLB problem with TLB-conscious access.
Figure 3.21: Reducing the performance decrease of the L3 TLB problem. (Murmur3)
(5) TLB-conscious Data Access
Additional to the small group numbers, we need to look at the large ones beyond 130MB and
2GB, where we identified different TLB levels to be the problem. Figure 3.21a illustrates the
2GB problem without the logarithmic scale. The runtime increases from 5 sec at 120M groups
to 65 sec at 720M groups, showing the drastic runtime increase of 13x.
We have already explained possible solutions using optimized hash table fill factors and
optimized CUDA grid parameters. The results for fewer threads, a higher fill factor, and their
combination are shown in detail in Figure 3.21a. We can see that a higher fill factor delays the
TLB problem towards higher group numbers, due to smaller hash tables. Also, the resulting
execution can accommodate more groups in the 12GB of GPU memory (compared to a fill
factor of 0.5). Fewer threads are beneficial, once the L3 TLB problem is encountered, as the
pressure on the L3 TLB is reduced. All in all, the 2GB problem is reduced but we still experience
a slowdown of more than 4x compared to the runtime without the problem.
TLB-conscious Approach: To improve the performance further, we propose a trade-off of
TLB misses and redundant work. In detail, we propose to restrict the hash table access to the
amount of data that can be addressed in the TLB (TLB scope). To allow accesses to the whole
hash table, we use the TLB scope as iterative window, executing the operator in multiple passes
each allowing access to different TLB scopes. This results in HTsizeTLBscope passes, while each pass
reads the full input data and allows hash table accesses only for the given scope. The idea is
visualized in Figure 3.22. Here, the input has to be read three times, each time with a different
TLB scope S1, S2, S3. This approach avoids most TLB misses, while repeating the scan of the
input data. However, this would also result in multiple transfers via PCIe bus. To further
improve this naive approach of Figure 3.22, we propose to read the input only once over PCIe
in Pass 1. For this pass, we add the applicable tuples of the first TLB scope to the hash table and
cache all other tuples on the GPU in a separate buffer. Since the access to the separate buffer is
according to the thread ID, the memory accesses are coalesced and do not show the problems
70 Chapter 3 Approaches to Utilize Heterogeneous Environments
S1
S2
S3
Hash TableInput
(a) First pass.
S1
S2
S3
Hash TableInput
(b) Second pass.
S1
S2
S3
Hash TableInput
(c) Third pass.
Figure 3.22: TLB-conscious data access with repeated scans over the input data.
of random memory access. Finally, all passes beyond the first one can read from the cached
buffer and do not need to transfer data over PCIe. The cached buffer has our input stride size
of 128 MB, which is usually no significant memory overhead, when working with hash tables
of multiple GBs.
Evaluation: The result of this approach is shown in Figure 3.21b. We use two different TLB
scope sizes, 128MB potentially avoiding L2 TLB misses and 2GB potentially avoiding L3 TLB
misses. Generally, we see a good performance for both TLB scopes. The 128MB scope increases
in runtime withmore groups, as the hash table size grows and, therefore, more andmore passes
are needed to access the full hash table. In the end, the pass overhead leads to the linear grow-
ing runtime in Figure 3.21b. There is a break-even point around 130M groups (⇡2GB hash
table size), where the pass overhead surpasses the L2 TLB problems of the original hash table
access. For 2GB TLB-scopes, the pass overhead is not visible this much, as there are only 6
passes needed to cover a maximum size of a 12GB hash table. Compared to the original L3 TLB
problem, this overhead is insignificant. All in all, our TLB-conscious Data Access with 128MB
scopes hides the L2 TLB misses, however, with significant pass overhead at some point. On the
other hand, 2GB scopes hide the L3 TLB misses with insignificant pass overheads for current
GPU memory sizes.
Considering TLB sharing: We also considered using the L2 TLB sharing of multiple SMs
to reduce the amount of passes. The idea is that we could assign different L2 TLB scopes to
different groups of SMs that share the TLB. The K80 has five of these groups (containing 1 to 3
SMs), therefore, it might be possible to access five TLB scopes in one pass. However, there are
two drawbacks convincing us not to follow this approach.
(1) We have seen in Section 3.2.3 that these L2 TLB sharing groups usually consist of three
SMs with the possibility to only include 1 or 2 SMs. Assigning different TLB Scopes and dif-
ferent passes to the L2 TLB groups would lead to skewed computational power, where a group
with three SMs would finish a pass much earlier than a group with one SM. Assigning more
passes to the larger group in order to balance the load is not trivial. To execute threads on differ-
ent SMs, they have to be scheduled in different thread-blocks. However, there is nomechanism
to synchronize multiple blocks within one kernel. Therefore, the threads on multiple SMs do
not know when a pass has finished and a new pass can be started.
3.2 Approach II: Static Placement 71
(2) Additional to the skewed computation, the input data has to be accessed for every pass.
Even when executing multiple passes in parallel (each for one TLB sharing group), the input
data has to be accessed once per pass. Sharing data accesses between the different SM groups
through the L2 data cache is unlikely, because the different executions are not synchronized.
For example, it is not guaranteed that all SMs access the 100th input data element at the same
time, therefore, the input data is most likely not in the cache when accessed. When the execu-
tion can not share input data accesses, then there is no difference in parallel passes with less
SMs compared to sequential passes with all SMs. Therefore, we avoid this additional complex-
ity and treat the L2 TLB as it would be global for all SMs.
(6) Sort-Based Approach
Additional to hash-based group-by implementations, we thought of sort-based approaches.
Sorting could mitigate some effects, since it can have a much more predictable memory ac-
cess pattern. However, we found that sorting can only be beneficial, if the whole input data
can be stored, sorted, and reduced on the GPU. This is impossible for our 6GB of input data,
since additional memory space is needed during the sort process. Likewise, sorting the whole
24GB table in one step is impossible as well. Therefore, we also evaluated strided sort, where a
128MB stride is transfered, sorted, reduced, and merged with the previous results. There, the
performance is worse than the hash-based group-by through the merging overhead. Caused by
the inferior performance, we do not consider sorting for our final implementation.
3.2.5 Configuration-based Optimizer
Finally, we can take the parameter and implementation insights of the previous sections and
build a simple model that switches between the different ideal configurations, depending on
the expected number of groups. The configurations are shown in Table 3.2. We use Murmur3
for all configurations. For small cardinalities, we switch from thread-local to block-local and to
global hash tables. For global hash tables we switch from a fixed size hash table, to a fill factor
of 50% and from 26x1024 threads to 26x512 and 53248x1024 threads. For a number of groups
beyond 7M, we switch to TLB-conscious data access, first with TLB scopes of 128MB and later
with 2GB. Finally, for 2GB scopes, we first use a fill factor of 50% and then switch to a fixed
hash table size of 11GB. Fixing the hash table to 11GB, leaves enough space for the input data
stride (128MB) on the GPU when using the TLB-conscious data access, while still allowing the
fill factor to rise until the hash table is completely filled. This enables our implementation to
store up to 1.47 billion groups, while the initial implementation could only store 805 million
groups.
Figure 3.23 compares our optimized implementation with the initial version with FNV-1a,
13x1024 threads, and a fill factor of 50%. We can see great improvements for Region I and the
atomic contention, Region II and the hashing contention, Region IV and L2 TLBmisses, as well
as Region V and L3 TLB misses. For Region III, the improvement is only small, mainly because
our initial choice of parameters could not be improved (e.g., fill factor 50% and 13312 threads).
However, in our extensive tests, we showed that different parameters could in fact increase the
runtime, so our contribution is to confirm the initial parameter choices for Region III.
72 Chapter 3 Approaches to Utilize Heterogeneous Environments
Label range (#groups) HT access HT size Grid Hash func.
A [1 - 3] thread-local FF 50% 13x1024 Murmur3
B (3 - 500] block-local FF 50% 104x256 Murmur3
C (500 - 100K] global 1.5MB 26x1024 Murmur3
D (100K - 200K] global FF 50% 26x1024 Murmur3
E (200K - 10M] global FF 50% 26x512 Murmur3
F (10M - 110M] global, 128MB TLB scope FF 50% 26x512 Murmur3
G (1100M - 663M] global, 2GB TLB scope FF 50% 53248x1024 Murmur3
H from 663M global, 2GB TLB scope 11GB 53248x1024 Murmur3
Table 3.2: Different configurations for the group-by operator and their range of application.
#groups (M)
ru
nt
im
e 
(s
ec
)
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
A B C D E F G H
before
after
Figure 3.23: Resulting performance through switching the different configurations (A to H).
3.2 Approach II: Static Placement 73
K80 (Kepler) P100 (Pascal)
Multi-processors 13 56
L2 cache size 1.5MB 4MB
Global Memory 12GB 16GB
Memory Bandwidth 168.66GB/s 562.14GB/s
Clock Frequency 875 MHz 1328MHz
L1 TLB cache 16 x 128KB pages 16 x 2MB pages
L2 TLB cache 65 x 2MB pages 65 x 32MB pages
L3 TLB cache 1032 x 2MB pages –
Table 3.3: Hardware differences of K80 (single GPU) and P100.
3.2.6 Conclusion and Transferability of Results
To conclude this approach, we want to look at the complexity of the approach and the transfer-
ability of the shown results, by answering the two questions from the beginning.
How much hardware-specific fine-tuning is needed to make an operator perform well
in any scenario? For the group-by operator, we identified five regions with different effects
and bottlenecks and proposed six different approaches to reduce harmful performance effects,
resulting in eight different configurations. We conclude that the extent of the search space for
the different configurations is large and extensive benchmarking and fine-tuning is needed to
achieve the best performance. The search space also seems too large for automatic fine-tuning,
as too many options need to be considered:
search_space = No. of hash functions (FNV 1a, Murmur3, etc.)
* No. of Fill factors (1\%   100\%, or fixed to L2, 11GB, etc .)
* No. of Grid configurations (1 thread to millions , different block sizes )
* No. of Hash table accesses ( thread local , thread global , global )
* No. of TLB scopes (128MB, 2GB, and many more)
* No. of Complete redesigns (e .g ., sort based grouping)
How extensible is this approach towards the support multiple operators and multiple
hardware platforms? We want to discuss this question for two directions: hardware and
software.
Hardware: To evaluate the transferability of our results, we evaluate our approach on the
newer Nvidia GPU P100 [Nvidia 2016]. The hardware properties in comparison to the K80 are
presented in Table 3.3. As we can see, the properties differ from one architecture to another.
We can still apply most of our optimization approaches, however, ideal configurations and the
points, where to switch configurations have to change:
74 Chapter 3 Approaches to Utilize Heterogeneous Environments
• To utilize the increased multi-processor count, we need to apply at least 56 ⇤ 1024 =
57344 threads instead of 13 ⇤ 1024 = 13312 for the K80. This increases the pressure on
atomics, on the L2 data cache, and the TLB caches.
• With different L2 cache sizes, the configuration switching-points need to be changed for
the thread-private approach and the global approach.
• The memory bandwidth is 3.3x higher and the frequency is 1.5x higher for the P100
compared to the K80. This means that data can be loaded faster from the GPU memory
and possibly needed cycles for a TLB miss delay are shorter. Therefore, L2 cache misses
and TLB misses are not as harmful for the P100 as for the K80. With less performance
penalty for data cache misses and TLB misses, our solutions, like TLB conscious data
access, could show more significant overhead on the P100 than on the K80.
• The P100 TLB works with larger page sizes, leading to the L2 TLB storing 2080MB of
data. This is similar to the L3 TLB border of 2GB for the K80. Therefore, our TLB scope
of 2GB is still beneficial. However, the smaller TLB scope of 128MB, initially meant to
avoid L2 TLB misses, will not be beneficial on the P100, as the L1 TLB stores only 32MB.
All in all, the ideal configurations and implementations differ for nearly all regions, to-
gether with the region borders and the switching points of configurations. It is not possible
to just execute this highly-optimized implementation on the P100, without revisiting all our
tests and benchmarks. These were only Nvidia GPUs of two different architectures. The con-
figurations would differ even more when using AMD GPUs, CPUs, or the Xeon Phi. Each
architecture needs its own set of specifically-tailored optimizations together with highly spe-
cific switching points.
Software: Additional to different CUs, the operator can change. We only evaluated a hash
table with two entries per bucket, where more entries lead to larger tables and more usage of
atomic instructions. More input columns can mask some bottlenecks, where our optimiza-
tions, like TLB conscious data access, could introduce more overhead since the prevented ef-
fects were not the real bottleneck. All of this can influence the choice of ideal configurations
and the switching points for the optimizer. Furthermore, this is only one operator out of many.
A full database approach of executing operators statically on a GPU with high-performance
code optimizations would need many optimization steps for all possible configurations and op-
erators.
We summarize that fine-tuning offloaded operators for one hardware platform is possible
but connected with an immense amount of effort. This effort has to be repeated for every varia-
tion of the operator, every new hardware platform, and desirably for many different operators.
3.2 Approach II: Static Placement 75
3.3 APPROACH III: DYNAMIC PLACEMENT
In the previous sections, we have presented two approaches to utilize heterogeneous comput-
ing environments: (1) partitioning input data and performing the operator execution in parallel
on multiple CUs and (2) executing an operator atomically on one fixed CU, with the opportu-
nity to highly optimize the implementation for this one CU. While the first approach shows
various limitations and a low speedup, the second approach shows good results. However, the
effort to optimize and fine-tune multiple database operators for multiple CUs is large and not
practical with many operators or CUs.
Our third approach is based on the observations of the previous ones. We decide to execute
each operator atomically on a single CU, however, we want to easily support multiple CUs in a
system by dynamically defining an execution location (placement) for each operator. We first
review the observations from the previous approaches in Section 3.3.1. Afterwards, we present
a case study on dynamic placement by executing the hash-based group-by operator on eight
different CUs and show how limitations of one CU can be hidden by others (Section 3.3.2).
In Section 3.3.3, we compare the code-optimization approach with the dynamic placement
approach and, finally, we present the challenges for automatic placement decision-making in
Section 3.3.4.
3.3.1 Observations from Previous Approaches
The observations and learned lessons from the previous approaches motivate this approach of
dynamic operator placement.
Intra-operator parallelism onmultiple CUs. For Approach I, we have seen that intra-opera-
tor parallelism onmultiple CUs is only beneficial, if three properties are given: (1) the operator
does not need a time-consuming merging step, (2) the heterogeneous CUs execute an operator
nearly equally fast, and (3) the host CPU has enough cores to reserve one for GPU controlling.
Especially the first two points are not given for most operators in heterogeneous environments.
The first point can be solved by executing an operator atomically on one CU and, therefore,
avoiding the merging step. As for the second point, heterogeneous environments inherently
stand for heterogeneous execution behavior on the different CUs and therefore different exe-
cution times. However, this can be used as an opportunity to find the better performing CU
for one operator to be executed on. For Approach I, we have seen that in most cases one CU is
better than the other for specific operator executions and certain data sizes. We also see that
this choice has to change according to the data sizes, as CUs performed differently for small
and for large data sizes (e.g., Figure 3.4 and Figure 3.8).
Static execution and optimization. With the second approach, we have seen that a fixed
assignment of one operator to one specific CU allows in-depth optimization and fine-tuning,
showing good results. This approach can be applied in real life if the number of CUs are small
and precisely known, e.g., when designing a database system together with the hardware. How-
76 Chapter 3 Approaches to Utilize Heterogeneous Environments
Algorithm 2 Different Memory Access Pattern
1: procedure Coalesced-memory-access
2: for i threadID; i < #inputElements; i i+ #threads do
3: insertIntoHashTable(inputData[i])
4: procedure Block-wise-memory-access
5: start threadID ⇤ elementsPerThread
6: end (threadID + 1) ⇤ elementsPerThread
7: for i start; i < end & i < #inputElements; i i+ 1 do
8: insertIntoHashTable(inputData[i])
ever, supporting multiple operators or multiple CUs results in a large implementation effort.
To make an operator implementation more transferable to different CUs, it is possible to apply
only light-weight optimizations that are beneficial for many CUs. An example is the adjustment
of the hash function for the group-by operator. However, every hardware-specific optimization
hinders the portability of this operator to other systems, e.g., executing our highly optimized
GPU operator on a Xeon Phi. Without hardware-specific optimizations, it is possible to execute
the operator on a number of CUs with reasonable performance, while possible performance
bottlenecks can be hidden through switching to another CU.
3.3.2 Case Study: Dynamic Placement of Group-by Operator
As a result of the previous approaches, we propose the dynamic placement approach. This ap-
proach assigns the execution location according to the available CUs, the operator, and the used
data. The actual operator implementation is not specifically optimized for any specific CU.
To evaluate this approach, we use the group-by operator from Section 3.2. We port the
group-by operator to OpenCL in order to be able to execute the operator on multiple CUs. As
for the optimizations fromApproach II, we do not adjust any parameters according to the group
count, however, we choose default parameters that differ slightly for each CU. All executions
are fixed to use Murmur3 instead of FNV-1a, as this is a general software optimization (Sec-
tion 3.2.4). Additionally, the input data is fixed to 1GB (⇡268M values), always located on
the host side, and the fill factor is fixed to 0.5. Before execution on a new CU, we run a test
execution with 50k groups, evaluating different configurations and choose the best performing
one. The configurations consist of the number of elements per thread, #elem (directly influ-
encing the number of threads), and the memory access type, i.e., coalesced or block-wise. The
different memory access types are shown in Algorithm 2.
Coalesced memory access ensures that neighboring threads load neighboring data at the
same time, allowing the memory access to be combined, if the CU supports this kind of mem-
ory access (supported by most GPUs). Block-wise memory access ensures that one thread reads
neighboring input data, leading to an improved cache usage for small amounts of cores and
per-core caches (mostly beneficial for CPUs). These two different memory access patterns are
important when working with different architectures like CPUs and GPUs, because the wrong
memory access pattern could largely harm performance. After evaluating the ideal memory ac-
3.3 Approach III: Dynamic Placement 77
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
K80: 64 elements per thread, coalesced
(a) Nvidia K80
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
GT640: 8 elements per thread, coalesced
(b) Nvidia GT640
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
Intel iGPU: 64 elements per thread, coalesced
(c) Intel iGPU
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
AMD iGPU: 1 element per thread,
(d) AMD iGPU
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
AMD CPU: 32 elements per thread, block
(e) AMD CPU
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
Tahiti: 128 elements per thread, coalesced
(f) AMD Tahiti GPU
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
Xeon: 16 elements per thread, block
(g) Intel Xeon CPU
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
Xeon Phi: 4 elements per thread, block
(h) Intel Xeon Phi
Figure 3.24: Testing the hash-based Group-By on different computing units showing different
effects with varying group sizes.
78 Chapter 3 Approaches to Utilize Heterogeneous Environments
cess pattern and the ideal number of elements per thread (#elem) for each CU, we can execute
the operator with different amounts of groups. Figure 3.24 shows the selected configuration
and the performance results for eight different CUs including different GPUs from Nvidia,
AMD, and Intel, different CPUs from Intel and AMD, and Intel’s Xeon Phi. The hardware
properties of the different CUs are presented in Table 2.3. All CPUs prefer block-wise memory
access, while the GPUs prefer coalesced access. Only the AMD iGPU works best with one ele-
ment per thread, where the choice of coalesced or block-wise memory access does not matter.
Instead, the GPU internal scheduler defines the access pattern through scheduling the indi-
vidual threads. The different CUs show different effects and limitations when executing the
group-by operator. We describe these effects in the following:
Global Memory: Each CU stores the hash table in its global memory, limiting the maximal
amount of groups that can be supported. The largest hash tables can be stored on the K80
(12GB), the Xeon Phi (16GB), the AMD CPU (32GB), and the Xeon CPU (64GB), while the
other CUs only support smaller hash tables.
Host Memory Access: The presented CUs also differ in host memory access. While CPUs
and integrated GPUs can access the host memory directly, the other CUs use direct memory
access but have to transfer the data through the PCIe bus (generation 2 or 3). Especially the
PCIe2 bus is limiting the transfer to a maximum of 6GB/s, which can be seen for the AMD
Tahiti GPU (Figure 3.24f). There, the straight line at 0.2 ms (for 1GB of input data = 5GB/s)
indicates that the runtime cannot be better than that.
Atomic Contention: Another heterogeneous effect is the impact of atomic contention. Each
CU shows a slowdown for small numbers of groups but some CUs show better performance
or a steeper slope of improvement. All Intel-based CUs show a significant impact of atomic
contention, while especially the AMD Tahiti GPU shows the best results.
Caches: As seen earlier, caches have a high impact on performance, depending on the size
of the hash table. For our test cases in Figure 3.24, the impact is clearly visible. All CUs have
different cache sizes, hence, different hash table sizes, where the runtime increases. In general,
we can see that CPUs and the Xeon Phi have larger caches than GPUs and, therefore, can show
good performance even for larger hash tables.
Performance: Resulting from the mentioned differences, the CUs differ in performance
sometimes showing surprising effects like the Intel iGPU (used in a low-power laptop) being
faster than the high-end Xeon Phi Accelerator for 10 - 20k groups.
As there are many factors that differ for the given CUs when executing the group-by oper-
ator, our hope is that one CU can mitigate the limitations of others, by switching the execution
assignment depending on the ideal performance. To confirm this idea, we simulate having
some of the presented CUs in one system and switch the execution according to the measured
performance. Figure 3.25a shows the resulting execution behavior. For Figure 3.25a, we as-
sume to have the Tahiti GPU, K80, Xeon Phi, and Xeon CPU in one system. We have to switch
3.3 Approach III: Dynamic Placement 79
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
A: Tahiti GPU
B: K80
C: Xeon Phi
D: Xeon
A B A C D
(a) Tahiti GPU, K80, Xeon CPU and Xeon Phi
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.01
0.1
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
A: GT640
B: Intel iGPU
C: AMD CPU
A B A C
(b) GT640, Intel iGPU, and AMD CPU
Figure 3.25: Choosing the best CUs for different numbers of groups.
the execution five times to achieve the best performance for the whole range. The K80 can
be used for a large range of group numbers, while the Tahiti GPU can be used to hide atomic
contentions and the L2 TLB cache problems of the K80. The Xeon Phi and the Xeon CPU
help to overcome the limited memory space of the K80 and Tahiti GPU. For Figure 3.25b,
we assume to have a system consisting of the GT640, the Intel iGPU and the AMD CPU. In
this scenario, three switching points are needed. The GT640 shows the best performance for
atomic contention of small groups, while the Intel iGPU shows good performance before cache
boundaries are reached. Then the GT640 can hide these cache effects again, while the AMD
CPU is used for large groups due to the larger memory space.
These two examples show the potential of heterogeneous placement, where the execution
is switched between different CUs to hide each others limitations. There are two questions that
need to be answered: (1) Is this dynamic placement approach more beneficial than the static
approach with a highly optimized implementation? (2) How can we achieve this dynamic
placement automatically in a database system? We will discuss both questions in the following.
3.3.3 High Performance vs. Dynamic Placement
High performance execution can be achieved using a static placement with a fixed pre-known
CU, allowing to highly optimize the operator implementation as shown with our Approach II
in Figure 3.23. Dynamic placement is the approach shown in Figure 3.25, where multiple
potentially-unknown CUs are used to hide each others limitations. There, code optimization is
impossible to a large extent as the operator has to be able to execute on all available CUs.
Both approaches have certain advantages and disadvantages, as well as a certain cost. The
static approach allows code optimizations leading to high execution improvements (up to 13x
for the group-by) and once the operator is profiled and optimized, the execution behavior is
well-known and understood. On the other side, the implementation needs to be manually fine-
tuned, limiting the approach to be applied only to a few CUs and only a few operators due to
the extent of the tuning efforts. As shown in our example, we adjust many different properties,
80 Chapter 3 Approaches to Utilize Heterogeneous Environments
which form a search space. Only considering the tested options from Section 3.2.4, we result
in a search space of 20.6 K possible configurations:
2 different hash functions
* 4 different fill factors
* 11x13 different grid configurations
* 3 different hash tables access modes
* 3 different TLB conscious optimizations
* 2 different general approaches
= 20 592 possible configurations (per group size )
This number is still optimistic, as there are many dozens of hash functions that could be
tested; fill factors can range between 0% and 100%; the strides used for TLB conscious data
access could range from 1MB to many more; and there are possibly more approaches to im-
plement the group-by operator. Finally, the presented search space is specific for the group-by
operator. Other operators have different and maybe even more possible configurations. This
shows the complexity of this approach and strengthens our claim that this kind of optimization
is only possible for a small number of CUs and operators. There have been attempts of automat-
ing this approach by generating a pool of possible configurations for an operator, choosing and
evaluating some configurations out of this pool at run-time, and iteratively updating the chosen
configuration [Rosenfeld et al. 2015; Broneske et al. 2014]. Given the different behavior for dif-
ferent groups, thousands of different configurations, and the fact that we want to support many
more operators, would lead to a long training time, until an acceptable performance is reached.
Different to the static approach, the dynamic approach can support many (unknown) CUs,
if it is possible to automatically define the placement. For each number of groups, an automated
system would have to decide on which CU the operator should be executed. The search space
per hash table size would be N, where N equals the number of available CUs in the system.
In current heterogeneous systems, this is likely to be a one digit number. The disadvantage is
the less hardware-specific code optimization. However, optimization is still possible to some
extent as we have shown in the previous section. We implemented an operator with variable
memory access pattern and a variable number of processed elements per threads. Before the
execution, we briefly benchmarked each CU to find their preferred properties for this limited
search space. This kind of light-weight code optimization is always possible, even with dy-
namic placement. We could even go one step further, allowing different implementations for
different CUs. There, whenever the dynamic placement wants to use the K80, it can use the
highly optimized version from Section 3.2, while for other CUs, it can fall back to the naive
implementation. Therefore, it is possible to make the effort and optimize the one or two most
promising CUs, while still using dynamic execution.
3.3 Approach III: Dynamic Placement 81
3.3.4 Challenges for Automatic Placement Decisions
We identified five challenges (C1 to C5) to be addressed in order to fully automate placement
decisions and achieve placement results similar to the manual placement in Figure 3.25. These
challenges are:
C1 Ability to execute heterogeneously: Usually, database operators are implemented for
the CPU using C/C++ or Java, while operators that need to be offloaded to a GPUmainly
use CUDA. In order to allow heterogeneous placement of operators, the operator need to
be able to execute on all available CUs in a system. Two directions are possible.
(a) Operators could be implemented for each CU in their native language (e.g., C++
and CUDA), resulting in a variety of implementations per operator. This can result
in efficient, fine-tuned code for each CU, like the group-by Operator in Section 3.2.
However, as seen in Section 3.2, these implementations are not portable and the
developer has to provide a new implementation for every unsupported CU, making
the whole approach less extendable. Systems following this approach are gpuQP [He
et al. 2009] and CoGaDB [Breß 2014].
(b) As another option, operators could be implemented once in an abstracting program-
ming language like OpenCL, where the single code base is compiled against differ-
ent hardware platforms, leading to the support of many different CUs. This way, one
implementation can be used by a variety of CUs as long as the hardware vendor pro-
vides an OpenCL driver and compiler. To make the operators portable, the imple-
mentation should not be optimized towards one particular architecture, leading to
worse performance than the first approach. However, as shown in Figure 3.25, limi-
tations and bad performance can be hidden by heterogeneous placement with a suf-
ficient number of different CUs. Systems implementing their operators in OpenCL
are gpuDB [Yuan et al. 2013] and Ocelot [Heimel et al. 2013].
C2 Execution-time knowledge: To decide an operator’s placement, the execution-time of
the operator has to be known for each CU. For database systems, this task is challenging
because there aremany different operators that can be executed with arbitrary input data,
leading to different operator runtimes. Different to traditional cardinality estimations, all
CUs need to be considered separately, since operator executions with the same cardinal-
ity show different runtimes on different CUs. All in all, a runtime estimator needs to
consider (1) the CU, (2) the operator, and (3) the used data for the operator. There were
several approaches in related work: He et al. propose a benchmarking approach, where
the execution behavior of operators is benchmarked before execution, to allow an estima-
tion at run-time [He et al. 2009]. Breß et al. propose an online-learning approach, where
operator executions are monitored and the learned execution times are used to estimate
future executions [Breß 2013].
C3 Placement optimization: Even when being able to execute operators on all CUs and
having the knowledge of their execution time, making the actual placement decision
adds additional complexity. The main reason for this complexity are the dependencies
82 Chapter 3 Approaches to Utilize Heterogeneous Environments
and data sharing between different operators within the query plan. For independent
operators, the placement decision can be made by examining the execution costs and the
transfer costs from main memory to the CU for input data and from the CU to the main
memory for resulting data. However, for dependent operators in a query plan, data may
need to be transfered from one CU to another and decisions for one operator may influ-
ence decisions for following operators. Therefore, a comprehensive way of optimization
is needed, to consider all side-effects of a single operator’s placement decision.
C4 Interaction with query optimization: Additionally to making the placement decision
for operators, the integration into the database systems and in particular into the database
optimizer needs to be defined. This includes the optimization time, e.g., at run-time or
compile-time, as well as the choice to adopt traditional query optimization techniques
like cardinality estimation for unknown intermediate results.
C5 Extensibility: Finally, there are two ways of extensibility that should be considered.
First, the decision maker should be easily extensible to support future hardware or it
should even provide this support automatically. Second, a good heterogeneous place-
ment approach should be easily apply-able for a variety of databases systems, without the
need to reimplement these systems completely. To achieve the extensibility, there is a
trade-off between abstracting the heterogeneous environment or making the whole en-
vironment visible to the database system. The first approach makes the database system
oblivious to the underlying hardware and therefore allowing it to be intrinsically able
to adapt to hardware changes. For the second approach, extensibility has to be added
deliberately to the database system.
In the following, we base our research on OpenCL-based operators, where all operators
are able to execute on all OpenCL-based CUs. We propose approaches to challenges C2 and
C3 in our general placement optimization (Chapter 4). Afterwards, we present solutions to
challenges C4 and C5 as well as improvements to the naive approach solving challenge C3 in
our adaptive placement optimization (Chapter 5).
3.3 Approach III: Dynamic Placement 83
84 Chapter 3 Approaches to Utilize Heterogeneous Environments
4
GENERAL PLACEMENT OPTIMIZATION
4.1 Runtime and Transfer Estimation
4.2 Local Placement Strategy
4.3 Global Placement Strategy
4.4 Evaluation
4.5 Conclusion
Heterogeneous Placement is the approach we choose to utilize heterogeneous computingenvironments. Our goal is an extensible and automated approach, to assign portions
of work, e.g., database operators, to different computing units to improve the overall query
runtime.
For the current hardware trends presented in Section 1, we see exponentially increasing
main memory sizes, while the computational capabilities and memory access bandwidths are
not improving at the same rate. Especially in-memory database systems with no bottleneck for
disk accesses are suffering from this trend. Heterogeneous computing hardware can speedup
computation in these scenarios, so we focus our work especially on these in-memory systems.
Furthermore, we focus on Online Analytical Processing (OLAP) because OLAP queries are long-
running and, therefore, can benefit significantly from acceleration. Also, OLAP queries are the
base for data analytics and data mining, which increasingly requires short response times to
make fast business decisions. To efficiently support in-memory OLAP queries, we focus our
work on column-oriented database systems [Abadi et al. 2013] and block-wise processing like
column-at-a-time [Boncz et al. 2008] or vector-at-a-time [Boncz et al. 2005] approaches. Also,
in most cases we assume the operators to be executed in an operator-at-a-time approach with
fully materialized intermediate results.
As we want to improve in-memory OLAP queries, we do not focus on specific systems
or operator implementations. Many operator implementations were proposed in the past for
different CUs, for example sorting for GPUs [Govindaraju et al. 2006; Merrill and Grimshaw
2010; Satish et al. 2009], FPGAs [Müller et al. 2012; Chen et al. 2015], and Many Core CPU
systems [Teubner and Müller 2011; Balkesen et al. 2013]. Additionally, many systems were
proposed to use these ported operator implementations to allow heterogeneous execution to
some extent. To build upon the work done in this area, we focus on the optimization aspect of
heterogeneous placement, while not limiting our approach to specific implementations.
The basic approach for heterogeneous placement optimization is illustrated in Figure 4.1.
We assume to get a logically and physically optimized QEP from the query optimizer, making
the following steps independent of any specific logical or physical optimization. The goal is now
to assign placement decisions to the operators before their execution in order to increase the
!"#$%"&%'(#)(*%+%"#,%-%#.'(
/#0$.&-,(1-2+'(
(
(
3$4(5( 3$4(6(
3$4(/( 3$4(7(
3$4(8(
9:%(0#'+(%;<2%-+(=8!((
>%+%"02-%>(?@(+:%(A.%"@((
3$&02B%"4(
C.-&0%(8'&0D&#-(
E(
9"D-')%"(920%(
8'&0D&#-(
3$4(5( 3$4(6(
3$4(/( 3$4(7(
3$4(8(
3$%"D+#"(C.-&0%(
F-)#"0D&#-(
9:%(0#'+(%;<2%-+(=8!(G2+:(
$HD<%0%-+(#)(#$%"D+#"'(+#(
<#0$.&-,(.-2+'4(
!HD<%0%-+(
3$&02BD&#-(
=.%"@(!HD-((
I+".<+."%(
/1(J( /1(K( /1(L(
Figure 4.1: Finding a good placement for the given query.
86 Chapter 4 General Placement Optimization
overall query runtime. A two phase approach is used to make the placement decision: (1) run-
time estimation, including transfer time estimation, and (2) placement optimization. In the
first phase, we need to estimate the operator’s runtime using the properties of the heteroge-
neous CUs and information about the operator itself. As second phase, we use the estimated
runtime and estimations on data transfer costs, to evaluate different operator placements and
the impact on the full query runtime. This phase also needs to know the properties of the given
CUs, e.g., total memory capacities and the structure of the query plan. The plan with the lowest
query runtime is chosen, which could range from highly heterogeneous executions to single-
CU execution, depending on the operator runtime costs and transfer costs. We identified two
different strategies, for placement assignments, i.e: the local and global placement strategy.1
The first placement strategy (local placement optimization) conducts an estimation and place-
ment decision at run-time directly before the execution of each operator and the placement is
done for each operator separately. Therefore, the estimation can work on the most recent
information about data sizes and data location, allowing an exact estimation of needed data
transfers and execution. Additionally, only one operator is placed at the time, leaving a small
search space of the amount of available compute units. However, this approach might not have
enough foresight since the rest of the QEP is not considered in the local decision. In particular,
data sharing between operators can not be explicitly planned throughout the query.
The second strategy (global placement optimization) decides the placement for all opera-
tors of a QEP at query compile-time before the execution of any operator. In this case, global
placement is done by considering all dependencies of the QEP. This approach yields a higher
potential for better placements compared to the local placement optimization, because data shar-
ing between operators is explicitly encouraged to avoid costly data transfers. However, there
are open challenges, when optimizing the whole query for heterogeneous execution. The two
main challenges are the enormous search space of possible placements and the problem of
uncertain or unknown intermediate result sizes.
In the following, we first describe our runtime and transfer estimation approach in Sec-
tion 4.1, followed by details on local optimization in Section 4.2 and global optimizations in
Section 4.3. In the end, we evaluate the resulting estimation quality and placement perfor-
mance in Section 4.4.
1Parts of the material about the two strategies have been developed jointly with Dirk Habich and Wolfgang
Lehner. The material is based on [Karnagel et al. 2015a]. The copyright is held by the authors and the original
publication is available at http://ceur-ws.org/Vol-1330/paper-10.pdf.
87
4.1 RUNTIME AND TRANSFER ESTIMATION
As the starting point for heterogeneous placement, accurate runtime estimations are needed to
reason about possible executions on different CUs.1 Knowing the estimate runtime in advance
enables the optimization to reliably choose the best CU for the execution. Our main goal is a
reliable and extensible approach, which is not limited to specific operator implementations or
specific computing units.
CPU cost models focus mainly on memory access [Manegold et al. 2002] and can not be
adapted to other CUs without adjustment. A GPU cost model was presented by Bingsheng He
et al. [He et al. 2009] including transfer considerations, computation, and memory access esti-
mation. The authors benchmark and calibrate operator primitives that are used in all operators
to estimate their computation time and manually provide a memory access and complexity
function. This is not easily extensible as different operators must build upon the benchmarked
primitives. Jiong He et al. presented a cost model for load balancing in tightly-coupled sys-
tems [He et al. 2013, 2014]. The authors used an intrinsic-knowledge approach of needed
instruction counts per tuple and instructions per cycle (IPC) of different CUs, which is highly
specific to operators and hardware characteristics. A more dynamic approach was presented by
Breß et al. [Breß et al. 2012], where execution times are estimated using an automatic learning
approach with prior training and spline interpolation for the estimation.
To assess the quality of these approaches, we revisit the group-by operator from Section 3.2.
There, the naive group-by operator has an irregular behavior with changing group numbers. We
try to apply the three presented estimation approaches to evaluate the quality of their estima-
tion. We define these three general approaches as: (1) calibration, where the execution is first
calibrated and benchmarked and the estimation is done on the number of input tuples (e.g., [He
et al. 2009]); (2) intrinsics, where the execution is calculated as instructions per tuple, with a
given IPC number for the CUs (e.g., [He et al. 2013, 2014]); and (3) interpolated, where the ex-
ecution is monitored, added to a model, and the estimation is done using spline interpolation
(e.g., [Breß et al. 2012]). Figure 4.2 shows the result. Since the first two estimation methods
only work on input tuples, they show a constant estimation for constant input data (6GB). The
calibration-based approach heavily depends on the number of groups chosen in the benchmark.
In Figure 4.2, we assume that the benchmark was done for 500 groups (red circle). Since the
input data size is constant, a constant runtime is estimated. The intrinsics-based approachmost
likely assumes ideal performance, without any of our presented bottlenecks from Section 3.2.
Both approaches fail to estimate the encountered effects. The spline-based estimation was cre-
ated using all data points of the empirical execution as input. The resulting estimation shows
a similar behavior as the empirical execution, however, the runtime effects below 1M groups
and the spikes (which are reproducible) are not represented in the estimation.
While the estimation based on online learning and spline interpolation shows the best
results, we do not think that the estimation accurately describes the spiky execution behavior,
shown in Figure 4.2. This motivates us to develop a more accurate estimation method and
improve the quality of estimation, especially for irregular execution behavior.
1Parts of the material in this section have been developed jointly with Dirk Habich, Benjamin Schlegel and
Wolfgang Lehner. The section is based on [Karnagel et al. 2014]. The copyright is held by Springer-Verlag Berlin Hei-
delberg 2014 and the original publication is available at http://dx.doi.org/10.1007/s13222-014-0167-9.
88 Chapter 4 General Placement Optimization
#groups (M)
ru
nt
im
e 
(s
ec
)
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
!
empirical
estimated ! interpolated
estimated ! calibration
estimated ! intrinsics 
Figure 4.2: Estimating irregular execution behavior of the group by operator using three ap-
proaches from related work.
In the following, we first define our general directions for estimation granularity and the
level of hardware and implementation awareness. Based on these decisions, we propose an
estimation approach for operator runtime and transfer time estimation.
4.1.1 General Directions
Before starting with runtime estimation, we need to define the general directions we want to
follow. We discuss three different questions: (1) On which granularity should the estimation be
done? (2) How much information does the estimation model need about the operator imple-
mentation? (3) How much information does the estimation model need about the computing
hardware?
Estimation Granularity: The estimation and placement can be done on a full query level or
operator level. Estimation on a query level has several drawbacks. First, it is hard to estimate a
query’s runtime without looking at a finer level like operators. Even if similar queries are reoc-
curring and the runtime could be learned, small changes in the selectivities result in different
intermediate result sizes and have a significant impact on the query runtime. Second, when
estimating full query runtime at that level, we would only be able to place whole queries on
one CU, where it executes best. However, we want to achieve heterogeneous execution within
one query, so full query estimation granularity is not applicable.
For runtime estimation on operator level, database operators can be estimated and placed
separately from each other, allowing heterogeneous execution within a single query. Also, for
one operator, runtimes can be estimated more precisely as there is a more fine-grained view
on the computation and data sizes. Therefore, we choose the operator level for our initial
approach.
4.1 Runtime and Transfer Estimation 89
!"#$#
%#
%#
%#
%#
&'('#)*+,#
-.
/0
1
,#
!"#2#
!"#3#
45,-'(6-#2#
!"#$#
%#
%#
%#
%#
&'('#)*+,#
-.
/0
1
,#
!"#2#
!"#3#
45,-'(6-#7#
Figure 4.3: Data Representation forM operators andN computing units.
Implementation-aware vs. implementation-oblivious: Execution times on any CU differ
between operators and implementations. It is possible to either have the operator’s developer
specify exactly how the operator computes its result or use code analysis to automatically ex-
tract this information. However, this approach is complex, might be a source for significant
estimation errors, and it is not known how to map an extracted computation specification onto
the used hardware to compute runtime estimations. On the other hand, if we do not need im-
plementation specifications, but regard the implementation as a black box, we do not depend
on any given or extracted implementation insights.
Hardware-aware vs. hardware-oblivious: It is possible to model the full architecture of a
CU including computational power for different data types, cache hierarchies, memory access
bandwidths, and so on. With this hardware-aware approach, runtime predictions could be
calculated for each database operator given the input data and the hardware model. However,
there are two limitations: (1) Creating the model for different CUs is time-consuming and
can not be easily adapted between CUs, as even CUs from the same vendor have different
architectures or different configurations. The support of a wide range of CUs would not be
possible. (2) Even with a detailed hardware model, it will most likely not be possible to predict
all performance effects like the ones presented in Section 3.2. For example, implementation
details like the hash function might interact with data properties and hardware properties,
resulting in unexpected runtimes. A hardware oblivious approach is more extensible without
the need for prior knowledge of the hardware characteristics. There, the hardware can be
treated as a black box and execution behavior can be learned during the execution, allowing
any CU with a supported implementation of the used operators.
4.1.2 Operator Runtime Estimation
In order to build our operator-based, implementation-oblivious, and hardware-oblivious run-
time estimation, we have to define how to store and collect operator runtimes for later estima-
tions. We also need to define the estimation technique itself.
Data representation: For the data representation, we propose to store key-value pairs (tu-
ples), where the key is derived from the data sizes used by an operator, while the values are the
actual runtimes. Figure 4.3 illustrates this approach. For each operator of the database system,
90 Chapter 4 General Placement Optimization
we keep one set of tuples (data-size! runtime) per CU, building the knowledge base of our
estimation model. With enough data, we can estimate the runtime for each operator on each
CU independently of other CUs or operators.
In our implementation, we combine all input and output data sizes to compute the key, as
shown in the following formula:
key =
nX
i=1
input_sizei +
mX
i=1
output_sizei (4.1)
This is a simplistic solution and there are possible ways to improve this approach. The
main problem with our solution is that different input and output sizes can result in the same
key, adding inconsistent runtimes to the knowledge base, as we are loosing information of
the single inputs and outputs. A more comprehensive approach could be a multi-dimensional
index, where we can store all single data sizes for one operator
(data-sizen, data-sizen+1, ..., data-sizem)! runtime (4.2)
However, when retrieving the stored data for estimation, the different dimensions have to be
weighted and prioritized in order to find the relevant entries. With our simple approach, we
do that implicitly by prioritizing all data sizes equally. As a more complex approach, we could
actually keep a multi-dimensional index and learn the priorities of the dimensions through the
runtimes. For example, for some operators solely the size of the first input is relative to the
runtime, while other operators might mainly depend on the sum of the input sizes.
For our current approach, we compute the key as the sum of all input and output sizes and
leave multi-dimensional indexing as future work.
Data collection: Instead of running system benchmarks or trying to calculate the system’s
performance with complex models, we collect the operator runtimes during query processing,
when the operators are executed. This means, with every operator execution, we either add
a new tuple (data-size! runtime) or update an existing one. Tuples keep an update counter
as additional information, so every tuple update can influence the stored runtime only to a
certain extent, depending on the update counter. This ensures that one single update can not
significantly change a stored value from many previous executions.
In our current implementation, we do not reset these counters, however, this would be a
possible approach to allow the adaption to runtime changes over time. For example, operator
runtime is not only dependent on the data size but also on data properties like value range,
sorting, or number of duplicates. If base data changes over time from, e.g., sorted duplicate-
free data to unsorted data with duplicates, operator runtimes might change as well and newly
measured runtimes could be prioritized in order to ensure good estimation quality. However,
as we mainly focus on data analytics of fixed base data, we leave this optimization to future
work.
Estimation: The estimation is based on the collected and stored runtime data from previous
executions. When estimating the runtime for an operator-CU combination, we retrieve the cor-
responding set of tuples and determine the two closest tuples (t1, t2) to the current situation.
4.1 Runtime and Transfer Estimation 91
00
NO ESTIMATION
POSSIBLE
data size
ru
nt
im
e
(a) No historical data.
0
0
data size
ru
nt
im
e
(b) One tuple.
0
0
data size
ru
nt
im
e
(c) Many tuples.
0
0
data size
ru
nt
im
e
(d) With outliers.
Figure 4.4: Runtime estimation depending on the learned data.
In the common case, this would mean:
key(t1)  keycurrent  key(t2) (4.3)
If one of the keys key(t1) or key(t2) is equal to the current key, then we can use the stored
runtime directly without further calculations. In all other cases, the estimation uses the two
tuples and calculates a linear interpolation for the current key. In the end, the tuples t1, t2,
and the estimated tuple tnew are on one line.
Figure 4.4 illustrates this approach, including corner cases. If no previous execution is
stored (Figure 4.4a), no runtime estimation is possible. If only one tuple is stored (Figure 4.4b),
we introduce a temporary entry (0,0), which is used together with the stored runtime informa-
tion to create the linear interpolation. As soon as more than one tuple is stored, the entry (0,0)
is not used anymore (Figure 4.4c). With multiple tuples, the linear estimation can always be
based on historic data. For estimations with smaller keys than all stored tuples, the two small-
est keys are determined and used for the estimation. Similarly, this is done with the two largest
keys, if the requested key is larger than any other key stored. With enough learned executions,
this approach is able to describe any kind of behavior, while every new entry improves the
estimation.
We choose the linear estimation approach based on only two entries in order to be robust
against outliers, as outliers are either corrected by repeated executions or, if showing repeatable
results, only influence the local neighborhood (Figure 4.4d). More executions with keys close
to outliers can weaken their influence further. When applying more global estimation like
spline interpolation, outliers potentially influence estimation results on the whole spectrum,
while outliers in our approach only impact neighboring keys and estimations.
Data Cleaning: Given our data representation and our estimation method based on local
estimation using two neighboring entries, we can apply data cleaning to reduce the number
of entries in our knowledge base. In our setup, an entry has only a value to the system, if it
adds new information and potentially yields more accurate estimations. Therefore, for three
entries t1, t2, and t3 with key(t1) < key(t2) < key(t3), tuple t2 can be removed if it can
be estimated using t1 and t3. In this case, tuple t2 is not adding any new information to the
estimation. To successfully remove tuples, we allow a small margin of error (Em), e.g., 5%,
in which a tested entry can be removed. With this margin, we can trade-off the estimation
quality with the needed memory space. Special cases are the two tuples with the smallest keys
and the two tuples with the largest keys. These tuples define the scaling for estimations on
92 Chapter 4 General Placement Optimization
00
data size
ru
nt
im
e
!
!
(a) Removing 2 tuples.
0
0
data size
ru
nt
im
e
!
(b) Removing 1 tuple.
0
0
data size
ru
nt
im
e
(c) No removal needed.
0
0
data size
ru
nt
im
e
!
!
!
(d) No removal possible.
Figure 4.5: Application of data cleaning using an error margin (Em).
smaller or larger keys. Removing one of the two outer tuples could change the scaling behavior,
and estimations for much smaller or much larger keys could experience high estimation errors.
Therefore, the two smallest keys and the two largest keys can not be deleted during the cleaning
process.
The cleaning procedure is shown in Figure 4.5. First, the estimation error is calculated for
two entires, always with one entry in the middle. If the error is below Em, the middle entry
can be deleted, as it does not add more information. This procedure can be repeated until no
entry can be deleted anymore (Figure 4.5c). If entries can not be removed, then all entries are
important to the estimation and have to be kept in the knowledge base (e.g., Figure 4.5d). This
cleaning can either be repeated after a certain amount of newly learned executions or it could
be applied for every insert, to evaluate if a tuple adds enough new knowledge to be inserted in
the first place.
4.1.3 Transfer Time Estimation
To estimate transfer times, we reuse parts from the runtime estimation. In detail, we reuse the
data representation and the estimation. Different to runtime estimation, we use the data size
to be transfered directly as a key, instead of mapping inputs and outputs to a key value. Other
approaches benchmark a bandwidths and simplymultiply the data size with this bandwidth [He
et al. 2009]. However, this bandwidth varies for different data sizes, depending on different
overheads. With our approach, we store the data size and the corresponding transfer time,
effectively considering different bandwidths in our estimation. For transfer estimation, we do
not apply data cleaning and use a different approach for data collection.
Data collection: Transfer costs are independent of any operator implementation, query struc-
ture, or data distribution but depend only on the data sizes and the involved CUs (CU A!
CU B). Therefore, we benchmark all possible transfer directions (#CU ⇤#CU ) on ramp-up
time. In fact, it is enough to benchmark the transfers once per system, without the need to
renew the benchmark or update the results with online learning.
4.1 Runtime and Transfer Estimation 93
4.2 LOCAL PLACEMENT STRATEGY
In the last section, we proposed a runtime estimation approach to estimate both, database
operator runtime and transfer time. With this approach, it is possible to choose a CU for one
isolated operator execution. However, a database query consists of many operators, which
depend on each other’s intermediate results. Therefore, a placement decision for one operator
influences the decisions for other operators. The overall goal is to define placements for every
query operator, ideally resulting in the fastest query execution. One way to determine this
placement is the local placement strategy, where operator placements depend on the current
data location and the estimated runtime of an operator. The decision is made at run-time.
O5O4 O6
O3
O1 O2
x y
a cb
Information
used for
estimation
Operator
a Memory Object
CUs
Figure 4.6: Local placement optimization example, where Operator O3 is placed next.
4.2.1 General Approach
The strategy to decide the operator’s placement at run-time for each operator is the most in-
tuitive approach. Placement is decided right before an operator’s execution, after preceding
operators have already finished their execution. The input and output data is kept in the CU’s
memory until it is needed on another CU. There are three questions that have to be considered:
1. How large is the input data?
2. Where is the input data placed at the moment?
3. How does the operator perform on the different computing units?
The first and the second question are needed to determine possible transfer costs; and
the first and the last question determine the estimated operator runtime. When knowing (or
estimating) all costs, a local decision can be made. The detailed approach is illustrated in Fig-
ure 4.6. The operators O1 and O2 produce the results x and y. These are stored in a CU’s
memory, where the operators have been executed (illustrated with different colors). Place-
ment and data size of each input for operator O3 is considered to calculate the transfer costs,
if transfers are needed. Then the hypothetical execution on each CU is considered with the
transfer costs and the execution estimations. The approach for a whole query is illustrated in
Algorithm 3. For each operator, the best CU is determined and the operator is executed im-
mediately on this CU. The exact data input size is known for base data as well as intermediate
94 Chapter 4 General Placement Optimization
Algorithm 3 Local Placement Optimization.
1: procedure Local-Optimization
2: for all op in operators do
3: for all cu in CUs do
4: cost exec-est(op, cu)+ input-transfer-est(cu)
5: if cost < bestCost then
6: bestCost cost
7: execute(op, cu-with(bestCost))
results, since previous operators have already finished their execution. For base data, the data
placement is either in main memory, or already in a CU’s memory, if another operator accessed
the data before. For intermediate results, data is most likely stored in a CU’s memory, where
the result producing operator was executed. There is the possibility that data was evicted from
the CU’s memory, if other operators needed additional space. However, this is traceable and
the current memory location is considered. The third question with respect to the estimated
runtime should be answered by our estimation model from the previous section. Having the
transfer time and the operator’s execution time estimates, a decision can be made by picking
the CUs with the minimal sum of all input transfer costs and execution time. This is the best
decision from a local point of view. The search space for this decision is limited to the number
of CUs and the decision procedure is repeated for each operator in the order of execution. The
result transfer is not considered for the producing operator since the data might be reused by
the next operator on the same CU. If the result transfer is needed, it is added to the costs of the
consuming operator instead of the producing one.
4.2.2 Advantages and Limitations
The strong advantage of the local placement strategy is its simplicity and easy implementation.
The search space corresponds to the number of computing units per decision with one decision
per plan operator. Furthermore, this approach works on the latest information about data sizes
and their placement, allowing precise runtime and transfer estimations.
On the other side, the decisions are local for each single operator after another. This might
not be optimal for the full query plan, as alternating decisions can introduce harmful data
transfer from one CU to another and vice versa, because only input transfers are considered
Op Runtime Placement Decisions
CU1 CU2 local ideal
1 1.2s 0.1s CU2+transfer = 1.1s CU1 = 1.2s
2 0.1s 1.2s CU1+transfer = 1.1s CU1 = 0.1s
Total: 2.2s 1.3s
Table 4.1: Local placement decisions and the ideal placement. Data transfers (if needed) take
1s. The initial data is stored on CU1. The operators are executed according to their ordering.
4.2 Local Placement Strategy 95
in the cost function. An example is shown in Table 4.1. The example includes two operators
with estimated execution times for two computing units (CU1, CU2). Initial data resides on
computing unit CU1 and every data transfer, if necessary, takes 1 second. The presented local
strategy would choose CU2 for the first operator, since the runtime + transfer-time is less than
the execution time on CU1. In the second step, it chooses CU1 for the same reason. The total
execution time is 2.2 seconds including transfers. For the ideal placement, however, the total
execution time is only 1.3 seconds, since harmful data transfers and alternating placement
decisions are avoided. For this example, the local optimized placement is 0.9s (69%) larger
than the ideal placement.
96 Chapter 4 General Placement Optimization
4.3 GLOBAL PLACEMENT STRATEGY
In the last section, we proposed local optimization as a simple way to perform placement op-
timization. However, the local optimization approach shows limitations when placements are
alternating and too many harmful transfers are added. This is caused by the missing global view
of the optimization strategy. In this section, we introduce global optimization, a strategy using
the global view on the query and making decision based on the interplay between all operators.
In the following, we first discuss the possibilities but also the challenges of this strategy, before
proposing a greedy search approach and possible search space reductions.
4.3.1 Search Space
Global optimization means applying the placement optimization at compile-time for the whole
query. This leads to new possibilities as well as new challenges. The main advantage of global
optimization is the complete view on the query and the possibility to find the optimal place-
ment per operator to speedup the query and avoid unnecessary transfers. Global optimization
would represent the ideal placement in Table 4.1. However, the search space of global opti-
mization is enormous.
The number of placements depends on the number of available CUs and the number of op-
erators within a query. For local optimization, this means one placement decision per operator
and a search space of #CU per decision. For global optimization, all possible placements need
to be evaluated, while the number of possible placements can be calculated as follows:
#placements = #cu#operator (4.4)
For example, there are 1.1 ⇤ 1015 possibilities for four CUs and 25 operators. If we can esti-
mate and calculate 1M placements per second, exhaustive search would take over 35 years for
the full evaluation to find the best heterogeneous placement for a query. Depending on the
query, applying exhaustive search with pruning can take hours or days instead of years, but the
worst case performance would not change. Even hours are still too long for interactive queries.
Therefore, it is simply not possible to evaluate all possible placements and some kind of ap-
proximation is needed. To reduce the search space in order to make placement decisions, we
apply a combination of assumptions and further optimizations.
(1) We assume that the QEP is fixed in its logical and physical properties. Therefore, we
do not need to consider other optimization options than the actual execution placement. This
assumption allows a clear separation of concerns: Logical decisions can be based on simple
rewriting rules. Physical decisions can be based on cardinalities and data properties (e.g., pre-
sorted data sets or number of duplicates). Finally, placement decisions can be based on data
sizes and learning-based runtime estimation.
(2) Instead of exhaustive search, we apply an iterative greedy approach, allowing the deci-
sion process to choose a local optimum. This approach is fast and can be used for online query
optimization, while we need to evaluate the actual placement quality on real queries.
(3) In order to reduce the search space and possible options for the greedy approach, we
propose using fixed placements for operators, which are placements that choose the same CU
in any possible situation.
4.3 Global Placement Strategy 97
4.3.2 Greedy Search Approach
To allow fast placement decisions with the presented search space, we propose a greedy search
algorithm. As mentioned earlier, the search space for common queries is too large for a com-
plete search in a reasonable runtime even with effective pruning. Therefore, we propose the
greedy approach illustrated in Algorithm 4. We start with an initial starting placement for ev-
ery operator. The algorithm iterates over each operator and evaluates the different placement
decisions for the current operator. If the algorithm finds a better placement for the current
operator, it changes the decision from the initial placement. The main difference to the local
approach is that we already have placement decisions for the following operators, leading to
a more informed decision concerning possible data sharing. In Line 6 of Algorithm 4, we see
that the cost function includes the operator estimation, as well as input and output transfers,
while local optimization only considers execution and input transfers. Figure 4.7 illustrates this
difference. Additional to Operator 1 and 2, the algorithm knows the placement of the opera-
tors 4 to 6 and the data sizes a, b, and c, therefore being able to calculate inward and outward
transfer costs of the current operator. This leads to a more informed decision than in a local
optimization. The changes made on one operator’s placement can influence the placements of
the following and the previous ones as well. Therefore, the algorithm has to iterate over the
operators as long as improvements can be found. In general, we have seen up to 5 iterations for
different queries, while it might be usefull to set an upper bound of iterations (e.g., 20). When
no single placement adjustment of an operator improves the query runtime anymore, then the
algorithm found a (local) optimum. The actual execution according to the placement decisions
is decoupled to the placement decision-making.
The described greedy approach is fast and improves an initial starting placement iteratively.
However, it is still a greedy approach, which finds a local optimum, which is not necessarily
the global optimum for the full plan. One reason for not finding the optimal placement is the
occurrence of operator groups that should be placed together. It could be possible that some
operators are most beneficially placed together on one computing unit, so that data transfers
Algorithm 4 Global Placement Optimization.
1: procedure Global-Optimization
2: placement starting placement
3: repeat
4: for all op in operators do
5: for all cu in CUs do
6: cost exec-est(op, cu)+ input-transfer-est(cu)+ output-transfer-est(cu)
7: if cost < bestCost then
8: bestCost cost
9: placement[op] cu-with(bestCost)
10: until placement has not changed
11: for all op in operators do
12: execute(op, placement[op])
98 Chapter 4 General Placement Optimization
Information
used for
estimation
O5O4 O6
O3
O1 O2
x y
a cb
Operator
a Memory Object
CUs
Figure 4.7: Global placement optimization example, where Operator O3 is placed next.
between them are avoided. However, the best computing unit for the group might not be
the best for the single operator, so an approach, which can only change one placement at a
time, might not find the best solution. The problem is illustrated in Table 4.2. Dependent
operators, transfer costs, and runtimes are shown. Varying input transfer times correspond
to intermediate data sizes, e.g., Operator 2 could be a join with large result, so Operator 3
has a high input transfer time. Local optimization would choose the pure CU1 placement (I).
For global optimization, the result highly depends on the starting placement. If the starting
placement is (I), then (III) and (IV) would be evaluated (besides others), but (I) would be
chosen as placement with the minimal costs. With a starting placement of (IV) and assuming
the algorithm starts from the top, our greedy strategy would also evaluate (V) and find it to be
the best possible placement.
There is no limit on how large these operator groups can be, so an optimizer would need
to evaluate all groups of two operators, three operators, and so on. This is not possible in
a reasonable amount of time for larger database queries. A more practical idea would be to
allow different initial placements and apply the greedy algorithm to all of them. For example,
when testing random starting placements, there would be the possibility that some operators
of a group are already assigned to the right computing unit, pulling the other operators to this
CU as well. For that, the overall result could be improved by testing many different starting
placements and picking the best placement according to our operator runtime estimation.
Op input Runtime Different Placements
transfer CU1 CU2 (I) (II) (III) (IV) (V)
1 1s 1s 5s 1 2 1 1 1
2 1s 1s 0.1s 1 2 2 1 2
3 5s 5s 0.1s 1 2 1 2 2
4 0.5s 1s 5s 1 2 1 1 1
Total cost including transfers: 8 11.2 13.1 8.6 3.7
Table 4.2: Multiple placements that could to be considered by global optimization. The initial
data is on CU1. If needed, the shown input transfer costs apply. The operators execute in the
given order.
4.3 Global Placement Strategy 99
4.3.3 Search Space Reduction
In order to find a good placement solution, we need to evaluate many different (random) place-
ments. This growswith the search space, meaning that we should test more starting placements
with a higher search-space (e.g., for more plan operators). Since we can only evaluate a defined
number of placements in a fixed time, we need to reduce the search space to improve the prob-
ability of finding a good placement.
We propose to reduce the search space by assigning operators to a fixed CU, if the greedy
algorithm would pick this computing unit in every possible scenario. For example, Operator 1
and 4 in Table 4.2 will always be placed on CU1 even if all other operators are on CU2. We call
these strong placements, where one CU is superior in the execution of an operator to an extent
that the worst case data transfers are negligible. Since every greedy iteration for any starting
placement would pick these placements, we do not have to consider them as adjustable in
the greedy algorithm as well as in selecting the starting placement. For Table 4.2, this would
mean fixing the placement for Operator 1 and 4, reducing the search space for the placement
decisions from 24 = 16 to 22 = 4. Depending on the computing units and operators, this
approach can reduce the search space significantly, even to the point of fixing the placement
for the full plan. The strong placements can be calculated by iterating over the plan once for each
computing unit and evaluate if a single operator would be placed on another computing unit,
even if all other operators are on the initial one. For example, a plan is initially set to CU1. Each
operator is tested if a placement on CU2, CU3, and so on, is beneficial for the overall runtime,
while having all other operators on CU1. This has to be done for each CU. If, for example, one
operator is always placed on the same computing unit, then this operator can be fixed to this
computing unit as a strong placement. Calculating these strong placements introduces only a
small overhead by having the potential to reduce the search space significantly.
4.3.4 Advantages and Limitations
With the proposed greedy approach and the search space reduction, we canmake fast decisions,
while avoiding exhaustive or pruned search of the entire search space. This allows placement
optimization using the global view of the query. There are two main limitations for global
optimization. (1) As mentioned, the optimization probably finds a good placement but not
necessarily the ideal one, which we hope to improve by evaluating the greedy approach with
many different starting placements. (2) A second limitation is that the placement is highly de-
pendent on intermediate cardinalities. Different to local optimization, the placement for every
operator is defined before any operator execution. Therefore, intermediate data sizes need to
be estimated for runtime and transfer cost estimation. This is a common database challenge
independent of heterogeneous execution. At this point, we assume perfect cardinality knowl-
edge, while we revisit this challenge in Chapter 5.
100 Chapter 4 General Placement Optimization
4.4 EVALUATION
For the evaluation, we examine runtime estimation and placement optimization separately. For
runtime estimation, we first show that our approach is working well for irregular execution
behavior, before we evaluate the estimation quality and the propagation of estimation errors
towards placement decisions. For placement optimization, we introduce two sets of benchmark
queries and a heterogeneous evaluation system consisting of four different CUs. For the given
queries, we first evaluate the search space for placement decisions and show the significance
of strong placements, before comparing local and global optimization for each query.
4.4.1 Runtime Estimation
To evaluate the accuracy of runtime estimation separate from the placement optimization of
full query trees, we use the group-by operator from the previous sections. We first revisit the
example from Section 3.2, where irregular execution behavior complicated the runtime esti-
mation. Afterwards, we evaluate the runtime estimation concerning the estimation quality
and the impact on placement decisions.
#groups (M)
ru
nt
im
e 
(s
ec
)
1
10
100
1e!06 1e!04 0.01 0.1 1 10 100 1000
empirical
estimated ! spline interpolated
estimated ! our approach
Figure 4.8: Irregular execution behavior of the naive group by operator, being estimated by the
spline interpolation approach and our approach.
Irregular Execution Behavior For the first evaluation of our placement model, we revisit
the scenario from Section 4.1. There, we described that previously proposed placement models
are not capable to estimate irregular execution behavior with a good estimation quality. As
an example, we took the group-by operator from Section 3.2. Figure 4.8 shows the result of
the spline interpolated approach and our novel approach of runtime estimation. We can see
that our approach describes the execution much better than the spline-based approach, leading
to near equal results for execution and estimation. We even apply data cleaning, reducing
the number of stored tuples from 220 original entries to 132, without any significant loss of
placement quality.
4.4 Evaluation 101
Accuracy and Placement Decisions: We also revisit the group-by implementation from Sec-
tion 3.3, to evaluate the estimation quality in more detail with possible errors in the estimation
and their propagation to placement decisions. There, a group-by operator is executed on eight
different CUs. We use these eight executions to test our runtime estimator, while exploring
two dimensions: (1) the amount of input data (%) to simulate not yet learned execution be-
havior and (2) the error margin (Em) used for our cleaning approach (%) to decide if learned
entries can be deleted to save memory space. For the evaluation, we first insert a random data
sample into our estimation model (sample size 10% to 100% of the input data). The input
data consists of the runtime per group count. Afterwards, we apply the cleaning operation
with different margins (remove points if they fall within 0%, 5%, and 10% of their estimation).
The margin Em = 0% will not remove any points. Finally, we compare every real execution
time with the estimated one. To assess the estimation error, we use two measures: Mean Abso-
lute Error (MAE) andMean Absolute Percentage Error (MAPE). The error calculation is shown in
Equation 4.5 and 4.6.
MAE =
1
n
nX
i=1
|Reali   Esti| (4.5)
MAPE =
100
n
nX
i=1
    Reali   EstiReali
     (4.6)
For both error measures, a low value is better than a larger one, while a low MAE can still
result in a large MAPE, and vice versa. This can be caused if errors mainly occur for small
numbers resulting in small absolute errors, while the percentage can still be large. Therefore,
we present both measures for our evaluation. The results are shown in Table 4.3, where we
average both separately, MAE and MAPE, for all eight CUs that were used in Section 3.3. As
we can see, if 100% of the data is used to train the estimation model and no data cleaning
is applied, there is no estimation error. This shows that we can represent the full behavior,
if we have learned all possible execution times (e.g., for fixed reoccurring workloads). Other
approaches based on benchmarking, hardware intrinsics, or spline interpolation might show
an error for this case. When reducing the input data to a random sample with a decreasing
percentage, we see that the error is growing. However, the error is still below 20% in the worst
case and better than 6.5% when using 20% or more of the data for model training. This shows
that even a small percentage of learned execution times can produce a relatively good runtime
estimation for unknown scenarios. When data cleaning is enabled, the error rises as expected.
There are two interesting observations: (1) In some cases, data cleaning on random samples can
reduce the error, compared to no data cleaning (e.g., for 10% data) and (2) even for full input
data, the error can be larger than Em (e.g., MAPE of 11.3% for Em = 10%). The latter can
not happen, if we only remove single points, while keeping their neighbors. The estimation is
based on the neighboring points and if the estimation is within the margin, a point is removed,
therefore, the error should not be above the givenEm. However, as our cleaning approach can
remove the neighboring points in a succeeding step (if their estimation error is belowEm), the
estimation error of the first deleted points can increase. For our example with full input data,
102 Chapter 4 General Placement Optimization
Em =0% Em =5% Em =10%
MAE MAPE MAE MAPE MAE MAPE
full input data (100%) 0 0 163 3.54 465 11.31
random sample - 90% 15 0.59 168 3.86 427 10.74
random sample - 80% 29 1.08 172 5.24 371 9.27
random sample - 70% 45 1.59 142 4.56 325 9.20
random sample - 60% 103 3.10 152 4.68 302 9.05
random sample - 50% 92 3.45 151 5.06 275 7.58
random sample - 40% 120 4.07 153 5.19 281 8.94
random sample - 30% 143 5.70 170 6.15 246 8.08
random sample - 20% 207 6.40 349 8.54 483 12.03
random sample - 10% 441 17.65 437 15.17 573 13.36
Table 4.3: Estimation errors for different percentages of input data and data cleaning.
this effect occurs but it is not significant (11.31% instead of 10%). For sampled data, we can not
be sure if the estimation error is caused either by the missing input data or the data cleaning.
The important question is now, how much this error propagates to wrong placement deci-
sions. Therefore, we use the different configurations from Table 4.3, to choose a CU for each
group number, while plotting the execution time of the chosen CU. The result is shown in Fig-
ure 4.9. We illustrate themanually placed execution from Section 3.3, the automatic placement
(automatic1) with full input data and no data cleaning (Em = 0%), and the 29 other configu-
rations (automatic2) for automatic placement presented in Table 4.3 (all plotted with the same
line type). Similar to Section 3.3, we evaluate two different cases: (1) four high-performance
CUs in one scenario and (2) three other CUs in one scenario.
(1) The first case is shown in Figure 4.9a. We can see that the automatic placement based
on full input data and no cleaning leads to similar results than the manual placement. Inter-
estingly, it even improves the previous approach at one point (II), where it switches the CU,
which we did not do for the manual approach. Automatic2 combines multiple configurations.
The runtime difference in (I) is caused by configurations with eitherEm = 10% or input data
of 40% and less. The difference at (III) is only caused by configurations with Em = 10%. We
note that for placement decisions, the errors for estimation (e.g., from Table 4.3) can add upon
each other. For example, one CU’s runtime could be estimated 10% higher through a small
sub-set of input data or data cleaning, while another CU’s runtime could be estimated 10%
lower for the same reasons. Therefore, a slow CU could be chosen for the placement, even if it
is 20% slower than others. There is no variance in area (IV), which is mainly caused by the CU
limitations. With higher group counts and larger hash tables, fewer CUs are able to compute
the grouping, therefore, they are not considered for the placement.
(2) A second case with three different CUs is shown in Figure 4.9b. Similar to the previous
example, the automatic approach with full data and no data cleaning is choosing the best place-
ments, equal to the manual approach. For configurations with estimation errors, the errors
only propagate to a wrong placement in area (I) and (II). Errors at (I) are caused by configura-
4.4 Evaluation 103
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.1
1
10
1e!06 1e!04 0.01 0.1 1 10 100 1000
manual
automatic1 (100% data, 0% cleaning)
automatic2 (10!100% data, 0!10% cleaning)
(I)
(II)
(III)
(IV)
(a) Tahiti GPU, K80, Xeon CPU and Xeon Phi
number of groups (M)
ru
nt
im
e 
(s
ec
)
0.1
1
10
1e!06 1e!04 0.01 0.1 1 10 100 1000
manual
automatic1 (100% data, 0% cleaning)
automatic2 (10!100% data, 0!10% cleaning)
(I)
(II)
(b) GT640, Intel iGPU, and AMD CPU
Figure 4.9: Manual placement decisions of a group-by operator compared to automated deci-
sions through runtime estimation.
tions with Em = 10% and errors at (II) are caused by estimating with 20% of the input data
or less. All in all, the resulting placements are for most parts similar to the manual placement.
To summarize, we have seen good results for the placement of the group-by operator with
our estimation model, while the model itself has two error sources: (1) incomplete input data
and (2) the data cleaning approach. For the former, the actual error is acceptable (less than
6.4% for more than 20% of the data), while the error will reduce with every execution due
to the runtime learning. Additionally, the estimation error propagates not or only slightly to
the placement decisions and the resulting runtime. For the latter error source, data cleaning,
we have seen that it introduces errors depending on the cleaning margin (Em). Based on our
evaluation, we suggest to use no cleaning if the data sizes of the learned executions are not a
problem or to apply Em = 5% or less. Em = 5% or less would reduce the model entries
for linear scaling, while highly irregular scaling would not be cleaned, leading to more model
entries where needed.
4.4.2 Placement Optimization
Given the runtime estimation for each operator, we now look at the whole query plan and eval-
uate our two optimization strategies: local and global optimization. For that, we first introduce
our evaluation setup including two benchmark workloads, before looking at the search spaces
and the optimization quality.
32 GB
Main
Memory
(2132 MHz)
AMD CPU
AMD iGPU
—
10.3 GB/s
Nvidia K20
Nvidia GT640
1.3 GB/s
12.4 GB/s
PCIe2 x4
PCIe3 x16
Figure 4.10: Heterogeneous evaluation setup consisting of one CPU and three GPUs.
104 Chapter 4 General Placement Optimization
(a) Overview (b) Focus on optimal placement
Figure 4.11: Query runtime of SSB queries with 1M different placements per query.
Hardware Setup For the evaluation, we use one heterogeneous system consisting of four dif-
ferent CUs. The system consists of one CPU (AMD CPU) and three GPUs (AMD iGPU, K20,
and GT640). The main memory connections are shown in Figure 4.10 (hardware properties
in Table 2.3). The transfer bandwidths are peak values we measured, however, the real band-
widths for various data sizes can differ. We benchmark these bandwidths through transferring
16KB to 256MB for each CUi ! CUj combination and additionally for transfers from the
host memory to the CUs. The benchmark results are stored similar to operator runtimes in our
estimation model. However, we do not apply data cleaning, as the data is a small fixed set of
tuples, where memory space is not an issue.
Database Workload In our heterogeneous system, we use gpuDB [Yuan et al. 2013] to ex-
ecute Star Schema Benchmark queries (SSB) [O’Neil et al. 2009] and Ocelot [Heimel et al.
2013] to execute TPC-H queries [TPC 2014]; both systems are limited to the stated bench-
marks. Both systems allow OpenCL-based execution and, therefore, are able to use different
CUs with one code base. However, both systems also do not allow heterogeneous execution
within a query but only manually fixed single-CU execution. We extend these systems to log
the query structure, operator runtimes, and data sizes, to extract information for runtime esti-
mation and placement optimization. The required information can be retrieved by executing
each query on every CU in a single-CU mode. Therefore, we do not need to implement het-
erogeneous placement at this point in order to evaluate our approach. Through the offline
evaluation, we can estimate the benefits and limitations of our approach. For gpuDB, we are
able to execute all 13 SSB queries with our system. For Ocelot, only 9 queries could be executed
on all four CUs, while the other queries abort with different errors on at least on CU. We use
scale factor 10 for the SSB queries and scale factor 5 for the TPC-H queries.
Search Space and Strong Placements To illustrate the search space and the resulting dif-
ferences in query runtime, we use the collected runtime information and generate 1M random
hypothetical placements for each query. The performance results, including all operator run-
times and transfer costs, are shown in Figure 4.11 and Figure 4.12 as Box-Whisker-Plots. Every
box represents one query and the distribution of runtimes for the different placements. The
4.4 Evaluation 105
(a) Overview (b) Focus on optimal placement
Figure 4.12: Query runtime of TPC-H queries with 1M different placements per query.
box itself describes the value ranges from 25% to 75%; the included line represents the me-
dian; the whiskers show ± 1.5 IQR (interquartile range, usually the size of the box); and the
shown points are outliers. Additionally to the Box-Whisker-Plots, we show the number of op-
erators and the best possible execution in Figure 4.11b and Figure 4.12b. We computed the
optimal placement through exhaustive search with pruning, which required several hours for
most queries.
With 9 to 24 operators for SSB queries, the search space contains 49 = 262144 to 424 =
2.8⇤1014 possible placements. For the TPC-H queries, there are 9 to 36 operators, leading to a
maximal search space of 436 = 4.7⇤1021 possible placements. Additionally, we can see a wide
range of runtimes for the different placement possibilities, therefore, it is highly important to
apply placement optimization.
To reduce the search space for global optimization, we proposed strong placements, single
operator placements that are always placed on one CU, even with the worst case transfers. Such
strong placements do not need to be considered in the optimization anymore and, therefore,
reduce the search space. For the evaluation, we build every possible sub-set of our four CUs.
There are four sub-sets with only one CU, six possibilities for two CU combination, four sub-
sets with three CUs, and one set with four CUs. We report the average percentage of strong
placements for each sub-set size in Figure 4.13 and Figure 4.14. If we only use one CU, all
0
20
40
60
80
100
str
on
g 
pla
ce
m
en
ts 
(%
)
SSB Queries
1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3
1 CU enabled
2 CUs enabled
3 CUs enabled
4 CUs enabled
Figure 4.13: Occurrence of strong operator placements within SSB queries.
106 Chapter 4 General Placement Optimization
020
40
60
80
100
str
on
g 
pla
ce
m
en
ts 
(%
)
TPC!H Queries
Q3  Q4 Q5  Q6  Q10 Q11  Q12  Q15  Q18
1 CU enabled
2 CUs enabled
3 CUs enabled
4 CUs enabled
Figure 4.14: Occurrence of strong operator placements within TPC-H queries.
operators can be considered as strong placements. For two CUs, 20% to 50% of the operators
can be fixed to one CU, reducing the search space significantly. For example, for SSB query 2_1
nine out of 18 operators can be assigned as strong placements on average, leading to a search
space of 29 = 512 instead of 218 = 262144. However, for more than two CUs, we see a large
decrease of strong placements for SSB queries (Figure 4.13) and a smaller decrease for TPC-H
queries (Figure 4.14). This decrease is caused by (1) a larger choice of CUs, where the actual
execution times are more likely to be close for at least two CUs with more CUs in the system,
and (2) possible worst case transfers are more expensive, leading to no secure decision for one
CU, as it might introduce harmful transfers. All in all, we conclude that strong placements
are beneficial for two CUs, while the search space improvement is decreasing for systems with
more than two CUs.
Local vs. Global Optimization Finally, we want to compare the performance of local and
global optimization. For this, we evaluate single-CU execution, local placement optimization
and global placement optimization with different starting placements on the presented SSB
queries and TPC-H queries. We use runtime estimation with the input data from all queries
and all CUs and do not apply data cleaning. The results are normalized to the pre-computed
ideal placement execution and are shown in Figure 4.15 for the SSB queries and TPC-H queries.
We can make multiple observations.
• Single-CU execution has runtimes in the full range, while the CPU mostly shows the
worst runtime and the GT640 or iGPU show the best single-CU runtime, because of
their parallel execution and relatively fast connection to the main memory.
• Local optimization shows good results, being always in the top 2% of the query runtimes.
For four SSB queries and two TPC-H queries, the simple local optimization approach
is able to find the best possible placement. However, there are also queries where lo-
cal optimization results in a worse performance than single-CU execution, because of
unnecessarily introduced data transfers (SSB query 4_1 and 4_2).
• The performance of global optimization varies with the starting placement. Starting with
single-CU placements can result in finding a local optimum in the starting placement,
i.e., no single placement adjustment can improve the performance, since transfers intro-
4.4 Evaluation 107
re
lat
ive
 p
er
fo
rm
an
ce
SSB Queries
1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3
0.0
0.2
0.4
0.6
0.8
1.0 !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! single CU
local
global on singleCU
global on random
global on local
(a) SSB queries - full scale
re
lat
ive
 p
er
fo
rm
an
ce
SSB Queries
1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3
0.980
0.985
0.990
0.995
1.000
!
! !
!
!
!
!
!
! !
! !
!
! single CU
local
global on singleCU
global on random
global on local
(b) SSB queries - top 2 %
re
lat
ive
 p
er
fo
rm
an
ce
TPC!H Queries
3 4 5 6 10 11 12 15 18
0.0
0.2
0.4
0.6
0.8
1.0
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! single CU
local
global on singleCU
global on random
global on local
(c) TPC-H queries - full scale
re
lat
ive
 p
er
fo
rm
an
ce
TPC!H Queries
3 4 5 6 10 11 12 15 18
0.980
0.985
0.990
0.995
1.000 !
!
!
!
!
! single CU
local
global on singleCU
global on random
global on local
(d) TPC-H queries - top 2 %
Figure 4.15: Placement optimization results relative to the best placement’s runtime.
duce too much additional costs. Even with better placements being possible, our greedy
approach would not leave the starting placement for that reason. This can be seen for
most SSB queries in Figure 4.15a. There, even some random starting placements find a
single-CU execution as local optimum.
• Besides the local optimums as single-CU execution, global placements with different
single-CU starting placements and random starting placements result in good runtimes,
which are partly better than local optimization. This shows us that we need to apply
global optimization with multiple different starting placements, while choosing the best
placement to achieve a good final result.
• In order to make sure, global optimization achieves the same or even better placement
than local optimization, we found that it is beneficial to combine both approaches. At
compile-time, we simulate local optimization, i.e., traversing the query plan, while defin-
ing the placement decision only according to input transfers and runtime estimation. The
resulting placements for every operator can then be used as a starting placement for the
global optimization, where we improve the placement by finding data sharing opportu-
nities through the global view. This approach shows results that are always as good as
local placement if not better. While this is a good heuristic, some global optimizations
108 Chapter 4 General Placement Optimization
with other starting placements achieved even better results (e.g., SSB query 2_1, 2_3,
4_1, 4_2, and TPC-H query 5). Again, this shows us that we need to evaluate the global
optimization with multiple starting placements.
• Finally, we see that we always found good placements within only 0.5% of the ideal run-
time, using our local or global optimizations. These optimizations execute in millisec-
onds instead of hours for the prune-based search, making the whole placement optimiza-
tion applicable for online query processing.
During our evaluation, we noticed two interesting trends:
(1) With changing magnitude of the transfer costs, the ideal optimization strategy changes.
For example, with low costs or no transfer costs at all (e.g., because of small transfered data or
high bandwidth connections), local optimization is sufficient, as data sharing does not need to
be considered. Local optimization would simply pick the best performing CU for each operator
and achieve the best possible runtime. If the transfer costs are significantly larger than the ac-
tual execution, then single-CU execution is sufficient, where transfers are only needed for base
data, while all intermediate results are shared on one CU. For scenarios in between the men-
tioned ones, e.g., transfer costs are significant but not much larger than execution times, global
optimization is needed to find the ideal heterogeneous placement, while considering data shar-
ing between operators. Global optimization can actually be used in all scenarios with different
transfer costs. It can find highly heterogeneous placements and single-CU placements, depend-
ing on the magnitude of transfer costs and the operator executions.
(2) A second trend we have seen is that local optimization might produce worse results
when optimizing for many more operators. As local optimization does not consider future
operators, the placement is decided locally, adding harmful transfer costs for wrong decisions.
With only a few operators, there are not many chances for future operators to share data or
suffer from earlier decisions. For many more operators, data sharing becomes more important
because every placement decision could influence many future decisions and introduce many
more unnecessary transfer.
4.4 Evaluation 109
4.5 CONCLUSION
In this chapter, we proposed a runtime estimation approach based on online learning during
execution and linear interpolation between raw tuples for the estimation. We have shown that
this approach is capable of estimating runtime even for irregular behavior and that possible
errors are low and do not propagate to the placement decisions in a large extent.
We also discussed two different placement optimization strategies, local and global opti-
mization, including implementation details and evaluation. We have shown that the search
space can be reduced with strong placements for two CUs in one system, while the potential
for more than two CUs is limited. However, even without strong placements, local and global
optimizations found good placements for our evaluation queries. Global optimization is most
likely to find a good resulting placement, when multiple different start placements are consid-
ered. We have shown that the large search space for global optimization is not a problem and
that our greedy algorithm is suited for the placement optimization. The remaining challenge
is accurate cardinality estimation, because compile-time approaches like global optimization
are highly dependent on the estimated cardinality information for both, the operator runtime
estimation and the transfer estimation.
110 Chapter 4 General Placement Optimization
5
ADAPTIVE PLACEMENT OPTIMIZATION
5.1 Open Challenges
5.2 Adaptive Placement Approach
5.3 Adaptive Placement Sequence
5.4 Implementation Approach
5.5 Evaluation
5.6 Conclusion
General Placement Optimization was proposed in the last chapter, including runtimeestimation, local optimization and global optimization. In this chapter, we build upon
these general placement techniques, by proposing adaptive placement optimization to solve the
problems and limitations of the previous approaches.1
As shown before, global placement optimization shows good performance, however, it is
dependent on accurate cardinality estimation. For each physical operator within a query, and
for each available CU, the operator runtime and the data transfer costs are estimated and com-
pared, whereas the estimations are based on data cardinalities. Up to now, the available ap-
proaches [Breß 2014; He et al. 2009] assume perfect cardinality estimations even for inter-
mediate results, however, this is simply not possible for complex workloads [Leis et al. 2015].
Even small deviations in the cardinality estimation may have a major impact on the estimated
runtime, potentially leading to sub-optimal decisions at the end (error propagation).
Beyond cardinality estimation, we identified two additional limiting aspects of current ap-
proaches (including ours). First, the execution time of the same physical operator can behave
differently depending on the query structure and the input data size, potentially resulting in
imprecise runtime estimation on operator level. Second, due to dominant data transfer costs,
decisions regarding the location of intermediate results in a heterogeneous environment limit
the flexibility for future placement decisions within the same query.
To overcome these limitations, we propose a novel adaptive placement approach for query
processing on heterogeneous computing resources. Our approach takes a physical query exe-
cution plan as input and divides the plan into disjoint execution islands at compile-time. The
execution islands are determined in a way that the cardinalities of intermediate results within
each island are known or can be precisely calculated. The placement optimization and exe-
cution are performed separately per island at query run-time. Execution islands are processed
sequentially according to their data dependencies. To further enhance our approach, we pro-
pose two additional improvements: (1) a fine-grained runtime estimation technique and (2) a
placement-friendly data transfer technique.
We first describe the open challenges in more detail in Section 5.1. Then, we propose
our adaptive placement approach to become independent of cardinality estimations together
with the two mentioned improvements in Section 5.2. We combine our general approach from
Chapter 4 with our adaptive approach to an adaptive placement sequence in Section 5.3 and
propose a novel implementation approach called HERO in Section 5.4. Finally, we provide an
evaluation on micro-benchmarks and real database execution in Section 5.5.
1Parts of the material in this chapter have been developed jointly with Dirk Habich and Wolfgang Lehner.
The chapter is based on [Karnagel et al. 2017b]. The copyright is held by the VLDB Endowment and the original
publication is available at http://www.vldb.org/pvldb/vol10/p733-karnagel.pdf.
112 Chapter 5 Adaptive Placement Optimization
5.1 OPEN CHALLENGES
While evaluating our current approach as well as other state-of-the-art approaches in this do-
main, we identified three open challenges, which we denote C1, C2, and C3.
C1 - Inaccurate Cardinality Information
Generally, data cardinality information is influencing the estimation models for the traditional
query optimization as well as for the placement optimization. This information is usually
provided via statistics, histograms, or estimations using heuristics. However, especially when
working with many joins, groupings, or complex selections, the estimated cardinalities for in-
termediate results can show significant errors [Ioannidis and Christodoulakis 1991]. Selected
attributes can be correlated and statistics on data distribution can not be simply intersected
for different attributes or relations [Christodoulakis 1984]. Leis et al. report cardinality esti-
mation errors by a factor of 1,0001 or more for all tested DBMSs when the query has multiple
joins [Leis et al. 2015].
0% 20% 40% 60% 80% 100%
percentage of operators
sel1
sel2
sel3
CU1
CU2
CU3
CU4
Figure 5.1: One Query with different (intermediate) cardinalities.
(SSB query 3_4, sel3 ⇡ 2 ⇤ sel2 ⇡ 6 ⇤ sel1)
To demonstrate the high influence of cardinality information on the placement optimization,
we execute a single SSB-query with three different selectivities, resulting in three different in-
termediate cardinalities. Figure 5.1 shows the optimal placement distribution of the query
operators to four different CUs. As we can see, the ideal placements vary greatly, which is
caused by different intermediate cardinalities. This illustrates the importance of cardinality in-
formation for the placement optimization. Unfortunately, existing approaches only assume exact
knowledge of data cardinalities for intermediate results [He et al. 2009; Breß 2014], but they
obtain the information from the query optimizer. These approaches ignore the well-known
problem of inaccurate cardinality estimation, which might result in sub-optimal placements.
C2 - Inaccurate Runtime Estimation
To assign operators to CUs, the runtimes of operators have to be estimated, for which a learning-
based approach is promising for complex operators and different CU architectures (see Sec-
tion 4.1). However, learning-based approaches suffer from inaccuracies originating again from
wrong cardinality information as well as behavior changes in the operator. We observed the lat-
ter by experimental investigating the two OpenCL-based DBMSs Ocelot [Heimel et al. 2013]
1A factor of 1 is accurate, while 10 means ten times more or less.
5.1 Open Challenges 113
and gpuDB [Yuan et al. 2013]. In both systems, physical operator behaves differently depend-
ing on input data size or the position within the QEP. The reasons for these variations are pre-
processing steps, like bitmapmaterializations or hash table creations, as well as post-processing
steps, like bitmap concatenations or data conversions. The presence or absence of these extra
steps is usually not visible in a purely operator-based query execution plan, however, these
additional steps do influence the runtime of the operators.
C3 - Influence of Intermediate Result Location
During query processing, operators are executed on different CUs, with input data being trans-
ferred to, and stored in the CU’s memory together with the operator’s results. With inter-
mediate results being stored on a specific CU, the further processing is usually locked to this
CU, even if other CUs perform better. The reasons for that are transfer costs, which might be
dominating the query runtime. As a consequence, current approaches for placement optimiza-
tion are substantially dominated by data transfer costs rather than optimal operator execution,
limiting the usage of heterogeneous computing resources for a single query.
114 Chapter 5 Adaptive Placement Optimization
5.2 ADAPTIVE PLACEMENT APPROACH
As mentioned in the previous section, imprecise cardinality estimations are the most signif-
icant source for errors during placement optimization (Challenge C1). In our novel adaptive
placement approach, we do not strive to improve the cardinality estimation, but instead, we
focus on becoming completely independent of these estimations.
5.2.1 General Approach
To tackle Challenge C1, our approach is two-fold: (1) create execution islands at compile-time
and (2) apply placement optimization and execution per island at run-time.
At query compile-time, the query optimizer provides the most-efficient QEP as input for our
placement optimization, as done by all state-of-the-art approaches. Then, we divide the QEP
into disjoint execution islands, which combine subsequent operators of the QEP in a way that
within a single execution island, the cardinalities of intermediate results are known or can be
precisely calculated at run-time. The islands are delimited by so called estimation breakers defin-
ing the QEP positions, where new cardinality information will be available during processing.
At query run-time, the disjoint execution islands of the QEP are executed successively. Before
executing the operators within an island, placement optimization for this island is conducted.
Since we know the exact cardinalities within this execution island, we are able to precisely esti-
mate the runtime behavior of each operator as well as data transfer costs. Thus, our placement
decisions are based on accurate numbers. To make the placement decisions, we propose re-
gional optimization, which is essentially global placement optimization restricted to a specific
execution island. Whenever the execution of an island is finished, we reach an estimation
breaker and the intermediate cardinalities for the next island can be calculated.
The most challenging issue for our approach is identifying the estimation breakers within
the QEP. To determine these estimation breakers, we analyzed the execution behavior of physical
operators in the underlying DBMSs, gpuDB and Ocelot. Since almost all CUs in a heteroge-
neous hardware system offer high parallelism, this aspect also affects the implementation of
operators. For example, when thousands of threads work on the same data, traditional locks
or atomic operations hinder parallelism significantly. However, if the result cardinality is pre-
cisely known, each thread can compute the designated position of its output and execute its
work without locking.
To achieve this desired processing behavior, the exact cardinalities are usually first com-
puted within a probing step, before producing the actual operator result. Figure 5.2a illustrates
that approach for a hash join operator, as proposed by He et al. [He et al. 2008]. A traditional
hash join consists of two steps: (1) hash table creation and (2) hash table probing. The highly
parallel version has two probing steps instead of one (see also Section 2.3). The first probe
calculates the output size for each thread, where the actual size of this probe’s output is exactly
one value per thread. Hence, the output size of the first probe is known beforehand. After-
wards, the second probe uses the gathered cardinality information, knowing precisely the real
output size, and produces the actual join result in parallel.
5.2 Adaptive Placement Approach 115
Input A
Input B
create HT
probe 1
probe 2
result
size
info
(a) Hash Join implementation.
estimation
breaker
Island 2
Island 1
Island 3
(b) Execution islands created by the unknown join result size.
Figure 5.2: Highly parallel hash join processing.
Our QEP division into execution islands is based on these multiple probing phases. We an-
alyze the QEP and the corresponding operators at compile-time to identify such cardinality
probing steps within the operators. These probing steps are our estimation breakers, because
new cardinality information becomes available. For example in Figure 5.2a, Probe 1 computes
the result cardinality of Probe 2, therefore, an estimation breaker would be added between both
and the operator placement would be optimized for two separate islands. A larger example is
given in Figure 5.2b, showing a query with two hash joins, where the two probing steps of each
join divide the query into execution islands. For the first island, the intermediate cardinali-
ties are known, based on the cardinalities of the input tables. For all parts within this island,
the placement can be defined using runtime estimation and placement optimization, based on
exact cardinality knowledge. Afterwards, the island is executed according to the chosen place-
ments. Once the island is fully executed, the intermediate cardinalities for the next island can
be calculated (e.g., after Probe 1) and we can start the same process again until all islands are
executed.
To summarize, using our adaptive approach, the estimation of operator runtimes and data
transfer times is based on precise cardinality information, significantly improving estimation
results.
5.2.2 Improving Placement Quality
With the adaptive approach, we are independent of cardinality estimations for intermediate re-
sults. To tackle the remaining two challenges, we additionally introduce a fine-grained runtime
estimation (C2) and a less data-centric optimization (C3) to improve the placement quality of
our approach.
Fine-grained Runtime Estimation (C2)
To improve runtime estimation and better support our adaptive placement approach, we pro-
pose to work on sub-operator granularity, where sub-operators can be reoccurring functions
that are executed subsequently within an operator. Specific pre-processing and post-processing
116 Chapter 5 Adaptive Placement Optimization
steps can be expressed with specific sub-operators, allowing the estimator to consider their run-
time individually. Working on this fine-grained level has several advantages:
1. More accurate runtime estimations, since every processing step is considered separately.
2. More training data, as some sub-operators are used in multiple operators (e.g., the same
hash table creation for hash joins and hash-based groupings).
3. More fine-grained placement, as sub-operators can be placed separately, instead of plac-
ing full operators.
4. Support for our adaptive work placement, as sub-operators allow the positioning of esti-
mation breakers (see example in Figure 5.2b).
While the first two points improve the runtime estimation quality, the third point could poten-
tially improve the runtime of the whole query by using fine-grained placement decisions.
Intermediate Results with multiple Locations (C3)
To provide the optimizer with more freedom in choosing the best CU, we propose to keep
temporary copies of used data on the CUs. Data objects can be accessed and updated by sub-
operators, however, they have to be transferred to and from the CUs depending on the access
location. Instead of moving a data object to the CU, where it is used next, we copy the data,
while also keeping the original version. This enables two improvements to the placement opti-
mization:
1. Future executions can choose a CU, where a copy of needed data is stored and, therefore,
avoid additional transfers. Otherwise, they can choose different CUs, transfer data, and
provide another copy to future executions.
2. Parallel access to different copies on different CUs is made possible, while before, access
on the same data object had to be sequential.
Allowing copied data objects introduces consistency challenges, when data is updated. To ad-
dress this problem, we define a small set of rules for data handling: (1) If the required data
object is not available on the assigned CU, copy it from a CU where it resides. If multiple CUs
apply (e.g., multiple copies as source), use the CU with the smallest transfer costs. (2) If a
sub-operator updates data, delete all copies except the one used by the sub-operator. (3) If a
CU’s memory is full, delete older copies.
When the majority of memory accesses is read-only, this approach can lead to temporary
copies being present on nearly every CU. There are no additional transfer costs for this ap-
proach, as the system would transfer data objects with or without copy support. However,
when using copies, data is available on source and target after the transfer, potentially avoiding
future transfers. All copies can be removed upon query termination.
5.2 Adaptive Placement Approach 117
Runtime Estimation
- On DB operators
- On vague cardinalities
Global Optimization
- With inaccurate estimations
- Limited by data transfers
Execution
at compile-time
at run-time
(a) General Approach
Drill-Down
Data Location Analysis
Island Creation
Runtime Estimation
- On sub-operators
- On precise cardinalities
Regional Optimization
- With accurate estimations
- With multiple copies
Execution
(b) Adaptive Approach
Figure 5.3: Adaptive optimization sequence: the pre-processing (yellow) extends and improves
the general steps (green).
5.3 ADAPTIVE PLACEMENT SEQUENCE
In the previous section, we proposed adaptive placement techniques. We now propose our
Adaptive Placement Sequence, defining the optimal execution sequence of the presented tech-
niques, including runtime estimation and regional optimization.
Figure 5.3 compares the general sequence with our adaptive approach. As illustrated, our
adaptive placement sequence has five optimization steps, where the first three steps act as pre-
processing at query compile-time. Then, the following two steps and the actual query execution
are applied multiple times at run-time, once per execution island. The steps are illustrated
using the following example query based on the TPC-H schema:
SELECT sum(l_quantity) FROM lineitem WHERE l_discount < 20 and l_quantity < 24
When executing this query for example with Ocelot [Heimel et al. 2013], a QEP with four
operators is generated, as shown in Figure 5.4a. The first two operators are the selections on
l_quantity and l_discount, both producing bitmaps. The third operator materializes the result of
the selections before the fourth operator performs the aggregation. All four operators are im-
plemented in OpenCL, so they can be executed on different CUs without any code adjustments.
5.3.1 Steps at Query Compile-Time
At query compile-time, we receive the most-efficient QEP determined by the traditional query
optimizer and divide it into execution islands with the following three steps.
Drill-Down
Our overall approach works on sub-operator level to be able to determine estimation break-
ers within operators as well as to improve the placement quality. Therefore, our first pre-
processing step is a Drill-Down from operators to sub-operators of the QEP.
118 Chapter 5 Adaptive Placement Optimization
  	 

 
 
 
		 
	 

 

	 


 

 

 

 

 

! "#$
 
	 
 
	 

 
 
(a) Original operator view.
(b) Drill-Down:
The four original operators
(gray bounding boxes)
split into 13 sub-operators
(white boxes).
Figure 5.4: Operator and sub-operator view; numbers define the execution order; arrows sym-
bolize the data flow.
Example: In our Ocelot example, sub-operators correspond to OpenCL kernels. Fig-
ure 5.4b shows the drill-down result for our running example. Theta_select_flt performs a se-
lection on float values, while bm_and intersects the two bitmaps of the previous selections to
one resulting bitmap. The prefixsum_* sub-operators count the number of set bits within the
resulting bitmap, to calculate the result cardinality. The fetchJoinBM sub-operator materializes
the bitmap result while fetching values of l_quantity, as they are needed for the next steps. Fi-
nally, the reduce_sum_* sub-operators consume the materialized selection result to build the
aggregation sum in parallel.
From a high-level point of view, both selections are assumed to have equal runtime, because
both consume a column of the same table and work with the same data cardinalities for input
and output. However, as we see in Figure 5.4, on the fine-grained level, the two operators differ
in execution. Both selections execute a theta_select_flt sub-operator but the second selection
has an additional sub-operator to combine the two bitmaps, increasing the runtime of the sec-
ond operator. When working directly on sub-operators, the estimator learns the runtimes for
theta_select_flt and bm_and separately, leading to more precise runtime estimations and also
potentially different placement decisions for these sub-operators. Additionally, we can not as-
sume that a single placement decision is optimal for all eight sub-operators of the leftfetchjoin.
When placing only operators, we loose speedup opportunities if sub-operators are diverse and
run better on different CUs.
5.3 Adaptive Placement Sequence 119
Data Location Analysis
After our Drill-Down, we analyze the resulting data flow of the query plan and in particular
the memory accesses of the sub-operators (read and write). This analysis is done to determine
where intermediate results can be kept in different locations according to the temporary copy
concept for challenge C3. We note that we only find opportunities for copies in this second
step, while the real occurrence of copies is placement-dependent. For example, if two sub-
operators only read the same data, our analysis confirms that copies may be kept. However, if
both sub-operators are placed on the same CU, there is no data transfer and, hence, no addi-
tional copy.
Example: The benefits are illustrated in Figure 5.5. There, we add the memory operations
alloc, read, and the memory access types for each sub-operator (r for read-only, w for read-
write). Alloc allocates the data object, while read evaluates a result. Read 4 evaluates the result
size of the bitmap materialization (data object 4), which determines the size of data object 5
(hence, the estimation breaker). Read 6 is done to output the aggregation result. One advantage
of our analysis can be seen for sub-operator 3.8: It reads data object 2, which was used before by
sub-operators 2.2, 3.1, and 3.2. If their placements were on three different CUs, sub-operator 3.8
can now choose one of the three CUs, where a copy resides, while having no transfer costs
for this data. An additional advantage can be seen for Read 4, which was scheduled to be
executed after sub-operator 3.7. However, since both operations only read data object 4, they
can be executed in parallel using copies.
Island Construction
Our third pre-processing step is the island construction. This is done by traversing the data
flow, while collecting all sub-operators with pre-defined input and result cardinalities into a
single execution island, and creating a new island after an estimation breaker. For example,
selections producing bitmaps, sort operations, foreign key joins, calculations, and aggregations
produce fixed size results, and may be executed within the same island. However, for bitmap
materializations, groupings, and joins not based on foreign keys, the cardinalities are not fixed,
leading to a staged execution: (Phase 1) calculating the result size and (Phase 2) actually pro-
ducing the result. Between these two phases, we place our estimation breakers dividing the
query plan into execution islands. Thus, the number of execution islands depends on the used
sub-operators.
Example: In Figure 5.5, the intermediate results are fixed for both selections, as they con-
sume a column of lineitem and produce a bitmap with one bit per row. However, when the
bitmaps need to be materialized, the exact result cardinality is unknown. This results in an
estimation breaker between sub-operator 3.7 and 3.8, building two separate execution islands.
The first island combines the selections and calculates the materialized result size, while the
second island combines the last step of materialization and the aggregation.
120 Chapter 5 Adaptive Placement Optimization
 	

 

 	


 

 

 


   

 
 
! 



 
 
" 
 
# 

 
 
  
 
 
  

"
  

" #
 #
#
		
 

		
 

		
 

		
  
 
		
 "
"
		
 #
#
Execution Island 2
Execution Island 1
estimation breaker
Figure 5.5: Reordering data accesses by inspecting data objects (ovals) and dependencies. Data
accesses are either read (r) or write (w) on a specific data object. Dashed lines illustrate that
there is a choice of data source.
5.3 Adaptive Placement Sequence 121
Sub-operator
1.1 2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2
CU 1 2 2 4 3 4 3 3 4 1 2 5 3 4
CU 2 4 4 5 5 3 3 1 3 1 4 2 3 1
CU 3 2 2 2 1 5 5 1 1 5 1 1 4 5
CU 4 2 2 4 4 1 1 2 1 3 3 2 3 2
Table 5.1: Hypothetical runtime estimations (in units) for the given sub-operators and 4 CUs.
5.3.2 Steps at Query Run-Time
With the presented steps at query compile-time, the most-efficient QEP is divided into exe-
cution islands. The following two steps of runtime estimation and placement optimization are
directly applied before an island is executed in order to determine the optimal placement for
the specific island.
Runtime Estimation per Island
The first step at query run-time is to estimate the runtimes of sub-operators for the available
CUs. Here, we utilize our estimation approach from Section 4.1. We continuously monitor the
execution times together with the input data cardinalities.
Example: A possible output of this step for our running example is depicted in Table 5.1.
Here, we assume to have four different CUs and that the estimation model returns the hypo-
thetical runtimes for each combination of sub-operator and CU. The runtimes are in abstract
units and chosen in a way that they can demonstrate specific optimization aspects later. Fur-
thermore, we assume any data transfer between two CUs takes one unit.
Algorithm 5 Regional Placement Optimization.
1: procedure Regional-Optimization(island)
2: placement starting placement
3: repeat
4: for all sub in sub-operators(island) do
5: for all cu in CUs do
6: cost exec-est(sub, cu)+ input-transfer-est(cu)+ output-transfer-est(cu)
7: if cost < bestCost then
8: bestCost cost
9: placement[sub] cu-with(bestCost)
10: until placement has not changed
11: for all sub in sub-operators(island) do
12: execute(sub, placement[sub])
122 Chapter 5 Adaptive Placement Optimization
Optimization Operator Sub-Operator + copies
Local 30 34 33
Global 29 27 26
Regional 29 28 27
Table 5.2: Results of optimization (in units) for the given runtime estimations. The gray values
are visualized in Figure 5.6.
Placement Optimization per Island
To define the actual placement, we apply our regional optimization approach. Unlike global
optimization, regional optimization is limited to the sub-operators within one execution island
to ensure that only exact cardinalities and runtime estimations are used in the decision pro-
cess. Since the search space within execution islands is still too large to evaluate every possible
placement, we apply the same lightweight greedy algorithm as for global optimization (see Sec-
tion 4.3). The greedy algorithm tries to improve each sub-operator’s placement by considering
its runtime estimations and the regional context, where data transfers from preceding and to
succeeding sub-operators are considered. The regional algorithm is illustrated in Algorithm 5.
Since the result depends on the starting placement of the greedy algorithm, we first run the
algorithm starting with (1) each single-CU placement and (2) the local placement. If the esti-
mated runtime of the best-plan found is larger than a threshold (in our case 100ms), we allow
more time for optimization by evaluating multiple random starting placements. The place-
ment with the best estimated runtime is chosen for execution. For the regional optimization,
we do not implement the search space reduction through strong placement (see Section 4.3.3),
because of the limited improvement for more than two CUs.
Example: In our running example, we want to show the effectiveness of our optimizations
compared to the general approach. We consider the placement of operators, of sub-operators,
and of sub-operators with data copies in Table 5.2. We also compare our optimization approach
to local and global optimization. For the global optimization, data cardinalities for intermedi-
ate results are not known at that time and have to be estimated for the placement decisions
leading to the described drawbacks. For comparability, we use perfect cardinality informa-
tion for the global optimization in the example. As we can see in Table 5.2, local placement
always yields the slowest runtimes, while global placement shows the best results under the as-
sumption of perfect cardinality information. Our regional island-based optimization represents
a middle ground, with a performance close to that of the global strategy. Generally, placement
on operators is worse than placement on sub-operators, except for the local strategy, where sub-
operator optimization produces additional data transfers. For all other cases, a drill-down to
sub-operators is beneficial. Additionally, allowing copies can achieve an improvement through
avoided transfers for all optimizations. In this example, the improvement is limited to a max-
imum of two avoided transfers as there are only two sub-operators able to exploit copied data
objects (sub-operator 3.2 and 3.8).
5.3 Adaptive Placement Sequence 123
Figure 5.6 depicts the placement decisions for local, global, and our regional optimization
when using sub-operators and copies. It shows that local optimization is swapping CUs often
to achieve better results in execution, while introducing many additional transfers (e.g., sub-
operator 3.3 to 3.5). Regional and global optimization only differ for operator 3.7, where global
optimization chooses an additional data transfer to avoid transfers for the following operator.
Since this is exactly at the border of two islands, the regional approach can not optimize this far
ahead and decides for a locally better placement.
To summarize, our regional placement is close to the global result. Nevertheless, if per-
fect cardinality estimations are not available – which is the normal case –, then the global
placement becomes most likely worse. However, since our regional approach always works
with precise cardinality information at run-time, we are not affected by inaccurate cardinality
estimations.
5.3.3 Feasibility of our Approach
Our adaptive optimization approach is best applied to columnar in-memory DBMSs with a
column-at-a-time processing model, where intermediate results are materialized, mainly trig-
gered by related work like Ocelot and gpuDB. However, other execution models may also ben-
efit from heterogeneous execution and our adaptive placement.
Block-wise processing, e.g., as vectors [Boncz et al. 2005] or morsels [Leis et al. 2014], can
be offloaded and placed on different CUs, where different blocks executing the same operator
can even run in parallel on different CUs at the same time. If intermediate results are not
materialized after every operator (like pipelined execution [Boncz et al. 2005] or generated
query code [Neumann 2011]), complete pipelines can be offloaded until a pipeline breaker
forces materialization. Although this results in a more coarse-grained placement, multiple
pipelines can be grouped to execution islands until an intermediate cardinality needs to be
calculated. The main challenge within the pipelined execution is runtime estimation, because
pipelines or generated code could always differ in their execution. However, an estimation
based on operations within the pipelines is possible, for example by estimating data accesses
and computation within the pipelines separately.
The only processing model that would most likely not benefit from our approach is tuple-
at-a-time, which is already not cache-friendly and has a high function-call overhead. These
problems would increase in heterogeneous environments, where the communication and data
transfer between CUs is costly. Therefore, this processing model is not well-suited for hetero-
geneous systems in general.
124 Chapter 5 Adaptive Placement Optimization
 	
   
    

 	
   

    

    

 
   

  

    

! 
   
 
    

"    

# 
   



 
   
"
 
   
" #
estimation breaker
CU 1
CU 2
CU 3
CU 4
Name local regional global
Placements
Figure 5.6: Placements for sub-operators with copies. The colors and numbers on the right
represent the placement.
5.3 Adaptive Placement Sequence 125
5.4 IMPLEMENTATION APPROACH
To broaden the application of our adaptive placement approach, we want to support many
heterogeneity-aware DBMSs with our implementation approach. We achieve that by reusing
the basic technology many of these systems apply to support heterogeneous hardware: OpenCL.
Each hardware vendor supplies an OpenCL driver to their CUs, which is loaded when the CU
is first accessed. The driver manages the communication of the application to the CU using the
standardized OpenCL interface. For our evaluation, we implement our own OpenCL driver,
called HERO (HEterogeneous Resource Optimizer), as a virtualization layer, which is loaded
by the database system and manages the heterogeneous environment transparently. Using the
driver approach, an OpenCL-based database system needs none or only few adjustments to sup-
port our adaptive placement optimization. The system communicates with HERO as if it would
be a single CU, while HERO manages all available CUs underneath. HERO does that by inter-
cepting the OpenCL communication between the database system and the CU, while applying
our adaptive placement sequence and executing the work heterogeneously. The result is an
abstraction between the database system and the hardware, where all placement optimization
and execution is done.
5.4.1 General Architecture
Building upon our optimization sequence and the idea to intercept the OpenCL communica-
tion for evaluation, we developed the architecture presented in Figure 5.7. For the incoming
work, the interface itself provides theDrill-Down, since only sub-operators (in this case OpenCL
kernels) and memory operations can be submitted to the driver. Sub-operators are collected in
a work queue, while a memory manager keeps track of all queued memory accesses to provide
the data location analysis in a later step. Requesting an intermediate result acts as an estimation
breaker, triggering island construction. Runtime estimation is done for all sub-operators of an
island. Afterwards, the placement optimizer determines the heterogeneous placement by us-
ing the runtime estimations, the data access analysis, and the regional view on the query. The
placed sub-operators are then executed using the OpenCL interface and all available CUs.
5.4.2 Database System Interface
As we reuse the OpenCL standard to communicate with the database system as well as with
the heterogeneous CUs, we need to implement the OpenCL interface and have to choose an
OpenCL version. As multiple Nvidia GPUs only work with OpenCL 1.1, we are bound to this
version for the whole driver.
OpenCL Interface: In order to support our evaluation database systems, we implemented
38 of the 115 OpenCL functions. An overview of the supported functions is shown in Ta-
ble 5.3. This subset is sufficient to support gpuDB, Ocelot, and different other application
using OpenCL. All other OpenCL functions are not supported and return an OpenCL error
code. Nevertheless, the support for these functions can be seamlessly added.
126 Chapter 5 Adaptive Placement Optimization
H
ER
O
OpenCL Interface
OpenCL Interface
Work
Queue
Memory Manager
Data Obj 1 Data Obj N
Runtime
Estimator
Placement
Optimizer
Execution
Engine
Database System
Heterogeneous Environment
Figure 5.7: Architecture overview of HERO.
For the implemented functions, the according operation is either executed to one specific
CU (e.g., clEnqueueTask) or it is applied to all CUs. For example, the clCreateKernel function will
use the OpenCL program to create a kernel on every CU available, while returning a default
kernel to the host database system. When this kernel needs to be executed, HERO exchanges
the default kernel with the CU-specific kernel, once the placement decision is made.
Hardware Visibility: We implement information functions with different information poli-
cies, describing the CU information that should be shown to the database system. In detail,
we have to decide which and how many CUs should be presented to the host system and which
properties these CUs should have. The implemented strategies include:
1. Showing one virtual CU, which could be of a CPU, GPU, or accelerator type. The host
system can submit all the work to this one CU and HERO will place the work on the
underlying CUs according to our placement optimization.
2. Showing two virtual CUs, a CPU and a GPU, which stand exemplarily for all CPUs and
GPUs in the system. All submitted work to one of the CUs is executed on the same type
of CU. However, for example, with multiple GPUs, HERO decides on the GPU to use.
This allows the database system to submit kernels, which are optimized for the specific
architectures, while introducing more complexity on the database system side.
3. Showing all CUs to the database system. Here, the database system has full control of the
hardware, while HERO submits the work directly to the CUs. In this case, HERO can be
used to log and visualize all communication between the database system and the CUs.
5.4 Implementation Approach 127
Setup Functions: Information Functions: Asynchronos Functions: Release Functions:
- clGetPlatformIDs - clGetPlatformInfo - clEnqueueMapBuffer - clRetainEvent
- clGetDeviceIDs - clGetDeviceInfo - clEnqueueNDRangeKernel - clReleaseMemObject
- clCreateContext - clGetContextInfo - clEnqueueTask - clReleaseKernel
- clCreateContextFromType - clGetKernelWorkGroupInfo - clEnqueueReadBuffer - clReleaseProgram
- clCreateCommandQueue - clGetMemObjectInfo - clEnqueueWriteBuffer - clReleaseCommandQueue
- clCreateProgramWithSource - clGetEventInfo - clEnqueueCopyBuffer - clReleaseDevice
- clBuildProgram - clGetProgramInfo - clEnqueueUnmapMemObject - clReleaseContext
- clCreateKernel - clGetProgramBuildInfo - clReleaseEvent
- clCreateBuffer - clGetEventProfilingInfo Sync/Blocking Functions:
- clSetKernelArg - clFlush
- clSetEventCallback - clFinish
- clWaitForEvents
Table 5.3: OpenCL functions implemented in HERO.
For our evaluation tests, we only use the first approach, as it reduces the complexity for the
database system and it increases the freedom of optimization for HERO, because all CUs can
be chosen for execution.
Visible CU properties: When showing only one CU to the database system, the question is
which hardware properties should be presented (e.g., memory space). If the database system
uses this information for execution decisions then this could influence the query performance.
For example, if we present a memory capacity of 2GB, the DBMS could decide not to use HERO
for memory space reasons, even if one CU has more than 2GB memory and could compute the
operation. We identified two strategies:
1. Presenting the minimal properties of all underlying CUs. In this case a system with CUs
having memory spaces of 24GB, 10GB, and 2GB would present a virtual CU with 2 GB,
making sure that every submitted kernel execution is able to execute on all available CUs.
2. Show the maximal properties, e.g., 24GB in our example. This way, even demanding
computation is possible, however, HEROmight be forced to execute a kernel only on the
one CU with 24GBs of memory, since this is the only CU able to execute the kernel.
We implement both policies. However, for our evaluation, we only use the first one, making
sure every CU can be used for every kernel. As a consequence, the tested queries need to be
able to execute on all CUs, having the CU with the smallest memory space defining our data
scale factor.
Additional Communication Channel: Beyond the OpenCL interface, we want to allow ad-
ditional communication between HERO and the database system. An additional communi-
cation channel is not needed for the pure execution of OpenCL kernels, but it can be used
to signal HERO which kernels belong to a database operator or when a query starts or ends.
To allow this communication, we reuse the OpenCL function clGetPlatformInfo, which usually
provides a memory pointer for the requested platform information. In our driver implementa-
tion, we test this memory region for hidden messages, whenever this function is called. These
messages could be data requests like the real number of CUs in the system, or messages like
operator names. The main advantage of this approach is its non intrusive nature. A database
128 Chapter 5 Adaptive Placement Optimization
system sending these requests to a normal OpenCL driver will get a simple error-code as an
answer, while systems sending no such requests still work with HERO. Therefore, adding the
support for this communication is no disadvantage for either the database system or HERO.
5.4.3 Memory Management
For the memory management, HERO keeps a queue of memory operations for every memory
object (OpenCL buffer). This queue holds all accesses to this memory object including Map,
Read, Write, and Copy operations, as well as all accesses where the memory object is first used
as a kernel argument. The operations are queued in the order of submission to the OpenCL
kernel and are either marked as read-write or as read-only. The queues for each memory object
are used to keep the execution order and dependencies consistent. Kernels with different sets
of memory objects can be executed concurrently and kernels with overlapping sets of memory
objects need to be executed in queuing order. The only exceptions are read-only accesses,
which could be swapped with other read-only accesses. For a write access, all previous read or
write accesses have to be finished.
Every memory object can reside onmultiple CUs, including none at all. When first created,
a memory object is not allocated as long as no execution is requesting it. When the memory
object is used, the allocation is done on the same CU as the execution. With multiple accesses
to one memory object, it is possible that the object is copied and kept on multiple CUs, as long
as it is not written to. Additionally, memory objects can be evicted from a CU if memory space
is needed. The data of the memory object is then stored in the main memory of the host system
if not already present, e.g., for intermediate results.
To transfer amemory object from one CU to another (even for CUs from different vendors),
we identified three different ways, which base on different combinations of these four OpenCL
functions to access a memory object from the host side:
• clEnqueueReadBuffer: This function reads an OpenCL memory object and transfers
the data to a prepared array on the host.
• clEnqueueWriteBuffer: This function writes data from the host to an OpenCL memory
object.
• clEnqueueMapBuffer: This function returns a memory pointer to the data. If the data is
already on the host system or host-accessible, it is not moved. If memory is not accessible
from the host, it is transfered transparently to the user.
• clEnqueueUnmapMemObject: This function reverses themapping after clEnqueueMap-
Buffer. If data is host accessible or the mapped data was not changed, nothing has to be
done, while otherwise data is transfered transparently to the user.
These functions only allow to access or copy data from the CU to the host system (or vice
versa). To support CU to CU transfers, we need to combine these functions. We found three
different combinations, which are illustrated in Algorithm 6.
5.4 Implementation Approach 129
Algorithm 6 Possible CU to CU data transfer functions.
1: procedure Transfer 1
2: src-data clEnqueueReadBuffer(source) . always a transfer
3: clEnqueueWriteBuffer(target, src-data) . always a transfer
4: procedure Transfer 2
5: trg-data clEnqueueMapBuffer(target) . no transfer if target is host-accessible
6: trg-data clEnqueueReadBuffer(source) . always a transfer
7: clEnqueueUnmapMemObject(trg-data) . no transfer if target is host-accessible
8: procedure Transfer 3
9: src-data clEnqueueMapBuffer(source) . no transfer if source is host-accessible
10: clEnqueueWriteBuffer(target, src-data) . always a transfer
11: clEnqueueUnmapMemObject(src-data) . never a transfer
• Transfer 1 allocates memory on the host, then transfers data from the source context to
the allocated memory, and writes data to the target context. This is the naive approach
and results always in two copy operations, independent of the two data locations.
• Transfer 2maps the targetmemory object to a host-accessible memory pointer and over-
rides this memory by reading the source data. Afterwards, the target object needs to be
unmapped. This could result in a high overhead if the target object is transferred from
the computing unit to the host and back. However, if no transfer is needed, e.g., if the
target CU is the host CPU, then this approach could be advantageous. This approach can
result in either one copy operations if target is host accessible, or three copy operations if
not.
• Transfer 3 maps the source memory object to a host-accessible memory pointer and
writes its contents to the target object. Unmapping the source pointer does not result
in a transfer, since the source data has not been changed. This approach results in either
one or two copy operations.
We found that transfers to a CPU prefer Transfer 2, and transfers from the CPU prefer Trans-
fer 3. However, it is not clear which function is best for transfers between twoCUs not being the
CPU. It is impossible to decide statically which function to use in all cases and if one function
is better than the others. Therefore, HERO implements all functions and finds the preferred
one for a source-to-target transfer by benchmarking the system’s preferences on ramp-up time.
5.4.4 Kernel Handling
We integrate two different optimizations to improve our handling of OpenCL kernels: (1) ex-
traction of read and write accesses for kernel arguments (memory objects) and (2) a kernel
library with the possibility for highly optimized kernel code for multiple CUs.
130 Chapter 5 Adaptive Placement Optimization
!"#$"%&'%( !"#$"%&%%(
'%)$*(+,('%(+-(+"./0+%%1.(!"#$"%&'%(
223"#$"%((14/5((54-4."06/$*7(((22*%48)%(/$0(9(8:;"#<=((
( ( ( ( (22*%48)%(/$0(9(8:;"#>=((
( ( ( ( (22*%48)%(/$0(9(8:;"#?@(
A(
((((BB(*"0(06#")5(CD(
((((/$0(06#")5CD(E(*"02*%48)%2/57F@G(
(((((
((((BB()55(8:;"#(<H>(
((((8:;"#?I06#")5CDJ(E(8:;"#<I06#")5CDJ(H(8:;"#>I06#")5CDJG(
(((((
((((BB(K4."(#)#"('4$5/L4$(
((((/M(706#")5CD(EE(<@(((8:;"#>I06#")5CDJ(E(8:;"#?I06#")5CDJG(
N(
G(O:$'L4$(PQ#KR($4:$S/$5(KKT(:S0)8%"(
5"U$"(14/5(V54-4."06/$*7(
( ( (/?>9($4')T0:#"(#")54$%W(X8:;"#<=((
( ( (/?>9($4')T0:#"(X8:;"#>=((
( ( (/?>9($4')T0:#"(X8:;"#?@(YF(A(
((X<(E(0)/%(')%%(/?>(7/?>=(&&&@9(8/0')K0(7/?>(7&&&@9((
(( ( (V*"02*%48)%2/5(04(/?>(7/?>=(&&&@9@7/?>(F@(Y>(
((X>(E(K",0(/?>(X<(04(/Z[(
((X?(E(*"0"%"."$0T0#(/$84:$5K(/?>9(X8:;"#<=(/Z[(X>(
((X[(E(%4)5(/?>9(X?=()%/*$([=(\08))(\>(
((X](E(*"0"%"."$0T0#(/$84:$5K(/?>9(X8:;"#>=(/Z[(X>(
((XZ(E(%4)5(/?>9(X]=()%/*$([=(\08))(\>(
((X^(E()55($KS(/?>(XZ=(X[(
((_(
Figure 5.8: Transformation of OpenCL kernel code to LLVM IR. We can extract readonly if a
memory object can not be written to.
Extracting read-write information: For our approach of keeping copies of data objects on
multiple CUs, we need to know if OpenCL kernels read or possibly write memory objects. This
information can be provided by the programmer. However, this would introduce the need for
manual input in order to support HERO, which conflicts with one of our design goals: non-
intrusiveness. Therefore, we implement an automatic detection method based on the fact that
the kernel code is passed from the database system through HERO to the different CUs. With
the help of the clang compiler (version 3.4 or higher), we can compile the OpenCL kernel code
to the LLVM Intermediate Representation (IR) and extract the needed information. Figure 5.8
shows the transformation for a simple example. There, only buffer1 is marked as read-only be-
cause buffer3 is always written and buffer2 has the possibility to be written. The same approach
also works for more complex executions including inner kernel function calls. With the ex-
tracted read-write information, we can mark the kernel’s memory accesses and support our
data copy approach.
OptimizedKernel Variants: To achieve good performance, OpenCL kernels sometimes need
to be optimized for a specific CU. The most common examples are memory access pattern,
which (in many cases) should be different for CPUs and GPUs (see Section 3.3). To support
such optimizations, we allow different OpenCL kernel variants. The database system can pro-
vide different sub-operator implementations in the OpenCL kernel code. Each version must
specify its preferred CU as a suffix like selection_CPU or selection_GPU or even the exact name
of the CU (selection_GT640). The database system can then request to execute a default sub-
operator (e.g., selection) and HERO exchanges that sub-operator after the placement decision
with the CU-specific implementation, if one exists. This allows small optimizations for single
CUs or types of CUs as long as the optimized versions share the same interface. To provide the
different kernel variants, the core database system does not need to be altered, e.g., it can still
use the selection kernel. The only adjustment is in the kernel pool, where selection_CPU or
selection_GPU can be added with no further references needed. HERO will automatically look
for these variants and fall back to the default kernel if no variants exist.
5.4 Implementation Approach 131
 
 

 	

	

 

 


 

 

 

(a) Common problem in Ocelot.
 
 

 

	
 

(b) Common problem in gpuDB.
Figure 5.9: Common scheduling problems in Ocelot and gpuDB.
5.4.5 Query Execution
Finally, we want to execute queries using HERO. For that, the database system schedules ker-
nels and memory operations to HERO using the OpenCL interface. To define the execution
islands, HERO queues all kernels it receives, while ignoring all synchronization function (Fig-
ure 5.3). These functions are usually used to allow external time measurement and to guaran-
tee the execution order. However, with HERO, time measurements are not encouraged, as the
executing CU can change transparently for the host system and the execution order is guaran-
teed through our memory object queues. Therefore, HERO ignores any synchronization until
a read or map request is issued. At this point, we know that the host system needs to evalu-
ate the result in order to queue more work. These are exactly the estimation breakers of our
adaptive placement optimization. This approach has multiple advantages: (1) HERO queues
as many kernels as possible allowing the regional placement optimization on multiple kernels
and (2) we only need to execute the kernels needed to compute the result that was requested
by the database system. The second point is important, as we found many systems including
Ocelot and gpuDB, where OpenCL kernels were queued to the OpenCL driver, but are not
needed to calculate the result in the end. Two examples are given in Figure 5.9. There, the
kernels read intermediate results of the query, but their output is not used before the memory
objects are deleted. In Ocelot, this happens commonly for the grouping operator, where ad-
ditional computation (generateExtentsTable) is not used by the host system. Since this kernel
only reads data, it does not need to be executed for the rest of the query to continue. When the
release of the memory object is submitted to the driver without any read operation, the create
and release commands and the whole kernel execution can be deleted from the work and the
memory object queues. For gpuDB, redundant work occurs mostly in the form of create, write,
and release of a memory object, without any use in a query (Figure 5.9b). In HERO, we use our
132 Chapter 5 Adaptive Placement Optimization
dependency analysis to identify these kernels and memory operations and stall them until the
memory objects are deleted, signaling that the non-computed results are not needed anymore.
5.4.6 Summary
With our HERO driver, we provide a platform for a broad evaluation of our placement approach
for any OpenCL-based database system. We implement our approach using the standardized
OpenCL interface, hiding the heterogeneous hardware environment and the complexity of op-
timization from the database system. We developed multiple data transfer options and let the
system find the best option for each CU combination. Kernels can have multiple variants,
which are optimized for different CU architectures. Memory access types for kernel inputs are
extracted automatically using LLVM IR. Additional to the placement and execution of submit-
ted kernels, HERO identifies unused kernels and omits their computation until they are either
used or can be avoided entirely.
Related Work There are multiple approaches, which also implement an OpenCL driver or
CUDA driver in order to control and optimize the execution. VCL [Barak and Shiloh 2014]
extends a host system with virtual OpenCL CUs frommultiple back-end nodes in a cluster. It is
implemented as OpenCL driver, where transfers to any CU can result in network transfers de-
pending on the CUs location. VCL exposes all CUs to the application, not optimizing towards
abstraction or heterogeneous placement. Since our approach and VCL base on the OpenCL
interface, it is possible to couple both approaches, where VCL provides the infrastructure to
use a whole cluster of CUs, while HERO assigns the heterogeneous placement hidden from the
application. HERO would learn the execution and transfer times, even when transfers result
in network communication. Helium [Lutz et al. 2015] implements an OpenCL driver to queue
and potentially rewrite OpenCL kernels. There, the application submits kernels to the driver,
while Helium finds reoccurring access patterns underneath and merges kernels at run-time to
avoid multiple memory accesses. VirtCL [You et al. 2015] also implements an OpenCL driver
and assigns a heterogeneous placement, however, it decides the placement locally by schedul-
ing the next kernel on an available CU, where it might finish first. For database applications,
where results of operators are reused by other operators, pure scheduling without further opti-
mization is not beneficial. Wang et al. [Wang et al. 2014] propose a CUDA driver to schedule
and load-balance multiple database operators of multiple queries. However, the CUDA ap-
proach is limited to NVIDIA GPUs, ignoring other CUs and the challenges of heterogeneous
environments.
5.4 Implementation Approach 133
5.5 EVALUATION
Our evaluation is based on our OpenCL driver implementation with the OpenCL database
systems gpuDB1 and Ocelot2. First, we revisit our query optimization examples from Sec-
tion 4.4 and use the collected data to evaluate our adaptive placement optimizations with
micro-benchmarks. Second, we use our HERO driver to execute real queries using gpuDB
and Ocelot. We use gpuDB for a large part of the evaluation because each query is generated
into an own binary, making it possible to observe all effects of the placement optimization.
Ocelot on the other hand is able to run multiple queries with one instance of the database,
where preceding queries could influence the placement of succeeding queries. Therefore, we
only use Ocelot to show portability of our HERO driver in the end.
For all experiments, we use the same hardware setup as presented in Section 4.4, consisting
of the AMD CPU, AMD iGPU, Nvidia K20, and Nvidia GT640 (see Table 2.3). We use scale
factor 10 for the SSB queries and scale factor 5 for the TPC-H queries.
5.5.1 Micro-Benchmarks
To show the impact of our adaptive approach, we revisit the example given in Section 4.4 based
on SSB queries. We refine our collected measurements from operators to sub-operators and
also collect the positions of estimation breakers. In the following, we present the evaluation of
our approaches. Query statistics are shown in Table 5.4.
Query 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3
#Operator 9 9 9 18 18 18 20 20 20 20 23 24 24
#Sub-Operator 38 39 40 97 100 91 111 109 111 108 135 144 140
#Execution Islands 5 5 5 12 12 12 13 13 13 13 15 15 15
Table 5.4: SSB query statistics.
Fine-grained optimization
We claim that sub-operator-based estimation improves the estimation quality and that placing
sub-operators instead of operators can lead to performance improvements. For the evalua-
tion, we insert all collected execution data over all queries into our runtime estimator. Then,
we estimate the (single-CU) runtime based on operators and sub-operators and compare the
estimation to the real execution time. Operator-based estimation shows a 4.73ms mean ab-
solute error (MAE) over all operators in all queries, while sub-operator estimation has only
0.7ms MAE. Additionally, the maximum errors differ largely: 58ms for operators and 1.3ms
for sub-operators. The error difference is mainly caused by same operator containing different
amounts of sub-operators as, for example, the group-by operator contains 4 to 9 sup-operators
1gpuDB (commit 609): https://code.google.com/archive/p/gpudb/
2Ocelot (version 1cda9db): https://bitbucket.org/msaecker/monetdb-opencl
134 Chapter 5 Adaptive Placement Optimization
sp
ee
du
p 
to
 lo
ca
l
0
0.5
1
1.5
2
1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3
SSB Queries
local global global card!error 0.1x global card!error 10x regional
Figure 5.10: Different placement optimizations on SSB queries.
depending on the use-case. To evaluate different placements, we look at single operators and
compare the best single-CU execution to the best heterogeneous placement of the correspond-
ing sub-operators. To determine the heterogeneous placements, we do a full search of all pos-
sible placements for these sub-operators. We found that in most cases the runtime is equal,
i.e., sub-operators also choose a single-CU placement, while in some few cases we achieved
speedups of up to 1.47x with the heterogeneous placement.
Regional optimization using execution islands
To evaluate the regional optimization, we use the runtime information of sub-operators to
calculate the placement for the local, global, and regional approaches. As the global approach is
not realistic considering varying cardinality estimations, we evaluate global optimization with
certain errors in the cardinality. Wemultiply the real intermediate cardinalities either with 0.1x
or with 10x to test the robustness of the optimization and estimation. Figure 5.10 shows the
results. We can see that global optimization is always better than local optimization. Compared
to our previous tests in Section 4.4, we see worse performance for local optimization due to
using sub-operators, which introduce more placement objects and more harmful transfers than
operator-based optimization. Global optimization with cardinality errors is slower than the
original global optimization. This is because of the wrong runtime and transfer estimations.
While global 0.1x still finds good placements, global 10x shows mostly bad performance. There,
data is thought to be 10x larger leading to CPU heavy computation, because data does not
need to be transfered for CPU execution. Local and regional optimizations are not effected by
cardinality estimation errors, since cardinalities are known precisely, and regional optimization
shows a performance similar to global optimization without errors.
Intermediate results with multiple locations
To evaluate the impact of allowing data copies, we use the SSB queries and test 1M random
placements per query with and without allowing data copies. In the worst case, we see no
speedup because of single-CU queries or unfortunate placements, where by chance none of the
copies could be exploited. In the best case, we achieve a speedup of 1.67x for highly heteroge-
neous queries. The performance gain of keeping copies depends on a query’s transfer costs, as
transfers are the only part that could be reduced. Single-CU execution has only transfers from
the host, which can not benefit from data copies. Queries where the execution is significantly
5.5 Evaluation 135
query
ov
er
he
ad
 (%
)
!10
!5
0
5
10
15
1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3
!
! !
!
!
!
!
!
!
!
! !
!
performance decrease
performance improvement
! iGPU CPU K20 GT640
Figure 5.11: Relative overhead per query on a single CU ( totalHERO totalNoHEROtotalNoHERO ). For better
illustration, the points are grouped by vertical lines for similar queries (SSB groups 1-4). The
overall average overhead is 0.5%.
larger than the transfer costs do also not benefit much from data copies. As performance is not
reduced in any case, we should allow copies in case a query is heavily dominated by transfers
that could be avoided.
5.5.2 Overheads
To evaluate the overhead of HERO, we compare the SSB runtimes of gpuDB with HERO exe-
cution and non-HERO execution. Since the original version of gpuDB supports only single-CU
execution, we examine this case. Figure 5.11 shows the overhead in percentage for all combina-
tions of SSB queries and CUs. For some combinations the overhead is negative, meaning that
the single-CU execution is faster when using HERO, while others experience an overhead of
up to 15% of the query runtime. There are two reasons for these variations:
Queuing: HERO queues submitted sub-operators, while, without HERO, OpenCL would
start working on the sub-operators immediately when being submitted. The advantage of queu-
ing and executing sub-operators together is an aligned heterogeneous placement that is usually
worthmore than the small delay for queuing. However, for single-CU execution, the placement
is fixed to a single CU and can not be improved by optimizing sub-operators together, making
the delay visible as overhead. This delay depends on the query topology and on how eager
the CUs start to work after submitting. For example, in Figure 5.11, small queries show this
overhead in particular (1_1, 1_2, 1_3), while iGPU and GT640 suffer most from the queuing,
showing that they usually execute work more eagerly than, e.g., the K20. We suspect that the
K20 is queuing work internally as well, only starting to execute if a synchronization function is
called, hence, it is not suffering that much from queuing in HERO.
136 Chapter 5 Adaptive Placement Optimization
010
20
30
40
50
sp
ee
du
p
                     1_1         1_2        1_3        2_1        2_2        2_3        3_1        3_2        3_3        3_4        4_1        4_2        4_3                    
CPU
+K20
+GT640
+iGPU
0
50
100
placement distributuion (%) CPU K20 GT640 iGPU
Figure 5.12: Performance of gpuDB with HERO when adding CUs for heterogeneous query
execution.
Controlled Execution: When HERO is executing sub-operators, it is tightly controlling the
execution on the CU. This means that when a sub-operator is given to a CU, HERO is forcing
the CU to execute it immediately without delay to ensure the correct time measurements for
our runtime estimation. This can have two effects: (1) an already eagerly executing CU might
experience additional overheads through the active waiting for the result, leading to slower
performance than without HERO (e.g., iGPU), (2) other CUs, which are lazier in execution
might experience a speedup through our approach (e.g., K20).
For heterogeneous execution in general, the overheads are not visible since the query exe-
cution benefits from the heterogeneity, making the query faster. Here, the delayed execution
is actually beneficial to find a good placement.
We also evaluated the scaling and overhead of our placement optimization sequence. We
enable between one and four CUs to be considered for a heterogeneous placement, while the
final placement is forced to be single-CU execution. When enabling two, three, or four CUs in
the optimization process, we see varying overheads from -0.2% to 0.4% of the query’s runtime,
with no recognizable pattern. We conclude that this overhead does not have any significant
impact on performance.
5.5.3 Performance and Placement Quality
For performance evaluation, we executed all SSB queries using HERO first on the CPU, be-
fore adding the K20, the GT640, and finally the iGPU. This way, we add more heterogeneity
and allow HERO to distribute sub-operators to more CUs to improve the runtime. Figure 5.12
shows the performance results and the resulting placement distribution. As expected from the
performance benchmarks in Table 2.3, the CPU is slow in execution. Adding the K20 improves
the result of all 13 queries significantly by placing most sub-operators on the GPU (speedup:
avg 16.4x, max 24.9x). Adding the GT640 further improves the results for 11 queries, since
5.5 Evaluation 137
intermediate result size multiplier
tim
es
 se
c
tim
es
 in
 se
c
0
1
2
3
1x 2x 5x 10x 20x 30x
! !
!
!
!! constant placement
adaptive placement
constant
placement
A B C
D
E
A: B: C: D: E:
iGPU
CPU
K20
GT640
Figure 5.13: Adaptivity evaluation by changing the intermediate cardinalities of SSB query 3_4.
the GT640 has a 10x faster connection to the host memory compared to the K20 (speedup to
previous setting: avg 1.7x, max 2.3x). Finally, adding the iGPU accelerates 10 queries addi-
tionally (speedup to previous setting: avg 1.5x, max 2.9x). With the placement distribution,
we see that some queries choose mainly single-CU placements, only differing in the chosen
CU (1_1, 1_2, 1_3, 4_2, 4_3), while other queries choose more heterogeneous placements (e.g.,
2_1 and 2_2). Even for single-CU execution, an accurate placement optimization is needed to
determine the best CU, as pure benchmark numbers are not descriptive enough. For example,
the K20 is the most powerful CU in our system. However, if another GPU is available, the K20
is nearly never used because of the low transfer bandwidth to the system. In summary, adding
placement optimization to the database system can achieve high speedups for the execution,
for example, 50x for Query 1_2. Nevertheless, we do not want to promote such high speedups
for every system; rather, we want to show that our approach of heterogeneous optimization is
working well. Using a more powerful CPU, e.g., a modern Intel Xeon, would lead to smaller
speedups, however, it would not change the efficiency of our approach.
Limitation: In four out of the 13 queries, we see a small performance reduction when
adding computing units (Queries 1_1, 1_2, 1_3, and 2_1). There, we found that the overall
placements are not ideal, while the regional placements for the sub-operators of each island
are good. The reduction is caused by the adaptive placement approach using separate execution
islands. The decisions for later islands depend on the decisionsmade for earlier islands, because
they define the location of intermediate data. This is the same effect as the one presented in
Section 5.3.2. For Query 1_1, the iGPU is mostly used throughout the query. However, most
sub-operators of the first island run better on the GT640, so they are placed on this GPU. This
introduces either being bound to the GT640 if transfers are too expensive, or transferring the
data to the iGPU, which would not be necessary if the first iteration would be executed on
138 Chapter 5 Adaptive Placement Optimization
0
1
2
3
4
5
6
7
sp
ee
du
p
14.8 | 13.4 | 13.4
              3              4              5              6             10             11            12            15            18               
TPC!H Queries
CPU
+K20
+GT640
+iGPU
Figure 5.14: Performance of Ocelot with HERO when adding CUs for heterogeneous query
execution.
the iGPU in the first place. A global optimization could most likely avoid these problems, but
would come with other drawbacks as stated earlier.
5.5.4 Adaptivity of Heterogeneous Placement
Besides the pure performance, we want to show the real benefits of our adaptive placement ap-
proach for heterogeneous execution using HERO. We exemplarily take SSB Query 3_4, which
accesses four tables and includes three selections, three joins, a group-by, and an order-by. The
selections are highly selective and the joins produce only small results, before the grouping
reduces the result to two tuples. We now update the base data to produce more tuples in the se-
lections, resulting in larger join results, which, at the end, are reduced by the grouping again to
two tuples. Through this method, only the intermediate cardinalities change, while the base ta-
ble size and the final result size are constant. We compare the execution performance of a fixed
placement, as a global optimizer would choose, and our adaptive approach, adjusting the place-
ment according to the collected cardinalities. Figure 5.13 shows the actual runtime results and
the placement distributions. The constant placement uses the cardinality information of the 1x
scale. The performance of the constant and adaptive placement is similar for small changes in
intermediate cardinalities, while larger changes show a significant difference. The pie charts
show the placement distribution for each testing point. While first the placements are similar
to the constant placement, i.e., the majority of the sub-operators is on the iGPU, the placement
changes towards more sub-operators on the K20 and the GT640 when the intermediate car-
dinalities grow. This adaption is not possible with global optimization and clearly shows the
benefit of the adaptive approach. We note that the error in cardinality estimation of up to 30x
in this experiment is small compared to the 1000x or more observed by Leis et al. [Leis et al.
2015].
5.5.5 Portability
Our implementation as OpenCL driver enables a broad evaluation. Thus, we evaluate our ap-
proach with Ocelot as a second DBMS as well. For our performance evaluation, we use 9
different TPC-H queries with a scale factor of 5. Figure 5.14 shows the results. Compared to
gpuDB, the CPU performs better, which is caused by mostly having two kernel variants with
5.5 Evaluation 139
different memory-access pattern for CPUs and GPUs. Therefore, the kernels are more opti-
mized for the used CU, as HERO switches these variants depending on the placement. For all
queries, adding the K20 improves the result significantly (speedup avg: 4.0x, max: 14.8x), ex-
cept for Query 5, where the CPU has already a comparable performance to the heterogeneous
execution. Adding the GT640 improves Queries 3, 10, and 15, while the others show that the
combination of CPU and K20 is already ideal for the given workload and data sizes. Adding
the iGPU improves only the result of Query 3. We can see in these results that HERO either
improves the runtime with more heterogeneity or holds the performance, even if additional
CUs can not be used beneficially for the given query. Only Query 18 suffers from the explained
limitation and the missing global view.
5.6 CONCLUSION
In this chapter, we proposed our adaptive placement approach as improvement over the general
approach presented in Chapter 4. We introduced a way to become completely independent of
cardinality estimation of intermediate results, one of the main sources of inaccuracy in runtime
estimation and transfer time estimation. Additionally, we proposed fine-grained optimization
on sub-operators instead of operators and allowing data copies to avoid costly transfers, while
providing more freedom of optimization to the optimizer. We incorporated our adaptive ap-
proach in an adaptive placement sequence consisting of five steps. We proposed HERO as
implementation approach, which optimizes and places queries transparently to the database
system.
In the evaluation, we showed the efficiency of our adaptive placement using micro-bench-
marks and full query execution. We demonstrated that the HERO implementation can have
a small overhead for single-CU execution depending on query topology and used CUs. How-
ever, for heterogeneous execution, the overhead is not visible since the query execution ben-
efits from the heterogeneity, making the query faster. The performance tests with gpuDB and
Ocelot showed the benefits of our approach with speedups of up to 50x. Additionally, we have
demonstrated the benefits of adjusting the placement according to changing intermediate car-
dinalities, resulting in reliably good performance.
140 Chapter 5 Adaptive Placement Optimization
6
CONCLUSION AND FUTURE WORK
6.1 Summary
6.2 Future Work
Concluding this thesis, we first summarize our findings and contribution presented inthe previous sections, before discussing possible directions of future work, based on our
approaches.
6.1 SUMMARY
With computing hardware changing towards heterogeneous systems, database query process-
ing needs to adapt in order to efficiently utilize the new environment. We first investigated
three different approaches of heterogeneous execution for database query processing. As a re-
sult, we found that one approach, dynamic placement, shows good performance, while being
highly extensible to different computing units and different operator implementations. There-
fore, we choose this approach to be investigated in more detail. For this, we proposed general
techniques of placement optimization, including runtime and transfer cost estimation, local
optimization, and global optimization. To improve placement optimization further, we pro-
posed an adaptive placement approach including a novel database integration technique.
Approaches to Utilize Heterogeneous Environments
Intra-Operator Parallelism: As a first approach, we investigated intra-operator-parallelism
using a fork-join model of dividing the input data, executing a database operator in parallel on
multiple computing units, and merging the results in the end. As a result of our experiments,
we found multiple limitations of this approach:
1. General Effects: Resource contention on CPU cores and underutilization with small data
sizes introduce performance slowdowns that need to be compensated by the actual bene-
fits of parallel processing. If these slowdowns can not be compensated, this approach can
not be used beneficially.
2. Result Processing: After an operator’s execution, the result needs to be merged in or-
der to proceed with the remainder of the query. This overhead can vary depending on
the database operator, while possibly growing to the extent that the whole execution is
dominated by this merging overhead.
3. Heterogeneity of CUs: In heterogeneous environments it is likely that a particular CU
is significantly faster for a specific operator than other CUs. Therefore, the partitioning
has to provide this CU with almost all input data. At some point, it is impractical to
partition and merge if most data is computed on one CU anyway. There, it is better to
execute the operator atomically on a single CU and avoid further overheads like merging
or synchronization.
As a result, we came to the conclusion that there are not many cases, where the intra-operator
parallelism on heterogeneous CUs can be applied beneficially. Therefore, we did not follow
this approach and instead, we decided to execute each database operator atomically on one
CU. While the execution on the CU can be in parallel, the parallelism does not go beyond one
CU to avoid the mentioned overheads.
142 Chapter 6 Conclusion and Future Work
Static Placement: When executing an operator atomically, we can add a CU to the system
and use it for a specific database operator, basically making a static decision, which operator
is executed on which CU. We evaluated this approach with a group-by operator and an Nvidia
GPU. As result, we found that the static execution was not that simple. We have seen vari-
ous unforeseen hardware and software effects, including hash contention, atomic contention,
and data cache and TLB misses, making the execution inefficient for certain scenarios. As a
solution, we profiled the operator and the CU in great detail and derived multiple configura-
tions that had to be adjusted for different scenarios, resulting in optimized execution and high
performance. However, these configurations are highly dependent on the CU’s hardware ar-
chitecture and the operator’s implementation, while for each additional operator or hardware
platform, the high effort of profiling and deriving multiple configurations has to be repeated.
As this is too time-consuming and not easily extensible, we did not follow this approach further.
Dynamic Placement: As a third approach, we looked at dynamic execution decisions. There,
we assumed to have a system with multiple CUs and a dynamic decision is made with respect
to the location of an operator’s execution (operator placement). We executed operators atomi-
cally only on one CU and it is possible to optimize an operator’s implementation further (as in
the previous approach). The idea of this approach is that an operator is only placed on a spe-
cific CU, if the execution on this CU is beneficial for the query runtime. This decision depends
on the CU, the operator implementations, and possibly needed data transfers. While the deci-
sions are strongly dependent on the hardware and operator, the actual optimization can work
with black-box approaches for the hardware and the operator implementation, allowing to sup-
port all hardware-operator combinations for which an implementation exists. This makes this
approach adaptive and highly extensible, ideal for wildly heterogeneous environments.
General Placement Optimization
We chose the dynamic placement of operator executions, as it is the most extensible approach
for highly heterogeneous computing environments.
Runtime Estimation: Tomake a decision before the actual execution, runtime estimation is
needed for each operator occurrence within a query. The runtime depends on the operator, the
chosen CU, and used data. To allow runtime estimation as a black-box approach, we monitor
the execution during query runtime, building a learning-based model for each operator on
each CU. In addition to operator runtime estimations, transfer cost estimations from every
CU to every other CU are provided through benchmarking varying data sizes at ramp-up time.
The transfer times are not dependent on the database query, operator, or data distribution,
therefore, we do not need to monitor and learn the transfer times at run-time.
Local Placement Strategy: Having the runtime estimations for all query operators on all
available CUs is only the first step of optimization. The local optimization approach now exe-
cutes each operator on the CU where it runs best, while taking input data transfers into consid-
eration. However, this can introduce many harmful transfers, as further usage of the operator’s
data is not considered.
6.1 Summary 143
Global Placement Strategy: We also proposed global optimization at compile-time, which
considers all operators and their interactions within the query in order to find the best trade-off
between ideal execution and time-consuming transfers. The main challenges of this optimiza-
tion is the large search space when considering all possible placements. Therefore, instead of
evaluating all placements, we proposed a greedy algorithm, which tries to improve the overall
runtime with local changes for a given placement. Since the outcome of this algorithm heavily
depends on the starting placement, we execute this algorithm multiple times with different
starting placements like single-CU placement, random placements, or the locally optimized
placement.
Adaptive Placement Optimization
For the general placement optimization, we found a strong connection between placement
quality and the quality of cardinality estimations. Cardinality information for intermediate
results is needed to estimate operator runtimes and transfer costs at compile-time. Any error
in the cardinalities is propagating to the general placement optimization and influencing the
placement decisions.
Adaptive Placement Strategy: To overcome the limitations of general placement optimiza-
tion, we proposed adaptive placement optimization. The adaptive optimization uses the exe-
cution patterns of highly parallel operator execution to find groups of multiple operators (so
called execution islands), within which the cardinalities can be exactly calculated. We allow
placement optimization only on one execution island at a time, making sure that we only work
with precisely known cardinalities producing the best possible placement quality. In addition
to execution islands, we proposed to refine the optimization and placement granularity from
operators to sub-operators and to allow data copies to reside in multiple locations in order to
reduce unnecessary data transfers.
Adaptive Placement Sequence: We incorporated our general approach and the proposed
adaptive techniques in an adaptive placement sequence. The sequence uses three pre-processing
steps at compile-time to refine the query, to apply an advanced dependency analysis, and to
define the execution islands. In the remaining steps, we apply runtime estimation, regional
placement optimization (global optimization on an execution island), and the actual execution
according to the placements for one execution island at a time.
Implementation Approach: We implemented our approach as virtualization layer between
the database system and the heterogeneous hardware. This approach allows a clear separation
of concerns for the different query optimization steps. The virtualization layer was built upon
the standardized OpenCL interface to allow fast integration and broad usability for a number
of database systems.
Evaluation: For the evaluation, we use two different OpenCL-based database systems with
two different OLAP benchmarks, running in a highly heterogeneous computing environment.
The results show the effectiveness of our approach, while especially benefiting from the adap-
tive execution and the independence of cardinality estimations.
144 Chapter 6 Conclusion and Future Work
6.2 FUTURE WORK
It is possible to extend our work in most areas of this thesis. In the following, we present a set
of ideas, which we find promising.
Intra-Operator Parallelism: For this approach, it is always possible to evaluate more query
operators and find cases where intra-operator-parallelism works well or where it can not be
applied. Our list of possible overheads and the initial evaluation of the selection and sort op-
erator can guide this future research. Additionally, it would be interesting to evaluate chains
of operators, so that multiple CUs can work on different data partitions for a number of opera-
tors without synchronization or merging. This could reduce the overheads connected with this
approach. However, the benefit might not be significant as the different operators in a chain
would need to work with sub-optimal partition sizes, to ensure that the CUs finish at the same
time.
Static Placement: For static placement, we showed the impact of the GPU TLB hierarchy
and TLB sharing between different SMs. It would be interesting to utilize the sharing of the L2
TLB in particular to work on multiple memory regions simultaneously, without causing TLB
misses. However, this could introduce skewed executions, as a varying number of SMs share
a TLB depending on which SMs are deactivated by default. Additional to this fine-grained
optimization, it might be possible to automate the presented code-optimization, so that the
same operator could be ported to a different CU with less implementation effort. However, as
presented earlier, the search space for this approach is large as there are many different tuning
options to choose from.
Dynamic Placement: We see two interesting points for future work for the general ap-
proach of dynamic placement.
(1) We found in our experiments that OpenCL is not the ideal programming language for
some CUs like the Xeon Phi or the CPU. To allow different programming languages in one sys-
tem, we could either implement each operator in a CU-native language or we could apply code
transformation tools like Alpaka [Zenker et al. 2016]. There, the code is written once in an
OpenCL-like intermediate language, which is compiled to different other languages including
CUDA and C++. For OpenCL, the compilation step is performed by the vendor’s OpenCL
driver, which sometimes is not supported well. For example, OpenCL on the Xeon Phi shows
mostly low performance; and Nvidia GPUs mostly support version 1.1, while the current ver-
sion of OpenCL is 2.2. With an approach like Alpaka, a first compilation step into a program-
ming language that is highly supported by the hardware vendors is done by the application,
circumventing the limited or missing OpenCL support.
(2) In this thesis, we always optimized towards query runtime. However, there are mul-
tiple other optimization goals possible, including query throughput, energy consumption, or
operation costs (e.g., in a cloud environment). It would be interesting to see, how many of our
techniques could be applied to these different optimization goals.
6.2 Future Work 145
Runtime Estimation: We compute the sum of all used data sizes of an operator as index for
the estimation model. However, it would be possible to improve the runtime estimation with
multi-dimensional indexing, where the different data sizes and data characteristics define the
dimensions. The different dimensions would need to be weighted according to their influence
on the runtime. This weighting could be learned online, through observing the dimension
values and the actual operator runtime. It is open if this approach results in a better quality of
runtime estimations or if the different dimensions and weights are actually a source of possible
inaccuracies for the estimation.
Global PlacementOptimization: For global placement optimization, it might be possible to
improve the greedy algorithm to not only alter one operator’s placement and evaluate the global
effects but to alter multiple operator placements at once. This could eventually find operator
groups that are beneficial to be placed on the same CU, while each single operator would prefer
a different CU. It is open if this approach is faster and more effective than repeating the initial
greedy algorithm multiple times with different starting placements.
Adaptive Placement Optimization: The main limitation of our adaptive approach is the re-
gional view per execution island, where we can not rely on cardinality information beyond one
island. This can be a disadvantage when considering future data sharing. One idea would be
to apply global optimization on inaccurate intermediate cardinalities in the beginning, while
refining the initial placement choice through regional optimization per execution island. This
could allow explicit data sharing even with operators beyond one execution island, ideally re-
sulting in a better placement. However, it could also influence the placement decisions per
execution island negatively, if the initial cardinality information for the global placement result
in largely wrong decisions. It is open to evaluate if considering an initial global optimization
step is beneficial or harmful to the adaptive placement.
Beyond placement optimization, it is possible to weaken our strict separation of concerns,
in order to allow physical optimization and placement optimization at the same time. Physical
optimization could benefit from our adaptive approach, by (1) knowing the exact CU an oper-
ator is going to be executed on and (2) by being applied at run-time for each execution island.
There, the physical optimization can benefit from the exact knowledge of intermediate cardi-
nalities. This approach is similar to the already known adaptive query processing [Deshpande
et al. 2007], where optimization decisions are reevaluated after new cardinality information
become available.
Multi Query Support: In this thesis, we mainly looked at single query execution, while it
is also possible to optimize for multi-query execution. There, heterogeneous resources need to
be shared between multiple queries, which could result in a more complex execution, where
operators from multiple queries are executed together on multiple CUs. However, this could
also result in simpler execution, when one query is assigned to one CU, to allow better data
sharing, while avoiding contention on the CU memory. In both cases, our runtime estimation
and placement optimization can be used to decide the operator placement or query placement.
146 Chapter 6 Conclusion and Future Work
BIBLIOGRAPHY
[Abadi et al. 2013] Abadi, Daniel; Boncz, Peter A.; Harizopoulos, Stavros; Idreos, Stratos ;
Madden, Samuel: The Design and Implementation of Modern Column-Oriented Database
Systems. In: Foundations and Trends in Databases 5 (2013), No 3, 197–280. http://dx.doi.org/
10.1561/1900000024. – DOI 10.1561/1900000024
[Adapteva 2014] Adapteva: Epiphany Architecture Reference, September 2014
[Ailamaki et al. 2002] Ailamaki, Anastassia; DeWitt, David J. ; Hill, Mark D.: Data page layouts
for relational databases on deep memory hierarchies. In: VLDB J. 11 (2002), No 3, 198–215.
http://dx.doi.org/10.1007/s00778-002-0074-9. – DOI 10.1007/s00778–002–0074–9
[Albutiu et al. 2012] Albutiu, Martina-Cezara; Kemper, Alfons ; Neumann, Thomas: Mas-
sively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. In: PVLDB
5 (2012), No 10, 1064–1075. http://vldb.org/pvldb/vol5/p1064_martina-cezaraalbutiu_
vldb2012.pdf
[Appleby 2008] Appleby, A: Murmurhash project. 2008. – http://code.google.com/p/smhasher/
[Appleyard 2016] Appleyard, Jeremy: Nvidia Presentation: PASCAL AND CUDA 8.0, July 2016
[Arnold et al. 2014a] Arnold, Oliver; Haas, Sebastian; Fettweis, Gerhard; Schlegel, Benjamin;
Kissinger, Thomas; Karnagel, Tomas ; Lehner, Wolfgang: HASHI: An Application Specific
Instruction Set Extension for Hashing. In: International Workshop on Accelerating Data Man-
agement Systems Using Modern Processor and Storage Architectures - ADMS 2014, Hangzhou,
China, September 1, 2014, 25–33
[Arnold et al. 2014b] Arnold, Oliver; Haas, Sebastian; Fettweis, Gerhard; Schlegel, Benjamin;
Kissinger, Thomas ; Lehner, Wolfgang: An application-specific instruction set for acceler-
ating set-oriented database primitives. In: International Conference on Management of Data,
SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, 767–778
Bibliography 147
[Bakkum and Chakradhar ] Bakkum, Peter; Chakradhar, Srimat: Efficient Data Management
for GPU Databases. https://github.com/bakks/virginian
[Balkesen et al. 2013] Balkesen, Cagri; Alonso, Gustavo; Teubner, Jens ; Özsu, M. T.: Multi-
Core, Main-Memory Joins: Sort vs. Hash Revisited. In: PVLDB 7 (2013), No 1, 85–96.
http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf
[Barak and Shiloh 2014] Barak, A.; Shiloh, A.: The VirtualCL (VCL) Cluster Platform, 2014
[Batcher 1968] Batcher, Kenneth E.: Sorting Networks and Their Applications. In: American
Federation of Information Processing Societies: AFIPS Conference Proceedings: 1968 Spring Joint
Computer Conference, Atlantic City, NJ, USA, 30 April - 2 May, 1968, 307–314
[Boncz 2002] Boncz, Peter A.: Monet; a next-Generation DBMS Kernel For Query-Intensive Appli-
cations. 2002
[Boncz et al. 2008] Boncz, Peter A.; Kersten, Martin L. ; Manegold, Stefan: Breaking the
memory wall in MonetDB. In: Commun. ACM 51 (2008), No 12, 77–85. http://dx.doi.org/
10.1145/1409360.1409380. – DOI 10.1145/1409360.1409380
[Boncz et al. 2005] Boncz, Peter A.; Zukowski, Marcin ; Nes, Niels: MonetDB/X100: Hyper-
Pipelining Query Execution. In: CIDR 2005, Second Biennial Conference on Innovative Data
Systems Research, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings, 2005, 225–237
[Borkar and Chien 2011] Borkar, Shekhar; Chien, Andrew A.: The Future of Microprocessors.
In: Commun. ACM 54 (2011), May, No 5, 67–77. http://dx.doi.org/10.1145/1941487.1941507.
– DOI 10.1145/1941487.1941507. – ISSN 0001–0782
[Breß 2013] Breß, Sebastian: Why it is time for a HyPE: A Hybrid Query Processing Engine
for Efficient GPU Coprocessing in DBMS. In: PVLDB 6 (2013), No 12, 1398–1403. http:
//www.vldb.org/pvldb/vol6/p1398-bress.pdf
[Breß 2014] Breß, Sebastian: The Design and Implementation of CoGaDB: A Column-oriented
GPU-accelerated DBMS. In: Datenbank-Spektrum 14 (2014), No 3, 199–209. http://dx.doi.
org/10.1007/s13222-014-0164-z. – DOI 10.1007/s13222–014–0164–z
[Breß et al. 2012] Breß, Sebastian; Beier, Felix; Rauhe, Hannes; Schallehn, Eike; Sattler,
Kai-Uwe ; Saake, Gunter: Automatic Selection of Processing Units for Coprocessing in
Databases. In: Advances in Databases and Information Systems - 16th East European Conference,
ADBIS 2012, Poznan´, Poland, September 18-21, 2012, 57–70
[Breß et al. 2016] Breß, Sebastian; Funke, Henning ; Teubner, Jens: Robust Query Processing
in Co-Processor-accelerated Databases. In: Proceedings of the 2016 International Conference
on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016
[Breß et al. 2014] Breß, Sebastian; Heimel, Max; Siegmund, Norbert; Bellatreche, Lad-
jel ; Saake, Gunter: GPU-Accelerated Database Systems: Survey and Open Challenges.
Version: 2014. http://dx.doi.org/10.1007/978-3-662-45761-0_1. 2014, 1–35
148 Bibliography
[Broneske et al. 2014] Broneske, David; Breß, Sebastian; Heimel, Max ; Saake, Gunter: Toward
Hardware-Sensitive Database Operations. In: Proceedings of the 17th International Conference
on Extending Database Technology, EDBT 2014, Athens, Greece, March 24-28, 2014, 229–234
[Chen et al. 2015] Chen, Ren; Siriyal, Sruja ; Prasanna, Viktor K.: Energy andMemory Efficient
Mapping of Bitonic Sorting on FPGA. In: Proceedings of the 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, February 22-24, 2015, 240–
249
[Christodoulakis 1984] Christodoulakis, Stavros: Implications of Certain Assumptions in
Database Performance Evaluation. In: ACM Trans. Database Syst. 9 (1984), No 2, 163–186.
http://dx.doi.org/10.1145/329.318578. – DOI 10.1145/329.318578
[Cui et al. 2003] Cui, Yi; Zhong, Zhaohui; Wang, Deli; Wang, Wayne U. ; Lieber, Charles M.:
High performance silicon nanowire field effect transistors. In: Nano letters 3 (2003), No 2,
P. 149–152
[Davis 1992] Davis, Ian J.: A Fast Radix Sort. In: Comput. J. 35 (1992), No 6, 636–642.
http://dx.doi.org/10.1093/comjnl/35.6.636. – DOI 10.1093/comjnl/35.6.636
[Delorme et al. 2013] Delorme, Michael C.; Abdelrahman, Tarek S. ; Zhao, Chengyan: Par-
allel Radix Sort on the AMD Fusion Accelerated Processing Unit. In: 42nd International
Conference on Parallel Processing, ICPP 2013, Lyon, France, October 1-4, 2013, 339–348
[Deshpande et al. 2007] Deshpande, Amol; Ives, Zachary ; Raman, Vijayshankar: Adaptive
Query Processing. In: Found. Trends databases 1 (2007), January, No 1, 1–140. http://dx.doi.
org/10.1561/1900000001. – DOI 10.1561/1900000001. – ISSN 1931–7883
[DeWitt et al. 1986] DeWitt, David J.; Gerber, Robert H.; Graefe, Goetz; Heytens, Michael L.;
Kumar, Krishna B. ; Muralikrishna, M.: GAMMA - A High Performance Dataflow Database
Machine. In: VLDB’86 Twelfth International Conference on Very Large Data Bases, Kyoto, Japan,
August 25-28, 1986, 228–237
[Dobosiewicz 1978] Dobosiewicz, Wlodzimierz: Sorting by Distributive Partitioning. In: Inf.
Process. Lett. 7 (1978), No 1, 1–6. http://dx.doi.org/10.1016/0020-0190(78)90028-5. – DOI
10.1016/0020–0190(78)90028–5
[Esmaeilzadeh et al. 2011] Esmaeilzadeh, Hadi; Blem, Emily R.; Amant, Renée St.; Sankar-
alingam, Karthikeyan ; Burger, Doug: Dark silicon and the end of multicore scaling. In:
38th International Symposium on Computer Architecture (ISCA 2011), San Jose, CA, USA, June
4-8, 2011, 365–376
[Fang et al. 2007] Fang, Rui; He, Bingsheng; Lu, Mian; Yang, Ke; Govindaraju, Naga K.; Luo,
Qiong ; Sander, Pedro V.: GPUQP: query co-processing using graphics processors. In: Pro-
ceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China,
June 12-14, 2007, 1061–1063
Bibliography 149
[Fegaras 1998] Fegaras, Leonidas: A NewHeuristic for Optimizing Large Queries. In: Database
and Expert Systems Applications, 9th International Conference, DEXA ’98, Vienna, Austria, August
24-28, 1998, 726–735
[Fowler et al. 2015] Fowler, Glenn; Noll, Landon C. ; Vo, Phong: FNV Hash. 2015. –
http://www.isthe.com/chongo/tech/comp/fnv/
[Garcia-Molina et al. 2000] Garcia-Molina, Hector; Ullman, Jeffrey D. ; Widom, Jennifer:
Database System Implementation. Prentice-Hall, 2000. – ISBN 0–13–040264–8
[Gaster 2011] Gaster, Benedict R.: OpenCL Device Fission, 2011
[Gedik et al. 2007] Gedik, Bugra; Yu, Philip S. ; Bordawekar, Rajesh: Executing Stream Joins
on the Cell Processor. In: Proceedings of the 33rd International Conference on Very Large Data
Bases, University of Vienna, Austria, September 23-27, 2007, 2007, 363–374
[Govindaraju et al. 2006] Govindaraju, Naga K.; Gray, Jim; Kumar, Ritesh ; Manocha, Dinesh:
GPUTeraSort: high performance graphics co-processor sorting for large database manage-
ment. In: Proceedings of the ACM SIGMOD International Conference on Management of Data,
Chicago, Illinois, USA, June 27-29, 2006, 325–336
[Graefe 1994] Graefe, Goetz: Volcano - An Extensible and Parallel Query Evaluation System.
In: IEEE Trans. Knowl. Data Eng. 6 (1994), No 1, 120–135. http://dx.doi.org/10.1109/69.
273032. – DOI 10.1109/69.273032
[Gür et al. 2016] Gür, Fatih N.; Schwarz, Friedrich W.; Ye, Jingjing; Diez, Stefan ; Schmidt,
Thorsten L.: Toward self-assembled plasmonic devices: High-yield arrangement of gold
nanoparticles on dna origami templates. In: ACS nano 10 (2016), No 5, P. 5374–5382
[Haas et al. 2016] Haas, Sebastian; Karnagel, Tomas; Arnold, Oliver; Laux, Erik; Schlegel, Ben-
jamin; Fettweis, Gerhard ; Lehner, Wolfgang: HW/SW-database-codesign for compressed
bitmap index processing. In: 27th IEEE International Conference on Application-specific Sys-
tems, Architectures and Processors, ASAP 2016, London, United Kingdom, July 6-8, 2016, 50–57
[Hazra 2014] Hazra, Rajeeb: Accelerating Insights ... In the Technical Computing Transforma-
tion. In: International Supercomputing Conference, Leipzig, Germany, 2014
[He et al. 2007] He, Bingsheng; Govindaraju, Naga K.; Luo, Qiong ; Smith, Burton: Efficient
gather and scatter operations on graphics processors. In: Proceedings of the ACM/IEEE Con-
ference on High Performance Networking and Computing, SC 2007, Reno, Nevada, USA, November
10-16, 2007, 46
[He et al. 2009] He, Bingsheng; Lu, Mian; Yang, Ke; Fang, Rui; Govindaraju, Naga K.; Luo,
Qiong ; Sander, Pedro V.: Relational Query Co-processing on Graphics Processors. In: ACM
Trans. Database Syst. 34 (2009), December, No 4, 21:1–21:39. http://dx.doi.org/10.1145/
1620585.1620588. – DOI 10.1145/1620585.1620588. – ISSN 0362–5915
150 Bibliography
[He et al. 2008] He, Bingsheng; Yang, Ke; Fang, Rui; Lu, Mian; Govindaraju, Naga K.; Luo,
Qiong ; Sander, Pedro V.: Relational joins on graphics processors. In: Proceedings of the
ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC,
Canada, June 10-12, 2008, 511–524
[He and Yu 2011] He, Bingsheng; Yu, Jeffrey X.: High-throughput transaction executions on
graphics processors. In: PVLDB 4 (2011), No 5, 314–325. http://dl.acm.org/citation.cfm?
id=1952381
[He et al. 2013] He, Jiong; Lu, Mian ; He, Bingsheng: Revisiting Co-Processing for Hash
Joins on the Coupled CPU-GPU Architecture. In: PVLDB 6 (2013), No 10, 889–900. http:
//www.vldb.org/pvldb/vol6/p889-he.pdf
[He et al. 2014] He, Jiong; Zhang, Shuhao ; He, Bingsheng: In-Cache Query Co-Processing on
Coupled CPU-GPU Architectures. In: PVLDB 8 (2014), No 4, 329–340. http://www.vldb.
org/pvldb/vol8/p329-he.pdf
[Heimel et al. 2013] Heimel, Max; Saecker, Michael; Pirk, Holger; Manegold, Stefan ; Markl,
Volker: Hardware-Oblivious Parallelism for In-Memory Column-Stores. In: PVLDB 6
(2013), No 9, 709–720. http://www.vldb.org/pvldb/vol6/p709-heimel.pdf
[Huismann et al. 2015] Huismann, Immo; Stiller, Jörg ; Fröhlich, Jochen: Two-level paral-
lelization of a fluid mechanics algorithm exploiting hardware heterogeneity. In: Computers
& Fluids 117 (2015), P. 114–124
[Ioannidis and Christodoulakis 1991] Ioannidis, Yannis E.; Christodoulakis, Stavros: On the
Propagation of Errors in the Size of Join Results. In: Proceedings of the 1991 ACM SIGMOD
International Conference on Management of Data, Denver, Colorado, May 29-31, 1991, 268–277
[Jha et al. 2015] Jha, Saurabh; He, Bingsheng; Lu, Mian; Cheng, Xuntao ; Huynh, Huynh P.:
Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Ap-
proach. In: PVLDB 8 (2015), No 6, 642–653. http://www.vldb.org/pvldb/vol8/p642-Jha.pdf
[Jouili and Vansteenberghe 2013] Jouili, Salim; Vansteenberghe, Valentin: An Empirical Com-
parison of Graph Databases. In: International Conference on Social Computing, SocialCom 2013,
SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013, Washington, DC, USA, 8-14 September,
2013, 708–715
[Kaldewey et al. 2012] Kaldewey, Tim; Lohman, Guy M.; Müller, René ; Volk, Peter B.: GPU
join processing revisited. In: Proceedings of the Eighth International Workshop on Data Man-
agement on New Hardware, DaMoN 2012, Scottsdale, AZ, USA, May 21, 2012, 55–62
[Karnagel et al. 2017a] Karnagel, Tomas; Ben-Nun, Tal; Werner, Matthias; Habich, Dirk ;
Lehner, Wolfgang: Big Data causing Big (TLB) Problems: Taming Random Memory Ac-
cesses on the GPU. In: Thirteenth International Workshop on Data Management on New Hard-
ware, DaMoN 2017, Chicago, IL, USA, Mai 15, 2017
[Karnagel and Habich 2017] Karnagel, Tomas; Habich, Dirk: Heterogeneous Placement Opti-
mization for Database Query Processing. In: it Information Technology, 2017
Bibliography 151
[Karnagel et al. 2015a] Karnagel, Tomas; Habich, Dirk ; Lehner, Wolfgang: Local vs. Global
Optimization: Operator Placement Strategies in Heterogeneous Environments. In: Proceed-
ings of the Workshops of the EDBT/ICDT 2015 Joint Conference (EDBT/ICDT), Brussels, Belgium,
March 27th, 2015, 48–55
[Karnagel et al. 2016] Karnagel, Tomas; Habich, Dirk ; Lehner, Wolfgang: Limitations of Intra-
operator Parallelism Using Heterogeneous Computing Resources. In: Advances in Databases
and Information Systems - 20th East European Conference, ADBIS 2016, Prague, Czech Republic,
August 28-31, 2016, 291–305
[Karnagel et al. 2017b] Karnagel, Tomas; Habich, Dirk ; Lehner, Wolfgang: Adaptive Work
Placement for Query Processing on Heterogeneous Computing Resources. In: PVLDB 10
(2017), No 7, 733–744. http://www.vldb.org/pvldb/vol10/p733-karnagel.pdf
[Karnagel et al. 2014] Karnagel, Tomas; Habich, Dirk; Schlegel, Benjamin ; Lehner, Wolf-
gang: Heterogeneity-Aware Operator Placement in Column-Store DBMS. In: Datenbank-
Spektrum 14 (2014), No 3, 211–221. http://dx.doi.org/10.1007/s13222-014-0167-9. – DOI
10.1007/s13222–014–0167–9
[Karnagel et al. 2015b] Karnagel, Tomas; Müller, René ; Lohman, Guy M.: Optimizing GPU-
accelerated Group-By and Aggregation. In: International Workshop on Accelerating Data Man-
agement Systems Using Modern Processor and Storage Architectures - ADMS 2015, Kohala Coast,
Hawaii, USA, August 31, 2015, 13–24
[Khronos 2011] Khronos: OpenCL Working Group, The OpenCL Specification, Version: 1.1, Docu-
ment Revision: 44, 2011. http://www.khronos.org/opencl/
[Kohei 2015] Kohei, KaiGai: PG-Strom: GPGPU meets PostgreSQL to accelerate analytic
queries. In: PGCon: The PostgreSQL Conference (2015)
[Kooi 1980] Kooi, Robert: The Optimization of Queries in Relational Databases, Case Western
Reserve University, Diss., September 1980
[Leis et al. 2014] Leis, Viktor; Boncz, Peter A.; Kemper, Alfons ; Neumann, Thomas: Morsel-
driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In:
International Conference onManagement of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27,
2014, 743–754
[Leis et al. 2015] Leis, Viktor; Gubichev, Andrey; Mirchev, Atanas; Boncz, Peter A.; Kemper,
Alfons ; Neumann, Thomas: How Good Are Query Optimizers, Really? In: PVLDB 9 (2015),
No 3, 204–215. http://www.vldb.org/pvldb/vol9/p204-leis.pdf
[Lipton et al. 1990] Lipton, Richard J.; Naughton, Jeffrey F. ; Schneider, Donovan A.: Practical
Selectivity Estimation through Adaptive Sampling. In: Proceedings of the 1990 ACM SIGMOD
International Conference on Management of Data, Atlantic City, NJ, May 23-25, 1990, 1–11
[Lüssem et al. 2013] Lüssem, Björn; Tietze, Max L.; Kleemann, Hans; Hoßbach, Christoph;
Bartha, Johann W.; Zakhidov, Alexander ; Leo, Karl: Doped organic transistors operating in
152 Bibliography
the inversion and depletion regime. In: Nature communications 4 (2013). http://dx.doi.org/
10.1038/ncomms3775. – DOI 10.1038/ncomms3775
[Lutz et al. 2015] Lutz, Thibaut; Fensch, Christian ; Cole, Murray: Helium: a transparent
inter-kernel optimizer for OpenCL. In: Proceedings of the 8th Workshop on General Purpose
Processing using GPUs, GPGPU@PPoPP 2015, San Francisco, CA, USA, February 7, 2015, 70–80
[Manegold et al. 2002] Manegold, Stefan; Boncz, Peter A. ; Kersten, Martin L.: Generic
Database Cost Models for Hierarchical Memory Systems. In: VLDB 2002, Proceedings of
28th International Conference on Very Large Data Bases, Hong Kong, China, August 20-23, 2002,
191–202
[Marks 2017] Marks, Anton: Alenka - GPU database engine. (2017)
[Mayr et al. 2000] Mayr, Tobias; Bonnet, Philippe; Gehrke, Johannes ; Seshadri, Praveen:
Query Processing with Heterogeneous Resources. Ithaca, NY, USA : Cornell University,
2000. – Thesis
[Mei and Chu 2015] Mei, Xinxin; Chu, Xiaowen: Dissecting GPUMemory Hierarchy through
Microbenchmarking. In: CoRR abs/1509.02308 (2015). http://arxiv.org/abs/1509.02308
[Meric et al. 2008] Meric, I.; Baklitskaya, N.; Kim, P. ; Shepard, K. L.: RF performance of top-
gated, zero-bandgap graphene field-effect transistors. In: 2008 IEEE International Electron
Devices Meeting, 2008. – ISSN 0163–1918, P. 1–4
[Merrill and Grimshaw 2010] Merrill, Duane; Grimshaw, Andrew S.: Revisiting sorting for
GPGPU stream architectures. In: 19th International Conference on Parallel Architecture and
Compilation Techniques, PACT 2010, Vienna, Austria, September 11-15, 2010, 545–546
[Merry 2015] Merry, Bruce: A Performance Comparison of Sort and Scan Libraries
for GPUs. In: Parallel Processing Letters 25 (2015), No 4. http://dx.doi.org/10.1142/
S0129626415500073. – DOI 10.1142/S0129626415500073
[Mittal 2016] Mittal, Sparsh: A survey of techniques for architecting TLBs. In: Concurrency
and Computation: Practice and Experience (2016). http://dx.doi.org/10.1002/cpe.4061. – DOI
10.1002/cpe.4061. – ISSN 1532–0634
[Mostak 2013] Mostak, Todd: An overview of MapD (massively parallel database). (2013)
[Mühlbauer et al. 2014] Mühlbauer, Tobias; Rödiger, Wolf; Seilbeck, Robert; Kemper, Alfons
; Neumann, Thomas: Heterogeneity-conscious parallel query execution: getting a better
mileage while driving faster! In: Tenth International Workshop on Data Management on New
Hardware, DaMoN 2014, Snowbird, UT, USA, June 23, 2014, 2:1–2:10
[Müller et al. 2009a] Müller, René; Teubner, Jens ; Alonso, Gustavo: Data Processing on
FPGAs. In: PVLDB 2 (2009), No 1, 910–921. http://www.vldb.org/pvldb/2/vldb09-603.pdf
[Müller et al. 2009b] Müller, René; Teubner, Jens ; Alonso, Gustavo: Streams on Wires - A
Query Compiler for FPGAs. In: PVLDB 2 (2009), No 1, 229–240. http://www.vldb.org/
pvldb/2/vldb09-622.pdf
Bibliography 153
[Müller et al. 2012] Müller, René; Teubner, Jens ; Alonso, Gustavo: Sorting networks on
FPGAs. In: VLDB J. 21 (2012), No 1, 1–23. http://dx.doi.org/10.1007/s00778-011-0232-z. –
DOI 10.1007/s00778–011–0232–z
[Neumann 2011] Neumann, Thomas: Efficiently Compiling Efficient Query Plans for Mod-
ern Hardware. In: PVLDB 4 (2011), No 9, 539–550. http://www.vldb.org/pvldb/vol4/
p539-neumann.pdf
[Nvidia 2014a] Nvidia: Summit and Sierra Supercomputers: An Inside Look at the U.S. Department
of Energy’s New Pre-Exascale Systems, November 2014
[Nvidia 2014b] Nvidia: White Paper: NVIDIA’s Next Generation CUDA Compute Architecture:
Kepler GK110/210, 2014
[Nvidia 2015] Nvidia: CUDA C Programming Guide. 7.0, March 2015
[Nvidia 2016] Nvidia: NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built
Featuring Pascal GP100, the World’s Fastest GPU. WP-08019-001_v01.1, 2016
[Nyland and Jones 2012] Nyland, Lars; Jones, Stephen: Understanding and Using AtomicMemory
Operations. 2012. – In GTC 2012, Session S3101
[O’Neil et al. 2009] O’Neil, Patrick E.; O’Neil, Elizabeth J.; Chen, Xuedong ; Revilak, Stephen:
The Star Schema Benchmark and Augmented Fact Table Indexing. In: Performance Evalu-
ation and Benchmarking, First TPC Technology Conference, TPCTC 2009, Lyon, France, August
24-28, 2009, 237–252
[Padmanabhan et al. 2001] Padmanabhan, Sriram; Malkemus, Timothy; Agarwal, Ramesh C. ;
Jhingran, Anant: Block Oriented Processing of Relational Database Operations in Modern
Computer Architectures. In: Proceedings of the 17th International Conference on Data Engi-
neering, Heidelberg, Germany, April 2-6, 2001, 567–574
[Pollack 1999] Pollack, Fred J.: NewMicroarchitecture Challenges in the Coming Generations
of CMOS Process Technologies. In: Proceedings of the 32nd Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO 32, Haifa, Israel, November 16-18, 1999, 2
[Poosala and Ioannidis 1997] Poosala, Viswanath; Ioannidis, Yannis E.: Selectivity Estimation
Without the Attribute Value Independence Assumption. In: VLDB’97, Proceedings of 23rd
International Conference on Very Large Data Bases, Athens, Greece, August 25-29, 1997, 486–
495
[Putnam et al. 2014] Putnam, Andrew; Caulfield, Adrian M.; Chung, Eric S.; Chiou, Derek;
Constantinides, Kypros; Demme, John; Esmaeilzadeh, Hadi; Fowers, Jeremy; Gopal, Gopi P.;
Gray, Jan; Haselman, Michael; Hauck, Scott; Heil, Stephen; Hormati, Amir; Kim, Joo-
Young; Lanka, Sitaram; Larus, James R.; Peterson, Eric; Pope, Simon; Smith, Aaron; Thong,
Jason; Xiao, Phillip Y. ; Burger, Doug: A reconfigurable fabric for accelerating large-scale
datacenter services. In: ACM/IEEE 41st International Symposium on Computer Architecture,
ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014, 13–24
154 Bibliography
[Ramakrishnan and Gehrke 2003] Ramakrishnan, Raghu; Gehrke, Johannes: Database man-
agement systems (3. ed.). McGraw-Hill, 2003. – ISBN 978–0–07–115110–8
[Rosenfeld et al. 2015] Rosenfeld, Viktor; Heimel, Max; Viebig, Christoph ; Markl, Volker:
The Operator Variant Selection Problem on Heterogeneous Hardware. In: International
Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Archi-
tectures - ADMS 2015, Kohala Coast, Hawaii, USA, August 31, 2015, 1–12
[Satish et al. 2009] Satish, Nadathur; Harris, Mark J. ; Garland, Michael: Designing efficient
sorting algorithms for manycore GPUs. In: 23rd IEEE International Symposium on Parallel and
Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009, 1–10
[Seeger and Ultra-Large-Sites 2009] Seeger, Marc; Ultra-Large-Sites, S: Key-Value stores: a
practical overview. (2009)
[Selinger et al. 1979] Selinger, Patricia G.; Astrahan, Morton M.; Chamberlin, Donald D.;
Lorie, Raymond A. ; Price, Thomas G.: Access Path Selection in a Relational Database
Management System. In: Proceedings of the 1979 ACM SIGMOD International Conference on
Management of Data, Boston, Massachusetts, May 30 - June 1., 1979, 23–34
[Stonebraker et al. 2005] Stonebraker, Michael; Abadi, Daniel J.; Batkin, Adam; Chen, Xue-
dong; Cherniack, Mitch; Ferreira, Miguel; Lau, Edmond; Lin, Amerson; Madden, Samuel;
O’Neil, Elizabeth J.; O’Neil, Patrick E.; Rasin, Alex; Tran, Nga ; Zdonik, Stanley B.: C-Store:
A Column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large
Data Bases, Trondheim, Norway, August 30 - September 2, 2005, 553–564
[Stonebraker and Kemnitz 1991] Stonebraker, Michael; Kemnitz, Greg: The Postgres Next
Generation Database Management System. In: Commun. ACM 34 (1991), No 10, 78–92.
http://dx.doi.org/10.1145/125223.125262. – DOI 10.1145/125223.125262
[Sutter 2005] Sutter, Herb: The free lunch is over: A fundamental turn toward concurrency in
software. In: Dr. Dobb’s Journal, 2005
[Teubner and Müller 2011] Teubner, Jens; Müller, René: How soccer players would do stream
joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data,
SIGMOD 2011, Athens, Greece, June 12-16, 2011, 625–636
[TPC 2014] TPC: Transaction Processing Performance Council (TPC) TPC Benchmark H (De-
cision Support) Standard Specification Revision 2.17.1, 2014
[Ungethüm et al. 2016] Ungethüm, Annett; Kissinger, Thomas; Habich, Dirk ; Lehner, Wolf-
gang: Work-Energy Profiles: General Approach and In-Memory Database Application. In:
Performance Evaluation and Benchmarking. Traditional - Big Data - Interest of Things - 8th TPC
Technology Conference, TPCTC 2016, New Delhi, India, September 5-9, 2016, 142–158
[Voigt et al. 2013] In: Voigt, Andreas; Greiner, Rinaldo; Allerdißen, Merle ; Richter, Andreas:
Towards Computation with Microchemomechanical Systems. Berlin, Heidelberg : Springer
Berlin Heidelberg, 2013. – ISBN 978–3–642–39074–6, 232–243
Bibliography 155
[Völp et al. 2016] Völp, Marcus; Klüppelholz, Sascha; Castrillon, Jeronimo; Härtig, Hermann;
Asmussen, Nils; Assmann, Uwe; Baader, Franz; Baier, Christel; Fettweis, Gerhard; Fröhlich,
Jochen; Goens, Andres; Haas, Sebastian; Habich, Dirk; Hasler, Mattis; Huismann, Immo;
Karnagel, Tomas; Karol, Sven; Lehner, Wolfgang; Leuschner, Linda; Lieber, Matthias; Ling,
Siqi; Märcker, Steffen; Mey, Johannes; Nagel, Wolfgang; Nöthen, Benedikt; Peñaloza,
Rafael; Raitza, Michael; Stiller, Jörg; Ungethüm, Annett ; Voigt, Axel: The Orchestra-
tion Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS
Hardware. In: Proceedings of the 1st International Workshop on Post-Moore’s Era Supercom-
puting (PMES), Co-located with The International Conference for High Performance Computing,
Networking, Storage and Analysis (SC16). Salt Lake City, USA, November 2016
[Wang et al. 2014] Wang, Kaibo; Zhang, Kai; Yuan, Yuan; Ma, Siyuan; Lee, Rubao; Ding,
Xiaoning ; Zhang, Xiaodong: Concurrent Analytical Query Processing with GPUs. In:
PVLDB 7 (2014), No 11, 1011–1022. http://www.vldb.org/pvldb/vol7/p1011-wang.pdf
[Wu et al. 2010] Wu, Ren; Zhang, Bin; Hsu, Meichun ; Chen, Qiming: GPU-Accelerated Pred-
icate Evaluation on Column Store. In: Web-Age Information Management, 11th International
Conference, WAIM 2010, Jiuzhaigou, China, July 15-17, 2010, 570–581
[You et al. 2015] You, Yi-Ping; Wu, Hen-Jung; Tsai, Yeh-Ning ; Chao, Yen-Ting: VirtCL: a
framework for OpenCL device abstraction and management. In: Proceedings of the 20th
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San
Francisco, CA, USA, February 7-11, 2015, 161–172
[Yuan et al. 2013] Yuan, Yuan; Lee, Rubao ; Zhang, Xiaodong: The Yin and Yang of Processing
Data Warehousing Queries on GPU Devices. In: PVLDB 6 (2013), No 10, 817–828. http:
//www.vldb.org/pvldb/vol6/p817-yuan.pdf
[Zenker et al. 2016] Zenker, Erik; Worpitz, Benjamin; Widera, René; Huebl, Axel; Juckeland,
Guido; Knüpfer, Andreas; Nagel, Wolfgang E. ; Bussmann, Michael: Alpaka - An Abstraction
Library for Parallel Kernel Acceleration. In: 2016 IEEE International Parallel and Distributed
Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016,
631–640
[Zhang et al. 2013] Zhang, Shuhao; He, Jiong; He, Bingsheng ; Lu, Mian: OmniDB: Towards
Portable and Efficient Query Processing on Parallel CPU/GPU Architectures. In: PVLDB 6
(2013), No 12, 1374–1377. http://www.vldb.org/pvldb/vol6/p1374-he.pdf
156 Bibliography
List of Figures
1.1 Database System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Hardware evolution of different aspects from 1970 to the present. . . . . . . . 13
1.3 The Structure of this thesis including chapter numbers. . . . . . . . . . . . . 17
2.1 Query Processing Order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Query Optimization for an example query. . . . . . . . . . . . . . . . . . . . 21
2.3 Query Processing Models for an example query. . . . . . . . . . . . . . . . . 22
2.4 Examples for different storage models. . . . . . . . . . . . . . . . . . . . . . 24
2.5 CPU architecture and thread execution. . . . . . . . . . . . . . . . . . . . . 26
2.6 GPU architecture and thread execution. . . . . . . . . . . . . . . . . . . . . 27
2.7 Comparing the CPU, GPU, and MIC properties using six categories. . . . . . . 28
2.8 OpenCL setup consisting of OpenCL library and OpenCL driver. . . . . . . . . 29
2.9 GPU optimized join approach. . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Operator execution on a single computing unit. . . . . . . . . . . . . . . . . 42
3.2 Operator execution on two computing units. . . . . . . . . . . . . . . . . . . 42
3.3 Simulating execution behavior in different setups with two CUs. . . . . . . . . 43
3.4 Selection operator executed on both test systems with different data sizes. . . . 47
3.5 Extensive analysis of the selection operator on the tightly-coupled system. . . . 48
3.6 Selection operator executed on the loosely-coupled test system. . . . . . . . . 50
3.7 Sort operator executed on the tightly-coupled test system. . . . . . . . . . . . 51
3.8 Sort operator executed on the loosely-coupled test system. . . . . . . . . . . . 52
3.9 GPU based group-by operator. . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.10 Performance of initial group-by implementation. . . . . . . . . . . . . . . . . 57
3.11 Regions with different behavior. . . . . . . . . . . . . . . . . . . . . . . . . 58
3.12 TLB boundary pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.13 TLB boundary benchmark results. . . . . . . . . . . . . . . . . . . . . . . . 61
3.14 TLB sharing benchmark results. . . . . . . . . . . . . . . . . . . . . . . . . 61
3.15 Hypothetical L2 TLB sharing with 15 SMs. . . . . . . . . . . . . . . . . . . . 62
3.16 Execution time with FNV-1a and Murmur3. . . . . . . . . . . . . . . . . . . 64
3.17 Evaluating different hash table fill factors. . . . . . . . . . . . . . . . . . . . 65
3.18 Evaluation of grid parameters for four different number of groups. . . . . . . . 67
3.19 Best performing grid configurations. . . . . . . . . . . . . . . . . . . . . . . 68
3.20 Performance of local hash table implementations for low cardinality. . . . . . . 69
3.21 Reducing the performance decrease of the L3 TLB problem. . . . . . . . . . . 70
3.22 TLB-conscious data access with repeated scans over the input data. . . . . . . 71
157
3.23 Resulting performance through switching the different configurations . . . . . 73
3.24 Testing the hash-based Group-By on different computing units. . . . . . . . . 78
3.25 Choosing the best CUs for different numbers of groups. . . . . . . . . . . . . 80
4.1 Finding a good placement for the given query. . . . . . . . . . . . . . . . . . 86
4.2 Estimating irregular execution behavior of the group by operator. . . . . . . . 89
4.3 Data Representation for operators and computing units. . . . . . . . . . . . . 90
4.4 Runtime estimation depending on the learned data. . . . . . . . . . . . . . . 92
4.5 Application of data cleaning using an error margin. . . . . . . . . . . . . . . . 93
4.6 Local placement optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.7 Global placement optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.8 Irregular execution behavior being estimated by our approach. . . . . . . . . . 101
4.9 Comparing manual and automated placement decisions. . . . . . . . . . . . . 104
4.10 Heterogeneous evaluation setup consisting of one CPU and three GPUs. . . . . 104
4.11 Query runtime of SSB queries with 1M different placements per query. . . . . 105
4.12 Query runtime of TPC-H queries with 1M different placements per query. . . . 106
4.13 Occurrence of strong operator placements within SSB queries. . . . . . . . . . 106
4.14 Occurrence of strong operator placements within TPC-H queries. . . . . . . . 107
4.15 Placement optimization results relative to the best placement’s runtime. . . . . 108
5.1 One Query with different (intermediate) cardinalities. . . . . . . . . . . . . . 113
5.2 Highly parallel hash join processing. . . . . . . . . . . . . . . . . . . . . . . 116
5.3 Adaptive optimization sequence. . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Operator and sub-operator view. . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Reordering data accesses by inspecting data objects and dependencies. . . . . . 121
5.6 Placements for sub-operators with copies. . . . . . . . . . . . . . . . . . . . 125
5.7 Architecture overview of HERO. . . . . . . . . . . . . . . . . . . . . . . . . 127
5.8 Transformation of OpenCL kernel code to LLVM IR. . . . . . . . . . . . . . . 131
5.9 Common scheduling problems in Ocelot and gpuDB. . . . . . . . . . . . . . . 132
5.10 Different placement optimizations on SSB queries. . . . . . . . . . . . . . . . 135
5.11 Relative overhead per query on a single CU. . . . . . . . . . . . . . . . . . . 136
5.12 Performance of gpuDB with HERO. . . . . . . . . . . . . . . . . . . . . . . 137
5.13 Adaptivity evaluation by changing the intermediate cardinalities. . . . . . . . 138
5.14 Performance of Ocelot with HERO. . . . . . . . . . . . . . . . . . . . . . . . 139
158 LIST OF FIGURES
List of Tables
2.1 Specific CU examples for each of the three hardware architectures. . . . . . . 27
2.2 Naming reference for CUDA, OpenCL, and the chosen naming in this thesis. . 31
2.3 Overview of all used CUs throughout this thesis. . . . . . . . . . . . . . . . . 32
2.4 Benchmark results for the different CUs. . . . . . . . . . . . . . . . . . . . . 32
3.1 TLB findings for the K80 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Different configurations for the group-by operator and their range of application. 73
3.3 Hardware differences of K80 GPU and P100 GPU. . . . . . . . . . . . . . . . 74
4.1 Local placement decisions and the ideal placement. . . . . . . . . . . . . . . 95
4.2 Multiple placements that could to be considered by global optimization. . . . . 99
4.3 Estimation errors for different percentages of input data and data cleaning. . . 103
5.1 Hypothetical runtime estimations for the given sub-operators and 4 CUs. . . . 122
5.2 Results of optimization for the given runtime estimations. . . . . . . . . . . . 123
5.3 OpenCL functions implemented in HERO. . . . . . . . . . . . . . . . . . . . 128
5.4 SSB query statistics (Operators, Sub-Operators, Execution Islands) . . . . . . 134
159
List of Abbreviations
APU Accelerated Processing Unit
ASIC Application-Specific Integrated Circuit
CPU Central Processing Unit
CU Computing Unit
DBMS Database Management System
DSM Decomposition Storage Model
dGPU discrete GPU
FPGA Field Programmable Gate Array
FLOPS floating point operations per second
GPC Graphics Processing Cluster
GPU Graphics Processing Unit
HERO HEterogeneous Resource Optimizer
HPC High Performance Computing
HT Hyper Threading
HW Hardware
iGPU integrated GPU
JIT just-in-time
MAE Mean Absolute Error
MAPE Mean Absolute Percentage Error
MIC Many Integrated Core
NSM N-ary Storage Model
NUMA Non-Uniform Memory Access
OLAP Online Analytical Processing
OLTP Online Transaction Processing
OpenCL Open Computing Language
PCIe Peripheral Component Interconnect Express
QEP Query Execution Plan
SIMD Single Instruction, Multiple Data
SIMT Single Instruction, Multiple Thread
SM Streaming Multiprocessor (GPU Core)
SSB Star Schema Benchmark
TID Tuple ID
TLB Translation Lookaside Buffer
TPC Texture Processing Cluster
TPC-H TPC Benchmark H
160
CONFIRMATION
I confirm that I independently prepared the thesis and that I used only the references and
auxiliary means indicated in the thesis.
Dresden, May 26, 2017
161
162
