University of South Florida

Digital Commons @ University of South Florida
Graduate Theses and Dissertations

Graduate School

February 2020

Algorithms and Framework for Computing 2-body Statistics on
Graphics Processing Units
Napath Pitaksirianan
University of South Florida

Follow this and additional works at: https://digitalcommons.usf.edu/etd
Part of the Computer Engineering Commons, and the Computer Sciences Commons

Scholar Commons Citation
Pitaksirianan, Napath, "Algorithms and Framework for Computing 2-body Statistics on Graphics
Processing Units" (2020). Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/8984

This Dissertation is brought to you for free and open access by the Graduate School at Digital Commons @
University of South Florida. It has been accepted for inclusion in Graduate Theses and Dissertations by an
authorized administrator of Digital Commons @ University of South Florida. For more information, please contact
scholarcommons@usf.edu.

Algorithms and Framework for Computing 2-body Statistics on Graphics Processing Units

by

Napath Pitaksirianan

A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy in Computer Science and Engineering
Department of Computer Science and Engineering
College of Engineering
University of South Florida

Major Professor: Yicheng Tu, Ph.D
Adriana Iamnitchi, Ph.D.
Yao Liu, Ph.D.
Hadi Charkhgrad, Ph.D.
Sagar Pandit, Ph.D.

Date of Approval:
March 5, 2020

Keywords: GPGPU, GPU, Database System, Parallel Computing, CUDA
Copyright © 2020, Napath Pitaksirianan

Dedication
To my family and professor who never gave up on me and encouraged me to accomplish
my achievements.

Acknowledgments
I wish to show my gratitude to my advisor, Dr. Yichen Tu, for his guidance and encouragement in this graduate research work. He spends a lot of time and effort to support me in
both academic and non-academic matters. His insights at times helped me whenever I faced
a difficulty.
I am also grateful to the committee members, Dr. Yao Liu, Dr. Adriana Iamnitchi, Dr.
Hadi Charkhgard, and Dr. Sagar Pandit, who express interest in this work and offered their
valuable time throughout this dissertation.
I would like to thank Dr. Jay Ligatti and Dr. Yao Liu for writing a recommendation
letter for me when I joined the Ph.D. Program at the computer science and engineering
department at USF.
Many thanks to Jose Ryan, my supervisor at the technical support team at the computer
science and engineering department USF, who worked with me in the last few years.
I want to thank my mother Pharkwara, my father Bunchai, and two beloved brothers
Jarkpot and Uran for their incredible support and encouragement over the years. I also
want to recognize my friends, the members of the database laboratory, who have worked
throughout this process.
Special thanks to the management and administration at the main office, Department of
Computer science and engineering USF, for their help in supplying the necessary documents
and aid in other matters.

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2:
2.1
2.2
2.3

Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GPU Architecture and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-Body Statistics (2-BS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3:
3.1
3.2
3.3

General 2-BS GPU Algorithms Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Overviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Input Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Algorithms for Pairwise Computation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Analytical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Data Output Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Output Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1.1 Analytical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Advanced Output Privatization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.3 Direct Output Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Additional Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Load Balancing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Tiling with Shuffle Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Dealing with Large Data Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Automated Code Optimization for 2-BS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Evaluation of GPU-Based 2-BS Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.1 Data Representation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.2 Evaluation of Pairwise Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8.3 Evaluation of Complete Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8.3.1 Output Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4

3.5

3.6
3.7
3.8

5
5
7
8

i

3.8.3.2 Advanced Output Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8.3.3 Direct Output Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9 Additional Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9.1 Load balancing technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9.2 Tiling with Shuffle instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Results of Large Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.11 Case Study of Automated 2-BS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.11.1 Case Study I: Radial Distribution Function (RDF) . . . . . . . . . . . . . . . . . .
3.11.2 Case Study II: Nested-Loop Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45
47
48
48
48
50
51
51
53

Chapter 4:
4.1
4.2
4.3
4.4

Index-based 2-BS Algorithm design on GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Index Data Structure on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Index Construction on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Index Query Processing on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Equality Search Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 Ranged Search Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.1 Index Construction on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.2 Index Equality Searching on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.3 Index Range Searching on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 2-BS Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6.1 Equi-Join Operation in DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Chapter 5: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Appendix A: Copyright Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

ii

List of Tables

Table 2.1

A list of some 2-BS problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Table 3.1

Symbols and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Table 3.2

Utilization of different GPU resources in running different 2-PCF
kernels under a data size of 512k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Table 3.3

Utilization of different GPU resources in running different SDH
kernels for a 512,000-point dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 3.4

Number of private copies H and the range of output size for
which H is found to be the optimal choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Table 4.1

Utilization of different GPU resources in running different group
size on GPU exact searching under a data size of 500 millions. . . . . . . . . . . . . . 75

Table 4.2

Utilization of different GPU resources in running different group
size on GPU range searching in second steps under a data size
of 500 millions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Table 4.3

Utilization of different GPU resources in running different group
size on GPU range searching in third steps under a data size of
500 millions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

iii

List of Figures

Figure 2.1

Architecture of a recent Nvidia (e.g., Maxwell, Pascal) GPU . . . . . . . . . . . . . . . . 6

Figure 3.1

Tiling method requires loading data in blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Figure 3.2

Combining private outputs in all blocks to obtain the final result . . . . . . . . . . 22

Figure 3.3

Modeling results under different numbers of private copies and
sizes of the output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 3.4

A case of direct output buffer for GPU threads, showing Thread
3 acquiring a new page (i.e., page 4) as its output buffer . . . . . . . . . . . . . . . . . . . . 28

Figure 3.5

Two different ways of work assignment to threads in intra-block
pairwise computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Figure 3.6

Tiling with shuffle instruction technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Figure 3.7

Pipelining data transmission and kernel execution via CUDA streams.. . . . . 33

Figure 3.8

Decision tree of our 2-BS framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Figure 3.9

Implementation of the 2-BS code generation framework . . . . . . . . . . . . . . . . . . . . . 36

Figure 3.10

Performance of different data types of vectorized memory access
for computing 2-PCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 3.11

Performance of different GPU-based algorithms for computing
2-PCF: total running time and speedup over naive algorithm . . . . . . . . . . . . . . . 39

Figure 3.12

Performance of different GPU-based algorithms for computing
SDH: total running time and speedup over CPU algorithm . . . . . . . . . . . . . . . . . 42

Figure 3.13

Performance of the Reg-RoC-Out kernel under different bin sizes:
running time and occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 3.14

Performance with advanced output privatization: theoretical
(left) and measured (right) running time of SDH kernel. . . . . . . . . . . . . . . . . . . . . 46
iv

Figure 3.15

Output Buffer Management: total running time and speedup
over CPU in computing the item similarity problem . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 3.16

Performance of Reg-SHM kernel with and without load balancing method: total running time (left) and speedup (right) over
Reg-SHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Figure 3.17

Performance of different GPU-based algorithms for computing
SDH: total running time and speedup over CPU algorithm . . . . . . . . . . . . . . . . . 49

Figure 3.18

Performance of different setup for computing SDH: total running
time and speedup over single GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Figure 3.19

Performance of different GPU-based algorithms for computing
RDF: total running time (left) and speedup over VMD code (right). . . . . . . . 52

Figure 3.20

Performance of NLJ kernel generated from 2-BS framework as
compared to best known NLJ kernel reported in [1] . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 4.1

A typical data structure of B+ Tree on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 4.2

Static B+ Tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 4.3

Memory access pattern of three queries.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Figure 4.4

Overlap of multiple range queries in a single dimension key domain . . . . . . . . 68

Figure 4.5

Range queries in 2D domain, x-axis as starting key and y-axis
as ending key.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Figure 4.6

Performance of different page sizes GPU-based algorithms for
Tree construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 4.7

Performance of different GPU-based algorithms for Tree equality
searching: total running time and speedup over CPU algorithm . . . . . . . . . . . . 73

Figure 4.8

Performance of different GPU-based algorithms for Tree equality
searching by step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 4.9

Performance of different Group size for Tree equality searching:
left two step searching algorithm, right three step searching algorithm . . . . . 74

Figure 4.10

Performance of different GPU-based algorithms for Tree range
searching: total running time and speedup over CPU algorithm . . . . . . . . . . . . 76

Figure 4.11

Tree range searching time for each grouping threshold . . . . . . . . . . . . . . . . . . . . . . 78
v

Figure 4.12

Performance of Index-join kernel as compared to best known
Hash-Join and Sort-Merge-Join kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

vi

Abstract

Various types of two-body statistics (2-BS) are regarded as essential components of lowlevel data analysis in scientific database systems. In relational algebraic terms, a 2-BS is
essentially a Cartesian product between two datasets (or two instances of the same dataset)
followed by a user-defined aggregate. The quadratic complexity of these computations hinders the timely processing of data. Thus using modern parallel hardware has become an
obvious solution to meet such challenges. This dissertation

1

presents our recent work in

designing and optimizing parallel algorithms for 2-BS computation on Graphics Processing
Units (GPUs). The unique architecture, however, provides abundant opportunities for optimizing the algorithm towards better performance and achieving high utilization of hardware
resources. While a typical 2-BS problem can be summarized into a straightforward parallel
computing pattern, traditional knowledge from (general) parallel computing often falls short
in delivering the best possible performance. Therefore, we present a suite of techniques to decompose 2-BS problems and methods for the effective use of computing resources on GPUs.
We also developed analytical models that guided us towards finding the best parameters of
our GPU programs. As a result, we achieve the design of highly-optimized 2-BS algorithms
that significantly outperform the best-known GPU and CPU implementations. Although
2-BS problems share the same core computations, each 2-BS problem, however, carries its
own characteristics that calls for different strategies in code optimization. For that, we developed a software framework that automatically generates high-performance GPU code based
1
Portions of this dissertation was previously published in Distributed and Parallel Databases Journal
(DAPD), Volume 37:587–622 DOI: https://doi.org/10.1007/s10619-018-7238-0 and in Proceedings of the
45th International Conference on Parallel Processing (ICPP), Philadelphia, PA, USA.,pp. 380-385, DOI:
10.1109/ICPP.2016.50. Permission is included in Appendix A.

vii

on a few parameters and short primer code inputs. We further present two case studies to
demonstrate that code generated by this framework reaches a very high level of efficiency.
In addition to the general problem, we also studied a particular group of 2-BS problems, in
which the computation can be reduced by using an index structure. Whereas the traditional
knowledge of the index tree structure cannot utilize the full performance of GPUs, we present
a technique to optimize the index searching for GPUs. The GPU index-searching is verified
by applications of 2-BS, which show the very high performance.

viii

Chapter 1: Introduction

1.1

Overview
Handling analytical workloads efficiently is a major challenge in today’s scientific do-

mains. Recent studies show an increasing interest in developing database systems for handling scientific data [2, 3, 4, 5]. Traditional DBMSs still fall short of algorithms and strategies
to satisfy the special needs of scientific applications, which are different from those in traditional databases in their data types and query patterns. In addition, designing efficient
algorithms for data query and analysis are still the main challenge in scientific areas. Moreover, supporting complex mathematical functions in DBMS has become an active research
area. A good example is the integration of linear algebra with DBMS in a recent ICDE
awarded paper [6].
In this dissertation, low-level analytics that is frequently used in various applications
is discussed, namely, the 2-body statistics. Bearing many forms and definitions, 2-body
statistics (2-BS) as referred to in this paper, is a type of computational pattern that evaluates
all pairs of points among an N-point data set. Therefore, in relational algebraic terms, a
2-BS is essentially a Cartesian product between two datasets (or two instances of the same
dataset) followed by a user-defined aggregate.
In general, a 2-BS can be computed by solving a distance function between all pairs
of datum. Although the distance function only demands constant time to compute for a
pair of data points, the total running time is quadratic to the data size. The first line of
defense is obviously better algorithms with lower complexity. For example, it is well known

1

that sort-merge, hash, or index-based algorithms are used to compute relational joins in
database systems. Some studies [7] also used quad-tree and a batching technique to reduce

the complexity of SDH computing to O N 1.5 . On the other hand, parallel computing techniques can be utilized to speed up the computation in practice, and is the main topic of this
dissertation. In the context of 2-BS problems, parallel computing techniques are extremely
useful for two reasons: (1) particular types of 2-BS lack efficient algorithms. For example,
kernel functions for Support Vector Machine (SVM) [8] and pairwise comparison in various
applications [9, 10] can only be solved in quadratic time. Another example is relational joins
– although sort-merge, hash, and index-based algorithms are preferred for processing equijoins, nested-loop join is the better choice for joins with complex non-equality conditions.
(2) Performance of more advanced algorithms can be further improved via parallelization.
For example, efficient join algorithms such as hash join still require complete pairwise comparison of data (e.g., within the same bucket of the hash table), for which parallel programs
[11] have shown great success. With that in mind, this work focuses on novel techniques
to implement and optimize parallel algorithms for computing 2-BSs on modern Graphics
Processing Units (GPUs).
By providing massive computing power and high memory bandwidth, GPUs have become
an integrated part of many high-performance computing (HPC) systems. Originally designed
for graphics processing, the popularity of general-purpose computing on GPUs (GPGPU)
has boosted in recent years with the development of software frameworks such as Compute
unified device architecture (CUDA) [12] and Open Computing Language (OpenCL) [13].
As today, GPUs is one of a designable platform for many compute-intensive applications
[14, 15, 16, 17], including the data analytics systems [18, 19, 20], and join operation [11, 21].
Due to the compute-intensive nature of 2-BS problems and the fact that the main body
of computations can be done in a parallel manner for most 2-BSs, GPUs stand out as a
desirable platform for implementing 2-BS algorithms.
2

However, GPU algorithms for only a few 2-BS problems have been studied (see Section
2.3 for details). In addition to the surprisingly little attention paid to this topic, existing
work lacks a comprehensive study of the necessary techniques to achieve high performance
on GPUs. This is a non-trivial task because code optimization has to consider architectural
and (system) software features of modern GPUs that are found to be more complex than
those in multi-core CPUs. As a result, traditional wisdom from (general) parallel computing
often falls short in delivering the best possible performance.

1.2

Contribution and Outline
In this dissertation, techniques and frameworks for 2-BS problems on GPUs are presented.

We introduce multiple techniques of a quadratic computation to compute 2-BS problems on
GPUs, which able to solve any 2-BS problem. We call this quadratic computation on 2-BS as
General 2-BS computation. Then, we study a particular group of 2-BS problem, in which the
problem can utilize an index structure to reduce the computation. We define this calculation
as Index-based 2-BS computation.
In chapter 2, we comprehensively review literature in 2-BS problems and related work in
GPUs. We explore the architecture of GPUs, which included the Compute Unified Device
Architecture (CUDA) model, GPU scheduler, and the cache system of GPUs. We review
multiple 2-BS problems and go over the definition of the problems. Then, we review the
knowledge of index structure, which use in Index-based 2-BS computation. Finally, we
explore the prior work of 2-BS in GPUs.
In chapter 3, we provide a technique and framework to compute general 2-BS computation on GPUs. We begin our study by classifying 2-BS problems into three categories,
in which each group of 2-BS problems generates a similar type of output. We define the
straightforward algorithm to compute 2-BS on GPUs, which comes from general knowledge
of parallel computing. Then, we optimize the algorithm by introducing multiple techniques
3

to reduce the bottleneck of the algorithm. We define a mathematic model to configure parameters of GPUs. Moreover, we present a framework to select the best technique for each
2-BS application by a few parameters. To verify our techniques and framework, we benchmark our technique with a verity of 2-BS application, such as spatial distance histogram
(SDH), two-points correlation function (2-PCF), item similarly, the nested loop join, and
radial distance function (RDF).
In chapter 4, we discuss Index-based 2-BS computational. We introduce an index structure which suite for GPUs. We present a new technique for index query processing on GPUs.
Both equality search and range search queries have been discussed in this chapter. Then, we
apply an index structure on 2-BS problems as a verification of our techniques.
Finally, in chapter 5, we conclude this dissertation and discuss the possible future studies
in this area.

4

Chapter 2: Literature Review

2.1

GPU Architecture and CUDA
In this section, we briefly introduce the architecture of modern GPUs. We use the latest

generation of Nvidia GPU product as an example (Fig 2.1). We believe such information
is essential in our discussions of (parallel) algorithm design in this paper. Readers already
familiar with GPU architecture can skip this section.
A GPU contains many processing units (cores) for handling complex graphics-related
computation. A group of cores is organized into a multiprocessor and a GPU can have tens
of multiprocessors. A GPU contains a few GBs of global memory built on high-speed memory
technology such as GDDR5 and HBM2. The CPU (i.e., host) can transfer data to the global
memory over a PCI-E link. Global memory can be accessed by different multiprocessors
simultaneously at a bandwidth up to 900 GB/sec [22]. Each multiprocessor also provides
high-speed programmable shared memory of size 64-96KB. The use of shared memory is
under full control of the programmers. There are also the programmable read-only data
cache (also named texture memory), which was first introduced in the Kepler Architecture,
for holding data that cannot be overwritten during the lifespan of the program, as well as
the non-programmable L1 cache (within each multiprocessor) and L2 cache (shared by all
multiprocessors).
On the software side, the CUDA programming model allows a large number of threads
to be launched to compute a function (called kernel ) in parallel. The entire collection of
threads (named grid ) are organized into groups (called blocks), therefore each thread can be

5

GPU Device
Multi-processors
Instruction Cache

Core

Core

Core

Core

Core

Core

Shared Memory

Host Device Memory

Core

Global Memory

Core

L2 Cache

Register File

L1 Cache /
Read-Only Data Cache

Figure 2.1: Architecture of a recent Nvidia (e.g., Maxwell, Pascal) GPU

identified by a block ID and thread ID within the block. In the CUDA runtime environment,
all threads in a block will be executed in the same multiprocessor. On the other hand,
one multiprocessor can execute multiple blocks. However, only a small number of threads
(called a warp) are guaranteed to run at the same time. Each warp contains 32 threads with
consecutive thread IDs. In a warp, each thread has its own registers, and the threads are
executed in a single-instruction-multiple-data (SIMD) manner.
Over the years, the architecture of Nvidia GPUs has evolved through several generations: Fermi [23], Kepler [24], Maxwell [25], Pascal[26], Volta [22], and Turing [27]. Newer
architectures provide more computing resources. Moreover, new functionalities and features
in the CUDA framework have been introduced over the different generations of GPUs. For
example, starting from Kepler, shuffle instructions can be used to exchange data in registers
among threads in the same warp. Kepler also allows launching kernels within an existing
kernel via a mechanism called dynamic parallelism. Such features provide new opportunities
for improving program efficiency and also impose extra challenges to algorithm design and
implementation.

6

2.2

2-Body Statistics (2-BS)
As we mentioned earlier in this paper, we refer to 2-BS as a computational pattern that

evaluates all pairs of points among an N-point dataset. This computation pattern can be
done either within a single dataset, or between two datasets. For every pair of data points,
a 2-BS computes a distance function between the pair. By iterating over all of the pairs, the
problem can be solved with total complexity O(N 2 ).
The 2-BS type of functions are found popular in many scientific domains, with numerous
concrete examples. The following are names of 2-BS that compute all pairs of points within
a single dataset: 2-tuple problem [28], all-point nearest-neighbor problem [28], pairwise
comparisons for time series motifs [29], nonparametric outlier detection and denoising [28],
kernel density regression [28], two-point angular correlation function [30], 2-point correlation
function [28], and spatial distance histogram (SDH) [7], to name a few. Another flavor of
2-Body statistic takes two different datasets as input, examples include Radial distribution
function (RDF) [31] and relational joins [11]. Table 2.1 lists some examples of 2-BS.
In practice, there are many applications of 2-BS problems. A common practice in many
application domains is to use various distance measures (e.g., Euclidean, Jaccard, and cosine distance) to find the similarity of all pairs of input datum. One important example is
recommendation systems for online advertising that predicts the interest of customers and
suggests correct items. Jensen et al. reports a music predictive model [32] based on pairwise
comparisons of Gaussian process priors between music pieces. There are two types of recommendation systems: content-based filtering and collaborative filtering [9, 10]. Both require
2-BS computation: content-based filtering depends on pairwise comparisons between items
and collaborative filtering depends on those between users.

7

Table 2.1: A list of some 2-BS problems
2-BS Problem Name
2-point correlation function

Kernel density regression

Two-point angular
correlation function

Spatial distance histogram
(SDH)

Radial distribution function

Relational joins

Similarity of all pairs
Recommendation systems

2.3

Output Description
Counting the number of pairwise
distance that is less than a constant r
Field of study: Astronomy
A statistic value that is defined by
P
K i w(|x − xi |) where K is a normalizing
constant and w is a weighting function
Field of study: Machine Learning
The probability of finding an
object within a given angular distance
Field of study: Astronomy
A histogram that shows the
distribution of all pairwise distances
within a set of atoms.
Field of study: Molecular physics
SDH normalized by density of the system
multiply by volume of the spherical shell
Field of study: Molecular physics
List of tuple pairs that satisfy a
given condition
Field of study: Database system
List of data point pairs that are similar
according to a given distance function
Field of study: Data mining
List of items that are related to (many) users
Field of study: Data mining

Related Work
The past few years witnessed a strong movement of using GPGPU for solving scientific

computing problems and numerous reports on such are generated. Surprisingly, there are
only a few reports on computing 2-BSs on GPUs. As a part of efforts to parallelize relational
joins, He et al. implemented various functionally-equivalent CPU-base join algorithms on
GPUs [11]. The algorithm was designed to take advantage of early generations of CUDA
8

framework. They utilized on-chip memory on GPU and handled undefined size of join result
by doing twice of computation: once to calculate the output size, and second to do actual
output. The report showed a 7X speedup over CPUs. Similar results are presented in [33].
Rui et al. [1] utilized new feature of GPU for nested loop join (e.g., Read-only data cache,
large L2 cache, and shuffle instructions) and reported speedup 20X over CPUs. Pirk et al.
[34] and Karnagel et al. [35] studied balancing workload of database system to between CPUGPUs via PCI-E Bus. The above work is often benchmarked against CPU-based parallel
joins. For example, He et al. [36] proposed cache-oblivious nested-loop joins by grouping
related data together to get better spatial locality and thus higher CPU cache hit. Kim et
al. [37] implemented optimized sort-merge join and hash join on a multi-core CPU system.
They improved data parallelism by taking advantage of the SIMD instructions. Albutiu et
al. [38] designed a massively parallel sort-merge join on Multi-Core CPU where each thread
only works on its local sorted partitions in a non-uniform memory access (NUMA) system.
Levine et al. [31] studied GPU-based processing of RDF, of which the main task is to
compute a histogram of all point-to-point distances. They used data privatization techniques
via constant memory and shared memory to speed up the algorithm. Constant memory
is high speed on-chip memory on GPUs with a drawback. In order to load new data to
constant memory, the CUDA kernel needs to stop and be relaunched. Ponece et al. [39] used
tiling and privatization via shared memory with two-point correlation function in cosmology
application. They reported a speedup of up to 100x over single-thread CPU code. However,
no details of their implementation was reported in the paper. More recently, Stratton et
al. sketched tiling and privatization techniques in computing two-point angular correlation
function [30], but again, no technical details were reported.
Several work presented techniques related to analysis and handling data output in computing other related problems on parallel computing platforms.

Karnagel et al.

pro-

posed GPU-based algorithms for computing group-by and aggregates that are often seen
9

in database systems [40]. Similarly, they keep copies of output data in GPU cache. However, their implementation is based on caching the entire hash table, which is large and
requires grid-level synchronization. For that, they proposed to use L2 cache that is slower as
compared to other caches such as shared memory. Ye et al. [41] studied the same problem
with multi-core CPU as the implementation platform. They implemented a partition-andAggregate algorithm that focuses on CPU cache utilization. Kaldewey et al. [42] utilized
Unified Virtual Addressing (UVA), which allows the GPU to access the host memory directly
on modern GPUs, to manage memory efficiently.
There are other related work on improving the performance of individual 2-BS computations via algorithms with complexity lower than quadratic. For computing SDH, the
state-of-art work reduces point-to-point computations to pairwise computation among nodes
in a spatial tree structure [43, 44]. Such strategies can reduce complexity from quadratic to
θ(N 3/2 ) for 2D data and θ(N 5/3 ) for 3D data. Their main idea is to group data in a local
region into a tree node, then pairwise comparisons of tree nodes (instead of individual particles) are conducted. Therefore, this type of computation requires a suitable index-structure
on GPUs. However, the core procedure of pairwise comparison remains the same, and we
could still parallelize the pairwise computation by GPUs. Other areas of studies, the complexity of Equi-Join, similarity search, and similarity join can be reduced by using index
structure searching, which reduces the cost of searching to the height of an index structure
tree [45]. The tree structure reduces the searching area of the data, which leads to smaller
data pairs of comparison. However, to utilize this technique on GPUs, we require a suitable index structure on GPUs which able to run a parallelize index-search query with high
performance.
In recent decades, there have been many efforts to parallelize index-search query on a
multi-core system, included GPU. Kim et al. proposed FAST, a static binary-tree for multicore systems [46]. They explored thread-level and data-level parallelism on both GPU and
10

CPU. They reported that FAST can handle 50 million (CPU), and 85 million (GPU) queries
per second. Later, Kaldewey et al. introduced P-ary search which utilizes all threads within
a thread block collaboratively to accelerate a single search query [47]. Recently, Shahvarani
et al. presented HB+Tree which utilizes the heterogeneous platform to search on B+ Tree
[48]. The report showed a 2.4X speedup over a CPU-optimized solution. Kaczmarski et al.
[49, 50] proposes a GPGPU based B+Tree structure for single-key search and batch search
on GPU. They reported 100X faster than FAST GPU index base and 10X faster than CPU
STX library [51]. However, most of the previous studies in index searching still underperform
the actual GPUs power, and the important operation, which is a range search query, is still
missing from the studies.
Unlike the aforementioned work focusing on individual problems and techniques, our
work aims at a comprehensive study of the multitude of techniques that can be used for the
development and optimization of GPU-based 2-BS algorithms. We propose a series of novel
techniques for minimizing overhead and increasing resource utilization, a few techniques
are never seen in previous work. All types of GPU on-chip memory are exploited for the
best results of caching input/output. The GPUs index structure, which can utilize in 2-BS
computation, is also explored.

11

Chapter 3: General 2-BS GPU Algorithms Design

In this chapter, we elaborate on the GPU algorithm design on general 2-BS computation,
which using a quadratic computation method; however, able to solve any 2-BS problem.
First, we discuss the input data representation of the GPU system. After that, we focus
on pairwise computation which focuses on loading data into multiprocessors via caching.
Then, we explain techniques to handle the output from each thread, and introduce additional
techniques that can be used by our algorithm in special cases. Next, we present an algorithm
for processing 2-BS problems on multiple GPUs and an automated code framework of 2-BS,
which combines all of our techniques. Finally, we evaluate our techniques and framework
with a verity of 2-BS applications.
Symbols/notations used in this chapter are listed in Table 3.1.

3.1

Overviews
A straightforward GPU algorithm for computing 2-BS is shown as Algorithm 1. Note the

pseudocode is written from the perspective of a single thread, reflecting the Single-ProgramMultiple-Data (SPMD) programming style of CUDA. Each thread loads one datum to a
local variable, and uses that to loop through the input dataset to find other data points
for the distance function computation. The output will be updated with the results of each
distance function computation.1
1

We focus our discussions on 2-BSs defined over a single dataset with commutative distance function (i.e.,
only one function call is needed for every pair of points). Therefore, the point with index i is only paired
with all data points beyond position i. Note there are cases where 2-BS is defined between two different
datasets (e.g., relational join) or with non-commutative distance function (e.g., SVM kernel functions). We
will mention them in coming sections as needed.

12

Table 3.1: Symbols and notations
Symbol

Meaning

CGAc
CGW
CGR
CShmAc
CShmW
CShmR
HS
N
B
M

latency of atomic operations in Global memory
latency of writing to Global memory
latency of Reading to Global memory
latency of atomic operation in Shared memory
latency of Writing to Shared memory
latency of Reading to Shared memory
Histogram Size or Output Size
Number of input datum
Block size
Number of blocks

Algorithm 1: Generic GPU-based 2-BS algorithm
Local Var: t (Thread id)
1: currentPt ← input[t]
2: for i = t + 1 to N do
3:
d ← DisFunction(currentPt, input[i])
4:
update output with d
5: end for

To optimize the above 2-BS algorithm, the challenges can be roughly summarized as
those in dealing with the input and output data, respectively. First, each input datum
i will be read many times (by different threads into registers) for the distance function
computation. Therefore, the strategy is to push the input data into the cache as much as we
can. The many types of cache in GPUs, however, complicates the problem. Second, every
thread needs to read and update the output data at the same time. Updating the output
data simultaneously might cause incorrect results. Recent GPUs and CUDA framework
provide atomic instructions to ensure correctness under concurrent access to global and
shared memory locations. However, an atomic instruction also means sequential access to
the protected data thus lowers performance. As a result, clever strategies are needed to
avoid update collisions as much as possible.

13

Given that, there is a need to characterize the multitude of 2-BS cases based on the
computational paths. This helps us to determine the proper combination of techniques we
can use for optimizing individual 2-BS problems. We found that the 2-BS we have studied are
very similar at the point-to-point distance function computation stage. However, members
of the 2-BS family tend to have very different patterns in the data output stage. We have
identified three groups of 2-BSs based on the output pattern, and will introduce different
techniques in dealing with these types.
In Type-I, this type of 2-BS generates a very small amount of output data from each
thread. These output must be small enough to be placed in registers for each thread. For
example, 2-point correlation function [28], which is fundamental in astrophysics and biology,
outputs a number of pairs of points that determine correlation in dataset. Other examples
are all-point k-nearest neighbors (when k is small) and Kernel density/regression[28], which
output classification results or approximation numbers from regression.
In Type-II, the output in this group is too big for registers but are still small enough
to be put into GPUs’ shared memory. Examples include: (1) Spatial distance histogram
(SDH) [7], which outputs a histogram of distances between all pairs of points; (2) Radial
distribution function (RDF) [31], which outputs a normalized form of SDH.
In Type-III, the size of the output for this group can be large, so they can only be put
into global memory. In some extreme cases, the size of the output is quadratic to the size
of input. Some examples are: (1) relational join [11], which outputs concatenated tuples
from two tables - total number of output tuples can be quadratic (especially in non-equality
joins); (2) Pairwise Statistical Significance [52], which is statistical significance of pairwise
alignment between two datasets and generates large output; and (3) Kernel methods which
compute kernel functions for all pairs of data in the feature space [8].
Table 2.1 shows information of various 2-BS examples, and the category each example
belongs to.
14

3.2

Input Data Representation
Before we discuss algorithmic design, we first present data structures for loading input

data. First of all, the input data is stored in the form of multiple arrays of single-dimension
values instead of using an array of structures that each holds a multi-dimensional data
point. This will ensure coalesced memory access when loading the input data. Moreover,
we vectorize each dimension array by loading multiple floating point coordinate values in
one data transmission unit. In particular, we use the float2, float3, and float4 data types
supported in CUDA for such purposes. This reduces the number of memory transactions
and thus the number of load instructions in code. Furthermore, vectorized memory access
also indirectly increases instruction level parallelism (ILP) by unrolling a loop to calculate
all pairwise distances between two vectors. Thus, in the remainder of this paper, a datum
means a vector of multiple data points. Also a distance function call between two datum
actually computes all pairwise distances.
As mentioned above, CUDA supports three vector floating point data types – float2,
float3, and float4, which holds 2, 3, and 4 regular floating point numbers, respectively.
In general, a wider vector yields higher memory bandwidth utilization, but also increases
register use in the kernel, which in turn reduces warp occupancy. Therefore, we need to find
a balance point between register usage and memory bandwidth. In this paper, the most
suitable data type is determined by experiments (Section 3.8.1).

3.3

Algorithms for Pairwise Computation Stage
Now we present design strategies in the pairwise distance function computation stage.

Due to the high latency of data transferring between the global memory and cores, our goal
is to reduce the number of data reads from global memory. In particular, we use the wellknown tiling method [53] to load data from the global memory to on-chip cache. Whenever

15

...

Input Data

Block L
Intra-block
Computaion

Cache

Global Memoy

Cache Block R

Inter-block
Computation

Figure 3.1: Tiling method requires loading data in blocks

two data points are used as inputs to the distance function, they are retrieved from cache
instead of the global memory.
Figure 3.1 illustrates the tiling idea. We divide input data into small blocks, and the
size of a block ensures it can be put into cache (we will discuss scenarios of loading to
different types of cache later). Normally, the data block size is the same as the number of
threads in each CUDA block. Each thread loads one vector of input data into the cache to
ensure coalesced access to the global memory. With blocks of data loaded to cache, the main
operation of the algorithm is now to compute distance function between two different blocks
of data (inter-block computation). Algorithm 2 shows the pseudo code of the tiling-based
algorithm. Basically, each thread block first loads an anchor block L, and loads a series of
other blocks R. Then, compute distance functions between all pairs of datum of inter and
intra blocks.
To implement the above algorithm, an important decision to make is: which cache do we
use to hold both blocks L and R? There is no straightforward answer since there are multiple
cache systems in the Nvidia GPUs. By ignoring the non-programmable L2 cache, we still
have the programmable shared memory and read-only data cache (RoC), both have TBpslevel bandwidth and a response time of just a few clock cycles [54, 55, 56]. According to
[54, 56], programmable shared memory has the lowest latency in GPUs (i.e., about 21 clock
cycles in Nvidia Maxwell), it is natural for us to use shared memory to hold both blocks

16

Algorithm 2: Block-based 2-BS computation
Local Var: t (Thread id), b (Block id)
Global Var: B (Block size), M (total number of blocks)
1: L ← the b-th input data block loaded to cache
2: for i = b + 1 to M do
3:
R ← the i-th input data block loaded to cache
4:
syncthreads()
5:
for j = 0 to B do
6:
d ← DisFunction(L[t], R[j])
7:
update output with d
8:
end for
9: end for
10: for i = t + 1 to B do
11:
d ← DisFunction(L[t], L[i])
12:
update output with d
13: end for

L and R, and this can be viewed as a starting point for our discussions. Let us call this
baseline technique as LRshm.
By taking a closer look at Algorithm 2, we found that each datum will have to be placed
into registers before it can be accessed by the distance function, and each thread only accesses
a particular datum throughout its lifetime. Therefore, it is more efficient by defining a local
variable for each data member of block L. By using a local variable in CUDA, such a variable
will be stored and accessed in registers. This will reduce the consumption of shared memory
in each thread – shared memory is a bottlenecking resource when we consider large data
output (Section 3.4). Plus, latency of accessing registers is just one clock cycle [53]. Note
that the same argument does not hold true for block R: all data in block R is meant to be
accessed by all threads in the block but a register is private to each thread. Therefore, we
have to load R into cache. Given that, we introduce another technique, named Register-SHM,
which improves LRshm by using registers to hold one datum from block L, and allocating
shared memory to hold block R. The program also needs another change to handle the
intra-block distance computation (lines 10 to 13 in Algorithm 2: such computation requires

17

Algorithm 3: Optimized Block-based 2-BS
Local Var: t (Thread id), b (Block id)
Global Var: B (Block size), M (total number of blocks)
1: reg ← the t-th datum in the b-th input data block
2: for i = b + 1 to M do
3:
R ← the i-th input data block loaded to cache
4:
for j = 0 to B do
5:
d ← DisFunction(reg, R[j])
6:
update output with d
7:
end for
8: end for
9: L ← the b-th input data block loaded to cache
10: for i = t + 1 to B do
11:
d ← DisFunction(reg, L[i])
12:
update output with d
13: end for

threads to access all datum in block L. For that, we now have to load block L to shared
memory before we run the last for loop. But the technique we use is: instead of asking for
a new chunk of shared memory for L, we overwrite the space we just used for block R. By
that, the total shared memory used is still one block. Algorithm 3 shows a new version of
Algorithm 2 enhanced by the Register-SHM technique.
We also explore another solution that further relieves the bottleneck of shared memory
by storing the block in RoC. Although this solution may not yield higher performance in
the distance computation stage, it is meaningful if we have to use shared memory for other
demanding operations (e.g., outputting results, Section 3.4). This solution basically does
not change the code structure of the second solution (with use of registers). However, we
use the RoC instead of shared memory to store blocks R (for inter-block computation) and
L (for intra-block computation). RoC has higher latency than the shared memory [54] (i.e.,
about 64 clock cycles higher in Nvidia Maxwell) but it is still an order of magnitude faster
than global memory. As a side note about implementation, RoC is not fully programmable

18

as the shared memory, but we can use the “const

restrict ” keyword combination before

a variable to ask CUDA runtime framework to store the variable into the RoC.

3.3.1 Analytical Evaluation
In order to robustly compare the performance of the proposed algorithms, we present an
analytical model of the number of accesses to different types of GPU memories during the
execution of these algorithms. The Naive algorithm does not benefit from shared memory
and RoC, and just uses global memory. The number of accesses to global memory for this
algorithm is:

N+

PN −1
i=1

(N − i)

(3.1)

Since each access costs the highest latency compare to other memory on GPU, this algorithm
is very costly. However, other improved algorithms take advantage of faster cache and yields
better performance. SHM-SHM and Register-SHM both use shared memory plus global
memory, and Register-RoC uses RoC and global memory. These three mentioned algorithms
have the same number of accesses to global memory, which equals to:

N+

PM −1
i=1

(M − i)B

(3.2)

Having the same number of accesses to global memory, we should consider other memory
accesses for the sake of comparison. In SHM-SHM algorithm, number of accesses to shared
memory is:

2

PM −1
i=1

(M − i)B 2 + 2

PB−1
i=1

(B − i)M

(3.3)

19

In Register-SHM algorithm, number of accesses to shared memory is dropping by half from
SHM-SHM, that is why it is faster than SHM-SHM. On the other hand, Register-RoC exploits
data cache instead of shared memory and the number of accesses to this memory is the same
as the number of accesses of Register-SHM to share memory. Considering corresponding
access time of each GPU memory and the number of accesses to that memory by each
algorithm, it’s clear that Register-SHM outperforms the others.

3.4

Data Output Stage
In this section, we present techniques to efficiently output the results from GPUs in

2-BS computing. Depending on the features of data output, the design strategy on this
stage can be different for various 2-BSs. The simplest type is that each thread omits a
very small amount of output (e.g., Type-I) – we simply use automatic (local) variable(s) to
store an active copy of the output data in registers, and transmit such data back to host
when kernel exits. For problems with medium-sized output (e.g., Type-II), we use shared
memory to cache the output. We present novel data privatization techniques to handle
these output. For problems with very large output size (e.g., Type-III ), we have to output
results directly to global memory. The main problem for using global memory for output
is the synchronization required by supporting different threads write into the same memory
location. To avoid incorrect results, atomic instructions are used in GPUs to have protected
accesses to (global) memory locations. In CUDA, such protected memory location is not
cached and obviously cannot be accessed in a parallel way. Therefore, it renders very high
performance penalty to use atomic instructions when threads frequently access the same
memory address. For that, we present a direct output buffer (Section 3.4.3) mechanism to
minimize such costs. Note that our paper focuses on 64-bit output data type.

20

3.4.1 Output Privatization
Data privatization is frequently used in parallel algorithms to reduce synchronization
[30]. For our problems, we store private copies of the output data structure to be used by a
subset of the threads in the on-chip cache of GPUs. The RoC cannot be used here since it
cannot be overwritten during the lifespan of the kernel. That leaves the shared memory the
only choice. By this design, the data output is done in two stages: (1) whenever the distance
function generates a new distance value, it is used to update the corresponding location
of the private output data structure via an atomic write. Although this still involves an
atomic operation, the high bandwidth of shared memory ensures minimum overhead; (2)
when all distance functions are computed, the multiple private copies of the output array
are combined to generate the final output (Figure 3.2). Here we assume the final output
can be generated using a parallel reduction algorithm such as the one presented in CUDA
thrust library. Algorithm 4 shows a new version of Algorithm 3 enhanced by the output
privatization technique.
In our initial implementation, we use one private copy of the output for each thread block.
By this, synchronization only happens within a thread block, and the bandwidth of the shared
memory can effectively hide the performance overhead. We will discuss advanced techniques
that involve more private copies in a block in Section 3.4.2. In the output reduction phase,
private outputs on shared memory are first copied (in parallel) to global memory, which is
in global scope and can be accessed by other kernels. Then a reduction kernel is launched
to combine the results into a final version of output array. This kernel is configured to have
one thread handle one element in the output array.

21

Algorithm 4: SDH with Output Privatization
Local Var: t (Thread id), b (Block id)
Global Var: B (Block size), M (total number of blocks)
1: SHM Out ← Initialize shared memory to zero
2: reg ← the t-th datum of b-th input data block
3: for i = b + 1 to M do
4:
R ← the i-th input data block loaded to cache
5:
syncthreads()
6:
for j = 0 to B do
7:
d ← DisFunction(reg, R[j])
8:
atomicAdd(SHM Out[d] , 1)
9:
end for
10: end for
11: L ← the b-th input data block loaded to cache
12: syncthreads()
13: for i = t + 1 to B do
14:
d ← DisFunction(reg, L[i])
15:
atomicAdd(SHM Out[d] , 1)
16: end for
17: syncthreads()
18: Output[b][t] ← SHM Out[t]

Each bin is combined
by reduction algorithm

Output Size

# Block

Final Result

Atomic
Writing
Stage

Reduction
Stage

Figure 3.2: Combining private outputs in all blocks to obtain the final result

22

3.4.1.1

Analytical Evaluation

Here we also present an analytical model of the performance during the process of output
writing. We can easily see that such cost in the naive algorithm is N 2 CGAc . As we know
that the latency CGAc is high, the naive algorithm is heavily bound by the time of writing
output. In the privatization-based algorithm, there are two stages in output writing: the
update stage and reduction stage. In the update stage, the cost of memory access is:

PM

i=1 (N

+ B − i)CShmAc

(3.4)

and that in the reduction stage is:

Hs [M (CGW + CShmR + CGR ) + CGW ]

(3.5)

Via the privatization technique, it is obvious that number of accesses to global memory
drops significantly from the naive algorithm. Specifically, the naive algorithm needs to access
global memory every time a result is generated from pairwise computation, but privatization
requires access only in the reduction stage. In other words, such a number decreases from N 2
to HS [2M + 1]. Moreover, we read and write to global memory without atomic instructions
in the reduction stage. In addition, the atomic writes are done in shared memory, which
bears a much lower overhead. These show that our method of writing private outputs can
significantly outperform the naive algorithm.

3.4.2 Advanced Output Privatization Method
So far we have presented a straightforward privatization method in which one private
copy of the output is used per thread block. Note that synchronization still exists when
different threads in the block write into the same output address. If the output size is small

23

Algorithm 5: 2-BS with Advanced Output Privatization
Local Var: t (Thread id), b (Block id), l (lane id)
Global Var: Hsize (Output size), Hnum (number of private copies)
1: laneID = t & 0x1f
2: initial Output
3: for each pair of pairwise computation do
4:
x ←2-BS Computation Stage
5:
atomic update Output[Hsize ∗ (laneID%Hnum ) + x]
6: end for
7: Output Reduction Stage

enough to allow multiple private copies for the same block of threads, the probability of
collision in atomic operations will decrease, leading to better efficiency of parallelization. To
realize this idea, there are two problems to solve: (1) how to assign threads (within a block)
to the multiple private copies; and (2) how to determine the exact number of required copies.
As to the first problem, it is natural to assign threads with continuous thread IDs to a
copy of temporary output. For example, with two private copies in each threads block of size
B, threads with IDs in [0, B/2) share the first copy and those with IDs in [B/2, B) access the
second copy. However, we found that this method does not further improve the performance
of the kernel. This is due to the run-time thread scheduling policy in CUDA: every 32
threads with consecutive IDs (called a warp) is the basic scheduling unit. In other words,
only threads in the same warp are guaranteed to run at the same time thus face the penalty
of collision due to atomic operations. Threads in different warps do not suffer from this
issue; thus, assigning them to different output copies does not help. Therefore, our idea is to
assign threads with the same offset of IDs in the warp to an output copy. Going back to the
previous example, we now assign threads with even-numbered IDs in all warps to share the
first output copy and those with odd-numbered IDs to the second copy. Algorithm 5 shows
details of this enhanced method: each private output is shared by threads whose IDs have
the same 5 least significant bits (called laneID). Upon completing a distance computation,
each thread updates its corresponding copy of the output (line 5).
24

The second problem (i.e., finding the best number of private outputs per block) is nontrivial: more copies will decrease the chance of collision in atomic writes, but may decrease
the number of threads running simultaneously due to the limited size of shared memory. The
impact of the latter has been studied in our previous work [57]. Based on that, we develop
an analytical model to quantify both effects and find the balance point that leads to maximal
kernel performance. We start with the following performance model for compute-intensive
kernels shown in [57].
&
R=

T
α × MP

'
(3.6)

where R is the number of rounds that takes to schedule all thread blocks in the hardware,
T = M ×B is the total number of threads, α is the number of threads that can be run in each
round in a single multiprocessor (a.k.a. thread occupancy), and MP is the total number of
multiprocessors in a GPU. Note that CUDA allows a large number of blocks to be launched
yet there are only a small number of (i.e., 30 in a Titan XP) multiprocessors in a GPU
device. Thus, the number of rounds is obviously determined by the occupancy, as all other
quantities in Equation (3.6) are constants. The occupancy, in turn, is affected by the use of
common resources for each block, which in our case is shared memory and is determined by
the number of private output copies. Due to page limitation, we skip the model describing
the relationship between occupancy and shared memory use developed in [57]. Instead, we
plot the occupancy calculated from the model in Figure 3.3 (left subfigure). As we can see,
kernel occupancy drops dramatically with the increase of number of copies and output size.
Based on such results, we now develop the analytical model.
Let L be the latency of running a single round, we obviously have R × L to be the total
kernel running time. In our case, latency is dominated by the time each thread idles due
to the conflict of atomic operations. Let k denote the number of threads sharing the same

25

1

0.6

0.4

0.2

0

1H
2H
4H
8H
16H
32H

25
Waiting time ( CL )

0.8

Occupancy

30

1H
2H
4H
8H
16H
32H

20
15
10
5

0

200

400
600
Output Size

800

1000

0

0

200

400
600
Output Size

800

1000

Figure 3.3: Modeling results under different numbers of private copies and sizes of the
output. Left: thread occupancy as given by the model developed in [57]; Right: latency
given by our model

private output in a warp (thus causing a conflict), latency can be then defined as a function
of k

L(k, CL ) =




C

L



pk CL + (1 − pk )P

k=1
(3.7)
k>1

Specifically, if each thread in a warp has its own private output (k = 1), there should be
no conflict and we denote the latency under this (ideal) situation as CL . If multiple threads
share a private output, latency is determined by the probability of seeing a collision-free
warp (pk ) and a penalty of collision P , which can be defined as

P = L(k − 1, CL + CLP )

(3.8)

26

In other words, L becomes a recursive function defined over a higher latency time CL + CLP
and fewer conflicting threads k − 1. Again, pk is the probability that all threads in the same
warp access different address locations in the outputs. This can be modeled as a classic
birthday problem [58], and we have:

pk =

HS −1
HS

×

HS −2
HS

×

HS −3
HS

× ... ×

HS −(k−1)
HS

(3.9)

This says that the first thread can update any address, the second thread can update any
address except the first thread’s output address, and so on. By using Taylor series expansion
of ex , the expression approximates to:

pk ≈ e

−k(k−1)
2HS

(3.10)

Figure 3.3 (right subfigure) shows latency derived from our model under various values of k
and HS . Note that the latency data plotted here is of unit CL instead of absolute time in
seconds. Here CL is a hardware-specific value that can be obtained via experiments. Clearly,
the latency decreases when there are more private copies of outputs. In case of only a single
output (1H), and when the output size is small the latency time becomes very high. As a
side note, the output size obviously plays a role in both sides of Figure 3.3: with the increase
of output size, we have lower occupancy (due to higher shared memory consumption) but
lower latency (due to less conflict in accessing the output).
With the above model, we can find the optimal number of private copies. Given any
output size (this is a user-specified parameter for a 2-BS problem), we can use different
values of k to solve both Equations (3.6) and (3.7) to get the estimated total running time.
Luckily, k is an integer ranging from 1 to 32 (i.e., CUDA warp size), therefore we can try all
such k values to find one that leads to the best running time.

27

Request a new page

Thread
0

Thread
1

Thread
2

Page
0

Page
1

Page
2

Page being used

Thread
3
Return empty page

Page
3

Page
4

Global
Pointer

Page
5

...

Output Buffer

Avaliable Page

Figure 3.4: A case of direct output buffer for GPU threads, showing Thread 3 acquiring a
new page (i.e., page 4) as its output buffer

3.4.3 Direct Output Buffer
Now we present a technique to handle a common problem in Type-III 2-BSs: allocating
GPU (global) memory for output whose size is unknown when the kernel is launched. The
problem is due to the fact that CUDA only allows memory allocation for data with a static
size. Such a problem has been a real difficulty for not only 2-BS computation but many
other parallel computing patterns as well. A typical solution [11] is to run the same kernel
twice – the first time is for determining the output size only, and the memory for output
is actually allocated and updated in the second run. This obviously imposes a big waste of
time.
We take advantage of a buffer management mechanism proposed in our previous work on
GPU-based joins [21] to handle unknown output size. This design only requires one single
run of the kernel, with very little synchronization among threads. Figure 3.4 demonstrates
the mechanism. First, we allocate an output buffer pool with a fixed size. Then, we divide it
into small chunks called pages. We keep a global pointer GP that holds the position of the
first available page in the buffer pool. Each thread starts with one page and fills the page
with output by keeping its own pointer to empty space in that page. Once the page is filled,
the thread acquires a new page pointed to by GP via an atomic operation. By using this
mechanism, conflicts among threads remains minimal because GP is only updated when a

28

Algorithm 6: 2-BS with Direct Output Buffer
Local Var: buf (current buffer page), count (page usage)
Global Var: GP (next free page), b (Page size)
1: buf ← atomicAdd(GP , b)
2: count ← 0
3: for each pair of input datum do
4:
x ←Pairwise Distance
5:
buf [count++] ← x
6:
if count == b then
7:
buf ← atomicAdd(GP , b)
8:
end if
9: end for

page is filled. Algorithm 6 shows the 2-BS algorithm augmented with this mechanism. The
algorithm starts from initializing local buffer pointer by using atomic add operation from
global buffer pool (line 1-2). Then, the algorithm adds each update to local buffer page
(line 5). If local buffer page is filled, the algorithm requests a new page by another atomic
operation (line 7). Here we want to point out that the GP pointer, although defined in
global memory, will be most likely cached at runtime therefore its accesses generate very
little overhead.

3.5

Additional Techniques
In this section, we introduce two additional techniques that could help increase the per-

formance of 2-BS programs.

3.5.1 Load Balancing Technique
Code divergence is the situation when different threads follow different execution paths in
an SIMD architecture. As a result, such threads will be executed in a sequential manner and
that leads to performance penalties. In CUDA, since the basic scheduling unit is a warp (of
32 threads), only divergence within a warp will be an issue. By looking at Algorithm 2, we

29

can see that the kernel will only suffer from divergence in the intra-block distance function
computation (line 10 to 13 in Algorithm 2). This is because each thread goes through a
different number of iterations (Figure 3.5). Here, we introduce a load balancing method to
eliminate divergence from the intra block computation. As we mentioned before, divergence
occurs because the workload on each thread is different. Our technique thus enforces each
thread to compute the same amount of work, i.e., half of the block size. Previously, for a
thread with index i in a block (thus i ∈ [0, B − 1]), the total number of datum it pairs with
is [B − 1 − i], meaning that every thread has a different number of datum to process, and
this leads to divergence everywhere. With the load balancing technique, we let each thread
pair with B/2 datum. In particular, at iteration j, the thread with index i pairs with datum
with index (i + j)%B. Figure 3.5 illustrates the main idea. Note that, in the last iteration,
only the lower half of all threads in a block needs to compute the output. This does not
cause a divergence as the block size is a multiple of warp size.
Load balancing

W/O Load balancing
Thread#
0
1

Thread#
1
2

2
3

3
4 ...

B-2 B-1
B-1

0
1

1
2

...

iteration

B

...

B-2
B-1

B-1

B/2
... B/2+1

B/2+1 B/2+2 B/2+3 ...

...

B-2
B-1

3
4

...

...

...

...

B/2

2
3

B-1
0

0
1

1 ... B/2-3
B/2-2
2

iteration

Figure 3.5: Two different ways of work assignment to threads in intra-block pairwise
computation

3.5.2 Tiling with Shuffle Instruction
As seen in Section 3.3, tiling via shared memory or RoC is the key technique to improve
performance of Type-I 2-BS programs. However, under some circumstances, both the shared
memory and RoC may not be available for the use of 2-BS kernels. For example, they could
30

Algorithm 7: Block-based 2-BS with shuffle instruction
Local Var: t (Thread id), b (Block id), W (warp size)
Global Var: B (Block size), M (total number of blocks)
1: reg0 ← the t-th datum of b-th input data block
2: for i = b + 1 to M do
3:
for j = t%W to B; j+=W do
4:
reg1 ← the j-th datum of i-th input data block
5:
for k = 0 to W do
6:
regtmp ← reg1 broadcasted from the k-th thread
7:
d ← DisFunction(reg0, regtmp)
8:
update output with d
9:
end for
10:
end for
11: end for

be used for other concurrent kernels as a part of a complex application. In this section, we
present another technique that relieves the dependency on cache. Note that register content
is generally regarded as private information to individual threads. However, the shuffle
instruction introduced in recent versions of CUDA allows sharing of register content among
all threads in the same warp (not in the same block). Therefore, we augment Algorithm 2
with using shuffle instructions and show the pseudocode in Algorithm 7. In particular, we
allocate three registers to store input data: reg0 (line 1) is used to store datum from L which
is the same as algorithm 2; reg1 (line 4) is used to store datum from R and changes after
every 32 iterations; regtmp (line 6) is a temporary variable, which updates every iteration
with shuffle instruction. We let each thread load a datum to reg1 (line 4). Then, in each
iteration, shuffle broadcast instruction is used to load data from other thread’s register (line
6) to regtmp. After regtmp value becomes valid, reg0 and regtmp can be used to calculate
distances (line 7). Figure 3.6 shows an example. Note that this method requires only two
more registers and does not require shared memory or read-only cache.

31

Thread id
reg0
reg1

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
32 33 34 35 36 37 38 39
Shuffle Broadcast from tid 0

regtmp

32 32 32 32 32 32 32 32
Shuffle Broadcast from tid 1
33 33 33 33 33 33 33 33
Shuffle Broadcast from tid 2

regtmp

Interation

regtmp

34 34 34 34 34 34 34 34
.
.
.

Figure 3.6: Tiling with shuffle instruction technique

3.6

Dealing with Large Data Inputs
So far, we have presented solutions to the 2-BS problem assuming the GPU global memory

is big enough to contain all input data as well as program states. In fact, the 10GB-level
global memory in a typical GPU can hold input data big enough to keep the GPU busy for a
long time due to the quadratic nature of 2-BS computing. However, it is still meaningful to
study algorithms that handle large datasets with sizes greater than that of the global memory.
In this section, we present an advanced algorithm that processes 2-BS on arbitrarily large
data inputs.
In our algorithm, we first divide the input data into small blocks that can fit into GPU
global memory, and execute the best algorithm we have developed so far for each pair of
input data blocks sitting in global memory. The results of each pair of data blocks are then
aggregated towards computing of the final result. It is well known that data transmission
in/out GPU carries significant overhead due to the limited bandwidth of the PCI-E bus.2
In our small-data algorithms, the input data only needs to be shipped into the GPU once,
this translates into a linear overhead that is easily overshadowed by the quadratic on-board
2

The NVlink bus found in newer GPUs provides a higher bandwidth but does not fundamentally change
the fact that data transmission is the bottleneck.

32

computational time. With the large data inputs, every pair of data blocks will have to be
transmitted to the GPU. If the data has to be partitioned into k blocks, we essentially have
to ship O(k 2 ) pairs of blocks. Therefore, a major optimization goal here is to reduce the
data shipping overhead.
Our strategy is to hide the data transmission latency behind the GPU processing of the
in situ data blocks. In CUDA, data transmission in/out of GPU is asynchronous therefore
execution of concurrent kernels allows data transmission and in-core computation to be done
at the same time. The mechanism for concurrent kernel execution is called CUDA Streams:
a CUDA stream is basically a series of GPU operations (e.g., I/O, kernel execution) that
have to follow a fixed order. However, kernels belonging to different CUDA streams can be
executed concurrently: when input data is being transmitted in Stream 1, the 2-BS kernel
in Stream 2 can run. Figure 3.7 illustrates this idea. For each pair of input data blocks, if
the kernel running time is longer than data transmission time,3 the latter can be effectively
hidden.
Default Work Flow
C1

C2

...

Figure 3.7: Pipelining data transmission and kernel execution via CUDA streams. Here
C1, C3, C5, C7 represent blocks of dataset A and C2, C4, C6, C8 are those for dataset B.
For simplicity we ignored the output data transmission, which can also be pipelined in the
same manner

Apart from allowing 2-BS to be computed for large data inputs, the above approach
provides an effective way to scale up the algorithm. Since all pairs of input data blocks can
3

It is easy to find a block size to satisfy this condition due to the quadratic computational time.

33

be processed independently, our method can be easily applied to a multi-GPU environment,
in which each GPU can work on a different set of input data blocks.
A special note here is: the way to combine outputs from different block-level runs depends
on the type of 2-BS. For Type-I and Type-II, we combine the results inside each GPU via
parallel reduction when any pair of data blocks are processed. For Type-III, due to the use
of direct output buffer, no action is needed to combine output within the GPU. To handle
new chunks output, threads just require to acquire a new page from the Global Pointer.
After GPU devices complete computation of all data blocks, they transfer output back to
the host. At the end, the host combines all device-level outputs into the final output.

3.7

Automated Code Optimization for 2-BS
We have so far presented a multitude of techniques to optimize 2-BS code on GPUs. It

is clear that different techniques are effective at different stages of different 2-BS problems.
For users with new 2-BS problems with arbitrary characteristics, development of efficient
code is still challenging. In this section, we introduce a framework that encapsulates all
aforementioned techniques and automatically generates optimized GPU code for user-defined
2-BS problems. To develop code for a new 2-BS problem, our framework only requires
the following inputs: (1) a distance function; (2) information about the type, number of
dimensions of the input data and the number (1 or 2) of input datasets; (3) an output data
structure and its (estimated) size; and (4) specifications of the target GPU device. Based
on such information, our framework outputs almost complete CUDA kernels that reflect the
optimized strategy for computing the given 2-BS.
Our framework stores chunks of template code reflecting the individual techniques mentioned above as well as the models we developed for kernel optimization. In addition, we
develop a rule-based engine that integrates different chunks of code into executable CUDA
kernels. For example, critical components of the rule-based engine are about decision-making
34

Figure 3.8: Decision tree of our 2-BS framework. Acronyms: SHM (cached input on shared
memory), RoC (cache input on read only data cache), Register (cached output on register),
DOB (Direct output buffer), OP (output privatization), AOP (Advance output
privatization)

with the different sizes of the output data. This can be seen as a decision tree in Figure 3.8.
If output size is tiny or equal to a threshold α (i.e., Type-I), the code will be generated based
on caching inputs into shared memory and outputs into registers. Otherwise, if output size
is larger than a threshold value β or unknown (i.e., Type-III), the code will cache inputs in
shared memory and use direct buffer to handle output. For Type-II problems, we will check
available RoC size. If there is enough RoC to hold input data, RoC will be used as input
cache. Otherwise, shuffle instructions will be used for caching input data. Then we check the
output size again, if output size is greater than a threshold γ, regular output privatization
will be used. Otherwise, warp-level output privatization will be used. The thresholds are
set as follows: we set α to 16 bytes – the size of the largest primitive type (i.e., float4, int4)
supported of CUDA. This is because anything larger than that (e.g., an array) will be placed
in global memory. We set β to the size of shared memory (i.e., 64K for Pascal); and γ is
given by the modeling results shown in Table 3.4.

35

<?xml version="1.0" encoding="UTF-8"?>
<Framework>
XML
<Config>
<filename>main.cu</filename>
<KernelName>testKernel</KernelName>
<dim>3</dim>
<InputType>float</InputType>
<OutputType>unsigned long long int</OutputType>
<OutputSize>2</OutputSize>
<Function>distanceFunction.cu</Function>
…
</Config>
</Framework>

2-BS Framework
Rule-base
Engine

distanceFunction(X, Y, Z) {
….
}
testKernel( … ) {
….
}

main.cu
distanceFunction(X, Y, Z) {
….
}

DistanceFunction.cu

Code DB

Figure 3.9: Implementation of the 2-BS code generation framework

We developed the framework with Python [59], and some implementation details can be
found in Figure 3.9. For any 2-BS problem, the system takes as inputs an XML file holding
all relevant problem-specific parameters and a CUDA file containing the distance function.
The CUDA distance function is written as a regular C function. Blocks of CUDA code are
pre-stored in a database and selected based on the aforementioned rules and modified and
integrated into a file containing the CUDA kernel code as the output.

3.8

Evaluation of GPU-Based 2-BS Algorithms
In this section, we present empirical evaluation of aforementioned algorithms. We run

our experiments in a workstation running Linux (Ubuntu 14.04 LTS) with an Intel Xeon
6-core E5-2620 v3 CPU, 128GB of DDR4 1866-MHz memory, and an Nvidia Titan XP GPU
with 12GB of global memory. The running time reported for all experiments is end-to-end
latency, which includes time to transfer input data in and output data out of the GPU.

3.8.1 Data Representation Schemes
We first evaluate the performance of vectorized memory accesses via different data types
(i.e., float2, float3, and float4). In particular, we implemented CUDA kernels to compute a
Type-I 2-BS: the 2-point correlation function (2-PCF). The 2-PCF requires computation of
36

all pairwise Euclidean distances and the output is of very small size: one scalar describing
the number of points within a radius. We see 2-PCF as a good example here because the
workload is almost exclusively on the distance computation, which requires intensive data
loading from global memory.
We select two different caching techniques to conduct this experiment: register with
shared memory, and register with RoC. In particular, we implemented and compared eight
kernels with different data types and caching techniques. There are four kernels that are
based on “Register + Shared Memory” (i.e., named Float-SHM, Float2-SHM, Float3-SHM,
and Float4-SHM) and other four kernels are based on “Register + RoC” (i.e., named FloatROC, Float2-ROC, Float3-ROC, and Float4-ROC). We experimented on input data sizes
ranging from 512 to 3 million particle coordinates. Particle coordinates are generated following a uniform distribution in a spatial region4 .
Figure 3.10 shows the running time of all eight kernels. As we see, kernels that use Shared
Memory show similar results by using Float2, Float3 and Float4. However, when input data
is greater than 1.8 Million atoms, Float2-SHM shows the best performance, which is 5%
faster than scalar load (i.e., Float-SHM). The float3 and float4 cases did not show significant
advantage. On the other hand, kernels that use RoC demonstrate a more clear trend that
Float2-ROC outperforms all other kernels. The Float2-ROC kernel is 11% faster than scalar
load (i.e., Float-ROC). However, larger vector width (i.e., float3 and float4) did not further
improve performance, the Float3-ROC kernel is even slower than the scalar load case. To get
insights on such results, we analyzed the runtime statistics of the kernels by using the Nvidia
visual profiler, a tool for analyzing runtime characteristics of CUDA kernels. The profiler
results show that vectorized memory access via float3 and float4 yield lower performance
4

We also run experiments on skewed datasets. However, the performance of 2-BS algorithms is not
affected by data distribution thus we omit those results.

37

100

10

10

Time (second)

Time (second)

100

1

0.1

1

0.1
Float-SHM
Float2-SHM
Float3-SHM
Float4-SHM

0.01

400

800 1200 1600 2000 2400 2800

Total number of atoms (x1000)

Float-ROC
Float2-ROC
Float3-ROC
Float4-ROC

0.01

400

800 1200 1600 2000 2400 2800

Total number of atoms (x1000)

Figure 3.10: Performance of different data types of vectorized memory access for
computing 2-PCF

because they significantly increased register use of kernel and thus reduced warp occupancy
(i.e., 75% on float3 and 50% on float4).
Based on the above results, in the rest of experiments, we vectorize all input data into
float2. The only exception is the naive algorithm, in which we still use scalar load.

3.8.2 Evaluation of Pairwise Algorithms
To evaluate the performance of the aforementioned solutions in distance computation, we
implemented them in CUDA and experimented using synthetic data with different sizes. We
still used 2-PCF whose workload is exclusively on the distance computation, as the sample
problem. We implemented and compared the following kernel functions that correspond
to the different solutions mentioned above: (1) SHM-SHM: caching both blocks L and R
in shared memory; (2) Register-SHM: caching one datum in register and block R in shared
memory; (3) Register-RoC: placing one datum in register and block R in read-only cache; and
we also compare with (4) Naive: generic GPU-Based 2-BS algorithm as shown in Algorithm

38

4

100

3.5
10

1

Speedup

Time (second)

3

0.1

2.5
2
1.5
1

0.01

0.001

Naive
SHM-SHM
Register-SHM
Register-ROC

600

1200

1800

2400

Total number of atoms (x1000)

0.5
0

SHM-SHM
Register-SHM
Register-ROC

600

1200

1800

2400

Total number of atoms (x1000)

Figure 3.11: Performance of different GPU-based algorithms for computing 2-PCF: total
running time and speedup over naive algorithm

1. Note that, in all of the kernels except Naive, input variable is vectorized by float2. In
addition, all kernels are compiled by -Xptxa-dlcm=ca flag, which enables the compiler to
use L1 cache.
We experimented on input data size ranging from 512 to 3 million particles. Particle
coordinates are generated following a uniform distribution in a region. For kernel parameters,
we set the total number of threads as the data size and the value of threads per block to 512,
which is derived from an optimization model developed in our previous work [57]. The model
guarantees best kernel performance among all possible parameters by minimizing running
round (i.e, number of rounds all the specified threads of a kernel are actually scheduled).
The model also shows that running round is limited by three factors – shared memory
consumption, register use, and number of concurrent warps. As our kernel in 2-PCF uses a
small amount of resources in all three categories, we can use a relatively large block size of
kernel (i.e, 512 thread per block) and achieve the best performance.

39

Table 3.2: Utilization of different GPU resources in running different 2-PCF kernels under
a data size of 512k. Ari: arithmetic operation; Con: control operation; Mem: memory
operations; SM: shared memory; ROC: Read-Only data cached
Kernel
Naive
SHM-SHM
Reg-SHM
Reg-RoC

GPU Cores
Memory Bandwidth
Ari Con Mem SHM L2
ROC
20% 5%
67% 14%
68% 14%
55% 12%

15%
4%
4%
12%

10%
20%
20%
10%

90%
10%
10%
20%

40%
10%
10%
50%

Figure 3.11 shows the total running time of each experimental run. We observed that the
running time grows with data size in a quadratic manner – this is consistent with the O(N 2 )
complexity of such algorithms. Among all tested parallel algorithms, the Register-SHM
and SHM-SHM kernel show similar results, which is the best performance under all data
sizes – it achieves an average speedup of 3.9X (maximum speedup of 3.5X). The RegisterRoC kernel shows the least improvement over naive algorithm, with an average speedup of
3.3X and maximum speedup of 3.7X. The above results are clearly in conformity with our
understanding of the proposed caching solutions.
To evaluate the level of optimization we achieved in our solutions, we looked into the
utilization of GPU resources while running our kernels. Normally, the bottleneck is on the
memory bandwidth in processing 2-BSs such as the 2-PCF, due to the simple calculations in
the distance function. If we can feed the cores with sufficient data, the cores will show a high
utilization, which indicates that the code is highly optimized. Another way to look at this
is: since the total number of distance function calls is the same for all algorithms mentioned
so far, the less idling time the cores experience, the better performance the algorithm has.
Information related to resource utilization can be obtained by running the program through
the Nvidia visual profiler. Table 3.2 shows utilization of different hardware units as recorded
by the profiler. Clearly, the three cache-based techniques significantly increases utilization of

40

compute core resources as compared to naive algorithm. The Register-SHM and SHM-SHM
kernels both achieve a roughly 67% utilization of arithmetic units in GPU cores, indicating
near-optimal performance. That number for Register-RoC is only 55%, verifying the result
that its performance is not as good as the other two. Without a surprise, it reaches a high
utilization (1.5 TB/s or 50%) of RoC bandwidth.

3.8.3 Evaluation of Complete Algorithms
In this subsection, we present experimental results on running the algorithms optimized
for both (i.e., distance computation and output transmission) stages. We study the impact of
each output technique (i.e., output privatization, advanced output privatization, and Direct
output buffer) separately.

3.8.3.1

Output Privatization

We use the Spatial Distance Histogram (SDH) as an example for implementing our output
privatization algorithm. Classified as a Type-II 2-BS, SDH is a problem similar to 2-PCF.
SDH also requires computing all pairwise Euclidean distances, but it outputs a histogram
that shows the distribution of all computed distances. The output size (i.e., number of
buckets) of SDH is not related to the data size N , but it normally comes at the level of a few
kilobytes thus can be placed in shared memory. On GPUs, due to the concurrent updating
of output the problem is bound by memory bandwidth.
In this set of experiments, we compare six kernel functions: the first three are algorithms
we studied in Section 3.3: Naive, Register-SHM, and Register-RoC. The output stage of
those three algorithms is handled in a straightforward way: we directly output to a shared
data structure in global memory via atomic operations. The other three algorithms, named
Naive-Out, Reg-SHM-Out, and Reg-RoC-Out, are based on the first three algorithms, but
we enhance the output stage with the privatization technique. In addition, we compare
41

10000

Naive-Out
Register-SHM
Reg-SHM-Out
Reg-ROC-Out

70
1000
60
50

Speedup

Time (second)

100
10
1
0.1

30
20

CPU
Naive-Out
Register-SHM
Reg-SHM-Out
Reg-ROC-Out

0.01
0.001

40

0

400

800

1200

1600

Total number of atoms (x1000)

10
0

400

800

1200

1600

Total number of atoms (x1000)

Figure 3.12: Performance of different GPU-based algorithms for computing SDH: total
running time and speedup over CPU algorithm

all GPU algorithms with a CPU-based parallel algorithm to study the overall advantage
of running 2-BS on GPUs vs. multi-core CPUs. We again generate uniformly distributed
datasets with a size ranging from 512 to 2 million. We set the total number of threads as
the data size and the value of threads per block to 64, which is derived from an optimization
model developed in our previous work [57]
We implemented a highly-optimized parallel algorithm for computing SDH in multi-core
Intel Xeon using OpenMP in C. To improve performance, various techniques are applied
to the CPU version. First, we optimize the output stage to reduce the effects of atomic
operations. In particular, every thread is given an independent copy of the output histogram
and parallel reduction will be conducted after all distance function calls are returned. Second,
we compare the effects of OpenMP thread affinity schedulers (e.g., scatter, compact, and
balanced ) and choose the one (i.e., balanced ) that is most beneficial to overall performance.
Third, parallel loops can be executed in different scheduling modes, and selecting a scheduling
mode is usually a trade-off between overhead and load imbalance. Among the available

42

modes (e.g., static, dynamic, and guided ) in OpenMP, we choose guided as the best one for
our algorithm. Other optimizations such as algebraic elimination of costly instructions and
enabling aggressive compiler optimizations are also applied to the CPU code. We also profile
our multithread CPU code for computing SDH by pref tool. The profiler shows that the
algorithm is bound by computation, uses all cores of the CPU, and makes good use of cache
(cache miss rate is less than 2%). Therefore, our CPU code shows great spatial locality since
most accesses are satisfied through cache. In summary, we believe our CPU code is of very
high (if not optimal) performance.
Figure 3.12 shows the running time of the aforementioned kernels. First of all, we found
that the three kernels without the output privatization technique run at almost the same
speed. Therefore, we just plot one of the three (i.e, Register-SHM) in Figure 3.12. It is easy
to see that the total running time of such kernels is about one order of magnitude longer
than the ones with output privatization technique. This clearly shows how data output
dominates the running time due to atomic operations (against a global memory). On the
other hand, applying output privatization can significantly improve the speed of kernels, as
shown by the short running time of the three output-optimized kernels. The Reg-RoC-Out
kernel, by using the read-only cache for distance function computation and shared memory
for output caching, combines the power of both cache systems and therefore shows the best
performance. Specifically, Reg-RoC-Out is about 13.48 times as fast as Register-SHM. Even
the Naive-Out algorithm, without any optimization for pairwise distance computation, shows
a 12.05 speedup over Register-SHM.
Further profiling of the involved kernels support discussions made above. Table 3.3 shows
the bandwidth utilization in different GPU cache systems by the tested kernels. Clearly,
cache bandwidth is the limiting factor of the three output-optimized kernels. Among them,
Reg-RoC-Out achieves very high bandwidth utilization in both shared memory (9TB/s) and
read-only cache (750GB/s), leading to the best kernel performance. The other two kernels,
43

Table 3.3: Utilization of different GPU resources in running different SDH kernels for a
512,000-point dataset. Ari: Arithmetic Operation; Con: Control Operation; Mem: Memory
Operation; SM: shared memory; ROC: Read-only data cache
Kernel

GPU Cores
Memory Bandwidth
Ari Con Mem SM
L2
ROC

Register-SHM
Naive-Out
Reg-SHM-Out
Reg-RoC-Out

10% 10%
23% 5%
50% 20%
60% 20%

10%
7%
20%
10%

10%
95%
98%
100%

10%
10%
10%
10%

10%
10%
10%
30%

Reg-SHM-Out and Naive-Out, have lower utilization in either shared memory or RoC. All
GPU kernels beat the CPU program running on a Intel Xeon, showing GPUs being a superior
platform for computing 2-BSs. The best GPU program (i.e., Reg-RoC-Out) is about 52 times
as fast as the CPU program. Even the least optimized Reg-SHM kernel is about 3.86 times

55

100

50

90

45

80

40

70

Occupancy (%)

Time (second)

as fast as the CPU code.

35
30
25
20

60
50
40

15

30

10

20

5

10

0

0

1000

2000

3000

4000

Number of buckets

5000

6000

0

1000 2000 3000 4000 5000
Number of buckets

Figure 3.13: Performance of the Reg-RoC-Out kernel under different bin sizes: running
time and occupancy

We also study the effects of output size on the performance of the output-optimized
kernels. Figure 3.13 shows such results of the Reg-RoC-Out kernel in computing the SDH
44

of a dataset with 512,000 data points. The general trend is: when output size (i.e., total
number of buckets in the output histogram) increases, the running time also increases. Note
that the running time increases as a step function of output size. This is because the
output size affects the performance via changing the occupancy of the kernel. Figure 3.13
shows that occupancy decreases when the output size increases. Interestingly, the kernel
also shows degraded performance when the output size is too small. This shows the other
side of the story: when an output has too few elements, it will suffer from high contention:
the many threads in the block always compete for accessing an output element via the
atomic operations. In the following section, we will show that advanced output privatization
technique is the remedy for this problem.

3.8.3.2

Advanced Output Privatization

We present our empirical evaluation of our output privatization and verify our running
time model. We continue using SDH to evaluate our algorithm. According to the experiments
in Section 3.8.3.1, when output size is too small, SDH renders more shared memory accesses
for updating outputs and suffers from long running time. Therefore, we implement our
advanced output privatization technique on top of Algorithm 4, and evaluate it with a
512,000-point dataset under output size from 1 to 600 buckets.
Table 3.4: Number of private copies H and the range of output size for which H is found
to be the optimal choice
Number of copies Theoretical Results
32
1 - 10
16
11 - 30
8
31 - 65
4
66 - 150
2
151 - 450
1
> 450

Actual Results
1 - 10
11 - 35
36 - 92
93 - 152
153 - 300
> 300

45

20

1H
2H
4H
8H
16H
32H

40

Measured running time (sec)

Relative Running time to CL

50

30

20

10

0

0

120

240
360
Output Size

480

600

1H
2H
4H
8H
16H
32H

15

10

5

0

0

120

240

360

480

Number of buckets

600

Figure 3.14: Performance with advanced output privatization: theoretical (left) and
measured (right) running time of SDH kernel. Each line represents the case of one
particular number of private output copies

Figure 3.14 shows theoretical running time (left) obtained from our models shown in
Section 3.4.2 and actual running time (right) of implemented algorithm. It is easy to see
that our model matches the empirical running time very well. Recall that our modeling work
aims at finding the optimal number of private output copies. Given any output size, this can
be easily found in Figure 3.14. Table 3.4 shows how well the theoretical results predict the
best choice in real-world. In particular, we see that any number of private copies is picked
under a continuous range of output size (e.g., for output size 1-10, 32 copies are found to be
optimal by both modeling and experiments). When the output size is small (i.e., less than
65), our model is very accurate in selecting the best number of private copies. When the
output size is between 65 and 152, our model gives more wrong selections among H values
of 8, 4, and 2. However, such selections under large output size do not significantly affect
the performance of the algorithm. If we look closely into Figure 3.14, the running time for

46

1000

60

100

50

10

Speedup

Time (second)

40
1
0.1

30
20

0.01
0.001
0.0001

0

500

DOB
Without DOB
CPU
1000
1500

Total number of atoms (x1000)

10

2000

0

0

500

DOB
Without DOB
1000
1500

Total number of atoms (x1000)

2000

Figure 3.15: Output Buffer Management: total running time and speedup over CPU in
computing the item similarity problem

8, 4, and 2 copies are almost the same for both the theoretical prediction and actual values
(i.e., less than 5% difference) for a wide range of output sizes.

3.8.3.3

Direct Output Buffer

We also conduct experiments to evaluate the direct output buffer technique. For that, we
use a Type-III 2-BS problem: the item similarity problem. It computes all pairwise distances
and saves those pairs that are found to be similar. We define a similar function by using
Euclidean distance. Naturally, the output size of this problem is unknown at the beginning.
In our experiments, we use data with up to 2 million items and limit output size around 4
million pairs. We compare our technique with the multi-thread CPU program and our kernel
without using the Direct output buffer technique. The CPU program is a variance of the
one described in Section 3.8.3.1. On the other hand, the baseline GPU code encapsulates
all optimizations described in previous sections but it handles output by running the kernel
twice (i.e., first run only calculates output size). Figure 3.15 demonstrates running time of

47

the experiment. As we can see, our GPU code without the output buffer technique achieves
an average 23X speedup over CPU program (Maximum speedup of 28X). When the direct
output buffer is implemented, the speedup averagely increases to around 48X, this translates
into a 2X speedup generated by the output buffer mechanism.

3.9

Additional Techniques
In this subsection, we present empirical evaluation of additional techniques on SDH

problem.

3.9.1 Load balancing technique
In such experiments, we only record the time for processing intra-block computations in
processing the SDH. We implement the load balancing technique on top of the tiling-based
kernel Register-SHM, which is shown to be the most efficient solution in Section 3.8.2. We
compare the running time of the kernel before and after applying the technique, and Figure
3.16 shows such results – a 12%-13% improvement can be seen.

3.9.2 Tiling with Shuffle instruction
To evaluate the this technique, we run experiments in a way similar to those mentioned
in Section 3.8.3.1. We compare the shuffle instruction with tiling via shared memory and
tiling via RoC. Figure 3.17 shows the results of the experiments. Clearly, tiling with shuffle
instruction has almost the same performance as tiling with RoC or with shared memory.
This shows that the technique based on shuffle instruction can be an alternative method
when shared memory and RoC are not available, and we expect the algorithm to show the
same level of performance.

48

0.1

1.12
Speedup

Running time (sec)

1.14

0.01

1.1

1.08
1.06

Register-SHM
Register-SHM-LB
600 1200 1800 2400 3000
Total number of atoms (x1000)

0.001

1.04

Register-SHM-LB
600 1200 1800 2400 3000
Total number of atoms (x1000)

Figure 3.16: Performance of Reg-SHM kernel with and without load balancing method:
total running time (left) and speedup (right) over Reg-SHM

10000

Reg-SHM-Out
Reg-ROC-Out
Shuffle

70
1000
60
50

Speedup

Time (second)

100
10
1
0.1

30
20

0.01
0.001

40

10

CPU
Reg-SHM-Out
Reg-ROC-Out
Shuffle

0

400

800

1200

1600

Total number of atoms (x1000)

2000

0

400

800

1200

1600

Total number of atoms (x1000)

Figure 3.17: Performance of different GPU-based algorithms for computing SDH: total
running time and speedup over CPU algorithm

49

7

10000

6

Speedup

Time (second)

5
1000
4
3

100
2

10

1

Single GPU
2 GPUs
4 GPUs
8 GPUs

0

4

8

12

16

2 GPUs
4 GPUs
8 GPUs

20

Total number of atoms (x 1 Million)

0

0

4

8

12

16

20

Total number of atoms (x 1 Million)

Figure 3.18: Performance of different setup for computing SDH: total running time and
speedup over single GPU

3.10

Results of Large Input Data

In this subsection, we present end-to-end performance of our multi-GPU 2-BS algorithm.
We run our algorithm on multiple GPUs that reside in a single server running Linux with an
Intel Xeon 6-core E5-2620 v3 CPU, 128 GB of DDR4 1866-MHz memory, and eight Nvidia
Titan X Pascal GPUs with 12GB of global memory in each card.
In such experiments, we still use SDH as an example to demonstrate the performance
of our algorithm. In particular, we implement the algorithm based on the best kernel for
processing onboard input data as shown in Section 3.8.3.1, which is RoC-Out. We compare
performance based on various numbers of GPU devices: single card, 2 cards, 4 cards, and 8
cards. We generate uniformly distributed datasets with a size ranging from 2 million to 20
million. The size of data blocks was set to 1 million points. We use this number because it
is relatively large input size that requires a long time to compute SDH (even on GPUs).

50

Figure 3.18 shows the running time. We calculate the speedup over the single GPU
setup. As we can see, the speedup is improved by almost the magnitude of a number of
GPU cards. The multiple GPU system is up to 1.97 (2 GPUs), 3.72 (4 GPUs) and 6.85
(8 GPUs) times as fast as the single GPU system. The higher number of GPUs card tend
to have more overhead, due to the time required to merge the output. However, the larger
the input data is, the better the merging time can be hidden. As a result, the speedup of
the algorithm increases with data size. In addition, we also used Nvidia Profiler to study
the algorithm behavior. On 8 GPUs and 20 Million input data, the profiler shows that the
average HtD (Host to device) transfer time is 847.10ms (0.04% of running time), and DtH
(Device to host) transfer time is 10.881us (less than 0.01% of running time). This shows
that the overhead of transfer time via PCIe is minimal.

3.11

Case Study of Automated 2-BS Framework

In this section, we evaluate CUDA kernels generated by automated code generation
framework via two case studies. We demonstrate Radial Distribution Function, a Type-II 2BS, as Case Study I. Then, we present Nested-Loop join as case study II, which is a Type-III
2-BS.

3.11.1 Case Study I: Radial Distribution Function (RDF)
RDF is an essential physical feature of molecular systems. The RDF algorithm receives
two sets of atom coordinates as the input and calculates distances between two atoms from
different sets in a pairwise manner. The output is a histogram of the distances between two
atoms. Therefore, RDF can be classified as a Type-II 2-BS. In this study, we compare the
performance of our RDF code (generated automatically) with the best known code extracted
from the Visual molecular dynamics (VMD) software. VMD is a well-known molecular visu-

51

1000

3

100
2
Speedup

Running time (second)

2.5

10

1.5
1

1

0.1

VMD 50
ROC-AOP
VMD 500
ROC-OP
600 1200 1800 2400 3000
Total number of atoms (x1000)

0.5
0

ROC-AOP
ROC-OP
600 1200 1800 2400 3000
Total number of atoms (x1000)

Figure 3.19: Performance of different GPU-based algorithms for computing RDF: total
running time (left) and speedup over VMD code (right)

alization program used for displaying, animating, and analyzing large biomolecular systems.
RDF is implemented on GPUs [31] and the code is included in the current VMD release.
We create two sets of experiments to test our framework and compare with existing VMD
kernel. In the first experiment, we set the size of output histogram to 500 buckets (i.e., 2KB).
In this case, the framework generates a kernel (named RoC-OP) with tiling cached in RoC
and regular output privatization. In the second experiment, we set the output size to 50
buckets (i.e., 200 bytes). The framework generates kernel (named RoC-AOP) with tiling by
RoC but with warp-level output privatization. In both experiments, we use two datasets with
a size ranging from 100k to 3 million particles. Particle coordinates are generated randomly
following a uniform distribution. The RoC-OP has 45 lines of code while RoC-AOP is 51
lines long.
Figure 3.19 shows the running time of relevant kernels under different input sizes. The
results of original VMD code under two output sizes are labeled as VMD50 and VMD500,
respectively. As can be seen, the running time grows quadratically with increase of data size.

52

5

10000
1000

4

Speedup

Time (second)

100
10
1

3

2

0.1
1
0.01
DOB
Without DOB

0.001

0

600

1200

1800

2400

Total number of atoms (x1000)

DOB

3000

0

600

1200

1800

2400

Total number of atoms (x1000)

3000

Figure 3.20: Performance of NLJ kernel generated from 2-BS framework as compared to
best known NLJ kernel reported in [1]

We compare the performance of RoC-OP with VMD500 and RoC-AOP with VMD50. For
output size of 500, the RoC-OP kernel is faster than the VMD code by up to 18%. Under
output size of 50, the RoC-AOP kernel achieves an average speedup of 1.88X and maximum
speedup of 2X. This clearly shows that our framework is more adaptive to different scenarios
of the same problem. Although the VMD code does a good job under large output, it does
not capture the opportunity to handle smaller output more efficiently via the warp-level
privatization.

3.11.2 Case Study II: Nested-Loop Join
We also use Nested Loop Join (NLJ) as an example to verify the effectiveness of our
2-BS framework. As mentioned earlier, being the preferred algorithm for processing joins
with complex non-equality conditions, NLJ is important in database systems. In particular,
NLJ requires to compare all pairs of tuples from two tables. The output size cannot be

53

determined at the beginning of the run: the size ranges from 0 to N 2 where N is the table
size.
In this experiment, each tuple contains a randomly generated integer ID and a key value.
Tables’ sizes range from 1M to 3M tuples. We limit the output size to be roughly the
same as the input table size. We feed such parameters and the join function to our 2-BS
framework to generate the CUDA kernel. Our framework classifies NLJ as a Type-III 2-BS
problem and chooses to cache input in shared memory and use direct buffer output to handle
the output of the application. As a result, around 70 lines of kernel code are generated.
We run the generated kernel, and compare its performance to a GPU-based NLJ program
developed in previous work [1]. The latter is believed to be the most efficient GPU-based
NLJ development, beating all other NLJ programs in performance by a large margin.
As shown in Figure 3.20, the kernel generated from the 2-BS framework clearly outperforms the state-of-art program, with a speedup up to 4.4X. Looking into the details, the code
in [1] is designed to tile input table into on-chip memory (i.e., shared memory and RoC) and
it does not use direct output buffer. Note that we also experiment input size beyond 3M
tuples, but the speedup stays at 4.45X. This shows that 2-BS framework can automatically
generate kernel with very high performance with very little effort from the developer.

54

Chapter 4: Index-based 2-BS Algorithm design on GPUs

4.1

Overview
In this chapter, we present a special computation for a group 2-BS problems, in which

the computation complexity can be reduced by using an index structure. As we mentioned,
there are various applications of 2-BS that the index-structure can reduce computation
complexity, such as equal-join, similarity search, and join similarity. In these particular
problems, the distance function is a selection function, which filters or/and computes only
the user’s interesting pairs to the output. As we can see, only some particular pairs of
computation are meaningful to the output, and the rest of the pairs of computation is the
overhead to the algorithm. By using the index search, the main idea is to build the index
structure for the data and using this index structure for index searching for the particular
pairs. Thus, the pairwise computation occurs only on some data pairs instead of all pairs,
which reduces the complexity of the 2-BS problem. In order to utilize the index-based search
for 2-BS, we need a suitable tree structure for this problem.
There are many index-tree that commonly use for index search such as Binary tree, Redblack Tree, QuadTree, OctTree, K-D Tree, B Tree, B+ Tree. Each index tree structure has
unique characteristics and benefits. For example, QuadTree and OctTree are optimized for
spatial data processing; Red-black Tree is optimized for frequent insert and delete system.
However, from all that, B+ Tree yields high performance on GPUs and originally designed
for a database system.

55

...
...

...

Figure 4.1: A typical data structure of B+ Tree on CPU

B+tree is one of the most common data structures for index searching in a database
system. Since B+ Tree originally designs for a system with small memory and large hard
disk [60, 61]. This means that the B+ Tree takes advantage of the cache system of memory
to reduce I/O to disk. Moreover, the designs of B+Tree have an advantage on both equality
and range search, which is the key to reduce the computation of 2-BS problem. Many
studies explore the implementation of B+Tree on various kinds of platforms [46, 48, 62, 63].
However, in many-core systems such as GPUs, the performance is far less than the promising
performance of GPUs, and the dynamic data structure is not favorable for the GPU system.
With that in mind, this chapter focuses on novel techniques to implement and optimize
parallel algorithms for searching B+ Tree-like index structure on modern Graphics Processing
Units (GPUs).

4.2

Index Data Structure on GPUs
Before we discuss algorithmic design, we first present the data structure of index Tree on

the GPUs. As we mentioned in the above section, B+ Tree is the data structure that design
for a database system and suitable for 2-BS problems. Thus, our index structure is base on
the B+ Tree index structure, which consists of inner nodes and leaf nodes represented by
different data structures. We still use the page concept, which system can take full advantage

56

of the cache system. However, there are some difficulties in using this index structure on
GPUs. A typical data structure of B+ Tree on CPU is a dynamic data structure, which means
an inner node of an upper level of the tree point of some locations of memory to indicate the
next level locations of the tree (Figure 4.1). As we see, this data structure is non-contiguous
in memory and lead to non-coalesce access. Different from the Host memory, GPU memory
architecture has several limitations. To maximize the bandwidth of a GPU memory, GPU
requires to have a coalesced memory access to global memory. Moreover, dynamic memory
allocation causes a significant problem on GPUs since it requires a synchronize between
a large number of threads and leads to the divergence problem. Thus the dynamic data
structure with a simple allocation of memory is not desirable for GPUs.
In recent studies, Shahvarani et al. [48] proposed in two directions of index tree on GPUs:
implicit tree and regular tree structure. The implicit tree organizes tree structure by using a
single array with a breadth-first order. Figure 4.2 shows an array of the implicit tree, which
is a continuous data structure on GPUs. The implicit tree performs very high performance
of queries, but with a constraint of the tree cannot be modified. To modify the implicit tree
structure rebuilding the tree is required. On the other hand, Regular tree structure organizes
tree with actual node structure and support modification, but trade-off with the data is not
continuous. They remedied the non-continuous data by setting the page size to suitable to
the GPUs cache size. Since altering tree data is an essential ability of indexing structure, in
this dissertation, we focus on a regular B+tree direction.
In this section, we introduce dynamic index tree data structures and allocation mechanisms for GPUs, which maximize the utility of GPU memory. This tree structure is designed
for flexibility and still maintain the high performance of searching. We still arrange data into
a page. However, instead of allocating a small chunk of memory for each page, we allocate
a large chunk of GPU memory for the entire page need. Then we divide this memory into
a page and distribute it to threads upon request by our memory allocation mechanism. We
57

...
...
...

...

...
...
...

Inner node

Leaf node

Pair data

Figure 4.2: Static B+ Tree structure

shared the inner node and leaf node with the same page structure, which contains key-value
data, size, node attribute (i.e., leaf node, or inner node), and pointers to the neighbor nodes.
In the inner nodes, the key-value represents a searching key and the pointer to the lower
level of the tree. On the other hand, in the leaf nodes, the key-value represents a searching
key and the pair value.
As we see, this data structure requires a mechanism to distribute page efficiency. Such
a problem has been a real difficulty for not only tree representation but many other parallel
computing fields in which the memory size cannot be specific at the beginning. We reuse
the direct output buffer mechanism, which presents in section 3.4.3, to handle dynamic
data structure. We keep a global pointer GP that holds the position of the first available
page in the buffer pool. Each thread starts with one page and fills the page with data.
The thread can acquire a new page pointer by GP via an atomic operation. By using this
mechanism, conflicts among threads remain minimal because GP is only updated when a
page is requested. In this study, we utilize this dynamic allocator on the tree construction
and the output buffer.

58

Algorithm 8: GPU-based B+ Tree construction algorithm
Input: Input Data,Input Pair,Size
Output: Index Tree
1: (D,P ) ← (Input Data, Input Pair)
2: (nD,nP ) ← SortingDataPair(D,P )
3: Level ← LogP ageSize (S)
4: Leaf N ode ← AllocatePage()
5: AssignedDataToLeafNode(nD,nP ,Leaf N ode)
6: while Level > 0 do
7:
InnerN ode ← AllocatePage()
8:
BuildLevel(InnerN ode, Leaf N ode)
9:
Level − −
10: end while
11: return Root

4.3

Index Construction on GPUs
In this section, we demonstrate our index construction process. In a recent study [48],

the Index structure was construction inside host memory and transfer the inner node to the
GPUs. The cost of building the index structure on host memory is high and not suitable for
GPUs. The critical issues are the low parallel level on CPU and the overhead of transferring
the index from host to GPUs. Furthermore, the searching operation cannot finish in the
GPUs due to a lacking of full tree data, and the algorithm requires to transfer intermediate
data back for CPU searching. Differently from previous work, in this study, we focus on
creating an index structure for all-in-GPU search operations. Thus, this means that all of
the computation is inside a GPUs.
There are unique challenges in parallel construction of a dynamic index structure on
GPUs. The traditional way of index construction is done by multiple threads allocating a
node structure and then link with other nodes. However, the primary issue of this procedure is allocation and synchronization. The GPU’s thread-level dynamic memory allocation
carries an extremely high overhead, and the synchronization between a thread can cause a
divergent effect to the GPUs. In this work, we use our memory allocator, which discussed in

59

section 4.2, to allocate a page of memory. Then we distribute the workload into a different
block to minimize the synchronization.
To minimize the index construction time and maximize the throughput of query processing, we utilize GPU sorting algorithm, our GPU memory allocator, and bottom-up procedure
to build the index. Algorithm 8 shows the procedure of the tree construction process on GPU.
We begin the algorithm by sorting the data-pair. This operation increases the performance
of index construction and query processing. By sorting the data-pair, the query operation
can take advantage of binary searching in the quires process. In this procedure, we select
a GPU radix sort operation from CUB [64], which is the best sorting algorithm on GPUs.
Then we use our allocator to allocate pages for leaf node and using a GPU kernel to assign
the data pairs to each node.
Then, we assign each thread to handle one leaf node and compute the upper-level inner
node. After that, the algorithm builds the index tree level by level from the leaf to the root.
We process the data and place it into the leaf node of our data structure. The data are
required to reorganize into page structure. Then, we call other kernels repeatably to build
each level of the tree. We assign each thread to find an element in the inner node. Each
thread selects the maximum value from a below level node; which is the last element because
the data is sorted.

4.4

Index Query Processing on GPUs
As we mention, the most advantage of using a tree index structure is fast query operation.

Each query operation can be accomplished in O(logb N )

1

instead of search through entire

data. By using the Tree index on GPU, they require a large number of threads and memory
bandwidth to operate such operation. Recent studies [48, 47], they utilize a branch of threads
to operate a single query and improve memory utilization by explicitly load some part of the
1

b define as a fanout of the index tree structure

60

Algorithm 9: Generic GPU-based Index Searching algorithm
Input: Query List, Size
Output: Output List
1: t ← Thread id
2: x ← Query[t]
3: N ode ← Root of B+Tree
4: while N ode is not leaf node do
5:
N ode ← getNextNode(x,N ode)
6: end while
7: OutputList[t] ← getValue(N ode,x)
8: return OutputList

upper level of the tree into high-speed on-chip memory. However, the performance is still
far behind the promising performance. In addition, the range searching operation has not
been report. Thus, we believe that GPU requires a large number of queries to utilize stream
processor in GPU sufficiently, and each type of search queries is required a different way to
handle. In this section, we propose an algorithm for a large number of queries searching
operation on Equality search query and ranged search query.

4.4.1 Equality Search Query
Equality search is the most common query in a database system and the key component
for computing our 2-BS problem. The main goal of equality search is to find an exact match
of the keys in the index tree. A straightforward GPU algorithm for query data is shown
as Algorithm 9. Note the pseudocode is written from the perspective of a single thread,
reflecting the Single-Program-Multiple-Data (SPMD) programming style of CUDA. Each
thread loads one query to a local variable and uses that to search through the inner node
index to find a leaf node pointer (lines 3 to 6). Then threads search in the leaf node to find
the destination match location and return the result (line 7). Since the data is sorted on
each node, the searching operation can speed up by binary search.

61

Inner Node

Leaf Node

Q1

Q2

Q3

Figure 4.3: Memory access pattern of three queries. Q1 and Q3 hit on the median data
range of the dataset. Q2 hit on the high data range of the dataset.

The straightforward searching algorithm may seem suitable for GPUs, which assigns each
query for a thread. However, there are some challenges to the system. First, GPUs requires
data access to be coalesced-access to maximize the utilization of memory bandwidth. By
parallelize a large number of queries at the same time, the following queries may hit on a
different leaf node of the tree and cause non-coalesce access on the system. Second, the
divergence of thread in a warp. As we know, the GPUs requires threads in the same warp
need to execute the same instruction at the same time. The divergence is the phenomena
that the thread from the same warp execute different instructions. GPUs handles this
phenomenon by breaking a warp into the smaller warp and execute them separately. As we
can see in algorithm 9, it is possible that some threads may finish getNextNode operation or
getValue sooner than other threads. This leaves an empty slot in a warp and reduces the
occupancy of the system.
Given that, we start the optimizing process by analyzing the pattern of each thread.
By assuming each thread executes one query, figure 4.3 demonstrates the memory access

62

pattern of three different queries on a continuous page of memory. In a continuous page of
memory, the right-hand side is the Inner node page, and the left-hand side is the leaf node.
In the three queries, Q1 and Q3 require to find keys that are around the middle value, and
Q2 require to find a high-value key. As we can see, at the beginning of the algorithm, all
query requires Inner node data, and later on, they access leaf node data. Therefore, we first
separate the algorithm into two steps; inner node search and leaf node search. This forces
threads to access a similar location (the inner node and leaf node) and leads to better cache
utilization. Then, if we look closely into Q1 and Q3, the threads require similar data on
both threads, which means that these threads can share data in the cache system. Thus, we
can group this query and enforce them into the same multiprocessor and share an on-chip
cache system.
Hence, we propose a new idea to search the index tree on GPUs. We first preprocess
queries by a clustering operation. We group the input queries that likely to hit on the
same leaf node into the same group. Then assign each group into a different CUDA block.
By doing this, threads in the same block can get a benefit from the L1 cache system and
minimize access to global memory. To fulfill this idea, we require a fast clustering/grouping
operation that can classify queries into a large number of groups. Thus in this work, we
introduce Radix grouping operation.
The main idea of Radix grouping operation is grouping data that have the same most
significant bit together as a single group. The different groups might use a different number
of the most significant bits as a classifier. The data that belong to the same group must be
located in consecutive memory but not sorted. Thus, the objective of the Radix grouping
algorithm is to rearrange data to the correct location and report the statistic data of each
group. To relocate the data to a new location, we can calculate the new location of each
data by:

63

Algorithm 10: Radix grouping algorithm overview
Input: Query List, Number of Queries (Size), bit per round (b), final group size (s)
Output: Query List, Number of Queries, GroupSize, GroupPrefix
1: GroupSize ← {Size}
2: for each pass i do
3:
if all GroupSize ≤ s then
4:
Return Output
5:
end if
6:
if i > 0 then
7:
Swap(input,output)
8:
end if
9:
Histogram ← GenHistogram(input,i,b)
10:
P ref ix ← GenPrefixSum(Histogram)
11:
Output ← Reorder(input,Histogram,P ref ix)
12: end for
13: return (Output,Size, GroupSize, P ref ix)

Pj−1

i=0 (Hi )

+ of f set

(4.1)

Hi represents the number of elements in each group, j represents the number of groups
that should appear before the current element, and of f set is a distinct random number.
Thus, this means we need to calculate the size of each group before moving the data.
Algorithm 10 shows the overview of our Radix grouping operation. The algorithm loop
over multiple bits of the keys, which starts from the most significant bit. We call each round
of the loop as a pass. The number of passes depends on how many bits the algorithm process
each pass. For example, if the algorithm process 8 bit in each pass and the key has 64 bit,
the algorithm requires up to 8 passes to finish the procedure. As we can see, the higher
number of bit process in each pass can reduce the number of passes. However, by increasing
the number of bit for each pass, the algorithm requires exponential resources from GPUs.
By looking at algorithm 10, the algorithm can stop before all of the passes is processed, if
members of all group are smaller than user setup threshold (line 3).
64

In each pass, the algorithm requires to process three steps. First, the algorithm builds
a histogram from the entire data. This histogram shows us the size of each group in the
dataset. Then, the histogram is used to calculate a prefix sum, which we use the parallel
implementation from [65]. The prefix sum identifies the start location of each group in the
memory. Finally, the histogram and the prefix sum are used to reorder the data accordingly
to equation 4.1. As we can see, due to a large number of queries, building a histogram
and reordering the data are time-consuming operations and require multiple techniques to
optimize the operations.
We optimize building a histogram algorithm from our previous work in section 3.3 and
3.4. We still cache input in a register, and we use an on-chip shared memory to cache the
private output histogram. However, we take a benefit of a new shuffle instruction, which is
called

match any sync in Nvidia Volta [22]. The new shuffle instruction can compare and

return a bitmap of the number of threads in a warp that contain the same value. Algorithm
11 demonstrates the pseudocode of warp-level building a histogram. The pseudocode is
written base on a view of a single thread in a CUDA warp and tid is represent a thread
id in each warp. The procedure start by each thread load a query from global memory
and extract index bits of that round (line 2-3). Then each thread compares and generates
bitmap of an exact match between threads in the same warp by shuffle instruction (line 4).
The bitmap has represented the threads that have the same value of index. For example,
if thread number 0, 1, 2, 15, 31 have exactly same index, the bitmap is 0xC0010001. After
that, we selected the thread that has the highest thread id as a leader (line 5). The leader
is only the thread that updates the histogram. As we can see, an atomic operation is still
required in updating operation because an updating collision between different warps can
still happen (line 6). Finally, after all of the warps complete this procedure, the histogram
in on-chip shared memory is combined into a global histogram.

65

Algorithm 11: Warp-level building a histogram
Input: Query List, bit per round (b)
Output: Histogram
1: tid ← Thread id
2: x ← Query List[tid]
3: index ← Extract b bits from x
4: bitmap ← match any sync(index)
5: if ( ffs(bitmap)−1 == tid) then
6:
atomicAdd(histogram[index], popc(bitmap))
7: end if

After we successfully generate a histogram, we generate the prefix sum with the aforementioned parallel algorithm. By using prefix sum, we have the first piece of equation
P
( j−1
i=0 (Hi )), which is the prefix sum. Thus, in the next step, we need to calculate the
of f set that is unique per datum inside each group. A straightforward idea to calculate
of f set is using atomic operation on prefix array in global memory. The atomic operation
ensures the unique value of the offset. However, the atomic operation in global memory has
very high latency, and every thread requires to go as sequential, which means there is no
parallel computation at all. Thus, we introduce a new technique to utilize a shared memory
and reduce atomic operation in global memory. The main idea is reordering data inside the
shared memory, and batch write the order data to global memory. For that, the algorithm
required to calculate a local offset for each CUDA block in shared memory. Algorithm 12
shows the procedure of our reordering. The algorithm starts with loading queries to registers and calculates local histograms in shared memory. Then we use the local histogram to
calculate local offset (line 6). Then from the local offset, each thread uses an atomic operation to calculate an actual offset value for each query. Finally, we relocate the data by the
final offset. By doing this procedure, we can ensure the correctness of offset and relocating
data (line 8). However, because the offset from each thread is random, the write is still not
coalesced write. Therefore, we further optimize the algorithm by write to shared memory

66

Algorithm 12: Block-level Relocation data
Input: Query List, Prefix, bit per round (b)
Output: Query List
1: tid ← Thread id
2: x ← Query List[tid]
3: index ← Extract b bits from x
4: Hist ← Build block level histogram
5: for i=0 ; i < Hist.size ; i += T hreadnums do
6:
Of f set[i] ← atomicAdd(devP ref ix[i],Hist[i])
7: end for
8: SharedOut[atomicAdd(Of f set[index], 1)] ← x
9: QueryList ← SharedOut

first, and then we copy the data into global memory. This procedure improves the coalesce
write to the global memory.
After we group and reorder the queries by Radix grouping operation, we assign each group
of queries to a CUDA block. Each thread still handles a single query at a time. Because of
the grouping, the data have high spatial locality, and the data can maximize the performance
of cache system.

4.4.2 Ranged Search Query
The goal of the ranged search is to find a single/multiple results that are in the input
ranges. A straightforward GPUs algorithm for range query data is similar to Algorithm 9 in
the equality search. Each thread takes one query and runs the search on an inner node and
a leaf node. However, the algorithm requires to search for the starting key and ending key
for the inner node. Then, the algorithm runs a linear search from the starting node to the
ending node. However same as the equality search, GPUs straightforward for range query
face a limitation of the hardware. The non-coalesce memory access and long latency time
accessing global memory are still problems.
On equality searching, we improve the performance of searching by grouping query that
is similar to the same CUDA block. This technique improves a locality of data access and
67

Rij

Min

si

ei

Max

Figure 4.4: Overlap of multiple range queries in a single dimension key domain

gains the magnitude of speed up. However, the grouping step is more complicated in range
searching because each query consists of starting key and ending key. It is non-trivial to
grouping range queries. There are many methods to group the query (e.g., starting point,
ending point, range size), and it is unclear to get the best performance. Thus, it is crucial
to analyze this in detail.
We begin the analysis by plotting each range query into a single dimension key domain,
as in figure 4.4. As we can see, in the system that processes a large number of queries with
a limit number of the key range (64-bit key), there is a chance that multiple queries can
overlap with each other. This means that the overlap queries are required similar data in the
searching operation. Thus, we can take advantage of this overlap by rearranging similar data
into the same stream process, which share L1 cache. To maximize the advantage of the L1
cache, we define the grouping operation by minimizing the number of groups and maximizing
the density of overlap between queries. We define the density of overlap between queries by
the following equation:

P

PM

R

R∈Gi

i=0 Si −Ei

(4.2)

where M is the total number of the group, Gi represent each group, R is the range of each
query in the group, Si is the starting key of the group, and Ei is the ending key of the
group. This equation represents the density of the overlap, higher mean high density in the
group, or high overlap. Our goal is to find the maximum number of overlap density with
68

Ending Key

Ending Key

Starting Key

Starting Key

Figure 4.5: Range queries in 2D domain, x-axis as starting key and y-axis as ending key.
(a) Original range queries plot (b) the 2D domain after apply clustering technique.

the minimum number of groups. However, as we can see, this grouping operation requires
to analyze the whole query dataset and high computation power to generate the optimal
result. Thus, this computation is not suitable for our range searching operation on GPUs,
which has requirements of grouping time and independent workloads.
To simplify our grouping model, we plot our range input queries into a 2D space by
start position is the x-axis, and the end position is the y-axis (Figure 4.5). By plotting this
way, the neighboring points are the queries that have a similar starting point and ending
point, which, if group together, can generate a high overlap. As we can see, we can map
this problem into clustering problems such as k-mean, centroid, nearest neighbor. However,
a regular clustering algorithm can be very high cost because of a large number of clusters.
In this operation, we require fast operation and parallelizable by GPUs. Therefore, in this
paper, we propose a fast clustering technique by improving the radix grouping operation,
which presents in section 4.4.1.
The main idea of the new radix grouping operation is we select a partial bit from the
starting and ending key. This means that the 2D domain of starting and ending key (Figure
4.5) is divided into small grids. Then, if inside a particular grid contain points more than
a threshold, that grid break into a smaller grid by select more bit from starting and ending
key. This process repeats until the number of points in each grid is less than the threshold.
69

Therefore eventually, the data will be clustered into a small group. Figure 4.5 shows a 2D
space that operates by the operation. As we can see, the algorithm is almost the same as
the algorithm 10, except for the extraction part, which requires partial bits from the starting
key and ending key. The algorithm extracts the bits from both keys and concatenates them
as the grouping index. However, by extraction bits from both keys, the algorithm requires
to take double of iteration cycle and double the time of relocation of the queries. For that
reason, we also propose additional technique to reduce the dimensions.
In order to reduce the dimensions, we utilize the z-order curve, which preserves locality
and constant time to calculate. We combine the starting and ending points into a new value
that still maintains the information. We select z-order as a tool to combine the value. After
the dimensions reduction, we still use Radix grouping operation to group the queries.
After the queries are grouped, each group of queries is sent to different stream processor
(or CUDA block) and execute. This process to execute a query is simple and straightforward
with only an extra step of grouping operation. Then, the procedure searches from the starting
point node and ending point node. Then, finally, retrieve the match key from starting node
and ending node to the users.

4.5

Evaluation and Results
In this subsection, we present an empirical evaluation of the aforementioned algorithms.

We run our experiments in a workstation running Linux (Ubuntu 18.04 LTS) with an Intel
i9 CPU, 128GB of DDR4 2444-MHz memory, and an Nvidia Titan RTX GPU with 24GB
of global memory. The running time reported for all experiments is kernel execution time.

4.5.1 Index Construction on GPUs
We first evaluate the performance of tree construction on GPU via various page sizes and
data sizes. In particular, we implemented CUDA kernels to build the B+ Tree structure in
70

0.08
0.075
0.07

PageSize500
PageSize1000
PageSize5000
PageSize10000

Time (second)

0.065
0.06
0.055
0.05
0.045
0.04
0.035
0.03
0.025

300

400
500
600
Total number of key-pairs (million)

700

Figure 4.6: Performance of different page sizes GPU-based algorithms for Tree construction

GPUs which bases on our dynamic data structure from section 4.3. We assign each node to
different pages by using our buffer management mechanism to distribute pages.
We select five different page sizes to conduct this experiment: 500, 1000, 5000, 10000,
50000. We experimented on input data size ranging from 0 to 300 million key pair values.
Since the key requires to be unique, we generated the data by using Vitter’s reservoir sampling algorithm [66]. The sampling algorithm which generates random sampling of unique
numbers from a range of 0 to 240 .
We plot the results in figure 4.6. As we can see, the larger page size trends to use a
shorter time to build the tree. The reason behind this phenomenon is that we can utilize
a large number of threads to calculate each element inside the page with the larger page
size. However, because we assign a single block to build up a single page, too large page size
can cause a bottleneck in a computation. As expected, when the data size larger than 600
million pairs, Pagesize 5000 can build the tree with the fastest time. Therefore, in the rest
of the experiment, we configure the page size to 5000 pairs per page.

71

4.5.2 Index Equality Searching on GPUs
In the index Equality Searching experiment, we evaluate the searching performance of a
large number of queries in which each query requires an equality search. We implemented
a search kernel to take input as an index tree structure and queries list. The kernel returns
pairs of searching keys and values. We use the tree that generates from section 4.5.1 as an
index tree for running the search.
In this experiment, we compare five setups of index tree searching; CPU, Naive, Naive 2
Step, Grouping 3 Step, and Grouping 2 Step. Our baseline is CPU implementation, which
implemented by following the technique present in [48, 46], which using the Open MP library
to implement the algorithm. In the naive implementation, we implemented the index search
operation by assigning each query to the GPUs thread and running the searching. In Naive
2 step, we separate the operation into two parts; inner node search and leaf node search.
In Grouping 3 step, we implemented the algorithm to be the same as the Naive 2 step, but
we have an extra step to group the queries as described in section 4.4.1. In the last setup,
Grouping with 2 steps, we utilize the grouping technique and combining the inner node and
leaf node search into only one step. With these five setups, we run these setups with different
numbers of queries and capture the time. We generate the input queries randomly with a
normal distribution. The number of queries is ranged from 100 to 900 million queries.
Figure 4.7 demonstrates the running time of the experiment. As expected, GPUs can
significantly reduce the searching time of batch queries. Even with the slowest GPUs setup,
the naive setup performs up to 6.6 faster than the multi-core CPU setup. By separating naive
search into two steps, the naive 2 steps setup produces a similar searching time to without
separating. By applying a grouping operation as an extra step, the searching operation
performs about 10 times faster than the naive setup. This shows that grouping operations
can significantly improve the utilization of GPUs with the best setup that can be 90 times

72

1000

120
100

10

Speedup

Time (second)

100

CPU
Naive
Naive 2 Step
Grouping 3 step
Grouping 2 Step

1

Naive
Naive 2 Step
Grouping 3 step
Grouping 2 Step

80
60
40

0.1

0.01

20

200
400
600
800
Total number of key-pairs (million)

(a)

0

200
400
600
800
Total number of key-pairs (million)

(b)

Figure 4.7: Performance of different GPU-based algorithms for Tree equality searching:
total running time and speedup over CPU algorithm

3.5

Grouping Step
Inner-node Step
Leaf-node Step

3

Time (second)

2.5
2
1.5
1
0.5
0

Naive

Naive 2 Step Grouping 3 step Grouping 2 Step
Kernel Name

Figure 4.8: Performance of different GPU-based algorithms for Tree equality searching by
step

73

0.9
0.8

1.2

GroupingTime
SearchingTime

1

GroupingTime
Inner-Node Time
Leaf-Node Time

Time (second)

Time (second)

0.7
0.6
0.5
0.4
0.3

0.8
0.6
0.4

0.2
0.2
0.1
0

1 3 5 7 10 50 10 50 1k 1.5 2k 5k
0 0
k

Group Threshold (x1000)
(a)

0

1 3 5 7 9 50 10 50 1k 1.5 2k 5k
0 0
k

Group Threshold (x1000)
(b)

Figure 4.9: Performance of different Group size for Tree equality searching: left two step
searching algorithm, right three step searching algorithm

faster than the multi-core CPU setup. We also capture the time of each step in the searching
operation; Grouping step, inner node search step, and output step. The running time of each
step is demonstrated in figure 4.8. We report the time as zero if the setup does not have
that step. As we can see, separate Naive into 2 steps, the majority of running time spend
on Output step, which search through the leaf node and output the result. In the setup that
contains the Grouping step, half of the time, the computation spends on grouping operation.
We run further analysis to explore the number of queries in a single group or group-size,
affecting the performance of searching on GPUs. We run the best two setups, which is
Grouping 2 Step and Grouping 3 Step, with a different number of queries that assigns to a
single block. We select the group-size from 1 thousand queries per group to 5 million queries
per group and the query size of 500 million queries in a batch. The results of both setups are
shown in figure 4.9. As we can see, the two setups have a similar trend, which is the smaller
group-size takes a longer time in grouping operation but faster in searching time. This shows
that the small group-size can benefit more from the caching system, but the smaller group

74

Table 4.1: Utilization of different GPU resources in running different group size on GPU
exact searching under a data size of 500 millions.
Group
Size

L1
Hit Rate

L2
Hit Rate

Memory

SM

5k
50k
500k
2000k
5000k
No Grouping

56.91%
91.04%
90.49%
75.23%
43.32%
29.62%

93.89%
56.03%
57.04%
44.06%
26.13%
45.75%

45.23%
41.42%
40.22%
33.03%
34.87%
35.38%

22.52%
48.72%
46.77%
17.13%
3.69%
2.82%

size also required the algorithm to observe more bits data. To ensure that, we also capture
the GPU L1 and L2 cache hit rate of each group-size. Table 4.1 shows the GPU cache hit
rate, which obtains from Nvidia Profiler. Clearly, verifying the assumption, which smaller
group-size produce higher hit cache hit rate and faster running time. However, figure 4.9 also
shows that tiny grouping consumes more time in grouping than the bigger group-size. This
happens because the smaller group-size requires the algorithm to observe more numbers of
bits data. Therefore, the best group-size parameter needs to be in the middle range; in this
experiment, between 50 thousand to 1 million queries per block.

4.5.3 Index Range Searching on GPUs
In the Range Searching experiment, we evaluate the searching performance of a large
number of queries in which each query executes a range search. Same as the previous setup,
we implemented a search kernel to take input as an index tree and a searching list request,
which contains a starting and ending point. The kernel returns a pair of searching key and
the value. We use the tree that generates from section 4.5.1 as an index tree for running the
search.
In this experiment, we compare four setups of index tree range searching; CPU, Naive,
Grouping, Zordering. Our baseline is CPU implementation. Again, We implemented the
75

12

CPU
Naive
Grouping
Zordering

10

10

Naive
Grouping
Zordering

8
Speedup

Time (second)

100

1

6
4
2

0.1
100 150 200 250 300 350 400
Total number of key-pairs (million)

(a)

0
100 150 200 250 300 350 400
Total number of key-pairs (million)

(b)

Figure 4.10: Performance of different GPU-based algorithms for Tree range searching: total
running time and speedup over CPU algorithm

CPU by following the technique present in [48, 46], which using the Open MP library to
implement the algorithm. In the Naive implementation, we implemented the index search
operation by assign each query to the GPU thread and run the searching. Each thread
searches for the start leaf node position and the end leaf node position. Then each thread
searches from the starting leaf node to the ending leaf node to generate the result. In the
Grouping implementation, we add a radix grouping operation as a pre-processing step before
the searching. Then, each group queries assigns to a CUDA block and process the searching
operation. In Zordering implementation, we optimized the grouping operation by reducing
the dimension of the searching operation into a single dimension with z-order, as discussed
in section .
We run these four setups with different numbers of queries and capture the time. We
generate the input queries randomly with a normal distribution. The number of queries is
ranged from 100 to 400 million queries.

76

Table 4.2: Utilization of different GPU resources in running different group size on GPU
range searching in second steps under a data size of 500 millions.
Group
Size

L1
Hit Rate

L2
Hit Rate

Memory

SM

5k
50k
500k
5000k
10000k
No Grouping

85.46%
82.82%
54.87%
43.07%
52.75%
32.10%

54.58%
57.56%
42.31%
30.60%
38.55%
49.72%

34.98%
37.36%
41.75%
35.07%
39.68%
34.79%

58.65%
44.20%
10.64%
4.49%
8.71%
3.59%

Figure 4.10 demonstrates the running time of the experiment. As expected, GPU can
significantly reduce the searching time of batch queries. Even the slowest GPU setup, Naive
setup can perform up to 3.9 times faster than the multi-core CPU setup. By applying
a grouping operation as an extra step, the searching operation can perform about 2 times
faster than the Naive setup. This shows that, again, grouping operation can also significantly
improve the utilization of GPUs range searching operation. Furthermore, the grouping
operation with z-order reduce-dismission can gain a similar speed to grouping operation.
Again, we run further analysis by exploring the number of queries in a single group or
group-size, affecting the performance of range searching on GPUs. We run the best two
setups, which are Grouping and z-order, with a different number of queries that assign to
a single block. We select group-size from 1 thousand queries per group to 5 million queries
per group and the query size of 400 million queries in this experiment. We plot the results
of the running time in figure 4.11. As we expect, the results show the same trend as the
exact searching in the previous section. The searching time is faster when the group size is
smaller, but the Grouping time is higher as the group size is smaller. We also use the profiler
to measure the cache hit rate on each searching step. Table 4.2 and 4.3 show the cache hit
rate on both steps. As we can see, clearly that reorder the query into a small group can
increase the cache hit rate and lead to better performance of searching.
77

Time (second)

2.5

3

GroupingTime
InnerSearchingTime
LeafSearchingTime

2.5

Time (second)

3

2
1.5
1
0.5
0

Zorder
GroupingTime
InnerSearchingTime
LeafSearchingTime

2
1.5
1
0.5

1 3 5 7 10 50 10 50 1k 5k 10
0 0
k

0

Group Threshold (x1000)
(a)

1 3 5 7 10 50 10 50 1k 5k 10
0 0
k

Group Size (x1000)
(b)

Figure 4.11: Tree range searching time for each grouping threshold

Table 4.3: Utilization of different GPU resources in running different group size on GPU
range searching in third steps under a data size of 500 millions.
Group
Size

L1
Hit Rate

L2
Hit Rate

Memory

SM

5k
50k
500k
5000k
10000k
No Grouping

92.98%
77.36%
45.40%
43.70%
44.78%
66.19%

30.81%
66.66%
69.08%
69.76%
69.06%
46.55%

45.23%
38.09%
38.52%
16.52%
36.15%
31.51%

46.11%
29.09%
12.26%
5.81%
11.46%
32.24%

78

4.6

2-BS Application
In this section, we applied our index-search CUDA kernels with 2-BS real-world applica-

tions, which can reduce the quadratic computation of 2-BS. We present Join operation as a
case study.

4.6.1 Equi-Join Operation in DBMS
Relational joins is one of the most intensive computations in DBMS and commonly found
on queries. In relational algebra, the relational join operation is defined by a Cartesian
product and an equality condition. There are many types of join operations, such as cross
join, natural join and, condition join. Different kind of join operation requires different
state-of-art to handle. In 2-BS computation, Join operation falls into our third category, in
which the output size is undefined. In this section, we discuss conditional join operation,
which the condition is an equal operator. This kind of join is well-know as Equi-join.
Equi-Join is the most commons operator and essential in database systems, however
high cost to operate. The core computation of the Equi-Join is searching for the match
between two tables. Thus, the performance of the operation is determined by the speed of
searching on tables. We use Index Join as an example of a 2-BS application to verify the
effectiveness of our index search operation. We compare our index-join compare with other
state-of-art of the join operations, which are hash-join and sort-merge-join. We select the
highest performance GPGPU implementation of hash-join and sort-merge-join from [21].
In this experiment, each tuple contains a randomly generated integer ID and a key value.
Tables’ sizes range from 100 million to 500 million tuples. We select the best kernel from the
experiment in section 4.5 to implement the index-join operation, which is Grouping 2 Steps.
In our implementation, We join the two table by the keys of two tables. We build an index
of the first table and then run the search of the key of the second table through the index.

79

4.5
4

IndexJoin
HashJoin
SortMergeJoin

Time (second)

3.5
3
2.5
2
1.5
1
0.5
0

100

150

200
250
300
350
400
Total number of key-pairs (million)

450

500

Figure 4.12: Performance of Index-join kernel as compared to best known Hash-Join and
Sort-Merge-Join kernel

Figure 4.12 shows a running time results of index-join operation compare to sort-merge
join and hash-merge join. By comparing the performance of these three join algorithm, our
index-join can outperform hash-join, and sort-merge-join when input is larger than 150, and
300 million pairs respectively. When the input data is 500 million pair, the index-join is
demonstrated the best performance with up to 2 times faster than hash-join and 1.11 time
faster than sort-merge-join.

80

Chapter 5: Conclusions

The 2-BS problems are popular and fundamental to many natural science domains. In
this dissertation, we studied parallel algorithms for processing 2-BS by exploiting the high
computing power of GPUs. First, we introduce a straightforward parallel algorithm under the
CUDA framework. The unique architecture of modern GPUs, however, provides abundant
opportunities for optimizing the algorithm towards better performance. To that end, we
divide the problem into two stages: pairwise computation and writing output. In order to
increase the performance, we present modifications to the algorithm by integrating various
novel techniques in each stage. In the pairwise computation stage, we optimize the algorithm
by blocking and tiling data into multiprocessors using different data paths, shared memory,
read-only data cache, and register. We evaluate this stage by 2-PCF problem. The results
show that tiling via shared memory and register outperform other techniques for this type
of 2-BS problems by up to 4 times. Considering the writing output stage, we utilize onchip shared memory to privatize output and use a parallel reduction method to combine
each private output. Experiments show that privatizing output can improve speed up to
13 times and lead to a 52X speedup over a highly-optimized parallel CPU version for the
same problem. We also introduce direct buffer output for 2-BS with large outputs whose
size is unknown at compile time. To further improve the efficiency of the algorithm, we
develop load balancing and tilling techniques with shuffle instructions. The scalability of the
GPU Algorithms is also studied to handle very large input/output data in 2-BS computation.
Moreover, we develop a general 2-BS framework that can automatically generate CUDA code
for user-defined 2-BS problems. To maximize the performance of GPUs on 2-BS problems,
81

we also studied a particular group of 2-BS, in which the computation can be reduced by
using an index tree structure. We introduce an index searching technique in GPUs that
specifically designed for 2-BS problems. By evaluating our index searching, our index search
algorithm can perform up to 90 times faster than multi-core CPU. We also evaluated the
algorithm by using an Equi-Join to verify the performance of the algorithm on GPUs. As
a result, our index join can perform 2 times faster than the best GPU hash-join and 1.11
speed up over sort-merge-join.

82

References

[1] R. Rui, H. Li, and Y. Tu, “Join algorithms on GPUs: A revisit after seven years,” in
2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA,
USA, October 29 - November 1, 2015, pp. 2541–2550, 2015.
[2] C. Türker, F. Akal, D. Studer-Joho, and R. Schlapbach, “B-fabric: An open source life
sciences data management system,” in Scientific and Statistical Database Management,
21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009,
Proceedings, pp. 185–190, 2009.
[3] M. Feig, M. Abdullah, S. L. Johnsson, and B. M. Pettitt, “Large scale distributed data
repository: design of a molecular dynamics trajectory database,” Future Generation
Comp. Syst., vol. 16, no. 1, pp. 101–110, 1999.
[4] G. Finocchiaro, T. Wang, R. Hoffmann, A. Gonzalez, and R. C. Wade, “DSMM:
a database of simulated molecular motions,” Nucleic Acids Research, vol. 31, no. 1,
pp. 456–457, 2003.
[5] W. Xu, S. Ozer, and R. R. Gutell, “Covariant evolutionary event analysis for base
interaction prediction using a relational database management system for RNA,” in
Scientific and Statistical Database Management, 21st International Conference, SSDBM
2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings, pp. 200–216, 2009.

83

[6] S. Luo, Z. J. Gao, M. N. Gubanov, L. L. Perez, and C. M. Jermaine, “Scalable linear
algebra on a relational database system,” in 33rd IEEE International Conference on
Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pp. 523–534,
2017.
[7] Y.-C. Tu, S. Chen, and S. Pandit, “Computing distance histograms efficiently in scientific databases,” ICDE, pp. 796–807, 2009.
[8] B. Schölkopf, C. J. C. Burges, and A. J. Smola, eds., Advances in Kernel Methods:
Support Vector Learning. Cambridge, MA, USA: MIT Press, 1999.
[9] L. Rokach and S. Kisilevich, “Initial profile generation in recommender systems using pairwise comparison,” Systems, Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, vol. 42, pp. 1854–1859, Nov 2012.
[10] S. Jiang, X. Wang, and H. Zhu, “Learning pairwise comparisons of items with bigram content features for recommending,” in Computer Science and Network Technology
(ICCSNT), 2013 3rd International Conference on, pp. 446–449, Oct 2013.
[11] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander, “Relational
joins on graphics processors,” in Procs. ACM Intl. Conf. Management of Data (SIGMOD), pp. 511–524, 2008.
[12] NVIDIA: CUDA C Programming Guide Version 10.0.
[13] T. Group., “Opencl.,”
[14] X. Liu, M. Li, S. Li, S. Peng, X. Liao, and X. Lu, “IMGPU: gpu-accelerated influence maximization in large-scale social networks,” IEEE Trans. Parallel Distrib. Syst.,
vol. 25, no. 1, pp. 136–145, 2014.

84

[15] H. Wang, L. Geng, R. Lee, K. Hou, Y. Zhang, and X. Zhang, “Sep-graph: finding
shortest execution paths for graph processing under a hybrid framework on GPU,”
in Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019
(J. K. Hollingsworth and I. Keidar, eds.), pp. 38–52, ACM, 2019.
[16] F. Zhang, H. Lin, J. Zhai, J. Cheng, D. Xiang, J. Li, Y. Chai, and X. Du, “An adaptive
breadth-first search algorithm on integrated architectures,” The Journal of Supercomputing, vol. 74, no. 11, pp. 6135–6155, 2018.
[17] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: scalable
deep learning on distributed gpus with a gpu-specialized parameter server,” in Proceedings of the Eleventh European Conference on Computer Systems, EuroSys 2016,
London, United Kingdom, April 18-21, 2016 (C. Cadar, P. R. Pietzuch, K. Keeton, and
R. Rodrigues, eds.), pp. 4:1–4:16, ACM, 2016.
[18] P. Chrysogelos, P. Sioulas, and A. Ailamaki, “Hardware-conscious query processing
in gpu-accelerated analytical engines,” in CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online
Proceedings, www.cidrdb.org, 2019.
[19] K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang, “Concurrent
analytical query processing with gpus,” PVLDB, vol. 7, no. 11, pp. 1011–1022, 2014.
[20] Y. Yuan, R. Lee, and X. Zhang, “The yin and yang of processing data warehousing
queries on GPU devices,” PVLDB, vol. 6, no. 10, pp. 817–828, 2013.
[21] R. Rui and Y. Tu, “Fast equi-join algorithms on gpus: Design and implementation,” in
Proceedings of the 29th International Conference on Scientific and Statistical Database
Management, Chicago, IL, USA, June 27-29, 2017, pp. 17:1–17:12, 2017.
85

[22] NVIDIA GeForce Tesla V100 Whitepaper.
[23] “Nvidia’s next generation cudatm compute architecture:fermi,” tech. rep., NVidia Developer Technology.
[24] “Nvidia’s next generation cudatm compute architecture:kepler gk110,” tech. rep.,
NVidia Developer Technology.
[25] NVIDIA. GTX 980 whitepaper.
[26] NVIDIA GeForce GTX 1080 Whitepaper.
[27] NVIDIA Cop, NVIDIA TURING GPU ARCHITECTURE.
[28] A. G. Gray and A. W. Moore, “N-body problems in statistical learning,” Advances in
Neural Information Processing Systems (NIPS), pp. 521–527, 1993.
[29] Y. Zhu, Z. Zimmerman, N. Shakibay Senobari, C.-C. M. Yeh, G. Funning, A. Mueen,
P. Brisk, and E. Keogh, “Exploiting a novel algorithm and gpus to break the ten
quadrillion pairwise comparisons barrier for time series motifs and joins,” Knowledge
and Information Systems, Dec 2017.
[30] J. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. Liu, W.-M. Hwu,
and N. Obeid, “Algorithm and data optimization techniques for scaling to massively
threaded systems,” Computer , vol.45, no.8, pp. 26–32, 2012.
[31] B. G. Levine, J. E. Stone, and A. Kohlmeyer, “Fast analysis of molecular dynamics
trajectories with graphics processing units-radial distribution function histogramming,”
Journal of Computational Physics. J. Comp. Phys., pp. 3556–3569, 2011.

86

[32] B. Jensen, J. Saez Gallego, and J. Larsen, “A predictive model of music preference
using pairwise comparisons,” in Acoustics, Speech and Signal Processing (ICASSP),
2012 IEEE International Conference on, pp. 1977–1980, March 2012.
[33] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha, “Fast computation of
database operations using graphics processors,” in Procs. ACM Intl. Conf. Management
of Data (SIGMOD), SIGMOD ’04, pp. 215–226, 2004.
[34] H. Pirk, S. Manegold, and M. L. Kersten, “Accelerating foreign-key joins using asymmetric memory channels,” in International Workshop on Accelerating Data Management
Systems Using Modern Processor and Storage Architectures - ADMS 2011, Seattle, WA,
USA, September 2, 2011 (R. Bordawekar and C. A. Lang, eds.), pp. 27–35, 2011.
[35] T. Karnagel, D. Habich, B. Schlegel, and W. Lehner, “Heterogeneity-aware operator
placement in column-store DBMS,” Datenbank-Spektrum, vol. 14, no. 3, pp. 211–221,
2014.
[36] B. He and Q. Luo, “Cache-oblivious nested-loop joins,” in Proceedings of the 2006
ACM CIKM International Conference on Information and Knowledge Management,
Arlington, Virginia, USA, November 6-11, 2006, pp. 718–727, 2006.
[37] C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, A. D. Nguyen, A. D. Blas, V. W. Lee,
N. Satish, and P. Dubey, “Sort vs. hash revisited: Fast join implementation on modern
multi-core cpus,” PVLDB, vol. 2, no. 2, pp. 1378–1389, 2009.
[38] M. Albutiu, A. Kemper, and T. Neumann, “Massively parallel sort-merge joins in main
memory multi-core database systems,” PVLDB, vol. 5, no. 10, pp. 1064–1075, 2012.

87

[39] R. Ponce, M. Cardenas-Montes, J. J. Rodriguez-Vazquez, E. Sanchez, and I. Sevilla,
“Application of gpus for the calculation of two point correlation functions in cosmology,”
in ADASS XXI (Paris, 2011) conference proceedings, 2012.
[40] T. Karnagel, R. Müller, and G. M. Lohman, “Optimizing gpu-accelerated group-by
and aggregation,” in International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures - ADMS 2015, Kohala Coast,
Hawaii, USA, August 31, 2015., pp. 13–24, 2015.
[41] Y. Ye, K. A. Ross, and N. Vesdapunt, “Scalable aggregation on multicore processors,”
in Proceedings of the Seventh International Workshop on Data Management on New
Hardware, DaMoN 2011, Athens, Greece, June 13, 2011, pp. 1–9, 2011.
[42] T. Kaldewey, G. M. Lohman, R. Müller, and P. B. Volk, “GPU join processing revisited,” in Proceedings of the Eighth International Workshop on Data Management
on New Hardware, DaMoN 2012, Scottsdale, AZ, USA, May 21, 2012 (S. Chen and
S. Harizopoulos, eds.), pp. 55–62, ACM, 2012.
[43] A. Kumar, V. Grupcev, Y. Yuan, J. Huang, Y. Tu, and G. Shen, “Computing spatial
distance histograms for large scientific data sets on-the-fly,” IEEE Trans. Knowl. Data
Eng., vol. 26, no. 10, pp. 2410–2424, 2014.
[44] V. Grupcev, Y. Yuan, Y. Tu, J. Huang, S. Chen, S. Pandit, and M. Weng, “Approximate
algorithms for computing spatial distance histograms with accuracy guarantees,” IEEE
Trans. Knowl. Data Eng., vol. 25, no. 9, pp. 1982–1996, 2013.
[45] L. Chen, Y. Gao, X. Li, C. S. Jensen, and G. Chen, “Efficient metric indexing for
similarity search and similarity joins,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, pp. 556–571, March 2017.

88

[46] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee,
S. A. Brandt, and P. Dubey, “FAST: fast architecture sensitive tree search on modern
cpus and gpus,” in Proceedings of the ACM SIGMOD International Conference on
Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010,
pp. 339–350, 2010.
[47] T. Kaldewey and A. Di Blas, “Large-scale gpu search,” GPU Computing Gems Jade
Edition, pp. 3–14, 12 2012.
[48] A. Shahvarani and H. Jacobsen, “A hybrid b+-tree as solution for in-memory indexing
on CPU-GPU heterogeneous computing platforms,” in Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco,
CA, USA, June 26 - July 01, 2016, pp. 1523–1538, 2016.
[49] K. Kaczmarski, “B + -tree optimized for GPGPU,” in On the Move to Meaningful
Internet Systems: OTM 2012, Confederated International Conferences: CoopIS, DOASVI, and ODBASE 2012, Rome, Italy, September 10-14, 2012. Proceedings, Part II,
pp. 843–854, 2012.
[50] K. Kaczmarski, “Experimental b+-tree for GPU,” in ADBIS 2011, Research Communications, Proceedings II of the 15th East-European Conference on Advances in Databases
and Information Systems, September 20-23, 2011, Vienna, Austria, pp. 232–241, 2011.
[51] Timo Bingmann, STX B+ Tree C++ Template Classes.
[52] A. Agrawal and X. Huang, “Pairwise statistical significance of local sequence alignment
using sequence-specific and position-specific substitution matrices,” Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 8, pp. 194–205, 2011.
[53] NVIDIA. CUDA C Best Practices Guide, version 7.5.
89

[54] Analyzing GPGPU Pipeline Latency, 2014.
[55] H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying
GPU microarchitecture through microbenchmarking,” in IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2010, 28-30 March
2010, White Plains, NY, USA, pp. 235–246, 2010.
[56] J. Wang, X. Xie, and J. Cong, “Communication optimization on GPU: A case study of
sequence alignment algorithms,” in 2017 IEEE International Parallel and Distributed
Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017, pp. 72–
81, 2017.
[57] H. Li, D. Yu, A. Kumar, and Y. Tu, “Modeling in cuda strems - a means for highthroughput data processing,” Big Data (Big Data), IEEE International Conference,
pp. 301–310, 2014.
[58] D. Bloom, “A birthday problem.,” Am. Math. Mon. 80, pp. 1141–1142, 1973.
[59] 2BS Framework.
[60] R. Elmasri, “Data management fundamentals: Database management system,” in Encyclopedia of Database Systems, Second Edition, 2018.
[61] R. Bayer and E. M. McCreight, “Organization and maintenance of large ordered indices,” Acta Inf., vol. 1, pp. 173–189, 1972.
[62] J. Zhou and K. A. Ross, “Implementing database operations using simd instructions,”
in Proceedings of the 2002 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’02, (New York, NY, USA), pp. 145–156, ACM, 2002.

90

[63] B. Schlegel, R. Gemulla, and W. Lehner, “K-ary search on modern processors,” in Proceedings of the Fifth International Workshop on Data Management on New Hardware,
DaMoN ’09, (New York, NY, USA), pp. 52–60, ACM, 2009.
[64] Nvlabs, CUB Documentation.
[65] H. Nguyen, “Parallel prefix sum (scan) with cuda,” in Gpu gems 3, ch. 37, AddisonWesley Professional: Addison-Wesley Professional, 2007.
[66] J. S. Vitter, “Random sampling with a reservoir,” ACM Trans. Math. Softw., vol. 11,
no. 1, pp. 37–57, 1985.

91

Appendix A: Copyright Permissions

Portions of this dissertation was previously published in Distributed and Parallel Databases
Journal (DAPD), Volume 37:587–622 DOI: https://doi.org/10.1007/s10619-018-7238-0 and
in Proceedings of the 45th International Conference on Parallel Processing (ICPP), Philadelphia, PA, USA.,pp. 380-385, DOI: 10.1109/ICPP.2016.50.
The permission (license number 4743170651709) for reuse in a dissertation is obtained.
Partial or complete papers can be included in this dissertation.

92

According to IEEE author rights, authors can include partial or complete papers of their
own in a dissertation without formal reuse license.
Rightslink® by Copyright Clearance Center

12/18/19, 4:25 PM

RightsLink

Home

Help

Email Support

Sign in

Create Account

E!cient 2-Body Statistics Computation on GPUs: Parallelization &
Beyond
Conference Proceedings: 2016 45th International Conference on Parallel Processing (ICPP)
Author: Napath Pitaksirianan
Publisher: IEEE
Date: Aug. 2016
Copyright © 2016, IEEE

Thesis / Dissertation Reuse
The IEEE does not require individuals working on a thesis to obtain a formal reuse license, however, you may
print out this statement to be used as a permission grant:

Requirements to be followed when using any portion (e.g., !gure, graph, table, or textual material) of an IEEE
copyrighted paper in a thesis:
1) In the case of textual material (e.g., using short quotes or referring to the work within these papers) users must
give full credit to the original source (author, paper, publication) followed by the IEEE copyright line © 2011 IEEE.
2) In the case of illustrations or tabular material, we require that the copyright line © [Year of original publication]
IEEE appear prominently with each reprinted !gure and/or table.
3) If a substantial portion of the original paper is to be used, and if you are not the senior author, also obtain the
senior author's approval.

Requirements to be followed when using an entire IEEE copyrighted paper in a thesis:
1) The following IEEE copyright/ credit notice should be placed prominently in the references: © [year of original
publication] IEEE. Reprinted, with permission, from [author names, paper title, IEEE publication title, and month/year
of publication]
2) Only the accepted version of an IEEE copyrighted paper can be used when posting the paper or your thesis online.
3) In placing the thesis on the author's university website, please display the following message in a prominent place
on the website: In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does
not endorse any of [university/educational entity's name goes here]'s products or services. Internal or personal use
of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or
promotional purposes or for creating new collective works for resale or redistribution, please go to
http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License
from RightsLink.
If applicable, University Micro!lms and/or ProQuest Library, or the Archives of Canada may supply single copies of
the dissertation.
BACK

© 2019 Copyright - All Rights Reserved | Copyright Clearance Center, Inc. | Privacy statement
Comments? We would like to hear from you. E-mail us at customercare@copyright.com

https://s100.copyright.com/AppDispatchServlet#formTop

CLOSE

| Terms and Conditions

Page 1 of 1

93

