Algorithmic and Software System Support to Accelerate Data Processing in CPU-GPU Hybrid Computing Environments by Wang, Kaibo
  
Algorithmic and Software System Support to Accelerate Data Processing in CPU-GPU 
Hybrid Computing Environments 
 
 
DISSERTATION 
 
 
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy 
in the Graduate School of The Ohio State University 
 
By 
Kaibo Wang 
Graduate Program in Computer Science and Engineering 
 
The Ohio State University 
2015 
 
 
Dissertation Committee: 
Xiaodong Zhang, Advisor 
P. Sadayappan 
Christopher Stewart 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Copyright by 
Kaibo Wang 
2015 
 
 
 
ii 
 
 
 
 
 
Abstract 
 
Massively data-parallel processors, Graphics Processing Units (GPUs) in particular, 
have recently entered the main stream of general-purpose computing as powerful 
hardware accelerators to a large scope of applications including databases, medical 
informatics, and big data analytics. However, despite their performance benefit and cost 
effectiveness, the utilization of GPUs in production systems still remains limited. A 
major reason behind this situation is the slow development of supportive GPU software 
ecosystem. More specially, (1) CPU-optimized algorithms for some critical computation 
problems have irregular memory access patterns with intensive control flows, which 
cannot be easily ported to GPUs to take full advantage of its fine-grained, massively data-
parallel architecture; (2) commodity computing environments are inherently concurrent 
and require coordinated resource sharing to maximize throughput, while existing systems 
are still mainly designed for dedicated usage of GPU resources. 
In this Ph.D. dissertation, we develop efficient software solutions to support the 
adoption of massively data-parallel processors in general-purpose commodity computing 
systems. Our research mainly focuses on the following areas. First, to make a strong case 
for GPUs as indispensable accelerators, we apply GPUs to significantly improve the 
performance of spatial data cross-comparison in digital pathology analysis. Instead of 
trying to port existing CPU-based algorithms to GPUs, we design a new algorithm and 
fully optimize it to utilize GPU’s hardware architecture for high performance. Second, we 
iii 
 
propose operating system support for automatic device memory management to improve 
the usability and performance of GPUs in shared general-purpose computing 
environments. Several effective optimization techniques are employed to ensure the 
efficient usage of GPU device memory space and to achieve high throughput. Finally, we 
develop resource management facilities in GPU database systems to support concurrent 
analytical query processing. By allowing multiple queries to execute simultaneously, the 
resource utilization of GPUs can be greatly improved. It also enables GPU databases to 
be utilized in important application areas where multiple user queries need to make 
continuous progresses simultaneously. 
 
  
iv 
 
 
 
 
 
 
 
 
 
 
Dedication 
 
To my family and dearest friends. 
 
  
v 
 
 
 
 
 
 
Acknowledgments 
 
This journey could not have been so memorable without the companion of many 
wonderful people. My advisor opened the door and led me step by step to the essential 
goal of doing impactful research. Xiaoning Ding and Rubao Li were great mentors whose 
deep knowledge and consistent support smoothened the roughest roads. Yuan Yuan, Kai 
Zhang, and Yin Huai walked shoulder to shoulder with me and lent the strongest arms 
whenever help was needed. My girlfriend was always supportive and never complained 
about my academic life that was often too busy to give her enough attention. 
Thank you all! 
 
  
vi 
 
 
 
Vita 
 
2006................................................................B.S. Computer Science and Engineering, 
Northwestern Polytechnical University, Xi’an, China 
2009................................................................M.S. Computer Science and Engineering, 
Northwestern Polytechnical University, Xi’an, China 
2010 to present ...............................................Ph.D. Computer Science and Engineering, 
The Ohio State University 
 
Publications 
 
1.  Kai Zhang, Kaibo Wang, Yuan Yuan, Rubao Lee, Lei Guo, Xiaodong Zhang. Mega-
KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores. 
Proc. of VLDB Endow., 8(11):1226-1237, 2015. 
2.  Kaibo  Wang, Kai Zhang, Yuan Yuan, Siyuan Ma, Rubao Lee, Xiaoning Ding, 
Xiaodong Zhang. Concurrent Analytical Query Processing with GPUs. Proc. VLDB 
Endow., 7(11):1011-1022, 2014. 
3.  Kaibo  Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, Xiaodong Zhang. GDM: 
Device Memory Management for GPGPU Computing. In Proceedings of the 2014 ACM 
International Conference on Measurement and Modeling of Computer Systems 
(SIGMETRICS 2014), pages 533-545, 2014. 
vii 
 
4.  Kaibo  Wang, Yin Huai, Rubao Lee, Fusheng Wang, Xiaodong Zhang, Joel H. Saltz. 
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems. In 
Proc. VLDB Endow., 5(11):1543-1554, 2012. 
5.  Xiaoning Ding, Kaibo Wang, Phillip B. Gibbons, Xiaodong Zhang. BWS: Balanced 
Work Stealing for Time-Sharing Multicores. In Proceedings of the 7th European 
Conference on Computer Systems (EuroSys 2012), pages 365-378, 2012. 
6.  Xiaoning Ding, Kaibo Wang, Xiaodong Zhang. SRM-Buffer: An OS Buffer 
Management Technique to Prevent Last Level Caches from Thrashing in Multicores. In 
Proceedings of the 6th European Conference on Computer Systems (EuroSys 2011), 
pages 243-256, 2011. 
7.  Xiaoning Ding, Kaibo Wang, Xiaodong Zhang. ULCC: A User-Level Facility for 
Optimizing Shared Cache Performance on Multicores. In Proceedings of the 16th ACM 
Symposium on Principles and Practice of Parallel Programming (PPoPP 2011), pages 
103-112, 2011. 
 
Fields of Study 
 
Major Field:  Computer Science and Engineering 
 
  
viii 
 
 
 
Table of Contents 
 
Abstract ............................................................................................................................... ii 
Dedication .......................................................................................................................... iv 
Acknowledgments............................................................................................................... v 
Vita ..................................................................................................................................... vi 
Publications ........................................................................................................................ vi 
Fields of Study .................................................................................................................. vii 
Table of Contents ............................................................................................................. viii 
List of Tables ................................................................................................................... xiii 
List of Figures .................................................................................................................. xiv 
Chapter 1 Introduction ........................................................................................................ 1 
Chapter 2 Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid 
Systems ............................................................................................................................... 6 
2.1 Introduction ............................................................................................................... 6 
2.2 Problem Identification ............................................................................................... 9 
2.2.1 Background: Spatial Cross-Comparison ............................................................ 9 
2.2.2 Existing Solutions with SDBMSs..................................................................... 11 
ix 
 
2.2.3 Performance Profiling of SDBMS Solution ..................................................... 13 
2.3 The PixelBox Algorithm ......................................................................................... 15 
3.1 Pixelization of Polygon Pairs .............................................................................. 15 
2.3.2 Reduction of Computing Intensity ................................................................... 18 
2.3.3 Optimized Algorithm Implementation ............................................................. 22 
2.3.4 Related Discussions .......................................................................................... 26 
2.4 System Framework .................................................................................................. 27 
2.4.1 The Pipelined Structure .................................................................................... 28 
2.4.2 Dynamic Task Migration .................................................................................. 30 
2.5 Experiments ............................................................................................................. 31 
2.5.1 Experiment Methodology ................................................................................. 31 
2.5.2 Performance of the PixelBox Algorithm .......................................................... 33 
2.5.3 Effectiveness of Optimization Techniques ....................................................... 36 
2.5.4 Parameter Sensitivity of PixelBox .................................................................... 37 
2.5.5 Performance of the Pipelined Framework ........................................................ 39 
2.5.6 Effectiveness of Dynamic Task Migration ....................................................... 40 
2.5.7 Performance Evaluation with All Data Sets ..................................................... 42 
2.6 Related Work........................................................................................................... 43 
2.7 Conclusions ............................................................................................................. 45 
x 
 
Chapter 3 GDM: Device Memory Management for GPGPU Computing ........................ 47 
3.1 Introduction ............................................................................................................. 47 
3.1.1 Problems with Application-Level Device Memory Management .................... 48 
3.1.2 GDM: OS Device Memory Management ......................................................... 49 
3.1.3 Contributions .................................................................................................... 50 
3.2 Demand for System-Level Device Memory Management ..................................... 51 
3.2.1 GPGPU Computing Architecture ..................................................................... 51 
3.2.2 Device Memory: A Critical Resource .............................................................. 54 
3.2.3 Issues with Existing System Designs ............................................................... 54 
3.2.4 Demand for System Management .................................................................... 56 
3.3 GDM Overview ....................................................................................................... 58 
3.3.1 Minimizing Overhead: A Major Challenge ...................................................... 59 
3.3.2 Guidelines for GDM Design ............................................................................. 61 
3.4 GDM Design ........................................................................................................... 65 
3.4.1 Staging Areas .................................................................................................... 65 
3.4.2 Device Memory Regions, Objects, Blocks ....................................................... 66 
3.4.3 Loading Data to Device Memory ..................................................................... 68 
3.4.4 Management of Device Memory Space ........................................................... 70 
3.5 Implementation........................................................................................................ 75 
xi 
 
3.5.1 Regions, Blocks, and Objects ........................................................................... 75 
3.5.2 Signature Computing ........................................................................................ 76 
3.6 Evaluation................................................................................................................ 77 
3.6.1 Experiment Setup and Methodology ................................................................ 77 
3.6.2 Tolerating Device Memory Leaks .................................................................... 80 
3.6.3 Multitasking Performance ................................................................................ 82 
3.6.4 Validation of Design Optimizations ................................................................. 85 
3.6.5 Defending against DoS Attacks ........................................................................ 87 
3.7 Related Work........................................................................................................... 88 
3.8 Conclusions and Future Work ................................................................................. 90 
Chapter 4 Concurrent Analytical Query Processing with GPUs ...................................... 92 
4.1 Introduction ............................................................................................................. 92 
4.2 Background and Motivation .................................................................................... 94 
4.2.1 Analytical Query Processing with GPUs .......................................................... 95 
4.2.2 Low Resource Utilization ................................................................................. 97 
4.2.3 Problems with Uncoordinated Query Co-Running ........................................ 100 
4.3 MultiQx-GPU: An Overview ................................................................................ 103 
4.4 Device Memory Manager...................................................................................... 105 
4.4.1 Framework ...................................................................................................... 105 
xii 
 
4.4.2 Data Replacement ........................................................................................... 108 
4.5 Query Scheduler .................................................................................................... 110 
4.6 Implementation...................................................................................................... 112 
4.7 Experiments ........................................................................................................... 116 
4.7.1 Settings and Metrics ....................................................................................... 116 
4.7.2 Performance of Concurrent Executions .......................................................... 118 
4.7.3 Validations of Optimizations .......................................................................... 121 
4.7.4 Experiments with Replacement Policies ........................................................ 125 
4.7.5 Effectiveness of Query Scheduling ................................................................ 126 
4.7.6 Overhead ......................................................................................................... 129 
4.8 Related Work......................................................................................................... 130 
4.9 Summary and Future Work ................................................................................... 132 
Chapter 5 Concluding Remarks ...................................................................................... 134 
Bibliography ................................................................................................................... 136 
  
xiii 
 
 
 
List of Tables 
 
Table 2.1: Performance comparisons between different schemes. ................................... 39 
Table 3.1: Total size of device memory space allocated in each benchmark. .................. 78 
 
 
  
xiv 
 
 
 
List of Figures 
 
Figure 2.1: Cross-comparing queries for the Jaccard similarity of two polygon sets 
extracted from the same image. ........................................................................................ 12 
Figure 2.2: Time decomposition of cross-comparing queries in PostGIS on a single core.
........................................................................................................................................... 13 
Figure 2.3: Polygons extracted from medical images have axis-aligned edges and integer-
valued vertices. ................................................................................................................. 16 
Figure 2.4: The principles of PixelBox. ............................................................................ 17 
Figure 2.5: A sampling box’s position relative to a polygon: (a) outside; (b) inside; (c, d) 
hover. ................................................................................................................................ 20 
Figure 2.6: A cross-comparing pipeline with dynamic task migrations. .......................... 28 
Figure 2.7: Performance comparison of GEOS and PixelBox. ........................................ 33 
Figure 2.8: Performance of two algorithm decisions: using sampling boxes and 
computing areas of union indirectly. ................................................................................ 35 
Figure 2.9: Performance impact of various optimization techniques in algorithm 
implementation. ................................................................................................................ 37 
Figure 2.10: The sensitivity of PixelBox performance to pixelization threshold T. ......... 38 
Figure 2.11: Performance benefits of dynamic task migration. ........................................ 41 
Figure 2.12: The overall performance of SCCG compared with PostGIS-M on 18 data 
sets..................................................................................................................................... 43 
xv 
 
Figure 3.1: GPGPU system organization. ......................................................................... 52 
Figure 3.2: The overall architecture of GDM. .................................................................. 58 
Figure 3.3: Device memory region and object.................................................................. 67 
Figure 3.4: An LRU stack is structured for the LRU-COST replacement policy............. 71 
Figure 3.5: The computation of data signatures................................................................ 77 
Figure 3.6: Decompositions of benchmark execution times. ............................................ 79 
Figure 3.7: The impact of device memory leaks with and without GDM management. .. 81 
Figure 3.8: Performance of multitasking workloads with and without GDM management.
........................................................................................................................................... 83 
Figure 3.9: Effectiveness of GDM optimizations. ............................................................ 85 
Figure 4.1: An example query execution plan in YDB. ................................................... 96 
Figure 4.2: Utilization of GPU resources during dedicated executions of SSB queries 
with YDB. ......................................................................................................................... 98 
Figure 4.3: The impact of query scheduling on system throughput. .............................. 102 
Figure 4.4: Overview of MultiQx-GPU. Shaded boxes denote the two new components 
provided by MultiQx-GPU to manage GPU resources................................................... 104 
Figure 4.5: Device memory usage of SSB queries at scale factor 14. qxy denotes the yth 
query of the xth query flight. The Allocated bars show the total device memory space 
allocated in each query. The Peak bars show the maximum device memory space held by 
each query during its execution. ..................................................................................... 117 
xvi 
 
Figure 4.6: Throughput of pairwise SSB query co-runnings in three different systems. 
MultiQx-GPU-Raw is a variant of MultiQx-GPU without optimizations. p.q denotes the 
combination of queries p and q. ...................................................................................... 119 
Figure 4.7: Improvement of GPU resource utilization with MultiQx-GPU relative to 
YDB. ............................................................................................................................... 120 
Figure 4.8: Improvement of DMA efficiency with the help of swapping framework 
optimizations. .................................................................................................................. 123 
Figure 4.9: The influence of each individual optimization technique on system 
performance. Performance is normalized to the fully optimized MultiQx-GPU............ 124 
Figure 4.10: Performance of co-running selected SSB queries under various data 
replacement policies. Throughput is normalized against LRU. ...................................... 126 
Figure 4.11: System speedups achieved with different scheduling policies, normalized to 
FIX-1. .............................................................................................................................. 128 
Figure 4.12: Overhead of MultiQx-GPU. ....................................................................... 130 
  
1 
 
 
Chapter 1 Introduction 
 
 
Modern computer systems need to handle increasingly data- and compute-intensive 
workloads due to the proliferation of throughput-oriented applications. The continued 
demand for high performance requires scalable parallel processing at affordable cost. 
Massively data-parallel processors, and Graphics Processing Units (GPUs) in particular, 
recently emerge as powerful, general-purpose hardware accelerators to a large class of 
applications. Compared with latency-optimized CPUs, the unique architecture of GPUs 
with numerous slim cores and fast on-board device memory matches intrinsically with 
the immense data parallelisms present in many throughput-demanding workloads. 
Perfectly exploiting massive data parallelism, GPUs can often lead to orders of 
magnitude of performance improvement over conventional CPU-based solutions in a 
cost-effective manner. 
Despite the abundance of data parallelisms in workloads, exploiting them for efficient 
acceleration with GPUs can be quite challenging. Unlike multicore CPUs that comprise 
of a few powerful cores aiming for low latency, GPUs employ numerous cores, each of 
which is not as powerful as a single CPU core, and require fine-grained SIMD (Single-
Instruction-Multiple-Data) executions to achieve maximal performance. As a result, 
algorithms optimized for running on CPUs often cannot be easily ported to fully utilize 
  
2 
the hardware resources of GPUs. This has posed a great difficulty to applications of 
GPUs in some important domains. We will later show such an example in digital 
pathology analysis (Chapter 2) and demonstrate how we expand the adoption of GPUs 
into the new field through effective algorithm design and implementation. 
As GPUs enter the main stream of general-purpose computing, related system software 
for GPUs, however, has been developed rather slowly. Today’s throughput-demanding 
applications have inherent requirements for multitasking to maximize system throughput. 
Unfortunately, the state-of-the-art commercial general-purpose GPU (GPGPU) system 
software is still mainly designed for dedicated usage of GPU resources. Without effective 
system support for resource sharing, main GPU resources such as device memory are 
controlled directly by each individual GPGPU application. It not only increases the 
programming burden of application developers, but also causes serious system problems 
including application crashes, resource underutilization, and vulnerabilities to malicious 
users. This has greatly limited the adoption of GPUs in production systems. 
Databases, as an important application field of GPUs, also have high demand for 
concurrent query processing. Executing one query at a time simplifies GPU database 
engine design, but can often lead to low resource utilization and thus suboptimal system 
performance. Important database applications such as high-performance data 
warehousing and multi-client data flow analysis rely on user queries to make continuous 
progress to satisfy the goal of interactive analysis. The lack of concurrent querying 
support also severely restricts the utilization of GPU databases in these fields. 
  
3 
In this dissertation, we exploit efficient software solutions to address the limitations of 
current GPGPU software ecosystem and expand the application of GPUs in commodity 
computing systems. First, we apply GPUs as key accelerators to improve the performance 
of comparing the similarities of micro-anatomic polygon sets in digital pathology 
analysis. The bottlenecks of conventional CPU-based approaches are effectively 
addressed through GPU-optimized algorithm designs and dynamic load balancing 
between CPUs and GPUs. Second, we propose operating system support for automatic 
device memory management to improve the usability and performance of GPUs in shared 
general-purpose computing environments. Finally, we develop resource management 
facilities in GPU database systems to support efficient concurrent analytical query 
processing. This dissertation makes the following main contributions: 
 As an important application of spatial databases in pathology imaging analysis, 
cross-comparing the spatial boundaries of a huge amount of segmented micro-
anatomic objects demands extremely data- and compute-intensive operations, 
requiring high throughput at an affordable cost. However, the performance of 
spatial database systems has not been satisfactory since their implementations of 
spatial operations cannot fully utilize the power of modern parallel hardware. In 
this dissertation, we provide a customized software solution that exploits GPUs and 
multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. 
Our solution consists of an efficient GPU algorithm and a pipelined system 
framework with task migration support. Extensive experiments with real-world 
data sets demonstrate the effectiveness of our solution, which improves the 
  
4 
performance of spatial cross-comparison by over 18 times compared with a 
parallelized spatial database approach. 
 GPGPUs are evolving from dedicated accelerators towards mainstream commodity 
computing resources. During the transition, the lack of system management of 
device memory space on GPGPUs has become a major hurdle. In existing GPGPU 
systems, device memory space is still managed explicitly by individual 
applications, which not only increases the burden of programmers but can also 
cause application crashes, hangs, or low performance. 
In this dissertation, we present the design and implementation of GDM, a fully 
functional GPGPU device memory manager to address the above problems and 
unleash the computing power of GPGPUs in general-purpose environments. To 
effectively coordinate the device memory usage of applications, GDM takes 
control over device memory allocations and data transfers to and from device 
memory, leveraging a buffer allocated in each application's virtual memory. GDM 
utilizes the unique features of GPGPU systems and relies on several effective 
optimization techniques to guarantee the efficient usage of device memory space 
and to achieve high performance. 
We have evaluated GDM and compared it against state-of-the-art GPGPU system 
software on a range of workloads. The results show that GDM can prevent 
applications from crashes, including those induced by device memory leaks, and 
improve system performance by up to 43%. 
  
5 
 In current databases, GPUs are used as dedicated accelerators to process each 
individual query. Sharing GPUs among concurrent queries is not supported, 
causing serious resource underutilization. Based on the profiling of an open-source 
GPU query engine running commonly used single-query data warehousing 
workloads, we observe that the utilization of main GPU resources is only up to 
25%. The underutilization leads to low system throughput. 
To address the problem, this dissertation proposes concurrent query execution as an 
effective solution. To efficiently share GPUs among concurrent queries for high 
throughput, the major challenge is to provide software support to control and 
resolve resource contention incurred by the sharing. Our solution relies on GPU 
query scheduling and device memory swapping policies to address this challenge. 
We have implemented a prototype system and evaluated it intensively. The 
experiment results confirm the effectiveness and performance advantage of our 
approach. By executing multiple GPU queries concurrently, system throughput can 
be improved by up to 55% compared with dedicated processing. 
The rest of the dissertation is organized as follows. Chapter 2 presents a CPU-GPU 
hybrid solution to accelerate spatial data cross-comparison in digital pathology analysis. 
Chapter 3 introduces GDM, a system-level GPU device memory manager to improve the 
usability and performance of GPGPU applications. In Chapter 4, we describe MultiQx-
GPU, a database system facility to manage GPU resources for concurrent analytical 
query processing with GPUs. Finally, we conclude the dissertation in Chapter 5. 
  
  
6 
 
 
Chapter 2 Accelerating Pathology Image Data Cross-
Comparison on CPU-GPU Hybrid Systems 
 
 
2.1 Introduction 
Digitized pathology images generated by high-resolution scanners enable the 
microscopic examination of tissue specimens to support clinical diagnosis and biomedical 
research [28]. With the emerging pathology imaging technology, it is essential to develop 
and evaluate high quality image analysis algorithms, with iterative efforts on algorithm 
validation, consolidation, and parameter sensitivity studies. One essential task to support 
such work is to provide efficient tools for cross-comparing millions of spatial boundaries 
of segmented micro-anatomic objects. A commonly adopted cross-comparing metric is 
Jaccard similarity [53], which computes the ratio of the total area of the intersection 
divided by the total area of the union between two polygon sets. 
Building high-performance cross-comparing tools is challenging, due to data explosion 
in pathology imaging analysis, as in other scientific domains [40, 45]. Whole-slide 
images made by scanning microscope slides at diagnostic resolution are very large: a 
typical image may contain over 100,000 x 100,000 pixels, and millions of objects such as 
cells or nuclei. A study may involve hundreds of images obtained from a large cohort of 
subjects. For a large-scale interrelated analysis, there may be dozens of algorithms — 
  
7 
with varying parameters — generating many different result sets to be compared and 
consolidated. Thus, derived data from images of a single study is often in the scale of tens 
of terabytes, and will be increasingly larger in future clinical environments. 
Pathologists mainly rely on spatial database management systems (SDBMS) to execute 
spatial cross-comparison [54]. However, cross-comparing a huge amount of polygons is 
time-consuming using SDBMSs, which cannot fully utilize the rich parallel resources of 
modern hardware. In the era of high-throughput computing, unprecedentedly rich and 
low-cost parallel computing resources, including GPUs and multi-core CPUs, have been 
available. In order to use these resources for maximizing execution performance, 
applications must fully exploit both thread-level and data-level parallelisms and well 
utilize SIMD vector units to parallelize workloads. 
However, supporting spatial cross-comparison on a CPU-GPU hybrid platform 
imposes two major challenges. First, parallelizing spatial operations, such as computing 
areas of polygon intersection and union, on GPUs requires efficient algorithms. Existing 
CPU algorithms, e.g., those used in SDBMSs, are branch intensive with irregular data 
access patterns, which makes them very hard, if not impossible, to parallelize on GPUs. 
Efficient GPU algorithms, if existing, must successfully exploit massive data parallelisms 
in the cross-comparing workload and execute them in an SIMD fashion. Second, a GPU-
friendly system framework is required to drive the whole spatial cross-comparing 
workload. The special characteristics of the GPU device require data batching to mitigate 
communication overhead, and coordinated device sharing to control resource contention. 
Furthermore, due to the diversity of hardware configurations and workloads, task 
  
8 
executions have to be balanced between GPUs and CPUs to maximize resource 
utilization. 
In this chapter, we present a customized solution, SCCG (Spatial Cross-comparison on 
CPUs and GPUs), to address the challenges. Through detailed profiling, we identify that 
the bottleneck of cross-comparing query execution mainly comes from computing the 
areas of polygon intersection and union. This explains the low performance of SDBMSs 
and motivates us to design an efficient GPU algorithm, called PixelBox, to accelerate the 
spatial operations. Both the design and the implementation of the algorithm are optimized 
thoroughly to ensure its high performance on GPUs. Moreover, we develop a pipelined 
system framework for the whole workload, and design a dynamic task migration 
component to solve the load balancing problem. The pipelined framework has advantages 
for its natural support of data batching and GPU sharing. The task migration component 
further improves system throughput by balancing workloads between GPUs and CPUs. 
The main contributions of this chapter are as follows: 1) PixelBox, an efficient GPU 
algorithm and its optimized implementation for computing Jaccard similarity of polygon 
sets; 2) a pipelined framework with task migration support for spatial cross-comparison 
on a CPU-GPU hybrid platform; and 3) a demonstration of our solution’s performance 
(18x speedup over a parallelized SDBMS) with extensive and intensive experiments 
using real-world pathology data sets. 
The rest of this chapter is organized as follows. Section 2.2 introduces the background 
and identifies the problem with SDBMSs in processing spatial cross-comparing queries. 
Our GPU algorithm, PixelBox, is presented in Section 2.3 to accelerate the bottleneck 
  
9 
spatial operations. Section 2.4 introduces the pipelined framework and the design of a 
task migration facility for workload balancing. Comprehensive experiments and 
performance evaluation are presented in Section 2.5, followed by related works in 
Section 2.6 and conclusions in Section 2.7. 
2.2 Problem Identification 
2.2.1 Background: Spatial Cross-Comparison 
A critical step in pathology imaging analysis is to extract the spatial locations and 
boundaries of micro-anatomic objects, represented with polygons, from digital slide 
images using segmentation algorithms [28]. The effectiveness of a segmentation 
algorithm depends on many factors, such as the quality of microtome staining machines, 
staining techniques, peculiarities of tissue structures and others. A slight change of 
algorithm parameters may also lead to dramatic variations in segmentation output. As a 
result, evaluating the effectiveness and sensitivity of segmentation algorithms has been 
very important in pathology imaging studies. 
The core operation is to cross-compare two sets of polygons, which are segmented by 
different algorithms or the same algorithm with different parameters, to obtain their 
degree of similarity. Jaccard similarity, due to its simplicity and meaningful geometric 
interpretation, has been widely used in pathology to measure the similarity of polygon 
sets. 
Suppose that P and Q are two sets of polygons representing the spatial boundaries of 
objects generated by two methods from the same image. Their Jaccard similarity is 
defined as 
  
10 
𝐽 =
‖𝑃∩𝑄‖
‖𝑃∪𝑄‖
 , 
where P∩Q and P∪Q denote the intersection and the union of P and Q, and ‖∙‖ is defined 
as the area of one or multiple polygons in a polygon set. To further simplify the 
computation, researchers in digital pathology use a variant definition of Jaccard 
similarity: let r(𝑝, 𝑞) =
‖𝑝∩𝑞‖
‖𝑝∪𝑞‖
, then 
 𝐽′ = 〈{r(𝑝, 𝑞): 𝑝 ∈ 𝑃, 𝑞 ∈ 𝑄, ‖𝑝 ∩ 𝑞‖ ≠ 0}〉 , (1) 
in which 〈∙〉 represents the average value of all the elements in a set. The greater the value 
of 𝐽′  is, the more likely P and Q resemble each other. Compared with 𝐽 , 𝐽′  does not 
consider missing polygons that appear in one polygon set but have no intersecting 
counterpart in the other. Missing polygons can be easily identified by comparing the 
number of polygons that appear in the intersection with the number of polygons in each 
polygon set. Other additional measurements of similarity, such as distance of centroids, 
are omitted in our discussion, as their computational complexity is low. 
What makes the computation of 𝐽′ highly challenging is the huge amount of polygons 
involved in spatial cross-comparison. Due to the high dependability required by medical 
analysis, the image base has to be sufficiently large — hundreds of whole slide images 
are common, with each image generating millions of polygons. Since a single image 
contains a great number of objects, the average size of polygons extracted from pathology 
images is usually very small. 
To expedite both segmentation and cross-comparison, large image files are usually pre-
partitioned into many small tiles so that they can fit into memory and allow parallel 
  
11 
segmentations. The generated polygon files for each whole image also reflect the 
structure of such partitioning: polygons extracted from a single tile are contained in a 
single polygon file; a group of polygon files constitute the segmentation result for a 
whole image; different segmentation results for the same image are represented with 
different groups of polygon files, which are cross-compared with each other for the 
purpose of algorithm validation or sensitivity studies. 
In the rest of this chapter, we refer to the area of the intersection of two polygons as 
area of intersection, and the area of the union of two polygons as area of union. 
2.2.2 Existing Solutions with SDBMSs 
Pathologists mainly rely on SDBMSs to support spatial cross-comparison [54]. In this 
solution, the cross-comparing workflow typically consists of three major steps: first, 
polygon files (raw data) are loaded into the database; second, indexes are built based on 
the minimum bounding rectangles (MBRs) of polygons; finally, queries are executed to 
compute the similarity score. Figure 2.1(a) shows a cross-comparing query in PostGIS [1] 
SQL grammar that computes the Jaccard similarity of two polygon sets, named ‘
oligoastroiii_1_1’ and ‘oligoastroiii_1_2’. The join condition is expressed with spatial 
predicate ST_Intersects, which tests whether two polygons have intersection. For each 
pair of intersecting polygons, their area of intersection, area of union, and thus the ratio of 
the two areas are computed. Spatial operators ST_Intersection and ST_Union compute the 
boundaries of the intersection and the union of two polygons, while ST_Area returns the 
area of one or a group of polygons. Finally, these ratios are averaged to derive the 
similarity score for the whole image. 
  
12 
 
Figure 2.1: Cross-comparing queries for the Jaccard similarity of two polygon sets 
extracted from the same image. 
According to the formula ‖𝑝 ∪ 𝑞‖ = ‖𝑝‖ + ‖𝑞‖ − ‖𝑝 ∩ 𝑞‖ , the query can be re-
written so that only the ST_Intersection operator is executed for each pair of intersecting 
polygons, while the area of union can be computed indirectly through the formula. 
Moreover, ST_Intersects can also be removed since we only need records with ratio > 0 
and whether two polygons intersect can be determined by their area of intersection. By 
replacing ST_Intersects with the && operator, which tests whether the MBRs of two 
polygons intersect, we can further optimize the query, as shown in Figure 2.1(b). 
  
13 
2.2.3 Performance Profiling of SDBMS Solution 
To identify the performance bottleneck of cross-comparing queries in SDBMSs, we 
performed a set of experiments with PostGIS, a popular open-source SDBMS
12
. We used 
a real-world data set extracted from a brain tumor slide image. The total size of the data 
set in raw text format is about 750MiB, with two sets of polygons (representing tumor 
nuclei) each containing over 450,000 polygons, and over 570,000 pairs of polygons with 
MBR intersections. Details of the platform and the data set will be described in Section 
2.5. 
 
Figure 2.2: Time decomposition of cross-comparing queries in PostGIS on a single core. 
We split the query execution into separate components, and profiled the time spent by 
the query engine on each component during a single-core execution. The result is 
presented in Figure 2.2 for both the unoptimized and optimized queries. Index Search 
refers to the testing of MBR intersections based on the indexes built. 
Area_Of_Intersection and Area_Of_Union represent computing the areas of intersection 
                                                 
1
 We also performed similar experiments on a mainstream commercial SDBMS, but its performance was 
much worse. For simplicity, we only present the results with PostGIS. 
2
 Based on our communication with the community of SciDB [2], spatial cross-comparing queries are not 
natively supported by SciDB. 
  
14 
and union, which correspond to the two combo operators, ST_Area(ST_Intersection()) 
and ST_Area(ST_Union()). ST_Area denotes the other two stand-alone ST_Area 
operators in the optimized query. 
For the unoptimized query, ST_Intersects (21.8%), Area_Of_Intersection (37.4%), and 
Area_Of_Union (36.7%) take the highest percentages of execution time, representing the 
bottlenecks of the query execution. For the optimized query, since ST_Intersects and 
Area_Of_Union are removed from the SQL statement, Area_Of_Intersection becomes the 
sole performance bottleneck, capturing almost 90% of the total query execution time. As 
the left two bars show, very little time (less than 6%) was spent on index building and 
index search in both queries. The bar for ST_Area shows that the time to compute 
polygon areas is negligible, and further indicates that the high overhead of 
Area_Of_Intersection and Area_Of_Union comes from spatial operators ST_Intersection 
and ST_Union. 
The profiling result explains the low performance of spatial databases in supporting 
cross-comparing queries — computing the intersection/union of polygons is too costly as 
the number of polygon pairs is large. SDBMSs usually rely on some geometric 
computation libraries, e.g., GEOS [3] in PostGIS, to implement spatial operators. 
Designed to be general-purpose, the algorithms used by these libraries to compute the 
intersection and union of polygons are compute-intensive and very difficult to parallelize. 
We analyzed the source codes of respective functions for computing polygon intersection 
and union in GEOS and another popular geometric library, CGAL [4], and find that only 
very few sections of codes can be parallelized without significantly changing algorithm 
  
15 
structures. Both GEOS and CGAL use generic sweepline algorithms [29], which are not 
built for computationally intensive queries and thus lead to the limited performance in 
SDBMSs. 
Using a large computing cluster can surely improve system performance. However, 
unlike in many high-performance computing applications, pathologists can barely afford 
expensive facilities in real clinical settings [42]. A cost-effective and meanwhile highly 
productive solution is thus greatly desirable. This motivates us to design a customized 
solution to accelerate large-scale spatial cross-comparisons. To eliminate the performance 
bottleneck, our solution needs an efficient GPU algorithm for computing the areas of 
intersection and union, as will be introduced in the next section. 
2.3 The PixelBox Algorithm 
We describe a GPU algorithm, PixelBox, which accepts an array of polygon pairs as 
input and computes their areas of intersection and union. The design of PixelBox mainly 
solves three problems: 1) how to parallelize the computation of area of intersection and 
area of union on GPUs, 2) how to reduce compute intensity when polygon pairs are 
relatively large, and 3) how to implement the algorithm efficiently on GPUs. We use the 
terms of NVIDIA CUDA [5] in our description. However, the algorithm design is general 
and applicable to other GPU architectures and programming models as well. 
3.1 Pixelization of Polygon Pairs 
As measured in the previous section, computing the exact boundaries of polygon 
intersection/union incurs enormous overhead and has been the main cause to the low 
performance of SDBMSs in processing cross-comparing queries. However, the most 
  
16 
relevant component to the definition of Jaccard similarity (as shown in Formula 1) is the 
areas, not the intermediate boundaries. As a key to enable parallelization on GPUs, 
PixelBox directly computes the areas without resorting to the exact forms of the 
intersections or unions. 
 
Figure 2.3: Polygons extracted from medical images have axis-aligned edges and integer-
valued vertices. 
Polygons extracted from medical images share a common property: the coordinates of 
vertices are integer-valued, and the directions of edges are either horizontal or vertical. 
This kind of polygons are a special form of rectilinear polygons [52]. As illustrated in 
Figure 2.3, since medical images are usually raster images, the boundary of a segmented 
polygon follows the regular grid lines at the pixel granularity. 
Taking advantage of this property, PixelBox treats a polygon as a continuous region 
surrounded by its spatial boundary on a pixel map. As shown in Figure 2.4(a), pixels 
within the MBR of polygons p and q can be classified into three categories: 1) pixels 
(e.g., A) lying inside both p and q, 2) pixels (e.g., B and C) lying inside one polygon but 
not the other, and 3) pixels (e.g., D) lying outside both. The area of intersection (‖𝑝 ∩ 𝑞‖) 
can be measured by the number of pixels belonging to the first category. The area of 
union (‖𝑝 ∪ 𝑞‖) corresponds to the number of pixels in the first and second categories. 
  
17 
Finally, pixels in the third category contribute to neither ‖𝑝 ∩ 𝑞‖  nor ‖𝑝 ∪ 𝑞‖ . The 
pixelized view of polygon intersection and union averts the hassle of computing 
boundaries and, more importantly, exposes a great opportunity for exploiting fine-grained 
data parallelism hidden in the cross-comparing computation. 
 
Figure 2.4: The principles of PixelBox. 
In order to determine a pixel’s position relative to a polygon, a well-known method is 
to cast a ray from the pixel and count its number of intersections with the polygon’s 
boundary [46]. As illustrated in Figure 2.4(b), if the number is odd, the pixel (e.g., A) lies 
inside the polygon; if the number is even, the pixel (e.g., B) lies on the outside. 
  
18 
The pixelization method is very suitable for execution on GPUs. Since testing the 
position of one pixel is totally independent of another, we can parallelize the computation 
by having multiple threads process the pixels in parallel. Moreover, since the positions of 
different pixels are computed against the same pair of polygons, the operations performed 
by different threads follow the SIMD fashion, which is required by GPUs. Finally, the 
area of intersection and area of union can be computed altogether during a single 
traversal of all pixels with almost no extra overhead, because the criteria for testing 
intersection (which uses Boolean AND operation) and union (which uses Boolean OR 
operation) are both based on each pixel’s positions relative to the same polygon pair. As 
the number of input polygon pairs is large, we can delegate them to multiple thread 
blocks. For each polygon pair, the contributions of all pixels in the MBR can be 
computed by all threads within a thread block in parallel. 
2.3.2 Reduction of Computing Intensity 
The pixelization method described above has a weakness — the computing intensity 
rises quickly as the number of pixels contained in the MBR increases. Even though 
polygons are usually very small in pathology imaging applications, as the resolution of 
scanner lens increases, the sizes of polygons may also increase accordingly to capture 
more details of the objects. There are also cases when the areas of intersection and union 
are computed between a small group of relatively large polygons and many small 
polygons, e.g., when processing an image with a few capillary vessels surrounded by 
many cells. Moreover, as will be shown in Section 2.5, even when polygons are small, it 
is still possible to further bring down the compute intensity and improve performance. 
  
19 
To reduce the intensity of computation and make the algorithm more scalable, 
PixelBox utilizes another technique, called sampling boxes, whose idea is similar to the 
adaptive mesh refinement method [15] in numerical analysis. Due to the continuity of the 
interior of a polygon, the positions of pixels have spatial locality – if one pixel lies inside 
(or outside) a polygon, other pixels in its neighborhood are likely to lie on the inside (or 
outside) too, with exceptions near the polygon’s boundary. Exploiting this property, we 
can calculate the areas of intersection and union region by region, instead of pixel by 
pixel, so that the contribution of all pixels in a region may be computed at once. 
This technique is illustrated in Figure 2.4(c). The MBR of a polygon pair is recursively 
partitioned into sampling boxes, first at coarser granularity (see the large grid cells in the 
figure), then going finer at selected sub-regions (e.g., as shown by the small boxes near 
the top) which need further exploration. For example, when computing the area of 
intersection, if a sampling box lies completely inside both polygons, the contribution of 
all pixels within the sampling box is obtained at once, which equals the size of the 
sampling box; otherwise, the sampling box needs to be partitioned into smaller sub-
sampling boxes and tested further. In Figure 2.4(c), the grey sampling boxes do not need 
to be further partitioned because their contributions to the areas of intersection and union 
are already determined. 
Similar to the pixelization method, the sampling-box approach requires computing a 
sampling box’s position relative to a polygon, which has three possible values: inside – 
every pixel in the box lies inside the polygon; outside – every pixel in the box lies outside 
the polygon; and hover – some pixels lie inside while others lie outside the polygon. 
  
20 
 
Figure 2.5: A sampling box’s position relative to a polygon: (a) outside; (b) inside; (c, d) 
hover. 
Lemma 1. A sampling box’s position relative to a polygon is determined by three 
conditions: (i) none of the sampling box’s four edges crosses through the polygon’s 
boundary; (ii) none of the polygon’s vertices lies inside the sampling box; (iii) 
sampling box’s geometric center lies inside the polygon. 
The sampling box lies inside the polygon if all three conditions are true; it lies 
outside the polygon if the first two conditions are true but the last is false; it hovers 
over the polygon in all other cases, when condition (i) or (ii) is false. 
Lemma 1 gives the criteria for computing a sampling box’s position, which is further 
illustrated in Figure 2.5. For each sampling box, its four edges are tested against the 
polygon’s boundary. If there are edge-to-edge crossings, the sampling box must hover 
over the polygon (case (d) in Figure 2.5). Otherwise, if any of the polygon’s vertices lies 
inside the sampling box, the entire polygon must be contained in the sampling box due to 
the continuity of its boundary, in which case the position is also hover (case (c) in Figure 
2.5); if none of the polygon’s vertices is inside the sampling box, the sampling box may 
be either totally inside (case (b) in Figure 2.5) or totally outside (case (a) in Figure 2.5) 
the polygon , in which case the position of the sampling box’s geometric center gives the 
final answer. If the sampling box’s four edges overlap with the polygon’s boundary, the 
sampling box’s position can be considered as either inside or outside. The next level of 
  
21 
partition will distinguish the contribution of each sub-sampling box to the areas of 
intersection and union. 
Testing the position of a sampling box is more costly than doing this for a pixel. When 
the granularity of a sampling box is large, the extra overhead is compensated by the 
amount of per-pixel computations reduced. However, as sampling boxes are more fine-
grained, the cost of computing their positions becomes more significant. Moreover, 
applying sampling boxes requires synchronization between cooperative threads — 
examination of one sampling box cannot begin until all threads have finished the 
partitioning of its parent box. Frequent synchronizations lead to low utilization of 
computing resources and have been one of the main hazards to performance improvement 
on GPUs [55]. 
To retain the merits of both efficient data parallelization and low compute intensity, 
PixelBox combines pixelization with sampling-box techniques. As depicted in Figure 
2.4(d), sampling boxes are applied at first to quickly finish testing for a large number of 
regions; when the size of a sampling box becomes smaller than a threshold, T, the 
pixelization method takes order and finishes the rest of the computation. 
Unlike the pixelization-only method, computing area of intersection and area of union 
altogether will incur extra overhead with sampling boxes. For example, if a sampling box 
hovers over one polygon but lies outside the other, its contribution is clear to the area of 
intersection, but unclear to the area of union; in this case, more fine-grained partitioning 
is required until the area of union is determined or the pixelization threshold is reached. 
To reduce the amount of sampling box partitioning and further improve algorithm 
  
22 
performance, the area of union is not computed together with the area of intersection in 
PixelBox. Instead, similar to the query optimization in Figure 2.1(b), we compute the 
areas of polygons, and use the formula, ‖𝑝 ∪ 𝑞‖ = ‖𝑝‖ + ‖𝑞‖ − ‖𝑝 ∩ 𝑞‖, to derive the 
areas of union indirectly. Computing the area of a simple polygon is very easy to 
implement on GPUs. With formula
3
 𝐴 =
1
2
∑ (𝑥𝑖𝑦𝑖+1 − 𝑥𝑖+1𝑦𝑖)
𝑛−1
𝑖=0 , in which (xi, yi) is the 
coordinate of the ith vertex of the polygon, we can let different threads compute different 
vertices and sum up the partial results to get the area. 
2.3.3 Optimized Algorithm Implementation 
Algorithm 1 shows the pseudocode of PixelBox. Sampling boxes are created and 
examined recursively — one region is probed from coarser to finer granularities before 
the next one. A shared stack is used to store the coordinates of the sampling boxes and 
the flags showing whether each sampling box needs to be further partitioned. For each 
polygon pair allocated to a thread block, its MBR is pushed onto the stack as the first 
sampling box (line 13). All threads pop the sampling box on the top of the stack to 
examine (line 18). If the sampling box does not need to be further probed, all threads will 
continue to pop the next sampling box (line 19-20) until the stack becomes empty and the 
computation for the polygon pair finishes. For a sampling box that needs to be further 
examined, if its size is smaller than threshold T, the pixelization procedure is applied (line 
22-28); otherwise, it is partitioned into sub-sampling boxes, and, after further processing, 
new sampling boxes will be pushed onto the stack by all threads simultaneously (line 30-
39). 
                                                 
3
 See http://en.wikipedia.org/wiki/Polygon. 
  
23 
 
  
24 
In the algorithm, POLYAREA computes the partial area of a polygon handled by a 
thread; BOXSIZE returns the number of pixels contained in a sampling box; 
PIXELINPOLY(m, i, p) computes the position of the ith pixel in sampling box m relative to 
polygon p; SUBSAMPBOX(b, i) partitions a sampling box b and returns the ith sub-box for 
a thread to process; BOXPOSITION(b, p) computes the position of sampling box b relative 
to polygon p; BOXCONTINUE computes whether a sampling box needs to be further 
partitioned based on its positions relative to two polygons; and BOXCONTRIBUTE 
computes whether a sampling box contributes to the area of intersection according to its 
position. 
The use of a stack to store sampling boxes saves lots of memory space and makes 
testing sampling box positions and the generation of new sampling boxes parallelized. 
Synchronization is required before popping a sampling box (line 17) to ensure that thread 
0 or the last thread in the thread block has pushed the sampling box to the top of the 
stack. When threads push new sampling boxes to the stack, they do not overwrite the old 
stack top (line 37); otherwise, an extra synchronization would be required before pushing 
new sampling boxes to ensure that the old stack top has been read by all threads. In the 
current design, the old stack top is marked as ‘no further probing’ (line 38), and will be 
omitted by all threads when being popped out again. 
The GPU kernel only computes the partial areas of intersections and the partial 
summed areas of polygons accumulated per thread (lines 5-6), which will be reduced 
later on the CPU to derive the final areas of intersection and union. Reduction is not 
performed on the GPU because the number of partial values for each polygon pair is 
  
25 
relatively small (equal to the thread block size), which makes it not very efficient to 
execute on the GPU. We measured the time take by the reductions on a CPU core; the 
cost is negligible compared to other operations on the GPU. 
In the rest of this sub-section, we explain some optimizations employed in the 
algorithm implementation. 
Utilize shared memory. Effectively using shared memory is important for improving 
program performance on GPUs [50]. The sampling box stack is frequently read and 
modified by all threads in a thread block, and thus should be allocated in the shared 
memory. Meanwhile, polygon vertex data are also repeatedly accessed when computing 
the positions of pixels and sampling boxes. Loading vertices into shared memory reduces 
global memory accesses. Due to the limited size of shared memory on GPUs, it is 
infeasible to allocate for the largest vertex array size. To make a trade off, we set a static 
size for the shared memory region containing polygon vertices, and only those polygons 
whose vertices fit into the region are loaded into the shared memory. 
Avoid memory bank conflicts. Bank conflicts happen when threads in a warp try to 
access different data items residing in the same, shared memory bank simultaneously. In 
this case, memory access is serialized which decreases both bandwidth and core 
utilization. In the sampling box procedure, pushing new sampling boxes to the stack may 
incur bank conflicts if each sampling box is stored continuously in the shared stack. This 
problem can be solved by separating the stack into five independent ones: four sub-stacks 
store the coordinates of sampling boxes, and the fifth one stores whether each sampling 
box needs to be further probed. 
  
26 
Perform loop unrolling. Computing pixel or sampling box positions requires 
comparing with polygon edges in a loop. Unrolling the loop to have multiple polygon 
edges tested in a single iteration reduces the number of branch instructions and hides 
memory latency more efficiently. 
2.3.4 Related Discussions 
Pixelization threshold T. The pixelization procedure is applied when the number of 
pixels contained in a sampling box becomes less than the threshold T. Let the number of 
threads in a thread block be n, a good value for T should be between n and n
2
. If T < n, 
the number of pixels contained in the last sampling box is less than the number of 
threads, which will not keep all threads busy during the pixelization procedure; if T > n
2
, 
the last sampling box contains too many pixels, because it could have been further 
partitioned at least once meanwhile guaranteeing all threads busy during the pixelization 
procedure. According to our testing (see Section 2.5.4), T = n
2
/2 is a good choice. 
Algorithm accuracy. Pixelizing polygons may introduce errors into the areas 
computed. In a general sense, the finer the granularity of pixels is defined, the more 
accurate the computed result is. For pathology imaging analysis, however, PixelBox does 
not incur any loss of precision. As explained in Section 2.3.1, the areas computed equal 
the numbers of pixels actually lying inside the intersection/union of polygons on the 
original image. This property generalizes to polygons segmented from any raster image 
in medical imaging and other applications. We validated the correctness of PixelBox by 
comparing the areas computed by PixelBox with those computed by PostGIS, and find 
  
27 
that the results are the same. We regard the generalization of PixelBox to vectorized 
polygons as a future work. 
Implications of PixelBox to other spatial operators. The principal ideas of PixelBox 
can also be applied to accelerate other compute-intensive spatial operators on GPUs. For 
example, ST_Contains can be implemented by computing the area of intersection and 
testing whether it equals the area of the object being contained. ST_Touches can be 
accelerated using ideas similar to PixelBox: compare the edges of one polygon with the 
edges of the other; also test the positions of vertices in one polygon relative to the other 
polygon; if there is no edge-to-edge crossing, no vertex of one polygon lies within the 
other polygon, and at least one vertex of one polygon lies on the edge of the other, these 
two polygons touches each other; otherwise, they do not touch. We believe that many 
frequently used spatial operators in SDBMSs can be parallelized on GPUs by either 
directly utilizing the PixelBox algorithm or using approaches similar to PixelBox. This is 
another interesting topic we would like to explore in the future. 
2.4 System Framework 
Having presented our core GPU algorithm for computing areas of intersection and 
union, we are now in a position to introduce how the whole workflow for spatial cross-
comparison is implemented and optimized in a CPU-GPU hybrid environment. From the 
input of the raw text data for polygons to the output of the final results, the workflow 
consists of multiple logical stages. To fully exploit the rich resources of the underlying 
CPU/GPU hardware, these stages must be executed in a controllable and dynamically 
adaptable way. To achieve this goal, the system framework must address three 
  
28 
challenges: 1) Since GPU has a disconnected memory space from CPU, input data 
batching for GPU is needed to compensate the long latency of host-device 
communication; 2) GPU is an exclusive, non-preemptive compute device [39], thus 
uncontrolled kernel invocations may cause resource contention and low execution 
efficiency on GPU; and 3) task executions have to be balanced between CPUs and GPUs 
in order to maximize system throughput. 
In this section, we present our system framework solution. We first introduce our 
pipelined structure for the whole workload, and then present our dynamic task migration 
mechanism between CPUs and GPUs. 
2.4.1 The Pipelined Structure 
 
Figure 2.6: A cross-comparing pipeline with dynamic task migrations. 
We have designed and implemented a pipelined structure for the whole workload. 
Through inter-stage buffers, task productions and consumptions are overlapped to 
improve resource utilization and system throughput. As depicted in Figure 2.6, the cross-
comparing pipeline comprises four stages: 
1. The parser loads polygon files and transforms the format of polygons from text to 
binaries. This stage executes on CPUs with multiple worker threads. 
2. The builder builds spatial indexes on the transformed polygon data. Since polygons 
are small, Hilbert R-Tree [38] is used to accelerate index building. This stage 
executes on CPUs in a single thread because its execution speed is already very 
fast. 
  
29 
3. The filter performs a pairwise index search on the polygons parsed from every two 
polygon files, and generates an array of polygon pairs with intersecting MBRs. 
Similar to the builder, this stage also executes on CPUs with a single worker 
thread. 
4. The aggregator computes the areas of intersection and union for each polygon 
array using our PixelBox algorithm. The ratios of areas are then aggregated to 
derive the Jaccard similarity for a whole image. Polygon pairs that do not actually 
intersect, i.e., with the area of intersection being zero, will not be considered. 
A computation task at each pipeline stage is defined at the image tile scale. For 
example, an input task for the parser is to parse two polygon files segmented from the 
same image tile; an input task for the builder is to build indexes on the two sets of 
polygons parsed by a single parser task. In practice, a digital image slide may contain 
hundreds of small image tiles; each tile may contain thousands of polygons. The 
granularity of tasks defined at image tile level matches the image segmentation 
procedure, and allows the workload to propagate through the pipeline in a balanced way. 
Utilizing such a pipelined framework is critical to solve the aforementioned challenges. 
First, the work buffers between pipeline stages provide natural support for GPU input 
data batching. For example, since the number of polygon pairs filtered may be drastically 
different from tile to tile, it is necessary for the aggregator to group multiple small tasks 
in its input buffer and send them in a batch to the GPU at once. Second, with a pipelined 
framework, a single instance of the aggregator consolidates all kernel invocations to the 
GPUs, which greatly reduces unnecessary contentions and makes the execution more 
efficient. Finally, the pipelined framework creates a convenient environment for load 
balancing between CPUs and GPUs, as will be introduced next. 
  
30 
2.4.2 Dynamic Task Migration 
Based on the pipelined structure, we have built a task migration component for the 
whole workflow to achieve load balancing between CPUs and GPUs. First, we have 
ported the PixelBox algorithms to CPUs (called PixelBox-CPU), and parallelized its 
execution with multiple worker threads. Second, we have also designed a GPU kernel for 
the parser stage (called GPU-Parser), whose performance is only comparable to its CPU 
counterpart since text parsing requires implementing a finite state machine, which has 
been shown not very efficient for parallel execution [26]. In this way, the parser and the 
aggregator stages are flexible to execute tasks on both CPUs and GPUs, which creates an 
opportunity for balancing workload distributions through dynamic task migrations. 
What must be noted is that the task migration relies on a special feature of the 
pipelined framework to detect workload imbalance from the application level. The work 
buffers between pipeline stages give useful indication on the progress of computation and 
the status of compute devices. Specifically, if the input buffer of the aggregator stage 
becomes full, the migrator knows that this stage is making slow progress and the GPUs 
have been congested. On the other hand, if the input buffer of the aggregator stage 
becomes empty, it indicates that the GPUs are being under-utilized. In each case, tasks 
are dynamically migrated from GPUs to CPUs, or from CPUs to GPUs, to mitigate load 
imbalance and improve system throughput. 
To implement the task migration scheme, two background threads, called migration 
threads, are created — one for the aggregator stage, one for the parser stage. They usually 
stay in the sleeping state and are only woken up when the input buffer of the aggregator 
  
31 
stage becomes full or empty. In the case of GPU congestion, the aggregator’s migration 
thread is woken up, which selects the smallest tasks from the input buffer of the 
aggregator and invokes PixelBox-CPU to execute them. In the case of GPU idleness, the 
parser’s migration thread is woken up to fetch some tasks from the parser’s input buffer 
and execute them on GPUs. The design of the task migration component is also 
illustrated in Figure 2.6. 
2.5 Experiments 
This section evaluates our SCCG solution, including the PixelBox algorithm and the 
system framework. We have implemented PixelBox and GPU-Parser with NVIDIA 
CUDA 4.0. Intel Threading Building Blocks [43], a popular work-stealing software 
library for task-based parallelization on CPUs, is used to parallelize text parsing and 
PixelBox-CPU. The pipelined framework is developed using Pthreads. The dynamic task 
migration component is built into the execution pipeline, and can be turned on or turned 
off according to the requirements of respective experiments. 
2.5.1 Experiment Methodology 
We perform experiments on two platforms. One is a Dell T1500 workstation with an 
Intel Core i7 860 2.80GHz CPU (4 cores), an NVIDIA GeForce GTX 580 GPU, and 
8GiB main memory. The operating system is 64-bit Red Hat Enterprise Linux 6 with 
2.6.32 kernel. The other platform is an Amazon EC2 instance with two Intel Xeon X5570 
2.93GHz CPUs (totally 8 cores, 16 threads) and two NVIDIA Tesla M2050 GPUs. The 
size of the main memory is 22GiB, and the operating system is 64-bit CentOS with 2.6.18 
Linux kernel. T1500 is primarily used to test the performance of the PixelBox algorithm 
  
32 
and the pipelined scheme, and to measure the overall performance of SCCG in cross-
comparing all data sets. Amazon EC2 instance is used to measure the performance of a 
parallelized PostGIS solution to cross-compare the whole data sets. The task migration 
component is verified on both platforms. The version of PostGIS we used is 1.5.3; the 
PostgreSQL version is 9.1.3. 
Our experiments use 18 real-world data sets extracted from 18 digital pathology 
images used in a brain tumor research at the authors’ institution. The total size of the data 
sets in raw text format is about 12GiB. The average size of polygons is about 150 in the 
number of pixels contained, with the standard deviation around 100. The average number 
of polygons in each data set is about half million, with the largest data set containing over 
2 millions. 
In all experiments performed in this chapter, we do not consider data loading or disk 
I/O time for the purpose of fair comparison. First, it is well known that the database 
system has high loading overhead when processing one-pass data with the “first-load-
then-query” data processing model. SCCG averts this problem through customized text 
parsing and pipelined execution to process the polygon stream on the fly. Second, disk 
I/O, even though still a significant performance factor for SDBMSs, is no longer the 
severest bottleneck for cross-comparing queries; most time is spent on computation. The 
effect of disk I/O can be further mitigated through SCCG’s pipelined framework by 
adding a disk prefetcher in front of parser stage to sequentially load polygon files into 
main memory. The use of more advanced storage devices, such as SSDs and disk arrays, 
can also reduce disk I/O time significantly. Thus, in the following experiments, we 
  
33 
assume that the polygon data are already loaded into main memory or imported into the 
database before the pipeline or queries are executed. 
2.5.2 Performance of the PixelBox Algorithm 
In this subsection, we evaluate the performance of Pixel-Box and verify some design 
decisions discussed earlier. The experiments are carried out on the T1500 workstation. 
Since PostGIS uses GEOS as its geometric computation library, we use the performance 
of GEOS on a single core as the baseline in respective experiments. Optimizations similar 
to the query in Figure 2.1(b) are used in the baseline to avoid the heavy function call for 
polygon unions. We select a representative data set, called oligoastroIII_1, for the 
experiments. It contains 462016 polygons in one polygon set and 458878 polygons in the 
other. Totally, 619609 pairs of polygons whose MBRs intersect are filtered. 
 
Figure 2.7: Performance comparison of GEOS and PixelBox. 
In Figure 2.7, we first show the overall performance of GEOS, PixelBox-CPU on a 
single core (denoted PixelBox-CPU-S), and PixelBox in computing the areas of 
intersection and union for all 619609 polygon pairs. Both absolute execution times and 
relative speedups are shown in logarithmic scales. The computation with GEOS takes 
  
34 
over 430 seconds. PixelBox-CPU-S performs better than GEOS thanks to algorithm 
improvement, reducing computation time to about 290 seconds. Compared with GEOS, 
PixelBox achieves over two-orders-of-magnitude speedup, finishing all computations 
within only 3.6 seconds. This experiment shows the efficiency of PixelBox algorithm that 
can fully utilize the power of GPUs to accelerate the computation. 
In order to validate several algorithm design decisions, i.e., using sampling boxes to 
reduce compute intensity, and computing areas of union indirectly, we do a stress testing 
with PixelBox using a set of 15724 polygon pairs filtered from two representative 
polygon files in oligoastroIII_1. We increase the polygon sizes by multiplying the 
coordinates of polygon vertices with a scale factor whose value varies from 1 to 5. The 
data sets used in this chapter are extracted from slide images captured under 20x 
objective lens. Considering that the resolution of objective lens commonly used is around 
40x at the maximum (which increases the sizes of polygons by 4 times), scaling up the 
coordinates of polygons by a maximum factor of 5 (which increases the sizes of polygons 
by 25 times) is more than sufficient. 
We compare the performance of PixelBox with two base versions: one that uses only 
the pixelization method (called PixelOnly), the other that combines the pixelization and 
sampling-box techniques but computes both area of intersection and area of union 
directly (called PixelBox-NoSep). We tune the grid size, block size, and T (for PixelBox-
NoSep and PixelBox), so that all algorithms execute in their best performance. Their 
execution times are shown in Figure 2.8. 
  
35 
 
Figure 2.8: Performance of two algorithm decisions: using sampling boxes and 
computing areas of union indirectly. 
In all scale factors, the performance of PixelBox-NoSep is consistently higher than that 
of PixelOnly due to the use of sampling boxes, while PixelBox beats the performance of 
PixelBox-NoSep by further reducing the amount of sampling box partitionings 
performed. When the scale factor is 1, the overhead of per-pixel examination is relatively 
low because the sizes of polygons are small. But PixelBox-NoSep and PixelBox still out-
perform PixelOnly in this case, reducing execution time by 28% and 34% respectively. 
As the scale factor increases, the performance of PixelOnly drops rapidly due to the 
dramatic increase of the number of pixels that must be handled by the algorithm. 
However, the performance of PixelBox-NoSep and PixelBox only degrades slightly. As 
the scale factor reaches 5, that is when the sizes of polygons are increased by 25 times, 
PixelBox-NoSep improves over PixelOnly by reducing execution time by over 50%, 
while PixelBox shortens the execution time even further by 73% compared with 
PixelBox-NoSep. This experiment verifies the effectiveness of using sampling boxes to 
  
36 
reduce compute intensity. It also shows that, by computing areas of union indirectly, the 
performance of the algorithm can be further enhanced due to reduced sampling box 
partitions. It has to be noted that the performance of PixelOnly, PixelBox-NoSep and 
PixelBox are much higher than the GEOS baseline at all scale factors (it takes GEOS 
over 11 seconds). 
2.5.3 Effectiveness of Optimization Techniques 
On the T1500 workstation, we evaluate the effectiveness of various optimization 
techniques employed during algorithm implementation, i.e., using shared memory (for 
loading the polygon vertex data), avoiding bank conflicts (when pushing new sampling 
boxes), and loop unrolling (when computing positions). We take the same set of 15724 
polygon pairs used in the previous experiment, with the scaling factors being 1, 3, and 5, 
and measure the execution times of four variants of the PixelBox algorithm: PixelBox-
NoOpt denotes the base version in which none of the optimization techniques are used; 
PixelBox-NBC denotes the version when bank conflicts are avoided; PixelBox-NBC-UR 
denotes the version when bank conflicts are avoided and loop unrolling is performed; 
finally, PixelBox-NBC-UR-SM denotes the version when all optimizations are utilized. 
In all variants, the sampling box stack is always allocated in shared memory, because 
otherwise a global heap whose size is proportional to the total number of threads in the 
whole grid has to be allocated, which we consider an unreasonable design scheme. 
The performance of each variant normalized to PixelBox-NoOpt is shown in Figure 
2.9. It can be seen that the optimization techniques discussed above are effective in 
improving the performance of PixelBox. When the scale factor is 1, the performance is 
  
37 
improved by a factor of 1.14 after all optimization techniques are utilized; when the scale 
factor is 5, the speedup raises to a factor of 1.30. The weights of different optimization 
techniques to the algorithm performance are, however, varied. The effects of loop 
unrolling and using shared memory are more significant than that of avoiding bank 
conflicts. This is because PixelBox spends more time on computing the positions of 
pixels and sampling boxes than on generating new sampling boxes. Thus, loop unrolling 
and using shared memory, which improves the efficiency of computing positions, play a 
larger role in the performance of PixelBox. 
 
Figure 2.9: Performance impact of various optimization techniques in algorithm 
implementation. 
2.5.4 Parameter Sensitivity of PixelBox 
In order to test the sensitivity of algorithm performance to the pixelization threshold T, 
we take the same set of 15724 polygon pairs used above and measure how the execution 
time of PixelBox varies as we change the value of T. We do the experiments on the 
  
38 
T1500 workstation. We set the thread block size to 64, and the performance trend in each 
scale factor (SF1 to SF5) is shown in Figure 2.10. The result verifies our analysis for 
choosing the value of T. The performance of PixelBox is sub-optimal when T is too small 
or too large. It performs the best when the value of T lies between 512 and 4096, which 
corresponds to the range from n
2
/8 to n
2
, in all scale factors. We also repeated the 
experiment when setting thread block size to other values, and the trend was similar. But 
when the block size is too large (e.g., >= 256), the overall performance of PixelBox 
degrades. This is because less thread blocks can run concurrently on a multiprocessor and 
the sampling box partitioning will be less fine-grained when block size is too large. 
According to our experience, setting n to a small value and the value of the pixelization 
threshold around n
2
/2 achieves the highest performance. 
 
Figure 2.10: The sensitivity of PixelBox performance to pixelization threshold T. 
  
39 
2.5.5 Performance of the Pipelined Framework 
We evaluate the performance of the pipelined framework in this subsection. Task 
migration is disabled to remove its influence on the pipeline’s performance. On the 
T1500 workstation, we collect the execution times of four schemes that cross-compares 
the oligoastroIII_1 data set: 
 PostGIS-S executes the optimized query shown in Figure 2.1(b) with PostGIS on a 
single core; 
 NoPipe-S uses a single execution stream that executes a non-pipelined version of 
the framework in Figure 2.6, in which the four stages execute sequentially on each 
pair of input polygon files without pipelining; 
 NoPipe-M represents the thread-parallel scheme where multiple execution streams 
are launched with each one invoking NoPipe-S independently; 
 Pipelined is the fully pipelined scheme used in SCCG. 
Scheme PostGIS-S NoPipe-S NoPipe-M Pipelined 
Speedup 1 37.07 63.64 76.02 
Table 2.1: Performance comparisons between different schemes. 
The result is shown in Table 2.1, with speedup numbers normalized against the 
PostGIS-S baseline. Since the bottleneck stage of the pipeline has been accelerated by 
PixelBox on GPUs, NoPipe-S achieves over 37-fold speedup compared with PostGIS-S. 
NoPipe-M performs better than NoPipe-S (63x speedup over PostGIS-S) because 
simultaneously issuing multiple streams improves the utilization of resources. However, 
due to the serialization caused by uncoordinated use of GPUs on the last stage, the CPU 
resource cannot be well utilized. We measured the CPU utilization during the execution 
of NoPipe-M and observed that all CPU cores were only about 50% saturated all the 
times, which confirmed our analysis. The Pipelined scheme achieves the highest 
  
40 
performance, accelerating the speed of cross-comparison by a factor of 76 compared with 
PostGIS-S. The result justifies the use of the pipelined framework and shows the 
importance of coordination when using GPUs. 
2.5.6 Effectiveness of Dynamic Task Migration 
In order to verify the design of the task migration component, we perform experiments 
in three different platform configurations: the T1500 workstation (Config-I), the Amazon 
EC2 instance with both GPU cards used (Config-II), and the Amazon EC2 instance with 
only one GPU card used (Config-III). We use the first two configurations to evaluate the 
effectiveness of the task migration component to offload workloads from CPUs to GPUs, 
and use the last one for testing load balance in the other direction. Since the GPUs on 
both platforms are too powerful, in order to make the case of GPU-to-CPU task 
migrations happen, we purposely slow down PixelBox by selecting a sub-optimal thread 
block size in Config-III. In real-world system environment, due to concurrent sharing of 
GPUs with other applications, GPUs may not be exclusively occupied by a single 
application, which is the case we want to emulate in the last configuration. 
The oligoastroIII_1 data set is used in experiments. We show the throughput of task-
migration-enabled SCCG normalized to the throughput of task-migration-disabled SCCG 
in each configuration. Throughput is defined as the size of data set divided by execution 
time. 
As Figure 2.11 shows, on T1500 workstation, the throughput of SCCG with dynamic 
task migration is about 50% higher than SCCG without dynamic task migration. In this 
setting, the aggregator stage cannot keep the GPU fully occupied, which triggers the 
  
41 
migrator to dynamically offload tasks from the parser stage to execute on GPU. This 
improves the performance of the parser stage and thus enhances the throughput. On 
Amazon EC2 with both GPUs utilized, the GPU resource still cannot be fully utilized by 
the aggregator stage. Thus, workloads are migrated from CPUs to GPUs, and the 
throughput of the pipeline is improved by over 40%. The throughput improvement is 
lower than Config-I, because the CPUs are more powerful, which causes less workload 
offloaded to GPUs. On Amazon EC2 with only one GPU utilized, dynamic task 
migration improves the pipeline throughput by over 14%. In this scenario, the aggregator 
stage becomes the bottleneck of the pipeline, and some aggregator tasks are migrated to 
execute on CPUs. But due to the relatively small speed gap between the parser and the 
aggregator stage and the limited performance of PixelBox-CPU on CPUs, the throughput 
improvement is smaller compared to other configurations. 
 
Figure 2.11: Performance benefits of dynamic task migration. 
  
42 
2.5.7 Performance Evaluation with All Data Sets 
In this section, we give the complete performance results of SCCG compared with a 
parallelized PostGIS solution over all 18 data sets. The experiments with SCCG are 
performed on the T1500 workstation with only one GPU card and a 4-core CPU. The 
experiments with PostGIS are performed on the Amazon EC2 instance with both 4-core 
CPUs fully utilized. The reason why we choose a less powerful platform for SCCG is to 
demonstrate both its performance advantage and cost-effectiveness. Query executions in 
PostGIS are parallelized over all CPU cores by evenly partitioning polygon tables into 16 
chunks and launching 16 query streams to process different chunks concurrently. We 
refer to this execution scheme as PostGIS-M. Being generous to PostGIS, we only 
consider index building and query execution times; time spent on partitioning polygon 
tables is not included. We measure the times taken by SCCG and PostGIS-M on cross-
comparing each data set, and the relative speedups of SCCG compared with PostGIS-M 
are presented in Figure 2.12. 
To give an impression on the absolute execution times, it takes PostGIS-M over 1120 
seconds to process all data sets, while SCCG finishes all computations within only 64 
seconds. As Figure 2.12 shows, the varied speedups of SCCG over PostGIS-M on 
different data sets are due to the different numbers and sizes of polygons among the data 
sets. For example, the first data set contains only 20 polygon files and about 57000 
polygons; while the last data set comprises a total of 442 polygon files with over 4 
million polygons contained. Among all data sets, SCCG achieves a minimum of 13-fold 
  
43 
speedup and a maximum of over 44-fold speedup compared with PostGIS-M. The last 
column gives the geometric mean of speedups across all data sets, which is over 18 times. 
 
Figure 2.12: The overall performance of SCCG compared with PostGIS-M on 18 data 
sets. 
The result shows the effectiveness of our SCCG solution in improving the performance 
of spatial cross-comparison at low cost. Two Intel X5570 CPUs cost over $2000, while 
the total cost of an Intel Core i7 860 CPU and an NVIDIA GTX580 GPU is only about 
$820 according to the current market price as of March 2012. 
2.6 Related Work 
Though modern computer architecture has brought rich parallel resources, existing 
geometric algorithms for spatial operations implemented in the widely used libraries (e.g. 
CGAL and GEOS) and in major SDBMSs are still single-threaded. There are several 
attempts of parallel algorithms. A parallel algorithm was proposed in [32] to compute the 
areas of intersection and union on CPUs. The algorithm was not designed to execute in 
  
44 
SIMD fashion, which has been the key to achieve high performance on both CPUs and 
GPUs in the era of high-throughput computing [44]. As a numerical approximation 
method, Monte Carlo [31] can be used to compute the areas of intersection and union on 
GPUs, by repeatedly generating randomized sampling points and counting the number of 
points lying within the region. However, repeated casting of random sampling points 
makes Monte Carlo much more compute-intensive than our optimized PixelBox 
algorithm. A paper [51] proposed to test polygon intersections by drawing polygons on a 
frame buffer through the OpenGL interfaces and counting the number of pixels with 
specific colors. This method could be extended to compute the areas of intersection and 
union, but it would suffer a similar performance problem like the pixelization-only 
approach due to high compute intensity. The idea of rounding objects to pixels has 
appeared in fields such as computer graphics [48] and GIS [6], while we realize and 
utilize the rectilinear property of polygons to solve an important problem in pathology 
imaging analysis. 
Prior work have proposed optimized algorithms and implementations for various 
database operations on the GPU architecture, including join [36], selection and 
aggregation [34], sorting [33], tree search [41], list intersection and index compression 
[25], and transaction execution [37]. Moreover, using a CPU-GPU hybrid environment to 
accelerate foreign-key joins has been explored in the paper [49]. Compared with these 
works, we focus on optimizing spatial operations for image comparisons in a CPU-GPU 
hybrid environment. In addition, considering our system execution framework, related 
work about the utilization of pipelined execution parallelism can be found in parallel 
  
45 
database systems [47] and optimized data-sharing query execution engine [35]. Related 
work about task scheduling and GPU resource management can be found in work-
stealing and real-time systems [30, 39]. 
2.7 Conclusions 
We have presented our solution for fast cross-comparison of analytical pathology 
imaging data in a CPU-GPU hybrid environment. After a thorough profiling of a spatial 
database solution, we identified the performance bottleneck of computing areas of 
intersection and union on polygon sets. Our PixelBox algorithm and its implementation 
on GPUs can fundamentally remove the performance bottleneck. Moreover, our pipelined 
structure with dynamic task migration can efficiently execute the whole workload using 
CPUs and GPUs. Our solution has been verified through extensive experiments. It 
achieves more than 18x speedup over parallelized PostGIS when processing real-world 
pathology data. 
We believe our work makes a strong case for performing high-performance, cost-
effective digital pathology analysis. The immense power of GPUs and the vectorized 
functional units on modern hardware must be fully utilized in order to handle the ever-
increasing, data-intensive computations. Efficient parallelization of computations on 
GPUs whilst relies on both the problem characteristics and GPU-optimized algorithm 
design and implementation. For example, PixelBox trades off a little bit of compute 
efficiency for a huge gain of data parallelism, and its compute-bound nature also 
perfectly matches the advantages of GPU architecture. From the system perspective, we 
consider the incorporation of GPUs into the database ecosystem as an imperative trend 
  
46 
with high economic benefits. In a CPU-GPU hybrid environment, many system problems 
such as GPU-aware query execution engine, load balancing, and multi-query GPU 
sharing need to be addressed. 
  
  
47 
 
 
Chapter 3 GDM: Device Memory Management for GPGPU 
Computing 
 
 
3.1 Introduction 
General-purpose GPUs, a.k.a. GPGPUs, are quickly evolving from conventional, 
dedicated accelerators towards mainstream commodity computing devices, which is 
driven by the demands for cost-effective high performance from new application domains 
and supported by GPU hardware and system software advancement [75, 77, 79, 56]. 
During the transition, system software plays an increasingly important role on managing 
GPUs. System software relieves application developers from explicit resource 
management in their programs. It must also coordinate the utilization of GPU resources, 
ensuring that applications can make continuous progress and no application can be 
deprived of resource usage indefinitely [62, 63]. 
Recent research and improvements on GPGPU resource management have mainly 
focused on supporting GPU abstractions [81], GPU file system [82], and the management 
of GPU computing units [70, 66, 71]. These system enhancements improve the usability 
and performance of GPGPU computing. However, despite these improvements, with 
state-of-the-art GPGPU system software an application still can easily crash, hang, or 
lose performance (Section 3.2.3). The major reason behind this problem is the lack of 
  
48 
GPU device memory management at the operating system (OS) level, which has become 
a major hurdle of GPGPUs as truly general-purpose mainstream computing facilities. 
This chapter identifies these problems and systematically studies the essentiality and 
design of GPGPU device memory management. 
3.1.1 Problems with Application-Level Device Memory Management 
Device memory is the primary onboard DRAM storage for the computation performed 
on GPU. Unlike system memory where the OS controls space allocation and reclamation, 
GPU device memory is still directly controlled by individual applications in current 
systems, which complicates GPGPU application design. In large applications, managing 
the usage of device memory space is a heavy burden for programmers. There have been 
numerous reports on application and system crashes [7, 8, 9, 10, 11, 12] caused by 
application's failure to manage device memory correctly. 
Managing device memory space at application level becomes even more difficult when 
there are multiple applications or application components (e.g., multiple worker threads 
in a server) with contradicting demands for device memory. Due to the lack of an 
arbitrator to coordinate the conflicts, applications can crash or hang on unexpected 
shortage of device memory space. Even if an application may manage to survive by using 
smaller device memory space or shifting computation back to the CPU, its performance 
can suffer dramatically. 
For instance, in Matlab, each worker thread can offload its computation tasks to GPUs 
for acceleration. However, if their working sets cannot totally fit into the device memory, 
some workers can easily fail or encounter severe performance degradation [7]. Device 
  
49 
memory conflicts will become increasingly common, when GPGPUs are more 
prevalently adopted in large-scale applications (e.g., Matlab, AutoCAD, relational 
databases, etc.), or in the cloud where resources are shared by virtual machines [66]. 
We will discuss the problems with existing GPU system design in more details in 
Section 3.2.3 and illustrate their consequences in Section 3.6. 
3.1.2 GDM: OS Device Memory Management 
As a critical system resource, device memory space must be managed by the OS to 
effectively coordinate conflicting demands and to guarantee efficiency. In this chapter, 
we present the design and implementation of GDM (GPGPU Device-memory Manager) 
in the OS. With experiments, we show that such a device memory management 
component in the OS is indispensable for unleashing the high computing power of GPUs 
in general-purpose systems. 
Without requiring modifications to existing APIs, GDM transparently takes control 
over device memory allocations and data transfers to and from device memory. Instead of 
letting applications directly allocate device memory space and exchange data with GPUs, 
GDM sits between applications and GPU devices, acting as an agent and coordinator for 
carrying out these operations, leveraging a staging area created in each application's  
virtual memory space.  It monitors the utilization of device memory space allocated to 
each application, and dynamically reclaims underutilized space by swapping  out the 
content to staging areas. In this way, GDM controls and coordinates the actual device 
memory consumption of different applications, or different phases of a single application, 
so as to achieve system-wide benefits of performance and service quality. With support 
  
50 
of GDM, even multiple applications with conflicting memory requirements can 
efficiently share the same GPU and make progress concurrently. As we will demonstrate 
in Section 3.6, GDM also enhances the capability of GPGPU systems to tolerate device 
memory leaks and defend against malicious device memory usage. 
The above benefits, however, do not come without any overhead, which is mainly from 
the extra data movements incurred by GDM management. Several unique characteristics 
of the GPGPU system make it especially challenging to reduce the overhead. Firstly, 
GPGPU applications are usually data-intensive. Thus, GDM must handle large sets of 
data that potentially incur high cost. Secondly, the data-driven nature of GPGPU 
computing involves synchronizations at various stages, which hinder the overlapping 
between data transfer and the computation over the data. This makes the performance of 
GPGPU applications sensitive to the delay caused by data movement. Finally, GPU 
devices may lack necessary hardware support for efficiently minimizing the overhead, 
which makes the solution even more challenging. To address these challenges, we have 
developed a series of optimization techniques in GDM, such as object-level access 
pattern inference, hashing-based dirty block detection, and cost-aware data replacement 
policy. These techniques can effectively reduce unnecessary data movement to achieve 
high performance.  
3.1.3 Contributions 
This chapter systematically studies the essentiality and design of GPGPU device 
memory management.  It makes the following main contributions: (1) We have identified 
and analyzed the serious problems caused by the lack of OS management of device 
  
51 
memory space on existing GPGPU systems. (2) We have explored the design space of 
managing device memory at system level. (3) We have implemented a prototype of GDM 
in an open-source GPGPU driver and on commonly used hardware to best utilize device 
memory resource for general-purpose systems, including a set of optimization techniques 
and principles that are crucial to the performance of device memory management for 
GPUs. (4) We have conducted extensive performance evaluation on GPU systems with 
insights. The experiments show that GDM can effectively prevent applications from 
crashing or stalling due to unexpected shortage of free device memory space. The 
experiments also show that the optimization techniques can increase system throughput 
by up to 46%. 
3.2 Demand for System-Level Device Memory Management 
To deliver high performance, GPGPU computing not only relies on vectorized GPU 
processors to process data in parallel but also requires high-speed memory to guarantee 
fast data accesses. Thus, a common practice is to integrate GPU processors with device 
memory on the same GPU board, which is connected with the system bus to accept data-
parallel tasks. This section introduces the system organization, which this chapter mainly 
focuses on, and validates the indispensability of efficient device memory management. 
3.2.1 GPGPU Computing Architecture 
GPUs are suitable for performing data-parallel computation. They are often used 
together with the CPUs to form a hybrid computing system, as shown in Figure 3.1. 
  
52 
 
Figure 3.1: GPGPU system organization. 
For high performance, a GPU usually has tens of hundreds of stream processors (SPs). 
Each SP is a many-lane SIMD engine. To satisfy data accesses from such a large number 
of SPs, a wide and fast memory interface must be employed. The device memory 
designed for GPUs is therefore optimized for high bandwidth and integrated close to the 
SPs. 
Generally, the bandwidth of GPU device memory is several times higher than the 
bandwidth of system memory accessed by CPUs, which emphasizes more on low latency. 
For example, a server-class NVIDIA Tesla K10 GPU provides over 300 GB/s device 
memory access. In contrast, the maximum memory bandwidth of a similar-level Intel 
Xeon E5-4650 CPU can only reach about 50 GB/s. Compared with system memory, the 
capacity of device memory, however, is much more limited, due to the pincount and 
power constraints suffered by the memory technology (e.g. GDDR) used for GPUs [73]. 
For example, a high-end GPU card is usually equipped with only a few gigabytes of 
device memory, while tens of gigabytes of system memory has been common on a 
modern server for years.  
  
53 
GPUs are connected to the system bus to accept data-parallel tasks, which are often 
called GPGPU kernels (or kernels for brevity), and the data to be processed. GPGPU 
system software is responsible for task scheduling, initiating data transfers, and handling 
task exceptions. The operations performed by system software are mostly control-
intensive, and thus can only be executed efficiently on CPUs. 
In the chapter, we mainly target the mainstream GPGPU computing architecture 
described above, in which dedicated device memory modules are used by GPUs to 
maximize throughput. Another GPGPU architecture, represented by AMD's APU [13], 
fuses graphics units and CPU cores on the same die and lets them share system memory. 
It cannot provide the same high computing power as a GPU with dedicated device 
memory does. Processors with the fused architecture are mostly used in mobile and low-
end desktop systems to handle graphics workload at a low cost. The performance is 
bottlenecked by the number of graphics units that can be integrated on the same CPU 
chip and by the narrow system memory bandwidth contended by both CPU and graphics 
cores. To alleviate memory bandwidth bottleneck, there are proposals to integrate fast 
memory modules (e.g. stacked memory [58] or eDRAM [80]) into this architecture. 
These memory modules will play an important role to improve the performance of 
computation on the graphics cores. The principles and techniques developed in this 
chapter can be adapted to manage these memory modules and other accelerators (e.g., 
DSP) with similar memory structures as well. 
  
54 
3.2.2 Device Memory: A Critical Resource 
Device memory provides a high-speed data storage for GPGPU computing, and must 
be well managed in order to achieve high performance. Despite its limited capacity, 
applications have high demands for device memory space. On one hand, as applications 
become increasingly data-intensive, the data sets handled by a GPGPU task also grows 
rapidly, requiring larger device memory space. On the other hand, GPGPU applications 
tend to keep their working sets on the device memory for future reuses to minimize data 
transfers. 
As an example, when GPUs are used to process database queries in data warehousing 
applications, main accelerator structures such as hash tables have to be loaded into device 
memory [85]. These data structures can be very large, especially for big-data problems 
[59]. Meanwhile, these data structures are usually used by different queries repeatedly. 
Keeping them in the device memory helps improving application performance. The small 
capacity and the high demand from applications make GPU device memory a critical but 
limited system resource. 
3.2.3 Issues with Existing System Designs 
Despite the cruciality of device memory it has not been well managed by the system. 
In a general-purpose computing environment, applications are still forced to manually 
manage device memory on their own. Before a task can be offloaded to GPU, the 
application must ensure that enough space has been reserved on the device memory and 
the working set of the task has been transferred to the reserved space. After the task 
finishes, it also has to decide whether the datasets should continue staying on the device 
  
55 
memory in case of reuses by other tasks, or can be transferred back to the system memory 
to make room for other data to be processed. 
The above design used to be more or less acceptable in the early era of GPGPU 
computing when GPUs were dedicated to applications with clear, static demands for 
device memory space. However, as both the scale and scope of GPGPU applications 
expand, it has become an increasingly heavy burden, or impossible, for programmers to 
correctly keep track of the demands for device memory space and manage the 
consumption accordingly. 
For example, some applications consist of GPU-accelerated modules developed by 
different groups of developers, or third-party GPGPU libraries and runtimes (e.g., CULA 
[21], PyCUDA [14], and Theano [57]). It is difficult to monitor and coordinate the device 
memory space consumption of different components. Applications such as Matlab, 
Boinc, and GPU databases may also launch multiple workers, whose activities and 
demands for device memory space depend on user requests and are affected by the OS 
scheduling. It is laborious and inefficient to deal with such dynamics at the programming 
stage. When GPGPUs are shared by multiple applications (e.g. in the cloud), managing 
device memory space inside each individual application also leads to uncoordinated 
contention for the space. 
Due to the complexity of managing device memory, applications may frequently 
experience shortage of free device memory space. For example, one worker thread may 
not be able to obtain enough device memory space if other worker threads have occupied 
too much of it. The application may crash or hang if it cannot handle the situation 
  
56 
correctly. There have been an increasing number of device memory related crash reports 
in both open-source and commercial GPGPU software such as Matlab [7], Boinc [8], and 
Theano [11]. An application may survive by reducing the granularity of GPGPU tasks or 
shifting computation back to the CPU dynamically. But either method can significantly 
reduce application performance. Please note that the shortage of free device memory may 
happen even when the allocated device memory space is not being actively used, which 
leads to resource underutilization. 
The absence of system management of device memory also causes other system issues. 
For instance, device memory leaks are a common type of software bugs that exist in 
many real-world GPGPU systems, including key computation libraries [15, 9], popular 
language runtimes [11, 12], and widely deployed applications [16, 17]. Without system 
management, the leaked memory space cannot be reclaimed until the leaking application 
crashes or is terminated. This shrinks the device memory space available to applications 
and significantly degrades system performance. Even worse, without system 
management, a malicious program can reserve most device memory space without 
releasing it, causing the whole GPGPU system unusable. 
3.2.4 Demand for System Management 
The above issues cannot be effectively addressed at application level due to the lack of 
system-wide information and the authority required for managing a shared resource. For 
example, a library that implements device memory management functionalities can 
relieve the burden of application programmers. However, a library can only provide local 
management within each individual application or application component adopting the 
  
57 
library. The conflicting demands between applications or application components still 
cannot be addressed. 
To address the above issues, GPGPU system software must be enhanced to control 
device memory management. This will not only relieve application developers from this 
tedious obligation, but also present an arbitrator to coordinate the contention for device 
memory space. With the new improvements of GPU hardware and firmware, especially 
those to support multitasking [78, 65], the application domains and environments of 
GPGPU computing will continue expanding. The demand for system management of 
device memory space is also becoming more imperative. 
The demand for system management of device memory is analogous to that for the OS 
to manage the physical space of system memory [61]. Before virtual memory was 
introduced, the large efforts spent by programmers to incorporate memory overlaying 
procedures into their programs proved inefficient and unrealistic as applications became 
increasingly complex. Nowadays, in almost all modern systems, the physical space of 
system memory is managed by the OS; applications just need to allocate and de-allocate 
objects in their virtual spaces to use system memory. 
However, unlike system memory management, because of the special characteristics of 
GPGPU systems and applications, the management of device memory must address a set 
of unique challenges to achieve high performance. We will introduce these challenges 
and the design of GDM in the next two sections. 
  
58 
3.3 GDM Overview 
The objective of GDM is to take over the control of device memory space from 
applications without changing current APIs for device memory operations. For this 
purpose, GDM creates a staging area in each GPGPU application's virtual memory 
space. This staging area effectively serves as the device memory extensions for the GPU 
kernels launched in the application. Thus, device memory operations from the application 
can be redirected to the corresponding staging area, while the actual control of device 
memory space is released to GDM. 
 
Figure 3.2: The overall architecture of GDM. 
  
59 
Figure 3.2 illustrates the positions of GDM and GDM staging areas in the system and 
how GDM interacts with other system components. GDM is built as part of the GPGPU 
driver in the OS. It intercepts and handles device memory related operations from 
GPGPU applications. GDM handles an allocation operation (e.g., cuMemAlloc in CUDA 
[79]) by allocating the required space in the staging area. Data to be transferred to device 
memory is first copied to the staging area, and is later transferred to the device memory 
when the kernel accessing the data is launched. This is shown with the arrows ○1  and ○2  
respectively. After the kernel finishes, the data may stay in the device memory. When the 
device memory is short of free space, GDM transfers some data back to the 
corresponding staging area and reclaims the space, as shown by arrow ○3 . When an 
application calls the function to copy some data (e.g. computation results) from the 
device memory, GDM locates the latest version of the data (either in the staging area or 
in the device memory) and copies the data to the user buffer designated by the 
application. The arrows marked by ○4  show the data transfers. To handle a de-allocation 
operation (e.g. cuMemFree in CUDA), GDM frees and reclaims the corresponding space 
in both the staging area and the device memory. 
3.3.1 Minimizing Overhead: A Major Challenge 
GDM relieves applications from the burden of directly managing device memory 
space. While it avoids the problems due to uncoordinated usage of device memory space, 
the benefit does not come without cost. The main overhead of GDM is from the extra 
data copying (to and from staging areas) and data transferring (to and from device 
memory). 
  
60 
Some unique features of the GPU hardware and GPGPU application execution model 
make minimizing the overhead particularly challenging. Firstly, kernels running on GPU 
devices are usually data-intensive. Transferring large data sets over system bus may incur 
high overhead
4
. As shown from previous studies [81, 64] and our own measurement 
(Section 3.6.1), data transfers already account for a considerable portion of GPU 
operation time for many applications. If the amount of data movement incurred by GDM 
cannot be effectively controlled, the benefits of device memory management can be 
easily out-weighted by the potential high cost, diminishing the usefulness of the whole 
system. 
Secondly, the performance of GPGPU applications is sensitive to the delays caused by 
bulky data transfers. The data-driven nature of GPGPU computing requires 
synchronizations at various stages. For example, a kernel cannot be launched before the 
transfer of its input data to the device memory finishes. These synchronization barriers 
reduce the opportunities of overlapping operations before and after the synchronization 
points, making application performance sensitive to the delays on these operations. 
Among these operations, most are related to data transfers between the CPU and the 
device. Thus, the extra data transfers incurred by GDM management may degrade 
application performance if not handled properly. Meanwhile, system bus transactions are 
usually non-preemptive. A data transfer through PCIe, for example, cannot be interrupted 
once the DMA command is sent to the GPU copy engine. This exacerbates the problem 
caused by GDM-initiated data movement. 
                                                 
4
 Data transfer rate via system bus is about one order of magnitude lower than device memory bandwidth. 
  
61 
Finally, some hardware facilities for minimizing the cost have not been or cannot be 
efficiently implemented on GPUs. For example, in current GPU designs, there is no 
support for page reference bits to track fine-grained data access patterns. This poses great 
challenges to identifying inactive device memory areas. Hardware setting page dirty bits, 
a convenient feature for detecting data modifications, is also missing. On GPUs, page 
faults usually incur prohibitive costs [76]. On some GPGPU systems, page faults even 
cause application crashes. As far as we know, in the foreseeable future, there have not 
been clear plans on improving these facilities in GPU hardware. 
To address these challenges, GDM minimizes the cost following two directions. One is 
to minimize data movements. This is achieved mainly through lazy copying, exploiting 
data locality, and careful classification of the data. The other direction is to reduce the 
latency incurred by data movements. GDM reduces the latency in two ways. Without 
compromising the correctness of program executions, GDM implicitly makes the 
handling of some heavy synchronous operations asynchronous to user programs, 
allowing programs to proceed while the operation is delegated to GDM for processing. 
GDM also internally breaks some bulky synchronous operations into several smaller 
pieces so that the processing of one piece can be overlapped with another to reduce costs. 
This also practically makes a long, bulky operation interruptible. 
3.3.2 Guidelines for GDM Design 
There are a few guidelines that have greatly influenced the design of GDM. One 
important guideline regards the choice between demand loading and anticipatory 
loading. In the chapter, we classify the methods of loading data to memory into two 
  
62 
categories, namely demand loading and anticipatory loading. Demand loading refers to 
the method in which data loading is triggered by data accesses. It is usually achieved by 
hardware-supported exception mechanisms. For example, in demand paging, page fault 
handler is triggered automatically by hardware when a page being accessed is not present 
in the memory. The page fault handler loads the missing page from the disk and may 
prefetch a few more pages that it predicts to be accessed soon. 
Anticipatory loading refers to the method in which the working set of a task is loaded 
into memory before the task is scheduled to run. It is the mechanism used in current 
GPGPU computing systems. An application reserves device memory space and transfers 
the working set of a GPU kernel to the device memory before it launches the kernel. 
Demand loading is motivated by the high cost of loading data into memory. It pays the 
cost of handling page faults to load only the data that is demanded by the application and 
minimize the cost incurred by loading extra data. Anticipatory loading is more 
advantageous when the data sets handled by an application can be accurately determined 
before its execution. Though there are proposals to provide hardware support for demand 
paging on GPUs [76, 72], we argue that anticipatory loading will continue playing an 
important role in device memory management on future GPU devices. This is based on 
the following two observations. 
In contrast with the data sets handled by CPUs, the data sets handled by GPUs are 
usually more predictable before the kernel starts execution. For example, some data sets 
to be referenced can be inferred from the data transfer APIs (e.g. cuMemcpyHtoD in 
  
63 
CUDA) before launching a kernel; some are specified in the parameters of the GPU 
kernel. This makes anticipatory loading a viable approach in practice. 
Handling page faults on a GPU incurs much higher overhead than doing so on a CPU, 
because it stalls a faster processor for a longer time. A GPU kernel is usually executed by 
hundreds of thousands of threads on a GPU, with the running state of each thread 
maintained in large register files, shared memory, and hardware caches which are often 
virtually addressed [74]. Saving the state of a GPU kernel on page faults, flushing caches, 
and restoring kernel state to resume execution thus take much longer time than that on the 
CPU. As the numbers of GPU cores and threads launched by GPU kernels keep 
increasing on future GPUs, the cost of handling page faults will also escalate 
significantly. Moreover, handling GPU page faults requires the involvement of system 
software running on the CPU (e.g., to carry out the corresponding memory management 
and re-scheduling operations). The extra delays and operations on the critical path of 
page fault handling further prolong GPU stall time. The large overhead associated with 
page fault handling on a GPU thus may not be justified. 
In GDM design, we use anticipatory loading for the data sets that are predicted to be 
accessed. To minimize the overhead incurred by handling page faults, demand loading is 
only used to handle unexpected data accesses if the GPU device supports page faults. 
The second guideline regards the granularities of device memory management to 
match the data-parallel nature of GPGPU computing. The granularities determine the 
units in which data in the staging areas and device memory space should be managed. In 
system memory management, memory page of a few kilobytes is a commonly used 
  
64 
granularity by the OS. But this granularity is too small for GPGPU computing. The data-
parallel feature of GPGPU computing determines that the data sets handled in GPGPU 
programs are usually very large (even the register file sizes on GPUs are at least hundreds 
of kilobytes). Using small granularities increases the overhead of managing metadata. 
More importantly, data transfers to and from the device memory in small units cannot 
amortize the start-up latency of memory controller, incurring prohibitive costs. 
Memory regions have been used in several studies. A device memory region is 
allocated by the user program through the memory reservation API call, and can be as 
large as hundreds of megabytes or even gigabytes. Thus, data transfers in units of regions 
can cause high synchronization and data movement overhead. Moreover, managing data 
at the region level is incapable of capturing the distinct access patterns of user data 
structures created within a single region, which are important information for the 
management of device memory space. 
Ideal granularities are those that can balance the latency and throughput of data 
movement and can preserve program-level data structure information to minimize 
overhead. GDM manages device memory space with both block and object-level 
information. We will introduce these concepts and how GDM utilizes them for 
management in the next section. 
The third guideline regards the generality of GDM design. We realize that GPU 
hardware design is still evolving towards mature, general-purpose computing device. 
Thus, in our design, we do not exclude possible new features in future GPU hardware 
that may help with device memory management. At the same time, we try to keep the 
  
65 
GDM design as general as possible. We explore the techniques that can minimize its 
reliance on the uncertainties of future GPU hardware features. This also helps it be 
adopted, starting with current GPU hardware. 
3.4 GDM Design 
This section presents the details of GDM, focusing on the design tradeoffs and 
optimization techniques for minimizing the overhead of device memory management. 
3.4.1 Staging Areas 
GDM creates a staging area for each GPGPU application in its virtual memory space. 
Instead of using a large chunk of space with continuous virtual addresses, a staging area 
consists of a set of virtual memory areas. These areas are dynamically allocated when 
GDM handles the requests from applications for device memory reservations. Because 
these areas are located in the application's virtual memory space, physical memory is not 
allocated until they are populated. 
Staging areas first serve as a temporary storage for the data to be transferred onto 
device memory. With staging areas, data transfers to the device memory can be fulfilled 
asynchronously. Specifically, when an application calls the API function to transfer some 
data from a user source buffer to the device memory, this function returns after GDM 
marks the source buffer copy-on-write. The data is transferred to the device memory later 
from the source buffer if it has not been changed (arrow ○5  in Figure 3.2). Otherwise, the 
data is copied to the staging area when it is about to be changed in the source buffer, and 
is later transferred to the device memory from the staging area. In many GPGPU 
  
66 
applications, a source buffer is often not modified before the data transferred from it is 
used in a kernel. Thus, copy-on-write can effectively reduce the memory consumption of 
staging areas and the cost incurred by data copying. 
Staging areas also serve as the swap space for the data that can no longer stay on the 
device memory due to space contentions. When an application needs more free device 
memory space to launch kernels, GDM evicts some data from device memory to the 
staging area and reclaims the space for its own data sets. The data swapped to the staging 
area may later be loaded back to the device memory when the kernel referencing the data 
is to be issued. 
Creating staging areas inside the virtual address spaces of applications provides a few 
benefits. The low-level management of staging areas, from space allocation/de-allocation 
to data swapping between system memory and disks when the system memory is under 
pressure, relies on the existing virtual memory manager in the operating system. This, on 
one hand, simplifies the design of GDM. On the other hand, it puts the system memory 
space occupied by staging areas under the unified management with other system 
components and applications. This helps the operating system balance system memory 
usage for the overall benefit of system performance. 
3.4.2 Device Memory Regions, Objects, Blocks 
A fundamental design decision to make is the granularity at which the device memory 
space should be managed. One natural choice is device memory region. A device 
memory region is allocated/de-allocated by the user program through the device memory 
reservation/release API calls. Applications may reserve different regions for different 
  
67 
data sets to be handled by GPU kernels. In these cases, data in the same memory region 
may show good access uniformity; managing data based on regions can thus be an 
efficient choice. Regions have been used as the units of device memory management in 
some existing studies [71, 69]. 
However, in some important GPGPU applications [68, 84], we do see cases in which a 
memory region includes multiple data sets with distinct access patterns (e.g. data 
structures with different read/write properties or being referenced by different kernels). 
Data sets can also be shared among different GPU contexts easily with a single IPC call, 
which makes the program structure much clearer to maintain. For these applications, 
managing data with regions fails to classify fine-grained data access characteristics and 
increases both space and data movement overhead. 
 
Figure 3.3: Device memory region and object. 
GDM identifies this demand and adds an object-based memory management layer 
below regions to differentiate data sets with different access patterns in each memory 
region. In GDM, an object is a data set handled by a data transfer operation. This is based 
on the observation that programs usually invoke separate data transfer API calls to pack 
multiple data sets into the same device memory region. The region area modified by each 
data transfer API call corresponds to an object in GDM. For efficiency, GDM merges 
  
68 
small objects with their neighboring objects in the same device memory region. As an 
example, based on the pseudo code snippet in Figure 3.3, GDM creates one region (i.e. 
region1) and three objects, one for dest_buf_1, one for dest_buf_2, and one for the rest 
part of the region. 
Objects can still be very large and cumbersome to manage. Meanwhile, object sizes 
usually vary widely in GPGPU programs, which introduce unnecessary complexity and 
overhead in memory management. For example, transferring large objects leads to high 
synchronization cost; evicting a whole object lowers the utilization of device memory if 
the required space is smaller than the object size. To address these problems, GDM 
further breaks objects into fixed-size blocks. Then, it allocates/reclaims device memory 
space and transfers data in units of blocks. The block size is selected to effectively 
amortize the start-up latency of data transfers. 
As will be explained in the following subsections, this hierarchical layout of regions, 
objects, and blocks makes the management of device memory space especially efficient. 
3.4.3 Loading Data to Device Memory 
For the correct and efficient execution of a kernel, GDM must load the working set of 
the kernel onto device memory. Basically, two key questions must be addressed: which 
data sets should be loaded, and when should they be loaded. 
To address the first question, GDM uses different techniques for different types of 
GPU devices. If page faults are correctly supported on the GPU device, GDM monitors 
and analyzes the parameters used to launch a GPU kernel and the data transfer API calls 
made before the kernel launch. It extracts the objects involved in the parameters and API 
  
69 
calls. Usually these objects are the data sets to be handled by the kernel. For these 
objects, GDM transfers the data block by block into the device memory before the kernel 
is issued (i.e. anticipatory loading). Other data sets, if accessed, will be loaded on demand 
on page faults. 
If page faults are not supported, GDM by default loads the whole context to the device 
memory. To reduce data transferring, GDM provides interfaces for programs to specify 
objects needed by a kernel, with which advanced programmers can direct GDM to only 
load the specified objects. 
For anticipatory loading, another key question is what is the good time to transfer the 
data sets to device memory. If a data set is transferred to the device memory too early, it 
may be evicted prematurely before the kernel referencing it is issued. This incurs extra 
data transfers. If a data set is transferred to the device memory too late, the execution of 
the kernel will be delayed. 
In a busy system, where kernels queue up waiting to be issued, the system throughput 
depends on how quickly these kernels can be issued. Thus, the most efficient way is to 
transfer the data sets used by the kernels according to the order of the kernels in the 
queue. When GDM finishes transferring the data sets used by a kernel, it can start to 
transfer the data sets used by the next kernel in the queue. In this way, GDM can overlap 
data transfers with GPU computation to a great extent and minimize the time that kernels 
must wait for their working sets. 
When all the launched kernels have been issued, if the device memory still has free 
space and the GPU copy engine is idle, GDM will trace back recent application requests 
  
70 
of data transfers to the device memory for unfulfilled requests. As explained in Section 
4.1, with staging areas, GDM handles data transfer requests to the device memory 
asynchronously. Thus, there may be some data sets in staging areas that have not been 
transferred to the device memory even the applications have requested to do so (e.g. by 
calling cuMemcpyHtoD). GDM takes the opportunity and loads these data sets onto 
device memory because they are more likely to be accessed in kernels soon. To prevent 
performance loss, GDM stops loading the data when the device memory is filled. GDM 
also stops loading the data when a kernel is launched, so that the GPU copy engine can be 
quickly released to transfer the data sets of the newly launched kernel. 
3.4.4 Management of Device Memory Space 
GDM makes every effort to satisfy the device memory demand of the kernel to be 
issued. If the device memory is short of free space, GDM must evict some data of 
finished kernels and reclaim the space. Thus, a core issue with the management of device 
memory space is data replacement, i.e. the policy that determines which data sets should 
be evicted when the free device memory space is insufficient. 
A large number of replacement policies have been proposed in previous studies of 
system memory and buffer management. The goal of these policies is mainly to 
maximize hit ratios, i.e. reuses of data in the memory. Every time when a replacement 
decision has to be made, these policies try to select an item that is least possible to be 
reused in the future. 
However, conventional replacement policies are usually designed for systems where 
small amounts of data (e.g. pages or blocks) are loaded on demand. Directly adopting 
  
71 
them will lead to sub-optimal performance in GPGPU systems, where usually a large 
amount of data (e.g. that can fill two thirds of device memory capacity) must be loaded 
before the corresponding kernel can start execution. Therefore, in GDM, other than 
maximizing hit ratios, the design of a replacement policy must achieve an additional goal 
– minimizing the time to spare the space for loading the data sets of the incoming kernel. 
The latency of readying the required space has direct impact on application performance. 
 
Figure 3.4: An LRU stack is structured for the LRU-COST replacement policy. 
GDM enhances the LRU replacement policy to maximize data reuses in the device 
memory and to minimize the latency of data eviction. The replacement policy with the 
enhancement is named LRU-COST
5
. LRU-COST uses a stack to manage the data sets 
loaded into the device memory. When a kernel is issued, all the data sets it will operate 
                                                 
5
 While other replacement policies can also be enhanced with similar approach, we select LRU because it is 
widely used and easy to implement. 
  
72 
are put on the top of the LRU stack, pushing existing data sets in the stack down towards 
the bottom. As shown in Figure 3.4, LRU-COST partitions the stack into two sections. 
The part on the top is named LRU section, and the part at the bottom is named COST 
section. The size of COST section is from 0 to 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛_𝑓𝑎𝑐𝑡𝑜𝑟 × 𝑙𝑒𝑛_𝑠𝑡𝑎𝑐𝑘, where 
selection_factor is an adjustable parameter with a default value of 0.2 and len_stack is the 
size of the whole stack. Data sets in the COST section are classified and sorted to 
minimize the eviction cost, as we will explain later. The data sets with the lowest costs 
are put at the bottom. When more free space is needed, LRU-COST selects the data sets 
at the bottom of COST section to evict. When COST section is depleted, it is refilled with 
the data sets in the LRU section that belong to the working sets of finished kernels. It 
preferentially selects the data sets at the bottom of the LRU section until it reaches its 
maximum size. 
The cost of evicting a data set is determined by its status. Evicting a clean data set 
incurs lower cost than evicting a dirty one. A data set is clean if it has not been changed 
since it is loaded to the device memory. Otherwise, the data set is dirty. The cost of 
evicting a clean data set is minimal, because there is a copy of the data set in the virtual 
memory space of the application, either in the staging area or in the corresponding source 
buffer. To evict a dirty data set, some cost has to be paid to transfer the data back to the 
staging area to preserve the changes. 
In real-world applications, usually a considerable portion of GPU data is read-only 
during kernel executions. For example, based on our analysis of 16 benchmarks in the 
Rodinia benchmark suite [60], on average, 61% of all the GPU data referenced during 
  
73 
kernel executions are not modified. Thus, there is a good potential to improve 
performance by preferentially evicting clean data sets. 
In the future, the clean/dirty status of data sets may be traced by hardware 
automatically. However, as mentioned earlier, existing GPU hardware does not provide 
such support. Thus, after a kernel finishes execution, GDM has no immediate information 
to determine whether a data set has been modified. If GDM cannot find a way to 
differentiate clean data from dirty data, it must write back the data sets being evicted 
indiscriminately to the corresponding staging areas. This may significantly increase 
wasteful system bus traffic and cause delay of kernel execution. 
To address this problem, GDM computes a signature for each data block in the COST 
section, which is a set of MD5 hash values of the data in the block (please refer to 
Section 3.5.2 for details). When the COST section is refilled, GDM immediately issues a 
maintenance kernel to compute the new signatures for the data blocks in the section from 
bottom to top. If the new signature of a block is different from its previous signature or if 
its previous signature does not exist, the block is marked dirty. For a 4MB block, its 
signature can be kept with only 4KB and is transferred along with the block. Computing 
the signature of a data block on a GPU can be made over an order of magnitude faster 
than transferring the data back to the staging area. 
Although computing signatures can significantly reduce the data traffic on system bus, 
it increases the workload of GPU processors. When there are other application kernels 
that have been issued on the device, the maintenance kernel and the application kernels 
may compete for GPU processors. This reduces system throughput (because the 
  
74 
execution of application kernels is delayed) and/or increase the latency incurred by data 
eviction (because computing signatures is delayed). To address the above problem, GDM 
uses three heuristics to reduce the amount of computation. 
Uniformity Heuristic: In the same object, if the first data block, the data block in the 
middle of the object, and the last data block are dirty blocks, GDM assumes that other 
blocks are also dirty blocks. This is based on the observation that kernels usually carry 
out similar operations on the data in the same object, because of the data-parallel nature 
of GPGPU computing. This is to reduce signature computation for write-mostly data. 
Overwrite Elimination: A data block in the device memory is invalidated and its space 
can be reclaimed when the application overwrites its content via a data transfer API call. 
This usually takes place when an application performs computation on a data set larger 
than device memory capacity. In a loop, the application repeatedly updates its data in the 
device memory and launches a kernel to process each partition of the data set. Thus, if 
GDM expects a block may be invalidated based on its overwriting history, it gives a low 
priority to computing its signature by putting it on the top of the COST section (see 
Figure 3.4). This heuristic can be used to reduce signature computation for both read-
mostly and write-mostly data. 
Double-Transfer Avoidance: A data block becomes a clean block if the application 
calls an API to copy its content out from the device memory. Thus, if GDM expects that 
a data block may be transferred to system memory upon application requests, it gives a 
low priority to computing its signature by putting it far away from the bottom of the 
  
75 
COST section (refer to Figure 3.4). When the content has been transferred and the block 
is still in the COST section, GDM moves the block to the bottom of the stack. 
To apply the last two heuristics, GDM keeps track of the blocks that changed status to 
“clean” or were overwritten, and use the information as hints to predict whether the 
behaviors will repeat. 
3.5 Implementation 
We have implemented a prototype of GDM in the GPGPU driver, Gdev [71], on 
Linux. We choose Gdev because it is open-source and has been shown in previous 
research [71] to perform comparably with the proprietary commercial CUDA system. 
Our prototype system targets discrete GPU cards, which are usually connected to the host 
CPU system through PCIe bus. In this section, we highlight some of the implementation 
details that deserve articulation. 
3.5.1 Regions, Blocks, and Objects 
When an application reserves a device memory region, GDM allocates two virtual 
memory areas for it: one from the CPU program's virtual memory space, which is used as 
the region's staging area; the other from the GPU context's virtual device address space, 
which is used by the GPU kernels to access data in the region. The starting address of the 
virtual device memory area is returned to the application as the identifier of the data 
region allocated. We set block size to 4MB in our prototype system, which is small 
enough compared with common object sizes in GPGPU programs and meanwhile 
preserves over 98% of PCIe efficiency. Nouveau [18], the GPU device driver Gdev relies 
on, currently does not allow users to allocate/de-allocate virtual and physical device 
  
76 
memory areas separately, nor does it support dynamic mapping/unmapping between 
them. We have thus modified the source code of Nouveau (less than 400 lines of 
changes) to expose such functionalities to GDM. 
Objects are maintained by GDM to infer user data structures and improve data 
replacement efficiency. Initially, every data region is a single object. A host-to-device 
data transfer, if larger than block size, can split an object into two or three smaller objects 
and/or merge several objects into a larger one. In our implementation, objects are aligned 
at the block boundary. 
3.5.2 Signature Computing 
By definition, computing the MD5 hash value of a given data block is inherently 
sequential. A block has to be logically broken into 64-byte chunks. For each chunk, a 16-
byte hash value is computed based on the data in this chunk and the hash value computed 
from the previous chunk. The hash value computed for the last chunk is used as the MD5 
hash value of the whole data block. To enable efficient parallelization on GPUs, GDM 
computes an array of MD5 hash values, instead of a single one, for a data block, and uses 
them together as the signature of the block. 
This is illustrated in Figure 3.5. The data in a block are equally partitioned among GPU 
threads; each thread computes a MD5 hash value for the data assigned to it. Consecutive 
4-byte words are allocated to different threads, so that device memory accesses can be 
coalesced for maximal kernel efficiency. Figure 3.5 shows how a data block is partitioned 
and the MD5 computed by one thread in an n-thread kernel. The signature of the data 
block comprises the MD5 values computed by all GPU threads. Optimal number of 
  
77 
threads for computing the signature of a data block is determined by the block size and 
GPU hardware parameters. In our implementation, we set the number of threads per 4MB 
data block to 256, which achieves very good performance. To maximize GPU core 
utilization, GDM computes the signatures of multiple blocks in one kernel, which further 
improves performance. 
 
Figure 3.5: The computation of data signatures. 
3.6 Evaluation 
This section evaluates GDM under various workloads. Before presenting the results, 
we first introduce the setup and methodology of our experiments. 
3.6.1 Experiment Setup and Methodology 
The experiments were carried out on a machine equipped with a 2.80GHz Intel Core 
i7-860 CPU, 8GB system memory, and an NVIDIA GTX 480 GPU card
6
. The operating 
system is Red Hat Enterprise Linux 6 running 3.3 kernel. The GPU device driver is 
Nouveau patched with our code to enable separate allocation of virtual and physical 
                                                 
6
 We choose a GPU with NVIDIA GF100 architecture because Gdev supports it most reliably. Other GPUs 
(e.g. Kepler) are fundamentally the same with respect to memory management. 
  
78 
device memory areas and dynamic mapping between them. The GPGPU drivers are Gdev 
(i.e. the stock Gdev) and GDM (i.e. Gdev with GDM enhancement). The CUDA 
compiler is NVIDIA nvcc 4.0. Excluding the space reserved by Gdev and Nouveau, there 
is about 1400MB of device memory space available to user applications. 
Name Size (MB)  Name Size (MB) 
backprop 668  mpe 658 
hj 2114  nn 1138 
hotspot 1316  srad 619 
kmeans 120  theano 4253 
Table 3.1: Total size of device memory space allocated in each benchmark. 
The benchmarks used in our experiments, as listed below, represent a variety of typical 
GPGPU applications and systems, including scientific computing, databases, machine 
learning, and image processing. Among them, four benchmarks, backprop, hotspot, nn, 
and srad, were selected from the Rodinia benchmark suite. Other benchmarks were 
extracted from real-world applications or existing open-source projects. 
 backprop is an implementation of the backpropagation machine learning 
algorithm. 
 hj performs a hash join operation over two database table columns. The data types 
of the two columns are both four-byte integers. 
 hotspot solves a differential function extracted from a popular thermal modeling 
tool called HotSpot. 
 kmeans implements the k-means clustering algorithm that partitions an array of 
multidimensional data elements into several clusters [19]. 
 mpe evaluates the value of expression 𝐴 × 𝐵 + 𝐶 × 𝐷, in which A, B, C, and D are 
four large matrices. 
 nn computes the k-nearest neighbors of a given target within a large cloud of data 
points. 
  
79 
 srad is a diffusion method used mainly in ultrasonic and radar imaging 
applications to remove speckles. 
 theano reproduces a real-world device memory leak bug in the Theano scientific 
Python library (the first commit in [11]). This bug is caused by incorrect increment 
of the reference counting value for a device memory region. 
These benchmarks are highly optimized to maximize the utilization of GPU cores and 
have different demands for device memory space. The total size of device memory 
regions allocated in each benchmark is listed in Table 3.1. Figure 3.6 shows the 
percentages of execution time spent on DMA transfers, GPU kernels, and other 
operations (e.g., CPU computations and disk accesses) when each benchmark executes 
alone with stock Gdev. theano is DMA-bound and is not drawn because it crashes due to 
device memory leak and cannot finish execution with stock Gdev. As shown in Figure 
3.6, all the benchmarks are GPGPU-intensive. At the same time, they incur different 
amount of data movement between the host and GPU device. Thus, the benchmarks can 
stress the design of GDM and test its components to minimize data movement overhead. 
 
Figure 3.6: Decompositions of benchmark execution times. 
  
80 
We evaluated GDM under two types of workloads: solo runs and combo runs. In a solo 
run, a benchmark executes alone. In a combo run, two benchmarks co-run with each 
other, and the benchmarks run multiple times to ensure the full overlap of their 
executions. Since our implementation of GDM is built into Gdev, we use the 
performance with the stock Gdev as baseline. The metrics we use to compare the 
performance are execution time for solo runs and weighted speedup for combo-runs. The 
weighted speedup of a combo run is the sum of the speedups of the participating 
benchmarks over their solo executions, i.e., ∑ (𝑠𝑜𝑙𝑜_𝑡𝑖𝑚𝑒𝑖/𝑐𝑜𝑚𝑏𝑜_𝑡𝑖𝑚𝑒𝑖)
𝑛
𝑖=1  [83]. For 
brevity, we call the weighted speedup of a combo run its throughput. 
3.6.2 Tolerating Device Memory Leaks 
Leveraging host memory and, subsequently, disks as swap space, GDM identifies and 
removes inactive blocks from the device memory. This greatly postpones device memory 
exhaustion and increases the capability of the system to tolerate device memory leaks. 
We use theano as a real-world example to demonstrate this advantage. theano is a 
Python script written using the Theano library. It invokes an in-place add operation for 
100 times in a loop. In each iteration, the underlying Theano runtime creates an array on 
the device memory and launches a GPU kernel to update the elements in the array with 
an incremental value. However, there is a bug in Theano and the array is not released by 
Theano when the iteration finishes, adding a memory leak delta of about 40MB. 
Figure 3.7(a) shows the per-iteration execution times of theano. Figure 3.7(b) 
illustrates the amount of allocable device memory (in logarithmic scale) after each 
iteration. Without GDM support, the leak quickly drains the device memory and causes 
  
81 
theano to crash before one third of the iterations can be finished. Gdev cannot deal with 
device memory leaks in this case because data swapping is not fully supported. 
 
Figure 3.7: The impact of device memory leaks with and without GDM management. 
In the experiment, we use the stock Gdev as a representative of exiting GPGPU 
systems without fully functional device memory management. However, the problem 
demonstrated with it is not limited to Gdev, but exists in similar systems too. For 
example, we have also run theano with the commercial NVIDIA driver; it crashes after 
36 iterations. 
  
82 
With GDM support, though the device memory may be filled, the space occupied by 
leaked regions can be reclaimed and is considered to be allocable. This allows theano to 
continue launching kernels and making progress continuously. Thus, theano is able to 
complete execution correctly. Since the leaked regions are modified during kernel 
executions, GDM incurs a constant overhead on data swapping after iteration 33. But, 
before GDM starts to swap out leaked regions, it does not slow down theano, as shown in 
Figure 3.7(a). 
3.6.3 Multitasking Performance 
In this subsection, we study the performance of GDM under multitasking workloads. 
We select all the possible combo runs consisting of two of the benchmarks excluding 
theano. We classify them into two groups based on their device memory demands. The 
first group has 14 combo runs with high demands. In each combo run, the total demand 
exceeds the device memory capacity, and the benchmarks contend for device memory 
space during the execution. The second group consists of the rest 7 combo runs. They 
have low demands for device memory space, and the benchmarks can share the device 
memory without contention. Before starting the combo-run experiments, we first 
measured the solo execution times of each benchmark with the stock Gdev and GDM 
respectively. Due to management activities, GDM performs slightly slower than the stock 
Gdev, but the difference is less than 2% on average. 
Figure 3.8(a) shows the throughput of the combo runs in the first group with Gdev and 
GDM. With GDM, all the combo runs can finish their executions correctly. However, 
with Gdev, only 3 combo runs can finish without failures. Eleven combo runs suffer 
  
83 
program crashes, which happen to either one or both participating benchmarks. We have 
also performed the same experiment with commercial NVIDIA driver; all combo runs in 
the first group failed due to program crashes. 
 
Figure 3.8: Performance of multitasking workloads with and without GDM management. 
Most combo runs fail without GDM support because state-of-the-art approaches either 
do not support (e.g. CUDA) or use only primitive memory management policies (e.g. 
Gdev). For example, Gdev implements a simple data swapping mechanism based on its 
shared device memory support. With Gdev, when a region A in application P is to be 
  
84 
loaded into the device memory short of free space, only a single region whose size is 
larger than A can be selected under the strict conditions that (1) it is not in application P, 
and (2) it has never replaced or been replaced by any other regions than A in P. If a 
region cannot be found to meet these constraints, the program may be blocked or crash 
due to insufficient device memory space for it to launch kernels. Unlike Gdev, GDM 
provides fully functional device memory management that allows the device memory 
space to be flexibly shared by any regions. This guarantees the successful executions of 
GPGPU applications on multitasking systems. 
Figure 3.8(a) also shows that GDM can handle device memory contentions more 
efficiently than the stock Gdev. With Gdev, even though a few combo runs successfully 
finish their executions, they suffer substantial performance losses. For example, for the 
combo run of backprop and nn, the throughput achieved with the stock Gdev is only 70% 
of that with GDM. Due to the lack of necessary mechanisms and policies to reduce data 
movement, Gdev cannot support data swapping with low overhead. The optimization 
techniques in GDM can effectively minimize the overhead. Thus, GDM can improve the 
performance of these workloads by 20% on average (up to 43%). 
Figure 3.8(b) compares the performance of GDM and the stock Gdev under the 
workloads in the second group. For all the workloads except the co-running of backprop 
and kmeans, the performance difference between GDM and Gdev is barely observable. 
When backprop co-runs with kmeans, the throughput with GDM is slightly lower (by 
4%) than that with Gdev. This shows the low overhead of GDM for multitasking 
workloads without device memory contentions. 
  
85 
3.6.4 Validation of Design Optimizations 
Throughout the design of GDM, optimization techniques are adopted to minimize data 
transfers and the associated cost. In this subsection, we validate the effectiveness of these 
optimization techniques through experiments. We compare the performance of the full-
fledged GDM with a simplified version of GDM, named GDM-Base, which only 
provides basic management over device memory to guarantee the correct executions of 
the combo-run workloads. 
 
Figure 3.9: Effectiveness of GDM optimizations. 
  
86 
Specifically, GDM-Base handles host-to-device data transfer eagerly. It carries out 
data transfers immediately upon application's requests (e.g. cuMemcpyHtoD). When the 
device memory is short of free space, traditional LRU algorithm is used to select a victim 
data set to replace. The victim data set is transferred back to the corresponding staging 
area. When a kernel is to be launched, GDM-Base examines the data sets in the context, 
and loads the data sets that are not resident in the device memory. 
Since the overhead of device memory management is mainly incurred when device 
memory space is under pressure, we select the combo runs in the first group and compare 
the performance of GDM and GDM-Base under these workloads. As shown in Figure 
3.9(a), with optimizations, GDM is able to consistently improve the throughput of the 
workloads by 21% on average and up to 46% (relative to GDM-Base). 
To further understand how the optimization techniques improve performance, we have 
collected the work efficiency of GPU, which is defined as the percentage of total GPU 
time spent on kernel executions and effective data movement. A data movement is 
effective if it is carried out when the benchmark runs alone. For example, in a combo-run 
workload of program A and program B, x repetitions of program A fully overlap with y 
repetitions of program B. During the co-running of the programs, the total time spent by 
the GPU to execute kernels and move data is c. If the time used for kernel execution and 
data movement is a for program A and b for program B when each of program A and B 
executes alone, the GPU work efficiency for this combo run is (ax+by)/c. Work 
efficiency reflects the amount of overhead incurred by device memory management, with 
high efficiency indicating low overhead. Though the overhead is mainly from extra data 
  
87 
transfers incurred by device memory management, kernel execution time is included in 
work efficiency measurement because we want to correlate overhead reduction with 
throughput increase. 
Figure 3.9(b) shows that the optimization techniques improve GPU work efficiency by 
14% on average and up to 37%. This explains the throughput improvement observed in 
Figure 3.9(a). Meanwhile, it also explains the varying degrees of performance 
enhancement for different combo runs. For example, the throughput of the co-running of 
backprop and hj only rises by 4%. This is because the GPU has already been working at 
almost full efficiency before and after optimizations are applied (97% vs. 98%), as shown 
by the first two bars in Figure 3.9(b). 
For most workloads, with the optimization techniques in GDM, the GPU work 
efficiency is close to 100%. This shows that the optimization techniques work effectively 
on controlling the overhead. However, we notice that there are a few workloads with 
GPU work efficiency below 90%. This indicates that there is still potential to further 
improve the performance of GDM in future work. 
3.6.5 Defending against DoS Attacks 
GDM makes the system capable of thwarting denial-of-service (DoS) attacks that 
deplete the device memory space available to GPGPU applications. To demonstrate this 
capability of GDM, we have designed a malicious program that reserves a device 
memory region with the same size as the usable device memory capacity. The program 
repeatedly issues a GPU kernel that updates the data content in the reserved region so as 
to cause the largest performance degradation to GDM management. 
  
88 
We co-run each of the benchmarks (except theano) with the malicious program, and 
measure its execution time
7
. With GDM, all benchmarks successfully finish executions in 
spite of the presence of the malicious program. The performance of the benchmarks is 
lowered by the malicious program compared to their solo executions, but kept at an 
acceptable level (69% on average). The highest slowdown happens with srad (284%) 
because it launches kernels frequently and each kernel accesses a moderately large 
working set, causing more data evictions than other benchmarks. The lowest slowdown is 
observed with kmeans (9%) because it has the least demand on device memory space. 
3.7 Related Work 
We are not the first to realize the problems caused by having GPU programmers 
directly and explicitly manage the device memory. Gdev [71] provides a data sharing 
mechanism for inter-process communication (IPC) and shows that this mechanism can be 
used to support device memory swapping. However, because it is based on an IPC 
mechanism and lacks generality, this proof-of-concept workaround barely works in 
practice and suffers serious performance issues as have been shown with our 
experiments. RSVM [69] provides an application-level device memory manager in a 
library. It relieves programmers from explicitly managing device memory, but programs 
must call the functions it provides to gain the benefits. Meanwhile, it suffers from the 
problems with application-level management. For example, it cannot address the 
                                                 
7
 On existing systems, a malicious program can also attack the system by issuing a non-terminating kernel 
(e.g. an infinite loop) or a large number of kernels. Thus, a thorough solution requires enhancements on 
GPU kernel scheduling, which is beyond the scope of the paper. This paper only focuses on the attacks 
through device memory space, and lets the system schedule GPU kernels in a round-robin manner in the 
experiment. 
  
89 
contention between applications and does not allow an application to use other libraries 
that call CUDA APIs to allocate device memory or transfer data. Compared with these 
studies, GDM identifies the critical issues of device memory management at system level 
and provides a general and non-intrusive solution. 
System management of GPGPU resources other than device memory has received 
attention in several recent studies. Pegasus [66] is a computation scheduling facility for 
virtualized, accelerator-based multiprocessor systems. It makes GPU a schedulable entity 
in the hypervisor and supports both high-throughput and low-latency scheduling among 
multiple guest OSes. TimeGraph [70] is a GPU command scheduler to support fair 
sharing of GPU computing resource for real-time, multitasking GPU applications. PTask 
[81] provides an OS abstraction for GPU computing resource and data transfer 
management. It presents a dataflow programming model that exposes information for OS 
kernel to provide performance isolation and to coordinate data movement between 
collaborative processes. GPUfs [82] proposes file system support for GPGPUs to allow a 
GPU program to access host files directly. 
Some research projects in architecture and compiler areas improve the usability of 
GPUs as mainstream computing devices. iGPU [76] is a GPGPU architecture to support 
exceptions and speculative executions with compiler support. ADSM [64] is a data-
centric programming model for heterogeneous computing that maintains an asymmetric 
shared memory space to achieve low cost. CGCM [67] is an automatic management and 
optimization system to reduce programmer's efforts for CPU-GPU data transfer. There 
are plans to provide unified and shared virtual spaces for CPU and GPU to access [20]. 
  
90 
They do not provide or have not provided a solution to manage the physical space in the 
device memory. Instead, they pose a higher demand for operating system managing the 
device memory space, which is targeted by the research in this chapter. 
3.8 Conclusions and Future Work 
This chapter identifies a crucial problem with existing GPGPU system software design. 
Namely, the lack of sophisticated device memory management causes application 
crashes, hangs, and inefficient utilization of GPGPU resources. This problem can 
seriously hinder the adoption of GPGPUs as mainstream computing devices in general-
purpose systems. 
The chapter presents GDM, a fully functional device memory manager, to effectively 
address the problem. The design fully considers the unique features of GPGPU 
computing and GPGPU devices from the perspectives of both challenging problems and 
optimization opportunities. GDM manages device memory with both block and object-
level information, and employs various optimization techniques to ensure system 
performance. Experiments verify the capabilities of GDM to tolerate device memory 
leaks, prevent program crashes, defend against malicious programs, and achieve high 
performance. 
As future work, we plan to improve the management over GPU device memory 
following two directions. First, it is possible to further reduce the overhead by leveraging 
the information from compilers or applications. Through static analysis of kernel source 
code, the compiler can infer some information that may otherwise need to be obtained 
with extra cost. In many applications such as databases and in-memory big data engines, 
  
91 
similar information can also be easily inferred from application-level semantics. Second, 
we also plan to investigate the collaboration between device memory manager and GPU 
kernel scheduler for more optimization opportunities. For example, when there is not 
enough free space in the device memory for the execution of a selected kernel, the system 
should balance the benefit of launching the kernel and the potential overhead. It may also 
decide whether to schedule another kernel with a smaller working set, or to wait for an 
issued kernel to finish execution and free up the space. 
  
  
92 
 
 
Chapter 4 Concurrent Analytical Query Processing with GPUs 
 
 
4.1 Introduction 
Multitasking has been a proven practice in computer systems to achieve high resource 
utilization and system throughput. However, despite the wide adoption of GPUs for 
analytical query processing, they are still mainly used as dedicated co-processors, unable 
to support efficient executions of multiple queries concurrently. 
Due to the heterogeneous, data-driven characteristics of GPU operations, a single 
query can hardly consume all GPU resources. Dedicated query processing thus often 
leads to resource underutilization, which limits the overall performance of the database 
system. In market-critical applications such as high-performance data warehousing and 
multi-client dataflow analysis, a large number of users may demand query results 
simultaneously. As the volume of data to be processed keeps increasing, it is also 
essential for user queries to make continuous progress so that new results can be 
generated constantly to satisfy the goal of interactive analysis. The lack of concurrent 
querying capability restricts the adoption of GPU databases in these application fields. 
While dedicated usage of GPUs is still needed for latency-critical queries to ensure 
performance isolation, databases must be improved to support concurrent multi-query 
  
93 
execution as an option to maximize the throughput of non-latency-sensitive queries on 
the GPU device. This consolidated usage of GPU resources enhances system efficiency 
and functionalities, but it makes the design of query execution engine more challenging. 
To achieve the highest performance, each user query tends to reserve a large amount of 
GPU resources. Unlike CPUs where the operating system supports _ne-grained context 
switches and virtual memory abstractions for resource sharing, current GPU hardware 
and system software provide none of these interfaces for database resource management. 
For example, GPU tasks cannot be preempted once started; on-demand data loading is 
not supported during task execution; automatic data swapping service is also missing 
when the device memory undergoes pressure. As a result, without efficient coordination 
by the database, multiple GPU queries attempting to execute simultaneously can easily 
cause low resource usage, system thrashing, or even query abortions, which significantly 
degrades, instead of improves, overall system performance. 
In this chapter we present a resource management facility called MultiQx-GPU (Multi-
Query eXecution on GPU) to address the above challenges and support efficient 
executions of concurrent queries in GPU databases. It ensures high resource utilization 
and system performance through two key components: a query scheduler that maintains 
optimal concurrency level and workload on the GPUs, and a data swapping mechanism to 
maximize the effective utilization of GPU device memory. This chapter also presents a 
prototype implementation of MultiQx-GPU in an open-source GPU query engine and 
discusses several technical issues addressed by our system to ensure its efficiency in 
practice. Through intensive experiments with a wide range of workloads, we demonstrate 
  
94 
the effectiveness and performance advantage of our solution. By supporting concurrent 
query processing, MultiQx-GPU improves system throughput by up to 55% relative to 
the system without such support. 
This chapter makes the following main contributions. First, we have made a strong 
case for building an effective resource sharing facility as a part of a database to manage 
concurrent query executions with GPUs. Second, we have shown the effectiveness of our 
design and implementation of the software facility with intensive experiments. Finally, 
the software framework presented in this chapter is open-source and can also be 
enhanced to support GPU resource sharing activities in other data processing 
applications, raising the productivity and system utilization. 
The rest of the chapter is organized as follows. Section 4.2 introduces the background 
and motivation of the research. Section 4.3 outlines the overall structure of MultiQx-
GPU. Section 4.4 and 4.5 describe the device memory swapping and query scheduling 
components of MultiQx-GPU respectively. After a summary of the implementation 
issues in Section 4.6, Section 4.7 evaluates the prototype system, Section 4.8 introduces 
related work, and Section 4.9 concludes the chapter. 
4.2 Background and Motivation 
This section provides background on GPU query processing and motivates this 
research by demonstrating the problems of lacking multi-query support. Based on 
extensive benchmarks over some existing GPU query engines, we show the low resource 
utilization induced by dedicated query processing and identify several system issues that 
must be addressed in order to co-run GPU queries efficiently. 
  
95 
4.2.1 Analytical Query Processing with GPUs 
With vectorized cores and high-bandwidth device memory, GPUs have been widely 
utilized in databases for analytical query processing [112, 93, 98]. In this subsection we 
describe the architecture of one such system, called YDB [22], as an example to briefly 
introduce state of the art. 
YDB is a standalone GPU execution engine for warehouse-style queries. Its front end 
consists of a query parser and optimizer, whose designs are based on the YSmart query 
translation framework [97]. It translates an SQL query into an optimized query plan tree, 
which is then used by the query generator to generate a driver program. This driver 
program controls the query execution flow; it is compiled and linked with the GPU 
operator library to produce an executable query binary. During execution, the query 
binary reads table data from a column-format backend storage and invokes the according 
GPU operators to offload data to GPUs for fast processing. Finally, the query results are 
materialized into row format and returned to the user. 
To explain GPU query execution in more details, consider the following query that 
computes the total revenue from orders with discounts no less than 1% in each month of 
1993: 
SELECT d_month, SUM(lo_revenue) 
FROM lineorder, ddate 
WHERE lo_orderdate = d_datekey 
AND d_year = 1993 AND lo_discount >= 1 
GROUP BY d_month 
  
96 
Figure 4.1 illustrates an execution plan generated by YDB for the query. It first 
performs a table scan on the fact table lineorder. The selection predicate lo_discount >= 
1 is evaluated to generate a selection vector. With this vector, the scan operator filters 
lo_orderdate and lo_revenue, and returns an intermediate table consisting of the two 
filtered columns to the driver program. Similarly, with a selection predicate d_year = 
1993, the driver program invokes a scan operation on the dimension table ddate, 
generating an intermediate table with the filtered d_datekey and d_month columns. 
Following the scans, the two intermediate tables are joined: a hash table is built on 
d_datekey’ and probed with lo_orderdate’ to generate a filtering vector, which is then 
used to filter the d_month’ and lo_revenue’ columns of the intermediate tables. In the 
end, the join output is aggregated (and materialized) to get the final query result. 
 
Figure 4.1: An example query execution plan in YDB. 
The GPU operator library provides the GPU implementations of common database 
operations such as scans, joins, aggregations, and sorting. These operations are optimized 
at both kernel1 and procedure levels in YDB. Shared memory and memory access 
coalescing are fully exploited to maximize single kernel performance. IOMMU-based 
  
97 
direct host memory access (through CUDA [79] unified virtual addressing or OpenCL 
[65] mapped buffer interfaces) and data compression techniques are supported to mitigate 
data transfer overhead. To ensure kernel execution efficiency, table tuples are pushed 
from one operator to another in batches. For data sets that cannot directly fit into device 
memory, they are partitioned into smaller blocks and processed one by one. 
Despite possible differences in implementation details, the core design principles of 
other analytical GPU query engines are similar to YDB. For example, Ocelot [93] is a 
hardware-oblivious parallel database engine supporting query executions on either CPUs 
or GPUs. Its integration with MonetDB [23] requires it to comply with the internal 
interfaces of MonetDB, but its column-based data stores, operator-at-a-time execution 
model, and the designs of major GPU operators agree with YDB closely. 
4.2.2 Low Resource Utilization 
Current analytical GPU engines such as YDB and Ocelot use GPUs as dedicated query 
co-processors. The query engine admits one user query at a time, generates and executes 
a query plan assuming exclusive usage of the GPU device. Although this dedicated query 
processing scheme simplifies query optimization and algorithm design, it inevitably 
causes low resource utilization due to the heterogeneous, data-driven features of query 
processing with GPUs. 
A typical query execution comprises both CPU and GPU phases. The CPU phases are 
in charge of, e.g., initializing GPU contexts, preparing input data, setting up GPU page 
tables, launching kernels, materializing query results, and controlling the steps of query 
progress. These operations can take a notable portion of query execution time, which may 
  
98 
cause GPU resources to be underutilized during these periods. Besides CPU phases, there 
also exist data dependencies amid various query stages. For example, a kernel cannot be 
launched until its input data are loaded into device memory or mapped to the GPU page 
table; aggregations cannot start until the join results are generated. Techniques such as 
double buffering can be used to mitigate data latency, but their applicability is 
constrained by the limited opportunities within a single query and the high complexity 
introduced to GPU operator designs. Assuming dedicated occupation of the device, GPU 
queries also tend to release reserved device memory space lazily to improve data reuses 
and simplify algorithm implementation. This lowers the effective usage of allocated 
space. 
 
Figure 4.2: Utilization of GPU resources during dedicated executions of SSB queries 
with YDB. 
To show the problem of low resource utilization, we measure the executions of Star 
Schema Benchmark (SSB [101]) queries on a modern server with an Intel CPU and 
NVIDIA GPU (platform details in Section 4.7.1). We use YDB to generate an optimized 
binary for each of the 13 SSB queries at scale factor 14. For each binary, it is executed 
  
99 
dedicatedly on the server for several times. We collect the average utilization of main 
GPU resources during one query execution. To minimize the influence of disk accesses, 
all data sets are preloaded into system memory before queries are executed. 
Figure 4.2 depicts the utilization of three major types of GPU resources. The first two 
bars in each group give the utilization of GPU's copy and compute units. It can be seen 
that both hardware resources are poorly utilized when a query executes dedicatedly on 
the GPU. The copy unit, which is in charge of DMA data transferring, is in use for only 
24% of query execution time on average. The compute unit, which executes the kernels 
of GPU operators, is even less utilized, accounting for an average of merely 8% of query 
makespan (4% at minimum). By further breaking down DMA traffic, we find that the 
overwhelming majority (over 99%) of data transfers are from the host to the device. 
Therefore, if a server-class GPU with dual copy units (e.g., an NVIDIA Tesla or Quadro 
GPU) is used in the production system, the device-to-host copy unit would remain 
(almost) completely idle through the entire query lifetime, wasting precious PCIe 
bandwidth resource. 
Figure 4.2 also shows the low utilization of device memory space, as illustrated by the 
third bar in each group. Queries allocate device memory to hold their working sets for 
fast access. However, not all the allocated space may be always effectively utilized 
during query execution. We have instrumented YDB to collect memory traces and 
computed the space utilization, which is defined as the ratio of device memory space 
occupied by actively accessed data. It can be seen that, averaged across all queries, only 
23% of allocated device memory space is effectively utilized, with the lowest near 16% 
  
100 
for queries in the q1 series. The allocated but underutilized space could not be put into 
better uses since only a single query was executed at any time in YDB. 
The problem of low resource utilization has motivated us to exploit concurrent query 
processing with GPUs in the beginning. However, will be shown in the next subsection, 
with some critical components missing, current GPU databases still cannot support 
concurrent query executions efficiently. 
4.2.3 Problems with Uncoordinated Query Co-Running 
Running multiple queries on the same GPUs can improve resource utilization and 
system performance. However, as we demonstrate next, these benefits do not come 
gratuitously. Due to the lack of necessary database facilities to coordinate the sharing of 
GPU resources, co-running queries naively can cause serious problems such as query 
abortions or mediocre throughput. 
One of the most important functionalities not supported in current database systems is 
the coordination over GPU device memory usage. To maximize performance, each query 
tends to allocate a large amount of device memory space and keep its data on the device 
for efficient reuses. This causes high conflicts when multiple queries try to use device 
memory simultaneously. Since the underlying GPU driver does not support automatic 
data swapping, query co-runnings, if not managed by the database, can easily abort or 
suffer low performance. Even though there are recent proposals to suggest adding such 
service in the operating system [108], the database engine still needs to provide this 
functionality on its own in order to take advantage of additional information from query-
level semantics for maximizing performance. 
  
101 
To show this demand, we measure how the system performs when co-running SSB 
queries used in the previous subsection on YDB. For all 69 combinations whose peak 
device memory consumption exceeds the device capacity (i.e., suffering contention), 
device memory allocation failures are observed for one or both of the participating 
queries. Because of high device memory conflicts, some query pairs cannot finish 
executions successfully every time they are co-ran together. Some others suffer failures 
sporadically, depending on whether their co-runnings happen to trigger the conflict. To 
verify the commonness of the problem, we have also performed similar experiments with 
Ocelot running TPC-H benchmarks on an AMD GPU. Ocelot supports device memory 
swapping within a single query, but provides no mechanisms to handle device memory 
conflicts caused by concurrent queries. We observe similar experiment results – all query 
co-runnings suffering device memory con cannot finish executions successfully with 
Ocelot. The underlying GPU drivers used in the YDB and Ocelot experiments are the 
latest commercial CUDA and OpenCL drivers from NVIDIA and AMD respectively. The 
problem shown by our experiments is thus general, which exists with both major GPU 
computing platforms. 
Besides device memory swapping, another critical facility missing in current GPU 
databases is query scheduling. Due to the limited capacity of GPU resources and the 
diverse demands of user queries, system performance is sensitive to the number of 
queries co-running on the GPU. Running too many queries can lead to severe resource 
contention that may cause high overhead. Running too few queries, on the other hand, 
underutilizes resources and loses the opportunity to maximize system performance. 
  
102 
Query scheduling maintains an optimal workload on the GPUs by controlling the 
combinations of queries that can execute concurrently, and thus plays an important role to 
system throughput. 
 
Figure 4.3: The impact of query scheduling on system throughput. 
To demonstrate the necessity of query scheduling, we measure the performance of a 
system we have developed to support concurrent query executions (see Section 4.6 for 
details), running SSB queries at scale factor 14. Without enabling the query scheduling 
functionality, we change the number of queries executed concurrently by our system and 
show the average system throughput achieved under each setting in Figure 4.3. It can be 
seen that system throughput improves from running one query at a time to running 
queries pairwisely, but degrades quickly as the number of co-running queries exceeds 
two. Noticeably, when four queries are allowed to execute simultaneously, system 
throughput drops to only 1/4 of the optimal value, which is even 65% lower than running 
queries one by one. This result shows that a database system without proper query 
  
103 
scheduling functionality can easily suffer low system throughput or high system 
thrashing, which severely undermines the benefits of concurrent query processing. 
4.3 MultiQx-GPU: An Overview 
To support concurrent query processing, MultiQx-GPU provides the functionalities 
needed by databases to coordinate GPU resource sharing. In this section we highlight the 
design principles and overall structure of the system. 
The design of MultiQx-GPU abides by two main principles. The first one is versatility 
– the techniques presented by the system should be applicable to different GPU databases 
and computing frameworks for managing GPU resources. GPU database technologies are 
still evolving very quickly. Different systems have different query engine 
implementations and may be based on different GPU computing frameworks. The 
methods employed by MultiQx-GPU therefore should be easily utilized in all these 
variations. This requires the design of MultiQx-GPU to capture the essential properties of 
GPU query processing, to build upon the common abstractions of GPU frameworks, and 
to integrate with existing and future GPU database engines in a non-intrusive manner. 
The second principle followed by MultiQx-GPU is high efficiency. Originally 
designed for gaming and super computing applications, GPU hardware and system 
software still do not have native support for multitasking. Basic system-level 
functionalities familiar to the CPU world, such as virtual memory (VM) and fine-grained 
context switches, are not provided by commercial GPU drivers. This forces MultiQx-
GPU to add an extra layer of application-level software to support multi-query 
capabilities on its own, which, if not taken great care of, could incur high overhead. 
  
104 
 
Figure 4.4: Overview of MultiQx-GPU. Shaded boxes denote the two new components 
provided by MultiQx-GPU to manage GPU resources. 
Figure 4.4 shows the position of MultiQx-GPU in the overall GPU database software 
stack. MultiQx-GPU is built into the database query engine, but remains loosely coupled 
with existing components in the query engine. It enforces controls over GPU resource 
usage by transparently intercepting the GPU API calls from user queries. This design 
does not change existing programming interfaces of the underlying GPU drivers, and 
minimizes the modifications to the other components of the GPU query engine. MultiQx-
GPU resides completely in the application space, and does not rely on any OS-level 
functionalities privileging to the GPU drivers. It can thus be easily ported between 
different query engine systems (such as Ocelot, YDB, and MapD) and GPU computing 
frameworks (such as CUDA, OpenCL, and DirectCompute [100]) to enable GPU 
resource sharing. 
MultiQx-GPU comprises two main components providing the support required for 
concurrent query executions. Working like an admission controller, the query scheduler 
component controls the concurrency level and intensity of resource contention on GPU 
devices. By controlling the queries that can execute concurrently at the first place, query 
  
105 
scheduler maintains optimal workload on the GPUs that would maximize system 
throughput. Once a proper concurrency level is maintained, the device memory manager 
component further ensures system performance by resolving the resource conflicts among 
concurrent queries. Through VM-like automatic data swapping service, it makes sure that 
multiple queries with moderate resource con can make concurrent progress efficiently 
without suffering query abortions or causing low resource utilization. 
In the next two sections we will elaborate the detailed designs of these two components 
and explain various decisions made by MultiQx-GPU to minimize overhead. Since the 
design of query scheduler assumes the capability of the query engine to efficiently 
resolve resource contention, we first introduce the design of the device memory manager 
to achieve this basic functionality in the following section. 
4.4 Device Memory Manager 
The primary functionality of the device memory manager is to coordinate the 
conflicting demands for device memory space from different queries so that they can 
make concurrent progress efficiently. To achieve this goal, it relies on an optimized data 
swapping framework and replacement policy to minimize overhead. 
4.4.1 Framework 
When free device memory space becomes insufficient, instead of rejecting a query's 
service request, MultiQx-GPU tries to swap some data out from device memory and 
reclaim their space for better uses. This improves the utilization of device memory space 
and makes concurrent executions more efficient. To achieve this purpose, the device 
memory manager employs a data swapping framework that is motivated by a system 
  
106 
called GDM [108]. Different from GDM, our framework resides in the application space, 
which cannot rely on any system-level interfaces, but has the advantage of using query-
level semantics, for data swapping. 
To support data swapping, the framework maintains a swapping buffer in the host 
memory to contain the query data that need not to reside in the device memory 
momentarily. When a device memory allocation request is received, it creates a virtual 
memory area in the swapping buffer and returns the address of the virtual memory area to 
the query. Device memory space only needs to be allocated when a kernel accessing the 
data is to be launched. The framework maintains a global list of data regions allocated on 
the device memory for all running queries. When free space becomes insufficient, the 
device memory manager selects some swappable regions from the list and evicts them to 
the swapping buffer. Due to the special features of multi-query workloads, several 
optimization techniques are employed by the framework to improve performance, as 
explained below. 
Lazy transferring. When a query wants to copy some data to a device memory region 
(e.g., through cudaMemcpy in CUDA), the data are not immediately transferred to device 
memory until they are to be accessed in a GPU kernel. The swapping buffer serves as the 
temporary storage for the data to be transferred. This design prevents data from being 
evicted from device memory immaturely because data only need to be transferred to 
device memory when they are to be immediately accessed. To further reduce overhead, 
the memory manager marks the query source buffer copy-on- write. The data can later be 
transferred directly from the source buffer if it has not been changed. 
  
107 
Page-based coherence management. GPU queries usually reserve device memory 
space in large regions. The memory manager internally partitions a large region into 
several small, fixed-size, logical pages. Each page keeps its own state and maintains data 
coherence between host and device memories independently. Managing data coherence at 
page units has at least two performance benefits. First, by breaking a large, non-
interruptible DMA operation into multiple smaller ones, data evictions can be canceled 
immediately when they become unnecessary (e.g., when a region is being released). 
Second, a partial update to a region only changes the states of affected pages, instead of a 
whole region, which reduces the amount of data that need to be synchronized between 
host and device memories. 
Data reference and access advices. To avoid allocating device memory space for 
unused regions, the memory manager needs to know which data regions are to be 
referenced during a kernel execution. It is also beneficial for the memory manager to 
know how the referenced regions are to be accessed by a kernel. In this way, for example, 
the content of a region not containing kernel input data needs not to be loaded into device 
memory before the kernel is issued; the memory manager also needs not to preserve 
region content into the swapping buffer during its eviction if the data are not to be reused. 
To achieve this purpose, the memory manager provides interfaces for queries to pass data 
reference and access advices before each kernel invocation. 
  
108 
4.4.2 Data Replacement 
When free device memory space becomes scarce, the memory manager has to reclaim 
some space for the kernel to be launched. The replacement policy that selects data 
regions for evictions plays an important role to system performance. 
There are three main differences between data replacement in device memory and 
conventional CPU buffer pool management. First, the target of device memory 
replacement is a small number of variable-size regions rather than a large amount of 
uniform-size pages. GPU queries usually allocate a few device memory regions, whose 
sizes may differ dramatically depending on the roles of the regions and query properties. 
Since the physical device memory space allocated for a region cannot be partially de-
allocated without necessary driver support, a victim region, once selected, must be 
evicted from device memory completely. Second, unlike CPU databases where data 
evictions can be interleaved with data computation to hide the latency of replacement, a 
GPU kernel cannot start execution until sufficient space is vacated on the device memory 
for all its data sets. This makes GPU query performance especially sensitive to the 
latency of data replacement. Third, device memory not only has to contain input table 
data and output query results, but also stores various intermediate kernel objects whose 
content can be modified from both CPU and GPU. This makes the data access patterns of 
device memory regions much more diverse than buffer pool pages. 
Based on these unique characteristics, we propose a policy, called CDR (Cost-Driven 
Replacement), which combines the effects of region size, eviction latency, and data 
locality to achieve good performance. When a replacement decision has to be made, CDR 
  
109 
scans the list of swappable regions, and selects the region that would incur the lowest cost 
for eviction. The cost c of a region is defined in a simple formula, 
 𝑐 = 𝑒 + 𝑓 × 𝑠 × 𝑙 , (2) 
where e represents the size of the data that needs to be evicted from device memory, s is 
region size, l represents the position of the region in the LRU list, and f is a constant, 
which we call latency factor, whose value is between 0 and 1. If two regions happen to 
have the same cost value, CDR breaks the tie by selecting the less recently used one for 
replacement. 
The first part of Formula 2, e, quantifies the latency of space vacation. Its value 
depends on the status of the data pages in a region. For example, e is zero if none of the 
pages has been modified by kernels on the device memory. If some pages have been 
updated by the query process from the CPU, the device memory copies of those modified 
pages would have been invalidated and thus should not be evicted back to the swapping 
buffer, leading to a value of e less than s. The second part of Formula 2, 𝑓 × 𝑠 × 𝑙, 
depicts the potential overhead if the evicted region would be reused in a future kernel. 
The value of l is between 1/n and 1, depending on the region's position among the n 
swappable regions in the LRU order. For example, l = 1/n for the least recently used 
region, l = 2/n for the second least recently used one, and so on. The role of latency 
factor f is to give a heavier weight to data eviction latency in the overall cost formula. 
As will be shown in Section 4.7.4, CDR delivers higher performance than conventional 
replacement policies in supporting concurrent query executions, thanks to its capability to 
identify suitable victim regions that incur low overhead. 
  
110 
4.5 Query Scheduler 
In an open system where user queries arrive and leave dynamically, the query 
scheduler maintains optimal workload on the GPUs by controlling which queries can co-
run simultaneously. A query is allowed to start execution if it can make effective use of 
the underutilized or unused GPU resources without incurring high overhead associated 
with resource contention. The GPU workload status is monitored continuously, so that 
delayed queries can be rescheduled as soon as enough resources become available. 
A critical issue in query scheduling is to estimate the actual resource demand of a GPU 
query. As explained in Section 4.2.2, the amount of resource being effectively utilized by 
a query can be much lower than its reservation. Scheduling queries based on the maximal 
reservation can thus cause GPUs to be under-loaded, leading to suboptimal system 
throughput. In GPU databases, different queries and query phases may have diverse 
resource consumption, depending on query and data properties such as filter conditions, 
table sizes, data types, and content distributions. If the query scheduler cannot accurately 
predict the actual resource demand, a mistakenly scheduled query can easily bring down 
the overall system performance by large, as has been shown in Section 4.2.3. 
To address the problem, we propose a simple, practical metric to effectively quantify 
the resource demand of a GPU query. The design of the metric is based on some 
observations that are generally applicable to analytical GPU databases. First, for GPU 
query processing, the utilization of device memory space has the principal impact on 
system throughput and can be frequently saturated under multi-query workloads. Unlike 
compute cycles and DMA bandwidth that can be freely reused, reusing a device memory 
  
111 
region requires data evictions and space re-allocation, which can potentially incur high 
overhead. This makes system performance strongly correlated with the demand for and 
utilization of device memory space. Second, to ensure data transfer and kernel execution 
efficiencies, analytical GPU engines usually employ a batch-oriented, operator-based 
query execution scheme. Under this scheme, table data are partitioned into large chunks 
and pushed from one operator to another for processing. It is thus a good model to 
consider query execution as a temporal sequence of operators, each of which accepts an 
input data chunk, processes it with the GPU, and generates an output data chunk that may 
be passed to the next operator for further processing. 
Based on the above observations, we define a metric called weighted device memory 
demand, or briefly weighted demand, which is the weighted average of the device 
memory space consumed by a query's operator sequence. The weight is computed as the 
percentage of query execution time spent in each operator. The device memory space 
consumed by an operator equals the maximal total size of device memory regions 
referenced by any GPU kernel in the operator. Suppose that each operator's execution 
time and device memory consumption are ti and mi respectively, the weighted demand m 
of the query can be computed by 
 𝑚 =
∑(𝑡𝑖×𝑚𝑖)
∑ 𝑡𝑖
 . (3) 
The device memory consumption of an operator can be computed from query predicates 
and table data statistics. The execution time of an operator can be predicted through 
modeling, as has been shown in our previous work [112]. 
  
112 
To accommodate the changes of resource consumption in different query phases, the 
weighted demand of a query is dynamically updated as the query executes, and is 
exposed to the query scheduler for making timely scheduling decisions. When a new 
query arrives, the query scheduler computes its initial weighted demand. If the number 
exceeds the available device memory capacity, which is measured by the difference 
between the device memory capacity and the sum of weighted demands of scheduled 
queries, the query's execution needs to be delayed. The query scheduler considers 
rescheduling a postponed query every time when a running query's resource demand 
changes or when a query finishes execution. 
4.6 Implementation 
We have implemented a prototype of MultiQx-GPU above the CUDA computing 
framework and integrated it with YDB to support concurrent query processing. This 
section summarizes our experiences in building the system. 
Our system adds two new components to the original YDB software stack. The 
memory manager component is implemented in a highly modulated shared library 
(5200+ lines of C code). It is dynamically linked with the query binary to intercept 
CUDA API calls (through the LD_PRELOAD dynamic linking option on Linux). The 
query scheduler component is an add-on Python module that wraps around a query binary 
to control its execution. Most existing YDB components remain unmodified. A small 
amount of code (120+ lines) is added to the GPU operator library to provide data 
reference and access advices to the memory manager. The algorithm designs and kernel 
implementations of all GPU operators are unchanged. 
  
113 
A new programming interface, cudaAdvice(addr, flags), is exported by the memory 
manager component to receive data advices. In the interface, addr denotes the address of 
the data region to be accessed in the next kernel execution, and flags is the advice about 
how this region is to be accessed, which can be input, output, or both. 
Our prototype system employs a process-based query execution model: each user 
query executes in a separate operating system process. To share information such as 
region states and resource availability, a shared memory area is created in the host 
memory. Data replacement requests and responses are communicated between different 
query processes through POSIX message queues. Copy-on-write is implemented through 
the mprotect system call on Linux. Since mprotect does not capture the event when a 
memory region is freed, we have to additionally override the free() function in libc. We 
set page size to 8MB, which provides the fine granularity required for data coherence 
management, meanwhile retaining over 99% of PCIe efficiency. The value of latency 
factor f is set to 0.01, which works empirically well in practice. 
The implementation of our prototype system addresses several technical issues which 
are discussed in the rest of the section. These issues are mainly caused by the undesired 
behaviors or missing services in the underlying GPU driver. 
False synchronizations. GPU kernels execute asynchronously with respect to the host 
query process. To avoid data races and kernel failures, the GPU driver usually enforces 
implicit synchronizations when handling some important operations such as data transfers 
and memory releases [24]. For example, the CUDA driver forcibly inserts a global barrier 
before a device memory region is released to ensure that no GPU operations are still 
  
114 
accessing the region when its address mapping is removed from the GPU page table. 
Such implicit synchronizations may be helpful to some other GPU applications. In 
concurrent databases, however, since the query engine has full knowledge about the data 
access behaviors of its kernels and the data dependencies among different GPU 
operations, the extra synchronizations performed by the GPU driver are often 
unnecessary and can delay the progress of user queries. Our system circumvents this 
problem with two main mechanisms. First, it creates a dedicated CUDA stream (or two 
streams, each for one direction, if the GPU card has dual copy engines) for DMA 
transfers. To prevent data transfer requests from being unnecessarily blocked, our system 
internally maintains a set of small, pinned buffers, and automatically converts all data 
transfers into asynchronous DMAs by pipelining data through the pinned buffers in the 
dedicated stream. Second, our system uses a dedicated garbage collection thread to 
handle device memory releases, so that the main query thread is never blocked by such 
operations. This daemon thread also handles the case when a region still being evicted 
needs to be released. 
Kernel event handling. When a kernel finishes execution, the states of the regions 
accessed by the kernel have to be timely updated to ensure system correctness and 
performance. This can be done by letting the GPU driver execute a segment of 
maintenance code on CPU whenever a kernel finishes execution (e.g., through the 
cudaStreamCallback interface in CUDA). However, this can degrade system 
performance because the GPU hardware command queue is suspended when the CPU 
service thread waits to be scheduled by the operating system and executes the 
  
115 
maintenance code. To avoid such overhead, our system inserts a short GPU code segment 
after each kernel. This code has the sole functionality of updating the states of the regions 
accessed by a kernel when it finishes. By pinning the state fields of data regions in the 
host memory and mapping them to the device address space, region states can be updated 
efficiently with a few direct host memory accesses without any interventions from the 
CPU. 
Device memory fragmentation. Data replacement requires frequent allocations and de-
allocations of device memory space. We find that the device memory space may become 
slightly fragmented in some cases after the system runs for a long time. When this 
happens, a device memory allocation operation may fail even if there seems to be 
sufficient free space on the device memory. This problem is due to the implementation of 
the physical device memory allocator in the GPU device driver, but we are not able to 
confirm the exact causes due to lack of public documentations about the commercial 
CUDA driver. Our system uses a two-round memory allocation procedure to address the 
problem without relying on the underlying driver details: if the first allocation attempt 
fails, the device memory manager evicts some data region, no matter whether the free 
space still seems enough or not, and retries the allocation. Since the release of free space 
often forces the device driver to re-organize its free memory list, this mechanism 
effectively addresses the influence of fragmentation in practice. Because fragmentation 
happens rarely, the extra space eviction caused by this approach also does little harm to 
system performance. 
  
116 
4.7 Experiments 
This section evaluates the performance of MultiQx-GPU thoroughly and verifies the 
effectiveness of various design decisions in supporting concurrent query executions. 
Before presenting the results, we first introduce the settings and methodology of our 
evaluation. 
4.7.1 Settings and Metrics 
The experiments are conducted on a workstation equipped with a four-core 3.4GHz 
Intel Core i7-2600 CPU, 16GB system memory, and an NVIDIA GTX 580 GPU installed 
in a PCIe 2.0 x16 slot. The maximum device memory capacity is 1.6GB, among which 
about 1.46GB of space is available for executing database queries. The GPU operator 
library and the device memory manager component of MultiQx-GPU are compiled and 
executed with NVIDIA CUDA Toolkit 5.0. The operating system is Red Hat Enterprise 
Linux Workstation 6.5 with 2.6.32 kernel. 
Our experiments mainly use the queries and data sets from the Star Schema 
Benchmark, which is widely used in database research due to its realistic modeling of 
data warehousing workloads. The table data are generated using the standard benchmark 
tool and converted into the column format required by the YDB query engine. The scale 
factor is set to 14 by default, unless otherwise noted, which populates the fact table with 
about 80 million tuples (6.6GB in total size). The query executables are pre-generated to 
exclude query parsing, optimization, code generation, and compilation times from 
performance measurement. To minimize the influence of disk accesses, we load all data 
sets into host memory before starting each experiment. Figure 4.5 lists the device 
  
117 
memory usage of each query when it executes alone with MultiQx-GPU. The diversity of 
device memory behaviors makes our workloads more representative. 
 
Figure 4.5: Device memory usage of SSB queries at scale factor 14. qxy denotes the yth 
query of the xth query flight. The Allocated bars show the total device memory space 
allocated in each query. The Peak bars show the maximum device memory space held by 
each query during its execution. 
Several metrics are used in our experiments to characterize system performance from 
different perspectives. We use weighted speedup to measure the throughput of multi-
query executions in a closed system, where the co-running queries are fixed. It is defined 
as the sum of the speedups of participating queries [107, 99]. Suppose n queries execute 
concurrently, the throughput of their executions is computed as ∑
𝑠𝑖
𝑐𝑖
𝑛
𝑖=1 , in which si is the 
execution time of the ith query when it runs alone and ci denotes its execution time when 
it co-runs with other queries. Under this definition, the throughput of running queries 
sequentially (i.e., without queries executing concurrently) is 1. Since queries have 
different execution times, we run each query multiple times to ensure its full overlap with 
other queries. In an open system where user queries arrive and leave dynamically, we 
measure system performance with the metric of queries per second, which is defined as 
the number of queries processed divided by the total processing time. 
  
118 
To gain insights into the performance numbers observed, we also measure the 
utilization of GPU resources such as the copy and compute units during workload 
execution. To validate the effectiveness of various resource management designs to 
reduce unnecessary data movement, we use another metric, called DMA efficiency, that 
measures the percentage of DMA time used for effective data transfers. A data transfer is 
effective if it is required when the query executes alone. Suppose a instances of query A 
co-runs with b instances of query B in exactly full overlap, the time spent on DMA data 
transfers is x for A and y for B when each of them executes alone, and the total DMA time 
is t during their co-runnings, then the DMA efficiency during the co-running of A and B 
is (ax+by)/t. 
In the following text, we first present the overall performance of MultiQx-GPU in 
executing concurrent queries, and then verify the effectiveness of its various components. 
4.7.2 Performance of Concurrent Executions 
Through coordinated sharing of GPU resources, MultiQx-GPU improves system 
throughput by letting multiple queries make efficient progress concurrently. In this 
subsection we evaluate the overall performance of MultiQx-GPU in supporting 
concurrent executions. The evaluation is performed by co-running SSB queries 
pairwisely. Among 91 possible query combinations, we select the 69 pairs of co-runnings 
whose peak device memory consumption exceeds device memory capacity (i.e., suffering 
conflicts). We measure their throughput achieved with MultiQx-GPU, and compare them 
with the original YDB system. The first two bars of each group in Figure 4.6 show the 
results. 
  
119 
 
Figure 4.6: Throughput of pairwise SSB query co-runnings in three different systems. 
MultiQx-GPU-Raw is a variant of MultiQx-GPU without optimizations. p.q denotes the 
combination of queries p and q. 
It can be seen that, by processing multiple queries at the same time, MultiQx-GPU 
greatly enhances system performance compared with dedicated query executions. The 
throughput is consistently improved across all 69 co-runnings, by an average of 39% (at 
least 15%) as compared with YDB. For the co-runnings of q32 with q41, q33 with q41, 
and q43 with itself, the improvements are more than 55%. The high performance leap 
achieved by MultiQx-GPU is mainly attributed to the better utilization of GPU resources. 
Under the efficient management of MultiQx-GPU, the DMA bandwidth and GPU 
computing cycles unused by one query can be allocated to serve the resource 
requirements from other queries. The efficient utilization of resources improves overall 
system throughput. 
To validate the reasons for performance improvement, we have measured the 
utilization of GPU's compute and copy units during the execution of each query pair on 
both YDB and MultiQx-GPU. In Figure 4.7, the left bar graph shows the geometric 
means of GPU resource utilization averaged across all query combinations on the two 
platforms. The right bar graph depicts the improvement of resource utilization per query 
pair achieved on MultiQx-GPU versus on YDB, with both the geometric mean and max-
  
120 
min values shown for each resource type. It can be seen that MultiQx-GPU significantly 
improves the utilization of both DMA and GPU compute resources over that with YDB. 
On average, the DMA engine becomes 62% more occupied (from 35% on YDB to 57% 
on MultiQx-GPU) after concurrent query execution is enabled, while the utilization of 
GPU cores is improved from 13% to 18% (by 33%) accordingly. The per-query-
combination improvements of resource utilization also reflect this trend, with the 
utilization of DMA and compute engines consistently raised by 61% and 31% on average 
respectively. Through manual inspections, we have also found close correlations between 
the degrees of resource utilization increasing and throughput improvement for different 
query combinations. The diversity of resource utilization improvement shown in the right 
bar graph of Figure 4.7 thus explains the difference of throughput enhancement in Figure 
4.6. 
 
Figure 4.7: Improvement of GPU resource utilization with MultiQx-GPU relative to 
YDB. 
Figure 4.7 also shows the potential to further improve multi-query performance on 
GPUs. Currently, despite the great improvement of resource utilization with MultiQx-
GPU, the DMA and compute units are still not 100% utilized. This is mainly due to the 
  
121 
hardware limitation of GPUs and the overhead of data swapping. On one hand, GPU is an 
exclusive, non-preemptive device: the kernels and DMA commands inserted into the 
hardware command queue are usually executed on the GPU one after another. This 
serializes query executions at the hardware level and greatly limits the concurrency level 
of the whole system. The data swapping activities performed by MultiQx-GPU, on the 
other hand, also limits performance enhancement. Even if the optimization techniques 
presented in this paper, as will be verified in the next section, are helpful at reducing the 
overhead, there are still some penalties that cannot be totally avoided. 
To measure the performance of MultiQx-GPU processing larger data sets, we have 
repeated the above experiment at scale factor 28, which corresponds to over 13GB of 
table data (each query processing about 3GB of data on average, much larger than the 
device memory capacity). Across all 69 query co-runnings, MultiQx-GPU improves 
throughput consistently by an average of 33% (up to 54%) above YDB, which is 
comparable to the performance improvement at scale factor 14 (39% on average, 55% at 
maximum). This result is well expected because the execution behaviors of GPU 
operators remain unchanged. When the sizes of intermediate results cannot fit into device 
memory, the YDB query generator partitions data sets into smaller chunks and generates 
a query binary that processes them separately. This only increases the number of times 
each operator is executed, not query behavior. 
4.7.3 Validations of Optimizations 
MultiQx-GPU’s data swapping framework employs a set of optimization techniques, 
including lazy transferring, page-based coherence maintenance, and data reference and 
  
122 
access advices, to minimize overhead. To validate the effectiveness of these techniques in 
ensuring system performance, we repeat the experiment performed in the previous sub-
section on a degraded version of MultiQx-GPU, denoted as MultiQx-GPU-Raw, that does 
not have these optimizations enabled. MultiQx-GPU-Raw transfers data to device 
memory eagerly, maintains data coherence in units of whole data regions, and does not 
exploit advices from user queries to assist data swapping. We measure the throughput of 
query co-runnings achieved with MultiQx-GPU-Raw, and show the results with the third 
bar of each group in Figure 4.6. 
It can be seen that the optimization techniques have high impact on system 
performance. Compared with the optimized MultiQx-GPU, MultiQx-GPU-Raw achieves 
much lower throughput for all query co-runnings. With an average slowdown of 40%, the 
largest performance loss reaches over 73% due to the high memory conflict between q11 
and q13. As a matter of fact, MultiQx-GPU-Raw performs even lower than YDB during 
most co-runnings; it brings down the throughput of 55 (among 69) query combinations 
by an average of 16%, up to 64%, compared with YDB. The lack of optimization 
techniques greatly undermines the usefulness of the MultiQx-GPU system to support 
concurrent query processing, causing performance degradation, instead of enhancement, 
relative to dedicated query executions. 
The major reason for the low performance of MultiQx-GPU-Raw is the excessive data 
movement overhead caused by the unoptimized data swapping framework. This is 
explained in Figure 4.8. We measure the average DMA efficiencies achieved with 
MultiQx-GPU-Raw and MultiQx-GPU when executing the workloads, as shown in the 
  
123 
left bar graph. It can be seen that, without the optimizations, the efficiency of the DMA 
engine is only 38%, which means that over 60% of the DMA time is spent on data 
transferring that would not be necessary during normal query executions. With the 
optimizations, the DMA engine works much more efficiently, raising DMA efficiency by 
over 118% (reaching 84%). This significant reduction of unnecessary data transfers 
directly translates to the high performance gap between MultiQx-GPU-Raw and 
MultiQx-GPU shown in Figure 4.6. The right bar graph sheds lights on the same result 
from the perspective of per-execution DMA efficiency improvement. The efficiency of 
the DMA engine during the execution of each query pair improves by 117% on average 
from MultiQx-GPU-Raw to MultiQx-GPU, with the minimum and maximum changes 
being 17% and 219%. 
 
Figure 4.8: Improvement of DMA efficiency with the help of swapping framework 
optimizations. 
To further evaluate the effectiveness of each individual optimization technique, we 
have implemented several modified variants of MultiQx-GPU. Each variant turns off the 
optimization technique being tested, while keeping all other optimizations enabled. We 
measure the throughput of the same workload used above under these variants. To save 
  
124 
space, we categorize SSB queries into four groups based on their query flight number, 
and only show the average performance of query co-runnings that belong to different 
group combinations. For example, q1.q2 denotes the average throughput for co-running 
one query from query flight q1 with another query from query flight q2. 
The result is depicted in Figure 4.9. NoLazy represents a variant of MultiQx-GPU 
without lazy transferring; data are transferred to device memory as soon as the request is 
received. NoCow transfers data to device lazily, but does not have the copy-on-write 
optimization for the data copied to swapping buffer. NoRef denotes the variant without 
data reference advice; the memory manager assumes that a kernel references all the data 
regions resident in a GPU context. NoAccess enables reference advice, but has no data 
access advice; every region referenced by a kernel is assumed to contain both data input 
and data output. Finally, NoPage does not maintain data coherence in page units. 
 
Figure 4.9: The influence of each individual optimization technique on system 
performance. Performance is normalized to the fully optimized MultiQx-GPU. 
It can be seen that missing any optimization technique would degrade system 
performance consistently, by varying degrees under different workloads. For example, 
removing data reference advice alone lowers MultiQx-GPU performance by an average 
  
125 
of 22% (60% at maximum for workloads in the q1.q1 group). The effect of data reference 
advice seems much more significant to workloads involving queries in the q1 series (47% 
to 60% slowdown) than to others (5% to 20%). This is because q1 queries have the 
highest device memory consumption (as can be seen from Figure 4.5), for which data 
reference advice would be most effective at reducing unnecessary device memory 
contention. The influence of page-level coherence management may seem moderate on 
average (5.7%), but still can be significant (14%) for the q1.q2 workloads. This result 
shows the indispensability of every optimization technique. It is only when they work 
together that MultiQx-GPU achieves its highest performance. 
4.7.4 Experiments with Replacement Policies 
By controlling the selection of proper victim regions to evict under resource 
contention, data replacement policy plays an important role to system throughput. This 
subsection presents the results of our experiment with several data replacement policies 
to support multi-query executions and verifies the effectiveness of CDR in improving 
MultiQx-GPU performance. We compare the performance of five replacement policies, 
LRU (Least Recently Used), MRU (Most Recently Used), LFU (Least Frequently Used), 
RANDOM, and CDR. The first four policies are selected because they are widely used in 
conventional multitasking and data management systems. We measure the throughput of 
the same workloads used in the previous two subsections, achieved using MultiQx-GPU 
(all optimizations are enabled) with different replacement policies. Due to space 
constraint, we randomly select 6 queries and only present the results for their co-
runnings, but similar observations can be made with other queries as well. 
  
126 
 
Figure 4.10: Performance of co-running selected SSB queries under various data 
replacement policies. Throughput is normalized against LRU. 
As shown in Figure 4.10, there are no significant differences among the performance 
of LRU, MRU, LFU, and RANDOM; they perform unevenly, but closely match each 
other under different workloads. CDR, however, performs much better than other policies 
across all query co-runnings, consistently improving system throughput by 44% on 
average (56% at maximum) compared with LRU. The performance advantage of CDR 
compared with other policies is expected, due to its careful design to select victim regions 
that minimize space eviction and data swapping costs. On the contrary, the other four 
policies do not consider the unique features of GPU queries and their concurrent 
executions. The criteria they use to make replacement decisions are rather random in 
terms of the benefits to overall system performance, often leading to increased kernel 
launch latency and unnecessary data swapping. 
4.7.5 Effectiveness of Query Scheduling 
To verify the effectiveness of query scheduling in ensuring system performance, we 
use a random-load generator to generate a sequence of 100 query requests, which model 
  
127 
user queries issued dynamically in an open system. The arrivals of query requests follow 
Poisson process, with the arrival rate set to 4 queries per second. The query issued at each 
interval is randomly picked from a pre-generated query set. Applying the same query 
request trace, we compare the system performance (in terms of queries per second) 
achieved by MultiQx-GPU under five query scheduling policies: FIX-1, FIX-2, FIX-3, 
Peak, and Weight. FIX-n denotes the policy that fixes the number of concurrently 
executing queries to n. For example, FIX-1 corresponds to the policy that executes one 
query at a time (without concurrent executions). Peak is a scheduling policy that 
schedules queries based on their peak device memory demands. Weight represents our 
scheduling policy proposed in Section 4.5. 
Figure 4.11(a) shows the speedup of each scheduling policy relative to FIX-1 when the 
scale factor is 14. It can be seen that the speedups achieved by FIX-3 (1.1) and Peak 
(1.07) are much lower than those delivered by FIX-2 (1.39) and Weight (1.37). At scale 
factor 14, SSB queries have high device memory demands. Co-running 3 queries together 
causes too much resource conflict that lowers sys- tem performance. Peak, on the other 
hand, is too conservative at selecting queries to co-run, and thus loses the opportunities to 
improve system performance by executing more queries concurrently. The performance 
of FIX-2 is comparable with Weight, because the concurrency level it supports (2) 
matches this workload. But, as we will show next, it cannot maintain its peak 
performance under other settings. 
  
128 
 
Figure 4.11: System speedups achieved with different scheduling policies, normalized to 
FIX-1. 
To demonstrate how each scheduling policy performs under workloads with less-
intensive resource conflicts, we repeat the above experiment with data sets at scale factor 
8. The speedup of each policy compared with FIX-1 is shown in Figure 4.11(b). It can be 
seen that FIX-1 still performs the worst among all policies, while Weight continues to 
deliver the highest speedup (over 62% higher than FIX-1). FIX-2, which performs 
equally with Weight under scale factor 14, can no longer match up with Weight at scale 
factor 8 because it does not consider the changing resource demands of user queries. This 
  
129 
experiment, combined with the previous one, verifies the effectiveness of the query 
scheduler to ensure system performance at various system settings. 
Figure 4.11 also shows the benefits of concurrent query executions for speeding up 
dynamic query workloads. Even though FIX-2, FIX-3, and Peak cannot consistently 
deliver the highest system performance as Weight does, they all outperform FIX-1, often 
by large margins. 
4.7.6 Overhead 
To efficiently manage GPU resources, MultiQx-GPU creates a swapping buffer in the 
host memory and performs various maintenance actions such as setting copy-on-write 
protections, updating the states of data blocks, and scheduling kernels. These 
management activities may add some overhead to query executions when there is not 
resource conflict. In this subsection, we evaluate this overhead and show that it is 
sufficiently low in practice. 
We measure the overhead by comparing the performance of MultiQx-GPU (all 
resource management functionalities enabled) with YDB at scale factor 14 under two 
groups of workloads. The first group consists of the solo executions of the 13 SSB 
queries; the second group comprises the 22 pairs of query co-runnings that do not suffer 
device memory conflicts. The results are presented in Figure 4.12. It can be seen that, for 
single-query executions, the performance of SSB queries achieved with MultiQx-GPU 
closely matches their performance with YDB, which does not have GPU resource 
management overhead. MultiQx-GPU increases execution times by at most 4.3% 
compared with YDB, with the average slowdown being 2.2% across all queries. For 
  
130 
query co-runnings, on the other hand, the average throughput achieved with MultiQx-
GPU is only 3.4% lower than that with YDB. We therefore believe that the overhead of 
MultiQx-GPU is negligibly low. 
 
Figure 4.12: Overhead of MultiQx-GPU. 
4.8 Related Work 
The use of GPUs for database applications has been intensively studied in existing 
research. Some works focus on the designs and implementations of efficient GPU 
algorithms for common database operations such as join [90, 103, 94], selection [89], 
sorting [105, 88], and spatial computation [109, 98, 86], achieving orders of magnitude of 
  
131 
performance improvement over conventional CPU-based solutions. Other works exploit 
various software optimization techniques to accelerate query plan generations [92], 
improve kernel execution efficiency [110, 106], reduce PCIe data transferring [110, 111], 
and support query co-processing with both GPUs and CPUs [102]. 
Our work in this paper is mainly related to recent development efforts of GPU query 
engines, which provide infrastructure software support for the integration of GPUs in 
real-world database systems. YDB [112], based on which MultiQx-GPU is implemented, 
is designed for data warehousing query processing. It employs a column-based storage 
format, and generates query plans that execute in a push-based, batch-oriented fashion. 
Ocelot [93] is a hybrid OLAP query processor as an extension for MonetDB. By adopting 
a hardware-independent query engine design, it supports efficient executions of OLAP 
queries on both CPUs and GPUs. Ocelot provides a memory management interface that 
abstracts away the details of the underlying memory structure to support portability. The 
memory manager can also perform simple data swapping within a single query. However, 
as we mentioned in Section 4.2, it does not have sufficient mechanisms or policies to 
support correct, efficient executions of queries in concurrent settings. MapD [98] is a 
spatial database system using GPUs as the core query processing devices. Through 
techniques such as optimized spatial algorithm implementations, kernel fusing, and data 
buffering, MapD outperforms existing CPU spatial data processing systems by large 
margins. GPUTx [91] is a high-performance transactional GPU database engine. It 
batches multiple transactional queries into the same kernel for efficient executions on the 
GPU and ensures isolation and consistency under concurrent updates. The workloads 
  
132 
GPUTx targets are short-running, small tasks that would not cause device memory 
contention. The techniques thus cannot be used for concurrent analytical query 
processing on GPUs, where tasks usually have long time spans and have high demands 
for device memory space. HyPE [87] is a hybrid engine for CPU-GPU query co-
processing. The idea of its operator-based execution cost model is similar to the weighted 
demand metric proposed in this paper. Compared with these works, MultiQx-GPU 
identifies the critical demands and opportunities of supporting concurrent query 
executions in analytical GPU databases. It addresses a set of issues in GPU resource 
management to achieve high system performance under multi-query workloads. 
In addition, there are several research works on GPU resource management in general-
purpose computing systems. PTask [104] adds abstractions of GPUs in the OS kernel to 
support managing GPUs as first-class computing resources. It provides a dataflow-based 
programming model and enforces system-level management of GPU computing 
resources and data movement. TimeGraph [95] is GPU scheduler to provide performance 
isolation for real-time graphics applications. Gdev [96] is an open-source CUDA driver 
and runtime system. It supports inter-process communication through GPU device 
memory and provides simple data swapping functionality based on the IPC mechanism. 
GDM [108] is an OS-level device memory manager, which motivated the design of 
MultiQx-GPU’s data swapping framework. 
4.9 Summary and Future Work 
This paper presents the motivation, design, implementation, and evaluation of 
MultiQx-GPU, a high-performance software support system for concurrent analytical 
  
133 
query processing with GPU devices. MultiQx-GPU provides two critically necessary 
functionalities, namely query scheduling and device memory swapping, to allow 
coordinated multi-query executions with GPUs. Our extensive experimental results show 
that MultiQx-GPU can significantly improve overall throughput when executing data 
warehousing benchmarks. MultiQx-GPU is open-source software. The source code can 
be downloaded from http://jason.cse.ohio-state.edu/mqx. 
We will extend MultiQx-GPU in two directions. First, considering the importance of 
CPU-GPU co-processing for high-performance query executions, we plan to investigate 
the probability of combining CPU scheduling in the operating system with the GPU 
scheduling in our system. Second, under the guideline of minimal changes to query 
engines, current query scheduler in MultiQx-GPU does not consider data overlapping 
between different queries. We will study how to coordinate the two layers to support data 
sharing. 
  
  
134 
 
 
Chapter 5 Concluding Remarks 
 
 
This dissertation addresses two critical issues for GPUs to be effectively utilized in 
general-purpose computing environments to achieve both high performance and high 
throughput for large-scale data processing. The first issue is to develop data-parallel 
algorithms to exploit the massive parallel capability of GPUs. This effort can be highly 
application-dependent, and serves as the foundation because existing parallel algorithms 
for multicore processors may not fit GPU architecture. The case study of accelerating 
digital pathology analysis by GPUs presented in the dissertation gives the insights into 
the nature of this issue. The second issue is related to GPU resource allocation in 
complex and dynamic systems. With the provided system support, we are able to 
adaptively execute tasks on both CPUs and GPUs for the best performance interests of 
applications and for the best utilization of the two different types of computing units. 
More specifically, we have presented our work in the following three perspectives. 
First, to make a strong case for GPUs as indispensable devices to some performance-
critical applications, we have provided our solution to significantly accelerate the cross-
comparison of analytical pathology imaging data in a CPU-GPU hybrid system. Our 
PixelBox algorithm and its GPU implementation effectively eliminate the performance 
bottleneck of computing the areas of the intersections and unions of polygon sets. The 
  
135 
pipelined framework with task migration support maximizes resource utilization by 
dynamically balancing the workload between CPUs and GPUs. With real-world 
workloads, we show that our solution achieves more than 18x speedup compared with a 
parallelized PostGIS solution. 
Then, we presented our identification of a critical problem with existing GPGPU 
software system: namely, the lack of device memory management causes application 
crashes, hangs, and inefficient utilization of GPGPU resources. This problem can 
seriously hinder the adoption of GPGPUs as mainstream computing devices in general-
purpose systems. We propose GDM, a fully functional device memory manager in the 
operating system, to effectively address the problem. The design fully considers the 
unique features of GPGPU computing and GPGPU devices from the perspectives of both 
challenging problems and optimization opportunities. Experiments with various 
benchmarks verify the capabilities of GDM to tolerate device memory leaks, prevent 
program crashes, defend against malicious programs, and achieve high performance. 
Finally, we described the motivation, design, implementation, and evaluation of 
MultiQx-GPU, a high-performance database software system to support concurrent 
analytical query processing with GPU devices. MultiQx-GPU provides two critically 
necessary functionalities, query scheduling and device memory swapping, to enable 
coordinated executions of multiple queries with GPUs. Our extensive experimental 
results show that MultiQx-GPU can significantly improve overall throughput when 
executing data warehousing benchmarks. 
  
  
136 
 
 
Bibliography 
 
 
[1] http://postgis.refractions.net. 
[2] http://www.scidb.org. 
[3] http://trac.osgeo.org/geos/. 
[4] http://www.cgal.org. 
[5] http://developer.nvidia.com/category/zone/cuda-zone. 
[6] http://resources.esri.com/help//9.3/arcgisengine/java/gp_toolref/conversion_toolbox/c
onverting_features_to_raster_data.htm. 
[7] http://mathworks.com/matlabcentral/newsreader/view thread/324086. 
[8] http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=2780. 
[9] http://culatools.com/blog/2012/03/12/3099. 
[10] http://blenderartists.org/forum/showthread.php?269777. 
[11] https://github.com/Theano/Theano (commit#: 5a755867f21b, fe69a5a5b3a4, 
410016f9d602, 9bdeda96639e). 
[12] http://mail-archive.com/pycuda@tiker.net/msg02432.html. 
[13] http://amd.com/en-us/innovations/software-technologies/apu. 
[14] http://documen.tician.de/pycuda/. 
[15] https://devtalk.nvidia.com/default/topic/513370/cublas-problem. 
[16] http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=1441. 
[17] http://mathworks.com/matlabcentral/answers/85601-unavoidable-memory-leaks-in-
mex. 
[18] http://nouveau.freedesktop.org. 
[19] https://github.com/serban/kmeans. 
[20] http://www.hsafoundation.com/. 
[21] http://www.culatools.com. 
[22] http://code.google.com/p/gpudb. 
  
137 
[23] http://monetdb.org. 
[24] http://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html. 
[25] N. Ao, F. Zhang, D. Wu, D. S. Stones, G. Wang, X. Liu, J. Liu, and S. Lin. 
Efficient parallel lists intersection and index compression algorithms using graphics 
processing units. PVLDB, 4(8):470–481, 2011. 
[26] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. 
Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of 
the parallel computing landscape. CACM, 52(10):56–67, 2009. 
[27] M. J. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial 
differential equations. Journal of Computational Physics, 53(3):484 – 512, 1984. 
[28] L. A. D. Cooper, J. Kong, D. A. Gutman, F. Wang, J. Gao, C. Appin, S. Cholleti, T. 
Pan, A. Sharma, L. Scarpace, T. Mikkelsen, T. Kurc, C. S. Moreno, D. J. Brat, and J. 
H. Saltz. Integrated morphologic analysis for the identification and characterization of 
disease subtypes. Journal of the American Medical Informatics Association, 
19(2):317–323, 2012. 
[29] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational 
geometry: algorithms and applications. Springer-Verlag, 1997. 
[30] X. Ding, K. Wang, P. B. Gibbons, and X. Zhang. BWS: balanced work stealing for 
time-sharing multicores. In EuroSys, pages 365–378, 2012. 
[31] G. S. Fishman. Monte Carlo: concepts, algorithms, and applications. Springer-
Verlag, 1996. 
[32] W. R. Franklin, V. Sivaswami, D. Sun, M. Kankanhalli, and C. Narayanaswami. 
Calculating the area of overlaid polygons without constructing the overlay. 
Cartography and Geographic Information Science, 21(2):81–89, 1994. 
[33] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: high 
performance graphics co-processor sorting for large database management. In 
SIGMOD, pages 325–336, 2006. 
[34] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast 
computation of database operations using graphics processors. In SIGMOD, pages 
215–226, 2004. 
[35] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: a simultaneously 
pipelined relational query engine. In SIGMOD, pages 383–394, 2005. 
[36] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational 
joins on graphics processors. In SIGMOD, pages 511–524, 2008. 
[37] B. He and J. X. Yu. High-throughput transaction executions on graphics 
processors. PVLDB, 4(5):314–325, 2011. 
  
138 
[38] I. Kamel and C. Faloutsos. Hilbert r-tree: an improved r-tree using fractals. In 
VLDB, pages 500–509, 1994. 
[39] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU 
scheduling for real-time multi-tasking environments. In USENIX ATC, pages 2–2, 
2011. 
[40] M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The researcher’s guide to the 
data deluge: querying a scientific database in just a few seconds. PVLDB, 
4(12):1474–1477, 2011. 
[41] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, 
S. A. Brandt, and P. Dubey. FAST: fast architecture sensitive tree search on modern 
CPUs and GPUs. In SIGMOD, pages 339–350, 2010. 
[42] D. B. Kirk and W.-m. W. Hwu. Programming massively parallel processors: a 
hands-on approach. Morgan Kaufmann, 2010. 
[43] A. Kukanov and M. J. Voss. The foundations for scalable multi-core software in 
Intel Threading Building Blocks. Intel Technology Journal, 11(4):309–322, 2007. 
[44] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. 
Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking 
the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and 
GPU. In ISCA, pages 451–460, 2010. 
[45] S. Loebman, D. Nunley, Y. Kwon, B. Howe, M. Balazinska, and J. P. Gardner. 
Analyzing massive astrophysical datasets: can pig/hadoop or a relational dbms help? 
In CLUSTER, pages 1–10, 2009. 
[46] J. O’Rourke. Computational geometry in C. Cambridge University Press, 1998. 
[47] J. Patel, J. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger, N. Hall, K. Ramasamy, R. 
Lueder, C. Ellmann, J. Kupsch, S. Guo, J. Larson, D. De Witt, and J. Naughton. 
Building a scaleable geo-spatial DBMS: technology, implementation, and evaluation. 
In SIGMOD, pages 336–347, 1997. 
[48] J. Pineda. A parallel algorithm for polygon rasterization. In SIGGRAPH, pages 17–
20, 1988. 
[49] H. Pirk, S. Manegold, and M. L. Kersten. Accelerating foreign-key joins using 
asymmetric memory channels. In VLDB - Workshop on Accelerating Data 
Management Systems Using Modern Processor and Storage Architectures, pages 
585–597, 2011. 
[50] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. 
Hwu. Optimization principles and application performance evaluation of a 
multithreaded GPU using CUDA. In PPoPP, pages 73–82, 2008. 
[51] C. Sun, D. Agrawal, and A. El Abbadi. Hardware acceleration for spatial selections 
and joins. In SIGMOD, pages 455–466, 2003. 
  
139 
[52] L. Surhone, M. Tennoe, and S. Henssonow. Rectilinear polygon. Betascript 
Publishing, 2010. 
[53] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-
Wesley Longman Publishing Co., Inc., 2005. 
[54] F. Wang, J. Kong, L. Cooper, T. Pan, T. Kurc, W. Chen, A. Sharma, C. 
Niedermayr, T. Oh, D. Brat, A. Farris, D. Foran, and J. Saltz. A data model and 
database for high-resolution pathology analytical image informatics. Journal of 
Pathology Informatics, 2(1):32, 2011. 
[55] Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU 
architectures. In HPCA, pages 382–393, 2011. 
[56] AMD. AMD accelerated parallel processing OpenCL programming guide, 2013. 
[57] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. 
Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression 
compiler. In SciPy, 2010. 
[58] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. 
McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. 
Shen, and C. Webb. Die stacking (3d) microarchitecture. In MICRO, 2006. 
[59] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. 
Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB 
Endow., 1(2):1265{1276, 2008. 
[60] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 
Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. 
[61] P. J. Denning. Virtual memory. ACM Comput. Surv., 2(3):153--189, 1970. 
[62] P. J. Denning. Third generation computer systems. ACM Comput. Surv., 3(4):175--
216, Dec. 1971. 
[63] D. R. Engler and M. F. Kaashoek. Exterminate all operating system abstractions. In 
HOTOS, 1995. 
[64] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An 
asymmetric distributed shared memory model for heterogeneous parallel systems. In 
ASPLOS, 2010. 
[65] K. O. W. Group. The OpenCL specification 1.2, 2013. 
[66] V. Gupta, K. Schwan, N. Tolia, V. Talwar, and P. Ranganathan. Pegasus: 
coordinated scheduling for virtualized accelerator-based systems. In USENIX ATC, 
2011. 
[67] T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. 
Automatic CPU-GPU communication management and optimization. In PLDI, 2011. 
  
140 
[68] K. Jang, S. Han, S. Han, S. Moon, and K. Park. SSLShader: cheap SSL 
acceleration with commodity processors. In NSDI, 2011. 
[69] F. Ji, H. Lin, and X. Ma. RSVM: a region-based software virtual memory for GPU. 
In PACT, 2013. 
[70] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU 
scheduling for real-time multi-tasking environments. In USENIX ATC, 2011. 
[71] S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: first-class GPU resource 
management in the operating system. In USENIX ATC, 2012. 
[72] H. Kim. Supporting virtual memory in GPGPU without supporting precise 
exceptions. In MSPC, 2012. 
[73] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. 
Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking 
the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and 
GPU. In ISCA, 2010. 
[74] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified 
graphics and computing architecture. IEEE Micro, 28(2), 2008. 
[75] M. Macedonia. The GPU enters computing's mainstream. Computer, 36(10):106--
108, 2003. 
[76] J. Menon, M. De Kruijf, and K. Sankaralingam. iGPU: exception support and 
speculative execution on GPUs. In ISCA, 2012. 
[77] T. Ni. Direct Compute: Bring GPU computing to the mainstream. In GTC, 2009. 
[78] NVIDIA. NVIDIA's next generation CUDA compute architecture: Kepler GK110, 
2012. 
[79] NVIDIA. NVIDIA CUDA C programming guide, 2013. 
[80] J. Poulton. An embedded DRAM for CMOS ASICs. In ARVLSI, 1997. 
[81] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: operating 
system abstractions to manage GPUs as compute devices. In SOSP, 2011. 
[82] M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: integrating a file system 
with GPUs. In ASPLOS, 2013. 
[83] A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous 
multithreaded processor. In ASPLOS, 2000. 
[84] K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating 
pathology image data cross-comparison on CPU-GPU hybrid systems. Proc. VLDB 
Endow., 5(11), 2012. 
[85] Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of processing data warehousing 
queries on GPU devices. Proc. VLDB Endow., 6(10):817--828, 2013. 
  
141 
[86] N. Bandi, C. Sun, D. Agrawal, and A. El Abbadi. Hardware acceleration in 
commercial databases: A case study of spatial operations. In VLDB, 2004. 
[87] S. Bress. Why it is time for a HyPE: A hybrid query processing engine for e_cient 
GPU coprocessing in DBMS. Proc. VLDB Endow., 6(12), 2013. 
[88] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: High 
performance graphics co-processor sorting for large database management. In 
SIGMOD, 2006. 
[89] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast 
computation of database operations using graphics processors. In SIGMOD, 2004. 
[90] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational 
joins on graphics processors. In SIGMOD, 2008. 
[91] B. He and J. X. Yu. High-throughput transaction executions on graphics 
processors. Proc. VLDB Endow., 4(5):314{325, 2011. 
[92] M. Heimel and V. Markl. A _rst step towards GPU-assisted query optimization. In 
ADMS, 2012. 
[93] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious 
parallelism for in-memory column-stores. Proc. VLDB Endow., 6(9):709{720, 2013. 
[94] T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. GPU join processing revisited. 
In DaMoN, 2012. 
[95] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU 
scheduling for real-time multi-tasking environments. In USENIX ATC, 2011. 
[96] S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class GPU resource 
management in the operating system. In USENIX ATC, 2012. 
[97] R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet another 
SQL-to-MapReduce translator. In ICDCS, 2011. 
[98] T. Mostak. An overview of MapD (massively parallel database). MIT Technical 
Report, 2013. 
[99] O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both 
performance and fairness of shared dram systems. In ISCA, 2008. 
[100] T. Ni. DirectCompute: Bring GPU computing to the mainstream. In GTC, 2009. 
[101] P. O'Neil, B. O'Neil, and X. Chen. Star schema benchmark. cs.umb.edu/ 
poneil/StarSchemaB.PDF. 
[102] H. Pirk, S. Manegold, and M. Kersten. Waste not... efficient co-processing of 
relational data. In ICDE, 2014. 
[103] H. Pirk, S. Manegold, and M. L. Kersten. Accelerating foreign-key joins using 
asymmetric memory channels. In VLDB, pages 27--35, 2011. 
  
142 
[104] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating 
system abstractions to manage GPUs as compute devices. In SOSP, 2011. 
[105] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. 
Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In 
SIGMOD, 2010. 
[106] E. A. Sitaridi and K. A. Ross. Ameliorating memory contention of OLAP operators 
on GPU processors. In DaMoN, 2012. 
[107] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous 
multithreaded processor. In ASPLOS, pages 234--244, 2000. 
[108] K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang. GDM: device memory 
management for GPGPU computing. In SIGMETRICS, 2014. 
[109] K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating 
pathology image data cross-comparison on CPU-GPU hybrid systems. Proc. VLDB 
Endow., 5(11):1543--1554, 2012. 
[110] H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: 
Automatically fusing database primitives for e_cient GPU computation. In Micro, 
pages 107--118, 2012. 
[111] S. Yalamanchili. Scaling data warehousing applications using GPUs. In FastPath, 
2013. 
[112] Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of processing data warehousing 
queries on GPU devices. Proc. VLDB Endow., 6(10):817-828, 2013. 
