Scientific Supercomputing with Graphics Processing Units by Werkhoven, B.J.C. van
Scientific Supercomputing
with
Graphics Processing Units
Ben van Werkhoven
Advanced School for Computing and Imaging
This work was carried out in the ASCI graduate school.
The work presented in this thesis was partially funded by the Dutch national
program COMMIT.
About the cover
The cover of this thesis shows a visualization of the input data used by the Par-
allel Ocean Program. The visualization software was created by Jeroen Rietveld
as part of his masters project, which was supervised by Ben van Werkhoven. The
algorithm for the visualization of the terrain was developed by Ben as part of
teaching the Computer Graphics course. The picture on the cover shows a visu-
alization that uses particles to visualize the flow of the ocean. The colors show
the velocities of the particles as they move along the flow. A special thanks to
Maarten van Meersbergen for helping to create the high-resolution screenshot.
Copyright © 2014 Ben van Werkhoven
VRIJE UNIVERSITEIT
Scientific Supercomputing
with
Graphics Processing Units
ACADEMISCH PROEFSCHRIFT
ter verkrijging van de graad Doctor aan
de Vrije Universiteit Amsterdam,
op gezag van de rector magnificus
prof.dr. F.A. van der Duyn Schouten,
in het openbaar te verdedigen
ten overstaan van de promotiecommissie
van de Faculteit der Exacte Wetenschappen
op maandag 27 oktober 2014 om 13.45 uur
in de aula van de universiteit,
De Boelelaan 1105
door
Bernardus Johannes Cornelis
van Werkhoven
geboren te Hoorn
promotor: prof.dr.ir. H.E. Bal
copromotoren: dr. F.J. Seinstra
dr. J. Maassen
members of the thesis committee: prof.dr. H. Corporaal
prof.dr.ir. D.H.J. Epema
dr. S. Jha
dr.-Ing. habil. T. Kielmann
dr. C.G.M Snoek

Acknowledgments
First of all, I have to thank Nadine, for being my constant source of motivation
and inspiration. I am grateful for her patience, for all the times that we could
not spend together because I was writing this thesis. She believes in me and my
capabilities of succeeding in whatever I pursue, even when I don’t believe I can. I
don’t think that I would have been able to write this thesis without her support.
I owe a great deal of gratitude to my parents for everything they have done
for me in the past 28 years, but in particular for all their support during my
studies and this PhD project. Thanks to Lisanne en Bas for all the time we
spent together and the wonderful bond that we share.
Thanks Peter for being my best friend ever since our trip to Rome in 2003.
Thank you Peter, Arjen, Samir, Marcel, and Dave for supplying me with plenty
of distractions on our Friday nights and many other times.
Thank you Frank for devoting such a great deal of time to supervising me,
even when you were no longer being paid to do so. I have to thank Frank as
well for all the effort he put into getting me a position as a PhD student at the
VU. Frank and I met when we were teaching computer graphics together with
Thilo. It often occurred that Frank and I were still discussing our research ideas
even hours after the students had left the classroom. It has been great doing my
masters project under the supervision of both Frank and Jason, that resulted in
my first publication and one amazing trip to Saint-Malo together with Rob and
Ana that I will never forget :-)
I thank Henri for all the time and energy he devoted to me during this project
and for being a truly kind supervisor and employer. I’m thankful for all the
freedom I was given by both Frank and Henri to pursue research in the directions
that I wanted. While I do believe that this freedom made doing a PhD project
in itself a lot harder, I would certainly not have learned as much without it.
Thanks Jason for all your advice on my research and for your refreshing ideas
on how to structure and present my talks and papers.
Thanks Henk and Michael for all our collaborations and the great times Jason
and I had on our weekly visits to IMAU. I’ve learned so much from our collabo-
rations, even things that I had never expected to learn during my PhD, such as
viii
being able to program fluently in Fortran90 or understanding the basics of Ocean
Modeling.
I would like to thank Pieter for our endless discussions on almost any possi-
ble topic. Although we seem to never agree on anything, our discussions have
certainly contributed to my critical thinking skills. Thanks Timo for all the nice
coffee breaks we shared. I’d like to thank all others that I have worked with
during my PhD project: Rob, Maarten, Niels, Alessio, Ceriel, Kaveh, Alexandru,
Ismail, Rutger, Kees, and everyone else. A special thanks to Kees for being the
best administrator of the DAS that we could wish for.
I want to thank all the great teachers at the VU, whose classes I have enjoyed
so much during my bachelors and masters. In particular I would like to thank
Wan Fokkink, Thilo Kielmann, Guillaume Pierre, Maarten van Steen, and Henri
Bal from whom I enjoyed multiple courses. Their teachings motivated me to
pursue a research-oriented masters program, as well as pursuing a career as a
Computer Scientist myself.
Contents ix
Contents
1. Introduction 1
1.1. Scientific supercomputing with GPUs . . . . . . . . . . . . . . . . 2
1.2. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Introduction to GPU Computing . . . . . . . . . . . . . . . . . . . 7
1.5. Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2. User transparent parallel multimedia computing on GPU clus-
ters 13
2.1. Parallel-Horus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2. CUDA-based compute kernels . . . . . . . . . . . . . . . . . . . . . 16
2.3. Lazy parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4. A line detection application . . . . . . . . . . . . . . . . . . . . . . 20
2.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3. Optimizing convolution operations on GPUs 33
3.1. Naive implementation and well-known optimizations . . . . . . . . 35
3.2. Avoiding shared memory bank conflicts . . . . . . . . . . . . . . . 41
3.3. Adaptive tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4. Adaptive tiling combined with loop unrolling . . . . . . . . . . . . 48
3.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4. Improving the performance of the Parallel Ocean Program 63
4.1. Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2. Results: load balancing . . . . . . . . . . . . . . . . . . . . . . . . 73
x Contents
4.3. Execution on graphics processing units . . . . . . . . . . . . . . . . 77
4.4. Performance of POP on GPUs . . . . . . . . . . . . . . . . . . . . 83
4.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5. Performance models for CPU-GPU data transfers 91
5.1. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2. Overlapping computation and communication . . . . . . . . . . . . 96
5.3. Modeling computation and communication overlap for streams . . 98
5.4. Modeling PCIe transfer time . . . . . . . . . . . . . . . . . . . . . 102
5.5. Overview of performance models . . . . . . . . . . . . . . . . . . . 106
5.6. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6. Conclusion 121
6.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
References 131
Samenvatting 139
Curriculum Vitae 143
1Chapter 1
Introduction
A team of researchers from Stanford and Google recently realized a breakthrough
in the field of machine learning, known as the Google Brain project [44]. They
built a deep neural network, a software model of a brain, with about a billion
connections. Roughly speaking, a billion connections is about the number of
connections in the brain of a honey bee. The neural network was trained over
the course of three days using an unlabeled set of 10 million images that were
taken from random YouTube videos and resized to 200 by 200 pixels. The most
remarkable part is that the learning process was completely unsupervised. After
three days of unsupervised learning, the model had discovered that there are
two concepts that frequently show up on the Internet: the model had learned to
recognize images of faces and cats. Training the model took about three days
on a computer cluster of a thousand servers, containing 2,000 CPUs in total.
This machine had cost about 5 million dollars to build and consumed about 600
kWatts.
Researchers from the same group at Stanford, together with a team of en-
gineers at Nvidia, have worked on another approach to tackle the same prob-
lem, simulating deep neural networks on Graphics Processing Units (GPUs) [13].
Nvidia has reported that the entire training process of the Google Brain can now
be performed on 3 GPU servers, containing 4 Nvidia GTX Titan Z GPUs per
server [32]. The system, consisting of 12 GPUs in total, costs over a hundred
times less to deploy ($33,000) and consumes over a hundred times less energy
(4 kWatts). Because of this breakthrough, machine learning at the scale of the
Google Brain project has become accessible to research groups around the world
and opens the road for dramatically increasing the size of simulations.
2 Chapter 1. Introduction
1.1 Scientific supercomputing with GPUs
A supercomputer is a high performance computing system built for specialized
applications that require state-of-the-art processing capacity. Supercomputers
typically consist of a large collection of processors and a high-speed network
interconnect. These systems are built to run large-scale scientific applications
that aim to solve scientific problems using mathematical simulations or analysis.
In recent years, supercomputers have been undergoing a major transformation
by equipping traditional cluster systems with graphics cards that contain GPUs.
These specialized processing units offer a large amount of compute power against
low costs and power consumption. This thesis focuses on understanding and
increasing the performance of large-scale scientific applications using graphics
processing units.
The fast growing video game industry has created an enormous economic
incentive for GPU vendors to meet the ever-increasing demands for more detailed
and more realistic computer graphics [5]. The GPU is responsible for rendering
complex scenes at high resolutions and computing the color of every pixel in every
frame at up to 60 frames per second. Such high demands can only be satisfied if
a large part of the chip area is devoted to performing floating-point calculations.
Fortunately, graphics workloads have a very high degree of data parallelism, as
the colors of all the different pixels that appear on screen can be computed in
parallel. This demand has motivated GPU vendors to design chips containing a
large number of small cores that are particularly suitable for performing massive
amounts of floating-point calculations per second in parallel.
The specific architectural characteristics of GPUs make these processors an
interesting platform for computationally intensive applications, even outside of
computer graphics. The first non-graphics related GPU applications still used
graphics libraries to program the GPU. However, the practice of running non-
graphics related workloads on GPUs is currently also supported specifically by
GPU vendors. Several programming languages are available today that target
the use of GPUs for computing, rather than computer graphics. Currently, GPU
vendors even have special brands of GPUs, for example Nvidia’s Tesla [63], ded-
icated to high-performance computing rather than computer graphics.
GPUs offer a tremendous increase in compute performance against low costs
and power consumption. As such, GPUs have become a computing platform
that cannot be ignored by supercomputing applications. Therefore, the high-
level research question of this thesis is:
How can we simplify the use of GPUs in scientific supercomputing
applications, increase application performance, and increase our un-
derstanding of the performance of these GPU-enabled applications?
1.1. Scientific supercomputing with GPUs 3
Many challenges and difficulties must be overcome when using GPUs for sci-
entific computing. Developing GPU applications is generally considered to be
a very difficult task, as programmers have to deal with multiple levels of paral-
lelism and a hierarchy of non-coherent memories. GPUs are massively parallel
devices that require existing software to be rewritten in a specialized program-
ming model. GPU applications are separated in a host part, executing on the
CPU, and device functions (called GPU kernels) executing on the GPU. The host
code is responsible for managing device memory, transferring data between host
and device memory, and invoking device functions. GPU kernels are executed
by thousands of threads in parallel requiring that parallelism is expressed at a
fine-grained level within the computations, which in turn requires expertise of
both the algorithm and the underlying architecture. The optimization process
of GPU kernels is also very challenging. The underlying hardware and threading
model contain hard resource restrictions that affect performance and make the
optimization space discontinuous [73]. As a result, subtle interactions between
resource limitations may have a large impact on application performance.
Following our high-level research goal stated above, we have grouped the main
challenges into the following categories:
Challenge 1 How to lower the threshold for domain experts to use GPU equipped
computing systems?
Challenge 2 How to measure the impact of the use of GPUs on application
performance?
Challenge 3 How to efficiently integrate GPU kernels into large existing code
bases?
Challenge 4 How to maximize the efficiency of compute-intensive GPU kernels?
Challenge 5 How to optimize data transfers between CPU and GPU?
The challenges formulated here are applicable to GPU computing applications in
general and require much more work than is possible within the scope of this thesis
to be resolved completely. Instead we aim to gain more understanding of the
fundamental problems that give rise to these challenges and provide contributions
towards a general solution for each challenge. While the work presented in this
thesis focuses on scientific applications some of the presented solutions also have
more general applicability.
4 Chapter 1. Introduction
1.2 Scope
In this thesis, we focus on scientific applications that perform large amounts of
floating-point calculations on dense data sets. This is to ensure that GPUs are a
realistic and efficient execution platform for the considered class of applications.
In our view, the landscape of floating-point intensive scientific applications can
be viewed as a spectrum of application types. One end of the spectrum contains
applications with compute-intensive workloads having high arithmetic intensities.
These applications are generally focused on analysis rather than simulation, as
the algorithm attempts to extract and infer information from the data set at
hand.
On the other end of the spectrum, we have applications with workloads that
are data intensive and have low arithmetic intensities. These applications do
perform large amounts of floating-point calculations, but the complexity of the
calculation per iteration is in general much lower, while the total number of itera-
tions is much higher. These applications are in general more focused on simulation
than analysis, and typical examples of these applications aim to numerically solve
partial differential equations.
Between these extremes is a spectrum of applications where characteristics
from both ends can be combined to characterize applications. Therefore, we
focus on applications that are from both ends of the spectrum, in order to deal
with challenges that are representative for the entire landscape.
As shown in Table 1.1, this thesis considers two application domains from
either end of the spectrum, Multimedia Content Analysis and Climate Mod-
eling. Multimedia Content Analysis is a representative application domain for
compute-intensive applications that perform many floating-point calculations per
data element, and is covered in Chapters 2 and 3. In these chapters we focus
on challenges concerning GPU computing in general and related to compute in-
tensive applications in particular. Chapter 2 focuses on Challenges 1, 2, and
3, by investigating how GPU execution is integrated efficiently into a high-level
parallel programming model that can be used without requiring any knowledge
of GPU computing from the application programmer. Chapter 3 focuses entirely
on Challenge 4, kernel-level optimizations for compute-intensive operations.
The second application domain that we consider is Climate Modeling, which
is a representative domain for applications with data-intensive workloads. Chap-
ter 4 focuses on Challenges 2 and 3 in the context of an ocean modeling applica-
tion. Chapter 5 presents performance modeling techniques to solve Challenge 5
in a way that is applicable to GPU computing applications in general. We eval-
uate the performance models using GPU kernels from both Multimedia Content
Analysis and Climate Modeling. In the following, we present a brief summary of
the content of each chapter in this thesis.
1.3. Thesis outline 5
Chapter Application Domain Challenges
2 Multimedia Content Analysis 1, 2, and 3
3 Multimedia Content Analysis 4
4 Climate Modeling 2, 3
5 Multiple domains 5
Table 1.1: Overview of the applications domains and challenges covered by each
chapter.
1.3 Thesis outline
The chapters in this thesis are largely based on a collection of scientific articles
that have been published. This section describes the overall structure of the
thesis and provides a brief description of the challenges solved in each chapter.
Chapter 2 investigates how GPU execution can be integrated into high-level
parallel programming models that can be used directly by domain experts. It
is essential to develop efficient programming models that hide the intrinsic com-
plexities of the underlying computing hardware, as it is unrealistic to expect
application domain experts to also become experts in high-performance comput-
ing. This is true for parallel computing on multiple compute nodes in a cluster
system, but the inclusion of GPUs at every node in a GPU cluster introduces an
additional level of complexity. Therefore, Chapter 2 investigates how an existing
user transparent parallel programming model for Multimedia Content Analysis
on traditional cluster systems can be extended to efficiently execute applications
on GPU clusters without requiring any additional parallelization or programming
effort from the application programmer. The content of this chapter is based on
the following publications:
Towards user transparent parallel multimedia computing on GPU-
clusters. B. van Werkhoven, J. Maassen, and F.J. Seinstra. In Pro-
ceedings of the first Workshop on Applications for Multi and Many
Core Processors, 37th ACM/IEEE International Symposium on Com-
puter Architecture (ISCA 2010), Pages 1-12, Saint-Malo, France, June
2010.
User transparent data and task parallel multimedia computing with
Pyxis-DT. T. van Kessel, B. van Werkhoven, N. Drost, J. Maassen,
H.E. Bal, and F.J. Seinstra. Future Generation Computer Systems,
Volume 29, Issue 8, Pages 2252-2261, October 2013.
Chapter 3 presents an extensive study of the optimization process of 2D Con-
volution, a compute kernel that is representative for the compute-intensive end
6 Chapter 1. Introduction
of the spectrum and in particular for Multimedia Content Analysis. Convolu-
tions are essential to signal and image processing applications, and are typically
responsible for a large fraction of the application’s execution time. This chapter
introduces a new optimization approach for implementing efficient GPU-enabled
library-based convolution operations called adaptive tiling. The optimization
techniques are also applied to other compute kernels from the image process-
ing domain, such as separable convolution. The content of this chapter is based
on the following publications:
Optimizing Convolution Operations in CUDA with Adaptive Tiling.
B. van Werkhoven, J. Maassen, and F.J. Seinstra. In Proceedings
of the second Workshop on Applications for Multi and Many Core
Processors, 38th ACM/IEEE International Symposium on Computer
Architecture (ISCA 2011), Pages 1-12, San Jose, CA, June 2011. Best
paper award.
Optimizing Convolution Operations on GPUs using Adaptive Tiling.
B. van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Fu-
ture Generation Computer Systems, Volume 30, Pages 14-26, January
2014.
Chapter 4 investigates whether and how data-intensive applications can benefit
from using GPUs. This chapter considers the Parallel Ocean Program (POP), a
representative climate modeling application that is used as the ocean component
of the coupled Community Earth System Model (CESM) [22]. We investigate
the performance characteristics of the Parallel Ocean Program to identify what
specific parts of the application can benefit the most from execution on GPUs.
In addition, this chapter describes how the existing software can be modified
to efficiently include GPU execution. Finally, we report the impact that the
use of GPUs will have on performance, also when using multiple GPU clusters
simultaneously. The content of this chapter is based on the following publication:
A distributed computing approach to improve the performance of the
Parallel Ocean Program (v2.1). B. van Werkhoven, J. Maassen, M.
Kliphuis, H.A. Dijkstra, S.E. Brunnabend, M. van Meersbergen, F.J.
Seinstra, and H.E. Bal. Geoscientific Model Development, Volume 6,
Pages 4705-4744, January 2014.
Chapter 5 presents techniques that are necessary to optimize data intensive
operations for performance. Many GPU applications perform data transfers to
and from GPU memory at regular intervals. For example because the data does
not fit into GPU memory or because of inter-node communication at the end of
each time step in a simulation. Overlapping GPU computation with CPU-GPU
1.4. Introduction to GPU Computing 7
communication can reduce the costs of moving data. Several different techniques
exist for transferring data to and from GPU memory and for overlapping those
transfers with GPU computation. This chapter provides insight in the perfor-
mance of GPU applications and proposes an analytical performance model that
includes PCIe transfers and overlapping computation and communication. The
content of this chapter is based on the following publications:
Performance models for CPU-GPU data transfers. B. vanWerkhoven,
J. Maassen, F.J. Seinstra, and H.E. Bal. In Proceedings of the 14th
IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGRID 2014), Chicago, IL, May 2014. Best paper
nominee.
Performance models for overlapping GPU computation with CPU-
GPU data transfers. B. van Werkhoven, J. Maassen, F.J. Seinstra,
and H.E. Bal. In preparation, June 2014.
Chapter 6 abstracts the technical and low-level contributions presented in the
previous chapters into more generic considerations. This chapter aims to identify
the lessons learned from the work in this thesis that can be used as guidelines for
future research that addresses similar challenges.
1.4 Introduction to GPU Computing
In this section, we give a brief introduction to GPU Computing and explain some
of the terminology used in this thesis. Throughout this thesis, we will use the
CUDA programming model [62] and as such use CUDA terminology, although
our methods can easily be translated to OpenCL [39].
In CUDA, a system consists of a host (the CPU), and one or more devices
(the GPUs). The host is responsible for the memory management of the device
and can move application data between host and device memory. This means the
host thread must allocate and free the GPU memory when needed. Transferring
data between host and device memory proceeds using either explicit memory
copy statements or mapped memory. Data transfers between CPU memory and
GPU memory can be overlapped with computation on both the GPU and the
CPU using a combination of CUDA Streams and asynchronous memory copy
statements or by using mapped memory.
Compute operations on the device are most commonly invoked from the host
and are referred to as kernels. The kernels are designed to exhibit a large amount
of data parallelism, allowing the same arithmetic operation to be performed si-
multaneously on different parts of the data. Kernels are executed concurrently
on the device by extremely large numbers of threads. As CUDA threads are
8 Chapter 1. Introduction
much more lightweight than traditional CPU threads, kernels can be executed by
millions of threads in parallel.
Threads executing the same kernel are organized in a two-level hierarchy,
threads are grouped into thread blocks, which are grouped into a grid. Thread
blocks typically contain hundreds of threads depending on the device. The built-
in threadIdx and blockIdx variables provide the thread and thread block index,
which are used by threads to direct themselves to different parts of the data.
Threads in the same thread block are executed by the same streaming multipro-
cessor (SM) and are able to synchronize with each other. Contiguous sections of
32 threads in a thread block are grouped into warps, in which all threads execute
the same instruction in parallel. When threads in a warp take different execu-
tion paths, multiple passes with suppression of certain threads are required to
complete execution.
CUDA supports different memory types. Global memory is a large high-
latency off-chip memory that is typically used to store large input and output data
structures. Global memory is the only device memory space that can be accessed
by the host. Although global memory bandwidth can be very high (177 GB/s
on a GTX480), it generally limits the performance of CUDA kernels if no other
memory types are utilized. Due to the design of modern DRAMs, the peak global
memory bandwidth can only be achieved if memory accesses are coalesced, that
is all threads in a warp access aligned and consecutive global memory locations.
Global memory is not coherent, that is the result of a write operation is not
guaranteed to be read by other threads until additional synchronization is used
or the kernel finishes execution.
Constant memory, resides in global memory space, but is mapped to a small
cached read-only memory, supporting low-latency, high bandwidth access. Con-
stant memory is particularly useful for values that are needed by all threads in a
warp simultaneously.
Shared memory is an on-chip memory that each SM is equipped with. Shared
memory can be allocated to thread blocks and accessed at very high speed in a
highly parallel manner. As all threads in a thread block can read and write to
shared memory, it is an efficient way for threads to share input data and inter-
mediate results. However, like global memory, shared memory is not coherent
and as such a thread block wide synchronization is required before written val-
ues become visible to other threads. Shared memory is aligned in a number of
memory banks, such that large arrays are wrapped row-wise across the banks. If
all threads in a warp access values in different banks the entire shared memory
bandwidth can be achieved. However, if different threads in a warp access dif-
ferent values in the same memory bank a shared memory bank conflict will occur
and the conflicting accesses are serialized.
1.5. Hardware setup 9
Registers also reside at the SMs and are private to each thread. The number of
registers used by each thread differs per kernel and is determined by the compiler.
The register usage per thread block thus also depends on the number of threads
in each thread block.
The number of thread blocks that can execute in parallel on each SM is bound
by the dynamic partitioning of SM resources. Each SM has a maximum number
of thread blocks and threads that it can support simultaneously. In addition,
the register file and shared memory are partitioned among the thread blocks
executing on an SM. The occupancy is defined as the number of warps that can
execute in parallel on each SM. The partitioning of SM resources often results in
subtle interactions between resource limits, as a slight increase in resource usage
could reduce the occupancy and result in a dramatic reduction in the achieved
performance.
As we will see in this thesis, one of the most difficult aspects of implementing
efficient GPU kernels is finding a balance between these subtle interactions and
code optimizations.
1.5 Hardware setup
In this section we describe the hardware setup used for the experiments performed
in this thesis. For each system we provide a brief overview of the compute infras-
tructure and explain what types of GPUs are present in the system.
1.5.1 Lisa GPU Cluster
The Lisa GPU Cluster [47] is a collaboration of the VU University Amsterdam,
the University of Amsterdam, and SURFsara (Stichting Academisch Rekencen-
trum Amsterdam). The system has a total of 6 nodes available that are equipped
with GPU accelerators. Each of these nodes is equipped with two Quad-Core In-
tel Xeon CPUs running at 2.50 GHz, with 32 GB of host memory. The nodes
have 2 Nvidia Tesla M1060 GPUs connected over PCIe 2.0.
The M1060 is of the GT200 architecture that predates the well-known Fermi
architecture [58], which means that unlike its successors it does not have L1 or
L2 caches. The Tesla M1060 GPU has 30 SMs with 8 cores each (240 CUDA
cores in total) running at 1.30 GHz and has 4 GB of global memory. The M1060
supports compute capability 1.3, which means it supports up to 512 threads per
thread block and has a limited 16KB of shared memory. The shared memory is
divided among 16 memory banks and processes requests by half-warps.
10 Chapter 1. Introduction
1.5.2 Huygens Supercomputer
The Huygens [33] is an IBM pSeries 575, a clustered SMP (Symmetric Multi-
processing) system. Each node contains 16 dual core IBM Power 6 processors
running at 4.7 GHz, resulting in 32 cores per node. As the cores support Simul-
taneous Multi Threading (SMT), every node appears to have 64 CPUs. Most
applications will perform better by using 64 MPI tasks per node (two MPI tasks
per processor core). Per node, 128 GB of memory is available (4 GB per core).
The nodes are connected using 8x(4xDDR) InfiniBand, resulting in a 160 Gbit/s
inter-node bandwidth.
Although Huygens does not contain GPUs, the system is used in Section 4.2
for the evaluation of different load balancing algorithms. Cartesius [10], which is
the successor of the Huygens as the Dutch national supercomputer, will contain
132 Nvidia Tesla K40 GPUs.
1.5.3 Distributed ASCI Supercomputer (DAS-4)
The DAS-4 (The Distributed ASCI Supercomputer 4) [16] is a six-cluster wide-
area distributed system designed by the Advanced School for Computing and
Imaging (ASCI). DAS-4 is funded by NWO (the Netherlands Organization for Sci-
entific Research) and the participating universities and organizations. The goal
of DAS-4 is to provide a common computational infrastructure for researchers
within ASCI, who work on various aspects of parallel, distributed, grid and cloud
computing, and large-scale multimedia content analysis.
Besides the 1 Gbit/s Ethernet (10 Gbit/s at the head nodes), DAS-4 also
employs Quad Data Rate (QDR) InfiniBand (IB) as an internal 20 Gbit/s in-
terconnect. The DAS-4 cluster sites are connected over dedicated 10 Gbit/s
Ethernet lightpaths. Except for some multi-cluster experiments in Chapter 4, we
only use the DAS-4 cluster located at the VU University Amsterdam.
The DAS-4 VU cluster is a largely homogeneous system, but also has several
different accelerators in different types of nodes. The cluster contains 74 regular
nodes with 2 quad-core Intel Xeon E5620 running at 2.4 GHz, 24 GB of main
memory at each node. A number of regular nodes are equipped with one GPU
each: 22 Nvidia GTX480, two Nvidia Tesla C2050, and one Nvidia GTX680,
all connected over PCIe 2.0. The system also has eight fat nodes containing 2
six-core Intel Xeon E5-2620 running at 2.0GHz, 64 GB of main memory, and one
Nvidia Tesla K20m connected over PCIe 2.0 at each node. One special node in
the system contains the Nvidia GTX Titan GPU, having 2 six-core Intel Xeon
E5-2630 running at 2.3GHz with a PCIe 3.0 interconnect.
In the following we give a more detailed description of the GPUs within the
DAS-4 system that are used in the experiments of this thesis, in order of their
release date.
1.5. Hardware setup 11
GTX 480
The GTX 480 is the flagship of the Fermi architecture [58] and the first Nvidia
GPU with a L1/L2 cache hierarchy. The GTX480 has 480 cores divided over
15 SMs with 32 cores each running at 1.40 GHz. Each SM also has 48 KB
of shared memory divided across 32 4-byte wide memory banks. The theo-
retical peak compute performance, computed as cores × frequency × 2, of the
GTX480 is 1344.96 GFLOP/s for single precision and 672.48 GFLOP/s for dou-
ble precision. The theoretical peak global memory bandwidth, computed as
(buswidth×memoryclock)/8, is 177 GB/s.
GTX 680
The GTX680 is the first GPU of the Kepler architecture [59] (GK104). The
Kepler GPUs have significantly more compute cores per SM compared to Fermi
GPUs (192 in Kepler versus 32 in Fermi). The GTX680 has 8 SMs which totals
to 1536 cores that run at a relatively low clock frequency (1006 MHz) to improve
energy efficiency. The theoretical peak compute performance of the GTX680 is
3090.43 GFLOP/s for single precision and 1545.22 GFLOP/s double precision.
The theoretical peak global memory bandwidth, however, has not scaled up pro-
portionally with the increased compute performance of the Kepler cards and is
192 GB/s for the GTX680. The GTX680 has 2 GB of global memory. The Kepler
SMs also have increased space for registers and can support up to 2048 threads
executing concurrently per SM. On the Kepler architecture global memory loads
and stores are only cached in L2 and not in L1. The L1 cache is reserved for
accesses to local memory and register spilling. However, the amount of shared
memory per SM on Kepler is exactly the same as on Fermi, 48 KB per SM.
Tesla K20
The Tesla K20 is the first Tesla GPU of the Kepler architecture (GK110) [60] and
includes several features that were not included in the GTX680. These features
include Dynamic Parallelism, Hyper-Q, and GPU Direct [60]. The Tesla K20m
GPUs in the DAS-4 system have 2496 CUDA cores running at 0.7 GHz, giving a
theoretical peak compute performance of 3519.36 GFLOP/s single precision and
1762.18 GFLOP/s for double precision. The Tesla K20m has 5 GB of device
memory and a global memory bandwidth of 208 GB/s when ECC support is
disabled. The Tesla K20m is connected over PCIe 2.0 and has 2 copy engines for
concurrent DMA transfers.
12 Chapter 1. Introduction
GTX Titan
The GTX Titan is a high-end consumer GPU of the same architecture as the
Tesla K20. The GTX Titan has 2688 compute cores at 0.88 GHz, producing a
theoretical peak compute performance of 4499.71 GFLOP/s single precision and
2249.86 GFLOP/s for double precision. The GTX Titan has 6 GB of device
memory and a theoretical peak global memory bandwidth of 288.4 GB/s. The
GTX Titan is the only GPU in the DAS-4 system that is connected over PCIe
3.0 and has 1 copy engine for DMA transfers.
13
Chapter 2
User transparent parallel
multimedia computing on
GPU clusters
Multimedia Content Analysis (MMCA) investigates methods of automated knowl-
edge extraction from image, video, and multimedia data. Research in the domain
is driven by applications, ranging from medical image registration [81] and remote
sensing [93], to computer vision and video indexing [86]. Computerized access
to the content of such data is a huge problem, as digital video produces high
data rates, and multimedia archives steadily run into Petabytes of storage space.
The massive amounts of data in these applications makes storing, cataloging,
processing, and retrieving of information a very challenging task. As a result,
high-performance computing is indispensable in the MMCA domain.
It is essential to develop efficient programming models that hide the intrinsic
complexities of the underlying computing hardware, as it is unrealistic to expect
application domain experts to also become experts in high-performance comput-
ing. In the literature, a number of such user transparent parallel programming
models have been described for traditional cluster systems (e.g. see [23, 56, 45,
36, 66, 77]). These programming models are based on a software library of pre-
parallelized compute kernels that cover the bulk of all commonly applied MMCA
functionality. Generally, these kernels are designed for data parallel execution on
traditional compute clusters. However, the inclusion of GPUs at every node in a
GPU cluster system introduces an additional level of complexity. Clearly, there
is a need for easy-to-use and efficient programming models for high-performance
multimedia computing on GPU-equipped cluster systems.
14 Chapter 2. User transparent multimedia computing on GPU clusters
This chapter covers challenges 1, 2, and 3.
First, we investigate how an existing user transparent parallel programming
model for Multimedia Content Analysis for traditional cluster systems, named
Parallel-Horus, can be extended to facilitate execution on GPU clusters (Chal-
lenge 1). Next, we apply a number of inter-kernel optimizations to ensure the
efficient integration of GPU kernels into the existing programming model. (Chal-
lenge 3). Finally, as part of our evaluation, we compare the performance of three
different implementations of a representative example application with the orig-
inal Parallel-Horus system against the same applications implemented with our
extended Parallel-Horus system (Challenge 2). The results obtained on two dif-
ferent GPU clusters show that the resulting system is able to efficiently execute
applications on GPU clusters without requiring any additional parallelization or
programming effort from the application programmer.
This chapter is organized as follows. Section 2.1 describes the original Parallel-
Horus system and proposes our extensions. Section 2.2 discusses the translation
of the ’sequential’ compute kernels to CUDA-based compute kernels. Section 2.3
discusses the integration of CUDA-based kernels in Parallel-Horus using an ap-
proach called lazy parallelization. Section 2.4 presents a representative example
application for which we consider three different implementations. Section 2.5
presents an extensive evaluation of the extended programming model. Section 2.6
discusses future work and concludes.
2.1 Parallel-Horus
Parallel-Horus [77] is a user transparent parallelization tool for the MMCA do-
main. Implemented in C++ and MPI, it allows programmers to develop data
parallel multimedia applications as fully sequential programs. The library’s API
is made identical to that of an existing sequential library: Horus [41]. Similar
to other frameworks (e.g. [56]), Horus recognizes that a small set of algorithmic
patterns can be identified that covers the bulk of all commonly applied function-
ality. Horus includes pattern implementations for common MMCA functionality
such as unary and binary pixel operations, global reduction, neighborhood and
convolution operations, and geometric transformations (e.g. rotation, scaling).
Recent developments include patterns for operations on large data sets, as well
as patterns on increasingly important derived data structures, such as feature
vectors.
An example application for line detection (explained further in Section 2.4) is
shown in Figure 2.1. Although this application is specified as a purely sequential
program, it contains no code for parallelization, Parallel-Horus can execute the
program in a data parallel fashion. To this end, Parallel-Horus extends the Horus
library by introducing a thin layer right in the heart of the algorithmic patterns,
2.1. Parallel-Horus 15
1 FOR all orientations θ DO
2 RotatedIm = GeometricOp(OriginalIm , "rotate", θ);
3 ContrastIm = UnPixOp(ContrastIm , "set", 0);
4 FOR all smoothing scales σu DO
5 FOR all differentiation scales σv DO
6 FiltIm1 = GenConvOp(RotatedIm , "gaussXY", σu, σv , 2, 0);
7 FiltIm2 = GenConvOp(RotatedIm , "gaussXY", σu, σv , 0, 0);
8 DetectedIm = BinPixOp(FiltIm1 , "absdiv", FiltIm2);
9 DetectedIm = BinPixOp(DetectedIm , "mul", σu × σv);
10 ContrastIm = BinPixOp(ContrastIm ,"max", DetectedIm);
11 OD
12 OD
13 BackRotatedIm = GeometricOp(ContrastIm , "rotate", −θ);
14 ResultIm = BinPixOp(ResultIm , "max", BackRotatedIm);
15 OD
Figure 2.1: Pseudo code for a line detection application, referred to as ConvRot.
HORUS API (sequential)
ALGORITHMIC
PATTERNS
MPI extensions
SEQUENTIAL
COMPUTE
KERNELS
CUDA-BASED
COMPUTE
KERNELS
Figure 2.2: General overview of Parallel-Horus and its CUDA-based extensions.
as shown in Figure 2.2. The layer uses MPI to communicate image data and
other structures among the participating compute nodes. Parallel-Horus executes
each of the operations in Figure 2.1 by combining the original sequential compute
kernels with a number of MPI-based communication steps [80]. Usually, an image
is scattered among the nodes in a cluster, leaving each node with a partial image.
Apart from the need for additional pre- and post-communication steps (such as
border handling in convolution operations), the original sequential compute kernel
available in Horus is now applied to each partial image individually.
It was realized that it is not sufficient to consider parallelization of Horus
operations in isolation. Therefore, the library was extended with a finite-state
machine based approach for communication minimization called, lazy paralleliza-
tion, that automatically parallelizes a fully sequential program at run-time by
inserting communication primitives and additional memory management opera-
tions whenever necessary [79].
16 Chapter 2. User transparent multimedia computing on GPU clusters
Results for realistic multimedia applications have shown the feasibility of the
approach. Data parallel performance on traditional clusters is consistently op-
timal with respect to message passing programs [77]. Notably, Parallel-Horus
was applied in NIST TRECVID benchmark evaluations for content-based video
retrieval, and played a crucial role in achieving top-ranking results in a field of
strong international competitors [87, 77]. Later extensions to Parallel-Horus, that
allow for services-based distributed multimedia computing, have been applied
successfully in large-scale distributed systems, involving hundreds of compute
resources worldwide [77].
When extending Parallel-Horus for GPU-execution, we aim for a similar min-
imal level of intrusiveness. Therefore, we leave the thin communication layer as it
is, and focus on introducing CUDA-based alternatives to the sequential compute
kernels that implement the algorithmic patterns (see bottom half of Figure 2.2).
In this way, MPI and CUDA work in concert, allowing the use of multiple GPUs
on the same node, and on multiple nodes in a cluster system simultaneously,
simply by creating one MPI process for each GPU.
2.2 CUDA-based compute kernels
This section discusses the implementation of the compute kernels used in Parallel-
Horus that allow parallel execution on GPUs using the CUDA [62] programming
model. The result of the initial translation process discussed here is to be con-
sidered a naive implementation, meaning that the implementation is obtained
from the simplest possible translation of a sequential program into a parallel one.
Naive implementations are important as they do not contain any architecture
specific optimizations and are therefore easy to maintain. Naive implementa-
tions also provide a lower bound on the programming effort required to express
a certain problem in a massively parallel programming model, such as CUDA.
The process of evaluating and optimizing the performance of individual CUDA
kernels is discussed in Chapter 3. Therefore, in the remainder of this chapter, we
do not include any architecture specific optimizations.
The image processing operations in the Horus library are categorized in a
number of algorithmic patterns that cover the bulk of all commonly applied op-
erations. Each algorithmic pattern corresponds to one of the operation classes
defined in Image Algebra [70]. Each operation in Horus is implemented by in-
stantiating the algorithmic pattern with the function to be applied to the input
image. Examples of algorithmic patterns in Horus are unary pixel operations, bi-
nary pixel operations, neighborhood operations, and geometric transformations.
Ultimately, all operations in Horus apply a certain function to every pixel in
an image. In this chapter, we consider only 2D images, but techniques presented
can be easily extended to 3 or more dimensions. All operations start with two
2.2. CUDA-based compute kernels 17
1 for (y=0; y<imHeight; y++) {
2 for (x=0; x<imWidth; x++) {
3
4 if (dimension == 1) {
5 for (i=0; i<kerWidth; i++) {
6 sum.x += sPtr ->x * kPtr ->x;
7 sPtr ++;
8 kPtr ++;
9
10 }
11 } else {
12 for (i=0; i<kerHeight; i++) {
13 sum.x += sPtr ->x * kPtr ->x;
14 sPtr += totalWidth;
15 kPtr ++;
16 }
17 }
18
19 sum.x /= kerWeight;
20 iPtr ->x = sum.x;
21 iPtr ++;
22
23 }
24 }
(a)
1 y = threadIdx.y+blockIdx.y*BY;
2 x = threadIdx.x+blockIdx.x*BX;
3
4 if (dimension == 1) {
5 for (i=0; i<kerWidth; i++) {
6 sum.x += sPtr ->x * kPtr ->x;
7 sPtr ++;
8 kPtr ++;
9
10 }
11 } else {
12 for (i=0; i<kerHeight; i++) {
13 sum.x += sPtr ->x * kPtr ->x;
14 sPtr += totalWidth;
15 kPtr ++;
16 }
17 }
18
19 sum.x /= kerWeight;
20 iPtr ->x = sum.x;
21 iPtr ++;
22
23
24
(b)
Figure 2.3: (a) Implementation of the kernel that applies a separable convolution
operation in either the horizontal or vertical dimension. (b) Implementation of
the CUDA kernel for the same operation.
loops that iterate over all pixels in the image (see Figure 2.3(a)). The simplest
way to obtain a CUDA kernel for such an operation is to unroll the outer two
loops and let every CUDA thread compute a single iteration (see Figure 2.3(b)).
In other words, every CUDA thread computes the value of a single pixel in the
output image. For most operations, consecutive CUDA threads will load and
store consecutive pixel values to and from global memory, ensuring coalesced
access and high utilization of the global memory bandwidth. However, for some
operations, such as geometric transformations, coalesced access to global memory
cannot be guaranteed.
The kernel that results from this simple transformation may perform very
little work per thread. For example, for a unary pixel operation each CUDA
thread loads the value of one pixel, performs a single instruction, stores the
outcome in the output image, and terminates. The CUDA programming model
offers an extremely lightweight version of the concept of a thread. However, for
operations such as unary and binary pixel operations, the amount of work per
thread is too low to be very efficient. On the other hand, for other functions,
such as neighborhood operations, the amount work per thread is much higher.
18 Chapter 2. User transparent multimedia computing on GPU clusters
host
device UnaryPixOp() UnaryPixOp()
Figure 2.4: Two CUDA kernel invocations for a Unary Pixel Operation and the
corresponding memory operations to move data between host and device.
host
device UnaryPixOp() UnaryPixOp()
Figure 2.5: Two CUDA kernel invocations for a Unary Pixel Operation and an
optimized sequence of memory operations to move data between host and device.
2.3 Lazy parallelization
Rather than optimizing the performance of compute kernels in isolation, we focus
here on optimizing the execution of multiple compute kernels applied in sequence
(as is typical for Horus applications). As such, several implementation strategies
can be considered when replacing the original Horus kernels with their CUDA
counterparts.
In the simplest approach, every call to a CUDA kernel is by default embedded
in code for allocating device memory, copying the image(s) to device memory,
executing the kernel, copying the resulting data structure back to main memory,
and finally deallocating device memory. As a result, the image is copied back
and forth between host and device even in the presence of consecutive kernel
invocations on the same data, as shown in Figure 2.4.
Although this approach is very simple to implement, a hand-coded version
of the same application will most likely be more efficient. In typical imaging
applications, the above approach will lead to unnecessary memory allocation
2.3. Lazy parallelization 19
Host
Device
stale up-to-date
stale q0:initial state q1:after I/O operation
up-to-date q2:after kernel execution q3:after synchronization
Table 2.1: Internal states for the finite state machine specification.
operations and redundant movements of data between device memory and host
memory. Figure 2.5 shows how two redundant data movements can be eliminated
without introducing any inconsistencies.
Linking up the CUDA kernels with the Parallel-Horus library must be done
carefully. This is because GPU copies of data structures can become stale due to
the normal workings of an imaging application itself, for example when an I/O
operation replaces an already existing image with a new one. More importantly,
some parallel versions of the imaging operations require the host copy of a data
structure to be up-to-date. One example is 2D convolution, which requires all
participating compute nodes in a parallel system to exchange border data using
MPI. This exchange is necessary as the evaluation of each individual pixel requires
information about that pixel’s neighbors. Hence, just before the border exchange,
a synchronization operation must be inserted to synchronize the host and device
copy of an image.
In Parallel-Horus [79], a fully sequential program is parallelized automati-
cally at runtime by inserting communication primitives and additional memory
management operations whenever necessary. This approach, referred to as lazy
parallelization, introduces a simple finite state machine (FSM) for every data
structure that is processed in parallel. The rest of this section presents our
extensions to Parallel-Horus, where a second, simple, finite state machine is in-
troduced for every data structure to avoid redundant GPU memory operations.
In our extension, it is assumed that every CUDA memory management operation
is redundant, unless proven otherwise. In other words, a memory operation is
only executed if its removal would introduce a data structure state inconsistency
(e.g. an operation executing on stale data).
The most up-to-date copy of the data structure resides either at the host,
the device, or both. Together with the initial state q0 our simple FSM has the
following four states, see Table 2.1. State q1 represents the state immediately after
an I/O operation, such as filling the data structure with input data read from a
file or immediately after an MPI communication operation has completed. State
q2 indicates that a CUDA kernel has been invoked on the device data structure.
Due to CUDA’s concurrent execution model, the host does not block to wait for
the kernel to complete execution. This means that the host can have multiple
pending kernel invocations, possibly on multiple data structures. However, when
20 Chapter 2. User transparent multimedia computing on GPU clusters
the host copy of a data structure is required to be up-to-data, the host thread
will block to wait for all kernel invocations on that data structure to complete
and copy the resulting data structure to host memory, which brings it to state q3.
State q3 indicates that the data structure is up-to-date in both host and device
memory. Data structures that are only used as arguments to CUDA kernels
can remain in state q3. The data structure is required to be up-to-date on the
device, however, as the device will not change the data, the host copy will remain
up-to-date as well.
The states of multiple data structures are not always independent. The output
data structures are the only structures that actually move from one state to
another. Other argument structures never change state, as these are accessed
only and never updated. All transitions that move a data structure to state q2
indicate parallel execution on a CUDA device. Additionally, all transitions that
cause a data structure to be moved to state q1 indicate I/O operations, such as
communication through MPI or disk I/O. Although writing a data structure to a
file does not change the data structure or its state, the data structure is required
to be present and up-to-date in host memory.
State transition functions related to the memory management of data struc-
tures are crucial to resolve data structure state inconsistencies, which may appear
in any executing imaging application. For example, whenever a CUDA kernel,
say a unary pixel operation, is invoked on a data structure that resides in host
memory (state q1), a data structure state inconsistency occurs, as UnaryPixOp
requires the input data structure to be in either state q2 or state q3. This par-
ticular state inconsistency is automatically solved by inserting a HostToDevice
operation, which allocates device memory (if necessary) and performs a copy
operation that copies the data structure from host to device memory.
The DeviceToHost operation is a synchronizing operation, which blocks the
host thread to wait for all pending kernel invocations on the data structure to
complete. To facilitate concurrent execution between host and device, CUDA
kernel invocations return immediately. In other words, the host thread does not
block to wait for a kernel to complete execution. However, when a DeviceToHost
memory operation is executed the host thread will block, to wait for all pending
kernel invocations on that data structure to complete and then copy the output
data structure to host memory.
2.4 A line detection application
Before we evaluate the extended Parallel-Horus, we first introduce an example
application. The example application is selected because results for data paral-
lel execution with Parallel-Horus are well available [80]. Despite being simple,
2.4. A line detection application 21
Figure 2.6: Detection of C. Elegans worms (courtesy of Janssen Pharmaceuticals,
Belgium).
the application is representative of the behavior of many other realistic MMCA
applications [77].
As discussed in [24], the computationally demanding problem of line detection
is solved by considering the second order directional derivative in the gradient di-
rection, for each possible line direction. This is achieved by applying anisotropic
Gaussian filters, parameterized by orientation θ, smoothing scale σu in the line
direction, and differentiation scale σv perpendicular to the line. When the filter
is correctly aligned with a line in the image, and σu, σv are optimally tuned to
capture the line, filter response is maximal, thus yielding line detection. Fig-
ure 2.6(a) gives a typical example of an input image, and (b) the result obtained
for a reasonably large subspace of (σu, σv, θ).
The anisotropic Gaussian filtering problem can be implemented in many dif-
ferent ways. We consider three possible approaches. First, for each orientation
θ it is possible to create a new filter based on σu and σv. In effect, this yields
a rotation of the filters, while the orientation of the input image remains fixed.
Hence, a sequential implementation based on this approach (which we refer to as
Conv2D) implies full 2-dimensional convolution for each filter.
The second approach (referred to as ConvUV) is to decompose the anisotropic
Gaussian filter along the perpendicular axes u, v, and use bilinear interpolation to
approximate the image intensity at the filter coordinates. Although comparable
to the Conv2D approach, ConvUV is expected to be faster due to a reduced number
of accesses to the image pixels. Pseudo code for the two algorithms is almost
identical, as presented in Figure 2.7.
Pseudo code for the third approach (referred to as ConvRot) has been given
in Figure 2.1 (see Section 2.1). In this case, the program starts by rotating
the original input image for a given orientation θ. In addition, for all (σu, σv)
combinations the filtering is performed by xy-separable Gaussian filters. For
each orientation the maximum response is combined in a single contrast image.
22 Chapter 2. User transparent multimedia computing on GPU clusters
1 FOR all orientations θ DO
2 FOR all smoothing scales σu DO
3 FOR all differentiation scales σv DO
4 FiltIm1 = GenConvOp(OriginalIm , "func", σu, σv , 2, 0);
5 FiltIm2 = GenConvOp(OriginalIm , "func", σu, σv , 0, 0);
6 ContrastIm = BinPixOp(FiltIm1 , "absdiv", FiltIm2);
7 ContrastIm = BinPixOp(ContrastIm , "mul", σu × σv);
8 ResultIm = BinPixOp(ResultIm , "max", ContrastIm);
9 OD
10 OD
11 OD
Figure 2.7: Conv2D and ConvUV, with "func" either "gauss2D" or "gaussUV".
Finally, the temporary contrast image is rotated back to match the orientation
of the input image, and the maximum response image is obtained.
2.5 Evaluation
We first evaluate the impact of lazy parallelization on the performance of ap-
plications implemented using our GPU-extended Parallel-Horus. Secondly, we
investigate the scalability of our example applications for different image sizes.
Finally, we compare the original Parallel-Horus implementation with our GPU-
extended version.
2.5.1 Evaluation of lazy parallelization
In this section, we evaluate the performance improvement of the lazy paralleliza-
tion optimization technique presented in Section 2.3. We call the version of
Parallel-Horus with GPU extensions using no intra-kernel optimization and naive
kernel implementations naive, and we call the implementation that employs lazy
parallelization naive lazy. We evaluate the performance impact of lazy paral-
lelization using two different GPU clusters, the Lisa GPU cluster and the DAS-4
cluster.
Lisa cluster
We have performed our measurements of the ConvUV and Conv2D applications
for an image size of 1024× 1024 using the Lisa GPU Cluster (see Section 1.5.1).
In our experiments we use different configurations, each of which is denoted
differently by the number of nodes and GPUs used. For example, measurements
involving one compute node and one GPU per node are denoted by 1x1. Likewise,
4x2 means that 4 nodes are used with 2 GPUs per node. For the CUDA-based
executions, the latter case implies the concurrent use of 8 GPUs.
2.5. Evaluation 23
Figure 2.8 shows that for the ConvUV application up to 40.1% of the execution
time can be saved when lazy parallelization is used to reduce the number of
redundant memory operations and data movements between device and host
memory. Figure 2.8 shows the execution times of the ConvUV application with
and without using lazy parallelization. The naive implementation demonstrates
good scalability when using configurations with 1 GPU per node. However, when
2 GPUs per node are used scalability is significantly worse. For the naive lazy
implementation the performance penalty when using multiple GPUs per node is
significantly reduced by using lazy parallelization.
The execution times for the Conv2D application are shown in Figure 2.9. Lazy
parallelization again improves performance for all cases, reducing up to 19.6%
of the execution time. The performance improvement is less than for ConvUV,
because a much smaller fraction of the execution time of Conv2D is spent on
redundant memory transfers.
Considering that the kernels used in this evaluation are naive implementa-
tions the total execution time of the applications can be significantly reduced
using more efficient kernel implementations. As such, the percentages presented
here represent a lower bound on the performance improvement achieved by the
lazy parallelization optimization, because the fraction of execution time spent on
memory management operations and data movements increases as the time spent
on computations is reduced.
Configurations 1x2 and 2x1 both employ the same number of GPUs, while
using a different number of nodes. The same holds for the 2x2 and 4x1 config-
urations. Despite the fact that for these configurations the number of GPUs is
pairwise identical, the implementations perform better for the 2x1 and 4x1 cases
than for 1x2 and 2x2 cases, respectively. When lazy parallelization is used to
reduce the number of memory management operations and data movements, the
difference in execution times becomes significantly less. This suggests that the
performance penalty caused by using multiple GPUs on the same node, instead
of different nodes, is caused by the large number of memory transfer operations.
DAS-4 cluster
We have also performed measurements of ConvUV, Conv2D and ConvRot with an
input image size of 4096×4096 on up to 20 GTX480 GPUs in parallel within the
DAS-4 cluster. The GTX480 is of the more recent Fermi architecture and has a
much higher theoretical peak performance than the M1060 available in the Lisa
cluster (1344.96 vs 622.08 GFLOP/s). For more details see Section 1.5.
Figures 2.10 and 2.11 show the performance of ConvUV, Conv2D respectively
with and without using the lazy parallelization optimization. While the perfor-
mance of ConvUV is better when using a small number of GPUs, Conv2D demon-
strates better scalability and therefore the execution times of both applications
24 Chapter 2. User transparent multimedia computing on GPU clusters
 0
 5
 10
 15
 20
 25
 30
 35
1x1 1x2 2x1 3x1 4x1 2x2 3x2 4x2
Se
co
nd
s
Number of Nodes x Processes per Node
ConvUV on the Lisa Cluster
Naive
Naive Lazy
Figure 2.8: The total application execution times for two versions of the GPU-
extended ConvUV application on the Lisa cluster.
 0
 10
 20
 30
 40
 50
 60
 70
 80
1x1 1x2 2x1 3x1 4x1 2x2 3x2 4x2
Se
co
nd
s
Number of Nodes x Processes per Node
Conv2D on the Lisa Cluster
Naive
Naive Lazy
Figure 2.9: The total application execution times for two versions of the GPU-
extended Conv2D application on the Lisa cluster.
2.5. Evaluation 25
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
1 2 4 8 16 20
Se
co
nd
s
GPUs
ConvUV on the DAS-4 GPU-Cluster
Naive
Naive Lazy
Figure 2.10: The total application execution times for two versions of the GPU-
extended ConvUV application on the DAS-4 cluster.
 0
 50
 100
 150
 200
 250
1 2 4 8 16 20
Se
co
nd
s
GPUs
Conv2D on the DAS-4 GPU-Cluster
Naive
Naive Lazy
Figure 2.11: The total application execution times for two versions of the GPU-
extended Conv2D application on the DAS-4 cluster.
26 Chapter 2. User transparent multimedia computing on GPU clusters
are roughly equal when using 20 GPUs. For the ConvUV application the amount
of execution time saved by using lazy parallelization is up to 42.1% and 37.4% on
average. For Conv2D lazy parallelization saves up to 36.5% of the execution time
and 33.8% on average. The results show that lazy parallelization significantly
improves performance in all tested cases.
The performance of the ConvRot application is shown in Figure 2.12. The
ConvRot application scales well up to 4 GPUs after which the application becomes
communication bound and performance no longer improves by adding compute
nodes. These results are entirely in line with earlier speedup characteristics re-
ported in [78] for a larger number of CPU nodes. The poor scalability of ConvRot
is due to the data dependencies present in the algorithm (i.e. the repeated image
rotations), and is not related to the capabilities of the Parallel-Horus system or
our GPU extensions. The fact that performance improvement from using lazy
parallelization for 8 or more nodes becomes significantly less also indicates that a
larger portion of the execution time is spent on communication rather than com-
putation. In general, lazy parallelization improves the performance of ConvRot
by up to 51.82% and 30.0% on average. An overview of the scalability of all three
applications is shown in Figure 2.13.
2.5.2 Scalability for different image sizes
In this section we assess the scalability of Parallel-Horus applications based on
the size of the input image, which limits the amount of data parallelism within
the program. As such, we investigate to what number of GPUs an application
can scale efficiently given a certain input image size. In the previous section,
we have seen that the ConvRot application demonstrates poor scalability beyond
four GPUs, therefore in this scalability study we only use the Conv2D and ConvUV
applications.
Figures 2.14 and 2.15 shows the speedup for Conv2D and ConvUV, compared
to their execution time on a single GPU. We have used a number of different
input image sizes ranging from 1 megapixel (1 MP) to 16 megapixel (16 MP)
using up to 20 GTX 480 GPUs in the DAS-4 cluster. As shown in both figures,
the scalability of the application is better for larger image sizes. Using the 1 MP
input image Conv2D achieves a speedup of 9.56 on 20 GPUs, while at the same
number of GPUs ConvUV reaches a speedup of only 6.75. For both applications
using more than 2 GPUs simultaneously is hardly worth the effort. However, for
the largest problem size the Conv2D application experiences a speedup of 17.9
on 20 GPUs and ConvUV reaches a speedup of 16.2. In fact, the time spent in
computations is reduced so dramatically that, in this case, file I/O takes up a
significant fraction of the execution time. From these results we conclude that
2.5. Evaluation 27
 0
 20
 40
 60
 80
 100
 120
 140
 160
1 2 4 8 16 20
Se
co
nd
s
GPUs
ConvRot on the DAS-4 GPU-Cluster
Naive
Naive Lazy
Figure 2.12: The total application execution times for two versions of the GPU-
extended ConvRot application on the DAS-4 cluster.
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 20
 0  2  4  6  8  10  12  14  16  18  20
Sp
ee
du
p
Number of GPUs
Speedup of three line detection applications on DAS-4
linear(x)
Conv2D
ConvUV
ConvRot
Figure 2.13: Application speedup results for the three different implementations
on the DAS-4 cluster.
28 Chapter 2. User transparent multimedia computing on GPU clusters
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 20
 0  2  4  6  8  10  12  14  16  18  20
Sp
ee
du
p
Number of GPUs
Speedup of Conv2D for different image sizes on DAS-4
linear(x)
8 MP
4 MP
2 MP
1 MP
Figure 2.14: Speedup of Conv2D relative to single-GPU execution on DAS-4 using
different image sizes (MP is megapixel).
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 20
 0  2  4  6  8  10  12  14  16  18  20
Sp
ee
du
p
Number of GPUs
Speedup of ConvUV for different image sizes on DAS-4
linear(x)
8 MP
4 MP
2 MP
1 MP
Figure 2.15: Speedup of ConvUV relative to single-GPU execution on DAS-4 using
different image sizes (MP is megapixel).
2.5. Evaluation 29
the GPU extended Parallel-Horus system is able to create efficient applications
with good scalability properties.
2.5.3 User-perceived speedup
In order to assess the performance impact from using either CPUs or GPUs in
Parallel-Horus applications we define the notion of user-perceived speedup. User-
perceived speedup is the speedup observed by the end-user of an entire applica-
tion as a result from switching between different implementations of a parallel
programming model, without making any modifications to the application.
For the end-users of GPU applications it is very important to know the per-
formance impact of using GPUs. However, the performance of an application
executing on either a CPU or a GPU can not easily be compared. First and
foremost because the performance being compared is obtained using applications
that have different implementations, use different compilers, and execute on com-
pletely different hardware. As was pointed out by Lee et al. [46], many perfor-
mance comparisons between CPU and GPU applications that have been reported
in the literature are often unfair and therefore meaningless. This is often because
the GPU kernels are optimized towards the target architecture and are compared
against CPU kernels that contain little to no optimizations. In addition, Gregg
and Hazelwood [28] state that many CPU vs GPU performance comparisons do
not consider the fact that the data has to be transferred to and from GPU mem-
ory. This can also result in unfair performance comparisons, as transferring the
data to and from the GPU is necessary and can have a considerable impact on
application performance.
As such, we would like to stress that the comparison made using user-perceived
speedup is on the level of entire applications, as opposed to only compute kernels,
and therefore includes all data transfers between host and device memory. In
addition, all CPU and GPU compute kernels are naive implementations that do
not contain any architecture-specific optimizations.
For this experiment we have used the DAS-4 GPU cluster using up to 20
GTX480 GPUs and an input image size of 4096 × 4096. Table 2.2 presents the
execution times in seconds for Conv2D using 1 to 20 CPUs or GPUs. The CPU
runs use all four cores in the chip by allocating 4 MPI processes to each CPU. The
# of CPUs/GPUs 1 2 4 8 16 20
CPU (4 cores) 2504.56 1259.93 663.69 342.95 193.98 166.34
GPU 142.75 70.58 37.77 18.72 9.54 7.98
user-perceived speedup 17.55 17.85 17.57 18.32 20.33 20.84
Table 2.2: Execution times in seconds for Conv2D using 1 to 20 CPUs or GPUs.
30 Chapter 2. User transparent multimedia computing on GPU clusters
GPU runs simply use the single GPU present in each node. As shown in Table 2.2
the GPU extended version of Parallel-Horus achieves a user-perceived speedup
of at least 17.55. We would like to stress that the results for both architectures
are obtained using the same sequential application in combination with different
implementations of the user transparent parallel programming model, without
requiring any parallelization effort from the application programmer.
2.6 Conclusions
In this chapter, we have discussed the implementation of GPU-based extensions
to the user transparent parallel programming model, named Parallel-Horus, for
MMCA. The resulting programming system allows a fully sequential program to
execute on GPU clusters without any additional parallelization or programming
effort from the application developer. Instead of optimizing the performance of
compute kernels in isolation, we have focused on optimizing the execution of
multiple compute kernels applied in sequence.
First, we have replaced the compute kernels that implement the algorithmic
patterns of the Parallel-Horus system with equivalent CUDA-based implementa-
tions. Next, we have presented a finite state machine based approach to optimize
the execution of multiple compute kernels applied in sequence. Lazy paralleliza-
tion in Parallel-Horus automatically parallelizes a sequential application by in-
serting communication operations and local memory management operations at
runtime, whenever necessary. We extend this approach by inserting GPU mem-
ory management operations at runtime, whenever necessary. Lazy parallelization,
based on a finite state machine specification, has proven to constitute a simple,
yet scalable and effective method for the inter-kernel optimization of MMCA ap-
plications on GPU-clusters. Due to the high abstraction level incorporated in
the finite state machine specification, the approach is easily applicable in CUDA-
based automatic parallelization tools outside the MMCA application domain as
well.
We have evaluated the performance of the extended Parallel-Horus system
using multiple versions of an example line detection application on two different
GPU clusters. Our evaluation shows that lazy parallelization improves perfor-
mance in all tested cases. The performance improvement due to lazy paralleliza-
tion is higher for ConvUV, because a relatively larger part of the execution time of
ConvUV is spent on redundant memory transfers between device and host memory.
Our ConvUV and Conv2D applications perform better on configurations with
only a single GPU, compared to setups with multiple GPUs per node. When lazy
parallelization is used to reduce the number of memory management operations
and data movements, the difference in execution times becomes significantly less.
2.6. Conclusions 31
The performance penalty that occurs for setups with multiple GPUs per node is
due to the reduced amount of PCIe bandwidth available to each GPU.
Finally, we have analyzed the scalability of a line detection application paral-
lelized by Parallel-Horus with GPU extensions, on the DAS-4 GPU-cluster using
up to 20 GPUs. We have demonstrated that the model is capable of accelerat-
ing a fully sequential program, without requiring any additional effort from the
application developer.
32 Chapter 2. User transparent multimedia computing on GPU clusters
33
Chapter 3
Optimizing convolution
operations on GPUs
Convolution operations are essential to signal and image processing applications
and are frequently used to implement the most fundamental tasks in Computer
Vision, for example the detection of edges and lines in images [25]. Outside of
image processing, convolutions are used in probability [7], statistics [82], signal
processing [12], and differential equations [69].
In the context of image processing, a convolution operation computes the
linear combination of the weights in the convolution filter and a range of pixels
from the input image for each output pixel. A 2D convolution of an input image
I of size (w×h) and a convolution filter F of size (Fw×Fh) computes an output
image O of size (w − Fw × h− Fh):
O(y, x) =
Fh∑
j=0
Fw∑
i=0
I(y + j, x+ i)× F (j, i)
The computational complexity of the operation depends on the size of the im-
age and convolution filter. As such, convolution operations are typically respon-
sible for a large fraction of the execution time of image processing applications.
Convolution is a highly parallel operation, as it can be computed for all pixels
in the output image simultaneously and is therefore well suited for execution
on GPUs. However, creating efficient GPU implementations and optimizing for
performance can be extremely challenging. There are many different hardware
details and features of GPUs that have to be taken into account to optimize
GPU kernels for compute performance. In this chapter, we present an extensively
optimized library-based implementation for convolution operations.
34 Chapter 3. Optimizing convolution operations on GPUs
This work is part of a larger effort to obtain an implementation of the Parallel-
Horus [77] programming model that allows sequentially written MMCA programs
to execute as highly optimized applications for GPU-clusters without requiring
any parallelization effort from the application programmer. Because convolu-
tion operations can be parallelized over multiple compute nodes by splitting and
merging the input and output images across the nodes, this chapter only discusses
optimizations within a single GPU compute node.
Our work extends previous work on implementing and optimizing convolution
operations for GPUs. Many existing implementations require that the convolu-
tion filter has a fixed size that is known at compile time [35, 89, 103] or consider
only separable filters [68]. Separable filters are a special case of 2-dimensional
filters that can be decomposed into two 1-dimensional filters. While the approach
presented in this chapter focuses on 2D convolution, the techniques equally well
apply to separable convolution. As discussed in detail in Section 3.7, our imple-
mentation applies well-known optimizations that are discussed in the literature
and introduces several new optimization strategies.
In this chapter, we address Challenge 4 as defined in Chapter 1: How to max-
imize the efficiency of compute-intensive GPU kernels?. This chapter provides
the following contributions:
• We present an extensive study of the optimization process of 2D convolution
and separable convolution operations on modern graphics cards.
• We demonstrate that once the well-known optimization techniques have
been applied, there are many optimizations still possible.
• We introduce a new optimization approach for implementing efficient GPU-
enabled library-based convolution operations, called adaptive tiling, which
we also combine with loop unrolling.
• To the best of our knowledge, our implementation is the most optimized and
best performing implementation of 2D Convolution in the spatial domain
available to date.
The remainder of this chapter is organized as follows. Section 3.1 discusses
well-known optimization techniques that have to be applied to our CUDA ker-
nels before we can apply our own optimization approach. Section 3.2 presents
our approach for avoiding shared memory bank conflicts. Section 3.3 presents
our new optimization approach called adaptive tiling and discusses the perfor-
mance improvements. Section 3.4 combines adaptive tiling with loop unrolling
to create our most efficient implementation. Section 3.5 evaluates the perfor-
mance improvements of each optimization step on various graphics hardware.
Section 3.6 discusses the limitations that are inherent to spatial solutions to the
3.1. Naive implementation and well-known optimizations 35
2D convolution problem. Section 3.7 discusses related work and Section 3.8 and
concludes.
3.1 Naive implementation and well-known opti-
mizations
This section presents a naive CUDA implementation and discusses existing op-
timization techniques that form a starting point for our own optimizations. The
discussion of these techniques is included to present the reader with a complete
overview of the optimization process. Readers with much experience in GPU
programming and optimization may choose to skip this section. As detailed in
Section 3.7, the implementation approach presented here also improves upon ex-
isting work.
In this chapter, we continuously report performance results obtained on the
Nvidia GTX680 Kepler GPU. Whenever necessary, we also report results ob-
tained on the GTX480 Fermi GPU and the Tesla K20, also of the Kepler archi-
tecture. A detailed overview of the architectural differences between the GPUs
is detailed in Section 1.5.
In each of our measurements, the kernel performs a 2D convolution of an image
of 4096x4096 floating point pixels and uses filter sizes ranging from 3 up to 43 in
both dimensions. Using larger or smaller images influences the total execution
time of the operation, but only has a very limited effect on the performance
behavior of the kernel in terms of GFLOP/s. For the presentation of results
3D graphs are used as the performance of our 2D convolution implementations
often varies in both dimensions, for example some combinations of filter sizes and
resource limitations on the GPU may cause performance cliffs.
3.1.1 Naive implementation
A convolution operation computes a new value for every pixel based on a weighted
average of the original pixel and the pixels in its neighborhood. These weights are
stored in a convolution filter, which also determines the size of the neighborhood.
To ensure that every pixel can be evaluated (even at the edge of the image) we
assume that the input image includes a border and is thus larger than the output
image. Note that the performance of the convolution operation is independent of
the way the border is filled.
An implementation in C for the 2D convolution kernel, shown in Figure 3.1(a),
uses two loops to iterate over all pixels in the image. The inner two loops iterate
over each pixel in the neighborhood of the current pixel and compute a weighted
average using the weights stored in the convolution filter. The algorithm takes
36 Chapter 3. Optimizing convolution operations on GPUs
1 for (y=0; y<Ih; y++) {
2 for (x=0; x<Iw; x++) {
3 sum = 0;
4 for (j=0; j<Fh; j++) {
5 for (i=0; i<Fw; i++) {
6 sum += S[y+j][x+i] * F[j][i];
7 }
8 }
9 I[y][x] = sum / (Fw * Fh);
10 }
11 }
(a)
1 x = threadIdx.x+blockIdx.x*Bw;
2 y = threadIdx.y+blockIdx.y*Bh;
3 sum = 0;
4 for (j=0; j<Fh; j++) {
5 for (i=0; i<Fw; i++) {
6 sum += S[y+j][x+i] * F[j][i];
7 }
8 }
9 I[y][x] = sum / (Fw * Fh);
10
11
(b)
Figure 3.1: (a) Pseudo code for a simple 2D convolution. (b) Pseudo code for
the same algorithm implemented as a CUDA kernel.
an image I of size (Iw × Ih) and a filter F of size (Fw × Fh) as arguments. A
naive CUDA implementation, shown in Figure 3.1(b), is obtained by creating one
CUDA thread for each output pixel. This way, every CUDA thread computes
the weighted average of a single pixel’s neighborhood and writes a single pixel to
the output image. The input and output images can be padded to multiples of
the thread block width and height, to allow images of any size to be processed
by the kernel.
3.1.2 Constant and shared memory
The first step in the process of optimizing CUDA kernels is ensuring that the
kernel is not global memory bandwidth bound. This can be verified using the
Roofline Model [101]. The key idea behind the roofline model is to calculate
the arithmetic intensity (FLOP/byte ratio) of a kernel and multiply this by the
theoretical peak bandwidth of the device. The result is an estimate of the peak
performance that can be achieved by the kernel. If this exceeds the theoretical
peak performance of the device the kernel is compute bound, otherwise it is
memory bandwidth bound.
The arithmetic intensity of the 2D convolution kernel is calculated as follows.
For every weight in the convolution filter, each thread loads 2 floating point val-
ues, the pixel and the filter weight making up a total of 8 bytes. These two
inputs are multiplied and added to a local sum, giving an arithmetic intensity
of 0.25 FLOP/byte. On a device with no hardware managed caches, the maxi-
mum compute performance of the kernel is computed by multiplying the memory
bandwidth of the device with the arithmetic intensity of the kernel. However, on
3.1. Naive implementation and well-known optimizations 37
Thread Blockw
(1) Pixels processed
by this thread block
(2) Border pixels
loaded by this thread
block
Using shared memory for Tiled Convolution
Thread BlockH
Figure 3.2: Layout of the data stored in shared memory. Example shows a 16x16
thread block processing an 11x7 filter.
devices with hardware managed caches, many pixel values can be loaded from
the cache as neighboring threads will require the overlapping pixel data.
Rather than relying on the hardware caches to cope with the high memory
bandwidth requirements, parts of the data can be stored in different device mem-
ories. First of all, half of the loads in the kernel are values from the convolution
filter, of which all threads in a warp load the same value simultaneously. This
access pattern is ideally suited for constant memory. Secondly, threads within
a block can cooperate by sharing data through shared memory. As neighboring
threads will require largely overlapping regions from the input image, the threads
within a block may cooperatively load the entire area needed by all threads in
the block, this approach is occasionally referred to as tiled convolution [35] and
is illustrated in Figure 3.2. This way, the threads load the area required by this
thread block exactly once, instead of many times. For 2D convolution it is crucial
to assign 2-dimensional workloads to thread blocks, in order to maximize data
reuse. This is because threads that require overlapping regions from the input
image neighbor each other in 2 dimensions.
The exact bandwidth requirements of the kernel now depend on the convolu-
tion filter and thread block dimensions, and is given by (Fw−1+Bw)× (Fh−1+
Bh)×4 bytes per thread block, where Fw and Fh are the width and the height of
the convolution filter and Bw and Bh are the width and the height of the thread
block. The arithmetic intensity then becomes,
AI =
2× Fw × Fh ×Bw ×Bh
(Fw − 1 +Bw)× (Fh − 1 +Bh)× 4 . (3.1)
38 Chapter 3. Optimizing convolution operations on GPUs
Given a thread block of 32 × 32 and the smallest possible 2D filter 3 × 3, the
arithmetic intensity is 3.99 FLOP/byte. Given that the GTX680 has 192.26
GB/s of global memory bandwidth, a 2D convolution kernel that uses shared
memory has a peak performance of 766.36 GFLOP/s, which is still lower than
the theoretical peak of the device (3090.43 GFLOP/s). However, arithmetic
intensity increases as the filter size increases. For most filter sizes in our test
range, except for the 20 smallest, the theoretical peak of the kernel exceeds the
theoretical peak of the device and therefore the kernel is compute bound rather
than memory bandwidth bound for most filters. In Section 3.3, we will discuss
how the arithmetic intensity, especially for the smallest filters, can be further
increased using our adaptive tiling approach.
Figure 3.3 shows the performance for the compute bound kernel that uses
constant and shared memory. The performance first increases as the filter size
gets larger. However, as the filter size increases the amount of shared memory
used by the kernel also increases, reducing the performance as fewer thread blocks
can run concurrently on each SM. In Figure 3.3(a), the number of thread blocks
that execute concurrently on each SM decreases from 8 to 3 as the filter size
increases. This is visible in the graph as five distinct edges where performance
decreases slightly. Between these edges performance slightly increases again,
because larger filter sizes have a higher arithmetic intensity.
Figure 3.3(a) also shows a few performance peaks, the largest at the high edge
when the filter width is 33. This high edge occurs because the other filter sizes
suffer from shared memory bank conflicts that occur when multiple threads try
to access different values in the same bank. Our techniques for avoiding shared
memory bank conflicts in 2D convolution kernels are discussed in Section 3.2. The
smaller peaks occur at filter widths 5, 9, and 17, which execute with fewer bank
conflicts than other filter widths, except for width 33. This is because multiple
rows of border and pixel data add up to multiples of 32 banks, for these filter
widths, and therefore some of the memory accesses will run conflict free. For
more information on shared memory bank conflicts see the CUDA Programming
Guide [62].
3.1.3 1xN Tiling
In the convolution kernels, discussed so far, each thread block processes a single
tile of pixels from the input image. However, the number of tiles each thread block
computes can be increased. The approach of processing multiple tiles in the hori-
zontal dimension is often referred to as 1xN tiling [74] or thread-block merge [103].
Increasing the amount of work per thread eliminates redundant instructions such
as array index calculations, loop accounting, and loading values from the convo-
lution filter that were previously distributed across different threads, as shown in
3.1. Naive implementation and well-known optimizations 39
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 50
 100
 150
 200
 250
 300
 350
2D Convolution kernel using shared memory and 16x16 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
 50
 100
 150
 200
 250
 300
 350
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 50
 100
 150
 200
 250
 300
 350
2D Convolution kernel using shared memory and 32x32 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
 50
 100
 150
 200
 250
 300
 350
(a) (b)
Figure 3.3: Performance of the kernel using shared memory and (a) a 16x16
thread block size and (b) a 32x32 thread block size.
1 sum0 = 0, sum1 = 0;
2 for (j=0; j<Fh; j++) {
3 for (i=0; i<Fw; i++) {
4 sum0 += S[j*Sw+i] * F[j][i];
5 sum1 += S[j*Swi+Bw] * F[j][i];
6 }
7 }
Figure 3.4: Pseudo code for the main loop of the 1x2 tiled 2D convolution kernel.
Figure 3.4. Additionally, a thread block that computes two neighboring tiles from
the input image no longer needs to load the overlapping borders between the two
tiles, further increasing arithmetic intensity. The total amount of shared memory
allocated to each thread block increases to (Fw − 1 +Bw ×N)× (Fh − 1 +Bh),
when 1xN tiling is used. When the amount of work per thread is doubled, only
half the number of thread blocks are created to execute the same kernel.
Figure 3.4 shows how 1x2 tiling may be applied to the 2D convolution kernel. S
is the dynamically allocated shared memory. Bw is the width of the thread block.
The code on Line 5 is inserted to allow threads to compute two intermediate
sums, while reusing the filter weight stored in F. Register usage increases by one
register per thread to store the additional intermediate result. Tiling increases
the amount of work per thread, which requires more registers per thread and
more shared memory per thread block. Therefore, the tiling factor can only be
increased within the resource limits of the device.
Figure 3.5 shows the performance of the 1x2 and 1x4 tiled kernels on a
GTX680 using a 32x32 thread block. Using large thread blocks reduces memory
bandwidth consumption of the kernel, because with fewer thread blocks there
are fewer overlapping neighborhoods between the tiles processed by each thread
block. Because of the large number of threads in each thread block at most two
40 Chapter 3. Optimizing convolution operations on GPUs
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 100
 150
 200
 250
 300
 350
 400
 450
2D Convolution kernel 1x2 tiled with a 32x32 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
 100
 150
 200
 250
 300
 350
 400
 450
4 8 12 16 20 24 28 32 36 40 4 8
1216
2024
2832
3640
100150
200250
300350
400450
500
2D Convolution kernel 1x4 tiled with a 32x32 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
100
150
200
250
300
350
400
450
500
(a) (b)
Figure 3.5: Performance of (a) 1x2 and (b) 1x4 tiled kernels executed with 32x32
thread block.
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 100
 150
 200
 250
 300
 350
 400
 450
2D Convolution kernel 1x2 tiled with a 16x16 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
 100
 150
 200
 250
 300
 350
 400
 450
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 100
 150
 200
 250
 300
 350
 400
 450
2D Convolution kernel 1x4 tiled with a 16x16 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
 100
 150
 200
 250
 300
 350
 400
 450
(a) (b)
Figure 3.6: Kernels executed with a 16x16 thread block, most filter widths cause
bank conflicts, (a) using a 1x2 tiled kernel (b) using a 1x4 tiled kernel.
thread blocks can execute on each SM at any given time. However, as the fil-
ter size increases shared memory usage also increases, which limits the number
of thread blocks that can execute concurrently on each SM. The edge in both
performance graphs is exactly caused by this drop in occupancy, from 64 warps
(i.e. two 32x32 thread blocks) to 32 warps (i.e. a single 32x32 thread block). In
fact, for the 1x4 tiled kernel, filter sizes (41x43) and (43x43) require more shared
memory than available per SM and therefore cannot be executed.
Figure 3.6 shows the performance of the 1x2 and 1x4 tiled kernels on a
GTX680 using a 16x16 thread block. The performance in both figures decreases
as the filter size increases, because increased shared memory usage decreases the
number of thread blocks that can execute concurrently. Overall, the performance
for both kernels is dominated by shared memory bank conflicts, except when the
3.2. Avoiding shared memory bank conflicts 41
filter width is exactly 17. The next section explains why shared memory bank
conflicts occur and how they can be avoided.
3.2 Avoiding shared memory bank conflicts
The previous section discussed an implementation of the 2D convolution kernel
based on existing optimization techniques, such as 1xN tiling. We showed that
kernels with higher tiling factors often execute more efficiently, but may require
more shared memory than available on the device when combined with a large
thread block size (32x32). The shared memory requirements of the kernel can be
reduced by reducing the thread block size to, for example 16x16. However, using
smaller thread blocks may cause shared memory bank conflicts. Therefore, this
section explains in detail why shared memory bank conflicts occur and presents
a novel approach to avoid bank conflicts in convolution kernels.
To understand why bank conflicts can occur we need to understand how the
threads access the shared memory data structure. The code for loading the data
into shared memory is shown in Figure 3.7. In 2D convolution, it is crucial to
assign 2-dimensional workloads to thread blocks to maximize data reuse. In our
approach, the threads within a thread block cooperatively load the area needed
from the input image and store it in shared memory as shown in Figure 3.7. The
statement at Line 4 sets the input image pointer to the start of the area loaded
by the thread block. Note, that this does not include individual thread indexes
yet. The for-loops on lines 9 and 10 ensure that each thread loads a different
item and that the threads cooperatively load the required 2D area of pixels into
shared memory.
In modern graphics hardware, any shared memory read or write of n addresses
that fall in n distinct memory banks can be serviced simultaneously, yielding an
overall bandwidth that is n times as high as the bandwidth of a single module [62].
In CUDA, multi-dimensional arrays are wrapped row-wise across different shared
memory banks. Whenever two threads from the same warp access different values
in the same memory bank, a shared memory bank conflict occurs, causing the
accesses to be serialized. Given the access pattern in Figure 3.7, it depends on
the thread block width and the convolution filter width whether or not a shared
memory bank conflict will occur. Note that when the thread block width is equal
to the number of memory banks, no shared memory bank conflicts occur as all
threads in each warp access values in different banks, independent of the filter
dimensions.
Figure 3.8 illustrates how bank conflicts occur when the thread block width is
16 and the convolution filter width is 23. The light gray colored values represent
the values that belong to the first row of pixels required by this thread block. The
dark gray colored values belong to the second row; note that both rows include
42 Chapter 3. Optimizing convolution operations on GPUs
1 int ty = threadIdx.y;
2 int tx = threadIdx.x;
3
4 iPtr += blockIdx.y*Bh*Iw + (blockIdx.x*Bw);
5
6 int jEnd = Fh -1 + Bh;
7 int iEnd = Fw -1 + Bw;
8 for (j=ty; j<jEnd; j+= Bh) {
9 for (i=tx; i<iEnd; i+= Bw) {
10 shared_mem[j*Sw + i] = iPtr[j*Iw + i];
11 }
12 }
13 __syncthreads ();
Figure 3.7: Pseudo code that allows threads within a thread block to coopera-
tively load a rectangular area of size (Fh − 1 +Bh)× (Fw − 1 +Bw) and store it
in shared memory.
border pixels. The top row of numbers (in bold) number the 32 memory banks.
The other numbers represent the 32 threads in a warp that all request the value
of the first pixel in their neighborhood. In the top figure, bank conflicts occur as
threads 6 to 15 and 16 to 25 access different values in the same memory banks.
In the bottom figure, bank conflicts are avoided as threads 16 to 32 are directed
towards different memory banks. This can be achieved by introducing a number of
padding columns to the array stored in shared memory. Padding was introduced
as a technique to solve shared memory bank conflicts for a matrix transpose
kernel [71]. However, in matrix transpose adding a single padding column to the
2D array is enough to solve all bank conflicts. In 2D Convolution this is a much
harder problem as the required number of padding columns depends on the size
of the convolution filter and thread block dimensions.
Our approach to avoiding bank conflicts is to extend the 2-dimensional array
stored in shared memory with zero or more padding columns. The width of the
array without padding is defined as Sw = Fw−1+Bw. The width of the padded
array is obtained by increasing Fw − 1 to the first multiple of the number of
memory banks, Sw = dFw−1M e ×M + Bw, where M is the number of memory
banks. The new width of the array is chosen such that each consecutive group
of threads from the same row (i.e. with same threadIdx.y) within a single warp
starts at the first unused memory bank. This guarantees that all threads within
that warp access values in different memory banks, and thus guarantees conflict
free access to shared memory. Note that when Bw is a multiple of the number of
memory banks no padding is required, and when Bw is less than half the number
of memory banks this approach may use more padding than strictly necessary.
However, the advantage of this approach is that the same kernel code as listed in
Figure 3.7 may be used to obtain a bank conflict free implementation.
3.2. Avoiding shared memory bank conflicts 43
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Figure 3.8: Mappings between data and memory banks in shared memory, both
without padding (top figure) and with padding (bottom figure). The top row of
numbers (in bold) number the 32 memory banks. The other numbers represent
the 32 threads in a warp that all request the value of the first pixel in their
neighborhood. In the top picture bank conflicts occur as threads 6 to 15 access
different values in the same banks as threads 16 to 25.
An alternative approach to avoiding bank conflicts may consider rearranging
the order in which threads compute their local sum. The idea behind this ap-
proach is that threads may compute their local sum in any order, and as such,
may direct themselves to different memory banks in order to avoid bank conflicts.
There are two important drawbacks to this approach. First of all, when threads
within a warp compute their local sum in different orderings, weights from the
convolution filter can no longer be broadcast, increasing memory bandwidth con-
sumption. Secondly, reordering the computations for threads within the same
warp introduces a significant number of instructions for index calculations, in-
creasing the execution time.
The performance of both kernels with and without padding is shown in Fig-
ure 3.9, limited to square filters only for visibility. This clearly shows that the
kernel with padding outperforms the one without padding in nearly all cases.
For filter size 17x17 in Figures 3.9(a) and (b) the performance of both kernels
is exactly the same. This is because for this particular filter size all kernels run
conflict-free and no padding is required. For filter size 41x41, the additional
shared memory required for padding limits the occupancy and causes a drop
in performance larger than the performance hit caused by bank conflicts. Fig-
ure 3.9(b) shows that our approach for avoiding shared memory bank conflicts
also applies to the GTX480, because cards of the Fermi architecture also have 32
memory banks. The performance across different filter sizes is less stable on the
GTX680, compared to the GTX480, because drops in occupancy have a much
larger effect on performance for the GTX680 [61]. This is because the Kepler
cards have fewer, but larger SMs. Therefore, the SMs on Kepler are capable of
44 Chapter 3. Optimizing convolution operations on GPUs
 100
 150
 200
 250
 300
 350
 400
 450
3x3 7x7 11x11 15x15 19x19 23x23 27x27 31x31 35x35 39x39 43x43
G
FL
O
P/
s
filter width and height (only square filters used)
Performance effects of bankconflicts for 1x4 tiled 2D Convolution on GTX680
16x16 with padding
16x16 with conflicts
 100
 150
 200
 250
 300
 350
 400
 450
3x3 7x7 11x11 15x15 19x19 23x23 27x27 31x31 35x35 39x39 43x43
filter width and height (only square filters used)
Performance effects of bankconflicts for 1x4 tiled 2D Convolution on GTX480
16x16 with padding
16x16 with conflicts
(a) (b)
Figure 3.9: Kernels executed with 16x16 thread block. Padding is used to prevent
bank conflicts. (a) shows execution on the GTX680 and (b) on the GTX480.
executing more thread blocks concurrently. However, as the amount of shared
memory per SM is the same for both Kepler and Fermi SMs, only smaller filters
can be executed at higher occupancy. A more detailed analysis of the implications
of both architectures on 2D convolution is given in Section 3.5.
Figure 3.10 shows the performance for the full range of tested filter sizes using
padding to prevent bank conflicts. Figure 3.10 may be compared to Figure 3.6 to
see that padding improves the performance of the kernel for nearly all filter sizes.
The performance drop that occurs right after the filter width is increased beyond
17 occurs because of the increase in shared memory use by the padding columns
that avoid bank conflicts. Performance drops in the other direction, when the
filter height is increased, also occur because the increase in shared memory usage
causes a reduction in the occupancy.
Convolution operations with large filters have large overlapping borders be-
tween tiles, and thus require more shared memory per thread block. Tiling in-
creases the use of shared memory even further. In fact, the 1x4 tiled kernel,
shown in Figure 3.5(b), requires more shared memory than available on the de-
vice when executed with filter sizes (41x43) and (43x43). This shows the very
limited flexibility of the 1xN tiling approach. Therefore, tiling must be kept to
a minimum for large filters. However, when only small filters are used, which
is typical in many applications, a higher tiling factor may be applied to achieve
more efficient execution on the GPU. The next section presents our adaptive tiling
approach that addresses exactly this issue.
3.3. Adaptive tiling 45
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 100
 150
 200
 250
 300
 350
 400
 450
2D Convolution kernel 1x2 tiled with a 16x16 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
 100
 150
 200
 250
 300
 350
 400
 450
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 100
 150
 200
 250
 300
 350
 400
 450
 500
2D Convolution kernel 1x4 tiled with a 16x16 thread block on a GTX 680
GFLOP/s
Filter height
Filter width
 100
 150
 200
 250
 300
 350
 400
 450
 500
(a) (b)
Figure 3.10: Kernels executed with 16x16 thread block. Padding is used to
prevent bank conflicts. (a) using a 1x2 tiled kernel (b) using a 1x4 tiled kernel.
3.3 Adaptive tiling
For many CUDA kernels, such as the matrix multiplication kernel described
in [74], there exists a one-size-fits-all best tiling factor that can be set at compile
time and will create the most efficient kernel for any input size used at runtime.
For convolution operations this approach is too restrictive as the efficiency of the
kernel is dictated by the size and shape of the convolution filter. In this section,
we present Adaptive Tiling as a new optimization approach for convolution op-
erations on GPUs. With adaptive tiling the tiling factor is selected at runtime
depending on the input data and the resource limitations of the device.
Adaptive tiling allows our convolution operations with relatively small filters
to be executed with higher tiling factors and operations with relatively larger
filters by kernels with lower tiling factors. This approach further increases the
arithmetic intensity, especially for small convolution filters, which is given by
AI =
2× Fw × Fh ×Bw × T ×Bh
(Fw − 1 +Bw × T )× (Fh − 1 +Bh)× 4 (3.2)
where T is the tiling factor.
In fact, to select a tiling factor at run time means to select a particular
implementation of a kernel to run. A small script is used to generate the variations
of the kernel for a set of tiling factors before compilation. These kernels are all
included in the library and can be selected by the run-time system. Instead of
using a script, one might also use for example, C++ templates. Our adaptive
tiling approach works independently of how the kernels are generated.
There are two main resource constraints that need to be considered when
increasing the tiling factor: shared memory and the register file. First, before the
tiling factor is increased, shared memory only contains the pixels processed by this
46 Chapter 3. Optimizing convolution operations on GPUs
thread block and their overlapping borders, using relatively little shared memory.
More efficient execution can be achieved when the tiling factor is increased, at
the cost of increasing the amount of shared memory the kernel uses. Therefore,
the tiling factor can only be increased as long as the tiles and the border around
them still fit in shared memory.
Second, as the tiling factor increases, more registers are required per thread.
Therefore, the tiling factor can only be increased up to the point where one thread
block consumes the entire register file on an SM. However, the total number
of registers used by a thread block may be reduced, by reducing the number
of threads per thread block. This slightly increases the use of global memory
bandwidth as more thread blocks will be required to complete the computation.
However, if the decrease in thread block size allows further increase in the tile
factor, the total number of thread blocks does not necessarily increase. This
suggests that it might be worthwhile to lower the number of threads per thread
block to allow for higher tiling factors.
Our approach uses a compile-time fixed thread block size, while the highest
possible tiling factor is selected at run time based on the largest possible fit within
the shared memory available on the device and the convolution filter size of the
operation at hand. Note that this means each SM can execute no more than
one thread block simultaneously, because this single thread block will consume
almost all of the available shared memory. While this approach yields a very
flexible and efficient implementation on GPUs of the Fermi architecture [58], the
selection scheme needs some fine tuning to be efficient for the Kepler cards.
The SMs of the Kepler architecture need roughly twice as much parallelism per
SM, compared to Fermi SMs [61]. For 2D convolution, increased parallelism per
SM can only be achieved by increasing occupancy, which means either creating
larger thread blocks or allowing multiple thread blocks to execute concurrently on
each SM. Both approaches for increasing occupancy imply a significant increase
in the required amount of shared memory, while the Kepler SMs do not have
more shared memory than Fermi SMs. For this reason our run-time selection
scheme on the Kepler chooses the highest possible tiling factor such that two
thread blocks can still run concurrently on each SM for small filter sizes. For
larger filter sizes it is still efficient to simply select the highest possible tiling
factor.
Figure 3.11 shows the achieved performance for the 2D convolution kernel
that uses the adaptive tiling optimization for the best performing thread block
sizes. The filter dimensions are used to determine the resource requirements of
the kernel, which are then used to select the best tiling factor within the resource
limitations of the device present at runtime. The edges in the performance graphs
(a) and (b) are caused by changes in the tiling factor. Figures (c) and (d) show
3.3. Adaptive tiling 47
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 100
 150
 200
 250
 300
 350
 400
 450
 500
2D Convolution kernel with Adaptive Tiling (32x32) on a GTX 680
GFLOP/s
Filter height
Filter width
 100
 150
 200
 250
 300
 350
 400
 450
 500
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 1
 2
 3
 4
 5
Tile factors used in the adaptive tiled kernel (32x32) on a GTX 680
Tile factor
Filter height
Filter width
 1
 2
 3
 4
 5
(a) (c)
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 100
 150
 200
 250
 300
 350
 400
 450
 500
2D Convolution kernel with Adaptive Tiling (16x32) thread block on a GTX 480
GFLOP/s
Filter height
Filter width
 100
 150
 200
 250
 300
 350
 400
 450
 500
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 6
 8
 10
 12
 14
 16
Tile factors used in the adaptive tiled kernel (16x32) on a GTX 480
Tile factor
Filter height
Filter width
 6
 8
 10
 12
 14
 16
(b) (d)
Figure 3.11: (a) and (b) performance using the adaptive tiling optimization on
GTX680 and GTX480. (c) and (d) the tiling factors selected by the run time
system.
the actual tile factors that were selected by the run-time system to create the
corresponding performance graphs shown in (a) and (b).
The adaptive tiling approach for the GTX680 uses a 32x32 thread block size,
similar to Figure 3.5. Compared to Figure 3.5(b) adaptive tiling improves the
efficiency of 121 out of the 441 tested filter sizes, for the other filter sizes the
same tiling factor is selected and therefore the performance is the same. Addi-
tionally, the adaptive tiling implementation is able to execute filter sizes 41x43
and 43x43, because the run-time system decreases the tiling factor if the resource
requirements exceed that of the device. The relatively large thread block size
used on the GTX680 limits the range of possible tiling factors, but is necessary
to supply sufficient parallelism to the Kepler SMs. However, the fixed thread
block size of 32x32 is not the best performing for each filter size within our test
range. Therefore, our parameter sweep discussed in the next section, also uses
different thread block sizes for operations with different filter sizes.
48 Chapter 3. Optimizing convolution operations on GPUs
3.4 Adaptive tiling combined with loop unrolling
One important optimization that we have not considered until now is loop un-
rolling. Loop unrolling introduces trade off between flexibility and efficiency, as
the number of iterations for all loops of a given kernel have to be known at com-
pile time. This means that for each possible filter size we create a unique CUDA
kernel and optimize it individually. Instead of selecting the highest possible tiling
factor that still fits in shared memory, the run time system now selects the CUDA
kernel with the best performing combination of block size and tiling factor given
a particular filter size. We have implemented this using a look-up table which
holds this information for each possible filter size in the expected range of filter
sizes.
This approach has a number of drawbacks: First of all, the number of CUDA
kernels becomes quite large, which also increases compilation times. Secondly,
the range of filter sizes that may be used by the applications have to be known at
the time the library is compiled. However, it is important to know exactly how
much performance can be gained from unrolling each kernel. Although creating
an optimized library using this optimization is considerably more work, it is still
worthwhile considering that many researchers from the multimedia domain can
benefit from this effort.
We have generated the unrolled versions of our CUDA kernels for each possi-
ble filter size ranging from 3 to 43 in both dimensions. At the time the kernel is
generated the thread block width and height as well as the tiling factor have to
be known, as these determine the number of iterations of each unrolled loop. For
each filter size, a search is performed through a range of selected thread block di-
mensions and tiling factors. The selected ranges for thread block dimensions are
powers of two from 16 to 64 in the x-dimension and 4 to 64 in the y-dimension.
The tiling factors range from 1 to 10. In this search, we have tested the per-
formance of 110 (11 block sizes × 10 tiling factors) unrolled kernels generated
specifically for each of the 441 evaluated filter sizes. The results are stored in a
look-up table, which for each filter size stores a reference to the best performing
unrolled kernel as well as the execution parameters required at launch time, such
as the tiling factor and thread block dimensions.
Whenever the host code of the library is required to execute a 2D convolu-
tion operation on the GPU at run time, it uses the look-up table to select the
best performing implementation that should be used based on the filter dimen-
sions of the operation at hand. The look-up table also specifies the tiling factor
and thread block dimensions, which are used to compute several execution pa-
rameters, such as the amount of dynamically allocated shared memory, possibly
including padding, and the total number of thread blocks required to complete
execution of the kernel.
3.4. Adaptive tiling combined with loop unrolling 49
 4
 8
 12
 16
 20
 24
 28
 32
 36
 40  4  8
 12 16
 20 24
 28 32
 36 40
 200
 300
 400
 500
 600
 700
 800
 900
 1000
2D Convolution tiled and unrolled on a GTX 680
GFLOP/s
Filter height
Filter width
 200
 300
 400
 500
 600
 700
 800
 900
 1000
Figure 3.12: Performance for 2D convolution using a uniquely generated tiled
and unrolled kernel for each filter size.
The look-up table, which is unfortunately too large to be included in this text,
essentially describes several of the properties of the underlying device and could
potentially be used to guide the optimization process of other kernels. For 2D
convolution, thread block size 16x32 was the best performing thread block size
for 49.7% of the tested filter sizes. In fact, 342 out of the 441 best performing
kernels use a thread block width of 16. This means our technique for avoiding
shared memory bank conflicts, as described in Section 3.2, is used in 77.6% of the
best performing 2D convolution kernels. Although using a thread block width
of 16 increases resource usage in terms of shared memory, and possibly reduces
the occupancy, the fact that the thread block size is small, allows for higher
tiling factors. High tiling factors significantly reduce the number of redundant
instructions previously spread across different threads, which has a large impact
on the performance of compute-bound kernels. The average tiling factor used in
the 441 best performing kernels is 4.16, tiling factor 2 is used for 21.3% of the
filter sizes, factor 4 in 33.6%, and tiling factors larger than 4 are used in 34.2%
of the kernels. This confirms the limitations of using a compile-time fixed tiling
factor and shows why adaptive tiling is so important to convolution operations.
Figure 3.12 shows the achieved performance for the 2D convolution kernel that
uses a specifically generated CUDA kernel for each filter size. In Figure 3.12, the
best performing kernel is the one that performs a convolution with filter size 17x43
and achieves about 938.4 GFLOP/s, which uses a thread block size of 16x64 and
tiling factor 6. More importantly, the smallest and most commonly used 2D filter
size (3 × 3) improved from 135.9 with adaptive tiling to 251.3 GFLOP/s when
combining adaptive tiling with unrolling.
50 Chapter 3. Optimizing convolution operations on GPUs
 100
 200
 300
 400
 500
 600
 700
 0  5  10  15  20  25  30  35  40  45
G
FL
O
P/
s
Filter height/width
Separable Convolution kernel tiled and unrolled on the GTX 680
row
col
Figure 3.13: Performance of separable convolution on the GTX680 using adaptive
tiling combined with unrolling.
The optimization approach presented here can also be applied to other op-
erations from the image processing domain, including neighborhood operations,
such as erosion, dilation, separable convolution, and anisotropic Gaussian filter-
ing. Separable convolution, for example, considers convolution operations with
2-dimensional filters that may be split in two 1-dimensional filters, which are
applied consecutively to obtain the same result as when performing a full 2D
convolution. The separable convolution operation has considerably lower com-
plexity, but unfortunately cannot be used for all 2D convolution filters. We have
implemented separable convolution by implementing and optimizing the hori-
zontal and vertical 1-dimensional filtering operations as two separate operations
and applied the adaptive tiling combined with unrolling optimization to both
operations.
The tested thread block sizes range from 16 to 64 in the x-dimension and 4 to
64 in the y-dimension. The tested range of tiling factors is from 1 to 20. Note that
tiling for the vertical filtering operation is performed vertically, instead of in the
x-direction as described in Section 3.1. This is necessary for increasing data reuse
within thread blocks, where horizontal tiling would only increase the number of
border pixels each thread block loads, without actually increasing data reuse.
Because of the 1D filters, the separable convolution operation uses considerably
less shared memory than 2D convolution. Therefore, it is possible to allow for
much higher tiling factors.
The resulting look-up table for separable convolution is considerably smaller
than the look-up table for 2D convolution and is therefore shown in Table 3.1.
3.4. Adaptive tiling combined with loop unrolling 51
Fw Tiling Factor Block size Fh Tiling Factor Block size
3 4 32x4 3 5 64x4
5 4 32x4 5 4 32x4
7 5 32x4 7 4 32x4
9 4 32x4 9 3 32x4
11 4 32x4 11 4 32x4
13 8 32x4 13 3 32x4
15 8 32x4 15 6 32x8
17 8 32x4 17 6 32x8
19 8 32x4 19 7 32x8
21 8 32x4 21 7 32x8
23 7 16x8 23 6 32x8
25 7 16x8 25 5 32x8
27 7 16x8 27 6 32x8
29 7 16x8 29 6 32x8
31 7 16x8 31 8 32x8
33 7 16x8 33 4 32x8
35 6 16x8 35 5 32x8
37 6 16x8 37 5 32x8
39 6 16x8 39 7 32x8
41 8 16x8 41 8 16x8
43 8 16x8 43 6 16x8
Table 3.1: Look-up table for horizontal (left) and vertical (right) filtering as used
in separable convolution on the GTX680.
The look-up table again provides insights that could guide future optimization
searches for similar kernels. For the horizontal filtering operation we observe
thread block size 32x4 was the most efficient for the 10 smallest filters. For larger
filters, thread block size 16x8 created the most efficient kernels when combined
with tiling factors of 6 or higher. In 11 out of the 21 filter sizes a thread block
width of 16 is used, which means the kernel also uses our technique for avoiding
shared memory bank conflicts as explained in Section 3.2. However, for the
vertical 1D-filtering operation bank conflicts cannot occur, because no border
pixels are loaded in the x-direction. For the vertical filtering, thread block size
32x4 is the most efficient for 18 out of 21 filters using tiling factors ranging from
3 to 8. Figure 3.13 shows the performance of the kernels from Table 3.1 for both
horizontal and vertical filtering.
52 Chapter 3. Optimizing convolution operations on GPUs
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
3x3 7x7 11x11 15x15 19x19 23x23 27x27 31x31 35x35 39x39 43x43
G
FL
O
P/
s
filter width and height (only square filters used)
Performance of different 2D Convolution kernels on GTX680
adapt+unroll
adapt tiled
32x32 tiledx4
32x32 shared
32x32 naive
Figure 3.14: Performance for several different implementations of the 2D convo-
lution problem on the GTX680.
3.5 Evaluation
In this section, we present a short evaluation of the performance effects of each
individual optimization step for the GTX680. The performance results of each
optimization step are also given for the GTX480 and Tesla K20 GPUs to show
the effects on different architectures. To simplify the discussion we only show the
performance results for the square filter sizes in our test range, i.e. 3x3, 5x5, etc.
Figure 3.14 shows the performance of each optimization step discussed in this
chapter on the GTX680. The performance of the naive kernel implementation
is limited by the memory bandwidth and does not exceed 106.3 GFLOP/s. Fig-
ure 3.14 clearly shows that each optimization step we apply further improves
performance. The adaptive tiling optimization again uses a fixed thread block
size of 32x32 and dynamically chooses the tiling factor at runtime. The per-
formance of the adaptive tiling and the statically 1x4 tiled implementation is
similar, because the adaptive tiling selects tiling factor 4 for the majority of the
square filters on the GTX680. Note that for filter size 43x43 the statically 1x4
tiled implementation cannot execute, as it requires more shared memory than
3.5. Evaluation 53
 0
 100
 200
 300
 400
 500
 600
 700
3x3 7x7 11x11 15x15 19x19 23x23 27x27 31x31 35x35 39x39 43x43
G
FL
O
P/
s
filter width and height (only square filters used)
Performance of different 2D Convolution kernels on GTX480
adapt+unroll
adapt tiled
32x32 tiledx4
32x32 shared
32x32 naive
Figure 3.15: Performance for several different implementations of the 2D convo-
lution problem on the GTX480.
available per SM, while the adaptive tiling algorithm simply selects to execute
the 43x43 filter with tiling factor 3. Our final optimization step demonstrates a
performance improvement of up to a factor of 9.14 over the naive implementa-
tion. This is not only due to the unrolling optimization, but also because the
best performing combination of thread block size and tiling factor is selected for
each individual filter. Note that unrolling the loops is only possible when the
number of iterations of all loops is known at compile time. This means that an
individual implementation is created for each filter size.
The performance effects of each optimization step on the GTX480 is shown
in Figure 3.15. The performance of the naive implementation on the GTX480 is
quite similar to the naive performance on the GTX680. For filter sizes 3x3 and
5x5, the naive implementation outperforms the implementation that explicitly
uses shared memory. This is because the naive approach benefits from the fact
that global memory loads are cached in L1 on the GTX480. The shared memory
implementation also benefits from caching in L1, albeit with the introduction of
a small overhead for additional instructions and synchronization. However, using
54 Chapter 3. Optimizing convolution operations on GPUs
shared memory is very important to achieving high performance, for larger filter
sizes or when the tiling factor is increased. For filter size 43x43, the tiled 1x4
implementation cannot execute, because it requires more memory than available
on the device. The adaptive tiling implementation does not suffer from this
issue, because it simply selects the implementation with the largest possible tiling
factor that still fits in shared memory. The adaptive tiling implementation uses
a thread block size of 16x32, which means it also uses our approach for avoiding
shared memory bank conflicts as explained in Section 3.2. The adaptive tiling
combined with unrolling implementation again improves significantly over the
other implementations.
The GTX680 (Figure 3.14) outperforms the GTX480 (Figure 3.15) for most
filter sizes. However, the difference is less than what one would expect from the
theoretical peak performance of both devices. There are several architectural
differences between Kepler and Fermi that limit the performance of 2D convo-
lution operations on Kepler cards. The most important one for 2D convolution
is that Kepler has 6 times less shared memory per compute core in each SM
compared to Fermi. Therefore, it becomes much harder to keep all 192 cores in
an SM occupied. When there is not enough shared memory, it is not possible to
either increase the thread block size or to increase the number of thread blocks
that run simultaneously on each SM. This causes low occupancy of each SM and
therefore low utilization of the 192 compute cores, as well as lower utilization
of global memory bandwidth and higher dependence on long latency operations.
All involved factors explain why performance is expected to be further from the
theoretical peak.
Figure 3.16 shows the performance of each optimization step on the K20. The
performance of the naive implementation on the K20 is about 24% faster than
the naive performance on the GTX480 and GTX680. It is important to note that
Figure 3.16 again shows that each optimization step we apply further improves
performance.
The K20 also has 48KB of read-only cache at L1 that may be used by global
memory loads for read-only data [60]. This requires that the global memory loads
are explicitly enclosed by the __ldg() intrinsic. The compiler transforms these
loads into non-coherent loads that pass through the read-only cache. In Fig-
ure 3.16, our most efficient implementation (adapt+unroll+nc) uses this specific
optimization. The fact that the performance of many of the kernels in Figure 3.16
improves by this optimization also indicates that the performance is related to
the achieved global memory bandwidth.
Note that Figure 3.16 shows only a very small subset of the filters that were
tested and that performance varies with the filter size in both dimensions. For
example, for filter size 17x43 our adaptive tiling combined with unrolling opti-
mization executes at just over a teraflop. Finally, the adaptive tiling combined
3.5. Evaluation 55
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
3x3 7x7 11x11 15x15 19x19 23x23 27x27 31x31 35x35 39x39 43x43
G
FL
O
P/
s
filter width and height (only square filters used)
Performance of different 2D Convolution kernels on K20
adapt+unroll+nc
adapt+unroll
adapt tiled
32x32 tiled 1x4
32x32 shared
32x32 naive
Figure 3.16: Performance for several different implementations of the 2D convo-
lution problem on the K20.
with unrolling optimization, both with and without using the read-only cache,
demonstrates a performance improvement of up to a factor 8.3 over the naive
implementation.
We observe that the tiling factors used in the best performing kernels on
the GTX680 and the K20 are generally lower than on the GTX480. However,
selecting the tiling factor at run time is even more important on the GTX680 and
the K20, than on the GTX480. While the tiling factors for the best performing
kernels range from 2 to 10 on the GTX480 and from 1 to 8 on the K20, tiling
factor 4 is the most common for both cards accounting for 44.2% on the GTX480
and only for 24.9% on the K20.
The look-up tables for both the GTX680 and the K20 also demonstrate that
selecting the thread block size at run time is still very important. We observe that
the thread block sizes used in the best performing kernels on the Kepler cards are
on average 56.5% larger than on the GTX480. This confirms that larger thread
blocks are required to achieve sufficient occupancy on the Kepler cards. Thread
block size 16x64 is the most common among the best performing kernels on the
56 Chapter 3. Optimizing convolution operations on GPUs
GTX680 and K20, accounting for 49.7% and 36.5%. For both Kepler cards, the
vast majority of the best performing kernels use thread block width of 16, 83.2%
on the GTX680 and 92.1% on the K20. Therefore, for most filter sizes in our
test range, the best performing kernel has a thread block width that requires
the use of our approach to avoid shared memory bank conflicts, as described in
Section 3.2.
3.6 Limitations
A 2D convolution operation in geometric space can also be implemented as a
point-wise multiplication in frequency space. Nvidia’s white paper [67] explains
their implementation of 2D convolution based on the CUFFT library. The ad-
vantage of this method is that it has lower complexity for very large filters. This
presents a limitation that is inherent to spatial solutions to the 2D convolution
problem. We therefore present a short comparison of both approaches.
The FFT approach has some minor limitations in the fact that it uses more
memory than a spatial solution. The input image needs to be rounded up to the
nearest power of two if the padded dimension is less than or equal to 1024, or to
the nearest multiple of 512 otherwise. Then the filter is wrapped and stored in
another image of the same size, roughly doubling the amount of GPU memory
required for the operation. However, due to the lower complexity the FFT-based
approach will become increasingly faster than any spatial solution as the filter
size increases. The question is which approach is fastest for which range of filter
sizes.
We have tested both our implementation and Nvidia’s implementation for
the FFT-based approach [67] that uses the CUFFT library, on the GTX680
given an image size of 4096x4096 for a filter range of 3 to 43 elements in both
dimensions. The contour plot, shown in Figure 3.17, shows the relative speedup
of our approach over the FFT approach. For the most commonly used filter sizes,
3x3 and 5x5, the speedup of our approach over the FFT-based approach is 18.96
and 10.09, respectively. Operations with filter sizes below and left of the ’1’ line,
the gray area in the figure, execute faster with our implementation than with the
FFT-based implementation. However, for very large filter sizes the FFT-based
approach is faster, simply because of the lower complexity of the algorithm.
It should be noted that the use of very large filters is rare in the MMCA
applications. It is however also possible to implement a library routine for 2D
convolution such that a run time choice is made between our implementation and
an FFT-based implementation based on the filter size of the operation at hand.
We consider this as future work.
3.7. Related work 57
Speedup of Spatial solution over FFT-based on the GTX 680
     0.5
       1
       2
       4
       8
      16
 3  7  11  15  19  23  27  31  35  39  43
Filter height
 3
 7
 11
 15
 19
 23
 27
 31
 35
 39
 43
Fi
lte
r w
id
th
 0
 2
 4
 8
 16
 20
Figure 3.17: Speedup of spatial solution over the FFT-based approach. Filter
sizes in the gray area are executed faster with our implementation than with the
FFT-based implementation.
3.7 Related work
The work presented in this chapter extends earlier work on implementing and
optimizing convolution operations for the GPU. Very early work on implementing
convolution operations on GPUs used shader programs in graphics APIs [64,
21], which can support only limited filter sizes due to the limited number of
instructions per pixel. This section discusses more recent work which is all based
on CUDA. We also present performance results to compare our method of using
shared memory for tiled convolution against a different method suggested by
Hwu [35]. Finally, we compare our optimized kernels with kernels created by the
auto-optimizing source-to-source compiler by Yang et al. [103].
OpenCV [1] is an open source library for Computer Vision and contains several
implementations for many image processing operations. OpenCV has a special
module named GPU operations, which also contains various implementations for
convolution operations. The 2D convolution operations discussed in this chapter
correspond with the filter2D method in OpenCV. OpenCV’s current implemen-
tation of 2D convolution for GPUs uses CUDA and is limited to 2D filters with
58 Chapter 3. Optimizing convolution operations on GPUs
a total size not greater than 256 elements. The only optimization that is ap-
plied to their CUDA implementation is the use of constant memory to store the
convolution filter. No shared memory or other optimizations are used.
Russo et al. [72] compare various implementations of separable convolution on
GPUs using CUDA, on FPGAs using Verilog and on CPUs using C and Matlab.
The optimizations that are applied to their CUDA implementation consist of
using shared memory and static 1xN tiling. They conclude that CUDA on GPUs
outperforms their FPGA and CPU implementations.
Hartung et al. [29] compare various implementations of 2D convolution us-
ing CUDA on GPUs and ANSI-C and OpenMP on CPUs. They refer to the
implementation of 2D convolution in the spatial domain as applying a spatially-
varying kernel. The optimizations they apply to their CUDA implementation
are using shared memory and static 1xN tiling. They conclude that their CUDA
implementation outperforms their various CPU implementations.
Nugteren et al. [57] introduce a classification of algorithmic skeleton imple-
mentations for image processing, which enable GPU code generation and mapping
of the algorithm to the GPU hardware. 2D convolution is discussed as an exam-
ple of the skeleton class for neighborhood-to-pixel image processing operations.
Local data reuse is exploited through the use of shared memory. The thread
block size is increased to the maximum allowed by the hardware to minimize the
amount of border pixels shared between thread blocks. No further optimizations
to increase arithmetic intensity or instruction efficiency, such as 1xN tiling, are
applied.
Al Umairy et al. [2] attempt to improve the performance of Nvidia’s FFT-
based implementation of 2D convolution. They optimize towards a specific use
case within the domain of electromagnetic diffraction modeling. While they argue
that many applications use small convolutions, it is unclear if this refers to small
input data or small convolution filters. Their conclusion is that the CUFFT
library should be extended with an operation that directly performs a complete
2D convolution operation, as opposed to using separate forward and inverse FFT
transformations. Their solution is not compared to an implementation in the
spatial domain.
Dastgeer et al. [17] have developed two performance metrics based on either
occupancy or tiling factor for our Adaptive Tiling optimization, based on our
earlier work [97]. These metrics help to guide the decision making process in
selecting the tiling factor at run time. Their experimental results confirm that
maximizing the tiling factor rather than occupancy results in the best perfor-
mance for GPUs of the Fermi architecture and older cards. As we have shown
in Section 3.3, GPUs of the Kepler architecture require a more balanced ap-
proach. In this chapter, we also show that even more efficient implementations
can be obtained by combining adaptive tiling with unrolling and choosing the
3.7. Related work 59
best performing combination of thread block size and tiling factor for each filter
size.
The 2D convolution problem is well known in the context of GPU Computing
and is often used as example to illustrate or teach how constant and shared
memory may be used to reduce the global memory bandwidth consumption of
a kernel [89, 35, 34]. However, these descriptions of the optimization process do
not go beyond the discussion of well-known techniques presented in Section 3.1
and only consider a static 5x5 filter size.
In contrast to our approach of using shared memory (explained in Section 3.1),
Hwu [35] suggests keeping the number of elements in shared memory equal to
the number of threads in the block. This simplifies the loading process, as each
thread loads only one element from the input image. As a consequence, not all
threads can be active during the computation as the threads near the edge of
the tile do not have enough information to complete their computation. There
are two important drawbacks to this approach. First, as some threads are not
participating in the computation, more thread blocks must be created to complete
the computation. This limits the opportunity for data reuse within the thread
block and increases overall global memory bandwidth consumption. Secondly,
this approach does not scale with increasing filter sizes. For the 5x5 filter size,
threads in the last four rows and columns do not participate in the computation.
As the filter size is increased, the number of active threads within each thread
block drops dramatically. More importantly, while the opportunity for data reuse
increases with larger filter sizes, data reuse in this approach actually decreases as
there are less active threads in each thread block to reuse data.
The question that remains is which approach executes more efficiently consid-
ering only the 5x5 2D convolution filter. We have implemented both approaches
to using shared memory and unrolled both kernels since the filter size is known
at compile time. We have tested on GTX480 with an image of 4096x4096 and
a static filter of 5x5. Our implementation of Hwu’s approach achieves 180.8
GFLOP/s, while our approach of having some threads load multiple elements
and keep all threads active during computation achieves 229.8 GFLOP/s.
We have also compared our optimized kernels with kernels created by the
auto-optimizing source-to-source compiler by Yang et al. [103]. Their GPGPU
compiler takes a slightly modified naive kernel as input and outputs an optimized
kernel. The optimization strategy that the source-to-source compiler applies
to the 2D convolution kernel is primarily focused on using shared memory to
convert noncoalesced global memory accesses into coalesced memory accesses.
To this end, the compiler assumes that the number of loop iterations is always a
multiple of 16. This results in a kernel where the data required for the first 16
iterations of the innermost loop is loaded into shared memory by all threads in
the thread block, following the design that some threads load multiple elements.
60 Chapter 3. Optimizing convolution operations on GPUs
This guarantees coalesced memory accesses, as well as data reuse within each
block of 16 iterations for the innermost loop.
Unfortunately, many opportunities for data reuse are unexploited. After the
first 16 iterations the data in shared memory is overwritten with the data required
for the next 16 iterations which are again loaded from global memory, although
the elements could have been reused from shared memory as well. In contrast,
our implementation tries to maximize data reuse in both dimensions and for all
iterations, instead of only within each block of 16 iterations in the horizontal
dimension. However, we do believe that the optimization techniques presented
in this chapter could be integrated into an auto-optimizing compiler such as the
GPGPU Compiler.
To confirm that our approach yields a more efficient kernel, we have tested the
performance of a 2D convolution kernel created for a static filter size of 32x32.
On the GTX480 with an input image of 4096x4096, the optimized kernel created
by the GPGPU compiler for a 32x32 filtering operation takes 143.7 ms, achieving
239.0 GFLOP/s. Our optimized kernel took 45.8 ms for the same operation,
achieving 749.6 GFLOP/s.
3.8 Conclusions
Convolution operations are an essential tool in signal and image processing and
are typically responsible for a large fraction of the application’s execution time.
In this chapter, we have discussed the implementation and optimization of con-
volution operations for the domain of multimedia content analysis (MMCA). As
part of our work on transparent parallelization tools for the MMCA domain,
our goal is to obtain an efficient library based implementation for convolution
operations.
We have evaluated the performance effects of many different implementations
for the 2D convolution operation and separable convolution on the GTX680,
GTX480 and Tesla K20 graphics cards. We have introduced a new approach for
solving shared memory bank conflicts in the context of convolution operations as
well as an optimization approach, called adaptive tiling, for implementing efficient,
yet flexible, convolution operations on modern GPUs. We have also presented
an implementation that combines adaptive tiling with loop unrolling to obtain a
less flexible, yet highly efficient implementation. The approach is less flexible, as
the filter sizes used by the application at runtime have to be known at compile
time. While the approach presented in this chapter focuses on 2D convolution,
the techniques also apply to separable convolution.
All optimizations combined do have a very large impact on performance. More
importantly, each optimization step we introduce provides a significant perfor-
mance improvement for each tested GPU. For the GTX480, the most commonly
3.8. Conclusions 61
used 2D filter size in image processing (3x3) executes at 50.8 GFLOP/s using
the first naive implementation, 141.2 GFLOP/s when adaptive tiling is used, and
finally 336.0 GFLOP/s using tiling combined with loop unrolling. The best per-
forming kernel on the GTX480 is the one that performs a convolution with filter
size 7x39 and achieves about 854 GFLOP/s, which is at 63.6% of the theoretical
peak performance of the GTX480. For the GTX680, the 3x3 filter executes at
66.6 GFLOP/s using the naive implementation, 133.2 GFLOP/s with adaptive
tiling, and finally 251.3 GFLOP/s using tiling combined with loop unrolling. The
best performing kernel on the GTX680 is the one that performs a convolution
with filter size 17x43 and achieves 938.4 GFLOP/s, using a thread block size of
16x64 and tiling factor 6. GPUs of the Kepler architecture have much less shared
memory and less global memory bandwidth per compute core compared to GPUs
of the Fermi architecture. As a result, the compute performance of 2D convolu-
tion operations does not come as close to the theoretical peak performance as on
the GTX480. For example, our best performing kernel on the K20 executes a 2D
convolution operation with filter size 17x43 and achieves about 1003 GFLOP/s,
which is only at 28.5% of the theoretical peak performance.
The look-up table stores the results of a search through a set of generated
unrolled CUDA kernels and stores which thread block size and which tiling factor
was used in the creation of the best performing kernel for each filter size. The
resulting table essentially describes some of the properties of the underlying device
and could potentially be used to guide the optimization process of other kernels.
The look-up table also demonstrates the importance of our approach for solving
shared memory bank conflicts, which enabled us to also use thread blocks with a
width of 16 threads in an efficient manner. For 2D Convolution on the GTX480,
76.2% of the best performing kernels use a thread block width of 16 threads. For
the GTX680 and the K20, even 83.2% and 92.1% of the best performing kernels
use a thread block width of 16 and therefore require padding to avoid bank
conflicts. While padding increases resource usage in terms of shared memory,
and possibly reduces the occupancy, the fact that the thread block size is small,
allows for higher tiling factors. High tiling factors significantly reduce the number
of redundant instructions previously spread across different threads, which has a
large impact on the performance of compute-bound kernels.
Our evaluations show the limitations of implementing convolution operations
with a compile time fixed tiling factor. The look-up tables for 2D convolution
and separable convolution demonstrate the importance of our adaptive tiling ap-
proach. For example in 2D convolution on the GTX480, tiling factor 2 is used
for 24.0% of the filter sizes, factor 4 in 44.2%, and tiling factors larger than 4
in 31.8%. Selecting the tiling factor at run time is even more important on the
GTX680 and the K20. While the tiling factors for the best performing kernels
range from 2 to 10 on the GTX480 and from 1 to 8 on the K20, tiling factor 4
62 Chapter 3. Optimizing convolution operations on GPUs
is the most common for both cards accounting for 44.2% on the GTX480 and
only for 24.9% on the K20. This again shows why our adaptive tiling approach
is so important to optimizing convolution operations. Our evaluation also shows
that selecting the thread block size at run time is also very important for high
performance. We observe that the thread block sizes used in the best performing
kernels on the Kepler cards are on average 56.5% larger than on the GTX480.
This confirms that larger thread blocks are required to achieve sufficient occu-
pancy on the Kepler cards.
FFT-based implementations of the 2D convolution operation have a lower
complexity for convolution operations with very large convolution filters, com-
pared to spatial implementations such as the ones discussed in this chapter.
Therefore, for very large 2D convolution filters an FFT-based implementation
will outperform any spatial solution to the 2D convolution problem. However,
for the most commonly used 2D filter sizes, 3x3 and 5x5, the speedup of our
approach over Nvidia’s FFT-based approach on the GTX680 is a factor 18.96
and 10.09, respectively.
The look-up tables used for selecting the most efficient tiled and unrolled ker-
nel implementation are tightly linked to a specific GPU architecture. Therefore,
future work is directed towards developing performance models, as construct-
ing the look-up table currently requires a considerable amount of performance
testing.
We have made the source code of our kernels available from the author’s
homepage as part of the data-parallel Parallel-Horus programming model. The
look-up tables for all tested GPUs and our performance measurement data is also
made available on the web.
63
Chapter 4
Improving the performance of
the Parallel Ocean Program ∗
In the previous chapters, we have considered compute-intensive applications from
the domain of Multimedia Content Analysis. In this chapter we focus on an appli-
cation domain at the other end of the spectrum, Climate Modeling. The Parallel
Ocean Program is a representative ocean modeling application that is used as the
ocean component of the coupled Community Earth System Model (CESM) [22].
The Parallel Ocean Program is characteristic for applications with data-intensive
workloads that are difficult to scale up and yet perform large amounts of floating-
point calculations, spread over a large number of iterations.
It is currently not well known what specific parts of ocean models can benefit
the most from execution on GPUs, how the existing software should be revised
to efficiently use GPUs, and what impact the use of one or more GPU clusters
will have on performance. In this chapter, we aim to answer these questions that
correspond to Challenges 2, and 3, as defined in Chapter 1.
The field of Ocean Modeling is currently undergoing a paradigm shift in the
understanding of the processes controlling the global ocean circulation. Two
factors have contributed to this shift: (i) the now about twenty-year long record of
satellite data and (ii) the possibility to simulate the ocean circulation using high-
resolution models which include processes on relatively small scale (10–50km).
Resolving this scale captures the instability processes that lead to ocean eddies
(a local swirling of the ocean created when the fluid flows past an obstacle), which
subsequently interact and affect the large-scale ocean flow [94].
∗I would like to acknowledge Jason Maassen who had a major contribution in the de-
velopment of the hierarchical load balancing scheme presented in Section 4.4.1 and who also
performed the multi-cluster experiments discussed in Section 4.4.4.2.
64 Chapter 4. Improving the performance of the Parallel Ocean Program
The level of realism (in relation to available observations) in simulating the
ocean with high-resolution, strongly eddying models substantially increases com-
pared to the low-resolution models in which the effects of eddies are parametrized.
For example, it leads to a much better simulation of the different oceanic bound-
ary currents, in particular the separation of the Gulf Stream in the Atlantic.
Also, the degree to simulate the surface kinetic energy distribution, which can be
compared with satellite data, markedly improves [51, 85].
The use of the strongly eddying models is, even on the supercomputing plat-
forms currently available, still computationally expensive, and simulations have
a long turn-around time. Typical performances are from one to a few model
years per 24h using thousands of cores [19]. Considering the fact that it takes at
least 1000yr to reach a near statistical equilibrium state, innovations to increase
the performance of these models and to efficiently analyze the data from the
simulations have a high priority.
GPUs have been used to successfully accelerate weather prediction mod-
els [27, 55, 83] and atmospheric models [18, 37]. However, for the field of ocean
modeling GPUs have only been successfully used for simple fluid dynamics com-
putations [76, 91], barotropic ocean models [9], and regional models [50].
We present two innovations to improve the performance of a widely-used
global ocean circulation model, the Parallel Ocean Program (POP). First, we ad-
dress alternative domain decomposition schemes and hierarchical load balancing
strategies which enable multi-platform simulations such that further scaling can
be achieved. Second, we show how POP can be adapted to run on GPUs and
study the effect of GPU usage on its performance.
The source code of our modified version of POP can be obtained from https:
//github.com/NLeSC/eSalsa-POP/. POP is also used as the ocean component
of the much used Community Earth System Model (CESM). We have applied our
modifications to a standalone version of POP (v2.1). However, we have confirmed
through source code inspection that all of our changes are also applicable to and
fully compatible with the latest release of CESM (v1.2.0). In the near future, we
plan to integrate our extensions to the CESM version of POP.
4.1 Load balancing
The model considered here is the global version of POP [20] developed at Los
Alamos National Laboratory. We consider the strongly eddying configuration,
indicated by R0.1, as used in recent high-resolution ocean model simulations
[51, 99]. This version has a horizontal resolution of 0.1◦ (10 km) using a 3600×
2400 horizontal grid with a tripolar grid layout, having poles in Canada and
Russia. The model has 42 non-equidistant z levels, increasing in thickness from
10m just below the upper boundary to 250m just above the lower boundary at
4.1. Load balancing 65
(a) (b)
Figure 4.1: (a) Sketch of the blockwise subdivision of the domain in POP. (b)
The halo regions of a block; image from Smith et al. [84].
6000m depth. In addition, bottom topography is discretized using partial bottom
cells, creating a more accurate and smoother representation of topographic slopes.
4.1.1 Domain decompositions and block distributions
POP supports parallelism on distributed memory computers through the Mes-
sage Passing Interface (MPI). To distribute the computation over the processors,
POP uses a three-dimensional mesh, sketched in Fig. 4.1a. The domain is de-
composed into equal-sized rectangular blocks in the horizontal direction. Each
block also contains several layers in the vertical direction (depth). The blocks
are then distributed over the available MPI tasks, where each task receives one
or more blocks. Blocks consisting only of land points may be discarded from
the computation. Below we will assume that a single MPI task is assigned to
a processor core (unless stated otherwise).
Each block is surrounded by a halo region (Fig. 4.1b) that contains a copy
of the information of the neighboring blocks. These halos allow the calculations
on each block to be performed relatively independently of its neighbor blocks,
thereby improving parallel performance. Nevertheless, the data in the halo re-
gions needs to be updated regularly. This requires a data exchange between the
blocks, which leads to communication between the MPI tasks, the amount of
data depending on the width of the halo, the size of the blocks, and the block
distribution over the MPI tasks.
In POP, the halo width is typically set to 2. For an example block size of
60×60, the number of elements that need to be exchanged per block in every halo
exchange is 4× (60× 2) + 4× 4 = 496. This number may need to be multiplied
66 Chapter 4. Improving the performance of the Parallel Ocean Program
by the number of vertical levels, depending on the data structure on which the
halo exchange is performed. Some data structures, like the horizontal velocity,
store a value for every gridpoint at every depth level. As a result, a 3-D halo
exchange is required that exchanges elements from every depth level. Others data
structures, such as surface pressure, only consist of a single level. There, a 2-D
halo exchange is sufficient.
For neighboring blocks that are assigned to the same MPI task, the data
exchange is implemented by an internal copy and no MPI communication is
required. Also, no data needs to be exchanged with (or between) land elements.
Therefore, the amount of data that needs to be communicated between MPI tasks
depends heavily on the way the blocks are distributed over the MPI tasks.
4.1.2 Existing block partitioning schemes
POP currently supports three algorithms for distributing the blocks over the
available MPI tasks, Cartesian, rake [52], and space-filling curve [19]. The Carte-
sian algorithm starts by organizing the tasks in a two-dimensional grid. Next,
the blocks are assigned to these tasks according to their position in the domain.
If the number of MPI tasks does not divide the number of blocks evenly in ei-
ther dimension, some tasks may receive more blocks than others. In addition,
some tasks may be left with less work (or even no work) if one or more blocks
assigned to it only contain land. As shown by Dennis et al. [19], load imbalance
between tasks can significantly degrade the performance of high-resolution ocean
simulations.
The rake algorithm attempts to improve the load balance by redistributing
the blocks over the tasks. Note that this requires that the number of blocks is
significantly larger than the number of MPI tasks. The rake algorithm starts with
a Cartesian distribution and the corresponding two-dimensional MPI task grid.
First, the average number of blocks per task is computed. Then, for each row in
the task grid, the algorithm takes the first task in the row and determines whether
the number of blocks exceeds the average. If so, the excess blocks are passed on
to the next task. This process is repeated for all tasks in the row. The process
is repeated for all columns of the task grid. As described by Smith et al. [84],
the algorithm “can be visualized as a rake passing over each node and dragging
excess work into the next available hole”. In an attempt to keep neighboring
blocks close together, constraints are placed on block movements that prevent
blocks from moving too far from their direct neighbors. Unfortunately, there are
instances where the rake algorithm actually results in a worse load balance where
blocks get raked into a corner. As a result Dennis et al. [19] state that “We do not
consider the current implementation of the rake algorithm... sufficiently robust.”
4.1. Load balancing 67
Figure 4.2: Examples of the space-filling curve load balancing algorithm, with the
Hilbert (left panel), meandering Peano (middle panel) and Cinco (right panel)
curves; image from Dennis et al. [19].
The space-filling curve algorithm [19] uses a combination of Hilbert, meander-
ing Peano, and Cinco curves to partition the blocks (Fig. 4.2). Conceptually, it
draws a single line that visits each of the blocks exactly once. It then splits this
line into equal-sized segments, each segment visiting the same number of blocks.
Due to the way the line is drawn, the blocks in each segment are also continuous
in the two-dimensional domain. This solution degrades slightly when the land-
only blocks are discarded, which introduces “cuts” in the curve. Nevertheless,
the space-filling curve algorithm significantly improves the load balance between
MPI tasks. A limitation of this approach is that each of the space-filling curves
can only partition domains of a specific size. For example, a domain P × P can
be partitioned by a Hilbert curve if P = 2n, or by a meandering Peano curve if
P = 3m, where n and m are integers. By using combinations of different curves,
the set of supported problem sizes can be extended.
4.1.3 Hierarchical block partitioning
None of the load balancing algorithms described in the previous section takes into
account the inherent hierarchical nature of modern computing hardware. This
typically consists of multiple cores per processor, multiple processors per node,
multiple nodes per cluster, and even the availability of multiple clusters for a nu-
merical simulation. The communication performance drops as we go up in the
hierarchy. The cores in a processor share cache memory and can therefore com-
municate almost instantaneously, while communication between processors has
to go through main memory, which is much slower. Communication between dif-
ferent nodes must go through an external network, which is orders of magnitude
slower, and communication between clusters in different locations is again orders
68 Chapter 4. Improving the performance of the Parallel Ocean Program
of magnitude slower. Therefore, simply balancing the load for the individual pro-
cessors (or cores) is not sufficient. Instead, a hierarchical load balancing scheme
must be used that takes both processor load and the communication hierarchy of
the target machine into account. We suggest using a similar approach to the one
used in Zoltan [105, 90]. However, where Zoltan supports dynamic load balancing
(where the work distribution may change during the application’s lifetime), we
compute a single static solution before the application is started.
Our hierarchical load balancing scheme, like the rake and space-filling curve
algorithms described earlier, assumes that the number of blocks is significantly
larger than the number of processors. Instead of simply specifying the number of
MPI tasks for which to create a partitioning, the user must now specify a sequence
of partitionings. For example, a sequence 2 : 16 : 8 indicates that the blocks must
first be partitioned into 2 sets (preferably of equal size), each of which is then
partitioned into 16 pieces, which are further divided into 8 pieces. The sequence of
partitionings relates directly to the hierarchy that is present in the computational
platform. For example, the 2 : 16 : 8 partitioning can be used for an experiment
on two clusters, each containing 16 nodes of 8 cores.
Once the user has specified the desired partitioning, the algorithm proceeds by
repeatedly splitting the available blocks into N (preferably equal-sized) subsets.
We try to partition the domain in such a way that the shape of each of the subsets
is as close to a square as possible. This will reduce the amount of communication
out of each subset in relation to the amount of work inside each subset.
When splitting a domain, multiple solutions may be available which are equiv-
alent from a load balancing perspective. However, the amount of communication
required between subsets may vary between these solutions due to assignment
of blocks to MPI tasks and the location of land-only blocks. Our algorithm
therefore compares these solutions and selects the one which generates the least
communication between subsets.
To explain our algorithm in more detail, we use the simplified example domain
shown in the upper left panel (a1) of Fig. 4.3. This example domain contains
1200 × 1000 grid elements. It is divided into blocks of 100 × 100, resulting in
12 × 10 blocks, of which 20 are land-only blocks. To divide this domain into 10
subsets, the algorithm starts by computing the required number of blocks per
subset. The 100 non-land blocks must be divided into 10 subsets, resulting in
10 blocks per subset. Next, the algorithm tries to arrange the desired number of
subsets in a (roughly) rectangular grid. The dimensions of this grid, consisting
of N subsets, is determined as follows:
4.1. Load balancing 69
(a1): initial domain (a2): select 30 blocks
for rst column
(a3): select 30 blocks
for second column
(a4): select 20 blocks
for third column
(a5): select 20 blocks 
for fourth column
(a6): initial split
completed
(b1): rst
column
(b2): select
10 blocks
for rst
set
(b3): select
10 blocks
for second
set
(b4): select
10 blocks 
for third
set
(b5): rst
column
completed
(c): nal result
Step 1: Split domain column wise
Step 2: Split column selections row-wise
(only rst column selection is shown)
Figure 4.3: Description of the hierarchical load balancing scheme for an example
of 12× 10 blocks, of which 20 are land-only blocks, as shown in panel (a1). The
initial column-wise split is shown in panels (a2)–(a6), the next row wise split in
the panels (b1)–(b5) and the final results is shown in panel (c).
70 Chapter 4. Improving the performance of the Parallel Ocean Program
[2,2] [3,3] [3,3,2] [3,3,2,2]
Figure 4.4: Example subdivisions of a square into 4, 6, 8, and 10 rectangular
sections.
f:= floor(sqrt(N));
c:= ceiling(sqrt(N))
if (f = c) We have found a square grid of [f x f]
if (f*c = N) We have found a rectangular grid of [f x c]
if (N < f*c) We have found a rectangular grid of [f x c] - (f*c-N)
if (N > f*c) We have found a square grid of [c x c] - (c*c-N)
In the first two cases of the algorithm shown above, a square or rectangular
decomposition is available containing exactly N subsets. In the last two cases, the
decomposition contains (f ∗c−N) or (c∗c−N) subsets too many respectively. To
correct this, we repeatedly remove a single subset from each row until the desired
number of subsets is reached. Figure 4.4 shows four example subdivisions, for
values of N = 4, 6, 8, and 10, that correspond to each of these four cases. For
our example domain we will use the rightmost subdivision in Fig. 4.4 for N = 10
named [3, 3, 2, 2], which represents the number of blocks in each column.
Next, we compute the required number of blocks per column using the average
number of blocks per subset and the selected subdivision. For our example, we
will use the [3, 3, 2, 2] subdivision as in Fig. 4.4 and the 10 blocks/subset average,
which will result in columns containing [30, 30, 20, 20] blocks. We then split the
domain into subsets by traversing the blocks in a vertical zigzag fashion and
selecting all non-land blocks until the desired number of blocks for that column
in reached. It should be noted that the partitioning scheme is not a flood-fill
type of algorithm, which may skip over isolated points. Instead, our partitioning
scheme simply skips over any land points encountered while scanning in a certain
direction, and continues scanning in a zigzag fashion until the required number
of ocean (i.e. non-land) points have been selected.
The panels (a2–a6) in Fig. 4.3 show how the example domain is split into
the 4 columns. We subsequently split each of the columns in a horizontal zigzag
fashion into the desired number of subsets for that column. Panels (b1–b5), show
4.1. Load balancing 71
permutation blocks communication per task
per task (min/avg/max)
(3,3,2,2) 10 1440/2186/2888
(2,3,3,2) 10 1244/2187/2888
(2,2,3,3) 10 1240/2188/3100
(2,3,2,3) 10 1240/2188/3300
(3,2,2,3) 10 1240/2229/3720
(3,2,3,2) 10 1440/2265/2876
Table 4.1: Permutations of the [3,3,2,2] example distribution showing the num-
ber of assigned blocks and the communication per task in grid points per level.
The entries are sorted by average communication per task. The topmost entry
provides the best solution.
an example for the first column, which needs to be split into 3 subsets of 10 blocks
in each case. A similar subdivision is applied to the other columns. The final
block distribution for the example domain is shown in Fig. 4.3c.
As explained above, the subdivision shown in panel (c) of Fig. 4.3 is only one
out of a series of options. Several permutations of the [3, 3, 2, 2] subdivision can be
created that are equivalent from a load-balancing perspective but require a differ-
ent amount of communication. In addition, the subdivision can also be rotated,
thereby initially dividing the domain row-wise instead of column-wise. Finally,
when selecting the blocks in a zigzag fashion (as shown in Fig. 4.3) a choice can
be made from which position to start the selection, top or bottom, or left or right.
In our algorithm we simply compute all unique permutations of the subdivision
in all possible rotations, with all possible starting points. We then select the
solution with the lowest average communication per subset. If multiple equiva-
lent solutions exist, we select the one with the lowest maximum communication
per subset. Table 4.1 shows the best scoring results for all permutations of the
[3, 3, 2, 2] subdivision. All solutions use the same number of blocks per task, but
the amount of communication varies per solution. Once a domain has been split
into the desired number of subsets, the algorithm is repeated for each of these
subsets for the next split.
4.1.4 Hierarchical partitioning of tripole grids
In the application of the hierarchical load balancing scheme to POP, the tripolar
grid layout, where the North Pole is replaced with two poles located (on land)
in Canada and Russia, needs special attention. Note that tripolar grids are fre-
quently used in ocean models because the grid spacing in the Arctic is much more
uniform and the cell aspect ratios are closer to one when compared to traditional
72 Chapter 4. Improving the performance of the Parallel Ocean Program
(a)
(b)
Figure 4.5: (a) A subdivision of the topography into 60 by 40 blocks. The
two tripoles are depicted by the red dots on the upper boundary. Note that the
leftmost and rightmost dots represent the same tripole; the tripole communication
is (partially) shown by the arrows. (b) A remapping of the grid that moves an
area of 30× 7 blocks. The original tripole boundary is shown as a red line.
4.2. Results: load balancing 73
latitude-longitude (dipole) grids [84]. In this case, additional communication is
required for the blocks located on the line between these poles [84]. These blocks
are located on the upper boundary of the grid, as shown in Fig. 4.5a. To sup-
port a tripolar grid layout in our hierarchical load balancing scheme, we add
the additional tripole communication to the communication requirements of the
subset whenever a subset contains a tripole block. The extra communication will
then be taken into account in the search phase of the algorithm. Although this
approach will improve the partitioning, the result will not be optimal. As shown
in Fig. 4.5a, two communicating tripole blocks may be located on opposite sides
of the grid. This makes it difficult for our partitioning scheme to put these two
blocks into the same subset. We overcome this problem by remapping the grid
before we start the partitioning (Fig. 4.5b). By simply moving blocks from one
side of the grid to the other, we enable our partitioning algorithm to optimize
the tripole communication. Note that this remapping is only performed on the
grid used in our partitioning algorithm. No change to POP is required, as POP
only uses the result of the partitioning in which the original block numbering is
maintained.
4.2 Results: load balancing
In this section we will compare the performance of our hierarchical algorithm to
the Cartesian, rake and space-filling curve block partitioning schemes. In our
experiments we carry out a 10 day simulation with the R0.1 version of POP, as
described at the beginning of Sect. 4.1, and show performance measures averaged
over these 10 days. For these experiments we have used the Huygens supercom-
puter (see Section 1.5.2) and DAS-4 (see Section 1.5.3). We use DAS-4 in a single
cluster and two cluster experiment. In the two cluster experiment, the clusters
are connected using an internet link with a maximum bandwidth of 1 Gbit/s.
The average round trip time between clusters is 2.6ms. As the link is shared with
other users, the available bandwidth and round trip latency may vary over time.
4.2.1 Using MPI for multiple clusters.
For POP to run on multiple clusters, an MPI implementation is required that
is capable of communicating both within and between clusters. This is far from
trivial, as clusters are often protected by a firewall that disallows any incoming
communication into the cluster. Also, it is common for the compute nodes to
be configured such that they can only communicate with the cluster frontend,
but not directly with the outside world [49]. To solve this problem, we created
wrapper code that is capable of intercepting the MPI calls in POP. For each
intercepted call, the MPI wrapper decides whether it should be forwarded to the
74 Chapter 4. Improving the performance of the Parallel Ocean Program
Figure 4.6: An example of POP running without the MPI wrapper on a single
cluster (left panel) and with the MPI wrapper on a multi cluster (right panel).
local MPI implementation, or whether it should be sent to another cluster. To
use the MPI wrapper code, POP needs to be recompiled using a different MPI
library, however, no changes to the POP code itself are required.
To communicate between clusters, one or more support processes, so-called
hubs, are used. Each hub typically runs on the cluster frontend, and serves as
a gateway to the other clusters. If necessary, multiple hubs can be connected to-
gether to circumvent communication restrictions caused by firewalls. In Fig. 4.6,
the left panel shows a traditional POP run on a single machine, while the right
image illustrates how a hub is used in DAS-4 to connect two clusters together.
Only a single hub is needed, as all compute nodes in DAS-4 can communicate
with all head nodes, even those of other clusters. However, compute nodes cannot
directly communicate with compute nodes in other clusters.
4.2.2 Performance
Table 4.2 shows the configurations of the partitioning schemes. For each exper-
iment we use 256 MPI tasks. The Cartesian distribution uses a 225× 150 block
size, resulting in exactly one block per MPI task (no land blocks are discarded).
Both rake and space-filling curve use a block size of 60 × 60 and discard 628
of 2400 blocks (i.e. 26%). The table also shows the minimum, average, and
maximum communication per MPI task, as well as the amount of traffic gener-
ated between the clusters for the two cluster experiment. We will discuss these
below. As can be seen from Table 4.2, the hierarchical domain distribution signif-
icantly decreases the amount of traffic between the clusters compared to rake and
4.2. Results: load balancing 75
algorithm block blocks blocks communication communication
size per core discarded per task between clusters
(min/max) (min/avg/max) (messages/volume)
cartesian 225× 150 1/1 0 (of 256) 0/1267.4/2408 22.3 M/99.0 GB
rake 60× 60 5/8 628 (of 2400) 748/1940.5/3936 77.9 M/337.4 GB
sfc 60× 60 6/7 628 (of 2400) 1007/1707.7/2960 41.0 M/212.7 GB
hierarchical 60× 60 6/7 628 (of 2400) 504/1394.9/2584 20.0 M/82.5 GB
Table 4.2: Configuration of the Cartesian, rake and space-filling curve (sfc) and
hierarchical distributions.
 0
 50
 100
 150
 200
cartesian 225x150 rake 60x60 sfc 60x60 hierarchical 60x60
m
o
d
e
ld
a
y
s
/d
a
y
Huygens (4x64)
Single Cluster DAS4 (32x8)
Two Cluster DAS4 (2x16x8)
1
1
8
1
3
8 1
4
5
1
4
6
9
9
1
4
4 1
5
3
1
5
5
8
9
8
0
9
1
1
4
2
Figure 4.7: Performance comparison of POP using Cartesian, rake, space-filling
curve and hierarchical block partitioning schemes on three different hardware
configurations, each using 256 MPI tasks.
space-filling curve. As a result, the performance overhead of using two clusters
is limited.
The performance results of POP are shown in Fig. 4.7 in model days/day. On
Huygens and single cluster DAS-4, the rake and space-filling curve block distri-
bution clearly improve the performance over the Cartesian distribution. On Huy-
gens, the performance improvement of space-filling curve is close to the amount of
work discarded (23% vs. 26%). On DAS-4 the improvement is much greater (54%
vs. 26%) due to the better cache behavior of smaller blocks. The space-filling
curve distribution outperforms the rake distribution in all cases, due to the better
load balancing characteristics, as shown in Table 4.2. Figure 4.7 also shows that
for the two cluster DAS-4 experiments the performance degrades. Interestingly,
76 Chapter 4. Improving the performance of the Parallel Ocean Program
Configuration Performance Speedup
(modeldays/day)
1 cluster, 16 nodes 82 1.0
1 cluster, 32 nodes 155 1.9
2 clusters, 16 nodes each 142 1.7
Table 4.3: Speedup on DAS-4 for one and two cluster configurations using a hi-
erarchical domain distribution.
the performance reduction for Cartesian is only 10%, while space-filling curve
(41%) and rake (44%) are much more affected. This difference is caused by the
increased communication caused by these two block distributions, as shown in
Table 4.2.
Although rake and space-filling curve both decrease the amount of work per
MPI task, they also significantly increase the amount of communication between
tasks. On supercomputers, where POP is traditionally run, this problem is mit-
igated by high-speed network interconnects, but in a multi-cluster environment,
the internet link between clusters becomes a bottleneck. In Table 4.2, the col-
umn “communication between clusters” clearly shows that compared to Cartesian,
rake causes a 3.4x increase in the communication between clusters. The increase
caused by the space-filling curve is smaller, a factor of 2.1, but still significant.
The hierarchical scheme performs slightly better than the space-filling curve
scheme on Huygens and single cluster DAS-4 (Fig. 4.7). This is to be expected,
as the communication overhead is small on these systems due to the fast local
network interconnects. On two cluster DAS-4, however, the hierarchical domain
distribution provides a significant performance improvement over the existing
algorithms. When running on two clusters, the performance drop compared to
a single cluster run is only 8% for the hierarchical domain distribution, compared
to 10% for Cartesian, 41% for space-filling curve and 44% for rake.
Table 4.3 shows the speedup on DAS-4 compared to a 16-node run on a single
cluster. The speedup on 32-nodes on a single cluster is, with a factor of about
1.9, almost perfect. Although the speedup on two clusters (of 16-nodes each) is
slightly lower, about a factor of 1.7, the performance gain compared to a single
cluster is still significant. These results clearly demonstrate that using multiple
clusters can be beneficial, especially to increase the number of machines beyond
the size of a single cluster.
4.3. Execution on graphics processing units 77
4.3 Execution on graphics processing units
This section discusses the main challenges that exist when moving parts of the
computation in POP to a GPU. We use the CUDA programming model [62]
in order to have fine-grained control over our GPU implementation and to be
able to explain and improve performance results. Many different software tools,
libraries, (directive-based) parallelization tools, and compilers aim to assist in
the development of GPU code. However, it is our goal to gain a deep under-
standing of the performance behavior of POP, which requires more control over
the implementation and in particular how data is transferred between the host
memory and GPU device memory. We are currently not aware of the capability
to implement GPU kernels that overlap GPU computation with CPU-GPU com-
munication in any of the existing directive-based parallelization tools for GPUs.
However, if this were possible, it would require a collection of directives similar
to the collection of calls to the CUDA runtime that are currently responsible for
achieving this overlapping behavior. While directive-based parallelization tools
do leave the kernel code in the same language as the original, understanding the
underlying architecture is still required in order to modify that parallelized code
and assess its correctness. In the following sections, we use CUDA terminology
[62], although our methods could just as easily apply to OpenCL [39].
POP consists of a large Fortran 90 codebase, and in this chapter we there-
fore limit ourselves to the most compute-intensive parts of the program and only
oﬄoad those computations to the GPU. The main challenge with this approach
is to overcome the PCIe bus bottleneck. Whenever computations are to be per-
formed on the GPU, the input and output data has to be transferred from host
memory through the PCIe bus to GPU device memory and vice versa. The
achieved bandwidth to GPUs connected through the PCIe 2.0 bus is approxi-
mately 5.7 GB/s from host to device and 6.3 GB/s from device to host. This is
significantly lower than the bandwidth between host memory and a CPU and the
bandwidth between GPU device memory and the GPU. Therefore, it is crucial
that we maximize the overlap of data transfers to the GPU with computation
and with transfers from the GPU back to the host.
To overlap GPU communication and computation we need fine-grained con-
trol over how data is transferred to the GPU. There are several alternative tech-
niques for moving data between host and device using the CUDA programming
model. The most commonly used approach is to simply use explicit memory copy
statements to transfer large blocks of memory to and from the GPU.
Alternatively, CUDA streams may be used to separate the computation into
distinct streams that may execute in parallel. This way, communication from
one stream can be overlapped with computation and communication in other
streams. GPUs with 2 copy engines, such as Nvidia’s Tesla K20, can use the PCIe
78 Chapter 4. Improving the performance of the Parallel Ocean Program
% time function module #calls computes
15.09 state state_mod 29562112 density of water and derivatives
6.69 hdiffu_del4 hmix_del4 4865280 horizontal diffusion of momentum
5.79 advu advection 4865280 advection of momentum
5.33 bldepth vmix_kpp 115840 ocean boundary layer depth
5.25 hdifft_del4 hmix_del4 4865280 horizontal diffusion of tracers
4.62 chrongear pop_solversmod 115840 conjugate-gradient solver
4.07 ri_iwmix vmix_kpp 115840 viscosity and diffusivity coefficients
3.83 vmix_coeffs_kpp vmix_kpp 115840 vertical mixing coefficients
3.66 impvmixt_correct vertical_mix 115840 implicit vertical mixing corrector step
3.34 blmix vmix_kpp 115840 boundary layer mixing coefficients
3.27 impvmixt vertical_mix 231680 implicit vertical mixing of tracers
3.27 clinic baroclinic 4865280 forcing terms of baroclinic momentum
3.17 advt_centered advection 4865280 centered differencing tracer advection
3.12 btropoperator pop_solversmod 14705152 barotropic solver operator
3.10 baroclinic_driver baroclinic 115840 integration of velocities and tracers
2.88 ddmix vmix_kpp 115840 add double-diffusion diffusivities
Table 4.4: List of the most compute intensive functions in POP, covering 76.48%
of the total computation time. The reported time does not include time spent in
functions called by this function.
bus in full duplex with explicit memory copies in different streams. This way,
communication and computation from different streams can be fully overlapped.
Finally, the mapped memory approach uses no explicit copies, but maps part
of the host memory into device memory space. Whether this approach is feasible
depends on the memory access pattern of the kernel. Typically, mapped memory
can only be used efficiently if each input and output element is read or written
only once by the GPU function, called kernel. Although this approach results in
very clean host code, requiring no explicit copy statements, it requires complex
kernel implementations with intricate memory access patterns to ensure high
performance.
4.3.1 Targets for GPU implementation
To determine which part of POP to port to the GPU, we must first get an impres-
sion of where the most time is spent. It is well known that the three-dimensional
baroclinic solver is the most computationally intensive part of POP [38, 102]. We
therefore limit ourselves to analyzing the performance of the baroclinic solver.
Figure 4.8 gives an overview of the structure of function calls in POP. The
function step is called repeatedly to advance the model by one time step, which
consists mainly of performing neighborhood communication, calling the barotropic
and baroclinic driver functions, and some correction steps. Table 4.4 lists the
most compute intensive functions from Figure 4.8. These profiling results of are
4.3. Execution on graphics processing units 79
main
100.00%
(0.00%)
step
99.97%
(0.46%)
100.00%
baroclinic_driver
81.09%
(3.10%)
81.09%
state
15.09%
(15.09%)
0.30%
barotropic_driver
8.89%
(0.11%)
8.89%
correct_adjust
7.82%
(0.04%)
7.82%
set_surface_forcing
0.89%
(0.00%)
0.89%
2.48%
vmix_coeffs
35.26%
(0.00%)
35.26%
clinic
20.63%
(3.27%)
20.63%
tracer_update
15.56%
(2.09%)
15.56%
impvmixt
3.27%
(3.27%)
1.64%
impvmixu
2.25%
(2.25%)
2.25%
2.48%
1.64%
impvmixt_correct
3.66%
(3.66%)
3.66%
vmix_coeffs_kpp
35.26%
(3.83%)
35.26%
gradp
2.23%
(1.90%)
2.23%
hdiffu
6.70%
(0.00%)
6.70%
advu
5.79%
(5.79%)
5.79%
vdiffu
2.64%
(2.64%)
2.64%
vdifft
1.82%
(1.82%)
1.82%
add_sw_absorb
0.71%
(0.71%)
0.71%
advt
5.32%
(0.00%)
5.32%
hdifft
5.25%
(0.00%)
5.25%
tgrid_to_ugrid
0.59%
(0.59%)
0.58%
buoydiff
8.51%
(1.24%)
8.51%
bldepth
7.48%
(5.33%)
7.48%
ddmix
5.37%
(2.88%)
5.37%
blmix
4.81%
(3.34%)
4.81%
ri_iwmix
4.69%
(4.07%)
4.69%
7.27%
wscale
2.85%
(2.85%)
1.38%2.48% 1.48%
ugrid_to_tgrid
0.63%
(0.63%)
0.62%
hdiffu_del4
6.69%
(6.69%)
6.69%
comp_flux_vel
2.13%
(1.99%)
2.13%
advt_centered
3.17%
(3.17%)
3.17%
hdifft_del4
5.25%
(5.25%)
5.25%
Figure 4.8: Callgraph of the Parallel Ocean Program. Obtained using gprof and
gprof2dot, edited to fit this page and format. Subfunctions of barotropic_driver
and set_surface_forcing have been cropped for readability.
80 Chapter 4. Improving the performance of the Parallel Ocean Program
obtained from one month of simulation using the R0.1 version (see beginning of
Sect. 2) on the DAS-4 cluster (described in Sect. 4.1). For this experiment we
have used a Cartesian distribution with blocks of size 255× 300 and 8 processes
per node on 16 nodes.
Table 4.4 lists the percentage of the total execution time spent in this func-
tion, not including subfunctions. All functions in Table 4.4, except those from
the module pop solversmod, belong to the baroclinic solver. Our profiling re-
sults indicate that the baroclinic solver does not contain any true computational
hotspots. That is, no individual function consumes a major part of the compu-
tation time.
However, the density computations from the equation of state are requested
by several different parts both within the baroclinic solver and at the end of each
time step. The computation of water densities is required so frequently by the
model that their computation time consumes 15.09% of the total execution time
on average.
The functions from the vmix_kpp module in Table 4.4 are part of the compu-
tation of the vertical mixing coefficients for the KPP mixing scheme [42], which
in total consumes about 35.3% of the total execution time. We therefore focus on
obtaining a GPU implementation for the equation of state and for the computa-
tion of vertical mixing coefficients, in particular the three functions state, buoydiff
(the computation of buoyancy differences) and ddmix. We focus on buoydiff()
and ddmix() since they are among the most compute intensive functions and are
responsible for 64.9% of the calls to state().
It is well-known that kernel-level optimizations focused on increasing compu-
tation throughput are generally not worthwhile when memory bandwidth is the
primary factor in limiting performance [73]. A frequently used tool for perfor-
mance analysis on multi- and many-core hardware using the Roofline model [101]
is the arithmetic intensity. For example, the Nvidia Tesla K20 GPU has a theo-
retical peak performance of 1173 GFLOP/s for double precision and a theoretical
peak global memory bandwidth of 208 GB/s. However, in practice the achieved
memory bandwidth is (roughly) 160 GB/s, as reported by the bandwidthTest
tool in the Nvidia CUDA SDK. A rough estimation tells us that an arithmetic
intensity of at least 7.3 FLOP/byte is required for the kernel to become compute-
bound. Thus, if the arithmetic intensity is less than 7.3 FLOP/byte, we know
the kernel is memory bandwidth-bound when executed on the K20.
The arithmetic intensity of the state() function is computed as follows. Al-
though POP supports various implementations for the equation of state, we focus
on the MWJF 25-term equation of state [53] because it is the most commonly
used implementation. The MWJF state() function requires the temperature and
salinity tracers as inputs as well as 25 coefficients, of which 6 depend on the
water pressure, the rest is constant. The state() function outputs the density of
4.3. Execution on graphics processing units 81
- Explicit implementation (uses explicit copy statements)
CPU-GPU Comm.
GPU Computation
- Implicit implementation (uses device-mapped host memory)
CPU-GPU Comm.
GPU Computation
- Streams implementation (uses CUDA Streams and explicit copy statements)
CPU-GPU Comm.
GPU Computation
Figure 4.9: Schematic of the three different implementations, Explicit, Implicit,
and Streams, that shows the potential overlap between computation and commu-
nication.
water and optionally also outputs the derivatives of the water density with re-
spect to temperature and salinity. When only the density of water is computed,
state() performs 40 floating point operations per grid point with an arithmetic
intensity of 2.5 FLOP/byte, assuming that all 25 coefficients can be stored in on-
chip caches and can be fully reused. When all outputs are requested, 89 floating
point operations are executed per grid point, resulting in an arithmetic inten-
sity of 5.56 FLOP/byte. With an arithmetic intensity of either 2.5 or 5.56, the
MWJF state() kernel is memory bandwidth-bound. Therefore, we focus on op-
timizing the time spent on communication between host and device, rather than
kernel-level optimizations.
4.3.2 Efficient integration of GPU code
We now describe how POP should be revised to efficiently use GPUs. For our
discussion, we focus on three functions in POP state(), buoydiff(), and ddmix().
Due to a lack in GPU performance models that consider asynchronous PCIe
transfers, it is currently impossible to predict what kind of implementation will
be the most efficient. For each function we have therefore implemented three
different versions that we call Explicit, Implicit, and Streams. We first describe the
three versions in general and then discuss the specific implementations for state(),
buoydiff(), and ddmix() in detail. Figure 4.9 provides a schematic overview of the
three different implementations with regard to the way GPU computation (shown
in green) and CPU-GPU communication (shown in blue) could be overlapped.
82 Chapter 4. Improving the performance of the Parallel Ocean Program
Explicit is a bulk synchronous implementation that uses explicit memory copy
statements to copy all the required input data to GPU and from the GPU for
the entire three-dimensional grid. The kernel used in Explicit creates a two-
dimensional array of threads, i.e. one thread for each horizontal grid point, which
iterate the grid points in the vertical dimension. Implicit uses mapped memory
and therefore requires no explicit memory copy statements. Instead, data is
requested by the GPU directly from the host memory and sent over the PCIe
bus. The performance of accessing the memory in this way is very sensitive to
the order in which data is requested and care must be taken not to create gaps or
misalignments from the mapping between threads and data. Therefore, Implicit
uses a kernel implementation that creates a one-dimensional array of threads with
size equal to the number of grid points in the three-dimensional grid. Each thread
then computes its three-dimensional index from its one-dimensional thread ID to
direct itself to the correct part of the computation. The Streams implementation
creates one stream for each vertical level and uses explicit copy statements to copy
the corresponding vertical level of the input and output variables to and from
the GPU. If the computation of one vertical level requires input from multiple
vertical levels, CUDA events are used to delay the computation until all inputs
have been moved to the device and vice versa. The kernel used in Streams is
similar to the kernel used in Explicit, except for the fact that the kernel only
computes the grid points of one vertical level.
The three different implementations are very different in terms of code and the
effort to create them. All three implementations use very distinctive host codes as
well as modified GPU kernels. For example, the Implicit implementation barely
requires any host code, whereas the Streams implementation requires multiple
loops of memory copy operations and kernel invocations with advancing offsets.
Note that, except for the differences described here, the kernels do not contain
any architecture-specific optimizations.
While the state() function computes the density of water at a certain vertical
level k, the function is mostly used directly surrounded by a loop over all vertical
levels. These code blocks can safely be replaced by a call to a single function that
directly computes the water densities for all vertical levels. Our Explicit imple-
mentation uses explicit copies to move the three-dimensional grid of tracer values
between host and device and creates one thread for each horizontal grid point,
which compute all outputs in the vertical direction. However, this approach is
unable to overlap communication to and from the device with GPU computation.
It is possible to also parallelize the computation of different vertical levels using
CUDA streams. Our Streams implementation ensures that GPU computation
can be overlapped with GPU communication of different vertical levels and thus
alleviates the PCIe bus bottleneck to a large extent. Because of the simple access
4.4. Performance of POP on GPUs 83
pattern in state(), where each input and output element is read or written only
once, it is also a good candidate for the highly parallel Implicit implementation.
More complex uses of the equation of state are found within the computation
of the vertical mixing coefficients for the KPP mixing scheme [42], in particular in
the computation of buoyancy differences (buoydiff) and double-diffusion diffusiv-
ities (ddmix). In POP the vertical mixing coefficients are sequentially computed
for all vertical levels. The computation of buoyancy differences at level k requires
the density of both the surface level and level k−1 displaced to level k, as well as
the water density at level k. These values can be computed for each level in par-
allel as long as all the data is present on the GPU. Overlapping data movement
from the host to the GPU with GPU computation and data movement from the
GPU to host becomes significantly more difficult, because the tracers for levels 1,
k−1, and k need to be present on the GPU to compute the buoyancy differences
at level k. The Streams implementation first schedules memory copies to the
GPU for all vertical levels in concurrent streams and then invokes GPU kernel
launches for all levels. However, before the execution of the kernel in stream k
can start, the memory copies in stream 1, k − 1, and k need to be complete.
The kernel executing in stream k outputs to different vertical levels for different
variables. Therefore, some of the memory copies from device to host in stream k
have to wait for the kernel in stream k− 1 to complete. We use the CUDA event
management functions to guarantee that no computations or memory transfers
start prematurely.
In the ddmix function, the computation of diffusivities at level k requires the
derivatives of density with respect to temperature and salinity at level k and
k − 1. That is, the computation of level k reuses the derivatives that were used
to compute level k − 1. At a first glance, it would seem that the computation of
all vertical levels cannot be parallelized. The sequential approach prevents that
these values have to be recomputed, but inhibits the ability to overlap communi-
cation and computation of different vertical levels. Therefore, our implementation
also parallelizes the computation in the vertical dimension by introducing double
work. The cost of computing the derivatives twice is significantly less than the
inability to overlap computation and communication. Similarly to the buoyancy
differences computation, the kernel executing in stream k requires the memory
copies of stream k and k − 1 to be complete. Again, CUDA event management
functions are used to guarantee that no data is copied from the GPU back to the
host before GPU computations have finished.
4.4 Performance of POP on GPUs
In this section, we will describe the performance of the R0.1 version of POP on
a single cluster and on multiple GPU clusters. In the first subsection below,
84 Chapter 4. Improving the performance of the Parallel Ocean Program
we focus on the performance impact on individual POP subroutines when using
a GPU. In the second subsection, we address the performance of the whole POP
code on a single GPU and on multiple GPU clusters.
4.4.1 Performance impact of GPU usage: individual rou-
tines
First we evaluate the performance of single functions that were taken out of POP
for individual benchmarking. We test our three implementations (Explicit, Im-
plicit, and Streams) for each discussed function of POP on a single node equipped
with a Nvidia Tesla K20 GPU in the DAS-4 cluster. The Tesla K20 has 2496
CUDA cores running at 705 MHz providing a theoretical peak double precision
performance of 1173 GFLOP/s. The K20 has 5 GB of device memory and a the-
oretical peak memory bandwidth of 208 GB/s. The K20 is connected through
a PCIe 2.0 bus and has 2 copy engines which enable full duplex use of the PCIe
bus for concurrent explicit memory transfers. The grid dimensions used for the
experiments discussed here are 229 × 304 × 42. This is the same block size as
used to obtain our profiling results, with 2 ghost cells in both horizontal dimen-
sions. The performance results presented here are averaged execution times of
five distinct runs. The execution times of these individual routines on the tested
GPUs show minimal variance.
For all three implementations the majority of the execution time is spent
on transferring the data to and from the GPU. For example, for the Streams
implementation of state() only 10.3% of the execution time is spent on GPU
computation, and only 19.4% and 13.3% for buoydiff() and ddmix(), respectively.
Note that the reported times for buoydiff() and ddmix() include the time spent
within state() when called as a subfunction. In fact, calls to state() from the
GPU kernels of buoydiff() and ddmix() are inlined to optimize the data access
pattern of these kernels.
Figure 4.10 shows the performance results for all three functions with three
different GPU implementations. For the state() function the Implicit implemen-
tation provides the best performance. Although the kernel implementation used
by Implicit is slightly less efficient than the kernel used by Explicit, the total
execution time is significantly less because a large part of the memory trans-
fers between host and device and computation are overlapped. While Streams
achieves overlapping behavior similar to Implicit, it is more coarse-grained with
one vertical level at a time rather than individual grid points. That explains why
Implicit outperforms the Streams implementation for the state() function.
The buoydiff() function has a very low arithmetic intensity and therefore the
computation again accounts for only a small part of the total execution time.
The Implicit implementation is slower than Explicit because the access pattern
4.4. Performance of POP on GPUs 85
 0
 5
 10
 15
 20
 25
 30
state buoydiff ddmix
Ti
m
e 
(m
s)
Explicit
Implicit
Streams
12
17
25
8
23 23
9
12
17
Figure 4.10: Performance results for the three POP functions on a GPU with
three different implementations as obtained on the Tesla K20 GPU with a 229×
304 block size.
in buoydiff() requires several input elements multiple times. As a result, the Im-
plicit approach transfers more data than necessary over the PCIe bus. Although
these transfers can be overlapped with computation and with transfers in the
opposite direction, the performance penalty for transferring data multiple times
reduces the overall performance. The Streams approach again benefits from the
fact that data transfers and computation can be overlapped, but without the
restrictions that come with the Implicit approach. The data access pattern in
buoydiff() requires that operations in some streams may have to wait for opera-
tions in another stream to complete before they can start. The overhead of these
synchronizations accounts for on average 3.26% of the total execution time of the
Streams implementation.
To parallelize the computation of ddmix() in the vertical dimension, the Im-
plicit and Streams implementations do some double work. That is, some values
are computed twice by different threads operating at different vertical levels,
whereas a thread in the Explicit approach may reuse that value from the compu-
tation of a previous vertical level. Therefore, the time spent in computation for
Implicit and Streams is higher than that of Explicit. However, due to the overlap
of computation and PCIe transfers in both directions, both Streams and Implicit
do outperform the Explicit implementation in terms of total execution time. The
86 Chapter 4. Improving the performance of the Parallel Ocean Program
Implicit implementation again suffers from the fact that, although overlapped
with communication and computation, data has to be transferred multiple times
through the PCIe bus.
In the GPU implementation of the POP we use in the next section, the Im-
plicit implementation for state() and the Streams implementation for buoydiff()
and ddmix() are used. As buoydiff() is executed before ddmix() as part of the
computation of vertical mixing coefficients, ddmix() reuses the tracers that have
been copied to the GPU by buoydiff(). Additionally, for all three functions, the
execution on the GPU as well as all data transfers are overlapped with the com-
putation of other functions on the CPU. Therefore, the CPU never has to wait
for the results of GPU computations.
4.4.2 Performance of POP on multiple (GPU) clusters
In this section, we evaluate the performance of the combination of the two ap-
proaches presented in this chapter. The goal of this evaluation is to assess whether
the addition of a GPU is at all beneficial for performance on the application level.
This is certainly not trivial, considering that large amounts of data have to be
moved back and forth between the different memories over a relatively slow PCIe
link. Additionally, only a small number of functions are executed on the GPU
and a single GPU is shared between the various CPU cores. As such, we compare
the performance of two versions of the program: one that only uses CPUs and
one that uses the available CPUs as well as the GPU.
We recognize that a truly fair comparison between the different experimental
setups is very hard to achieve. We take the achieved performance in terms of
the number of model days per day of simulation as a measure for comparison.
We have chosen not to normalize these results using additional metrics such as
hardware costs or power consumption to keep the experimental setup as simple
as possible. Hardware costs of both CPUs and GPUs are influenced by different
factors in addition to their performance capabilities. Power consumption is an
important factor in the operational costs for modern supercomputers. However,
as only a small fraction of the code currently executes on the GPU, it is clear
that with the current state of the software, the GPU will be idle for a large
fraction of the execution. Whether a complete GPU implementation of POP is
more efficient than a CPU-only implementation in terms of power consumption
is an interesting issue, but it is outside the scope of this thesis.
For this evaluation we use the DAS-4 cluster (see Section 1.5). First, 8 com-
pute nodes each containing two quad-core Intel E5620 CPUs (8 cores per node in
total) and a Nvidia GTX480 GPU are used. In addition, we also use 8 compute
nodes each containing two six-core Intel E5-2620 CPUs (12 cores per node in to-
tal) and a Nvidia Tesla K20 GPU each. As a reference for the CPU-only version
4.4. Performance of POP on GPUs 87
 0
 20
 40
 60
 80
 100
 120
4 cores/node 8 cores/node 12 cores/node
m
o
d
e
ld
a
y
s
/d
a
y
8-core DAS4 (CPU only)
8-core DAS4 (CPU + GTX480)
12-core DAS4 (CPU only)
12-core DAS4 (CPU + K20)
3
0
4
1
0
3
6
4
9
0
3
3
6
2
8
7
3
8
7
1
1
0
0
Figure 4.11: Performance of POP using 8 compute nodes of the DAS-4 cluster,
with and without GPUs, using hierarchical partitioning with 60× 60 block size.
of POP we use the original POP code with the hierarchical partitioning scheme
described in Sect. 4.1.3. Comparisons against other load balancing schemes can
be derived from Figure 4.7. All configurations in this section use a block size of
60× 60.
Figure 4.11 shows the performance of POP using 4, 8, and 12 MPI tasks per
node, with and without GPU. Note that only a single GPU is available in each
node. Therefore, the GPU is shared between the multiple MPI tasks on a single
node. For the 8-core DAS-4 nodes, the performance gained by using the GPU is
approximately 20%, both when using 4 or 8 MPI tasks. This directly corresponds
with the execution time consumed by POP code that has been ported to the
GPU. The figure also shows that the scalability of POP itself is far from perfect.
Running on 8 MPI task per node, only provides a speedup of 1.4 compared to 4
MPI tasks per node, both for the CPU-only and GPU versions.
For the 12-core DAS-4 nodes, the performance gained by using the GPU is
approximately 15% when using 4 MPI tasks per node, and 13% when using 8 or
12 MPI tasks per node. Although this relative performance gain is lower than for
the 8-core nodes, the absolute performance gain is much higher due to the better
performance offered by the (newer) six-core CPU and K20 GPUs. In addition, the
scalability of POP on the 12 core nodes is also much better, achieving a speedup
of 1.9 on 8-cores and 2.6 on 12-cores (both relative to the 4-core experiment).
The results show that it is possible to combine the hierarchical partitioning
scheme with GPU execution and still obtain a performance increase. This is
88 Chapter 4. Improving the performance of the Parallel Ocean Program
 0
 20
 40
 60
 80
 100
 120
CPU only CPU+GTX480
m
o
d
e
ld
a
y
s
/d
a
y
Single Cluster DAS4 (16 x 8 cores)
Two Clusters DAS4 (2 x 8 x 8 core)
8
2
9
4
7
8
8
8
Figure 4.12: Performance of POP using 16 compute nodes of the DAS-4 cluster,
on one or two clusters, using hierarchical partitioning with 60× 60 block size.
a remarkable result, as the hierarchical partitioning scheme prefers small block
sizes, such as 60 × 60, to eliminate as many land-only blocks as possible and
distribute load evenly among MPI tasks, while the GPU code would prefer larger-
sized blocks to increase GPU utilization. However, GPU utilization is already
increased by the fact that all MPI tasks running on a single node share a single
GPU for all their GPU computations. It is important to understand that this
would not have been possible with larger block sizes because of the limited size
of the GPU memory. As such, the two approaches presented in this chapter work
in concert to improve the performance of POP.
As a final experiment, we study the performance of POP on multiple platforms
including GPUs. For this experiment, we use 8-core DAS-4 compute nodes with
an Nvidia GTX480 GPU (described in Sects. 1.5 and 4.4.2).
Figure 4.12 compares the performance of a 16-nodes single cluster run with
a 2 × 8-node two cluster run. Results are shown for CPU-only and CPU+GPU
experiments. The results show a performance increase of 15% on one cluster
and 13% on two clusters when using GPUs. The performance loss when switch-
ing from one to two clusters is 5% for the CPU-only version and 6% for the
CPU+GPU version. These results clearly indicate that running POP on multi-
ple GPU clusters is feasible and also beneficial in terms of performance. Moreover,
it allows users with access to multiple smaller GPU clusters to scale up to well
beyond the size of a single GPU cluster.
4.5. Conclusions 89
4.5 Conclusions
High-resolution ocean and climate models are becoming a very important tool in
climate research. It is crucially important that multi-century simulations with
these models can be performed efficiently. In this chapter, we presented a new
distributed computing approach to increase the performance of the POP model.
We first have shown that it is possible to optimize the load balancing of POP
such that it can run successfully in a multi-platform setting. The hierarchical
load balancing scheme was shown to perform much better than the existing load
balancing schemes (Cartesian, rake and space-filling curve), mainly due to the
reduction in communication between the MPI tasks. In the future, we plan to
take advantage of the Zoltan library to extend our load balancing scheme to
also take performance differences between machines into account. Secondly, it
was demonstrated that it is advantageous to port part of POP to GPUs (and
get a performance increase), even though POP itself does not contain any real
hotspots and is therefore not an obvious candidate for using GPUs.
In the experiments shown, only three functions in POP were implemented on
a GPU. Another substantial portion of the execution time is spent computing the
advection of momentum and the horizontal diffusion of momentum and tracers.
Obtaining a GPU implementation for these functions is deferred to future work.
The advection of tracers also uses the equation of state to compute the potential
density referenced to the surface layer, which is used to compute a variety of time
averaged fields. Currently, the majority of the execution time per kernel is spent
on PCIe transfers. When more computation is moved to the GPU, more data can
be reused, and some intermediate data structures that result from computation
may even never have to leave the GPU. In that case, some PCIe transfers can
be eliminated completely. In future work we hope to produce a complete GPU
implementation of the vertical mixing part of POP.
The software presented in this chapter has the same portability properties as
the original POP. The GPU code is written in CUDA, which is a widely used
language for GPU Computing applications. To increase portability across differ-
ent GPU architectures no architecture-specific optimizations have been included.
OpenCL is a well-known alternative to CUDA that aims at a wider set of many-
core compute devices and different compilers are available for different platforms.
However, there are no real linguistic differences between CUDA and OpenCL,
and porting the code will be a simple engineering effort. The use of both exten-
sions (domain decomposition or GPU functions) can be enabled, disabled, and
controlled individually through the well-known pop_in namelist file.
Finally, we have shown that the combination of these two approaches also im-
proves performance. Although we demonstrated this only for the DAS-4 cluster,
it opens up the possibility to submit a POP job in the near future over multiple
90 Chapter 4. Improving the performance of the Parallel Ocean Program
supercomputing platforms (with or without GPUs). The new hierarchical load
balancing scheme and the MPI wrapper methodology are crucial elements for
maintaining the performance of POP. Future work is to port more of POP to
GPUs and to scale up the multi-cluster experiments to production size hardware.
91
Chapter 5
Performance models for
CPU-GPU data transfers
Many studies of GPU computing applications focus on individual kernel perfor-
mance and ignore the fact that many applications have to regularly transfer data
over the PCIe bus [28]. In particular large-scale supercomputing applications
suffer from this issue, either because the data does not fit into the GPU memory,
inter-node communication at the end of each time step requires data to be present
at the host, or parts of the application are executed on the CPU. Moreover, as
supercomputing systems are large investments even a 10% speedup can often save
millions of dollars [30].
This chapter focuses on Challenge 5 as defined in Chapter 1: How to optimize
data transfers between CPU and GPU? PCIe transfers can have a large impact on
performance, especially when considering that the bandwidth is much lower than
the GPU device memory bandwidth. Gregg and Hazelwood [28] state that GPU
kernel execution times can increase to between 2 and 50 times of the original,
when PCIe transfer times are included. The de facto way of transferring data to
the GPU is by delaying all GPU computation until the entire transfer is complete.
An important reason for this is that overlapping computation and communication
is challenging and requires a considerable effort from the programmer and results
in much more code [26, 100].
Transferring data to or from the GPU proceeds through explicit memory
copy statements or using device-mapped host memory. Overlap between com-
putation and communication can be achieved using either CUDA streams or
device-mapped host memory. As we will show in this chapter, the most effi-
cient implementations may even require a mix of these different approaches. The
different implementations require completely different host codes as well as modi-
92 Chapter 5. Performance models for CPU-GPU data transfers
fications to the GPU kernel. For example, using mapped memory barely requires
any host code, whereas using CUDA streams may require multiple loops of mem-
ory copy operations and kernel invocations with advancing offsets. Currently,
if the application should run as efficiently as possible the programmer has little
choice but to implement all alternatives and run benchmarks to see which method
performs best. Implementing all different options is often a large programming
effort and not feasible. Knowing which method to apply in advance can save the
application developers a lot of time.
In order to solve these issues and provide insight in the performance of GPU
applications that require regular transfers across the PCIe bus, we propose an
analytical performance model that includes for PCIe transfers and overlapping
computation and communication. The model is capable of classifying the alter-
native implementations with regards to their relative performance. While the
parameterized model is specified analytically, it may be instantiated empirically
to provide performance estimates for a target hardware platform.
The reason to develop an analytical performance model, as opposed to alter-
native methods such as simulation, is that analytical models tend to be easier
to use and provide more high-level insight [30]. While cycle accurate or model-
based simulators can produce extremely accurate projections, the vast amount
of data that is generated may not lead to the desired insight within a time frame
that is reasonable for application developers. Our goal is to create an analytical
performance model with a reasonably small number of parameters. This ensures
that the model is easy to use by application developers while performance is
characterized correctly.
Using our performance model programmers should be able to quickly answer
questions such as: What is the dominant factor for my programs’ performance,
kernel execution or PCIe transfers? How much can be gained from overlapping
computation and communication? Do I have to use memory copy operations or
mapped memory? If I use streams, what number of streams is likely give the best
performance? How will switching to PCIe 3.0 platforms over PCIe 2.0 impact
performance?
The model is presented in two stages. First, we model to what extent com-
putation and communication can be overlapped. Second, we create a model to
accurately predict PCIe transfer times and combine the two to create a set of
performance estimations for different methods of overlapping computation and
communication. We show that our performance models can be used to correctly
classify the best performing implementation strategy for a small set of kernels
and GPUs.
In this chapter, we use CUDA terminology [62], although our method can just
as easily be applied in OpenCL [39]. The CPU is referred to as the host and the
GPU accelerator as the device, which are connected through PCI Express. We
5.1. Related work 93
use the term communication in this chapter to refer to data transfers between
host and device.
This chapter is structured as follows. Section 5.1 discusses related work. Sec-
tion 5.2 explains what methods for overlapping computation and communication
are available and provides upper bounds to indicate performance gain. Section 5.3
presents our models for kernels using CUDA streams. Section 5.4 shows how PCIe
transfers can be modeled accurately. Section 5.5 combines the models from Sec-
tions 5.3 and 5.4, presents our model for mapped memory kernels, and provides
estimates for the number of streams to use as part of the classification. Sec-
tion 5.6 evaluates the presented performance models, leading to the conclusions
in Section 5.7.
5.1 Related work
This section discusses related work on performance modeling for GPU applica-
tions grouped into three categories, models for kernel execution time, models for
GPU cluster applications, and models that consider overlap between computation
and communication.
5.1.1 Models for GPU kernel execution time
Various performance models for modeling GPU kernel execution times exist in
the literature. For example, Zhang and Owens [104] have developed a semi-
empirical model to analyze and predict kernel execution times. Hong and Kim [31]
and Baghsorkhi et al. [6] presented detailed analytical models to predict the
performance of GPU kernels. These models are intended for use in compilers to
guide kernel optimization. They require low-level information about the kernel,
such as the amount of warp- and instruction level parallelism, which makes them
difficult to be used directly by the programmer. Therefore, in this chapter we
assume the execution time of the kernel is known by the programmer, either by
experimentation or by using any of the existing performance models. Modeling
the kernel execution time is outside the scope of this chapter.
The roofline model [101] is an insightful high-level model that provides an
upper bound on performance of parallel software on multi-core architectures.
The model abstracts the machine into two performance metrics: peak memory
bandwidth and peak compute performance. Application kernels are abstracted
into a single value (operational intensity) that describes the number of floating-
point operations per byte of DRAM traffic. The upper bound on the performance
of the kernel is then given as the minimum of the peak compute performance and
the product of the peak memory bandwidth and operational intensity of the
kernel.
94 Chapter 5. Performance models for CPU-GPU data transfers
For GPU applications, the model can be used assuming that all data fits in
the DRAM of the device. In the absence of caches or data reuse, arithmetic
intensity (the number of floating-point operations per byte of input) can used
instead of operational intensity. In Section 5.2, we explain how we extend the
roofline model to also capture the impact of transferring data between host and
device.
5.1.2 Models for GPU cluster applications
Another class of performance models considers not only the kernel execution
time but also the performance in a GPU cluster setting. For example, Schaa en
Kaeli [75] describe a model for predicting the performance of GPU applications for
multi-GPU systems. PCIe transfer time is modeled using a constant bandwidth
as the only parameter. Overlapping computation and communication is not part
of the model.
cudaMPI [43] is a system that provides MPI-like message passing to com-
municate data stored on device memory in a GPU cluster setting. Lawlor [43]
argues against the use of streams and suggests to use mapped memory instead.
Bernaschi et al. [8] compare the performance of several other GPU-aware MPI im-
plementations (OpenMPI, MVAPICH2 and APEnet) using the Heisenberg spin
glass model as a benchmark application. They argue that the use of CUDA
streams is instrumental to achieving the best performance.
Aspen [88] is a domain specific language for developing analytical performance
models. Aspen consists of two parts, one that models the application and one
that describes the target machine. For example, Aspen is used to develop a per-
formance model for a 3D FFT application in a GPU cluster setting [88]. Aspen
assumes a completely connected intra-node topology that operates at a fixed link
bandwidth. As such, their abstract machine model currently has no language fea-
tures for describing the PCIe bus and it is assumed that communication between
host and device operates at a fixed data rate of 8 gigabytes per second.
These performance models for GPU cluster applications have to capture PCIe
transfer times to some extent. However, it is not their goal to provide the applica-
tion developer with insight in to what extent computation can be overlapped with
communication and how such overlap can be achieved. Therefore, in this chapter
we develop an accurate model for predicting PCIe transfer times and combine
that with models that capture overlap between computation and communica-
tion. Finally, our model may be included in performance models for GPU cluster
applications that do consider overlap between computation and communication.
5.1. Related work 95
5.1.3 Models that consider computation and communica-
tion overlap
Hoefler et al. [30] have proposed a method for developing application performance
models for supercomputing systems. Their method does not explicitly consider
the use of GPUs, but it may well be applicable to GPU applications. Through sev-
eral different steps their method abstracts the application into three performance
metrics, required memory traffic, floating-point instructions, and fixed-point in-
structions. Their method tells the modeler to extract the duration of overlappable
serial computation and communication for each kernel. Unfortunately, the pa-
per does not detail how the modeler is to extract this information and this step
is also omitted in the application example discussed in the paper. The perfor-
mance models we present in this chapter can be used by application developers
to perform this step of Hoefler’s method.
Meswani et al. [54] have developed a framework for predicting the performance
of applications executing on accelerators. Using automatically extracted applica-
tion signatures and a machine profile based on benchmarks, they aim to predict
the application runtime before the application is ported. PCIe transfer rates are
modeled as a function of the size of the data to be transferred between host and
device. The PCIe throughput rates reported in their paper are much lower than
what we observe, even when considering they assume a hardware setup with two
GPUs per node. Unfortunately, their paper does not detail how these numbers
were obtained.
Meswani et al. define a performance metric called data transfer ratio, which
denotes the fraction of the duration of the total data transfer that is to be added
to the execution time on the accelerator. For a selected number of kernels they
list data transfer ratios, indicating what fraction of the data transfer needs to be
overlapped in order for the GPU kernel to outperform the CPU. They conclude
that in general 70% of data transfer time between host and device needs to be
overcome in order to benefit from using accelerators. This stresses the impor-
tance of effectively overlapping data transfer times and computation for HPC
applications. However, the analysis of how the overlap between computation and
data transfer should be achieved, as well as the methods for achieving such data
transfer ratios are outside the scope of their paper.
Gómez-Luna et al. [26] have studied the performance of CUDA streams for
the older GTX 280 and GTX 480 graphics cards. In this chapter, we do not
consider GPUs that predate Fermi cards. They model the execution time of a
kernel using CUDA streams including transfers between host and device.
Gómez-Luna et al. model the performance of an application using CUDA
Streams on a Fermi GPU using the stream creation time as the only drawback to
the use of streams. To include the stream creation time in kernel execution time
96 Chapter 5. Performance models for CPU-GPU data transfers
estimates may be reasonable for applications that consist of a single GPU kernel,
such as benchmarks or CUDA SDK examples, but for real-world applications this
is not correct. Streams are typically created at the start of the application and
destroyed only when the application has finished. Once created, a stream may
be used for thousands of kernel invocations. They estimate the stream creation
time at 0.03 ms per stream [26], clearly this becomes negligible if the stream can
be reused for many kernel invocations. Stream creation times are not a limiting
factor in the performance of streamed CUDA applications in general.
When the stream creation time is removed from their performance estimates,
no drawback from using additional streams is left in the model. As such, the
model suggests that an infinite number of streams results in optimal performance,
which is clearly not realistic as long as there is some overhead associated with
using additional streams. We may interpret their model as a semi-empirical model
that abstracts all per stream overhead into some constant amount of time per
stream. This model however does not provide much insight in the working of
streams other than that there is some overhead associated with their use. In the
following we explain how we model the execution time of streamed CUDA kernels
in a way that provides the necessary insight for the application programmer.
5.2 Overlapping computation and communication
This section explains what methods for overlapping computation and commu-
nication are available to application developers and provides upper bounds to
indicate how much performance can be gained.
Overlapping computation with communication requires fine-grained control
over how data is transferred to and from the GPU. There are several alternative
techniques for moving data between host and device in the CUDA Programming
model. The most commonly used approach is to simply use explicit memory copy
statements to transfer large blocks of memory to the GPU, invoke GPU kernels,
and copy the results back from the GPU. These explicit memory copy operations
can be invoked either synchronously or asynchronously with respect to the host.
While the host does not wait for an asynchronous copy to complete, the copy
operation will entirely precede GPU kernel execution and as such, computation
and communication are not overlapped.
The mapped memory approach uses no explicit copies, but maps part of the
host memory into device memory space. Whether this approach is feasible de-
pends on the memory access pattern of the kernel. Typically mapped memory
can only be used efficiently if each input and output element is read or written
only once by the kernel. This is because every load or store on device-mapped
host memory may result in a transfer over PCIe. Although this approach results
in very clean host code, requiring no explicit copy statements, it may require
5.2. Overlapping computation and communication 97
more complex kernel implementations with delicate memory access patterns to
ensure high performance.
CUDA streams may be used to separate the computation into distinct streams
that may execute in parallel. This way, communication from one stream can be
overlapped with computation, and with communication in other streams. GPUs
with 1 copy engine can only use the PCIe bus in half duplex when data is moved
using explicit memory copies. As such, computation can only be overlapped with
communication in at most one direction. The only way for these GPUs to use the
PCIe bus in full duplex is to use device-mapped host memory instead. GPUs with
2 copy engines, such as Nvidia’s Tesla K20, can use the PCIe bus in full duplex
using explicit memory copies in different streams. This way, computation and
communication in both directions can be fully overlapped using different streams.
Streams may also be used to allow different kernels to execute concurrently, re-
gardless of how data is transferred between host and device. This task-parallel
use of streams for exploiting application-level task parallelism is outside the scope
of this work.
5.2.1 Bounding the performance of kernels with overlap
We can apply the roofline model [101], outside of its intended scope, for mod-
eling the performance of GPU kernels including PCIe transfer times. For this,
we need a metric different from operational intensity to filter out all redundant
loads and stores to device memory. This is similar to how the roofline model
uses operational intensity rather than arithmetic intensity to measure the traffic
between cache and DRAM rather than between processor and cache. To model
this correctly we need the ratio between the total number of floating point oper-
ations performed by the kernel and the amount of bytes transferred between host
and device memory. We call this data intensity, defined as the total number of
floating point operations divided by the total number of bytes transferred over
PCIe. The roofline model defines the attainable performance as
PRM = min(peak compute performance,
peak memory bandwidth× operational intensity).
Using data intensity we determine whether the execution time of a kernel will
be dictated by PCIe transfers or by kernel execution. This extended roofline
model is defined as
Pmax = min(peak compute performance,
peak memory bandwidth× operational intensity,
peak PCIe bandwidth× data intensity).
98 Chapter 5. Performance models for CPU-GPU data transfers
Pmax gives the upper bound on the kernel performance assuming maximal
overlap between computation and communication. However, if the PCIe transfers
entirely precede and follow GPU kernel execution then the performance is limited
as follows:
Pzero =
total flops
( bytes transferredpeak PCIe bandwidth ) + (
total flops
PRM
)
,
where PRM is the upper bound on performance provided by the roofline model.
The two estimates that we have just defined can be used to bound the per-
formance for maximal or zero overlap between computation and communication.
As such, they can be used to indicate how much performance can be gained from
overlapping computation and communication. However, this model does not yet
express the performance gain that some extent of overlap between computation
and communication may have, nor does it give any insight into how such overlap
can be achieved.
The device-mapped host memory approach can achieve an overlap between
computation and communication at the highest possible granularity (i.e. warp-
level). The mapped memory approach entirely forgoes the use of device memory
and as such we cannot use data intensity as a measure for the amount of data
that will be transferred. In fact, every request that would normally be served by
the device memory will result in a transfer over PCIe. As such, we can directly
use operational intensity combined with the PCIe bandwidth to provide an upper
bound on performance, by defining
Pmm = min(peak compute performance,
peak PCIe bandwidth× operational intensity).
The performance bounds introduced in this section can be used to determine
performance bottlenecks, but they cannot be used to model the performance
that a certain degree of overlap may provide nor do they provide insights in how
overlap may be best achieved. Therefore, the following sections introduce new
performance models that are specifically created to model the extent of overlap
between computation and communication.
5.3 Modeling computation and communication
overlap for streams
In order to model what extent of overlap can be achieved when streams are used,
we start using the same notation as Gómez-Luna et al. [26] and the CUDA Best
5.3. Modeling computation and communication overlap for streams 99
Practices Guide [62]. Transfer times tThd (host to device) and tTdh (device to
host) represent the sum of time spent on transfers for all streams combined. The
kernel execution time for all streams combined is written as tE . The time per
stream, whether spent on execution or transfers in either direction, is written
as tnStreams . As some overhead is associated with the use of streams, tThd and
tTdh will generally be larger than the observed transfer time when no streams are
used. As such, the estimates presented in this section are too optimistic and are
refined at a later stage.
Based on what GPUs are currently available we identify three categories:
(1) devices with implicit synchronization (explained below) and 1 copy engine,
examples are GPUs from the Fermi architecture e.g. GTX 480 and Tesla C2050,
but also the Kepler GTX 680; (2) devices with no implicit synchronization and
1 copy engine, for example the GTX Titan; and (3) devices with no implicit
synchronization and 2 copy engines, for example the Tesla K20. Figures 5.1, 5.2,
and 5.3 show the possible overlap between computation and communication for
each category using 1, 2 and 4 streams. The blue bar on the left represents the
time spent on data transfers from host to device, the green bar represents kernel
execution time on the GPU, followed by another blue bar representing the time
spent on transferring data from device to host. From these diagrams we can
easily derive simple analytical models for the performance of a kernel that uses
CUDA streams.
The overlapping behavior of the execution is separated into two scenarios
depending on whether kernel execution or transfers are the dominant factor for
performance. A component is dominant for performance when the time spent
on that component is so large that, if the time spent on another component can
be overlapped with the dominant component, it will overlap with the dominant
component for all streams but one. While this separation helps to explain the
models, in practice when estimating the total execution time one can simply use
the largest estimate of both dominant kernel and dominant transfers scenarios.
5.3.1 Implicit synchronization and 1 copy engine
Implicit synchronization delays operations that require a dependency check to
see if a streamed kernel launch is complete until all thread blocks of all prior
kernel launches from any stream have started executing [62], as shown in Fig-
ure 5.1. In the dominant kernel scenario, computation can overlap with transfers
from host to device in different streams. In the dominant transfers scenario, the
transfers from host to device can overlap with computation in different streams.
Because of implicit synchronization, practically no overlap between computation
and transfers from device to host can be achieved. The extent of overlap between
host to device transfers and computation can be modeled as follows:
100 Chapter 5. Performance models for CPU-GPU data transfers
dominant kernel dominant transfers
Figure 5.1: Diagrams showing possible overlap for devices with implicit synchro-
nization and 1 copy engine when using 1, 2, or 4 streams.
Dominant kernel
t =
tThd
nStreams
+ tE + tTdh
Dominant transfers
t =tThd +
tE
nStreams
+ tTdh
For devices with implicit synchronization and 1 copy engine, we come to the
same conclusions as Gómez-Luna et al. did for Fermi GPUs, although we derive
the model from the hardware specifications rather than extensive experimenta-
tion.
5.3.2 No implicit synchronization and 1 copy engine
On devices like the GTX Titan, computation can overlap with transfers in differ-
ent streams in either direction, as shown in Figure 5.2. However, because there
is only 1 copy engine, transfers in the opposite direction can only start once all
transfers in the first direction have completed. For the dominant kernel scenario,
there is no implicit synchronization and rather than the entire time spent on
device to host transfers we will only see the time spent on the device to host
transfers in the last stream. The dominant kernel scenario is now able to entirely
hide the kernel execution time, depending on whether the host to device and
device to host transfers are large enough to hide all computation time.
5.3. Modeling computation and communication overlap for streams 101
dominant kernel dominant transfers
Figure 5.2: Diagrams showing possible overlap for devices with no implicit syn-
chronization and 1 copy engine when using 1, 2, or 4 streams.
Dominant kernel
t =
tThd
nStreams
+ tE +
tTdh
nStreams
Dominant transfers
t =max( tThd + tTdh,
tThd +
tE
nStreams
+
tTdh
nStreams
,
tThd
nStreams
+
tE
nStreams
+ tTdh)
The last two expressions of the max() statement in the estimate for the dom-
inant kernel scenario capture the fact that the transfer may not be large enough
to hide all computation time. To fully understand this, consider the dominant
transfers scenario in Figure 5.2 for 2 streams and imagine that either the host to
device transfers or the device to host transfers only consume about half of the
time that they do in the picture. In either case, it is clear that when there is not
enough time spent on transfers and the number of streams is low enough, some
amount of time spent on kernel execution will not be overlapped.
5.3.3 No implicit synchronization and 2 copy engines
With 2 copy engines, transfers in one direction can overlap with transfers in the
opposite direction in different streams, as shown in Figure 5.3. This means that all
102 Chapter 5. Performance models for CPU-GPU data transfers
dominant kernel dominant transfers
Figure 5.3: Diagrams showing possible overlap for devices with no implicit syn-
chronization and 2 copy engines when using 1, 2, or 4 streams.
three components can be successfully overlapped with each other. The resulting
execution time can be modeled as the largest component plus the per stream
time spent on the other two components. For this reason, we can distinguish two
scenarios within the dominant transfer scenario: One where the host to device
transfer is dominant and one where the device to host transfer is dominant. As
explained earlier, when estimating the total time spent on transfers and execution
we can simply use the largest estimate of the different scenarios presented in this
section. Therefore, we provide a single estimate that captures the dominant
kernel and and both dominant transfers scenarios:
t = max
tThd
nStreams
+ tE +
tTdh
nStreams
,
tThd +
tE
nStreams
+
tTdh
nStreams
,
tThd
nStreams
+
tE
nStreams
+ tTdh)
Comparing Figures 5.2 and 5.3 we can also conclude that the 2nd copy engine
only improves performance if tEnStreams +
tThd
nStreams > tThd.
5.4 Modeling PCIe transfer time
Communication between host and device memory proceeds over PCI Express.
PCI Express is a packet-based switched point-to-point interconnect [65]. As
such, we decided to use the LogP [14], LogGP [3], and parameterized LogP [40]
class of performance models as a base for modeling the performance of transfers
over PCI Express. Note that we only consider asynchronous transfers over DMA,
5.4. Modeling PCIe transfer time 103
Figure 5.4: Time spent on transferring data between host and device for two
consecutive messages of k1 and k2 bytes.
this corresponds to either using cudaMemcpyAsync on pinned memory in a non-
default stream or using device-mapped host memory.
LogP [14] abstracts the communication of fixed-sized short messages through
four parameters: communication latency (L), overhead (o), gap (g), and the
number of processors (P ). Latency is the amount of time it takes each byte
to travel from endpoint to endpoint in the network. The overhead is defined
as the length of time that a processor is engaged in transmission or reception.
The gap is the reciprocal of the available per byte bandwidth for short messages
per processor. LogGP [3] extends the LogP model, by introducing a separate
bandwidth for large messages (G). The bandwidth for short messages (g) is kept
as a parameter in order to model the startup bottleneck of the network. We take
LogGP as a starting point for our model, as PCIe transfers between host and
device typically consist of large messages sent over DMA.
To model the performance of a memory transfer between host and device we
adapt the LogGP model in the following way. Under LogGP transferring a single
large message of k bytes is modeled as t = L+ o+ (k − 1)G+ o. For modeling a
single data transfer over PCIe we use: t = L+ o+kG. The second o in LogGP is
for the time spent by the receiving processor. In our case, the transfer proceeds
over DMA and data is written directly into the memory at the receiving end,
therefore there is no receiving processor overhead. The sending overhead o that
is left, can be seen as the time the processor is involved in registering the DMA
request with the controller. We replace (k − 1) with k, which is only part of the
LogGP model to reduce to LogP for k = 1. Because all transfers happen over
DMA, sending the first byte is not considered to be part of the sending processor
overhead. Not including the −1 byte seems more natural.
Sending multiple large messages of k1 and k2 bytes under LogGP is modeled
as t = L+ o+(k1− 1)G+ g+(k2− 1)G, where g captures the startup bottleneck
104 Chapter 5. Performance models for CPU-GPU data transfers
Host to Device Device to host
L+o 0.009420 ms
Ghd 8.318392E-008 ms/byte
ghd 0.002503 ms
L+o 0.009023 ms
Gdh 7.924734E-008 ms/byte
gdh 0.002674 ms
Table 5.1: Overview of model parameters for a PCIe 3.0 system
of the network. Just as in the LogP and LogGP models we assume a single port
model, i.e., only a single transmission can be active at any given time. Therefore,
if multiple streams are used to transfer a large message, we assume the per stream
transfers happen one after the other. As such, we model the transmission time of
large messages sent using multiple streams in a way similar to multiple consecutive
messages under LogGP. For example, we model 2 streams each transmitting half
of k bytes as t = L+ o+ 12k ×G+ g + 12k ×G, or more generally for nStreams
streams
t = L+ o+ (
k
nStreams
×G× nStreams) + g × (nStreams− 1),
which equals
t = L+ o+ k ×G+ g × (nStreams− 1).
Figure 5.4 presents an overview of how the parameters impact transfer times.
5.4.1 Model validation
We validate the model using an Nvidia GTX Titan GPU connected over PCIe
3.0 in a system with 2 Intel Xeon E5-2630 CPUs running at 2.3GHz using CUDA
5.5. The model parameters are obtained as follows. L + o is approximated by
measuring the transfer time of a single byte transfer. G is estimated measuring
n transfers of size k1, . . . , kn bytes. The sum of transfer times
tk1 + · · ·+ tkn = n(L+ o) +G(k1 + · · ·+ kn)
gives
G =
tk1 + · · ·+ tkn − n(L+ o)
k1 + · · ·+ kn .
We have written a small benchmark application that extracts the model pa-
rameters using the averages of a large number of runs. The model parameters
for our test system are shown in Table 5.1. Because the parameters for host to
5.4. Modeling PCIe transfer time 105
 0
 0.5
 1
 1.5
 2
1 2 4 8 16 32 64 128 256
Ti
m
e 
(m
s)
Number of streams
observed
estimated
Figure 5.5: Host to Device transfer performance for 16 MB over PCIe 3.0.
 0
 0.5
 1
 1.5
 2
1 2 4 8 16 32 64 128 256
Ti
m
e 
(m
s)
Number of streams
observed
estimated
Figure 5.6: Device to Host transfer performance for 16 MB over PCIe 3.0.
device transfers and device to host transfers differ, we will write Ghd, or Gdh
respectively, to distinguish between the two. We observe that g < L + o, which
means that the latency of consecutive transmissions can be overlapped.
We have tested the model accuracy for data sizes ranging from 16 MB to 1
GB and varying the number of streams from 1 to 256. For the host to device
transfers our model always underestimates performance with a maximum error of
1.18%. For device to host transfers the maximum overestimation error is 2.47%
and maximum underestimation is 0.65%. Figures 5.5 and 5.6 show the observed
and estimated PCIe transfer times for transfers of 16 MB varying the number
106 Chapter 5. Performance models for CPU-GPU data transfers
Host to Device Device to host
PCIe 2.0 explicit-explicit
PCIe 2.0 explicit-mapped
PCIe 3.0 explicit-mapped
Ghd 1.741965E-007
Ghd 2.491988E-007
Ghd 1.193386E-007
Gdh 1.747120E-007
Gdh 2.236644E-007
Gdh 1.480396E-007
Table 5.2: Model parameters for bidirectional transfers in ms/byte.
of streams. We only show graphs for size 16 MB, which produced the largest
relative errors. While a more detailed model may be able to produce even more
accurate performance estimations, we consider our model to be accurate enough
for our intended purposes.
Some devices have 2 copy engines and therefore transfers in one direction
can overlap with transfers in the opposite direction. The performance of these
transfers cannot be expected to be exactly the same as the performance of non-
overlapped transfers. While the communication channel can function in full du-
plex, some resources, including the end points of both transfers, are still shared.
As such, we can expect that the transfer time increases while a concurrent transfer
in the opposite direction occurs. Similarly, we can expect that the performance of
a single large transfer in one direction will degrade when many smaller transfers
in the opposite direction occur due to accesses to device-mapped host memory.
Therefore, we simply provide the observed Ghd and Gdh for bidirectional trans-
fers using streams on the Tesla K20 (PCIe 2.0) and for transfers colliding with
mapped memory transfers for both the K20 and the GTX Titan, see Table 5.2. A
separate performance model for the effective bandwidth under contention could
improve our performance estimations. However, performance modeling individ-
ual system parameters is beyond the scope of this work and not necessary given
the current accuracy of our performance predictions.
5.5 Overview of performance models
This section combines the model for PCIe transfers from Section 5.4 with the
models for computation and communication overlap using streams from Sec-
tions 5.3. In addition, we present our model for mapped memory kernels and
provide estimates for the number of streams to use as part of the classification.
An overview of all performance estimates we consider in this chapter is shown
in Figure 5.7. For applications using streams these are the estimates presented
in Section 5.3 with the PCIe transfer model from Section 5.4 filled in, shown
here as a reference for the reader. Note that the kernel execution time for all
streams combined tE , as well as the per stream kernel execution time tEnStreams
have to be estimated using any of the existing performance models or be derived
5.5. Overview of performance models 107
Implicit synchronization and 1 copy engine
Dominant kernel
t =L+ o+
Bhd
nStreams
×Ghd + tE + L+ o+Bdh ×Gdh + g × (nStreams− 1) (5.1)
Dominant transfers
t =L+ o+Bhd ×Ghd + g × (nStreams− 1)+ (5.2)
tE
nStreams
+ L+ o+Bdh ×Gdh + g × (nStreams− 1)
No implicit synchronization and 1 copy engine
Dominant kernel
t =L+ o+
Bhd
nStreams
×Ghd + tE + L+ o+ Bdh
nStreams
×Gdh (5.3)
Dominant transfers
t = max( (5.4)
L+ o+Bhd ×Ghd + g × (nStreams− 1) + L+ o+Bdh ×Gdh + g × (nStreams− 1),
L+ o+Bhd ×Ghd + g × (nStreams− 1) + tE
nStreams
+ L+ o+
Bdh
nStreams
×Gdh,
L+ o+
Bhd
nStreams
×Ghd + tE
nStreams
+ L+ o+Bdh ×Gdh + g × (nStreams− 1))
No implicit synchronization and 2 copy engines
t = max( (5.5)
L+ o+Bhd ×Ghd + g × (nStreams− 1) + tE
nStreams
+ L+ o+
Bdh
nStreams
×Gdh,
L+ o+
Bhd
nStreams
×Ghd + tE + L+ o+ Bdh
nStreams
×Gdh,
L+ o+
Bhd
nStreams
×Ghd + tE
nStreams
+ L+ o+Bdh ×Gdh + g × (nStreams− 1))
Device-mapped host memory.
t = max( (5.6)
L+ o+Bhd ×Ghd + L+ o, L+ o+ tE + L+ o, L+ o+ L+ o+Bdh ×Gdh)
Figure 5.7: Overview of performance models for using streams and for using
device-mapped host memory. Bhd is the number of bytes transferred from host
to device. Bdh is the number of bytes transferred from device to host.
108 Chapter 5. Performance models for CPU-GPU data transfers
empirically. Also note that if the amount of data transferred per stream varies the
largest amount per stream should be used in place of BhdnStreams or
Bdh
nStreams . The
individual models for streamed kernels have already been discussed in Section 5.3.
Therefore, the rest of this section discusses our model for a kernel using mapped
memory as well as estimates for a suitable number of streams.
5.5.1 Model for mapped memory
The model for kernels using mapped memory (Equation 5.6) reflects that compu-
tation as well as communication in either direction can overlap with each other.
As such, the performance can be estimated using only the longest of the three
components (host to device transfers, computation, or device to host transfers),
and the latency of the other components. For example, when the time spent on
transferring data from host to device is the longest component, the computa-
tions and device to host transfers can occur while additional data is still being
transferred from host to device. Therefore, the total time of the kernel can be
modeled as the time of the longest component and the latency of the overlapped
components. Because the performance of mapped memory kernels is not subject
to implicit synchronization or the number of copy engines, no separate models
are necessary.
The number of bytes transferred in either direction, Bhd and Bdh, should be
increased for elements in device-mapped host memory that are read or written
multiple times, as multiple accesses translate to multiple transfers over PCIe. In
addition, the accesses to device-mapped host memory must be fully coalesced,
otherwise the amount of traffic over PCIe should be multiplied for each serialized
access similar to noncoalesced accesses to device memory. This implies that
mapped-memory should in general only be used for kernels with simple access
patterns and may in other cases require that the kernel is rewritten to only
perform strictly coalesced memory accesses.
The model for mapped memory presented here is the result of a compromise
between accuracy and usability of the model. The model assumes that the two
smaller components can be overlapped with the component consuming the largest
amount of time, including possible transfer latencies of overlapped components,
but ignoring the possible compute latency. In extreme cases, this approach may
not be accurate enough. For example, when the amount of input or output data
for a single wrap is extremely large and that transfer is not the longest component,
the transfer latency for this warp will be underestimated by the model. Similarly,
when kernel execution is not the longest component, but the compute latency of
an individual warp is extremely large, it will be underestimated by the model.
We believe that a much more detailed model using warp-level granularity can
be more accurate for these extreme cases. This increased detail would require
5.5. Overview of performance models 109
a large extension in terms of parameters to the model, including the modeling
of the kernel execution time at the granularity of individual warps. We consider
our model presented here to be sufficiently accurate, while providing a high-level
framework for algorithm design and analysis.
5.5.2 Limits on the number of streams
To identify which implementation strategy is the most efficient for a particular
problem, some number of streams must be chosen to calculate performance pre-
dictions for implementations using streams. The performance models shown in
Figure 5.7 show the possible drawbacks from using additional streams. As such,
we can also reason how long performance could be improved from increasing the
number of streams. The key idea of using more streams is to maximize over-
lap. However, at a certain point we expect performance to degrade because the
cost of using an additional stream is not outweighed by the performance gain of
increased overlap.
This section explains how to produce estimates for the number of streams, to
ensure that a representative number of streams is used as part of the classification.
For many applications, it is practical to limit the number of streams to one of
the input dimensions, but in this chapter we assume that no more than 128
streams are used. Unfortunately, we cannot use the estimates provided in this
section to accurately predict the optimal number of streams at runtime for a real
application, because the kernel execution time is also influenced by the number
of streams. However, these estimations may be used to limit the search space for
(auto-)tuning the number of streams once the implementation using streams has
been created.
For devices with implicit synchronization we can take the first derivative of
Equations 5.1 and 5.2 to find for the dominant kernel case:
nStreamsmax =
√
Bhd ×Ghd
g
and for the dominant transfer case:
nStreamsmax =
√
tE
2g
.
For the dominant kernel scenario on devices with no implicit synchronization
and 1 copy engine, we can not derive a limit for the number of streams based
on Equation 5.3. As Equation 5.3 does not include the performance drawback
of using additional streams on the kernel execution time (tE). Modeling kernel
execution time is outside the scope of this chapter.
From Equation 5.4 it would seem from the model that performance can only
be increased by decreasing the number of streams. As long as the transfers can
successfully overlap all kernel execution time, lowering the number of streams
110 Chapter 5. Performance models for CPU-GPU data transfers
0
4
8
12
16
20
24
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
T
im
e
(m
s)
Number of streams
tThd
nStreams + tE +
tTdh
nStreams
tThd + tTdh
tThd +
tE
nStreams +
tTdh
nStreamstThd
nStreams +
tE
nStreams + tTdh
Figure 5.8: Performance as estimated by individual clauses of our performance
model varying the number of streams, for a device with no implicit synchroniza-
tion and 1 copy engine, 128 MB of input data, 2 MB of output data, 10 ms kernel
execution time, and PCIe 3.0 system parameters.
will reduce the overhead from using additional streams. However, nStreams can
only be decreased until BhdnStreams × Ghd + tEnStreams or tEnStreams + BdhnStreams ×
Gdh become larger than L + o + BdhnStreams × Gdh + g × (nStreams − 1) or L +
o + BdhnStreams × Gdh + g × (nStreams − 1), respectively. This is illustrated in
Figure 5.8, where all individual clauses from the performance models for devices
with no implicit synchronization and 1 copy engine are shown. The performance
estimate produced by our performance model is the maximum of all clauses for
a given number of streams. The best performing number of streams according
to the model can be found by calculating the intersection of the top two lines in
Figure 5.8.
The largest number of streams for devices with no implicit synchronization
and 2 copy engines can be derived from Equation 5.5 in Figure 5.7. Again, for
the dominant kernel scenario a limit on the number of streams depends on how
tE increases with the number of streams. For the dominant transfer scenario we
can derive the maximum number of streams from Equation 5.5:
nStreamsmax =

√
Bdh×Gdh+tE
g if (tThd > tTdh)√
Bhd×Ghd+tE
g if (tTdh > tThd)
5.6. Evaluation 111
GPU CEs IS CPUs Interconnect
GTX 480 1 3 2 Intel Xeon E5620 @ 2.4GHz PCIe 2.0
GTX Titan 1 7 2 Intel Xeon E5-2630 @ 2.3GHz PCIe 3.0
Tesla K20m 2 7 2 Intel Xeon E5-2620 @ 2.0GHz PCIe 2.0
Table 5.3: Overview of hardware used in experiments. CEs=Copy Engines.
IS=Implicit Synchronization.
5.6 Evaluation
This section evaluates the ability of our performance models to accurately classify
the performance of different kernel implementations for our target hardware plat-
forms. We introduce several real-world examples of kernel implementations that
overlap computation and communication to demonstrate how our performance
models can be applied to model the execution times of each implementation.
For this evaluation we use the DAS-4 distributed supercomputer. DAS-4 is a
heterogeneous platform containing many different compute nodes and accelera-
tors. We use three configurations detailed in Table 5.3 each using CUDA 5.5.
These configurations correspond with the three different categories introduced in
Section 5.3.
We evaluate our performance models using two kernels from the Parallel
Ocean Program [48], a 2D convolution kernel, and a matrix multiplication ker-
nel. The kernels from the Parallel Ocean Program fall into the dominant transfer
scenario. Convolution and matrix multiplication are compute-intensive kernels
that are modeled using the dominant kernel performance models.
5.6.1 Implementations
For each kernel we consider four different versions that we call Explicit, Implicit,
Streams, and Hybrid. Each version is designed for different overlapping behavior
between CPU-GPU data transfers and GPU computation. We first describe the
four versions in general before discussing the specific kernel implementations in
detail.
Explicit is a bulk synchronous implementation that uses explicit memory copy
statements to copy all the required input data to the GPU and the produced out-
put data from the GPU to host memory. The kernels used in Explicit implemen-
tations create a two-dimensional array of threads. The result is that for 2D data,
one thread computes one output element. For 3D data, used in the kernels from
the Parallel Ocean Program, each thread computes an array of output elements.
Implicit uses mapped memory and therefore requires no explicit memory copy
statements. Instead, data is requested by the GPU directly from the host memory
112 Chapter 5. Performance models for CPU-GPU data transfers
and sent over the PCIe bus. The performance of accessing the memory in this way
is very sensitive to the order in which data is requested and care must be taken
not to create gaps or misalignments from the mapping between threads and data.
Therefore, Implicit uses a kernel implementation that creates a one-dimensional
array of threads with size equal to the number of output elements. Each thread
then computes its N-dimensional index from its one-dimensional thread ID to
direct itself to the correct part of the computation.
The Streams implementation partitions the computation into a number of
streams and launches a kernel for each partial computation. Explicit copy state-
ments are used to copy the input and output data for each partial computation. If
the computation of one partition requires input from multiple partitions, CUDA
events are used to delay the computation until all inputs have been moved to the
device and vice versa. The kernel used in Streams is similar to the kernel used
in Explicit, except for the fact that each thread computes exactly one output
element. Note that, except for the differences described here, the kernels do not
contain any architecture specific optimizations.
The Hybrid implementation is similar to the Streams implementation in that
it uses explicit memory copies and streams to transfer the input data from host
to device and for invoking kernels. However, the kernel uses device-mapped
host memory to transfer the output from device to host. This may be somewhat
counter-intuitive, but the main idea behind this implementation is that input data
is often requested multiple times by different threads, therefore using mapped
memory for host to device communication is unlikely to give the best performance.
Since each output element is written only once, using mapped memory for device
to host communication is a good fit. This approach allows overlap between the
device to host transfers and host to device transfers in other streams even on
devices with only one copy engine and/or implicit synchronization. However,
there is no benefit for this method on devices with no implicit synchronization
and 2 copy engines. The performance of this implementation for all devices is
modeled as the performance of a streamed kernel for devices with 2 copy engines
and no implicit synchronization (see Figure 5.7, Equation 5.5).
5.6.2 State kernel
As explained in Chapter 4, the Parallel Ocean Program is a global ocean cir-
culation model which solves the three-dimensional primitive equations for fluid
motions under hydrostatic and Boussinesq approximations, in which depth is
used as the vertical coordinate. In this evaluation we use two kernels, state and
buoydiff, that are part of the computation of the vertical mixing coefficients [42]
within the Parallel Ocean Program [96].
5.6. Evaluation 113
The state kernel computes the water density for each vertical level k based
on two variables, temperature and salinity. State optionally also computes the
derivatives of the density with respect to temperature and salinity, which totals
to three output variables. Our Explicit implementation uses explicit copies to
move the three-dimensional grid of variables between host and device and creates
one thread for each horizontal grid point, which computes all outputs for each
level in the vertical direction. This approach is unable to overlap communication
to and from the device with GPU computation. However, it is possible to also
parallelize the computation of different vertical levels using CUDA streams. Our
Streams implementation ensures that GPU computation can be overlapped with
GPU communication of different vertical levels. Because of the simple access
pattern in state, where each input and output element is read or written only
once, it is also a good candidate for the highly parallel Implicit implementation.
The Hybrid implementation uses CUDA streams and explicit memory copies to
move data from host to device, but uses device-mapped host memory to transfer
the output from device to host.
We have measured the performance of the state and buoydiff kernel on a
1024 × 1024 × 42 domain for each of the hardware platforms in Table 5.3. The
observed performance results are compared against the performance predicted by
our models for the state kernel in Figure 5.9. 42 CUDA streams are used in both
the Streams and Hybrid implementations. One stream is mapped to one vertical
level of the domain.
For the state kernel the Implicit implementation is the most efficient on each
of the hardware platforms. This is expected as each input and output element is
read or written only once and the Implicit implementation is capable of overlap-
ping computation and communication in both directions to a very high extent.
This is also what is reflected by our performance models. The reason the Streams
implementation on the K20 outperforms Streams on the GTX 480 is not because
of the increased compute performance of the K20 or increased PCIe bandwidth,
but is mainly due to the second copy engine. This way the K20 is able to (par-
tially) overlap kernel execution with device to host transfers. The difference
between Implicit and Hybrid on the GTX 480 is larger than on the Titan or
K20, because the effective device to host bandwidth degrades more rapidly under
contention from host to device transfers.
5.6.3 Buoydiff kernel
The buoydiff kernel is also part of the computation of the vertical mixing co-
efficients within the Parallel Ocean Program. Buoydiff computes the buoyancy
differences between different vertical levels in the ocean, and internally uses the
state kernel for the computation of potential density. The computation of buoy-
114 Chapter 5. Performance models for CPU-GPU data transfers
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
 200
observed estimated observed estimated observed estimated
Ti
m
e 
(m
s)
Explicit
Implicit
Streams
Hybrid
K20TitanGTX480
Figure 5.9: Observed and estimated performance of the State kernel on three
different GPUs.
ancy differences at level k requires the density of both the surface level and level
k − 1 displaced to level k, as well as the water density at level k. These three
values can be computed for each level in parallel as long as all the data is present
on the GPU. Overlapping data movement from the host to the GPU with GPU
computation and data movement from the GPU to host becomes significantly
more difficult, because the input values (temperature and salinity) for levels 1,
k−1, and k need to be present on the GPU to compute the buoyancy differences
at level k. Therefore, the Streams and Hybrid implementations first schedule
memory copies to the GPU for all vertical levels in concurrent streams and then
invoke GPU kernel launches for all levels. However, before the execution of the
kernel in stream k can start, the memory copies in stream 1, k − 1, and k need
to be complete. The kernel executing in stream k outputs to different vertical
levels for different variables. Therefore, for the Streams implementation, one of
the device to host memory copies in stream k has to wait for the kernel in stream
k − 1 to complete. We use the CUDA event management functions to guarantee
that no computations or device to host transfers start prematurely.
We model the dependencies between operations in different streams within
the Streams and Hybrid implementations of buoydiff by regarding the host to
device transfers in the first two streams as a sequential part of the execution.
The measured and modeled performance of the buoydiff kernel, using the same
experimental setup as for the state kernel, are shown in Figure 5.10. The Explicit
5.6. Evaluation 115
 0
 50
 100
 150
 200
 250
 300
 350
 400
observed estimated observed estimated observed estimated
Ti
m
e 
(m
s)
Explicit
Implicit
Streams
Hybrid
K20TitanGTX480
Figure 5.10: Observed and estimated performance of the buoydiff kernel on three
different GPUs.
and Implicit implementations do not differ from the general description given in
Section 5.6.1.
As expected, the Implicit implementation is not the best performing for the
buoydiff kernel. The input data is requested multiple times by different threads,
which results in data being repeatedly transmitted. For the buoydiff kernel the
most efficient implementation differs from platform to platform. On the GTX
680 and the GTX Titan the Hybrid implementations are the best performing.
This is because the Hybrid implementation enables the kernel to utilize the PCIe
bus in full duplex, where this is not possible when using only explicit memory
copies. On the K20, which has 2 copy engines and no implicit synchronization,
there is no benefit to using a Hybrid implementation and as such Streams is the
best performing. These observations exactly match the classifications provided
by our performance models.
5.6.4 2D Convolution kernel
Convolution operations are essential to signal and image processing applications
and are used to implement the most fundamental tasks in Computer Vision,
for example the detection of edges and lines in images [25]. In the context of
image processing, a convolution operation computes the linear combination of
the weights in the convolution filter and a range of pixels from the input image
116 Chapter 5. Performance models for CPU-GPU data transfers
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
observed estimated observed estimated observed estimated
Ti
m
e 
(m
s)
Explicit
Implicit
Streams
Hybrid
K20TitanGTX480
Figure 5.11: Observed and estimated performance of the a 2D convolution kernel
on three different GPUs.
for each output pixel. As such, the input image also includes a border to support
the computation of pixels near the edge of the image.
For this evaluation we have used a kernel that has been optimized for compute
performance (see Chapter 3). Optimizations that are applied are, among others,
the use of constant memory, shared memory, read-only cache (on the Tesla K20
and GTX Titan), adaptive tiling [95], and loop unrolling. The kernel computes a
2D convolution of a 4096× 4096 image using a convolution filter of size 17× 17.
The kernel execution time exceeds the device to host and host to device transfer
times and is therefore modeled using the dominant kernel scenario.
The Streams and Hybrid implementations use 64 CUDA streams to overlap
the device to host transfers with GPU computations by transferring blocks of
horizontal rows from the input image in different streams. The first stream
transfers an additional block of rows that includes the border data. Similar to
the buoydiff kernel, computation in stream x can start when the device to host
transfers in stream x − 1 and x have completed. Transfers from device to host
only consist of the output data computed by the kernel in that stream and require
no additional synchronization. The Explicit and Implicit implementations do not
differ from the general description given in Section 5.6.1.
The estimated and observed performance for our 2D convolution kernel is
shown in Figure 5.11. For both the GTX 480 and GTX Titan GPUs our perfor-
mance models correctly predict the most efficient implementation. For the Tesla
5.6. Evaluation 117
K20, the measured and predicted times of Streams and Hybrid are very close
to each other, with Hybrid being predicted as slightly faster than Streams. As
explained earlier, using the Hybrid implementation strategy is not beneficial on
devices with 2 copy engines, such as the Tesla K20. Therefore, we do not consider
this as a problem for our performance model.
Although the 2D convolution kernel falls into the dominant kernel scenario,
the time spent on kernel execution and transferring data from host to device and
device to host are all of roughly similar size. In particular on the GTX 480, the
total time spent on each component is very close (tTdh = 11.34ms, tE = 15.78ms,
and tThd = 10.24ms). In fact, for the Hybrid implementation on the GTX 480,
where the effective device to host bandwidth is reduced because of contention, the
device to host transfer time increases such that it becomes the dominant factor
for performance. As such, the performance models for the dominant transfer
scenario are used to model its performance.
5.6.5 Matrix multiplication kernel
Matrix multiplication is one of the most extensively studied kernels for GPUs.
For this evaluation we have used a simple, naive implementation. The kernel
computes the product of two matrices C = A × B, all of size 4096 × 4096.
Matrix multiplication is a compute-intensive operation and is modeled using the
dominant kernel performance models.
The Streams and Hybrid implementations use 128 CUDA streams and achieve
overlap of the device to host transfers of matrix A by transferring the matrix in
row-wise blocks of data. As the computation of each row in A requires a different
column in B, matrix B is transferred from device to host beforehand without any
overlap. This is modeled in our estimates for Streams and Hybrid by adding the
estimated device to host transfer time of B. Transfers from device to host only
consist of the output data computed by the kernel in that stream and require
no additional synchronization. The Explicit and Implicit implementations do not
differ from the general description given in Section 5.6.1.
The observed and estimated performance of the matrix multiplication kernel
is shown in Figure 5.12. For each GPU the performance models correctly identify
the most efficient implementation, which is Hybrid for the GTX480 and Streams
on the GTX Titan and Tesla K20. The Implicit implementation is very inefficient
as all elements in the B matrix are requested for each row in A and thus have
to be transmitted repeatedly over the PCIe bus. This is because we have used a
naive kernel implementation, a more efficient kernel that reuses input data can
significantly reduce the bandwidth consumption.
118 Chapter 5. Performance models for CPU-GPU data transfers
 0
 100
 200
 300
 400
 500
 600
 700
observed estimated observed estimated observed estimated
Ti
m
e 
(m
s)
2836.1 2877.9 1421.8 1429.1 2840.6 2836.6 Explicit
Implicit
Streams
Hybrid
K20TitanGTX480
Figure 5.12: Observed and estimated performance of the a 2D convolution kernel
on three different GPUs.
5.6.6 Discussion
We have evaluated whether our performance models can be used to accurately
identify the most efficient implementation using a selection of kernels and one
GPU from each class that we consider. The results have shown that our perfor-
mance models can be used to predict the most efficient implementation strategy
for each kernel on each device.
To evaluate the precision of the performance predictions for the individual
implementations, we use the error of the prediction relative to the measured total
execution time. In all cases, the prediction error is below 10% of the observed
execution time. The Explicit implementations are predicted with a mean error
of 0.79% and a maximal error of 3.11%, which occurs for the buoydiff kernel on
the GTX Titan. The Implicit implementations prove to be the hardest to predict
with a mean error of 3.60% and 9.74% maximum for the convolution kernel on the
K20. For Streams the mean error is 1.15%, with the highest at 3.70% occurring
for 2D convolution on the K20. Our Hybrid implementations are predicted with
a mean error of 2.17% that goes up to 6.75% for buoydiff on the GTX 480.
We have also used our performance models to detect subtle performance is-
sues with our Streams and Hybrid implementations. In an earlier study [98],
we noticed that the observed times for the Streams and Hybrid implementations
were often higher than predicted by our models. This was due to the additional
synchronization used to wait for operations in other streams, as is used in our
5.7. Conclusions 119
implementations of buoydiff and 2D convolution. At first, the cross-stream syn-
chronization was placed just before the kernel launch, preventing the kernel from
launching prematurely. This setup, however, allows the device to host transfers
to occur in an arbitrary order, introducing unnecessary delays for some kernels
to be launched. We have resolved the issue by moving the synchronization to
before the device to host transfer. This way, we enforce a strict ordering of op-
erations across streams and ensure a higher degree of overlap between device to
host transfers and kernel execution.
Our Hybrid implementation was the most efficient implementation for buoy-
diff and 2D convolution on the GTX 480 and GTX Titan, and for the matrix
multiplication kernel on the GTX 480. This implementation strategy uses CUDA
streams to overlap transfer from device to host with GPU computation and uses
mapped memory for overlapping the device to host transfers. This is an impor-
tant result, as our Hybrid implementation improves performance over the Streams
implementation by up to 30.0% of the execution time, while the CUDA Program-
ming guide [62] states there is no need to use streams when mapped memory is
used. We have shown that using streams and mapped memory for a single kernel
can be very beneficial on devices with one copy engine.
The performance improvement from overlapping computation and commu-
nication varies from kernel to kernel. For the state, buoydiff, and convolution
kernels the average performance improvement is a factor of 1.61, 1.95, and 1.90,
respectively. For our matrix multiplication kernel the performance improvement
is only up to 3.7% of the execution time, because the performance is dominated
by the long kernel execution time.
5.7 Conclusions
PCIe transfers can have a large impact on performance. In particular when
the entire data set does not fit into the GPU memory, inter-node communica-
tion at the end of each time step requires data to be present at the host, or
parts of the application are executed on the CPU. Overlapping computation and
communication can be challenging and requires a considerable effort from the
programmer and results in much more code. Several different methods for trans-
ferring data and overlapping computation and communication are available, but
little is known about when to apply which method. The different implementa-
tions require completely different host codes as well as modifications to the GPU
kernel. Implementing all alternatives is often a large programming effort and not
feasible.
In order to solve these issues and provide insight in the performance of GPU
applications that require regular transfers across the PCIe bus, we propose an
analytical performance modeling approach that includes for PCIe transfers and
120 Chapter 5. Performance models for CPU-GPU data transfers
overlapping computation and communication. Our evaluation shows that our
performance model can be used to correctly select the best performing implemen-
tation strategy for four different kernels on three different GPUs. In all cases,
the prediction error of the observed execution time of individual implementa-
tions is below 10%. The largest error occurs in the estimations of the Implicit
implementation, which is explained by the fact the effective bandwidth under
contention from a large number of transfers resulting from reads and writes to
device-mapped host memory varies. We have also used our performance models
to detect subtle performance issues with earlier versions of our Streams and Hy-
brid implementations, clearly indicating the applicability and effectiveness of our
models.
The current set of parameters for the models presented in this chapter is a
compromise between capturing the main performance characteristics and provid-
ing a reasonable framework for algorithm design and analysis. While no small
set of parameters can describe all machine aspects completely, analysis becomes
more difficult with a large set of parameters.
We believe the performance models presented in this chapter open several av-
enues for future work. The degrading PCIe transfer performance for overlapping
transfers in opposite directions is currently captured by using a lower effective
bandwidth. Our performance models could be extended with a separate model
for the effective bandwidth under contention, which could further improve per-
formance estimations. In addition, the performance models presented in this
chapter may be extended with a performance model for kernel execution time.
Such a unified model can be used to provide a complete overview of the behav-
ior of kernels that overlap computation and communication. Our performance
models or extended versions may be applied within optimizing compilers or other
software development tools that automatically generate efficient host code for
CUDA programs.
121
Chapter 6
Conclusion
GPUs offer a large amount of compute power against low hardware costs and
power consumption. This is why GPUs are such a promising platform for su-
percomputing applications. As a result, supercomputing systems have become
more diverse, as different combinations of multi-core CPUs and many-core GPUs
are used to build these systems. Developing scientific applications that execute
efficiently on these systems has become much more challenging, but not less im-
portant. Supercomputer systems are large investments and even a reduction of
10% in the execution time of applications can save millions of dollars [30].
In this chapter, we first summarize the contents of this thesis. This is followed
by a discussion of the contributions of our work towards answering the challenges
of using GPUs for scientific supercomputing, as presented in Chapter 1. After
this, we will present the overall conclusions of this thesis.
6.1 Summary
In this thesis we have investigated how to simplify the use of Graphics Process-
ing Units (GPUs) in scientific supercomputing applications, increase application
performance, and increase our understanding of the performance of these GPU-
enabled applications. In order to structure our approach we have formulated five
challenges that are covered in the different chapters of this thesis. The challenges
require much more work than is possible within the scope of this thesis to be
solved completely. Instead we aim to gain more understanding of the fundamen-
tal problems that give rise to these challenges and provide contributions towards
a general solutions for each challenge.
To develop techniques that are applicable to the entire landscape of floating-
point intensive scientific applications, we have looked into two application do-
122 Chapter 6. Conclusion
mains, Multimedia Content Analysis and Climate Modeling. Multimedia Content
Analysis is a representative application domain for compute-intensive applica-
tions, whereas Climate Modeling is representative for data-intensive applications.
Chapters 2 and 3 focus on applications from the domain of Multimedia Content
Analysis, Chapter 4 on a Climate Modeling application, and Chapter 5 focuses
on a set of compute kernels from both these and other application domains.
In Chapter 2, we have shown how GPU execution can be integrated into
high-level parallel programming models that are ready to be used directly by
application domain experts. It is essential to develop such efficient programming
environments that hide the intrinsic complexity of GPU computing, as it is unre-
alistic to expect domain researchers to also become experts in high-performance
computing.
Specifically, we have replaced the compute kernels that implement the al-
gorithmic patterns of the Parallel-Horus programming system for multimedia
computing with equivalent GPU kernels. Next, we have used a lazy paralleliza-
tion approach to optimize the execution of multiple compute kernels applied in
sequence. Based on a finite state machine specification, we insert GPU mem-
ory management operations at runtime, whenever necessary. Our approach has
proven to constitute a simple, yet scalable and effective method for the inter-
kernel optimization of MMCA applications on GPU-clusters.
We have evaluated the performance of the extended Parallel-Horus system
using multiple versions of a representative line detection application on two dif-
ferent GPU clusters. Our evaluation shows that lazy parallelization improves
performance in all tested cases. We have demonstrated that the model is capa-
ble of significantly accelerating a fully sequential program, without requiring any
additional effort from the application developer.
In Chapter 3, we have presented an extensive optimization study of 2D con-
volution, a representative operation for compute-intensive applications in many
application domains. Convolutions are essential to signal and image processing
applications, and are typically responsible for a large fraction of the application’s
execution time. We have introduced a new approach for solving shared memory
bank conflicts in the context of convolution operations as well as an optimization
approach, called adaptive tiling, for implementing efficient, yet flexible, convolu-
tion operations on modern GPUs. We have also presented an implementation
that combines adaptive tiling with loop unrolling to obtain a less flexible, yet
highly efficient implementation. The latter approach is less flexible, as the filter
sizes used by the application at runtime have to be known at compile time.
We have evaluated the performance impact of many different implementa-
tions for the 2D convolution operation and separable convolution on the GTX680,
GTX480 and Tesla K20 GPUs. All optimizations combined have a large impact
on performance. Our final optimization step demonstrates a performance im-
6.1. Summary 123
provement of up to a factor 9.14 over naive GPU implementations. This is not
only due to the adaptive tiling combined with unrolling optimization, but also
because the best performing combination of thread block size and tiling factor is
selected for each individual filter.
In Chapter 4, we have investigated whether and how data-intensive applica-
tions can benefit from using GPUs. This chapter considers the Parallel Ocean
Program (POP), a representative climate modeling application. We have inves-
tigated the performance characteristics of POP to identify what specific parts of
the application can benefit the most from execution on GPUs. In addition, we
have described in detail how the existing software should be modified to efficiently
include GPU execution. Finally, we report the impact that the use of GPUs has
on performance, also when using multiple GPU clusters simultaneously.
We have identified three functions that are part of the vertical mixing pipeline
of POP to be good candidates for GPU execution. To efficiently include the use
of GPUs within the existing software, we have created multiple versions of each
GPU function to experimentally verify which version realized the highest overlap
between GPU computation and CPU-GPU communication. In addition, the
execution of all GPU kernels as well as all data transfers is fully overlapped with
the execution of other functions on the CPU. As such, the functions executing
on the GPU have practically disappeared from the application’s execution time.
Our evaluation shows that the use of GPUs is beneficial for performance, even
though the application itself did not have any real computational hotspots.
These are promising results, as currently the majority of the execution time
per kernel is spent on PCIe transfers. When more computation is moved to
the GPU, more data can be reused on the GPU, and some intermediate data
structures that result from computation may even never have to leave the GPU.
In that case, many the PCIe transfers can be eliminated completely. In future
work we hope to produce a complete GPU implementation of the vertical mixing
part of POP.
In addition, we have shown that it is possible to improve the load balancing
of POP such that it can run successfully in a multi-cluster setting. Moreover, our
evaluation shows that the combination of GPU execution and the hierarchical
load-balancing scheme improves performance even further. This is a remarkable
result, as the hierarchical partitioning scheme prefers small block sizes, while
the GPU code prefers larger sized blocks to increase GPU utilization. However,
GPU utilization is again increased by the fact that all MPI tasks running on a
single node share a single GPU for all their GPU computations. As such, the two
approaches work in concert to improve application performance.
In Chapter 5, we have presented techniques that are necessary to optimize
CPU-GPU communication intensive applications for performance. Many GPU
applications perform data transfers to and from GPU memory at regular inter-
124 Chapter 6. Conclusion
vals. Overlapping GPU computation with CPU-GPU communication can reduce
the costs of moving data. Several different techniques exist for transferring data
to and from GPU memory and for overlapping those transfers with GPU compu-
tation. The different implementations require completely different host codes as
well as modifications to the GPU kernel. Implementing all alternatives is often
a large programming effort and not feasible. This chapter provides insight in the
performance of GPU applications and proposes an analytical performance model
that includes PCIe transfer and overlapping computation and communication.
Our evaluation shows that the performance models can be used to correctly
select the best performing implementation strategy for a set of real-world kernels
on three different GPUs. In all cases, the prediction error is below 10% of the
observed execution time. We have also used our performance models to detect
performance issues in earlier versions of our implementations. Overlapping GPU
computation and CPU-GPU communication improves the performance of the
kernels in our evaluation by almost a factor of 2. The results clearly indicate the
applicability and effectiveness of our performance models.
6.2 Discussion
In this thesis, we set out to seek answers to the five challenges we have formulated
that cover the major difficulties of using GPUs for scientific supercomputing. In
the following, we discuss the contributions of our work towards answering each
challenge and formulate the insights gained from the research carried out as part
of this thesis.
Challenge 1 How to lower the threshold for domain experts to use GPU equipped
computing systems?
GPU computing has always been considered difficult. This is in part because
GPU computing programs are implemented using specialized programming mod-
els such as CUDA or OpenCL. While these programming models only expand the
C programming language with a few keywords, writing GPU programs requires
a different mindset on parallelism and concurrency.
Over the years, a variety of programming tools, such as compiler directives
and source-to-source optimizing compilers, have been developed to abstract away
some of the complexity of GPU programming. The problem with creating ab-
stractions is that the features that are abstracted may be irrelevant for some
applications, but crucial to the performance of others. For example, many auto-
matic optimization tools for GPUs apply optimizations that are known to improve
the performance of a matrix multiplication kernel. However, this approach is far
from guaranteed to yield efficient implementations of other GPU kernels.
6.2. Discussion 125
An orthogonal approach is to create a programming model intended for use
within a particular application domain, as opposed to a general purpose program-
ming model. User transparent parallel programming models, like Parallel-Horus,
are excellent examples of programming models that focus on a particular appli-
cation domain.
In this thesis, we have shown that it is possible to incorporate the use of GPUs
into a user transparent programming model. The resulting system allows appli-
cations that are able to execute efficiently on GPU clusters to be created in a fully
sequential manner, without requiring any parallelization or GPU programming
effort from the application programmer. While we have applied this approach to
a programming model for Multimedia Content Analysis, the approach in itself is
not limited to a specific application domain.
Challenge 2 How to measure the impact of the use of GPUs on application
performance?
The performance impact of using GPUs as a platform for computations can be
significant. As part of the work carried out in this thesis, we have seen that
many supercomputing applications do not consist of only highly optimized im-
plementations of every compute kernel. In fact, many scientific applications use
code that would be called a naive implementation in GPU jargon, for the simple
reason that naive implementations are much easier to maintain and port to new
architectures. We believe that, for this reason in particular, it is perhaps more
relevant to compare the CPU versus GPU performance of naive kernel implemen-
tations. Additionally, there are other factors that are left out by a comparison
solely based on performance, such as the time invested to create highly optimized
implementations for either architecture.
The performance of an application executing on either a CPU or a GPU
can not easily be compared. First and foremost because the performance being
compared is obtained using applications that have different implementations, use
different compilers, and execute on completely different hardware. We recognize
that a truly fair comparison between different experimental setups is very hard
to achieve. In this thesis, we have chosen to directly report performance num-
bers and not normalize results using additional metrics such as hardware costs or
power consumption to keep the experimental setup as simple as possible. Hard-
ware costs of both CPUs and GPUs are influenced by different factors in addition
to their performance capabilities. Power consumption is an important factor in
the operational costs for modern supercomputers. However, when only a small
part of the application executes on the GPU, it is clear that the GPU will be idle
for a large part of the execution.
To assess the performance impact from using either CPUs or GPUs in ap-
plications we have defined the notion of user-perceived speedup. User-perceived
126 Chapter 6. Conclusion
speedup is defined as the speedup observed by the end-user of an entire appli-
cation as a result from switching between different implementations of a single
parallel programming model, without making any modifications to the applica-
tion. For the end-users of GPU applications it is very important to learn the
performance impact of using GPUs.
In what manner the exact comparisons are to be made remains an active
topic of discussion within the scientific community [28, 46]. We believe that the
notion of user-perceived speedup allows for performance comparisons on the level
of entire applications in clear manner that can easily be explained to the end-users
of applications.
Challenge 3 How to efficiently integrate GPU kernels into large existing code
bases?
In this thesis, this challenge emerged on two occasions: first, when applying GPU
extensions to the library-based Parallel-Horus system; and second, when integrat-
ing GPU kernels into the Parallel Ocean Program. In this section, we summarize
the lessons learned from both exercises to arrive at a step-wise approach that
may guide the process of integrating GPU code into existing software.
When integrating GPU code into a software library, the problem is that it is
often unclear in what order and frequency the various functions of the library will
be called by the application. A naive approach would be to allocate and free GPU
memory and copy the input and output data to and from the GPU for each func-
tion called. A more efficient approach requires that a policy is defined regarding
the GPU memory management and the movement of data to and from the GPU.
An optimistic approach could for example asynchronously copy the GPU output
data back to main memory, expecting the next operation requires the data to
be present on the CPU. The policy we have implemented in Parallel-Horus is a
lazy approach. This approach assumes that all memory management operations
are redundant, unless not executing the operation would lead to inconsistencies
at runtime. This approach minimizes the number of redundant memory PCIe
transfers, but has the drawback that it eliminates every opportunity for over-
lap between GPU computation and CPU-GPU communication. Whether lazy
parallelization or overlapping CPU-GPU communication will be better for appli-
cation performance depends on behavior is expected from the applications that
are implemented using the programming model.
The Parallel Ocean Program is not implemented using a library of functions,
but as a monolithic application. As such, it was much more difficult to identify
suitable targets for GPU execution. In particular, because POP contains no true
computational hotspots. Because there was no opportunity for optimizing the
GPU kernels individually, we have focused on overlapping computation and com-
munication as much as possible. In addition, based on the sequence in which the
6.2. Discussion 127
4. analyze inter-kernel 
data flow
3. optimize compute 
kernels, if possible
2. create naive kernel 
implementations
1. identify time 
consuming operations
5b. eliminate redundant
memory operations
5a. minimize redundant 
memory operations
6. overlap transfers and
GPU computations
7. overlap GPU kernels 
with CPU execution
standalone applicationssoftware libraries
Figure 6.1: A step-wise approach for integrating GPU code into a large existing
codebase.
GPU kernels are called, we found that some of the input data of different kernels
could be kept on the GPU to eliminate redundant data transfers. Furthermore,
the execution of all GPU kernels as well as all data transfers to and from the
GPU are overlapped with the computation of other functions on the CPU.
Figure 6.1 shows a step-wise approach derived from our experience that may
be used to guide the process of efficiently integrating GPU code into existing
software. We distinct between either library-based or standalone applications,
as these different approaches present different constraints to the application de-
veloper. This is true for library-based software in particular, as it is often not
known beforehand in what order the various functions will be called. This lim-
its the amount of overlap between data transfers and computation that can be
achieved. However, when GPU kernels are integrated into a standalone appli-
cation the application developer has more information and can as such apply
techniques for overlapping computation and communication more effectively.
Challenge 4 How to maximize the efficiency of compute-intensive GPU kernels?
In this thesis, we have presented a detailed study of the optimization process of a
2D convolution kernel to find answers to this challenge. When kernel execution is
the limiting factor in the performance of a GPU application, applying kernel-level
optimizations becomes crucial for performance. These optimizations are applied
128 Chapter 6. Conclusion
by restructuring the GPU kernel code to reorder or regroup computations or
change data access patterns. Several features of the GPU architecture, such as
the memory hierarchy or notion of threads and thread blocks, are exposed within
GPU programming models. Other features such as memory coalescing, bank
conflicts, or occupancy have to be considered when creating GPU applications
but are not explicitly part of GPU programming models. All different facets
of GPU programming, implicit or explicit, have to be taken into account when
optimizing a kernel for performance. Implementing every possible configuration
in the kernel optimization space is often infeasible and several studies have tried
to develop heuristics to guide the optimization process.
From our experience with implementing convolution operations we have learned
that optimizing GPU kernels consists of a search for the best performing combina-
tion of resource utilization and optimization techniques. For the 2D convolution
operation, we found that applying kernel-level optimizations was mostly concern-
ing the increase of data reuse at the cost of increased resource utilization, and
thus decreasing occupancy. One of the most difficult aspects of optimizing GPU
kernels is knowing how to exactly value occupancy against optimizations that
improve the throughput of the instruction stream. The performance impact of a
change in occupancy is not easily predicted and is as such difficult to develop an
intuition for.
Additionally, the occupancy, the number of warps that can execute in parallel
on each SM, is not always easy to calculate. Most factors used to calculate
the occupancy are known to the developer, except for the number of registers
per thread, which is determined by the compiler. We think it would be very
helpful if an SDK could indicate the expected occupancy for a kernel while the
optimizations are being applied. However, the performance impact of occupancy
varies per kernel and per GPU.
Another remarkable observation is that while optimizing convolution opera-
tions for compute performance, every optimization that we applied was actually
increasing data reuse and therefore lowering the memory bandwidth requirements
of the kernel This indicates that strictly following the roofline model when opti-
mizing GPU kernels may be too limited. Even if the kernel was no longer mem-
ory bandwidth bound, according to analysis using the roofline model, reducing
the memory bandwidth consumption still improved performance. The balance
between compute power and memory bandwidth, where in current GPU archi-
tectures compute power is abundant, while memory bandwidth is constrained,
is likely to change in the near future. The planned introduction of stacked
DRAMs [11] in Nvidia’s Pascal architecture [32]. Stacked DRAMs place the
normally off-chip DRAM directly on top of the GPU chip. This should increase
the global memory bandwidth from several hundreds of gigabytes per second to
several terabytes per second.
6.3. Conclusions 129
In this thesis, we have also learned when not to apply kernel-level optimiza-
tions. For example, for the kernels of the Parallel Ocean Program, there was little
to no opportunity for data reuse. Frequently applied optimizations to improve
the instruction stream such as tiling have significantly less effect if they do not
also increase data reuse.
Challenge 5 How to optimize data transfers between CPU and GPU?
Optimizing data transfers concerns maximizing the overlap between GPU compu-
tation and CPU-GPU communication. We faced this challenge when investigat-
ing how to efficiently integrate the GPU kernels into the Parallel Ocean Program.
Several different techniques are available for transferring data to and from GPU
memory and for overlapping those transfers with GPU computation. No perfor-
mance models existed that could help to decide which method to use, therefore,
we implemented all and measured the performance. However, to provide a much
more general answer to this challenge we created an analytical performance model
that predicts PCIe transfer times and the potential overlap between computation
and communication.
Our performance models can be used to correctly select the best performing
implementation strategy before implementing each alternative and can therefore
save the application developer a large amount of time. We believe that the per-
formance modeling techniques presented in this thesis have the potential of being
widely used by GPU application developers, because of their general applicability.
The long-term applicability of the performance models depends on whether
GPUs will remain separate computing devices. Microprocessor vendors have al-
ready developed so called APUs (accelerated processing units) [4], that contain
a small number of fast CPU cores and a larger number of smaller SIMD cores for
executing floating-point intensive workloads. However, the applicability of APUs
for supercomputing systems has currently not been demonstrated by a large su-
percomputing system ranked among the Top500 [92]. Nvidia has stated that for
the coming years, a heterogeneous chip containing both types of cores is currently
not on their roadmap [15]. Instead, Nvidia has announced NVLink [32], a new
interconnect that will replace PCIe 3.0 and reportedly improves the through-
put over PCIe by a factor of 5 to 12. While the empirical parameters for the
bandwidth and latency of transfers are likely to change dramatically, we have no
reason to believe that our performance models will become any less relevant in
the near future.
6.3 Conclusions
GPUs can be a very efficient computing platform for scientific supercomputing,
but can be challenging to use efficiently. Therefore, the research question of this
130 Chapter 6. Conclusion
thesis was: "How can we simplify the use of GPUs in scientific supercomputing
applications, increase application performance, and increase our understanding
of the performance of these GPU-enabled applications?" The results of the work
presented in this thesis contribute to all three aspects of the stated research
question.
The GPU revolution has shifted the balance in the computing platform from
compute bound to memory bandwidth bound. This requires a completely differ-
ent way of reasoning about efficiency when developing scientific applications for
such platforms. We have observed that parts of the Parallel Ocean Program were
written to minimize the number of performed floating-point operations by storing
intermediate results and using additional memory. Before multi-core CPUs and
many-core GPUs, this was a perfectly fine strategy for creating efficient applica-
tions. However, memory bandwidth is often the limiting factor in performance for
current supercomputers and communication is more expensive than computation.
The introduction of GPUs as a computing platform for supercomputing sys-
tems created a large gap between hardware and software. In particular the fact
that developing efficient GPU code requires a significant amount of expertise,
creates obvious risks for supercomputing centers. It is currently taking large-
scale scientific applications that consume many compute hours at several large
supercomputing sites around the world, such as the Parallel Ocean Program, a
long time to get adjusted to the new computing platform. This spawns the ques-
tion of what we have learned from the GPU revolution that could expedite the
transition process when the next revolution in supercomputing systems arrives.
We believe that the underlying reason for the slow transition process of scien-
tific applications is due to the fact that they are currently implemented in a way
that is too rigid. In other words, the amount of manual labor required to revise
and restructure the code is not proportional to the conceptually simple tasks at
hand. This is based on the observation that a huge amount of manual labor is
required to, for example, rework a computational-conservative algorithm into a
memory-bandwidth-conservative one. We believe that more powerful application
development tools are necessary to increase productivity and portability. We
consider our work on user transparent programming models and performance
modeling among the essential steps in that direction.
References 131
References
[1] OpenCV 2.4.5. Open Source Computer Vision. http://www.opencv.org/, 2013.
[2] S.A. Al Umairy, A.S. van Amesfoort, I.D. Setija, M.C. van Beurden, and H.J. Sips.
On the use of small 2d convolutions on gpus. In First Workshop on Applications
for Multi and Many Core Processors (A4MMC) at ISCA, pages 52–64, 2012.
[3] A. Alexandrov, M.F. Ionescu, K.E. Schauser, and C. Scheiman. Loggp: Incor-
porating long messages into the logp model for parallel computation. Journal of
Parallel and Distributed Computing, 44(1):71–79, 1997.
[4] AMD. Accelerated Processing Units (APUs).
http://www.amd.com/us/products/technologies/apu, 2014.
[5] ESA Entertainment Software Association. Essential facts about the computer and
video game industry. sales, demographics and usage data., 2013.
[6] S.S. Baghsorkhi, M. Delahaye, S.J. Patel, W.D. Gropp, and W.W. Hwu. An adap-
tive performance modeling tool for gpu architectures. In ACM Sigplan Notices,
volume 45, pages 105–114. ACM, 2010.
[7] S.T. Belinschi. The lebesgue decomposition of the free additive convolution of two
probability distributions. Probability Theory and Related Fields, 142(1-2):125–150,
2008.
[8] M. Bernaschi, M. Bisson, and D. Rossetti. Benchmarking of communication tech-
niques for gpus. Journal of Parallel and Distributed Computing, 2012.
[9] F. Bleichrodt, R. Bisseling, and H.A. Dijkstra. Accelerating a barotropic ocean
model using a GPU. Ocean Modeling, 41:16–21, 2012.
[10] Cartesius. The Dutch National Supercomputer Cartesius,
https://www.surfsara.nl/systems/cartesius.
[11] B. Caulfield. Nvidia CEO Updates NVIDIA’s Roadmap.
http://blogs.nvidia.com/blog/2013/03/19/nvidia-ceo-updates-nvidias-roadmap/,
2014.
[12] J. Chen, P. Jönsson, M. Tamura, Z. Gu, B. Matsushita, and L. Eklundh. A simple
method for reconstructing a high-quality ndvi time-series data set based on the
savitzky–golay filter. Remote Sensing of Environment, 91(3):332–344, 2004.
132 References
[13] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A.Y. Ng. Deep learning
with cots hpc systems. In Machine Learning. ICML 2013. International Confer-
ence on, pages 1337–1345, 2013.
[14] D.E. Culler, R.M. Karp, D. Patterson, A. Sahay, E.E. Santos, K.E. Schauser,
R. Subramonian, and T. von Eicken. Logp: A practical model of parallel compu-
tation. Communications of the ACM, 39(11):78–85, 1996.
[15] W.J. Dally. Efficiency and programmability: Enablers for exascale. In High Per-
formance Computing, Networking, Storage and Analysis. Supercomputing 2013.
International Conference for (GPU Technology Theater talk), pages 1–21. Nvidia,
2013.
[16] DAS4. The Distributed ASCI Supercomputer DAS4, http://www.cs.vu.nl/das4.
[17] U. Dastgeer and C. Kessler. A performance-portable generic component for 2d
convolution computations on gpu-based systems. In Seventh Workshop on Pro-
grammability Issues for Heterogeneous Multicores (MULTIPROG) at HiPEAC,
pages 1–12, 2012.
[18] I. Demeshko, S. Matsuoka, N. Maruyama, and H. Tomita. Ultra-high resolu-
tion atmospheric global circulation model nicam on graphics processing unit. In
Parallel and Distributed Processing Techniques and Applications. PDPTA 2012.
International Conference on, 2012.
[19] J.M. Dennis. Inverse space-filling curve partitioning of a global ocean model.
In Parallel & Distributed Processing. IPDPS 2007. International Symposium on,
pages 1–10, 2007.
[20] J.K. Dukowicz and R.D. Smith. Implicit free-surface method for the Bryan-Cox-
Semtner ocean model. Journal of Geophysical Research, 99(C4):7991–8014, 1994.
[21] O. Fialka and M. Cadik. FFT and convolution performance in image filtering on
GPU. In Information Visualization. IV 2006. International Conference on, pages
609–614. IEEE, 2006.
[22] National Center for Atmospheric Research (NCAR). Community Earth System
Model (CESM) http://www2.cesm.ucar.edu/, 2014.
[23] A. Galizia, D. D’Agostino, and A. Clematis. A Grid Framework to Enable Parallel
and Concurrent TMA Image Analysis. International Journal of Grid and Utility
Computing, 1(3):261–271, August 2009.
[24] J.M. Geusebroek, A.W.M. Smeulders, and H. Geerts. A minimum cost approach
for segmenting networks of lines. International Journal of Computer Vision,
43(2):99–111, 2001.
[25] J.M. Geusebroek, A.W.M. Smeulders, and J. Van De Weijer. Fast anisotropic
gauss filtering. Computer Vision, pages 99–112, 2002.
[26] J. Gómez-Luna, J.M. GonzáLez-Linares, J.I. Benavides, and N. Guil. Performance
models for asynchronous data transfers on consumer graphics processing units.
Journal of Parallel and Distributed Computing, 72(9):1117–1126, 2012.
References 133
[27] M. Govett, J. Middlecoff, and T. Henderson. Running the nim next-generation
weather model on gpus. In Cluster, Cloud and Grid Computing. CCGrid 2010.
International Symposium on, pages 792–796. IEEE, 2010.
[28] C. Gregg and K. Hazelwood. Where is the data? why you cannot debate cpu
vs. gpu performance without the answer. In Performance Analysis of Systems
and Software. ISPASS 2011. International Symposium on, pages 134–144. IEEE,
2011.
[29] S. Hartung, H. Shukla, J.P. Miller, and C. Pennypacker. Gpu acceleration of
image convolution using spatially-varying kernel. In Image Processing. ICIP 2012.
International Conference on, pages 1685–1688. IEEE, 2012.
[30] T. Hoefler, W. Gropp, M. Snir, and W. Kramer. Performance Modeling for
Systematic Performance Tuning. In High Performance Computing, Networking,
Storage and Analysis. Supercomputing 2011. International Conference for (SotP
Session), November 2011.
[31] S. Hong and H. Kim. An analytical model for a gpu architecture with memory-
level and thread-level parallelism awareness. In Computer Architecture. ISCA
2009. International Symposium on, pages 1–12. ACM/IEEE, 2009.
[32] J.H. Huang. Opening Talk. GPU Technology Conference 2014., 2014.
[33] Huygens. The Dutch National Supercomputer Huygens,
https://www.surfsara.nl/systems/huygens.
[34] W.W. Hwu. Convolution Lab. http://courses.engr.illinois.edu/ece408/
fall2011/MPs/MP3-README.txt, 2011.
[35] W.W. Hwu. Lecture notes. http://courses.engr.illinois.edu/ece408/fall2011/
ece408_syll.html, 2011.
[36] Z. Juhasz and D. Crookes. A PVM Implementation of a Portable Parallel Image
Processing Library. In Parallel Virtual Machine. EuroPVM’96. European Con-
ference on, pages 188–196. Springer, 1996.
[37] R. Kelly. Gpu computing for atmospheric modeling. Computing in Science &
Engineering, 12(4):26–33, 2010.
[38] D.J. Kerbyson and P.W. Jones. A performance model of the parallel ocean
program. International Journal of High Performance Computing Applications,
19(3):261–276, 2005.
[39] Khronos Group. OpenCL, www.khronos.org/opencl/, 2013.
[40] T. Kielmann, H.E. Bal, S. Gorlatch, K. Verstoep, and R.FH. Hofman. Net-
work performance-aware collective communication for clustered wide-area sys-
tems. Parallel Computing, 27(11):1431–1456, 2001.
[41] D. Koelma, E. Poll, and F.J. Seinstra. Horus c++ reference. University of Ams-
terdam, The Netherlands, Technical Report, January 2002.
[42] W.G. Large, J.C. McWilliams, and S.C. Doney. Oceanic vertical mixing: A
review and a model with a nonlocal boundary layer parameterization. Reviews of
Geophysics, 32(4):363–403, 1994.
134 References
[43] O.S. Lawlor. Message passing for gpgpu clusters: Cudampi. In Cluster Computing
and Workshops. CLUSTER’09. International Conference on, pages 1–8. IEEE,
2009.
[44] Q.V. Le, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng.
Building high-level features using large scale unsupervised learning. In Acoustics,
Speech and Signal Processing. ICASSP 2013. International Conference on, pages
8595–8598. IEEE, 2013.
[45] J. Lebak, J. Kepner, H. Hoffmann, and E. Rutledge. Parallel vsipl++: An open
standard software library for high-performance parallel signal processing. Pro-
ceedings of the IEEE, 93(2):313–330, 2005.
[46] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen, N. Satish,
M. Smelyanskiy, S. Chennupaty, P. Hammarlund, et al. Debunking the 100X GPU
vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In
Computer Architecture. ISCA 2010. International Symposium on, pages 451–460.
ACM, 2010.
[47] Lisa. The Lisa GPU Cluster at SURFsara, https://www.surfsara.nl/systems/lisa.
[48] Los Alamos National Laboratory. Parallel Ocean Program4,
http://climate.lanl.gov/Models/POP/, 2013.
[49] J. Maassen and H.E. Bal. Smartsockets:Solving the Connectivity Problems in
Grid Computing. In High-Performance Distributed Computing. HPDC 2007. In-
ternational Symposium on, Monterey, CA, USA, June 2007.
[50] J. Mak, P. Choboter, and C. Lupo. Numerical ocean modeling and simulation
with cuda. In OCEANS Conference, pages 1–6. IEEE, 2011.
[51] M.E. Maltrud, F.O. Bryan, and S. Peacock. Boundary impulse response func-
tions in a century-long eddying global ocean simulation. Environmental Fluid
Mechanics, 10(1–2):275–295, 2010.
[52] C.P. Marquet and J.L. Dekeyser. Data-parallel load balancing strategies. Parallel
Computing, 24:1665–1684, 1998.
[53] T.J. McDougall, D.R. Jackett, D.G. Wright, and R. Feistel. Accurate and compu-
tationally efficient algorithms for potential temperature and density of seawater.
Journal of Atmospheric and Oceanic Technology, 20(5):730–741, 2003.
[54] M.R. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole.
Modeling and predicting performance of high performance computing applications
on hardware accelerators. International Journal of High Performance Computing
Applications, 27(2):89–108, 2013.
[55] J. Michalakes and M. Vachharajani. Gpu acceleration of numerical weather pre-
diction. Parallel Processing Letters, 18(04):531–548, 2008.
[56] P.J. Morrow, D. Crookes, J. Brown, G. McAleese, D. Roantree, and I. Spence.
Efficient implementation of a portable parallel programming model for image
processing. Concurrency and Computation: Practice and Experience, 11(11):671–
685, 1999.
References 135
[57] C. Nugteren, H. Corporaal, and B. Mesman. Skeleton-based automatic paralleliza-
tion of image processing algorithms for gpus. In Embedded Computer Systems.
SAMOS 2011. International Conference on, pages 25–32. IEEE, 2011.
[58] Nvidia. Fermi Compute Architecture. NVIDIA Corporation Whitepaper, 2010.
[59] Nvidia. GeForce GTX 680. NVIDIA Corporation Whitepaper, 2012.
[60] Nvidia. Kepler GK110 Architecture. NVIDIA Corporation Whitepaper, 2012.
[61] Nvidia. Kepler Tuning Guide. NVIDIA Corporation Whitepaper, 2012.
[62] Nvidia. CUDA Programming Guide. http://docs.nvidia.com/cuda/, 2014.
[63] Nvidia. Tesla Brand. http://www.nvidia.com/object/tesla-supercomputing-
solutions.html, 2014.
[64] B.R. Payne, S. O. Belkasim, G.S. Owen, M. C. Weeks, and Y. Zhu. Accelerated 2D
image processing on GPUs. In Computational Science. ICCS 2005. International
Conference on, pages 256–264. Springer, 2005.
[65] PCI-SIG. PCI Express Base Specification Revision 3.0, 2010.
[66] A. Plaza, D. Valencia, J. Plaza, and P. Martinez. Commodity cluster-based par-
allel processing of hyperspectral imagery. Journal of Parallel and Distributed
Computing, 66(3):345–358, 2006.
[67] V. Podlozhnyuk. FFT-based 2D convolution. NVIDIA Corporation Whitepaper,
2007.
[68] V. Podlozhnyuk. Image convolution with CUDA. NVIDIA Corporation Whitepa-
per, 2007.
[69] M. Renardy and R.C. Rogers. An introduction to partial differential equations,
volume 4. Springer, 2004.
[70] G.X. Ritter, J.N. Wilson, and J.L. Davidson. Image algebra: An overview. Com-
puter Vision, Graphics, and Image Processing, 49(3):297–331, 1990.
[71] G. Ruetsch and P. Micikevicius. Optimizing matrix transpose in cuda. NVIDIA
CUDA SDK Application Note, 2009.
[72] L.M. Russo, E.C. Pedrino, E. Kato, and V.O. Roda. Image convolution processing:
A gpu versus fpga comparison. In Programmable Logic. SPL 2012. Southern
Conference on, pages 1–6. IEEE, 2012.
[73] S. Ryoo, C.I. Rodrigues, S.S. Stone, S.S. Baghsorkhi, S.Z. Ueng, J.A. Stratton,
and W.W. Hwu. Program optimization space pruning for a multithreaded gpu. In
Code generation and optimization. International Symposium on, pages 195–204.
ACM, 2008.
[74] S. Ryoo, C.I. Rodrigues, S.S. Stone, J.A. Stratton, S.Z. Ueng, S.S. Baghsorkhi,
and W.W. Hwu. Program optimization carving for GPU computing. Journal of
Parallel and Distributed Computing, 68(10):1389–1401, 2008.
[75] D. Schaa and D. Kaeli. Exploring the multiple-gpu design space. In Parallel &
Distributed Processing. IPDPS 2009. International Symposium on, pages 1–12.
IEEE, 2009.
136 References
[76] J. Schmidt, C. Piret, N. Zhang, B.J. Kadlec, D.A. Yuen, Y. Liu, G.B. Wright, and
E.O.D. Sevre. Modeling of tsunami waves and atmospheric swirling flows with
graphics processing unit (gpu) and radial basis functions (rbf). Concurrency and
Computation: Practice and Experience, 22(12):1813–1835, 2010.
[77] F.J. Seinstra, J.M. Geusebroek, D. Koelma, C.G.M. Snoek, M. Worring, and
A.W.M. Smeulders. High-Performance Distributed Image and Video Content
Analysis with Parallel-Horus. IEEE Multimedia, 14(4):64–75, 2007.
[78] F.J. Seinstra and D. Koelma. User transparency: a fully sequential programming
model for efficient data parallel image processing. Concurrency and Computation:
Practice and Experience, 16(6):611–644, 2004.
[79] F.J. Seinstra, D. Koelma, and A.D. Bagdanov. Finite State Machine-Based Opti-
mization of Data Parallel Regular Domain Problems Applied in Low-Level Image
Processing. IEEE Transactions on Parallel and Distributed Systems, 15(10):865–
877, 2004.
[80] F.J. Seinstra, D. Koelma, and J.M. Geusebroek. A software architecture for
user transparent parallel image processing. Parallel Computing, 28(7–8):967–993,
August 2002.
[81] R. Shams, P. Sadeghi, R.A. Kennedy, and R.I. Hartley. A survey of medical image
registration on multicore and the gpu. Signal Processing Magazine, 27(2):50–60,
2010.
[82] T. Shi, M. Belkin, and B. Yu. Data spectroscopy: Eigenspaces of convolution
operators and clustering. The Annals of Statistics, pages 3960–3984, 2009.
[83] T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada,
N. Maruyama, and S. Matsuoka. An 80-fold speedup, 15.0 tflops full gpu accel-
eration of non-hydrostatic weather model asuca production code. In High Per-
formance Computing, Networking, Storage and Analysis. Supercomputing 2010.
International Conference for, pages 1–11. IEEE, 2010.
[84] R. Smith, P. Jones, B. Briegleb, F.O. Bryan, G. Danabasoglu, J. Dennis,
J. Dukowicz, C. Eden, B. Fox-Kemper, P. Gent, M. Hecht, S. Jayne, M. Jochum,
W. Large, K. Lindsay, M.E. Maltrud, M. Norton, S. Peacock, M. Vertenstein,
and S. Yeager. The Parallel Ocean Program (POP) reference manual: Ocean
component of the Community Climate System Model (CCSM), , 2010.
[85] R.D. Smith, M.E. Maltrud, F.O. Bryan, and M.W. Hecht. Numerical simulation
of the north atlantic ocean at 1/10. Journal of Physical Oceanography, 30(7):1532–
1561, 2000.
[86] C.G.M. Snoek and M. Worring. Multimodal video indexing: A review of the
state-of-the-art. Multimedia tools and applications, 25(1):5–35, 2005.
[87] C.G.M. Snoek, M. Worring, J.M. Geusebroek, D. Koelma, F.J. Seinstra, and
A.W.M. Smeulders. The semantic pathfinder: Using an authoring metaphor for
generic multimedia indexing. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 28(10):1678–1689, 2006.
References 137
[88] K.L. Spafford and J.S. Vetter. Aspen: a domain specific language for performance
modeling. In High Performance Computing, Networking, Storage and Analysis.
Supercomputing 2012. International Conference for, pages 1–11. IEEE, 2012.
[89] J. Stam. Convolution Soup: A case study in CUDA optimization.
http://www.nvidia.com/content/GTC/documents/1412_GTC09.pdf, 2009.
[90] J.D. Teresco, J. Fair, and J.E. Flaherty. Resource-aware scientific computation on
a heterogeneous cluster. Computing in Science & Engineering, 7(2):40–50, 2005.
[91] J.C. Thibault and I. Senocak. Cuda implementation of a navier-stokes solver
on multi-gpu desktop platforms for incompressible flows. In AIAA Aerospace
Sciences Meeting, pages 1–15, 2009.
[92] Top500. http://www.top500.org, 2014.
[93] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari. A survey of active
learning algorithms for supervised remote sensing image classification. Selected
Topics in Signal Processing, 5(3):606–617, 2011.
[94] G.K. Vallis. Atmospheric and Oceanic Fluid Dynamics: Fundamentals and Large-
Scale Circulation. Cambridge University Press, Cambridge, U.K., 2006.
[95] B. van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing con-
volution operations on gpus using adaptive tiling. Future Generation Computer
Systems, 30(0):14 – 26, 2014.
[96] B. van Werkhoven, J. Maassen, M. Kliphuis, H.A. Dijkstra, S.E. Brunnabend,
M. van Meersbergen, F.J. Seinstra, and H.E. Bal. A distributed computing ap-
proach to improve the performance of the parallel ocean program (v2.1). Geosci-
entific Model Development, 7(1):267–281, 2014.
[97] B. van Werkhoven, J. Maassen, and F.J. Seinstra. Optimizing convolution opera-
tions in cuda with adaptive tiling. In Second Workshop on Applications for Multi
and Many Core Processors (A4MMC) at ISCA, pages 1–12, 2011.
[98] B. van Werkhoven, J. Maassen, F.J. Seinstra, and H.E. Bal. Performance models
for cpu-gpu data transfers. In Cluster, Cloud and Grid Computing. CCGRID
2014. International Symposium on, pages 1–10, 2014.
[99] W. Weijer, M.E. Maltrud, M.W. Hecht, H.A. Dijkstra, and M.A. Kliphuis. Re-
sponse of the Atlantic Ocean circulation to Greenland Ice Sheet melting in a
strongly-eddying ocean model. Geophysical Research Letters, 39(9), 2012.
[100] J.B. White and J.J. Dongarra. Overlapping computation and communication for
advection on hybrid parallel computers. In Parallel & Distributed Processing.
IPDPS 2011. International Symposium on, pages 59–67. IEEE, 2011.
[101] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual
performance model for multicore architectures. Communications of the ACM,
52(4):65–76, 2009.
[102] P.H. Worley and J. Levesque. The performance evolution of the parallel ocean
program on the cray x1. In Cray User Group Conference, 2003.
138
[103] Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory
optimization and parallelism management. In Programming language design and
implementation. PLDI 2010. Conference on, pages 86–97. ACM, 2010.
[104] Y. Zhang and J.D. Owens. A quantitative performance analysis model for gpu
architectures. In High Performance Computer Architecture. HPCA 2011. Inter-
national Symposium on, pages 382–393. IEEE, 2011.
[105] Zoltan User Guide: Hierarchical Partitioning. last access: December 2013
http://www.cs.sandia.gov/Zoltan/ug_html/ug_alg_hier.html, 2013.
Samenvatting
De inzet van grafische processoren
voor wetenschappelijke
supercomputing applicaties
Grafische processoren (GPU’s) bieden de mogelijkheid tot een enorme reken-
kracht tegen lage aanschafkosten en stroomverbruik. Daarom zijn GPU’s een veel
belovend platform voor supercomputing applicaties. Supercomputers zijn als ge-
volg hiervan complexer en diverser geworden en worden tegenwoordig opgebouwd
uit combinaties van multi-core CPUs en many-core GPU’s. Het ontwikkelen van
wetenschappelijke applicaties die efficiënt gebruik kunnen maken van deze syste-
men is daardoor een veel grotere uitdaging geworden, maar niet minder belang-
rijk. Omdat supercomputers enorme investeringen zijn, kan zelfs een besparing
van 10% op de executietijd van applicaties al een miljoenenbesparing opleveren.
Om technieken te ontwikkelen die toepasbaar zijn voor de gehele scope van
wetenschappelijke applicaties, hebben we ons gericht op twee applicatiedomei-
nen, Multimedia Content Analysis en Climate Modeling. Multimedia Content
Analysis is een karakteristiek applicatiedomein voor reken-intensieve applicaties,
terwijl Climate Modeling juist karakteristiek is voor data-intensieve applicaties.
Daarna hebben we ons op gericht op het minimaliseren van de tijd die wordt be-
steed aan data transfers en het overlappen van deze transfers met berekeningen
op een manier die algemeen toepasbaar is voor GPU-computing applicaties.
In hoofdstuk 2, hebben we laten zien hoe GPU berekeningen geïntegreerd kun-
nen worden in hoog niveau parallelle programmeermodellen die direct gebruikt
kunnen worden door domeinexperts. Het is essentieel om efficiënte programmeer-
omgevingen te creëren die de intrinsieke complexiteit van GPU computing kunnen
140
verbergen, omdat het onrealistisch is om te verwachten dat domeinexperts ook
experts worden op het gebied van GPU-computing.
Om dit doel te realiseren hebben we de rekenfuncties die de algoritmische
patronen in het Parallel-Horus systeem implementeren vervangen door functies
die uitgevoerd worden door de GPU. Vervolgens hebben we een lazy paralleliza-
tion aanpak gebruikt om de uitvoering van meerdere rekenfuncties op de GPU
na elkaar te optimaliseren. Door middel van een finite-state machine worden
GPU geheugen operaties ingevoegd terwijl het programma draait. Deze aanpak
bleek niet alleen simpel te implementeren, maar ook een schaalbare en effectieve
methode voor de optimalisatie van MMCA applicaties op GPU-clusters.
We hebben de performance van het aangepaste Parallel-Horus systeem geë-
valueerd met behulp van verschillende versies van een karakteristieke voorbeeld
applicatie die lijnen in een afbeelding detecteert. Onze evaluatie liet zien dat
lazy parallelization de performance verbetert in alle geteste gevallen. Hiermee
hebben we aangetoond dat het aangepaste systeem in staat is om een volledig
sequentieel programma uit te voeren op een GPU-cluster, zonder dat dit enige
extra aanpassingen of expertise van domeinexperts vereist.
Hoofdstuk 3 laat een uitgebreide studie zien van de optimalisatie van de 2D
convolutie operatie. Deze operatie is karakteristiek voor reken-intensieve applica-
ties en vormt een essentieel onderdeel van signaal en beeld verwerkingsapplicaties.
Convolutie operaties zijn doorgaans verantwoordelijk voor een groot gedeelte van
de executietijd van dit soort applicaties. We hebben een nieuwe methode ge-
ïntroduceerd voor het oplossen van shared memory bank conflicts in de context
van convolutie operaties evenals een nieuwe optimalisatie aanpak die we adaptive
tiling noemen. Met behulp van adaptive tiling is het mogelijk om efficiënte en
flexibele convolutie implementaties te creëren voor GPU’s. We hebben ook nog
een combinatie van adaptive tiling met loop unrolling geïmplementeerd om een
minder flexibele, maar zeer efficiënte implementatie te verkrijgen. Deze laatste
methode is minder flexibel, omdat de afmetingen van het convolutie filter dat
wordt gebruikt in de applicaties al tijdens het compilen bekend moet zijn.
We hebben de performance geëvalueerd van verschillende implementaties van
convolutie operaties op een reeks van moderne grafische processoren. Alle opti-
malisaties hebben samen een grote impact op de performance. Onze uiteindelijke
optimalisatie laat een performance verbetering zien van tot een factor 9.14 over
naïeve implementaties. Dit komt niet alleen door de combinatie van adaptive
tiling en loop unrolling, maar ook doordat de best presterende combinatie van
thread block afmeting en tiling factor wordt geselecteerd voor iedere filter afme-
ting.
In hoofdstuk 4 hebben we onderzocht hoe data-intensieve applicaties voordeel
kunnen halen uit het gebruik van GPU’s. Dit hoofdstuk buigt zich over het
Parallel Ocean Program (POP), een karakteristieke Climate Modeling applicatie.
We hebben onderzocht wat de performance eigenschappen zijn van het Parallel
Ocean Program om te kunnen identificeren welke specifieke onderdelen van de
applicatie het meeste voordeel kunnen halen uit executie op GPU’s. Daarnaast
hebben we in detail beschreven hoe de bestaande software aangepast dient te
worden om efficiënt gebruik van GPU’s te bewerkstelligen. Ten slotte hebben we
onderzocht wat de impact is van het gebruik van GPU’s op de performance ook
wanneer het programma wordt gedraaid op meerdere GPU-clusters tegelijkertijd.
We hebben drie functies geïdentificeerd voor implementatie op GPU’s, deze
functies vormen een belangrijk onderdeel van de berekening van verticale convec-
tie in de oceaan. Om het efficiënt gebruik van de GPU te realiseren binnen de
bestaande software hebben we meerdere versies van iedere GPU functie geïmple-
menteerd, om experimenteel te kunnen verifiëren welke versie de beste overlap
behaalt tussen GPU berekeningen en communicatie van en naar GPU geheugen.
Daarnaast worden de berekeningen en de data transfers van en naar de GPU
overlapt met berekeningen op de CPU. Onze evaluatie laat zien dat het gebruik
van GPU’s bijdraagt aan de peformance, zelfs wanneer de applicatie geen echte
computationele hotspots bevat.
Dit is een veelbelovend resultaat, omdat het merendeel van de executietijd per
kernel nu nog opgaat aan PCIe transfers. Wanneer meer berekeningen worden
verplaatst naar de GPU, kan er meer data worden hergebruikt op de GPU en
sommige tussentijdse resultaten hoeven mogelijk zelfs nooit de GPU te verlaten.
In dat geval kunnen veel van de PCIe transfers volledig worden voorkomen. In de
nabije toekomst hopen we te komen tot een volledige GPU implementatie voor
de verticale mixing in POP.
Daarnaast, hebben we laten zien dat het mogelijk is om de werkverdeling in
POP te verbeteren, zodanig dat de applicatie kan draaien op meerdere clusters
tegelijkertijd. Nog belangrijker is dat onze evaluatie laat zien dat ook de com-
binatie van GPU executie en de hiërarchische werkverdeling ook performance
verbeterend werkt. Dit is een opmerkelijk resultaat, omdat het hiërarchische ver-
delingsschema de voorkeur geeft aan kleine blok afmetingen, om nauwkeurig het
werk te kunnen verdelen, terwijl de GPU juist grotere blokken nodig heeft om de
utilisatie op peil te houden. De GPU utilisatie wordt echter ook verhoogd door
het feit dat alle processen die draaien op de CPU cores slecht één enkele GPU
delen voor al hun berekeningen. Daardoor werken de twee verschillende methodes
eigenlijk hand-in-hand om de performance van de applicatie te verbeteren.
In hoofdstuk 5 hebben we technieken gepresenteerd die noodzakelijk zijn om
de performance van data-intensieve operaties te verbeteren. Eigenlijk alle GPU
applicaties voeren regelmatig data transfers uit van en naar GPU geheugen. Het
overlappen van berekeningen op de GPU en deze CPU-GPU communicatie kan
bijdragen aan het minderen van de kosten van het verplaatsen van data. Er
bestaan verschillende technieken om data van en naar de GPU te versturen en
142
om deze transfers te overlappen met GPU berekeningen. Het implementeren van
alle alternatieven is doorgaans erg veel werk en niet haalbaar. Om inzicht te
verschaffen in de performance van GPU applicaties stellen wij het gebruik van
een analytisch performance model voor, dat rekening houdt met deze transfers
en de overlap tussen berekening en communicatie.
Onze evaluatie laat zien dat de performance modellen gebruikt kunnen worden
om de best presterende implementatie strategie te selecteren voor een verzameling
van functies uit bestaande en veel gebruikte applicaties voor drie verschillende
klassen van GPU’s. In alle geteste gevallen is de foutmarge van de schattingen
minder dan 10% van de geobserveerde executietijd. Ook hebben we onze perfor-
mance modellen gebruikt om subtiele performance problemen in eerdere versies
van onze implementaties te ontdekken. Het overlappen van GPU berekeningen
en CPU-GPU communicatie verbetert de performance van de functies in onze
evaluatie tot een factor 2. Deze resultaten laten duidelijk de toepasbaarheid en
effectiviteit van onze performance modellen zien.
De introductie van GPU’s als een rekenplatform voor supercomputers heeft
een groot gat geslagen tussen hardware en software. We zijn van mening dat de
onderliggende reden hiervoor is dat de bestaande software eigenlijk te rigide is
geïmplementeerd. De hoeveelheid handmatig werk die nodig is om de code te
herzien en herstructureren staat niet in verhouding met hoe conceptueel eenvou-
dig die taak eigenlijk is. Wij zijn van mening dat krachtigere tools voor software
ontwikkeling noodzakelijk zijn om de productiviteit van dit proces te verhogen.
Wij zien het werk in dit proefschrift aan onder andere programmeermodellen en
performance modellen als essentiële stappen in deze richting.
Curriculum Vitae
Personal Data
Name: Ben van Werkhoven
Place of birth: Hoorn, The Netherlands
Date of birth: May 10, 1986
Nationality: Dutch
Current Address: Onbekendegracht 11 - 2,
1018 XP Amsterdam, The Netherlands
ben@cs.vu.nl
http://www.cs.vu.nl/ ben/
Education
PhD: High Performance Distributed Computing, VU, Amsterdam . . . . . . . . . . 2010-now
MSc: Parallel and Distributed Computer Systems, VU, Amsterdam . . . . . . . . 2008-2010
BSc: Bachelor Informatica, VU, Amsterdam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2004-2008
VWO: Natuur & Techniek, Copernicus, Hoorn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1998-2004
Awards
Best Paper Nomination: B. van Werkhoven, J. Maassen, F.J. Seinstra, and H.E.
Bal. "Performance models for CPU-GPU data transfers". In Proceedings of the 14th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC-
GRID), May 2014.
Winner: H. Dijkstra, F.J. Seinstra, H. Bal, D. Wallom, D. KranzlmÃĳller, S. Jha, F.
Bryan, B. Kirtman, M.A. Kliphuis, S.E. Brunnabend, M. Santcroos, J. Maassen, M. van
Meersbergen, B. van Werkhoven. "An Advanced Distributed Computing Approach
to High-Resolution Climate Modeling". Enlighten Your Research Global (organized by
ESnet, Internet2, Funet, Janet, and SURFnet), November 2013.
144
Sustainability Prize: F.J. Seinstra, J. Maassen, N. Drost, M. van Meersbergen, B.
van Werkhoven, H.E. Bal, I. Pelupessy, S. Portegies Zwart, M. Kliphuis, H. Dijkstra.
"High-Performance Distributed Multi-Model / Multi-Kernel Simulations". Enlighten
Your Research 3 (organized by SURFnet, SARA, BigGrid, NWO), December 2011.
Best Paper Award: B. vanWerkhoven, J. Maassen, and F.J. Seinstra. "Optimizing
Convolution Operations in CUDA with Adaptive Tiling". 2nd Workshop on Applica-
tions for Multi and Many Core Processors (A4MMC 2011) at the 38th ACM/IEEE
International Symposium on Computer Architecture (ISCA 2011), June 2011.
Publications
B. van Werkhoven, J. Maassen, F.J. Seinstra, and H.E. Bal. "Performance models
for CPU-GPU data transfers". In Proceedings of the 14th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing (CCGRID), May 2014.
B. van Werkhoven, J. Maassen, M. Kliphuis, H.A. Dijkstra, S.E. Brunnabend, M.
van Meersbergen, F.J. Seinstra, and H.E. Bal. "A distributed computing approach to
improve the performance of the Parallel Ocean Program (v2.1)". Geoscientific Model
Development, Volume 7, Issue 1, Pages 267-281, Febuary 2014.
B. vanWerkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. "Optimizing Convolution
Operations on GPUs using Adaptive Tiling". Future Generation Computer Systems,
Volume 30, Pages 14-26, January 2014.
T. van Kessel, B. van Werkhoven, N. Drost, J. Maassen, H.E. Bal, and F.J. Seinstra.
"User transparent data and task parallel multimedia computing with Pyxis-DT". Future
Generation Computer Systems, Volume 29, Issue 8, Pages 2252-2261, October 2013.
B. van Werkhoven, J. Maassen, and F.J. Seinstra. "Optimizing Convolution Oper-
ations in CUDA with Adaptive Tiling". 2nd Workshop on Applications for Multi and
Many Core Processors (A4MMC 2011) at the 38th ACM/IEEE International Sympo-
sium on Computer Architecture (ISCA 2011), June 2011.
B. van Werkhoven, J. Maassen, and F.J. Seinstra. "Towards User Transparent Par-
allel Multimedia Computing on GPU-clusters". 1st Workshop on Applications for Multi
and Many Core Processors (A4MMC 2010) at the 37th ACM/IEEE International Sym-
posium on Computer Architecture (ISCA 2010), June 2010.
Other Publications
S.E. Brunnabend, H.A. Dijkstra, M. Kliphuis, B, van Werkhoven, H.E. Bal, F.J.
Seinstra, J. Maassen, and M. van Meersbergen. "Changes in extreme regional sea surface
height due to an abrupt weakening of the Atlantic MOC". Ocean Science Discussions,
Volume 11, Pages 1213-1241, April 2014.
S.E. Brunnabend, H.A. Dijkstra, M. Kliphuis, B. van Werkhoven, H.E. Bal, J.
Maassen, M. van Meersbergen, and F.J. Seinstra. "Future high-resolution dynamic
sea level changes". EGU General Assembly Conference Abstracts, April 2013
N. Drost, J. Maassen, M. van Meersbergen, B. van Werkhoven, H.E. Bal, F.I.
Pelupessy, S. Portegies Zwart, M. Kliphuis, H.A. Dijkstra, and F.J. Seinstra "Dis-
tributed Multi-Model/Multi-Kernel Simulations: A Case Study in Jungle Computing
using eScience Infrastructure and High Speed Networks" 8th IEEE International Con-
ference on eScience 2012 - Collaborative Research using eScience Infrastructure and
High Speed Networks (NECS), October 2012.
F.J. Seinstra, J. Maassen, R.V. van Nieuwpoort, N. Drost, T. van Kessel, B. van
Werkhoven, J. Urbani, C. Jacobs, T. Kielmann, and H.E. Bal. "Jungle Computing:
Distributed Supercomputing beyond Clusters, Grids, and Clouds". In: M. Cafaro and
G. Aloisio, editors, Grids, Clouds and Virtualization, Pages 167-197, Springer-Verlag,
2011.
Teaching
2012-2013
1. Concurrency and Multithreading - Excercise classes
2. Computer Graphics - Guest lecture, grading projects, and preparing software
2011-2012
1. Concurrency and Multithreading - Excercise classes
2. Computer Graphics - Guest lecture, grading projects, and preparing software
2010-2011
1. Concurrency and Multithreading - Excercise classes
2. Computer Graphics - Four lectures, grading projects, and preparing software
3. Bachelor Seminar - Co-lecturer
2009-2010
1. Concurrency and Multithreading - Excercise classes
2. Computer Graphics - Guest lecture, grading projects, and creating software
3. Operating Systems - Demos, exam questions, and grading
