Accelerating Numerical Simulations on Multiple GPUs with Multiple CUDA Streams Applied on a Sediment-Transport Model for Dual Lithologies by Bernhoff-Jacobsen, Heidi-Christin
Accelerating Numerical Simulations on
Multiple GPUs with Multiple CUDA
Streams
Applied on a Sediment-Transport Model for Dual
Lithologies
Heidi-Christin Bernhoff-Jacobsen
Master’s Thesis Spring 2015

Accelerating Numerical Simulations on





List of Figures vi
List of Tables vii
List of Acronyms viii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Document Structure/Chapter Overview . . . . . . . . . . . . . . . 3
2 Thesis Domain 5
2.1 Sediment Transport Model . . . . . . . . . . . . . . . . . . . . . 5
2.2 Area of Interest Used for Testing . . . . . . . . . . . . . . . . . . 6
2.3 The Computer System . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Thesis Questions and Problems to be Solved . . . . . . . . . . . . 8
2.6 Contribution of This Work . . . . . . . . . . . . . . . . . . . . . 9
3 Methodology 11
3.1 Paradigms for The Discipline of Computing . . . . . . . . . . . . 11
3.2 The Numerical Method For Partial Differential Equations . . . . . 12
3.2.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 The Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Domain Decomposition Method . . . . . . . . . . . . . . . . . . 13
3.6 Empirical Research Method . . . . . . . . . . . . . . . . . . . . 14
3.6.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 15
3.6.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . 15
ii
3.7 Enhancement Methods And Techniques . . . . . . . . . . . . . . 15
4 The Mathematical Model 17
4.1 Numerical Analysis of a
Dual-Sediment Transport Model . . . . . . . . . . . . . . . . . . 17
4.2 Numerical Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Initial Condition (IC) . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Temporal Discretization . . . . . . . . . . . . . . . . . . 20
4.2.3 Fully-Explicit Numerical Scheme . . . . . . . . . . . . . 20
4.2.4 Spatial Discretization . . . . . . . . . . . . . . . . . . . . 21
4.2.4.1 Spatial Discretization on Diffusion . . . . . . . 22
4.2.4.2 Spatial Discretization on Convection . . . . . . 23
4.2.4.3 Boundary Condition (BC) . . . . . . . . . . . . 23
4.2.5 Numerical Scheme for Diffusion . . . . . . . . . . . . . . 26
4.2.6 Numerical Scheme for Convection . . . . . . . . . . . . . 28
5 High Performance Computing (HPC) 31
5.1 Supercomputing in a Historical Perspective . . . . . . . . . . . . 32
6 Parallel Computer Architectures 35
6.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Flynn’s Taxonomy -
A Scheme for Parallel Architectures . . . . . . . . . . . . 35
6.1.2 Designing a Parallel Algorithm . . . . . . . . . . . . . . 37
6.1.2.1 The Task/Channel Model for Parallel Architec-
tures) . . . . . . . . . . . . . . . . . . . . . . . 37
6.1.3 General-Purpose Parallel Computer Architecture . . . . . 39
6.1.3.1 CUDA - Compute Unified Device Architecture . 39
6.1.3.2 CUDA C Programming Model . . . . . . . . . 41
7 Heterogeneous Computing with CUDA C 47
7.1 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . 47
7.2 Parallel Programming Features in CUDA C . . . . . . . . . . . . 48
7.2.1 Parallel kernel . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.3 CUDA Device Memory . . . . . . . . . . . . . . . . . . 49
7.2.3.1 Linear memory . . . . . . . . . . . . . . . . . 50
7.2.3.2 Page-Locked Host Memory . . . . . . . . . . . 50
7.2.4 Synchronous Execution . . . . . . . . . . . . . . . . . . 51
7.2.5 Asynchronous Concurrent Execution . . . . . . . . . . . 52
7.2.6 CUDA Streams . . . . . . . . . . . . . . . . . . . . . . . 52
iii
7.2.6.1 Streams and Events on a Multi-GPU application 52
7.2.6.2 Synchronization . . . . . . . . . . . . . . . . . 53
7.2.7 CUDA Events . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.8 CUDA Featureas for a Multi-GPU System . . . . . . . . . 54
7.2.8.1 Peer-to-Peer (P2P) Memory Access . . . . . . . 54
7.2.8.2 Peer-to-Peer (P2P) Memory Copy . . . . . . . . 55
7.2.8.3 Unified Virtual Address Space (UVA) . . . . . . 56
7.2.8.4 Peripheral Component Interconnect Express (PCIe) 56
8 Implementation of the
Sediment Transport Model 57
8.1 Analysis of a Parallel Architecture for the
Numerical Model . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1.1 The Task/Channel approach for a Parallel Architecture . . 58
8.1.1.1 Partitioning . . . . . . . . . . . . . . . . . . . 58
8.1.1.2 Partitioning for a Multi-GPU system . . . . . . 59
8.1.1.3 Communication . . . . . . . . . . . . . . . . . 59
8.1.1.4 Communication for a Multi-GPU system . . . . 59
8.1.1.5 Agglomerating and Mapping . . . . . . . . . . 60
8.1.1.6 Agglomerating and Mapping for a Multi-GPU
system . . . . . . . . . . . . . . . . . . . . . . 61
8.2 General Details Regarding Both
Serial and Parallel Implementations . . . . . . . . . . . . . . . . 61
8.2.1 Variables for Implementation . . . . . . . . . . . . . . . . 62
8.2.1.1 Spatial and Time Intervals - dx, dy and dt . . . . 65
8.2.1.2 Ghost Points . . . . . . . . . . . . . . . . . . . 66
8.2.1.3 Influx . . . . . . . . . . . . . . . . . . . . . . . 66
8.2.2 Seismic Data Read Into The System With The
NetCDF-Library . . . . . . . . . . . . . . . . . . . . . . 67
8.2.3 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2.4 Coalesced Memory . . . . . . . . . . . . . . . . . . . . . 69
8.3 Serial Implementation with ANSI C . . . . . . . . . . . . . . . . 69
8.3.1 Allocation of Coalesced Memory for the Serial Code . . . 69
8.4 Parallelization of Serial Code using the
CUDA C Extension . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.4.1 Allocation and Copying of Coalesced Memory
for the Parallel Code . . . . . . . . . . . . . . . . . . . . 71
8.4.2 Threads and Blocks in a Grid . . . . . . . . . . . . . . . . 72
8.4.3 Application for a Single GPU . . . . . . . . . . . . . . . 73
8.4.4 Applications for Multiple GPUs . . . . . . . . . . . . . . 74
8.4.4.1 Data Decomposition . . . . . . . . . . . . . . . 75
iv
8.4.4.2 Pinned Memory on Host . . . . . . . . . . . . . 78
8.4.4.3 Implementation 2. . . . . . . . . . . . . . . . . 79
8.4.4.4 Implementation 3. . . . . . . . . . . . . . . . . 82
8.4.4.5 Implementationn 4. . . . . . . . . . . . . . . . 84
8.4.4.6 Implementation 5. . . . . . . . . . . . . . . . . 85
8.4.4.7 Implementation 6. . . . . . . . . . . . . . . . . 86
8.4.4.8 CUDA Kernels . . . . . . . . . . . . . . . . . . 87
9 Performance Analysis 89
9.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.3 Amdahl’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.4 Gustafson’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.5 The Karp-Flatt Metric . . . . . . . . . . . . . . . . . . . . . . . . 90
9.6 The Isoefficiency Metric . . . . . . . . . . . . . . . . . . . . . . 90
10 Conclusion 91
10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.2.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100




4.1 5 Point Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1 Tianhe-2, Chinese Supercomputer . . . . . . . . . . . . . . . . . 33
6.1 GPU devoting more transistors to data processing than CPU [NVIDIA
CUDA Toolkit Documentation, v7.0] . . . . . . . . . . . . . . . . 40
6.2 GPU Computing Applications [NVIDIA CUDA Toolkit Documen-
tation, v7.0] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 The CUDA Architecture [NVIDIA CUDA Toolkit Documentation,
v7.0] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Automatic scalability in the CUDA architecture [NVIDIA CUDA
Toolkit Documentation, v7.0] . . . . . . . . . . . . . . . . . . . . 43
6.5 Grid of thread block in the CUDA architecture [NVIDIA CUDA
Toolkit Documentation, v7.0] . . . . . . . . . . . . . . . . . . . . 44
6.6 Memory Hierarchy of the CUDA architecture [ REF: NVIDIA ] . 46
8.1 One Row of Ghost Points Around the Boundary . . . . . . . . . . 66
8.2 Domain Decomposed on GPUs . . . . . . . . . . . . . . . . . . . 76
10.1 Solution of h after 1000 years . . . . . . . . . . . . . . . . . . . . 94
10.2 Solution of s after 1000 years . . . . . . . . . . . . . . . . . . . . 94
vi
List of Tables
2.1 Questions This Thesis Aims To Answer . . . . . . . . . . . . . . 9
4.1 Varibles for the Mathematical Model . . . . . . . . . . . . . . . . 18
4.2 Transport Coefficients for Sand and Mud . . . . . . . . . . . . . . 19
4.3 Derivation of the Formulas for Boundary Conditions at the Influx . 25
6.1 Flynn’s Taxonomy for Parallel Computer Architecture . . . . . . 36
6.2 CUDA Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . 40
7.1 CUDA’s Hierarchical Execution Model . . . . . . . . . . . . . . . 48
7.2 Levels of Parallelism in CUDA . . . . . . . . . . . . . . . . . . . 49
7.3 Different "kinds" of Memory Copy in CUDA . . . . . . . . . . . 50
8.1 Additional Variables for the Programmed Code . . . . . . . . . . 64
8.2 Implementations with Various Features for Parallel Architecture . 71
10.1 Available GPUs for Testing . . . . . . . . . . . . . . . . . . . . . 91
10.2 Implementation Specifications . . . . . . . . . . . . . . . . . . . 93
10.3 Speedup on GeForce GTX 590 for the maximum of 4 GPUs . . . 96
10.4 Speedup onTesla K20 for the maximum of 2 GPUs . . . . . . . . 97
10.5 Answers to the Thesis Questions . . . . . . . . . . . . . . . . . . 100
vii
List of Acronyms
CPU Central Processing Unit
GPU Graphics Processing Unit
GPGPU General-purpose computing on Graphics Processing Units
HPC High Performance Computing
PDE Partial Differential Equation
PDE’s Partial Differential Equation’s
CFD Computational Fluid Dynamics
FDM Finite Difference Method
f FEM Finite Element Method
FVM Finite Volume Method
FTCS Forward Time Centered Space
DDT Domain Decomposition Methods
BVP Boundary Value Problem
BC Boundary Condition






PCB Printed Circuit Board
I/O Input/Output
UVA Unified Virtual Addressing
PCIe Peripheral Component Interconnect Express (PCI Express)
OpenMP Open Multi Processing
MPI Message Passing Interface
SISD Single Instruction, Single Data stream
SIMD Single Instruction, Multiple Data stream
MIMD Multiple Instruction, Multiple Data streams
MISD Multiple Instruction, Single Data streams




Improving the overall computational time is one of the challenges in scientific
computing today. Performing simulations and visualizations of large data have
been very slow or impossible to do on desktop computers or laptops. This is
due to the limited processing capacity of these computers’ Central Processing
Unit (CPU). This thesis will investigate the field of High Performance Computing
(HPC) and the possibility of running the simulations of a sediment transport model
in parallel on the many cores of the Graphics Processing Unit (GPU), and observe
if these computations will perform faster, and yield results with the same accuracy
as a CPU. It will also look at the possibility of coupling multiple GPUs, and
observe if that will give an additional speedup.
1.1 Background
In the field of scientific computing/computational science, mathematical models
and quantitative analysis techniques are used to solve a big specter of scientific
problems. Computers are used to perform simulations of the modeled equations
and other forms of computations. With the computer simulation we get a solution
that is a numerical approximations to the problem we are aiming to solve. Nu-
merical simulations can be performed in many various fields and with different
objectives, such as reconstructing and understanding events like earthquakes and
tsunamis, or predicting the future like a weather forecast or predicting unobserved
events such as where to find oil. Other fields where this is used include medical
applications, various physical phenomenon and rocket science.
Computer programs have traditionally been written for serial computations. Prob-
lems have been broken down to a set of instructions that have been executed
sequentially, one after another, on a single core on the Central Processing Unit
1
(CPU). The CPUs have over the last years moved away from having one core to
having multiple cores that can run in parallel. In 2006 Intel released a dual-core
processor called Intel Core 2 Duo, and CPUs with 4, 8, 16 and more processors
has been produced since then. The mainstream desktop computers and laptops
sold today have multi-core processors, a single CPU with two or more processing
units called cores, that can run multiple instructions simultaneously. To perform
a simulation with billions of computations on a CPU with e.g. 8 cores running in
parallel will still take a very long time, and we see that for such cases the multi-
core CPUs still have their limitations. Coupling multiple CPUs into a cluster has
been done to increase processing speed, but the purchase of all the units for such a
cluster is often rather expensive. Reducing cost is one of the motifs why this thesis
investigates the use of GPUs. Another motif is reducing computational time by
running simulations on parallel computers.
For certain problems in the field of scientific computing, the graphics card, has
been used to do the computations. The GPU is a circuit board originally designed
to rapidly alternate and render graphics on the computer screen. The possibility to
use the GPU to other tasks than graphics rendering has made scientists explore the
GPU’s architecture through General-Purpose computing on Graphics Processing
Units (GPGPU).
Accelerators such as GPUs are used to speed up computationally-intensive tasks.
These are also implemented in supercomputers around the world used by e.g. re-
searchers.
Compute Unified Device Architecture (CUDA) [NVIDIA CUDA Toolkit Docu-
mentation, v7.0] is a parallel computing platform and programming model. It
allows software developers to access the power of GPUs through the CUDA-
accelerated libraries end extensions to standard programming languages such as
C, C++, Fortran etc.
The CUDA architecture available on NVIDIA graphics cards will be tested
in this thesis, allowing the simulations of [“Application of a dual-lithology, depth-
dependent diffusion equation in stratigraphic simulation”] sediment transport model
to run in parallel on a single GPU. Coupling of multiple GPUs will also be tested.
Both the single- and multi-GPU will be tried with synchronous and asynchronous
simulations. The asynchronous version will perform the communication between
the GPU and CPU, or between the multiple GPU’s simultaneously with the com-
putations, and by that aim to avoid the overhead due to communications (between
the nodes). Here the communication between the GPUs will be tested using
the Peripheral Component Interconnect Express (PCI Express, also abbreviated
PCIe). Computer busses move information between the different parts of the hard-
2
ware in a computer system including the CPU, RAM and peripheral devices. The
PCIe is an expansion bus that in this case moves information between the nodes of
an GPU. In addition to running parallel code on multiple GPUs, multiple OpenMP
threads will will be spawn on the CPU to reduce kernel launch overhead
The geological phenomenon described in the equation set derived by [Rivenæs
1993] is one of many in the field of geology and geophysics. On the path to the
numerical solution to the mathematical sediment transport model, the equation
set is discretized using a Finite Difference method according the Forward-Time
Central-Space (FTCS) scheme.
The software is then implemented for GPU’s produced by Nvidia in CUDA C and
additional OpenMP threads, but also in plain C for CPU(s). The CPU implemen-
tation is used for comparison of speedup.
1.2 Motivation
The major motivation behind this work is to reduce computational time. The
results of large systems can take hours, days or moths to compute. For both re-
search and the industry reducing the overall computing time means money saved.
For business it may result in getting a product out on the market sooner, and in
research it means moving forward towards new findings. Getting results faster is
an important goal for everyone.
1.3 Document Structure/Chapter Overview
Chapter 1: Introduction presents the problem and why the filed is interesting to
investigate.
Chapter 2: Thesis Domain describes the problem to be investigated more thor-
oughly, and give a more precise description of the problem to be analized.
Chapter 3: Describes the different methodologies.
Chapter 4: Presents the mathematical model, discretizing schema and boundary-
and initial conditions.
Chapter 5: Gives an overview of High Performance Computing.
Chapter 6: Presents Parallel Architectures, Flynn’s taxonomy and Foster’s method-
ology.
Chapter 7: Presents the features in CUDA C applied in this thesis.
Chapter 8: Presents details in the various implementations of the sediment trans-
port.
Chapter 9: Presents a performance analysis.
3




2.1 Sediment Transport Model
The mathematical model investigated in this thesis is a sediment transport model
evolved/described by [“Application of a dual-lithology, depth-dependent diffusion
equation in stratigraphic simulation”]. The model describes an application of a
two-dimensional (2D) basin simulation. It is based on topographical diffusion of
a dual-lithology mass, as it runs down a dynamic slope and into a basin. The two
materials are classified as sand and mud. Each lithology has its own transport-
coefficient function, and the effect of compaction is included in the mass balance,
as well as depth-dependent transport coefficients (diffusivities).
This physical phenomenon is described by a set of Partial Differential Equa-
tions (PDEs). The equation set is a coupled system consisting of a diffusion part
and a convection part.
Diffusion describes/is the movement/motion of (mud/silt) molecules as it flows
from one region of high concentration to a a region of low concentration along a
concentration gradient. Diffusion can also relate to other matters as e.g. pressure
along a pressure gradient or temperature along a temperature gradient. The word
diffusion comes from the latin word "diffundere", and means "to spread out". We
here use the word in the context of mass transportation.
Convection is another transport phenomenon in the field of physics and fluid
mechanics. It describes the collective movement of groups/aggregates of molecules
within fluids. It describes how the masses (of sand) moves in a circulatory motion
as the variation in density (of the mud/silt) changes.
5
A broader elaboration of these equations will be handled in chapter 4 "The
Mathematical Model". A thorough presentation is given on how to get from an
equation set representing a physical model to the final computational model that
is used for implementations of the system.
To be able to do the computations/simulations on a computer, the area of inter-
est, also called domain, is separated into a grid consisting of discrete grid points.
Calculations on our equations are performed for each grid-point. The equation-
system consists of various variables that may hold different values on the different
grid-points.
Forward-Time Central-Space (FTCS) scheme is the Finite Difference Method
used to discretize the equations. In the numerical analysis the path to an algo-
rithm representing the numerical approximation of the analog equations is dis-
played.
Other extensive details regarding the analysis of Computational Fluid Dynamics
(CFD) are presented in the same chapter.
2.2 Area of Interest Used for Testing
The area of interest that will be focused on when testing the numerical algorithms
is situated in Florida. More specificly Lake Okeechobee and the Kissimmee
River. The lake is remarkably shallow with a maximum depth of only 4 m. and
an average depth of 2.7 m. The Kissimmee River is the single contributor of
sediments to the lake, counting for 30% of the (water-)volume added to the lake.
High resolution data for the lake’s bathymetry are derived from depth sounding,
and they are publicly available. These data will be utilized in the simulated model.
2.3 The Computer System
The application is based on the numerical model of a 2D system with time series.
In the process of implementing a computer system for the sediment-transport
model, there are several considerations to be taken. The simulations of this model
are to be run for large data-sets and it will be computing-heavy. If the system is
to be computed on a CPU of any desktop computer, even the more powerful ones,
it will most probably take terribly long time. The widely used term Big Data can
be be used when simulating a numerical model as the sediment-transport model
and with data-sets as large as the ones for Lake Okeechobee, since this yields a
system so large and complex that traditional data processing applications are inad-
6
equate. For that reason this thesis will investigate the field of High Performance
Computing (HPC), which is the practice of aggregating computing power that
delivers a much higher performance than than what is possible on a single CPU.
This system consists of a set of numerical PDE’s to be computed for a large set
for grid points. At each grid point computations on the same equations will be
performed, but the values of the equation’s variables may differ from one grid
point to the next. In a multiprogramming context, this system classifies under
the Singel Instruction, Multiple Data streams (SIMD) paradigm. This means
that the system may benefit from an implementation made for a parallel computer
architecture. In a parallel context this thesis will investigate the use of hardware
accelerators such as the GPUs produced by NVIDIA, a bleeding edge technology.
In the process of the thesis work there will be two main implementations. The
first is a serial application made for computations on a CPU, the next is a parallel
implementation made to run on a single GPU. The development of the parallel
code normally demands extended programming skills that can take time to master
and take in use. On the other hand, if the parallel system runs much faster than the
serial system then there are tremendous benefits. A further extension is to evolve
the parallel system to run on multiple GPU’s of an arbitrary number. One focus
will then be on GPU’s sharing the same PCI Express bus for data transfer.
The programing languages that will be used are C for the serial system to run on
a CPU, and C with the CUDA extensions/bindings called CUDA C, used for the
parallel GPU-version. Multiple technologies such as CUDA C combined withe
Open MP threads will be tested for the implementation of the multi-GPU applica-
tion.
Further details on HPC will be surveyed in the Chapter 5.
Details regarding CUDA C programming will be targeted in the Chapter 7.
2.4 Speedup
Measurements of performance will be made for the different systems, and they
will be compared in various ways. The performance enhancement can be mea-
sured using speedup as a quantifying metric for relative performance improve-
ment. More on this topic will be surveyed in the Chapter 9: Performance Analysis.
In this research the aim is to see if there are any improvements in computa-
tional time comparing a parallel application running on a single GPU to a serial
7
application running on a CPU. An additional focus is to see if there is even more
time to save if running the system on multiple GPU’s.
Another research aspect to focus on is if the parallel computations yields re-
sults with the same accuracy as the serial computations.
If the parallel system achieves a good speedup, it will open up for the possibil-
ity to run larger simulations. One possibility could be to do data-simulations over
a larger grid representing a larger interest domain or it could be a more refined grid
with a higher number of grid points within the same domain size, but yielding a
more detailed result. Other possibilities could be the choice of extending the time
series to include more time steps, or even a combination of the different options
that opens up if good speedup is achieved.
So, will the parallel code run faster and yield results in shorter time than the
serial code, and will the parallel code yield results with the same accuracy as the
CPU?
The next section will state the questions this thesis will aim to answer.
2.5 Thesis Questions and Problems to be Solved
One of the areas this thesis will investigate is wether there are any benefits in im-
plementing a parallel system. Then another question is; if there are time to save
when running the application on one GPU, will there be more time to save when
run on multiple GPUs?
Performing simulations and visualizations of large data and complex systems has
been very slow or impossible to compute due to the lack of computing power of
the CPU. The evolution in the GPU production has been amazing over the last
years, towards a circuit board adapted for heavy computations and other tasks
than graphics rendering. With the use of GPGPU, a new world has opened up to
computer programmers.
Here is a structured presentation of the questions this thesis will aim to answer
and the problems this thesis will aim to solve.
8
Q 1: Is a parallel application running on a single GPU
faster than a serial application running on a CPU?
And if so:
Q 2: Is a parallel application running on multiple GPUs
faster than a parallel application running on a single GPU?
Q 3: Is the application enhancing performance utterly if additional
CUDA Streams are utilized?
If the the GPU-implementations yields the results faster, then the further ques-
tions will be:
Q 3: Does the parallel application running on a single GPU yield results
with the same accuracy as the serial application running on a CPU?
Q 4: Does the parallel application running on multiple GPU’s yield results
with the same accuracy as the application running on a single GPU?
Table 2.1: Questions This Thesis Aims To Answer
2.6 Contribution of This Work
This thesis has reached the goal the of answering the Thesis Questions in table
2.1. The goal was to enhance speedup for sediment-transport simulations, with
the use of GPUs. Another goal was to achieve solutions with the same accuracy
as the solution from a CPU-simulation. Through this investigation, the enhanced
speedup has been more than satisfactory for simulations on both a a single GPU
as well as on multiple GPUs. The numerical results yield the same accuracy on






3.1 Paradigms for The Discipline of Computing
This thesis investigates a technology-oriented discipline with fundamentals in
mathematics and engineering. The discipline of computing embraces all of com-
puter science and computer engineering. The two fields do not differ in the core
material. [Comer et al. 1989] suggests paradigms for the discipline of computer
science, and emphasizes:
1. Theory - based in the mathematical model.
2. Abstraction - based in experimental scientific method.
3. Design - based in engineering, involves designing and testing a system.
With their view on methodologies for the field, this chapter presents methods
utilized on the path to a solution to the problem aimed to be solved in this thesis.
The main problem to be tested is if a computer-system can enhance perfor-
mance if the simulation is executed on a parallel architecture like the SIMD archi-
tecture on a GPU, compared to execution on a serial CPU.
For this purpose the chosen mathematical model is the sediment-transport
model by [Rivenæs 1993]. This model consists of a coupled system of Partial
Differential Equations (PDE’s). To be able to compute the set of analog PDE’s,
the equations has to be discretized and for that a numerical method is needed.
11
3.2 The Numerical Method For Partial Differential
Equations
The approach to find numerical solutions to PDE’s has led researchers into the
field of numerical analysis, and numerous well established numerical methods has
been evolved to help in this matter. Some of these method are, the Finite Differ-
ence Method (FDM), Finite Element Method (FEM) and Finite Volume Method
(FVM).
On the sediment transport model the Finite Difference Method is applied.
With this method, functions are represented by their values at discrete points in
a grid. Through the differences in these values, derivatives are approximated.
From the mathematical model the numerical model is derived using an Explicit
Method. This is a discretizing method applying a forward difference in time, and
central difference in space, called Forward-Time Central-Space (FTCS). It is a
first-order method in time and a second-order method in space. For the sediment
transport model there are 2 dimensions in space, so second-order derivatives are
calculated in both x- and y-direction. The calculations made for time-step ul uses
the values at calculated at timestep ul−1.
3.2.1 Stability
To maintain numerical stability for 1D problems with the FTCS method, the fol-
lowing condition has to satisfied.
r = αδ tδx2 ≤ 12
This means that the size of the time step, δ t is restricted by this stability con-
dition. For problems with large diffusivity the δ t has to be very small. A small
δ t can be a major disadvantage of this method, as it will have to perform many
time-iterations to simulate a given time period.
For systems with many variables and multiple large data-sets as the sediment-
transport system, the simulations can take a long time. Therefore numerous opti-
mization techniques and methods will be tested and investigated in this thesis.
3.3 The Variables
The variables used in this investigation are given by the mathematical model by
[Rivenæs 1993] as seen in the chapter 4. It is a massive equation system with
a myriads of variables that are operated on as the numerical model is simulated
12
through the vey many time steps. Some of these variables holds scalars or con-
stants, while other variables are large data-sets extracted from survey of seismic
activity. The data-sets used in the computational domain is gridded for the spatial
discretization.
3.4 The Data
The data that will be used in the sediment-transport simulations are extracted from
Lake Okeechobee in Florida. The lake is positioned in a shallow geological trough
and is of the size 1.900 square kilometers (730 sq. mi). It is exceptionally shallow
with an average depth of 3 meters (9 feet). The Kissimmee River gives the influx
of sand and mud to the lake.
The massive data-sets used in these computations are the results of seismic mea-
surements of Lake Okeechobee. These data are publicly available [Gtopo30 1996].
3.5 Domain Decomposition Method
In the field of numerical analysis and numerical PDE’s, simulations are mostly
performed over massive data-domains. Computations performed on very large
data-set(s) takes time. If they are executed sequentially it takes much longer time
then if they are executed in parallel on a single GPU, and even more time can
be saved if run on multiple GPU’s compared to a single GPU. With the purpose
of the enhancement to yield a faster result, a method to run the simulations on
multiple GPU’s is applied. Solving boundary value problems on multiple GPU’s
can be done by splitting the data-domain into smaller subdomains, where compu-
tations for each subdomain are computed on one of the dedicated GPU’s in the
multi-GPU system. The Domain Decomposition Method (DDM) is a method that
splits a data-domain into independent, adjacent subdomains. The method is ideal
for parallel computing on a Boundary Value Problems (BVP).
A 1D problem can only be subdivided in a 1D decomposition.
Let the intervals over the linear domain M be denoted [mstart ,mend].
For a 1D domain on the interval [0,1], if split into two equally sized subdomains,
one is on the interval [0, 12 ] and the other is on the interval [
1
2 ,1].
A 2D-problem is exposed to a variation of decomposition possibilities. The
domain can be decomposed in 1D or 2D manners.
A 2D decomposition of a 2D domain is decomposed both horizontally or verti-
cally.
13
Let the intervals over a MxN domain be denoted [mstart ,mend]x[nstart ,nend].
With a 2x2 decomposition on the interval [0,1]x[0,1], the different subdomains













A 1D decomposition on a 2D domain can be decomposed horizontally or ver-
tically.
A 1D decomposition into 4 subdomains will with a 1x4 decomposition on the in-









If decomposed in the other direction; 4x1 on the same interval, it yields subdo-









A 2D decomposition yields a larger number of neighbors that boundary val-
ues has to be communicated to than a 1D decomposition. Due to communication-
overhead a 2D decomposition is not necessarily a better choice than a 1D decom-
position.
3.6 Empirical Research Method
To evaluate the results in this research, empirical research method is used. It is
a way to acquire knowledge through experimenting and testing, and it may also
be achieved through direct or indirect observations or experience. Empirical ev-
idence is the record of a someone’s direct observations or experience and can be
analyzed quantitatively or/and qualitatively.
Implementing, testing and comparing the applications will make up for a large
part of the empirical studies in this work. There will of course be iterations of
application development with the intent of making improvements.
14
Based on the problem domain and the questions this thesis aims to solve, both
a qualitative and a quantitative analysis will be made.
3.6.1 Quantitative Analysis
For the quantitative analysis of the results, the execution time of the various sys-
tems are measured. This time is given by a very direct observation of benchmark-
ing the actual speed of the executed system. Execution time for all the different
systems can be compared, and by the comparison it is possible answer some of
this thesis’ questions.
Through the quantitative analysis of these result it is possible to decide if a parallel
implementation is faster than a serial implementation, and if a parallel implemen-
tation of multiple GPU’s is faster than an implementation of one GPU. It is also
possible to decide if an implementation with OpenMP threads yields a faster so-
lution.
Performance and Speedup are two "methods" to analyze the measurements and
enhancement between to different systems. Both methods are in this way quanti-
tative as both methods derive results form quantitative measurements. More about
the these two methods is reviewed in the chapter on Performance Analysis 9.
3.6.2 Qualitative Analysis
For a qualitative analysis of the results, this thesis examines the numerical results
of the simulations, and makes comparisons between the serial implementation and
the various parallel versions. If the different systems parallel systems, through ex-
ecution, render the same correct numerical result as the serial system, it can be
said that the parallel implementations produce a result of equal quality as the se-
rial.
Although speedup and performance are analyzed qualitatively, it would be
possible to say that a system that gives a noticeable speedup and improved perfor-
mance, in addition to a correct numerical result of course, adds to the total quality
of the application.
3.7 Enhancement Methods And Techniques
Computer programs have been implemented with the aim and ambition to solve
the problems this thesis states. The code has been produced with a variety of fea-
15
tures, exploiting the possibilities that lies within the chosen technologies. Exten-
sive empirical testing has been conducted through running the different programs
evolved with the different programming details made for optimization. Along with
the use of the well-established methods presented here, the research is accom-
plished with a great specter of enhancing programming techniques and methods.
A broader explanation on how these methods and the large variety of enhance-





4.1 Numerical Analysis of a
Dual-Sediment Transport Model
There are several numerical strategies for solving this system, and Clark et al.
[Clark, Wei, and Cai 2010] has studied two of the possibilities for solving a cou-
pled system of distinct nonlinear equations governing sediment transport of Lake
Okeechobee, Florida. They used high-resolution bathymetry data of the actual
lake derived from depth-sounding [Gtopo30 1996]. The computations were per-
formed on a multicore-based cluster.
This thesis will investigate the numerical strategy of a fully-explicit numer-
ical scheme, and the computations will be tested on the CUDA architecture on
NVIDIA GPUs.
Erosion and deposition of sediment is a process often modeled using a diffu-
sion equation. The speed of diffusion represents the transport efficiency of the sed-
iment. When multiple sediments are involved in the model, one can add another
equation for each sediment. Various physical processes like sediment compaction,
tectonic movements and carbonate production can be governed by a numerous
empirical rules, but has been neglected by Clark et al. in order to investigate the
parallel performance of their diffusion equation alone.
To model sediment transport in fluvial, on- and off-shore environments, dif-
fusion has been used. Jordan and Flemings [Jordan and Flemings 1991] suggest
a model for sediment transport, approximating a slope-controlled diffusion, and






h= h(x,y, t) is the height of the basin in the x,y-plane at a given time(step),
t is the time and
κ = κ(x,y, t) is the transport coefficient or the efficiency of the diffusion.
This equation (4.1) was later modified by Rivenæs [Rivenæs 1993], so two types
of sediments could be handled in the model, with one transport coefficient for
each sediment.
The variables used for mathematical system are:
Variable Name Explanation
h Bathymetry (Height of the basin)
s Fraction of sand
(1− s) Fraction of mud/silt
α Transport coefficient for sand
β Transport coefficient for mud/silt
A Layer thickness
Cs Compaction ratio for sand
Cm Compaction ratio for mud
fs Influx sand
fm influx mud
dx Size of spatial step in x-direction
dy Size of spatial step in y-direction
dt Size of time step
l Temporal discretization step
i Spatial discretization step in x-direction
j Spatial discretization step in y-direction
Table 4.1: Varibles for the Mathematical Model
∂h
∂ t








= ∇ · (αs∇h). (4.3)
Here,
α = α(x,y) and β = β (x,y) are the two diffusion coefficients for sand and silt/-
mud, and
s= s(x,y, t) and (1− s(x,y, t)) are the corresponding fractions of the two sediment
18
types at a particular location. For this case s represents the first material, sand, and
(1− s) represents the second; material, silt. Further
A is the layer thickness, representing the height of unsettled sediments in the basin,
and in this case it’s set to 1 m. Also, A scales the partial derivatives of s with re-
spect to distance and determines how effective changes in h will be affecting the
sediment concentration.
When taking into consideration the compaction ratio for sand and silt, the






∇ · (αs∇h)+ 1
Cm











∇ · (αs∇h). (4.5)
where
Cs is the compaction ratio for sand and
Cm is the compaction ratio for mud.
Water discharge rates, water-flow induced shear-stress and drag-coefficient
can be neglected due to the separation of the two material types rather than having
one diffusion coefficient based on grain-size.
For simplicity the values are not depth dependent, since Clark et al. [Clark,
Wei, and Cai 2010] assume that sediment diffusion is not as vigorous in the lake
as for the seashore. Therefor they utilize lower values than those suggested by the
lakes mean depth of 2.7 m. In this case, they use the coefficients for sand and silt,
and the table below shows the different regions with the corresponding transport
coefficients in square meters per year (m2/yr).
Region Sand (α) Silt (β )
Kissimmee River 70.000 100.000
Lake Okeechobee 2.100 3.000
Surronds 70 100
Table 4.2: Transport Coefficients for Sand and Mud
19
4.2 Numerical Scheme
The problem to be solved is a typical initial value problem. Clark et al. [Clark,
Wei, and Cai 2010] solve the initial value problem by time integration, and at time
step l the latest numerical solutions of h and s, denoted hl+1 and sl+1. To have the
equations system solved by a computer, the continuous equations need to be trans-
ferred into discrete counterparts by using techniques for numerical discretization.
Solving the analogue equations will yield a correct result, while solving the dis-
cretized scheme will give an approximation to the solution. Various numerical
discretization strategies can be chosen, but this thesis will look exclusively at the
Fully-explicit scheme.
4.2.1 Initial Condition (IC)
At the start of our simulations, at time step 0, we set the initial condition to be:
h(x,y,0) = h0(x,y)
s(x,y,0) = s0(x,y)
These initial values for h and s that are used at the start of the simulations at
time step 0, come from seismic measurements of Lake Okeechobee in Florida
[Gtopo30 1996].
4.2.2 Temporal Discretization
The time domain 0 < t ≤ T is divided into a number of equal-distanced discrete
time levels with ∆t set as the step size. The time level index is noted with l and
we use superscript l such that hl denotes h(x,y, l∆t) and sl denotes s(x,y, l∆t).
The temporal derivatives are then approximated as:
∂h




∂ t ≈ s
l+1−sl
∆t
4.2.3 Fully-Explicit Numerical Scheme







∇ · (αsl∇hl)+ 1
Cm











∇ · (αsl∇hl+1) (4.7)
It is to be noted that h is must be updated before s during each time-step. This
is why the newly computed hl+1 from the first equation is immediately used to
compute sl+1 in the second equation. Another remark is that Wei et al. [Wei et al.
2013] suggests sl+1 instead of sl , used in the s∂h∂ t term on the left side of equation
4.5. They have shown with numerical experiments that this trick improves the
numerical stability of this fully-explicit scheme in which both hl+1 and sl+1 are
computed straight forwardly. The values hl+1 and sl+1 can be computed using
simple algebraic operations, and exploit the advantage that no linear systems need
to be solved.
The values of α , β and mesh spacing limits the size of ∆t in the fully-explicit




This scheme has 1. order accuracy in time.
4.2.4 Spatial Discretization
The equations 4.4 and 4.5 for respectively diffusion and convection are partially



































































































Motivated by numerical and programming simplicity, [Wei et al. 2013] use Fi-
nite Differences for the spatial discretization.





Figure 4.1: 5 Point Stencil
4.2.4.1 Spatial Discretization on Diffusion
Centered Difference in space is a standard way that will be used for the two
diffusion terms on the right hand side of 4.4, and this will give a second-order
accuracy in space. Applying centered difference to the ∇ · (αs∇h) term yields the
following discretized form:
∇ · (αs∇h) = α(i+1/2, j)s(i+1/2, j)(h(i+1, j)−h(i, j))−α(i−1/2, j)s(i−1/2, j)(h(i, j)−h(i−1, j))
∆x2
+
α(i, j+1/2)s(i, j+1/2)(h(i, j+1)−h(i, j))−α(i, j−1/2)s(i, j−1/2)(h(i, j)−h(i, j−1))
∆y2
(4.10)
where the subscripts i, j are the indices at a grid point in a 2D uniform mesh with
the mesh spacing ∆x and ∆y . The half-indexed terms in the formula 4.10 are
22
evaluated as:
α(i+1/2, j)s(i+1/2, j) = (α(i, j) s(i, j)+α(i+1, j) s(i+1, j))/2.
4.2.4.2 Spatial Discretization on Convection
Equation 4.5 is a convection equation with respect to s, due to the term ∇(˙αs∇h).
To obtain numerical stability, one-sided upwind finite difference is preferred over
centered difference, although it has a first order accuracy. Therefore, when check-
ing the flow direction, the convection term ∇(˙αs∇h) is moved to the left side of
the equation, as customary. By this −∇h gives the convection velocity. The ap-
proximation of the x-component, −∂h/∂x ≈ (hi−1, j− hi+1, j)/2∆x, and the sign
determines how the x-component of the convection term is discretized by one-
sided upwind difference. More specifically, if hi−1, j > hi+1, j approximation with






































In the y-direction the discretization is done similarly.
4.2.4.3 Boundary Condition (BC)
In the numerical model most of the boundary has a no-flow condition. The en-








One part of the boundary has a non-zero inflow along a 1.5 km wide channel
representing the River Kissimmee. Where a river crosses the boundary of the
23
solution domain, the Dirichlet Boundary Condition is set for s and the non-
zero Inhomogeneous Neumann BC is set for ∂h∂n . By this, the system allows
specifications of an inflow of 80% mud and 20% sand.







These boundary conditions model an inflow of sediments due to e.g. a river cross-
ing the boundary of the solution domain.
Treatment of the Boundary Condition (BC):
The boundaries can then be described by two parts.
Most of the boundary has a no-flow condition using a second-order accurate treat-







The standard approach by using one layer of ghost points around the boundary
points is applied.
Along domain where the river Kissimmee crosses the boundary, special treatment
is needed for the inhomogeneous influx conditions The two conditions, 4.14 and












Here h is an Inhomogeneous Neumann BC and s takes a Dirichlet BC.
The derivation of the formulas for the boundary conditions where we have the
influx can be seen i Table 4.3.
24
Influx Sand Influx Mud
fs =−αs∂h∂n fm =−β
(
1− s)∂h∂n
=⇒ fs∂n−α∂h = s











=⇒ − fmβ = ∂h∂n + fsα
=⇒ ∂h∂n =− fsα − fmβ
rewrite the expression for sand
=⇒ fs−α ∂h∂n = s










=⇒ s= β fsβ fs+α fm
Table 4.3: Derivation of the Formulas for Boundary Conditions at the Influx
25
4.2.5 Numerical Scheme for Diffusion
Recalling the temporal and the spatial discretization schema with the superscript
l for temporal discretization and the subscrtipts i and j for spatial discretization in
respectively x and y direction. These schemas are applied on the derivative of the
diffusion equation as seen in equation 4.8 and yields this discretized equation:























The diffusion is then solved with respect to hl+1, displayed in equation (4.20).

























































































































































































4.2.6 Numerical Scheme for Convection
For convection the discretization schema is applied on the derivative of the equa-
















































































When the expression for the function as shown in the section on Spatial Dis-
cretization of Convction 4.2.4.2 is filled in between the hard brackets with respect




















( hl+1i+1 j − hl+1i−1 j
2∆x
)]
, hi−1 j > hi+1 j
1
Cs




( hl+1i+1 j − hl+1i−1 j
2∆x
)]









( hl+1i j+1 − hl+1i j−1
2∆y
)]
, hi j−1 > hi j+1
1
Cs




( hl+1i j+1 − hl+1i j−1
2∆y
)]
, hi j−1 < hi j+1
(4.22)




















A + hl+1i j − hli j

















A + hl+1i j − hli j



















A + hl+1i j − hli j

















A + hl+1i j − hli j







Before describing the implementation of both the serial and the parallel appli-
cations, a survey on ideas and concepts regarding High Performance Computing
(HPC) will be presented in this chapter. HPC is the use of parallel processing
technologies for running applications efficiently. The efficiency can be measured
in different ways.
Clock speed/Clock rate is the time used by a microprocessor to execute an in-
struction. All computers contains a clock that regulates the speed. Clock rates are
expressed in megahertz (MHz) or gigahertz (GHz).
Throughput is a measure of how many units of information a (computer-) sys-
tem can process in a given time-period. The throughputs are often described as
FLoating-point OPerations per Second (FLOPS). This is a benchmark mea-
surement for the speed of microprocessors.
The earliest supercomputers had a throughput measured in
kiloFLOPS (KFLOPS) = 103 FLOPS and
megaFLOPS (MFLOP) = 106 FLOPS.
For supercomputers today we are talking about throughput measured in teraFLOPS
(TFLOPS) = 1012 FLOPS and
petaFLOPS (PFLOPS) = 1015 FLOPS.
Eager scientists are even looking towards exa-scale computing for a throughput
on the size of exaFLOPS (EFLOPS) = 1018 FLOPS.
The term HPC applies to systems performing above a teraflop (1012) floating point
operations per second.
31
5.1 Supercomputing in a Historical Perspective
The basic components used for electronics in radios, televisions, radars, tele-
phones and computers was the vacuum tube (electron tube), a device that was
used to control electrical current through a vacuum in a sealed container. The
ones with a control grid could also be used as amplifiers. The vacuum tubes was
invented in 1907 and are mostly used until the first half of the twentieth century.
In 1947 came the first transistor, a semiconductor device used to amplify and
switch electronic signal and electrical power, invented by the American physicists
John Bardeen, Walter Brattain, and William Shockley. The transistor revolution-
ized the field of electronics, and the inventors were rewarded the Nobel Prize in
Physics in 1956.
Among the advantages that have allowed the transistors to replace the vacuum
tube processors are lower power dissipation and higher energy efficiency.
The first electronic general purpose computer was Electronic Numerical Inte-
grator and Computer (ENIAC). It was Turing-complete (computationally uni-
versal) and reprogrammable to solve a large class of numerical problems. The
press called it "Giant Brain" when it was announced in 1946. It was primarily
designed to calculate artillery firing tables for the US Army.
The Turing machine is named after the British pioneering computer scientist,
mathematician, cryptanalyst and logician, Alan Mathison Turing (1912-1954).
With the development of the Turing machine, a theoretical or hypothetical de-
vice, he formalized the concepts of "algorithm" and "computation". An algorithm
is a step-by-step set of operations/instructions to be performed, and computation
is a calculation or use of computing technology in information processing.
In the 1960’s various supercomputers were introduced.
As early as November 1959 the IBM 7090 was installed, designed for large-scale
scientific and technological applications. This was a transistorized version of the
earlier IBM 709, a vacuum tube computer. The processing speed was around 100
KFLOP/s and about six times faster than the 709.
The first Atlas supercomputer was commissioned in 1962 and installed at the
University of Manchester. The machine used discrete germanium transistors to
lower the voltage.
32
Another supercomputer was introduced by Seymour Cray, and in 1976 the 80
MHz Cray-1 system was installed with a single CPU. The 80 MFLOPS Cray-1
was in 1982 succeeded by the 800 MFLOPS Cray X-MP, the first multi-processing
computer. This was a shared-memory parallel vector processor with a clock cycle
of 9.5 nano seconds (ns) (105 MHz) compared to Cray-1’s clock cycle of 12.5
ns. A vector processor, or array processor, is a CPU that has implemented an in-
struction set able to operate on one-dimensional arrays called vectors, in contrast
to a scalar processor that only operates on single data units. In 1984 an improved
model of the Cray X-MP came with up to a four-processor system.
Twice a year the TOP500 list of supercomputers [Top500List 2015] is published,
and one machine has been on the top of the list the last times the list has come out.
China’s Tianhe-2 supercomputer, which translates to "Milky Way 2", has been
the fastest in the world since June 2013. It runs at 33.86 petaFLOPS (PFLOPS),
or 33.86 quadrillion floating point operations per second. Or in other words; 33.86
thousand million million floating point operations per second.






Workstations today are a hundred times faster than the ones made a decade ago.
Yet scientists and engineers still experience that "fast" isn’t fast enough. Waiting
for hours, days or even weeks for larges computations to finish can be costly. The
use of a parallel computers to solve a computational problem is called parallel
computing, and is widely used to solve large problems. A parallel computer is a
multi-processor computer system where computations on different processors are
run in parallel. The use of parallel architectures to solve computational problem
often saves a lot of time, but it is important to make the right choices of computer
architecture and algorithm design, to achieve the most optimal result.
6.1.1 Flynn’s Taxonomy -
A Scheme for Parallel Architectures
Operations processed in a computer is a set of instructions that operates on a set
of data operands. These processes can be executed as a stream of instructions
operating on a stream of data. The possibility of multiplicity in the streams of
instructions and data lies in the computer architecture.
A well-known classifications scheme for parallel computer architectures is Flynn’s
Taxonomy, proposed by Michael J. Flynn in 1966 [Quinn 2003]. In this scheme
a computer is categorized on it’s ability to exhibit parallelism in the instruction
stream and the data stream., see figure 6.1
35


















Table 6.1: Flynn’s Taxonomy for Parallel Computer Architecture
Single Instruction, Single Data stream (SISD)
This is the architecture of a sequential computer which exploits no parallelism
neither in the instruction nor data streams. An example is the uniprocessor, a
system where all processing tasks share a single CPU.
Single Instruction, Multiple Data stream (SIMD)
Computers with this architecture employ a single instruction stream but multiple
data streams. Processor arrays and pipelined vector processors are examples in
this category, and both technologies employs a single control unit executing one
instruction or instruction stream at the time. A processor array has multiple pro-
cessor elements operating in parallel on multiple data elements in a data stream,
and the same instruction is performed on each data element. The vector processor
has a single processor element that operates in sequence on the data elements.
Multiple Instruction, Single Data streams (MISD)
The MISD architecture employs a pipeline of multiple independently executing
instructions performed on a single data stream. The systolic array exemplifies
this kind of computer, forwarding results from one function to the next on a single
data element.
Multiple Instruction, Multiple Data streams (MIMD)
The MIMD class of parallel architectures is for computers consisting of multi-
ple instruction streams performed on multiple data streams. Multiprocessors and
multicomputers fit into this category, and both architectures are based on multi-
36
ple CPU’s where the different CPU’s simultaneously execute different instructions
streams on multiple data elements. This has possibly been the most familiar form
of parallel architectures.
6.1.2 Designing a Parallel Algorithm
6.1.2.1 The Task/Channel Model for Parallel Architectures)
When designing a parallel algorithm, Ian Fosters [Quinn 2003] task/channel
methodology can be used. The model is intended for MIMD architectures, and
not for SIMDs. MIMD architectures are multi-processors and multi-computers
witch consists of multiple CPU’s with a shared memory. The MIMD architecture
typically have fewer processors than the SIMD does.
In the case of constructing the algorithm for the numerical model for sediment
transport, the channel/task method will be presented here to enlighten a few top-
ics around choices for designing parallel algorithms.
The task/channel model represents a parallel computation as a set of tasks that
may interact with each other by sending messages through a channel. A task is
a program’s instructions and private data on the local memory and a collection of
Input/Output (I/O) ports, and a task can use the ports to send local data to other
tasks. One task’s output port can connect with another task’s input port through a
message queue called a channel.
In the task/channel model any task can send data to other tasks and continue it’s
instructions regardless if the receiving task has received the data. Sending is a
non-blocking process and is said to be an asynchronous operation. On the other
hand a task can not receive data from another task before the other task has sent
it, so the task is blocked until the value is received. Receiving is then said to be a
blocking process, and is a synchronous operation.
In this model:
• The execution time of a parallel algorithm is the period of time the parallel
tasks are active.
• The starting time of a parallel algorithm is when all tasks simultaneously
starts executing.
• The finishing time of a parallel algorithm is when the last task has stopped
executing.
Ian Foster’s design methodology proposes a four-step process for designing




In the search for how much parallelism is possible in a system, the methodology
starts with a partitioning step. This process divides data and computations into
pieces, and can take either a data-centric or a computation-centric approach.
The data-centric design approach for parallel algorithms is where data is divided
into pieces and then computations are associated with the data pieces. This is
called domain decomposition and results in sets of data and a number of oper-
ations to be performed on the data. Some examples of domain decomposition is
when data is divided in different dimensions (1D, 2D and 3D).
The computation-centric design approach is where computations are divided into
tasks and then it’s decided which data are associated with the individual compu-
tations. This is called functional decomposition, and the collection of tasks can
often achieve concurrency through pipelining.
Communication
The next step is to determine the communication pattern between the tasks. If data
needs to be shared between the tasks, then communication is needed.
The two possible communication patterns for parallel algorithms are local com-
munication and global communication.
Communication is not needed for a sequential algorithm, and the extra time it
takes to perform the communication among tasks in a parallel algorithm is called
overhead. An important goal of parallel algorithm design is to minimize the
overhead. The overhead may be strongly affected by how the tasks are assigned
to the processors, especially if the number of tasks greatly exceeds the number of
processors. To help evaluate the communication structure of the parallel system,
Foster has a checklist:
• Balance communication operations among the tasks.
• Each task communicates only with a small number of neighbors.
• Tasks can perform their communications concurrently.
• Tasks can perform their computations concurrently.
Agglomeration
After achieving as much parallelism as possible through the first two design steps,
it is time to group the primitive tasks into larger tasks to improve performance,
38
simplify programming and maintain scalability of a program. This process s
called agglomeration. Reducing communication overhead is one of the goals of
agglomeration. One way to reduce the communication to zero is to agglomerate
all primitive tasks that communicates with each other. This is called increasing
the locality. Replicating computations on different processors may also take less
time than communicating the data over a channel. This replication must be small
enough to allow a system to scale.
Mapping
Mapping is the process of assigning the different tasks to the various processors
in a way that maximizes processor utilization and minimizes interprocessor com-
munication costs, but these two goals are often conflicting. Processor utilization
is the average percentage of time a systems’s processors actively performs tasks
to solve a problem. Computation time may vary for different tasks. When com-
putations are balanced evenly, the processor utilization is maximized.
Different characteristics in the parallel algorithm will lead to different mapping
strategies. Mapping can be part of a static or dynamic task allocation. One ex-
ample on when to choose dynamic load-balancing algorithms is in the case where
tasks are created and destroyed runtime and has frequent communications be-
tween the tasks. An example of when to choose static load-balancing algorithms
to minimize the communication overhead, is when you have a static number of
tasks with unstructured communication patterns.
6.1.3 General-Purpose Parallel Computer Architecture
The insatiable market demand for real-time, high-definition 3D-graphics has driven
producers of GPU’s to evolve the general-purpose programmable, highly parallel,
multithreaded, many-core processors with extreme computational power and high
bandwidth. The GPU’s are devoting more transistors to data processing than to
data caching and flow control as the CPU’s are. This gives the GPU an advan-
tage over the CPU for data-parallel computations, where the same instruction is
executed on numerous data elements in parallel. See figure 6.1
6.1.3.1 CUDA - Compute Unified Device Architecture
Reasons why there is such a large performance gap between CPU’s and GPU’s
lies in the fundamental differences in the design of the hardware.
The Compute Unified Device Architecture (CUDA) [NVIDIA CUDA Toolkit Doc-
umentation, v7.0] is a GPGPU architecture that was introduced by NVIDIA Cor-
poration in November 2006. Since then different generations of GPU’s has been
evolved, and for different purposes, whether it’s been made for graphics rendering
39
Figure 6.1: GPU devoting more transistors to data processing than CPU [NVIDIA
CUDA Toolkit Documentation, v7.0]
or general purpose tasks. CUDA also came with a programming platform includ-
ing adapted libraries, middleware and extensions to programming languages. This
opened up for the possibility to make parallel systems solve complex computa-
tional problems more efficiently than on a CPU. A figure of the GPU Computing
Applications can be seen in figure 6.2
A CUDA capable architecture holds one or multiple Stream Multiprocessors
(SM’s). Each SM contains several Streaming Processors (SP’s).





• Global memory/Device memory
Table 6.2: CUDA Memory Hierarchy
where registers are the fastest to access and global memory the slowest.
GPU’s keep a high memory bandwidth that greatly exceeds CPU’s. The number
of SM’s, SP’s and memory details varies with the different generations of GPU’s.
The CUDA architecture is displayed in the figure 6.3.
The CUDA architecture is supplemented with a software environment that
allows developers to use C, C++, FORTRAN or other programming languages,
apply programming interfaces or directive-based approaches.
40
Figure 6.2: GPU Computing Applications [NVIDIA CUDA Toolkit Documenta-
tion, v7.0]
The three key abstractions presented in the CUDA programming model are the
hierarchy of thread groups, shared memories and barrier synchronization. This
gives the platform for the coarsely grained data parallelism and task parallelism
and within that the fine-grained data parallelism and thread parallelism. The de-
veloper may then partition a problem into groups of smaller problems solved in
blocks of threads. A CUDA program can execute on a various number of proces-
sor cores and therefore supports automatic scaling of a problem. See fig: 6.2
6.1.3.2 CUDA C Programming Model
CUDA C is an extension to the C programming language, evolved to allow parallel
execution of programs on the GPU. It let’s developers familiar to the C program-
ming language exploit the technology without the effort of learning a completely
new programming language. The CUDA C programming model is benefiting
from the CUDA-kernels that let the programmer (re)define c-functions. When a
kernel is called it executes N times in parallel by N different CUDA threads, while
41
Figure 6.3: The CUDA Architecture [NVIDIA CUDA Toolkit Documentation,
v7.0]
a C-function only executes once. The kernel-call must have defined:
• the number of threads on each block
• the number of thread-blocks in a grid
A kernel call uses a new <<<...>>> execution configuration syntax, and the invo-
cation of a kernel may look like the code listing 6.1
42
Figure 6.4: Automatic scalability in the CUDA architecture [NVIDIA CUDA
Toolkit Documentation, v7.0]
Listing 6.1: CUDA Kernel Call
1 kernelName <<<numBlocks , numThreads >>>( argument_1 , . . . ) ;
When the kernel is called from a C-function it must have the __global__ decla-
ration specifier, and it may be defined as shown the code listing: 6.2
Listing 6.2: CUDA Kernel Declaration
1 _ _ g l o b a l _ _ void kernelName ( argument_1 , . . . )
2 {
3 . . .
4 }
The number of thread-blocks in the grid and the number of threads per block is
for the programmer to decide through the variables specified in the kernel-call. In
the code-example 6.1 these variables are called numBlocks and numThreads, and
they may be of type int or dim3. The blocks of threads are organized into a 1D,
2D or 3D grid. Figure 6.5 displays threads on blocks and blocks in a grid. The
43
index of a block can be accessed inside the kernel through the built-in variable
blockIdx, and the size of the block through the variable blockDim.
A unique thread-ID is given each thread that executes the kernel, and the
thread-ID is also accessible within the kernel through the built-in variable threadIdx.
The threadIdx defines a threads position in a 1D, 2D or a 3D thread block, and
is therefore a 3-component vector. Defining the unique thread-ID is based on the
thread’s indexed position. For a thread in a 2D system of dimensions (Dx,Dy), the
ID of the thread at index (x,y) is (x+yDx). For a 3D block of size Dx,Dy,Dz) the
ID of a thread with the index (x,y,z) is (x+ yDx+ zDxDy).
Figure 6.5: Grid of thread block in the CUDA architecture [NVIDIA CUDA Toolkit
Documentation, v7.0]
Thread-blocks can be executed in any order. The threads of a thread block execute
concurrently on one multiprocessor, and a multiprocessor may also execute mul-
44
tiple thread blocks concurrently. When thread blocks terminate execution, new
blocks will be launched for execution on the vacant multiprocessors.
The unique architecture that NVIDIA calls Single-Instruction, Multiple-Thread
(SIMT) allows a multiprocessor to execute hundreds of threads concurrently. The
SIMT architecture is akin to SIMD (Single-Instruction, Multiple Data streams),
but the SIMD architecture operates a single instruction to control multiple data on
a vector processor, whereas the SIMT architecture operates on single or multiple
threads. The SIMT instructions specifies the execution and branching behavior
of each single thread. SIMT is intended to limit instruction fetching overhead
and the instructions are pipelined for instruction-level parallelism within a single
thread. With simultaneous hardware multi–treading, SIMT also offers extensive
thread-level parallelism. Data parallelism [Kirk and Wen-mei 2012] is achieved
with the SIMD architecture where data is distributed across different parallel com-
puting nodes executing the single, same instruction on each data element. Data
parallelism can also be achieved with the SIMT architecture as a single instruction
can operate on multiple threads in parallel, where each thread operates on a data
element.
The CUDA-threads can access data from the various memory spaces. Each thread
can access it own (per-thread) local memory. All threads within a block can
share data through the (per-block) shared memory. The global memory can be
accessed by all threads. Texture and constant memory spaces can also be ac-
cessed by all threads, but these memory spaces are read-only. An organization of
the CUDA memory hierarchy is revealed in figure 6.6. In addition to the global
memory, the constant and texture memories are persistent during an applications
kernel launch.
The CUDA programming model assumes that the C-code is executed on the
host (CPU) and all the CUDA-kernels with the CUDA-threads executes on a de-
vice (GPU), a physically separate coprocessor. This model also expects that both
the host and the device maintains their respective memory spaces in DRAM. The
program must manage memory allocations and deallocations, and data transfer
between the host and the device memory.
More details on CUDA will be conveyed in Chapter 7 Heterogeneous Com-
puting with CUDA C.
45






In the world of HPC, General-Purpose computing on Graphics Processing
Units (GPGPU) is considered to be a "bleeding edge technology". In this re-
search, GPU’s from NVIDIA will be investigated for their compute capabilities.
NVIDIA GPUs are built on the CUDA Architecture. The hardware is constructed
to perform general-purpose tasks in addition to traditional graphics-rendering
tasks.
To construct computer-programs to be running general-purpose computations
on CUDA GPUs, it’s convenient use the CUDA C programming language that
extends the C programming language. The CUDA C language is a powerful
tool for parallel programming. CUDA C is easy to combine with C by making
a few changes, and will be used for the parallel implementation of the sediment-
transport application. The C-functions must then be rewritten into CUDA-kernels.
In such applications the kernels will run on a GPU, while the CPU handles the
scheduling of the kernels. It’s typically referred to as heterogeneous computing
when a system exploits more than one kind of processor.
This parallel application takes advantage of the GPU for it’s massive process-
ing power, but uses the CPU to run the operating system. It is heterogeneous for
a mixed serial-parallel programming, serial program with parallel kernel. In this
parallel context, a heterogeneous system consists of a host and a device or multi-
ple devices, each with separate processor-capacities and separate memory spaces
that they maintain. The memories are referred to as a host memory and device
47
memory respectively.
7.2 Parallel Programming Features in CUDA C
CUDA’s extension set to the C language, CUDA C, has the runtime implemented
in the cudart runtime library. It initializes the first time a function is called, and
needs no explicit initialization. The entry points are prefixed cuda.
7.2.1 Parallel kernel
The CUDA-kernels runs on the the CUDA device, a GPU that is physically sepa-
rated from the CPU. The CUDA kernel is like a C-function, but the code is exe-
cuted in parallel. The massive data-load is sectioned into grids consisting of thread
blocks, and each thread on the block performs computations for one data-element
at one grid point. For explicit synchronization of the threads the intrinsic function
__syncthreads()
can be called. The function acts as a barrier, and waits for all the threads on a
block to finish before it can proceed.
Kernels can operate out of device on memories through functions provided by
the runtime library. These functions lets the system perform allocations, copies
and deallocations of data. It also permits transfer of data between host and device
or between the different devices in a multiple GPU system. It can also directly
load and store data between GPUs.
7.2.2 Scalability
A parallel kernel are composed of very many lightweight threads, grouped into
thread blocks. CUDA is exceptionally scalable [Nickolls et al. 2008] and has a
hierarchical execution model:
• Decompose problem into sequential steps such as kernels
• Decompose kernel into computing parallel blocks
• Decompose block into computing parallel threads
Table 7.1: CUDA’s Hierarchical Execution Model
48
The hardware distributes independent blocks to Streaming Multiprocessors
(SMs) as available, and schedules independent threads of execution. The blocks
can run concurrently or sequentially, they are independent thus scalable. A block
is a virtualized multiprocessor. When a kernel is launched it fits processors to data
and computation.
There are different levels of parallelism:
• Thread parallelism
– each threads an independent thread of execution
• Data parallelism
– across threads in a block
– across blocs in a kernel
• Task parallelism
– different blocks are independent
– independent kernels
Table 7.2: Levels of Parallelism in CUDA
7.2.3 CUDA Device Memory
The CUDA Architecture handles it’s own memory, and the CUDA-kernels can
only access the memory on the device. As shown in figure 6.6 we recall that:
• Global memory - can be accessed from all the blocks in the grid
• Shared memory - can be accessed by all the threads on a block
• Local memory - can be accessed by a single thread
The device memory can be allocated as a linear memory or as CUDA-arrays.
The device memory allocation and deallocation has to be handled separately using
cudaMalloc() and cudaFree(). The transfer of data between the host and device
memory also needs to be handled explicitly, and can be done with cudaMemcpy().
The various cudaMemcpy*() functions takes the parameter: cudaMemcpyKind
which is used as a flag describing the kind of copying that is being performed,
defining if the copy goes from a CPU or GPU and to a CPU or GPU. The choices







Table 7.3: Different "kinds" of Memory Copy in CUDA
7.2.3.1 Linear memory
Linear memory for 2D or 3D objects is recommended to be allocated using
cudaMallocPitch() and cudaMalloc3D(). The function cudaMalloc3D() allo-
cates at least width×height×depth bytes of linear memory. A 2D linear memory
is allocated with cudaMallocPitch(). This allocates the memory of a size that is
at least width (in bytes) ×height.
These functions also make sure to pad the allocated memory to meet the hardware
requirements, if needed. By this it ensures the best performance when accessing
row addresses or copying between regions of device memory. Both the 2D and
the 3D allocation mentioned, are padding the memory to a size being a multiple
of the warp-size. Since data are fetched with the size of a warp, the memories
are then optimized for this. The size of the padded width is returned in a variable
called pitch, and holds the size of the stride in number of bytes.
Both the 2D and the 3D allocation, returns a a pointer to the memory.
7.2.3.2 Page-Locked Host Memory
The runtime provides functions with the possibility for a page-locked host mem-
ory, also referred to as pinned memory, which can be allocated using
cudaHostAlloc() or cudaMallocHost() and freed with cudaFreeHost().
Page-locked memory guarantees that the operating system will never page out
this memory to disk, and ensures it’s space on the physical memory. A page-
locked host memory benefits form the possibility to perform copies between the
device memory and the page-locked host memory concurrently with kernel exe-
cution.
Mapped Memory - Zero Copy
Pinned memory can also be mapped directly into the host’s memory space from
50
some devices, a feature that eliminates memcpy between host and device. The
device code will then transfer data implicitly as needed. For a mapped memory,
the setup must be done on the host with proper flags, and the memory must be al-
located with the function cudaHostAlloc() with the flag cudaHostAllocMapped.
This is also referred to as Zero copy. For integrated systems that utilize the CPU
memory, Zero copy will be faster. Also for systems where data are read/written
from/to global memory once during the application the Zero copy is fast.
If data has to be read/written often throughout the simulations, a mapped mem-
ory will run slow due to massive synchronization of different devices trying to read
and write to the same data concurrently.
Portable Memory
An other option is to utilize a pinned memory allocation using a portable memory.
A block of page-locked memory is only available to the device that was current
when the the block was allocated, and to all the devices that shares the same Uni-
fied Virtual Address space (UVA). To make all the devices in such a multi-GPU
system have the benefits of accessing the same specific memory block, it must be
allocated with cudaHostAlloc(), and he flag cudaHostAllocPortable must be
passed when the memory block is allocated.
When creating a system for multi GPU’s with a portable memory, it makes
a single address space available to all the devices sharing the UVA, and copy-
ing of data between the GPU’s is performed with the cudaMemcpy() function.
The parameter kind which passes the flag cudaMemcpyKind can then be set to
cudaMemcpyDefault.
7.2.4 Synchronous Execution
Kernel execution and memory operations are launched and executed sequentially,
one task at the time. The CPU controls the scheduling of the tasks, and the control
is not returned from the device to the host thread before the device has finished
executing it’s task. Synchronous executions are blocking and no further tasks can
be launched for execution until the last call is done.
In the CUDA context, a kernel’s parallelism embodies the possibility to per-
form a task on many data-elements concurrently, the so-called thread parallelism,
still being synchronous. As opposed to synchronous execution there are possibili-
ties for asynchronous execution. This and other features for enhanced parallelism
has been enabled with CUDA, and will be conveyed in later sections.
51
7.2.5 Asynchronous Concurrent Execution
In addition to the thread-parallelism, CUDA has establish the possibility for con-
current execution between host and device. The control is then given back to the
CPU before the kernel or memory operation has has finished it’s execution on the
GPU.
This is facilitated through some asynchronous function calls. Copies between
page-locked memory and device memory can be performed concurrently with
kernel execution for devices that supports this behavior. It is possible to check
for this capability by the asyncEngineCount in the device property, if it is set to a
value larger than zero.
Concurrent kernel execution is possible for some devices with compute capability
higher than 2.0. Up to 32 concurrent kernel launches are possible on devices of
compute capability 3.5 or higher. The capability can be checked with a query to
the concurrentKernels i device property. It is set to 1 for devices that supports
this.
It is possible for the programmer to block asynchronous behavior by setting
the environment variable CUDA_LAUNCH_BLOCKING to 1.
7.2.6 CUDA Streams
CUDA Streams are great for accelerating parallel applications, and manage the
application’s concurrency. A stream represents a queue of GPU operations, like
kernel launches and memory operations, that are executed in a certain order. The
commands in the stream are operated by the host thread. It is also possible to uti-
lize more than one host thread, as will be explained later in the section on OpenMP
threads.
Different streams can be executed concurrently. They execute their com-
mands separate from each other, and may therefore be executed concurrently and
out of order with respect to each other. A stream is created with the instruc-
tion cudaStreamCreate(), and destroyed with cudaStreamDestroy(). If kernel
launches and memory operations are issued without an explicitly declared stream
parameter, the tasks are issued to the default stream.
7.2.6.1 Streams and Events on a Multi-GPU application
In a multi-GPU application each GPU can handle multiple streams and multiple
events. These streams and events are only valid within the scope where they are
declared. The scope of a stream is determined by the device that was current at the
52
time of the streams creation. CUDA streams are per device, and calls to a stream
can only be issued when the device is current. This means that streams associated
with one device, are not considered by other devices. If a task is issued to a stream
that is not associated with the current device, the operation will fail. If the input-
event and the input-stream to an operation are associated to different devices, the
operation will fail. This can be monitored with the function cudaEventRecord().
The function returns cudaSuccess if the event succeeded. Otherwise it yields a
cudaError. Any synchronization between streams has to be handled explicitly.
7.2.6.2 Synchronization
Streams can be synchronized with each other. There are some explicit synchro-
nization calls that handles synchronization on different levels.
cudaDeviceSynchronize()
waits until all host threads have completed all their commands in all their streams.
cudaStreamSynchronize()
waits until all the preceding commands in a stream are completed. This function
takes the specified stream as a parameter.
cudaStreamQuery()
provides the application with a way to know if all the foregoing tasks in a stream
have completed.
7.2.7 CUDA Events
CUDA events are markers for synchronization that can:
• Yield a fine grained synchronization in a specified stream
• Time (asynchronous) tasks in streams
• Yield an inter stream synchronization, where one stream waits for an event
in an other stream
The events are created with cudaEventCreate(),
and destroyed with cudaEventDestroy().
Some of the different event functions are:
cudaEventRecord()
Sets a point for where it starts to record a given event.
53
cudaStreamWaitEvent()
this function delays all the tasks that are called for after this function call until
the given event in the given stream has finished executing (from where it stared
recording). It takes both a stream and an event as parameters. This synchroniza-
tion does not stall the host, but only waits for the operations to finish within a
stream and an event. The stream may be on an other device, and this function can
be used to set a barrier until a event in an other stream on an other device is is done.
cudaSynchronizeEvent()
waits for an event to complete.
cudaEventElapsedTime()
is a function for timing. It takes two CUDA events as parameters, one for start
time and one for stop time.
7.2.8 CUDA Featureas for a Multi-GPU System
A CPU may be connected to multiple GPU’s. For a system with multiple devices,
the host needs to be able to distinct the devices from each other. CUDA-enabled
devices can be enumerated and counted. The following code extracts the number
of a devices in the system.
Listing 7.1: Get Device Count with/in CUDA C
i n t d e v i c e C o u n t ;
cudaGetDeviceCount (& d e v i c e C o u n t ) ;
To operate on a specific device, the context for the device needs to be set. A
host thread sets the device it operates on with a call to cudaSetDevice(). This
function takes the device, an integer, as a parameter. The kernel launches and
memory operations are then handled by that specific device.
7.2.8.1 Peer-to-Peer (P2P) Memory Access
In a multi GPU system, P2P comunication allows for a direct communication be-
tween the GPUs. In order to have the ability for a P2P memory communication
between two devices, a Unified Virtual Access space (UVA) has to be available
and utilized by the devices. The direct communication eliminates system memory
allocation and copy overhead.
Direct communication between two GPUs can be established by Direct Transfer
or Direct Access. (Direct Transfer will be addressed in the next section.)
54
On some multi GPU systems, a kernel executing on one device can derefer-
ence a pointer to the memory of an other device. This is called P2P memory
access. Data is cached in the L2 of target GPU.
Direct Access
gives one GPU the possibility to load/read or store/write data-values directly from
or onto the memory of an other GPU.
cudaDeviceCanAccessPeer()
is a function that checks if a system supports P2P between two devices when
called. It requests if peer access is possible between a set of two devices, and
returns true if so.
cudaDeviceEnablePeerAccess()
must be called to enable P2P direct access from a kernel on GPU to the memory
of another GPU
7.2.8.2 Peer-to-Peer (P2P) Memory Copy
Data can be copied directly between the memories of two different devices, with-
out the need to pass through the host, as it works transparently with UVA.
Direct Transfer
allows one GPU to copy data directly to or from an other GPU’s memory.
Memory copies can be performed synchronously or asynchronously. For
synchronous P2P memcpy functions CUDA offers:
cudaMemcpyPeer()
cudaMemcpy3DPeer()




is used as a parameter describing the kind of copying that is being performed,
whether it is from/to CPU/GPU. This parameter becomes unnecessary when a
55
Unified Virtual Address space (UVA) is available, so this "default" parameter is
used instead
cudaMemcpyDefault.
7.2.8.3 Unified Virtual Address Space (UVA)
With UVA, only one address space is needed for all CPU and GPU memory.
The application needs to be running as a 64-bit’s process and the devices must
be of compute capability 2.0 or higher. It simplifies the library functions and
the cudaMemcpyDefault can be utilized instead of any of the other "kinds" of
cudaMemcpyKind. To benefit from this feature, the host memory must be allocated
through CUDA with cudaHostAlloc() and is automatically portable across all
the devices using the UVA.
7.2.8.4 Peripheral Component Interconnect Express (PCIe)
The PCIe is a computer bus for maximum high throughput. This kind of bus is
acquired for fast P2P communication between devices. To carry out P2P copy





This chapter will embroider different aspects regarding the implementations of
the numerical model for dual-sediment transport, by [Rivenæs 1993]. On the path
to the numerical solution, the equation set has been discretized and initial condi-
tions and boundary conditions have been set before the software is written for a
both serial and parallel implementations. The numerical aspects of the model was
derived in Chapter 4. Recalling that the system for the sediment-transport model
runs time-series over a set of partial differential equations. Each time step calcu-
lates the bathymetry of Lake Okeechobee in the variable h, and the fractions of
the two sediments sand and silt as the values of s and 1− s, all represented as 2D
data-grids. The Finite Difference scheme called Forward-Time, Central-Space
(FTCS) is used to approximate a solution to the PDE’s.
The serial implementation runs on the host (CPU), and the parallel on the
device (GPU) or multiple devices. For the serial implementation the ANSI C pro-
gramming language has been used. The parallel programs have been developed
for a SIMD/SIMT architecture on NVIDIA’s CUDA architecture coded with the
CUDA C extensions. Most of the CUDA C features that was presented in Chapter
7 have been employed, and will be elaborated in the context of the implementa-
tions and how they have been utilized. Parallel applications have been developed
for both a single GPU, and for multiple GPU’s. For multiple GPU’s additional
CUDA features have been explored. Also a set of multiple OpenMP-threads are
spawned to reduce overhead on the CUDA-kernel launches. The use of OpenMP
will also be presented here.
Prior to revealing details in the program code for of the model, an analysis for
a parallel architecture is conveyed in this chapter.
57
8.1 Analysis of a Parallel Architecture for the
Numerical Model
When considering an architecture for a parallel application, it is important to ana-
lyze the characteristics of the problem to be solved. It is important to look at what
kind of an algorithm/system is to be implemented and the details that it evolves
around.
8.1.1 The Task/Channel approach for a Parallel Architecture
As a tool for analyzing which parallel architecture to use in this specific case, ele-
ments from the task/channel model is used. See section 6.1.2. Although this tool
ay be intended for a MIMD architecture, it has features that also a SIMD applica-
tion can benefit from.
Reviewing details in the model, the following analysis has been the basis for
the choice of a parallel architecture.
8.1.1.1 Partitioning
A system for the sediment-transport model does calculations on extremely large
datasets. It holds multiple data-variables in 2D data-matrices with a data-item at
each grid-point. Each data-item is updated with a few functions to perform the cal-
culations, and this procedure is done for every time step. A few tasks/functions are
performed at each time step. The tasks are performed on the same data-elements,
first one task on all the elements in parallel, then the next task performs on all the
elements in parallel and so on, one function after an other. Therefore the system
does not yield functional parallelism, as would be the case if the different func-
tions could be run in a parallel. This system does yield data parallelism since
one task can be performed on multiple data-elements simultaneously.
With the SIMT architecture on a single GPU, the thread parallelism naturally
leaves a decomposition of the data-domain, a so-called data-decomposition, into
each thread handling a single data-element. Therefore a data-domain can contain
only one data-element and be associated with one task at the time.
58
8.1.1.2 Partitioning for a Multi-GPU system
When the computer system contains multiple GPU’s it is likely to yield additional
parallelism. This application is still restricted to perform data-parallelism, and
since it deals with very large data-sets it is natural to choose a decomposition over
the data-domain. With a multi-GPU system it is possible to divide the data onto
the different GPU’s. The complete data-load is divided into N chunks, each to
be calculated on one of the N GPU’s in the multi-GPU system. In addition to
maintaining the parallelism in each single GPU as described, where each thread
performs tasks on a data-element, it is possible to benefit from enhanced accelera-
tion when performing that same thread parallelism on multiple GPU’s. Then each
chunk of data on a the separate GPU’s performs the same thread parallelism as a
single GPU do. The data-load can be decomposed in a 1D, 2D or a 3D manner.
8.1.1.3 Communication
When data-sets are decomposed, tasks are performed separately on the different
decomposed data-domains. A data-domain can contain one single data-element at
a grid-point, or many data-elements can be grouped together. When the value of
element h j+1,i in one domain is needed in an other domain to perform a task on el-
ement h j,i, then the element h j+1,i must be communicated through a channel to the
domain needing the data-element. If a data-domain holds only one data-element
and needs to compute a stencil with four neighboring elements and each element
belongs to four separate neighbors, then it needs to perform communication over
four channels for accessing all neighboring elements needed in the computations.
see figure
Considering the SIMT architecture provided on NVIDIA’s GPU’s the com-
munication of data-values between threads on a single GPU does not have to be
handled explicitly. NVIDIA has organized for memories to be shared between the
threads, and with a high bandwidth access.
8.1.1.4 Communication for a Multi-GPU system
When the data is partitioned between the multiple GPU’s, the data along the
boundaries must be communicated to and from the the neighboring GPU. When
calculations are performed on a data-domain at one time step, the data along the
boundaries are needed by the neighbor for it’s calculations at the next time step.
Therefore the data-domains containing single data-elements have to be agglomer-
ated into fewer and larger data-domains containing more data-elements.
59
To perform tasks on data-elements along the border to a neighbor of the de-
composed data-domains, communication must be performed between the different
domains. Each data-element along a border of the decomposition has to be com-
municated to the neighboring data-domain, but the data are communicated as a
block of data-elements through one channel, rather than through one channel per
data-element. With fewer neighbors there will be less communication and less
overhead. The communication only has to be performed along the border of the
domains, and larger data-domains will yield a reduction in the communication
overhead.
8.1.1.5 Agglomerating and Mapping
It would be impossible to compute all tasks on all data-elements at the same time.
One reason is that the values calculated in one time step is used to calculate the
new values in the next time step. A second reason is that all the tasks/functions
performed at each time step have to follow a certain consecutive order for the cal-
culations to yield correct result. The tasks are listed as the functions in the code
example 8.4. Although there are multiple tasks to be performed in this algorithm,
they all have dependencies to one another and can not be performed simultane-
ously on a an architecture for Multiple-Instruction in parallel like the MIMD ar-
chitecture. Since these functions have to be performed sequentially, one after an
other, the system can be considered to benefit from an architecture of a "Single
Instructions"-type. The multiple data elements in this finite difference application
can be calculated on simultaneously, but by only one function at a the time. By
this it can be concluded that this is a system with "Multiple-Data streams" that
can be computed on simultaneously by one task at the time, and therefore benefit-
ing from an architecture for Single-Instructions. This yields a SIMD architecture
when recalling Flynn’s taxonomy for parallel computer architectures, and SIMD
is then the preferred choice when implementing the sediment-transport.
In this application akin of SIMD architecture, namely the SIMT architecture
will be exploited, executing one data-element per thread, with multiple threads in
parallel. With the stencil for the sediment-transport model, there will be a need to
communicate the data-elements to and from four neighbors as the decomposition
of the data-domain is made per data-element. This could easily result in a major
overhead due to the all the communication of data-elements performed over chan-
nels, but the CUDA architecture omits this to happen. It could be the case where
each data-element in each function at every time-step needed to be communicated.
Luckily the CUDA architecture provided on the GPU’s produced by NVIDIA em-
ploy the SIMT parallel architecture. CUDA features many options for parallelism
and some have already been mentioned, like the parallelism performed with the
60
multiple threads and the high memory bandwidth. The data access between the
threads on a single GPU is optimized through the great bandwidth to the memory
hierarchy, and by this it allows high-speed access and computing on neighboring
data-elements. The extended number of memory interfaces (also called channels),
is a hardware feature that the GPU’s profit on. When computations are executed
on a single GPU, the communication overhead is not to be considered between
the threads as the multiple threads can share memory spaces and have extremely
fast access to the same data.
Although the CUDA architecture performs fast on single threads, it has a fea-
ture where threads can be placed into blocks. Placing groups of threads onto
thread-blocks increases the speed of the system. The reason is that the multi-
processor executes threads in groups of 32 parallel threads called warps. All
threads in a warp start together at the the same program address. The size of
the thread-blocks are optimally a multiple of the warp-size. On a single GPU the
data-elements are agglomerated into such thread-blocks.
In the world of parallel computing this problem is considered to be embar-
rassingly parallel. It could also be called "perfectly parallel" or "pleasingly par-
allel". The calculations could be performed in parallel on all the grid points in
the data-domain at one time step by one function, and have no dependency to
other tasks nor dependencies to data that needs to be communicated during the
calculations.
8.1.1.6 Agglomerating and Mapping for a Multi-GPU system
It has already been mentioned that in the multi-GPU system, the decomposition
of the data-domain lets each GPU handle one chunk of data. To obtain a good
load-balance, the total data-domain is decomposed into N (more or less) equally
sized data-sets, one for each of the N GPUs. To reduce communication, a 1D
decomposition along x has been chosen.
8.2 General Details Regarding Both
Serial and Parallel Implementations
Implementations of the systems have been developed with stencil computations
of the coupled system of Partial Differential Equations (PDEs) with time series,
according to the numerical model. Operations are executed on the sets of 2D
grid-data and the values are updated according to the stencils by a series of arith-
metic instructions. The application for the sediment-transport model is an Initial
61
value problem, and has been programmed with the equations ?? for diffusion and
4.20 for convection in addition to the functions handling the boundaries. The spa-
tial discretization of these equations requires on layer of ghost points around the
boundaries of each (sub-)domain. The simulations are executed through discrete
time steps, and all the data in the large 2D grids for h and s are updated at every
time step.
Both the serial and the parallel systems are implemented in C, although the
parallel system exploits the CUDA C extensions. CUDA C is basically C-programming
with CUDA extensions to perform memory-handling and thread-operations on the
CUDA parallel architecture.
8.2.1 Variables for Implementation
The numerical model represents how the bathymetry of the lake and the fractions
of the two sediments, sand and silt, are distributed when the masses are coming
down the River Kissimmee and into Lake Okeechobee. The lake is represented
as a 2D domain, and the physical size of the lake is set to 62130 by 69500 me-
ters. The change in bathymetry and the fractions of sand and silt over time, are
simulated using so-called time series over the 2D-domain. To simulate the distri-
bution of the sediments entering the lake from the river, the calculations for both
diffusion and convection are performed on the discrete data for each grid-point in
the 2D data-domain, at each time step. The numerical model with the discretized
equations from chapter 4 are implemented, and equation 4.20 is used for diffusion
and equation 4.23 is used for convection. In addition the updates of the boundaries
are computed.
There are various variables used for the calculations at each grid-point and at
each time step in the numerical model, as can be recalled from the variables listed
in the table 4.1 in the chapter explaining the numerical model 4. Some of the
variables are scalar constants used for all the grid points and all the time steps,
and some are 2D-matrices with a discrete values for each grid-point in the 2D
data-domain. The values for h, s, α and β are represented in such 2D-matrices.
The matrices α and β are constant matrices, which means that their data-values
remains the same for every time step without any updates.
The values in the 2D data-grids representing h and s are updated at each discrete
grid point in space and for each time-step. These matrices defines the bathymetry
of the lake, and the fraction of sand (and silt represented as 1− s) at a given point
in time and space.
The variables α and β from the numerical schema derived in Chapter 4, cor-
62
responds to variables alpha and beta in the implemented code.
For the simulations some assisting variables are needed. As the new value for h
are calculated at time step l+1, the old values calculated at time step l are used in
the calculations. For this reason, two values are needed for h. By this the matrix
for the new time step at l+1, representing hl+1 in the mathematical model is put
into the variable h_new, and the variable for the previous time step hl is kept as h
for simplicity when reading the program code on the right hand side.
In addition to this a third variable for h has to be available at the end of each time
step, when the the pointer to the matrices needs to we swapped for the next iter-
ation. This variable is called h_tmp. The same set of variables are declared for s
with s_new and s_tmp.
63
Variable Name Explanation
h The "old" values of h, calculated at the previous time step
s The "old" values of s, calculated at the previous time step
h_new Newly calculated values of h
s_new Newly calculated values of s
h_tmp Temporary grid for h during swapping of pointers
s_tmp Temporary grid for s during swapping of pointers
dx Spatial step in x-direction (Equivalent to ∆x in the numerical scheme)
dy Spatial step in y-direction (Equivalent to ∆y in the numerical scheme)
dt Time step (Equivalent to ∆t in the numerical scheme)
t Value incremented by dt at each time step
dx_R = 1/dx (Reciprocal variable)
dy_R = 1/dy (Reciprocal variable)
dx2_R = 1/(dx∗dx) (Reciprocal variable)
dy2_R = 1/(dy∗dy) (Reciprocal variable)
as_center = alpha[ j][i]∗ s[ j][i]
bs_center = beta[ j][i]∗ s[ j][i]
M Number of elements in x-direction
N Number of elements in y-direction
M2 =M+2 (Number of elements in x-direction incl. 2 ghost points)
N2 = N+2 (Number of elements in y-direction incl. 2 ghost points)
num_gpus Number of GPUs
N_dev = N/num_gpus (Num elem in y-dir on each GPU after domain decomp)
N2_dev = N_dev+2 (Num elem in y-dir on each GPU incl. 2 ghost points)
nStreams Number of CUDA Streams
nEvents Number of CUDA Events
nThreads Number of OpenMP threads
tr Number of threads on a thread block
bl Number of blocks in a grid
in f lux Influx percentage of sand and mud together (set to 50% at the start)
in f lux_start Start position in the grid for the influx
in f lux_stop Stop position in the grid for the influx
Table 8.1: Additional Variables for the Programmed Code
Other variables are also introduced to enhance computational performance.
The reciprocal variables are set so the stencil computations can avoid as many
repeating divisions as possible which would reduce the speed. Also the calcu-
lations of as_center = alpha[ j][i] ∗ s[ j][i] which are repeated several times in a
stencil is put in a separate variable to enhance speedup. The same is done with
bs_center= beta[ j][i]∗ [ j]s[i]. Layer thickness A is set to 1m. The variables added
for the implementation to the ones that are already presented with the numerical
64
model, are displayed with explanations in Table 8.1.
8.2.1.1 Spatial and Time Intervals - dx, dy and dt
The spatial intervals dx and dy have been calculated based on the physical size
of the domain from Lake Okeechobee’s seismic measurements. The size of dx is
based on the physical length in x-direction divided on the number of intervals in x-
direction, and likewise for dy which is based on the physical length in y-direction
divided on the number of intervals in y-direction. How these variables in addition
to the dt are set, can be seen in the code Listings 8.1.
Listing 8.1: Setting the variables dx, dy and dt
1 # d e f i n e X_LEN 62130 / / Leng th o f t h e s p a t i a l domain i n x−d i r e c t i o n
2 # d e f i n e Y_LEN 69500 / / Leng th o f t h e s p a t i a l domain i n y−d i r e c t i o n
3
4 # d e f i n e imin ( a , b ) ( a<b ? a : b ) / / I n l i n e f u n c t i o n − r e t u r n s t h e sm a l l e s t o f two
v a l u e s
5 # d e f i n e imax ( a , b ) ( a>b ? a : b ) / / I n l i n e f u n c t i o n − r e t u r n s t h e l a r g e s t o f two
v a l u e s
6
7 dx = X_LEN / (M+2−1.) ; / / The s i z e o f t h e t h e s p a t i a l i n t e r v a l s i n x−d i r e c t i o n (
s pac i ng be tween t h e x _ i and x_ ( i +1) )
8 dy = Y_LEN / ( N+2−1.) ; / / The s i z e o f t h e t h e s p a t i a l i n t e r v a l s i n y−d i r e c t i o n (
s pac i ng be tween t h e y _ j and y_ ( j +1) )
9
10 / * A s s e r t t h a t d t i s sma l l enough f o r t h e CFL S t a b i l i t y C h r i t e r i o n
11 * Find t h e sm a l l e s t d t a c co rd i ng t o t h e c h r i t e r i o n :
12 * d t <= ( dx ^2 * dy ^2 ) / (2 * C_max *( dx ^2 + dy ^2 ) )
13 * Ca l l t o i n l i n e f u n c t i o n t o r e t u r n t h e sm a l l e s t v a l u e ( d t or t h e s t a b i l i t y
c h r i t e r i o n f o r d t ) * /
14 / / d t = imin ( dt , ( ( ( dx*dx ) * ( dy*dy ) ) / (2* coe f f _max * ( ( dx*dx ) +( dy*dy ) ) ) ) ) ; / /
From hp l ’ s book + a r t i c l e ( i n k l . c o e f f s )
15 setMaxDt (& h _ a l p h a _ s t o r a g e , &h _ b e t a _ s t o r a g e , &dt , dx , dy , M, N) ;
16
17
18 / * S e t t h e l a r g e s t p o s s i b l e v a l u e f o r dt , t o o b t a i n s t a b i l i t y
19 * With a f u l l y e x p l i c i t scheme i t i s l i m i t e d by t h e v a l u e s o f a lpha and be t a
20 * /
21 void setMaxDt ( double ** h _ a l p h a _ s t o r a g e , double ** h _ b e t a _ s t o r a g e , double * dt ,
double dx , double dy , s i z e _ t M, s i z e _ t N) {
22 i n t i ;
23 double alpha_max = (* h _ a l p h a _ s t o r a g e ) [ 0 ] ;
24 double beta_max = (* h _ b e t a _ s t o r a g e ) [ 0 ] ;
25 f o r ( i =1 ; i <(M*N) ; i ++) {
26 alpha_max = imax ( alpha_max , (* h _ a l p h a _ s t o r a g e ) [ i ] ) ;
27 beta_max = imax ( beta_max , (* h _ b e t a _ s t o r a g e ) [ i ] ) ;
28 }
29 / / * d t = ( dx*dy ) / imax ( alpha_max , beta_max ) ;
30 * d t = 0 . 0 1 ;
31 }
The time interval dt was first set according to the Courant-Friedrichs-Levy
(CFL) Stability Criterion, but was later set to 0.01 year for both diffusion and
convection according to the suggestions in [Clark, Wei, and Cai 2010].
65
8.2.1.2 Ghost Points
The grid of the 2D data has been added 1 row of ghost points all around on the
boundary as presented in figure 8.1. The size of the grid without the ghost points




Figure 8.1: One Row of Ghost Points Around the Boundary
8.2.1.3 Influx
Where the River Kissimmee enters Lake Okeechobee, there is an influx of sand
and mud, which together is set to be 50% of the total mass. Of this mass there
is a total of 20% sand and 80% mud. The variables influx_start and influx_-
stop indicates the position along the border where the influx starts and stops. The
influx is along x at j = 0, and the variables are used to set the Dirichlet- and
Inhomogeneous Neumann Boundary Condition. An overview of the variables can
be viewed in the code listing 8.2.
66
Listing 8.2: Variables for the influx from River Kissimmee
1 double i n f l u x = 5 0 ; / / I n f l u x−p e r c e n t a g e
2 double f _ s = i n f l u x * 0 . 2 , f_m= i n f l u x * 0 . 8 ; / / I n f l u x−va l u e o f sand (20%) and mud (80%)
3 i n t i n f l u x _ s t a r t =755 , i n f l u x _ s t o p =809; / / Grid Po i n t s f o r i n f l u x − s t a r t and s t o p
8.2.2 Seismic Data Read Into The System With The
NetCDF-Library
The numerical model implemented for the sediment transport model contains nu-
merous of variables as displayed. The variables need values, and some of the
variables regarding in the sediments-transport of Lake Okeechobee, have their
data-values at the simulation’s start point coming from seismic measurements.
The variables that are read into the system from seismic surveys [Gtopo30 1996]
are: h,s,α and β .
These data exists in a Network Common Data Form (NetCDF) [Unidata
2015] file format, as a .nc file, a machine-independent data format for array-
oriented scientific data. These datasets are first opened and read into the system
with the NetCDF-library for the C-language, and then stored into the memory
allocated on the CPU. The data in the .nc files are read into the system with the
code in the code listing 8.3. The size of the grid comes from variables read from
the file, in addition to the data for the large grids.
The initial values in the s-grid , s(i, j,0), are set to 0.5, as the initial sand-
silt volume fraction are assumed to be at 50% according to [Clark, Wei, and Cai
2010]..
Listing 8.3: Reading files of the format .nc with the NetCDF-library
1 # i n c l u d e < n e t c d f . h>
2 . . .
3
4 double * a l p h a ;
5 i n t r e t v a l , nc id , v a r i d , x id , y i d ;
6 s i z e _ t x len , y l e n ;
7
8 i f ( ( r e t v a l = nc_open ( " a l p h a . nc [ 2 ] " , NC_NOWRITE, &n c i d ) ) )
9 ERR( r e t v a l ) ;
10 i f ( ( r e t v a l = n c _ i n q _ d i m i d ( nc id , " x " ,& x i d ) ) )
11 ERR( r e t v a l ) ;
12 i f ( ( r e t v a l = n c _ i n q _ d i m l e n ( nc id , xid ,& x l e n ) ) )
13 ERR( r e t v a l ) ;
14 i f ( ( r e t v a l = n c _ i n q _ d i m i d ( nc id , " y " ,& y i d ) ) )
15 ERR( r e t v a l ) ;
16 i f ( ( r e t v a l = n c _ i n q _ d i m l e n ( nc id , yid ,& y l e n ) ) )
17 ERR( r e t v a l ) ;
67
18
19 i f ( ( a l p h a = ( double *) ma l l oc ( x l e n * y l e n * s i z e o f ( double ) ) ) == NULL) {
20 p r i n t f ( " F a i l e d t o a l l o c a t e t h e 2D a r r a y f o r a l p h a . n " ) ;
21 re turn −1;
22 }
23
24 i f ( ( r e t v a l = n c _ i n q _ v a r i d ( nc id , " z " , &v a r i d ) ) )
25 ERR( r e t v a l ) ;
26 i f ( ( r e t v a l = n c _ g e t _ v a r _ d o u b l e ( nc id , v a r i d , a l p h a ) ) )
27 ERR( r e t v a l ) ;
28 i f ( ( r e t v a l = n c _ c l o s e ( n c i d ) ) )
29 ERR( r e t v a l ) ;
8.2.3 Time Series
This system simulates over the 2D data domain, although represented as 1D coa-
lesced, in time series. In addition to performing the updates for Diffusion on h and
Convection on s, there is also a need to update the boundaries for both h and s at
each time step. The boundary conditions for h are updated after the diffusion-step,
first by a Homogeneous Neumann BC, and then by a Inhomogeneous Neumann
BC. The BCs need to be updated on h before the convection can be calculated on
s, as the newly calculated values for h are used in the calculation of s. The BCs on
s are updated after the convection-step, first updated by the Homogeneous Neu-
mann BC, before it takes a Dirichlet BC. Since the Homogeneous Neumann BC
is 0, it means copying the value of the neighboring index on the inner side of the
data-domain to the boundary-points. Inhomogeneous Neumann BC and Dirichlet
BC are set along the border, only where the Kissimmee River gives an influx. At
each time step t is updated with an incrementation of dt. Pseudo-code for the
functions in the loop of the time series and the swapping of the matrix-pointers
can be seen in the code listing 8.4.
Listing 8.4: Pseudo Code of the Main Function Loop
1 whi le ( t <max_t ) {
2 t += d t ;
3
4 / / compute d i f f u s i o n and upda t e bounda r i e s on h
5 d i f f u s i o n ( ) ;
6 update_boundary_neumann_homogeneous ( ) ;
7 update_boundary_neumann_inhomogeneous ( ) ;
8
9 / / compute c o n v e c t i o n and upda t e bounda r i e s on s
10 c o n v e c t i o n ( ) ;
11 update_boundary_neumann_homogeneous ( ) ;
12 u p d a t e _ b o u n d a r y _ d i r i c h l e t ( ) ;
13
14 / / swap mat r i x−p o i n t e r s
15 h_tmp = h_o ld ;
16 h_o ld = h_new ;
68
17 h_new = h_tmp ;
18
19 s_tmp = s _ o l d ;
20 s _ o l d = s_new ;
21 s_new = s_tmp ;
22 }
8.2.4 Coalesced Memory
One major feature is added to reduce the time spent to access the data at run-
time. The optimization lies in the structuring of the data. Instead of using a 2D
data-structure for the 2D-matrices, this application benefits from a 1D coalesced
memory on both the CPU and the GPU. Using a coalesced 1D data-structure is
an advantage when data is to be accessed, as they are read from the same memory
address, and not from multiple address spaces. This improves the performance of
the application as less time is used on data accessing.
8.3 Serial Implementation with ANSI C
The implementation of the sequential system is done pretty straight forward using
the C programming language with the built-in libraries that lies within the ANSI-
C standard [ REF ]. The stencil computations are performed using loops over the
time series and the 2D spatial data domain.
8.3.1 Allocation of Coalesced Memory for the Serial Code
Some of the host data that are handled by the host alone has been allocated with
malloc(). For the 2D-grids, 1D coalesced data-structures has been allocated for
fast access of the data. The feature of 2D-pointers are created for the 2D data-sets.
The 2D-pointers helps indicate which row the data-element belongs to, and makes
the coded stencils more readable. This way to create 2D-indexing helps on the
comprehension of the code and what lies underneath the numerics, for anybody
reading or working with the computer program. The allocations of the coalesced
memory and allocations of the corresponding pointer to the rows created on the
CPU, together with how to access the data, is presented in the code listing 8.5.
69
Listing 8.5: 1D Coalesced Memory for Fast Access on the CPU.
1 / * A l l o c a t i n g t h e 1D coa l e a s c e d memory s t o r a g e f o r a 2D−d a t a s e t * /
2 i f ( ( u _ s t o r a g e = ( double *) ma l lo c ( s i z e o f ( double ) *M*N ) ) == NULL ) {
3 f p r i n t f ( s t d e r r , " Out o f memory \ n " ) ;
4 e x i t ( 0 ) ;
5 }
6
7 / * A l l o c a t e t h e p o i n t e r t o a c c e s s da ta i n a 2D manner * /
8 i f ( ( u = ( double **) m a l l oc ( s i z e o f ( double ) *N ) ) == NULL ) {
9 f p r i n t f ( s t d e r r , " Out o f memory \ n " ) ;
10 e x i t ( 0 ) ;
11 }
12
13 / * S e t t h e 2D−p o i n t e r t o p o i n t c o r r e c t l y i n t o t h e 1D coa l e a s c e d
14 * memory s t o r a g e where each new l i n e b eg i n s * /
15 f o r ( i =0 ; i <N; i ++) {
16 u [ i ] = &( u _ s t o r a g e [ i *M] ) ;
17 }
18
19 / * Ac c e s s i n g data−e l emen t s u s i n g 2D−i n d e x i n g * /
20 f o r ( j =0 ; j <N; j ++) {
21 f o r ( i =0 ; i <M; i ++) {
22 . . .
23 u [ j ] [ i ] = . . . ;
24 . . .
25 }
26 }
8.4 Parallelization of Serial Code using the
CUDA C Extension
The parallelization carried out by [Clark2010] was applied using Message Pass-
ing Interface (MPI). With MPI their code was executed on CPUs while this thesis
has investigated the benefit from the extraordinary processing power of NVIDIA’s
GPUs.
* In addition to aim to answer this thesis’ questions stated in 2.1, this thesis
also intend to investigate the following problems.
* The first aim was to see if the CUDA application could outperform the MPI
application. If so; in addition to yield the benefits of enhanced speedup, it is also
great in the aspect of knowing that a few GPUs can be less costly and more energy
saving than a CPU cluster.
* Another aim was to see if the features presented by [Sourouri et al] would
enhance the the speedup utterly. They have tested some techniques for enhance-
ments for parallelizing code on GPUs using CUDA Streams in combination with
70
OpenMP threads. The question was if their techniques applied on the sediment-
transport application would yield an utterly enhanced speedup.
All computer code has been written from scratch, including the serial code
for the CPU, but the MPI code has been viewed in the process of working with
this thesis. This thesis has extend the serial code to run in parallel on one or
multiple GPUs rather than on a CPU cluster, using the CUDA C Extension for the
CUDA Parallel Architecture. The CUDA features mentioned in chapter 7 has been
deployed in the procedure, and miscellaneous optimization techniques has been
tested out. Here is an enumerated list of the implementations with descriptions
and features that has been examined in this thesis:
1. Parallelizing with use of CUDA kernels on 1 GPU
2. Coupling multiple GPU’s using Synchronous computations and communi-
cations between GPUs via host.
3. Coupling multiple GPU’s using Synchronous computations and P2P com-
munications between GPUs over the PCI Express bus.
4. Coupling multiple GPU’s using Asynchronous computations and commu-
nications between GPUs via host, with CUDA Streams for concurrency.
5. Coupling multiple GPU’s using Asynchronous computations and P2P com-
munications between GPUs over the PCI Express bus, with CUDA Streams
for concurrency.
6. Coupling multiple GPU’s using Asynchronous computations and P2P com-
munications between GPUs over the PCI Express bus, with CUDA Streams
for concurrency, and in addition create a set of OMP-threads for each GPU,
and let one OMP-thread on the CPU launch one CUDA-kernel, and an other
OMP-thread handle asynchronous communication. With this, observations
of additional speedup can be expected, due to a reduction of kernel-launch
overhead.
Table 8.2: Implementations with Various Features for Parallel Architecture
8.4.1 Allocation and Copying of Coalesced Memory
for the Parallel Code
The device memory has been allocated using the function cudaMallocPitch().
This is a function, which allocates a linear memory that is at least width (in bytes)
71
×height in size. This function will make sure that the data are padded appropri-
ately to meet the alignment requirements for coalescing the data. This function is
recommended by [NVIDIA:CUDA] for best performance when threads in a warp
are accessing the addressed row or performing copies with 2D arrays. The length
of rows in the 2D array should then be a multiple of the warp size (32). as these
data are fetched runtime in the size of a warp, and cudaMallocPitch() therefore
pads the row in the memory accordingly. The function returns a value pitch,
witch is the length of the rows including the padding, and it is measured in the
size of bytes. The pitched data must be handled correctly so the right data are
accessed right. When copying data between host and device, and the device-array
is padded, the function cudaMemcpy2D() can be used to handle the copying cor-
rectly. A small example of the allocation with cudaMallocPitch() and copying
of data from host to a device can be seen in code listing 8.6.
Listing 8.6: Allocation of Coalesced Memory and Copy between Host and Device.
1 double *h , * s ; / / g r id−da ta / ma t r i c e s
2 s i z e _ t p i t c h ; / / s i z e o f paddded g r i d
3
4 / * A l l o c a t e l i e a r memory f o r dev i c e−da ta (1D−v e c t o r s ) * /
5 HANDLE_ERROR( c u d a M a l l o c P i t c h (&h_dev , &p i t c h , s i z e o f ( double ) *M2, N2 ) ) ;
6 HANDLE_ERROR( c u d a M a l l o c P i t c h (&s_dev , &p i t c h , s i z e o f ( double ) *M2, N2 ) ) ;
7
8 / * Copy v e t o r s FROM HOST−memory TO DEVICE−memory * /
9 HANDLE_ERROR( cudaMemcpy2D ( h_dev , p i t c h , h_hos t , s i z e o f ( double ) *M2,
10 s i z e o f ( double ) *M2, N2 , cudaMemcpyHostToDevice ) ) ;
11 HANDLE_ERROR( cudaMemcpy2D ( s_dev , p i t c h , s _ h o s t , s i z e o f ( double ) *M2,
12 s i z e o f ( double ) *M2, N2 , cudaMemcpyHostToDevice ) ) ;
8.4.2 Threads and Blocks in a Grid
When computing on the GPU and exploiting the parallelism of a CUDA-kernel,
it is necessary to decide for the number of threads on a thread-block, tr, and the
number of thread-blocks in a grid, bl. Both these sizes are in 2D for our 2D data,
and the variables tr and bl are therefore declared of the type dim3. The computer-
code for this can be seen in the code listing 8.7. The code has one line commented
out which is setting the size of the grid for an arbitrary number of GPUs, and the
variable N_dev is the size of the decomposed domain. The two last lines for the
boundaries launches fewer threads as it handles less data. The program optimizes
the speed by only launching the necessary number of threads (in a multiple of
warps), and not the size of a full grid.
72
Listing 8.7: Threads on a Block and Blocks in a Grid.
1 / * Number o f t h r e a d s on a thread−b l o c k i n ( x− and y d i r e c t i o n ) * /
2 i n t n u m _ x _ t h r e a d s _ p e r _ b l o c k = 16 , n u m _ y _ t h r e a d s _ p e r _ b l o c k = 4 ;
3 dim3 t r ( num _x _ t h re ad s_p e r _b l oc k , n u m _ y _ t h r e a d s _ p e r _ b l o c k ) ;
4
5 / * Number o f th read−b l o c k s i n a g r i d * /
6 dim3 b l ( (M+ t r . x +1) / t r . x , (N+ t r . y +1) / t r . y ) ; / * ** FOR 1 GPU *********** * /
7 / / dim3 b l ( (M+ t r . x+1) / t r . x , ( N_dev+ t r . y+1) / t r . y ) ; / *** FOR MULTIPPLE GPU’S *** /
8 dim3 bl_BC_HomNeumann ( 1 6 , 4 ) ; / * ** For Homogeneous Neumann BC ** * /
9 dim3 b l _ B C _ i n f l u x ( 1 , 1 ) ; / * ** For BC a t t h e i n f l u x ******** * /
8.4.3 Application for a Single GPU
The first parallel application (1), listed in table 8.2 is described here. The appli-
cation performing on a (single) GPU has some of the same features as the serial
code. Instead of the C-functions, the CUDA-kernels are called from a loop han-
dling the time series. It updates diffusion and convection on all the grid points
except on the boundaries. Update of the boundaries are performed in separate
kernels. The loop handling the kernel calls for 1 GPU can be seen in the code
Listing 8.8. Note that for the update of the boundaries, the kernels are called for
a smaller set of threads than for the diffusion or convection to reduce time, as
mentioned in section 8.4.2.
73
Listing 8.8: Calling the CUDA-Kernels in Time Series on 1 GPU.
1 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ 0 ] ) ) ;
2 whi le ( t <t_max ) {
3 t += d t ;
4
5 / * Compute D i f f u s i o n on H * /
6 dif fus ion_Solve_H_GPU <<<bl , t r >>>(h , s , h_new , a lpha , be t a , p i t c h , dx , dy
, d t , c_s , c_m , M2, N2 ) ;
7
8 / * Update Boundary Cond i t i o n s on H * /
9 updateBC_Neuman_Homogeneous_H_GPU <<<bl_BC_HomNeumann , t r >>>(h_new , p i t c h ,
dx , dy , M2, N2 ) ;
10 updateBC_Neuman_Inhomogeneous_H_GPU <<< bl_BC_in f lux , t r >>>(h , h_new , a lpha
, be t a , p i t c h , i n f l u x _ s t a r t , i n f l u x _ s t o p , f_s , f_m , dy ) ;
11
12 / * Compute Convec t i on on S * /
13 convect ion_Solve_S_GPU <<<bl , t r >>>(h , s , h_new , s_new , a lpha , p i t c h , dx ,
dy , dt , c_s , c_m , A, M2, N2 ) ;
14
15 / * Update Boundary Cond i t i o n s on S * /
16 updateBC_Neuman_Homogeneous_S_GPU <<<bl_BC_HomNeumann , t r >>>(s_new , p i t c h ,
dx , dy , M2, N2 ) ;
17 updateBC_Dir ichle t_S_GPU <<< bl_BC_in f lux , t r >>>(s_new , a lpha , be t a , p i t c h ,
i n f l u x _ s t a r t , i n f l u x _ s t o p , f_s , f_m , dy ) ;
18
19 / * Swap array−p o i n t e r s b e f o r e t h e n e x t t i m e s t e p * /
20 h_tmp = h ;
21 h = h_new ;
22 h_new = h_tmp ;
23
24 s_tmp = s ;
25 s = s_new ;
26 s_new = s_tmp ;
27 }
8.4.4 Applications for Multiple GPUs
The various applications for multi GPUs that are described in 8.2, are presented
in this section. Different features that has been investigated regarding the imple-
mentations, is described.
When programming for multiple GPUs there are several possibilities regarding
how to structure the program, various ways to communicate data between the de-
vices, and different features can be employed, all with one major objective; to
enhance the performance of the parallel system.
When dealing with multiple devices it’s important to set the right context for
the kernel-calls or memory handling. Since each device handles it’s own memory,
it is crucial that operations are performed on the correct set of data allocated on a
given device. Each device is given a device-id, and the right context is set for each
device with the function cudaSetDevice() as displayed in the code listing 8.9.
74
Listing 8.9: Sets The Device-ID, and the Current Device’s Context.
1 i n t d ; / / i t e r a t o r over GPUs
2 i n t num_gpus ; / / number o f GPUs r e t u r n e d by t h e i n t r i n s i c f u n c t i o n
3
4 / * Check f o r t h e number o f a v a i l a b l e d e v i c e s * /
5 HANDLE_ERROR( cudaGetDeviceCount ( &num_gpus ) ) ;
6 i f ( num_gpus < 2) {
7 p r i n t f ( "We need a t l e a s t two compute 1 . 0 o r g r e a t e r d e v i c e s , b u t on ly found %d \
n " , num_gpus ) ;
8 }
9
10 / * S e t dev i c e−i d s f o r each GPU * /
11 i n t d e v i c e [ num_gpus ] ;
12 f o r ( d =0; d<num_gpus ; d ++) {
13 d e v i c e [ d ] = d ;
14 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
15 }
8.4.4.1 Data Decomposition
When the computations are performed on multiple devices, the data must be de-
composed into smaller data domains, one for each of the multiple devices, and
memory must be allocated on each device.
For the sediment transport application the data domain has been decomposed in
only 1 direction to reduce the number of neighbors, and by that reducing the
number of communication channels. Establishing the communication is a time
consuming operation, so in order to minimize communication overhead, a 1D de-
composition is preferred. The data domain is decomposed along x in equally sized
data blocks, and placed on the number of available GPUs, or the number of GPUs
that is set for the simulation (num_gpus).
The variable N being the size of the non-decomposed domain in y-direction,
has then been discarded for the decomposed size of N. The decomposed size of N
is set in the variable N_dev and equals the number of N divided by the number of
GPUs (num_gpus). The handling of ghost-point for the decomposed domains are
done equally for every GPU. The decomposed domain including one row of ghost
points around the boundary, has like the variable N (for non-decomposed), been
added 2 rows in y-direction, one on each side of the decomposed domain. The
decomposed size including the ghost points is then given in the variable N2_dev
which equals the value of N_dev+ 2. An illustration of the decomposed data-
domain with boundary-points and ghost-points spread on the multiple GPUs is
unveiled in figure 8.2.
Since the decomposition is performed along x, the domain is not divided in
the x-direction. Therefore the values for the domain size in x-direction remain the
75





























Figure 8.2: Domain Decomposed on GPUs
The data communicated between neighboring devices are copied into "tranfer-
buffers" for sending and receiving data. The newly calculated boundary points
along x in one domain are transferred to the ghost points in the neighboring do-
main. The upper boundary on GPU 0, being the upper_send_bu f f er is trans-
ferred to the ghost points on GPU 1, being the upper neighbor, via lower_recv_bu f f er.
This is performed like vise in the opposite direction, where GPU 1 is sending
it’s lower boundary in lower_send_bu f f er to GPU 0’s upper ghost points in
upper_recv_bu f f er. This is applied for all neighbors on all the working GPU’s.
Fgure 8.2 also shows the transferred data buffers.
The various implementations handles this data transfer in different ways, as
explained in later sections in this chapter.
When handling multiple GPUs, the application loops over all the devices, sets
76
the context of the device, and the variables are declared and allocated on that spec-
ified GPU. The declaration and allocation of the decomposed data sets for each
device can be seen in listings 8.10
Listing 8.10: Allocating Device Memory for the Decomposed Data-Domains on
All the Devices.
1 / * ** Dev ice da ta ** * /
2 s i z e _ t N_dev = N/ num_gpus ; / / h e i g h t o f data−domain a f t e r domain d e compo s i t i o n
3 s i z e _ t N2_dev = N_dev +2; / / h e i g h t o f data−domain i n c l u d i n g ghos t−p o i n t s
4 s i z e _ t p i t c h _ b y t ; / * p i t c h s i z e i n number o f b y t e s = num_elements * s i z e o f (
doub l e ) * /
5
6 double * h_dev [ num_gpus ] , * h_new_dev [ num_gpus ] , * h_tmp_dev [ num_gpus ] ; / *
Coaleased 1D s t o r a g e * /
7 double * s_dev [ num_gpus ] , * s_new_dev [ num_gpus ] , * s_tmp_dev [ num_gpus ] ;
8 double * a l p h a _ d e v [ num_gpus ] , * b e t a _ d e v [ num_gpus ] ;
9
10 double * h _ f i r s t _ r o w _ s e n d [ num_gpus ] , * h _ f i r s t _ r o w _ r e c v [ num_gpus ] ;
11 double * s _ f i r s t _ r o w _ s e n d [ num_gpus ] , * s _ f i r s t _ r o w _ r e c v [ num_gpus ] ;
12 double * h _ l a s t _ r o w _ s e n d [ num_gpus ] , * h _ l a s t _ r o w _ r e c v [ num_gpus ] ;
13 double * s _ l a s t _ r o w _ s e n d [ num_gpus ] , * s _ l a s t _ r o w _ r e c v [ num_gpus ] ;
14
15 f o r ( d =0; d<num_gpus ; d ++) {
16 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
17
18 / * A l l o c a t e l i e a r memory f o r dev i c e−da ta (1D−v e c t o r s ) on GPU’ s f o r t h e new
da ta t o be computed * /
19 / * padded / p i t c h e d * /
20 HANDLE_ERROR( c u d a M a l l o c P i t c h (&h_new_dev [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
21 HANDLE_ERROR( c u d a M a l l o c P i t c h (&s_new_dev [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
22 HANDLE_ERROR( c u d a M a l l o c P i t c h (&h_tmp_dev [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
23 HANDLE_ERROR( c u d a M a l l o c P i t c h (& s_tmp_dev [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
24
25 / * A l l o c a t e padded memory f o r dev i c e−da ta t o be cop i ed from t h e e x i s t i n g
cpu / hos t−a r r a y s * /
26 HANDLE_ERROR( c u d a M a l l o c P i t c h (&h_dev [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
27 HANDLE_ERROR( c u d a M a l l o c P i t c h (& s_dev [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
28 HANDLE_ERROR( c u d a M a l l o c P i t c h (& a l p h a _ d e v [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
29 HANDLE_ERROR( c u d a M a l l o c P i t c h (& b e t a _ d e v [ d ] , &p i t c h _ b y t , s i z e o f ( double ) *M2,
N2_dev ) ) ;
30
31 / * A l l o c a t e 1D−a r r a y s f o r t r a n s f e r r i n g da ta be tween t h i s GPU and t h e
p r e v i o u s GPU, and t h i s GPU and t h e n e x t GPU * /
32 HANDLE_ERROR( c u d a M a l l o c P i t c h (& h _ f i r s t _ r o w _ s e n d [ d ] , &p i t c h _ b y t , s i z e o f (
double ) *M2, 1 ) ) ;
33 HANDLE_ERROR( c u d a M a l l o c P i t c h (& h _ f i r s t _ r o w _ r e c v [ d ] , &p i t c h _ b y t , s i z e o f (
double ) *M2, 1 ) ) ;
34 HANDLE_ERROR( c u d a M a l l o c P i t c h (& h _ l a s t _ r o w _ s e n d [ d ] , &p i t c h _ b y t , s i z e o f (
double ) *M2, 1 ) ) ;
35 HANDLE_ERROR( c u d a M a l l o c P i t c h (& h _ l a s t _ r o w _ r e c v [ d ] , &p i t c h _ b y t , s i z e o f (
77
double ) *M2, 1 ) ) ;
36
37 HANDLE_ERROR( c u d a M a l l o c P i t c h (& s _ f i r s t _ r o w _ s e n d [ d ] , &p i t c h _ b y t , s i z e o f (
double ) *M2, 1 ) ) ;
38 HANDLE_ERROR( c u d a M a l l o c P i t c h (& s _ f i r s t _ r o w _ r e c v [ d ] , &p i t c h _ b y t , s i z e o f (
double ) *M2, 1 ) ) ;
39 HANDLE_ERROR( c u d a M a l l o c P i t c h (& s _ l a s t _ r o w _ s e n d [ d ] , &p i t c h _ b y t , s i z e o f (
double ) *M2, 1 ) ) ;
40 HANDLE_ERROR( c u d a M a l l o c P i t c h (& s _ l a s t _ r o w _ r e c v [ d ] , &p i t c h _ b y t , s i z e o f (
double ) *M2, 1 ) ) ;
41
42 / * Copy v e t o r s FROM HOST−memory TO DEVICE−memory * /
43 HANDLE_ERROR( cudaMemcpy2D ( h_dev [ d ] , p i t c h _ b y t , h _ o l d _ s t o r a g e _ h o s t [ d ] ,
s i z e o f ( double ) *M2, s i z e o f ( double ) *M2, N2_dev , cudaMemcpyDefaul t ) ) ;
44 HANDLE_ERROR( cudaMemcpy2D ( s_dev [ d ] , p i t c h _ b y t , s _ o l d _ s t o r a g e _ h o s t [ d ] ,
s i z e o f ( double ) *M2, s i z e o f ( double ) *M2, N2_dev , cudaMemcpyDefaul t ) ) ;
45 HANDLE_ERROR( cudaMemcpy2D ( a l p h a _ d e v [ d ] , p i t c h _ b y t , a l p h a _ s t o r a g e _ h o s t [ d ] ,
s i z e o f ( double ) *M2, s i z e o f ( double ) *M2, N2_dev , cudaMemcpyDefaul t ) ) ;
46 HANDLE_ERROR( cudaMemcpy2D ( b e t a _ d e v [ d ] , p i t c h _ b y t , b e t a _ s t o r a g e _ h o s t [ d ] ,
s i z e o f ( double ) *M2, s i z e o f ( double ) *M2, N2_dev , cudaMemcpyDefaul t ) ) ;
47 }
8.4.4.2 Pinned Memory on Host
For some of the host data that has to be copied or mapped from host to device dur-
ing runtime, in addition to being transferred between multiple devices using P2P
and/or asynchronous behavior, the CUDA-function cudaHostAlloc() has been
used to allocate data on host, and kind is set to cudaHostAllocPortable. This
ensures for page-locked or so-called pinned memory on the host. The pinned
memory space is shared between the host and device or between multiple devices,
and this memory has pointers for host and device. An example on how pinned
host memory is allocated for the decomposed data, in addition to allocations of
2D-pointers, can be seen in 8.11.
Listing 8.11: Allocation of Pinned/Page-Locked Memory Space)
1 / * Decomposed ho s t da ta ( d i v i d e d by number o f gpu ’ s ) * /
2 double * h _ o l d _ s t o r a g e _ h o s t [ num_gpus ] , * h _ n e w _ s t o r a g e _ h o s t [ num_gpus ] ;
3 double * s _ o l d _ s t o r a g e _ h o s t [ num_gpus ] , * s _ n e w _ s t o r a g e _ h o s t [ num_gpus ] ;
4 double * a l p h a _ s t o r a g e _ h o s t [ num_gpus ] , * b e t a _ s t o r a g e _ h o s t [ num_gpus ] ;
5 double ** h _ o l d _ p t r _ h o s t [ num_gpus ] , ** h _ n e w _ p t r _ h o s t [ num_gpus ] ;
6 double ** s _ o l d _ p t r _ h o s t [ num_gpus ] , ** s _ n e w _ p t r _ h o s t [ num_gpus ] ;
7 double ** a l p h a _ p t r _ h o s t [ num_gpus ] , ** b e t a _ p t r _ h o s t [ num_gpus ] ;
8
9 i n t s t a r t _ r o w ; / / The row i n t h e " g l o b a l g r i d " ( h o l d i n g a l l t h e p a r t i a l g r i d s )
where a p a r t i a l g r i d s t a r t s
10
11 f o r ( d =0; d<num_gpus ; d ++) {
12
13 / * A l l o c a t e 1D coa l e s c e d s t o r a g e o f da ta * /
14 c u d a H o s t A l l o c (& h _ o l d _ s t o r a g e _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
15 c u d a H o s t A l l o c (& h _ n e w _ s t o r a g e _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
78
16 c u d a H o s t A l l o c (& s _ o l d _ s t o r a g e _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
17 c u d a H o s t A l l o c (& s _ n e w _ s t o r a g e _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
18 c u d a H o s t A l l o c (& a l p h a _ s t o r a g e _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
19 c u d a H o s t A l l o c (& b e t a _ s t o r a g e _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
20
21 / * A l l o c a t e 2D p o i t e r s f o r each row * /
22 c u d a H o s t A l l o c (& h _ o l d _ p t r _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
23 c u d a H o s t A l l o c (& h _ n e w _ p t r _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
24 c u d a H o s t A l l o c (& s _ o l d _ p t r _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
25 c u d a H o s t A l l o c (& s _ n e w _ p t r _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
26 c u d a H o s t A l l o c (& a l p h a _ p t r _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
27 c u d a H o s t A l l o c (& b e t a _ p t r _ h o s t [ d ] , s i z e o f ( double *) *M2*N2_dev ,
c u d a H o s t A l l o c P o r t a b l e ) ;
28
29 / * S e t t h e row−p o i n t e r * /
30 f o r ( j =0 ; j <N2_dev ; j ++ ) {
31 h _ o l d _ p t r _ h o s t [ d ] [ j ] = &( h _ o l d _ s t o r a g e _ h o s t [ d ] [ j *M2] ) ;
32 h _ n e w _ p t r _ h o s t [ d ] [ j ] = &( h _ n e w _ s t o r a g e _ h o s t [ d ] [ j *M2] ) ;
33 s _ o l d _ p t r _ h o s t [ d ] [ j ] = &( s _ o l d _ s t o r a g e _ h o s t [ d ] [ j *M2] ) ;
34 s _ n e w _ p t r _ h o s t [ d ] [ j ] = &( s _ n e w _ s t o r a g e _ h o s t [ d ] [ j *M2] ) ;
35 a l p h a _ p t r _ h o s t [ d ] [ j ] = &( a l p h a _ s t o r a g e _ h o s t [ d ] [ j *M2] ) ;
36 b e t a _ p t r _ h o s t [ d ] [ j ] = &( b e t a _ s t o r a g e _ h o s t [ d ] [ j *M2] ) ;
37 }
38
39 / * I n i t i a l i z e t h e decomposed ma t r i c e s * /
40 s t a r t _ r o w =d*N_dev ;
41 f o r ( j =0 ; j <N2_dev ; j ++) {
42 f o r ( i =0 ; i <M2; i ++) {
43 h _ o l d _ p t r _ h o s t [ d ] [ j ] [ i ] = h_o ld [ s t a r t _ r o w + j ] [ i ] ;
44 s _ o l d _ p t r _ h o s t [ d ] [ j ] [ i ] = s _ o l d [ s t a r t _ r o w + j ] [ i ] ;
45 a l p h a _ p t r _ h o s t [ d ] [ j ] [ i ] = a l p h a [ s t a r t _ r o w + j ] [ i ] ;





The multi GPU system described as implementation (2) in 8.2 is coupling mul-
tiple GPU’s using synchronous computations and communications between the
devices. This means that communication does not overlap with computations.
In this app the communication between devices is performed via host, using the
function cudaMemcpy() with the kind set to cudaMemcpyDeviceToHost moving
data from device memory to host memory. When all the data arrays are copied to
host, swapping of pointers to the arrays are done before the data are copied to the
neighboring device with the same function, but in the opposite direction, and the
79
kind parameter is set to cudaMemcpyHostToDevice. A system according to this
description is displayed in listing 8.12.
Listing 8.12: Multi GPU Synchronous Copy via Host.
1 whi le ( t <t_max ) {
2 t += d t ;
3
4 / * **************************** * /
5 / * ** Compute D i f f u s i o n on H ** * /
6 / * **************************** * /
7 f o r ( d =0; d<num_gpus ; d ++) {
8 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
9 dif fus ion_Solve_H_GPU <<<bl , t r >>>( h_dev [ d ] , s_dev [ d ] , h_new_dev [ d ] , \
10 a l p h a _ d e v [ d ] , b e t a _ d e v [ d ] , p i t c h _ b y t , dx , dy , dt , \
11 c_s , c_m , M2, N2_dev ) ;
12 }
13 / * Update Boundary Cond i t i o n s on H * /
14 f o r ( d =0; d<num_gpus ; d ++) {
15 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
16 updateBC_Neuman_Homogeneous_H_GPU <<<bl , t r >>>( h_new_dev [ d ] , \
17 p i t c h _ b y t , dx , dy , M2, N2_dev ) ;
18 }
19 / * Th i s upda t e i s on l y where we have t h e i n f l u x * /
20 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ 0 ] ) ) ;
21 updateBC_Neuman_Inhomogeneous_H_GPU <<<bl , t r >>>( h_dev [ 0 ] , h_new_dev [ 0 ] , \
22 a l p h a _ d e v [ 0 ] , b e t a _ d e v [ 0 ] , p i t c h _ b y t , \
23 i n f l u x _ s t a r t , i n f l u x _ s t o p , f_s , f_m , dy ) ;
24
25 / * ** S e c t i o n copy ing da ta from d e v i c e t o h o s t − swapping p o i n t e r s on ho s t
− copy from ho s t t o d e v i c e ** * /
26 / * Copy Border−da ta from Dev ice t o Host * /
27 f o r ( d =1; d<num_gpus ; d ++) {
28 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
29 HANDLE_ERROR( cudaMemcpy ( h _ f i r s t _ r o w _ s e n d _ h o s t [ d ] , \
30 h_new_dev [ d ]+M2+1 , M2 , \
31 cudaMemcpyDeviceToHost ) ) ;
32 }
33 f o r ( d =0; d<num_gpus−1; d ++) {
34 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
35 HANDLE_ERROR( cudaMemcpy ( h _ l a s t _ r o w _ s e n d _ h o s t [ d ] , \
36 h_new_dev [ d ] + ( N_dev*M2) +1 , M2 , \
37 cudaMemcpyDeviceToHost ) ) ;
38 }
39
40 / * Swap boundary p o i n t e r s on ho s t * /
41 f o r ( d =0; d<num_gpus−1; d ++) {
42 h _ l a s t _ r o w _ r e c v _ h o s t [ d ] = h _ f i r s t _ r o w _ s e n d _ h o s t [ d + 1 ] ;
43 h _ f i r s t _ r o w _ r e c v _ h o s t [ d +1]= h _ l a s t _ r o w _ s e n d _ h o s t [ d ] ;
44 }
45
46 / * Copy Border−Arrays from ho s t t o ma t r i x on d e v i c e * /
47 f o r ( d =0; d<num_gpus−1; d ++) {
48 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
49 HANDLE_ERROR( cudaMemcpy ( h_new_dev [ d ] + ( ( N_dev +1) *M2+1) , \
50 h _ l a s t _ r o w _ r e c v _ h o s t [ d ] , M2 , \
51 cudaMemcpyHostToDevice ) ) ;
52 }
53 f o r ( d =1; d<num_gpus ; d ++) {
54 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
80
55 HANDLE_ERROR( cudaMemcpy ( h_new_dev [ d ] , \
56 h _ f i r s t _ r o w _ r e c v _ h o s t [ d ] , M2 , \
57 cudaMemcpyHostToDevice ) ) ;
58 }
59
60 / * ***************************** * /
61 / * ** Compute Convec t i on on S ** * /
62 / * ***************************** * /
63 f o r ( d =0; d<num_gpus ; d ++) {
64 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
65 convect ion_Solve_S_GPU <<<bl , t r >>>( h_dev [ d ] , s_dev [ d ] , \
66 h_new_dev [ d ] , s_new_dev [ d ] , a l p h a _ d e v [ d ] , p i t c h _ b y t , \
67 dx , dy , dt , c_s , c_m , A, M2, N2_dev ) ;
68 }
69
70 / * Update Boundary Cond i t i o n s on S * /
71 f o r ( d =0; d<num_gpus ; d ++) {
72 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
73 updateBC_Neuman_Homogeneous_S_GPU <<<bl , t r >>>( s_new_dev [ d ] , \
74 p i t c h _ b y t , dx , dy , M2, N2_dev ) ;
75 }
76 / * Th i s upda t e i s on l y where we have t h e i n f l u x * /
77 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ 0 ] ) ) ;
78 updateBC_Dir ichle t_S_GPU <<<bl , t r >>>( s_new_dev [ 0 ] , a l p h a _ d e v [ 0 ] , \
79 b e t a _ d e v [ 0 ] , p i t c h _ b y t , i n f l u x _ s t a r t , i n f l u x _ s t o p , f_s , f_m , dy ) ;
80
81 / * ** S e c t i o n copy ing da ta from d e v i c e t o h o s t − swapping p o i n t e r s on ho s t
− copy from ho s t t o d e v i c e ** * /
82 / * Copy da ta Border−Arrays from d e v i c e t o h o s t f o r S * /
83 f o r ( d =1; d<num_gpus ; d ++) {
84 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
85 HANDLE_ERROR( cudaMemcpy ( s _ f i r s t _ r o w _ s e n d _ h o s t [ d ] , \
86 s_new_dev [ d ]+M2+1 , M2 , \
87 cudaMemcpyDeviceToHost ) ) ;
88 }
89 f o r ( d =0; d<num_gpus−1; d ++) {
90 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
91 HANDLE_ERROR( cudaMemcpy ( s _ l a s t _ r o w _ s e n d _ h o s t [ d ] , \
92 s_new_dev [ d ] + ( N_dev*M2) +1 , M2 , \
93 cudaMemcpyDeviceToHost ) ) ;
94 }
95
96 / * Swap boundary p o i n t e r s on ho s t * /
97 f o r ( d =0; d<num_gpus−1; d ++) {
98 s _ l a s t _ r o w _ r e c v _ h o s t [ d ] = s _ f i r s t _ r o w _ s e n d _ h o s t [ d + 1 ] ;
99 s _ f i r s t _ r o w _ r e c v _ h o s t [ d +1]= s _ l a s t _ r o w _ s e n d _ h o s t [ d ] ;
100 }
101
102 / * Copy from Border−Arrays t o from ho s t t o ma t r i x on d e v i c e f o r S * /
103 f o r ( d =0; d<num_gpus−1; d ++) {
104 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
105 HANDLE_ERROR( cudaMemcpy ( s_new_dev [ d ] + ( ( N_dev +1) *M2+1) , \
106 s _ l a s t _ r o w _ r e c v _ h o s t [ d ] , M2 , \
107 cudaMemcpyHostToDevice ) ) ;
108 }
109 f o r ( d =1; d<num_gpus ; d ++) {
110 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
111 HANDLE_ERROR( cudaMemcpy ( s_new_dev [ d ] , \
112 s _ f i r s t _ r o w _ r e c v _ h o s t [ d ] , M2,
113 cudaMemcpyHostToDevice ) ) ;
114 / / HANDLE_ERROR( cudaMemcpy ( s_new_dev [ d ]+(M2+1) ,




117 / * Swap array−p o i n t e r s b e f o r e t h e n e x t t i m e s t e p * /
118 f o r ( d =0; d<num_gpus ; d ++) {
119 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
120 h_tmp_dev [ d ] = h_dev [ d ] ;
121 h_dev [ d ] = h_new_dev [ d ] ;
122 h_new_dev [ d ] = h_tmp_dev [ d ] ;
123 }
124 f o r ( d =0; d<num_gpus ; d ++) {
125 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
126 s_tmp_dev [ d ] = s_dev [ d ] ;
127 s_dev [ d ] = s_new_dev [ d ] ;




This implementation, is also coupling multiple GPU’s using synchronous com-
putations and communications, but the communications between GPUs are per-
formed using P2P over the PCI Express bus. For this the memory has been
pinned, meaning that the host memory has been allocated through the function
cudaHostAlloc() as shown in code listing 8.11.
Since the ghost- and boundary-data of h and s need to be communicated be-
tween the GPUs, the transfer of the data have been deployed by copying the
boundary data from the padded 2D-matrix on the device into an unpadded array
also on the same device. How padded data are accessed and copied into unpadded
data, and vice versa, can be viewed in the code listing 8.13.
Listing 8.13: Copy Boundary-Points to the Neighbors Ghost-Points.
1 / * ** Copy boundary−da ta from data−domain t o t h e t r a n s f e r −ar ray ** * /
2 _ _ g l o b a l _ _ void c o p y M a t r i x D a t a T o T r a n s f e r A r r a y s ( double *h_new , double * h_row_send ,
i n t M2, i n t row_no , s i z e _ t d _ p i t c h ) {
3 i n t i = b l o c k I d x . x * blockDim . x + t h r e a d I d x . x ;
4 i n t j = b l o c k I d x . y * blockDim . y + t h r e a d I d x . y ;
5
6 i f ( j ==row_no && i >=0 && i <M2) {
7 double * h = ( double *) ( ( char *) h_new + j * d _ p i t c h ) ;




12 / * ** Copy ghos t−da ta from t h e t r a n s f e r −ar ray t h e data−domain ** * /
13 _ _ g l o b a l _ _ void c o p y T r a n s f e r A r r a y s T o M a t r i x D a t a ( double *h_new , double * h_row_recv ,
i n t M2, i n t row_no , s i z e _ t d _ p i t c h ) {
14 i n t i = b l o c k I d x . x * blockDim . x + t h r e a d I d x . x ;
15 i n t j = b l o c k I d x . y * blockDim . y + t h r e a d I d x . y ;
16
17 i f ( j ==row_no && i >=0 && i <M2) {
18 double * h = ( double *) ( ( char *) h_new + j * d _ p i t c h ) ;
82
19 h [ i ] = h_row_recv [ i ] ;
20 }
21 }
After the copying the padded data to the unpadded arrays, transfer of arrays
between the devices is performed utilizing the synchronous P2P copy function
cudaMemcpyPeer(). When the neighboring device has received the array, it is
copied into the 2D padded matrix on this new device. For this functionality, the
sections under diffusion and convection regarding copy and transfer of bound-
ary/ghost data as seen in code listing 8.12, can be changed with the code block as
revealed in listing 8.14. This code handles h, but the same is applied on s.
Listing 8.14: Kernel Calls to Copy Data and Synchronous Transfer of Data P2P.
1 / * Copy from 2D Padded Mat r i x Data t o
2 * 1D Unpadded Arrays b e f o r e P2P t r a n s f e r f o r H * /
3
4 f o r ( d =1; d<num_gpus ; d ++) { / / copy second f i r s t row i n t h e lower g r i d
5 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
6 c o p y M a t r i x D a t a T o T r a n s f e r A r r a y s <<<bl , t r >>>( h_new_dev [ d ] ,
7 h _ f i r s t _ r o w _ s e n d [ d ] , M2, 1 , p i t c h _ b y t ) ;
8 }
9 f o r ( d =0; d<num_gpus−1; d ++) { / / copy second l a s t row i n upper g r i d
10 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
11 c o p y M a t r i x D a t a T o T r a n s f e r A r r a y s <<<bl , t r >>>( h_new_dev [ d ] ,
12 h _ l a s t _ r o w _ s e n d [ d ] , M2, N_dev , p i t c h _ b y t ) ;
13 }
14
15 / * Peer−To−Peer−copy o f Border−Arrays be tween d e v i c e s f o r H * /
16 f o r ( d =0; d<num_gpus−1; d ++) { / / copy from g r i d 1 t o 3 , t o g r i d 0 t o 2
17 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d + 1 ] ) ) ;
18 HANDLE_ERROR( cudaMemcpyPeer ( h _ f i r s t _ r o w _ r e c v [ d + 1] , d e v i c e [ d + 1] ,
19 h _ l a s t _ r o w _ s e n d [ d ] , d e v i c e [ d ] , p i t c h _ b y t ) ) ;
20 }
21 f o r ( d =0; d<num_gpus−1; d ++) { / / copy from g r i d 0 t o 2 , t o g r i d 1 t o 3
22 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
23 HANDLE_ERROR( cudaMemcpyPeer ( h _ l a s t _ r o w _ r e c v [ d ] , d e v i c e [ d ] ,
24 h _ f i r s t _ r o w _ s e n d [ d +1 ] , d e v i c e [ d +1 ] ,
p i t c h _ b y t ) ) ;
25 }
26
27 / * Copy from 1D Unpadded Arrays t o
28 * 2D Padded Mat r i x Data a f t e r P2P t r a n s f e r f o r H * /
29 f o r ( d =0; d<num_gpus−1; d ++) { / / copy i n t o l a s t row o f lower g r i d
30 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
31 c o p y T r a n s f e r A r r a y s T o M a t r i x D a t a <<<bl , t r >>>( h_new_dev [ d ] ,
32 h _ l a s t _ r o w _ r e c v [ d ] , M2, N_dev +1 , p i t c h _ b y t ) ;
33 }
34 f o r ( d =1; d<num_gpus ; d ++) { / / copy i n t o f i r s t row i n upper g r i d
35 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
36 c o p y T r a n s f e r A r r a y s T o M a t r i x D a t a <<<bl , t r >>>( h_new_dev [ d ] ,




What differ this program from implementation (2), is that this one performs com-
munication and computations asynchronously. Memory has been pinned and com-
munication is performed via host. For asynchronous computations and commu-
nications between GPUs, CUDA Streams are applied for concurrency. A stream
operates a sequence of commands and can only operate in one given context as on
one given GPU. Although one GPU may operate multiple streams. In this appli-
cation 2 streams has been created on each device as displayed in code listing 8.15.
The code also shows the creation of CUDA events that will be utilized in a later
application.
Listing 8.15: Creating 2 Streams and 2 Events on Each Device.
1 / * Crea t e CUDA Streams and Even t s f o r each GPU * /
2 i n t nSt reams = 2 , nEven t s = 2 ;
3
4 c u d a S t r e a m _ t s t r e a m [ num_gpus ] [ nS t reams ] ;
5 c u d a E v e n t _ t e v e n t [ num_gpus ] [ nEven t s ] ;
6
7 f o r ( d =0; d<num_gpus ; d ++) {
8 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
9 HANDLE_ERROR( cudaStreamCreate (& s t r e a m [ d ] [ 0 ] ) ) ;
10 HANDLE_ERROR( cudaStreamCreate (& s t r e a m [ d ] [ 1 ] ) ) ;
11 HANDLE_ERROR( cudaEventCreate (& e v e n t [ d ] [ 0 ] ) ) ;
12 HANDLE_ERROR( cudaEventCreate (& e v e n t [ d ] [ 1 ] ) ) ;
13 }
As CUDA kernels are called or a calls to CUDA memory operations are car-
ried out, the kernels and the memory functions have to take a given stream as
parameter. This is displayed in listing 8.16.
Listing 8.16: Loop With 2 Streams Yielding Asynchrone Behavior.
1 whi le ( t <t_max ) {
2 t += d t ;
3
4 / * **************************** * /
5 / * ** Compute D i f f u s i o n on H ** * /
6 / * **************************** * /
7 f o r ( d =0; d<num_gpus ; d ++) {
8 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
9 dif fus ion_Solve_H_GPU <<<bl , t r , 0 , s t r e a m [ d ][0] > > >( h_dev [ d ] , s_dev [ d ] , \
10 h_new_dev [ d ] , a l p h a _ d e v [ d ] , b e t a _ d e v [ d ] , \
11 p i t c h _ b y t , dx , dy , d t , c_s , c_m , M2, N2_dev ) ;
12 }
13 / * Update Boundary Cond i t i o n s on H * /
14 f o r ( d =0; d<num_gpus ; d ++) {
15 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
16 updateBC_Neuman_Homogeneous_H_GPU <<<bl , t r , 0 , s t r e a m [ d ] [0 ] > > >( \
84
17 h_new_dev [ d ] , p i t c h _ b y t , dx , dy , M2, N2_dev ) ;
18 }
19 / * Th i s upda t e i s on l y where we have t h e i n f l u x on t h e " g l o b a l g r i d ’ s "
row 0 , on H * /
20 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ 0 ] ) ) ;
21 updateBC_Neuman_Inhomogeneous_H_GPU <<<bl , t r , 0 , s t r e a m [ 0 ] [ 0 ] > > > ( \
22 h_dev [ 0 ] , h_new_dev [ 0 ] , a l p h a _ d e v [ 0 ] , b e t a _ d e v [ 0 ] , \
23 p i t c h _ b y t , i n f l u x _ s t a r t , i n f l u x _ s t o p , f_s , f_m , dy ) ;
24
25 / * Copy Tran s f e r−da ta from Dev ice t o Host * /
26 f o r ( d =1; d<num_gpus ; d ++) {
27 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
28 HANDLE_ERROR( cudaMemcpyAsync ( h _ f i r s t _ r o w _ s e n d _ h o s t [ d ] , \
29 h_new_dev [ d ]+M2+1 , M2 , \
30 cudaMemcpyDeviceToHost , \
31 s t r e a m [ d ] [ 0 ] ) ) ;
32 }
33 f o r ( d =0; d<num_gpus−1; d ++) {
34 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
35 HANDLE_ERROR( cudaMemcpyAsync ( h _ l a s t _ r o w _ s e n d _ h o s t [ d ] , \
36 h_new_dev [ d ] + ( N_dev*M2) +1 , M2 , \
37 cudaMemcpyDeviceToHost , \
38 s t r e a m [ d ] [ 1 ] ) ) ;
39 }
40
41 / * Swap boundary p o i n t e r s on ho s t * /
42 f o r ( d =0; d<num_gpus−1; d ++) {
43 h _ l a s t _ r o w _ r e c v _ h o s t [ d ] = h _ f i r s t _ r o w _ s e n d _ h o s t [ d + 1 ] ;
44 h _ f i r s t _ r o w _ r e c v _ h o s t [ d +1]= h _ l a s t _ r o w _ s e n d _ h o s t [ d ] ;
45 }
46
47 / * Copy from Border−Arrays t o Mat r i x Data a f t e r P2P t r a n s f e r f o r H * /
48 f o r ( d =0; d<num_gpus−1; d ++) {
49 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
50 HANDLE_ERROR( cudaMemcpyAsync ( h_new_dev [ d ] + ( ( N_dev +1) *M2+1) , \
51 h _ l a s t _ r o w _ r e c v _ h o s t [ d ] , M2 , \
52 cudaMemcpyHostToDevice , \
53 s t r e a m [ d ] [ 0 ] ) ) ;
54 }
55 f o r ( d =1; d<num_gpus ; d ++) {
56 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
57 HANDLE_ERROR( cudaMemcpyAsync ( h_new_dev [ d ] + (M2+1) , \
58 h _ f i r s t _ r o w _ r e c v _ h o s t [ d ] , M2 , \
59 cudaMemcpyHostToDevice , \
60 s t r e a m [ d ] [ 1 ] ) ) ;
61 }
62
63 / * ***************************** * /
64 / * ** Compute Convec t i on on S ** * /
65 / * ***************************** * /
66 / / same p a t t e r n i s made f o r c o n v e c t i o n
67 }
8.4.4.6 Implementation 5.
This implementation is similar to implementation (4), but it performs asynchronous
computations and P2P communications between GPUs over the PCI Express bus.
The concurrency is deployed with CUDA Streams and synchronized with CUDA
85
Events. The creation of 2 events per device can be viewed in listing 8.15.
The P2P communication is set up as earlier, but the code is asynchronous, and
therefore employs the function cudaMemcpyPeerAsync(), whitch also takes a ded-
icated stream as an argument. This application is also handling synchronization
with the function cudaStreamWaitEvent() witch is waiting until all calls before
the setting of an event-record in a given stream has finished executing and re-
ports completion. It takes both a stream and an event as arguments. The event
record is set with the function cudaEventRecord(), and takes an event as an ar-
gument. The function cudaStreamWaitEvent() can wait for events in streams on
other devices.This cross device synchronization is efficient as it is not stalling the
host it can enhance performance compared to other barriers. Example on use of
asynchronous P2P copy, and kernel calls waiting on events is exhibited in listing
8.17.
Listing 8.17: Creating 2 Streams and 2 Events on Each Device.
1 f o r ( d =0; d<num_gpus ; d ++) {
2 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
3 dif fus ion_Solve_H_GPU <<<bl , t r , 0 , s t r e a m [ d ][0] > > >( h_dev [ d ] , s_dev [ d ] ,
h_new_dev [ d ] , a l p h a _ d e v [ d ] , b e t a _ d e v [ d ] , p i t c h _ b y t , dx , dy , dt , c_s ,
c_m , M2, N2_dev ) ;
4 HANDLE_ERROR( cudaEven tRecord ( e v e n t [ d ] [ 0 ] ) ) ;
5 }
6
7 f o r ( d =0; d<num_gpus−1; d ++) {
8 HANDLE_ERROR( cudaSetDevice ( d e v i c e [ d ] ) ) ;
9 HANDLE_ERROR( cudaS t reamWai tEven t ( s t r e a m [ d + 1 ] [ 0 ] , e v e n t [ d + 1 ] [ 0 ] , 0 ) ) ;
10 HANDLE_ERROR( cudaMemcpyPeerAsync ( h _ l a s t _ r o w _ r e c v [ d ] , d e v i c e [ d ] ,
h _ f i r s t _ r o w _ s e n d [ d +1 ] , d e v i c e [ d +1 ] , p i t c h _ b y t , s t r e a m [ d ] [ 0 ] ) ) ;
11 / / HANDLE_ERROR( cudaEven tRecord ( e v e n t [ d ] [ 1 ] ) ) ;
12 }
8.4.4.7 Implementation 6.
This application, the last application described in 8.2, is coupling multiple GPU’s
using asynchronous computations and P2P communications between GPUs over
the PCI Express bus. It applies CUDA Streams for concurrency.In addition it cre-
ates a set of OMP-threads for each GPU, and let one OMP-thread on the CPU
launch one CUDA-kernel, and an other OMP-thread handle asynchronous com-
munication. This is applied according to [ Sourouri at al’s ] suggestions with 4
CUDA Streams and 2 OpenMP threads per device. With these features, obser-
vations of additional speedup can be expected and a reduction of kernel-launch
overhead. Creation of OpenMP threads in presented in the listing 8.18.
Listing 8.18: Create Open MP Threads.
86
1 / * ** S e t t h e number o f OpenMP t h r e a d s ** * /
2 unsigned i n t num_OMP_threads = num_gpus * 2 ;
3 omp_set_num_threads ( num_OMP_threads ) ;
4
5 #pragma omp p a r a l l e l
6 {
7 unsigned i n t nThreads = omp_get_num_threads ( ) ;
8 }
Other details in the application are unveiled in an extensive example listed in
appendix.
8.4.4.8 CUDA Kernels
The CUDA kernels for diffusion and convection are as follows in listings 8.19 and
8.20.
Listing 8.19: CUDA Kernel for Diffusion.
1 / * ** So l v e t h e D i f f u s i o n term ( h ) ** * /
2 _ _ g l o b a l _ _ void di f fus ion_Solve_H_GPU ( \
3 double *d_h , double * d_s , \
4 double *d_h_new , double * d_a lpha , \
5 double * d_be ta , s i z e _ t d _ p i t c h , \
6 double dx , double dy , double dt , \
7 double c_s , double c_m , i n t M, i n t N) {
8
9 i n t i = blockDim . x * b l o c k I d x . x + t h r e a d I d x . x ;
10 i n t j = blockDim . y * b l o c k I d x . y + t h r e a d I d x . y ;
11
12 / / p r i n t f ( " c_s=%.4 f \ t dy=%.4 f \ n " , c_s , dy ) ;
13
14 / * Ca l c u l a t e t h e va l u e f o r each i n n e r g r i d p o i n t * /
15 i f ( i >0 && i <M−1 && j >0 && j <N−1) {
16 / / i f ( i <M && j<N) {
17
18 / * P o i n t e r s t o rows i n t h e t h e d i f f e r e n t p i t c h e d a r r a y s * /
19 double * h = ( double *) ( ( char *) d_h + j * d _ p i t c h ) ;
20 double * s = ( double *) ( ( char *) d_s + j * d _ p i t c h ) ;
21 double * h_new = ( double *) ( ( char *) d_h_new + j * d _ p i t c h ) ;
22 double * a l p h a = ( double *) ( ( char *) d _ a l p h a + j * d _ p i t c h ) ;
23 double * b e t a = ( double *) ( ( char *) d _ b e t a + j * d _ p i t c h ) ;
24 i n t s t r i d e = d _ p i t c h / s i z e o f ( double ) ;
25 / / p r i n t f ( " In k e r n e l : h=%.4 f \ n " , h [ 3050 ] ) ;
26
27 double dx2_R = 1 / ( dx*dx ) , dy2_R = 1 / ( dy*dy ) ;
28 double a s _ c e n t e r = a l p h a [ i ]* s [ i ] , b s _ c e n t e r = b e t a [ i ]*(1− s [ i ] ) ;
29
30 h_new [ i ] = ( ( d t / c_s ) * ( \
31 ( ( ( ( ( a s _ c e n t e r + a l p h a [ i +1]* s [ i + 1 ] ) * . 5 ) * ( h [ i +1]−h [ i ] ) )−\
32 ( ( ( a l p h a [ i −1]* s [ i −1]+ a s _ c e n t e r ) * . 5 ) * ( h [ i ]−h [ i −1]) ) ) *dx2_R ) + \
33 ( ( ( ( ( a s _ c e n t e r + a l p h a [ i + s t r i d e ]* s [ i + s t r i d e ] ) * . 5 ) * ( h [ i + s t r i d e ]−h [ i ] ) )−\
34 ( ( ( a l p h a [ i−s t r i d e ]* s [ i−s t r i d e ]+ a s _ c e n t e r ) * . 5 ) * ( h [ i ]−h [ i−s t r i d e ] ) ) ) *dy2_R )
) ) + \
35 ( ( d t / c_m ) * ( \
36 ( ( ( ( ( b s _ c e n t e r + b e t a [ i +1]*(1− s [ i + 1 ] ) ) * . 5 ) * ( h [ i +1]−h [ i ] ) )−\
37 ( ( ( b e t a [ i −1]*(1− s [ i −1])+ b s _ c e n t e r ) * . 5 ) * ( h [ i ]−h [ i −1]) ) ) *dx2_R ) + \
87
38 ( ( ( ( ( b s _ c e n t e r + b e t a [ i + s t r i d e ]*(1− s [ i + s t r i d e ] ) ) * . 5 ) * ( h [ i + s t r i d e ]−h [ i ] ) )− \
39 ( ( ( b e t a [ i−s t r i d e ]*(1− s [ i−s t r i d e ] ) + b s _ c e n t e r ) * . 5 ) * ( h [ i ]−h [ i−s t r i d e ] ) ) ) *dy2_R ) ) )
+ \
40 h [ i ] ;
41 }
42 }
Listing 8.20: CUDA Kernel for Convection.
1 / * ** So l v e t h e Convec t i on term ( s ) ** * /
2 _ _ g l o b a l _ _ void convect ion_Solve_S_GPU (
3 double *d_h , double * d_s , \
4 double *d_h_new , double *d_s_new , \
5 double * d_a lpha , s i z e _ t d _ p i t c h , \
6 double dx , double dy , double dt , \
7 double c_s , double c_m , \
8 i n t A, i n t d_M , i n t d_N ) {
9
10 i n t i = b l o c k I d x . x * blockDim . x + t h r e a d I d x . x ;
11 i n t j = b l o c k I d x . y * blockDim . y + t h r e a d I d x . y ;
12
13 i f ( i >0 && i <d_M−1 && j >0 && j <d_N−1) {
14
15 / * P o i n t e r s t o rows i n t h e t h e d i f f e r e n t p i t c h e d a r r a y s * /
16 double * h = ( double *) ( ( char *) d_h + j * d _ p i t c h ) ;
17 double * s = ( double *) ( ( char *) d_s + j * d _ p i t c h ) ;
18 double * h_new = ( double *) ( ( char *) d_h_new + j * d _ p i t c h ) ;
19 double * s_new = ( double *) ( ( char *) d_s_new + j * d _ p i t c h ) ;
20 double * a l p h a = ( double *) ( ( char *) d _ a l p h a + j * d _ p i t c h ) ;
21
22 double s_x , s_y , h_x , h_y , h_xx , h_yy ;
23 double a s _ c e n t e r = a l p h a [ i ]* s [ i ] ; / / g1
24 double dx_R =1/ dx , dy_R = 1 / dy ;
25 double dx2_R = 1 / ( dx*dx ) , dy2_R = 1 / ( dy*dy ) ;
26 i n t s t r i d e = d _ p i t c h / s i z e o f ( double ) ;
27
28 / * Ca l c u l a t e t h e va l u e f o r each i n n e r g r i d p o i n t * /
29 i f ( ( h_y = ( h_new [ i + s t r i d e ] − h_new [ i−s t r i d e ] ) * 0 . 5 * dy_R ) >0)
30 s_y = ( a l p h a [ i + s t r i d e ]* s [ i + s t r i d e ] − a s _ c e n t e r ) * dy_R ;
31 e l s e
32 s_y = ( a s _ c e n t e r − a l p h a [ i−s t r i d e ]* s [ i−s t r i d e ] ) * dy_R ;
33
34 i f ( ( h_x = ( h_new [ i +1] − h_new [ i −1]) * 0 . 5 * dx_R ) > 0 )
35 s_x = ( a l p h a [ i +1]* s [ i +1] − a s _ c e n t e r ) * dx_R ;
36 e l s e
37 s_x = ( a s _ c e n t e r − a l p h a [ i −1]* s [ i −1]) * dx_R ;
38
39 h_xx = ( h_new [ i +1] − 2*h_new [ i ] + h_new [ i −1]) * dx2_R ;
40 h_yy = ( h_new [ i + s t r i d e ] − 2*h_new [ i ] + h_new [ i−s t r i d e ] ) * dy2_R ;
41
42 s_new [ i ] = ( ( d t / c_s ) * ( s_x * h_x + s_y * h_y + a s _ c e n t e r \







Implementing a parallel algorithm can be difficult, and in order to find out if it is
worth the trouble to implement it, some tools are used help to predict the enhance-
ment in performance of a parallel system. How much faster is a parallel algorithm
compared to a sequential algorithm?
To answer this question some formulas for speedup and efficiency will be investi-
gated.
9.1 Speedup
Viewed in the light of parallel computing, speedup is the ratio between the se-
quential execution time and the parallel and hopefully improved execution time.




When the number of parallel processors is denoted n, the number of serial proces-










Efficiency is defined by speedup divided by the number of processors used:
E f f iciency=
Sequential execution time












Amdahl’s law raises some issues regarding parallel computations, It ignores over-
head related to communication time!
9.4 Gustafson’s law
Both Gustafson’-Barsis’s law and Amdahl’s law ignores the term for parallel over-
head: κ(n, p), and can by that overestimate the speedup achieve by a parallel
system.
9.5 The Karp-Flatt Metric
Takes into consideration the overhead term.
T (n, p) = σ(n)+φ(n)/p+κ(n, p)
9.6 The Isoefficiency Metric
Scalability is a measure of the ability to increase performance as the number of




This chapter concludes this thesis, presenting the results from the investigation
and contribution of this research.
10.1 Summary
High Performance Computing has been the he main focus in this work. The aim
has been to enhance the performance of simulations by having the computations
performed in parallel on GPUs with the CUDA architecture rather than serially
on a single CPU core. The model of choice for the conduction of the tests has
been a sediment transport model. The PDEs representing the model have been
discretized into to a fully explicit finite difference scheme and implemented ac-
cordingly. The sediment transport model has been implemented with ANSI-C
programming language for the serial implementation and CUDA C for the paral-
lel implementation.
Various parallel features have been applied and tested in the conduction of
this work. Simulations have been made with numerous optimization techniques
for enhanced speedup on the parallel architecture. The tests have been performed
on 2 different computers, both holding multiple NVIDIA GPUs with the CUDA
architecture. The two computers have GPUs with different specifications and dif-
ferent numbers of available GPUs as shown in table 10.1.
GPUs GPU Type Architecture
4 GeForce GTX 590 Fermi
2 Tesla K20m Kepler
Table 10.1: Available GPUs for Testing
91
Specification of Implementations
Both the serial and parallel implementations benefit from a few optimizing strate-
gies like coalesced memory and some intermediate variables to reduce the number
of operations. The "Generic Code" has been for written for an arbitrary number
of GPUs. It can run on all the GPUs the system has available, and decomposes
the data-domain accordingly.
The different implementations and their characteristics are listed in table 10.2.
The communication between devices for the multi-GPU applications has been
tested for both synchronous and asynchronous communication, as is shown under
the column named "Comm", and for asynchronous communication, additional
CUDA streams have been deployed. Communication between devices can be per-
formed P2P over the PCIe bus or via host, implied with a "Yes" or a "No" in the
table. Also the right number of streams are launched according to the number of
devices the simulation is set for.
92
Application #GPUs Comm P2P #Streams
1 CPU N/A N/A N/A N/A
1 GPU 1 N/A N/A 1
Hardcoded for 2 GPUs 2 Sync Yes 1
Generic Code Multi Sync Yes 1
Generic Code Multi Sync No 1
Generic Code Multi Async No 2
Table 10.2: Implementation Specifications
93
10.2 Results
Diffusion-based evolution of Lake Okeechobee has been simulated for 1000 years
and the solutions can be viewed in the figures 10.1, 10.2.
Figure 10.1: Solution of h after 1000 years
Figure 10.2: Solution of s after 1000 years
94
All the simulations have been tested with data of double precision. Results
from the different tests that were carried out in this investigation are unveiled in
table 10.3 for GeForce GTX 590 and table 10.4 for Tesla K20m. The results are
for a 10 year simulation.
All the implementations have been tested on a data domain with a grid size of
1000x1000. Tests on different multi-GPU implementations have been carried out
for both 2 and 4 GPUs on GeForce GTX 590 where 4 GPUs are applicable.
95
# GPUs Domain Size Time Speedup Application
0 1000x1000 40.55 s N/A Serial code
1 1000x1000 2.125 s 19.08 Parallel code, 1STREAM
2 1000x1000 1.222 s 33.18 Hardcoded for 2 GPUs, SYNC, P2P, 1STREAM
2 1000x1000 1.244 s 32.60 Generic code for Multi-GPU, SYNC, P2P, 1STREAM
4 1000x1000 0.920 s 44.08 Generic code for Multi-GPU, SYNC, P2P, 1STREAM
2 1000x1000 1.509 s 26.99 Generic code for Multi-GPU, SYNC, copyViaHOST, 1STREAM
4 1000x1000 1.221 s 33.36 Generic code for Multi-GPU, SYNC, copyViaHOST, 1STREAM
2 1000x1000 1.350 s 30.17 Generic code for Multi-GPU, ASYNC, copyViaHOST, 2STREAMS
4 1000x1000 0.696 s 58.52 Generic code for Multi-GPU, ASYNC, copyViaHOST, 2STREAMS
Table 10.3: Speedup on GeForce GTX 590 for the maximum of 4 GPUs
96
# GPUs Domain Size Time Speedup Application
0 1000x1000 38.01 s N/A Serial code
1 1000x1000 1.156 s 32.88 Parallel code, 1STREAM
2 1000x1000 0.763 s 49.82 Hardcoded for 2 GPUs, SYNC, P2P, 1STREAM
2 1000x1000 0.812 s 46.81 Generic code for Multi-GPU, SYNC, P2P, 1STREAM
2 1000x1000 0.953 s 39.88 Generic code for Multi-GPU, SYNC, copyViaHOST, 1STREAM
2 1000x1000 0.769 s 49.43 Generic code for Multi-GPU, ASYNC, copyViaHOST, 2STREAMS
Table 10.4: Speedup onTesla K20 for the maximum of 2 GPUs
97
10.2.1 Analysis
The tests have been carried out on 2 different computers, both holding multiple
NVIDIA GPUs with the CUDA architecture.
The optimal size of thread blocks has proven to be 16×4 on Fermi, and 16×8
on Tesla on a grid of size 1000×1000.
Analysis of the results from GeForce GTX 590, Fermi, seen in table 10.3 and
for Tesla K20m, Kepler as seen in table 10.4, show a clear enhancement on using
GPU’s for simulations of the sediment transport model vs. CPU. All the imple-
mentations of so-called generic code for multi-GPUs have been tested for both 2
and 4 GPUs on Fermi.
1 GPU, 1 STREAM:
The speedup from the serial computations on a CPU to the parallel computations
on 1 GPU shows that the speedup is 19.08 on Fermi and 27.2 on Kepler. This
is a remarkable speedup, considering that large simulations can take weeks and
months to carry out.
Multi-GPU hardcoded, SYNC, P2P, 1 STREAM:
For multi-GPU implementations that are synchronous with P2P communication a
hardcoded version has been implemented in addition to the "generic code". The
implementation of a hardcoded parallel version for 2 GPUs runs faster than the
generic code on 2 GPUs with the same features. This is due to the reduction of
programming structures created to handle an arbitrary number of GPUs, and the
tests show that the hardcoded version is more optimized with respect to speedup.
With respect to results from 1 GPU an "ideal scaling" would be if timing the
results for 1 GPU /2 equals the results for 2 GPU. The simulations for 1 GPU
indicate an ideal scaling from 1 to 2 GPUs as:
2.125/2 = 1.0625 on Fermi.
The results for 2 GPUs hardcoded is 1.222, which is a little over the ideal, but
not too bad considering the overhead due to communication between GPUs. This
implementation yields a good speedup on both Fermi and Kepler, with a speedup
of respectively 33.18 and 49.82.
Multi GPU, Generic code, SYNC, P2P, 1 STREAM:
This version has been tested on both 2 and 4 GPUs. On 2 GPUs it yields a slightly
poorer result than the hardcoded version, as expected. With respect to 4 devices
the generic code yields a very acceptable speedup, although it does not scale ide-
98
ally, again due to communication overhead. This version shows a nice speedup
from 2 to 4 GPUs from 33.6 to 44.08. The communication between the GPUs is
synchronous, although P2P, and is launched sequentially by the CPU so a kernel
launch overhead is to be expected in this implementation. The more GPUs in-
volved, the more kernel launches and memory copies are issued, since the set of
instructions is launched on every GPU.
Multi GPU, Generic code, SYNC, copy via HOST, 1STREAM:
This implementation is almost the same as the one above, but this copies data be-
tween devices via host, and not P2P. Without the benefit of the PCIe transfer of
the data between devices, it shows in the table 10.3 a reduced speed on both 2 and
4 GPUs, with respect to the P2P implementation, as was expected.
Multi GPU, Generic code, ASYNC, copy via HOST, 2 STREAMS:
This implementation is almost the same as the one above, but it deploys asyn-
chronous kernel executions and memory copies. The results clearly show that
asynchronous execution yields very good speedup compared to synchronous. The
asynchronous execution is performed with 2 CUDA Streams to yield the asyn-
chronous behavior. The simulations on Fermi yields a speedup of 58.52 on 4
GPUs, and 49.43 on 2 K20m GPUs.
The results for the different implementations show the same tendencies on
Fermi and Kepler, although Kepler is faster on 2 GPUs than Fermi, due the en-
hanced specifications of the GPU.
Contribution
With respect to the Thesis Questions that were stated at the beginning of this in-
vestigation, as can be seen in table 2.1, the results as described above, answers
and enlightens the topics that this thesis aimed to solve and answer. This thesis’
questions are answered in table 10.5. This research has proved that:
99
A 1: A parallel application running on a single GPU
is faster than a serial application running on a CPU core.
A 2: A parallel application running on multiple GPU’s
is faster than a parallel application running on a single GPU.
A 3: The application does enhance performance utterly if additional
CUDA Streams are utilized.
A 3: A parallel application running on a single GPU yields the exact
same results as the serial application running on a CPU core,
hence the same accuracy.
A 4: A parallel application running on multiple GPU’s does yield results
with the same accuracy as the application running on a single GPU.
Table 10.5: Answers to the Thesis Questions
Looking at the overall results, with a speedup of 58.52 on Fermi’s 4 GPUs, and
49.43 on Kepler’s 2, it is clear that taking the effort of implementing applications
like this sediment-transport model on a parallel architecture is worth while. With
respect to the accuracy of the results, the simulations on GPUs give the exact
same result as the CPU, also for double precision data, as was deployed in this
research. Others, [Tutkun and Edis 2012] and [Michéa and Komatitsch 2010],
have found that parallel applications of similar kinds yield a speedup between 9
and 60 for GPUs of different kinds and also in combination with Message Passing
Interface (MPI). Although, in this matter, it is difficult to compare these results
with others, it is not unlikely to think that the results presented in this thesis are
quite satisfying.
10.3 Future Work
As for future work different optimizations are possible to carry out.
One is to expand the implementations to take more CUDA streams for utterly
parallelism.
In addition to more streams, OMP threads could be combined. The OMP
threads may reduce kernel launch overhead as suggested by [Sourouri et al. 2014].
The OMP threads could be tested with 2 threads as for one of the already existing
implementations, or for implementations with more streams and P2P communi-
cations.
Another suggestion for future work would be to see if the use texture memory
for constant variables like the matrices holding the coefficients α and β is enhanc-
100
ing speedup.
If it would be possible to have more (high end) GPUs available, it would be




List of Code Examples
6.1 CUDA Kernel Call . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 CUDA Kernel Declaration . . . . . . . . . . . . . . . . . . . . . 43
7.1 Get Device Count with/in CUDA C . . . . . . . . . . . . . . . . 54
8.1 Setting the variables dx, dy and dt . . . . . . . . . . . . . . . . . 65
8.2 Variables for the influx from River Kissimmee . . . . . . . . . . . 67
8.3 Reading files of the format .nc with the NetCDF-library . . . . . . 67
8.4 Pseudo Code of the Main Function Loop . . . . . . . . . . . . . . 68
8.5 1D Coalesced Memory for Fast Access on the CPU. . . . . . . . . 70
8.6 Allocation of Coalesced Memory and Copy between Host and De-
vice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.7 Threads on a Block and Blocks in a Grid. . . . . . . . . . . . . . 73
8.8 Calling the CUDA-Kernels in Time Series on 1 GPU. . . . . . . . 74
8.9 Sets The Device-ID, and the Current Device’s Context. . . . . . . 74
8.10 Allocating Device Memory for the Decomposed Data-Domains
on All the Devices. . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.11 Allocation of Pinned/Page-Locked Memory Space) . . . . . . . . 78
8.12 Multi GPU Synchronous Copy via Host. . . . . . . . . . . . . . . 80
8.13 Copy Boundary-Points to the Neighbors Ghost-Points. . . . . . . 82
8.14 Kernel Calls to Copy Data and Synchronous Transfer of Data P2P. 83
8.15 Creating 2 Streams and 2 Events on Each Device. . . . . . . . . . 84
8.16 Loop With 2 Streams Yielding Asynchrone Behavior. . . . . . . . 84
8.17 Creating 2 Streams and 2 Events on Each Device. . . . . . . . . . 86
8.18 Create Open MP Threads. . . . . . . . . . . . . . . . . . . . . . . 86
8.19 CUDA Kernel for Diffusion. . . . . . . . . . . . . . . . . . . . . 87
8.20 CUDA Kernel for Convection. . . . . . . . . . . . . . . . . . . . 88
103
Bibliography
[1] NVIDIA Corporation. NVIDIA CUDA Toolkit Documentation, v7.0. URL:
https://docs.nvidia.com/cuda/index.html (cit. on pp. 2, 39–44).
[2] Jan C. Rivenæs. “Application of a dual-lithology, depth-dependent diffu-
sion equation in stratigraphic simulation”. In: Basin Research 4.2 (). ISSN:
1365-2117. URL: http://dx.doi.org/10.1111/j.1365-2117.1992.
tb00136.x (cit. on pp. 2, 5).
[3] Jan C. Rivenæs. “A computer simulation model for siliclastic basin stratig-
raphy”. University of Trondheim, 1993 (cit. on pp. 3, 11, 12, 18, 57).
[4] D. E. Comer et al. “Computing As a Discipline”. In: Commun. ACM 32.1
(Jan. 1989). Ed. by Peter J. Denning, pp. 9–23. ISSN: 0001-0782. DOI: 10.
1145/63238.63239. URL: http://doi.acm.org/10.1145/63238.63239
(cit. on p. 11).
[5] U. G. Survey. Gtopo30. In: 1996 (cit. on pp. 13, 17, 20, 67).
[6] S.R. Clark, Wenjie Wei, and Xing Cai. “Numerical Analysis of a Dual-
Sediment Transport Model Applied to Lake Okeechobee, Florida”. In: Par-
allel and Distributed Computing (ISPDC), 2010 Ninth International Sym-
posium on. 2010, pp. 189–194. DOI: 10.1109/ISPDC.2010.29 (cit. on
pp. 17, 19, 20, 65, 67).
[7] T. E. Jordan and P. B. Flemings. “Large-scale stratigraphic architecture,
eustatic variation, and unsteady tectonism: A theoretical evaluation”. In:
Journal of Geophysical Research: Solid Earth 96.B4 (1991), pp. 6681–
6699. ISSN: 2156-2202. DOI: 10.1029/90JB01399. URL: http://dx.doi.
org/10.1029/90JB01399 (cit. on p. 17).
[8] Wenjie Wei et al. “Balancing efficiency and accuracy for sediment transport
simulations”. In: Computational Science Discovery 6.1 (2013), p. 015011.
URL: http://stacks.iop.org/1749-4699/6/i=1/a=015011 (cit. on
pp. 21, 22).
[9] Top500List. Top 500 Supercomputers. 2015. URL: \url{http://www.top5
00.org/},urldate={2015-06-01} (cit. on p. 33).
104
[10] Michael J Quinn. Parallel Programming. Vol. 526. TMH CSE, 2003 (cit.
on pp. 35, 37).
[11] David B Kirk and W Hwu Wen-mei. Programming massively parallel pro-
cessors: a hands-on approach. Newnes, 2012 (cit. on p. 45).
[12] John Nickolls et al. “Scalable parallel programming with CUDA”. In:
Queue 6.2 (2008), pp. 40–53 (cit. on p. 48).
[13] Unidata. Network Common Data Form NetCDF. 2015. URL: http://www.
unidata.ucar.edu/software/netcdf/ (visited on 05/01/2015) (cit. on
p. 67).
[14] Bulent Tutkun and Firat Oguz Edis. “A GPU application for high-order
compact finite difference scheme”. In: Computers & Fluids 55 (2012),
pp. 29–35 (cit. on p. 100).
[15] David Michéa and Dimitri Komatitsch. “Accelerating a three-dimensional
finite-difference wave propagation code using GPU graphics cards”. In:
Geophysical Journal International 182.1 (2010), pp. 389–402 (cit. on
p. 100).
[16] M. Sourouri et al. “Effective multi-GPU communication using multiple
CUDA streams and threads”. In: Parallel and Distributed Systems (IC-
PADS), 2014 20th IEEE International Conference on. 2014, pp. 981–986.
DOI: 10.1109/PADSW.2014.7097919 (cit. on p. 100).
105
106
