Use of multi-GPU systems for large FFTs: with applications in ultrasound simulations by Nandapalan, Nimalan
Use of Multi-GPU Systems for Large
FFTs: With Applications in Ultrasound
Simulations
Nimalan Nandapalan
February 2013
A thesis submitted for the degree of Master of Philosophy
of the Australian National University

To families that stick together through thick and thin

Declaration
The work in this thesis is my own except where otherwise stated.
Nimalan Nandapalan

Publications
N. Nandapalan, J. Jaros, B. Treeby, A. Rendell. Implementation of 3D FFTs
Across Multiple GPUs in Shared Memory Envronments. In 2012 13th Inter-
national Conference on Parallel and Distributed Computing, Applications and
Technologies. IEEE.

Acknowledgements
There are far too many people who deserve acknowledgement for in some way
being involved in the journey that accompanied this thesis. While not everyone
can be mentioned, no one is forgotten.
First, I would like to thank my supervisors, Alistair ‘Chop Chop’ Rendell,
Brad ‘Common, less sleep more work’, and Jiri ‘Cool Bananas!’ Jaros. Without
their guidance nothing would have been possible. Their advice, friendship and
patience have made this candidature a truly memorable experience.
I would also like to thank all of my fellow students - Gaurav, Josh, Beau, the
nameless people who smiled in the corridors and many, many more. The countless
discussions, jokes, distractions and hours spent mindlessly trying to debug each
other’s code will not be forgotten.
Last and most certainly not least my family. Words can not even begin to
acknowledge the effect that they have made. Love, shelter, faith and support pale
compared to the feeling that we will always be together through the thick and
thin of it.
Addendum: The author would also like to acknowledge that some machines
were harmed in the making of this thesis: Jabberwocky, Bandersnatch, some
Vorpal Blades, N-iMac, the espresso machine in the staffroom and the countless
other machines that either computed their last digits, or were wounded beyond
repair in the line of duty (even that nameless machine that spewed smoke and
shut down the whole building).
ix

Abstract
Ultrasound simulations are a type of application that are both computationally
and communicatively intensive. With better performance, implementations of
these can be used in designing new ultrasound probes, developing better signal
processing techniques, training new ultrasonographers, in treatment planning and
many other uses [12]. The pseudo-spectral technique can be used effectively to ex-
press the wave-propagation model used in these simulations, and is characterised
by its use of the Fast Fourier Transform (FFT). The FFT can account for over
half of the time spent by ultrasound simulations, with the remaining consisting
of embarrassingly parallel arithmetic [29].
The use of a Graphics Processing Unit (GPU) for general computations like
the FFT has become ubiquitous with favourable performance. The current trend
in the design of the Central Processing Unit (CPU) of most systems has seen a
shift from single-core to multi-core processing with these now being assembled into
multi-socket configurations. GPUs are already massively multi-core processors—
typically with three or four times as many cores—the question remains: will GPUs
follow a similar trend and incorporate multiple devices in individual sockets when
implemented?
The purpose of the work in this thesis is to assess the viability of multi-GPU
systems for ultrasound simulations in terms of cost and performance compared
to other system designs that offer similar computational resources.
Current machine hardware is capable of supporting multiple GPU through pe-
ripheral devices and offers a glimpse of the potential of future machines however,
relatively little work has been reported on the use of such systems for ultrasound
simulations and the FFT algorithm. In this thesis, to address this issue, we
benchmark and model the device-to-device communication potential of an exist-
ing multi-GPU system. Four different methods are considered, namely: via CPU,
pointer swapping, hybrid-staged, and kernel. The results reveal that the pointer
xi
xii
swapping and kernel based methods of managing communication can be up to
twice as efficient as other methods.
The methods for communication identified in the benchmarks are then used
as the basis for a number of important generic communication functions, which
are in turn used to implement a distributed 3D FFT algorithm as required by
the ultrasound simulation. The multi-GPU distributed 3D FFT with four GPUs
was found to be up to 18% faster than an existing FFT implementation on a six
core CPU.
This multi-GPU distributed 3D FFT implementation is then used in an ultra-
sound simulation as a proof-of-concept case study of the thesis. By overlapping
communication and computation between the CPU and GPU resources a speed
up of 8% is observed.
Contents
Acknowledgements ix
Abstract xi
List of Figures xv
List of Tables xvii
Notation and Terminology xix
1 Introduction 1
2 Background 5
2.1 GPU and CPU Processors, and Hardware . . . . . . . . . . . . . 6
2.2 GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The Multi-GPU Architecture . . . . . . . . . . . . . . . . . . . . 15
2.4 Multi-GPU Communication . . . . . . . . . . . . . . . . . . . . . 18
2.5 Ultrasound Simulations and the FFT . . . . . . . . . . . . . . . . 21
2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Multi-GPU System Behaviour 27
3.1 Benchmarking Using The Swap Operation . . . . . . . . . . . . . 28
3.1.1 Swapping Via CPU . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Swapping Via Device Pointers . . . . . . . . . . . . . . . . 31
3.1.3 A Hybrid Swap: Staging On The CPU . . . . . . . . . . . 32
3.1.4 Swapping Using Device Kernels . . . . . . . . . . . . . . . 34
3.1.5 Equipment: The Benchmarked System . . . . . . . . . . . 36
3.1.6 Benchmark Methodology . . . . . . . . . . . . . . . . . . . 38
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
xiii
xiv CONTENTS
3.3 Modelling the Multi-GPU System Behaviour . . . . . . . . . . . . 43
3.3.1 The Bandwidth Response Function . . . . . . . . . . . . . 44
3.3.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 The Multi-GPU 3D FFT 49
4.1 Communication In The Distributed 3D FFT . . . . . . . . . . . . 50
4.2 A General Algorithm for The Distributed 3D FFT . . . . . . . . . 53
4.3 Specific Implementations . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Application: k-Wave 65
5.1 Equations in Ultrasound Simulation . . . . . . . . . . . . . . . . . 65
5.2 CPU–Multi-GPU Pipelining for Simulation Acceleration . . . . . 67
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Conclusion 75
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A Equipment 85
A.1 Multi-GPU System: Infib . . . . . . . . . . . . . . . . . . . . . . 85
A.2 Multi-GPU System: Trinity . . . . . . . . . . . . . . . . . . . . . 86
B Complete Bandwidth Table for Infib 89
C Table for Modelling Swap Operation 91
D GPU-CPU Pipelining of FFT in K-Space Pseudo-Spectral Meth-
ods 93
Bibliography 95
Acronyms 101
Glossary 103
Index 104
List of Figures
2.1 GPU vs. CPU Structural Comparison . . . . . . . . . . . . . . . . 7
2.2 GPU vs. CPU FLOPS and Memory Bandwidth Comparison . . . 8
2.3 Shared and distributed memory systems . . . . . . . . . . . . . . 11
2.4 CUDA Kernel Example . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 CUDA and Memory Hierarchy . . . . . . . . . . . . . . . . . . . . 14
2.6 Hypothetical multi-GPU systems. . . . . . . . . . . . . . . . . . . 16
2.7 Collective Communication Functions . . . . . . . . . . . . . . . . 20
3.1 Device-to-device Communication Via CPU . . . . . . . . . . . . . 30
3.2 Device-to-device communication Via Pointer Swapping . . . . . . 32
3.3 Device-to-device Communication via Hybrid-Staging . . . . . . . 33
3.4 Device-to-device Communication Via Kernel . . . . . . . . . . . . 35
3.5 Multi-GPU System Schematic . . . . . . . . . . . . . . . . . . . . 37
3.6 Minimum and Maximum Bandwidth for Benchmark . . . . . . . . 40
3.7 Bandwidth Response Function Fitting . . . . . . . . . . . . . . . 44
3.8 Swap Operation Modelling: Predicted vs. Observed Time . . . . . 48
4.1 Multi-GPU Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Multi-GPU Gather . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Multi-GPU All-To-All . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 General algorithm for distributed memory type 3D multi-GPU FFT. 54
4.5 Profile of Simple and Pointer Swapping Implementations . . . . . 59
5.1 Visualisation of an Ultrasound Simulation . . . . . . . . . . . . . 66
5.2 Matlab Code of Simulation . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Technique for GPU-CPU pipelining . . . . . . . . . . . . . . . . . 69
5.4 GPU-CPU Overlapping of Matlab Code . . . . . . . . . . . . . . 70
5.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xv
xvi LIST OF FIGURES
6.1 Technique for GPU-CPU Pipelining With Two Copy Engines . . . 80
A.1 Multi-GPU System Schematic . . . . . . . . . . . . . . . . . . . . 86
List of Tables
3.1 Bandwidth Benchmark Results . . . . . . . . . . . . . . . . . . . 42
3.2 Footprint and Communication Costs of Swap Methods . . . . . . 43
4.1 FFT Performance over Batch Size . . . . . . . . . . . . . . . . . . 57
4.2 FFT Implementation Performance . . . . . . . . . . . . . . . . . . 61
4.3 Modelled FFT Performance . . . . . . . . . . . . . . . . . . . . . 62
5.1 CPU Ultrasound Simulation Performance . . . . . . . . . . . . . . 71
5.2 Non-Pipelined Ultrasound Simulation Performance . . . . . . . . 71
5.3 Pipelined Ultrasound Simulation Performance . . . . . . . . . . . 72
B.1 Complete Bandwidth Benchmark Results . . . . . . . . . . . . . . 90
C.1 Predicted and Measured Time for Swap Operation . . . . . . . . . 92
xvii
xviii LIST OF TABLES
Notation and Terminology
This chapter presents some of the notation and terminology used throughout the
thesis for reference. Here, the points of reference for 3D space, matrices and other
domain specific lexicology are defined.
Data[i] Array notation for the ith element of an array called Data
where the first element is at position 0 (Data[0]).
di Sequence notation for the i
th element of a sequence denoted d
with d1 indicating the first element.
D,d The capitalisation of the symbol denoting the elements of a
sequence indicates the entire sequence, i.e: D = d1, d2, . . . ,
where d may be used as index or to denote an arbitrary element
of D. In case of ambiguity, the bold face of the symbol is used
(d).
|Data| Size/cardinality of a set, or number of elements in a sequence
Data, e.g: Given a ‘case’ contains four ‘sixers’, then |sixers|
in a case is six.
#Data The number of sets or sequnces Data in some group, e.g: Given
a ‘case’ contains four ‘sixers’ then #sixers in a case is four.
X 
Y 
Z 
The origin: The point of reference and orientation in
3D space. This defines the axis X as depth with back-
wards and forwards, Y as width with left and right,
and Z as height with up and down.
xix
xx NOTATION AND TERMINOLOGY
Matrix
A matrix: A form of data with any number of di-
mensions. Here, unless otherwise specified matrices
are 3D. Referring to the origin for our 3D space, the
upper-left-forward corner is the location of the first el-
ement, with subsequent elements behind, elements to
the right more distant, and elements below the great-
est distance from the first.
buffer An allocated region of memory.
Chapter 1
Introduction
“For, usually and fitly, the presence of an
introduction is held to imply that there is something
of consequence and importance to be introduced.”
—Arthur Machen.
The primary goal of this thesis can be motivated from two opposite direc-
tions: understanding an emerging trend in system design towards multi-GPU
architectures, and the demands of computing applications such as ultrasound
simulations.
From the perspective of system design the current trend is towards multi-core
processing, with current consumer hardware now typically containing four Central
Processing Units (CPUs). Furthermore, these machines are often configured with
Graphics Processing Units (GPUs) which usually contain three or four times
as many more cores, and the use of GPUs for general purpose programming is
now quite ubiquitous. In the space of high performance computing systems this
initially involved augmenting each node of a distributed memory system with a
single GPU. Typical node hardware (and many consumer system) can however
support multiple GPUs and, as node CPUs become increasingly multi-core, the
trend would suggest that each node will become populated with multiple GPUs.
To date, relatively little work has been reported on the use and optimisation of
algorithms to run on multiple GPUs attached to a single shared memory host.
One reason to create a system like this presently is that current GPU devices are
manufactured with a fixed amount of memory that is often limited in size and
1
2 CHAPTER 1. INTRODUCTION
bandwidth compared to the memory allocation available to the CPU. Aggregating
the resources from multiple GPUs can potentially solve this.
From the application context, motivation is driven by the desire to perform
large-scale ultrasound simulations in time-frames that are clinically meaningful
[64]. Ultrasound simulations of sufficient detail can be used in the design of new
ultrasound probes, the development of signal processing techniques, in training
new ultrasonographers, and in treatment planning. The demands on hardware
to produce simulations of sufficient detail are significant, and almost place this
problem solely in the domain of high performance computing systems. That
given, GPUs have been shown on many occasions to offer large speed ups to
many different types of problems [28], and ultrasound simulations have been
shown to benefit from acceleration on a single GPU [63]. However, the existing
GPU hardware, in particular the memory capacity, is not suitable for large-scale
simulations required for detailed ultrasound simulations. Once again, aggregating
the resources from multiple GPUs can solve this, however the implications of
doing so are not immediately intuitive.
Based on these motivations, the goal of this thesis is to characterise the per-
formance of multi-GPU systems and determine the efficacy of multi-GPU systems
for ultrasound simulation applications.
In order to address this goal we make a number of contributions. We present
an analysis of the performance characteristics of the multi-GPU architecture,
and define and implement a selection of generic communication functions for the
platform, along with several techniques for managing said communication. The
performance characteristics are used to create a model suitable for predicting
the performance of algorithms on existing and potential instances of multi-GPU
systems. The generic communication functions are also demonstrated in our
implementation of the 3D Fast Fourier Transform (FFT) for the architecture,
and this FFT is then applied to an ultrasound simulation along with a technique
for efficiently using this configuration and its performance analysed.
This thesis is thus organised into the following five chapters. Chapter 2
presents the theory and background information behind the concepts in this the-
sis, and concludes with a survey of existing literature. Chapter 3 begins the
analysis of multi-GPU systems through the development of methods for manag-
ing communication and the benchmark of their performance on an experimental
multi-GPU machine. Chapter 4 defines the implementation of the 3D FFT on
3the platform in terms of general purpose communication functions and compares
the performance of this implementation to their equivalent versions for the CPU
and GPU architectures. Chapter 5 builds upon the previous by replacing the
FFT component of an ultrasound simulation with the mutli-GPU implementa-
tion demonstrating a possible use for the multi-GPU architecture to augment
existing work. Chapter 6 concludes the thesis with a discussion of the goal and
the state of the multi-GPU architecture in terms of current and future hardware
and addresses potential future work to further develop the platform.
4 CHAPTER 1. INTRODUCTION
Chapter 2
Background
“Education is an admirable thing, but it is well to
remember from time to time that nothing that is
worth knowing can be taught.”
—Oscar Wilde
The following chapter covers the fundamentals involved in this thesis: it intro-
duces the reader to the hardware, key concepts and the context of the problems
addressed. This chapter is divided into the following seven sections: GPU and
CPU Processors, GPU Programming, Multi-GPU Systems, Multi-GPU Com-
munication, Ultrasound Simulations and the 3D FFT, and related work. The
multi-GPU architecture that is central to this thesis is made of a number of es-
tablished components which we introduce individually. First, there are the GPU
processors themselves which we compare to the more familiar CPUs. We focus
on NVIDIA’s hardware which was used in the experiments. The second section
is on the software interface and programming paradigms for GPU applications,
which is by NVIDIA’s CUDA [41] solution. The third section introduces the
multi-GPU architecture. The fourth section discusses the PCIe protocol and net-
work and the aspects of communication and memory that are relevant. In the
fifth section the application domain is presented along with the key component
of its algorithm: the 3D Fast Fourier Transform (FFT), along with the particular
ultrasound simulation studied: k-Wave. Section six concludes the chapter with a
review of recent relevant literature.
5
6 CHAPTER 2. BACKGROUND
2.1 GPU and CPU Processors, and Hardware
The Graphics Processing Unit (GPU) is considerably different to the Central Pro-
cessing Unit (CPU) that the majority of programmers are used to developing for
[49]. From an evolutionary point of view [54] these two processors could not have
a more different history. The CPU found in most consumer machines were origi-
nally developed for general purpose sequential tasks, but have evolved to include
pipelines, superscalar features and out-of-order execution factoring heavily into
their design for efficiency. The primary means of improving the performance of
these processors for a long time was through engineering to simply increase the in-
ternal speed at which the processors clock speed operated. From their conception
regular consumer CPUs have increased in clock speed from about 1 Mhz to over
3 Ghz (with some being over-driven at an unsustainable rate beyond 8 Ghz [24]).
This speed up is more or less in correlation with the number of transistors on the
processor, which has been doubling at a rate of approximately 18 months—an
observation known as Moore’s Law [37]. Eventually limitations in manufactur-
ing technology meant physical, thermal and power constraints prevented this
technique from yielding performance improvements with a single processing core
which resulted in the current trend of multi-core processors with very complex
control logic and memory management.
In contrast, GPUs are more specialised and were originally designed to accel-
erate certain parts of the graphics pipeline (the process by which information is
eventually rendered onto a device such as a computer screen monitor) as opposed
to the CPU which was always designed to handle a variety of tasks. Consequently
GPUs have evolved for processing the massive amounts of data associated with
2D and 3D graphics and have almost always followed multi-core designs. These
feature a much greater concentration of computational resources to other func-
tional units within the processor than CPUs. Quintessentially, the difference
between these processors is that the GPU has evolved into a massively parallel,
multi-threaded and multi-core processor, while CPUs have become highly efficient
at executing a few threads on a small number of cores.
The design of the GPU core is centred around the Single-Instruction Multiple-
Thread (SIMT) style of instruction execution, which differs in a few but significant
ways from the more well known Single-Instruction Multiple-Data (SIMD) model
[31], and conventional multi-threading in CPUs. The main distinction is that in
SIMT the same instruction is issued to some number of threads which results
2.1. GPU AND CPU PROCESSORS, AND HARDWARE 7
CPU GPU 
Main/System/Shared Memory Device Local Memory 
Cache 
Control 
ALU ALU 
ALU ALU 
…
 
Figure 2.1: Structural comparison between GPUs and CPUs. CPUs contain
a larger and more complicated cache structure (memory in orange), for a
more complex controller (grey) with fewer computational cores (white). We
define the CPU as exclusive of the main memory. GPUs have smaller, faster
memory (green) closer to the cores, with simpler caches and controllers for
a larger amount of computational resources. Based on figure by NVIDIA
[41]
in a more flexible programming model than SIMD implementations such as SSE
[31]. This means that SIMT programs have fewer memory alignment and storage
constraints; support for conditional and branching program flow; and thread
specific and localised registers for each data element.
Consequently the multi-processor found within the GPU consists of a single
control unit for a number of computational function units, typically enough for
sixteen threads. This way several warps of threads (groups of 32 threads) can be
kept ready, collectively working on any set of data, allowing context switching to
mask high latency operations. Then the same instruction can be issued by the
controller for all threads in one warp in a single fetch/decode effecting perhaps the
most efficient means of performing that computation [16]. These multi-processors
are modular and at the design point of any particular GPU processor any number
of these multi-processors can be factored in, to produce a highly parallel processor.
The dissimilarity between the CPU and GPU architectures designs can be seen
in Figure 2.1, where the GPU clearly reserves a significantly larger proportion of
resources in a more efficient configuration for data processing than the CPU [52].
This results in some of the key features of the GPUs: its capability for enormous
computational rates and its memory bandwidth. These can be compared using
the floating-point operations per second (FLOP/S) for the computation rate,
and bytes per second (B/s) for the memory bandwidth. Figure 2.2 illustrates the
trend in these two metrics for a number of recent CPUs from Intel and GPUs
8 CHAPTER 2. BACKGROUND
Chapter 1. Introduction 
 
 
2  CUDA C Programming Guide Version 4.1 
 
 
Figure 1-1. Floating-Point Operations per Second and 
Memory Bandwidth for the CPU and GPU 
Chapter 1. Introduction 
 
 
2  CUDA C Programming Guide Version 4.1 
 
 
Figure 1-1. Floating-Point Operations per Second and 
Memory Bandwidth for the CPU and GPU 
Figure 2.2: FLOPS and Memory Bandwidth Comparison for GPUs (gree )
and CPUs (blue)—statistically GPUs are considerably more capable, and
the rate of performance growth is also greater than CPUs. Figure taken
from NVIDIA [41]
from NVIDIA.
The full extent of the degree of separation in parallelism supported becomes
quite apparent when considering the number of ‘in-flight’ threads (ready to ex-
ecute), and threads within a single active context of a particular GPU in com-
parison to a CPU. For example, take the NVIDIA GTX 580 GPU [41] device
which uses the GF110 GPU and has 16 Multi-Processors. Each Multi-Processor
in this generation of hardware is capable of supporting 1536 threads in flight. This
means that across the entire device, well over 20, 000 threads can potentially be
supported in a ready-to-execute state, with 256 threads in active contexts. In
contrast, current Intel i7 CPUs with hyper-threading technology enabled only
support a few threads in flight, so a quad or hex core system only supports eight
or twelve active threads in context respectively.
Under the classic Von Neumann architecture [66] the GPU has been a periph-
eral device attached to the CPU of a system. Thus, the vast majority of GPU
devices that are produced have the form-factor of expansion cards which augment
the system. These GPU device cards are made of a number of components which
can include: at least one GPU processing chip; a large memory space (originally
the frame buffer for the display); specialised memory for graphics rendering (con-
stant texture memory); and, controllers (Direct Memory Access (DMA) [26]) and
interfaces (PCIe) for communicating with the rest of the system. GPUs also ap-
pear in integrated systems alongside CPUs in implementations such as NVIDIA’s
2.1. GPU AND CPU PROCESSORS, AND HARDWARE 9
Tegra [42] and AMD’s Fusion processors [1], where they share the same memory.
They have also been placed into stand-alone units, and as modular server racks
as seen in the line of products offered by NVIDIA’s Tesla series [45].
As GPUs typically exist as peripheral devices to the CPU the hardware facil-
itating communication between the two processors is a significant component in
the function of a GPU system. The Peripheral Component Interconnect Express
(PCIe) is a protocol defining all aspects required for a high-speed general-purpose
input/output (I/O) interconnect [5]. The primary use of this interconnect is to
attach peripheral devices to the core of a system housed on a motherboard. These
devices can fulfil roles ranging from both wired and wireless network adapters, to
high-bandwidth storage, to dedicated graphics output devices such as the afore
mentioned GPUs. PCIe is the current development of the interconnect technology
building on its predecessors which include PCI, PCI-X and AGP. The protocol
has itself undergone a number of revisions, the current standard is at version 3.0
with version 4.0 announced, however most systems in use—including those in this
thesis—are based on version 2.x. While originally designed and implemented as
a bus like in the classic Von Neumann architecture [66], this is no longer strictly
true as PCIe now operates more like a common switched network.
In order to efficiently support a number of peripheral devices, PCIe is typically
used in a switched network consisting of several multiplexed, point-to-point serial
connections that branch out from the primary interface to other switches and
eventually the peripheral device slots, thus at the time of communication each
device has a dedicated connection. The smallest possible connection, referred
to as a x1 lane (read by-one lane), is a fully duplex link capable of transferring
one bit per cycle. In practice, parallel lanes are used by devices for greater
bandwidth and so the available configurations are x1, x2, x4, x8, x16 and x32.
The x16 configuration is the most common for high bandwidth devices and has
a theoretical peak transfer rate of 8.0 GB/s in each direction [50].
While the PCIe network facilitates the communication between GPUs and
CPUs, its use is completely subjective to the memory models in use throughout
the system. The physical placement, amount, and access of memory within a
system can be the primary charactering element of it, for example defining the
difference between Harvard, Von Neumann, and Turing style machines [11]. In
this section we introduce two major memory models which are used in the the-
sis: the distributed and the shared memory models, along with two significant
10 CHAPTER 2. BACKGROUND
concepts in memory access: Non-Uniform Memory Access (NUMA), and Unified
Virtual Addressing (UVA).
The shared memory model is a design for multiple processors where some
memory resource is simultaneously available, more or less directly, in a global
scope to more than one of the processors[8]. Figure 2.3a illustrates this arrange-
ment. The main memory of most modern multi-core desktop machines is an
example of shared memory—true also when considering the role of main memory
when a GPU and CPU are present. Given the complicated memory hierarchies
of modern systems, shared memory systems can exist at multiple layers within
the structure in both the physical hardware and abstract software implementa-
tions. Shared memory systems can be either NUMA or have a uniform memory
access speed. Operating systems are capable of creating shared memory spaces in
the virtual addressing of applications with the main advantage of facilitating very
fast inter-process communication[7]. The OpenMP Application Programming In-
terface (API) [14][47] is an example of an implementation capable of supporting
this. The conceivable down-side to this model is that with increasing numbers of
processors the bottle-neck effect can become more and more significant depending
on the memory accesses that occur.
In contrast to shared memory systems, the distributed memory model de-
scribes an alternate approach to exploiting parallelism [8]. This model describes
machines created by connecting many independent computers together as nodes
in a network. The key being that each computer within the machine is associated
with its own private memory that is not directly accessible by other computers—
only via communication over a network—as described in Figure 2.3b. This com-
munication can be handled by software such as the Message Passing Interface
(MPI) library [34]. Distributed memory systems are more scalable as the inter-
connecting network is the primary limiting factor in adding an arbitrary number
of the independent computers[3]. An issue that distributed memory systems do
force upon programmers is the problem of data distribution and management—
which when performed poorly can render the parallel capacity of a distributed
system redundant.
Non-Uniform Memory Access (NUMA) as its name suggests refers to the
access time for a processor to all regions of memory in the design for a system, and
in this case where that time is not consistent usually because of physical design
[33]. This can result from a number of different reasons including the distinction
2.2. GPU PROGRAMMING 11
Memory 
CPU CPU CPU CPU 
(a) A shared memory system.
Memory 
CPU CPU CPU CPU 
Memory Memory Memory 
Network 
(b) A distributed memory system.
Figure 2.3: Shared and distributed memory systems. In a shared mem-
ory system, the memory is available directly to all of the processors. In
the distributed memory system each processor has its own memory, the
addressing of which is invisible to the other processors.
between the local memory of a processor and remote memory of a neighbouring
processor, the network between the processors, or any shared memory. Also,
both on-chip and off-chip caches for processors can be factored into the non-
uniformity of access times between memory regions. The principle objective of
the NUMA design philosophy is to improve scalability by avoiding the contention
and hardware costs associated with ensuring a uniform access speed to all memory.
This is achieved by separating processors in a structured manner [6].
The Unified Virtual Address space (UVA) is a software abstraction for how
memory is accessed within distributed memory systems [41][56]. Memory virtual-
isation allows otherwise independent memory spaces to contribute to a common
virtual memory pool. Effectively allowing a shared memory space to be abstracted
over an underlying distributed memory system. This means that pointers to
memory in a physically remote locations to a processor can be dereferenced and
used without the need for additional message passing or communication. The
potential benefits of using this include: improved memory utilisation through the
removal of redundancies in data replication during communication, increased ef-
ficiency and lower latency in certain applications through aggregated bandwidth,
and simplified data management without the need to program data distribution.
2.2 GPU Programming
The API for GPUs is the software abstraction that makes the hardware available
for developers to utilise. There are a number of different APIs presently in use
12 CHAPTER 2. BACKGROUND
!
!
!
!
!
!
!"#$%&'()'*++,-)%./,01%21'3,(-%456% ! 7!
!
!8*9:1'%;5%
&'()'*++,-)%<(01=%
"#$%!&#'()*+!$,)+-./&*%!)#*!0'$,!&-,&*()%!1*#$,.!)#*!2345!(+-6+'00$,6!0-.*7!
18!-/)7$,$,6!#-9!)#*8!'+*!*:(-%*.!$,!2;!5,!*:)*,%$<*!.*%&+$()$-,!-=!2345!2!$%!
6$<*,!$,!>*&)$-,!?;@;!
A/77!&-.*!=-+!)#*!<*&)-+!'..$)$-,!*:'0(7*!/%*.!$,!)#$%!&#'()*+!',.!)#*!,*:)!&',!1*!
=-/,.!$,!)#*!!"#$%&'((!>4B!&-.*!%'0(7*;!
;5> ?1'-1=3%
2345!2!*:)*,.%!2!18!'77-9$,6!)#*!(+-6+'00*+!)-!.*=$,*!2!=/,&)$-,%C!&'77*.!
)"&*"+,C!)#')C!9#*,!&'77*.C!'+*!*:*&/)*.!D!)$0*%!$,!('+'77*7!18!D!.$==*+*,)!-./'0
$1&"2(,C!'%!-((-%*.!)-!-,78!-,&*!7$E*!+*6/7'+!2!=/,&)$-,%;!
5!E*+,*7!$%!.*=$,*.!/%$,6!)#*!!!"#$%&#!!!.*&7'+')$-,!%(*&$=$*+!',.!)#*!,/01*+!-=!
2345!)#+*'.%!)#')!*:*&/)*!)#')!E*+,*7!=-+!'!6$<*,!E*+,*7!&'77!$%!%(*&$=$*.!/%$,6!'!
,*9!'''«(((!"3"#4$5%*0#%*6574&2$5%*!%8,)':!F%**!5((*,.$:!G;H?I;!J'&#!)#+*'.!)#')!
*:*&/)*%!)#*!E*+,*7!$%!6$<*,!'!/,$K/*!$1&"2(08/!)#')!$%!'&&*%%$17*!9$)#$,!)#*!E*+,*7!
)#+-/6#!)#*!1/$7)L$,!)*+,&-.-/!<'+$'17*;!
5%!',!$77/%)+')$-,C!)#*!=-77-9$,6!%'0(7*!&-.*!'..%!)9-!<*&)-+%!'!',.!9!-=!%$M*!:!
',.!%)-+*%!)#*!+*%/7)!$,)-!<*&)-+!-N!
!!"#$%&$'"($)*&*+*,&"
--.',/0'--"1,*("2$34((5)',0+6"47")',0+6"87")',0+6"9:"
;"
""""*&+"*"<"+=%$0(>(?@?A"
""""9B*C"<"4B*C"D"8B*CA"
E"
"
*&+"F0*&5:"
;"
""""@@@"
""""!!"#$%&$'"*&1,30+*,&"G*+="H"+=%$0(I"
""""2$34((JJJK7"HLLL547"87"9:A"
E"
O*+*C!*'&#!-=!)#*!:!)#+*'.%!)#')!*:*&/)*!0,12--34!(*+=-+0%!-,*!('$+L9$%*!
'..$)$-,;!0Figure 2.4: Code snippets illustrating the use of CUDA kernels. The kernel
call in the main() function executes the kernel function VecAdd() with one
block and N threads. Figure taken from NVIDIA[41]
for GPU programming and one of the most prominent is CUDA [41]. Examples
of other current APIs include OpenCL [30][59] and OpenACC [53]. CUDA is
the name given by NVIDIA for its complete solution for GPGPU products which
it released in November 2006, well before OpenCL (2008), and supports more
features in part due to its longer development history. In CUDA, a typical flow of
a program involves: the initialisation of a task on the CPU for GPU programming
called the host ; the allocation of space for the task’s data in the memory of the
GPU, referred to as the device; copying the data to the device; executing a device
side task called a kernel ; and finally the returning of the result to the host.
The terms host and device are common terminology in CUDA GPU program-
ming as the program is rarely in direct control of the GPU. Instead, the GPUs
are used and accessed as separate entities within the host system via an asyn-
chronous queue operation model [16]. This means that the host is responsible for
placing tasks (memory operations and kernel executions) into the devices queue
and is immediately free without any further waiting for more work—there is no
guarantee on the exact time a task in the queue is completed, only that it begins
after the previously queued task has finished, and finishes before the next queued
task has begun. This requires some form of synchronisation between the host and
device for the host to know when a result is ready which is achieved through a
host-initiated barrier. Any illusion of synchronous activity, or blocking behaviour
is provided by implicitly performing barrier synchronisations.
The kernels within the CUDA architecture have a few unique properties that
differentiate them from regular programmable functions. In Figure 2.4 there is
an example of defining a simple device kernel (VecAdd()) and a host function
2.2. GPU PROGRAMMING 13
(main()) calling it. As all kernels are launched asynchronously they can have
no return value for the host. What a kernel invocation actually specifies is a
SIMT type code execution launch—for this we require the number of threads
and their configuration. The invocation takes arguments in addition to regular
function arguments using the triple angled brackets (<<<,>>>) defined by CUDA
to specify the dimensions of the grid of blocks, the dimensions of the block of
threads, and optionally the amount of shared memory to be allocated per block
(the example uses none). The example specifies one block of N threads. Within the
kernel function each thread is identifiable via a CUDA data structure, threadID,
which can be used to distinguish work for each thread—in this example it is used
by each thread to index into the elements of two vectors to perform the vector
addition operation at a thread parallel level.
The grid of blocks is an abstraction made by the CUDA programming model
[41]. These blocks are where a problem is divided into a grid of any dimension
of sub-problems that can each be solved independently. The blocks themselves
should be sub-problems capable of being solved in a SIMT fashion by multiple
threads. This abstraction is designed to mirror the scalable structure of the multi-
processors within the GPU (see Section 2.1 for multi-processors), and thus the
blocks are mapped entirely to a multi-processor, with the threads within the block
sharing the multi-processor’s shared memory, and executing on the Arithmetic
Logic Unit (ALU) under the same controller with their own registers and data as
per the SIMT model. This correlation between the CUDA programming model,
memory hierarchy and hardware structure is described in Figure 2.5, which shows
the relationship between the grids, blocks, threads and memory. The key benefit
of using this block division is that if given enough blocks any number of scheduling
methods can be used for high efficiency in computation and device usage on the
GPUs and multi-processors [58].
Another significant component of the CUDA programming model is the con-
cept of streams. NVIDIA designed this in recognition of the time required to
transfer data between GPU devices and the benefits of such techniques as pipelin-
ing and scheduling with multiple queues. A CUDA stream represents a user
defined queue of the asynchronous GPU operations (kernels, memory transfers)
that get executed in a specific order. In this context the GPU devices are made
of two components: a copy engine (the device DMA controller); and a compute
engine (the CUDA cores). By allowing multiple streams within a device, the copy
14 CHAPTER 2. BACKGROUND! !"#$%&'()*(+',-'#../0-(1,2&3!
!
!
!456(+',-'#../0-(78/2&(9&':/,0(;<=! ! >>!
!
(
?/-8'&()@)<( 1&.,'A(B/&'#'C"A(
)<D B&%&',-&0&,8:(+',-'#../0-(
"#!$%%&#'()'*+!,-!.$/&(*!0123!'4*!567"!8(9/()::$;/!:9+*%!)##&:*#!'4)'!'4*!
567"!'4(*)+#!*<*=&'*!9;!)!84-#$=)%%-!#*8)()'*!!"#$%"!'4)'!98*()'*#!)#!)!=98(9=*##9(!
'9!'4*!&'()!(&;;$;/!'4*!5!8(9/():>!?4$#!$#!'4*!=)#*3!@9(!*<):8%*3!A4*;!'4*!B*(;*%#!
*<*=&'*!9;!)!CD6!);+!'4*!(*#'!9@!'4*!5!8(9/():!*<*=&'*#!9;!)!5D6>!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
"#$%&#!'('$)*!
")+,!-!
.#$/0!123!45!.#$/0!143!45!.#$/0!1-3!45!
.#$/0!123!-5!.#$/0!143!-5!.#$/0!1-3!-5!
")+,!4!
.#$/0!143!45!
.#$/0!143!-5!
.#$/0!143!25!
.#$/0!1-3!45!
.#$/0!1-3!-5!
.#$/0!1-3!25!
67)(&,!.#$/0! !
8()9%#$/0!:7&)(,!
'('$)*!
67)(&,!
8()9;7)(&,!#$/&#!
'('$)*!
Figure 2.5: The thread grouping in the CUDA programming model is mir-
rored in the memory hierarchy of GPUs. Source: NVIDIA[41]
engine can execute a memory transfer from one stream while the CUDA cores
are busy processing a kernel from another stream.
The last feature of CUDA to mention in this context, and one which dis-
tinguishes it from other APIs such as OpenCL, is the support for a UVA among
devices of compute capability 2.x. More specific to memory, UVAs are elaborated
in Section 2.1. Essentially, this enables direct access to remote device memory
by CUDA kernels and CUDA memory copy routines.
Despite the feature rich state of CUDA and the massive computational capac-
ity of Graphics Processing Units, they are not without their limitations. Perhaps
first and foremost all problems can not be mapped well to this programming
model—an inherently serial task will not improve simply by implementing it un-
der CUDA on GPUs. Also, due to the repetitive, arithmetically intensive but
simple nature of graphics rendering, GPUs do not have as complicated and ef-
ficient memory hierarchies as often found in CPUs. This means programming
techniques and algorithms that have been developed with these in mind, such as
cache-aware algorithms, are much less effective. Nevertheless, the control afforded
to developers and programmers by this programming model allows considerable
freedom in the micro-management of algorithms; in the placement of data within
memory, and how communication is performed and when computation is carried
2.3. THE MULTI-GPU ARCHITECTURE 15
out.
2.3 The Multi-GPU Architecture
The Multi-GPU architecture describes a type of computing system that can be
designed based on the Von Neumann architecture for computer systems, and
continues in the same ideology of NUMA machines. Simply put a multi-GPU
system is a machine with more than one GPU device—its main feature is a large
number of GPUs attached peripherally to the CPU. This section begins with
a description of the multi-GPU architecture, discusses some of its features and
concludes by introducing some hypothetical multi-GPU systems as examples of
categories of systems useful for further discussions within the thesis.
The multi-GPU architecture can be described by the four core components
that contribute to its function. The first and most obvious component to be
introduced are the processing cores. The processing cores are comprised of a
CPU, which may contain one or more processors, and several Graphics Processing
Units, each housed within distinct devices. The second part required by a multi-
GPU architecture is a communication network which connects the peripherally
housed GPUs to the CPU. As discussed earlier, this is typically covered in a
modern system by the PCIe protocol and hardware. The third component, as
per the Von Neumann architecture, is the memory hierarchy required to facilitate
the operation on data by the processing cores. This results in a NUMA like
arrangement of memory locality across the PCIe network, for each GPU device
and the Von Neumann main memory. The fourth and final component is the
API, and in this thesis we use NVIDIA’s CUDA[41].
There are a number of potential benefits from a multi-GPU architecture as
described in comparison to other architectures where multiple GPU processors
are present. The cluster based architecture with a single or limited number of
GPUs per node is the most notable example of another architecture with multiple
GPUs. The advantages offered by the multi-GPU system described include lower-
latency GPU communication, lower hardware costs and a smaller system for a
comparable number of GPUs.
When using multiple devices in the above system it is necessary to have mul-
tiple tasks executing simultaneously in order to actually utilise the system. If the
tasks are mutually independent, no communication among devices is necessary—if
16 CHAPTER 2. BACKGROUND
(a) A minimal system. (b) A simple tree structured system.
(c) A system with a large branching
factor. (d) A fully interconnected system.
Figure 2.6: Hypothetical multi-GPU systems.
dependent, communication is required. Historically, inter-device communication
in CUDA was controlled by the host. That is, the host would collect pieces of
data from each device into its memory space, and then send the relevant data to
the relevant destination device memory. One of the most attractive features of
the latest versions of CUDA is the support for a UVA. This enables direct access
to remote device memory by CUDA kernels and CUDA memory copy routines,
making communication considerably easier from a development perspective. Also,
when considering the presence of DMA controllers on the devices themselves, an
accessible UVA would allow the devices to dynamically initiate and manage such
communications leading to much more efficient communication with minimal in-
put from developers.
For the purpose of assisting discussions throughout the thesis we introduce
2.3. THE MULTI-GPU ARCHITECTURE 17
a number of hypothetical multi-GPU systems displaying some of the conceptual
possibilities available from the core components identified. These systems should
exhibit a number of key features inherent to the architectural design and should
allow us to comment on the performance of multi-GPU systems. These systems
also display a range of variations which are all acceptable as instances of the
multi-GPU architecture, and which should allow us to compare and contrast the
effect of various design decisions in the construction of said machines.
All of the following systems are based around a single CPU with a simplified
memory hierarchy and are represented in Figure 2.6. The first hypothetical multi-
GPU system, which we shall denote as the minimal system, is the smallest and
most simple system capable of being classified as a multi-GPU system. As can be
seen in Figure 2.6a, this system comprises of just two GPU devices which share
a single internal network connecting them to their host and the shared memory
that is the system’s main memory.
The systems in Figure 2.6b and Figure 2.6c, which we call the simple tree
structured and large branching factor systems respectively, represent two poten-
tial designs/methods for increasing the number of GPUs within the system. As
its name implies the simple tree structured system achieves this by replicating
the internal network as a tree comprising of the internal network as the trunk
and branches, and the GPU devices as leaves. The key feature of this design deci-
sion is that the simple tree structure can be repeated and scaled endlessly in this
fashion to construct a multi-GPU system of any size, and it can also be grown
recursively by replacing a GPU with another instance of the internal network and
two more GPUs creating a more complex tree structured system.
Alternatively, the system with the large branching factor in Figure 2.6c in-
creases the number of GPUs by expanding the internal network, requiring it to be
capable of supporting the added GPU devices. This has advantages over a design
like the simple tree structure due to its simplicity and the ability to place GPUs
‘closer’ in a network topological sense. However, a system with a large branching
factor will require more expensive hardware with greater requirements, and will
encounter limits in scalability sooner than other systems such as those based on
the simple tree structure. Those familiar with trees will recognise that combining
the simple tree structure with a large branching factor is a simple technique to
achieve better configurations/performance.
Figure 2.6d shows the final hypothetical system which we are calling a fully
18 CHAPTER 2. BACKGROUND
interconnected system. This is a purely speculative system which can not be
constructed with current hardware. The purpose of this system is to explore the
idea where every processor within the system has a dedicated communication
path to every other processor. As the name suggests, this system contains a fully
interconnected network. Such an arrangement is still the same sort of multi-
GPU architecture as described as the GPUs are still separated from the host
by a common internal network and there is a distinct host and device memory
space. While such a system eliminates all communication contention and will
offer a significantly larger bandwidth to other systems, it suffers a number of
drawbacks. The most obvious of which is that the number of connections grows
quadratically:
c =
p(p− 1)
2
Where c is the number of connections, and p is the number of processors involved.
2.4 Multi-GPU Communication
The communication of interest in a multi-GPU system is more or less directly
related to PCIe communication. At its core the PCIe is a communication network
and to understand communication over the PCIe network an understanding of
the various forms in which communication can occur in general is required. The
taxonomy and lexicology of computer and network communication has not been
standardised. We introduce communication using terms based on the MPI stan-
dard [34], however one should consider other descriptions and the cross-over of
terms that can occur [60]. Communication in these sorts of systems can be broken
down into two main categories: those where the links or communications are based
on point-to-point connections, and those based on collective transmissions[34],
which together give the following four patterns: one-to-one, one-to-all, all-to-one,
all-to-all. To assist describing the patterns found in these communications we
introduce the terms senders and receivers as communicators to identify points in
the communication network, and messages which are essentially data sent from
senders to receivers.
One-to-one communication is synonymous with point-to-point transfers, and
is defined where the communication connection is between two end points—i.e:
one sender, one message, one receiver. Point-To-Point communication is perhaps
one of the most fundamental forms of communication available in a system, and
2.4. MULTI-GPU COMMUNICATION 19
hence the PCIe is heavily based on it. The services or functions that fall under
this pattern are the basics between one sender and receiver: send, receive, put,
get, etcetera.
Collective communication is defined where one or more of the communication
connections involve a group of points—e.g: one sender, with some message to all
other communicators as receivers (one-to-all). This is where the remaining com-
munication patterns are found: one-to-all, all-to-one, all-to-all. There are a large
number of services or functions that fall under these communication patterns:
broadcast, reduce and all-gather are but a few. The three that are relevant to
this thesis are scatter, gather and all-to-all. These three functions are all that
are required for communication by the distributed FFT algorithm. While the
hardware of certain networks can support complicated communication, such as
the broadcast operation and thus perform these algorithms more efficiently, when
performed on a PCIe network all of these functions are typically implemented as
a series of point-to-point messages.
The scatter function uses the one-to-all communication pattern to distribute
some data from one point across all of the points involved. In this pattern one
communicator, the sender, contributes to the result of the communication which
the collective (all) receives. An example of the scatter function for six processes
(communicators) is given in Figure 2.7. The implementation of this as a series of
point-to-point transfers is performed by first splitting the data A into a series of
messages where |A| is the number in the collective (in the example A = A0, A1,
A2, A3, A4, A5), then sending the Ai
th message to the ith corresponding point.
This requires p− 1 point-to-point transfers where p is the number of processes or
communicators involved.
The gather function is the inverse of the scatter, and thus has the opposite
communication pattern to scatter: all-to-one, which it uses to aggregate the
data from all of the points in the collective on one receiver. This is shown in
Figure 2.7 by gather acting in the opposite direction to scatter. For the all-to-
one communication pattern, all communicators collectively generate the result on
one receiver. When using point-to-point transfers to perform the gather function
an order or rank must be imposed on the points in the collective. With this order,
each sender in the collective can send its message to the one receiver which can
then use the rank of the sender (which is known for each connection) to assemble
the result, A, as a series where the message from the ith sender is set to position Ai.
20 CHAPTER 2. BACKGROUND
5.1. INTRODUCTION AND OVERVIEW 143
A0 A1 A2 A3 A4 A5 scatter
gather
A0
A1
A2
A3
A4
A5
A0 A1 A2 A3 A4 A5
B0 B1 B2 B3 B4 B5
C0 C1 C2 C3 C4 C5
D0 D1 D2 D3 D4 D5
E0 E1 E2 E3 E4 E5
F0 F1 F2 F3 F4 F5
A0 B0 C0 D0 E0 F0
A1 B1 C1 D1 E1 F1
A2 B2 C2 D2 E2 F2
A3 B3 C3 D3 E3 F3
A4 B4 C4 D4 E4 F4
A5 B5 C5 D5 E5 F5
complete
exchange
A0
B0
C0
D0
E0
F0
allgather
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0
data
broadcast
pr
oc
es
se
s A0
A0
A0
A0
A0
A0
Figure 5.1: Collective move functions illustrated for a group of six processes. In each case,
each row of boxes represents data locations in one process. Thus, in the broadcast, initially
just the first process contains the data A0, but after the broadcast all processes contain it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Figure 2.7: Assorted collective commu ication fu ctions for six processes.
Each row represents data local to a process. Source: The MPI Standard[34]
This requires the same number of point-to-point transfers as the scatter function.
The all-to-all function (also known as all-to-all scatter/gather) as its name
implies is based on the all-to-all communication pattern and describes a relatively
more complicated movement of data as again illustrated in Figure 2.7. This
communication pattern requires all points in the collective both in the formation
and the reception of the result. The operations of the function can be explained
in terms of a series of point-to-point transfers for each point in the collective,
where that point acts as a sender with a distinct message for every other point as
a receiver. In the example there are six distinct series of data, each divided into
six messages—one for each point in each case. The jth message of the ith sender
is received into the ith position of the jth series of the result, e.g: A1 goes to B0 in
the result. This requires p× p− 1 point-to-point transfers to achieve.
Two important metrics to consider when discussing these communications
2.5. ULTRASOUND SIMULATIONS AND THE FFT 21
are bandwidth and latency. The bandwidth of a communication can be defined as
either the net or effective rate at which communication is achieved, or the capacity
or maximum throughput of data over time, typically expressed in bits or bytes per
second (bps or b/s for bits, Bps or B/s for bytes). The latency in communication
represents the delay between the first signal leaving the sender and its arrival at
the receiver, and in other terms is referred to as the communication overhead and
is expressed as a duration of time. Our current understanding of physics imposes
a minimum latency in communication based on the speed of light (the fastest
speed information can travel) and the distance over which it must be sent but
is more typically influenced by processing at intermediary points in the network
such as queues in a switched topology. Together the bandwidth and latency of a
system are enough to characterise and describe the communication performance
of it.
2.5 Ultrasound Simulations and the FFT
An ultrasound simulation predicts the interaction of ultrasonic waves in a given
environment. This involves representing properties of a physical space and its
contents, then applying physics based model of wave propagation from an initial
condition over the domain for a time-span made into discrete time steps. Some
applications of the use of ultrasound simulations include designing new ultrasound
probes, developing signal processing techniques, training new ultrasonographers,
and in treatment planning [12][64]. Ultrasound simulations can be used in treat-
ment planning as the procedure High-Intensity Focussed Ultrasound (HIFU) uses
ultrasound to destroy unwanted tissue through ablation, and the results of sim-
ulating different intensities and durations using a scan of the patient can assist
in the creation of a suitable treatment plan. There are many different models for
computing the propagation of waves under the laws of physics, and one class of
methods of particular recent interest is the pseudo-spectral method.
The pseudo-spectral methods are a technique in numerical analysis which offer
improved accuracy and a number of other benefits over methods such as finite
difference time domain solutions and finite element methods [17][22]. One other
major benefit is that pseudo-spectral methods only require two data points for
a given wave-length compared to ten to twenty for other methods for the same
accuracy. This results in fewer computations and less memory requirements for
22 CHAPTER 2. BACKGROUND
a given problem.
The work in this thesis is presented in relation to a particular ultrasound
simulation implementation: version 1.0 of the K-Wave [62] toolbox by Treeby
and Cox. An application like k-Wave using the pseudo-spectral method would
typically proceed by iterating over discrete time intervals and computing values
for the pressure, density and velocity of the waves, while factoring in scattering,
absorption and other effects for each time step of the simulation using the results
of the previous iteration for the next [64]. This results in expressions in the form
of Equation 2.1 which describe how k-Wave calculates the change in pressure in a
particular Euclidean direction ξ for a time step n, ∂
∂ξ
pn, which is one component
in the computation for the pressure matrix p in the next time step:
∂
∂ξ
pn = F−1 (i · kξ · κ · F (pn)) (2.1)
The key characteristic of equations of this form is the use of the function F , the
Discrete Fourier Transform (DFT); some element-wise matrix operations; fol-
lowed by the inverse DFT (F−1). As these simulations are typically performed in
3D space, these equations are repeated at least three times, and a typical simu-
lation would contain a number of equations similar to this, each being performed
over a significant number of time steps resulting in a large number of invocations
of the Fourier Transform for any simulation.
The Fourier family of transforms is named after Jean Baptiste Joseph Fourier
based on his claim that any continuous periodic function can be expressed as a
series of well chosen sinusoidal basis functions (sine, cosine) [18], and is thus most
commonly used to transform a function of points over time f(t) into a function
whose argument is a frequency g(k) [57]. The DFT is defined in Equation 2.2:
g(k) = Ff(x)
=
N−1∑
n=0
e−i2pikn/N · f(xn)
(2.2)
Where, N is the number of samples over time, x is a set of points over time, and
k = 0, . . . , N−1 representing the frequencies where |k| is also N as a result of the
Nyquist limit [48]. The inverse of the transform is achieved by simply changing
the sign of the exponent (e−i2pikn/N to ei2pikn/N) and normalising by N . There
are many different algorithms for calculating the various forms of the Fourier
2.5. ULTRASOUND SIMULATIONS AND THE FFT 23
transform. A completely naive implementation of the DFT based on the direct
description of Equation 2.2 would produce an algorithm with run-time complexity
O(N2): there are N results in g(k) each requiring the summation of N terms. The
most widely used algorithm of the Fourier transform for computational purposes
is the FFT . Although first described much earlier by the likes of Karl Friedrich
Gauss in the 19th century [27], this algorithm is credited to Cooley and Tukey [10]
who analysed its run-time and presented it in the context of computer science.
The FFT makes use of a recursive decomposition to reduce the complexity of
the run-time to O(N logN). This relies on the factorisation of N : by recursively
expressing a larger Fourier transform of N points as a product of two smaller
Fourier transforms of N1 and N2 points where N = N1×N2. Thus, for equations
of the form of Equation 2.1, where the other components are O(N), it is the
dominating arithmetic component.
One dimensional signal processing is of limited use. Moving to a set of points
in two dimensions over time (x1,x2) corresponds to a two dimensional function in
time, and substituting that into Equation 2.2 gives Equation 2.3:
g(k0, k1) = Ff(x0, x1)
=
N0−1∑
n0=0
N1−1∑
n1=0
e
−i2pi(k0 n0N0+k1
n1
N1
) · f(xn0 , xn1)
(2.3)
The problem with this equation in terms of performance on any processor be-
comes apparent when in consideration of the typical storage of data for the two
dimensional input where elements in x0 are adjacent and elements in x1 occur
in strides of #x0 elements. This means that while one summation exhibits good
data locality, the other if computed directly will exhibit very poor locality which
inevitably leads to poor performance for the algorithm. However, it can be solved
because the calculation of the product in the summation for each set of points
for a fixed value of one dimension can be calculated independently. That is, by
computing an intermediary product of the partial 2D DFT in the x0 component,
and rearranging the data along these dimensions, the remainder of the 2D DFT
product can be calculated by another partial 2D DFT in the x1 component with
good data locality over the partial 2D DFT algorithms. In practice, this results in
#x1×#x0-point 1D FFTs in the x0 dimension, followed by a transpose between
those dimensions, and #x0 ×#x1-point 1D FFTs in the x1 dimension.
As the most practical ultrasound simulations are in the regular spatial di-
24 CHAPTER 2. BACKGROUND
mension space, they require the 3D form of the Fourier transform. In the multi-
dimensional case, instead of working with sinusoidal basis functions we are dealing
with their higher dimensional counterparts—sinusoidal planes/planar waves. Ex-
tending the equation for the 2D DFT for the multi-dimensional version with an
arbitrary number of dimensions d results in Equation 2.4:
g(k0, k1, . . . , kd−1) = F(0,1,...,d−1)f(x0, x1, . . . , xd−1)
=
N0−1∑
n0=0
e−i2pik1n1/N1 · N1−1∑
n1=0
e−i2pik2n2/N2 · · ·Nd−1−1∑
nd−1=0
·f(xn0 , xn1 , . . . , xnd−1)

(2.4)
From the final expression in Equation 2.4 and the decomposition of Equation 2.3,
it can be seen that recursively constructing a sequence of (d − 1)-dimensional
DFTs, results in an expression of the d-multidimensional DFT result as a series
of 1D DFTs and transposes. While there exist many other methods of performing
multi-dimensional DFTs [15], the main advantage of this decomposition is that it
allows the DFT of arbitrary dimensions to be calculated while reusing the same
code, and common data manipulation functions. This is also the reason why the
3D FFT is an interesting algorithm to study on a distributed memory type system
as between the initial distribution, transposition, and collection of the result, the
scatter, all-to-all, and gather functions are used respectively, and correspondingly
all of the communication patterns mentioned in Section 2.4. in the context of
computers is that simple transpositions of the intermediate results can greatly
improve data locality allowing the actively computed one-dimensional FFT to
operate within contiguous memory.
2.6 Related Work
The following presents a short summary of items with implications and relevance
to the contents of this thesis. It contains a survey of recent literature in FFTs,
ultrasound and multi-GPU computing.
The approach just described for the distributed 3D FFT is incorporated into
the de facto standard implementation of the FFT for CPUs: the Fastest Fourier
Transform in the West (FFTW) library by Frigo and Johnson [19][20][21]. The
FFTW library is released under the GPL and is thus open source. FFTW has
favourable performance in comparison to other publicly available FFT imple-
2.6. RELATED WORK 25
mentations as well as machine specific and vendor specific implementations. It
achieves this by planning the execution of the FFT decomposition using a com-
bination of different FFT algorithms including variants of the Cooley-Tukey,
Rader’s and Bluestein’s algorithms. The plan is further refined by either estimat-
ing, measuring or auto-tuning the performance of the plan on the given hardware.
This approach has the additional advantage that the plan can be reused if the
computation of FFTs of the specified dimensions need to be repeated.
As FFTW is to CPUs the standard for GPU implementations is NVIDIA’s
CUFFT library [43]. It is a proprietary library and integrates with the rest of
the CUDA framework. CUFFT is modelled after FFTW, primarily to encourage
code portability, and thus uses a similar combination of FFT algorithms for peak
performance (Cooley-Tukey and Goldstein only). Plans are implemented like
in FFTW however much less rigorously as compatible hardware is only available
from NVIDIA’s line of products and thus many aspects can be built in. The plans
in this case also contain additional configuration details relating to the layout
and allocation of threads, thread blocks and other GPU resources. This does
allow CUFFT to support a greater level of parallelism and scalability, especially
for higher dimensional FFTs where the decomposition allows for batches of the
one-dimensional FFTs to be computed in parallel. It should be noted that this
provides no interface for utilizing multiple GPU devices.
The work presented by Nukada et. al. in [40] is a continuation of previous
works with the NukadaFFT library in [39] and [38]. This uses a combination
of MPI and CUDA (CUFFT) to perform and analyse the scalability of the dis-
tributed 3D FFT on a specific supercomputer: the Tsubame 2.0, using up to 256
nodes with up to three GPUs on each. They conclude that through low-level
interconnect manipulation to manage intra-node and inter-node communication
with their dynamic selection among the network rails a total of 4.8 TFLOPS was
achieved when performing the transform.
P3DFFT is an open-source off-the-shelf framework for distributing 3D FFT
[51]. It does not compute the transform itself, but handles all of the decomposition
and communication tasks required for performing a distributed 3D FFT. It makes
use of a 2D (pencil) decomposition, using a localized library, such as FFTW, for
the component transform.
Csechowski et al. [13] extended P3DFFT to create DiGPUFFT for use on
their GPU cluster. They observed significant performance bottlenecks which
26 CHAPTER 2. BACKGROUND
they attribute to the cost of communication over the PCIe bus between CPU and
GPU. This was estimated to represent approximately 27% of the total time taken
by the 3D FFT.
PKUFFT [9] is similar to DiGPUFFT in that it uses a pencil data decompo-
sition with GPUs to perform the actual computation. Whereas P3DFFT appears
to be limited to real-to-complex (R2C, forward) and complex-to-real (C2R, back-
ward) transforms, PKUFFT includes complex-to-complex (C2C) transforms. It
differs from P3DFFT in the data manipulation it performs at the various stages,
and the factoring of architectural elements of GPU clusters into the decomposi-
tion and underlying computations.
ULSFFT [23] look at a recursive composition, essentially a restructuring of the
butterfly graph to allow a scatter-gather processing model. Gu et al. [25] propose
a number of techniques for performing large FFTs when the data is maintained
out of a single devices memory space.
None of the above explicitly consider the case of multiple GPUs hosted by a
single node; so direct communication between devices was not considered. Also
aspects of all these systems are now obsolete, e.g. PKUFFT is presented for
version 2.3 of CUDA, and the PCIe controller versions used in the various systems
are not immediately apparent.
Chapter 3
Multi-GPU System Behaviour
“Never underestimate the bandwidth of a station
wagon full of tapes hurtling down the highway.”
—Andrew S. Tanenbaum, 1981
In a system based on the multi-GPU architecture described in the previous
chapter, the performance of each major component in isolation is already quite
well understood—if not by their specifications, then at least by the countless ex-
amples of their applications in literature. The multi-GPU architecture is quite
different to more traditional machine architectures. The key to understanding
the behaviour of this architecture for any given application is in the communica-
tion between the GPU devices. This chapter presents an analysis of that aspect
of multi-GPU systems. This begins with a description of various methods for
performing the communication in this context. Following that, a set of empirical
tests is designed and carried out to examine the performance characteristics of
a particular system based on this architecture. This is followed by a discussion
of the results observed, followed by the construction of a theoretical model for
the performance of the communication aspect and its application and an analysis
of it.
The reason behind the design of the multi-GPU architecture is more apparent
in consideration of cluster-based computing. When GPUs are used in a cluster
they are typically found in a per-node basis. If the nodes were instead modelled
on the multi-GPU system then the same number of GPUs could be found in
fewer nodes resulting in a smaller cluster size with lower hardware costs, and less
27
28 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
latency in GPU to GPU communication. The inter-connects of a cluster will be
less busy with fewer nodes, and it is more or less impossible for an inter-node
inter-connect to be faster than a direct PCIe connection as most inter-connects
invariably interface through the PCIe network.
Back in the context of a single multi-GPU machine, the communication be-
tween two GPUs can be managed in a number of different ways. A significant
feature of this platform is when it is possible to perform this communication in
direct device to device (D2D) transfers—without any contribution from the CPU
or the use of intermediary buffers. This capability and its performance charac-
teristics have not been explored in any great detail, and the primary tool for
examining this is through the systematic benchmarking of some process.
3.1 Benchmarking A Multi-GPU System Using
The Swap Operation
A simple analysis of the all-to-all communication pattern from Section 2.4 on
the multi-GPU architecture would reveal it is the most demanding in terms of
requirements on the network of interconnects for the GPU devices—the conse-
quences of the architectural design when constructed from the current generation
of hardware (Fermi/PCIe 2.0) are far from trivial. For that reason, the all-to-
all communication pattern makes a suitable candidate for basing a benchmark
for the system. Using this basis should allow the benchmark to be capable of
fully exposing the performance characteristics of the multi-GPU architecture. To
construct a benchmark from the all-to-all communication operation we begin by
further analysing the execution of this operation. On the PCIe network and
CUDA framework used in the multi-GPU architecture of this thesis, this oper-
ation must be executed as a series point-to-point communications that may be
performed either synchronously or asynchronously. Each of these point-to-point
communications will involve any pair of communicators twice: first when one
device is sending and the other receiving, and again when the same device is
now the receiver in a later part of the all-to-all communication sequence. In ef-
fect, these two devices involved in this part of the all-to-all communication are
swapping data. Thus, the sequence of point-to-point transfer communication op-
erations, that are performed in one direction between a sender and receiver, can
be represented as half as many swap operations between pairs of communicators
3.1. BENCHMARKING USING THE SWAP OPERATION 29
as indicated by Equation 3.1 and Equation 3.2:
T = n× (#d− 1) (3.1)
S =
n× (#d− 1)
2
(3.2)
Where T is the number of transfers, S is the number of swaps, #d is the number
of devices involved, and n is the number of units of data—note that the number of
transfers is also equivalent to the minimum number of data movements (∆M) in
Equation 4.1. Hence, to explore the performance characteristics of the multi-GPU
system in relation to the collective communication patterns, all that is required
is a thorough benchmarking of the swap operation.
Swapping data between pairs of devices in the multi-GPU architecture is not a
simple and straight-forward process. There exists a number of different methods
by which this sort of device-to-device communication can be orchestrated on
any particular hardware configuration. Assuming a system configuration like the
minimal system in Section 2.3, four methods for performing the swap operation
were designed:
1. A swap via the CPU and host memory.
2. A swap using only device-to-device communication by swapping pointers.
3. A hybrid approach using both the CPU and direct device-to-device com-
munication.
4. A swap via kernel, where a GPU kernel uses registers to carry out the swap.
3.1.1 Swapping Via CPU
In the first method for performing the swap operation, the communication is
performed exclusively through the CPU, hence it is called the via CPU method.
As mentioned in Section 2.1, the PCIe is an interconnect designed for peripheral
devices to connect to the CPU. The communication route between the CPU and
a device on the PCIe network was a primary focus of the initial design intentions,
and would thus already be a highly optimised route in the hardware and low-
level software of the system. In fact, under the CUDA framework, it was not
until version 4.0 was released that a direct communication route between two
GPU devices on a PCIe network was available. Hence, this method for swapping
30 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
Figure 3.1: Device-to-device communication via CPU, swapping A with B
in four steps: 1) A to main memory; 2) B to main memory; 3) A from main
memory to the old location of B; 4) B from main memory to the old location
of A completing the swap.
data between devices also represents what was only possible for device-to-device
communication before those recent changes allowed this capability in a multi-
GPU system.
The process under which this method actually operates can be broken down
into four distinct steps. This is illustrated using the minimal multi-GPU system
in Figure 3.1. Before the first step, the system exists in a state with data to
be transferred in the local memories of each device, and the main memory with
sufficiently free space. The first step consists of one device, in this example GPU0,
sending its data A through the CPU to a corresponding area in main memory.
Next, the second step which is identical to step one proceeds but with GPU1 and
the data B, which can not proceed in this example until step one has completed
due to contention on the internal network. A synchronisation is required here
to prevent data from being overwritten before it is sent. Once this step has
completed the local memory on the devices are now free to be overwritten—the
completion of step two would typically be managed by barrier synchronisation
between the devices. At which point the third step can proceed which is where
the data from the first device, now in main memory, is sent to the other device’s
local memory. In this example that is the data A to GPU1. The swap is concluded
in the fourth step when the remaining data, B, is moved onto the first device
GPU0.
When considering this communication route on a system such as the simple
tree structured hypothetical system, the immediate drawback of this method
becomes apparent. There is a large volume of traffic between main memory and
3.1. BENCHMARKING USING THE SWAP OPERATION 31
the device memory and the aggregate bandwidth on the device side of the PCIe
network will always be larger than that on the CPU side inevitably leading to
a bottleneck on the system at that point. The best possible performance on
a modern system such as Trinity or Infib (see Appendix A) is that the time
multiplexing by the QPI and PCIe will fully saturate the QPI interface resulting
in a theoretical limit for these systems of 12 GB/s. A brief analysis of this method
reveals two communication costs of note: an additional memory footprint, and the
total volume of data moved over the network—or gross network traffic. Tracing
this method we observe in the memory footprint that for every swap performed
there is an additional main memory space requirement equal to the total size of
the data set. Furthermore, the total sum of the data moved over the network is
equal to twice the size of the data set, as each lot of data is moved once in each
direction between the host and devices.
3.1.2 Swapping Via Device Pointers
The second method represents one of the two ways selected in which the swap
operation is performed completely using only direct device-to-device communica-
tion. The advantage in this is to free the CPU and main memory for other tasks.
This method achieves this through the developments in NVIDIA’s GPUdirect
technology [46] and the recent addition to the CUDA framework of an asyn-
chronous memory copy that operates directly between the local memories of two
devices, and the redirection of device pointers to different regions of memory
within the pair of devices involved in the swap. Therefore, this is aptly named
the pointer swapping method and although analogous to conventional pointer
swapping in shared memory environments (in regards to the activity at a device
level), this is distinct in the context of the multi-GPU architecture in that it
involves memory copying.
The number of steps involved in this method is again three, however the initial
condition before the first step can proceed is different to the previous methods.
Conceptually, the host is not involved in this method and thus there are no
requirements in main memory. Instead, this method requires that additional
local memory be allocated and available on each device, which will be referred to
as the spare buffer for each device. This can be seen by the example in Figure
3.2. The first step of this method has the first GPU send its data directly to
the second storing it in the other device’s spare buffer. In the example with the
32 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
Figure 3.2: Device-to-device communication via pointer swapping, swap-
ping A with B in three steps: 1) A directly to the spare buffer on the other
device in a D2D transfer; 2) B directly to the other spare buffer in a D2D
transfer; 3) pointers to the active buffer on each device are changed com-
pleting the swap.
minimal system this is GPU0 sending A to the spare buffer on GPU1. The second
step is simply the vice versa of the first with the other GPU device, and so in
the example with GPU1 and B and the spare buffer on GPU0. Synchronisation
is required here to maintain coherency between the active pointers and buffers.
Once this is complete, the third step can finish the swap operation by changing
the pointer for each device to the active part of its local memory to the recently
received data. The original active buffer can now be freed or reserved for a further
swap operation.
This method has an additional memory footprint equal in size to the data set
in extra device local memory. On the other hand, the total volume of data moved
over the internal network on the multi-GPU architecture is exactly the same as
the total size of the data involved.
3.1.3 A Hybrid Swap: Staging On The CPU
The idea behind the third method for swapping data between pairs of devices is
to attempt some level of balance between communication through the host and
communication directly between devices. The motivation for doing this is the
intention of achieving peak performance in both these communication pathways
(utilising more of the internal network), and making use of the shared memory
space, which is usually a more abundant resource. This method achieves this
by staging data transfers through the CPU in main memory while performing
3.1. BENCHMARKING USING THE SWAP OPERATION 33
Figure 3.3: Device-to-device communication via the hybrid-staged method,
swapping A with B in three steps: 1) A to main memory; 2) B directly to
the old location of A in D2D transfer; 3) A from main memory to the old
location of B to complete the swap.
other steps with direct device-to-device communication, hence we shall call it the
hybrid-staged method.
The swap method for this idea can be described as a process comprising of
three steps. The illustration for this swap operation is provided in Figure 3.3.
The initial condition for this method is identical to the via CPU method. The
first step is only conceptually different in that the area in main memory that
the data A from GPU0 is sent to is a temporary staging area for the particular
pair of GPUs and the swap operation. On completion of the first step, the local
memory previously occupied by A in the example is now available. Step two takes
advantage of this and has the second device directly send its data to the first,
in this case B from GPU1. Thus in turn freeing the memory in the second device
for it to receive the temporarily staged data originally from the first device. This
third and final step relinquishes the temporary staging area in the main memory
and concludes the swap. The consequence of this method is that at each step, the
memory required for the next is not available until the completion of the current
step—it requires two synchronisations corresponding with the completion of the
first and second steps.
However, this method has the benefit of potentially being well suited for a
queue of operations with more GPUs than what is available in a minimal system.
When considering the communication patterns for this method on the simple tree
structured system we can see that if GPU0 was engaged in the communication
with the CPU, then GPU2 and GPU3 could potentially be performing the direct
communication in step two. The ability to pipeline the steps in this method is
34 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
not limited to that hypothetical system and is applicable to systems based on
fully interconnected networks or larger branching factors. The memory footprint
of this method requires half the total size of the data set in additional main
memory. The total volume of data moved over the network during this method
is one-and-a-half times the size of the involved data.
3.1.4 Swapping Using Device Kernels
The fourth and final method uses a kernel to handle the communication in place
and is thus named the kernel method. Using a kernel is the other way in which
communication between devices can be carried out directly and without interact-
ing with main memory or their host controllers. In a sense, one device is taking
the role of the host and controlling the other for a period of time. The example
for the kernel swap method is in Figure 3.4 and again swaps the data A on device
GPU0 with B on GPU1.
The kernel simply operates in one asynchronous launch. That said, the kernel
itself can be identified as consisting of three familiar steps as witnessed in the
other methods. The kernel has few requirements besides both devices being
ready to communicate with data in memory. All this method requires is that
the executing device has the pointer for the other device’s memory. The kernel
begins execution on one of the devices involved in the swap and as its first step
it loads the values of its data into local registers. In the example, this would be
GPU0 loading A into the registers on GPU0. The second step involves the kernel
pulling the data from the other device by directly accessing the remote memory
of the second device. This is performed by directly dereferencing the pointer to
the data on the other device—as opposed to requesting it through a managed
communication function—and storing the data in its local memory, in the space
previously occupied by what is now in the registers. Using the example, at the
end of step two GPU0 now contains B in local memory and A in its register space,
and GPU1 is in a free state. The kernel method then concludes in the third
step by performing a remote store operation, sending its data from its registers
to the other device. By using the computational registers to power the swap
operation, this method has no additional memory footprint. In addition, this
method transfers a total volume of data between the pair of devices equal to the
total size of the data, and as it executes in a single asynchronous kernel launch it
requires no explicit synchronisations during the swap operation like the previous
3.1. BENCHMARKING USING THE SWAP OPERATION 35
Figure 3.4: Device-to-device communication via a kernel, swapping A with
B in three steps: 1) A to registers; 2) B direct remote memory load from
GPU0 moves B over old location of A; 3) direct remote memory store moves
A to old location of B completing the swap.
methods.
These four methods should test the multi-GPU architecture sufficiently well
to expose the performance characteristics of the system well enough to manage
device-to-device communication with good efficiency. There exist a number of
other methods for performing the swap operation which were not considered.
Subtle variations of the chosen methods, Occam’s Razor (the complexity of the
method), and some fundamental understanding of the system eliminate these
other methods from possibly adding any information that the four chosen methods
could not.
An example of a method that was not chosen is the algorithm based on the
bitwise exclusive or, ⊕, operator. This method of swapping two binary values is
demonstrated in Equation 3.3. Two four-bit binary words are stored in A and
B. By repeatedly applying the exclusive or operator we can see that at Equation
3.3f and Equation 3.3h, the values of A and B have been swapped. This works
as the first application of the exclusive or operator at Equation 3.3c encodes
the difference between the two values into A. From that, A can be reproduced in
the space where B was, and after that, B can be reproduced from the encoded
value stored in A and the value of A stored in B and can overwrite the encoded
value which is no longer required. While this method requires no extra memory
footprint, even at the level of the processing registers, these sorts of algorithms
based on bitwise operations can perform with worse efficiency than other methods
36 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
based on load-store operations depending on the processor in question. This is
because the operation pipeline of most modern processors is optimised towards the
load-store sequence of operations and these sorts of algorithms can also present
difficulties for compilers to optimise.
A := 0101 (3.3a)
B := 0110 (3.3b)
A = A⊕B (3.3c)
= 0011 (3.3d)
B = A⊕B (3.3e)
= 0101 (3.3f)
A = A⊕B (3.3g)
= 0110 (3.3h)
3.1.5 Equipment: The Benchmarked System
In order to design and implement a benchmark for a multi-GPU system, we
must first identify a particular system to benchmark. The system used for the
benchmarks was the system called Infib. The full details of the Infib system are
available in Appendix A.1. There are a number of important features of this
system relevant to the benchmarking of the swap operation.
First and foremost, the system is based on the simple tree structured hypo-
thetical system in Figure 2.6b of Section 2.3. The schematic for the Infib system
is reproduced in Figure 3.5. This system contains a dual socket arrangement.
The hypothetical system with four GPUs is repeated for each socket, allowing for
up to eight GPU devices to be attached.
However, one PCIe slot is reserved for an external network interconnect and
thus there are only seven GPUs on this system. This leads to the next point of
note as the system attempts to implement the PCIe protocol version 2.0. How-
ever, there is a hardware fault in the implementation of Intel’s IOH (input/output
hub) 5520 chip. What this means in terms of benchmarking this system is that the
connection between the dual sockets does not report that communication between
devices on the PCIe network as possible: i.e. direct communication between the
GPU devices is not available where it should be, and devices GPU0, GPU1, GPU2
3.1. BENCHMARKING USING THE SWAP OPERATION 37
Memory 
24GB 
PEX 
8647 
GPU0 
GTX 580 
Infiniband  
Mellanox 40Gb/s 
CPU 
Memory 
24GB 
Intel IOH 
5520 
Intel IOH 
5520 
CPU 
Xeon X5650 
ICH 10R 
(LANs, DISKs) 
PEX 
8647 
PEX 
8647 
PEX 
8647 
GPU2 
GTX 580 
GPU1 
GTX 580 
GPU3 
GTX 580 
GPU4 
GTX 580 
GPU5 
GTX 580 
GPU6 
GTX 580 
16 links shared 
between 2 GPUs 
16 links shared 
between IF and GPU 
PC
I-E
 x 
16
 
PCI-E x 16 
PCI-E x 16 
PCI-E x 16 
CPU 
Xeon X5650 
QPI 
12GB/s 
ESI 
8GB/s 16 links shared 
between 2 GPUs 
16 links shared 
between 2 GPUs 
25GB/s 
25GB/s 
Figure 3.5: Schematic of the core components on the motherboard of the In-
fib multi-GPU, shared-memory system. The system contains seven Geforce
GTX 580 GPU devices in a tree structure, two six-core CPUs and 48 GB
of main memory in a NUMA arrangement.
can not communicate directly with devices GPU3, GPU4, GPU5, and GPU6 and must
instead communicate via the shared memory space in main memory. The ex-
istence of this technical fault is acknowledged in [35] and further explained by
[55].
The last technical detail of interest is the theoretical limits for the bandwidth
between various key points in the system’s architecture. In regards to main mem-
ory, Infib is equipped with twelve 4 GB memory modules (48 GB RAM) and has
an aggregated bandwidth (between CPU and main memory) of 2 × 25 GB/s.
Communication between the CPUs is supported by the Intel QuickPath Inter-
38 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
connection (QPI) with a theoretical bandwidth of 12 GB/s. The QPI provides
the interface between the CPUs and the rest of the internal network under the
PCIe version 2.0 protocol. The bandwidth of one link between a CPU and the
PCIe network has a theoretical limit of approximately 8 GB/s. This is also the
bandwidth between any two adjacent GPUs in the system, that is GPU devices
that are connected to the same PEX interconnect chip. Due to the dual socket ar-
rangement, the Infib system has a total of four links to the internal PCIe network
which implies a theoretical bandwidth limit of 32 GB/s.
3.1.6 Benchmark Methodology
Given the aforementioned Infib system in Section 3.1.5 as a target system, the
following benchmark was designed to understand the communication character-
istics and bottlenecks of the architecture. This benchmark consists of a series of
programs that perform the swap operation between selected pairs of GPUs under
different configurations.
The configurations for the benchmark can be divided into the four methods
for performing the swap operation presented earlier in this section. Programs
implementing each of the four methods for swapping: via CPU, hybrid-staged,
with pointer swapping, and by kernels were created. Each of these programs takes
a list of device IDs of even length. The ordering of the device IDs and the length
of the list dictate the number of devices and the roles they play in the selected
swap operations. Typically the identified devices are grouped as pairs and swap
their data. For example, given the list 1, 2, 3, 4, two swaps will be performed:
between GPU1 and GPU2, and between GPU3 and GPU4. Furthermore, the list
1, 3, 2, 4, would produce two different swaps: GPU1 would instead swap with
GPU3, and GPU2 with GPU4, and thus the communication would ‘stride’ over the
corresponding PEX chips of the Infib system and behaviour of the performance
may change depending on the method of swapping used. In the context of the
Infib system, the range of configurations of device pairs which were selected was
based on the available combinations provided by striding the CPUs and PEX
switched involved in the PCIe network for each method. The exact combinations
are available in Appendix B.
In order to compare the different configurations, a single metric was consid-
ered: the effective bandwidth over the entire PCIe network, also known as the
net bandwidth. This is a measure for data transfer over a network that is inde-
3.2. RESULTS 39
pendent of the particular algorithm used. The effective bandwidth is defined in
Equation 3.4 as the rate at which the net data movement is achieved:
EBW =
p× n
t
bytes/s (3.4)
Here EBW is the effective bandwidth, p is the number of GPU devices involved
in the benchmark, n is the size of the data used in the swap in bytes, and t is
the time taken to perform the swap. Thus, the unit for the metric is in bytes per
second (B/s).
On a multi-GPU system such as Infib, there is a degree of overhead for all com-
munication tasks. Due to the granularity of message packaging and the scheduling
of the PCIe network traffic, any communication must be of a sufficient size to
not be completely dominated by the time taken to manage it. To avoid this
problem, the benchmark takes one additional parameter, the message size. The
message size was fixed for the purpose of characterising the system. This value
was found by steadily increasing the size of the messages sent until the effective
bandwidth reached a stable rate, indicating that the measurements taken corre-
spond to the effects of the method and the system and not the underlying network
management.
There is one final point of configuration for the benchmark worth mentioning.
As at least the kernel swap method definitely uses the computational cores of
the devices, the effects of using the CUDA programming technique of streams
was explored. Each benchmark was run with both one and two active streams.
When the second stream was used, the swap operation was performed with the
corresponding stream in the device pairs. For example, configured with device
IDs 1, 2 and two streams, stream 1 of GPU1 would swap with stream 1 of GPU2 and
stream 2 of GPU1 would swap with stream 2 of GPU2—in effect doubling the volume
of data in the benchmark in comparison to the single streamed version. This
may allow the device to prepare communications for one stream while the other
is actively communicating, making the complete saturation of the interconnects
possible and a faster effective bandwidth observable.
3.2 Results
The following set of results are for the effective bandwidths of the four methods
introduced in Section 3.1 for swapping data between devices. A range of band-
40 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
0	  
1	  
2	  
3	  
4	  
5	  
6	  
7	  
8	  
9	  
via	  CPU	   Hybrid-­‐
Staged	  
Pointer	  
Swapping	  
Kernel	  
Ba
nd
w
id
th
	  (G
B/
s)
	  
Swap	  Method	  
2	  GPU	  Swap	  Bandwidth	  on	  Inﬁb	  
(a) 2 GPUs
0	  
2	  
4	  
6	  
8	  
10	  
12	  
14	  
16	  
18	  
via	  CPU	   Hybrid-­‐
Staged	  
Pointer	  
Swapping	  
Kernel	  
Ba
nd
w
id
th
	  (G
B/
s)
	  
Swap	  Method	  
4	  GPU	  Swap	  Bandwidth	  on	  Inﬁb	  
(b) 4 GPUs
Figure 3.6: Box graphs showing minimum and maximum effective band-
widths for the swap benchmarks. Bottom of the box represents minimum,
top of the box is maximum. Higher values are better.
widths were recorded by varying the device pairs used by the methods, as de-
scribed in Section 3.1.6, and the minimum and maximum values for each method
are displayed in Table 3.1. Both the minimum and maximum bandwidths are im-
portant for characterising the systems performance. In the context of an all-to-all
communication from which this benchmark is based, each process communicates
with every possible partner meaning that the routes in the network with the
highest and lowest (and all values in between) will be used at some stage.
Figure 3.6 plots the minimum and maximum bandwidths using box graphs
for each number of GPUs which were used. The minimum bandwidth is repre-
sented as the bottom of the box and the maximum the top. While the kernel
method appears to have the best minimum bandwidth, this is however mislead-
ing. The results for the kernel version are affected as there is no possible way to
have the devices GPU0, GPU1, GPU2 communicate directly with any of the other
devices (GPU3, GPU4, GPU5, and GPU6). This is due to a hardware fault in the
manufacturer’s implementation of the PCIe network [35] with the consequence
that GPU configurations used by the other methods can not be used with the
kernel method. These configurations correspond to where the contention on the
internal network is potentially the greatest for those other methods, and hence
where the other methods perform the worst, skewing the results.
3.2. RESULTS 41
0	  
5	  
10	  
15	  
20	  
25	  
30	  
via	  CPU	   Hybrid-­‐
Staged	  
Pointer	  
Swapping	  
Kernel	  
Ba
nd
w
id
th
	  (G
B/
s)
	  
Swap	  Method	  
6	  GPU	  Swap	  Bandwidth	  on	  Inﬁb	  
(a) 6 GPUs
Figure 3.6: (continued).
This problem is averted in the other methods for swapping data. The asyn-
chronous direct device-to-device communication functions in the CUDA frame-
work reverts to a method most similar to the via CPU communication method
if the protocols for the PCIe communication route are not supported, which is
detected by the initialisation function. It would appear that explicitly managing
this communication in the via CPU method proceeds faster than the framework
is capable of regressing to in the hybrid-staged and pointer swapping methods.
Perhaps the most interesting result is the variation in bandwidths by each
method for a given configuration. Due to the differences in the communication
patterns of each method, for any given configuration, one method may present
its minimum bandwidth while another may present its maximum. For example,
when performing a single swap operation between GPU1 and GPU3 we observe
an effective bandwidth of 5.38 GB/s with the via CPU method, which is the
maximum observed bandwidth for the method and two GPUs. However, with this
configuration and the pointer swapping method we observe a bandwidth of just
2.91 GB/s—its minimum. The complete set of data is available in Appendix B.
When any communication involving the CPU occurs (the via CPU and hybrid-
staged methods), the observed effective bandwidth is significantly limited. These
methods have the smallest bandwidth when devices on the same IOH chip or
PEX switch are used, which limits the bandwidth in and out of the CPU. Con-
42 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
Table 3.1: Effective Bandwidth for Inter-Device Data Swap (GB/s)
Method
2 GPUS 4 GPUS 6 GPUS
Min. Max. Min. Max. Min. Max.
Via CPU 2.95 5.38 3.83 7.55 5.64 5.71
Hybrid-Staged 2.82 4.71 2.83 9.41 2.99 9.88
Pointer Swapping 2.91 8.11 2.97 16.20 3.10 24.29
Kernel 5.94 8.24 5.96 16.46 8.88 24.31
versely, the direct device-to-device exclusive methods (pointer swapping and ker-
nel methods) achieved their maximum bandwidths in these conditions due to the
aggregated bandwidth of the PEX switches.
This behaviour is also suggested by table 3.2 which summarises the total
memory footprint and the gross communication volume of the four swap methods.
The two methods not involving the CPU only move data when necessary, i.e.,
the gross communication volume in relation to the net communication volume
is exactly the same. In contrast, the via CPU method has to move twice the
amount of data and the hybrid-staged method one-and-a-half times the amount
over the PCIe network, which lowers the effective bandwidth.
All methods except the kernel method require some amount of additional
memory on the host or device. The hybrid-staged and kernel methods are better
in this respect as they require no extra GPU memory which is a premium resource
on current hardware. The shared main memory is a less premium resource as it
is modularly expandable, and consequently less costly to increase and is typically
available in larger amounts than GPU memory, which is usually fixed for a given
device at the point of manufacture. However, the kernel method is limited to using
devices on the same IOH chip, and although the hybrid-staged method works with
devices on different IOH chips, it requires data to be staged by the CPU, which is
managed differently to the simple via CPU method and consequently the effective
bandwidths vary significantly depending on device selection. In other words, the
hybrid-staged method is better than the via CPU method in that it uses less
memory and eliminates one route of contention on the PCIe network.
The conclusion to be drawn from these results, is that for the Infib system and
other systems designed on the same principles, the best method for performing
3.3. MODELLING THE MULTI-GPU SYSTEM BEHAVIOUR 43
Table 3.2: Gross communication and total memory footprint for swap meth-
ods relative to the size of the data being swapped.
Method Total Footprint Gross Communication Volume
Via CPU 2× 2×
Hybrid-Staged 1.5× 1.5×
Pointer Swapping 2× 1×
Kernel 1× 1×
the swap operation between two GPU devices is through direct device-to-device
communication either using the pointer swapping method or the kernel method.
The primary reason for this is caused by the communication bottleneck at the
CPU-PCIe interface. The peak effective bandwidth observed by the system was
obtained using the kernel method and was approximately 24.31 GB/s, which is
close to the theoretical limit of the system. This benchmark is far from defini-
tive for a behavioural analysis of the multi-GPU architecture. The behavioural
analysis of the multi-GPU architecture based on this benchmark reveals some
characteristics of note. The kernel method involves both the communication and
computation cores of the device and therefore locks the entire device in the swap
operation. It is a highly demanding kernel and thus does not schedule as well on
the network of devices as the intrinsic CUDA communication functions used in
the pointer swapping method. In the pointer swapping method, the control oper-
ations are completely out of the device execution scope, and are only concerned
with the device communication and the copy engine. This is almost analogous to
copying an array of data in the programming language C using a for-loop versus
using a memory management function implementing lower-level control such as
the memcpy() function; the full implications of this would only be appreciable in
a more complex application test.
3.3 Modelling the Multi-GPU System Behaviour
This section attempts to present a model that can be used to help predict the per-
formance of algorithms on any multi-GPU architecture based on the observations
made in Section 3.2. This is based on the bulk synchronous parallel model [65]
44 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
Fitted function vs. observed bandwidth 
100 101 102 103 104 105
0
1000
2000
3000
4000
5000
6000
7000
y(x) = a atan(b x)
a = 4076.3
b = 0.016324
R = 0.99838  (lin)
B
an
dw
id
th
 (M
B
/s
) 
Message Size (KB) 
Figure 3.7: Example of a function fitted (red) to the observed effective
bandwidth (blue). Inset: expression for the fitted function, the coefficients
for said function, and R is Pearson’s coefficient.
for program execution. Critical to calculating the communication times within
the bulk synchronous parallel model is the bandwidth response function which
we define. Using this, we can then present a model for this type of program
execution on the multi-GPU architecture.
3.3.1 The Bandwidth Response Function
The bandwidth response function is a mapping from a message size to the band-
width expected when transferring that message over the PCIe network for a given
configuration. It is unique for each communication pathway and machine. In or-
der to measure this the benchmark described earlier in Section 3.1 was modified
to record the observed effective bandwidth for an assortment of message sizes
between various key-points; from device to host, host to device, and device to de-
vice. The construction of the bandwidth response function begins with plotting
these bandwidths and fitting a function to describe them.
The trigonometric arctan function as described in Equation 3.5 was found to
3.3. MODELLING THE MULTI-GPU SYSTEM BEHAVIOUR 45
provide a good fit for the observed effective bandwidths over message size.
EBW (x) = a× arctan(b× x) (3.5)
Here EBW is the effective bandwidth, x is the message size, and a and b are
parameters defining the shape of the curve which are specific to each communi-
cation pathway on each machine. An example of one of these fitted functions is
provided in Figure 3.7. While Pearson’s coefficient for the example indicates that
this is a strong correlation, the error at specific responses can be quite significant.
This allows us to speculate on the performance in this transitional range at which
the effective bandwidth changes dramatically, corresponding to the point when
the PCIe network becomes saturated. However, in most cases the fitted functions
do not diverge from the observed values significantly. This can be further refined
by introducing a bandwidth cap at the theoretically defined limit for the commu-
nication pathway. Thus, the bandwidth response function becomes the minimum
of the fitted function and the theoretical limit. Alternatively, the fitted function
can be used as a basis for trivial modifications to form an upper and lower bound
for the bandwidth response function.
3.3.2 The Model
With the bandwidth response function defined, it can now be applied to the
bulk synchronous parallel model for the multi-GPU architecture. The bulk syn-
chronous parallel model introduced by L.G. Valiant [65] is a programming model
for the design of algorithms on parallel hardware much like the Von Neumann
model is to sequential processors. The model describes a computation as a series
of discrete global supersteps, S. Each superstep consists of three components:
1. The concurrent computation by the processors (the GPU devices).
2. A communication phase between the processors (point-to-point over PCIe).
3. A barrier synchronisation of all the processors (managed by CUDA).
Thus, the cost or time for an algorithm in the bulk synchronous parallel model
can be calculated as the sum of the completion times for each of these terms as
expressed in Equation 3.6:
t = W +H ·G+ |S| × l (3.6)
46 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
Where t is the time in seconds for the model, W is the work, H ·G is the time to
communicate (H is the size of the communication which is typically in bytes, and
G is the capacity to communicate which is typically in seconds to communicate
one unit of H), |S| is the number of supersteps and l is the cost of synchronisation.
The terms W , H and G can all be represented as a series for each superstep and
so Equation 3.6 becomes Equation 3.7:
t =
|S|∑
s=1
(ws + hs × gs + l) (3.7)
Where among the processors, wi is the maximum computational component,
hs is the maximum amount of communication performed, gs is the capacity to
communicate for hs at superstep s.
One of the key contributions of this section is the definition of the capacity
to communicate for a process, gi, for the multi-GPU architecture. The value
of any element of G is dependent on a large number of factors relating to the
communication network. A simple unit analysis reveals that for the time hs× gs,
where hs is the size of data (bytes), gs must be expressed in terms of seconds
for a number of bytes to be communicated. This is simply the multiplicative
inverse of the bandwidth response function defined earlier and thus results in the
following formula for one superstep of the bulk synchronous parallel model for
the multi-GPU architecture:
ts = ws +
hs
EBW (hs)
+ l
= ws +
hs
a× arctan(b× hs) + l
(3.8)
To demonstrate this model we apply it to the hybrid-staged method for per-
forming the swap operation. This method is interesting as it requires three dif-
ferent bandwidth response functions—host to device (h2d), device to host (d2h),
and device to device (d2d). The validity of the model can be measured by com-
paring the predicted and observed time to swap an amount of data for a particular
system (Infib). The description of the hybrid-staged method translates directly
to three supersteps based on the communication. Expanding the summation of
supersteps results in Equation 3.9 for the time to swap, tswap, a message of size
3.3. MODELLING THE MULTI-GPU SYSTEM BEHAVIOUR 47
n bytes:
tswap(n) =
(
w1 +
n
EBWd2h(n)
+ l
)
+
(
w2 +
n
EBWd2d(n)
+ l
)
+
(
w3 +
n
EBWh2d(n)
+ l
) (3.9)
In the case of the swap operation there is no work component, and measuring
the time to synchronise p devices and the effective bandwidth functions reduces
Equation 3.9 to Equation 3.10:
tswap(n) =
n MB
3.6× 103 × arctan(3.7× 10−2 × n KB)
+
n MB
3.8× 103 × arctan(8.5× 10−3 × n KB)
+
n MB
3.9× 103 × arctan(3.3× 10−2 × n KB)
+ 3× p× 9.4× 10−7
(3.10)
Equation 3.10 expresses the time for the swap operation in terms of the size of
the message. The results comparing this to the measured time for completing the
swap operation using the hybrid-staged method are in Figure 3.8 and show that
this modelling technique supports the observed performance of the system. The
table of results for Figure 3.8 is attached in Appendix C.
The accuracy of the model demonstrated in its application to the hybrid-
staged method on the system used can also be expected to be maintained in the
remaining three methods for the swap operation. This is because all of the swap
operations consist of well defined steps that can correspond to supersteps in the
model. Furthermore, the bandwidth response function for each type of commu-
nication required can be readily derived courtesy of the extensive programming
interface and profiling tools provided by CUDA 4.0 and the available hardware.
48 CHAPTER 3. MULTI-GPU SYSTEM BEHAVIOUR
Figure 3.8: Comparison of model based predicted time and observed time
for the swap operation via the hybrid-staged method for a range of message
sizes. The modelling technique appears to predict the performance with
good accuracy.
Chapter 4
The Multi-GPU 3D FFT
“An algorithm must be seen to be believed.”
—Donald E. Knuth, 1968
In this chapter we analyse the 3D Fast Fourier Transform (FFT), which is a
major component of ultrasound simulations using the pseudo-spectral method, in
the context of the multi-GPU architecture. The 3D FFT as discussed in Section
2.5 contains a combination of computational operations and data rearrangement
operations to execute efficiently. This raises the question as to how to efficiently
map this algorithm onto the parallel device environment described by the multi-
GPU architecture. The memory space of the multi-GPU architecture is inherently
distributed in nature. Consequently, the data rearrangements translate to com-
munication between the memory spaces. Hence we focus on this communication
component which has not been analysed in the context of the multi-GPU architec-
ture until now. Towards this analysis the chapter begins by describing the com-
munication used by the 3D distributed FFT in terms of generic patterns on the
multi-GPU architecture. These descriptions are then used in combination with
the knowledge from the previous chapter on the point-to-point communication
characteristics of the platform to design a general algorithm for the distributed
memory type 3D FFT, and a number of implementations are made of it based
on the various methods for managing the communication between devices. The
chapter then concludes by contributing the results of the performance analysis of
these implementations.
49
50 CHAPTER 4. THE MULTI-GPU 3D FFT
Figure 4.1: The scatter operation on a multi-GPU system. The boxes
surrounding local memories represent distinct GPU devices. Data moves
from the host to the devices.
4.1 Communication In The Distributed 3D FFT
There are three communication patterns of interest that are used in the dis-
tributed memory type of 3D FFT on the multi-GPU architecture. These are the
scatter, gather and all-to-all communication patterns. Due to the heterogeneous
layout of the multi-GPU system’s processing power and memory, the standard
communication patterns observed in Section 2.4 are different. This is largely to do
with the distinction and relationship between the host controller and the devices
attached to it. For example, the scatter operation in the context of a multi-GPU
system is an operation from host memory (in this section host memory is an ab-
straction that can be applied to any main, system, or shared memory resource on
the host) to the device local memories—the initial process on the host does not
retain any current data after the scatter completes—the data is now distributed
among the other processes which are the GPU devices. This can be seen in Figure
4.1. In a typical scatter operation involving five processes—operating on the host
memory and four devices—current data would still remain on all processes. Here
the data is typically scattered from the source, the host’s memory, and is redis-
tributed among the local memories of the devices. This can be performed using
either an implied or imposed ordering among the devices for an added effect. In
the example provided in Figure 4.1, four units of data denoted A, B, C, and D, are
distributed from the host memory and each unit of data now resides in physically
isolated locations in the NUMA memory structure. From the perspective of the
4.1. COMMUNICATION IN THE DISTRIBUTED 3D FFT 51
Figure 4.2: The gather operation on a multi-GPU system. This is very
similar to Figure 4.1. Data moves from the devices to the host.
CUDA framework, the data has been moved from the host memory space to the
device memory space. In the more common scenarios of applications of CUDA,
the process of moving data from host to devices normally only involves one device
and one discrete memory operation, where under the multi-GPU architecture this
involves multiple distinct operations to distribute the data. As can be seen from
the example the scatter operation on the multi-GPU architecture contains n data
unit movements over the PCIe network for n data units; one more data movement
than the regular scatter communication pattern in Section 2.4.
The standard gather operation in Section 2.4 is the inverse form of the scatter
operation. Consequently we observe the same modification to the gather oper-
ation as we did to the scatter when applying it to the multi-GPU architecture.
In the context of this architecture, the data is gathered from the local memories
of the devices and collated in the host memory of the system. This similarity
between the scatter and gather operations on the multi-GPU architectures can
be seen in Figure 4.2. Here, the four units of data A, B, C, and D, are moved from
the local memories of each GPU where they each reside, to the host’s memory. If
the scatter operation and gather operation were to be performed in sequence with
the same ordering then the data in the host’s memory after the two operations
will be identical to the data before the scatter. Like the scatter operation, the
gather operation on the multi-GPU architecture contains one more data move-
ment when compared to its regular form on a generic system with n data unit
movements over the PCIe network for n data units, which can be performed se-
quentially one by another, or in parallel. By this measure both the scatter and
52 CHAPTER 4. THE MULTI-GPU 3D FFT
A	   B	   C	   D	  
+	   −	   ×	   ÷	  
1	   2	   3	   4	  
♠	   ♣	   ♥	   ♦	  
Host/Shared/Main Memory 
GPU0 – Device/Local Memory GPU2 – Device/Local Memory 
GPU1 – Device/Local Memory GPU3 – Device/Local Memory 
(a) Before (b) After
Figure 4.3: Before and after the all-to-all operation. Data moves between
the devices.
gather operations are small operations in the communication patterns considered
in this thesis.
The final communication pattern described in Section 2.4 that is relevant to
the multi-GPU distributed 3D FFT, All-To-All communication, is probably the
least affected when used on multi-GPU systems. This is only if we consider the
devices in the operation and disregard the host in the process. Consequently the
host’s memory is not used at all, and the communication proceeds much like any
other all-to-all communication among the devices and their local memories over
the PCIe network. This can be seen in 4.3.
Before the all-to-all communication, represented by Figure 4.3a, each device
for intents and purposes contains distinct data. In this case symbols from distinct
sets (letters, numbers, mathematical operators, and playing card suits). During
the all-to-all communication operation, each device sends data to every other de-
vice. At the end of this operation as seen in Figure 4.3b, each device now contains
data from every other, and consequently one member from each of the sets. In
analysing this operations on the multi-GPU architecture we should first note that
in comparison to the scatter and gather operations, this process involves more
data. This can be calculated at #d×n, where; #d is the number of processes (in
this case devices) involved, and n is the number of data units used in the scatter
and gather operations. This is also the same as the regular all-to-all communica-
tion pattern in Section 2.4. The number of movements of units of data involved
in this form of all-to-all communication among devices in a multi-GPU system is
like its regular version; significantly more than the scatter and gather operations.
4.2. A GENERAL ALGORITHM FOR THE DISTRIBUTED 3D FFT 53
In the example in Figure 4.3b, it should be noted that the arrows indicating com-
munications are bi-directional implying two communications between each pair of
devices—one in each direction. This means that the calculation for the minimum
number of data unit movements in the all-to-all communication is slightly more
complicated and is as follows in Equation 4.1:
∆M = #d×
(
n−
⌈
n
#d
⌉)
(4.1)
Where ∆M is the minimum number of data unit movements, #d is the number
of processors (in this case just GPUs) involved, and n is the number of data units.
4.2 A General Algorithm for The Distributed
3D FFT
Considering all that has been presented on communication methods and charac-
teristics, we created three implementations for the distributed 3D FFT. These
three implementations are all based on one algorithm with the variations intro-
duced by changing the method used for swapping data in a key phase of the
algorithm, and adjusting it accordingly where necessary. This general algorithm
for the distributed 3D FFT on the multi-GPU architecture can be described in
seven steps as illustrated in Figure 4.4.
If the 3D input matrix in main memory is stored in row-major order such
that elements in the X dimension are contiguous in memory, then the first step
of decomposing and scattering the matrix begins by partitioning the matrix in the
Z dimension into a sequence of contiguous batches, B, with an arbitrary batching
factor or size, |b|. This defines the dimension of a batch as: |b| × |Y | × |X|
elements, and the number of batches (#b) as |Z| ÷ |b|. The batches are then
further decomposed in the Y dimension into a sequence of pencils, P , where the
dimension of a pencil is |p| × |p| × |X| (|p| and |b| are equal) and the number of
pencils (#p) is |Y |÷ |p|. The left part of Figure 4.4a represents the matrix at the
end of this decomposition. The first step is finished when the matrix is scattered
to the devices as described earlier, resulting in the data distribution in the right
side of Figure 4.4a.
The second step is a computational step where 2D FFTs are performed on
each batch in the XY plane. This consists of |b| XY -point 2D FFTs for each
54 CHAPTER 4. THE MULTI-GPU 3D FFT
batch. These can be computed independently and concurrently as indicated in
red in Figure 4.4b.
To complete the 3D FFT each Z component, which is presently distributed
across multiple devices, must be rearranged to be within a device. This is accom-
plished by an Y Z transposition of the pencils as illustrated in step three in Figure
4.4c, and requires at least one all-to-all communication. The communication for
|b| 
…
 
CPU 
GPU0 
GPU1 
B 
(a) Step 1: Decomposition and ini-
tial scatter of data.
…
 
GPU0 
GPU1 
(b) Step 2: Concurrent 2D FFT of
each XY plane.
GPU0 
GPU1 
…
 
(c) Step 3: Y Z transpose of pencils
(all-to-all communication).
GPU1 
…
 
…
 
GPU1 
…
 
…
 
(d) Step 4: Local shuﬄe: pencil-wise
Y Z transpose followed by XY trans-
pose.
…
 
GPU0 
GPU1 
(e) Step 5: Concurrent 1D FFT in
the original Z plane.
GPU1 
…
 
…
 
GPU1 
…
 
…
 
(f) Step 6: Inverse local shuﬄe:
XY transpose followed by pencil-
wise Y Z transpose.
…
 
CPU 
GPU0 
GPU1 
(g) Step 7: Gather—collection of re-
sults.
Figure 4.4: General algorithm for distributed memory type 3D multi-GPU
FFT.
4.2. A GENERAL ALGORITHM FOR THE DISTRIBUTED 3D FFT 55
this is managed by traversing the upper-right triangular set of pencils, and for
source pencil pi, calculating the destination buffer for the pencil, and performing
a data swap. The destination is uniquely identified by the ID number of the
device, dj, a buffer identified by the ID of the batch stored in it bj, and an offset
into the buffer for the number of pencils before it pj (i.e. the destination pencil).
These destination values are derived as follows:
• dj = pi
#p/#d
.
• bj = pi (mod #p
#d
).
• pj = bi.
It should be noted that unlike in an all-to-all communication where the minimum
number of data movements (∆M) is given by:
∆M = #d×
(
n− n
#d
)
this still requires the movement of data between two locations even if they are
within the same device. Therefore the total number of memory operations and
data transfers is:
∆M = n× (n− 1)
which is only equivalent if n and #d are the same.
Now that the data is on the correct device, it must be rearranged to be
contiguous within the local memory of the device for efficient computation. This
is addressed in the fourth step by performing a memory shuﬄing operation on
each batch. What this actually involves is a Y Z transposition on each pencil
(left, Figure 4.4d), followed by an XY transposition on the entire batch (right,
Figure 4.4d). The net effect of steps three and four on the whole matrix is a
Y Z transpose followed by a XY transpose of the entire matrix. This step can
be viewed as a computational component as there is no communication involved
between devices.
With this complete, the final part of the FFT result (the Z component) can
now be calculated. This is completed as step five and consists of |b| × |Y | 1D
FFTs each of size Z (Figure 4.4e).
The 3D FFT of the data now exists in the device memory space, albeit in a
slight permutation. What remains is to restore this result to main memory in its
56 CHAPTER 4. THE MULTI-GPU 3D FFT
original configuration. This is started in step six, which is simply the inverse of
step four, the shuﬄe operation—the XY transpose of the batch is performed first,
then the pencil-wise Y Z transpose. Then it is finished in step seven by collecting
the result in main memory with an out-of-place multi-GPU gather operation.
The out-of-place refers to the fact that this collection operations uses a mapping
function similar to that of step three to place the pencils in a different position if
the distributed memory space was abstractly assembled to represent the original
matrix. This is done to reverse the effects of the pencil-wise Y Z transpose of step
three and is illustrated in Figure 4.4g.
4.3 Specific Implementations
As mentioned earlier, the following implementations are based on variations of
the swap operation performed in step three of the general algorithm—the most
critical and involved communication phase. From the previous chapter we can see
that the kernel and pointer swapping methods are the most viable and can be used
in the above algorithm directly, thus creating two of the three implementations—
referred to as the Kernel and Pointer Swapping implementations respectively.
The last one is based on the via CPU method. This implementation is present
to contrast the other implementations using direct device-to-device communica-
tions, and represent what is possible when not utilising the UVA features of the
multi-GPU architecture. Using this method effectively requires some minor mod-
ification to the algorithm. Instead of a scatter operation in the first step, the
batches are managed individually, and after step two they are copied to the ad-
ditional main memory buffer using the out-of-place memory mapping of pencils.
Between this and the subsequent movement of the newly formed batch, the same
result as the all-to-all communication is achieved. What this means is that the
matrix is maintained in main memory, and batches and pencils are transferred on
demand via the CPU to and from the GPUs for processing. This version, using
the via CPU method, will be referred to as the Simple implementation.
Using the terminology of the bulk synchronous parallel model (Section 3.3)
the general algorithm can be performed in three supersteps. The first superstep
contains steps one and two, the second contains step three and no computation,
and the third containing the remainder. This suggests that there are at least two
barrier synchronisations that are unavoidable. Barrier synchronisation for each
4.4. EQUIPMENT 57
Table 4.1: Performance for assorted batch sizes and the Simple FFT im-
plementation across 6 GPUs with a 10243 complex single-precision input
Batch Size
Time (seconds)
Size of Pencils
1 Stream 2 Streams
1 8.96 8.72 8 KB
2 7.27 7.30 32 KB
4 6.35 6.42 128 KB
8 6.33 6.33 512 KB
16 6.17 6.15 2 MB
32 6.77 6.57 8 MB
device exists as the only intrinsic method for synchronisation, which is achieved
by iterating for each device and blocking until the asynchronous queue has been
emptied.
4.4 Equipment
Two multi-GPU systems were used for the distributed multi-GPU 3D FFT: the
Trinity system (Appendix A.2) for development, debugging and profiling; and the
Infib system (Appendix A.1) for the performance testing of 3D FFTs.
In addition to the information presented in Section 3.1.5, the most relevant
detail about the Infib system is the total aggregated size of the local device
memory space, which is 10.5 GB. However, due to the manufacturer’s fault (see
Section 3.1.5) the kernel method has access to at most a total of 6 GB of device
memory across four devices in a unified address space. These values place the
upper limit on the maximum size of the FFT that can be performed, depending
on the method chosen.
4.5 Results
For all three of these implementations the batch size |b| is one parameter which can
affect the performance in two ways. Essentially the larger the message sent over
58 CHAPTER 4. THE MULTI-GPU 3D FFT
the PCIe the more likely it is to saturate the bandwidth and transfer efficiently,
therefore we expect larger batch sizes to perform faster. Thus, the effect of varying
it is the first result we analyse. Table 4.1 demonstrates the effect of varying the
batch size for the simple implementation of the distributed 3D FFT.
The first way this impacts performance is because this value directly changes
the size of the largest contiguous set of data that is accessed and transferred,
i.e. the size and number of transfers on the PCIe, which if not sufficiently large
will not transfer efficiently. This leads to at least #d× (#d− 1) device-to-device
transfers and at most |Z| × (|Z| − 1). These two cases occur as the number of
transfers (#t) and the size of each transfer (|t|) are dependent on the batch size:
#t =
Z
|b| −
#b
#d
|t| = |b| × |Y | × |X|
Poorer performance is expected in the second case, when the batch size is small, as
there is a greater number of smaller transfers which are less likely to saturate the
PCIe bus. In order to saturate the PCIe bus and operate at maximum bandwidth,
sufficiently large messages are required. This can be seen in the speed up as the
batch size increases, and the 25% improvement from a size of 1 to 32. Using
a larger batch size will only lead to an improvement to a certain extent as a
pipeline will always contain a start up and finish time where all resources are not
in use concurrently as parallel resource usage can not occur until the initial unit
has completed its first task—increasing the batch size will increase this time as
witnessed when moving from a batch size of 16 to 32.
The second way in which the batch size can affect performance is in the
application of streams. This factor defines the size and consequently the time
of the computations and communications. A shorter time results in a higher
granularity of tasks (kernels, or memory transfers) to be scheduled between the
copy and compute engines of the device. If the computational component is
significant, then the batch size should be balanced according to the bandwidth to
efficiently overlap computation and communication. The difference between the
results with one stream (no overlap) and two streams (some overlap) in Table 4.1
suggests that this component is insignificant, accounting for approximately 4% of
the total time, as the times do not differ significantly when overlapped. We also
observe some loss of performance for specific batch sizes when streamed, which
4.5. RESULTS 59
Time 
G
P
U
0 
G
P
U
1 
M
em
cp
y 
M
em
cp
y 
K
er
ne
l 
K
er
ne
l 
H2D 
D2H 
H2D 
D2H 
Other 
FFT 
Other 
FFT 
(a) Simple FFT Implementation Profile
Time 
G
P
U
0 
G
P
U
1 
M
em
cp
y 
M
em
cp
y 
K
er
ne
l 
K
er
ne
l 
H2D 
D2D 
Other 
FFT 
Other 
FFT 
D2H 
H2D 
D2D 
D2H 
(b) Pointer Swapping FFT Implementation Profile
Figure 4.5: Timeline of GPU activity showing memory transfers (brown),
in-device data rearrangements (green), and FFT kernels (blue) for the Sim-
ple and Pointer Swapping implementations of the general algorithm for the
distributed 3D FFT. The computational part is small compared to the
memory operations and overlapped in most cases.
we attribute to the dynamic nature of the device scheduler and the alignment of
data in memory.
In order to confirm this, the implementations were profiled using the NVIDIA
Visual Profiler [41]. Figure 4.5 provides a timeline of GPU activity for the Simple
and Pointer Swapping implementations. Each profile is divided into two sections:
the inter-device activity (communication through memcpy() type routines), and
intra-device activity (computational kernels and data rearrangement kernels). In
the inter-device section the first two rows of the profile for the Simple imple-
mentation and three of the Pointer Swapping implementation show the timing
and duration of the host-to-device, device-to-host, and device-to-device memory
60 CHAPTER 4. THE MULTI-GPU 3D FFT
transfers. In the intra-device section, the first two rows of each are the data re-
arrangement kernels and the rest are computational kernels related to the FFT.
The computational kernels all display very short durations for execution com-
pared to all other activity, and are also mostly overlapped by the other activity.
This confirms that the computational component is not significant and is hidden.
In Table 4.2 we present results for the three implementations on multiple
GPUs alongside results for FFTW on 6 and 12 CPU cores and CUFFT (where
applicable). Each FFT was applied to complex 3D single-precision cubic matrices
with dimensions 5123, 7683, and 10243. The performance was measured as the
average of the times taken to perform the FFT. Times were recorded for batch
sizes 1, 2, 4, 8, 16, and 32, and for one and two streams per device and the fastest
average times are reported in the table. The results for the Simple implementa-
tion provides a baseline for the versions using direct device-to-device transfers in
memory management. The performance of this version is slower in all cases to
the reference CPU (FFTW) and GPU (CUFFT) versions.
The results for the Pointer Swapping implementation are not available for
10243, and 7683 for two and four GPUs, as the total device memory is not enough
for the swap method which requires 7 GB for 7683 and 16 GB for 10243. Similarly,
for the Kernel implementation for 10243 and 7683 with two GPUS. When using
six GPUs, the Pointer Swapping implementation stages some communications
through main memory, and the Kernel implementation can not be run due to the
hardware bug in the Intel IOH chips preventing the UVA accessing more than
four devices.
In all cases the implementations using direct device-to-device transfers were
faster than the Simple version communicating via host memory. The greatest
speed up was observed with the Pointer Swapping implementation on two GPUs
over the 5123 matrix where the performance improved by 49%.
Despite the kernel method approach for swapping performing better in the
effective bandwidth tests (Chapter 3), the performance of the Kernel implemen-
tation was in general worse than the Pointer Swapping version. An explanation
for this performance is the limit in the resource configuration of kernels. Another,
is that the kernel both involves transfers and computation and consequently does
not schedule well. It should be noted that the Kernel implementation requires
half the memory of the other implementation which allows it to compute the 7683
matrix with only four GPUs. Although using fewer GPUs, this was slower than
4.5. RESULTS 61
Table 4.2: Time (seconds) to Compute 3D Complex FFT. Results presented
as a dash (-) indicate unobtainable values due to the memory limitations
given by the number of GPUs.
Size
Implementation Cores/GPUs 5123 7683 10243
Simple
1 1.74 6.32 14.05
2 1.21 4.26 9.81
4 0.81 2.78 6.52
6 0.77 2.64 6.16
Pointer Swapping
2 0.61 - -
4 0.50 - -
6 0.64 2.07 -
Kernel
2 0.80 - -
4 0.74 2.46 -
CUFFT 1 0.56 - -
FFTW
6 0.61 2.01 5.64
12 0.30 1.07 2.82
the Pointer Swapping implementation with six GPUs by 16%. This suggests that
the slower transfers staged through main memory with pointer swapping may
have been masked by the other device-to-device communications.
Compared to CUFFT on one GPU and FFTW on multiple cores, all three
implementations were generally slower. However, at 5123 the Pointer Swapping
implementation over four GPUs was found to outperform CUFFT (a single GPU)
by 12%, and FFTW when six cores on a single CPU were used by 18%. FFTW
over 12 cores (both CPUs) was still faster in these conditions by at least 40%.
As final point of analysis we present in Table 4.3 the predicted performance
using the model in Chapter 3, of the Pointer Swapping implementation on four
devices for 5123 input along side the observed values for a variety of batch sizes
for verification. The table also includes predicted values for 10243—which are
experimentally unverifiable—as if the intended functionality of the system were
present, or if sufficient memory was available on the devices, using the model
62 CHAPTER 4. THE MULTI-GPU 3D FFT
Table 4.3: Model based prediction for the execution time (seconds) of the
distributed 3D FFT for select input size and batch sizeswith four GPUs.
Batch Size
5123 10243
Observed Predicted Predicted
1 2.91 3.24 13.46
2 1.64 1.68 7.59
4 1.00 0.95 5.06
8 0.71 0.63 4.04
16 0.58 0.50 3.62
32 0.50 0.45 3.43
defined in Chapter 3. The main detail of the model used is in the expression of
the communication cost in the host-to-device, device-to-device and device-to-host
transfers. This is simplified and taken as the number of batches per device by
the size of a batch divided by the bandwidth function for the given message size
(defined by the pencil size). Equation 4.2 shows this expression for the time to
compute the host-to-device transfer, tH2D:
tH2D =
#b
|d| × |b| MB
a× arctan(b× |b| KB (4.2)
The correlation between the predicted and observed results in Table 4.3 for an
input of 5123 indicates that the modelling technique in Chapter 3 has a reasonable
accuracy for predicting the performance of more complicated algorithms such as
the distributed FFT. The predictions for the input size of 10243 suggest that the
performance of the Pointer Swapping implementation will scale and maintain its
relative performance between one and two six-core CPU processors.
The conclusion that can be drawn from these results is that although the
performance of the multi-GPU architecture for the FFT algorithm is not always
faster than existing CPU implementations, depending on the configuration used
it is possible to achieve speed ups with existing hardware. The observed perfor-
mance is due to the limitations of PCIe 2.0 and a hardware bug in the Intel IOH
5520 controller. This indicates that future generations of the hardware, and more
4.5. RESULTS 63
efficient applications of the algorithm with better integration into the data flow
of an ultrasound simulation will result in significant speed ups. At the very least,
the migration of this component of an ultrasound simulation to the multi-GPU
architecture should not impede the performance of the entire simulation and risk
negating any performance improvement yields from the other components of the
simulation on the multi-GPU architecture.
64 CHAPTER 4. THE MULTI-GPU 3D FFT
Chapter 5
Application: k-Wave
“People think that computer science is the art of
geniuses but the actual reality is the opposite, just
many people doing things that build on each other,
like a wall of mini stones.”
—Donald E. Knuth
The purpose of the following chapter is to assess the viability of the multi-GPU
architecture for the K-Wave ultrasound simulation application. This is achieved
by adapting the simulation algorithms and off-loading a significant computational
load, the 3D FFT, from the CPU to the GPUs. The chapter begins with a descrip-
tion of the equations for an ultrasound simulations and their implementation—the
k-Wave toolbox. Then, we present a technique to efficiently use the multi-GPU
architecture and the multi-GPU FFT from the previous chapter in conjunction
with the CPU, and then apply it to this simulation to create our prototype im-
plementation of the equations. This chapter concludes with an analysis of the
performance results for a small selection of configurations sufficient to demon-
strate the technique on an existing multi-GPU system.
5.1 Equations in Ultrasound Simulation
The k-Wave application [62] is a large Matlab toolbox for acoustic and ultrasound
processing and simulations. It was released in open source in November, 2012 but
has been in active beta since July, 2009 [63]. K-Wave has many features in terms
65
66 CHAPTER 5. APPLICATION: K-WAVE
(a) Thresholded IVP
voxel-plot (b) t=0 (c) t=25
(d) t=50 (e) t=100 (f) t=200
Figure 5.1: Visualisation of initial value problem ultrasound simulation:
a) 3D/voxel plot of pressure with threshold shows initial condition (two
pressure regions); b–f) pressure recorded through a single plane of the 3D
space at time steps 0. 25, 50, 100 and 200.
of its ability to model different aspects of ultrasound simulations which allow it
to account for non-linear wave propagation, heterogeneous mediums, and acous-
tic absorption by the medium. The k-Wave toolbox achieves this functionality
through the solution of a number of Partial Differential Equations by the k-space
pseudo-spectral method.
For the purpose of this chapter we will not use all of these features and instead
use a simpler simulation that still shares the same form of computation. The
features of the simulation are that it uses a linear wave propagation model in
a heterogeneous, lossless medium. It is simpler only in that a few equations
and terms are not included, resulting in a less complex model. This simulation
is a 3D initial value problem: where the model is used to simulate and detect
the pressure over time from an initial distribution within a three dimensional
propagation medium. Figure 5.1 provides a visualisation of the simulation.
The profile of a simulation accounting for absorption which requires fourteen
transforms per time step presented by Jaros et. al. in [29] and reveals that close
to 57% of the time is spent in computing the 3D Fourier Transforms, with the
remaining consisting of many element wise addition and multiplication operations
5.2. CPU–MULTI-GPU PIPELINING FOR SIMULATION ACCELERATION67
between scalars, vectors and matrices.
The equations that govern the simple simulation in this chapter are given in
Equations 5.1 to Equation 5.4 (for further details see Treeby et. al. 2012 [64]).
For this simple simulation in three dimensions there are a total of ten transforms
per time step. Equation 5.1 requires one forward transform, F(pn), which can
be shared among the dimensions and then three inverse transforms when ξ is
equal to X,Y and Z respectively. Similarly Equation 5.3 requires three forward
and three inverse transforms corresponding to one forward and inverse for each
dimension. By extrapolation, it can be expected that at least 40% of the time
for this simpler simulation is similarly spent in FFTs. Given the multi-GPU
FFT from Chapter 4 and the capability to overlap tasks between CPU and GPU
provided by the asynchronous execution model, in the best possible case up to
this amount of computation can be hidden using the technique in the following
section.
∂
∂ξ
pn = F−1 (i · kξ · κ · F (pn)) (5.1)
un+1ξ = u
n
ξ −
∆t
ρ0
∂
∂ξ
pn (5.2)
∂
∂ξ
un+1ξ = F−1
(
i · kξ · κ · F
(
un+1ξ
))
(5.3)
ρn+1ξ =
ρnξ −∆tρ0 ∂∂ξun+1ξ
1 + 2∆t ∂
∂ξ
un+1ξ
(5.4)
5.2 CPU–Multi-GPU Pipelining for Simulation
Acceleration
The performance of the simulation on a suitable multi-GPU system can be im-
proved by staggering the initiation of computation on the CPU with communi-
cation and computation on the GPU such that the time each spends idling is
minimised. The technique we propose is to interleave intermediary results for
the dimensions being calculated between the CPU and GPU processors. The
potential for improved efficiency is conceptually similar to the use of instruction
pipelines in conventional CPU resource scheduling.
In order to best describe this technique’s application to this simulation, we
express the equations in terms of computer code in the form of the original Matlab
implementation from which the CUDA version is derived. The technique can only
68 CHAPTER 5. APPLICATION: K-WAVE
1 p k = f f t n (p) ;
2 ux sgx = bsxfun (@times , pml x sgx , bsxfun (@times , pml x sgx ,
ux sgx ) − dt . / rho0 sgx .∗ real ( i f f t n ( bsxfun (@times ,
ddx k sh i f t po s , kappa .∗ p k ) ) ) ) ;
3 uy sgy = bsxfun (@times , pml y sgy , bsxfun (@times , pml y sgy ,
uy sgy ) − dt . / rho0 sgy .∗ real ( i f f t n ( bsxfun (@times ,
ddy k sh i f t po s , kappa .∗ p k ) ) ) ) ;
4 uz sgz = bsxfun (@times , pml z sgz , bsxfun (@times , pml z sgz ,
uz sgz ) − dt . / rho0 sgz .∗ real ( i f f t n ( bsxfun (@times ,
dd z k sh i f t p o s , kappa .∗ p k ) ) ) ) ;
5 duxdx = real ( i f f t n ( bsxfun (@times , ddx k sh i f t n eg , kappa .∗
f f t n ( ux sgx ) ) ) ) ;
6 duydy = real ( i f f t n ( bsxfun (@times , ddy k sh i f t n eg , kappa .∗
f f t n ( uy sgy ) ) ) ) ;
7 duzdz = real ( i f f t n ( bsxfun (@times , ddz k sh i f t n eg , kappa .∗
f f t n ( uz sgz ) ) ) ) ;
Figure 5.2: Matlab code relevant to equations in the k-space pseudo-
spectral method involving Fourier Transforms.
be used with Equation 5.1 and Equation 5.3 and these correspond to seven Matlab
operations which are reproduced in Figure 5.2 from [29]:
The first line of Figure 5.2 computes the inner forward FFT component of
Equation 5.1 and stores the result in a temporary variable p_k for reuse. Lines
2–4 correspond to Equation 5.2 which uses the result of Equation 5.1 (computed
in-line using p_k) for each dimensions, hence why it appears repeated three times.
The bsxfun() used throughout is a function provided by Matlab to map an
operation element-wise between two matrices which do not have to be of the
same rank, i.e. it allows a matrix to be represented as a scalar or vector saving
memory while still performing an element-wise operation between matrices. Lines
5–7 represent Equation 5.3 similarly to lines 2–4 and Equation 5.2 and using the
result of Equation 5.2 in each dimension for each dimension, hence the repetition
for X, Y , and Z.
The key to this technique is in identifying the data dependencies between
various terms in the equations and restructuring the code accordingly. In this
case the simulation is dimensionally independent—the majority of the data for
calculating the simulation component in the X dimension is independent of the
data in the Y dimension. The data that is shared between the dimensions, such
as p_k, is used in a read-only access and so does not affect the rearrangement
of the calculation of terms so long as it is complete when the term is required.
Similarly within a dimension, for example the X dimension, the only constraints
apparent from Figure 5.2 is that the term ux_sgx be computed before the cal-
5.2. CPU–MULTI-GPU PIPELINING FOR SIMULATION ACCELERATION69
culation for duxdx can begin. The idea then behind this technique is that the
GPU can prepare data for the next dimension rank, while the CPU can finish
the computation for the given dimension. This means that while the CPU within
the system is computing the result for the X dimension, the GPUs can begin
preparing data for the computation in the Y dimension, as indicated by Figure
5.3. The only additional requirement of this technique is the synchronisation
necessary between the preparation–computation/GPU–CPU to avoid a race con-
dition in the calculation within a dimension. Thus, in a 3D simulation if each
dimension has a preparation and calculation phase, then we can consider six units
of computation. In the ideal case the time spent in the calculation phase would
closely overlap the preparation phase, or vice-versa, effectively hiding two units
of computation and leading to a theoretical speed up of at most 33%.
Figure 5.4 describes for the Matlab code lines 1–4 in Figure 5.2, the restruc-
turing and division of work between the GPU on the left with CPU operations
that can be computed concurrently on the right with horizontal lines represent-
ing the required synchronisation between the GPUs (Matlab code lines 5–7 are
attached in Appendix D). In the first unit of execution the GPU prepares the
Fourier Transform of the pressure matrix (p_k) while the CPU prepares the dot
product of kappa and ddx_k_shift_pos. A barrier synchronisation is then used
so that p_k has been prepared and is used to compute the intermediary result Ax.
The GPU is left idle during this unit of execution as its next task is to compute
the inverse Fourier Transform of Ax. During that, the CPU prepares the inter-
mediary Y dimension result from the dot product of p_k, ddy_k_shift_pos, and
kappa. After the synchronisation the next unit of execution prepares the inverse
Fourier Transform of Ay, while the CPU prepares the Az intermediary result.
Optionally, the first intended product, ux_sgx can be computed by the CPU in
this unit of execution or the next as found in Figure 5.4. This pattern continues
Time 
GPU 
CPU 
Preparation 
Computation 
Legend: X	  
X	  
Y	  
Y	  
Z	  
Z	  
Figure 5.3: Technique for GPU-CPU pipelining in 3D: overlapping prepa-
ration (grey) and computation (white).
70 CHAPTER 5. APPLICATION: K-WAVE
GPU CPU
p k = f f t n (p) ; Ax = bsxfun (@times , ddx k sh i f t po s , kappa ) ;
( no op ) Ax = p k .∗ Ax;
IFx = real ( i f f t (Ax) ) ; Ay = p k .∗ bsxfun (@times , ddy k sh i f t po s , kappa ) ;
IFy = real ( i f f t (Ay) ) ; Az = p k .∗ bsxfun (@times , dd z k sh i f t p o s , kappa ) ;
IFz = real ( i f f t (Az) ) ;
ux sgx = bsxfun (@times , pml x sgx , bsxfun (@times ,
pml x sgx , ux sgx ) − dt . / rho0 sgx ∗ IFx ) ;
uy sgy = bsxfun (@times , pml y sgy , bsxfun (@times ,
pml y sgy , uy sgy ) − dt . / rho0 sgy ∗ IFy ) ;
( no op )
uz sgz = bsxfun (@times , pml z sgz , bsxfun (@times ,
pml z sgz , uz sgz ) − dt . / rho0 sgz ∗ IFz ) ;
Figure 5.4: GPU-CPU overlapping for Equations 1-4 from Figure 5.2. Hor-
izontal lines represent barrier synchronisations.
until the last synchronisation which is required to ensure that the inverse Fourier
Transform of Az is ready for the calculation of uz_sgz.
In terms of actual implementation details, the entire process is managed with
nested parallelism provided by OpenMP. Two outer-level threads control the
GPU and CPU. These threads spawn inner-level threads defined by the run-time
configuration to perform work sharing among the CPU cores and multiple GPUs.
This way, OpenMP’s built-in barrier synchronisation can be used by the outer-
level threads to control the progression through the execution units described.
5.3 Results
The purpose of this section is as a proof-of-concept for the implementation of
ultrasound simulations on multi-GPU systems. To achieve this, an initial value
problem as described in Section 5.1 was created for the spatial domain sizes of 323,
643, 1283, 2563 and 5123. Varying the problem domain in this manner will change
the size of the data transferred over the PCIe network and provide information
on the communication overheads and bandwidth utilisation in the context of a
real-world application. The Infib machine (see Appendix A.1) was used again for
these simulations. The simulation was run using the CPU–Multi-GPU pipelining
technique (pipelined version) for a small number of time steps and the average
time per step was compared to an optimised C++ version [29]. The multi-GPU
FFT of Chapter 4 was also directly substituted into C++ version replacing the
5.3. RESULTS 71
Table 5.1: Original C++ Version (CPU only): Time Per Step (seconds)
of Initial Value Problem k-Wave Simulation, Assorted Number of Threads
and Problem Sizes.
Size3
Number of Threads
1 2 4 6
32 0.002 0.001 0.001 0.001
64 0.02 0.013 0.006 0.005
128 0.195 0.105 0.067 0.052
256 2.17 1.26 0.659 0.521
512 20.9 10.7 6.31 5.68
Table 5.2: Non-Pipelined Version (CPU ‖ GPU): Time Per Step (seconds)
of Initial Value Problem k-Wave Simulation, Assorted Number of Threads
and Problem Sizes.
Size3
Number of Threads
1 2 4 6
32 0.039 0.04 0.04 0.04
64 0.084 0.082 0.082 0.082
128 0.242 0.234 0.221 0.217
256 1.23 1.05 1.034 1.02
512 8.62 7.45 7.2 7.14
FFTW library to create another version (non-pipelined) which would demon-
strate the effect of restructuring the simulation. The number of threads used
by the simulation was also varied to reflect the current number of physical cores
available in common desktop machines. However, the number of GPUs used was
fixed at four, with a suitable batch size chosen for the domain size, as this is
sufficient to show the proof of concept, and the performance of the many other
configurations can be extrapolated from the model and results of Chapter 4. The
results measured are presented in Tables 5.1–5.3.
These results reveal that while the C++ version of the simulation scales with
72 CHAPTER 5. APPLICATION: K-WAVE
Table 5.3: Pipelined Version (CPU & GPU): Time Per Step (seconds) of
Initial Value Problem k-Wave Simulation, Assorted Number of Threads
and Problem Sizes.
Size3
Number of Threads
1 2 4 6
32 0.049 0.049 0.049 0.049
64 0.094 0.088 0.087 0.087
128 0.224 0.215 0.204 0.21
256 0.995 0.90 0.83 0.885
512 6.27 6.26 5.82 6.2
the number of active threads, the multi-GPU pipelined version does not appear to
scale significantly. The conclusion to draw from this is that under these configura-
tions the simulation is completely bounded by the FFT component: the observed
time for the 5123 FFT with four GPUs from Section 4.5 was 0.50 seconds, and
with ten FFTs this gives 5 seconds per time step in calculating the transform in-
dicating the remaining element-wise operations are either successfully overlapped
or insignificant.
Comparing the C++ and non-pipelined versions also provides support for this
observation. As the number of threads increase in the non-pipelined version the
speed increase compared to the C++ does not scale accordingly. As the time for
each FFT for each domain size is essentially constant this indicates the scaling
observed in the C++ version is achieved in the performance of the FFT.
The multi-GPU version pipelined with four threads confirms the proof of con-
cept with the 5123 problem where it is 0.5 seconds faster per time step than
the C++ version (a speed up of 8%) showing the viability of the technique and
multi-GPU systems. This result is significant considering that these simulations
typically involve hundreds to several thousands of time steps, and the standard de-
viation across run times is negligible. The results using four threads are graphed
in Figure 5.5, and from these plots it can be seen that while the multi-GPU
pipelined version is slower than the C++ for small problems, it performs better
as the domain size increases until its performance surpasses the C++ version.
This can be explained through the effective bandwidth ratio between the CPU
5.3. RESULTS 73
Figure 5.5: k-Wave initial value problem simulation with four threads com-
paring a C++ only version to a CPU–Multi-GPU pipelined version.
and GPU as the domain size, and consequently the size of the individual com-
munications increases.
This result is further reflected in the comparison between the non-pipelined
and pipelined versions as the domain size increases. Over the problem sizes of 323
and 643 the non-pipelined version is faster regardless of the number of threads
used. However, when the problem is at 1283 or larger the pipelined version
performs faster. The most likely explanation for this is that at the smaller sizes
the more highly optimised C++ component of the simulation may make better
utilisation of data locality in the CPU caches. Once the size of data exceeds
cache limits and main memory accesses increase the performance of the pipelined
version and its ability to overlap these becomes favourable.
The performance of the multi-GPU system under six threads in comparison
to four can be explained by a combination of factors. First, each processor in
the Infib system supports six physical cores and so when running six inner-level
processing threads along with two CPU/GPU task threads there will inevitably
be a larger number of context switches incurring an overhead. The four inner-
level GPU management threads should not interrupt significantly as they are
mostly idle after populating the asynchronous queues of the devices. Secondly
and quite simply: more threads increase the cost of synchronisation at the end
of each parallel section.
From these observations we can surmise that critical to making effective use
74 CHAPTER 5. APPLICATION: K-WAVE
of this technique of multi-GPU pipelining is to balance the ratio of work in the
preparation phase. As the multi-GPU pipelined version of the code is significantly
structurally different to the original C++ code direct comparison of the parallel
speed up is difficult. Further exploration of this is possible by moving more
computations from the CPU to the GPU space (e.g. computing p k * Ax, etc.).
However, this requires more data in the GPU memory space but is necessary
in determining whether a CPU only, multi-GPU pipelined, or multi-GPU only
solution is most efficient.
Chapter 6
Conclusion
“Chop Chop.”
—Alistair P. Rendell
This thesis is brought to a conclusion in three parts. First a brief summary of
the work contained is presented. Second a discussion is raised on the implications
of next generation components and technologies, and the use of other hardware
for the performance of the multi-GPU architecture. Third, the thesis finishes
with some suggestions for future work, which would consolidate the application
of ultrasound simulations on the multi-GPU architecture.
6.1 Summary
The goal of this thesis was to define and characterise the performance of multi-
GPU systems and assess the viability of multi-GPU systems for ultrasound sim-
ulations. There are a number of reasons to attempt this and why a multi-GPU
system is a potentially suitable platform for these types of applications. The per-
formance of GPUs for general purpose programming and numerical methods is
already well accepted [49]. A multi-GPU system allows access to an aggregated
memory source extending the available memory beyond the limits of a single
device while simultaneously allowing a reduction in computation time through
parallel processing. Our approach to achieving this goal was performed in three
stages each building upon the previous and presented in Chapters 3 through to
Chapter 5.
75
76 CHAPTER 6. CONCLUSION
In Chapter 3, we began by analysing a multi-GPU system from existing hard-
ware in a shared memory system and conducted a benchmark of various methods
of communication within the system. Four methods for communication were
identified: a method communicating exclusively via the system’s shared memory
and CPU; a method using pointer swapping between buffers and direct device
to device communications; a hybrid method combining direct device to device
communication with staged transfers through shared memory; and the use of ker-
nels to manage the data through DMA transfers directly. These methods were
benchmarked by measuring the effective bandwidth achieved when swapping data
between two GPUs in the system. The results of this, presented in Chapter 3,
identified a significant bottleneck in communication between the CPU and GPUs
depending on the method used to manage communication with techniques using
kernels and pointer swapping found to have the best performance. The perfor-
mance of the multi-GPU platform is further characterised by the modelling of the
communication bandwidth as a function of the size of the message being sent.
This allows for the prediction of the performance of various algorithms using an
expression derived from the bulk synchronous parallel modelling technique for
the execution time of an algorithm, and thus the speculation of the performance
of hypothetical systems. The model is shown to have good accuracy in predict-
ing the performance of the swap operation and supports the observation of the
bottleneck identified in the benchmark results.
Chapter 4 involved defining and implementing various generic communication
functions for the multi-GPU architecture with a degree of efficiency gained from
the observations from the benchmark results in Chapter 3. These functions are
then used to implement and measure the performance of the distributed 3D Fast
Fourier Transform (FFT) with the intent of applying it to an ultrasound simula-
tion. This implementation of the distributed 3D FFT is also described in terms
of the bulk synchronous parallel model in three super steps. First, the data is
distributed in batches, and the 2D FFT on each slice is performed. Second, the
data is redistributed using pencils amongst the devices to make it contiguous in
the third dimension. The predictions of the model and the observed profile reveal
this to be the most time consuming component of the algorithm. Third, the 1D
FFTs are performed over the remaining dimension, and the data gathered back
to the host. The 3D FFT distributed among the devices of a multi-GPU system
was found to be faster in some configurations than a single multi-core CPU. More
6.2. DISCUSSION 77
specifically, computing the FFT for a 5123 matrix using four NVIDIA GTX 580
devices was 18% faster than using all six cores on an Intel Xeon X5650 processor,
and 12% faster than using a single device. Furthermore, by using the different
methods for data swapping with the second super step the most efficient method
in application was revealed to be by pointer swapping between active and pend-
ing buffers using asynchronous device to device memory transfers in a UVA. This
was found to be up to 49% faster than communication via the shared memory
and CPU, clearly showing that direct device to device transfers are a key factor
in obtaining high performance on multi-GPU systems.
Chapter 5 was a proof of concept that was created by modifying an existing
ultrasound simulation to use the multi-GPU distributed 3D FFT. The modifica-
tions made were to improve the time scheduling efficiency of the computational
resources by overlapping tasks between the CPU and GPUs minimising the time
each spends idling. This is made possible by a degree of independence in the
computation between each spatial dimension in the simulation. The duration of
the average time step for a simple simulation reveals that this initial application
of this technique without further typical optimisations can effectively overlap and
mask components of an algorithm, and resulted in an 8% speed up in comparison
to a production version of a CPU only simulator in the same configuration.
6.2 Discussion
The performance of simulations like those of ultrasound propagation and algo-
rithms such as the distributed 3D FFT on the multi-GPU architecture are of
particular interest as these are memory bound problems. It would be expected
that memory bound problems like this will have poorer performance and effi-
ciency when solved on a processor such as a GPU when compared to a CPU as
the design and resource allocation are more devoted to computation than memory
management. Many proponents of GPGPU programming cite examples where
the execution time of CPU programs can be sped up significantly, often in the
order of twenty times. [28][49][52]. These orders of speed up are only possible
where the most significant component of a problem is computational. The reason
to attempt this is to quantify the impact on performance when such problems
are implemented on a multi-GPU architecture. Amdahl’s law [2] implies that the
maximum improvement is bound by the sequential component. Given that the
78 CHAPTER 6. CONCLUSION
profile of the ultrasound simulation revealed that the FFT occupies no more than
40% of the computational load, the most optimistic improvement that can be ex-
pected will be in the magnitude of 1.5 times the original speed. If the performance
decreases significantly, then the multi-GPU architecture will not be suitable for
ultrasound simulations. If the cause of the performance impact can be identified,
then considerations can be taken into account with subsequent generations of the
architecture to address them.
The improved performance of the simulation in select configurations indicates
that despite the deficiencies in existing hardware towards the task, it is currently
plausible to achieve a greater cost-performance efficiency through the multi-GPU
architecture than with other traditional cluster or component upgrades. The price
of a single node or machine for high performance computing tasks at present can
easily reach $1000, and with the choice of high-end or server quality components,
the price can readily exceed $10 000. In contrast, current consumer grade GPU
devices can be bought for around $100, with ‘enthusiast’ and high performance
devices costing at most several thousand dollars. This means that a suitable
machine could be retrofitted with almost any number of additional GPU devices
with significantly less cost than the purchase of a new machine, or the addition
of another node to a cluster, and with better performance.
Another significant expense is the cost of communication between nodes. Even
the fastest interconnects between nodes can not compete with the performance
and benefits from locality in intranode connections. This is especially true for
CPU communication: two single CPU nodes communicating via MPI will typ-
ically be significantly slower than a single node containing two CPUs in a dual
socket arrangement communicating via OpenMP. In fact it would not be uncom-
mon for the performance of a problem to be worse on a small cluster than on a
single machine if the resources of a single machine were sufficient to efficiently
complete the task due to the overhead and latency of communication [32].
The reason for this performance impact is due to the fact that the communica-
tion between nodes inevitably moves over the PCIe network. The communication
pathway used between CPUs in different nodes typically uses a peripheral device
such as the Mellanox Infiniband card found in the Infib system on one node and
a matching device on the other which are used to relay messages on the remote
PCIe networks to their corresponding CPUs. The extent of the bottleneck caused
by the PCIe network becomes more apparent when considering the ratio of band-
6.2. DISCUSSION 79
width available between the CPUs, GPUs and various memory resources involved
in a multi-GPU system. Taking the Infib system for example again, comparing
the bandwidth between a single GPU’s compute cores and memory of 200 GB/s,
or the CPU cores and host memory of 25 GB/s, the bi-directional limit of 8 GB/s
with PCIe 2.0 is significantly limiting. The results of benchmarking the 3D FFT
at 5123 confirm this where using two GPUs achieves at best a PCIe bandwidth
of about 8 GB/s during the critical device to device communication resulting in
performance that is 12% slower than using a single GPU version, while using four
GPUs allows access to a greater aggregate PCIe bandwidth of 16 GB/s resulting
in a 12% speed up to the single GPU version. This has significant implications for
both the hardware and software development for multi-GPU systems, chiefly that
the GPU devices communication capability is the primary performance limiting
factor.
These ratios of memory bandwidth to processing capacity explain the known
scalability of the FFTW library on large clusters. Increasing the number of nodes
within the cluster simultaneously increases the available aggregate bandwidth,
much the same way that using four GPUs improves over two GPUs in the mutli-
GPU 3D FFT. Increasing the processing power within each node of a cluster will
only yield a performance improvement until the processors capability to generate
communication exceeds the communication networks ability to service it.
There are a number of ways in which current high end hardware and next gen-
eration consumer hardware can address some of the problems outlined. While the
GTX 580 devices used in the Infib system may be considered as recent high-end
consumer hardware, NVIDIA also produce a line of high performance computing
devices—the Tesla series. The feature of particular interest in the Tesla series
of devices is the presence of two copy engines. All consumer hardware thus far
have had only a single copy engine, and thus can not use the duplex capability
of the PCIe 2.0 lanes. This ability to perform communications in duplex can
improve the performance of the simulation in two ways. First, there will be a
higher bandwidth for device to device communication in the swap operation as
the transfers in each direction can occur concurrently. Second, the additional
copy engine is another resource in the communication/computation pipeline re-
sulting in an extra degree of overlap and a potential speed up of up to 22% in
addition to the improvement from the technique in Chapter 5 (see Figure 6.1).
As the problems explored in this thesis are memory/communication bound, this
80 CHAPTER 6. CONCLUSION
(a) One Copy Engine
(b) Two Copy Engines
Figure 6.1: Technique for GPU-CPU pipelining in 3D: overlapping prepara-
tion (grey) and computation (white) with one and two copy engines. Using
two engines allows host to device (H2D) transfers to be overlapped with
device to host (D2H) transfers leading to an additional 22% speed up.
has a large potential for performance increase.
Another way in which the performance of these problems can be improved by
upcoming hardware is simply through the increase in available memory per device.
In Chapters 3 and 4 it was established that the bandwidth achieved is limited
to an extent by the size of the message being transferred. In the distributed 3D
FFT algorithm presented the maximum size of the smallest transfer is bound
by the size of the batch and the memory within a single device. Essentially,
increasing the memory not only leads to the ability to compute larger problems
but also affects the capacity to saturate the interconnects between devices. This
is reflected in the size and speed of the 3D FFT over one, two and four GPUs,
and what would theoretically be possible with six GPUs. The model predicts
that with six GPUs at full bandwidth the performance of the multi-GPU FFT
with transfers back to host would be marginally faster (less than 5%) than the
observed times for FFTW with 12 cores. The memory capacity of GPU devices
has steadily increased with successive generations, and this trend is likely to
6.2. DISCUSSION 81
continue especially considering the release of GPU devices such as the NVIDIA
GTX Titan into the consumer market. The Titan has 6 GB of memory—four
times the amount found on the GTX 580s used in the Infib machine—comparable
to the previous generation of Tesla series of devices which typically have more
memory than the same generation consumer devices.
If the aggregate memory space is sufficiently large, it is also possible to improve
the performance of the simulation by moving parts of the computation from the
CPU to the GPU. Comparing the times for the distributed 3D FFT (0.5 seconds)
at 5123 and the number of FFTs per timestep (10), and the average duration
for a timestep of an ultrasound simulation of the same size (5.1 seconds) we can
see that under this configuration the CPU load from the element wise operations
slows down the execution by only a small fraction. If any of the aforementioned
factors results in a speed up in the FFT, such as what is possible with two copy
engines, then this balance will shift resulting in the GPU idling. To address this
some of the computation can be performed on the GPU, however this will require
enough free space for the corresponding matrices. If the device memory space
is large enough, then it may be possible to achieve a significant performance
increase by performing the simulation entirely in the device memory space. This
will negate the need to transfer data between the host and device at each time
step as presently performed in Chapter 5.
The GPU devices in the Infib machine were only recently succeeded by the
current generation of GPU devices code named by NVIDIA as Kepler. One of the
key features of the Kepler generation GPUs is dynamic parallelism: the ability
kernels to launch other kernels independently of the host/CPUi [44]. This not
only results in a reduction in interrupts for the CPU in controlling the GPU and
communication over the PCIe, but also leads to a greater granularity in parallel
control and scheduling in tasks within the GPU.
Another feature of select Kepler generation GPUs with potential performance
implications for the FFT and ultrasound simulations is the Hyper-Q, which allows
up to 32 simultaneous, hardware-managed connections from different threads
and processes (including MPI processes) to a single GPU [44]. The Hyper-Q
addresses a limitation of previous processors in that there was only one queue for
work tasks—imposing false serialisation between kernels where all but one of the
multi-processors on a GPU could be idle, but theoretically capable of performing
work if scheduled.
82 CHAPTER 6. CONCLUSION
The final hardware consideration in this discussion with perhaps the greatest
impact on multi-GPU system performance stem from the implications of the
next iterations of the PCIe protocol. It is possible that many of the hardware
constraints may be alleviated with systems based on PCIe 3.0 which doubles the
theoretical throughput of PCIe 2.0. Whether this will correspond to halving the
communication times in GPU communication can only be speculated.
The construction of systems completely capable of multi-GPU support and
utilising the full potential of the components is inherently difficult. A paper
that came out during this thesis by Schaetz and Uecker[55] experienced the same
technical difficulties as their systems were based on the same underlying hardware
as the Infib system used in this thesis. Due to the hardware limitations of the
Intel 5520 IOH chips, the implementations are limited to four GPUs. This places
a smaller ceiling on the aggregate GPU memory (size of the UVA space) and size
of the largest FFT that can be performed than one would expect. In order to
use more devices the only option would be to stage data in host memory, and
transfer data on demand to GPUs for processing. This will be less efficient but
allows for larger problem sizes.
These sorts of limitations could be addressed using PCIe based external inter-
faces such as Intel’s Thunderbolt. The current version of Thunderbolt is based
around the PCIe 2.0 protocol with x4 lanes and a bandwidth of 10 Gb/s—slower
than current generation infiniband interconnects. The advantage of the Thun-
derbolt is that it interfaces directly with the PCIe network with the ability to
daisy-chain up to six additional devices. Provided that similar mistakes to In-
tel’s 5520 IOH chips are not made, a large number of devices could be modularly
connected to the interface under a single UVA, with future versions of the inter-
face based on the PCIe 3.0 or later protocols capable of larger bandwidths and
potentially many more devices.
6.3 Future Work
As suggested throughout the discussion there are a number potential optimisa-
tions and additional work that can be performed with the multi-GPU distributed
3D FFT and its application to an ultrasound simulation. These form the basis
for recommending the following future work that can proceed from the findings
of this thesis.
6.3. FUTURE WORK 83
A simple optimisations that can be explored involves staging and coalescing
data before communication which can improve data locality, bandwidth satura-
tion, and scheduling. However, some of the communications to the same desti-
nation are sent in separate transfers at various times—explicit control over the
scheduling of certain tasks is not possible without incurring significant synchro-
nisation overheads.
Related to this are alternative algorithms for computing the FFT itself, such
as those suggested by Tang et. al. [61]. However, these sorts of methods of-
ten incur some other form of penalty such as numerical accuracy or restrictive
domain sizes/sparsity constraints. Therefore, analysis of these methods requires
consideration of the application for acceptability.
The dynamic parallelism of the Kepler generation of devices can be used
at a batch level by the distributed 3D FFT to improve the efficiency of the
computation. In the current implementation synchronisations are implicitly in
place around the memory shuﬄe, transposes and 2D and 1D FFTs that performed
within a single device. By using the dynamic parallelism approach, kernels can
be launched to perform some of these operations as soon as data is available with
minimal overhead. Presently in the worst possible case one compute core may be
active on the last batch of data for an operation such as the transpose prior to the
FFT while the remaining cores are idle waiting for the barrier synchronisation
to pass where it is possible for all of these cores to be already computing the
FFT result. By launching the next operation from the previous, rather than
synchronising and launching all similar operations in batch these sorts of idles
can be avoided—simultaneously improving the ratio of computational operations
to pure communication tasks.
These types of problem domain specific kernels can also be further developed
to incorporate the effect of the memory operations with final storing operation of
the computational kernel. That is, rather than performing the computation and
storing the result in place, a less general kernel designed for the current task can
store the result in its final destination. This sort of approach would essentially
incorporate the kernel method of data swapping into the computational kernels.
This level of integration would not be possible with any of the other swap methods
such as the pointer swapping method that presently performs more optimally.
Another kernel level optimisation that is also conceptually employed by FFTW
as an available option in the MPI version of the library is to omit the final stage
84 CHAPTER 6. CONCLUSION
of memory manipulation and return a transformed result. This would require an
index mapping for the matrix element-wise operations so that the corresponding
elements are used correctly.
The development time for the above mentioned complex kernels would be
significant, and the testing of the various parameters and configurations for any
given system even more so.
The final task that can be recommended would be to implement or port
the entire ultrasound simulation into a multi-GPU implementation in the GPU
memory space. This would be the only true comparison of a multi-GPU system
to a pure CPU system. It is the author’s experience and opinion that this task
would result in madness and is not suitable for doctoral work.
Appendix A
Equipment
A.1 Multi-GPU System: Infib
The primary multi-GPU system used in this work is based on the Tyan bare-
bone TYAN FT72B7015 [36]. The motherboard has two LGA 1366 sockets for
Intel Core i7 processors in a NUMA configuration (see schematic in Figure A.1).
Each socket is populated with a six-core Intel Xeon X5650 processor giving a
total of twelve physical cores. The server is equipped with twelve 4 GB mem-
ory modules (48 GB RAM) and has an aggregated memory bandwidth of 2× 25
GB/s. Communication between the CPUs is supported by the Intel QuickPath
Interconnection (QPI) with a theoretical bandwidth of 12 GB/s.
The QPI also connects each CPU with an Intel IOH chip that offers various
I/O connections including a total of four PCIe x16 links. Each PCIe link is
branched by a PLX PEX 8647 switch to give a total of eight PCIe slots. As this
system is designed as a node in a cluster, one PCIe slot is reserved for a high-
bandwidth interconnect (Infiniband). The remaining seven slots are populated
with GPUs.
Each GPU is an NVIDIA GeForce GTX 580 with 512 CUDA cores and 1.5
GB of memory. Access to the CPU by the GPUs or vice versa is provided by the
PEX bridge multiplexing the PCIe x16 links as required on demand. All GPUs
use NVIDIA CUDA 4.2 [43].
85
86 APPENDIX A. EQUIPMENT
Memory 
24GB 
PEX 
8647 
GPU0 
GTX 580 
Infiniband  
Mellanox 40Gb/s 
CPU 
Memory 
24GB 
Intel IOH 
5520 
Intel IOH 
5520 
CPU 
Xeon X5650 
ICH 10R 
(LANs, DISKs) 
PEX 
8647 
PEX 
8647 
PEX 
8647 
GPU2 
GTX 580 
GPU1 
GTX 580 
GPU3 
GTX 580 
GPU4 
GTX 580 
GPU5 
GTX 580 
GPU6 
GTX 580 
16 links shared 
between 2 GPUs 
16 links shared 
between IF and GPU 
PC
I-E
 x 
16
 
PCI-E x 16 
PCI-E x 16 
PCI-E x 16 
CPU 
Xeon X5650 
QPI 
12GB/s 
ESI 
8GB/s 16 links shared 
between 2 GPUs 
16 links shared 
between 2 GPUs 
25GB/s 
25GB/s 
Figure A.1: Schematic of the core components on the motherboard of the
infib multi-GPU shared-memory system.
A.2 Multi-GPU System: Trinity
Another system that was available is based on a more accessible desktop machine
using the Asus motherboard: the P6T6 WS Revolution [4]. This system was
built primarily for development and debugging of CUDA code. The motherboard
is configured with a single LGA 1366 socket for Intel Core i7 processors. The
CPU in this socket is an Intel i7 960 processor with four physical cores (Intel’s
Hyper-Threading is enabled on this system therefore it presents 8 virtual cores to
the operating system). This machine is equipped with six 2 GB memory modules
(12 GB RAM).
The GPUs attached to this system are NVIDIA GeForce GTX 480 cards with
A.2. MULTI-GPU SYSTEM: TRINITY 87
1.5 GB of memory. There are three of these devices on the system, however their
exact configuration is unknown. While the system can support three devices at
full PCI x16, there is reason to believe this arrangement has not been achieved.
The chipset and interconnecting PCIe chips for this system comprise of Intel
5520/5500/X58 I/O Hub to ESI Port, ICH10R, and NVIDIA NF200 PCIe 2.0
switches. The system uses CUDA 4.2 [43].
88 APPENDIX A. EQUIPMENT
Appendix B
Complete Table of Measured
Bandwidths on the Multi-GPU
System Infib
This appendix accompanies Chapter 3. The table below shows the complete set
of configurations measured for each method. The list of numbers in the device
coordination indicate pairs of devices that swapped for example: 1 2 3 4 means
devices 1 and 2 swapped, and devices 3 and 4 swapped.
89
90
A
P
P
E
N
D
IX
B
.
C
O
M
P
L
E
T
E
B
A
N
D
W
ID
T
H
T
A
B
L
E
F
O
R
IN
F
IB
Table B.1: Effective Bandwidth for Inter-Device Data Swap (GB/s)
Device Coordination
Via CPU Hybrid-Staged Pointer Swapping Kernel
1 Stream 2 Streams 1 Stream 2 Streams 1 Stream 2 Streams 1 Stream 2 Streams
1 0 3.84 3.84 3.67 4.08 4.91 6.43 5.94 5.94
1 2 2.95 2.95 3.97 4.71 6.18 8.11 8.22 8.24
1 3 5.38 5.38 2.82 2.82 2.91 2.92 - -
1 2 3 4 5.88 5.89 7.92 9.41 12.33 16.20 16.04 16.27
3 4 5 6 3.83 3.84 5.85 6.64 12.31 16.20 16.37 16.46
0 1 3 5 7.53 7.55 7.32 8.12 9.81 12.84 11.66 11.77
3 5 4 6 3.84 3.84 3.69 4.09 4.90 6.42 5.96 5.96
1 3 2 4 5.40 5.40 2.85 2.83 2.97 2.97 - -
1 2 3 4 5 6 5.71 5.71 8.72 9.88 18.47 24.29 23.99 24.31
0 2 3 5 4 6 5.65 5.67 5.49 6.09 7.34 9.63 8.88 8.92
0 4 1 5 2 6 5.64 5.71 2.99 3.00 3.10 3.62 - -
Appendix C
Table of Predicted and Measured
Times for the Swap Operation on
the Multi-GPU System Infib
This appendix accompanies Figure 3.8. Table C.1 shows the predicted time in
seconds, using the model created in Chapter 3 for a swap operation performed
using the hybrid-staged method for a range of message sizes and compares it to
the actual observed time on a particular multi-GPU system.
91
92 APPENDIX C. TABLE FOR MODELLING SWAP OPERATION
Table C.1: Model Predicted and Actual Measured Times for the Swap
Operation via the hybrid-staged Method on Infib (seconds)
Message Size (KB) Predicted Time (s) Observed Time (s)
4× 2−10 5.08E-05 5.39E-05
32× 2−10 5.08E-05 5.65E-05
256× 2−10 5.08E-05 4.94E-05
2 5.08E-05 5.34E-05
16 5.24E-05 6.33E-05
128 9.48E-05 1.19E-04
432 2.40E-04 2.61E-04
1024 5.33E-04 6.70E-04
2000 1.02E-03 1.40E-03
3456 1.74E-03 2.22E-03
5488 2.74E-03 3.32E-03
8192 4.08E-03 4.62E-03
11664 5.81E-03 6.23E-03
16000 7.96E-03 8.30E-03
21296 1.06E-02 1.09E-02
27648 1.37E-02 1.39E-02
35152 1.74E-02 1.75E-02
43904 2.18E-02 2.17E-02
54000 2.68E-02 2.64E-02
65536 3.25E-02 3.22E-02
Appendix D
GPU-CPU Pipelining of FFT in
K-Space Pseudo-Spectral
Methods
The following appendix accompanies Section 5.2. This describes the technique
for overlapping work on the GPU with the work on the CPU by making use of
the multi-GPU FFT from Chapter 4. This corresponds to Equations 1-4 from
Figure 5.2 for the k-space psuedo-spectral method.
GPU CPU
p k = f f t n (p) ; Ax = bsxfun (@times , ddx k sh i f t po s , kappa ) ;
( no op ) Ax = p k .∗ Ax;
IFx = real ( i f f t (Ax) ) ; Ay = p k .∗ bsxfun (@times , ddy k sh i f t po s , kappa ) ;
IFy = real ( i f f t (Ay) ) ; Az = p k .∗ bsxfun (@times , dd z k sh i f t p o s , kappa ) ;
IFz = real ( i f f t (Az) ) ;
ux sgx = bsxfun (@times , pml x sgx , bsxfun (@times ,
pml x sgx , ux sgx ) − dt . / rho0 sgx ∗ IFx ) ;
uy sgy = bsxfun (@times , pml y sgy , bsxfun (@times ,
pml y sgy , uy sgy ) − dt . / rho0 sgy ∗ IFy ) ;
( no op )
uz sgz = bsxfun (@times , pml z sgz , bsxfun (@times ,
pml z sgz , uz sgz ) − dt . / rho0 sgz ∗ IFz ) ;
93
94APPENDIX D. GPU-CPU PIPELINING OF FFT IN K-SPACE PSEUDO-SPECTRALMETHODS
GPU CPU
Ax = f f t n ( ux sgx ) ; ( no op )
Ay = f f t n ( uy sgy ) ; Ax = bsxfun (@times , ddx k sh i f t n eg , kappa .∗ Ax) ;
Az = f f t n ( uz sgz ) ; Ay = bsxfun (@times , ddy k sh i f t n eg , kappa .∗ Ay) ;
duxdx = real ( i f f t n (Ax) ) ;
duydy = real ( i f f t n (Ay) ) ;
Az = bsxfun (@times , ddz k sh i f t n eg , kappa .∗ Az) ;
duzdy = real ( i f f t n (Az) ) ; ( no op )
Bibliography
[1] A. M. D. AMD. http://sites.amd.com/us/fusion/apu/Pages/fusion.
aspx. 9
[2] G. M. Amdahl. Validity of the single processor approach to achieving large
scale computing capabilities. In Proceedings of the April 18-20, 1967, spring
joint computer conference, pages 483–485. ACM, 1967. 77
[3] G. R. Andrews. Foundations of multithreaded, parallel, and distributed pro-
gramming. Addison-Wesley, 2002. 10
[4] ASUSTeK Computer Inc. P6t6 ws revolution. http://www.asus.com/
Motherboard/P6T6 WS Revolution/, 2013. 86
[5] A. V. Bhatt. Creating a PCI Express interconnect. www.
pcisig. com/ specifications/ pciexpress/ technicallibrary/
pciexpresswhitepaper. pdf , 2002. 9
[6] R. Bryant and J. Hawkes. Linux scalability for large NUMA systems. In
Linux Symposium, page 76, 2003. 11
[7] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald.
Parallel programming in OpenMP. Morgan Kaufmann, 2000. 10
[8] B. Chapman, G. Jost, and R. Van Der Pas. Using OpenMP: portable shared
memory parallel programming, volume 10. MIT press, 2007. 10
[9] Y. Chen, X. Cui, and H. Mei. Large-scale FFT on GPU clusters. ICS, pages
315–324, 2010. 26
[10] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of
complex Fourier series. Mathematics of computation, pages 297–301, 1965.
23
95
96 BIBLIOGRAPHY
[11] B. J. Copeland. The Essential Turing. Clarendon Press, 2004. 9
[12] B. Cox, S. Kara, S. Arridge, and P. Beard. k-space propagation models for
acoustically heterogeneous media: Application to biomedical photoacoustics.
The Journal of the Acoustical Society of America, 121:3453, 2007. xi, 21
[13] K. Czechowski, C. McClanahan, C. Battaglino, K. Iyer, P.-K. Yeung, and
R. Vuduc. On the communication complexity of 3D FFTs and its impli-
cations for exascale. In ICS, San Servolo Island, Venice, Italy, June 2012.
25
[14] L. Dagum and R. Menon. OpenMP: an industry standard API for
shared-memory programming. Computational Science & Engineering, IEEE,
5(1):46–55, 1998. 10
[15] P. Duhamel and M. Vetterli. Fast fourier transforms: A tutorial review and
a state of the art. Signal Processing, 19(4):259 – 299, 1990. 24
[16] R. Farber. CUDA application design and development. Morgan Kaufmann,
2011. 7, 12
[17] M. W. Feise, J. B. Schneider, and P. J. Bevelacqua. Finite-difference and
pseudospectral time-domain methods applied to backward-wave metamate-
rials. Antennas and Propagation, IEEE Transactions on, 52(11):2955–2962,
2004. 21
[18] J. B. J. Fourier. The analytical theory of heat. The University Press, 1878.
22
[19] M. Frigo and S. Johnson. FFTW: An adaptive software architecture for the
FFT. In ASSP, volume 3, pages 1381–1384. IEEE, 1998. 24
[20] M. Frigo and S. Johnson. The Design and Implementation of FFTW3.
93(2):216–231, 2005. 24
[21] M. Frigo and S. Johnson. FFTW, C subroutine library. http://www.fftw.
org/, 2011. 24
[22] S. Georgakopoulos, C. Balanis, and R. Renaut. Pseudospectral methods
versus fdtd. In Antennas and Propagation Society International Symposium,
2000. IEEE, volume 3, pages 1506 –1509 vol.3, 2000. 21
BIBLIOGRAPHY 97
[23] J. Glenn-Anderson. Ultra Large-Scale FFT Processing on Graphics Proces-
sor Arrays. enparallelcom, 2009. 26
[24] M. Gorman. AMD gets Guinness World Record for fastest CPU with over-
clocked octa-core FX processor. http://www.engadget.com/2011/09/13/
amd-gets-guiness-world-record-for-fastest-cpu-with-overclocked-o/,
Sep 2011. 6
[25] L. Gu, J. Siegel, and X. Li. Using GPUs to Compute Large Out-of-card
FFTs, pages 255–264. 2011. 26
[26] A. Harvey and D. Staff. DMA Fundamentals on Various PC Platforms.
National Instruments, April, 1991. 8
[27] M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of
the fast Fourier transform. Archive for history of exact sciences, 34(3):265–
277, 1985. 23
[28] L. Howes and D. Thomas. GPU Gems 3, chapter Efficient Random Number
Generation and Application Using CUDA. Addison-Wesley, 2007. 2, 77
[29] J. Jaros, B. E. Treeby, and A. P. Rendell. Use of multiple GPUs on shared
memory multiprocessors for ultrasound propagation simulations. In 10th
Australasian Symposium on Parallel and Distributed Computing, edited by
J. Chen and R. Ranjan, ACS, volume 127, pages 43–52, 2012. xi, 66, 68, 70
[30] Khronos OpenCL Working Group et. al. The OpenCL Specification. A.
Munshi, Ed, 2008. 12
[31] D. B. Kirk and W. H. Wen-mei. Programming massively parallel processors:
a hands-on approach. Morgan Kaufmann, 2010. 6, 7
[32] D. Knuth. Fundamental algorithms, volume 1 of The art of computer pro-
gramming, volume 1. Addison-Wesley,, 2 edition, 1969. 78
[33] Linux Scalability Effort. NUMA Frequently Asked Questions. http://lse.
sourceforge.net/numa/faq/, 2013. 10
[34] Message Passing Interface Forum. MPI: A Message-Passing Interface Stan-
dard Version 3.0. University of Tennessee, 2012. 10, 18, 20
98 BIBLIOGRAPHY
[35] P. Micikevicius. M07: High Performance Computing with CUDA. SC11
Tutorial http://sc11.supercomputing.org, Nov. 2011. 37, 40
[36] MiTAC International Corporation. Tyan FT72B7015 server barebone.
http://www.tyan.com/product SKU spec.aspx?ProductType=BB&pid=
439&SKU=600000195, Sept. 2011. 85
[37] G. E. Moore. Cramming more components onto integrated circuits, electron-
ics 38, 1965. 6
[38] A. Nukada and S. Matsuoka. Auto-tuning 3-D FFT library for CUDA GPUs.
Proceedings of the Conference on High Performance Computing Networking
Storage and Analysis SC 09, 2(3):1, 2009. 25
[39] A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-
D FFT kernel for GPUs using CUDA. In High Performance Computing,
Networking, Storage and Analysis, 2008. SC 2008. International Conference
for, pages 1 –11, 2008. 25
[40] A. Nukada, K. Sato, and S. Matsuoka. Scalable Multi-GPU 3-D FFT for
TSUBAME 2.0 Supercomputer. In Proceedings of the International Confer-
ence on High Performance Computing, Networking, Storage and Analysis,
page 44. IEEE Computer Society Press, 2012. 25
[41] NVIDIA Corp. NVIDIA CUDA Programming Guide Version 4.1. Technical
report, NVIDIA, Nov. 2011. 5, 7, 8, 11, 12, 13, 14, 15, 59
[42] NVIDIA Corp. Tegra. http://www.nvidia.com/object/tegra.html, 2011.
9
[43] NVIDIA Corp. CUDA Toolkit 4.1 CUFFT Library. Technical report,
NVIDIA, Jan. 2012. 25, 85, 87
[44] NVIDIA Corp. NVIDIA Kepler GK110 Architecture
Whitepaper. http://www.nvidia.com/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012. 81
[45] NVIDIA Corp. Tesla. http://www.nvidia.com/object/
tesla-supercomputing-solutions.html, July 2012. 9
BIBLIOGRAPHY 99
[46] NVIDIA Corp. NVIDIA GPUDirect. https://developer.nvidia.com/
gpudirect, 2013. 31
[47] OpenMP Architecture Review Board. The OpenMP API specification for
parallel programming. http://openmp.org/wp/, 2013. 10
[48] A. V. Oppenheim, R. W. Schafer, J. R. Buck, et al. Discrete-time signal
processing, volume 2. Prentice hall Englewood Cliffs, NJ:, 1989. 22
[49] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kr
”uger, A. Lefohn, and T. Purcell. A Survey of General-Purpose Computation
on Graphics Hardware. 26(1):80–113, 2007. 6, 75, 77
[50] PCI-SIG. Pci-sig. http://www.pcisig.com/, 2013. 9
[51] D. Pekurovsky and J. Goebbert. P3DFFT-Highly scalable parallel 3D Fast
Fourier Transforms library. Technical report, University of California, Nov.
2011. 25
[52] M. Pharr and R. Fernando. CPU Gems 2. Addison-Wesley, 2005. 7, 77
[53] D. Poole. Introduction to OpenACC directives. In NVIDIA GPU technology
conference, 2012. 12
[54] J. Sanders and E. Kandrot. CUDA by example: an introduction to general-
purpose GPU programming. Addison-Wesley Professional, 2010. 6
[55] S. Schaetz and M. Uecker. A Multi-GPU Programming Library for Real-
Time Applications. Algorithms and Architectures for Parallel Processing,
pages 114–128, 2012. 37, 82
[56] T. C. Schroeder. Peer-to-Peer & Unified Virtual Address-
ing. http://developer.download.nvidia.com/CUDA/training/
cuda webinars GPUDirect uva.pdf, 2011. 11
[57] S. W. SMITH. The scientist and engineer’s guide to digital signal processing,
1999. 22
[58] W. Stallings. Operating systems: internals and design principles. Prentice
Hall, 2008. 13
100 BIBLIOGRAPHY
[59] J. E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming stan-
dard for heterogeneous computing systems. Computing in science & engi-
neering, 12(3):66, 2010. 12
[60] A. S. Tanenbaum. Computer Networks. Prentice Hall, 2003. 18
[61] P. T. P. Tang, J. Park, D. Kim, and V. Petrov. A framework for low-
communication 1-D FFT. In High Performance Computing, Networking,
Storage and Analysis (SC), 2012 International Conference for, pages 1–12.
IEEE, 2012. 83
[62] B. E. Treeby and B. T. Cox. k-Wave: MATLAB toolbox for the simulation
and reconstruction of photoacoustic wave fields. Journal of biomedical optics,
15(2):021314–021314, 2010. 22, 65
[63] B. E. Treeby and B. T. Cox. k-Wave: A MATLAB toolbox for the time-
domain simulation of acoustic wave fields. http://www.k-wave.org/index.
php, 2012. 2, 65
[64] B. E. Treeby, J. Jaros, A. P. Rendell, and B. T. Cox. Modeling nonlinear
ultrasound propagation in heterogeneous media with power law absorption
using a k-space pseudospectral method. J. Acoust. Soc. Am., 131(6):4324–
4336, 2012. 2, 21, 22, 67
[65] L. Valiant. A bridging model for parallel computation. Communications of
the ACM, 33(8):103–111, 1990. 43, 45
[66] J. von Neumann. First Draft of a Report on the EDVAC. Annals of the
History of Computing, IEEE, 15(4):27–75, 1993. 8, 9
Acronyms
ALU Arithmetic Logic Unit.
API Application Programming Interface.
CPU Central Processing Unit.
DFT Discrete Fourier Transform.
DMA Direct Memory Access.
FFT Fast Fourier Transform.
GPGPU General Purpose-Computing on Graphics Processing Unit.
GPU Graphics Processing Unit.
NUMA Non-Uniform Memory Access.
PCIe Peripheral Component Interconnect Express.
SIMD Single-Instruction Multiple-Data.
SIMT Single-Instruction Multiple-Thread.
TLA Three Letter Acronym.
TMT Too Many TLA.
UVA Unified Virtual Address space.
101
102 Acronyms
Glossary
API Interface for programming specific applications: protocols designed for soft-
ware to communicate; definitions of routines, data structures and constants;
typically implemented as a library. .
CUDA NVIDIA’s API for GPGPU.
device entire GPU - DMA, GPU, Memory, fans, power, etc..
DFT A mathematical transform of a function over time into a function whose
argument is a frequency.
DMA allows non-CPU hardware to access system/main memory independently
of the CPU.
FFT An algorithm for computing the Discrete Fourier Transform.
GPGPU The use of the GPU for tasks normally performed by CPUs.
GPU A type of processor typically designed for preparing and outputting a
signal for a display.
multi-GPU Use of multiple GPUs within a single machine.
PCIe A software and hardware protocol for hardware expansion on computer
machines.
superstep An algorithm described by the Bulk Synchronous Parallel model is
computed as a series of supersteps. See Section 3.3.
Ultrasound A cyclic sound pressure wave with a frequency greater than the
upper limit of the human hearing range.
103
Index
API, 11
CPU, 6
CUDA, 12
device, 8
DFT, 22
DMA, 13
Fast Fourier Transform, 23
FFT, 22, 53
Graphics Processing Unit, 6
NUMA, 10
PCIe, 9
SIMT, 6
superstep, 45
ultrasound, 21, 65
UVA, 11
104
