Libra.Net: Single Task Scheduling in a CPU-GPU Heterogeneous Environment by PAGANUCCI, STEFANO
UNIVERSITA` DEGLI STUDI DI PISA
Facolta` di Scienze Matematiche, Fisiche e Naturali
Corso di Laurea Specialistica in Tecnologie Informatiche
Tesi di Laurea
Libra.Net: Single Task Scheduling in a
CPU-GPU Heterogeneous Environment
Candidato
Stefano Paganucci
Relatori Controrelatore
Dott. Antonio Cisternino Prof. Marco Vanneschi
Dott. Cristian Dittamo
Anno Accademico 2009 - 2010

A Fiorella.

“ Hardware: the parts of a computer that can be kicked.”
Jeff Pesis

Ringraziamenti
Un grazie particolare ad Antonio e Cristian per l’aiuto ed il sostegno durante il lavoro,
per i pranzi insieme e le chiaccherate. A Nicole, Gabriele, Alessandro, Matteo e Liliana
per la bellissima esperienza che e` stata LOA Mobile e per le successive serate insieme
per locali pisani. A Simone per i consigli ed il confronto informatico durante tutto il
corso di studi. Ad Andrea per l’aiuto reciproco durante la preparazione degli esami e le
serate fuori dal comune.
Per tutto quello che abbiamo passato insieme, grazie a tutti gli amici di sempre:
Simo, Matte, Barbara, Diego, Monica, Fritz, Carmine, Marco, Oliver, Hanz, Roberto (il
Pacio). Ai compagni della classe piu` mitica della storia, la 5 Bst, per le gite, le infinite
risate e l’amicizia che ancora oggi ci lega tutti. Insomma, un grazie enorme a tutti gli
amici.
Agli zii e ai cugini: Giusy, Luca, Liviana, John, Alessio e Michele. A nonno Beppe,
nonna Irma, nonno Livio e nonna Ie` che con il loro sostegno hanno reso possibile
tutto questo. A Irene e Babbo per sopportarmi ogni giorno e per l’enorme affetto di-
mostratomi.
A Eli, perche´ senza di lei non avrei superato i momenti difficili, perche´ mi ha capito
e perche´ e` sempre riuscita ad allietare ogni momento.
Alla mia mamma, per aver sempre creduto in me e perche´ so quanto avrebbe voluto
condividere con me questo momento.

Contents
1 Introduction 11
2 State of the Art 15
2.1 Evolution of the Graphics Processing Unit . . . . . . . . . . . . . . . . . . 15
2.1.1 Fixed-function Graphics Pipeline . . . . . . . . . . . . . . . . . . . 16
2.1.2 Programmable Shading . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Programmable Graphics . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 General Purpose GPU . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 GPGPU platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Nvidia Compute Unified Device Architecture . . . . . . . . . . . . 20
2.2.2 AMD Stream Computing . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Hybrid architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Other existing approaches for parallelism exploitation using GPUs . . . . 44
3 Tools 47
3.1 Common Language Infrastructure . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.1 Design and Capabilities . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.2 Common Intermediate Language . . . . . . . . . . . . . . . . . . . 50
3.2 Common Language Runtime . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Metadata extensibility . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.3 Delegates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.4 Managed and un-managed code . . . . . . . . . . . . . . . . . . . . 56
3.2.5 Interoperating with un-managed code . . . . . . . . . . . . . . . . 56
3.3 Shared Source Common Language Infrastructure . . . . . . . . . . . . . . 58
9
10 CONTENTS
3.4 CLIFile Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Performance Modeling for GPGPU and CPU 61
4.1 Parameters for performance evaluation . . . . . . . . . . . . . . . . . . . . 61
4.2 GPGPU Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.2 Ground Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.3 Refinement Example: Nvidia PTX 1.x . . . . . . . . . . . . . . . . 66
4.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 CPU Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 SSCLI Just-In-Time compiler . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Opcodes cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 80
5 Implementation 83
5.1 4-Centauri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Libra.Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Bytecode Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.3 Executing a task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Conclusions and Future Works 101
A Tables 105
B Listings 117
Bibliography 126
Chapter 1
Introduction
In 1966, Micheal J. Flynn proposed a classification of computer architectures based
upon the number of concurrent instructions and data streams available in the architec-
ture. Flynn recognized four classes of computer architectures: Single Instruction Single
Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data
(MISD) and Multiple Instruction Multiple Data (MIMD). General-purpose architectures
are architectures not devoted to a unique usage while special-purpose architectures are
architectures devoted to a specific task. Generally, general-purpose architectures em-
brace the Von Neumann computational model while special-purpose architectures adopt
non-Von Neumann computational models. The Von Neumann computational model cor-
responds to the SISD architecture in which a single processor, a uniprocessor (CPU),
executes a single instruction stream, to operate on a single data stream. SIMD architec-
tures exploit instead data parallelism in which multiple data elements of a stream are
operated by a single instruction. For instance, Graphics Processing Units (GPUs) are
SIMD architectures because graphics rendering often applies the same function to each
pixel or vertex. MISD architectures apply multiple instructions on a single data stream.
These are, for example, architectures used for fault tolerance or heterogeneous systems
operating on the same stream element that must agree on the result. An example of
MIMD architectures are distributed systems either with shared memory and distributed
memory [1, 2].
GPUs have recently evolved from fixed-function rendering devices into highly par-
allel programmable many-core architectures fulfilling the enormous demand for high-
definition 3D games. The computational power of GPUs is available and inexpensive.A
typical latest-generation card costs $400–500 at release and drops rapidly as new hard-
ware emerges. As today’s computer systems often include CPUs and GPUs it is im-
portant to enable software developers to take full advantage of these heterogeneous
processing platforms [3].
For this reasons, many researchers and developers have become interested in harness-
ing the power of commodity graphics hardware for general-purpose computing. Initially,
development of non-graphics applications on GPUs was very complex because program-
11
12 Introduction
mers had to be expert in two domains: the application domain and computer graphics.
Developers had to map their applications onto the computer graphics domain specific
language and code and decode respectively input and output results. Recent years have
seen an explosion in interest in such research efforts, known collectively as GPGPU
computing (General-Purpose GPU) [4]. The term GPGPU was coined by Mark Harris
in 2002 when he recognized an early trend of using GPUs for non-graphics applica-
tions [5]. Since 2005, the two major GPU vendors, Nvidia and AMD, have introduced
fully programmable GPUs and development platforms such as Nvidia Compute Unified
Device Architecture [6] and AMD Stream Computing [7]. GPGPU devices operate as
co-processors to the main CPU, or host. More precisely, a portion of an application that
drives the computation of each core is commonly known as a shader (in the traditional
3D terminology) or kernel (a term to stress the will to go beyond 3D graphics) [8]. An
entire application is referred to as task. However, due to the special usage of GPU, it is
impossible to complete all computing tasks solely on the GPU. Many control and serial
instructions still need to complete on the CPU [9]. Despite this new interest in GPGPU
computing, few efforts has been done by researchers in formulating performance models
(cost models) in order to accurately evaluate performance of GPGPU applications and
compare CPU executions with GPU ones of the same task.
In the last few years, researchers have become interested in Virtual Execution Envi-
ronments (VEEs) such as the Java Virtual Machine (JVM) [10] and the .NET Common
Language Runtime (CLR) [11]. VEEs abstract from many features of the underlying
architecture through dynamic translation of code before its execution on the host. Pro-
grams are expressed in an Intermediate Language (IL) and executed using an abstract
computational model. Moreover, VEEs provide a specification in which code is mixed
with metadata. Metadata permits to monitor program activities enforcing security, and
enables dynamic loading and reflection. An important technique related to metadata is
annotation: elements of source code such as classes, methods and fields can be annotated
with special attributes. Attributes are compiled and saved with IL code but are ignored
by the runtime. Using reflection programmers can retrieve such attributes and change
programs behavior.
This thesis focuses on identifying bottlenecks in GPGPU computations through per-
formance modeling. Moreover, we would like to find out if such bottlenecks could de-
crease performance of parallel applications in a way that the CPU execution time of
those applications become lower than the GPU one. In particular, we will focus on
the overhead introduced by data transfers between host and device and the available
techniques that may reduce their impact on performance. We would like to hide to
developers who write parallel applications more architectural details as possible such as
the number of cores and the memory bandwidth. In order to achieve better productivity,
developers should be able to write parallel applications without learning GPGPU pro-
gramming languages and APIs. For this reason, we leveraged the abstraction power of
Virtual Execution Environments. In particular, our work enables programmers to write
kernels as annotated methods using one of the available high-level languages supported
by the Common Language Infrastructure (CLI) [12]. These kernels may be scheduled
13
on a CPU or on a GPU. In the latter case, kernels written in Microsoft Intermediate
Language (MSIL) must be compiled into a target architecture language (e.g. AMD IL
[7] and Nvidia PTX [6]).
In this thesis we developed Libra.Net, a single task scheduler for a CPU-GPU het-
erogeneous environment. Libra.Net is written in C# using the .NET Framework and
schedules tasks written in MSIL. In order to enable the scheduler to estimate the exe-
cution time of a task on a CPU and on a GPU, we formulated a CPU and a GPGPU
performance model. If the CPU execution time of a task is lower than the GPU one,
the scheduler executes the task on the CPU, otherwise the task is executed on the GPU.
Libra.Net is a component of the 4-Centauri meta-compiler, a source-to-source compiler
developed by Cristian Dittamo during his PhD that translates Microsoft Intermediate
Language into Nvidia PTX code [8]. During his thesis, Giacomo Righetti extended the
4-Centauri compiler with translation from MSIL to AMD IL [13]. Libra.Net performs
the following steps:
• Takes in input a Kernel-annotated method.
• Analyzes the method code statically in order to estimate the number of instructions
that will be executed at runtime.
• Decides on which platform execute the kernel basing on our CPU and GPGPU
performance models. If the scheduler decides to execute the task on a GPGPU,
4-Centauri provides compilation from MSIL to the target architecture language.
• Executes the kernel.
This thesis is organized in four chapters. In Chapter 2 we give a brief introduc-
tion to the evolution of GPUs that has brought to GPGPU computing. Moreover, we
present and compare the main aspects of the three most spread GPGPU platforms:
Nvidia CUDA [6], AMD Stream Computing [7] and OpenCL [3]. Finally, we introduce
some hybrid architectures that tries to merge Von Neumann and non-Von Neumann
architectures and other solutions to parallel exploitation using GPUs.
In Chapter 3 we describe design and capabilities of the Common Language Infras-
tructure and we briefly introduce the techniques of the Common Language Runtime used
for the scheduler implementation.
In Chapter 4 we propose a GPGPU performance model structured in layers. The
abstract layer (we called the “ground model”) is a performance model based on the
OpenCL platform and specification. We chose OpenCL because it is an open standard.
Moreover it is supported by the two major graphics card vendors (Nvidia and AMD).
The ground model must be refined to produce a performance model tied to a vendor-
specific platform or even to a specific device. In particular, we propose a refinement of
the ground model that is tied to the Nvidia PTX 1.x platform.
In order to compare CPU and GPU execution times we formulated a simple CPU
performance model. Our model is based on the computational model of the CLI virtual
14 Introduction
execution environment. Virtual execution environments abstract from specific underly-
ing architecture features making their sequential execution model well-suited for perfor-
mance modeling. Moreover, virtual execution environments enable programs to inspect
their code at runtime through Metadata and Reflection. These techniques are useful
to infer a program behavior. We corroborated the GPGPU and the CPU performance
models through an experimental approach. We set up several tests in order to compare
experimental and theoretical results.
Finally, in Chapter 5 we present the main implementation aspects of Libra.Net and
evaluate its efficiency in two case study applications: a Vector Addition and a Matrix
Multiplication.
Chapter 2
State of the Art
In this chapter we give an introduction to the major hardware and software innovations
that brought from first generation special-purpose graphics cards to modern General-
Purpose Graphics Processing Units (GPGPUs). Then we present the programming
model, memory model and communication model of the three main GPGPU platforms.
Finally, we give a brief description of the new CPU-GPU hybrid architectures and other
approaches for parallel exploitation using GPUs.
2.1 Evolution of the Graphics Processing Unit
To make this thesis self-contained, in the following sections we present the hardware and
software evolution of graphics processing units which has led to GPGPU computing.
We identify four main GPU generations: fixed-function graphics pipeline, programmable
shading, programmable graphics and GPGPU (Figure 2.1) [14]. This is a simplified
Figure 2.1: Evolution of the Graphics Processing Unit.
view of the evolution of the GPU. Other minor software and hardware improvements
should be placed between the four macro-steps but this is out of our interest.
The graphics pipeline1, also called rendering pipeline, is a method of rasterization-
based rendering implemented by graphics cards. The graphics pipeline accepts as input a
representation of a three-dimensional scene and outputs a 2D image that can be displayed
1In computing, a pipeline is a set of data processing elements connected in series, so that the output
of one element is the input of the next one. The elements of a pipeline are often executed in parallel or
in time-sliced fashion.
15
16 State of the Art
on a screen. A generic 3D application provides a representation of a scene in the form of
vertices that can be manipulated in parallel by the steps composing the graphics pipeline
[15]. Figure 2.2 depicts a simplified model of the graphics pipeline.
Figure 2.2: The graphics pipeline.
2.1.1 Fixed-function Graphics Pipeline
First generation graphics cards (1996-1999) implemented a hardware-coded fixed-function
graphics pipeline (Figure 2.3) i.e. there was a specialized (not programmable) hardware
unit for each step composing the pipeline. In OpenGL 1.x [16] and DirectX 9 [17] a little
“customization” was possible through parameters setting.
Figure 2.3: Fixed-function graphics pipeline.
2.1.2 Programmable Shading
From 2001, major graphics software libraries such as OpenGL [16] and Direct3D (compo-
nent of the DirectX API [17]) began to enable programmers to define special functions,
called shaders, to be executed by GPUs. The beginning of shaders marked the transition
from a fixed-function pipeline to a programmable one as depicted in Figure 2.4. Shaders
2.1 Evolution of the Graphics Processing Unit 17
Figure 2.4: Programmable shading.
replace fixed-functions composing the rendering pipeline in order to obtain customized
graphic effects such as bump mapping and color toning [15].
Shader programs can be written using a shading language (i.e. a Domain Specific
Language) such as the OpenGL Shading Language (GLSL) [16], the High Level Shader
Language (HLSL) [11] from Microsoft or Cg (C for graphics) [18] developed by Nvidia
in collaboration with Microsoft.
2.1.3 Programmable Graphics
A major innovation in graphics shaders, introduced by DirectX 10 in 2006, was of the
Unified Shader Model which consists of two aspects: the Unified Shader Model and the
Unified Shader Architecture. The Unified Shader Model (known in OpenGL simply
Figure 2.5: Unified shader model.
as “Unified Shader Model” [16], known in Direct3D 10 as “Shader Model 4.0” [19])
18 State of the Art
consists in defining a very similar instruction set for all shader types. The Unified
Shader Architecture is a low-level model that “unifies” the GPU compute units meaning
that any of them can run any type of shader. Hardware is not required to have a Unified
Shading Architecture to support the Unified Shader Model, and vice versa. With the
Unified Shader Model the graphics pipeline become fully programmable. The graphics
pipeline steps can be replaced, added and ordered by programmers (Figure 2.5) [20].
2.1.4 General Purpose GPU
The advance from the Unified Shader Model to GPGPU computing mainly consists in a
change of the provided programming interfaces and languages rather than on hardware
modifications. It has passed from Domain Specific Languages (shading languages) to
general-purpose programming languages such as C, FORTRAN, etc. The hardware
design has substantially remained the same, just more transistors have been placed on a
chip following a trend in which the number of transistors doubles every six months. Since
modern GPGPUs still implement a SIMD architectural model, speaking about “general-
purpose” architectures would be incorrect. In fact, as it will be explained in this thesis,
not all kinds of applications are well-suited to be executed on a GPU but only those that
can fit the data-parallellism programming model. The “general-purpose” term has been
introduced in order to emphasize the new capability of GPUs of executing non-graphics
applications. Lots of algorithms belonging for example to the fields of physics simulation
and computational biology can be accelerated by data parallel implementations making
them well-suited for the execution on GPGPUs.
The GPGPU computational model exposes to programmers a stream processing
paradigm. Stream processing is a computer programming paradigm, related to SIMD,
that allows some applications to more easily exploit a limited form of parallel process-
ing. The stream processing paradigm restricts the parallel computation that can be
performed by parallel software and hardware. Given a set of data (a stream), a series
of operations (kernel functions) are applied to each element in the stream. Shaders and
kernels have the same meaning. The name has been changed just to stress the the will
to go beyond 3D graphics.
The following example explains the GPGPU programming model using pseudo-code.
Listing 2.1 shows a sequential implementation of the matrix sum.
void sum( f loat A[ ] , f loat B[ ] , f loat C [ ] )
{
for ( int i = 0 ; i < n ; i++)
{
for ( int j = 0 ; j < m; j++)
{
C[ i ] [ j ] = A[ i ] [ j ] + B[ i ] [ j ] ;
}
}
}
Listing 2.1: Sequential sum of two matrices.
2.1 Evolution of the Graphics Processing Unit 19
This code can be executed sequentially on a CPU such that C[0][1] is computed after
C[0][0] and so on. However, the elements of C can be calculated independently by a
number of threads equals to the size of each matrix. Listing 2.2 shows a multi-threaded
version of the same application.
void sum( f loat A[ ] , f loat B[ ] , f loat C [ ] )
{
for ( int i = 0 ; i < n ; i++)
{
for ( int j = 0 ; j < m; j++)
{
l aunch thread { C[ i ] [ j ] = sum kernel ( i , j , A, B) ; }
}
}
synch threads ( ) ;
}
Listing 2.2: Multi-threaded version of the sum of two matrices.
The function sum_kernel() can represent the kernel of an hypothetical GPGPU
application that performs the sum of two matrices. In this case threads are mapped
onto the elements of matrix C, that is each thread is assigned a single element of the
output matrix (one-to-one mapping). In a parallel environment the nested “for” loops
are simulated by the hardware. Programmers must thus provide the kernel function,
the input and output buffers and the type of mapping. The runtime will correctly map
threads to cores of the underlying architecture [7].
20 State of the Art
2.2 GPGPU platforms
The following sections describe the programming model of the three main GPGPU plat-
forms: Nvidia Compute Unified Device Architecture [6], AMD Stream Computing [7]
and OpenCL [3]. Since our aim is to introduce the GPGPU computational model and
not to present all the available GPGPU platforms, other solutions will be cited at the
end of this chapter.
2.2.1 Nvidia Compute Unified Device Architecture
Nvidia Compute Unified Device Architecture (CUDA) is a general purpose parallel com-
puting architecture developed and introduced by Nvidia in November 2006. CUDA pro-
vides a new Instruction Set Architecture (ISA) and a new parallel programming model
enabling programmers to solve non-graphics computational problems using its GPUs.
A software environment that uses C as the primary high-level programming language is
given to developers. Other programming languages are going to be supported by CUDA
such as FORTRAN [6].
Programming Model
The CUDA programming model assumes that CUDA threads execute on a physically
separate device that operates as a coprocessor to the host running the C program.
Moreover, the CUDA programming model assumes that both the host and the device
maintain their own separate memory spaces in RAM (Random Access Memory), referred
to as host memory and device memory, respectively. CUDA programs manage device
memory through calls to the CUDA runtime. This includes device memory allocation
and deallocation as well as data transfer between host and device memory [6].
When an application launches a kernel, an user-specified number of threads are
created and scheduled for the execution. Threads execute the same kernel function.
Each thread executing the kernel has a unique identifier. Threads can be grouped in
one-dimensional, two-dimensional or three-dimensional thread blocks. Thread blocks
are themselves organized in a one-dimensional or two-dimensional grid as illustrated in
Figure 2.6. For each kernel, a single grid of threads can be launched. The number of
thread blocks per grid is dictated by the size of the data being processed. Precisely, the
grid size is obtained dividing data size by block size [6].
Programming Interface. The two main programming interfaces provided by CUDA
are CUDA C and CUDA Driver API. CUDA C is a small extension to the C language
syntax. CUDA Driver API is a low-level interface that enables programmers to gain more
control over kernel execution but is harder to program and debug. CUDA C is essentially
built on top of the CUDA Driver API, hiding to programmers low-level operations such
2.2 GPGPU platforms 21
Figure 2.6: The CUDA thread hierarchy.
as runtime initialization and management. CUDA C provides extensions to the standard
C syntax enabling developers to define kernel functions.
For instance, kernels are declared with the __global__ declaration specifier and have
access to some built-in variables. Programmers configure a kernel launch through the
<<< ... >>> configuration syntax specifying, for example, block size and grid size. The
thread identifier is accessible through threadIdx built-in variable that is a triple in which
the components represent the coordinates of the thread inside its thread block. The
blockIdx built-in variable is a 2-component vector that contains the block index within
its grid. The dimension of each block is accessible through the blockDim built-in variable
[6].
Listing 2.3 [21] shows a kernel function that uses the three different built-in variables
presented above to perform the saxpy function, a combination of scalar multiplication
and vector addition. CUDA functions such as cudaMallocHost(), cudaMalloc() and
cudaMemcpy() will be explained further in this chapter when we will speak about the
CUDA communication model (Section 2.2.1).
22 State of the Art
global void Saxpy ( f loat a , f loat ∗ InData1 , f loat ∗ InData2 , f loat ∗ Result )
{
int idx = blockIdx . x ∗ blockDim . x + threadIdx . x ;
Result [ idx ] = InData1 [ idx ] ∗ a + InData2 [ idx ] ;
}
int main ( int argc , char∗∗ argv )
{
f loat ∗ InitData1 , ∗ InitData2 , ∗ InData1 , ∗ InData2 , ∗Result , ∗HostResult ;
f loat a = 10 . 0 ;
unsigned int Length = 100 ;
/∗ −−−−−−− MEMORY ALLOCATION −−−−−−−− ∗/
cudaMallocHost ( ( void ∗∗)&HostResult , Length ) ;
memset ( HostResult , 0 , Length ) ;
cudaMalloc ( ( void ∗∗)&InData1 , s izeof ( f loat ) ∗ Length ) ;
cudaMalloc ( ( void ∗∗)&InData2 , s izeof ( f loat ) ∗ Length ) ;
cudaMalloc ( ( void ∗∗)&Result , s izeof ( f loat ) ∗ Length ) ;
/∗ −−−−−−− SET INPUT VALUES −−−−−−−− ∗/
In i tData1 = ( f loat ∗) mal loc ( s izeof ( f loat ) ∗ Length ) ;
In i tData2 = ( f loat ∗) mal loc ( s izeof ( f loat ) ∗ Length ) ;
for ( int i = 0 ; i < Length ; ++i ) {
In i tData1 [ j ] = ( f loat ) rand ( ) ;
In i tData2 [ j ] = ( f loat ) rand ( ) ;
}
cudaMemcpy( InData1 , InitData1 , s izeof ( f loat ) ∗ Length , cudaMemcpyHostToDevice ) ;
cudaMemcpy( InData2 , InitData2 , s izeof ( f loat ) ∗ Length , cudaMemcpyHostToDevice ) ;
/∗ −−−−−−− RUN COMPUTE KERNEL −−−−−−−− ∗/
int n = 16 ∗ 1024 ∗ 1024 ;
dim3 threads = dim3(512 , 1) ;
dim3 b locks = dim3(n / threads . x , 1) ;
Saxpy<<<blocks , threads , 0>>>(a , InData1 , InData2 , OutData ) ;
/∗ −−−−−−− GET RESULT −−−−−−−− ∗/
cudaMemcpy( HostResult , Result , s izeof ( f loat ) ∗ Length , cudaMemcpyDeviceToHost ) ;
}
Listing 2.3: CUDA implementation of the saxpy function.
Thread blocks execute independently, that is they can be executed in any order, in
parallel or in series. It is, therefore, possible to schedule thread blocks in any order across
any number of cores enabling programmers to write code that scales with the number
cores. Threads within a block can share data using the on-chip shared memory region
and synchronize memory accesses through the __synchthreads() function that acts as
a barrier at which all threads in a block must wait before any is allowed to proceed.
Compilation tool-chain. The CUDA compilation tool-chain is composed of three
stages as depicted in Figure 2.7. Kernels are compiled with nvcc [22](the compiler
provide by Nvidia) in an intermediate language called PTX [23] (Parallel Thread eXe-
cution). PTX is the Nvidia low-level virtual machine and ISA designed to support the
operations of a parallel CUDA processor.
During the first stage the Edison Design Group preprocessor separates host code (i.e.
2.2 GPGPU platforms 23
Figure 2.7: CUDA’s compilation process.
code to be executed on the host) from device code (i.e. code to be executed on the device).
Host code can be compiled using nvcc or another tool chosen by the programmer. Source
code written for the host CPU follows a fairly traditional path and allows developers to
choose their own C/C++ compiler. During the second stage, device code is compiled
into PTX and/or in a binary form (cubin object). The second stage is performed by the
Nvidia’s PTX-to-Target Translator, which converts Open64’s assembly-language output
into executable code for specific Nvidia GPUs [24]. An application can link to the
generated device code, i.e. the PTX or the cubin, or load and execute the kernel at
runtime using just-in-time compilation. Just-in-time compilation increases application
load time but benefits from latest compiler improvements. Moreover this technique
permits the execution of an application on different devices without recompilation.
Nvidia and the Portland Group (component of STMicroelectronics) has recently
announced their collaboration on CUDA x86 [25], a solution that enables CUDA appli-
cations to be executed on any PC system or server. The PGI CUDA C x86 compiler
enables programmers to optimize CUDA applications for their execution on x86 systems
not equipped with Nvidia graphics cards.
Memory Model
CUDA threads can access and share several memory spaces at different levels during a
kernel execution (Figure 2.8):
• Registers. A set of 32-bit registers located on-chip. Each thread can access a
private subset of the available registers.
24 State of the Art
 Chapter  2:  Programming  Model  
  
 
CUDA  C  Programming  Guide  Version  3.1      11 
  
  
Figure  2-­2.   Memory  Hierarchy  
2.4 Heterogeneous  Programming  
As illustrated by Figure 2-3, the CUDA programming model assumes that the 
CUDA threads execute on a physically separate device that operates as a coprocessor 
to the host running the C program. This is the case, for example, when the kernels 
execute on a GPU and the rest of the C program executes on a CPU. 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
Global  memory  
Grid  0  
Block  (2,  1)  Block  (1,  1)  Block  (0,  1)  
Block  (2,  0)  Block  (1,  0)  Block  (0,  0)  
Grid  1  
Block  (1,  1)  
Block  (1,  0)  
Block  (1,  2)  
Block  (0,  1)  
Block  (0,  0)  
Block  (0,  2)  
Thread  Block     
Per-­block  shared  
memory  
Thread  
Per-­thread  local  
memory  
Figure 2.8: The CUDA memory spaces [6].
• Local Memory. A memory region with a scope local to a thread. Local does not
mean physically close to the cores in which threads executes. In fact this memory
region is located off-chip on a partition of the GPU RAM. Accessing this memory
region is as expensive as accessing global memory. Moreover, like global memory,
local memory is not cached. Local memory is only used by the nvcc compiler
to hold automatic variables that can not be saved in registers. Programmers can
access this memory space only developing at the lowest level provided by Nvidia
(PTX ISA).
• Shared Memory. An user-managed cache located on-chip that is shared between
threads in a block. Shared memory is a limited resource (16 Kb). Like registers, if
not enough space is available, considering threads memory requirements, a kernel
is not launched and an error is reported. The shared memory is divided into
equally sized modules, called banks. Accessing shared memory is as fast as accessing
registers. However, if different requests from different threads addresses the same
bank (called bank conflict), the accesses are serialized decreasing bandwidth and
performance.
• Global Memory. This memory region is located inside the GPU RAM. It is shared
between all threads in a grid and across multiple kernel launches of the same
CUDA application. Input and output data are allocated in global memory to be
2.2 GPGPU platforms 25
transferred respectively from and to the host memory. Accessing this memory
region is very expensive and could be a bottleneck for kernel performance.
• Constant Memory. There is a total of 64 KB constant memory on a device. The
constant memory space is cached. As a result, a read from constant memory costs
one memory read from device memory only on a cache miss; otherwise, it just costs
one read from the constant cache.
• Texture Memory. The read-only texture memory space is cached. Therefore, a
texture fetch costs one device memory read only on a cache miss; otherwise, it
just costs one read from the texture cache. The texture cache is optimized for 2D
spatial locality, so threads reading texture addresses that are close together will
achieve best performance.
Table 2.1 lists the CUDA memory spaces and their features.
Memory On/Off chip Cached Access Scope Lifetime
Register On n/a R/W 1 thread Thread
Local Off No R/W 1 thread Thread
Shared On n/a R/W 1 Block Block
Global Off No R/W All threads + host Host allocation
Constant Off Yes R All threads + host Host allocation
Texture Off Yes R All threads + host Host allocation
Table 2.1: Salient features of CUDA device memory spaces.
Coalesced and un-coalesced global memory accesses. Coalesced memory access
is a memory access mechanism that is able to considerably increase the overall per-
formance of an application [26]. Global memory operations issued by threads can be
batched into a single memory transaction when certain requirements are met. Detailed
information about coalescing requirements are provided by the CUDA documentation
[6]. If a memory access pattern does not fulfill these requirements, a separate transaction
results for each requested element (un-coalesced memory accesses).
Hardware Model
CUDA devices are composed of a scalable array of Streaming Multiprocessors (SMs)
as illustrated in Figure 2.9. Each SM contains a fixed number (typically 8) of Stream
Processor (or CUDA cores). Each SP has a fully pipelined integer arithmetic logic unit
(ALU) and floating point unit (FPU). A CUDA core executes a floating point or integer
instruction per clock for a thread.
26 State of the Art
Figure 2.9: CUDA Hardware Architecture.
An SM executes hundreds of threads concurrently in a SIMT2 (Single-Instruction
Multiple-Thread) fashion. The SMs schedule threads in groups of 32 threads called
warps that are the unit of parallelism. All threads in a warp execute the same instruction
at the same time but they are free to take different execution paths in case of a branch
divergence. If threads diverge for a data-dependent conditional branch the paths are
executed in series and when all paths are completed the warp converge back to the
same instruction. Branch divergence can occur only within a warp, different warps can
execute independently and take different execution paths. Full efficiency is reached when
all threads within a warp agree on their execution path [6].
When a kernel is launched, threads of a grid are enumerated and distributed to the
SMs for the execution. Each SM is assigned a set of thread blocks that are executed
sequentially. Thread blocks are partitioned into warps that are scheduled by the warp
scheduler. The partitioning is always the same: each warp contains thread of consecutive
thread IDs with the first warp containing thread 0 [27].
The execution information of a warp such as the program counter and the registers
content are maintained on-chip during the entire lifetime of the warp. Therefore the
context switch has no cost because the scheduler select a ready warp (i.e. a warp that
contains active threads or equivalently a warp that does not contain threads waiting for
2The SIMT architecture is akin to SIMD. A key difference is that SIMD vector organizations expose
the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior
of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-
level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads.
2.2 GPGPU platforms 27
a memory response) and simply issues the next instruction for that warp.
Compute Capability. The compute capability is a revision number that specifies
features and capabilities of a device. It is defined by a major revision number and a
minor revision number. Devices with the same major revision number are of the same
core architecture. The major revision number of devices based on the Fermi architecture
is 2. Prior devices are all of compute capability 1.x (their major revision number is 1).
The minor revision number corresponds to an incremental improvement to the core
architecture, possibly including new features [6].
Communication Model
The typical steps involved in the execution of a CUDA application are illustrated in
Figure 2.10. First of all input data are transferred from host memory to device memory
through the PCIe bus. When data transfer is completed, the host instructs the device
to start processing and the computation take place onto the device cores in parallel.
Finally, results are copied back from the device memory to the host memory to be
post-processed.
Figure 2.10: Example of CUDA processing flow.
Data transfers between host memory and device memory can be bottlenecks in
GPGPU computations because the peak bandwidth of the PCIe bus is almost an order
of magnitude lower than the peak bandwidth between the device memory and the device
chip. Hence, to achieve the best performance, it is fundamental to
• minimize data transfers between host and device;
28 State of the Art
• batch many small transfers into a single larger transfer for reducing the overhead
introduced by the bus and buffer management;
• increase transfer bandwidth allocating buffers in page-locked host memory;
• overlapping computation and data transfers. This technique is not available on all
CUDA devices and depends on the compute capability.
Page-Locked Host Memory. Page-locked host memory, also known as pinned mem-
ory, is a region of the host memory that, in spite of pageable memory, is not swapped
to secondary storage. The main advantages using page-locked host memory are:
• Copies between page-locked host memory and device memory can be performed
concurrently with kernel execution for some devices.
• On systems with a front-side bus, bandwidth between host memory and device
memory is higher if host memory is allocated as page-locked.
However, pinned memory is a limited resource. Allocating too much data in this memory
space reduces the physical memory available for the operating system decreasing overall
performance.
Overlapping computation and data transfers. Host to device data transfers are
generally blocking transfers meaning that control is returned to the host thread only after
all data has been transferred. However it is possible to overlap host computation and
memory transfers through asynchronous calls in which control is returned immediately to
the host thread. Asynchronous data transfers functions contain an additional parameter
that represents the stream. A stream is a sequence of operations that are performed in
order by a device. Listing 2.4 shows how to execute the cpuFunction() while some data
is transferred to the device and the kernel is executed on the device.
cudaMemcpyAsync ( a d , a h , s i z e , cudaMemcpyHostToDevice , 0) ;
aKernel<<<gr id , block>>>(a d ) ;
cpuFunction ( ) ;
Listing 2.4: Overlapping host computation and data transfers.
In this case the memory transfer is associated to the stream 0 that is the default
stream. The kernel also uses the stream 0 so it will not start until the transfer will be
completed. Some devices have the capability of overlapping memory transfers and kernel
execution simply assigning different streams to each operation as shown in Listing 2.5.
Obviously, the computation must have data dependencies that permit the partitioning
of data in chunks that can be manipulated independently by different kernels [27].
2.2 GPGPU platforms 29
s i z e = N∗ s izeof ( f loat ) /nStreams ;
for ( i = 0 ; i < nStreams ; i++) {
o f f s e t = i ∗N/nStreams ;
cudaMemcpyAsync ( a d+o f f s e t , a h+o f f s e t , s i z e , d i r , stream [ i ] ) ;
}
for ( i = 0 ; i < nStreams ; i++) {
o f f s e t = i ∗N/nStreams ;
aKernel<<<N/( nThreads∗nStreams ) , nThreads , 0 , stream [ i ]>>>(a d+o f f s e t ) ;
}
Listing 2.5: Concurrent copy and kernel execution.
Nvidia GF100
The Nvidia GF100 processor, code named “Fermi” [28], is the new core architecture from
Nvidia that comes with many improvements on the preceding CUDA architecture. The
new Fermi SMs contains 32 CUDA cores, four times the number of cores in preceding
CUDA architectures. A Streaming Multiprocessor is able now to execute two warps
concurrently thanks to the Dual Warp Scheduler, two Instruction Dispatch Units and
the amount of CUDA cores. With Fermi, the amount of shared memory per SM has
Figure 2.11: On the left is represented the serial kernel execution prior to Fermi, while on the right is
represented the concurrent kernel execution available on Fermi.
been increased to 64 Kb that can be configured as 48 Kb of shared memory and 16 Kb
of L1 cache or 16 Kb of shared memory and 48 Kb of L1 cache. In addition to the L1
cache, Fermi features a 768 Kb unified L2 cache that services all load, store and texture
requests. Fermi is the first architecture that support PTX 2.0 instruction set. Support
to C++ has been obtained with the implementation of a unified address space that
unifies the three previously separated address spaces (local memory, shared memory,
global memory). Moreover, creation and deletion of objects, exception handling and
function pointers are now supported. The application context switch on the GPU has
been improved for better kernel-to-kernel communication performance. Fermi supports
concurrent kernel execution for kernels of the same application (Figure 2.11). This allows
many small kernels to be executed concurrently on the GPU without waste of resources.
30 State of the Art
2.2.2 AMD Stream Computing
AMD Stream Computing is a programming model and an hardware architecture that
enables GPGPU computing over AMD GPUs.
Programming Model
Kernels are executed in parallel using a virtualized SIMD programming model by the
GPU hardware. Instances of a kernel running on the GPU are called threads. Threads
of the same kernel execution are grouped together forming the domain of execution and
mapped to the elements of the output buffer. A group is a set number of threads that
execute blocks of code together in parallel before another group can execute the same
block of code. The current generation of AMD GPUs allows threads within a single
group to share data and synchronize with each other. This can be useful in certain
applications where inter-thread communication is either vital to the algorithm or can
vastly speedup the execution of the application [7].
Programming Interface. Programmers developing GPGPU applications on AMD
devices can choose between two programming interfaces: AMD Brook+ and AMD Com-
pute Abstraction Layer (CAL) [7]. Brook+ is a data-parallel C compiler that extends
the ANSI C programming language with two main key elements: streams and kernels.
Kernels are declared using the kernel declaration specifier while streams are specified
through the <> syntax. Each thread can obtain its identifier within its group calling
the instanceGroup() function while its identifier inside the entire domain of execution
can be obtained through the instance() function. Group size can be specified using
Attribute[GroupSize (x, y, z)]. Threads in a group can synchronize their execu-
tion and memory accesses calling the syncGroup() function. When all threads in a
group have reached the synchronization barrier, the execution of each thread can con-
tinue from that point. Listing 2.6 [21] shows a Brook+ application that performs the
saxpy function.
2.2 GPGPU platforms 31
kernel void Saxpy ( f loat a , f loat x<>, f loat y<>, out f loat r e su l t <>)
{
r e s u l t = a ∗ x + y ;
}
int main ( int argc , char∗∗ argv )
{
unsigned int Length = 100 ;
f loat a = 10 . 0 ;
f loat ∗ InData1 ;
f loat ∗ InData2 ;
f loat ∗ Result ;
/∗ −−−−−−− MEMORY ALLOCATION −−−−−−−− ∗/
InData1 = ( f loat ∗) mal loc ( s izeof ( f loat ) ∗ Length ) ;
InData2 = ( f loat ∗) mal loc ( s izeof ( f loat ) ∗ Length ) ;
Result = ( f loat ∗) mal loc ( s izeof ( f loat ) ∗ Length ) ;
/∗ −−−−−−− SET INPUT VALUEs −−−−−−−− ∗/
for ( int i = 0 ; i < Length ; ++i )
{
InData1 [ i ] = ( f loat ) rand ( ) ;
InData2 [ i ] = ( f loat ) rand ( ) ;
}
/∗ −−−−−−− SET DOMAIN −−−−−−−− ∗/
f loat indata1<Length>;
f loat indata2<Length>;
f loat r e su l t<Length>;
streamRead ( indata1 , InData1 ) ;
streamRead ( indata2 , InData2 ) ;
/∗ −−−−−−− RUN COMPUTE KERNEL −−−−−−−− ∗/
Saxpy (a , InData1 , InData2 , OutData ) ;
/∗ −−−−−−− GET RESULT −−−−−−−− ∗/
streamWrite (OutData , r e s u l t ) ;
/∗ −−−−−−− CLEAN UP and EXIT −−−−−−−− ∗/
f r e e ( InData1 ) ;
f r e e ( InData2 ) ;
f r e e ( Result ) ;
return 0 ;
}
Listing 2.6: Brook+ implementation of the saxpy function.
Comparing Listing 2.3 and Listing 2.6 we note many similarities between the CUDA and
the AMD Stream Computing programming model: input data initialization, allocation
and copy into the device memory, kernel launch and copy of output data from device
memory to host memory.
CAL is a device driver library used by programmers to write kernels at a lower level
and to gain more control over kernel execution. CAL runtime accepts kernels written in
AMD IL and produces executable code for the target architecture. Listing 2.7 [21] shows
a CAL application that performs the saxpy function. Code in Listing 2.6 and Listing
2.7 perform the same kernel launch. However, the latter requires a number of code lines
32 State of the Art
that is almost three times that of the former. Moreover, CAL programmers must have
a better understanding of the underlying hardware architecture [7].
const CALchar∗ ILke rne l =
“ i l p s 2 0 \n”
“ d c l i n p u t p o s i t i o n i n t e r p ( l i n e a r n op e r s p e c t i v e ) v0 . xy \n”
“dc l ou tpu t g en e r i c o0 . x \n”
“dc l cb cb0 [ 1 ] \ n”
“d c l r e s o u r c e i d (0 ) type (2d , unnorm) fmtx ( f loat ) fmty ( f loat ) fmtz ( f loat ) fmtw (
f loat ) \n”
“d c l r e s o u r c e i d (1 ) type (2d , unnorm) fmtx ( f loat ) fmty ( f loat ) fmtz ( f loat ) fmtw (
f loat ) \n”
“sample re source (0 ) sampler (0 ) r0 , v0 . xyxx\n”
“sample re source (1 ) sampler (0 ) r1 , v0 . xyxx\n”
“mad ieee o0 . x , cb0 [ 0 ] . x , r0 . x , r1 . x\n”
“ret dyn \n”
“end\n” ;
int main ( int argc , char∗∗ argv )
{
/∗ −−−−−−− INITIALIZATION −−−−−−−− ∗/
c a l I n i t ( ) ;
CALuint numDevices = 0 ;
calDeviceGetCount(&numDevices ) ;
CALdeviceinfo i n f o ;
ca lDev iceGet In fo (& in fo , 0) ;
CALdevice dev i ce = 0 ;
calDeviceOpen(&device , 0) ;
CALcontext ctx = 0 ;
calCtxCreate(&ctx , dev i c e ) ;
/∗ −−−−−−− COMPILE & LINK KERNEL −−−−−−−− ∗/
CALdeviceattr ibs a t t r i b s ;
a t t r i b s . s t r u c t s i z e = s izeof ( CALdeviceattr ibs ) ;
ca lDev iceGetAtt r ibs (&a t t r i b s , 0) ;
CALobject obj ;
ca l c lCompi l e (&obj , CAL LANGUAGE IL, ILkerne l , a t t r i b s . t a r g e t ) ;
// Link o b j e c t in to an image
CALimage image = NULL;
c a l c l L i nk (&image , &obj , 1) ;
/∗ −−−−−−− MEMORY ALLOCATION −−−−−−−− ∗/
// a l l o c a t e input / output resources and map them in to the con tex t
unsigned int Length = 100 ;
CALresource InData1 = 0 ; CALresource InData2 = 0 ;
ca lResAl locLocal1D(&InData1 , device , Length , CAL FORMAT DOUBLE 1, 0) ;
ca lResAl locLocal1D(&InData2 , device , Length , CAL FORMAT DOUBLE 1, 0) ;
CALresource Result = 0 ;
ca lResAl locLocal1D(&Result , device , Length , CAL FORMAT DOUBLE 1, 0) ;
CALresource a = 0 ;
calResAllocRemote1D(&a , &device , 1 , 1 , CAL FORMAT DOUBLE 1, 0) ;
CALuint p i t ch1 = 0 , p i t ch2 = 0 ;
CALmem InMem1 = 0 , InMem2 = 0 ;
/∗ −−−−−−− SET INPUT VALUEs −−−−−−−− ∗/
f loat ∗ f i nda ta1 = NULL;
f loat ∗ f i nda ta2 = NULL;
calCtxGetMem(&InMem1 , ctx , InData1 ) ;
calCtxGetMem(&InMem2 , ctx , InData2 ) ;
calResMap ( (CALvoid∗∗)&f indata1 , &pitch1 , InData1 , 0) ;
for ( int i = 0 ; i < Length ; ++i )
f i nda ta1 [ i ∗ p i tch1 ] = rand ( ) ;
2.2 GPGPU platforms 33
calResUnmap ( InData1 ) ;
calResMap ( (CALvoid∗∗)&f indata2 , &pitch2 , InData2 , 0) ;
for ( int i = 0 ; i < Length ; ++i )
f i nda ta2 [ i ∗ p i tch2 ] = rand ( ) ;
calResUnmap ( InData2 ) ;
f loat ∗ constPtr = NULL;
CALuint constPi tch = 0 ;
CALmem constMem = 0 ;
// Map constant resource to CPU and i n i t i a l i z e va lue s
calCtxGetMem(&constMem , ctx , a ) ;
calResMap ( (CALvoid∗∗)&constPtr , &constPitch , a , 0) ;
constPtr [ 0 ] = 1 0 . 0 ;
calResUnmap ( a ) ;
// Mapping output resource to CPU and i n i t i a l i z i n g va lue s
void ∗ r e s da t a = NULL;
CALuint p i t ch3 = 0 ;
CALmem OutMem = 0 ;
calCtxGetMem(&OutMem, ctx , Result ) ;
calResMap(&res data , &pitch3 , Result , 0) ;
memset ( re s data , 0 , p i t ch3 ∗ Length ∗ s izeof ( f loat ) ) ;
calResUnmap ( Result ) ;
/∗ −−−−−−− LOAD MODULE & SET DOMAIN −−−−−−−− ∗/
CALmodule module ;
calModuleLoad(&module , ctx , image ) ;
CALfunc func ;
CALname InName1 , InName2 , OutName , ConstName ;
calModuleGetEntry(&func , ctx , module , “main” ) ;
calModuleGetName(&InName1 , ctx , module , “ i0” ) ;
calModuleGetName(&InName2 , ctx , module , “ i1” ) ;
calModuleGetName(&OutName , ctx , module , “o0” ) ;
calModuleGetName(&ConstName , ctx , module , “cb0” ) ;
calCtxSetMem ( ctx , InName1 , InMem1) ;
calCtxSetMem ( ctx , InName2 , InMem2) ;
calCtxSetMem ( ctx , OutName , OutMem) ;
calCtxSetMem ( ctx , ConstName , constMem) ;
CALdomain domain = {0 , 0 , Length , 1} ;
/∗ −−−−−−− RUN COMPUTE KERNEL −−−−−−−− ∗/
CALevent event ;
calCtxRunProgram(&event , ctx , func , &domain ) ;
// wait f o r func t i on to f i n i s h
while ( calCtxIsEventDone ( ctx , event ) == CAL RESULT PENDING)
{ } ;
/∗ −−−−−−− GET RESULT −−−−−−−− ∗/
CALuint p i t ch4 = 0 ;
f loat ∗ f outdata = 0 ;
calResMap ( (CALvoid∗∗)& foutdata , & pitch4 , Result , 0) ;
for ( int i = 0 ; i < Length ; ++i )
foutdata [ i ∗ p i tch4 ] = ( f loat ) ( i ∗ p i tch4 ) ;
calResUnmap ( Result ) ;
/∗ −−−−−−− CLEAN UP and EXIT −−−−−−−− ∗/
calModuleUnload ( ctx , module ) ;
ca l c lFree Image ( image ) ;
c a l c lF r e eOb j e c t ( obj ) ;
calCtxReleaseMem ( ctx , InMem1) ;
ca lResFree ( InData1 ) ;
calCtxReleaseMem ( ctx , InMem2) ;
ca lResFree ( InData2 ) ;
calCtxReleaseMem ( ctx , constMem) ;
34 State of the Art
ca lResFree ( a ) ;
calCtxReleaseMem ( ctx , OutMem) ;
ca lResFree ( Result ) ;
ca lCtxDestroy ( ctx ) ;
ca lDev i c eC lo se ( dev i c e ) ;
calShutdown ( ) ;
}
Listing 2.7: AMD CAL implementation of saxpy function.
Compilation tool-chain. The software stack that provides compilation of Brook+
files (.br) is illustrated in Figure 2.12. brcc is a source-to-source meta compiler that
takes a .br file and separates code to be executed on the CPU from code to be executed
on the GPU. The former is compiled into the host machine code while the latter is
compiled into AMD IL (i.e. the AMD ISA). brt is a run-time library able to call device
Figure 2.12: The Brook+ compilation tool-chain.
code from CPU code and execute kernels using just-in-time compilation and the CAL
driver.
Memory Model
The following memory spaces are exposed to programmers and accessible by threads
executing a kernel [7]:
• Global Buffer. This memory region contains input and output buffers. It can be
located in the GPU RAM or in the CPU PCI express memory region.
2.2 GPGPU platforms 35
• Local Data Store (LDS). An on-chip memory space shared by all threads within a
group. The LDS has a write-private, read-public model: a thread can write only
to its own memory space but can read from the memory space of any thread in
the same group.
• Shared Registers. A set of 32-bit registers located on-chip. Shared registers are
a method of sharing data at a lower level than the LDS. The LDS shares data at
group level; shared registers share data at wavefront level.
Hardware Model
AMD GPGPU devices are called Stream Processors. They are composed of a set of
SIMD multiprocessors called SIMD Engines each containing a set of thread processors.
Thread processors within a SIMD engine execute the same instruction at the same
time while different SIMD engines are free to execute independently. Thread processors
consist in a pipeline of stream cores which fundamentally are ALUs capable of performing
integer, single precision floating point, double precision floating point and transcendental
operations (Figure 2.13). Thread processors can issue four instruction at a time. In
Figure 2.13: Generalized Stream Processor Structure.
order to hide fetching latencies four threads are executed per thread processor with
interleaving. If the stream processor has 16 thread processors per SIMD engine, 64
threads can be executed in parallel by each SIMD engine. This group of threads is
called wavefront. If threads within a wavefront diverge on their execution path due to
a branch condition, the paths of the branch are executed serially. In case of branch
divergence, the total time of execution of a branch is the sum of the execution times of
its paths [7].
Stream processors are devices capable of executing thousands of threads concurrently.
The thread scheduler de-schedules any thread that is waiting for a memory response and
36 State of the Art
schedule a new active thread for the execution. For kernels with high arithmetic intensity
this technique hides memory latencies and pipelining latencies. Figure 2.14 shows a
simplified example in which four active threads are running concurrently on a single
thread processor. As soon as a thread issues a memory request, the thread scheduler
schedules the next active thread for execution and so on. In this case computation
periods are sufficient to hide memory latencies [7].
Figure 2.14: Simplified execution of threads in a single thread processor.
Communication Model
There are three memory regions that are involved in data transfers between host and
device: host memory (the CPU RAM), PCIe memory (a region of the CPU RAM) and
the stream processor memory (GPU RAM). Communications between host and device
occurs over the PCIe bus using the DMA (Direct Memory Access) engine. During a
memory transfer process, data is copied from host memory to PCIe memory and then
from PCIe memory to stream processor memory. Using pinned memory the first copy
can be skipped but some memory requirements must be fulfilled such as a limited size
of data to be transmitted. DMA transfers can be performed asynchronously, i.e. a
data transfer can occur during a computation on the host and a computation on the
stream processor. Host to device data transfers and kernel invocations are implicitly
asynchronous while device to host data transfers are synchronous. Programmers can
change this behavior for example making asynchronous data transfers from device to
host. In this way more operations on the CPU can be overlapped with operations on
the GPU to achieve best performance [7].
2.2 GPGPU platforms 37
AMD Cypress
Cypress is the latest stream processor from AMD. It has 10 SIMDs, containing 16 Thread
Processors each. TPs execute 4 threads at a time, making it a 64-wide SIMD processor.
Each TP is internally arranged as a 5-way scalar processor, allowing for up to five scalar
operations co-issued in a VLIW (Very Long Instruction Word) instruction. Hence the
SIMD processor is 160 wide and the total number of scalar processors is 800.
The wavefront construction and the order of thread execution depends on the ras-
terization order of the domain of execution. Rasterization follows a pre-set zig-zag like
pattern across the domain. Currently, this pattern is not publicly available; it is only
known that it is based on a multiple of 8x8 blocks within the domain, aligned with
the size of a wavefront. The rasterization process is transparent to the user, but could
impact memory access performance.
The performance of FP64 calculations in the best case scenario is halved with re-
spect to the original FP32 case, i.e. 500 GFLOPs. The worst case scenario sees the
performance reduced to one quarter of the original FP32 execution.
AMD equips Cypress GPUs with one L1 Texture Cache for each SIMD core, plus
256KB L2 caches, for Texture, Color and Z/Stencil, and 32 KB L1 caches, for constant
data and instructions, for Texture, Vertex and Memory Read/Write. Moreover, the
Cypress goes even further with another memory area (of another 16 KB) called Global
Data Share, to enable communication among SIMD arrays. The CPU processor units
do not directly access GPU local memory; instead they issue memory requests through
dedicated hardware units. There are two ways to access memory: cached and uncached.
Aside from caching, the main difference between the two is that uncached memory
supports writes to arbitrary locations (scatter), whereas cached memory writes allow
outputs to the associated domain location of the thread only [8].
38 State of the Art
2.2.3 OpenCL
OpenCL [3] is an open standard platform managed by the Khronos Group, a non-profit
technology consortium. OpenCL enables the execution of general-purpose applications
across heterogeneous architectures. OpenCL consists of a programming language and
compiler for writing kernels, and APIs that can be used to control the communication
and interaction between the host and the devices. The OpenCL programming language
is a subset of the ISO C99 with extensions for parallelism exploitation. OpenCL is
supported by CUDA and AMD Stream Computing.
Programming Model
OpenCL programs can be divided into two parts: kernels that are functions to be ex-
ecuted on OpenCL devices, and the host program that is code that executes on the
host. Every kernel instance, called work-item, is assigned a global identifier from a three-
dimensional index space. Each work-item executes the same kernel function but operates
on different data elements and can follow different execution pathways. Work-items are
organized into work-groups which are assigned a unique work-groupID. Work-items are
also assigned a localID that identifies each work-item inside its block. Thus, a work-item
can be uniquely identified by its global identifier or through a combination of its localID
and its work-groupID. The index space associated with the OpenCL execution model,
called NDRange, defines the work-items and how the data maps onto the work-items.
In a strictly data parallel model, there is a one-to-one mapping between work-items and
stream elements over which a kernel can be executed in parallel. OpenCL implements
a relaxed version of the data parallel programming model where a strict one-to-one
mapping is not a requirement.
OpenCL provides a hierarchical data parallel programming model through work-
groups and work-items. There are two ways to specify the hierarchical subdivision. In
the explicit model a programmer defines the total number of work-items to execute in
parallel and how work-items are divided among work-groups. In the implicit model, a
programmer specifies only the total number of work-items to execute in parallel, and
the division into work-groups is managed by the OpenCL implementation [3]. Listing
2.8 shows an OpenCL implementation of the saxpy function. The get_global_id()
function returns the unique global work-itemID value for the current thread. We do not
provide specifications for each function used in Listing 2.8 because this is out of the scope
of this thesis. Listings 2.3, 2.6, 2.7 and 2.8 should give the reader a brief introduction to
these APIs in order to understand the efforts required by new programmers that start
developing GPGPU applications.
2.2 GPGPU platforms 39
kernel void Saxpy ( constant f loat ∗x , global f loat ∗y , const f loat a )
{
const uint i = g e t g l o b a l i d (0 ) ;
y [ i ] += a ∗ x [ i ] ;
}
int main ( int argc , const char ∗ argv [ ] )
{
/∗ −−−−−−− INITIALIZATION −−−−−−−− ∗/
// Enumerate a l l a v a i l a b l e OpenCL p la t forms
c l u i n t num platforms ;
c l p l a t f o rm i d ∗ p lat fo rms ;
c lGetPlat formIDs (0 , NULL, &num platforms ) ;
p la t fo rms = mal loc ( s izeof ( c l p l a t f o rm i d ) ∗ num platforms ) ;
c lGetPlat formIDs ( num platforms , p lat forms , NULL) ;
// Se l e c t the f i r s t p la t form
c l p l a t f o rm i d p l a t f o rm id = plat fo rms [ 0 ] ;
// Enumerate a l l a v a i l a b l e OpenCL p la t forms
c l u i n t num devices ;
c l d e v i c e i d ∗ dev i c e s ;
c lGetDeviceIDs ( p la t fo rm id , CL DEVICE TYPE ALL, 0 , NULL, &num devices ) ;
d ev i c e s = mal loc ( s izeof ( c l d e v i c e i d ) ∗ num devices ) ;
c lGetDeviceIDs ( p la t fo rm id , CL DEVICE TYPE ALL, num devices , dev ices , NULL) ;
// Se l e c t the f i r s t dev i ce
c l d e v i c e i d d e v i c e i d = dev i c e s [ 0 ] ;
// Create OpenCL contex t
context = clCreateContext (0 , 1 , &dev i c e i d , NULL, NULL, &e r r ) ;
// Create a command queue fo r the dev i ce
commands = clCreateCommandQueue ( context , d ev i c e i d , 0 , &e r r ) ;
/∗ −−−−−−− COMPILE & LINK KERNEL −−−−−−−− ∗/
program = clCreateProgramWithSource ( context , 1 , &ke rne l s ou r c e , NULL, &e r r ) ;
// Bui ld the program
clBuildProgram ( program , 1 , &dev i c e i d , NULL, NULL, NULL) ;
// Create ke rne l o b j e c t
aKernel = c lCreateKerne l ( program , ”Saxpy” , &e r r ) ;
/∗ −−−−−−− MEMORY ALLOCATION −−−−−−−− ∗/
// A l l o ca t e memory on dev i ce
dev x = c lCrea t eBu f f e r ( context , CL MEMREAD ONLY, s izeof ( f loat ) ∗ VECTOR SIZE,
NULL, NULL) ;
dev y = c lCrea t eBu f f e r ( context , CL MEM READWRITE, s izeof ( f loat ) ∗ VECTOR SIZE,
NULL, NULL) ;
// Create random data on hos t . . .
f loat ∗x = mal loc (VECTOR SIZE ∗ s izeof ( f loat ) ) ;
f loat ∗y = mal loc (VECTOR SIZE ∗ s izeof ( f loat ) ) ;
f loat a ;
/∗ −−−−−−− SET INPUT VALUEs −−−−−−−− ∗/
srand ( time (NULL) ) ;
a = ( f loat ) rand ( ) / RANDMAX;
for ( int i = 0 ; i < VECTOR SIZE; ++i )
{
40 State of the Art
x [ i ] = ( f loat ) rand ( ) / RANDMAX;
y [ i ] = ( f loat ) rand ( ) / RANDMAX;
}
// . . . and wr i t e i t to memory o b j e c t s
clEnqueueWriteBuffer (commands , dev x , CL TRUE, 0 , s izeof ( f loat ) ∗ VECTOR SIZE,
x , 0 , NULL, NULL) ;
c lEnqueueWriteBuffer (commands , dev y , CL TRUE, 0 , s izeof ( f loat ) ∗ VECTOR SIZE,
y , 0 , NULL, NULL) ;
/∗ −−−−−−− SET DOMAIN −−−−−−−− ∗/
// Set k e rne l arguments
c lSetKerne lArg ( aKernel , 0 , s izeof ( cl mem ) , &dev x ) ;
c lSetKerne lArg ( aKernel , 1 , s izeof ( cl mem ) , &dev y ) ;
c lSetKerne lArg ( aKernel , 2 , s izeof ( f loat ) , &a ) ;
/∗ −−−−−−− RUN COMPUTE KERNEL −−−−−−−− ∗/
// Enqueue ke rne l execu t ion
const s i z e t work items = VECTOR SIZE;
clEnqueueNDRangeKernel (commands , aKernel , 1 , NULL, &work items , NULL, 0 , NULL,
NULL) ;
// Wait f o r a l l commands in queue to f i n i s h
c lF i n i s h (commands) ;
/∗ −−−−−−− GET RESULT −−−−−−−− ∗/
// Read r e s u l t s from dev i ce memory to hos t memory
f loat ∗ r e s u l t = mal loc ( s izeof ( f loat ) ∗ VECTOR SIZE) ;
clEnqueueReadBuffer (commands , dev y , CL TRUE, 0 , s izeof ( f loat ) ∗ VECTOR SIZE,
r e su l t , 0 , NULL, NULL) ;
/∗ −−−−−−− CLEAN UP and EXIT −−−−−−−− ∗/
// Cleaning up OpenCL resources
cleanup ( ) ;
return 0 ;
}
Listing 2.8: OpenCL implementation of the saxpy function.
Memory Model
Work-items executing a kernel have access to four distinct memory regions [3]:
• Global Memory. This memory region permits read/write access to all work-items in
all work-groups. Work-items can read from or write to any element of a memory
object. Reads and writes to global memory may be cached depending on the
capabilities of the device.
• Constant Memory: A region of global memory that remains constant during the
execution of a kernel. The host allocates and initializes memory objects placed
into constant memory.
• Local Memory: A memory region local to a work-group. This memory region can
be used to allocate variables that are shared by all work-items within a work-group.
It may be implemented as dedicated regions of memory on the OpenCL device.
2.2 GPGPU platforms 41
Alternatively, the local memory region may be mapped onto sections of the global
memory.
• Private Memory: A region of memory private to a work-item. Variables defined in
one work-item’s private memory are not visible to another work-item.
Table 2.2 lists the OpenCL memory regions and their features.
Host Kernel
Global Dynamic Allocation No Allocation
R/W access R/W access
Constant Dynamic Allocation Static Allocation
R/W access Read-only access
Local Dynamic Allocation Static allocation
No access R/W access
Private No allocation Static allocation
No access R/W access
Table 2.2: OpenCL memory regions.
Hardware Model
The OpenCL platform consists of a host connected to one or more devices, also called
OpenCL devices. Each device is composed of a set of Compute Units and each compute
unit is further subdivided into a set of Processing Elements that are the fundamental
units for computation (see Figure 2.15). An OpenCL application runs on the host
and submits commands to a device for the execution of computations. The processing
elements within a compute unit execute instructions in a SIMD or in a SPMD (Single-
Program, Multiple-Data) fashion. In the first case the processing elements execute the
same instruction at the same time while in the second case each processing element
maintain its own program counter [3].
Communication Model
Host programs manage the execution of kernels through contexts and command queues.
A context includes the following elements:
• Devices: The collection of OpenCL devices to be used by the host.
• Kernels: The OpenCL functions that run on OpenCL devices.
• Program Objects: The program source and executable that implement the kernels.
42 State of the Art
Figure 2.15: The OpenCL hardware model.
• Memory Objects: A set of memory objects visible to the host and the OpenCL
devices. Memory objects contain values that can be manipulated by instances of
a kernel.
The elements listed above are manipulated by the host program using the OpenCL
API. The host program puts commands into a command-queue that are flushed to the
devices bound to a specific context. Commands can be of the following types:
• Kernel execution commands: Execute a kernel on the processing elements of a
device.
• Memory commands: Transfer data to, from, or between memory objects, or map
and unmap memory objects from the host address space.
• Synchronization commands: Constrain the order of execution of commands.
For example the following function enqueues a command to execute a kernel on a device:
c l i n t clEnqueueNDRangeKernel (
cl command queue command queue ,
c l k e r n e l aKernel ,
c l u i n t work dim ,
const s i z e t ∗ g l oba l wo r k o f f s e t ,
const s i z e t ∗ g l oba l wo rk s i z e ,
const s i z e t ∗ l o c a l wo r k s i z e ,
c l u i n t num even t s i n wa i t l i s t ,
const c l e v en t ∗ e v e n t wa i t l i s t ,
c l e v e n t ∗ event )
2.2 GPGPU platforms 43
Commands in a queue can be executed in two ways:
• In-order Execution: Commands are launched in the order they appear in the
command-queue and complete in order. In other words, a prior command in the
queue completes before the beginning of the following command. This serializes
the execution order of commands in a queue.
• Out-of-order Execution: Commands are issued in order, but do not wait to com-
plete before following commands execute. Any order constraints are enforced by
the programmer through explicit synchronization commands.
Synchronization can occur in two different scenarios:
• Work-items in a single work-group.
• Commands enqueued to command-queue(s) in a single context.
In the first case synchronization is done using a barrier at which each work-item
within a work-group must wait until all work-items of that group have reached the
barrier. In the second case, commands in a command-queue can be synchronized using
a command-queue barrier: every command preceding the barrier must be completed
before the succeeding command can start its execution.
44 State of the Art
2.3 Hybrid architectures
For completeness, in this section we list some of the emerging and established hybrid
architectures. For hybrid architectures we mean computers that contain both Von Neu-
mann and a non-Von Neumann hardware units.
Cell Broadband Engine [29] is a computer architecture developed by Sony Corpo-
ration, IBM and Toshiba. It combines a general-purpose Power Architecture core of
modest performance with streamlined coprocessing elements which greatly accelerate
multimedia and vector processing applications, as well as many other forms of dedicated
computation.
Larrabee [30] is the codename of an hybrid architecture proposed by Intel that should
become a new processor for parallel computations. The main feature of this architec-
ture consists in the use of traditional x86 cores exploited for the execution of parallel
applications. Unlike CUDA, AMD and other GPGPU platforms, Larrabee should allow
programmers to develop parallel applications without learning new programming lan-
guages or techniques. Probably, Larrabee will be replaced with the new Many Integrated
Core architecture named Intel Knights Corner.
AMD has recently announced the Fusion processor that will be released in 2011 [31].
Fusion processors are both CPUs and GPUs (Accelerated Processing Units).
2.4 Other existing approaches for parallelism exploitation
using GPUs
Microsoft DirectCompute [32] is an application programming interface that supports
general-purpose computing on graphics processing units on Microsoft Windows Vista
and Windows 7. DirectCompute is part of the Microsoft DirectX collection of APIs and
was initially released with the DirectX 11 API but runs on both DirectX 10 and DirectX
11 graphics processing units.
Intel Array Building Blocks [33] is the combination of Intel’s Ct Technology [34] and
RapidMind technology [35]. RapidMind is a development platform and a runtime that
enables single-threaded applications to execute on multi-core architectures. Intel Array
Building Blocks provides a C++ API and runtime for developing parallel applications
without the knowledge of low-level parallelism mechanisms or hardware architectures.
It provides a flexible programming model for multicore and many-core architectures.
Accelerator [36] is an high-level data parallel library which uses parallel processors
such as the GPU or multicore CPU to accelerate execution. Accelerator fills the gap
between virtual machines and special-purpose architectures. In fact, programmers use
the .NET CLR to write data-parallel applications. The main advantage is that no
hardware aspects are exposed to programmers. Data-parallel code is compiled on the fly
into pixel shader code. This is main limitation, since it relates Accelerator to a specific
2.4 Other existing approaches for parallelism exploitation using GPUs 45
graphics API and makes porting of applications very difficult.
CUDA.NET [37] is a library that wraps the functionality by Nvidia CUDA driver
for .NET based applications. To ease development and migration from existing CUDA
application written in C to .NET, the same API was reserved. Therefore a new ob-
ject oriented approach is suggested for allowing programmers work easily with CUDA
and devices. All methods are static, thus allowing direct access to the same functions.
CAL.NET provide access to CAL functionality through .NET applications, such that
programmers can create cross-platform CAL solutions, that can run on any operating
system supported by CAL without changing source code.
46 State of the Art
Chapter 3
Tools
In the following section we give a brief introduction to the Common Language Infras-
tructure (CLI) design and capabilities. In particular, we will focus on technologies that
operates with Metadata such as Custom Attributes and Reflection, and compilation
methods such as Just-In-Time compilation (JIT). In Section 3.2 we present, along with
its technologies, the Microsoft proprietary implementation of the CLI, the Common Lan-
guage Runtime. Moreover, Microsoft distributed an open-source implementation of the
CLI, the Shared Source Common Language Infrastructure (SSCLI), presented in Section
3.3.
3.1 Common Language Infrastructure
The Common Language Infrastructure (CLI) is an open specification (published under
ECMA-335 [12] and ISO/IEC 23271) that provides a specification for executable code
and the execution environment (the Virtual Execution System) in which it runs. The
CLI has several implementations such as the Common Language Runtime [11] (CLR),
the SSCLI [38], Mono [39] and Portable.NET [40] (part of the DotGnu project).
3.1.1 Design and Capabilities
The Common Language Infrastructure covers the following aspects:
• The Common Type System
• Metadata
• The Common Language Specification
• The Virtual Execution System
47
48 Tools
Common Type System
The Common Type System (CTS) defines the base set of types that embody standard-
ized cross-language interoperability and specifies how types definitions and values are
represented in computer memory [41]. In order to support object-oriented, functional
and procedural programming the CTS deals with two kind of entities: objects and values.
Values represent simple types like integers and floats. They store information about a
type such as the meaning of the bit representation, the storage it occupies in memory and
the operations allowed. Objects define new types so type information are self-contained
in an object representation. Objects may recursively contain other entities like values
or objects.
Metadata
Metadata is a structured way to represent all information managed by the CLI to locate
and load classes, lay out instances in memory, resolve method invocations, translate
Common Intermediate Language (CIL) to native code, enforce security, and set up run-
time context boundaries [12]. Metadata contains all the information about types defined
in the CTS so it is widely used by tools that manipulate programs (e.g. compilers, de-
buggers and run-time code generators). In particular, metadata enables control over
runtime behavior of programs through attributes. It can be accessed in two different
ways: directly read from a CLI module file following the format specified in the ECMA
standard (partition II), and through a reflection API (see Section 3.2.1).
Common Language Specification
To fully interact with other objects regardless of the language they were implemented
in, objects must expose to callers only those features that are common to all the lan-
guages they must interoperate with. For this reason, it has been defined the Common
Language Specification (CLS), which is a set of basic language features needed by many
applications. The CLS rules define a subset of the Common Type System: all the
rules that apply to the common type system apply to the CLS, except where stricter
rules are defined in the CLS (Figure 3.1). The CLS helps enhance and ensure language
interoperability by defining a set of features that developers can rely on to be available
in a wide variety of languages [11].
Virtual Execution System
For the purposes of this thesis we consider Virtual Execution System (VES), Virtual
Execution Environment (VEE) and Virtual Machine (VM) as synonyms. They provide a
system that works with bytecode (i.e. intermediate language) in a way that is processor-
independent. Examples of VEEs are the Java Virtual Machine (JVM), the CLI itself,
the virtual machines of Nvidia and AMD.
3.1 Common Language Infrastructure 49
Figure 3.1: Languages offer a subset of the CLR/CTS and a superset of the CLS (but not necessarily
the same superset) [42].
VEEs have the following advantages:
• Portability. By using an intermediate language n languages on m platforms can
be implemented by n + m translators instead of n ∗m translators.
• Efficiency. By delaying the translation to a specific native platform as much as
possible, the execution platform can make optimal use of the knowledge of the
underlying architecture, or even adapt to the dynamic behavior of the program.
• Security. High-level intermediate code is more amenable to deployment and run-
time enforcement of security and typing constraints than low level binaries.
• Interoperability. By sharing a common type system and high-level execution envi-
ronment interoperability between different languages becomes easier than binary
interoperability.
• Flexibility. Combining high level intermediate code with meta-data enables the
construction of meta-programming concepts such as reflection, dynamic code gen-
eration, serialization, etc [43].
Virtual machines can be of two different types: stack-based and register-based.
Stack-based virtual machines use a stack to manage a program execution reading operands
from the stack and pushing results onto the stack. Instead, register-based virtual-
machines use a set of abstract registers to pass arguments to instructions. Examples
of stack-based virtual-machines are the CLI and the JVM while examples of register-
based VM are the machines of Nvidia PTX and AMD IL.
50 Tools
3.1.2 Common Intermediate Language
The Common Intermediate Language (CIL) is an ECMA standard (ECMA-335) for the
bytecode understood by the CLI. It defines a set of opcodes (IL instructions), not tied
to any existing hardware architecture, which drive an abstract stack machine. The two
most important implementations of the CIL are the Microsoft Intermediate Language
(MSIL) [11] and the MONO CIL [39].
Compilation tool-chain of the CLI
Applications written in high-level languages must be first compiled into CIL before being
executed by the VES (Figure 3.2). This step requires n source-to-source compilers where
Figure 3.2: Common Language Infrastructure tool-chain.
n is the number of high-level languages that are going to be compiled. In Listings 3.1, 3.2
and 3.3 the Hello World code snippet is implemented respectively in C#, Visual Basic
and Delphi.
public stat ic void Main( St r ing [ ] a rgs )
{
Console . WriteLine ( “He l lo World ! ” ) ;
}
Listing 3.1: “Hello World!” C# implementation.
3.1 Common Language Infrastructure 51
Public Sub Main(ByVal args as String ( ) )
Console .WriteLine ( “He l lo World ! ” ) ;
End Sub
Listing 3.2: “Hello World!” Visual Basic implementation.
procedure Hel lo . Main ( args : string [ ] ) ;
begin
Console . WriteLine ( “He l lo World ! ” ) ;
end ;
Listing 3.3: “Hello World!” Delphi implementation.
However, in order to guarantee interoperability, the first step of compilation pro-
duces the same IL code depicted in Listing 3.4. This example has been produced using
.NET Reflector [44], a software developed by Redgate that is able to decompile assem-
blies into a subset of the supported .NET languages.
. method public h idebys i g stat ic void Main( s t r i n g [ ] a rgs ) c i l managed
{
. en t rypo int
. maxstack 8
IL 0000 : nop
IL 0001 : l d s t r “He l lo World ! ”
IL 0006 : c a l l void [ mscor l ib ] System . Console : : WriteLine ( s t r i n g )
IL 000b : nop
IL 000c : r e t
}
Listing 3.4: CIL compiled code that prints “Hello World!” on the console.
CIL modules can not be directly executed by the CPU but require an additional
compilation step in order to translate CIL instructions into executable machine code.
This step can be performed either at compile-time (Ngen of Microsoft .NET [45]) or at
runtime by a just-in-time compiler. The Native Image Generator (Ngen) is a tool that
improves the performance of managed applications. Ngen creates native images, which
are files containing compiled processor-specific machine code, and installs them into the
native image cache on the local computer. The runtime can use native images from the
cache instead using the JIT compiler to compile the original assembly.
Just-in-time compilation
Just-in-time compilation, or simply “jitting”, is placed between interpretation and static
compilation. An interpreted language is translated in executable code during every
execution, while statically compiled languages are compiled in executable code before
execution. Just-in-time compilation is similar to interpretation but it caches already
translated code to achieve best performance. Static compilation may seem the best
choice to achieve peak performance because it does not introduce overhead at runtime.
Instead this is not always true since interpretation and just-in-time compilation benefit
52 Tools
from best runtime knowledge and for this reason can produce more optimized executable
code.
Types in a CLI-based application are typically not loaded until they are needed, and
once a type is loaded, its methods are not translated until they are needed for execution.
Every subsequent call to a method does not require the method to be recompiled because
the executable code was cached the first time the method was compiled. Once a method
has been compiled the JIT compiler jumps at the first instruction and executes that
method as illustrated in Figure 3.2.
3.2 Common Language Runtime 53
3.2 Common Language Runtime
The Common Language Runtime is the Microsoft proprietary implementation of the CLI
and the core of the Microsoft’s .NET initiative. The most important supported languages
are C#, F#, Visual Basic, a CLI version of C++, J#, Fortran, Perl and Python. In this
section we present the programming techniques provided by the CLR that we have used
in this thesis. For more documentation on the features of the CLR refer to the Microsoft
Developer Network [11].
3.2.1 Reflection
Reflection is the capability of an application to access or even modify its internal state
of execution. These capabilities are related to two important aspects:
• Introspection. The application observe and decide its execution basing on its
internal state.
• Intercession. The application modify its execution or change its interpretation.
Systems supporting reflection, like the CLR and the JVM, are sometimes called reflective
systems. Reflective systems provide for example the possibility of locating members,
invoking methods, accessing type attributes or creating object instances whose name
is not known until runtime. Information about programs are stored in the form of
metadata along with bytecode at compile time. Programmers developing applications on
a reflective system have access to special APIs that allow the interaction with metadata.
The following method invocation shows how to retrieve the methods defined in MyType
class using the Type.GetMethods() method.
MethodInfo [ ] ms = MyType . GetMethods ( BindingFlags . Publ ic | BindingFlags .
DeclaredOnly | BindingFlags . In s tance ) ;
GetMethods() requires a bitmask that filters, in this case, any public, not inherited and
instance method. Moreover, it is possible to exploit reflection for dynamically instantia-
tion of classes and method invocations. In the following listing an instance of the MyType
class is created and then the Add() method is invoked on it.
Object MyInst = Act ivator . Create Ins tance (MyType) ;
Object MyObj = MyInst . InvokeMember (“Add” ,
BindingFlags . InvokeMethod ,
System . Type . DefaultBinder ,
MyInst ,
new Object [ ] { int , int }) ;
The InvokeMember() parameters represent respectively the name of the invoked method,
a bitmask to filter methods, the default binder, the instance that owns the method and
finally the parameters array to be passed to the method [43].
54 Tools
3.2.2 Metadata extensibility
CLR provides metadata extensibility with arbitrary information using Custom Attributes
(CA). A CA is an instance of a class that inherits from the System.Attribute class.
Properties of CA must be computable at compile-time because attribute classes are in-
stantiated at compile-time instead of runtime like any other class. In fact, CA are placed
into an assembly and are ignored by the execution environment but they can be retrieved
by applications through the reflection APIs [11].
Listing 3.5 shows how to declare and retrieve Custom Attributes. In this example,
the Author class is a CA since it extends the Attribute class. The Author class has a
constructor with a string parameter representing the name. The Author CA is assigned
to the CSPoint class through the [...] syntax and the attribute is instantiated with the
“Damien Watkins” name. At runtime, custom attributes can be retrieved through the
GetCustomAttributes() method.
using System . Re f l e c t i o n ;
public class Author : Att r ibute
{
using System ;
public readonly s t r i n g name ;
public Author ( s t r i n g name)
{
this . name = name ;
}
public ove r r i d e S t r ing ToString ( )
{
return St r ing . Format ( “Author : {0}” , name) ;
}
}
[ Author ( “Damien Watkins” ) ]
public class CSPoint : Point
{
public stat ic void Main ( )
{
MemberInfo i n f o = typeo f ( CSPoint ) ;
ob j e c t [ ] a t t r i b u t e s = i n f o . GetCustomAttributes ( ) ;
Console . WriteLine ( “Custom Att r ibute s are : ” ) ;
for ( int i = 0 ; i < a t t r i b u t e s . Length ; i++)
{
System . Console . WriteLine ( “Attr ibute ” + i + “ : i s ” + a t t r i b u t e s [ i ] .
ToString ( ) ) ;
}
}
}
Listing 3.5: C# Custom Attributes.
3.2.3 Delegates
Delegates define a reference type that can encapsulate a static or an instance method.
Delegates are similar to function pointers in C++ but they are type-safe and secure. A
3.2 Common Language Runtime 55
delegate object is an instance of a delegate type [11]. The following C# listing shows
how to declare a delegate type and how to create delegate objects from that type. The
delegate objects are bound to an instance and to a static method.
/∗ Delegate d e f i n i t i o n ∗/
public de l e ga t e int BinaryOp ( int x , int y ) ;
/∗ Target c l a s s ∗/
public class MathLib {
i n t e r n a l int sum = 0 ;
public int Add( int m, int n)
{
sum += m + n ;
return m + n ;
}
public stat ic int Subtract ( int a , int b)
{
return a − b ;
}
}
class MyApp
{
stat ic void Main ( )
{
MathLib ta r g e t = new MathLib ( ) ;
Type t t = typeo f (MathLib ) ;
Type dt = typeo f (BinaryOp ) ;
BinaryOp op1 = (BinaryOp ) Delegate . CreateDelegate ( dt , tt , “Subtract” ) ;
BinaryOp op2 = (BinaryOp ) Delegate . CreateDelegate ( dt , target , “Add”) ;
}
}
The CreateDelegate() method of the System.Delegate class creates a new delegate
object of the specified type and bind it to a method that must have the same signature
of the delegate type. Otherwise an exception is thrown. Delegate objects can also be
created invoking the delegate constructor like for any other C# object as follows:
// Map the d e l e g a t e to the ins tance method
MyDelegate d = new MyDelegate (p . InstanceMethod ) ;
d ( ) ;
// Map to the s t a t i c method
d = new MyDelegate (MyClass . StaticMethod ) ;
d ( ) ;
Delegates are similar to single-method interfaces. The main difference is that in-
terfaces require compatibility between the target method type and the interface type.
Delegates only require that the method they are going to be bind has the same signature
of that expected by the delegate type [46].
56 Tools
3.2.4 Managed and un-managed code
The common language runtime uses the terms “managed” and “un-managed” to qualify
code, data, and pointers. Managed or un-managed identify the amount of control that
the runtime has over aspects of a program. Anything that is managed is tightly controlled
by the common language runtime [11].
Managed code provides the execution engine with sufficient information (metadata)
to allow the execution engine to manage its execution safely. Safe execution includes
such aspects of program execution as debugging, inter-language interoperability, memory
management, and security. Managed data describes values that are allocated on the
garbage-collected heap by the common language runtime. Un-managed code does not
provide such information to the execution engine. Thus, the execution engine cannot
provide these services for un-managed code [11].
3.2.5 Interoperating with un-managed code
The fact that managed and un-managed code may exist in the same program is not
necessarily a problem. However, the interaction between managed and un-managed
code requires close attention. For example, the garbage collector can only run when
all threads are suspended. When a thread executes un-managed code, the execution
engine cannot suspend the thread, and garbage collection cannot occur. Another area
of concern is exception handling. Managed code will throw a .NET Framework common
language runtime exception while native code may throw a WIN32 exception. When
exceptions propagate between managed and un-managed code, the exception needs to
change to fit the expected model [11].
Platform Invocation Services
Platform Invocation Services, commonly referred to as P/Invoke, is a feature of the
Common Language Runtime, that enables managed code to interact with un-managed
code (Figure 3.3). Platform invoke relies on metadata to locate exported functions and
marshal1 their arguments at run time. When platform invoke calls an un-managed
function, it performs the following sequence of actions:
• Locates the DLL containing the function.
• Loads the DLL into memory.
• Locates the address of the function in memory and pushes its arguments onto the
stack, marshaling data as required.
1In computer science, marshalling (similar to serialization) is the process of transforming the memory
representation of an object to a data format suitable for storage or transmission. It is typically used when
data must be moved between different parts of a computer program or from one program to another.
3.2 Common Language Runtime 57
Figure 3.3: A Platform Invoke call to an un-managed DLL function [11].
• Transfers control to the un-managed function [11].
To declare a method as having an implementation from a DLL export using C# (Listing
3.6), developers must:
• Declare the method with the static and extern C# keywords.
• Attach the DllImport attribute to the method. The DllImport attribute allows
you to specify the name of the DLL that contains the method. The common
practice is to name the C# method the same as the exported method, but you can
also use a different name for the C# method.
• Optionally, specify custom marshaling information for the method’s parameters
and return value, which will override the .NET Framework default marshaling
[11].
using System ;
using System . Runtime . I n t e r opSe rv i c e s ;
class PlatformInvokeTest
{
[ Dl l Import ( “msvcrt . d l l ” ) ]
public stat ic extern int puts ( s t r i n g c ) ;
[ Dl lImport ( “msvcrt . d l l ” ) ]
i n t e r n a l stat ic extern int f l u s h a l l ( ) ;
public stat ic void Main ( )
{
puts ( “Test” ) ;
f l u s h a l l ( ) ;
}
}
Listing 3.6: P/Invoke example using C#.
For every .NET Framework type there is a default un-managed type, which the
common language runtime will use to marshal data across a managed to un-managed
58 Tools
function call. For example, the default marshaling for C# string values is to the type
LPTSTR (pointer to TCHAR char buffer). You can override the default marshaling using
the MarshalAs attribute in the C# declaration of the un-managed function as illustrated
in Listing 3.7.
using System ;
using System . Runtime . I n t e r opSe rv i c e s ;
class PlatformInvokeTest
{
[ Dl l Import ( “msvcrt . d l l ” ) ]
public stat ic extern int puts (
[ MarshalAs (UnmanagedType . LPStr ) ]
s t r i n g m) ;
[ Dl lImport ( “msvcrt . d l l ” ) ]
i n t e r n a l stat ic extern int f l u s h a l l ( ) ;
public stat ic void Main ( )
{
puts ( “He l lo World ! ” ) ;
f l u s h a l l ( ) ;
}
}
Listing 3.7: Custom Marshaling.
GCHandle table
The System.Runtime.InteropServices.GCHandle class is used with the GCHandleType
enumeration to create a handle corresponding to any managed object. This handle can
be one of four types: Weak, WeakTrackResurrection, Normal, or Pinned. When the
handle has been allocated, you can use it to prevent the managed object from being
collected by the garbage collector when an un-managed client holds the only reference.
Without such a handle, the object can be collected by the garbage collector before
completing its work on behalf of the un-managed client [11].
You can also use GCHandle to create a pinned object that returns a memory address
to prevent the garbage collector from moving the object in memory.
When the handle goes out of scope you must explicitly release it by calling the
GCHandle.Free() method; otherwise, memory leaks may occur. When you free a pinned
handle, the associated object will be unpinned and will become eligible for garbage
collection, if there are no other references to it.
3.3 Shared Source Common Language Infrastructure
Shared Source Common Language Infrastructure (SSCLI), previously codenamed “Ro-
tor”, is the Microsoft’s shared source implementation of the CLI. SSCLI can be seen as
composed by four different parts:
3.4 CLIFile Reader 59
• The CLI execution engine. This is the core of the CLI virtual execution environ-
ment implementation. JIT compilation, memory management, assembly and class
loading, type resolution, metadata parsing, stack walking, and other fundamental
mechanisms are implemented here.
• Component frameworks that both wrap and extend the execution engine
• A portability layer (the Platform Adaptation Layer) used to move from one operat-
ing system to another. The place where the work to bring Rotor to new platforms
would begin.
• Tools, tests, compilers, documentation, and utilities for working with managed
code
In Chapter 4 we will study in depth the JIT compiler implemented in SSCLI 2.0 in
order to formulate our CPU performance model.
3.4 CLIFile Reader
CLIFile Reader [47] is a library specifically designed to read and rewrite .NET binaries.
It interacts with .NET reflection only under explicit request so large code bases can
be analyzed using it. It is built using memory mapping to avoid unnecessary memory
allocation, data is accessed directly on the disk and CLR meta-data tables are exposed
as a set of tables using indexers. CLIFileRW provides the ILCursor class, that is a linear
cursor into a stream of CIL instructions. The cursor provides a number of facilities that
are needed to generate optimized streaming code [8].
60 Tools
Chapter 4
Performance Modeling for
GPGPU and CPU
In this chapter we present performance models for GPGPUs and CPUs that have been
formulated and validated during this thesis. The aim of this part is to enable the
scheduler to estimate the completion time of a task either on the CPU and on the GPU.
Basing on our performance models, the scheduler decides on which platform (CPU,
GPU) execute a task.
4.1 Parameters for performance evaluation
In this section we introduce the main parameters related to performance evaluation used
throughout this thesis.
The mean execution time of a sequential computation is the mean time required to
completely execute that computation. For parallel computations, we define the comple-
tion time as the mean time required to process all elements in an input stream. Equation
4.1 shows how to calculate the mean completion time Tc, where m is the number of el-
ements in the stream and T is the mean execution time required to process a single
element [48].
Tc = m ∗ T (4.1)
Bandwidth, also called throughput, measures the mean number of operations that an
hardware unit (e.g. an Arithmetic Logic Unit or a memory unit) is able to execute in the
unit of time. As demonstrated by Shane Ryoo et. al. [49], device memory bandwidth is
one of the most important factors that can affect performance of GPGPU computations.
Thus, calculating memory bandwidth can be a key factor in approximating the overall
performance of an application accurately. Device memory bandwidth can be calculated
as theoretical bandwidth or as effective bandwidth. Theoretical bandwidth can be cal-
culated using hardware specifications provided by vendors such as memory clock rate,
61
62 Performance Modeling for GPGPU and CPU
memory interface width and type of RAM (e.g. Double Data Rate).
Bt =
ClockRateHz ∗ (interfaceWidth/8) ∗ 2
109
(4.2)
For example, using Equation 4.2, the peak theoretical memory bandwidth of the Nvidia
GeForce GTX 580 is 192.4 GB/sec:
(2004 ∗ 106 ∗ (384/8) ∗ 2)/109 = 192.4 (4.3)
In this calculation, the memory clock rate is converted in to Hz, multiplied by the
interface width (divided by 8, to convert bits to bytes) and multiplied by 2 due to the
double data rate. Finally, this product is divided by 109 to convert the result to GB/sec
(GBps).
Effective bandwidth must be calculated timing specific program activities and consid-
ering the amount of data respectively read and written from and to the device memory.
Be =
Br+Bw
109
T
(4.4)
In Equation 4.4 Br and Bw represent respectively the number of bytes read and written
by the kernel. T represent the measured time required for transfer. These two indicators
could be very different but only the second should be used to evaluate performance
accurately [27].
Latency represents the time spent by an hardware unit for completely process a
single stream element. The service time measures the mean execution time between the
beginning of elaboration of two consecutive stream elements. Bandwidth can also be
expressed using the service time (T ) as illustrated in Equation 4.5.
B =
1
T
(4.5)
As we will see further in this thesis, an important parameter for evaluating the
performance of SIMD architectures is the arithmetic intensity. Arithmetic intensity
is defined as the number of operations performed per memory word1 transferred. It is
important for GPGPU computations to have high arithmetic intensity in order to reduce
the effect of memory access latencies [6].
1We consider a memory word as composed of 32 bits.
4.2 GPGPU Performance Model 63
4.2 GPGPU Performance Model
Since we aimed to provide a performance model that is independent from any vendor-
specific GPGPU device (Nvidia, AMD, etc.), we defined an abstract model, called the
ground model, based on the OpenCL programming model. As seen in Chapter 2, OpenCL
is an open programming standard that targets various compute devices such as Nvidia
and AMD GPUs. Its programming model naturally maps onto the Nvidia and the AMD
programming models. For this reasons, the OpenCL programming model may be seen
as the intersection between the features exposed by CUDA and the those exposed by
AMD Stream Computing.
Without loss of generality, in this section we will use the OpenCL terminology. How-
ever, as seen in Chapter 2, each term has a synonymous in AMD Stream Computing
and CUDA as well. Table 4.1 correlates CUDA, AMD Stream Computing and OpenCL
terms.
CUDA AMD Stream Computing OpenCL
CUDA Device Stream Processor Compute Device
Streaming Multiprocessor SIMD Engine Compute Unit
CUDA core Thread Processor Processing Element
Kernel Kernel Kernel
Warp Wavefront -
Thread Group Group Work-group
Grid Domain of Execution NDRange
Global Memory Global Buffer Global Memory
Shared Memory Shared Memory Local Memory
Local Memory - Private Memory
Table 4.1: Correlation between CUDA, AMD Stream Computing and OpenCL terminologies.
The following sections present a ground model that calculates the completion time of
a single kernel setup and execution on a single GPGPU device. Moreover, we will show
how the ground model can be incrementally refined/specialized to models that target
specific platforms (e.g. CUDA and AMD Stream Computing) or even specific GPU
cards. For instance, we propose and validate a model refinement tied to the Nvidia PTX
1.x architecture.
In order to validate the refined model we performed several tests in CUDA stressing
different features of the device hardware. We did not have the possibility to propose
and validate a refinement for the AMD platform because of lack of instrumentation and
equipment. This could be done in future works.
64 Performance Modeling for GPGPU and CPU
4.2.1 Related works
Many works in literature investigate on how to achieve peak performance exploiting
the computational power of GPGPU enabled devices. Almost all of them take some
applications of interest and look for specific code optimizations to reduce or eliminate
bottlenecks. Shane Ryoo et. al. [49] present some optimization principles for CUDA
applications and apply them to a suite of problems (e.g. the matrix multiplication)
verifying the increase of performance. Examples of these optimizations are leverage
zero-overhead thread scheduling executing hundreds of threads to hide memory latency,
and optimized use of on-chip memory to reduce bandwidth usage. In [50] Shuai Che
et. al. examine several common, computationally demanding applications (e.g. Traffic
Simulation, Thermal Simulation and K-Means) whose performance may benefit from
graphics hardware’s parallel computing capabilities.
A performance model for Iterative Stencil Loops on GPUs is presented by Meng and
Skadron [51]. They automate Ghost Zone optimizations basing on the CUDA computa-
tional model.
Some works are focused on providing a performance model for a specific platform.
Kothapalli et. al. [52] present a performance prediction model for the CUDA platform
that encompasses the various facets of the GPU architecture like scheduling, memory
hierarchy and pipelining among others. Hong and Kim in [53] propose an analytical
model that estimates the execution time of massively parallel programs. The key com-
ponent of their model is the estimation of the number of parallel memory requests (they
call this the memory warp parallelism) by considering the number of running threads
and memory bandwidth. Based on the degree of memory warp parallelism, the model
estimates the cost of memory requests, thereby estimating the overall execution time
of a CUDA program. To the best of our knowledge there are not works in literature
proposing performance models for the AMD Stream Computing platform.
We found only one work in literature that investigates on how to build GPGPU
performance models not tied to any GPGPU platform. Baghsorkhi et. al. [54] propose
a performance model for GPU architectures focusing on microarchitecture features such
as un-coalesced memory accesses, scratch-pad memory bank conflicts and control flow
divergence. It also introduces an abstract interpretation of a GPU kernel, the work flow
graph, based on the program dependence graph (PDG).
4.2.2 Ground Model
In order to set up and execute a kernel on a device, GPGPU applications perform
some preliminary operations on the host, such as context creation and configuration and
device memory allocations. The execution time required for runtime initialization and
configuration can be considered constant for each kernel launch assuming that input data
has already been initialized. As we demonstrate further in this chapter, the execution
time needed for device memory allocations is linear in the size of allocated memory.
4.2 GPGPU Performance Model 65
Before kernel execution, input data must be copied from the host memory to the device
memory through the PCIe bus. Kernels are not able to start their execution until the
data partition on which they operate is not completely transferred. The kernel is then
executed in parallel by the processing elements of the device and finally results are copied
back to the host.
We calculate the completion time of a kernel configuration and execution as
Tgpu = Tsetup + Th2d + Tkernel + Td2h (4.6)
where Th2d and Td2h represent respectively host-to-device and device-to-host transfer
latencies, Tsetup is the time spent in configuration and Tkernel represent the completion
time of the kernel on the device. A generic OpenCL application performs the following
steps:
1. Creates an OpenCL context on a GPU device
2. Creates a command-queue
3. Allocates and copies the input buffer memory objects
4. Creates and builds a program object.
5. Creates a kernel object
6. Sets arguments values
7. Sets work-item dimensions
8. Executes the kernel
9. Reads output buffers
10. Releases OpenCL objects
In this example steps 1 to 7 and 10 contribute to Tsetup, step 3 contributes to Th2d, step
9 contributes to Td2h and step 8 contributes to Tkernel.
Host-to-device and device-to host data transfer latencies are approximated dividing
the amount of data to be transmitted by the effective PCIe transfer bandwidth as follows:
Th2d =
DATAin
Bh2d
(4.7)
Td2h =
DATAout
Bd2h
(4.8)
DATAin and DATAout represent the amount of data copied respectively to device mem-
ory and to host memory. Bh2d and Bd2h represent the PCIe bandwidth for data transfers
respectively from host to device and from device to host.
66 Performance Modeling for GPGPU and CPU
4.2.3 Refinement Example: Nvidia PTX 1.x
In this section we propose a refinement of the ground model for the Nvidia PTX platform
prior to Fermi (PTX 1.x platform). For this reason, global memory is assumed to be not
cached and only one kernel at a time can be launched on a device. This refinement tries
to formulate a performance model able to approximate Tkernel for a subset of Nvidia
devices.
As we presented in Chapter 2, CUDA devices are able to hide memory latencies
using zero-overhead scheduling of threads. However, the device scheduler requires a lot
of threads in order to hide memory latencies. The required number of threads depends on
the arithmetic intensity of the kernel and on specific platform requirements. Therefore
it is not unique and can be very different for different devices. Since at this level of
abstraction our model is not tied to any vendor-specific device, we can ignore the impact
of global memory latencies on kernel completion time assuming that enough threads are
launched for the execution.Moreover, since device memory and device cores operate in
parallel [7], we obtain
Tkernel = max(Tmem, Tcalc) (4.9)
Tmem and Tcalc represent the time spent by a kernel respectively in read/write mem-
ory operations in global memory and pure computation.
Threads within a thread group execute concurrently on the CUDA cores of a single
Streaming Multiprocessor. Moreover, different thread groups assigned to the same SM
execute sequentially but different SMs execute concurrently. The scheduling policy of
thread groups across the available SMs of a device is not given by the CUDA docu-
mentation because it depends on the specific device over which the kernel executes [6].
Assuming that threads are uniformly scheduled to SMs we obtain:
Tcalc = Ttg ∗
⌈
Ntg
NSM
⌉
(4.10)
Ttg is the completion time of a single thread group, Ntg is the total number of thread
groups launched for kernel execution and NSM is the number of available SMs. Equation
4.10 states that the completion time of a kernel is equal to the completion time of a
thread group multiplied by the number of thread groups assigned to the most loaded
SM. Different SMs execute different thread groups that may have different execution
times. For example, if a kernel function contains a flow-control statement with a guard
that depends on the block ID or on the thread ID, different thread groups may take
different execution paths and complete their execution in different times. For this reason,
considering a different completion time for each thread group would give a more accurate
estimation of Tcalc. However, we consider Ttg as the mean completion time for each thread
group launched.
4.2 GPGPU Performance Model 67
Memory Throughput
RAM memories of different devices may be of different types (e.g. DDR2 or DDR3) and
may have different clock rate, interface width, number of modules and capacity. The
impact of memory operations on GPU performance is clearly difficult to estimate in this
scenario. Moreover we do not want to provide a performance model related to a specific
RAM model but abstract from those features. A more accurate approach is that followed
in [48] where parallel architectures are seen as client-server systems: clients are device
cores while the server is the main memory. This approach is based upon the queueing
theory. For the purposes of this thesis, a less accurate performance model is acceptable.
For these reasons, we estimate Tmem dividing the amount of data (expressed in
bytes) read (Bin) and written (Bout) by the kernel by the transfer bandwidth of the
device memory (DMB in Equation 4.11).
Tmem =
Bin + Bout
DMB
(4.11)
Memory access pattern. CUDA devices are capable of coalescing separate memory
requests issued by threads in a warp into a single memory transaction (coalesced memory
accesses). As seen in Chapter 2, CUDA GPUs require different access patterns to coalesce
memory accesses. If these requirements are not fulfilled, one memory transaction for
each thread is issued (un-coalesced memory accesses) and memory transfer bandwidth
decrease significantly. It is therefore important to calculate memory bandwidth either
for coalesced and un-coalesced memory accesses to be able to predict the completion
time of a kernel accurately. This can be done using effective bandwidth since theoretical
bandwidth does not consider the memory access pattern.
We implemented a CUDA kernel that stresses device memory using different ac-
cess patterns (Listing 4.1). Timing this kernel execution we calculate effective memory
bandwidth for coalesced and un-coalesced memory accesses.
global void memoryTest ( const vectorType∗ A, vectorType∗ B, int s t r i d e )
{
int i = blockDim . x ∗ blockIdx . x + threadIdx . x ∗ s t r i d e ;
B[ i ] = A[ i ] ;
}
Listing 4.1: CUDA kernel for effective global memory bandwidth calculation.
The memoryTest kernel copies vector A into vector B, both stored in global memory.
The stride parameter is used to control the memory access pattern. For CUDA devices
with compute capability equal to 1.1, coalesced and un-coalesced memory accesses are
obtained setting stride to 1 and 32 respectively. A more accurate estimation of Tmem is
as follows:
Tmem =
BinC + BoutC
BandC
+
BinU + BoutU
BandU
(4.12)
68 Performance Modeling for GPGPU and CPU
BandC represents the memory bandwidth of coalesced memory accesses while BandU
represents the memory bandwidth of un-coalesced memory accesses. BinC and BoutC
represent the amount of data, expressed in bytes, read and written respectively to and
from global memory using coalesced memory transactions. BinU and BoutU represent
the amount of data, expressed in bytes, read and written respectively to and from global
memory using un-coalesced memory transactions.
Instruction Throughput
Ttg represents the completion time of a thread block without considering memory oper-
ations (i.e. only considering arithmetic and logic instructions). Nvidia provides a rich
documentation containing the approximated instruction throughput for each PTX in-
struction [6]. The instruction throughput is defined in clock cycles and refers to a single
warp. Thus, to calculate the completion time of a thread block, the completion time of
a single warp must be multiplied by the number of warps assigned to each thread block.
Nwarp =
⌈
ThreadBlockSize
WarpSize
⌉
(4.13)
Tcalc−warp =
∑
i∈ISA
TotalInstructionsi
InstructionThroughputi
(4.14)
Ttg =
Tcalc−warp ∗Nwarp
ClockRate
(4.15)
Tcalc−warp represents the completion of a warp expressed in clock cycles. Equation 4.15
shows how to calculate Ttg where ClockRate is the clock rate of a Streaming Multipro-
cessor.
Ttg could be refined any more considering the overhead for the launch of each thread
on the GPU. This overhead is caused by the activation of threads and others hidden
operations performed by a device before a kernel execution. Tilak Raj and Pramod
Subramanyan [55] propose a method to calculate this parameter. They implemented a
NOP kernel, i.e. a kernel that executes no operations, and measured its execution time.
They found that the execution time increases linearly with the number of threads as
expected.
Ttg = Ttg + (Toverh ∗ ThreadBlockSize) (4.16)
where Toverh represents the overhead introduced by a single thread launch. The re-
finement step presented in this section would be replaced with a more precise model
considering specific platform features. For example, it should be used the performance
model proposed in [54].
4.2 GPGPU Performance Model 69
4.2.4 Experimental Evaluation
The ground model and its refinement for the Nvidia PTX 1.x platform proposed in this
section have been validated using an experimental approach. We implemented different
CUDA kernels that stress different hardware units such as global memory and cores.
The hardware and software configuration is listed in Table 4.2. The following section
describes how to time CUDA calls.
CPU AMD Phenom 9850 Quad-Core Processor 2.50 GHz 64-bit
GPU Nvidia GeForce 9800 GTX/9800 GTX+
Main Memory 8 GB RAM
Video Memory 512 MB
OS Windows 7 Enterprise
Platform CUDA 3.0
Driver Nvidia Drivers 196.21
Table 4.2: Hardware and software configuration.
Timing CUDA calls
CUDA calls can be timed using either CPU or GPU timers but programmers should
be careful when using CPU timers because some CUDA calls are asynchronous mean-
ing that they return control back to the calling CPU thread prior to completing their
work. To synchronize the execution of CPU and GPU, the CUDA API provides the
cudaThreadSynchronize() function that should be used immediately before starting
and stopping the CPU timer.
GPU timers use events. Events can be created, destroyed and recorded. The fol-
lowing listing shows a kernel execution that is timed using GPU timers and CUDA
events.
cudaEvent t s ta r t , stop ;
f loat time ;
cudaEventCreate(& s t a r t ) ;
cudaEventCreate(&stop ) ;
cudaEventRecord ( s ta r t , 0) ;
kernel<<<gr id , threads>>> ( d odata , d ida ta ) ;
cudaEventRecord ( stop , 0) ;
cudaEventSynchronize ( stop ) ;
cudaEventElapsedTime(&time , s t a r t , stop ) ;
cudaEventDestroy ( s t a r t ) ;
cudaEventDestroy ( stop ) ;
Listing 4.2: How to time a kernel execution using CUDA events.
The cudaEventRecord() function places the start and stop events into an events
queue. When the execution reaches the cudaEventElapsedTime() function, the time
elapsed between the start and stop events is saved into the variable named time. This
70 Performance Modeling for GPGPU and CPU
value is expressed in milliseconds and has a resolution of approximately half a microsec-
ond [27].
Since GPU timers are more precise we used them to time our CUDA applications.
Tsetup
The time spent in configuring the execution environment for a kernel can be measured
timing specific calls to the runtime. As seen, there are different operations that con-
tribute to Tsetup. These are, for example, the creation of the context or the creation of
the command queue. The execution time of setup operations is constant across different
kernel launches, except that for device memory allocations. Device memory allocations
require an execution time that is linear in the number of bytes to be allocated. We
performed a benchmark that measures the time required to allocate a single byte. This
value is then multiplied by the number of bytes allocated in device memory in order to
obtain the completion time of device memory allocations. Experimental results for Tsetup
will be presented in Chapter 5. Moreover, we will demonstrate that for certain GPGPU
applications, when the data size is lower than a specific threshold (the threshold is tied
to the specific application and the hardware and software environment), Tsetup becomes
the performance bottleneck.
Th2d and Td2h
In this section we provide a comparison between experimental and estimated latencies of
PCIe data transfers. The following charts (Figure 4.1, 4.2, 4.3 and 4.4) report experimen-
tal and estimated latencies of data transfers at increasing transfer size. We performed
separated tests using pageable and pinned memory. Results are reported in Tables A.7,
A.8, A.9 and A.10. The mean relative percentage error is about 7% and 0.5% using
respectively pageable memory and pinned memory. The difference may be explained
by the fact that transfer bandwidth using pageable memory is less constant than using
pinned memory due to the overhead introduced by memory swapping mechanisms.
4.2 GPGPU Performance Model 71
0 5 10 15 20 25 30
Transfer Size (MB)
0
5
10
15
20
25
Ti
m
e 
(m
s)
Experimental
Estimated
Host To Device
Pageable Memory
Figure 4.1: Experimental and theoretical results for host to device data transfers using pageable
memory.
0 5 10 15 20 25 30
Transfer Size (MB)
0
5
10
15
20
Ti
m
e 
(m
s)
Experimental
Estimated
Device To Host
Pageable Memory
Figure 4.2: Experimental and theoretical results for device to host data transfers using pageable
memory.
72 Performance Modeling for GPGPU and CPU
0 5 10 15 20 25 30
Transfer Size (ms)
0
5
10
15
20
Ti
m
e 
(m
s)
Experimental
Estimated
Host To Device
Page-locked Memory
Figure 4.3: Experimental and theoretical results for host to device data transfers using pinned memory.
0 5 10 15 20 25 30
Transfer Size (MB)
0
5
10
15
20
Ti
m
e 
(m
s)
Experimental
Estimated
Device To Host
Page-locked Memory
Figure 4.4: Experimental and theoretical results for device to host data transfers using pinned memory.
4.2 GPGPU Performance Model 73
Tkernel
We developed three CUDA applications in order to corroborate our model: a vector addi-
tion (Listing B.1), a non-optimized matrix multiplication (Listing B.2) and an optimized
matrix multiplication (Listing B.3). The first and the second applications prove the cor-
rectness of our model for applications with low arithmetic intensity (memory-bound
applications). Moreover, the non-optimized matrix multiplication combines coalesced
and un-coalesced memory transactions. The third application reduce the memory trans-
actions performed by the kernel increasing the arithmetic intensity. In this case Tcalc
is higher than Tmem. The following charts (Figure 4.5, 4.6, 4.7 and 4.8) plot experi-
mental and estimated completion times of kernels at increasing data size. Results are
reported in Tables A.11, A.12, A.1 and A.2. Figure 4.5, 4.6 and 4.7 demonstrate that
0 10 20 30 40 50 60
Vector Size (MB)
0
50
100
150
200
250
300
350
Ti
m
e 
(m
s)
Experimental
Estimated (Effective Bandwidth)
Estimated (Theoretical Bandwidth)
Vector Addition
Stride = 1
Block Size = 256
Figure 4.5: Experimental and theoretical results for vector addition with coalesced memory accesses.
The mean relative percentage error is about 1.7% using effective bandwidth and about 18.4% using
theoretical bandwidth.
accurate performance prediction is achieved considering effective bandwidth instead of
theoretical bandwidth. Listings B.6, B.7 and B.8 represent the compiled PTX code of
vector addition, matrix multiplication and optimized matrix multiplication respectively.
The relative percentage error between experimental and estimated completion times of
the optimized matrix multiplication is higher than the vector addition and the matrix
multiplication ones. The difference may be caused by the use of thread block-level syn-
chronization. Thread block-level synchronization is necessary to synchronize memory
accesses to shared memory of threads within a thread block. Synchronization may in-
crease the completion time of a thread block thus increasing the overall kernel completion
time. However, our model refinement does not consider the impact of thread block-level
synchronization on kernel performance. Modeling thread block-level synchronization we
74 Performance Modeling for GPGPU and CPU
0 0.5 1 1.5 2
Vector Size (MB)
0
20
40
60
80
100
Ti
m
e 
(m
s)
Experimental
Estimated (Effective Bandwidth)
Estimated (Theoretical Bandwidth)
Vector Addition
Stride = 32
Block Size = 256
Figure 4.6: Experimental and theoretical results for vector addition with un-coalesced memory ac-
cesses. The mean relative percentage error is about 2% using effective bandwidth and about 92% using
theoretical bandwidth.
0 10 20 30 40 50
Matrix Size (MB)
0
10000
20000
30000
40000
Ti
m
e 
(m
s) Experimental
Estimated (Effective Bandwidth)
Estimated (Theoretical Bandwidth)
Matrix Multiplication
Block Size = 256
Figure 4.7: Experimental and theoretical results for non-optimized matrix multiplication (every mem-
ory access is performed in global memory). The mean relative percentage error is about 11% using
effective bandwidth and about 80% using theoretical bandwidth.
4.2 GPGPU Performance Model 75
could obtain different values for Tcalc with a lower relative percentage error between
experimental and estimated completion times.
0 10 20 30 40 50
Matrix Size (MB)
0
200
400
600
800
1000
1200
1400
Ti
m
e 
(m
s)
Estimated
Experimental
Optimized Matrix Multiplication
Block Size = 256
Figure 4.8: Optimized Matrix Multiplication (use of Shared Memory). The Max model has a relative
percentage error of 15.5% against the 16.5% of the Sum model.
76 Performance Modeling for GPGPU and CPU
4.3 CPU Performance Model
Different microprocessor architectures such as Intel or PowerPC have different hardware
implementations based on the number of cores, caching policies or assembler instruction
set. Moreover, modern computers can be equipped with a wide variety of operating
systems (e.g. Windows 7, Linux or Mac OS X). Our goal is to provide a CPU performance
model that is operating system- and processor-independent. Therefore, we formulate a
CPU performance model based on the computational model of the CLI. In particular, we
studied how the just-in-time compiler translates CIL opcodes into native microprocessor
instructions recognized by the underlying specific CPU. As seen in Chapter 3, since the
CLR is a proprietary implementation its code can not be inspected and studied. For this
reason, we studied the SSCLI JIT compiler implementation. In this chapter we are not
interested in studying the whole SSCLI implementation but we would like to provide
an overview on how the JIT compiler translates opcodes into native microprocessor
instructions. We executed and timed several tests using the .NET Framework [11] in
order to validate our model.
A different approach to CPU performance modeling is proposed in [48]. Since it
is placed at a lower level than VEEs, this CPU performance model is more accurate
than our (e.g. in certain applications the impact of caches is crucial). However, we
are not interested in achieving high accuracy but rather in demonstrating that a VEEs
performance model is possible and useful in real applications.
4.3.1 SSCLI Just-In-Time compiler
In this section we give a brief description of the SSCLI JIT compilation process. Starting
from a single CIL opcode, we navigated the source code that produces its executable
code.
The SSCLI compilation process is very simple because SSCLI is designed to be easily
portable. JIT compilation of a CIL method starts at method FJit::jitCompile()
located sscli20/clr/src/fjit/fjit.cpp inside SSCLI implementation. The compiler
begins stepping through the CIL opcodes, starting at the method’s entrypoint (i.e. the
first opcode of the method). The SSCLI compiles the CIL opcodes one at a time. For
each CIL opcode, the compiler places a stream of native microprocessor instructions,
representing the compilation of the CIL opcode, into a global buffer. CIL opcodes drive
a switch statement whose case statements represent the entire set of CIL opcodes. For
each case, the JIT compiler emits a corresponding sequence of instructions into its buffer
[41].
In this section, we show the compiling steps for the ADD opcode. SSCLI defines the
CIL opcodes in a table named opcode.def that can be found in the sscli20/clr/src/inc
directory. This table contains the following entry:
OPDEF(CEE_ADD, "add", Pop1+Pop1, Push1, ...)
4.3 CPU Performance Model 77
The first parameter is the name used by the SSCLI for the current instruction, followed
by a human-readable string. The next two parameters represent the stack behavior of
the given opcode (in this case two pops followed by a push). The information encoded
by the macros in opcode.def are used by the JIT compiler.
As said above, the FJit::jitCompile() method contains a large switch statement
that discriminates between opcodes and calls out to the appropriate compilation function
[38]. For the ADD opcode the switch case is like the following:
switch ( opcode ) {
// omitted many cases
case CEE ADD:
J i tRe su l t = compileCEE ADD() ;
break ;
The compileCEE_ADD() is a function that emits opcode-specific code. Every opcode-
driven case contains a call to such a function. This function appears in the same file as
the switch, and looks like the following:
FJi tResu l t FJit : : compileCEE ADD()
{
OpType r e su l t add ;
BINARY NUMERIC RESULT( topOp ( ) , topOp (1) , CEE ADD, r e su l t add ) ;
TYPE SWITCH PTR( topOp ( ) , emit ADD , ( ) ) ;
POP STACK(2) ;
pushOp( r e su l t add ) ;
return FJIT OK ;
}
The BINARY_NUMERIC_RESULT is a verification check that is contingent on the types
of operands on the stack. It checks that the two operands match and sets the value of
the result_add variable to the expected type of the result. TYPE_SWITCH_PTR then calls
lower-level macros based on the types of operands:
#define TYPE SWITCH PTR( type , emit , a rgs )
switch ( type . enum ( ) ) {
emit WIN32 ( case typeByRef : )
emit WIN32 ( case typeRef : )
emit WIN32 ( case typeMethod : )
case typeI4 :
emit## I4 args ;
break ;
emit WIN64 ( case typeByRef : )
emit WIN64 ( case typeRef : )
emit WIN64 ( case typeMethod : )
case typeI8 :
emit## I8 args ;
break ;
case typeR8 :
emit## R8 args ;
break ;
default :
FJIT FAIL (FJIT INTERNALERROR) ;
}
For example, if the stack has 4 byte integers on it, the emit_ADD_I4 macro is called
as follows:
78 Performance Modeling for GPGPU and CPU
#ifndef emit ADD I4
#define emit ADD I4 ( )
{
c a l l I n f o . r e s e t ( ) ;
em i t t o s a r g ( 1 , INTERNAL CALL ) ;
em i t t o s a r g ( 2 , INTERNAL CALL ) ;
em i t c a l l h e l p e r I 4 I 4 I 4 ( ADD I4 helper ) ;
em i t pu sh r e su l t I 4 ( )
}
First the callInfo, an accumulator used by some of the emit macros used to support
different calling conventions, is reset. Then, the JIT compiler emits the native code to
perform the addition and push the result onto the stack. The emit_tos_arg macro
is sometimes used to enregister the top-of-stack value, but since in this case the ADD
operation is implemented using an internal function, the arguments should be passed on
the stack rather than being put into registers. The INTERNAL_CALL macro is defined as
false to cause correct behavior. The address of the ADD_I4_helper function is placed
into the instruction stream in the buffer and finally, the return value is pushed back onto
the stack. Compilation for this particular opcode is complete at this point [38].
The JIT compiler has several layers of macros that are used during code emission.
At the core is a set of primitive helper functions that are designed to be easily ported,
which are enriched by processor-specific macros. All of the macros discussed to this point
are part of this layer that is portable across JIT implementations. At the top level, the
sscl20i/clr/src/fjit directory contains a file named fjitcore.h, which is the main
file for the JIT compiler, and which defines the calling convention and also acts as the root
of a processor-specific tree of include files. This file, in turn, includes fjit.h, which is
a key header file that imports the opcodes defined by sscli20/clr/src/inc/openum.h
and listed in opcode.def.
In addition to fjit.h, the fjitdef.h header file contains machine-independent code
emitter macros, which are themselves tied to the actual processor being used by a nam-
ing convention: the appropriate <processor>def.h and <processor>fjit.h files are
switched at compile-time. For example, when building the JIT compiler for an x86 pro-
cessor, the x86def.h and x86fjit.h files would be switched into the build. For the
Power PC processor, the files to use would be ppcdef.h and ppcfjit.h.
In the <processor>fjit.h file, macros can be redefined to get processor-specific
performance gains. Returning to our example, the ADD opcode uses the processor-specific
macro layer. In x86fjit.h, the emit_ADD_I4 macro is redefined to take advantage of
x86 specific instructions:
#define emit ADD I4 ( )
enregisterTOS ;
x86 pop (X86 ECX) ;
x86 bar i th (x86OpAdd , x86Big , x86 mod reg (X86 EAX, X86 ECX) ) ;
inRegTOS = true
This macro has intimate knowledge of the processor’s instruction set and conventions.
The use of processor-specific helper functions also deserves discussion. In the case of
4.3 CPU Performance Model 79
ADD, the call takes the form of a C function that can be found in fjitdef.h. The
emit_callhelper macro eventually bottoms out into the x86_call_reg macro, which
emits x86 microprocessor instructions to call the helper function:
#ifdef DECLARE HELPERS
int HELPER CALL ADD I4 helper ( int i , int j ) {
return j + i ;
}
#endif
Because the arguments for this function were pushed earlier in the CIL stream using
opcodes whose compiled behavior “matches” that of the helper call, the arguments are
already in the right spot on the stack for the function invocation.
4.3.2 Opcodes cost
SSCLI JIT compiler has a hierarchical structure. Macros are expanded at different
levels starting from the compileCEE_[opcode]() functions and going deep toward the
processor-specific macro layer. We chose one of this layers and counted how many code
emissions (i.e. the number of native instructions placed into the output buffer) are
defined by the macro that compiles each opcode. That number can be taken as an
approximation of the cost for executing a single opcode. Since we are not interested in
creating a cost model related to a specific chip architecture (e.g. Intel and PowerPC), we
set at the layer of the emit_[opcode] macros in which the compiler ignores which is the
target processor. We obtained a cost model that assigns an abstract cost to any opcode
of the CIL. The term “abstract” indicates that the cost to execute a CIL instruction is
not measured in time but is instead a pure value.
Abstract stack behavior
The opcodes cost may also be obtained observing the behavior of the CLI evaluation
stack for each opcode. The CLI specification [12] uses a graphical notation called stack
transition diagram to represent the evaluation stack behavior of an opcode. The stack
transition diagram displays the state of the evaluation stack before and after the instruc-
tion is executed. Below is a typical stack transition diagram.
..., value1, value2→ ..., result
This diagram indicates that the stack shall have at least two elements on it, and in the
definition the topmost value (“top-of-stack” or “most-recently-pushed”) will be called
value2 and the value underneath (pushed prior to value2) will be called value1. The
instruction removes these values from the stack and replaces them by another value,
called result in the description. Consider, for example, the ADD opcode. Its stack
transition diagram is the same as the diagram above. The ADD opcode pops value1
and value2 from the evaluation stack, adds them and then pushes the result on the
evaluation stack. Totally four operations, two pops, an elementary operation and a
push. For this reason we say that the ADD instruction costs four. This approach can
80 Performance Modeling for GPGPU and CPU
be applied to any opcode of the CIL obtaining the same cost model presented in the
previous section. The reason is that, for each opcode, the JIT compiler emits the native
microprocessor instructions which perform the stack transition for that opcode.
4.3.3 Experimental Evaluation
We implemented three different applications using the .NET framework: a vector addi-
tion, a matrix multiplication and a matrix and vector multiplication. For each applica-
tion we timed its execution using the Win32 QueryPerformanceCounter and
QueryPerformanceFrequency functions [11]. We created a managed wrapper class that
incapsulate these functions in order to time the managed code. To compare abstract
costs with experimental times we performed a simple benchmark, timed its execution
and divided its execution time by its estimated abstract cost. In this way, we obtained a
correlation between the abstract cost unit and time unit. Tests has been performed on
two different computers whose characteristics are listed in Table 4.3 and 4.4. Figure 4.9
and 4.10 report experimental and estimated execution times of the three applications.
.NET applications can be compiled/built and executed using two available config-
urations: Debug and Release. Using the Debug configuration, programs are compiled
with full symbolic debug information and no optimization. Optimization complicates
debugging, because the relationship between source code and generated instructions is
more complex. Using the Release configuration, programs compiled code contains no
symbolic debug information and is fully optimized.
CPU Intel Core 2 Duo 2,4 GHz 32-bit
Main Memory 2 GB RAM
OS Windows 7 Professional
Framework .NET Compact Framework 3.5
Table 4.3: Hardware and software configuration.
CPU Intel Core 2 Quad Q9300 2,5 GHz 64-bit
Main Memory 6 GB RAM
OS Windows 7 Enterprise
Framework .NET Compact Framework 3.5
Table 4.4: Hardware and software configuration.
Building a program in Debug mode generally led to longer MSIL than building the
same program in Release mode. This is the reason why estimated execution times using
Debug and Release mode are different. Particularly, the former are almost twice the
latter. But abstract costs of each application estimated using the Debug mode are
not twice those estimated using the Release mode. These means that a single abstract
4.3 CPU Performance Model 81
Figure 4.9: Experimental and estimated execution times using the configuration of Table 4.3.
Figure 4.10: Experimental and estimated execution times using the configuration of Table 4.4.
82 Performance Modeling for GPGPU and CPU
unit of measurement corresponds to different execution time depending on the build
configuration. The cause of this difference may be that the runtime could perform
additional operations when executing in Debug mode or that the JIT compiler has the
capability to produce more optimized executable code when executing in Release mode
(i.e. a single opcode could be translated to few microprocessor instructions).
Chapter 5
Implementation
In this chapter we present Libra.Net, our task scheduler for a CPU-GPU heterogeneous
environment. Libra.Net is a particular implementation that targets the CLR but it
could also be developed, for example, for Mono and other CLI platforms. We inte-
grated Libra.Net into the 4-Centauri project [8], a MSIL to Nvidia PTX meta-compiler
developed by Cristian Dittamo during his PhD [56, 8]. Recently, 4-Centauri has been
extended by Giacomo Righetti with PBricks, a MSIL to AMD IL meta-compiler. Before
presenting Libra.Net, in the following section we introduce the 4-Centauri’s design and
implementation aspects.
5.1 4-Centauri
The 4-Centauri compiler translates a given program written in MSIL into another equiv-
alent program written in the Nvidia PTX intermediate language. 4-Centauri uses stan-
dard mechanisms provided by the CLR, such as Custom Attributes, in order to repre-
sent parallel computations and abstract from underlying architectures. In particular,
programmers are aware of the additional operation costs due to engines communication
via streams.
Three different levels of abstraction are exposed to programmers. At the higher level
programmers have no access to any information about the execution environment, they
just make a reference to a domain-specific library (Figure 5.1) that expresses different
models of parallel execution. At this level programmers do not have to re-implement
their code or learn another programming language. The middle level is targeted to
experts in a particular domain that can develop specific libraries leveraging the stream
programming model provided by 4-Centauri. Architecture details are transparent to
programmers at this level too, but streaming capabilities are exposed through a dedicated
API. At a lower level, experts in computer graphics architectures can develop their on
meta-compiler in order to map high-level code to new architectures [8].
4-Centauri provides three key abstractions to enable programmers to write applica-
83
84 Implementation
Figure 5.1: 4-Centauri software stack.
tions that can be compiled for the execution on the GPU:
• Stream objects are used to represent data that is going to be processed on the GPU.
They contain an ordered collection of data elements of a given type that can only
be accessed sequentially. Stream objects are implemented by two generic types:
InputStream<T> and OutputStream<T>. An InputStream can only be read while
an OutputStream can only be written. Kernels must have parameters of this two
types. The interface that manipulates Streams looks like the following:
public class InputStream<T> {
public ove r r i d e int Count ( ) { . . . }
public ove r r i d e void Reset ( ) { . . . }
public ove r r i d e bool MoveNext ( ) { . . . }
public T Current { . . . }
/∗ For workload d i s t r i b u t i o n purpose ∗/
public stat ic int GetSize ( ) { . . . }
public List<InputStream<T>> Sp l i t ( int num array ) { . . . }
}
public class OutputStream<T> {
public ove r r i d e int Count ( ) { . . . }
public ove r r i d e void Reset ( ) { . . . }
public ove r r i d e bool MoveNext ( ) { . . . }
public T Current { . . . }
}
/∗ For workload s c a t t e r i n g and ga ther ing purposes ∗/
public stat ic int GetSize ( ) { . . . }
public List<OutputStream<T>> Sp l i t ( int num array ) { . . . }
public void Concat (OutputStream<T> [ ] nstream ) { . . . }
5.1 4-Centauri 85
• Kernel objects are used to map methods to stream processors. Programmers can
declare methods annotated with a special Custom Attribute that specify that
method is a kernel.
• Controls implement control code. They can initiate, monitor, and terminate the
execution of kernels. These are transparent to programmers.
The 4-Centauri compiler consists of the following parts [8]:
• Analyzer. It looks for Kernel-annotated methods in a given MSIL code. If there
is at least one GPU available, for each kernel found, the Analyzer checks whether
it is already compiled or not. In the former case, the Analyzer configures the
execution environment (e.g. memory allocation and initialization, context and
function definition, etc.) and invokes the run-time support in order to execute the
kernel. In the latter case, the Kernel-annotated method is passed to the Parser.
• Parser. It takes a Kernel-annotated method as input and builds a parse tree (PT)
by using an abstract stack.
• Code Generator. It translates the PT nodes into a specific intermediate language
depending on the available underlying GPU, such as Nvidia PTX.
• Runtime support. It provides support for executing CPU and GPU computations,
such as scattering input data, and gathering results at the end of computation. If
part of the computation is executed on a GPU, our support performs all that is
necessary to configure and execute Kernel code.
In order to leverage Nvidia CUDA driver functionalities Cristian Dittamo developed
a dedicated .NET library, called NVIDIALib, based on the .NET platform invocation ser-
vices. NVIDIALib is a driver wrapper which translates the CUDA driver un-managed C
library’s interface into a compatible .NET managed one in C#. In particular, NVIDIALib
implements a set of static methods, one for each function exposed by CUDA driver. For
instance, the function
void cuMemAlloc ( CUdeviceptr∗ dptr , unsigned int by t e s i z e ) ;
allocates bytesize bytes of linear memory on the device and returns dptr pointer to
the allocated memory. This is wrapped by a static method as follows
[ Dl l Import (“NVIDIALib . d l l ” , ExactSpe l l i ng = true , CharSet = CharSet . Auto ) ]
public stat ic extern int MAllocOnGPU ( [ In , Out ] IntPtr dptr , int by t e s i z e ) ;
86 Implementation
5.2 Libra.Net
Libra.Net performs the following steps:
• Takes in input a Kernel-annotated method.
• Analyzes the method code statically in order to estimate the number of instructions
that are going to be executed.
• Decides on which platform executing the task basing on our CPU and GPGPU
performance models. If the scheduler decides to execute the task on a GPGPU,
4-Centauri provides compilation from MSIL to the target architecture language.
• Executes the task.
5.2.1 Bytecode Analysis
The byte-code analysis for the given method is performed to estimate the number of
MSIL instructions that will be executed at runtime by the method. In particular, the
scheduler maintains a table that associates each opcode with the estimated number of
times it will be executed. In order to recognize loops and branches the method byte-code
is analyzed using the CLIFileRW library. Exception handling and method calling are not
considered during the analysis phase because they are not supported by the PTX 1.x
platform.
Loops
Since control flow in the CLI is exposed in the form of conditional and unconditional
branches, the analysis tool tries to infer high-level language constructs (e.g. for loops
and if-else statements) recognizing code patterns inside the method byte-code. Despite
CLIFileRW provides a powerful tool to iterate over .NET binaries, recognizing loops and
branches is not so easy. In fact, source-to-source compilers, that translate high-level
code into MSIL, can introduce more or less optimizations whereas compiling in Debug
or Release mode. The same problem was encountered by Giacomo Righetti during its
thesis [13]. In Debug Mode, for example, each loop statement is translated into a couple
of branch instructions: one forward unconditional branch and one backward conditional
branch. The conditional brach has always as target the instruction placed at offset +2
respect to the unconditional branch as illustrated in Listing 5.1. This pattern can be
recognized while scanning MSIL code using a parser.
5.2 Libra.Net 87
while ( cur r ent > 2)
while ( cur r ent > 1)
current−−;
L 0008 : br . s L 001c
L 000a : nop
L 000b : br . s L 0013
L 000d : nop
L 000e : l d l o c . 0
L 000f : l dc . i 4 . 1
L 0010 : sub
L 0011 : s t l o c . 0
L 0012 : nop
L 0013 : l d l o c . 0
L 0014 : ldc . i 4 . 1
L 0015 : cgt
L 0017 : s t l o c . 1
L 0018 : l d l o c . 1
L 0019 : br t rue . s L 000d
L 001b : nop
L 001c : l d l o c . 0
L 001d : ldc . i 4 . 2
L 001e : cgt
L 0020 : s t l o c . 1
L 0021 : l d l o c . 1
L 0022 : br t rue . s L 000a
L 0024 : r e t
Listing 5.1: Debug Mode Compilation of loops. Green labels and targets represent the outer loop
while yellow ones represent the inner loop.
In Release Mode loops compilation is more optimized and the above property is
sometimes not verified. If the high level code has been compiled in Release Mode, an
exhaustive study of the different compilation patterns must be performed. A possible
solution to this problem can be found in [13].
Our analysis tool recognizes loops compiled in Debug mode performing the following
steps. When a forward unconditional branch (e.g. br) with label l1 is encountered,
the parser stores l1. If the parser reads a backward conditional branch with label l2
that points to the instruction at offset +2 from l1 a new loop block with bounds {l1,l2}
is recognized. In order to recognize nested loops, the analysis tool stores any forward
unconditional branch label and when a backward conditional branch is encountered it
searches for labels with offset -2 from that target.
Calculating the cost of a loop means multiplying the cost of the loop body for the
times the loop will be executed. Since static analysis does not permit to evaluate the
number of times a loop will be executed, the analysis tool implement some heuristics (e.g.
the heuristic presented above). We consider two cases: loops iterating over the elements
of a Stream and loops iterating for a constant number of times. In both cases, the
number of iterations can be retrieved observing stack operations before a conditional
branch. In the first case, one of the streams passed as method arguments is pushed
onto the stack through the ldarg opcode and then a special method that iterates over
88 Implementation
the stream elements is called. 4-Centauri uses the MoveNext() method to check wether
another element of the stream is available and, in that case, move to the next element. A
typical code fragment that iterates over the elements of a Stream looks like the following:
L 001f : lda rg . 2
L 0020 : c a l l v i r t i n s t ance bool [ CompileLib ] StreamDefs . Stream : : MoveNext ( )
L 0025 : s t l o c . 0
L 0026 : l d l o c . 0
L 0027 : br t rue . s L 0009
In order to estimate the number of iterations over the elements of a Stream the scheduler
binds actual and formal parameters of the Kernel-annotated method. For instance, in
the above listing, the ldarg.2opcode loads the second parameter onto the stack. Since
the second parameter is a Stream, its length is obtained from the corresponding actual
parameter.
int i = 5 ;
while ( i < 10)
i++;
L 0000 : nop
L 0001 : ldc . i 4 . 5
L 0002 : s t l o c . 0
L 0003 : br . s L 0009
L 0005 : l d l o c . 0
L 0006 : ldc . i 4 . 1
L 0007 : add
L 0008 : s t l o c . 0
L 0009 : l d l o c . 0
L 000a : ldc.i4.s 10
L 000c : c l t
L 000e : s t l o c . 1
L 000f : l d l o c . 1
L 0010 : br t rue . s L 0005
L 0012 : r e t
In this case, instructions from label L_0005 to L_0010 will be approximately executed
10 times because at label L_000a a constant equal to 10 is pushed onto the stack and
compared with the local variable that drives the loop statement.
Branches
Methods containing if-else statements can follow different code paths during their ex-
ecution. As explained in the previous section, we are not able to statically evaluate
guards and determine which path will be taken by a method. The easiest solution to
calculate the cost of an entire if-else statement is to add the cost of if and then blocks
multiplied by the probability that each block has to be executed. Since an if guard can
not be evaluated statically, we chose a probability of 0.5 for each block, i.e. each block
has the same probability to be executed. This process can be iterated for nested if-else
statements.
5.2 Libra.Net 89
i f ( expr1 )
{
stmt1 ;
}
else
{
i f ( expr2 )
{
stmt2 ;
}
else
{
stmt3 ;
}
}
For instance, the cost of the above listing is given by the following equation.
Cost =
1
2
∗ cost(stmt1) + 1
2
(
1
2
∗ cost(stmt2) + 1
2
∗ cost(stmt3)
)
=
1
2
∗ cost(stmt1) + 1
4
∗ cost(stmt2) + 1
4
∗ cost(stmt3)
The following listing shows the compiled MSIL of an if-else statement. In this example,
instructions labeled L_000e and L_000f represent the if block while those labeled L_0012
and L_0013 represent the else block.
int i = 0 ;
i f ( i == 0)
i = 5 ;
else
i = 4 ;
L 000c : br t rue . s L 0012
L 000e : ldc . i 4 5
L 000f : s t l o c . 0
L 0010 : br . s L 0014
L 0012 : ldc . i 4 4
L 0013 : s t l o c . 0
L 0014 : lda rg . 1
5.2.2 Model Implementation
The GPGPU ground model proposed in Chapter 4 has been implemented as a C# class
we called GPGPUPerfModel. This class exploit the available driver wrappers to instantiate
model parameters such as the number of available compute units or transfer bandwidth
between host and device. As said in Chapter 4, Tkernel represents the completion time of
all threads within a grid. This parameter strongly depends on the underlying architecture
or platform and must be calculated refining the ground model. For this reason, the
scheduler provides a sub-class of the GPGPUPerfModel class for each available refined
model (Figure 5.2). These sub-classes directly interacts with the compiler in order
to obtain costs of MSIL instructions when they are compiled for a target architecture.
90 Implementation
Figure 5.2: Class hierarchy that implements our GPGPU performance model.
Each sub-class override some of the GPGPUPerfModel class methods. For example, we
implemented the NVIDIAPerfModel class to leverage the Nvidia PTX refined model.
The method that returns the number of available compute units is declared inside the
GPGPUPerfModel class but it is overridden by the NVIDIAPerfModel class that interacts
with the correct driver wrapper, as shown in the following listing. Moreover, the Nvidia
PTX compiler provides, for each opcode, the number of global memory transactions that
derive from the translation of that opcode.
public ove r r i d e int ComputeUnits ( )
{
return CUDA.GetSMCount( deviceID ) ;
}
Moreover we implemented the CPUPerfModel class that implements our CPU perfor-
mance model. Since it is a sequential model, the CPUPerfModel class simply multiply
the cost each opcode by the estimated number of times it will be executed at runtime.
Benchmarks for GPU
In order to calculate transfer bandwidth between the CPU and the GPU and between
the GPU main memory and the GPU chip we extended the MSIL to Nvidia PTX meta-
compiler of 4-Centauri with the capability of executing benchmarks. In particular, we
enriched the NVIDIALib.dll library with new methods wrapping several CUDA func-
tions that launch benchmarks. For example, the following C function performs data
transfers between host and device and returns the measured bandwidth.
extern “C” NVIDIALIB API f loat bandwidthTest (memcpyKind kind , memoryMode mode)
{
return testBandwidthRange ( INIT SIZE , END SIZE , INCREMENT SIZE, kind , mode) ;
}
5.2 Libra.Net 91
In order to interoperate with un-managed code, for each new function in NVIDIALib.dll,
we declared a corresponding C# managed method using the P/Invoke mechanism. The
wrapping method of the bandwidthTest() function looks like the following:
[ Dl l Import (“NVIDIALib . d l l ” , ExactSpe l l i ng = true , CharSet = CharSet . Auto ) ]
public stat ic extern f loat bandwidthTest (memcpyKind kind , memoryMode mode) ;
Listing B.4 shows how we implemented the testBandwidthRange() function in
CUDA. When the cudaMemcpy() function is called a data transfer between host and de-
vice is performed. The last parameter of this function specifies the copy direction. There
are three self-explained possibilities: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost
and cudaMemcpyDeviceToDevice. Effective transfer bandwidth between device memory
and device cores is calculated as explained in Section 4.2.3 using an ad-hoc memory-
bound kernel. Another way to measure device to device transfer bandwidth is to use
cudaMemcpy() with cudaMemcpyDeviceToDevice as parameter. Unlike our method, this
function does not consider un-coalesced memory accesses. We explained the importance
of un-coalesced memory accesses in Section 4.2.3.
Benchmarks for CPU
In order to compare GPU and CPU costs, the compiler needs to calculate the mean
processing time of a single abstract cost unit. For this reason, we implemented a set of
benchmarks that stress the CPU core. For each benchmark we estimated its cost using
our CPU performance model. Every time the system configuration changes, the compiler
executes these benchmarks and times their execution. The mean processing time of a
single abstract cost unit is obtained dividing the execution time of each benchmark by
its estimated cost.
5.2.3 Executing a task
If a task is going to be executed on the CPU, the Kernel-annotated method must be
executed a number of times equal to the size of the domain of execution. Since the
scheduler must be able to invoke methods without knowing methods name until runtime
on the CPU, we used the Reflection mechanisms for method loading and invocation.
The execution of a task on the CPU looks like the following:
private void ScheduleOnCPU ( )
{
ob j e c t o = Act ivator . Create Ins tance ( c lassType ) ;
MethodInfo mi = classType . GetMethod (methodName) ;
while ( outArg .MoveNext ( ) )
mi . Invoke (o , parameters ) ;
}
Unfortunately, as demonstrated by Peter Sestotf in [46], these reflection mechanisms
introduce a huge overhead due to parameters wrapping and unwrapping. Experimental
92 Implementation
and theoretical execution times could be very different making our work useless. Sestoft
proposes two alternative techniques in order to obtain better performance: delegates and
interfaces. The more efficient way consist in using interfaces but, as seen in Chapter 3,
they require that the interface type is the same of the target method type. This can not
be guaranteed because developers are not forced to define classes (types) of a specific
type. Delegates require only signature compatibility, i.e. method and delegate signature
must match. Moreover they reach very similar performance to those achieved through
interfaces. The C# code in Listing 5.2 schedules and executes a task on the CPU using
a delegate type.
1 //Delegate d e c l a ra t i on
2 private de l e ga t e void Task (OutputStream<int> o , InputStream<int > [ ] i ) ;
3
4 private void ScheduleOnCPU ( )
5 {
6 ob j e c t o = Act ivator . Create Ins tance ( c lassType ) ;
7 MethodInfo mi = classType . GetMethod (methodName) ;
8 Task task = (Task ) Delegate . CreateDelegate ( typeo f (Task ) , o , mi ) ;
9
10 // execu t ing the ta sk
11 while ( outArg .MoveNext ( ) )
12 {
13 task ( outArg , inputArgs ) ;
14 }
15 }
Listing 5.2: Task scheduling on the CPU using delegates.
If a task is going to be executed on a GPU, the scheduler instructs 4-Centauri to
translate the given Kernel-annotated method to the target architecture code.
5.2.4 Evaluation
In this section we present results of two applications scheduled by Libra.Net: a Vector
Addition and a Matrix Multiplication. These applications has been developed in 4-
Centauri. Tests has been performed using the hardware and software configuration
listed in Table 5.1.
CPU Intel Core 2 Quad Q9300 2,5 GHz 64-bit
GPU Nvidia GeForce GT320
Main Memory 6 GB RAM
Video Memory 1 GB
OS Windows 7 Enterprise
Platform CUDA 3.1
Driver Nvidia Drivers 260.99
Table 5.1: Hardware and software configuration.
5.2 Libra.Net 93
Vector Addition
The first example is a computation that performs the sum of two vectors and stores the
result in an output vector. The vector addition task can be sequentially executed on a
CPU and its code looks like the following:
for ( int i = 0 ; i < s i z e ; i++)
output [ i ] = input1 [ i ] + input2 [ i ] ;
The same computation can be efficiently executed in parallel by a SIMD architecture
like a GPU. The for loop is simulated by the hardware units. Each thread is assigned a
single vector element. Threads read one element from the first vector and one from the
second vector, perform a pair-wise addition and store result in the corresponding output
element. A CUDA implementation of the vector addition is given in Listing B.1.
A 4-Centauri implementation of the vector addition looks like the following:
public class MyClass {
// . . .
[ Kernel ]
public void Add(OutputStream<int> output , InputStream<int > [ ] inputs ) {
output . Current = inputs [ 0 ] . Current + inputs [ 1 ] . Current ;
}
}
public class Program {
// . . .
stat ic void Main( s t r i n g [ ] a rgs ) {
MyClass mc = new MyClass ( ) ;
int [ ] A;
int [ ] B;
int [ ] C;
/∗ Vectors i n i t i a l i z a t i o n ∗/
// . . .
OutputStream<int > [ ] outputs = new OutputStream<int > [ 1 ] ;
outputs [ 0 ] = new OutputStream<int>(C) ;
InputStream<int > [ ] inputs = new InputStream<int > [ 2 ] ;
inputs [ 0 ] = new InputStream<int>(A) ;
inputs [ 1 ] = new InputStream<int>(B) ;
GPUDataParallel<int , int> dp = new GPUDataParallel<int , int>(mc) ;
dp .Map(“Add” , outputs , inputs ) ;
}
}
The GPUDataParallel class implements both Map and Reduce operations that take
stream data types as arguments for scattering input data to and gathering results from
a GPU [8].
This task is apparently well-suited to be executed on a GPU because it is highly
parallelizable. However, as we will demonstrate, for certain configurations the vector
addition does not take advantage to be executed on a GPU because of the overhead
introduced by Th2d, Td2h and Tsetup. In particular, we show that under specific hardware
94 Implementation
configurations the CPU completion time of the vector addition task is lower than the
GPU one. The computation grain is very “fine” such that the time required to elaborate
Figure 5.3: Vector addition CPU cost (left) and GPU cost (right). The GPU implementation uses
pageable memory to copy data from host memory to device memory.
a single element on the CPU is lower than the time required to copy that element from
the host memory to the device memory. Figure 5.3 and 5.4 compares CPU and GPU
costs calculated by Libra.Net (Table A.5). In this case we implemented a non-optimized
version of vector addition using pageable memory for data transfers. Theoretical results
(Table A.6) are corroborated with experimental ones like depicted in Figure 5.5.
The first optimization consists in allocating input and output vectors inside the
pinned host memory region. Despite the completion time of the kernel and the overhead
of the setup phase remain the same, data transfer latencies between host and device
decrease significantly. This optimization does not improve performance enough in order
to take advantage of a GPU implementation. In particular, the CPU completion time of
the task is now comparable to the GPU one. Figure 5.6 compares CPU and GPU costs.
As seen in Chapter 2, some CUDA devices have the capability to overlap copy and
kernel execution. In particular, CUDA devices of compute capability greater than 1.1
(this is our case) are able to overlap computation with one memory copy to or from the
host. Compute capability 2.0 (Fermi) improves on this by enabling a second parallel
copy operation in the opposite direction (PCIe is symmetric). Overlapping can be used
when data dependency is such that data can be broken into chunks and transferred in
multiple stages, launching multiple kernels to operate on each chunk as it arrives. When
the execution time (Tkernel) exceeds the transfer time (Tt) a rough estimate for the overall
5.2 Libra.Net 95
Figure 5.4: Vector Addition theoretical results using pageable memory.
Figure 5.5: Vector Addition experimental results using pageable memory.
96 Implementation
Figure 5.6: Vector addition task executed on the CPU (left) and on the GPU (right). The GPU
implementation uses pinned memory to copy data from host memory to device memory.
Figure 5.7: Vector Addition theoretical results using pinned memory.
5.2 Libra.Net 97
Figure 5.8: Vector Addition experimental results using pinned memory.
execution time is Tkernel+Tt/nStreams for the staged version versus Tkernel+Tt for the
sequential version. If the transfer time exceeds the execution time, a rough estimate for
the overall time is Tt + Tkernel/nStreams [27]. Tt is equal to Th2d + Td2h because host-
to-device and device-to-host data transfers execute sequentially. In this case Tt exceeds
Tkernel thus the gain in performance is given by the decrease of Tkernel. For this reason,
in this case, data transfers still remain the bottleneck of the computation. Moreover, the
efforts made by a programmer to implement and optimize the CUDA vector addition
are not comparable to those made to implement the CPU version.
In order to reduce the impact of data transfers on application performance, we pro-
pose a transversal solution. Input and output data may be compressed to reduce the
time needed for the transfer. Input data should be compressed on the host while out-
put data should be compressed on the device. Once input data has been completely
transferred to the device memory it must be decompressed before a kernel execution.
Decompression may occur in parallel on the GPU using a kernel that implements a paral-
lel version of a compression/decompression algorithm. Each thread is assigned a portion
of the input data and executes a decompression algorithm on that portion. Once data
has been completely decompressed the task is able to be launched. Since in GPGPU
computations data stored in device memory is available across multiple kernel launches
by the same application, this technique can be applied. If the mean compression ratio
of the algorithm and the overhead introduced by compression and decompression phases
are known it is possible to estimate the overall execution time of a task and quantify, if
reached, the gain in performance [57].
98 Implementation
Matrix Multiplication
The standard matrix multiplication algorithm is a “medium-grained” computation. For
square matrices a C# implementation looks like the following.
for ( int i = 0 ; i < s i z e ; i++)
for ( int j = 0 ; j < s i z e ; j++)
for ( int h = 0 ; h < s i z e ; h++)
C[ i , j ] = A[ i , h ] ∗ B[ h , j ] ;
As seen in Chapter 4, the matrix multiplication may be implemented in different
ways using CUDA. In this example we consider the non-optimized matrix multiplication
version in which every memory access is performed in global memory (Listing B.2). In
Figure 5.9 a comparison between the estimated costs of the matrix multiplication task at
increasing data size when executed on the CPU and on the GPU is represented. Matrix
Figure 5.9: Matrix Multiplication theoretical results.
size represents the number of matrix elements in each dimension.
Tests has been performed on the CPU and on the GPU using the hardware and
software configuration listed in Table 5.1. For this example we used pageable memory
since in this case data transfers between host and device are not bottlenecks. Exper-
imental results are reported in Table A.3 while theoretical ones are reported in Table
A.4. Figure 5.9 and 5.10 demonstrate that for small matrices (matrix size lower than
32) the CPU completion time of matrix multiplication is lower than the GPU one. In
particular, for small matrices, the bottleneck is represented by the GPU setup phase. If
the matrix size is greater than 32 the CPU completion time on the CPU grows more
rapidly compared to the GPU one. Thus, for large matrices, it is more convenient to
5.2 Libra.Net 99
Figure 5.10: Matrix Multiplication experimental results.
execute the matrix multiplication on the GPU, as expected.
If data transfers are performed using pinned memory the overall GPU execution time
decrease but Tsetup still remain a bottleneck for matrix size lower than 32. If the compiler
is able to optimize the matrix multiplication kernel leveraging shared memory, Tkernel
drastically decrease because global memory accesses are minimized. Since the bottleneck
is represented by Tsetup we are not yet able to execute the matrix multiplication for small
matrices on GPUs efficiently.
100 Implementation
Chapter 6
Conclusions and Future Works
In this thesis we addressed the problem of identifying potential bottlenecks in GPGPU
computations and comparing CPU and GPU execution times of the same task. In
particular, we formulated and validated performance models for CPUs and GPGPUs.
Finally, we implemented our scheduler, we called Libra.Net, as an extension of the 4-
Centauri meta-compiler. In order to evaluate its efficiency, we proposed results of two
simple case studies: a Vector Addition and a Matrix Multiplication.
In Chapter 2 we gave a brief introduction to the evolution of GPUs that has brought
to GPGPU computing. Moreover, we presented and compared the main aspects of the
three most spread GPGPU platforms: Nvidia CUDA, AMD Stream Computing and
OpenCL.
In Chapter 3 we described design and capabilities of the Common Language Infras-
tructure and we briefly introduced the techniques of the Common Language Runtime
used for the scheduler implementation.
In Chapter 4 we introduced the principles of our GPGPU performance model. The
GPGPU performance model is structured in layers. We recognized an abstract layer,
we called the “ground model”, based on the OpenCL standard and common to any
GPGPU platform. This layer must be refined to obtain performance models tied to a
vendor-specific platform (e.g. Nvidia CUDA or AMD Stream Computing) or even to a
specific device. We proposed and validated a refinement of the ground model for the
Nvidia PTX 1.x platform.
We demonstrated that in some cases on-chip memory latencies can be hidden by
the device scheduler. Thus, the impact of on-chip memory transactions may be esti-
mated dividing the amount of data read and written by the device memory bandwidth.
Moreover we demonstrated that, for devices capable of coalescing memory accesses, the
device memory transfer bandwidth should be calculated performing both coalesced and
un-coalesced memory accesses. Another important consideration is that for kernels with
high arithmetic intensity their execution is dominated by pure computation performed
within the device cores and for kernels with low arithmetic intensity their execution is
101
102 Conclusions and Future Works
dominated by memory operations.
In order to approximate performance accurately, we should know how PTX instruc-
tions are compiled into executable machine code. But Nvidia does not provide doc-
umentation about the step of compilation that translates PTX code into executable
machine code. Despite it is not possible to perform an accurate performance evaluation
we demonstrate that our refined model is able to estimate, with a good degree of ap-
proximation, the completion time of a GPGPU kernel. Our model may be refined in
future works considering the effect of branch divergence, shared memory bank conflicts
and thread block-level synchronization. Moreover, tests could be performed on the AMD
Stream Computing platform in order to verify if the proposed refinement could be used
to estimate the completion time of kernels on AMD GPUs. It could be interesting to
propose a performance model tied to the Fermi architecture, considering the impact of
caching policies and concurrent execution of kernels on the same device.
We proposed a CPU performance model that is operating system- and processor-
independent. Exploiting the powerful abstraction provided by virtual machines we for-
mulated a simple performance model for the execution of tasks in a CLI virtual envi-
ronment. Tests demonstrate that our model can be used by a scheduler to approximate
the completion time of a task on a CPU. This model is not exploitable for accurate
or exact performance modeling because of its limitations: although the computational
model of virtual machines is quite simple they abstract from an enormous number of
features of the underlying architecture. For example, our CPU performance model does
not consider the cost of memory operations and the impact of caches. Moreover, our
model abstract from compiling optimizations performed by the JIT compiler or by an
ahead-compiler (Ngen) and garbage collection.
In Chapter 5 we showed two tests performed using Libra.Net and demonstrated that
in some cases the CPU completion time of a task is lower than the GPU one even if
the task fits the data-parallel programming model. Bottlenecks may be data transfer
latencies between host and device and the setup phase required to prepare the execution
on the GPU. We shown an example where if data size is lower than a fixed threshold the
task executes faster on the CPU than on the GPU, if data size exceeds that threshold
the task executes faster on the GPU than on the CPU.
Some techniques that reduce the impact of data transfers between host and device are
available on the main GPGPU platforms. These are the allocation of data using pinned
memory or the use of asynchronous functions overlapping communication and kernel
execution. In the vector addition example, the use of pinned memory improves the task
performance when executed on the GPU but, as seen, the CPU and the GPU execution
time are almost the same. If we overlap communication and kernel execution, the kernel
executes concurrently with data transfers. Therefore, the kernel completion time does
not contribute to the overall completion time of the task on the GPU that, in this case,
become lower than the CPU one. Despite these improvements, the difference between
the GPU and CPU execution time becomes comparable. Tasks could be parallelized
producing multi-threaded implementations for the execution on modern CPUs multi-
103
core increasing performance and productivity respect to GPUs implementations [58].
As seen, the complexity of GPGPU code development and optimization is very high and
it increases with the introduction of new technologies and architectures like the NVIDIA
Fermi.
104 Conclusions and Future Works
Appendix A
Tables
Matrix Multiplication
Size Experimental Tmem Tcalc Tkernel Tmemtheo
0,25 25,210 13,241 0,513 13,241 2,727
1 152,747 105,931 4,045 105,931 21,818
2,25 516,934 357,517 13,589 357,517 73,636
4 913,499 847,448 32,136 847,448 174,545
6,25 1649,110 1655,172 62,678 1655,172 340,909
9 2866,466 2860,138 108,208 2860,138 589,091
12,25 4390,393 4541,793 171,716 4541,793 935,455
16 6567,276 6779,586 256,196 6779,586 1396,364
20,25 9311,830 9652,966 364,638 9652,966 1988,182
25 12646,059 13241,379 500,035 13241,379 2727,273
30,25 16741,076 17624,276 665,378 17624,276 3630,000
36 21707,828 22881,103 863,659 22881,103 4712,727
42,25 27071,563 29091,310 1097,871 29091,310 5991,818
49 33785,227 36334,345 1371,005 36334,345 7483,636
56,25 41818,207 44689,655 1686,052 44689,655 9204,545
Table A.1: Experimental and theoretical results of the matrix multiplication kernel. The size is
expressed in MB while times are expressed in milliseconds.
105
106 Tables
Optimized Matrix Multiplication
Size Experimental Tmem Tcalc Tkernel
0,25 0,339 0,142 0,276 0,276
1 2,563 1,121 2,138 2,138
2,25 8,237 3,763 7,135 7,135
4 19,325 8,897 16,816 16,816
6,25 37,683 17,349 32,730 32,730
9 65,178 29,948 56,428 56,428
12,25 104,259 47,522 89,458 89,458
16 157,285 70,897 133,369 133,369
20,25 226,810 100,901 189,712 189,712
25 315,050 138,362 260,035 260,035
30,25 425,319 184,108 345,887 345,887
36 561,485 238,966 448,819 448,819
42,25 728,029 303,763 570,379 570,379
49 932,211 379,328 712,117 712,117
56,25 1180,558 466,487 875,583 875,583
Table A.2: Experimental and theoretical results of the optimized matrix multiplication kernel. The
size is expressed in MB while times are expressed in milliseconds.
107
Matrix Multiplication
Size CPU H2D D2H Kernel Setup GPU
16 0,021 0,001 0,001 0,018 0,231 0,250
32 0,156 0,004 0,002 0,029 0,198 0,233
48 0,520 0,009 0,005 0,047 0,195 0,256
64 1,229 0,016 0,009 0,084 0,236 0,345
80 2,449 0,025 0,013 0,154 0,229 0,421
96 4,249 0,036 0,019 0,260 0,228 0,543
112 6,789 0,050 0,026 0,438 0,227 0,741
128 11,905 0,065 0,035 0,714 0,225 1,038
144 14,223 0,082 0,044 0,879 0,355 1,359
160 19,598 0,101 0,054 1,116 0,257 1,528
176 25,827 0,122 0,065 1,461 0,208 1,857
192 40,196 0,146 0,078 1,933 0,227 2,383
208 42,474 0,171 0,091 2,512 0,261 3,035
224 53,999 0,198 0,106 3,062 0,199 3,565
240 65,319 0,227 0,121 3,829 0,198 4,375
256 93,989 0,259 0,138 5,021 0,221 5,639
272 96,585 0,292 0,156 5,459 0,186 6,093
288 134,442 0,328 0,175 6,437 0,265 7,204
304 134,008 0,365 0,195 7,681 1,251 9,491
320 183,340 0,404 0,216 9,495 1,252 11,367
336 182,411 0,446 0,238 10,711 1,292 12,687
352 243,442 0,489 0,261 12,237 1,276 14,263
368 237,874 0,535 0,285 14,100 2,318 17,237
384 316,358 0,582 0,311 16,667 2,340 19,901
400 305,972 0,632 0,337 17,472 2,297 20,737
416 401,809 0,683 0,365 19,226 2,368 22,642
432 390,790 0,737 0,393 21,918 2,264 25,311
448 499,769 0,793 0,423 23,676 2,213 27,104
464 515,343 0,850 0,454 26,066 2,229 29,599
480 625,971 0,910 0,485 29,161 2,349 32,905
496 657,269 0,972 0,518 32,010 2,296 35,796
Table A.3: Experimental results for the Matrix Multiplication task using the testing environment of
Table 5.1.The size represents the matrix dimension while times are expressed in milliseconds.
108 Tables
Matrix Multiplication
Size CPU H2D D2H Kernel Setup GPU
16 290816 13302 7095 16961 3037895 3075253
32 2326528 53207 28381 133635 2600000 2815223
48 7852032 119716 63857 448705 2563816 3196094
64 18612224 212828 113524 1060855 3108158 4495366
80 36352000 332544 177381 2068771 3008421 5587117
96 62816256 478864 255429 3571135 2996579 7302007
112 99749888 651787 347667 5666632 2984868 9650955
128 148897792 851313 454096 8453947 2957105 12716462
144 212004864 1077443 574716 12031764 4668158 18352081
160 290816000 1330177 709525 16498766 3383947 21922416
176 387076096 1609514 858526 21953639 2742368 27164047
192 502530048 1915454 1021717 28495066 2983947 34416184
208 638922752 2247999 1199098 36221731 3432895 43101722
224 797999104 2607146 1390670 45232319 2618158 51848293
240 981504000 2992897 1596432 55625514 2600000 62814844
256 1191182336 3405252 1816385 67500000 2910789 75632427
272 1428779008 3844211 2050529 80954461 2443816 89293016
288 1696038912 4309772 2298863 96087582 3489211 106185428
304 1994706944 4801938 2561387 112998047 16462763 136824135
320 2326528000 5320707 2838102 131784539 16477500 156420848
336 2693246976 5866079 3129007 152545744 16999605 178540436
352 3096608768 6438055 3434103 175380345 16786974 202039477
368 3538358272 7036635 3753390 200387027 30496053 241673104
384 4020240384 7661818 4086867 227664474 30794474 270207632
400 4544000000 8313604 4434534 257311369 30221842 300281350
416 5111382016 8991994 4796392 289426398 31152895 334367679
432 5724131328 9696988 5172441 324108244 29785658 368763331
448 6383992832 10428585 5562680 361455592 29119211 406566068
464 7092711424 11186786 5967109 401567126 29334737 448055758
480 7852032000 11971590 6385729 444541530 30912368 493811217
496 8663699456 12782998 6818540 490477488 30211842 540290867
Table A.4: Estimated costs for the Matrix Multiplication task basing on the environment of Table
5.1.The size represents the matrix dimension while costs are expressed in the abstract cost unit.
109
Vector Addition
Size CPU H2D D2H Kernel Setup GPU
262144 0,949 1,304 0,892 0,153 2,794 5,144
786432 2,502 3,248 2,615 0,430 3,219 9,512
1310720 4,145 5,274 4,400 0,697 3,617 13,988
1835008 5,585 7,135 6,202 0,973 3,842 18,152
2359296 9,792 9,170 7,541 1,273 4,288 22,272
2883584 11,382 11,005 8,776 1,514 4,574 25,869
3407872 13,706 12,922 10,459 1,801 5,046 30,227
3932160 15,080 14,584 11,811 2,121 5,429 33,945
4456448 13,505 16,695 13,525 2,354 5,625 38,199
4980736 19,557 18,460 28,962 2,620 6,120 56,162
5505024 21,942 20,876 28,432 2,950 6,665 58,922
6029312 23,716 22,107 18,769 3,170 6,829 50,875
6553600 33,935 24,263 19,463 3,460 7,309 54,494
7077888 20,337 26,167 22,395 3,812 7,429 59,803
7602176 32,228 28,009 42,357 4,010 7,763 82,139
8126464 32,182 30,148 39,248 4,260 8,115 81,770
8650752 51,052 32,124 40,934 4,659 8,487 86,204
9175040 36,391 33,640 33,141 4,838 8,792 80,411
9699328 45,378 35,832 40,179 5,086 9,300 90,398
10223616 41,237 37,498 42,540 5,506 9,549 95,093
10747904 42,771 39,564 37,038 5,667 9,924 92,193
11272192 44,155 41,460 37,377 5,944 10,313 95,094
11796480 46,234 43,402 52,569 6,348 10,683 113,002
12320768 52,393 45,460 39,490 6,482 11,067 102,499
12845056 52,593 46,998 51,838 6,745 11,298 116,880
13369344 51,209 48,968 42,801 7,164 11,689 110,622
13893632 54,636 51,086 60,059 7,298 12,088 130,531
14417920 57,319 52,742 43,628 7,583 12,372 116,324
14942208 64,113 55,435 50,449 8,044 12,761 126,689
15466496 62,297 57,530 64,688 8,147 13,111 143,477
15990784 61,062 58,656 47,363 8,433 13,841 128,293
16515072 64,917 61,115 50,269 8,858 13,978 134,221
Table A.5: Experimental results for the Vector Addition task using the testing environment of Table
5.1. The size represents the number of vector elements while times are expressed in milliseconds.
110 Tables
Vector Addition
Size CPU H2D D2H Kernel Setup GPU
262144 9699328 9766006 5209256 1366911 26361698 42703871
786432 29097984 29298019 15627767 4100732 30371038 79397557
1310720 4849664 48830032 26046279 6834554 34120189 115831054
1835008 67895296 68362045 36464791 9568375 36243302 150638513
2359296 87293952 87894058 46883302 12302197 40456415 187535972
2883584 106692608 107426071 57301814 15036018 43148113 222912016
3407872 126091264 126958084 67720325 17769840 47601509 260049759
3932160 145489920 146490097 78138837 20503661 51213302 296345897
4456448 164888576 166022110 88557349 23237483 53070472 330887413
4980736 184287232 185554123 98975860 25971304 57738868 368240155
5505024 203685888 205086136 109394372 28705125 62877925 406063558
6029312 223084544 224618149 119812884 31438947 64420849 440290829
6553600 242483200 244150162 130231395 34172768 68955189 477509514
7077888 261881856 263682175 140649907 36906590 70086604 511325275
7602176 281280512 283214188 151068418 39640411 73240472 547163489
8126464 300679168 302746201 161486930 42374233 76551887 583159251
8650752 320077824 322278214 171905442 45108054 80066132 619357842
9175040 339476480 341810227 182323953 47841876 82940943 654916999
9699328 358875136 361342240 192742465 50575697 87735849 692396251
10223616 378273792 380874253 203160976 53309519 90085472 727430220
10747904 397672448 400406266 213579488 56043340 93618679 763647773
11272192 417071104 419938279 223998000 58777162 97293585 800007025
11796480 436469760 439470292 234416511 61510983 100778868 836176654
12320768 455868416 459002305 244835023 64244805 104404528 872486661
12845056 475267072 478534318 255253534 66978626 106585094 907351573
13369344 494665728 498066331 265672046 69712448 110269245 943720070
13893632 514064384 517598344 276090558 72446269 114037170 980172340
14417920 533463040 537130357 286509069 75180090 116712736 1015532252
14942208 552861696 556662370 296927581 77913912 120383962 1051887825
15466496 572260352 576194383 307346093 80647733 123692075 1087880284
15990784 591659008 595726396 317764604 83381555 130573585 1127446140
16515072 611057664 615258409 328183116 86115376 131866887 1161423787
Table A.6: Estimated Costs for the Vector Addition task basing on the environment of Table 5.1. The
size represents the number of vector elements while costs are expressed in the abstract cost unit.
111
Host to Device - Pageable Memory
Size Experimental Th2d
1 1,086 0,885
2 2,056 1,770
3 3,015 2,655
4 4,168 3,540
5 4,914 4,425
6 6,098 5,310
7 6,505 6,195
8 6,970 7,080
9 7,746 7,965
10 9,552 8,850
11 10,762 9,735
12 10,810 10,619
13 10,920 11,504
14 13,721 12,389
15 14,514 13,274
16 15,633 14,159
17 16,366 15,044
18 16,350 15,929
19 18,283 16,814
20 16,900 17,699
21 18,297 18,584
22 20,715 19,469
23 20,546 20,354
24 23,509 21,239
25 24,214 22,124
26 24,119 23,009
27 23,986 23,894
28 25,987 24,779
29 26,879 25,664
30 27,88 26,549
31 28,246 27,434
32 27,611 28,319
Table A.7: Experimental and theoretical results of host to device data transfer with pageable memory.
The size is expressed in MB while times are expressed in milliseconds.
112 Tables
Device to Host - Pageable Memory
Size Experimental Td2h
1 1,183 0,689
2 1,838 1,378
3 2,463 2,068
4 3,117 2,757
5 3,748 3,446
6 4,394 4,135
7 5,041 4,824
8 5,687 5,513
9 6,368 6,203
10 6,929 6,892
11 7,662 7,581
12 8,217 8,270
13 8,859 8,959
14 9,533 9,649
15 10,157 10,338
16 10,861 11,027
17 11,448 11,716
18 12,183 12,405
19 12,711 13,094
20 13,418 13,784
21 14,001 14,473
22 14,763 15,162
23 15,316 15,851
24 15,926 16,540
25 16,547 17,229
26 17,272 17,919
27 17,892 18,608
28 18,539 19,297
29 19,15 19,986
30 19,873 20,675
31 20,437 21,365
32 20,984 22,054
Table A.8: Experimental and theoretical results of device to host data transfer with pageable memory.
The size is expressed in MB while times are expressed in milliseconds.
113
Host to Device - Pinned Memory
Size Experimental Th2d
1 0,761 0,740
2 1,503 1,480
3 2,246 2,221
4 2,987 2,961
5 3,73 3,701
6 4,473 4,441
7 5,21 5,181
8 5,956 5,922
9 6,698 6,662
10 7,44 7,402
11 8,19 8,142
12 8,925 8,882
13 9,666 9,623
14 10,406 10,363
15 11,148 11,103
16 11,899 11,843
17 12,641 12,583
18 13,383 13,323
19 14,136 14,064
20 14,867 14,804
21 15,61 15,544
22 16,35 16,284
23 17,096 17,024
24 17,834 17,765
25 18,575 18,505
26 19,316 19,245
27 20,057 19,985
28 20,803 20,725
29 21,54 21,466
30 22,288 22,206
31 23,022 22,946
32 23,763 23,686
Table A.9: Experimental and theoretical results of host to device data transfers with pageable memory.
The size is expressed in MB while times are expressed in milliseconds.
114 Tables
Device to Host - Pinned Memory
Size Experimental Td2h
1 0,632 0,617
2 1,249 1,234
3 1,872 1,851
4 2,492 2,468
5 3,105 3,085
6 3,705 3,701
7 4,337 4,318
8 4,96 4,935
9 5,573 5,552
10 6,181 6,169
11 6,813 6,786
12 7,417 7,403
13 8,041 8,020
14 8,644 8,637
15 9,269 9,254
16 9,892 9,870
17 10,508 10,487
18 11,123 11,104
19 11,742 11,721
20 12,301 12,338
21 12,97 12,955
22 13,59 13,572
23 14,205 14,189
24 14,829 14,806
25 15,461 15,423
26 16,019 16,039
27 16,641 16,656
28 17,271 17,273
29 17,974 17,890
30 18,599 18,507
31 19,238 19,124
32 19,88 19,741
Table A.10: Experimental and theoretical results of device to host data transfer with pinned memory.
The size is expressed in MB while times are expressed in milliseconds.
115
Vector Addition performing coalesced memory accesses
Size Experimental Tmem Tcalc Tkernel Tmemtheo
2 10,279 10,345 1,594 10,345 8,523
4 20,559 20,690 3,188 20,690 17,045
6 32,198 31,034 4,782 31,034 25,568
8 41,136 41,379 6,376 41,379 34,091
10 51,400 51,724 7,969 51,724 42,614
12 64,424 62,069 9,563 62,069 51,136
14 71,985 72,414 11,157 72,414 59,659
16 82,232 82,759 12,751 82,759 68,182
18 96,722 93,103 14,345 93,103 76,705
20 102,813 103,448 15,939 103,448 85,227
22 113,075 113,793 17,533 113,793 93,750
24 128,969 124,138 19,127 124,138 102,273
26 133,674 134,483 20,720 134,483 110,795
28 143,916 144,828 22,314 144,828 119,318
30 161,187 155,172 23,908 155,172 127,841
32 164,513 165,517 25,502 165,517 136,364
34 174,773 175,862 27,096 175,862 144,886
36 193,437 186,207 28,690 186,207 153,409
38 198,050 196,552 30,284 196,552 161,932
40 208,515 206,897 31,878 206,897 170,455
42 225,699 217,241 33,471 217,241 178,977
44 226,206 227,586 35,065 227,586 187,500
46 236,402 237,931 36,659 237,931 196,023
48 257,919 248,276 38,253 248,276 204,545
50 257,042 258,621 39,847 258,621 213,068
52 267,274 268,966 41,441 268,966 221,591
54 290,261 279,310 43,035 279,310 230,114
56 287,880 289,655 44,629 289,655 238,636
58 298,117 300,000 46,222 300,000 247,159
60 322,460 310,345 47,816 310,345 255,682
62 318,733 320,690 49,410 320,690 264,205
Table A.11: Experimental and theoretical results of the vector addition kernel. The size is expressed
in MB while times are expressed in milliseconds.
116 Tables
Vector Addition performing uncoalesced memory accesses
Size Experimental Tmem Tcalc Tkernel Tmemtheo
0,0625 3,298 3,233 0,050 3,233 0,266
0,125 6,588 6,466 0,100 6,466 0,533
0,1875 9,890 9,698 0,149 9,698 0,799
0,25 13,181 12,931 0,199 12,931 1,065
0,3125 16,482 16,164 0,249 16,164 1,332
0,375 19,786 19,397 0,299 19,397 1,598
0,4375 23,067 22,629 0,349 22,629 1,864
0,5 26,371 25,862 0,398 25,862 2,131
0,5625 29,662 29,095 0,448 29,095 2,397
0,625 32,959 32,328 0,498 32,328 2,663
0,6875 36,261 35,560 0,548 35,560 2,930
0,75 39,558 38,793 0,598 38,793 3,196
0,8125 42,872 42,026 0,648 42,026 3,462
0,875 46,149 45,259 0,697 45,259 3,729
0,9375 49,446 48,491 0,747 48,491 3,995
1 52,739 51,724 0,797 51,724 4,261
1,0625 56,034 54,957 0,847 54,957 4,528
1,125 59,348 58,190 0,897 58,190 4,794
1,1875 62,659 61,422 0,946 61,422 5,060
1,25 65,947 64,655 0,996 64,655 5,327
1,3125 69,238 67,888 1,046 67,888 5,593
1,375 72,555 71,121 1,096 71,121 5,859
1,4375 75,835 74,353 1,146 74,353 6,126
1,5 79,108 77,586 1,195 77,586 6,392
1,5625 82,417 80,819 1,245 80,819 6,658
1,625 85,690 84,052 1,295 84,052 6,925
1,6875 89,011 87,284 1,345 87,284 7,191
1,75 92,317 90,517 1,395 90,517 7,457
1,8125 95,604 93,750 1,444 93,750 7,724
1,875 98,924 96,983 1,494 96,983 7,990
1,9375 102,183 100,216 1,544 100,216 8,256
Table A.12: Experimental and theoretical results of the vector addition kernel. The size is expressed
in MB while times are expressed in milliseconds.
Appendix B
Listings
1 global void VecAdd( const e l ement t ∗ A, const e l ement t ∗ B, e l ement t ∗C)
2 {
3 int i = ( blockDim . x ∗ blockIdx . x + threadIdx . x ) ;
4 C[ i ] = A[ i ] + B[ i ] ;
5 }
Listing B.1: CUDA Vector Addition.
1 global void MatMul(Matrix A, Matrix B, Matrix C)
2 {
3 e l ement t Cvalue = 0 .0 f ;
4 int row = blockIdx . y ∗ blockDim . y + threadIdx . y ;
5 int c o l = blockIdx . x ∗ blockDim . x + threadIdx . x ;
6 for ( int e = 0 ; e < A. width ; e++)
7 Cvalue += A. e lements [ row ∗ A. width + e ] ∗ B. e lements [ e ∗ B. width + co l ] ;
8 C. e lements [ row ∗ C. width + co l ] = Cvalue ;
9 }
Listing B.2: CUDA Matrix Multiplication.
1 // Get a matrix element
2 d e v i c e f loat GetElement ( const Matrix A, int row , int c o l )
3 {
4 return A. elements [ row ∗ A. s t r i d e + co l ] ;
5 }
6
7 // Set a matrix element
8 d e v i c e void SetElement (Matrix A, int row , int co l , f loat value )
9 {
10 A. e lements [ row ∗ A. s t r i d e + co l ] = value ;
11 }
12
13 // Get the BLOCK SIZExBLOCK SIZE sub−matrix Asub o f A tha t i s
14 // l o ca t ed co l sub−matrices to the r i g h t and row sub−matrices down
15 // from the upper− l e f t corner o f A
16 d e v i c e Matrix GetSubMatrix (Matrix A, int row , int c o l )
17 {
18 Matrix Asub ;
19 Asub . width = BLOCK SIZE ;
20 Asub . he ight = BLOCK SIZE ;
21 Asub . s t r i d e = A. s t r i d e ;
117
118 Listings
22 Asub . e lements = &A. e lements [A. s t r i d e ∗ BLOCK SIZE ∗ row + BLOCK SIZE ∗ c o l ] ;
23 return Asub ;
24 }
25
26 // Matrix mu l t i p l i c a t i o n ke rne l c a l l e d by MatMul ()
27 global void MatMulShared (Matrix A, Matrix B, Matrix C)
28 {
29 // Block row and column
30 int blockRow = blockIdx . y ;
31 int blockCol = blockIdx . x ;
32
33 // Each thread b l o c k computes one sub−matrix Csub o f C
34 Matrix Csub = GetSubMatrix (C, blockRow , blockCol ) ;
35
36 // Each thread computes one element o f Csub
37 // by accumulat ing r e s u l t s in to Cvalue
38 f loat Cvalue = 0 ;
39
40 // Thread row and column wi th in Csub
41 int row = threadIdx . y ;
42 int c o l = threadIdx . x ;
43
44 // Loop over a l l the sub−matrices o f A and B tha t are
45 // requ i red to compute Csub
46 // Mul t i p l y each pa i r o f sub−matrices t o g e t h e r
47 // and accumulate the r e s u l t s
48 for ( int m = 0 ; m < (A. width / BLOCK SIZE) ; ++m) {
49 // Get sub−matrix Asub o f A
50 Matrix Asub = GetSubMatrix (A, blockRow , m) ;
51
52 // Get sub−matrix Bsub o f B
53 Matrix Bsub = GetSubMatrix (B, m, blockCol ) ;
54
55 // Shared memory used to s t o r e Asub and Bsub r e s p e c t i v e l y
56 shared f loat As [BLOCK SIZE ] [ BLOCK SIZE ] ;
57 shared f loat Bs [BLOCK SIZE ] [ BLOCK SIZE ] ;
58
59 // Load Asub and Bsub from dev i ce memory to shared memory
60 // Each thread loads one element o f each sub−matrix
61 As [ row ] [ c o l ] = GetElement (Asub , row , c o l ) ;
62 Bs [ row ] [ c o l ] = GetElement (Bsub , row , c o l ) ;
63
64 // Synchronize to make sure the sub−matrices are loaded
65 // be fo r e s t a r t i n g the computation
66 sync th r ead s ( ) ;
67
68 // Mul t i p l y Asub and Bsub t o g e t h e r
69 for ( int e = 0 ; e < BLOCK SIZE ; ++e )
70 Cvalue += As [ row ] [ e ] ∗ Bs [ e ] [ c o l ] ;
71
72 // Synchronize to make sure t ha t the preced ing
73 // computation i s done be f o re l oad ing two new
74 // sub−matrices o f A and B in the next i t e r a t i o n
75 sync th r ead s ( ) ;
76 }
77
78 // Write Csub to dev i ce memory
79 // Each thread wr i t e s one element
80 SetElement (Csub , row , co l , Cvalue ) ;
81 }
119
Listing B.3: CUDA Optimized Matrix Multiplication.
1 enum memcpyKind { DEVICE TO HOST, HOST TO DEVICE, DEVICE TO DEVICE } ;
2 enum memoryMode { PINNED, PAGEABLE } ;
3
4 f loat testHostToDeviceBandwidth (unsigned int memSize , memoryMode memMode)
5 {
6 f loat elapsedTimeInMs = 0 .0 f ;
7 f loat bandwidthInMBs = 0 .0 f ;
8
9 cudaEvent t s ta r t , stop ;
10 c u t i l S a f eC a l l ( cudaEventCreate(& s t a r t ) ) ;
11 c u t i l S a f eC a l l ( cudaEventCreate(&stop ) ) ;
12
13 // a l l o c a t e hos t memory
14 unsigned char ∗h odata = NULL;
15 i f (PINNED == memMode)
16 {
17 #i f CUDART VERSION >= 2020
18 //pinned memory mode − use s p e c i a l f unc t i on to ge t OS−pinned memory
19 c u t i l S a f eC a l l ( cudaHostAlloc ( ( void ∗∗)&h odata , memSize , 0) ) ;
20 #else
21 //pinned memory mode − use s p e c i a l f unc t i on to ge t OS−pinned memory
22 c u t i l S a f eC a l l ( cudaMallocHost ( ( void ∗∗)&h odata , memSize ) ) ;
23 #endif
24 }
25 else
26 {
27 // pageab l e memory mode − use mal loc
28 h odata = (unsigned char ∗) mal loc (memSize ) ;
29 }
30 // i n i t i a l i z e the memory
31 for (unsigned int i = 0 ; i < memSize/ s izeof (unsigned char ) ; i++)
32 {
33 h odata [ i ] = (unsigned char ) ( i & 0 x f f ) ;
34 }
35
36 // a l l o c a t e dev i ce memory
37 unsigned char∗ d ida ta ;
38 c u t i l S a f eC a l l ( cudaMalloc ( ( void ∗∗) &d idata , memSize ) ) ;
39
40 int loop = MEMCOPY ITERATIONS;
41
42 c u t i l S a f eC a l l ( cudaEventRecord ( s ta r t , 0) ) ;
43 //copy hos t memory to dev i ce memory
44 i f (PINNED == memMode)
45 {
46 for (unsigned int i = 0 ; i < loop ; i++)
47 {
48 c u t i l S a f eC a l l ( cudaMemcpyAsync ( d idata , h odata , memSize ,
cudaMemcpyHostToDevice , 0) ) ;
49 }
50 }
51 else {
52 for (unsigned int i = 0 ; i < loop ; i++)
53 {
54 c u t i l S a f eC a l l ( cudaMemcpy( d idata , h odata , memSize ,
cudaMemcpyHostToDevice ) ) ;
55 }
56 }
120 Listings
57 c u t i l S a f eC a l l ( cudaEventRecord ( stop , 0) ) ;
58 c u t i l S a f eC a l l ( cudaThreadSynchronize ( ) ) ;
59 // t o t a l e l apsed time in ms
60 c u t i l S a f eC a l l ( cudaEventElapsedTime(&elapsedTimeInMs , s ta r t , stop ) ) ;
61
62 // c a l c u l a t e bandwidth in MB/s
63 bandwidthInMBs = (1 e3 f ∗ memSize ∗ ( f loat ) loop ) / ( elapsedTimeInMs ∗ ( f loat ) (1
<< 20) ) ;
64
65 // c lean up memory
66 c u t i l S a f eC a l l ( cudaEventDestroy ( stop ) ) ;
67 c u t i l S a f eC a l l ( cudaEventDestroy ( s t a r t ) ) ;
68 i f (PINNED == memMode)
69 {
70 c u t i l S a f eC a l l ( cudaFreeHost ( h odata ) ) ;
71 }
72 else
73 {
74 f r e e ( h odata ) ;
75 }
76 c u t i l S a f eC a l l ( cudaFree ( d ida ta ) ) ;
77
78 3return bandwidthInMBs ;
79 }
Listing B.4: CUDA Bandwith test.
1 private f loat RunMemoryKernel (bool coa l e s c ed )
2 {
3 IntPtr job1 = CUDA. InitEnv (new GPUDevice ( deviceID ) , 3 , 1) ;
4 GPUContext ctx1 = CUDA. createContext ( job1 ) ;
5 GPUModule mod1 = CUDA. LoadBin ( job1 , @“.\memoryTest . ptx” ) ;
6 GPUFunction fun1 = CUDA. LoadFunc ( job1 , “memoryTest” ) ;
7
8 int g r i dS i z e = 4096 ;
9 int b lo ckS i z e = 256 ;
10 int s t r i d e = coa l e s c ed ? 1 : 32 ;
11 int s i z e = b lo ckS i z e ∗ g r i dS i z e ∗ s t r i d e ;
12 f loat [ ] A = new float [ s i z e ] ;
13 f loat [ ] C = new float [ s i z e ] ;
14 int mem size = s i z e ∗ s izeof ( f loat ) ;
15 Random rand = new Random( ) ;
16 for ( int i = 0 ; i < s i z e ; ++i )
17 A[ i ] = ( f loat ) rand . Next (100) ;
18
19 //Host a l l o c a t i o n s
20 GCHandle gcHdInput = GCHandle . A l loc (A, GCHandleType . Pinned ) ;
21 GCHandle hOutputHandle = GCHandle . A l loc (C, GCHandleType . Pinned ) ;
22
23 //Device a l l o c a t i o n s
24 int ptrInput = CUDA.MAllocOnGPU( job1 , mem size ) ;
25 int ptrOutput = CUDA.MAllocOnGPU( job1 , mem size ) ;
26
27 //Data t r an s f e r Host to Device
28 CUDA.MCpyCPU2GPU( job1 , ptrInput , gcHdInput . AddrOfPinnedObject ( ) , mem size ) ;
29
30 // Set parameters
31 int o f f s e t = 0 ;
32 o f f s e t = CUDA. SetParamV( job1 , fun1 , o f f s e t , ptr Input ) ;
33 o f f s e t = CUDA. SetParamV( job1 , fun1 , o f f s e t , ptrOutput ) ;
34 o f f s e t = CUDA. SetParamI ( job1 , fun1 , o f f s e t , s t r i d e ) ;
121
35 CUDA. SetParamSize ( fun1 , o f f s e t ) ;
36
37 CUDA. SetBlockShape ( fun1 , b lockS ize , 1 , 1) ;
38
39 CUDA. StartTimer ( ) ;
40 CUDA. LaunchGrid ( fun1 , g r i dS i z e , 1) ;
41 CUDA. StopTimer ( ) ;
42 f loat time = CUDA. GetElapsedTime ( ) ;
43
44 //Data t r an s f e r Device to Host
45 CUDA.MCpyGPU2CPU( job1 , hOutputHandle . AddrOfPinnedObject ( ) , ptrOutput , mem size )
;
46
47 //Marshal . Copy( outPtr , C, 0 , s i z e ) ;
48
49 gcHdInput . Free ( ) ;
50 hOutputHandle . Free ( ) ;
51
52 return 2 .0 f ∗ ( mem size / s t r i d e ∗ 1 e3 f ) / ( time ∗ ( f loat ) (1 << 20) ) ;
53 }
Listing B.5: CUDA Memory-copy kernel.
1 . entry Z6VecAddPKiS0 Pi (
2 . param . u64 cudaparm Z6VecAddPKiS0 Pi A ,
3 . param . u64 cudaparm Z6VecAddPKiS0 Pi B ,
4 . param . u64 cudaparm Z6VecAddPKiS0 Pi C )
5 {
6 . reg . u16 %rh<4>;
7 . reg . u32 %r<8>;
8 . reg . u64 %rd<10>;
9 . l o c 29 5 0
10 $LDWbegin Z6VecAddPKiS0 Pi :
11 . l o c 29 9 0
12 cvt . u32 . u16 %r1 , %t i d . x ;
13 mov . u16 %rh1 , %c ta id . x ;
14 mov . u16 %rh2 , %nt id . x ;
15 mul . wide . u16 %r2 , %rh1 , %rh2 ;
16 add . u32 %r3 , %r1 , %r2 ;
17 cvt . s64 . s32 %rd1 , %r3 ;
18 mul . wide . s32 %rd2 , %r3 , 4 ;
19 ld . param . u64 %rd3 , [ cudaparm Z6VecAddPKiS0 Pi A ] ;
20 add . u64 %rd4 , %rd3 , %rd2 ;
21 ld . g l oba l . s32 %r4 , [%rd4 +0] ;
22 ld . param . u64 %rd5 , [ cudaparm Z6VecAddPKiS0 Pi B ] ;
23 add . u64 %rd6 , %rd5 , %rd2 ;
24 ld . g l oba l . s32 %r5 , [%rd6 +0] ;
25 add . s32 %r6 , %r4 , %r5 ;
26 ld . param . u64 %rd7 , [ cudaparm Z6VecAddPKiS0 Pi C ] ;
27 add . u64 %rd8 , %rd7 , %rd2 ;
28 s t . g l oba l . s32 [%rd8+0] , %r6 ;
29 . l o c 29 10 0
30 e x i t ;
31 $LDWend Z6VecAddPKiS0 Pi :
32 }
Listing B.6: Compiled PTX of the kernel in Listing B.1. Target architecture 1.1.
1 . entry Z18MatMulGlobalKernel6MatrixS S (
2 . param . a l i g n 8 . b8 cudaparm Z18MatMulGlobalKernel6MatrixS S A [ 2 4 ] ,
3 . param . a l i g n 8 . b8 cudaparm Z18MatMulGlobalKernel6MatrixS S B [ 2 4 ] ,
122 Listings
4 . param . a l i g n 8 . b8 cudaparm Z18MatMulGlobalKernel6MatrixS S C [ 2 4 ] )
5 {
6 . reg . u16 %rh<6>;
7 . reg . u32 %r<25>;
8 . reg . u64 %rd<16>;
9 . reg . pred %p<4>;
10 . l o c 29 5 0
11 $LDWbegin Z18MatMulGlobalKernel6MatrixS S :
12 cvt . u32 . u16 %r1 , %t i d . x ;
13 cvt . u32 . u16 %r2 , %t i d . y ;
14 mov . u16 %rh1 , %nt id . x ;
15 mov . u16 %rh2 , %c ta id . x ;
16 mov . u16 %rh3 , %nt id . y ;
17 mov . u16 %rh4 , %c ta id . y ;
18 ld . param . s32 %r3 , [ cudaparm Z18MatMulGlobalKernel6MatrixS S A+0] ;
19 mov . u32 %r4 , 0 ;
20 setp . l e . s32 %p1 , %r3 , %r4 ;
21 @%p1 bra $Lt 0 2306 ;
22 ld . param . s32 %r3 , [ cudaparm Z18MatMulGlobalKernel6MatrixS S A+0] ;
23 mov . s32 %r5 , %r3 ;
24 mul . wide . u16 %r6 , %rh3 , %rh4 ;
25 mul . wide . u16 %r7 , %rh1 , %rh2 ;
26 add . u32 %r8 , %r6 , %r2 ;
27 add . u32 %r9 , %r7 , %r1 ;
28 mul . l o . s32 %r10 , %r3 , %r8 ;
29 mov . s32 %r11 , %r10 ;
30 add . s32 %r12 , %r10 , %r3 ;
31 ld . param . s32 %r13 , [ cudaparm Z18MatMulGlobalKernel6MatrixS S B+0] ;
32 cvt . s64 . s32 %rd1 , %r13 ;
33 mul . wide . s32 %rd2 , %r13 , 4 ;
34 ld . param . u64 %rd3 , [ cudaparm Z18MatMulGlobalKernel6MatrixS S B+16] ;
35 cvt . s64 . s32 %rd4 , %r9 ;
36 mul . wide . s32 %rd5 , %r9 , 4 ;
37 add . u64 %rd6 , %rd3 , %rd5 ;
38 ld . param . u64 %rd7 , [ cudaparm Z18MatMulGlobalKernel6MatrixS S A+16] ;
39 cvt . s64 . s32 %rd8 , %r10 ;
40 mul . wide . s32 %rd9 , %r10 , 4 ;
41 add . u64 %rd10 , %rd7 , %rd9 ;
42 mov . s32 %r14 , 0 ;
43 mov . s32 %r15 , %r5 ;
44 $Lt 0 1794 :
45 //<loop> Loop body l i n e 5 , ne s t ing depth : 1 , es t imated i t e r a t i o n s : unknown
46 . l o c 29 11 0
47 ld . g l oba l . s32 %r16 , [%rd6 +0] ;
48 ld . g l oba l . s32 %r17 , [%rd10 +0] ;
49 mul . l o . s32 %r18 , %r16 , %r17 ;
50 add . s32 %r14 , %r14 , %r18 ;
51 add . s32 %r11 , %r11 , 1 ;
52 add . u64 %rd10 , %rd10 , 4 ;
53 add . u64 %rd6 , %rd2 , %rd6 ;
54 setp . ne . s32 %p2 , %r11 , %r12 ;
55 @%p2 bra $Lt 0 1794 ;
56 bra . uni $Lt 0 1282 ;
57 $Lt 0 2306 :
58 mul . wide . u16 %r19 , %rh3 , %rh4 ;
59 mul . wide . u16 %r20 , %rh1 , %rh2 ;
60 add . u32 %r8 , %r19 , %r2 ;
61 add . u32 %r9 , %r20 , %r1 ;
62 mov . s32 %r14 , 0 ;
63 $Lt 0 1282 :
64 . l o c 29 13 0
65 ld . param . u64 %rd11 , [ cudaparm Z18MatMulGlobalKernel6MatrixS S C+16] ;
123
66 ld . param . s32 %r21 , [ cudaparm Z18MatMulGlobalKernel6MatrixS S C+0] ;
67 mul . l o . s32 %r22 , %r21 , %r8 ;
68 add . s32 %r23 , %r9 , %r22 ;
69 cvt . s64 . s32 %rd12 , %r23 ;
70 mul . wide . s32 %rd13 , %r23 , 4 ;
71 add . u64 %rd14 , %rd11 , %rd13 ;
72 s t . g l oba l . s32 [%rd14+0] , %r14 ;
73 . l o c 29 14 0
74 e x i t ;
75 $LDWend Z18MatMulGlobalKernel6MatrixS S :
76 }
Listing B.7: Compiled PTX of the kernel in Listing B.2. Target architecture 1.1.
1 . entry Z18MatMulSharedKernel6MatrixS S (
2 . param . a l i g n 8 . b8 cudaparm Z18MatMulSharedKernel6MatrixS S A [ 2 4 ] ,
3 . param . a l i g n 8 . b8 cudaparm Z18MatMulSharedKernel6MatrixS S B [ 2 4 ] ,
4 . param . a l i g n 8 . b8 cudaparm Z18MatMulSharedKernel6MatrixS S C [ 2 4 ] )
5 {
6 . reg . u32 %r<34>;
7 . reg . u64 %rd<38>;
8 . reg . f32 %f<52>;
9 . reg . pred %p<4>;
10 . shared . a l i g n 4 . b8 cuda cuda l o c a l va r 488757 38 As72 [ 1 0 2 4 ] ;
11 . shared . a l i g n 4 . b8 cuda cuda l o ca l va r 488758 38 Bs1096 [ 1 0 2 4 ] ;
12 . l o c 17 39 0
13 $LDWbegin Z18MatMulSharedKernel6MatrixS S :
14 . l o c 17 46 0
15 ld . param . s32 %r1 , [ cudaparm Z18MatMulSharedKernel6MatrixS S A+0] ;
16 shr . s32 %r2 , %r1 , 31 ;
17 mov . s32 %r3 , 15 ;
18 and . b32 %r4 , %r2 , %r3 ;
19 add . s32 %r5 , %r4 , %r1 ;
20 shr . s32 %r6 , %r5 , 4 ;
21 cvt . s32 . u16 %r7 , %t i d . y ;
22 cvt . s32 . u16 %r8 , %t i d . x ;
23 cvt . s32 . u16 %r9 , %c ta id . y ;
24 cvt . s32 . u16 %r10 , %c ta id . x ;
25 mov . u32 %r11 , 0 ;
26 setp . l e . s32 %p1 , %r6 , %r11 ;
27 @%p1 bra $Lt 0 3330 ;
28 mov . u64 %rd1 , cuda cuda l o c a l va r 488757 38 As72 ;
29 mov . u64 %rd2 , cuda cuda l o ca l va r 488758 38 Bs1096 ;
30 mov . s32 %r12 , %r6 ;
31 ld . param . s32 %r13 , [ cudaparm Z18MatMulSharedKernel6MatrixS S A+8] ;
32 mul . l o . s32 %r14 , %r6 , 16 ;
33 cvt . s64 . s32 %rd3 , %r8 ;
34 cvt . s64 . s32 %rd4 , %r7 ;
35 ld . param . s32 %r15 , [ cudaparm Z18MatMulSharedKernel6MatrixS S B+8] ;
36 mul24 . l o . s32 %r16 , %r10 , 16 ;
37 mul . l o . s32 %r17 , %r13 , %r9 ;
38 mul . l o . s32 %r18 , %r17 , 16 ;
39 mov . s32 %r19 , %r18 ;
40 add . s32 %r20 , %r14 , %r18 ;
41 mul . wide . s32 %rd5 , %r8 , 4 ;
42 add . u64 %rd6 , %rd2 , %rd5 ;
43 mul . wide . s32 %rd7 , %r7 , 64 ;
44 add . u64 %rd8 , %rd1 , %rd7 ;
45 mul . wide . s32 %rd9 , %r7 , 16 ;
46 add . u64 %rd10 , %rd3 , %rd9 ;
47 mul . l o . u64 %rd11 , %rd10 , 4 ;
124 Listings
48 add . u64 %rd12 , %rd11 , %rd1 ;
49 add . u64 %rd13 , %rd11 , %rd2 ;
50 mul . l o . s32 %r21 , %r15 , 16 ;
51 cvt . s64 . s32 %rd14 , %r21 ;
52 mul . wide . s32 %rd15 , %r21 , 4 ;
53 ld . param . u64 %rd16 , [ cudaparm Z18MatMulSharedKernel6MatrixS S A+16] ;
54 mul . l o . s32 %r22 , %r13 , %r7 ;
55 add . s32 %r23 , %r8 , %r22 ;
56 cvt . s64 . s32 %rd17 , %r23 ;
57 mul . wide . s32 %rd18 , %r23 , 4 ;
58 add . s64 %rd19 , %rd16 , %rd18 ;
59 cvt . s64 . s32 %rd20 , %r18 ;
60 mul . wide . s32 %rd21 , %r18 , 4 ;
61 add . s64 %rd22 , %rd19 , %rd21 ;
62 ld . param . u64 %rd23 , [ cudaparm Z18MatMulSharedKernel6MatrixS S B+16] ;
63 mul . l o . s32 %r24 , %r15 , %r7 ;
64 add . s32 %r25 , %r8 , %r24 ;
65 cvt . s64 . s32 %rd24 , %r25 ;
66 mul . wide . s32 %rd25 , %r25 , 4 ;
67 add . s64 %rd26 , %rd23 , %rd25 ;
68 cvt . s64 . s32 %rd27 , %r16 ;
69 mul . wide . s32 %rd28 , %r16 , 4 ;
70 add . s64 %rd29 , %rd26 , %rd28 ;
71 mov . f32 %f1 , 0 f00000000 ; // 0
72 mov . s32 %r26 , %r12 ;
73 $Lt 0 2818 :
74 //<loop> Loop body l i n e 46 , ne s t ing depth : 1 , es t imated i t e r a t i o n s : unknown
75 . l o c 17 73 0
76 ld . g l oba l . f 32 %f2 , [%rd22 +0] ;
77 s t . shared . f32 [%rd12+0] , %f2 ;
78 . l o c 17 74 0
79 ld . g l oba l . f 32 %f3 , [%rd29 +0] ;
80 s t . shared . f32 [%rd13+0] , %f3 ;
81 . l o c 17 78 0
82 bar . sync 0 ;
83 . l o c 17 82 0
84 ld . shared . f32 %f4 , [%rd8 +0] ;
85 ld . shared . f32 %f5 , [%rd6 +0] ;
86 mad . f32 %f6 , %f4 , %f5 , %f1 ;
87 ld . shared . f32 %f7 , [%rd8 +4] ;
88 ld . shared . f32 %f8 , [%rd6 +64] ;
89 mad . f32 %f9 , %f7 , %f8 , %f6 ;
90 ld . shared . f32 %f10 , [%rd8 +8] ;
91 ld . shared . f32 %f11 , [%rd6 +128] ;
92 mad . f32 %f12 , %f10 , %f11 , %f9 ;
93 ld . shared . f32 %f13 , [%rd8 +12] ;
94 ld . shared . f32 %f14 , [%rd6 +192] ;
95 mad . f32 %f15 , %f13 , %f14 , %f12 ;
96 ld . shared . f32 %f16 , [%rd8 +16] ;
97 ld . shared . f32 %f17 , [%rd6 +256] ;
98 mad . f32 %f18 , %f16 , %f17 , %f15 ;
99 ld . shared . f32 %f19 , [%rd8 +20] ;
100 ld . shared . f32 %f20 , [%rd6 +320] ;
101 mad . f32 %f21 , %f19 , %f20 , %f18 ;
102 ld . shared . f32 %f22 , [%rd8 +24] ;
103 ld . shared . f32 %f23 , [%rd6 +384] ;
104 mad . f32 %f24 , %f22 , %f23 , %f21 ;
105 ld . shared . f32 %f25 , [%rd8 +28] ;
106 ld . shared . f32 %f26 , [%rd6 +448] ;
107 mad . f32 %f27 , %f25 , %f26 , %f24 ;
108 ld . shared . f32 %f28 , [%rd8 +32] ;
109 ld . shared . f32 %f29 , [%rd6 +512] ;
125
110 mad . f32 %f30 , %f28 , %f29 , %f27 ;
111 ld . shared . f32 %f31 , [%rd8 +36] ;
112 ld . shared . f32 %f32 , [%rd6 +576] ;
113 mad . f32 %f33 , %f31 , %f32 , %f30 ;
114 ld . shared . f32 %f34 , [%rd8 +40] ;
115 ld . shared . f32 %f35 , [%rd6 +640] ;
116 mad . f32 %f36 , %f34 , %f35 , %f33 ;
117 ld . shared . f32 %f37 , [%rd8 +44] ;
118 ld . shared . f32 %f38 , [%rd6 +704] ;
119 mad . f32 %f39 , %f37 , %f38 , %f36 ;
120 ld . shared . f32 %f40 , [%rd8 +48] ;
121 ld . shared . f32 %f41 , [%rd6 +768] ;
122 mad . f32 %f42 , %f40 , %f41 , %f39 ;
123 ld . shared . f32 %f43 , [%rd8 +52] ;
124 ld . shared . f32 %f44 , [%rd6 +832] ;
125 mad . f32 %f45 , %f43 , %f44 , %f42 ;
126 ld . shared . f32 %f46 , [%rd8 +56] ;
127 ld . shared . f32 %f47 , [%rd6 +896] ;
128 mad . f32 %f48 , %f46 , %f47 , %f45 ;
129 ld . shared . f32 %f49 , [%rd8 +60] ;
130 ld . shared . f32 %f50 , [%rd6 +960] ;
131 mad . f32 %f1 , %f49 , %f50 , %f48 ;
132 . l o c 17 87 0
133 bar . sync 0 ;
134 add . s64 %rd29 , %rd29 , %rd15 ;
135 add . s32 %r19 , %r19 , 16 ;
136 add . s64 %rd22 , %rd22 , 64 ;
137 setp . ne . s32 %p2 , %r19 , %r20 ;
138 @%p2 bra $Lt 0 2818 ;
139 bra . uni $Lt 0 2306 ;
140 $Lt 0 3330 :
141 mov . f32 %f1 , 0 f00000000 ; // 0
142 $Lt 0 2306 :
143 . l o c 17 21 0
144 ld . param . s32 %r27 , [ cudaparm Z18MatMulSharedKernel6MatrixS S C+8] ;
145 ld . param . u64 %rd30 , [ cudaparm Z18MatMulSharedKernel6MatrixS S C+16] ;
146 mul . l o . s32 %r28 , %r27 , %r9 ;
147 add . s32 %r29 , %r10 , %r28 ;
148 mul . l o . s32 %r30 , %r29 , 16 ;
149 cvt . s64 . s32 %rd31 , %r30 ;
150 mul . wide . s32 %rd32 , %r30 , 4 ;
151 add . u64 %rd33 , %rd30 , %rd32 ;
152 mul . l o . s32 %r31 , %r7 , %r27 ;
153 add . s32 %r32 , %r8 , %r31 ;
154 cvt . s64 . s32 %rd34 , %r32 ;
155 mul . wide . s32 %rd35 , %r32 , 4 ;
156 add . u64 %rd36 , %rd33 , %rd35 ;
157 s t . g l oba l . f 32 [%rd36+0] , %f1 ;
158 . l o c 17 93 0
159 e x i t ;
160 $LDWend Z18MatMulSharedKernel6MatrixS S :
161 }
Listing B.8: Compiled PTX of the kernel in Listing B.3. Target architecture 1.1.
126 Listings
Bibliography
[1] M. J. Flynn. Some computer organizations and their effectiveness. IEEE Transac-
tions on Computers, C-21:948–960, 1972.
[2] Ralph Duncan. A survey of parallel computer architectures. Computer, 23(2):5–16,
1990.
[3] Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8
December 2008.
[4] Owens John D., Luebke David, Govindaraju Naga, Harris Mark, Kroger Jens,
Lefohn Aaron E., and Purcell Timothy J. A Survey of General-Purpose Compu-
tation on Graphics Hardware. Eurographics 2005, State of the Art Reports, pages
21–51, August 2005.
[5] GPGPU. General-Purpose Computation on Graphics Processing Units. http:
//www.gpgpu.org/.
[6] Nvidia. Nvidia CUDA Programming Guide 3.1, 2010.
[7] AMD-ATI. ATI Stream Computing, 2009.
[8] Cristian Dittamo. On Expressing Different Concurrency Paradigms on Virtual Ex-
ecution Systems. PhD Thesis, 2009.
[9] Lei Wang, Yong-zhong Huang, Xin Chen, and Chun-yan Zhang. Task scheduling of
parallel processing in cpu-gpu collaborative environment. In ICCSIT ’08: Proceed-
ings of the 2008 International Conference on Computer Science and Information
Technology, pages 228–232, Washington, DC, USA, 2008. IEEE Computer Society.
[10] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification. Addison
Wesley, 2nd edition, 1999.
[11] Microsoft. Microsoft Developer Network. http://msdn.microsoft.com/.
[12] ECMA International. Standard ECMA-335 - Common Language Infrastructure
(CLI). 4 edition, June 2006.
127
128 BIBLIOGRAPHY
[13] Giacomo Righetti. Esecuzione di codice intermedio su GPU: le VM incontrano i
processori grafici. Tesi di Laurea Specialistica, 2009.
[14] Cristian Dittamo. General Purpose GPU.
[15] Alexander Zibula. General Purpose Computation on Graphics Processing Units
(GPGPU) using CUDA. http://www.nvidia.it/page/geforce256.html.
[16] Mark Segal and Kurt Akeley. The OpenGL Graphics System: A Specification
(Version 4.0).
[17] Chas. Boyd. DirectX 11 Compute Shader, 2008.
[18] William R. Mark, R. Steven Glanville, Kurt Akeley, and Mark J. Kilgard. Cg:
a system for programming graphics hardware in a c-like language. ACM Trans.
Graph., 22(3):896–907, 2003.
[19] Suryakant Patidar, Shiben Bhattacharjee, Jag Mohan Singh, and P. J. Narayanan.
Exploiting the Shader Model 4.0 Architecture, 2007.
[20] Microsoft. Common-Shader Core (DirectX HLSL). http://msdn.microsoft.com/
en-us/library/bb509580(VS.85).aspx.
[21] Cristian Dittamo and Antonio Cisternino. GPU White Paper. http://tinyurl.
com/3xo9vh5.
[22] Nvidia. The CUDA Compiler Driver NVCC.
[23] Nvidia. PTX: Parallel Thread Execution ISA Version 2.1.
[24] Halfhill Tom. Parallel Processing with CUDA. Microprocessor Journal, 2008.
[25] Nvidia. CUDA x86, SC10 Supercomputing Conference, 2010.
[26] Phuong Hoai Ha, Philippas Tsigas, and Otto J. Anshus. The Synchronization Power
of Coalesced Memory Accesses. In DISC ’08: Proceedings of the 22nd international
symposium on Distributed Computing, pages 320–334, Berlin, Heidelberg, 2008.
Springer-Verlag.
[27] Nvidia. Nvidia Cuda Best Practice Guide 3.1, 2010.
[28] Nvidia. Nvidia’s Next Generation CUDA Compute Architecture: Fermi, 2009.
[29] Dac Pham, Hans-Werner Anderson, Erwin Behnen, Mark Bolliger, Sanjay Gupta,
H. Peter Hofstee, Paul E. Harvey, Charles R. Johns, James A. Kahle, Atsushi
Kameyama, John M. Keaty, Bob Le, Sang Lee, Tuyen V. Nguyen, John G. Petro-
vick, Mydung Pham, Juergen Pille, Stephen D. Posluszny, Mack W. Riley, Joseph
Verock, James D. Warnock, Steve Weitzel, and Dieter F. Wendel. Key features
BIBLIOGRAPHY 129
of the design methodology enabling a multi-core SoC implementation of a first-
generation CELL processor. In Fumiyasu Hirose, editor, ASP-DAC, pages 871–878.
IEEE, 2006.
[30] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Pradeep Dubey, Stephen
Junkins, Adam Lake, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan,
Michael Abrash, Jeremy Sugerman, and Pat Hanrahan. Larrabee: A Many-Core
x86 Architecture for Visual Computing. IEEE Micro, 29(1):10–21, 2009.
[31] AMD. Amd Fusion Family of APUs: Enabling a Superior, Immersive PC Experi-
ence.
[32] Eric Young. Directcompute Optimizations and Best Practices, 2010.
[33] Intel. Intel Array Building Blocks for Windows OS, User’s guide.
[34] Anwar Ghuloum, Eric Sprangle, Jesse Fang, Gansha Wu, and Xin Zhou. Ct: A
Flexible Parallel Programming Model for Tera-scale Architectures.
[35] M. Monteyne. Rapidmind multi-core develpment platform. Technical report, Tech.
Rep., 2007.
[36] David Tarditi, Sidd Puri, and Jose Oglesby. Accelerator: using data parallelism
to program GPUs for general-purpose uses. SIGARCH Comput. Archit. News,
34(5):325–335, 2006.
[37] Mordechai Butrashvily. CUDA.NET manual. Reference for programmers, 2008.
[38] Joel Pobar and Ted Neward. Shared Source CLI 2.0 Internals. http://callvirt.
net/blog/.
[39] Mono-Project. Mono-project. http://www.mono-project.com/.
[40] DotGNU Project. Portable .NET. http://www.gnu.org/software/dotgnu/pnet.
html.
[41] Stutz David, Neward Ted, and Shilling Geoff. Shared Source Cli Essentials. O’Reilly
& Associates, Inc., Sebastopol, CA, USA, 2002.
[42] Jeffrey Richter and Jeffrey Richter. Applied Microsoft .NET Framework Program-
ming. Microsoft Press, Redmond, WA, USA, 2002.
[43] Cristian Dittamo. Tecniche di parallelizzazione di programmi sequenziali basate su
annotazioni. Tesi di Laurea Specialistica, 2006.
[44] Redgate. .NET Reflector. http://www.red-gate.com/products/reflector/.
[45] Microsoft. Native Image Generator. http://msdn.microsoft.com/en-us/
library/6t9t5wcf(VS.80).aspx.
130 BIBLIOGRAPHY
[46] Peter Sestoft. Runtime Code Generation with JVM and CLR. Technical report,
Department of Mathematics and Physics, IT University of Copenhagen, Denmark,
2002.
[47] A. Cisternino. CliFileRW. http://www.codeplex.com/clifilerw.
[48] Marco Vanneschi. Architetture Parallele e Distribuite, SEU, Pisa.
[49] Ryoo Shane, Rodrigues Christopher I., Baghsorkhi Sara S., Stone Sam S., Kirk
David B., and Hwu Wen-mei W. Optimization Principles and Application Perfor-
mance Evaluation of a Multithreaded GPU Using CUDA. Proceedings of the 13th
ACM SIGPLAN Symposium on Principles and practice of parallel programming,
pages 73–82, 2008.
[50] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and
Kevin Skadron. A performance study of general-purpose applications on graphics
processors using cuda. J. Parallel Distrib. Comput., 68(10):1370–1380, 2008.
[51] Meng Jiayuan and Skadron Kevin. Performance Modeling and Automatic Ghost
Zone Optimization for Iterative Stencil Loops on GPUs. In Proc. of ACM ICS,
2009.
[52] Kothapalli Kishore, Mukherjee Rishabh, Rehman Suhail, Patidar Suryakant,
Narayanan P. J., and Srinathan Kannan. A Performance Prediction Model for the
CUDA GPGPU Platform. International Conference on High Performance Com-
puting, December 2009.
[53] Sunpyo Hong and Hyesoon Kim. An analytical model for a GPU architecture with
memory-level and thread-level parallelism awareness. In Stephen W. Keckler and
Luiz Andre´ Barroso, editors, ISCA, pages 152–163. ACM, 2009.
[54] Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and
Wen mei W. Hwu. An adaptive performance modeling tool for GPU architectures.
In R. Govindarajan, David A. Padua, and Mary W. Hall, editors, PPOPP, pages
105–114. ACM, 2010.
[55] Tilak Raj G. and Pramod Subramanyan. Studying Memory System Performance
of a Multithreaded GPU. Technical report, 2008.
[56] Cristian Dittamo. On expressing different concurrency paradigms on virtual exe-
cution systems. In Sheikh Iqbal Ahamed, Elisa Bertino, Carl K. Chang, Vladimir
Getov, Lin Liu, Hua Ming, and Rajesh Subramanyan, editors, COMPSAC (1),
pages 664–667. IEEE Computer Society, 2009.
[57] Daniel Haugen. Seismic Data Compression and GPU Memory Latency, 2009.
[58] Rajesh Bordawekar, Uday Bondhugula, and Ravi Rao. Can CPUs match GPUs on
Performance with Productivity?: Experiences with Optimizing a FLOP-intensive
Application on CPUs and GPU. Technical report, IBM Research Report.
