Louisiana Tech University

Louisiana Tech Digital Commons
Doctoral Dissertations

Graduate School

Winter 2014

Performance modeling and optimization
techniques for heterogeneous computing
Supada Laosooksathit

Follow this and additional works at: https://digitalcommons.latech.edu/dissertations
Part of the Applied Statistics Commons, Mathematics Commons, and the Other Computer
Sciences Commons

PERFORM ANCE MODELING AN D OPTIMIZATION TECHNIQUES
FOR HETEROGENEOUS CO M PUTING
by
Supada Laosooksathit, B.S., M.S.

A Dissertation Presented in Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy

COLLEGE OF ENGINEERING AND SCIENCE
LOUISIANA TECH UNIVERSITY

March 2014

UMI Number: 3662202

All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

Di!ss0?t&Ciori Publishing
UMI 3662202
Published by ProQuest LLC 2015. Copyright in the Dissertation held by the Author.
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code.

ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, Ml 48106-1346

LOUISIANA TECH UNIVERSITY
THE GRADUATE SCHOOL

D ec 10, 2 0 1 3
Date

We

.

hereby

recom m end

that

the

dissertation

prepared

under

our

supervision

SUPAD A LAOSOOKSATHIT

by______________________________________________________________________________________________________
entitled__________________________________________________________________________________________________

PERFORM ANCE MODELING AND OPTIMIZATION TECHNIQUES
FOR HETERO GENEOUS COMPUTING

be

accepted

in

partial

fulfillm ent

of

the

requirem ents

for

the

D egree

of

DOCTOR OF PHILOSOPHY

Supervys<Votf)i(>sertation Research

Head o f Department

Com putational A nalysis and M odeling
Department
R ecom m endatiom concurred in:

/

A dvisory C om m ittee

Approved:

Approved:

DiregtefoC^Sfaduate Studies

uate School

V
Dean o f the College
GS Form 13a
(6/07)

ABSTRACT
Since Graphics Processing Units (GPUs) have increasingly gained popularity
amoung non-graphic and computational applications, known as General-Purpose com
putation on GPU (GPGPU), GPUs have been deployed in many clusters, including
the worlds fastest supercomputer. However, to make the most efficiency from a GPU
system, one should consider both performance and reliability of the system.
This dissertation makes four major contributions. First, the two-level check
point/restart protocol th at aims to reduce the checkpoint and recovery costs with a
latency hiding strategy in a system between a CPU (Central Processing Unit) and
a GPU is proposed. The experimental results and analysis reveals some benefits,
especially in a long-running application.
Second, a performance model for estimating GPGPU execution time is pro
posed. This performance model improves operation cost estimation over existing
ones by considering varied memory latencies. The proposed model also considers
the effects of thread synchronization functions. In addition, the impacts of various
issues in GPGPU programming such as bank conflicts in shared memory and branch
divergence are also discussed.
Third, the interplay between GPGPU application performance and system re
liability of a large GPU system is explored. This includes a checkpoint scheduling

model for a certain GPGPU application. The effects of a checkpoint/restart mecha
nism on the application performance is also discussed.
Finally, optimization techniques to remedy uncoalesced memory access in
G PU ’s global memory are proposed. These techniques are memory rearrangement
using 2-dimensional m atrix transpose and 3-dimensional m atrix permutation. The
analytical results show th at the proposed technique can reduce memory access time,
especially when the transformed array/m atrix is frequently accessed.

APPROVAL FOR SCHOLARLY DISSEMINATION
The author grants to the P rescott M em orial Library o f Louisiana T ech U niversity the right to
reproduce, by appropriate methods, upon request, any or all portions o f this Dissertation. It is understood
that “proper request” consists o f the agreem ent, on the part o f the requesting party, that said reproduction
is for his personal u se and that subsequent reproduction w ill not occu r w ithout written approval o f the
author o f this D issertation.

Further, any portions o f the D issertation used in b ook s, papers, and other

w orks must be appropriately referenced to this Dissertation.
Finally, the author o f this D issertation reserves the right to publish freely, in the literature, at
any tim e, any

or all portions o f this D issertation.

Author

S u p a d a L aosooksathit

v
Date

1 2 /1 0 /2 0 1 3

_______

GS Form 14
(5/03)

DEDICATION
my late mothor, Suvimol Laosooksathit, I dedicate this work.

TABLE OF CONTENTS
A BSTRA CT......................................................................................................................... iii
DEDICATION..................................................................................................................... vi
LIST OF TABLES............................................................................................................... x
LIST OF FIGURES............................................................................................................ xi
ACKNOWLEDGMENTS....................................................................................................xv
CHAPTER 1

INTRODUCTION..................................................................................

1

1.1

Overview of G P U s.............................................................................................

1

1.2

GPGPU Performance and Optimization.......................................................

2

1.3

Checkpoint/Restart Mechanism......................................................................

4

CHAPTER 2

BACKGROUND AND RELATED W O R K S...................................

6

2.1

Checkpoint/Restart Mechanism......................................................................

6

2.2

Checkpoint Scheduling and System Reliability.............................................

7

2.3

GPGPU Performance M odels..........................................................................

8

2.4

Memory Optim ization.......................................................................................

9

2.4.1

Memory Optimization in H P C ............................................................ 9

2.4.2

Memory Optimization in G PG P U ...................................................... 11

CHAPTER 3

2.4.2.1

GPU-CPU Memory Latency Hiding.................................. 11

2.4.2.2

GPGPU Memory Hierarchy................................................. 12

TWO-LEVEL CHECKPOINT/RESTART FOR G PG P U ............. 14

vii

3.1

CUDA Stream s................................................................................................... 14
3.1.1

Performance Model of CUDA Stream s................................................ 16

3.1.2

Benchmarks................................................................................................ 22

3.2

Two-Level Checkpoint/Restart Protocols........................................................25

3.3

Experim ents......................................................................................................... 27

3.4

Sim ulation............................................................................................................ 31

3.5

Conclusion.............................................................................................................. 35

CHAPTER 4 PERFORMANCE MODEL FOR G P G P U ..........................................36
4.1

4.2

Performance M odelling...................................................................................... 38
4.1.1

Parameters...................................................................................................38

4.1.2

Performance Model in a General C a s e ................................................ 42

4.1.3

Performance Model in a Case with Synchronization........................ 43

4.1.4

Control Flow Cases....................................................................................45

Results and Discussion......................................................................................... 47
4.2.1

Code Analysis........................................................................................... 47

4.2.2 Results......................................................................................................... 49
CHAPTER 5 GPU APPLICATION PERFORMANCE VS CHECKPOINTS ... 52
5.1

Application Performance and System Reliability.......................................... 53
5.1.1

Predicting the System Size from the Maximum System Reliability 57

5.1.2 Predicting the System Size from an Expected Performance

5.2

58

5.1.2.1

Non-scalable W orkload........................................................... 59

5.1.2.2

Scalable W orkload................................................................. 59

Checkpoint Scheduling....................................................................................... 60

ix

5.3

Perform ance........................................................................................................ 64

5.4

Real-World Case Study....................................................................................... 69

5.5

The Model with an Excess Weibull...................................................................72

5.6

Conclusion..............................................................................................................74

CHAPTER 6

GPGPU OPTIM IZATION......................................................................75

6.1

Coalescing VS Uncoalescing............................................................................. 76

6.2

Memory Rearrangement.................................................................................... 78

6.3

6.4

6.2.1

Single S tride...............................................................................................79

6.2.2

Double Strides......................................................................................... 81

Analytical M odels.................................................................................................86
6.3.1

Predetermined D ata Size....................................................................... 86

6.3.2

Unknown Data Size................................................................................ 88
6.3.2.1

Single Stride..............................................................................88

6.3.2.2

Double S trid es....................................................................... 90

Results and Cost A nalysis..................................................................................90
6.4.1

Predetermined D ata Size....................................................................... 91

6.4.2 Unknown Data Size and Break-even Analysis................................... 91
6.5

Conclusion............................................................................................................ 92

CHAPTER 7 CONCLUSIONS AND FUTURE W O R K ............................................94
BIBLIOGRAPHY.................................................................................................................96

LIST OF TABLES
Table 3.1:

Times spent by operations of matrix multiplication, measured by
CUDA Events.................................................................................................. 23

Table 3.2:

Total execution times with non-stream and the expected total
execution time with 8-stream are calculated from the operational
times on Table 3.1......................................................................................... 24

Table 4.1:

The values of experimental parameters validated in the m odel

Table 6.1:

The comparison between the kernel execution times based on our
technique using the sample code in Listing 6.4 with the original
memory access and transformed memory access..................................... 91

Table 6.2:

The comparison between the performance gain from the transformed
memory access to the time of 2D matrix transpose.................................92

x

48

LIST OF FIGURES
Figure 1.1:

Two-level checkpoint for a heterogeneous system .................................

5

Figure 2.1:

Memory hierarchy in G P G P U .................................................................. 12

Figure 3.1:

The results of simpleStreams application with various numbers of
streams and array sizes................................................................................ 15

Figure 3.2:

Time diagram of non-streamed execution............................................... 17

Figure 3.3:

Time diagram of streamed execution: (a) asynchronous memory
copies from host to device, (b) asynchronous memory copies from
device to host, and (c) asynchronous memory copies in both directions. 18

Figure 3.4: Time diagram of execution time when the time of memory copies
are longer than kernel executions: (a) synchronous memory copies,
(b) synchronous memory copies from host to device, (c) asynchronous
memory copies from device to host, and (d) asynchronous memory
copies in both directions.............................................................................. 20
Figure 3.5: Time diagram of execution time when the time of kernel executions
are longer than memory copies: (a) synchronous memory copies,
(b) synchronous memory copies from host to device, (c) asynchronous
memory copies from device to host, and (d) asynchronous memory
copies in both directions.............................................................................. 21
Figure 3.6:

The checkpoint protocol for GPU streamed C P R ................................. 26

Figure 3.7:

The restart protocol for GPU streamed C PR ........................................ 27

Figure 3.8:

The percentage of performance improvement in terms of (a)checkpoint
cost, (b) recovery cost due to a failure occurrence, and (c) wasted
time due to a failure occurrence................................................................. 30

Figure 3.9:

The percentages of total checkpoint costs compared to wasted time
of the application with non-streamed, 4-streamed, and 8-streamed
CPRs, when the size of arrays is 224 and (a) the checkpoint interval
is 30 minutes, and (b) the M TTF is 12 hours......................................... 32

xi

Figure 3.10: The percentages of total recovery costs compared to wasted time
of the application with non-streamed, 4-streamed, and 8-streamed
CPRs, when the size of arrays is 224 and (a) the checkpoint interval
is 30 minutes, and (b) the M TTF is 12 hours......................................... 33
Figure 3.11: The percentages of wasted time compared to completion time of the
application with non-streamed, 4-streamed, and 8-streamed CPRs,
when the size of arrays is 224 and (a) the checkpoint interval is 30
minutes, and (b) the MTTF is 12 hours.................................................... 33
Figure 4.1:

Memory latency on GPU NVIDIA GeForce GTX 295 ........................ 37

Figure 4.2:

GPU architecture: (a) programming model; (b) warp scheduling
in an S M ...........................................................................................................39

Figure 4.3:

The instruction list in (a) transforms into the timeline in (b)............ 41

Figure 4.4:

Diagrams show the execution time when: (a) there is no idle time;
(b) there is idle time. A shaded box indicates th a t the warp is
being executed. A white box indicates th at the warp is waiting for
the memory access........................................................................................ 43

Figure 4.5:

Diagrams show the execution time when: (a) a synchronization
function causes extra time between two consecutive computational
instructions; (b) a synchronization function causes extra time between
memory and computational instructions. Again, a shaded box
indicates th a t the warp is being executed. A white box indicates
th at the warp is waiting for the memory access...................................... 44

Figure 4.6:

The kernel execution time from the performance model compared
to the measurement of: (a) naive m atrix multiplication; (b) tiled
matrix m ultiplication.....................................................................................49

Figure 4.7:

Percentage of error of the performance model compared to the
measurement for both benchm arks........................................................... 50

Figure 5.1:

Probability of survival for different values of c and k from Equation
(5.4).................................................................................................................. 56

Figure 5.2:

The time between two consecutive checkpoints when: (a)-(c) k is
predicted by the maximal reliability expressed by Equation (5.8);
(d)-(f) k is predicted by the desired performance with fixed workload
expressed by Equation (5.9); (g)-(i) k is predicted by the desired
performance with scalable workload expressed by Equation (5.10).... 64

xiii

Figure 5.3: Comparison of application performance when: (a)-(c) k is predicted
by the maximal reliability expressed by Equation (5.8); (d)-(f) k is
predicted by the desired performance with fixed workload expressed
by Equation (5.9); (g)-(i) k is predicted by the desired performance
with scalable workload expressed by Equation (5.10).............................. 68
Figure 5.4: Time between two consecutive checkpoints for the case study when:
(a)-(c) k is predicted by the maximal reliability expressed by Equation
(5.8); (d) (f) k is predicted by the desired performance with fixed
workload expressed by Equation (5.9); (g)-(i) k is predicted by the
desired performance with scalable workload expressed by Equation
(5.10)................................................................................................................ 70
Figure 5.5: Comparison of application performance for the case study when:
(a)-(c) k is predicted by the maximal reliability expressed by Equation
(5.8); (d)-(f) k is predicted by the desired performance with fixed
workload expressed by Equation (5.9); (g) (i) k is predicted by the
desired performance with scalable workload expressed by Equation
(5.10)................................................................................................................ 71
Figure 6.1: Memory access pattern categories: (a) Coalesced memory access,
(b) Single stride uncoalesced memory access, and (c) Double stride
uncoalesced memory access......................................................................... 77
Figure 6.2: Array B transformation associated to the sample code in Listing 6.4
and 6.5 (a) The original array; (b) The 2D matrix is constructed;
(c) The transformed array..............................................................................80
Figure 6.3: Transformation of the 2D matrix th a t only the elements on the
diagonal line are accessed as given by Listing 6.7; (a) Original
Layout, (b) Modified matrix with the width of n B + 1........................ 81
Figure 6.4:

3D M atrix...................................................................................................... 82

Figure 6.5: ID array with double strides where di < dy, (a) The original layout,
(b) The 3D m atrix with m = dt, n = d^fd^ (c) Permuted 3D m atrix
with the order of [2,3,1], and (d) The final layout................................... 83
Figure 6.6: ID array with double strides where di > d2\ (a) The original layout,
(b) The 3D m atrix with m = di/dj, n = d3, (c) Permuted 3D m atrix
with the order of [2,1,3], and (d) The final layout.................................. 84
Figure 6.7:

2D matrix with double strides; (a) The original layout, (b) The
final layout after 3D matrix perm utation with the order of [2,1,3]...... 85

xiv

Figure 6.8: Tiled 2D matrix transposition where (a) is the original m atrix and
(b) is the transposed m atrix........................................................................88

ACKNOWLEDGMENTS
This dissertation would not have been possible without the great advice of my
advisor, Dr. Chokchai (Box) Leangsuksun, who has been patiently supporting me
throughout my doctoral study. He has guided me in every aspect of my research and
career, including writing and presenting research, providing critical feedback, giving
me opportunities to work with several researchers from many scientific areas, and
encouraging me to always move forward.
I would like to acknowledge Dr. Raja Nassar for his guidance in probability
and statistics and his patience in editing the technical papers. I am sincerely grateful
to Dr. Mihaela Paun for discussing ideas and validating my research in statistics, Dr.
Weizhong Dai for advising me through the CAM program, and Dr. Zeno Greenwood
for his valuable comments.
In personal, I would like to thank Nichamon Naksinehaboon and Narate Taerat
for helping me throughout the study and for living in Ruston. I would also like to
thank Thanadech Thanakornworakij for giving moral suggestions during my hard
times. Most importantly, I would like to express my deepest gratitude to my mother,
Suvimol Laosooksathit, who always taught me to be patient and persistent; my father,
Dr. Surin Laosooksathit, who is my role model to persue a doctoral degree and always
loves and supports me unconditionally; and my sister, W ipada Laosooksathit, who is
my morale and always believes in me.

xv

CHAPTER 1
INTRODUCTION
Exascale computing demands higher computational power and massive paral
lelism. In response to this demand, many organizations have upgraded their computa
tional infrastructures. Oak Ridge National Laboratory (ORNL)’s Titan has also been
upgraded and has become the world’s fastest supercomputer. T itan ’s computational
power is mainly from the newest NVIDIA’s graphic cards [1]. Therefore, to make the
most efficiency from a system with Graphics Processing Units (GPUs), the application
performance on such a system and the GPU system reliability should be carefully
studied.

1.1 Overview of GPUs
GPUs were first introduced for graphics computation. Due to the ability to
accelerate computation by massive data-parallelism, GPUs have been deployed for
non-graphic applications, known as General-Purpose computation on GPU (GPGPU)
[2]. The paradigm to exploit both CPU (Central Processing Unit) and GPU is also
known as heterogeneous computing.
For a node-wise system, since a GPU works as a co-processor of a CPU, the
CPU is referred to as the host, and the GPU as the device. First the data set has
to be prepared on the host side and transferred to the device. This process is called
1

2

host-to-device memory copy. Once host-to-device memory copy finishes, the program
is executed on the device side. A part of the program that will be executed on the
device is called a kernel. After the kernel execution finishes, the results are transferred
back to the host. This process is called device-to-host memory copy [3] [4]. Note th at
a host thread can handle only one GPU device.
Once the kernel is invoked, a kernel grid is created inside the GPU. The grid
contains thread blocks th a t are distributed to the GPU streaming multiprocessors
(SMs). Each thread block contains a number of threads that will execute instructions
on an SM. A new set of thread blocks is launched on the SMs again as previous thread
blocks terminate [3].

1.2 G P G P U Performance and Optim ization
By estimating cost benefit of GPGPU computing, the programmers will be
able to find ways to optimize their parallel program for a better use of both CPU
and GPU in a single application. Therefore, a performance model for GPGPU has
become more crucial in order to gain an insight into speed improvement issues from
a High Performance Computing (HPC) software development perspective.
In order to estim ate an application completion-time, the memory transfer time
between the GPU and the CPU, and the kernel execution time must be obtained.
Thus, one can make a decision whether the GPU is worth participating in the
com putation and can further improve the heterogeneous computing application.
Furthermore, the performance model can also help to improve the efficiency
of fault tolerance techniques for GPGPU, such as checkpoint/restart mechanisms,

3

especially for a large GPU cluster.

One of the major problems in this area is

checkpoint scheduling. To find an optimum checkpoint placement, the time-to-failure
(TTF) of the system and the completion-time of the application must be taken into
account. The system T T F can be obtained by a failure prediction technique, while the
completion time of the application can be estimated by the performance model. The
details of the performance model for a GPGPU application is illustrated in Chapter
4.
Even though deploying many computing elements can increase the computing
power, a very large system is prone to fail. Both soft and hard failures can cause
interruptions to applications running on the system at that time. Thus, Chapter 5
will describe an interplay between the performance of a GPGPU application and the
reliaibilty of a GPU system. The model to find an optimal number of nodes that will
satisfy both criteria is also proposed.
In addition, an optimization technique th a t aims toward performance im
provement in GPGPU applications is presented in Chapter 6.

Since one of the

major concerns in GPGPU optimization is coalescing in global memory, this work
proposes memory rearrangement techniques to remedy uncoalesced global memory
access patterns by using 2-dimensional m atrix transpose and 3-dimensional matrix
permutation. The proposed techniques can be applied to many common memory
access patterns. The cost benefit of these techniques will also be discussed in details.
Moreover, the analytical results reveal that the proposed techniques are beneficial if
the transformed array/m atrix is frequently accessed.

4

1.3 Checkpoint/R estart Mechanism
Since a large-scale GPU cluster is the main focus system in this dissertation,
fault tolerance is an important issue that should not be omitted.
Checkpoint/restart is a fault tolerance mechanism th at has been used in many
system platforms.

Instead of restarting computation from the beginning when a

failure occurs, with checkpoint/restart, a process can be rolled back from the last
checkpoint and can be migrated to a healthier node or system [5] [6].
However, GPU fault tolerance has recently been a concern in HPC. There are
very few existing works th at address resilience issues in a GPU environment. However,
with the popularity of large GPU systems, it is anticipated th at reliability will be quite
critical for its future success.
In a GPU system, any failure th at occurs during kernel execution will normally
cause a loss of kernel computation. Figure 1.1 shows a two-step checkpoint protocol
that first transfers checkpoint d ata and GPU status from the GPU to the CPU
memory and then saves the software state to either a reliable storage or a healthier
node.

5

Figure 1.1: Two-level checkpoint for a heterogeneous system

To improve the utilization of the checkpoint/restart mechanism, there are
two major concerns: reducing the costs of checkpoint and recovery processes, and
finding optimal checkpoint placements. To reduce the costs of checkpoint and recovery
processes, the two-level checkpoint/restart protocols are further discussed in Chapter
3. These protocols utilize a latency hiding strategy to reduce the memory transfer
time. Moreover, the checkpoint scheduling model for finding a sequence of optimal
checkpoint placements is presented in Chapter 5. The proposed model aims to reduce
the wasted time, which is an aggregation of checkpoint costs, recovery costs, and
recomputing time, while increasing the application performance in case of failure.

2
BACKGROUND AND RELATED WORKS
This chapter discusses the related work on a checkpoint/restart mechanism,
performance model, and application optimization for GPGPU.

2.1 C heckpoint/R estart M echanism
A checkpoint/restart mechanism is a process to improve the application re
silience. By saving the application state and considerable data in the checkpoint file,
the process can be restarted from th a t state later. Therefore, the recomputing time
- the time spent to re-execute the work due to a failure - can be reduced. There are
many existing works th at have been implemented for checkpoint/restart [5] [7] [8] [9].
BLCR [8] is a checkpoint/restart mechanism for Linux systems. It is developed
for checkpointing at the operating system-level, which allows the system preemption.
Periodic and preemptive checkpointing can be used in response to the precursors of
a possible failure.
VCCP [5] provides a transparent checkpoint/restart mechanism for virtual
machines (VMs). It uses a hypervisor-based coordinated checkpoint/restart protocol
so th at the guest OS does not have to be changed.
For GPGPU resiliency, CheCUDA [6] is a checkpoint/restart mechanism for
NVIDIA’s CUDA (Compute Unified Device Architecture).
6

However, it results in

reduction of system performance due to the checkpoint overhead, particularly, for a
large data set. This cost is mainly produced by memory transfer.
HiAL-Ckpt [10] is also a checkpoint/restart mechanism for GPGPU. However,
it is implemented based on Brook+ programming language and allows the program
mer to do checkpoints at the application level. Although its idea of the hierarchical
checkpoint is similar to our previous work [11], the overhead optimization is not
considered.

2.2 Checkpoint Scheduling and System Reliability
There have been existing checkpoint scheduling models th at aim to minimize
wasted time. Liu et al [12] have proposed a scheme to derive a sequence of checkpoint
placements that uses the theory of stochastic renewal reward process. Although their
model targets a large-scale HPC system, it can be applied to any system as long as
the system reliability is derivable.
Paun et al [13] have presented a scheduling model for incremental checkpoints.
An incremental checkpoint mechanism has been introduced to reduce the checkpoint
cost.

They have proposed a scheme to find an optimal number of incremental

checkpoints for any failure intensity function.
Since the aforementioned checkpoint scheduling models rely on a system relia
bility model, for an HPC system, Gottumukkala et al [14] have proposed a reliability
model of a system of k nodes where each individual node follows a Weibull distribution.
Their model also considers the excess life of survival nodes. The results have shown
th at their model is more accurate than previous models used in literature.

8

Thanakornworakij et al [15] have proposed a new reliability model that is an
extension of Gottumukkala’s work [14]. Their work is based on the reliability of a
system of k nodes with simultaneous failures. Their model considers the correlation
between nodes, which can make the nodes in the system fail simultaneously.

2.3 G PG PU Performance M odels
Today, as GPUs have become popular in the HPC area, there have been
existing works th at model GPGPU performance by estimating the kernel execution
time [16] [17] [18] [19].
Hong and Kim [16] have introduced two metrics, Memory Warp Parallelism
(MWP) and Computation Warp Parallelism (CWP) in order to describe the GPU par
allel architecture. This performance model aims to predict the GPGPU performance
from CPU code skeletons.
Zhang and Owens [18] have developped a quantitative performance model
based on their microbenchmarks so that they can identify bottlenecks in the program.
Nevertheless, this model does not incorporate the cache model, bank-conflict, thread
synchronization, and non-perfect pipeline in an SM.
Baghsorkhi et al [19] have presented an analytical model to predict the per
formance based on a GPGPU work flow graph. The model allows a compiler to
determine the benefit of parallelization mathematically. Moreover, the model can
identify the bottlenecks to guide the compiler through the optimization process.

Furthermore, Wong et al [20] have developped a suite of microbenchmarks to
obtain the architectural characteristics of an NVIDIAs GPU. They have also investi
gated the architectural details of the processing cores and the memory hierarchies.

2.4 Memory O ptim ization
In scientific and mathematical problem domains, a lot of application data sets
are organized and operated as arrays and matrices. For example, iterative methods
are commonly used for solving scientific problems. However, those methods require
high d ata communication, which may cause bottlenecks in parallelization [21] [22].

2.4.1 M emory O ptim ization in HPC
There have been many attem pts to optimize the cost of data communication.
D ata alignment problems are an issue th at aims to assign data and computations to
a set of virtual processors [23] [24]. David Bau et al [23] have proposed a strategy to
solve alignment problem by using elementary linear algebra. They have claimed th at
their strategy is able to achieve communication-free alignment.
In a distributed environment, such as grid computing, the data are scattered
in different locations. Chervenak et al [25] have discussed the effect of data placement
policies and workflow management systems in a grid environment. Ranganathan et
al [26] have presented a d ata scheduling framework and data movement operations
th a t address resource utilization, response time, resource locality, etc.
However, those strategies may not be able to directly apply to a co-processor
environment such as GPGPU. In GPGPU, the data are resided in a memory unit

10

called global memory. There have been many studies th at aim to reduce the global
memory access time.
Yang et al [27] have introduced an optimization compiler for GPGPU. Their
compiler analyzes off-chip memory access patterns and optimizes the memory accesses
through vectorization and coalescing by using shared memory. Then it analyzes data
dependencies and identifies possible data sharing across threads and thread blocks
and merges threads and/or thread blocks to improve memory reuse. After that, it
uses a d ata prefetching technique from temporary variables to overlap global memory
latency with computation.
Bader et al [28] have proposed a CUDA library for a set of d ata rearrangement
operations. They address four types of generic kernels. (1) The first type consists
of basic read/w rite routines.

They are kernels th at provide optimal read/w rite

accesses from the global memory by allowing for data transfer as per common access
patterns. (2) The second type consists of data reordering routines, which are kernels
for rearranging TV-dimensional array into M-dimensional array, where M < N , by
employing an offset/striding approach and shared memory. (3) The third are interlace
and de-Interlace kernels. An interlace kernel combines multiple data-sets to form a
single data-set. A de-interlace kernel splits a single data-set into multiple smaller data
sets. (4) Forth, a generic stencil computation kernel provides a generic and optimal
framework for stencil computations, where each point in a 2D grid is updated with
weighted contributions from its neighbors (e.g. in PD E solvers).
The techniques proposed in Chapter 6 aim to achieve an optimization at the
compile-time under a condition th at the data size is predetermined. In the case that

11

the data size is unknown before the run-time, matrix transpose/perm utation during
the compile-time is impossible.

Hence, the proposed techniques must be applied

during run-time with a condition of the trade-off between optimization overhead
versus performance gain, i.e. the overall uncoalesced memory access time to such
a data set is longer than the time for matrix transformation.

2.4.2 M emory Optimization in G PG PU
Since GPGPU performance has become crucial, especially for big-data com
puting, the techniques to minimize memory latency due to memory access at any
level are essential in GPGPU computing.

2.4.2.1

G P U -C P U Memory Latency Hiding

Masuhara et al [29] have described that the latency hiding is a technique to
eliminate time to wait for the remote message by overlapping local computation and
remote communication. They show two versions of the sample function, which are
different in the number of requests th at are sent in advance of the actual use of the
data. One only requests for the element th at is used in the next iteration. Another
one requests for all the elements before the computation. This paper also shows how
the mechanisms are implemented.
NVIDIA’s CUDA has introduced a latency hiding technique, called stream.
Since the GPU is idle during the memory transfer, this technique allows overlapping
between kernel execution and memory transfer [3]. The details of CUDA stream are
discussed in Section 3.1.

12

2.4.2.2

G P G P U Memory Hierarchy

When a part of the program th at runs on the GPU is invoked, a collection of
threads called “grid” is created. There are a specific number of thread blocks in a grid,
and each block has a specific number of threads. The threads in a block are managed
in a group of threads call a warp. Threads in a warp will dispatch an instruction and
access the memory simultaneously. Furthermore, there are multiple memory spaces
in GPGPU as illustrated in Figure 2.1.

Host Memory

Device Global Memory
v — ... ~ 7

—

..............

*
Kernel 1

Kernel 0
,

.

,

Thread 0

Thread 1

ft

ft

Shared Memory
Block 0

,

-

,

Thread0

Thread 1

I F

" 0

/

Shared Memory

Block 1

, ,

|T h re a d s

Thread 1

ft

ft

...

| ;

Shared Memory

Block 0

Thread d j (Thread 1

#
|

it

—
...

Shared Memory

Block 1

GPU

Figure 2.1: Memory hierarchy in GPGPU

In Figure 2.1, global memory is accessible by all threads in the kernel, and
even across kernels. The data to be executed by the kernel have to be transfered from
the CPU (host) memory and stored in the GPU (device) global memory. Once the
threads in a warp are active, the data required by those threads will be simultaneously
read from the global memory to the register memory. Likewise, the results from the
active threads will be writen to the global memory at the same time. Besides the

registers, shared memory is visible to all threads in the same block as long as the
block is active. This memory unit acts as a cache where a bulk of d ata read from the
global memory can be temporarily stored.

CHAPTER 3
TWO-LEVEL CHECKPOINT/RESTART FOR GPGPU
In a very large scale system, a fault tolerance is critical to mitigate the
failures in a long running application. Checkpoint/restart is of popular fault tolerance
techniques.

By saving d ata and essential information e.g., software state, into a

checkpoint file, the system can be recovered from the saved software state instead of
recomputing from the beginning of the program.
This chapter presents the checkpoint/restart mechanism for GPGPU. The idea
centers around transferring checkpoint d ata first from the GPU (device) memory
to the CPU (host) memory and then from the CPU either to a reliable storage or
another healthier node when a failure occurs. The saved software state will be used for
recovery. The proposed checkpoint/restart protocol on the GPU utilizes the latency
hiding strategy in order to improve overall performance.

3.1 CUDA Streams
NVIDIAs CUDA has introduced a latency hiding technique called CUDA
stream th at enables an overlap between the memory copy and kernel execution [3].
The performance models in [4] show that the stream technique can reduce the kernel
execution times while the applications may require more data transfer between device
and host. Moreover, it suggests that the stream technique is more suitable for the
14

applications in which the d ata are independent so th at both host-to-device and deviceto-host memory transfer can be carried on without kernel interruption.
In NVIDIA’s CUDA, streams usage is illustrated by an application called
simpleStreams. The application initializes an array, each assigned a specific value.
In the experiment th at renders the results as shown in Figure 3.1, each value
in the array is initialized as an integer, for example 5. The host and the device
memories are then allocated to the array of size n , which varies from 64 thousand to
16 million integers. Then the integer and the array are synchronously copied to the
device memory. In addition, the number of streams (s) varies between 1, 2, 4, 8 , 16,
32, 64, and 128.

S tre a m E x p erim e n t
—* 80-e-+ £ 70-

-a -

.2 6 0 -

16M
8M
4M
2M
1M
512K
256K
128K
64K

50-

401

16

32

64

128

N um ber of S tream s

Figure 3.1: The results of simpleStreams application with various numbers of streams
and array sizes

16

First, the experiment synchronously performs a number of actions: it operates
the kernel, copies memory from device to host, and collects the elapsed time. The
execution time for s is then recorded as the kernel and memory copies th at are
operated asynchronously. The elapsed times are then averaged over 10 repetitions.
The simpleStreams has been implemented in NVIDIA CUDA SDK 2.0 [30]. The
hardware environment utilized in this experiment is a GeForce 8800 G T with Intel(R)
Pentium(R) D 2.80 GHz CPU.
Figure 3.1 shows th at the execution times for a large data set, e.g., 16 million
integers, are extremely high when these applications are run without streams. These
execution times get lower when running with 2, 4, and 8 streams, respectively. These
times then begin to increase again as the number of streams subsequently increases.
Thus, for the purposes of this experiment, data execution time could be reduced to
its minimum value when 8 streams are utilized. This value is suspected to depend
on the specification of the hardware. Other param eters that affect the results are the
sizes of the respective blocks and grids.

3.1.1 Performance M odel o f CUDA Streams
Even though streams can be utilized to improve performance as shown in the
experiment in the previous section, there are other factors th at affect the execution
times of the GPGPU applications. The GPU memory transfer and execution times
are also major factors in improving performance. In this section, the performance
models for CUDA streams will be introduced in order to study the effects of those
two factors.

17

Assuming th at there are two execution blocks in a grid, hereinafter and in
the following figures referred to as ”block 1” and ’’block 2 ", these two blocks will
be executed in parallel, and each block has to wait for the data from host-to-device
memory copies. Another assumption is that the time to execute operations in each
block and the time of transferring d ata to be executed are the same. The time diagram
of non-streamed execution is shown in Figure 3.2.

-* —

p a ta group I|_
p a ta group

\

Block 1
Block 2
Result group! 1
[Result group! 2
H-D memcopies

Kernel

D-H memcopies

Figure 3.2: Time diagram of non-streamed execution

In Figure 3.2, the kernel waits until all data are copied from the host to the
device memory and then operates on those data. Thus, the results are copied back
to the host again. However, each block in the kernel may process data independently.
The data can be grouped as data group 1 and data group 2, which are operated by
kernel block 1 and block 2. Then, the result group 1 and result group 2 are generated,
respectively. The total execution time, Tq . can be considered as stated in Equation
(3.1) where T hd represents the time of data transfer from the host to the device, TK
is the kernel execution time, and T dh is the time of data transfer from the device to
the host.
T g — T hd + T k + T d h ■

(3.1)

18

Due to independent data and operations in each kernel execution, which means
the operations and memory copies can be done at the same time, streams allow
memory copy to overlap with kernel execution. Therefore, the total execution time
of the program can be reduced. This strategy can be described by the time diagrams
in Figure 3.3.

pata group-'!^
p ata group X \
\ |

pata group I|_
p a ta group \
Block 1

Block 1 ^ 1

/r—

(

Block 2
Result group 1
Result group! 2

|

Block 2 |

\Jpesult group !^
[Result group 2

H-D memcopies

H-D memcopies
Kernel

Kernel
D-H memcopies

D-H memcopies

(a)

(b)
pata group \ ' r''
pata group %

*

Block I

Block 2 |

•*.— ►
;

'Result grow

!Result groui 2
H-D memcopies
Kernel
D-H memcopies

(c)
Figure 3.3: Time diagram of streamed execution: (a) asynchronous memory copies
from host to device, (b) asynchronous memory copies from device to host,
and (c) asynchronous memory copies in both directions.

According to the time diagram in Figure 3.3 (a), once the d ata group 1
transaction completes, the kernel block 1 starts operating on those d ata at the
same time th a t the data group 2 is transferred to the device memory. The time
of transferring d ata group 2 to the device memory is consequently hidden and can be

19

considered in Equation (3.2)
Tg = —

+ T k + Td h .

(3.2)

Figure 3.3 (a) describes the ability of kernel and memory copies overlapping
after each kernel block execution is completed. The data group 1 and group 2 are
sequentially copied from the host to the device. The kernel block 1 and block 2 then
operate those data, respectively. As the kernel block 1 completes its operation, the
result group 1 is copied back to the host memory at the same time as the kernel
block 2 is operating d ata group 2. The total execution tim e can be considered as in
Equation (3.3) [30]
Ta = THD+ T K + ^ S ii.
s

(3 .3 )

In these two cases, the device is still idle during the time that the d ata are
sequentially copied. Then they can be combined together for better performance as
illustrated by the diagram shown in Figure 3.3 (c). It is obvious th at the time of
transferring d ata group 2 and result group 1 are hidden. Then the total execution
time can be reduced as in Equation (3.4)
l a s . + T k + T ss..
s
s

( 3 .4 )

However, the total time of execution also depends on how complicated the
operations inthe kernels are (kernel execution time) and dependency ofoperations
and data.

Figure 3.4shows the time diagram of execution time when the

memory copies are longer than kernel executions.

time of

20

Data group LData group 1

i
i
i
i
i
i
i
i
Data group 2 L fc-l
1
fc-l
1
Block ll__ ^

f

x^Blockj 1
Blockj2
) Result group 1

ClockC
,
\
j Result group 1
1
1
Result group 2
H-D memcopies

Kernel

| D at^ g ro u p f

V tz ti

!

Result group 2|
!Kernel

Kernel
D-H memcopies

H-D memcopies

D-H memcopies

(a)

(b)

Data group 1

Data group i
,'* ■
/ | Data group !T |

Data group 2

f M
■Block! 1
\ ! ;

kloeKfi" _ !

I t 1 C'
' i
;

'Block‘2
''-{Jgesull group

j

V/

+

X

».

'Block'2
SResult gro'
{ Result group 2{

| Result group 2

RemellKemel

Kernel

D-H memcopies

H-D memcopies

1

1—fr;

kernel

D-H memcopies

H-D memcopies

(c)

(d)

Figure 3.4: Time diagram of execution time when the time of memory copies
are longer than kernel executions: (a) synchronous memory copies,
(b) synchronous memory copies from host to device, (c) asynchronous
memory copies from device to host, and (d) asynchronous memory copies
in both directions.
Compared to the asynchronous memory copies in Figure 3.4 (a), the diagrams
in Figure 3.4 (b), (c), and (d) show how the kernel execution and memory copies
overlapped in the case th a t kernel execution times are shorter than memory copies.
The total execution tim e for the cases shown in Figure 3.4 (b) and (c) can be
considered in Equation (3.5). As shown in Figure 3.4 (d), the kernel execution time
can be neglected no m atter how many streams it uses. The model in this case is
described in Equation (3.6)
Tg = T hd H— — + T dh
s
T hd + Td h -

(3.5)
(3.6)

21
The time diagram in Figure 3.5 (b), (c), and (d) shows the performance of
streams when the kernel execution times are longer than the time of memory copies
compared to the diagram in Figure 3.5 (a). The total execution times of each diagram
can be considered as in Equations (3.2), (3.3), and (3.4), respectively. In Figure 3.5
(d), if the memory copy time in a stream is much less than the kernel execution time,
it supposes th at there is only kernel execution time to be considered as the total
execution time.

Data group

Data groupU ^

Data grou

!

/EJata group* 2
Block 1

''•j.

Block 2

Jf*

Block 1

Block 2

suit grout) 1

Result group 1
Rjesult groijp 2

H-D memcopies

Kernel

D-H memcopies

H-D memcopies

j D-H memcopies ;
Kernel

(a)

(b)
Data grotr

data group! I
Qata grouf} 2

/ 1Data

Block 1

Block 2
'{Result grfmp 1

!

1-II

\!

H-D memcopies

Kernel

(c)

Blqck 1

A

Block 2
Result group

Result gr()up 2

{D-H men)copies D-H memcopies

\
;\

group 2

H-D memcopies

!_
Result grpup 2
rr
r
i<
>>
T>H merAcopies SD-H menjcopies
Kernel
1
1

(d)

Figure 3.5: Time diagram of execution time when the time of kernel executions
are longer than memory copies: (a) synchronous memory copies, (b)
synchronous memory copies from host to device, (c) asynchronous
memory copies from device to host, and (d) asynchronous memory copies
in both directions.

22

3.1.2 Benchmarks
In this section, the GPGPU matrix multiplication application is introduced in
order to demonstrate the utilization of streams on an application.
W hen two matrices A and B with dimensions of n x rn and m x p are multiplied,
the dimension of the result m atrix C is n x p. Each element of m atrix C, Cij, is the
summation of each element in row i of matrix A. a**., multiplied by each element of
column j of m atrix B, bkj, as shown in Equation (3.7).
m

(3.7)
To implement m atrix multiplication in CUDA programming, each thread in
the device computes each element of matrix C, i.e., each thread computes the summation in Cij. All elements in row i of matrix A and column j of matrix B are transferred
to the corresponding thread [3].
In this experiment, the multiplication of two squared matrices is considered.
The size of the matrices varies from 64 x 64 elements to 2400 x 2400 elements. Both
matrices are copied from the host memory to the device memory. Then the kernel
multiplies those two matrices and the result m atrix is copied back to the host memory.
This application also runs on a GeForce 8800 G T system with Intel(R) Pentium(R)
D 2.80 GHz CPU.
The operation times of m atrix multiplication recorded by CUDA Events are
shown in Table 3.1. The results suggest that the kernel times are longer as the sizes
of matrices are larger. The times of transferring two matrices from the host to the
device are also about twice th at of transferring a result matrix from the device to the

23

host. Since streams allows overlapping on the GPU, we use the values in Table 3.1
with the performance model described in Section 3.1.1 to evaluate the performance
increase.

Table 3.1: Times spent by operations of matrix multiplication, measured by CUDA
Events
M atrix
64
160
320
800
1600
2400

x
x
x
x
x
x

64
160
320
800
1600
2400

H-D memory copy
(ms)
0.16
0.24
0.64
3.28
12.79
28.63

Kernel execution
(ms)
0.05
0.15
0.83
11.76
94.45
319.35

D-H memory copy
(ms)
0.05
0.09
0.28
1.61
6.34
14.23

When the sizes of matrices are 64 x 64 and 160 x 160, the time of transferring
data from the host to the device memory is longer than the kernel execution time.
This case matches the time diagram in Figure 3.4 (a). Others match the tim e diagram
in Figure 3.5 (a). The corresponding performance model for the streams applied on
both host-to-device and device-to-host memory copies is described in Equation (3.6).
Table 3.2 shows non-stream execution time versus 8 -stream execution time.
percentage of performance increase from streams is discussed.

The

24

Table 3.2: Total execution times with non-stream and the expected total execution
time with 8-stream are calculated from the operational times on Table 3.1
Matrix
64
160
320
800
1600
2400

x
x
x
x
x
x

64
160
320
800
1600
2400

Non-stream execution
(ms)
0.26
0.48
1.76
16.65
113.58
362.21

8 -stream execution

(ms)
0.21

0.33
0.95
12.37
96.84
324.71

% of performance
increase
19.23
31.25
46.02
25.71
14.74
10.35

From Table 3.2, it can be seen th a t when the size of matrices are small, such
as 64 x 64, the performance of m atrix multiplication slightly increases. However,
when the sizes of matrices are 320 x 320, the performance of the application with 8
streams increases 46.02 percent of the application performance without stream. The
performance starts to drop as the sizes of the matrices are larger because the gaps
between memory transfer times and kernel execution times are extended.
Since the latency of data transfer between the host and the device is one
of the major factors th at impacts the performance of a GPGPU application, there
are some strategies to eliminate this factor. Streaming is one of the strategies th at
aims to reduce the latency by overlapping kernel execution with data transfer. The
proposed performance models and time diagrams show th at streams can hide kernel
execution time in the case that the application consumes more time on data transfer
and that streams can hide memory transfer time when the device is occupied by
kernel execution. However, this strategy performs well on the application when the
kernel execution time and the time of memory transfer are not too much different.
The results suggest th at streaming strategy is effective when the application data

25

are independent because it can be applied on both host-to-device and device-to-host
memory transfers.

3.2 Two-Level Checkpoint/R estart Protocols
This section describes the GPU checkpoint/restart (CPR) protocols with CUDA
stream utilization [11 ] [31]. First of all, the GPU checkpoint concept is based on the
following assumptions:
1. Kernel execution is sufficiently long such that the checkpoint will finish before
the kernel completes. Otherwise the idea would not be beneficial.
2. The data is independent such th at a transfer of data input of the next execution
block can be overlapped with an execution of the current code block.
3. There is a thread in the CPU th at handles the checkpoint process.

Figure 3.6 illustrates the streamed checkpoint protocol. The checkpoint cost
is a result of host memory allocation and device-to-host memory copy after a thread
synchronization.

W ith CUDA streams, the kernel continues executing while the

checkpoint data are copied to the host. Once the memory copy finishes, the host
dumps all the checkpoint data to a checkpoint file, which is handled by the underlying
C PR mechanism.

W hen using CUDA streams, the data set is split into chunks.

The first chunk is copied to the host before the next kernel is invoked. Then, the
subsequent chunk is copied while the kernel is executing after the first chunk transfer
completes [4].

26

CPU

GPU

Process starts

Kernel starts

H-D memory copy

'
H ostM em A llocJ

<

ThreadSynchronization/Checkpoint

. ^C heckpoint Overhead

t

Underline CPR,
mechanism

*

' ThreadSynchronization/Checkpoint
Host M emAllocJ

£

Checkpoint Overhead

Underline CPR.
mechanism

4

r

D-H memory copy

Process ends

r

Kernel completes

Figure 3.6: The checkpoint protocol for GPU streamed CPR

Figure 3.7 shows the streamed restart protocol. When a failure occurs while
the kernel is executing, the device application context, including the device memory,
is destroyed. Therefore, to restore the application, the device context has to be
recreated along with device memory, and host-to-device memory has to be copied
again. The restart process begins by reading the checkpoint file to host memory.
However, reading the checkpoint file is handled by the underlying mechanism and the
duration of this process depends on the speed of reading from the hard drive, I/O or
network storage. Those factors are not considered in the GPU recovery cost. The

27

recovery cost includes the overhead of device memory allocation and host-to-device
memory copy. CUDA streams are beneficial by starting the kernel execution while
the rest of the checkpoint d ata is being transferred. The experiments based on the
proposed C PR protocols are presented in the next section.

CPU

G PU

Jh read S y n ch ro n i 2ation()A^heckpoint
Host M emAllocJ

heckpoint O verhead

Underline

m echanism ^^^^^H
Failure O c c u rre n c e

Underline I
m ech an ism l
Device MemAlloc
iestart O verhead

Figure 3.7: The restart protocol for GPU streamed CPR

3.3 Experiments
To study the cost of checkpoint and restart processes, the behavior of the
protocols presented in the previous section is simulated and the checkpoint costs,
recovery costs, and wasted time, which is the summation of checkpoint cost, recovery
cost, and recomputing time due to a failure, are collected. The experiments are done
on various sizes of a large array addition application with three types of GPU CPR
mechanisms: non-streamed, 4-streamed, and 8-streamed CPRs.

28

Listing 3.1 illustrates the simulation based on the protocols in Figures 3.6 and
3.7. Since the checkpoint process takes place on the host, the kernel of the iterated
array addition is split into three parts: k e rn e l-1 (), k ern el_ 2 (), and k ern el_ 3 (),
which are prolonged by I, m, and n loops of iterative array addition (C = A + B),
respectively. We assume th at k e rn e l_ l() is the computation before a checkpoint,
kernel_2 () is the computation between the checkpoint and a failure, and k e rn e l-3 ()
is the computation after the failure until the application finishes. Then when a failure
occurs, it has to recompute only k ern el_ 2 () instead of restarting from the beginning.
L istin g 3.1: The pseudo code imitating GPU streamed checkpoint and restart
protocols
x / / Begin computing
2 Do host-to-device memory copy of array A and B.
3 Execute kernel-1 () with I iterations.
4 / / Checkpoint process
5 Synchronize all threads to prepare for memory copy.
6 Allocate host memory for data checkpoint, i.e., for array A, B , and C.
7 Do device-to-host memory copy of array A, B , and C.
s Execute kernel_2() with m iterations.
9 / / Failure occurrence
10 Free all device memory
n / / R estart process
12 Reallocate device memory for array A, B, and C.
13 Do host-to-device memory copy of array A , B , and C.
14 / / Recompute kernel_2()
15 Re-execute kernel_2() with m iterations.
16 Execute kernel_3() with n iterations.
17 Do device-to-host memory copy of the result array C.

29

The experiments are done on an NVIDIA GeForce GTX 295 system, which
has compute capability 1.3. The heuristics of NVIDIA’s graphic card with compute
capability 1.x indicate that the maximum number of blocks in a grid is 65535, and
the maximum number of threads in a block is 512 [3]. Then, the maximum number
of threads is 65535 x 512, or 225 — 29. For the purpose of load balancing and data
correctness while invoking the kernel with streams, we vary the size of arrays from
210 to 224. If the size of arrays reaches 225, the number of iterations in the kernel will
be influenced.
In a non-streamed case, the memory copy and the kernel execution are done
consecutively. However, in 4-streamed and 8-streamed cases, those instructions are
done simultaneously. The checkpoint cost is obtained by timing host memory alloca
tion and device-to-host memory copy. The recovery cost is obtained by timing device
memory reallocation and host-to-device memory copy. The wasted time is obtained
by the summation of those costs and the time of k ern el_ 2 () execution. The results
are shown in Figure 3.8.

30

Recovery Cost Improvement

Checkpoint Cost Improvement

o.
*m

•

s

4-stream ed over non-stream ed
8-stream ed over non-stream ed

”2v

I °

8a
E

f 7

•« •

4-stream ed over non-stream ed
8-stream ed over non-stream ed

.

o
c

i

3
I

Size of Arrays

Size of Arrays

(b)

(a)
Wasted Time Improvement
• « • 4-stream ed over non-stream ed
- e 8-stream ed over non-stream ed

"0H
)8
E

-

IQ.
E

*5
c

Size of Arrays

(c)
Figure 3.8: The percentage of performance improvement in terms of (a) checkpoint
cost, (b) recovery cost due to a failure occurrence, and (c) wasted time
due to a failure occurrence

Figure 3.8 (a) illustrates percentage of the performance improvement of streamed
CPR in terms of checkpoint cost.

When the size of arrays is less than 219, the

percentage of improvement is negative since the streamed CPR has no advantage
over non-streamed CPR. However, it becomes positive when the size of arrays is 219.
When the size of arrays is 224, 4-streamed and 8-streamed CPRs gain advantage over
non-streamed CPR with the percentages of 29.790 and 32.699, respectively.

31

Similar to the checkpoint cost, as shown in Figure 3.8 (b), the recovery cost of
4-streamed CPR is smaller than non-streamed CPR when the size of arrays is at least
219. Also, the recovery cost of 8-streamed C PR gains advantage over 4-streamed C PR
when the size of arrays is at least 221. As the size of arrays is 224, the percentages of
improvement of 4-streamed and 8-streamed CPR are 66.743 and 74.305, respectively.
This similarity is because both checkpoint and recovery costs depend mainly on the
duration of memory copy.
The wasted times due to a failure occurrence on an application with nonstreamed, 4-streamed, and 8-streamed CPRs are illustrated by Figure 3.8 (c). Since
the wasted time is the aggregate of checkpoint cost, recovery cost, and recomputing
time, the performance improvement in the aspect of wasted time is similar to the
checkpoint and recovery costs.
In this experiment, the benefits of the streamed C PR mechanism with only
one checkpoint and one failure is studied. To study the benefits of streamed CPR on
real-world applications, the simulation on a long-run application and its results will
be presented in the next section.

3.4 Simulation
In the simulation to study the benefits of streamed CPR on a long-run ap
plication, despite various checkpoint intervals, the mean-time-to-failures (MTTFs)
must be considered. Due to the fact th a t a large scale GPU cluster system, such as
ORNL’s Titan, LLNL’s Sequoia, etc. [1] is a heterogeneous system with a significant
number of GPUs, the MTTF of the system depends on the M TTF of other modules

32

or nodes in the system. In this study, the M TTFs are varied from 12 hours to 7
days with the checkpoint interval of 30 and 120 minutes. The application length is
fixed at 1000 hours. Since both 4-streamed and 8-streamed CPR have an advantage
over non-streamed CPR when the size of arrays is over 219, the simulation is done for
the array size of 224. The performance is observed in three different aspects: (1) the
percentage of total checkpoint costs compared to wasted time; (2) the percentage of
total recovery costs compared to wasted time; and (3) the percentage of wasted time
compared to the completion time, which is the summation of the application length
and the wasted time. Figure 3.9 illustrates the percentages of total checkpoint costs,
while Figures 3.10 illustrates the percentages of total recovery costs, and Figures 3.11
shows the percentages of wasted time.

Percentage of Checkpoint Costs
Compared to Wasted Time

Percentage of Checkpoint Costs
Compared to Wasted Time

CO

o

O
§

to
CN

N o n -stre a m e d
4 - s tr e a m e d
8 - s tr e a m e d

N o n -stre a m e d
4 - s tr e a m e d
8 - s tr e a m e d

©

ovJ
C
o

CO

CD

o
o
o

o

c

8I—

o

CL

d

o
o
d

12

36

96

60

MTTF (hours)

(a)

120

144

168

50

100

150

200

25 0

300

350

C heckpoint Interval (m inutes)

(b )

Figure 3.9: The percentages of total checkpoint costs compared to wasted time of the
application with non-streamed, 4-streamed, and 8-streamed CPRs, when
the size of arrays is 224 and (a) the checkpoint interval is 30 minutes, and
(b) the M TTF is 12 hours.

33

Percentage of Recovery Costs
Compared to Wasted Time

Percentage of Recovery Costs
Compared to Wasted Time
co

o
o

N o n -stre a m e d
4 - s tr e a m e d
8 - s tr e a m e d

o

N o n -stre a m e d
4 - s tr e a m e d
8 - s tr e a m e d

d

o S
COo

«

.o o

o

oo

o
o

p -H12

36

60

96

120

144

°

168

30

90

210

150

270

330

C heckpoint Interval (m inutes)

MTTF (hours)

(a)

(b)

Figure 3.10: The percentages of total recovery costs compared to wasted time of the
application with non-streamed, 4-streamed, and 8-streamed CPRs, when
the size of arrays is 224 and (a) the checkpoint interval is 30 minutes, and
(b) the M TTF is 12 hours.

Percentage of Wasted Time
Compared to Completion Time

Percentage of Wasted Time

Com pared to Completion Time

o

oco j

d

N o n -stre a m e d
4 - s tr e a m e d
8 - s tr e a m e d

N o n -stre a m e d
4 - s tr e a m e d
8 - s tr e a m e d

IS
e
p

©
CO
I
o ^
c
I o_
£
®

V)

#
o
c

5
Cl

CM

-

in -

12

36

60

96
MTTF (hours)

(a)

120

144

168

30

90

210

150

270

3 30

C heckpoint Intrval (m inutes)

(b)

Figure 3.11: The percentages of wasted time compared to completion time of the
application with non-streamed, 4-streamed, and 8-streamed CPRs, when
the size of arrays is 224 and (a) the checkpoint interval is 30 minutes, and
(b) the M TTF is 12 hours.

Figure 3.9 illustrates the percentages of total checkpoint costs compared to
wasted time when the size of arrays is 224. According to Figure 3.8 (a), when the size
of arrays is 224, the checkpoint cost of 8-streamed CPR is the smallest. As a result,
8-streamed CPR performs better than non-streamed and 4-streamed CPRs in term of
total checkpoint overheads, particularly when there are more checkpoints. In addition,
the CPRs with the checkpoint interval of 30 minutes, shown in Figure 3.9 (a), produce
larger total checkpoint costs when the M TTF increases.

Moreover, with various

checkpoint intervals, shown in Figure 3.9 (b), the total checkpoint overheads are
smaller for the larger checkpoint interval. As non-streamed C PR generates the most
checkpoint costs in both cases, the total checkpoint costs are very small compared to
the wasted time with no more than 5 percent.
Figure 3.10 illustrates the percentages of total recovery costs compared to
wasted time when the size of arrays is 224. Since the number of failures depends on
M TTF, when the M TTF increases, both recovery costs and wasted time decrease. As
a result, the graphs slightly change. T hat is between 0.006 - 0.008 percent for the
non-streamed CPR, 0.0020 - 0.0025 percent for the 4-streamed CPR, and 0.0015 0.0020 percent for the 8-streamed CPR (as shown in Figure 3.10 (a)). On the other
hand, when the M TTF is fixed at 12 hours, by varying the checkpoint intervals (as
shown in Figure 3.10 (b)), the graphs drop due to the increase of checkpoint costs,
which makes the wasted time increase. Furthermore, in both cases, streamed CPR
obviously performs better than non-streamed CPR. However, the recovery costs do
not have much effect on the performance of the application since the percentages of
total restart overheads are less than 0.01 percent of wasted time in any cases.

35

Figure 3.11 illustrates the percentage of wasted time compared to completion
time, which is the summation of wasted time and application length. When the
checkpoint interval is fixed at 30 minutes (shown in Figure 3.11 (a)), as the M TTF
increases, the percentage of wasted time tends to be smaller. However, when the
M TTF is fixed at 12 hours (shown in Figure 3.11 (b)). as the checkpoint interval
increases, the percentage of wasted time also increases because the recomputing time
increases. Nevertheless, the costs of non-streamed, 4-streamed, and 8-streamed CPUs
are relatively small compared to the recomputing time and completion time. Thus
the difference of wasted time between those three types of CPRs is insignificant.

3.5 Conclusion
Even though the checkpoint/restart mechanism can improve the application
resilience, it reduces the performance by the cost of the checkpoint process. In this
study, the GPU checkpoint/restart protocol th at aims to reduce the fault tolerance
cost is proposed. The experiments have revealed th at the streamed CPR can reduce
the checkpoint cost when the size of checkpoint data is large enough. Additionally, it
can improve the restart process by reducing the recovery cost.
The simulation has shown that the streamed CPR can reduce the cost of
the checkpoint process. However, in a long-run application, since the costs of both
checkpoint and restart processes are relatively small compared to the recomputing
time and the completion time, streamed CPR may not be beneficial on a single node
GPU system.

CHAPTER 4
PERFORMANCE MODEL FOR GPGPU
Although a GPU is considered an accelerator for a CPU, the architectures of
a GPU and a CPU are quite different [3] [32]. In order to estimate an application
completion-time, the kernel execution time on a GPU has to be estimated. Thus,
a programmer can make a decision whether the GPU is worth participating in the
computation and can further improve the heterogeneous computing application. In
addition, the performance model can also help to increase the efficiency of fault
tolerance techniques for a GPGPU application, such as checkpoint/restart mecha
nisms, especially for a large GPU cluster. One of the major problems in this area is
checkpoint scheduling. To find an optimum checkpoint placement, the time-to-failure
(TTF) of the system and the completion-time of the application must be taken into
account. The system T T F can be obtained by a failure prediction technique, while
the completion time of the application can be estimated by the performance model.
In this study, a novel performance model is proposed.

The model based

on the current state-of-the-art model by Hong and Kim [16]. However, Hong and
Kim’s model is limited by an assumption th at a latency due to every global memory
instruction is the same. Nonetheless, the empirical results reveal th a t the memory
latencies are varied depending on the data type and the type of memory access.

36

37

Figure 4.1 shows th at a global memory write operation takes much longer than a
global memory read operation in every case. Also, 4-byte word accesses take longer
than 1-byte word accesses. Thus it is important to consider various memory access
costs th a t will impact GPU performance outcome.

Latency of global memory access

to
■
■
13
■
■
■

io •o
i *
8u>
o
o
®

E

Col write 1 -b y te word
Col read 1 -b y te word
Uncol read 1 -b y te word
Col write 4 -b y te word
Col read 4 -b y te word
Uncol read 4 -b y te word

CO

CM

t-

1

2

4

8

16

32

N um ber of w arps per SM

Figure 4.1: Memory latency on GPU NVIDIA GeForce GTX 295

Since the detailed architecture of a Graphic Double D ata Rate (GDDR) mem
ory is not public, the actual memory latencies are measured. Then instead of using
fixed memory latencies like in Hong and Kim’s model, the latencies from an actual
benchmark are used in our model as described in Section 4.1.1.
Moreover, Hong et al [16] considers a latency for a shared memory instruction
to be the same as a computation instruction. In fact, a computation instruction
usually takes approximately 20 clock cycles while a shared memory takes 38 cycles
according to [20], and 40 cycles according to the experiment.

38

Due to the aforementioned reasons, our model does not rely on P T X code - an
assembly-like code th at is generated by compiling with nvcc - to obtain the number
of registers, the number of computations, and memory instructions [3] [33]. On the
other hand, those instructions can be counted from CUDA code.

4.1 Performance M odelling
In this section, the im portant parameters in the proposed performance model
are defined. Then, the model notations in a general case and in cases with synchro
nization functions are described. Furthermore, the impacts of branch divergence and
bank conflicts are also discussed in detail.

4.1.1 Parameters
In GPGPU programming, parallel instructions are executed by multiple threads.
These threads are maintained as thread blocks th at will be assigned to stream proces
sors (SMs) as illustrated in Figure 4.2(a). The thread blocks are also organized into
a grid of a kernel. Moreover, Figure 4.2(b) shows th a t threads in a block executed in
an SM are managed in a group of parallel threads called a warp.

39

G PU
S h a re d Mem

M u ltip ro cesso r

R e g isters

M u ltip ro cesso r
C o re s

M ultiprocessor
W arp S c h e d u le r

W arp S c iheduier

Mini

T h re a d

S h a re d
M em ory

R e g iste r file

Mini
,
"■j1'

„

. ' fj.

W arp

!D e v ic e M em ory (DRAM)

H o st M em ory

(a)

(b)

Figure 4.2: GPU architecture: (a) programming model; (b) warp scheduling in an
SM

Let S q be the number of blocks in a grid (so called grid size), and the number
of SMs be N s m • Then the number of blocks assigned to an SM is defined by
Nb =

N.S M

(4.1)

However, not all assigned blocks can be resided into an SM. The new blocks will
be assigned to an SM as it completes the execution of the previous blocks [3] [34]. Let
N rb be the number of blocks that can be resided into an SM, which is called resident
blocks, and N P be the number of groups of blocks assigned to a multiprocessor. Then,
Np can be defined as
NP =

N,
RB

( 4 . 2)

40

Moreover, N RB is restricted by hardware limitation, i.e., the maximum number
of resident blocks (NRB>max), the maximum number of resident warps {NRWrnax), and
the limitation of resources, such as the amount of shared memory and registers [3]
[34]. Let each block assigned to an SM have S R threads. Those threads are grouped
into warps, where each warp has Sw threads. Then N RB is defined by
N rb = min < N b , N RB>max,

NHWjnaxSw
SB

r Me
lvlb,max i
M sb

1

Rmax
R b }■

(4.3)

where M stmax and Rmax are the maximum amount of shared memory (in bytes) and
the maximum number of 32-bit registers per SM, respectively.

These values are

hardware specifications while M BB and R B are the amount of shared memory and
the number of registers required by a block, respectively. Both M sb and R B can be
estimated mathematically as described in the NVIDIA Programming Guide [3].
Furthermore, the number of resident warps is defined by
N rw —

n rbs b

Sw

(4.4)

Note th at there is a limit to the number of threads per block and to the number
of blocks per grid. T hat is, So and S B cannot exceed the hardware heuristics.
Furthermore, there are 4 additional parameters to be introduced in this section:
(1) the operation time required by the total set of computational operations, Tc; (2)
the operation tim e due to the last set of computational operations, T ql 'i (3) the
memory latency resulted from the total set of memory instructions, TM; and (4) the
memory latency lost by the last memory instruction in the kernel, TML.
From Figure 4.3, suppose that there are 2 resident blocks assigned to an SM,
each resident black has 2 warps, and there are 5 instructions to be executed by each

41

thread. Since I I , 13, and 14 are computational instructions, Tc is a summation of the
operation times due to I I , 13, and 14, where TCl is an aggregation of the operation
times due to 13, and 14. Also, TM is the memory latency resulted by 12 and 15,
while T ml is a result of memory access in 15 only.

List of Instructions
11
compute
12
memory access
13
compute
14
compute
15
memory access
(a)

^

B2W 212

B2W215

B2W1 12
^
„

11 11

B1W2I2

B1W1 12

B1W1 B1W2|

B2W1 IS

^

B1W2I5
^ ___B1W1 15

B1W1 B1W1 BfW2 B1W2
14
13
13
H

a

T

(b)

Figure 4.3: The instruction list in (a) transforms into the timeline in (b).

Since T = C x FP, where T is the operation time, C is the number of clock
cycles required by an operation, and FP is the frequency of an SM, the number of clock
cycles has to be determined in order to evaluate the operation time or the memory
latency.

The number of clock cycles due to the set of computational operations

can be analyzed and derived from the kernel code, which is detailed in the NVIDIA
Programming Guide [3].
Nonetheless, evaluating the memory latency is more complex. For a global
memory access, the global memory latency is varied depending on the size of data,
memory bandwidth, and the number of transactions required by a memory instruc
tion (memory coalescing).

According to the NVIDIA Programming Guide [3], a

global memory latency can be as high as 400 to 800 clock cycles per memory access.

42

The CUDA C Best Practices Guide [32] also states th at the latency for reading an
uncached d ata from local or global memory ranges from 400 to 600 clock cycles.
For these reasons, instead of using a fixed number of clock cycles, the operation
time from the measurement divided by Np is considered in order to estim ate the global
memory latency.

4.1.2 Performance M odel in a General Case
In this section, the performance model is described by assuming th at there is
no branch convergence in a warp. Thus, every thread in a warp executes the same
set of instructions. Then the computation times and memory latencies are the same
for each warp. Moreover, assuming that there is no synchronization instruction, the
idle time does not depend on the time that a warp waits for other warps to complete
the instructions prior to the barrier, but depends only on memory access.
Figure 4.4(a) illustrates the case that Tc is longer than T m , or N RW is large
enough to hide the idle time. The execution time of the kernel T ( K ) can be defined
as
T ( K ) = (TMl + T c N r w ) N P.

(4.5)

43

Time

B1W1 1

B1W1 stalls

| B1W1

B1W1 stalls

.

iB l

I B 1W 2

~B2Wi~[

B1W2 stalls

h?w i staHs

B2W1 stalls

B2W1 stalls

B2W2 stalls
! T_CL |

T ML

(a)
Time
i
B1W1 |

i

B1W1 stalls

1 B2W1 I

B tW t|

U2W1 stalls

B1W1 state

I 2W1 |

|

...

SIW1

B1W1 s
B2W1 stalls

B2W1 stalls

r « — Idle—

■
T_ML

! T_C L!

»»

(b)
Figure 4.4: Diagrams show the execution time when: (a) there is no idle time; (b)
there is idle time. A shaded box indicates th at the warp is being executed.
A white box indicates th a t the warp is waiting for the memory access.

However, in Figure 4.4(b), N Rw is not large enough to fill the idle time [3] [32],
Hence,
T ( K ) = (T m + Tc + T cl (N rw - 1)) N P.

(4.6)

To determine whether N RW is large enough to hide the idle time, Equation (4.5)
is subtracted from Equation (4.6) to find the idle time and set to zero. Consequently,
N rw = ^

+1. Therefore, if N RW >
’■ CL

T m ~ T ml

+ 1 , there will not be an

-CL

idle time.

4.1.3 Performance M odel in a Case with Synchronization
In this section, a model th at considers synchronization in a kernel is described.
It is an extension from the model for a general case presented in the previous section.

44

A synchronization function or __synchthread() acts as a barrier at which
threads in a block must wait until every thread hits this point before they are allowed
to execute the next instruction. However, threads from different blocks do not have to
wait [3]. In general, if there are multiple resident blocks scheduled in an SM ( N r b >
1), the compiler will try to fill the idle clock cycles with warps from the other blocks.
As a result, the impact of synchronization functions can be omitted. Nonetheless, if
there is only one resident block in an SM (N r b = 1), the amount of execution time
will be increased by the waiting time resulted by the synchronization function. The
time diagrams in Figure 4.5 describe impacts of synchronization functions.

B1W1 stalls

B1W2

B1W1 stalls

B1W2 stalls

B1W2 stalls

B1W3 stalls

B1W3 stalls

B1W4 stalls

|

Extra clock cycles

B1W4 stalls

|

B1W1 stalls
B1W2 stalls

B1W2 stalls

Extr'a clock cycles

(b)
Figure 4.5: Diagrams show the execution time when: (a) a synchronization function
causes extra time between two consecutive computational instructions;
(b) a synchronization function causes extra time between memory and
computational instructions. Again, a shaded box indicates th a t the warp
is being executed. A white box indicates th a t the warp is waiting for the
memory access.

Figure 4.5(a) is the diagram extended from Figure 4.4(a), which is the case
th at it originally does not have idle time. In this case, the synchronization function
generates an extra time, Tsyn+TM,syn>where TSyn is the operation time for executing
the synchronization function, and T M^Syn is the memory latency due to the memory
instructions before or after the synchronization function. Let the number of syn
chronization functions be Nsyn• Therefore, the extra time caused by synchronization
functions for each block is (TSyn + TM,syn) Nsyn- Then the kernel execution time
defined by Equation (4.5) can be improved as follows
T ( K ) — [Tm L + T c N rw + (Tsyn + TM,Syn) N s yn} Np.

(4.7)

Figure 4.5(b) is the diagram modified from Figure 4.4(b). It illustrates the
extra time due to a synchronization function in the case th a t it originally has idle time.
The extra time generated by the synchronization function is Tsyn + Tctsyn ( N r w — 1 ) ,
where Tc ,s yn is the time due to the set of computational instructions between the
memory instructions and the synchronization function. Therefore, the kernel execu
tion time defined by Equation (4.6) can be improved as
T ( K ) = \Tm + Tc + Tc l ( N r w - 1) + (TSyn + Tc,Syn (N RW - 1)) N Syn} N P.

(4.8)

4.1.4 Control Flow Cases
In kernels, control flow statements, such as i f , e ls e , sw itch, f o r, do, and
w hile, may cause branch divergence in a warp, meaning th at threads in a warp
execute different code paths.

Once threads in a warp hit a diverging point, the

warp sequentially executes each branch path taken by disabling threads th at are not
on th at path. W hen all paths complete, the threads re-converge back to the same

46

execution path [3] [35]. Consequently, branch divergence prolongs the execution time
by increasing the number of instructions.
L istin g 4.1: An example of pseudo code th at results in branch divergence
1 tid = threadldx.x;
2 if t i d is an even number th e n
3
operation A;
4 else
5
operation B;
6 end

For instance, the code in Listing 4.1 describes th at half of the threads in a
warp do operation A, another half do operation B. Suppose th a t there are 8 threads
in a warp, once they hit the i f statement, t l , t3, t5, and t7 have to wait until tO, t2,
t4, and t4 finish operation A. Then t l , t3, t5, and t7 will do operation B. If operations
A and B take 4 clock cycles each, the warp will take 8 clock cycles to execute the

operations in the i f - e ls e statement.
Another possible issue that may occur when using shared memory is bank
conflicts. Since shared memory in an SM is separated into a certain number of banks,
there is a possibility th a t two or more threads may access different data in the same
bank and cause threads to sequentially access that bank. Therefore, the time to
access shared memory is prolonged depending on the hardware architecture.

47

4.2 Results and Discussion
In this section, the code analysis using the example of CUDA code is described.
Additionally, naive and tiled matrix multiplications are used as benchmarks to val
idate the proposed models. The GPU th at is used in this experiment is NVIDIA
GeForce GTX 295.

4.2.1 Code Analysis
To illustrate how we obtain the parameters introduced in Section 4.1, the
kernel code of a matrix multiplication listed in Listing 4.2 is analyzed.

Suppose

th at the programmer specifies the number of blocks and the number of threads in
a block as equal to the dimension of square matrices to be multiplied (dim), which
in this example is 128. Then, S g = S b = 128. The values of other parameters are
summarized in Table 4.1. Since the shared memory is not used in this kernel, Ms,max
and M sb in Equation (4.5) are omitted.
L istin g 4.2: The kernel code of simple matrix multiplication in CUDA
1 int i = blockldx.x;
2 int j = threadldx.x;

3 int dim = blockDim.x;
4 float Cvalue = 0.0;
5 for i n t k = 0; k < d i m ; k + -I- do
e
Cvalue + = A[i*dim+k] * B[k*dim+j];
7 end
8 C[i*dim+j] = Cvalue;

48

Table 4.1: The values of experimental parameters validated in the model
Param eters
— 30

N sm

lO

o

CO

II
i-----

00

CM

ll

N RB,max
3
N r w ,max — 1024
S w — 32
Rmax = 16384
R b = 1024

min{5,8, [1024 x
32/1281, [16384/1024]}
"er?

=

II

£

N rb

N r w = [5x128/32] = 20

Tc = 14.5894
Tcl = 0.0644
T m = 121.0543
T ml = 4.9199

Descriptions
Hardware specification
Equation (4.1)
Hardware specification
Hardware specification
Hardware specification
Hardware specification
There are 10 registers required by this kernel. The
CUDA C Programming Guide [3] also shows how
to determine R b Equation (4.3)
Equation (4.2)
Equation (4.4)
Code analysis
Code analysis
Code analysis
Code analysis

From the pseudo code in Listing 4.2, one can see that each thread declares
five param eters and calculates the position of m atrix A and matrix B. Also, when the
thread executes for-loop, there are seven computations in each iteration. Then, the
thread computes the position of matrix C th at takes three more computations. Since
each computation takes 20 clock cycles, and the frequency of NVIDIA GTX 295 is
1242 MHz, Tc = (7 +128 x 7 + 3) x 20/1242 = 14.5894 *xs, and TCL = 4 x 20/1242 =
0.0644 n s.
Furthermore, in each iteration of the kernel computation, there are two global
memory accesses: one is reading elements from m atrix A, which is coalesced, while the
other is reading elements from m atrix B, which is uncoalesced. From our experiment,
a coalesced memory read takes 0.4289 fis and an uncoalesced memory read takes

49
0.4784 /xs. Finally, each thread writes a value of m atrix C to the global memory,
which takes 4.9199 /is. Therefore, TM = (0.4289 + 0.4784) x 128 + 4.9199 = 121.0543
/xs, and T m l = 4.9199 /xs.
From the experimental parameters in Table 4.1, one can determine whether
there are idle times by [(T m — T m l ) / ( T c — T c l ) J + 1 = 70, which is more than the
number of resident warps, N r w • Hence there are not enough resident warps to hide
the idle time. Thus T ( K ) = (121.0543+14.5894 + 0.0644(20-1)) x 1 - 138.8673 /is.

4.2.2 Results
We use naive and tiled matrix multiplications as benchmarks to validate our
model. The sizes of squared matrices vary from 32 x 32 to 512 x 512. The results are
shown by the graphs in Figures 4.6 and 4.7.

Kernel execution tim e o f naive matrix multiplication

o

o

_

GO

- •-

o

o
•o

Kernel execution tim e o f tiled matrix multiplication

Model

Model
Measured

Measured

.

1 1 .
8© £
<0
20
1
<D §
E §
P

CO

-

oo

_

o

-

CM

32x32

O
64x64

128x128
Matrix size

(a)

256x256

512x512

32x32

64x64

128x128

256x256

512x512

Matrix size

(b)

Figure 4.6: The kernel execution time from the performance model compared to
the measurement of: (a) naive m atrix multiplication; (b) tiled matrix
multiplication

50

Percentage of error
oo

o

00

_

-

Naive Matmul
Tiled Matmul

CTlg« <o
c
o>
o _
£
o

_

o 32x32

64x64

128x128

512x512

Matrix size

Figure 4.7: Percentage of error of the performance model compared to the measure
ment for both benchmarks

Figure 4.6(a) illustrates the kernel execution time of the naive matrix multipli
cation estimated by the proposed performance model compared to the measurement.
For the m atrix size of 32 x 32, the execution time from the model is only 0.005
millisecond different from the measurement, and 0.015 millisecond different for the
m atrix size of 64 x 64, even though the graph in Figure 4.7 indicates th at the
percentage of error is relatively high. This is because the execution times for both
cases are very small. For the matrix size o f5 1 2 x 5 1 2 , the execution time from the
model is 1.3 ms different from the measurement, or approximately 16% error.
Figure 4.6(b) illustrates the kernel execution time of the tiled m atrix multipli
cation estimated by the model compared to the measurement. Again, the analytical
execution time from the model differs from the measurement by 0.002 millisecond for
the m atrix size of 32 x 32, and 0.003 milliseconds for the matrix size of 64 x 64. For

the matrix size of 512 x 512, it differs from the measurement by 0.175 millisecond, or
approximately 11% error.
Both Figures 4.6(a) and 4.6(b) also indicate th a t the execution time from the
model follows the measurement will. Figure 4.7 shows that, for both benchmarks, our
model is more accurate when the size of the m atrix is practically large enough.
In conclusion, the proposed performance model shows improvements of the
kernel execution time estimation for a GPGPU application, which considers memory
access types and the impacts of other important factors such as synchronization
functions, branch divergence, and bank conflicts. It has also suggested th at the global
memory latency is varied depending on the type of memory access and the data type.
The results have also indicated th a t our performance model is more accurate when
the data size is practically large enough.

CHAPTER 5
GPU APPLICATION PERFORMANCE VS
CHECKPOINTS
Since a checkpoint restart mechanism can sustain application execution and
performance by reducing the recomputing time when failures occur [11], [31], the
purpose of this work is to optimize fault tolerance costs, while balancing the cost
factors and the application performance. We propose mathematical models to address
the following:
1. An optimal number of nodes for maximum system reliability and the desired
application performance.
2. Near optimal checkpoint placements that mitigate system failures but still
maintain the desired performance level.

In previous work [11], [31], a checkpoint/restart protocol that aims to reduce
the GPU checkpoint cost and rollback using a latency hiding strategy has been
proposed. Additionally, it has been argued, [31], th at only cost reduction is not
enough to minimize the wasted time due to a system failure. This is because the
wasted time is dominated by the recomputing time.
There are existing checkpoint scheduling models th at aim to minimize wasted
time. Liu et al [12] proposed a scheme to derive a sequence of checkpoint placements

52

53

th a t uses the theory of a stochastic renewal reward process. Although their model
targets a large-scale HPC system, it can be applied to any system as long as the
system reliability is derivable.
The study here is based on the reliability models presented by Gottumukkala
and Thanakornworakij [14] [15]. However, the fc-node system is extended such th at
each node has a co-processor, for instance, a GPU. In addition, the proposed check
point scheduling model is an enhancement of Liu et al [12] and Niehamon et al [13]’s
models in the sense th at the system is a heterogeneous system and the model relies
on the two-step checkpoint/restart protocol.

5.1 Application Performance and System Reliability
Since the system to be considered is a GPU cluster, the major factor th a t
impacts the performance and reliability of the system is the number of nodes, denoted
by k. Theoretically, when k increases the application performance increases, but the
reliability decreases. Thus, the performance improvement can be determined as
PT =

=

Ts

1

-

Ts ’

( 5 . 1)
v '

where Ts is the completion time on a single-node, and Tp is the completion time on
a k-node system. For single kernel GPU programming, T s is the sum of the kernel
execution time (Tk), host-to-device memory copy ( 7 h d ), device-to-host memory copy
(Td h

), the execution time on the host ( T c ) , and the overhead caused by parallelization,

such as data communication between nodes (TPo ) [36].

For a node-wise GPU environment, the sequential completion time is defined
as
T s = T k + T hd + Tdh + Tc .

(5.2)

Assuming th a t the computation is data parallelizable with load balancing
(the data to be executed are equally distributed across the compute nodes), T k is
completely parallelizable; i.e., it is divisible by k. Yet, the data transfer time between
the CPU and the GPU on each node depends on the amount of data the node receives.
Therefore, only parts of TpD and TDH can be reduced by the distribution. Moreover,
Tc can also be partially divisible by k. Let f HD, f dh , and f c be the fractions of THd ,
T d h , and Tc th at can be distributed, respectively. Amdahl’s law [37] suggests th at
Tp

>

- (Tk + J hd T hd + J dh T dh + f c T c )
+ (1 —fHD)TnD + (1 —} dh )Tdh + (1 ~ f c ) T c + TPo ■

(5.3)

To describe the system reliability, the time-to-failure (TTF) for each node is
assumed to follow a Weibull distribution. For simplicity and without loss of generality,
the failure rates are assumed to be equal (A = Aj = A2 = A3 = ... = \ k). Also, once a
node fails, the entire computation is assumed to fail. Therefore, the probability th at
the system will survive beyond time t is expressed as
S(t) = e~kXiC,

(5.4)

where t is measured from the moment th a t the system starts until the application is
completed, and c is the shape param eter of the Weibull distribution for each node

55

For the purpose of estimating Tp and S(t), where t = Tp, the proportions
of T k , T h d , T d h , T c , and TPo to Ts are assumed to be 0.75, 0.02, 0.01, 0.20, and
0.02, respectively. However, these numbers are just an example to demonstrate the
applicability of the model. The parallelizable fractions are fuD = 1, foH = 1, and
f c — 0.90. Moreover, Ts is scaled to an extended period of time (e.g., 15 days) in
order to study the im pact of the system reliability.
To determine if an application has acceptable performance and reliability,
the graphs of the application performance against the system reliability are plotted.
Figure 5.1 shows the probability of survival for different values of the Weibull shape
parameter, c, the number of nodes, k, and the failure rates, A. The y-axis represents
the survival probability of the system. The x-axis represents the number of nodes in
a logarithmic scale. The percentage of performance improvement for each value of k
is presented in parentheses beneath log10 k.
In Figure 5.1, the study shows th at the system reliability, S(t), can increase
with a growth in k before it ultimately decreases. This can be interpreted as being
due to the fact th a t Tp depends on k. The completion time Tp decreases with an
increase in k. Moreover, as c becomes larger, the system tends to be less reliable.
There is evidence, [14] and [15], that time-to-failure follows a Weibull distribution. It
is known that for large c values, the Weibull becomes more symmetric, approximating
a normal distribution. Hence, in this work, we consider the values of c between 1 and
2, where the Weibull failure rate decreases with time (c > 1) and c is not large enough
for the Weibull to approximate normal. Additionally, we consider the system from
the start to the first failure.

56

c - 1 .1

Performance
Lambda = 1/(1000 years)
Lambda = 1/(100 years)
Lambda 9 0.00005
Lambda 9 0.0001

Performance
Lambda = 1/(1000 years)
Lambda = 1/(100 years)
Lambda = 0.00005
Lambda = 0.0001

0 0.8

-

1 0.6 -

1 * 5 0 .4 -

2 o 0.4 -

•0 .2 -

. 0.2 -

0 .0

0.0

-

0

1

-

0

2 3.5327543789925

1

2 3.5327543789925
Iog10 (k)

Iog10 (k)

(a)

(b)
c ■ 1.5

c « 1 .3

Performance
Lambda = 1/(1000 years)
Lambda 9 1/(100 years)
Lambda = 0.00005
Lambda 9 0.0001

Performance
Lambda 9 1/(1000 years)
Lambda 9 1/(100 years)
Lambda 9 0 00005
Lambda 9 0.0001

0 0.1

2 *6 0.4 -

*5 0.4 02
0.0

0

1

0.2

-

0 .0

-

0

2 3.5327543789925

1

log 10 (k)

(c)

(d)
c * 2 .(

c» 1 .T

Performance
Lambda 9 1/(1000 years)
Lambda = 1/(100 years)
Lambda 9 0.00005
Lambda 9 0.0001

Performance
Lambda 9 1/(1000 years)
Lambda = 1/(100 years)
Lambda 9 0.00005
Lambda = 0.0001

0.8

0 0.1

•

ts.
xj *5 0.4 -

2 *5 0.4 -

•0.2 -

. 0 .2 -

0 .0

-

2 3.5327543789925
Iog10 (k)

0

0.0
1

2 3.5327543789925
loglO (k)

(e)

0

1

2 3.5327543789925
togiO (k)

(f)

Figure 5.1: Probability of survival for different values of c and k from Equation (5.4)

57

The graphs in Figure 5.1 also show the system reliability of the given applica
tion for different values of A. Since the system reliability decreases and application
performance increases with a growth in the number of nodes, one can identify a
number of nodes k th a t has acceptable levels of reliability and performance. For
instance, from Figure 5.1, one can choose, for a given curve, a value for k that would
give an acceptable reliability and performance. For example, it is seen from the curve
in Figure 5.1 (a), for A = l/1000yrs that one can choose a k value with performance
close to 96 percent th at has a reliability close to 1.

5.1.1 Predicting the System Size from th e Maximum System Reliability
Let p be the parallelizable part of the program, defined by
P = T k + / h d T hd + J d h T dh + f c T c ,

(5.5)

and s be the sequential (non-parallelizable) part of the program, defined by
s = (l-

Ihd)

Th d + (1 - f DH) Td h + (1 - f c ) T c + T P 0 .

(5.6)

Therefore, by Amdalh’s law [37], the parallel application completion time can
be expressed as
Tp = 1 + s -

(5-7)

To find the optimum k that maximizes S(Tp,k), one must solve the following
equation
r\

— S{TP, k) = -A { k c T P
c lT'P + 7 » e - * A7> = 0,
OK

(k cT'p + TP) T c~l = 0.

58

Because T'p = —p / k 2,
fcCr P + r P = - ^

+ ( | + s) = o .

Hence,
k=

(5.8)

Since k must be at least 1, Equation (5.8) indicates that, for the system
reliability with a Weibull-distributed failure intensity, c must be larger than 1. In
addition, the proportion of p over s must also be at least l / ( c —1). Furthermore, one
can predict the improvement in application performance from Equations (5.1) and
(5.3).

5.1.2 Predicting the System Size from an Expected Performance
In the previous section, we have proposed a model to determine the optimum
k th at gives the maximum system reliability. However, typical programmers focus
on the performance improvement without considering the system reliability when
deploying parallel programming on a large scale system. In this section, a model that
identifies the optimal k value for a desired performance level is proposed.
The application problems are categorized into two cases: non-scalable and
scalable workload. Since Am dahl’s law is used for explaining the speed improvement
when the execution time is related to the problem size, Amdahl’s law is applied to
the former case. For the latter case, Gustafson’s law is used because Gustafson’s law
states th at for a scalable problem that maintains a fixed execution time, the speedup
is a linear function of system size.

59

5.1.2.1

Non-scalable Workload

For a fixed problem size, the workload does not grow when more nodes are
deployed. Therefore, the completion time decreases as the number of nodes increases.
According to Amdahl’s law [37], Ts = p + s — Tpo, and TP = p / k + s.

Since

Pr = 1 — Tp/Ts, Tp = Ts( 1 —Pr )• As a result,

where Ts( 1 —Pr) — s > 0. Hence, Pr < 1 - s/Ts-

5.1.2.2

Scalable Workload

For a scalable problem, more nodes are added to the system as the problem
size increases. Gustafson’s [38], [39] suggests th at if the workload is scalable according
to the number of nodes, Tp is fixed, while Ts is scaled by k. Let Tp — T k + T hd +
TDh + T c + Tp0 = p + s. Consequently,
Ts

— k(TK

+

I h d Thd

+

I d h T d h + IcT c)

+ (1 —J h d )T hd + (1 —/d h )T d p + (1 —f c ) T c
kp + s + Tp0 ■
From Pr = 1 —T p / T s >Ts — TP/( 1 —Pr ). Thus,
(5.10)
where Pr ^ 1.
Once k and Tp are determined by the models indicated by Equations (5.9),
and (5.10), the system reliability can be predicted by Equation (5.4). However, if

the reliability is unacceptably low, meaning that the survival probability is below
the baseline specified by the programmer, the checkpoint/restart mechanism may be
needed in order to increase the system reliability.

5.2 Checkpoint Scheduling
In a GPGPU environment, a checkpoint will have to be performed on both
GPU and CPU during the GPGPU kernel execution. Besides, a typical checkpoint is
performed only on the CPU. In addition, to maintain data correctness, a checkpoint
must not be allowed during data transfer.
To find an optimal checkpoint placement, we enhance an original optimal
checkpoint model [12], [13] for an HPC system. Let C q and Cc be checkpoint costs
from the GPU to the CPU, and from the CPU to a reliable storage on a A;-node
system, respectively. However, the GPU checkpoint latency (the time to transfer
the data from the GPU to the CPU on a /c-node system) can be estimated by the
device-to-host memory transfer time. Therefore, Cq is defined as
Ca = Og +

+ (1 _ f DH)TDH,

(5.11)

where 0% is the time spent on a checkpoint process.
Moreover, the proportion of kernel execution on k nodes is T x / k T P . The
proportion of CPU execution on k nodes is [fcTc + k ( 1
checkpoint cost for a GPGPU application becomes

-

f c ) T c ] / k T P . Thus, the

61

Furthermore, let R c be the CPU recovery cost of a k-node system. Then, the
GPU recovery cost of the k-node system is estimated by
Ra = Og + ! hd^

md

+ (1 - f „ D)Th d ,

(5.13)

where O q is the time spent on a recovery process.
Similarly, the recovery cost can be evaluated by

R = VFp [Tk R g + {Tk + f c Tc + k ( l - f c ) T c ) R c ].

(5.14)

Let n(t) be the checkpoint frequency function, such th at
ptm
1= /
n(t)dt,

(5.15)

" t m —1

where tm, (m = 1 ,2 ,3 ,...) is the m th checkpoint placement, and to = 0. Let Tj
be a random variable representing the TT F, and ip be a rollback coefficient th a t is
ranged between 0 and 1. Note that the rollback coefficient can be estimated by the
algorithm presented by Liu et al [12]. The number of checkpoints from the system
start until a failureoccurs is given by C J0T/n(t)df. Also, according to [12]and [13],
the recomputing time can be approximated by
(5.16)
Consequently, the wasted time is expressed as
W (T,) = C / n(T)dT + - £ r - + R.
Jo

(5.17)

r c G/ j

Let f (t ) be the probability density function (pdf) of the system TT F. The
expected wasted time becomes
/•oo r

E[w]- [ \ c

ft

L n{T)iT+W)+R.

(5.18)

62

Therefore, the optimal checkpoint frequency function th at minimizes the ex
pected wasted time can be expressed as
(5.19)

c ] j s(ty

From the probability of survival described by Equation (5.4), the distribution
of time-to-failure can be expressed as
f ( t ) = k \ c t cc—1 e*—k\tc

(5.20)

Therefore, from Equations (5.4), (5.19), and (5.20), we have
n

(5.21)

From Equation (5.15), it is seen th at
i

=

f\flV k x ^ d t

Jtm-i V ^
2

Ik Xcip

C

cT 1
2

i ^c+i

c+ 1V

tn
tm—1j

c

Thus, tm can be obtained by
2

tm

c+ 1

C
s±i
+
tm1
kXap

Since tQ= 0,

c+1

(5.22)

2

c+1

ctp
and
t%

c+1

c+1
C
+
k\c(p

C_
k\c<p

c+1

Therefore, by induction, we obtain

t”' = ( m —

\ J l x ^ J

■

<5-23)

T hat is, for the selected A and c, the sequence of optimal checkpoint placements
tm, m — 1 ,2 ,3 ,..., where t0 = 0, can be obtained by Equation (5.23).
The graphs th at illustrate the influence of k , as predicted by Equations (5.8),
(5.9), and (5.10), on the checkpoint interval, are shown in Figure 5.2.
From Figure 5.2 (a) (c), the results show that, as k increases, the checkpoint
frequency decreases. On the other hand, when the application performance becomes
a criterion (Figure 5.2 (d)-(i)), more nodes are needed in order to improve the
performance, resulting in lower reliability. Hence, more checkpoints are needed to
m itigate the effects of failures, leading to a smaller checkpoint interval. It can also
be seen th at the checkpoint interval decreases when A increases (an increase in A
decreases the reliability of the system). Furthermore, when comparing the graphs
horizontally, it is seen th a t as c increases, the checkpoint interval decreases because
the system becomes less reliable. Therefore, more checkpoints are needed.

64

14
—
Lambda ■ 0.001
----- Lambda - 0.0005
----- Lambda ■ 0.0002
Lambda * 0.0001

I

Lambda * 0.001
Lambda * 0.0005
Lambda = 0.0002
Lambda = 0.0001

12

4

6
\ 40-

Lambda * 0.001
Lambda ■ 0.0005
Lambda = 0.0002

1*
6

4
2

\ 201

3

5

7

9

11

13

15

17

19

21

m

(a)

(c)

(b)
k - 95
----——
—
—

5 2 .0 -

c* 1. 7, 95
Lambda®
Lambda
Lambda®
Lambda*

. 2 .0 -

0.001
0.0005
0.0002
0.0001

Lambda * 0.001

Lambda * 0.0002

§ 1 .5 -

l10'
c 0.5 -

20.0

8 0. 0 -

1 3

5

7

9 11

14

17

20

23

0.0

26

m

(e)

(d)
c*

|

1 .1

m

Lambda * 0.001
Lambda * 0.0005
Lambda * 0.0002
Lambda = 0.0001

12

%10
%8

(f)

c - 1.5, k *

, k - 20
B

B

8

Lambda * 0.0005
Lambda * 0.0002

6

8

4
4
2

2

*0

0
4

5

6

7

8

1

9

3

5

7

9

11 13 15 17 19 21

m

(g)

(h)

(i)

Figure 5.2: The time between two consecutive checkpoints when: (a)- (c) k is
predicted by the maximal reliability expressed by Equation (5.8); (d)-(f)
k is predicted by the desired performance with fixed workload expressed
by Equation (5.9); (g)-(i) k is predicted by the desired performance with
scalable workload expressed by Equation (5.10).

5.3 Performance
In Section 5.1, the mathematical models (Equations (5.8), (5.9), and (5.10))
to derive the optimal number of nodes based on different criteria have been presented.
The previous models did not consider the effect of a failure on the application per
formance. However, it is possible th at there will be failure occurrence during the

65

computation. This will impact the application performance due to the recomputing
time.

As a result, both the effect of reliability and application performance are

considered in this section.Thus, the reliability-aware

performance

model can be

denoted as follows:
p = i _

T.L f.

(5 24)

Ts + E[Tbly
where E[Tbk] is the expected recomputing time for an application running on k nodes,
and Tfci is the expected recomputing time for an application running on a single node.
These recomputing times can be expressed as
rTp
/ tf(t,k)dt
E\Tm)

= -------A
pip ? ! -------/ /(*,*:)<i d£
Jo
c(k\y/c

<5-25>

\c ’
1 _

J

k\

(5.26)

e - fcATp

and
7 |- ,A 2 ? )
y
a
---------------------------,
1 —e

cAi/c ' V c ’_

-

E[Tbi] = ^

where 7 (z, y ) is a lower incomplete gamma function, 7 (z, y) = Ifvt z~l

(5.27)

dt.

Jo

Despite the costs of checkpoint and recovery, it would be of interest to deter
mine if the checkpoint/restart mechanism can increase the application performance
by shortening the recomputing time.
Let E[Wk] and E\W\] be the expected wasted time on a k-node and a single
node systems, respectively. If the ith node is replaced at the j th restart, by substituting

66

Equations (5.20) and (5.21) into Equation (5.18), we obtain
E[Wk] =

“

C
+R
ipk X c t c~l

j ° ° C 1 ‘J ^ V k X c T ^ d r + „>

J/ o c + 1

k \ c t c *€ ~kXtc dt

y / C p k X c t ^ ~ k X c t°~ l e~kxtCdt
C<p
k X c t c- l e~kxtCdt
k X c t c~l

+
POO

+ /

R k X c f - lc~kxtCdt.

(5.28)

Jo

Let u = k X t c. Thus,
c+ 1
2c

E[Wk)

c+1

^ ^ f ( n )

+

C(P

A:Ac

c —1
2c

/°Y _ M
Jo

e u d«

e “ du

U A /

poo

+R
=

/

Jo

' du
e" “

yc
- ^ - V C y : l Ac+1
( k X ) ^ Jo

r°u(s£ +1)-1e-'tdu

kXc

Ck x ) ^

/ u 2c
o/o

e “ dw
(5.29)

Therefore,
E{ wt } = ^ . ^ r ( +
c ± + i ) + J c * -------c + 1 (A;A)2c \ 2c
y
V c (A;A)1

2c J

(5.30)

where T(z) is a gamma function.
From the property of the gamma function (r(z + 1) = zT{z)), we have

^

+ 0

= £^

r f t r ) He n c e ’

E[Wk] =

(5.31)
{k x y .

67

Similarly,

The application performance can be expressed as

Therefore, the change in performance is given by the following expression:
?

p
'

Tp + E p k )

Ts +

B [r „ ]

TP+ E[Wt]
Ts + E{W, } '

To study the impacts of checkpoints on the application performance
predicted k values,Figure

'
for the

5.3 illustrates the reliability-aware performance th at the

application will achieve during deploying or not deploying checkpoints (evaluated by
Equations (5.24) and (5.33)). Again, Figure 5.3 (a)-(c) illustrates the performance
for various values of c and A when k is predicted by Equation (5.8); Figure 5.3 (d)-(f)
shows the performance when k is predicted by Equation (5.9); and Figure 5.3 (g) (i)
presents the performance when k is predicted by Equation (5.10). It is obvious that,
on a large scale system, when deploying checkpoints, the application performance is
better than it will be without checkpoints.

68

■ Perfw/ockpt
■ Perfwithckpt

« ■ 1.1

C « 1.5

■ Perfw/ockpt
■ Perfwithckpt

■ Perfw/ockpt
■ Perfwithckpt

§5

I
0.0005

0.0005
0.0002
Lambda

0.0002

0.0005

lambda

(a)

(c)
■ Perfw/ockpt
■ Perf with ckpt

€ ■ 1.1

c ■ 1.5

■ Perfw/ockpt
■ Perfwithckpt

■ Perfw/ockpt
• Perfwithckpt

m

m
0.0005

0.0002

Lambda

0.001

0.0002

0.0005

0.0002

0.0001

0.001

u
0.0005

0.0002

0.0001

Lambda

(e)

(d)
c ■ 1.1

■ Perfw/ockpt
■ Perf with dept

c * 1.5

(f)

I
0.0005

0.0002

Lambda

(g)

■ Perfw/ockpt
■ Perfwithckpt

■ Perfw/ockpt
■ Perf with ckpt

I
0.0005
0.0002
Lambda

(h)

0.0005

0.0002

Lambda

(i)

Figure 5.3: Comparison of application performance when: (a)-(c) k is predicted by
the maximal reliability expressed by Equation (5.8); (d)-(f) k is predicted
by the desired performance with fixed workload expressed by Equation
(5.9); (g)-(i) k is predicted by the desired performance with scalable
workload expressed by Equation (5.10).

Additionally, in Figure 5.3 (a), the system can gain less performance than that
shown in other figures because, for the maximal reliability, Equation (5.8) suggests
very few number of nodes.

69

5.4

R eal-W o rld C ase S tu d y

In this simulation, the optimized matrix multiplication algorithm given by [30]
is analyzed as a case study, where the m atrix size is 223 x 223 elements. Therefore,
Ts is approximately 30 days; i.e., twice as long as the previous example. Moreover,
the proportions of TK , THD, TDH, Tc , and TPO to Ts are 0.7458, 0.0122, 0.0172,
0.2247, and 0.05, respectively. In addition,

fn o

— I d h = 1 and f c = 0.78. These

numbers are obtained by the performance model and code analysis presented in [40].
Furthermore, we assume th at both host and device memory units are very large.
To illustrate the influence of k values that are predicted by Equations (5.8),
(5.9), and (5.10) on the checkpoint interval, we plot the graphs shown in Figure 5.4.
Similar to Figure 5.2, from Figure 5.4 (a)-(c), our study shows that, as k
increases with c, the checkpoint interval decreases. Although, as shown by Figure 5.4
(d) - (i), k does not increase with c, more nodes are suggested in order to improve
the performance. As a result, the reliability decreases, and more checkpoints are
needed, which leads to smaller checkpoint intervals. Furthermore, as A is also a critical
factor in the reliability model, when A is very small; for instance, A = l/(1000yrs),
checkpoints may not be needed (as shown by Figure 5.4 (a) and (g)). On the other
hand, when A is large, the checkpoint interval is very small, indicating that the system
may take too much computation for checkpointing and may cause a performance drop.

70

Lambda = 1/1000yre
Lambda * 1/100yrs
Lambda » 0.00005
Lambda * 0.0001

—

Lambda * 1/1000yrs
Lambda * 1/100yra
—— Lambda * 0.00005
—
Lambda *0.0001

Lambda - 1/1000yrs
Lambda * 1/100yr*
Lambda * 0.00005
' Lambda * 0.0001

C:
1

2

3

4

5

6
m

7

8

9

10 11

1 5 9

(a)

I

(c)

e - 1 .5 ,k * 1651

Lambda * 1/1000ym
Lambda *1/100yra
Lambda * 0.00005
Lambda * 0.0001

—
—

i 267

1 6 12 19 26 33 40 47 54 61 68 75

(b)

e * 1 1 . k * 1651

I

14 19 24 29 34 39 44 49 54

Lambda * 1/1000yrs
—— Lambda *1/100yr*
—
Lambda * 0.00005
—
Lambda * 0.0001

1 1242 2949 4656 6363 8070 9777 11639

609 9S1 1331 1748 2165 2582

c * 1 .7 ,k * 1651
Lambda * 1/1000yra
Lambda* 1/100yre
Lambda *0.00005
Lambda * 0.0001

1 1658 4412 6966 9520 12306 15324

(e)

(d)
—— Lambda “ 1/1000yr»
Lambda *1/100yra
Lambda-0.00005
Lambda = 0.0001

I\

(f)

Lambda * 1/1000yrs
—
Lambda = 1/100yrs
----- Lambda « 0.00005
—
Lambda * 0.0001

—

Lambda » 1/1000yra
Lambda *1/100yra
Lambda * 0.00005
— - Lambda * 0.0001

V
1

3

5

7

9

11

14

17

20

23

(g)

1 7 15 24 33 42 51

(h)

1 8 17 27 37 47 57 67 77 87 97

(i)

Figure 5.4: Time between two consecutive checkpoints for the case study when: (a)(c) k is predicted by the maximal reliability expressed by Equation
(5.8); (d)-(f) k is predicted by the desired performance with fixed
workload expressed by Equation (5.9); (g)-(i) k is predicted by the desired
performance with scalable workload expressed by Equation (5.10).

To study the impacts of checkpoints on the application performance for the
predicted k values, Figure 5.5 illustrates the reliability-aware performance th a t the
application will achieve during deploying or not deploying checkpoints (evaluated by
Equations (5.24) and (5.33)).

71

■ Perfw/ockpt
■ Perf with ckpt

k- 4

■ Perfw/ockpt
■ Perf with ckpt

k- e

■ Perf w/o ckpt
■ Perf with ckpt

f'

1/100yre
0.00005
Lambda

l l l l

0.0001

1/1000yrs

(a)
■ Perf w/o ckpt
■ Perf with ckpt

k « 1651

■
1/100yrs
0.00005
Lambda

i
°

0.0001

1/1000yrs

e® 1.1
k - 10

■ Perfw/ockpt I
■ Perfwithckpt I

m

b
0.00005

Lambda

(g)

1/1000yrt

c * 1.7
k - 1651

n

m
1/100yra
0.00005
Lambda

0.0001

c * 1.5
k - 10

i
1/IOOOyr*

0.0001

i
°

1/1000yre

0.0001

■ Perfw/ockpt
■ Perfwithckpt

i

m
1/100yr»
0.00005
Lambda

0.0001

( f )
■ Perf w/o ckpt I
■ Perfwithckpt 1

u

n
1/100yr»
0.00005
Lambda

(h)

1/100yr*
0.00005
Lambda

(c)
■ Perfw/ockpt
■ Perfwithckpt

(e)

( d )

1/100yrs

°

0.0001

(b)

k > 1651

-

1/100yre
0.00005
Lambda

0.0001

c « 1.7
k - 10

■ Perf w/o ckpt
■ Perfwithckpt

m
1/100ym
0.00005
Lambda

0)

Figure 5.5: Comparison of application performance for the case study when: (a)- (c) k
is predicted by the maximal reliability expressed by Equation (5.8); (d)-(f)
k is predicted by the desired performance with fixed workload expressed
by Equation (5.9); (g)-(i) k is predicted by the desired performance with
scalable workload expressed by Equation (5.10).

Figure 5.5 (a) (c) illustrates the performance for various values of c and A,
when k is predicted by Equation (5.8). As Equation (5.8) gives k = 1 when c = 1.1,
there is no performance gain. Besides, since k increases with c, the performance in
Figure 5.5 (c) is larger than in Figure 5.5 (b). Nontheless, the checkpoint can be
advantageous only when A and c are small.

72

Figure 5.5 (d) (f) shows the performance when k is predicted by Equation
(5.9). In this case, k is large, leading to high performance but very low reliability.
The system spends too much time handling the checkpoint processes. Therefore, the
checkpoint can be beneficial only when c is small enough and A is very small.
Similarly, Figure 5.5 (g)-(i) presents the performance, when k is predicted by
Equation (5.10). Since k suggested by Equation (5.10) is not as large as th at by
Equation (5.9), the performance is lower. However, in this case, the checkpoint can
be more advantageous when A is small.

5.5 The Model w ith an Excess Weibull
In Section 5.1, the system that can function until the first failure has been
considered. In reality, however, after a failure occurs at one node, the other nodes
continue to function. As a result, after the j th restart, the time-to-failure, x tJ, of the
ith node th at did not fail has an excess life distribution [14], [15]. Hence, the failure
intensity can be described as
k \ c t c *e kXtc

if the ith node is replaced
at the j th restart,

f(t) = <
Ac

(%ij + t)c Je A£>=i[(XiJ+<)

if the ith node is not replaced
at the j th restart,

V

(5.35)

and the probability of survival can be expressed as
if the ith node is replaced at the j th restart,
if the ith node is not replaced at the j th restart.
(5.36)

73

If the ith node is not replaced at the j tUhl restart, the checkpoint frequency
function becomes
n(t) =

(5.37)

C— 1

'\

i=

1

Again, substituting Equation (5.37) into Equation (5.15), we obtain
1=

l(p\c

ft-m
rin

C

”tm—1\

(5.38)

+ *)c_l d ti= 1

Equation (5.38) can be solved numerically by varying tm and fm_ i . Moreover,
the performance of the application can be evaluated as described by Equations (5.24)
and (5.33). However, in this case, the expected recomputing time on a /e-node system,
E[Tbk], becomes
Ac /

Jo

E[Tbk) =

+

*=l
1_

(5.39)
e - x T,i=i[(xa + t)c- xij]

For a single-node system, the excess time after the j th restart is denoted by
Xj. Therefore, E[Tbi] becomes
eXxi
E[Tbl]

T p e - ^ +Tp)e~xi]

cA<

,

1 _ e -A !(^+T p)c-x5]

(5.40)

where 7 (z, y) is a lower incomplete gamma function.
In addition, The expected wasted time J5[H4] and E[IFi] becomes
E[W k) =

1/ C v? A3 c3
/•OO

/
Jo

ft

fe

y ] ( x i j + <)c_1 e_AEl=l ( l ^ +t)c / * X ]
[ i=1
Jo \ i = i

/
+ R,

+ r ) c- 1d r dt

OO

y ( x i j + t)c-1e

dt
(5.41)

By solving Equations (5.39), (5.40), (5.41), and (5.42) numerically, one can
determine, from Equation (5.34), the change in application performance due to the
checkpoint/restart mechanism.

5.6 Conclusion
In the real world, application performance and system reliability are key factors
that influence large scale HPC applications. However, as the application performance
increases with the number of nodes, the system reliability decreases. The impacts of
these two critical factors on a heterogeneous HPC system, where the failure intensity
of each node follows a Weibull distribution, are studied. Several models have been
proposed in order to determine an optimal number of nodes based on three different
criteria: maximal reliability, the desired performance of an application with a fixed
problem size, and the desired performance of an application with a scalable problem
size. In addition, a checkpoint scheduling model for a heterogeneous system with
k nodes has been presented. Moreover, we have shown th at the checkpoint/restart
mechanism based on our checkpoint scheduling model can increase the performance
of an application when failures occur as long as c and A are small.

CHAPTER 6
GPGPU OPTIMIZATION
One of the major concerns in GPGPU optimization is coalescing in global
memory. The d ata to be operated by the GPU multiprocessors are stored in the
GPU’s DRAM called global memory. Since the global memory is an off-chip memory
unit, global memory access is very expensive. In some cases [3], it may take over 400
clock cycles to read coalesced data (the data that are aligned properly in the global
memory) required by all threads in a “warp” , which is a group of threads th at execute
instructions simultaneously (See more detail in Section 6.1.). For uncoalesced data
access, it will take much longer due to multiple round-trip accesses. Consequently, un
coalesced global memory access prolongs the memory latency, resulting in a decrease
in GPGPU application performance [3].
There are several existing techniques that attem pt to reduce the latency due to
uncoalesced global memory access. A common way to solve this issue is to promote
the use of shared memory. However, shared memory is much smaller than global
memory and can lead to bank conflicts in shared memory. This happens when the
bandwidth of the shared memory is too small to serve multiple data accesses at a
time [3].

75

76

Another performance improvement technique is memory rearrangement. This
technique aims to re-align data in global memory such th at they are coalesced before
d ata access during kernel execution.
This research proposes memory rearrangement techniques to remedy uncoa
lesced global memory access patterns by using 2-dimensional matrix transpose and
3-dimensional m atrix permutation. The proposed techniques can be applied to many
common memory access patterns.

The cost benefit of these techniques will also

be discussed in detail. Moreover, the analytical results reveals that the proposed
techniques are beneficial if the transformed array/m atrix is frequently accessed.

6.1 Coalescing V S Uncoalescing
To understand the memory access patterns in the GPU, there are two termi
nologies that have to be introduced in this section: coalescing and uncoalescing.
For the most efficient data access, the data requested by threads in a warp
should be aligned properly in the same segment, which is called coalesced memory
alignment.
Figure 6.1 illustrates the coalesced and uncoalesced memory access patterns.
Figure 6.1 (a) shows the array/m atrix elements required by threads in a warp (rep
resented by light blue arrows) are aligned in the same segment 0 - 3 1 . This access
pattern is called coalesced memory access. On the other hand, Figure 6.1 (b) shows
a single stride uncoalesced memory access pattern with an offset of / and a stride
of d. That is the array/m atrix elements accessed by threads in a warp start at the

77

f th element, and each access is skipped by d elements. Listing 6.1 shows the sample
GPU code th at generates this single stride memory access pattern.

31.32

(c)
Figure 6.1: Memory access pattern categories: (a) Coalesced memory access, (b) Sin
gle stride uncoalesced memory access, and (c) Double stride uncoalesced
memory access

L istin g 6.1: GPU code th at generates a single stride memory access pattern
1 idx = blockldx.x * bloekDim.x + threadldx.x;
2 C[idx] = A[idx] + B[d * idx + f];

For a double stride uncoalesced memory access pattern, as illustrated by Figure
6.1 (c), there are two stride parameters, di and dj. Again, the array/m atrix elements
requested by threads start at the f th element, and the two consecutive accesses are
separated by dj elements. However, the next thread block will request for the set of
elements th at is di separated from the previous set of elements. The sample GPU

78

code th at generates a double stride memory access pattern is shown in Listing 6.3,
which is translated from the CPU code with double loops (two iterative parameters)
in Listing 6.2.
L istin g 6.2: CPU code with a double stride memory access pattern
1 fo r i = 0; i < CountJ; ++i do
for j = 0; j < CounLj; ++j do
C[i+j] = B[d_i*i + d_j*j];
en d
5 end

2
3
4

L istin g 6.3: GPU code th at translated from the CPU code in Listing 6.2, which
generates a double stride memory access pattern
1 xidx = blockldx.x * blockDim.x + threadldx.x;
2 yidx = blockldx.y * blockDim.y + threadldx.y;

3 C[xidx + (n_C * yidx)] = B[(d_l * xidx) + (d_2 * n_B * yidx) + f];

Since the uncoalesced memory access pattern causes the data needed by a
warp to be scattered in the memory, the warp needs multiple round trips in order to
access all the data required. Consequently, the memory latency is prolonged, which
decreases the application performance.

6.2 M em o ry R e a rra n g e m e n t
In this section, the techniques for solving uncoalesced global memory, called
memory rearrangement, are described. These techniques deploy matrix transpose
for a 2D matrix and permutation for a 3D matrix. The uncoalesced memory access
problems are categorized by the number of strides. In this research, only two problem
cases are discussed: single stride and double stride access patterns.

79

6.2.1 S in g le S trid e
From the GPU code in Listing 6.1, the elements of matrix B are read by the
stride d and the offset / . To eliminate the uncoalesced global memory access, a 2D
m atrix is constructed such th a t the width of the 2D matrix is equal to the stride d.
Note th at, for the best utilization, Count should be a multiple of a warp size, SwThe example is given by Listing 6.5, which is the GPU code translated from the CPU
code in Listing 6.4, and illustrated by Figure 6.2.
L istin g 6.4: Array access pattern with a single stride (CPU code)
1 fo r i = 0; i < Count; ++i do
2
C[i] = A[i] + B[3 * i + 5];
3 end

L is tin g 6.5: Array access pattern with a single stride (GPU code th at is a
parallel version of the CPU code in Listing 6.4)
1 idx = blockldx.x * blockDim.x + threadldx.x;
2 C[idx] = A[idx] + B[3 * idx + 5];

The CPU code in Listing 6.4 can be w ritten in parallel GPU code as shown
in Listing 6.5. The elements of matrix B are read by the stride d = 3 and the offset
/ = 5 as shown in Figure 6.2 (a). Figure 6.2 (b) shows the 2D m atrix with the width
equal to the stride d = 3. The 2D matrix is transposed, resulting in the transformed
array with the coalesced access pattern illustrated in Figure 6.2 (c).

80

oi i m 3iin

7

w m w z mMm aM*vm pr\nM
(a)

(c)
Figure 6.2: Array B transformation associated to the sample code in Listing 6.4 and
6.5 (a) The original array; (b) The 2D matrix is constructed; (c) The
transformed array.

Similarly, each of the matrices in Listing 6.6 has a stride d — 3. Therefore, the
technique illustrated in Figure 6.2 must be applied to arrays A, B, and C.
L istin g 6.6: Array access pattern with a single stride in a loop-statement
1 for i = 5; i < Count; i += 3 do
2
C[i] = A[i] + B[i];
3 end

In a specific case as shown by Listing 6.7, B is a 2D m atrix in which only
the elements in the diagonal line will be accessed. Hence, the reference address of an
element in B can be determined by r e f = B[inB + i] = B[(nB + 1 )i], where n B is the
width of m atrix B. T hat is, the stride d = n B + 1. Then the technique illustrated in
Figure 6.2 can also be applied to this case as shown in Figure 6.3.

81

Listing 6.7: An access pattern th at requires only elements on the diagonal line
of a 2D matrix_________________________________________________________
1 for i = 0; i < Count; ++i do
2
C[i] = B [i] [i];
3 end

n B+ 1

1 2 3 4
7 8 9
6
12 13 14

13 14
(a)

11
I

(b)

Figure 6.3: Transformation of the 2D matrix th at only the elements on the diagonal
line are accessed as given by Listing 6.7; (a) Original Layout, (b) Modified
m atrix with the width of n# + 1.

6.2.2 Double Strides
For this problem category, a 3D matrix structure has to be introduced in order
to further describe the double stride memory access patterns.
From a 3D matrix shown in Figure 6.4, the reference address of an element in
the m atrix can be determined by r e f = xnp + yp + z, where x is the counter th at
counts along the height (m) of the matrix, y is the counter th at counts along the
width (n) of the matrix, and k is the counter th at counts along the depth (p) of the
matrix. This implies th at the threads in a warp will access elements along fc-direction.
Given a CPU code shown in Listing 6.2 with two iterative parameters, where di is the
stride on the outer-loop and dj is the stride on the inner-loop, this problem category
can be classified into two sub-categories: when di < dj and di > dj.

82

n

m<

Figure 6.4: 3D M atrix

From the sample code shown in Listing 6.8, there are two strides di = 2 and
dj = 6, i.e., di < dj. A 3D m atrix should be constructed such th a t the depth of
the m atrix (/^-dim ension) p —

Data.Size

, where the height m = di and the width

n = dj/di. In addition, since p implies the number of elements required by threads
in a block, p should be equal to a multiple of SwL istin g 6.8: An access pattern where di < dj

1 for i = 0; i < CountJ ; ++i do
fo r j = 0; j < Count-j; ++i do
a
C[i+j] = B[2*i + 6*j];
4
end
5 end
2

Figure 6.5 shows the outcome of 3D m atrix permutation associated with the
sample CPU code in Listing 6.8. From the original layout illustrated by Figure 6.5
(a), the 3D m atrix is constructed as shown by Figure 6.5 (b) and perm uted by the
order of [2,3,1]. Normally, the dimensional order of a 3D matrix is represented by
[1,2,3], i.e., 1 represents the height, 2 represents the width, and 3 represents the depth
of the 3D matrix, respectively. In this case, the 3D m atrix is permuted with the order

83
of [2,3,1], which means th at the height of the m atrix becomes the depth, the width
becomes the height, and the depth becomes the width. Therefore, the permuted 3D
matrix is presented by Figure 6.5 (c), where the final memory layout is shown by
Figure 6.5 (d).

(a)

5

11 17 23

(b)
5 I 11 I 17123*1

1 I 7 113| 19
(d)

Figure 6.5: ID array with double strides where di < d j; (a) The original layout, (b)
The 3D m atrix with m = di, n = dj/di, (c) Permuted 3D m atrix with the
order of [2,3,1], and (d) The final layout.

For the case th at di > dj (The sample code is shown by Listing 6.9 and the
memory layout is illustrated by Figure 6.6 (a).), the 3D m atrix can be constructed
such th at m = ck/dj, n — dj, and p —

D ataSize

di

as shown in Figure 6.6 (b). Then

the perm utation can be performed with the order of [2,1,3] as shown in Figure 6.6 (c).
Hence, the outcome of the proposed technique is presented by Figure 6.6 (d).

84

L istin g 6.9: An access pattern where di > dj
1 for i = 0; i < CountJ ; ++i do
2
fo r j = 0; j < Count-j; ++j do

s
4

C[i+j] = B[8*i + 2*j];

end
5 end

(d)

Figure 6.6: ID array with double strides where d, > dj\ (a) The original layout, (b)
The 3D matrix with m — di/dj, n — dj, (c) Permuted 3D m atrix with
the order of [2,1,3], and (d) The final layout.
To describe an access pattern of a 2D m atrix as shown by the sample code in
Listing 6.10, let the dimension of m atrix C b e n e x m c, and the dimension of matrix
B be

ub

x

tub-

From the code in Listing 6.10, the index of C can be determined

by in c + j- If n c is a multiple of warp size, Sw , C is coalesced. Otherwise, C is
uncoalesced and should be reconstructed, such th at n c x m e = n'c x m'c , where n'c
is equal to a multiple of S w •

85

Listing 6.10: Access pattern of a 2D M atrix
1 for i = 0; i < CountJ; ++i do
2
for j = 0; j < Count-j; ++j do
3
C[i][j] = B[3*i][2*j];
4
end
5 end

As shown by Listing 6.10, the index of B can be determined by 3in s + 2j.
T hat is, the strides dt = 3Ub , and dj = 2. Therefore, di > dj. Thus, the technique
shown in Figure 6.6 can be applied to this case as shown in Figure 6.7. Additionally,
every warp will be coalesced if Count j is the a multiple of Sw-

1 3 5
9 11 13
17 19 21
25 27 29
33 35 37
41 43 45
49 51 53

(a)

7
15
23
31
39
47
55

(b)

Figure 6.7: 2D matrix with double strides; (a) The original layout, (b) The final
layout after 3D m atrix permutation with the order of [2,1,3].

After performing the rearragement process, the outer-loop i can be distributed
among thread blocks, while the inner-loop j is distributed among threads in a block.
In summary, the algorithm explaining the proposed technique is given by
Listing 6.11. This algorithm describes the approach to solve both single stride and
double stride global memory access patterns.

86

Listing 6.11: An algorithm of m atrix transformation to eliminate an uncoa
lesced global memory access pattern
1 Detect whether it is a single-stride or a double-strides access pattern by the
depths of the nested loop
2 if single-stride th e n
3
Detect the stride d and the offset f
4
Construct a 2D matrix, where:
1. n ~ d
5
6
2. the first position of the matrix = /
7
Transpose the 2D matrix

II

r—
H

8 end
9 else if double-strides th en
Detect the stride d* and d j, and the the offset f
10
11
if di < dj th e n
12
Construct a 3D matrix, where:
13
1. m = di
14
2. n = dj/di
15
3. the first position of the matrix = /
16
Permute the 3D m atrix with the order of [2,3,1]
17
en d
18
e lse if di > dj th en
19
Construct a 3D matrix, where:
20
21
2. n = dj
22
3. the first position of the matrix = /
Perm ute the 3D m atrix with the order of [2,1,3]
23
24
en d
25 end

6.3 Analytical M odels
To discuss the cost of m atrix transpose/perm utation in order to eliminate
uncoalesced global memory access patterns, it has to be considered whether the data
size is known (predetermined) or unknown before the GPGPU run-time.

6.3.1 Predeterm ined Data Size
In case the data size is predetermined before the run-time, the proposed
technique can be done during the compile-time on the CPU. In the case of a single

87

stride access pattern, an algorithm of the 2D matrix transpose as shown by the
algorithm in Listing 6.12 yields the complexity of 0 (n m ).
L istin g 6.12: An algorithm for 2D m atrix transposition on a CPU

1 for x = 0; x < m ; ++x do
2
for y = 0; y < n; ++y do
a
A_t[y][x] = A[x][y];
4
end
5 end

In the case of a double stride access pattern, a 3D matrix perm utation algo
rithm on a CPU with the order of [2,3,1] is given by Listing 6.13. Therefore, the
complexity of a 3D matrix permutation is 0(nm p).
L istin g 6.13: An algorithm for 3D matrix permutation on a CPU with the
order of [2,3,1]
1 for x = 0; x < m; ++x do
2
for y = 0; y < n; ++y do
3
for z = 0; z < p; ++z do
4
A_p[y][z][x] — A [x] [y] [z];
5
end
e
end
7 end

In summary, if the data size is predetermined, the complexity of the compile
time will increase by 0 (n m ) for a single stride access pattern, and 0 (n m p ) for a
double stride access pattern. It is possible that n, m, and p will multiply the m atrix
transformtion overhead, but the transformation is done only once. Therefore, the
break-even point between the performance gain from the m atrix transformation and
the overhead have to be considered.

88

6.3.2 Unknown D ata Size
If the d ata size is not predetermined before the run-time, the proposed tech
niques have to be applied during the run-time. This section describes the cost analysis
to determine the break-even point and benefit of the proposed technique.

6.3.2.1

Single Stride

For the single stride access pattern, the 2D matrix transpose is performed
to eliminate the uncoalesced global memory access. An optimized m atrix transpose
program is illustrated by Figure 6.8 and given in CUDA SDK [30].

TILE_DIM

r

^---- \

B LO CK RO W S
r*-1

2
3.

TILE DIM

m

(a)

(b)

Figure 6.8: Tiled 2D m atrix transposition where (a) is the original m atrix and (b) is
the transposed matrix

89

In t h e p e r f o r m a n c e m o d e ls in t h e p r e v io u s w o rk , w h ic h p r e s e n ts t h e m a t h e 
m a t ic a l m o d e ls fo r e s t im a t in g t h e GPU r u n - t im e w ith t h e e f fe c t s o f m a n y p a r a m e te r s
a n d c h a r a c t e r is t ic s [40], t h e k e r n e l e x e c u t io n t im e d e p e n d s o n t h e fo llo w in g fa c to r s:
71771

• T h e g r id s iz e , Sn = ------------n, w h e r e n a n d m a r e t h e w id t h a n d h e ig h t o f
6

’

6

TILE_DIM2

t h e m a tr ix , r e s p e c t iv e ly . In a d d it io n , t h e sh a r e d m e m o r y s iz e is TILE_DIM x
TILE-DIM .
• T h e b lo c k s iz e , S b = TILE_DIM x BL0CK-R0WS, w h e r e TILE_DIM is a n in t e g r a l
m u lt ip le o f BLOCK-ROWS. F u r th e r m o r e , BL0CK_R0WS is a n in te g r a l m u lt ip le o f t h e
o

mi

r

i i

i i

ii

TILE-DIM

w a r p siz e , o w . I h e r e to r e , e a c h t h r e a d h a n d l e s ------------------ e le m e n ts .
BL0CK-R0WS
• T h e m e m o r y la t e n c y in t h e k e r n e l,
_

TILE_DIM

„

«

T m = ------------------ (C oaL read + b m e m - w n te + G o a l-w n te + Sm erruread),
BLOCK-ROWS V
w h e r e CoaLread is t h e tim e t h a t a w a r p r e a d s t h e d a t a fr o m g lo b a l m e m o r y
c o a le s c e d ly , a n d CoaLwrite is t h e t im e t h a t a w a r p w r it e s t h e d a t a t o g lo b a l
m e m o r y c o a le s c e d ly . M o r e o v e r , Sm em jread is t h e t im e a w a r p r e a d s t h e d a t a
fr o m s h a r e d m e m o r y , a n d Sm em jw rite is t h e t im e a w a r p w r ite s t h e d a t a t o
sh a red m em ory.
• T h e c o m p u t in g t im e in t h e k e r n e l, T c is a fu n c tio n o f

TILE_DIM

BLOCK-ROWS

M o s tly , t h i s m a t r ix t r a n s p o s e p r o g r a m s p e n d s m o r e t im e t o m o v e d a t a in a n d
o u t o f g lo b a l a n d s h a r e d m e m o r y . T h e r e fo r e , t h e k e r n e l e x e c u t io n t im e o f t h e m a t r ix
t r a n s p o s e , Ttrans, is a f u n c t io n o f m e m o r y a c c e s s e s , TILE_DIM , BL0CK_DIM, a n d t h e
m a t r ix s iz e a s d e s c r ib e d b y E q u a tio n ( 6 .1 ) .

m
1trans

nm

CoaLread + CoaLw rite + Smerruread + Sm em jw rite

N sm N rb

TILE.DIM X BLOCK-ROWS

( 6 -1)

90

6 .3 .2 .2

Double Strides

The double stride access pattern can be eliminated by 3D m atrix permutation,
which can be considered in two cases: permuting the matrix by the order of [2,1,3] and
permuting the matrix by the order of [2,3,1]. For permuting the m atrix from [1,2,3]
to [2,1,3] the third dimension is fixed. Hence, the 2D matrix transpose can be applied
to each slice along the third dimension [41]. Therefore, Ttrans can be approximated
by Equation (6.2).
^

_
tTanS ~

map

CoaLread + CoaLwrite + Sm em jread + Sm em jw rite

N s M N Rb

TILE_DIM x BL0CK_R0WS

(a ^
'

^

'

However, for permuting the matrix from [1,2,3] to [2,3,1], the grid size can be
71

a p p r o x im a t e ly d e t e r m in e d b y S g ~ ------------- n { p + TILE_DIM — l ) ( m + TILE_DIM — 1 ).
t i l e _ d i m 2V

a

’

Thus, Ttrans can be approximated by Equation (6.3).
_

nra»(T IL E _D IM )

\

Ttrans ~ t ; — zz— ■,----------------~•(CoaL read+ C oal_w rite+ bm em j'ead+ bm em jvnte).
N s m N r b (BL0CK_R0WS) V

’

(6.3)

6.4 Results and Cost Analysis
In this section, the effects of uncoalesced memory access are studied based on
the proposed technique using sample code in Listing 6.5. They are analyzed in two
cases: when the data sized is known and unknown before the run-time. However, in
the double stride case, the outer-loop can be distributed across thread blocks, the
analytical results of single and double stride cases are not significantly different.

91

6.4.1 Predeterm ined D ata Size
Assuming that the data size is known before the run-time, the m atrix transpose
is done during compile-time and is not taken into account here. The results are shown
in Table 6.1. It is obvious th a t the kernel execution time with the original memory
access is longer than the kernel execution time with the access to the transformed
memory alignment. For instance, to access 218 elements, the original memory latency
is 0.03 ms longer than the transformed memory latency. Additionally, the original
memory access to 220 elements access takes 0.09 ms longer than the transformed
memory access. Thus, the performance gain is on average over 25 percent.

Table 6.1: The comparison between the kernel execution times based on our technique
using the sample code in Listing 6.4 with the original memory access and
transformed memory access.
Count
(elements)
64K
128K
256K
512K
1M

Original memory
latency (ms)
0.03
0.04
0.08
0.14
0.27

Transformed memory
latency (ms)
0.02
0.03
0.05
0.10
0.18

Performance gain
(%)
33.3
25.0
37.5
28.6
33.3

6.4.2 Unknown Data Size and Break-even Analysis
In the case that the data size is unknown before the run-time, the time of
matrix transpose has to be considered. Table 6.2 shows the effect of transformed
memory access using 2D m atrix transpose. It is seen that the 2D m atrix transpose
(Run-time overhead) takes longer than the time gained from transformed memory
access.

Thus, this technique will not be beneficial if the transformed memory is

92

accessed only once or twice. However, it can be beneficial if the transformed memory
is accessed frequently.
For example, to access 218 elements, the 2D m atrix transpose takes 0.091 ms,
which is over 3 times the performance gain from the transformed memory access.
Hence, the proposed technique is beneficial if the transposed matrix is accessed at
least 4 times.

Table 6.2: The comparison between the performance gain from the transformed
memory access to the time of 2D matrix transpose
Count
(elements)
64K
128K
256K
512K
1M

Performance gain
from the transformed memory access (ms)
0.01
0.01
0.03
0.04
0.09

Run-time overhead
(ms)
0.028
0.048
0.091
0.173
0.340

6.5 C onclusion
Optimization is a challenging problem. Uncoalesced memory access is one
of the issues th a t decreases the performance of GPGPU applications.

To reduce

memory latency due to uncoalesced memory access patterns, the memory rearrange
ment techniques using matrix transpose/perm utation are proposed. The patterns are
categorized by the number of strides. For a single stride memory access pattern, a
2D m atrix associated with the stride is constructed. Then 2D m atrix is transposed
to rearrange the data in global memory. For a double stride memory access pattern,
a 3D m atrix is constructed and 3D matrix perm utation is performed. The analytical
models have been discussed, considering whether the data size is predetermined. If

the data size is known before the run-time, the proposed techniques can be applied
during the compile-time. Otherwise, the proposed techniques have to be performed
at the run-time. Our analytical results have shown the performance gain and the
break-even point for the proposed techniques.
Since the proposed techniques can be the most advantageous when the data
size is predetermined, the future work is to predict the data size required by a GPU
kernel execution and detect memory uncoalescing th at may occur during the kernel
execution.

CHAPTER 7
CONCLUSIONS AND FUTURE WORK
Due to the rapid increase in computational performance required to handle
massive data sets, GPUs have been deployed in several HPC systems. GPUs can pro
vide highly parallel computation via several multithreaded processors. However, there
are a few studies th a t give an insight on both GPGPU performance and reliability.
This dissertation has presented streamed checkpoint/restart (CPR) protocols
for GPGPU th at utilize CUDA stream to reduce the checkpoint and recovery costs
in order to minimize the total wasted time. Furthermore, the study has revealed
th a t even though the costs can be reduced, streamed CPR may not be beneficial in a
long-running application since the wasted time is dominated by the recomputing time.
Consequently, a checkpoint scheduling model to minimize the recomputing tim e has
also been proposed.
In order to derive effective checkpoint scheduling models, an application run
time is required. Therefore, a performance model for predicting a GPGPU application
run-time has been proposed in this dissertation. This performance model improves the
kernel execution time estimation by considering different memory access types and
the impacts of other factors such as synchronization functions, branch divergence,
and bank conflicts. The results have shown that this performance model can achieve

94

95

higher accuracy when the d ata size is predetermined and especially, practically large
enough.
Furthermore, in this dissertation, the impacts of application performance and
system reliability on a heterogeneous HPC system have been studied. Due to the
fact th at the application performance increases with the number of nodes while the
system reliability decreases, the models to determine an optimal number of nodes
have been proposed.

These models are created based on three different criteria:

maximal reliability, the desired application performance with a fixed problem size
and a scalable problem size. In addition, the analytical results have indicated th at
the checkpoint/restart mechanism based on the proposed checkpoint scheduling model
can increase the application performance when failures occur.
Moreover, this dissertation has presented an optimization technique to reduce
memory latency caused by uncoalesced memory access patterns. This technique uses
m atrix transformation to rearrange data resided in the G PU ’s global memory. The
optimization overhead can be estimated by the proposed analytical model.

The

analytical results have shown that this technique can be beneficial when the data
size is predetermined or the transformed data are frequently accessed.
In the future, we hope that the models proposed in this dissertation will be
implemented at compiler level to achieve autom atic checkpoint and optimization tools
at compile-time.

BIBLIOGRAPHY
[1] Top 500 Supercomputing Sites, http://www.top500.org. Online: Accessed: Dec,
2012 .

[2] General-Purpose Computation on Graphics Hardware, http://gpgpu.org. Online;
accessed: Dec, 2012.
[3] NVIDIA. CUDA C Programming Guide Version 4.0, May 2011.
[4] S. Laosooksathit, A. Baggag, and C. Chandler. Stream Experiments: Toward
Latency Hiding in GPGPU. In Proceedings of the 9th IASTED International
Conference, volume 676, page 240, 2009.
[5] Hong Ong, Natthapol Saragol, Kasidit Chanchio, and Chokchai Leangsuksun.
VCCP: A Transparent, Coordinated Checkpointing System for Virtualizationbased Cluster Computing. In IEEE Cluster, 2009.
[6] Hiroyuki Takizawa, K atsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi.
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications. In PDCAT,
pages 408-413, 2009.
[7] John W. Young. A first order approximation to the optimum checkpoint interval.
Commun. ACM, 17(9):530-531, September 1974.
[8] Paul H Hargrove and Jason C Duell. Berkeley lab checkpoint/restart (blcr) for
linux clusters. Journal of Physics: Conference Series, 46(1):494, 2006.

96

97

[9] Adam J. Ferrari, Stephen J. Chapin, and Andrew S. Grimshaw. Process intro
spection: A heterogeneous checkpoint/restart mechanism based on automatic
code modification. Technical report, Charlottesville, VA, USA, 1997.
[10] X. Xu, Y. Lin, T. Tang, and Y. Lin. HiAL-Ckpt: a hierarchical application-level
checkpointing for CPU-GPU hybrid systems. In 5th International Conference on
Computer Science and Education (ICCSE), pages 1895-1899, 2010.
[11] S. Laosooksathit, N. Naksinehaboon, C. Leangsuksan, A. Dhungana, C. Chan
dler, K. Chanchio, and A. Farbin.

Lightweight Checkpoint Mechanism and

Modeling in GPGPU Environment. Computing (HPC systems), 12:13, 2010.
[12] Yudan Liu, Raja Nassar, Chokchai Leangsuksun, Nichamon Naksinehaboon,
Mihaela Paun, and Stephen L. Scott. An optimal checkpoint/restart model for
a large scale high performance computing system. In IPDPS. IEEE, 2008.
[13] Mihaela Paun, Nichamon Naksinehaboon, R aja Nassar, Chokchai Leangsuksun,
Stephen L. Scott, and Narate Taerat.

Incremental Checkpoint Schemes for

Weibull Failure Distribution. International Journal of Foundations of Computer
Science, 21(03):329, 2010.
[14] N.R. Gottumukkala, R. Nassar, M. Paun, C.B. Leangsuksun, and S.L. Scott.
Reliability of a System of k Nodes for High Performance Computing Applications.
Reliability, IEEE Transactions, 59(1):162-169, 2010.
[15] Thanadech Thanakornworakij, Raja Nassar, Chokchai Leangsuksun, and Mi
haela Paun. Reliability Model of a System of k Nodes with Simultaneous Failures
for High Performance Computing Applications,

accpted at the International

Journal of High Performance Computing Applications, Oct 2012.

98

[16] Sunpyo Hong and Hyesoon Kim. An analytical model for a gpu architecture
with memory-level and thread-level parallelism awareness. SIG ARCH Comput.
Archit. News, 37:152-163, June 2009.
[17] Kishore Kothapalli, Rishabh Mukherjee, Suhail Rehman, Suryakant Patidar, PJ
Narayanan, and Kannan Srinathan. A performance prediction model for the
CUDA G PGPU platform. In HiPC, pages 463-472, Kochi, India, December
2009. IEEE.
[18] Yao Zhang and John D. Owens. A Quantitative Performance Analysis Model for
GPU Architectures. In Proceedings of the 17th IE E E International Symposium
on High-Performance Computer Architecture (HPCA 17), pages 382-393. IEEE,
February 2011.
[19] Sara S. Baghsorkhi, M atthieu Delahaye, Sanjay J. Patel, William D. Gropp,
and Wen-mei W. Hwu.

An Adaptive Performance Modeling Tool for GPU

Architectures, pages 105-114, 2010.
[20] Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and
Andreas Moshovos. Demystifying GPU Microarchitecture through Microbench
marking.

2010 IEEE International Symposium on Performance Analysis of

Systems & Software (ISPASS), pages 235-246, March 2010.
[21] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra,
V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst.

Templates for the

Solution of Linear Systems: Building Blocks fo r Iterative Methods, 2nd Edition.
SIAM, Philadelphia, PA, 1994.

99

[22] James W. Demmel, Michael T. Heath, and Henk A. van der Vorst. Parallel Nu
merical Linear Algebra. Technical Report UCB/CSD-92-703, EECS Department,
University of California, Berkeley, Oct 1992.
[23] David Bau, Induprakas Kodukula, Vladimir Kotlyar, Keshav Pingali, and Paul
Stodghill. Solving alignment using elementary linear algebra. In Proceedings of
the 7th Annual Workshop on Languages and Compilers fo r Parallel Computers,
pages 46-60. Springer-Verlag, 1994.
[24] Weng-Long Chang, Jih-Woei Huang, and Chih-Ping Chu.

Using elementary

linear algebra to solve data alignment for arrays with linear or quadratic
references, volume 15, pages 28-39, Piscataway, NJ, USA, January 2004. IEEE
Press.
[25] Ann Chervenak, Ewa Deelman, Miron Livny, Mei-Hui Su, Rob Schuler, Shishir
Bharathi, Gaurang Mehta, and Karan Vahi.

D ata Placement for Scientific

Applications in Distributed Environments. In Proceedings of the 8th IE E E /A CM
International Conference on Grid Computing, GRID ’07, pages 267-274, Wash
ington, DC, USA, 2007. IEEE Computer Society.
[26] Kavitha Ranganathan and Ian Foster.

Decoupling computation and data

scheduling in distributed data-intensive applications. In Proceedings of the 11th
IEEE International Symposium on High Performance Distributed Computing,
HPDC ’02, page 352, Washington, DC, USA, 2002. IEEE Computer Society.
[27] Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. A GPGPU Compiler
for Memory Optimization and Parallelism Management. In PLDI, pages 86-97,
2010 .

100
[28] Michael

Bader,

Hans-Joachim

Bungartz,

Dheevatsa

Mudigere,

Srihari

Narasimhan, and Babu Narayanan. Fast GPGPU D ata Rearrangement Kernels
using CUDA. volume abs/1011.3583, 2010.
[29] Hidehiko Masuhara, Satoshi Matsuoka, and Akinori Yonezawa. Implementing
parallel language constructs using a reflective object-oriented language, 1996.
[30] NVIDIA, online, http://www.nvidia.com/object/cuda_home_new.html.
[31] S. Laosooksathit, N. Naksinehaboon, and C. Leangsuksan.
point/restart modeling for GPGPU.

Two-level check

In 2011 9th IE E E /A C S International

Conference on Computer Systems and Applications (AICCSA), pages 276-283.
IEEE, December 2011.
[32] NVIDIA. CUDA C Best Practices Guide, May 2011.
[33] NVIDIA. PTX: Parallel Thread Execution ISA Version 2.1, April 2010.
[34] Chapter 3 CUDA threads.
[35] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic
Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings
of the 40th Annual IEEE/AC M International Symposium on Microarchitecture,
MICRO 40, pages 407-420, Washington, DC, USA, 2007. IEEE Computer
Society.
[36] Blaise Barney. Introduction to Parallel Computing. Online; Accessed: Jan, 2013.
https://com puting.llnl.gov/tutorials/parallel_comp/.
[37] Mark D. Hill and Michael R. Marty. Amdahl’s Law in the Multicore Era. In
IEEE Computer Society, pages 33-38, July 2008.
h ttp :/ / www.cs.wise.edu / multifacet / papers / ieeecomputer 08_amdahl _multicore.pdf.

101
[38] John L. Gustafson. Reevaluating Amdahl’s Law. volume 31, pages 532-533,
1988.
[39] John L. Gustafson, Gary R. Montry, Robert E. Benner, C. W. Gear, John L.
Gustafson, Gary R. Montry, Robert, and E. Benner. Development of Parallel
Methods for a 1024-Processor Hypercube.

SIAM Journal on Scientific and

Statistical Computing, 9:609-638, 1988.
[40] S. Laosooksathit and C. Leangsuksun.

A Performance Model for Predicting

Completion-time of a GPGPU Application. Technical report, 2012. Unpublished
manuscript.
[41] Lung-Sheng Chien. M atrix transpose.
http://oz.nthu.edu.tw / d947207/index_3Ddata.htm.

