Efficient Performance Evaluation for Highly Multi-threaded Graphics Processors by Sadeghi Baghsorkhi, Sara
c© 2011 Sara Sadeghi Baghsorkhi
EFFICIENT PERFORMANCE EVALUATION FOR HIGHLY
MULTI-THREADED GRAPHICS PROCESSORS
BY
SARA SADEGHI BAGHSORKHI
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2011
Urbana, Illinois
Doctoral Committee:
Professor Wen-mei W. Hwu, Chair and Director of Research
Professor William D. Gropp
Associate Professor Nacho Navarro
Professor David A. Padua
Associate Professor Sanjay J. Patel
ABSTRACT
With the emergence of highly multithreaded architectures, an effective perfor-
mance monitoring system must reflect the interaction between a large number of
concurrent events, and associate the overall effect of individual events and inef-
ficiencies to the operations in the application source code. The state-of-the-art
performance counters in highly multithreaded graphic processors currently do not
provide this level of precision. Although fine-grained sampling of performance
counters after each source-level operation could potentially achieve the desired
precision, the high frequency of sampling required will likely cause too much dis-
tortion to the actual application behavior and make the sampled counter values
inaccurate.
In this thesis, I present a novel software-based approach for monitoring the
memory hierarchy performance in highly multithreaded general-purpose graph-
ics processors. The proposed analysis is based on memory traces collected for
small snapshots of application execution. A trace-based memory hierarchy model
with a Monte-Carlo experimental methodology generates statistical bounds of per-
formance measures in the presence of non-uniform thread interleaving and data
sharing in a highly multithreaded execution environment. The statistical approach
overcomes the classical problem of disturbed execution timing due to instrumen-
tation. The approach scales well as I deploy a minimal sampling technique to
reduce the trace generation overhead and model simulation time.
The proposed scheme also keeps track of individual memory operations in the
ii
source code and can quantify the amount of their contribution to detrimental ef-
fects on memory system performance. A cross-validation of the model results
shows close agreement with the values read from the hardware performance coun-
ters on an NVIDIA Tesla C2050.
I later use the predicted memory hierarchy performance statistics in an ana-
lytical model to identify performance characteristics of a kernel and its expected
execution time. To account for the systematic error present in the predictions,
I approximate the error function and express a range of potential true execution
times for each predicted value.
iii
To my parents and my grandmother
iv
ACKNOWLEDGMENTS
I would first like to express my deepest gratitude to my adviser Professor Wen-
mei Hwu. Over the past six years, he provided me with tremendous support and
enlightened my perspective on compilers, computer architecture and applicable
research. This work would have not been possible without his inimitable expertise
and generous patience and support.
I would like to thank my doctoral dissertation committee for their constructive
comments: Professor William Gropp and Professor Sanjay Patel provided me with
valuable feedback during early stages of this work, which helped me start off in
the right direction. Professor David Padua was a great asset during the course of
my graduate studies; I learned a lot from him when I took my first courses on
compiler design and parallel programming. His expertise and technical insights
into my thesis work were extremely helpful. I thank Professor Nacho Navarro who
has always provided me with insightful feedback and tremendous encouragement.
I would like to thank Matthieu Delehaye for his contribution to the infrastruc-
ture that this work is built upon, specially the Clang front-end for parsing CUDA
applications. I specially thank my friend and colleague, Dr. Isaac Gelado for all
the stimulating discussions.
I thank Professor Marc Snir, Dr. Yuri Dotsenko and Dr. Naga Govindaraju,
from whom I have learned a lot during various stages of my doctoral studies.
Many thanks go to Ms. Marie-Pierre Lassiva-Moulin and Ms. Laurie Talking-
ton for their valuable administrative help.
v
I am grateful to my friend Behzad for his companionship and support that car-
ried me through difficult times.
I would like to dedicate this dissertation to my parents, Maryam and Abdolreza,
and my grandmother, Saltanat, to whom I am forever indebted for their perpetual
love and support.
vi
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Limited Performance Visibility . . . . . . . . . . . . . . . . . . . 2
1.2 Shortcomings of Sampling-based Schemes . . . . . . . . . . . . . 3
1.3 Limitations of Full System Simulation . . . . . . . . . . . . . . . 4
1.4 The Significance of Source Level Feedback . . . . . . . . . . . . 5
1.5 Sources of Uncertainty . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Contributions and Organization . . . . . . . . . . . . . . . . . . . 9
CHAPTER 2 BACKGROUND AND RELATED WORK . . . . . . . . . 11
2.1 The Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The Bayesian Method . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 The Graphics Processing Unit . . . . . . . . . . . . . . . . . . . 16
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
CHAPTER 3 A HIGHLY MULTI-THREADED ARCHITECTURE
MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Throughput-oriented Computing . . . . . . . . . . . . . . . . . . 26
3.2 Hardware Thread Scheduling . . . . . . . . . . . . . . . . . . . . 27
3.3 Bulk Instruction Scheduling . . . . . . . . . . . . . . . . . . . . 27
3.4 Throughput-oriented Memory System . . . . . . . . . . . . . . . 29
3.5 The Complete Model Overview and Use-case Realization . . . . . 31
3.6 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 32
CHAPTER 4 THE STOCHASTIC MEMORY HIERARCHY MODEL . . 39
4.1 Spatial and Temporal Locality . . . . . . . . . . . . . . . . . . . 39
4.2 The Monte Carlo Approach . . . . . . . . . . . . . . . . . . . . . 41
CHAPTER 5 SOFTWARE FRAMEWORK . . . . . . . . . . . . . . . . 48
5.1 Instrumenting the Kernel . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Recording Traces . . . . . . . . . . . . . . . . . . . . . . . . . . 52
vii
CHAPTER 6 EXPERIMENTAL EVALUATION: MEMORY HIER-
ARCHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Error Quantification: Sampling Error – Precision . . . . . . . . . 67
6.3 Error Quantification: Systematic Error – Accuracy . . . . . . . . 68
6.4 Systematic Error Function Estimation . . . . . . . . . . . . . . . 69
CHAPTER 7 HIGH RESOLUTION PERFORMANCE STATISTICS . . 74
7.1 The Average Latency per each Static Load . . . . . . . . . . . . . 74
7.2 Case Study: Sparse Matrix Vector Multiplication Kernel . . . . . 76
CHAPTER 8 MEASURING KERNEL PERFORMANCE: A GLOBAL
PERSPECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.1 Compute and Memory Time . . . . . . . . . . . . . . . . . . . . 80
8.2 Probabilistic Merit of a Kernel . . . . . . . . . . . . . . . . . . . 90
8.3 Case Study: 7-Point Stencil . . . . . . . . . . . . . . . . . . . . . 93
CHAPTER 9 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . 99
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2 Future Work Direction . . . . . . . . . . . . . . . . . . . . . . . 100
APPENDIX A SYMBOLIC EVALUATION . . . . . . . . . . . . . . . . 102
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
viii
LIST OF TABLES
2.1 NVIDIA performance counters used to cross validate the pre-
dicted L1 and L2 caches hit ratios . . . . . . . . . . . . . . . . . 19
7.1 Dynamic load count per static load in SpMV kernel . . . . . . . . 79
8.1 Summary of predicted performance information for the Sten-
cil kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Comparison of the probabilistic relative error of the stencil
kernels – reported probability for each pair of X and Y corre-
spond to P (X ≤ Y ). . . . . . . . . . . . . . . . . . . . . . . . . 98
ix
LIST OF FIGURES
2.1 Block diagram of NVIDIA’s G80 graphics processor . . . . . . . 17
3.1 A Highly Multithreaded GPU Model . . . . . . . . . . . . . . . . 30
3.2 Array of linked-lists used in microbenchmarks . . . . . . . . . . . 32
3.3 Average memory access time for different stride values with a
fixed list size and number of memory access . . . . . . . . . . . . 34
3.4 Average memory access time for different list size values with
a fixed stride and number of memory access . . . . . . . . . . . . 36
4.1 Random points mimic schedule deviation for simultaneously
scheduled thread blocks . . . . . . . . . . . . . . . . . . . . . . 43
4.2 The stochastic memory hierarchy model . . . . . . . . . . . . . . 44
4.3 Probability distribution for main memory latency in LBM . . . . . 46
4.4 Probability distribution for main memory latency in SpMV . . . . 46
5.1 Organization of the sampling buffer . . . . . . . . . . . . . . . . 49
6.1 Benchmarks suite used for experiments . . . . . . . . . . . . . . 57
6.2 Probability distribution for the L1 and L2 hit ratios - dense
matrix multiply . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Probability distribution for the L1 and L2 hit ratios – FFT . . . . . 60
6.4 Probability distribution for the L1 and L2 hit ratios – Black
Scholes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Probability distribution for the L1 and L2 hit ratios – sparse
matrix vector multiply . . . . . . . . . . . . . . . . . . . . . . . 62
6.6 Probability distribution for the L1 and L2 hit ratios – merge sort . 63
6.7 Probability distribution for the L1 and L2 hit ratios – lattice-
Boltzman method . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.8 Probability distribution for the L1 and L2 hit ratios – stencil . . . 65
6.9 Probability distribution for the L1 and L2 hit ratios – merge
sort (global) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.10 Sampling error for L2 read hit ratio – sparse matrix vector
multiply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.11 Prior probability density function of the relative error . . . . . . . 70
x
6.12 Probability distribution of systematic error for predicted L1
and L2 hit ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1 Breakdown of memory latency for static loads in LBM . . . . . . 75
7.2 Breakdown of memory latency for static loads in Stencil . . . . . 76
7.3 Breakdown of memory latency for static loads in SpMV . . . . . 78
8.1 Scheduling of memory and compute cycles . . . . . . . . . . . . 85
8.2 Serialization of concurrent memory accesses . . . . . . . . . . . . 87
8.3 Scatter plot of predicted versus measured execution times . . . . . 90
8.4 Posterior probability density function of the relative error . . . . . 91
A.1 Expression simplification for Example 1. (a) Initial expres-
sion tree. (b) Distributing integer division. (c) Distribution
is carried on along the path that contains free variables; con-
straint N = k× 2J is bound to free variable N. (d) Multipli-
cation is distributed over the add operator. (e) Multiplication
and integer division cancel each other out. (f) The simplified
expression tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xi
LIST OF ABBREVIATIONS
GPU Graphics Processing Unit
CPU Central Processing Unit
VLIW Very Long Instruction Word
ISA Instruction Set Architecture
ILP Instruction Level Parallelism
MLP Memory Level Parallelism
TLP Thread Level Parallelism
WLP Warp Level Parallelism
SIMT Single Instruction Multiple Thread
SIMD Single Instruction Multiple Data
SPMD Single Program Multiple Data
PDF Probability Distribution Function
CUDA Compute Unified Device Architecture
OpenCL Open Computing Language
IRM Independent Reference Model
LLN Law of Large Numbers
CLT Central Limit Theorem
ECC Error Correction Code
PTX Parallel Thread Execution
SSA Static Single Assignment
AST Abstract Syntax Tree
API Application Programming Interface
xii
CHAPTER 1
INTRODUCTION
Graphics processing units (GPUs) traditionally had highly specialized program-
ming models and interfaces that limited the ability of developers to easily map
general-purpose applications to these platforms. With the emergence of pro-
grammable GPUs along with Compute Unified Device Architecture (CUDA) [1]
and Open Computing Language (OpenCL) [2] programming interfaces, GPUs
made their entry into the general purpose computing community. Developers
started to have the standard programming interfaces and architectural features to
quickly port programs to a platform with a massively parallel, GPU-based co-
processor.
Introduction of the newer generations of GPUs such as Fermi [3] created an-
other inflection point in use of GPUs in general purpose and scientific computing.
Graphics processors are traditionally optimized for throughput oriented work-
loads, which allowed early GPUs to not include data caches. Newer GPU genera-
tions took a step forward by introducing a new memory architecture that supports
cache organizations to boost the performance of applications which have irregular
and indirect accesses to data; small per-core L1 caches capture inter-thread reuse,
and larger unified L2 caches exploit inter-core sharing. This is a significant step
forward to better support a more diverse set of workloads and reduce some of the
performance discrepancies that previously existed. Furthermore, supporting Error
Correction Code (ECC) through the whole memory hierarchy and accelerating
double-precision performance have enhanced the GPU programming ecosystem.
1
1.1 Limited Performance Visibility
The above improvements have made GPU computing users to take a more serious
look at the significant computational power provided by GPUs. Yet, efficiently
programming many-core GPUs remains intellectually challenging and requires a
significant amount of effort. The major challenge in programming GPUs is usu-
ally in keeping the high-bandwidth computational processing elements well uti-
lized by providing them with the required data in a timely manner. To achieve this,
it is important for the GPU applications to make effective use of the GPU mem-
ory hierarchy. Therefore, information about interaction of an application with the
memory system is necessary to optimized memory accesses and its performance.
Currently developers need to manually examine the GPU kernel source code
and identify the sources of performance degradation such as long latency loads
on the critical path of a tight loop. Much of this is done through guesswork and
trial-and-error.
Furthermore, multiple levels of concurrency and the large number of concurrent
events such as the in flight memory operations, reduce the developer’s or even the
optimizing compiler’s performance visibility. A highly multithreaded GPU acts
like a complex, nonlinear system with tightly coupled dynamics. A microarchi-
tectural event such as a memory load should be evaluated in a meaningful context
with regard to other events from the co-existing threads.
Therefore, the performance evaluation process is very time-consuming for ex-
perienced application developers and much more difficult and tedious for less-
informed programmers. With millions of units currently in use, graphics proces-
sors have become the largest deployment of massively parallel systems. As more
users experiment with graphic processors, the larger market is going to keep their
cost down. The snowballing phenomenon will in turn expand the GPU computing
2
community and necessitate development of techniques that systematically analyze
the source code in the context of a highly multithreaded execution environment to
replace the manual investigation of the code.
1.2 Shortcomings of Sampling-based Schemes
Current GPUs’ level of support for application developers, compiler designers and
system architects to investigate the behavior of source code segments that cause
performance bottlenecks has been limited. While they provide a set of perfor-
mance counters that collect raw statistics for microarchitecture events, e.g., the
overall number of misses for the L1 and the L2 caches, the counter values cannot
be monitored or sampled during application execution.
Likewise, support for time-based sampling is provided through application pro-
gramming interfaces or profiler triggers [1] that can be inserted in the code at user
level. For example, a call to the NVIDIA’s profiler API prof trigger will
increment the value of one out of the eight hardware counters enabled. While a
time-based sampling built on top of these interfaces can help identify the hotspots
in the code, a fine-grain sampling approach is not feasible because its high sam-
pling rate will likely affect the accuracy of the profile data. On the other hand,
infrequent sampling will increase the number of blind spots and reduce the preci-
sion of the measurement.
Profiling the performance of highly multithreaded applications is not confined
to recently introduced high performance accelerators. Salapura et al. [4] discuss
the significance of the feedback derived from the performance statistics in high
performance computing systems such as Blue Gene and propose a new perfor-
mance counter architecture that scales better with the number of concurrent events
in a highly multithreaded system.
3
In this thesis, I propose an alternative software-level support to capture the per-
formance statistics with respect to the highly parallel interleaved memory system
in the presence of large number of concurrent events. The proposed approach does
not require the instrumented application to have the same execution timing as the
uninsturmented version. As a result, it does not suffer from the same limitations
as the traditional profiling techniques.
Most highly multithreaded applications are memory bound and their perfor-
mance is strongly dependent on efficient use of the memory subsystem. Therefore,
the focus this thesis is on capturing the existing inefficiencies in a GPU kernel with
respect to the graphics processors memory hierarchy.
1.3 Limitations of Full System Simulation
An alternative to hardware support performance monitoring is the full system sim-
ulation. Simulators work by emulating the full processor(s) architectural features.
Depending on the accuracy of the simulation, they can be often excessively slow.
Cycle accurate simulators [5, 6, 7] reconstruct highly detailed models of system
components and accurate timing information. The downside is that they can only
emulate a few cycles of the actual processor execution per second. The slowdown
factor is even more blatant when simulating a highly multithreaded processor with
high number of in flight threads and states to keep track of. Therefore, the high
execution cost of cycle accurate simulations, particularly for fine grain highly
multithreaded systems such as graphics processors, precludes them as practical
options to collect performance monitoring information.
Functional simulators [8], on the other hand, can simulate thousands of instruc-
tions per second at the cost of losing cycle accuracy and timing information. The
direct utilization of these simulators for performance tuning of memory operations
4
in GPU application kernels will result in inaccurate performance profiles.
The idea presented in this thesis is based on use of a subset of instruction traces
including the memory addresses. But unlike functional simulators, arrival order-
ing of memory operations is reconstructed based on a stochastic execution model
of threads in the GPU. The proposed memory hierarchy simulation model runs
significantly faster than traditional simulators and provides a good approximation
to actual arrival ordering of memory operations at different levels of GPU mem-
ory hierarchy. Therefore, the proposed scheme provides capabilities for fine grain
memory efficiency analysis of the GPU application by estimating the latency for
accessing the memory and modeling all cache levels in the memory hierarchy in
a cost-effective manner.
1.4 The Significance of Source Level Feedback
As mentioned earlier, knowledge of an applications interaction with the memory
systems is crucial for tuning the performance. This information is most useful to
the application developer or the optimizing compiler when it is presented in terms
of specific memory accesses (locations) at source code level.
Limited hardware support in current graphics processors makes it difficult to
provide this level of feedback to application developers and makes it difficult to
determine which instructions cause accesses that poorly utilize the memory hier-
archy.
Some processors such as Intel’s Itanium [9] have more support for performance
monitoring. For example, hardware counters report the last memory address that
has triggered a cache miss. The Itanium processor also provides means to limit
counting of cache misses to a user-determined region of memory. In a highly
multithreaded GPU, interleaved execution of a large number of threads makes it
5
difficult to determine – via hardware performance counters – which instruction
has caused the miss. Therefore, it is not feasible to provide a programmer with
an accurate idea of which program memory operations exhibit the worst memory
efficiency behavior.
With the software-level approach for performance monitoring that I propose,
source code level feedback is provided to a programmer about memory opera-
tions and data structures that exhibit a poor memory efficiency behavior. Source
code locations of memory operations (that trigger microarchitecture events) are
propagated through the memory hierarchy simulation model. Therefore, the pro-
posed performance modeling framework is capable of producing precise and high
resolution profile statistics.
1.5 Sources of Uncertainty
When analyzing complex systems or phenomena, complete agreement between
the model assumptions and the properties of the system being analyzed is almost
infeasible. This deviation between a model and the corresponding real system – a
GPU in this work – can originate from the following factors:
1. Limited knowledge exists about the system or the phenomena being studied.
2. The model assumes a simplified representation of a system. The purpose of
this deliberate simplification that is enforced when constructing the model
is to reduce evaluation cost, e.g., the time of analysis.
In the course of this study, the first factor becomes relevant when specific de-
sign parameters of the GPU hardware and software cannot be reliably determined.
For example, the exact scheduling scheme for thread blocks of a kernel is not dis-
closed by the GPU vendors. The amount of perturbation in terms of the actual
6
instructions being executed by the GPU hardware – after modification and opti-
mization performed by the vendor’s compiler – is another source of discrepancy.
Unlike cycle accurate simulation approaches [5] discussed earlier, the proposed
modeling framework does not account for all microarchitectural details of GPU.
The choice of this source of uncertainty that belongs to the second category, was
deliberate to make the analysis time tractable; execution cost of traditional simu-
lation methods does not scale well in a highly multithreaded environment.
In this work, I devised two approaches to reduce the detrimental effect of these
uncertainties on the quality of the results generated by the performance modeling
framework:
1. I use a Monte Carlo experimental methodology – a classic statistical method
– to account for lack of knowledge and randomness in a GPU execution
environment, e.g., the scheduling and the order of memory requests arriving
at data caches.
The GPU execution model under this approach includes both determinis-
tic and stochastic components. The deterministic component accounts for
known events and properties of the GPU, while the stochastic part reflects
the uncertainty or lack of knowledge related to the unknown events or be-
haviors. For example, this statistical approach allows the proposed model-
ing framework to overcome the problem of lack of timing information, e.g.,
exact ordering of memory events.
2. I use a Bayesian approach to account for the uncertainty factors that the
discussed Monte Carlo method cannot compensate for. The Bayesian anal-
ysis assumes that all uncertainties are a result of lack of knowledge. It then
focuses on the observable quantities or a set of outputs from the model. It
then based on a set of subjective probabilities corresponding to each of the
7
observed outputs, expresses uncertainties in a system. Following the same
approach, I estimate the systematic error function for the predicted perfor-
mance statistics.
As a result of the above efforts, in spite of allowing for some level of inaccuracy,
the model serves its purpose to provide meaningful and statistically sound results.
Using statistical models and analysis to compensate for lack of information in
modeling computer systems and memory hierarchy is an established approach,
specially when dealing with large amount of input data, e.g., large traces. A
widely studied example is the Independent Reference Model (IRM), which is a
memoryless model for analyzing the behavior of the I/O page references [10]. In
the absence of accurate and detailed traces for the storage system, IRM assumes
that at each discrete point in time, exactly one page is referenced with a probabil-
ity, pi, that is derived based on the number of times that the page is observed in a
stream of sparse and less detailed traces. IRM is very simple with respect to the
fact that it does not account for the exact ordering of the page reference pattern,
but at the same time it retains the identity of a page via the probability pi.
I follow a similar approach with respect to analyzing the behavior of memory
operations in a GPU system. In the absence of exact information regarding the
ordering and scheduling of instructions, I reconstruct enough statistically correct
information to capture the overall behavior of memory operations with respect to
the memory system.
8
1.6 Contributions and Organization
The contributions of this work are:
1. I propose a stochastic model of memory hierarchy for highly multithreaded
execution environments: While major configuration parameters for the un-
derlying hardware (graphics processor) are extracted through microbench-
marking as discussed in Chapter 3, I use a Monte Carlo approach (random-
ized trials) to capture the sources of non-determinisim in the simulation
model, e.g. the relative arrival ordering of the memory requests at each
level of the memory hierarchy. This stochastic approach is presented in
Chapter 4.
The results that are presented in Chapter 6 confirm that the proposed ap-
proach yields fast convergence with respect to randomized trials with a 95%
confidence interval of 0.02 and high accuracy when compared against val-
ues read from the hardware performance counters.
2. I devise an adaptive trace collection mechanism that gathers enough mem-
ory addresses to capture the intra-thread, inter-thread and inter-core interac-
tions within memory accesses issued from a set of independent but concur-
rently running threads (thread blocks).
The generated traces are later combined together through the stochastic
memory hierarchy modeling framework to predict miss rates for the L1 and
L2 caches and measure the expected latency of the main memory.
The trace collection is implemented as a source to source compiler trans-
formation module that seamlessly inserts probes inside the GPU kernel (de-
vice code) and the required routines to handle trace buffers in the surround-
ing code (host code) without any user intervention. The trace collection
9
mechanism incurs modest execution overhead and results in minimal user
intervention. This is discussed in Chapter 5.
3. The proposed approach can identify and isolate the behavior of specific
memory operations in an application with respect to the memory hierarchy
and in the context of other contemporary memory operations in the system.
This information can later be used to target optimizations such as tiling for
locality, data layout and thread coarsening for specific code segments. In
Chapter 7, I show how based on the predicted statistics for the memory hi-
erarchy efficiency, the performance of individual memory loads (expected
latency) can be identified.
Based on this information, in Chapter 8, I devise a tractable analytical model
and identify the critical path with respect to the execution time of a kernel.
The latency of the critical path determines the expected execution time for a
kernel. Based on the results presented in Chapter 8, the model predicts the
execution times for the studied GPU kernels with an average relative error
of 16%.
I also propose a method to estimate the systematic error of the performance
predictions. This is discussed in Section 6.4. Based on the estimated error
function, I compare and evaluate the probabilistic merit of a kernel with
respect to other kernels in Section 8.2.
10
CHAPTER 2
BACKGROUND AND RELATED WORK
2.1 The Monte Carlo Method
The Monte Carlo method [11] is a technique to approximate solutions to a com-
plex, nonlinear system that involves more than a few uncertain input parameters.
For example, in mathematical physics, problems that involved smaller number of
particles were solved through systems of ordinary differential equations which are
apparatus to study physical phenomena in classical mechanics.
On the other hand, to find solutions to complex systems with a large number
of particles, statistical mechanics uses a completely different approach, similar
to randomized trials, that is not concerned about individual particles but rather
studies the properties of a set of particles.
Through statistical sampling, the Monte Carlo method iteratively evaluates a
deterministic model using sets of random numbers as inputs. As a result of using
random inputs, the deterministic model of the complex system is transformed into
a stochastic model.
The modern Monte Carlo technique – not the statistical sampling – was invented
by Stan Ulam and later independently rediscovered by Enrico Fermi. The name
Monte Carlo was used in World War II to refer to the sampling-based methods for
neutron scattering computations.
Ulam automated the statistical sampling by developing algorithms for the newly
invented electronic computer ENIAC [12] and transformed non-random problems
11
into random forms to facilitate the computation of their solution through statistical
sampling. He describes how the idea took shape in his mind [13] as:
”The first thoughts and attempts I made to practice [the Monte Carlo
Method] were suggested by a question which occurred to me in 1946 as I
was convalescing from an illness and playing solitaires. The question was
what are the chances that a Canfield solitaire laid out with 52 cards will
come out successfully? After spending a lot of time trying to estimate them
by pure combinatorial calculations, I wondered whether a more practical
method than ”abstract thinking” might not be to lay it out say one hun-
dred times and simply observe and count the number of successful plays.
This was already possible to envisage with the beginning of the new era
of fast computers, and I immediately thought of problems of neutron dif-
fusion and other questions of mathematical physics, and more generally
how to change processes described by certain differential equations into
an equivalent form interpretable as a succession of random operations.
Later [in 1946, I] described the idea to John von Neumann, and we began
to plan actual calculations.”
2.1.1 The Mathematical Foundation
The Monte Carlo method is designed to analyze and propagate uncertainty by
determining how random variation and lack of knowledge affects the sensitivity
of the solutions of the system that is being modeled. The inputs are randomly
generated based on their probability distributions to represent samples collected
from an actual population. The outputs of the simulation are also represented by
probability distributions.
The expected value of a random variableX , defined over the interval [a, b], with
12
a probability distribution of f(x) is described by:
E(X) =
∫ b
a
xf(x)dx (2.1)
The expected value, E(X), can become hard to compute via the above inte-
gral. The Monte Carlo method states that E(X) can be estimated by collecting
n sample points x1, x2, x3, ..., xn from X based on f(x) and then calculating the
average values of xis:
E(X) ≈ Ên(X) = 1
n
n∑
i=1
xi (2.2)
The Monte Carlo method is validated by two simple laws that are foundations
to statistical analysis:
1. The Law of Large Numbers (LLN) states that the mean (average) of a se-
quence of random variables with a common distribution converges to the
(true) mean of the population, as the size of the sequence approaches in-
finity. The LLN is important because it guarantees the convergence of the
Monte Carlo method via stable results for random events as the number of
observations (sample size) increases.
2. The Central Limit Theorem (CLT) states that the sum of a large number of
independent observations, in general, exhibit a normal distribution (the bell
curve) as the number of samples increases. In the context of the Monte-
Carlo method, the CLT is the basis to evaluate the quality of solutions of the
simulated system.
For Equation 2.2, the CLT states that E − Ên = σZ, where Z ∼ N(0, 1)
13
and σ, the standard deviation of Ên, is estimated by the following equation:
σ̂n =
√√√√ 1
n
n∑
i=1
(xi − Ên)2 (2.3)
The result of the Monte carlo are reported in the form E = Ên ± σ̂n, which
gives one standard deviation error bar of [Ên − σ̂n, Ên + σ̂n]. Based on the
CLT, one standard deviation error bar indicates that the probability E being
within the above interval over large number of trials is 66%. Similarly,
based on the CLT, chances ofE being in confidence interval [Ên−2σ̂n, Ên+
2σ̂n] is 95%.
2.2 The Bayesian Method
In probability theory, Bayesian theorem states that the probability of a particular
inference being true – the posterior probability – given some observed evidence,
is a combination of the prior probability of the inference and the compatibility of
the observed evidence with the inference.
The prior distribution is the probability distribution before observing data. Af-
ter observing data, the new probability distribution, the posterior distribution, is
computed using Bayes’ inference theorem [14]. The Bayes’ theorem for discrete
probability distributions is based on the conditional probability and expressed as:
P (A|B) = P (B|A)P (A)
P (B)
(2.4)
In Equation 2.4, P (A|B), refers to the probability of event A, given the occur-
rence of some other event B. In other words, in a random experiment when B has
occurred, the probability of the occurrence of A is adjusted from the unconditional
14
probability into the conditional probability under B. In the context of the Bayes’
theorem:
• P (A) is the prior probability and does not contain any information about
occurrence of B.
• P (A|B) is the posterior probability – the adjusted probability of A upon
observance of B.
• P (B|A) is called the likelihood of event B happening in the presence of
prior probability distribution of P (A).
• P (B) is the prior probability of B and is used to normalize the probabilities
so that they sum up to one.
Bayes’ theorem in this form gives a mathematical representation of how the
posterior probability distribution preserves knowledge contained in the prior dis-
tribution and the adjustment made according to the new observed data.
In this work, I use the Bayesian method to estimate the probability distribution
of the systematic error function, which will be discussed in Section 6.4. I assume
that the prior probability distribution is uniform over the interval [0, 100) and
decays exponentially afterwards.
The uniform distribution is used when background information about the prior
distribution is limited, which indicates that before any observations, all outcomes
are equally likely. The exponentially decaying tail represent the fact that the for a
reasonably good performing model, chances of reporting a prediction with a large
error magnitude tends to be very small.
15
2.3 The Graphics Processing Unit
A little over a decade ago, NVIDIA built the first graphics processor to accelerate
the real-time graphics. Gradually, high arithmetic and memory bandwidth made
GPUs ideal high performance computing platforms for data-parallel applications.
The first GPU programming interfaces were limited to shader languages such
as DirectX [15], OpenGL [16] and Cg [17]. As a result, programmers had to
express the computation of their applications in terms of graphics pipeline opera-
tions. Later, Brook [18] and Sh [19] added stream extensions to the C language to
abstract away the graphics hardware details. Accelerator [20] moved one step fur-
ther from stream programming by providing data-parallel arrays with aggregate
element-wise operations.
2.3.1 The G80 Architecture
While the above research libraries made noticeable effort to overcome the draw-
backs of the shader languages, the inflection point in GPU computing happened
when NVIDIA introduced the G80 family of unified graphics and compute pro-
cessors. The G80 GPUs were accompanied by a new combined software and
hardware architecture called the Compute Unified Device Architecture (CUDA),
which significantly improved programability of the new graphics processors and
established the new GPU computing model.
As the G80 architecture combined the vertex and pixel pipelines, the new uni-
fied processor was capable of executing both graphics (vertex, pixel, etc.) and
general purpose computing programs. This design established a ground base for
the new generation of programmable graphics processors.
Figure 2.1 summarizes major components of a graphics processor – important
during the course of my discussion in this thesis – based on a simplified diagram
16
DRAM  DRAM  DRAM  DRAM  DRAM  DRAM  DRAM  DRAM 
Mul(processor 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
SP 
Figure 2.1: Block diagram of NVIDIA’s G80 graphics processor
of the NVIDIA’s GTX 8800 GPU.
At a high level, the GPU programming model is based on the Single Program
Multiple Data (SPMD) model; a kernel function defines the program that is ex-
ecuted by each of the thousands or millions of fine-grain threads that compose
a GPU application. Host code initializes the kernel parameters and invokes it
through CUDA APIs.
From an architectural perspective, a graphics processor is formed by group
of multiprocessors each consisting of a number of simple inorder cores and a
17
massively parallel memory architecture. The programmer decomposes the prob-
lem domain into a number of blocks. For each block, a small program is exe-
cuted which uses a number of independent threads to perform its computation.
The blocks are scheduled on multiprocessors which operate on them in paral-
lel. Groups of Independent threads within in block operate based on the single-
instruction multiple-thread (SIMT) execution model; threads within a group run
in lock-step executing the same instruction.
The first programmable GPUs provided limited memory hierarchy and synchro-
nization support: A small per core software-managed cache (shared memory) was
provided to share data within a thread-block. A barrier synchronization intrinsic
was supported for intra-thread-block communication.
2.3.2 The Fermi Architecture
Major successors to the G80 design are based on the Fermi architecture, which
was first released in 2010. The Fermi architecture has taken a significant leap
forward in GPU architecture design by providing a two-level cache hierarchy to
better support the applications that are not able to use the GPU’s shared memory
efficiently. Among other major key improvements are the improved double preci-
sion performance, faster atomic operations to reduce the cost of inter-thread-block
communication, and the Error Correction Code (ECC) support.
I use the Fermi architecture (NVIDIA Tesla C2050) as the target platform for
this work. In what follows, I briefly highlight hardware and software features that
either significantly influenced the most important concepts in my work, or were
used in the course of developing the infrastructure that this work is based upon. A
more detailed description of these can be found in [3].
18
Counter name Description
l1 global load hit Number of global load hits in L1 cache
l1 global load miss Number of global load misses in L1 cache
l2 subp0 write sector misses Accumulated write sector misses from L2
cache for slice 0 for all the L2 cache units
l2 subp1 write sector misses Accumulated write sectors misses from L2
cache for slice 1 for all the L2 cache units
l2 subp0 read sector misses Accumulated read sectors misses from L2
cache for slice 0 for all the L2 cache units
l2 subp1 read sector misses Accumulated read sectors misses from L2
cache for slice 1 for all the L2 cache units
l2 subp0 write sector queries Accumulated write sector queries from L1 to
L2 cache for slice 0 of all the L2 cache units
l2 subp1 write sector queries Accumulated write sector queries from L1 to
L2 cache for slice 1 of all the L2 cache units
l2 subp0 read sector queries Accumulated read sector queries from L1 to
L2 cache for slice 0 of all the L2 cache units
l2 subp1 read sector queries Accumulated read sector queries from L1 to
L2 cache for slice 1 of all the L2 cache units
Table 2.1: NVIDIA performance counters used to cross validate the predicted L1
and L2 caches hit ratios
Streaming Processors
A streaming multiprocessor in Fermi has 32 in-order streaming cores. Each
core can execute a single floating point or integer instruction per clock.
Thread Scheduling
Thread blocks are allocated to the streaming multiprocessors by a global
thread scheduling engine. Within a streaming multiprocessor, a local thread
scheduler picks up warps that are ready to run and executes them on the
streaming processors.
Memory Hierarchy
The Fermi architecture memory hierarchy design takes a large step towards
a more generic general purpose setting with two levels of data caches; a
19
private L1 cache shared within streaming cores of a multiprocessor and a
unified L2 shared among multiprocessors.
Atomic Operations
Atomic operations are operations which are performed in a mutually exclu-
sive manner (without any interruption from other threads) to prevent race
conditions among concurrently running threads of a GPU application. They
work for both shared memory (intra-thread-block mutual exclusion) and
global memory (inter-thread-block mutual exclusion). The unified second
level cache in Fermi, greatly enhanced the atomic operation performance
for global memory.
In Chapter 5, I describe how atomic operations are used to collect trace
information when an instrumented GPU kernel is being executed.
Parallel Thread eXecution Instruction Set Architecture
Parallel Thread Execution (PTX) is NVIDIA’s three address intermediate
representation in Static Single Assignment (SSA) format for CUDA and
OpenCL programming interfaces. The PTX assembly is translated by the
shader compiler into the cubin form native to the target GPU architecture.
The CUDA programming interface has recently added support to allow PTX
assembly being inlined in the kernel code. This feature is exploited in the
design of the dynamic trace collection module, described in Chapter 5, to
implement a robust and safe communication mechanism between GPU and
CPU.
Performance Counters
Performance counters are enabled by setting a few environment variables to
initialize the set of hardware counters to be sampled.
20
In this work, performance statistics provided by the hardware counters are
used to cross validate the predictions of the memory hierarchy model, e.g.,
the overall hit ratios for the L1 and L2 caches. Table 2.1 lists the hardware
counters that were used for the purpose of this validation.
2.4 Related Work
2.4.1 Functional Simulation
Functional simulators such as Barra [8], a GPU simulator that implement the PTX
instruction set, simulate instructions one by one and are used to check whether the
instruction set implementation is correct. They cannot provide accurate perfor-
mance statistics as they do not account for the correct timing of events.
Nevertheless, a functional simulator can be used to generate instruction traces
for a full system trace-driven simulation. Instruction traces can also be used in
cache predictor simulations, relatively accurate in predicting hit and miss ratios
for a single stream of instructions.
In this work, I use an approach similar to a trace-driven cache predictor, with
the exception that in a highly multithreaded execution environment, there are more
than a single stream of instructions – thousands of inflight threads issue memory
operations all together. As a result, the proposed memory hierarchy simulation
approach needs to combine the individual instruction streams (memory accesses)
into a single one. It then, predicts the efficiency of the GPU memory hierarchy
based on the synthesized relative ordering generated.
21
2.4.2 Full System Simulation
While full cycle-accurate simulators such a GPUSim [5] perform a detailed mi-
croarchitecture simulation of a program to study the performance of proposed
architecture designs, the high overhead of full system simulation has lead to al-
ternative simulation techniques at least for the CPU simulators. These simulation
approaches that target reducing the simulation time can be grouped into the fol-
lowing categories:
• Reducing input set: Among the first studied approaches to reduce the sim-
ulation time is to modify the input set and have the simulator execute fewer
number of instructions [21]. This way the benchmark would have gone
through all different stages of computation and the reported performance
metrics have potentially similar characteristics compared to the full input
set. However, more comprehensive studies [22] showed that the reported
cache miss rates for the reduced input set can be very different from the full
input set simulation results. Therefore, the predicted Instruction per Cycle
(IPC) statistics, computed under a reduced input set, will not be necessarily
accurate.
• Truncated execution: The assumption behind this simulation technique is
that an arbitrary interval of program execution is a reasonably good rep-
resentative of the application behavior. Therefore, only a fixed number of
instructions are simulated in detailed. This snapshot can start from the be-
ginning of the application, or the initial phases of application computation
can be skipped by functionally simulating those and later switching to a
detailed simulation for the snapshot of interest. The functional simulation
helps in avoiding the cold error due to uninitialized states of the memory
and processor.
22
• Sampling: Sampling approaches fully simulate only a subset (sampling
units) of a benchmark’s execution. Instructions between the sampling units
are fast forwarded through functional simulation, used primarily to main-
tain the up to date microarchitectural state (caches and branch predictors).
Sampling can be performed under one the following schemes:
Representative sampling: This approach extracts a subset of dynamic in-
struction stream such that it matches the overall behavior of the bench-
mark when running the full input set. Representative subsets are cho-
sen through profiling and offline analysis of the program basic blocks
[23]. After a particular sequence of instruction streams (basic blocks)
have been simulated once, the collected statistics is weighted appropri-
ately to reflect the overall presence of each basic block when executing
the full input set.
Periodic sampling: Periodic sampling [24] simulates selected subsets (sam-
pling units) of the dynamic instruction stream sampled at fixed inter-
vals. The sampling frequency is determined based on the desired level
of accuracy.
Random sampling: In random sampling, the simulation results from a
large number of randomly sampled intervals are combined together
to produce the overall simulation statistics [25].
2.4.3 Analytical Modeling
Analytical modeling, on the other extreme, provides valuable insight by approx-
imating the microarchitecture behavior. It specifically works well when the un-
derlying design is relatively simple, e.g., in-order cores as opposed to superscalar
processors. In this work, I combine an analytical model with a novel trace-driven
23
memory hierarchy simulation approach to evaluate the performance of applica-
tions running on a highly multithreaded graphics processor.
Previous studies on performance estimation [26, 27, 28] and tuning [29] for
general purpose computing on graphics processing units (GPGPUs) were con-
strained by the programming environment and the necessity of mapping algo-
rithms to existing GPU features.
The first body of research on GPU computing (on programable graphics pro-
cessors) started with the work of Ryoo et al. [31] who used Pareto-optimal curves
to prune the optimization space of general-purpose applications on GPUs. They
introduce efficiency (a flat instruction count) and utilization (a measure of how
many pure compute cycles are available from other executing warps) as single
number metrics. They did not model memory latency and assumed that none of
the studied GPU kernels were memory bound.
Later Baghsorkhi et al. [32] and Hong et al. [33] proposed analytical models
to evaluate the performance of the GPU applications. In a previous work [32], I
explained how static analysis can be efficiently used to to account for control flow
divergence, global memory coalescing and shared memory bank conflicts when
the control flow and array index expressions are not data-dependent. This topic
is discussed in more details in Appendix A. I also showed that GPU kernels with
data-depedent control flow conditions and memory accesses can be efficiently in-
strumented to collected the data required for the analytical model.
Neither of the approaches discussed in [33, 31] handle the diverging control
flow or the data-dependent GPU applications properly. Hong et al. [33], conser-
vatively account for both control flow paths even if all threads (of a warp) go
through the same path. Properly handling control flow is important for a per-
formance model, specially when memory loads are protected by conditionals; a
conservative approach can result in more than twice the number memory opera-
24
tions that a kernel executes, which will result in large deviation between predicted
and real performance statistics.
None of the previous work in the area of performance tuning and modeling of
graphics processors [31, 32, 33, 34, 35], to this point, have accounted for cache
behavior with inter-core or inter-thread data sharing and neither did model a GPU
with a multi-level memory hierarchy.
This work is also different from GPU simulators such as GPUSim [5] as they
emulate each instruction in detail and account for all GPU micro-architecture
states (in the case of GPUSim). This requires much more simulation overhead
compared to the approach proposed in this thesis; I simulate a small portion of the
memory traces. Furthermore, my approach provides better targeted (source code
level) performance feedbacks to application developers and systems engineers.
25
CHAPTER 3
A HIGHLY MULTI-THREADED
ARCHITECTURE MODEL
In this chapter, I will define a machine model for a highly parallel graphics pro-
cessor targeted toward general purpose and compute oriented workloads. This
machine model must be simple enough to facilitate study and efficient evaluation
of the graphics processors. On the other hand, it need to enclose enough details
so that it can reflect a realistic behavior of the hardware. In what follows, I high-
light a few fundamental differences reflected in this model that separates graphics
processors from their peer multi-core processors.
3.1 Throughput-oriented Computing
GPUs emphasize high throughput and single instruction multiple thread (SIMT)
performance rather than single thread (instruction) execution latency. As a result,
the number of inflight threads and memory accesses is much higher compared to
conventional multi-core CPUs.
The SIMT is enabled by having streaming processors running neighboring threads
in lock-step, executing the same instruction (on different data). This microarchi-
tectural grouping of threads which can affect both control flow and memory ac-
cess efficiency introduces the concept of warp – a group of threads that compose
a hardware vector unit.
26
3.2 Hardware Thread Scheduling
Preemptive thread scheduling by the operating system cannot efficiently support a
throughput oriented environment. Therefore, GPUs implement thread scheduling
through a dedicated hardware.
A global thread block scheduler assigns thread blocks to streaming multiproces-
sors – arrays of streaming processors that implement the SIMT execution model.
In this work, I will consider a fine-grain round robin scheduling paradigm with
respect to the relative order of thread blocks assigned to each streaming multipro-
cessor. For example, with 4 streaming multiprocessors available, each running 6
thread blocks, the model assumes the following mapping of logical thread blocks
to multiprocessors:
Multiprocessor 0 Multiprocessor 1 Multiprocessor 2 Multiprocessor 3
56, 60, 64 57, 61, 65 58, 62, 66 59, 63, 67
68, 72, 76 69, 73, 77 70, 74, 78 71, 75, 79
This mapping is the closest model to the dynamic scheduling scheme that the
hardware implements.
Within a multiprocessor, a warp scheduler alternate between different warps
following a weighted round robin scheme.
3.3 Bulk Instruction Scheduling
To make the model tractable and avoid fine-grain instruction scheduling of a ker-
nel instructions, which would result in a full system simulation model, I propose
a bulk scheduling scheme. The computation within a kernel is divided into a se-
quence of memory loads followed by a sequence of compute or store instructions.
When static memory loads are repeated within a dynamic instruction stream,
in the presence of enclosing loop constructs, they end up in separate sequences
27
that are sequentially executed. Therefore, a back-edge in the control flow will
terminate a sequence.
Another delimitating factor is the existence of data-dependance between static
memory loads. For example, if the index expression of a memory load is com-
puted based on the value that is being read by a previous memory load, the second
load will start new sequences for both memory loads and compute instructions.
Explicit synchronization instructions will also mark the end of the current se-
quences.
The above simplifications preserve enough information about the instruction
stream so that the model can reflect a reasonable representation of the kernel com-
putation. Meanwhile, it is general enough that can approximate different instruc-
tion scheduling schemes adapted by different GPU vendors. For example, AMD’s
GPUs are designed based on a Very Long Instruction Word (VLIW) Instruction
Set Architecture (ISA); multiples of instructions that can execute in parallel are
packed into statically scheduled VLIW bundles. These bundles then compose rel-
atively short clauses; clauses that are initiated by control flow instructions, may
only contain a single type of instructions, e.g., memory loads or ALU instruc-
tions. In contrast, NVIDIAs GPUs require more sophisticated scheduling logic
for fine-grain scoreboarding and resolving dependencies among dynamic stream
of instructions. Through the proposed bulk instruction scheduling scheme these
level of details can be abstracted away.
The bulk instruction scheduling scheme proposed above accounts for the major
dependencies among instructions and differentiate between types of instructions
that consume different hardware resources (memory loads and compute instruc-
tions). At the same time, it consciously ignores the low level details of dependen-
cies and scheduling order to make the analysis tractable.
28
3.4 Throughput-oriented Memory System
GPUs conserve the memory bandwidth by grouping together aligned loads and
stores. Each coalesced memory access uses a single memory request, while mov-
ing multiple elements of data.
Traditionally, highly multithreaded architectures such as HEP [36], M-Machine
[37], Tera MTA [38] and more recently multithreaded GPU processors employ
hardware multi-threading for fast context switches to tolerate memory latency.
Following this approach to tolerate memory latency, the early multithreaded ar-
chitectures did not include data caches. More recent GPUs such as NVIDIA’s
Fermi architecture [39] added caches as means to conserve memory bandwidth.
Private L1 caches expoit inter-thread input data sharing to reduce the meory band-
width consumption by each core. A shared unified L2 further reduces off-chip
bandwidth requirements by exploiting data re-use between cores.
The throughput oriented environment also influences the design and function-
ality of the memory hierarchy. For example, data caches – if exist at all – are no
longer optimized for exploiting long term locality as their effective size shrinks
with high number of concurrent memory request. These caches are designed to
exploit short-term spatial and temporal data locaity across the concurrently exe-
cuting threads rather than long-term temporal data reuse.
To lock data that is reused frequently but with a reuse distance larger than the
effective cache capacity, the on-chip software-managed cache or shared memory
can be used.1
In NVIDIA GPUs, shared memory and the L1 data cache use the same underly-
ing physical structure and cannot be accessed simultaneously. Therefore, shared
memory can be conceptually considered as an extension to the L1 cache, with the
1Reuse distance is defined as the number of distinct memory accesses between two occurrences
of a memory reference.
29
difference that the data that is mapped to the shared memory address space will
always hit in the L1. This assumption simplifies modeling of the memory level
parallelism (MLP) that is discussed in Section 8.1.2.
Figure 3.1: A Highly Multithreaded GPU Model
30
1 # inc lude <c u d a r u n t i m e a p i . h>
2
3 i n t que ryDev ice ( ){
4
5 / / t o t a l number o f d e v i c e s
6 i n t devCount ;
7 cudaGetDeviceCount (& devCount ) ;
8
9 f o r ( i n t i = 0 ; i < devCount ; ++ i ){
10
11 p r i n t f ( ” Device %d :\ n ” , i ) ;
12 cudaDev iceProp devProp ;
13 c u d a G e t D e v i c e P r o p e r t i e s (& devProp , i ) ;
14
15
16 p r i n t f ( ”Name : %s\n ” , devProp . name ) ;
17 p r i n t f ( ” T o t a l g l o b a l memory : %u\n ” , devProp . to t a lGloba lMem ) ;
18 p r i n t f ( ”Max s h a r e d memory p e r b l o c k : %u\n ” , devProp . sharedMemPerBlock ) ;
19 p r i n t f ( ”Max r e g i s t e r s p e r b l o c k : %d\n ” , devProp . r e g s P e r B l o c k ) ;
20 p r i n t f ( ”Warp s i z e : %d\n ” , devProp . warpS ize ) ;
21 p r i n t f ( ”Max t h r e a d s p e r b l o c k : %d\n ” , devProp . maxThreadsPerBlock ) ;
22 f o r ( i n t i = 0 ; i < 3 ; ++ i )
23 p r i n t f ( ”Max b l o c k d imens ion : %d\n ” , devProp . maxThreadsDim [ i ] ) ;
24 f o r ( i n t i = 0 ; i < 3 ; ++ i )
25 p r i n t f ( ”Max g r i d d imens ion : %d\n ” , devProp . maxGridSize [ i ] ) ;
26 p r i n t f ( ” Clock r a t e : %d\n ” , devProp . c l o c k R a t e ) ;
27 p r i n t f ( ”Num of m u l t i p r o c e s s o r s : %d\n ” , devProp . m u l t i P r o c e s s o r C o u n t ) ;
28
29 }
30
31 re turn 0 ;
32 }
Listing 3.1: Query device function
3.5 The Complete Model Overview and Use-case
Realization
Figure 3.1 shows a diagram of a GPU architecture with all the components that
I discussed in previous sections. For the rest of this thesis, I will use this model
of highly multithreaded graphics processors and customize it for NVIDIA’s Fermi
architecture to demonstrate the design and effectiveness of proposed performance
evaluation appraoch.
A Fermi-based GPU is composed of a number of streaming multiprocessors
each consisting of a group of simple in-order processor pipelines. Each multi-
processor can have multiple in-flight warps from the same or different thread-
blocks whose executions overlap. A warp scheduler switches among warps in
31
Figure 3.2: Array of linked-lists used in microbenchmarks
a semi-round-robin fashion. When a warp is stalled on a memory access, the
scheduler picks another warp to execute. When the data becomes ready the warp is
returned to the list of warps eligible for execution. Warps running on a node share
an L1 data cache. A unified L2 cache sits between individual multiprocessors and
the main memory. The L1 and L2 are set associative caches.
Configuration parameters for the streaming multiprocessors can be determined
through a call to NVIDIA’s API function cudaGetDeviceProperties as
shown in Listing 3.1. NVIDIA’s Tesla C2050, which is the Fermi based GPU
that I use for experiments during the course of this work, contains 32 streaming
processor per each of the 14 multiprocessors and works based on a warp size of
32.
To determine the specific memory hierarchy configurations and their manage-
ment policies, I use a micro-benchmarking approach that will be discussed in the
following section.
3.6 Microbenchmarks
I extract the configuration parameters of the memory hierarchy through micro-
benchmarking. The information is later used by the model to adapt itself to the
32
profile of the specific target platform for executing the application.
The micro-benchmark suite is built around a GPU kernel that traverses ele-
ments of an array of linked-list in a circular (wrap-around) fashion. Consecutive
elements of a linked-list are located within a certain stride from each other, while
elements of different linked-lists are contiguous in the memory as illustrated by
Figure 3.2. I invoke the kernel with a single GPU thread unless stated otherwise.
To determine different configuration parameters and arrangements of the mem-
ory hierarchy, I customize the micro-benchmark by varying the size of linked-lists,
the number of linked-lists (stride), and the number of memory accesses inside the
micro-benchmark kernel.
3.6.1 Cache Block Size
The first parameter that I measure is the cache block size. I run the micro-
benchmark and vary the strides for a fixed linked-list size. The first access to
a block in the cache is always a miss. Successive accesses to the same block will
hit in the cache. As the stride increases, fewer accesses hit the same cache block
and eventually when the stride is equal to the cache block every access will be a
miss. This experiment initially implies a cache block size of 128 bytes (the solid
line in Figure 3.3). This leads one to the conclusion that either there is no higher
level caches or the cache block size of the higher level caches are at most 128
bytes. Later, the micro-benchmark results in Section 3.6.2 capture two levels of
cache memory, which leads to the conclusion that L2 cache block size is at most
128 bytes. The Fermi L1 cache can be deactivated though a compiler switch. With
that option available, I further investigate the cache block size of the second level
cache (the dashed-line in Figure 3.3). The new experiment yields an L2 cache
block size of 32 bytes. Based on this outcome, I assume that on an L1 miss four
33
 0
 100
 200
 300
 400
 500
 600
 700
 0  100
 200
 300
 400
 500
 0
 100
 200
 300
 400
 500
 600
 700
 800
Av
er
ag
e 
Lo
ad
 T
im
e 
(na
no
se
c)
D
iff
er
en
tia
l L
oa
d 
Ti
m
e
Access Stride
L1 Cache L2 Cache
Figure 3.3: Average memory access time for different stride values with a fixed
list size and number of memory access
consecutive L2 cache blocks are accessed.
3.6.2 Cache Capacity
Next, I determine the size of the first and second level caches. I run the micro-
benchmark using a fixed stride of 128 bytes (the L1 cache block size) for a single
linked-list. At each invocation, the kernel sequentially accesses elements of the
list and wraps around a certain number of times. When the list size is smaller
than the cache capacity, subsequent accesses to the same element will hit in the
cache, and the average memory access time is low. When the list size exceeds the
capacity of the cache, accesses to the same element will always miss in the cache
and the average access time increases.
The solid line in Figure 3.4, which shows the results from running this bench-
34
mark on the Tesla C2050, suggests a 16KB first-level cache and a 768KB second-
level cache. I repeat the same experiment with a fixed access stride of 32 bytes
(the L2 block size) and the L1 cache disabled. Results shown in Figure 3.4 (the
dashed line) re-confirm the 768KB capacity for the L2.
3.6.3 Cache Set Associativity
To measure the first and the second level cache associativities, I access elements
of a linked-list with fixed strides that are equal to the cache size, which guarantees
that all the accesses will end up in the same cache set. The linked list is traversed
a large number of times in an experiment. I repeat the experiment multiple times
and each time increase the linked-list size by one, which increases the number of
distinct cache blocks accessed during linked list traversal by one. Eventually the
cache set overflows. At this point, each access causes a (conflict) miss which will
be reflected in a significantly longer average latency of the corresponding load.
The results indicate that both L1 and L2 caches are 64-way set-associative.
The 64-way set associative L1 may at first glance look unreasonable, but when
I measure the L1 access latency in the following section, it indicates a latency of
about 52 cycles. This long latency access time justifies a 64-way set associative
cache.
3.6.4 Latencies
To measure the access latencies for each level of the memory hierarchy, I run
the micro-benchmark for a linked-list that completely fits into the memory in that
level but exceeds the memory capacity of the lower levels. Using a stride equal
to the cache block size, I access the linked-elements many times to factor out the
effect of cold misses. I measure an average L1 access time of 90 nanoseconds,
35
 100
 150
 200
 250
 300
 350
 400
 450
 500
 4  8  16
 32
 64
 128
 256
 512
 1024
Av
er
ag
e 
Lo
ad
 T
im
e 
(na
no
se
c)
List Size (KB)
Stride 128 / L1 Enabled Stride 32 / L1 Disabled
Figure 3.4: Average memory access time for different list size values with a fixed
stride and number of memory access
an average L2 access time of 250 nanoseconds. I later show, in Section 4.2.3,
how I customize the measurement of the access time to global memory for each
application.
3.6.5 Inclusive, Exclusive or Victim
I also need to determine whether the second-level cache is inclusive or exclusive.
Therefore, I execute the micro-benchmark using a list size equal to the L1 cache
capacity (i.e. 16KB) for as many thread blocks as the number of available stream-
ing multiprocessors plus one, i.e. 15 thread blocks in the case of Tesla C2050. To
ensure that different thread blocks are not scheduled at the same time on the same
core (streaming multiprocessor), I enforce each thread block to have 1024 threads
or 32 warps, which is the maximum number of in flight warps that can be simulta-
36
neously scheduled on each core. However, only one thread is performing memory
accesses in each thread block. Each of the first 14 thread blocks access elements
of 14 completely distinct linked-lists. The last thread block, which is roughly
executed after the first 14 finished their execution, randomly chooses one of the
previously accessed linked-lists to start its memory accesses. Memory accesses
from the first 14 thread blocks will miss in both L1 and L2 caches. However,
memory accesses of the last thread block will hit in L2, if the cache is inclusive
and in L1 if the last thread block is scheduled on the same core as a thread block
that has previously accessed the list. I run this benchmark 1024 times, and discard
the results that indicate the accesses from the last thread block hit in the L1 cache.
Next, I run a second set of experiments: this time all 15 thread blocks access
distinct linked-lists. I compare the average memory latency for these two set of
experiments. Results show a lower memory latency for the case that the last thread
block reads from a previously accessed linked-list. This leads one to the conclu-
sion that the second-level cache is inclusive.
3.6.6 Write Policy
Since the L1 caches are relatively small considering the number of inflight threads
for each core, it makes sense to forward stores directly to the shared L2 cache. To
verify this scenario, I modify the micro-benchmark to first initialize the values of a
linked list by writing the pre-calculated indices for the next items in the list. After
this initialization phase, I use the internal clock register to measure the execution
time of loading successive elements from the linked-list. The time measured indi-
cates that loads do not hit in the L1 cache. In other words, write requests bypass
the first-level cache. Through further micro-benchmarking and by closely inves-
tigating the numbers reported by the hardware counters, I verified that stores hit
37
in the L2 only if the corresponding cache block has been activated via a write
request, and that the L2 follows a write back, write allocate policy for stores. So,
for the memory hierarchy model I assume that when a store request reaches the
L2, if the cache block is valid it writes to the cache while setting the dirty bit for
that block. If the block is not valid, it updates the memory and brings the cache
block to the cache and set its valid bit.
38
CHAPTER 4
THE STOCHASTIC MEMORY
HIERARCHY MODEL
In this chapter I describe how a stochastic model of the GPU memory hierarchy is
built on top of the deterministic memory model discussed earlier in Chapter 3; the
order of memory request being issued is partially determined by random variables.
Initial memory requests are generated by instrumenting the GPU kernel at source
level to record the addresses accessed during the kernel execution.
The instrumentation framework, which is discussed in more detail in Chap-
ter 5, is designed as a source-to-source transformation module. Static probes are
inserted within the GPU kernel source code to record the memory addresses read
from and written to by active threads for a subset of thread blocks as the GPU
kernel is executed. This subset of thread blocks represents a snapshot of the ker-
nel execution whose dimensions are determined through a heuristic based on the
number of concurrently executing thread blocks, the (average) number of dynamic
memory operations per warp, and the size of the last level cache.
4.1 Spatial and Temporal Locality
To understand the performance of the GPU kernel with respect to the memory
hierarchy, it is necessary to collect enough traces to warm up the L1 and L2 caches.
The information must also be detailed enough to capture:
39
• the intra-thread locality and cache interactions
• the inter-thread and inter-thread-block locality and cache interactions
• the behavior of the kernel during different phases of the computation
The memory references in a trace are distributed both temporally and spatially.
To capture the spatial locality, which in the context of a highly multithreaded
graphics processor is translated to inter-thread, inter-thread-block and inter-core
locality and memory hierarchy interactions, the execution snapshot of traces is
extended horizontally. The proposed technique collects traces for thread blocks
that are potentially scheduled close together on different cores (streaming multi-
processors), i.e. thread blocks with consecutive logical IDs.
To account for the temporal locality the execution snapshot is extended such
that enough traces are collected within a single thread. The threshold is deter-
mined by the capacity of the last level cache and the total number of inflight
threads for all the cores. If required, the execution snapshot is extended across
the boundaries of multiple thread-blocks that are scheduled back-to-back on a
single core.
With this approach, a subset of the memory traces is collected, but the subset is
detailed enough to reflect locality within different granularities discussed above.
If required, traces can be collected for multiple snapshots of the kernel execution.
However, the initial experiments confirm that applying the model to the traces col-
lected from a single snapshot produces precise and accurate enough estimations.
This is quite expected as the computation in a typical GPU kernel is structured
around a Single Program Multiple Data (SPMD) programming model. Neverthe-
less, the demand-driven sampling of the traces provides the flexibility to trade-off
between the overhead of trace collection and the accuracy of the results generated
based on the collected traces.
40
The above approach, is to some extent similar to the time sampling technique
introduced by Laha et al. [40]. They used contiguous segments of memory ac-
cesses over certain time intervals for trace driven simulation of single-thread work-
loads. For large caches and in a single-thread execution model, the cold-start error
caused due to unknown status of the cache at the start of each trace simulation is a
major source of inaccuracy. In a highly multithreaded execution model, the effect
of the cold-start error is fairly insignificant; the execution environment setup for
throughput oriented caches limits their ability in exploiting long-term temporal
locality. I expect that the same limitation will hold in future GPUs as parallelism
scales with at least the same rate as the size of the caches increases.
4.2 The Monte Carlo Approach
Collected traces exhibit precise intra-warp ordering of the memory references. But
they do not maintain any information on the relative order of memory references
issued from different warps or thread blocks. Note that the proposed approach
does not rely on the execution order of memory loads and stores when collecting
traces. The rationale for not relying on the ordering during the execution of the
instrumented kernel is that adding the static probes to the kernel source code:
• changes the kernel resource usage (number of registers) which may conse-
quently alter the streaming multiprocessors occupancy, i.e., the number of
concurrently active thread blocks.
• increases the number of inflight memory operations which will distort the
state of caches and the level of congestion in the memory hierarchy.
• changes the instruction mix of the GPU kernel and introduces spurious syn-
chronization or stall points.
41
As a result, instrumenting the kernel will induce considerable timing distortions
in the execution order of the memory references. To account for this effect, the
order of memory requests arriving at each level of the memory hierarchy is recon-
structed via a Monte Carlo method which is an efficient sampling approach for
systems with dynamics that are highly coupled together. Traces are then driven
into a memory hierarchy simulator following the randomly sampled ordering for
each run.
Thus, given a pool of memory requests waiting to be serviced at each level of
the memory hierarchy, the following steps are perfomed:
1. Generate a valid random ordering from the pool of avialable memory re-
quests.
2. Use the deterministic memory hierarchy simulator discussed in Section 3 to
obtain an output for the ordering derived in step 1.
3. Repeat steps 1 and 2, N times.
4. Determine the probability distribution of results using histograms and sum-
marize the confidence of the predictions.
The output of the model is a probabilistic performance behavior of the memory
system such as hit ratios for the first and second level caches.
4.2.1 Schedule Deviation
As independent thread blocks are scheduled to run on streaming multiprocessors,
their executions start to fall out of synch with each other due to non-uniform mem-
ory access latencies, variations in the number of memory operations and compu-
tation load as individual threads or warps may follow different control flow paths,
etc. To account for this deviation, when scheduling loads and stores from inflight
42
Figure 4.1: Random points mimic schedule deviation for simultaneously
scheduled thread blocks
warps, random start points are chosen for thread blocks scheduled simultaneously
across different streaming multiprocessors. The horizontal dashed lines in Fig-
ure 4.1 highlight the execution snapshots devised based on random start points for
two multiprocessors. In this example, snapshots are expanded across two back-to-
back scheduled thread blocks – each composed of four warps – and each multipro-
cessor executes two thread blocks simultaneously. Tiny bars in front of dynamic
memory operations L0, · · ·,L6 represent individual memory transactions being is-
sued for the corresponding memory vector instruction. The entry is empty if none
of the threads within the warp have executed the matching load or store.
43
Figure 4.2: The stochastic memory hierarchy model
4.2.2 Scheduling Traces
The scheduling scheme that is used preserves the vector nature of the memory
references issued by threads within a warp. Memory references are coalesced
based on the size of the data being accessed and the cache block size. Coalesced
accesses (memory transactions) for each warp are bound together to ensure that
they are all scheduled at the same time. Each warp of a thread block has a queue
of ready-to-issue memory transactions. Attached to each memory transaction is
44
a unique ID that corresponds to the source code location of the memory load or
store that has triggered the transaction.
Figure 4.2 illustrates the layout of the memory hierarchy simulator. The intra-
core trace schedulers pick traces from the warp queues following a weighted round
robin fashion. Traces are sent to the L1 caches in the order that they have been
picked. Loads update the status and counters of the L1 caches. All stores and the
loads that miss in the L1 are forwarded to the L2 cache.
When scheduling traces from a warp queue, the scheduler continues picking
transactions from the same warp if the corresponding cache lines are triggered
by static memory operations within the same scheduling sequence, provided that
no data-dependence has been recorded between back to back scheduled accesses
(read after writes). This scheduling policy follows the concept of bulk instruction
scheduling discussed earlier in Section 3.3.
The inter-core trace scheduler picks traces from L1 queues based on a weighted
random scheduling algorithm. Loads and Stores that miss in L2 along with store
evictions are placed into the memory request queue as shown in Figure 4.2.
4.2.3 Main Memory
In the context of a highly multithreaded environment, memory latency observed
from a single-thread execution is not necessarily a meaningful measure of per-
formance. Congestions in the memory system may add to the memory latency.
Congestion pattern is characterized based on how efficiently a shared resource is
utilized by the memory requests that arrive close in time. The main memory, one
of the most significant shared resources in a GPU, is organized into a number
of interleaved channels. Depending on how memory accesses are spread across
these channels the effective memory bandwidth and latency will change. To esti-
45
600 800 1000 1200 1400 1600 1800
Latency (nanoseconds)
0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Pr
ob
ab
ili
ty
Average: 1277.24 Stdev: 187.96
DRAM_LBM
Figure 4.3: Probability distribution for main memory latency in LBM
600 800 1000 1200 1400 1600 1800
Latency (nanoseconds)
0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Pr
ob
ab
ili
ty
Average: 1285.04 Stdev: 179.98
DRAM_SpMV
Figure 4.4: Probability distribution for main memory latency in SpMV
46
mate the effective memory latency observed by memory operations issued close
to each other, memory accesses that reach the main memory are grouped together
following a uniform random distribution.
To form groups of simultaneously issued memory loads or stores, random num-
ber of memory requests are picked from the queue of misses arrived from the L2
cache. Each group can contain up to 32 memory requests which is the maximum
number of distinct memory transactions a core can issue simultaneously, i.e. the
SIMD memory vector size. Then up to 14 groups are picked, i.e. one group per
multiprocessor core, to form a set of memory requests that reach the main memory
relatively at the same time.
The addresses in this aggregated group of memory transactions are then normal-
ized and translate into indices of an array whose elements are the size of a single
main memory transaction (32 bytes). The indices are then loaded into a syn-
thesized micro-benchmark and the main memory access latency for each batch
of concurrently scheduled accesses is measured via the internal clock register.
Figures 4.3 and 4.4 show the probability density function (PDF) of the observed
latencies collected by the above approach for the lattice-Boltzmann method and
sparse matrix vector multiplication benchmarks.
The general rule is that more frequent random accesses and higher memory
bank conflict rates will degrade the main memory efficiency and increase the ex-
pected latency. If a dominant measured latency exists – which is not sensitive
to the exact grouping or ordering of the memory accesses arriving closely at the
main memory – that latency is a proper statistical representative for the memory
accesses that miss in the L2 cache. For example, the PDFs shown in Figures 4.3
and 4.4 are far from a uniform distribution and both indicate a dominant expected
access latency for the corresponding benchmarks.
47
CHAPTER 5
SOFTWARE FRAMEWORK
This chapter discusses the infrastructure and framework that collects: dynamic
memory traces for the stochastic memory hierarchy simulator discussed in Chap-
ter 4 and an approximation of the number of dynamic instructions executed, which
is used by the analytical model discussed in Chapter 8.
5.1 Instrumenting the Kernel
To collect and analyze the memory addresses accessed and instructions executed
by a given kernel, additional instructions are inserted inside the kernel and the
host codes. This is performed automatically by a source-to-source transformation
module that is run before the actual compilation of the program. This module
and the actual compiler are wrapped together into a new compiler driver that can
replace the original compiler in a project build system. This minimal level of
intervention simplifies the use of the framework significantly.
The source-to-source transformation module performs the following modifica-
tions:
• Inserts memory address probes into the kernel source code at locations
where the kernel is performing a read or a write operation. A probe is a
call to a function that will be in charge of storing the memory address ac-
cessed as well as a unique ID that identifies the source code location of the
memory operation.
48
Figure 5.1: Organization of the sampling buffer
• Stores for each memory address probe, its exact source code location and
the size of the element being accessed into a static array.
• Inserts control flow probes into the kernel source code before and after each
control structure, and before and after the main body of the control struc-
ture. A control structure is a block of programming that evaluates a con-
dition and chooses a control flow path to take based on the result of the
evaluation, e.g., a ”for loop”, an ”if then else” clause, etc.
• Generates a mapping to record the number of instructions in the abstract
syntax tree (AST) between each potentially consecutive pair of control flow
probes. This mapping is used during the post processing of traces to produce
an approximate dynamic count of instructions.
• Adds allocation and deallocation routines for the sampling buffer before and
after the launch of the kernel. A pointer to the buffer is added as an extra
argument to the kernel invocation code. The organization of the sampling
buffer is explained later in Section 5.2.
49
• Adds an additional parameter to the kernel that communicates to the probes
the location within the buffer that they need to store the corresponding sam-
pled information.
• Adds routines to compute the original kernel occupancy as described in
listing 3.1.
• Adds routines that collect traced memory addresses and post-process them
for the memory hierarchy simulation model.
To initiate trace collection for a subset of simultaneously executing thread-
blocks, one need to know the thread-block occupancy for a streaming multiproces-
sor and the number of available streaming multiprocessors; the latter is obtained
via a call to the NVIDIA’s programming APIs. The former is computed as follows.
The occupancy of a GPU kernel is the ratio of the number of active warps to
the maximum number warps supported on a streaming multiprocessor [41]. The
occupancy is determined by the amount of resources that the kernel consumes.
Resources can be allocated either statically or at the kernel invocation time. Oc-
cupancy is also dependent on the device that the kernel will be executed on.
Before starting the source-to-source transformation, the driver calls NVIDIA’s
nvcc compiler to collect the kernel-specific information required to compute the
occupancy. This includes the number of registers and the size of the shared mem-
ory used. The instrumentation module inserts calls to NVIDIA’s programming
interfaces to identify the GPU device that the kernel will be executed on – simi-
lar to routines shown in Listing 3.1. It then compute the occupancy of the kernel
just before its invocation; meanwhile, it collects the kernel invocation parameters
whose values are required for the occupancy computation, i.e., the thread block
dimensions and the amount of shared memory dynamically allocated.
Redundant loads and stores that are likely to be eliminated by the back-end
50
compiler are not instrumented. To achieve this, a global value numbering analysis
is implemented to identify the redundant memory accesses and a basic pointer
alias analysis is used to determine if a memory operation will be removed through
redundant load elimination.
The source to source transformation module is build on top of Clang [30], the C
language family frontend for LLVM [42]. Clang has been modified so that it can
correctly parse a subset of the CUDA [1] source code and build the abstract syntax
tree of both the host and the device code. Each node of the tree precisely records
its exact location within the source code, even when the node is the result of one
or more macro instantiations. This property is essential for the transformation
module to provide accurate source level feedback.
The output of the transformation cannot be generated directly from a modified
abstract syntax tree: To build the abstract syntax tree, a header containing the
declaration of CUDA types and runtime functions must be preincluded. The seri-
alization of the abstract syntax tree would contains these additional declarations.
But nvcc preincludes a header file as well. So, if nvcc is run to compile the
source code generated from the abstract syntax tree, it will result in duplicated
declaration errors. Instead, the output of the transformation is a patch file that is
applied to the user source code. The modifications that need to be applied to the
original source code are recorded as a list of source code insertion operations that
are later used to generate the patch file. This mechanism leverages the Rewriter
API that Clang provides.
The instrumentation of the source code must be done with an unmodified ab-
stract syntax tree to ensure that the source locations are correct. However, the
abstract syntax tree must be modified to increase the quality of the global value
numbering analysis: Expression trees are canonicalized to increase the chance
that two arithmetically equivalent expressions have the same tree representation.
51
This means that the source code must be instrumented before the analysis is per-
formed. As a result, the source code is over-instrumented. After the analysis is
performed, the results are forwarded to the patch file generation module to skip
the unnecessary instrumented memory operations.
A significant challenge in instrumenting the memory operations is identifying
the correct type of the address space being accessed by a pointer, i.e. global mem-
ory, shared memory, etc. The type of a pointer cannot be used to infer which
address space it points to as it can change during the lifetime of the pointer.
A dataflow-based algorithm is used that disambiguates the address space that a
pointer refers to for each segment of the source code. When incapable of resolv-
ing the actual pointer type, it conservatively associates the pointer to the global
memory, similar to the nvcc compiler.
5.2 Recording Traces
The memory address probes inserted within the kernel store the addresses ac-
cessed by each thread into a sampling buffer. The control flow probes record
the number of instructions observed between two control flow checkpoints. The
implementation of these probes and the design of the buffer are inherent to the
execution model within the GPU: The threads within a warp are executed in lock-
step and the warps within a thread block are executed concurrently. Therefore, the
probes are designed to store per-warp rather than per-thread memory accesses or
instruction counts. In addition, the buffer is designed so that its size does not limit
the acquisition of an unbounded amount of instructions or memory addresses.
A buffer is composed of a number of block buffers. Each block buffer itself
contains a set of warp buffers – one for each warp in the thread block – as il-
lustrated in Figure 5.1. The number of block buffers is equal to the number of
52
1 g l o b a l void SpMV( a d a p t s a m p l i n g d e v i c e b u f f e r ∗ a d a p t s a m p l e s ,
2 f l o a t ∗x , cons t f l o a t ∗va l , . . . )
3 {
4 a d a p t s a m p l i n g s t a t u s a d a p t t h i s t h r e a d s t a t u s ;
5 a d a p t i n i t s a m p l i n g s t a t u s ( samples , &t h i s t h r e a d s t a t u s ) ;
6
7 t i d = t h r e a d I d x . y ;
8 b i d = b l o c k I d x . y ;
9 t =0 ;
10 myi = b i d ∗ BLOCKSIZE + t i d ;
11
12
13 a d a p t s a m p l e p r e d ( samples , &t h i s t h r e a d s t a t u s , 0 ) ;
14 i f ( myi < ( numRows ) ){
15 a d a p t s a m p l e p r e d ( samples , &t h i s t h r e a d s t a t u s , 6 ) ;
16
17 a d a p t s a m p l e m e m i n d e x ( samples , &t h i s t h r e a d s t a t u s , 0 , &( rowInd [ myi ] ) ) ;
18 l b = rowInd [ myi ] ;
19
20 a d a p t s a m p l e m e m i n d e x ( samples , &t h i s t h r e a d s t a t u s , 2 , &( rowInd [ myi + 1 ] ) ) ;
21 ub = rowInd [ myi + 1 ] ;
22
23 a d a p t s a m p l e p r e d ( a d a p t s a m p l e s , &t h i s t h r e a d s t a t u s , 2 ) ;
24 f o r ( j = l b ; j<ub ; j ++) {
25 a d a p t s a m p l e p r e d ( samples , &t h i s t h r e a d s t a t u s , 4 ) ;
26
27
28 a d a p t s a m p l e m e m i n d e x ( samples , &t h i s t h r e a d s t a t u s , 4 , &( i n d i c e s [ j ] ) ) ;
29 i n d = i n d i c e s [ j ] ;
30
31 a d a p t s a m p l e m e m i n d e x ( samples , &t h i s t h r e a d s t a t u s , 6 , &(y [ i n d ] ) ) ;
32 y v a l = y [ i n d ] ;
33
34 a d a p t s a m p l e m e m i n d e x ( samples , &t h i s t h r e a d s t a t u s , 8 , &( v a l [ j ] ) ) ;
35 t += v a l [ j ] ∗ y v a l ;
36
37 a d a p t s a m p l e p r e d ( samples , &t h i s t h r e a d s t a t u s , 5 ) ;
38 }
39 a d a p t s a m p l e p r e d ( samples , &t h i s t h r e a d s t a t u s , 3 ) ;
40
41 a d a p t s a m p l e m e m i n d e x ( samples , &t h i s t h r e a d s t a t u s , 1 , &(x [ myi ] ) ) ;
42 x [ myi ] = t ;
43
44
45 a d a p t s a m p l e p r e d ( samples , &t h i s t h r e a d s t a t u s , 7 ) ;
46 }
47 a d a p t s a m p l e p r e d ( samples , &t h i s t h r e a d s t a t u s , 1 ) ;
48 a d a p t r e l e a s e b u f f e r (& a d a p t t h i s t h r e a d s t a t u s ) ;
49
50
51 }
Listing 5.1: Instrumented SpMV – device code
53
1 / / Sampl ing f u n c t i o n used by d e v i c e
2
3 d e v i c e f o r c e i n l i n e void
4 UncachedS to re ( u i n t 6 4 t va l , u i n t 6 4 t ∗ d e s t ) {
5
6 asm ( ”{ s t . g l o b a l . wt . u64 [%1] , %0;}” : : ” l ” ( v a l ) , ” l ” ( d e s t ) ) ;
7
8 }
9
10 d e v i c e f o r c e i n l i n e void
11 UncachedLoad ( u i n t 6 4 t& val , u i n t 6 4 t ∗ s r c ) {
12
13 asm ( ”{ l d . g l o b a l . cv . u64 %0, [%1] ;} ” : ”= l ” ( v a l ) : ” l ” ( s r c ) ) ;
14
15 }
Listing 5.2: Sampling functions – GPU-CPU communication
simultaneously active thread blocks in the GPU.
I use a per-block-buffer locking mechanism, described in Listing 5.3, to prevent
concurrent accesses to the same buffer from multiple thread blocks. Each block
buffer has a counter that holds the number of active threads using the buffer. When
zero, the buffer is free to be acquired. A thread block acquires exclusive rights
to the buffer when it successfully exchanges, via an atomic operation, the value
zero in the counter with the number of threads that it contains. Each thread will
atomically decrement this counter just before terminating.
A warp buffer is a ring buffer with fixed-sized lines that are wide enough to
hold for each thread within a warp, an address value and a static ID associated
with the source code location of the corresponding memory operation. Instruction
counts are stored following the same format with the exception that the static ID
associated with an instruction count entry is -1.
When an active thread executes a probe, the next available line in the warp
buffer is updated and a line counter is incremented. The buffer is originally ini-
tialized with invalid addresses (odd values). So when an inactive thread does not
store any address it can be easily captured. Probes will spin on the warp buffer
counter when the buffer is full.
During the kernel execution, the host program constantly snoops at individual
54
1
2 / / Sampl ing f u n c t i o n used by d e v i c e −− i n t e r−t h r e a d communica t ion
3
4 d e v i c e void
5 a d a p t a c q u i r e b l o c k l o c k ( cons t a d a p t s a m p l i n g s t a t u s ∗ s t a t u s ) {
6
7 bool b l o c k A c q u i r e d = f a l s e ;
8 bool p e e r A c q u i r e d = f a l s e ;
9 cons t u i n t 6 4 t window id = s t a t u s−>window id ;
10 bool f i r s t T h r e a d = ( t h r e a d I d x . x ==0) && ( t h r e a d I d x . y = = 0 ) ;
11
12 i f ( f i r s t T h r e a d ) {
13 whi le ( atomicCAS ( ( unsigned long long i n t ∗ ) s t a t u s−>b l o c k , 0 , 1 ) ! = 0 ) ;
14 }
15 s y n c t h r e a d s ( ) ;
16
17 i f ( ! f i r s t T h r e a d ) {
18 a t o m i c I n c ( ( unsigned i n t ∗ ) s t a t u s−>b l o c k , 0 x f f f f f f f f ) ;
19 }
20 s y n c t h r e a d s ( ) ;
21 }
22
23 d e v i c e void
24 a d a p t r e l e a s e b l o c k l o c k ( a d a p t s a m p l i n g s t a t u s ∗ s t a t u s ) {
25
26 i f ( s t a t u s−>do sample ) {
27 t h r e a d f e n c e ( ) ;
28 atomicDec ( ( unsigned i n t ∗ ) ( s t a t u s−>b l o c k ) , 0 x f f f f f f f f ) ;
29 s t a t u s−>do sample = 0 ;
30 }
31
32 }
33
34 d e v i c e void
35 a d a p t a c q u i r e b u f f e r ( a d a p t s a m p l i n g s t a t u s ∗ s t a t u s ) {
36
37 a d a p t a c q u i r e b l o c k l o c k ( s t a t u s ) ;
38
39 }
40
41 d e v i c e void
42 a d a p t r e l e a s e b u f f e r ( a d a p t s a m p l i n g s t a t u s ∗ s t a t u s ) {
43
44 a d a p t r e l e a s e b l o c k l o c k ( s t a t u s ) ;
45
46 }
47
48 d e v i c e void
49 a d a p t s a m p l e m e m o p i n d e x ( a d a p t s a m p l i n g d e v i c e b u f f e r ∗ ds ,
50 a d a p t s a m p l i n g s t a t u s ∗ s t a t u s ,
51 u i n t 6 4 t in sn , u i n t 6 4 t i n d e x ) {
52
53 i f ( ! s t a t u s−>do sample ) re turn ;
54 a d a p t w r i t e b u f f e r ( ds , s t a t u s , i n sn , 0 , i n d e x ) ;
55
56 }
57
58 d e v i c e void
59 a d a p t s a m p l e p r e d ( a d a p t s a m p l i n g d e v i c e b u f f e r ∗ ds ,
60 a d a p t s a m p l i n g s t a t u s ∗ s t a t u s , u i n t 6 4 t p red ) {
61
62 i f ( ! s t a t u s−>do sample ) re turn ;
63 a d a p t w r i t e b u f f e r ( ds , s t a t u s , pred , 1 , 0 ) ;
64
65 }
Listing 5.3: Sampling functions – inter-thread (thread-block) communication
55
warp-buffer counters. When a warp buffer is full, the host program reads the warp
buffer content (traces) and reset the warp buffer counter. Now, threads that were
spinning on the counter will resume to record the trace information. When enough
traces is gathered, the host program set the counters to a predefined value that
disables probes inside the kernel. The minimum number of indices to be sampled
and the start window are set through a pair of environment variables whose values
are read and applied to the initialization routines on the host side.
To allow both the host and the device code access the buffer concurrently, the
buffer is allocated using pinned-memory. Pinned-memory is the host memory that
is removed from the virtual memory, so it is not paged out by the operating system.
Since the atomic operations operating on a pinned-memory are not atomic from
the point of view of the host [1], the communication between host and device is
implemented via un-cached load and store operations by inlining PTX assembly
as shown in Listing 5.2. The ld.cv command considers the cached memory
lines are stale and st.wt performs as a write-through store bypassing the GPU
L2 cache.
The exact layout of the buffer is computed based on thread block dimensions,
number of concurrently active thread blocks and the overall size of the buffer
itself.
56
CHAPTER 6
EXPERIMENTAL EVALUATION:
MEMORY HIERARCHY
In this chapter, I present the result of application of the stochastic memory hierar-
chy model discussed in Chapter 4 to several GPU kernel applications. The results,
which are presented in terms of the L1 hit ratios for loads and the L2 hit ratios for
loads and stores are cross validated against the hit ratios reported by the hardware
performance counters.
6.1 Experiments Setup
Description Data-
dependent
accesses
Data-
dependent
control flow
Diverging
control
flow
Load imbal-
ance within
thread blocks
DMM – Dense matrix
multiplication
No No No No
SpMV – Sparse matrix
vector multiplication
Yes Yes Yes Yes
FFT – fast Fourier
transform
No No No No
Black Scholes –
stochastic differential
equation model for
option pricing
No No No No
LBM – 3D lattice-
Boltzmann method
No Yes Yes Low
Merge sort No Yes Yes Low
Stencil – 3D heat equa-
tion
No No Low No
Figure 6.1: Benchmarks suite used for experiments
57
The proposed memory hierarchy modeling approach is validated on an NVIDIA
Tesla C2050 general-purpose graphics processor.
The benchmarks used for the experiments consist of CUDA implementations of
GPU kernels listed in Table 6.1. The benchmarks suite chosen covers GPU ker-
nels commonly used in scientific computing and signal processing applications:
dense matrix multiplication (DMM), sparse matrix vector multiplication (SpMV),
fast Fourier transform (FFT), Black Scholes, the 3D lattice-Boltzmann method
(LBM), merge sort, and a 3D stencil computation. These benchmark also exhibit
diverse behavior with respect to memory access patterns, data-dependent control
flow and memory references, mix of load and store operations, and the degree of
locality. Table 6.1 summarize the functionality of each benchmark and program
behaviors that it exhibits.
The overall probability distributions for the L1 and the L2 cache hit rates are
generated by the proposed stochastic model for the discussed GPU benchmarks
and results are displayed in Figures 6.2 to 6.8. Each histogram shows the proba-
bility of a specific predicted hit ratio on the y-axis based on results obtained from
1024 simulations. The dashed vertical red line marks the hit ratios reported by the
hardware counters.
58
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.95 Stdev: 0.00
l1_DMM
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.61 Stdev: 0.01
l2_read_DMM
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.00 Stdev: 0.00
l2_write_DMM
Figure 6.2: Probability distribution for the L1 and L2 hit ratios - dense matrix
multiply
59
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.00 Stdev: 0.00
l1_FFT
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.00 Stdev: 0.00
l2_read_FFT
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.42 Stdev: 0.01
l2_write_FFT
Figure 6.3: Probability distribution for the L1 and L2 hit ratios – FFT
60
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.00 Stdev: 0.00
l1_BlackScholes
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.00 Stdev: 0.00
l2_read_BlackScholes
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.00 Stdev: 0.00
l2_write_BlackScholes
Figure 6.4: Probability distribution for the L1 and L2 hit ratios – Black Scholes .
61
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.02 Stdev: 0.00
l1_SpMV
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.22 Stdev: 0.02
l2_read_SpMV
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.00 Stdev: 0.00
l2_write_SpMV
Figure 6.5: Probability distribution for the L1 and L2 hit ratios – sparse matrix
vector multiply
62
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.44 Stdev: 0.01
l1_MergeSort
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.00 Stdev: 0.00
l2_read_MergeSort
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.00 Stdev: 0.00
l2_write_MergeSort
Figure 6.6: Probability distribution for the L1 and L2 hit ratios – merge sort
63
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.94 Stdev: 0.00
l1_LBM
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.04 Stdev: 0.00
l2_read_LBM
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.09 Stdev: 0.01
l2_write_LBM
Figure 6.7: Probability distribution for the L1 and L2 hit ratios –
lattice-Boltzman method
64
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.14 Stdev: 0.00
l1_Stencil
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.49 Stdev: 0.02
l2_read_Stencil
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.00 Stdev: 0.00
l2_write_Stencil
Figure 6.8: Probability distribution for the L1 and L2 hit ratios – stencil
65
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.43 Stdev: 0.00
l1_Sort
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.95 Stdev: 0.00
l2_read_Sort
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f 
si
m
u
la
ti
o
n
s
Average: 0.00 Stdev: 0.00
l2_write_Sort
Figure 6.9: Probability distribution for the L1 and L2 hit ratios – merge sort
(global)
66
6.2 Error Quantification: Sampling Error – Precision
I report the average and the standard deviation for each hit ratio probability dis-
tribution in Figures 6.2 to 6.8. The standard deviation describes the spread of
the predicted hit ratios and determines how random variations affects the sensitiv-
ity and reliability of the proposed memory model. A small standard deviation –
0.01 on average for the outputs of the model – indicates that the predictions from
the model are robust with respect to the randomly sampled ordering of memory
requests.
0.0 0.2 0.4 0.6 0.8 1.0
Hit Ratio
0
10
20
30
40
50
60
%
 o
f s
im
ul
at
io
ns
Average: 0.22 Stdev: 0.02
l2_read_SpMV
Figure 6.10: Sampling error for L2 read hit ratio – sparse matrix vector multiply .
Based on the Law of Large Numbers (LLN) and the Central Limit Theorem
(CLT) – discussed in Section 2.1.1 – the probability of the expected predicted
hit ratios being within the the confidence interval [Ên − 2σ̂n, Ên + 2σ̂n] is 95%.
For example, L2 hit ratio predictions for the SpMV kernel shown in Figure 6.10
exhibit an average of 0.22 with a standard deviation of 0.02. Over large number
67
of trials, the true average is expected to fall within interval [0.18, 0.26] with a
probability of 95%.
6.3 Error Quantification: Systematic Error – Accuracy
I define the absolute error function Eabs(h, hˆ) = |h− hˆ| for measured (h) versus
estimated (hˆ) hit ratios, where h, hˆ ∈ [0, 1]. Relative error, Erel(h, hˆ) = |h−hˆ|h , is a
more meaningful measure of accuracy than the absolute error when the predicted
values are large and unbounded. For the case presented here, relative error does
not relate closely to the performance associated with the difference in the pre-
dicted cache hit ratios. For example, using the relative error one gets the same
accuracy for Erel(0.02, 0.04), Erel(0.2, 0.4) and Erel(0.5, 1.0). The fact that the
above errors would be judged to have equal importance is not correct from mi-
croarchitecture point of view. Furthermore, Erel would result in irrational high
errors for very small values of h that are close to zero. Therefore, I use absolute
error as a measure of accuracy.
I cross-validated the results predicted by the memory hierarchy model with
those read from the hardware performance counters provided by the NVIDIA.
The performance counters – listed in Table 2.1 – cannot be read or sampled during
the kernel execution. Before running the kernel, an environment variable is set to
activate the counters. After the kernel execution is finished, the values of the
counters that reflect the overall performance statistics with respect to all memory
references are written into a log file. The vertical dashed lines in Figures 6.2
to 6.8 highlight the hit ratios computed based on values read from the hardware
counters. When compared against hardware counters, the proposed approach have
an average absolute error of 3.4% for the L1 read hit ratios, 1.9% for the L2 read
hit ratios and 1.0% for the L2 write hit ratios.
68
6.4 Systematic Error Function Estimation
Experimental measurements are subject to inaccuracy and errors. When the error
is large enough such that its impact is noticeable on estimations, it is appropriate
to report and account for the uncertainty of the results.
Each predicted estimation for the kernel execution time contains an additive
error component. In this case, since it is infeasible to calculate the error function in
closed form, I will provide a statistical formulation of the error function based on
Bayesian analysis, introduced in Section 2.2. In what follows, the error function
is represented as a random variable. In general, to describe a random variable,
one needs to have access to its corresponding probability density function (PDF),
which expresses the relative likelihood of the function taking a given value.
To estimate the PDF of the error, I use a Bayesian curve fitting approach. The
Bayesian method updates a prior PDF (of the error function) with a set of new
observations to form a new posterior PDF. The resulting posterior distribution
will optimally formulate all the observed “knowledge.”
Since the error function is initially unknown, it is necessary to come up with
a prior PDF, which is constructed based on assumptions that are theoretically
compatible with a reasonable error model. In what follows, I summarize the as-
sumptions made in design of the error PDF:
1. I consider each of the observed errors (for the predictions) as a ”data point.
These data points are expressed in the precentage values (which can be
higher that 100%). Let T be the first integer value greater than the max-
imum value among such data points.
The interval [0, T ) is divided into T subintervals, each of lenght 1. For
example, with a maximum observed error of 19.7%, [0, 20) is divided into
subintervals [0, 1), [1, 2),· · ·, [19, 20).
69
exponential decay
factor: α
xT
Figure 6.11: Prior probability density function of the relative error
2. The likelihood of a data point (error) greater than T drops exponentially.
3. In the absence of additional prior knowledge, I assume a uniform prior dis-
tribution for the error (points) in the interval [0, T ). However, note that the
computed Bayesian posterior will not be uniformly distributed.
A reasonable prior PDF for the error that is consistent with the assumptions
stated in items 1-3 above is the following:
fprior err(x) =
 βIbxc if 0 ≤ x < Te−α(x−T ) if x ≥ T (6.1)
where, the indicator function Ibxc is the probability density of error occurring in
the subinterval [bxc, bx+1c). The real numbers α and β are chosen appropriately
as described in the following.
For ferr prior(x) to be a proper PDF [43], its integral over (−∞,+∞) should
70
sum up to 1. The area under the exponentially decaying function e−α(x−T ) is equal
to:
∫ +∞
T
e−α(x−T )dx =
1
α
(6.2)
For the prior PDF, I assume that the chances of error occurring within interval
[0, T ) is T times that of [T,+∞). This implies that β = 1
α
. Therefore, the area
under ferr prior(x) will be equal to:
lim
x→+∞
Ferr prior(x) =
∫ +∞
0
ferr prior(x)dx =
T
α
+
1
α
= 1 =⇒ α = T + 1
(6.3)
A reasonable prior PDF – shown in Figure 6.11 – can be defined by initializing
T to 100 percent.
Next, I compute the error for the predicted hit ratios for each of the discussed
benchmarks. The obtained errors serve as observations to update the prior PDF.
To this end, the value of T is updated according to the maximum error observed.
For example, this upper limit value for the L1 hit ratios in the studied benchmark
suite is equal to 8%.
Based on the Bayes rule, the probability (likelihood) of the observed error (data
points) is proportional to the number of times that they have occurred (among the
observations). To avoid overfitting the posterior PDF to the observed data [43],
all the T subinterval defined over [0, T ) are given a small and equal probability.
Therefore, given S sample points, the area under ferr post(x), the posterior PDF,
is defined as:
71
lim
x→+∞
Ferr post(x) =
∫ +∞
0
ferr post(x)dx =
T
α
+
1
α
+
T × S
α
= 1 (6.4)
=⇒ α = T + 1 + T × S
Here, each lenght-1 subinterval in [0, T ) and all the subintervals in [T,+∞)
have a 1/α chance of accommodating the error. In addition, each observed data
point effectively increases the probability of its corresponding subinterval by a
factor of T . Based on the observed error data and parameters derived from Equa-
tion 6.4, probabilities of Ix for x ∈ {0, 1, · · ·, T − 1} are updated accordingly.
Note that the error function defined here is one realization of the general form
described earlier in Definition 6.1. The relative weight of the area under each seg-
ment of the curve can be adjusted to tune the sensitivity of the posterior function
with respect to the observed data points.
Figure 6.12 shows the PDF of posterior error function computed for the L1 read
hit ratios, L2 read hit ratios and L2 write hit ratios respectively. With PDF of the
systematic error computed as described above – for a new predicted value, e.g.,
ĥ in case of cache hit ratios – one can generate a range of true values that the
new predicted value is likely to occupy. Therefore, the prediction result can be
expressed as PDF of the true value. Application of PDF of the prediction error
is covered in more detail in Section 8.2 when I discuss the probabilistic merit of
different application kernel configurations.
72
0 2 4 6 8 10 120
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Error %
P
ro
b
a
b
il
it
y
D
is
tr
ib
u
ti
o
n
0 1 2 3 4 5 6 70
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Error %
P
ro
b
a
b
il
it
y
D
is
tr
ib
u
ti
o
n
0 1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Error %
P
ro
b
a
b
il
it
y
D
is
tr
ib
u
ti
o
n
Figure 6.12: Probability distribution of systematic error for predicted L1 and L2
hit ratios
73
CHAPTER 7
HIGH RESOLUTION PERFORMANCE
STATISTICS
In this chapter, I discuss how high resolution performance statistics are generated
by coupling source code level information and the results of the memory hierarchy
simulation model. This comprehensive view of the level of memory efficiency
exploited by individual data structures or memory operations within a particular
code segment is a crucial step for targeted memory optimizations.
I later illustrate though a simple example how this low level information can
highlight inefficient memory accesses and play up the data structures that are ideal
candidates for optimization.
7.1 The Average Latency per each Static Load
In Section 4.2.3, I presented a micro-benchmarking approach to measuring the
main memory access time, Tm, for each application. In Section 3.6.4, I discussed
my approach to measuring the L1 and the L2 cache latencies, denoted by t1 and
t2, with a simple micro-benchmark. I later summarized the overall probabilistic
predicted hit ratios by the proposed memory hierarchy model for the L1 and L2
caches, denoted by H1 and H2, in Figures 6.2 to 6.8.
In this section, I combine all these results using the following simple access
latency equation:
X(H1, H2, Tm) = H1 · t1 + (1−H1) · [H2 · t2 + (1−H2) · Tm] (7.1)
74
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Static Load ID in Source Code
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
La
te
n
cy
 (
m
ic
ro
se
co
n
d
s)
LBM
Figure 7.1: Breakdown of memory latency for static loads in LBM
In the above equation, X, Tm, H1, H2 are random variables; their probability
density function (PDF) can vary from one application to the other or, from one
static memory operation in the source code to another.
Since I propagate down the source code location of each memory operation, I
can obtain the PDF for individual static loads in the code. Based on that infor-
mation, for example, I can estimate the expected access latency, E(X), for each
static load in the kernel source code. Figures 7.1, 7.3 and 7.2 show the break-
down of the expected memory latency with respect to each static load for LBM,
SpMV, and Stencil benchmarks. This information can be used to target memory
optimizations such as tiling, thread coarsening and data layout transformation for
specific memory operations or data structures in the program.
75
0 1 2 3 4 5 6
Static Load ID in Source Code
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
La
te
n
cy
 (
m
ic
ro
se
co
n
d
s)
Stencil
Figure 7.2: Breakdown of memory latency for static loads in Stencil
7.2 Case Study: Sparse Matrix Vector Multiplication
Kernel
In this section I show through an example how high resolution performance in-
formation can be used to apply targeted optimizations. Initial examination of the
sparse matrix vector multiplication kernel source code shown in Listing 7.1 im-
plies that loads from lines 11 and 13 each exhibits a high level of intra-thread
temporal locality as they are executed within a tight loop using consecutive in-
dices. Therefore, the latency for these loads is expected to be low. On the other
hand, memory references from the load in line 12 are not analyzable statically and
expected to follow an irregular pattern resulting in poor data locality.
Table 7.1 summarizes information with respect to memory loads for this kernel
76
1 f l o a t t = 0 ;
2 unsigned i n t myi = b i d ∗ BLOCKSIZE + t i d ;
3
4 i f ( myi < numRows ){
5
6 unsigned i n t l b = r o w I n d i c e s [ myi ] ;
7 unsigned i n t ub = r o w I n d i c e s [ myi + 1 ] ;
8
9 f o r ( j = l b ; j < ub ; j ++) {
10
11 unsigned i n t i n d = i n d i c e s [ j ] ;
12 y v a l = y [ i n d ] ;
13 t += v a l [ j ] ∗ y v a l ;
14
15 }
16
17 x [ myi ] = t ;
18 }
Listing 7.1: Sparse matrix vector kernel
1 f l o a t t = 0 ;
2 unsigned i n t myi = b i d ∗ BLOCKSIZE + t i d ;
3 i f ( myi < numRows ){
4
5 unsigned i n t l b = r o w I n d i c e s [ myi ] ;
6 unsigned i n t ub = r o w I n d i c e s [ myi + 1 ] ;
7
8 / / p r o l o g u e code −− a l i g n m e n t a d j u s t m e n t
9 .
10 .
11 .
12
13 f o r ( j = newlb ; j < newub ; j +=4) {
14
15 u i n t 4 i n d = i n d i c e s [ j / 4 ] ;
16 f l o a t 4 v a l u e = v a l [ j / 4 ] ;
17
18 f l o a t y v a l = y [ i n d . x ] ;
19 t += v a l u e . x ∗ y v a l ;
20 y v a l = y [ i n d . y ] ;
21 t += v a l u e . y ∗ y v a l ;
22 y v a l = y [ i n d . z ] ;
23 t += v a l u e . z ∗ y v a l ;
24 y v a l = y [ i n d .w ] ;
25 t += v a l u e .w ∗ y v a l ;
26 }
27
28 / / e p i l o g u e code −− a l i g n m e n t a d j u s t m e n t
29 .
30 .
31 .
32
33 x [ myi ] = t ;
34 }
Listing 7.2: Vectorized sparse matrix vector kernel
77
0 1 2 3 4
Static Load ID in Source Code
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
La
te
n
cy
 (
m
ic
ro
se
co
n
d
s)
SpMV
Figure 7.3: Breakdown of memory latency for static loads in SpMV
and Figure 7.3 shows the predicted load latency for each of the static loads of
SpMV kernel – generated by the stochastic memory hierarchy model – consider-
ing the interactive effects of all concurrent memory operations. Results reported
in Figure 7.3 are different from the speculation derived based on manual source
code examination. Based on this new information when considering the interac-
tions among the concurrently running threads, loading the vector value (line 12)
is more efficient compared to the other two static loads that initially exhibit a
high reuse pattern. This observation makes these two loads the most promising
candidates with respect to memory optimization.
Listing 7.2 shows an optimized version of the SpMV kernel with intra-thread
vectorization applied to loads initially from lines 11 and 13 to exploit the available
78
Static Load ID Corresponding line in
Listing 7.1
Avg. Dynamic count
per thread
0 6 1
1 7 1
2 11 51
3 12 51
4 13 51
Table 7.1: Dynamic load count per static load in SpMV kernel
data reuse that cannot be efficiently captured by the available data caches. The
optimized kernel uses just enough extra registers to maintain the initial occupancy
level of streaming multiprocessors, which combined with more efficient memory
accesses results in a speedup of 2.9 compared to the initial kernel configuration.
79
CHAPTER 8
MEASURING KERNEL PERFORMANCE:
A GLOBAL PERSPECTIVE
In the previous chapter I discussed how the expected memory latency for a static
load in the GPU kernel is computed based on the estimated L1 and L2 hit ratios
associated with it and the expected main memory latency for the kernel. Being
able to identify the least performing set of memory operations in the code is a
valuable piece of information for an application developer or a compiler. In this
chapter, I will take that information one step further to determine the dominating
detrimental factor with respect to the performance of a kernel. Particularly, the
goal is to determine to what degree a GPU kernel is memory bound, if at all.
8.1 Compute and Memory Time
I decompose the execution time of a GPU kernel into the compute time and the
memory time. The former is the time to complete the computation assuming an
ideal memory system, while the latter is the time needed to flow all data required
by computational cores from memory.
As mentioned in Section 5, the instrumentation framework collects an estimate
of the number of dynamic instructions executed during the kernel runtime by in-
serting probes at control flow entry and exit points. The number of instruction
that the kernel executes between two control flow probes is estimated by the num-
ber of the abstract syntax tree nodes (representing actual computation) traversed
when control flow proceeds from one probe to the other. It should be noted that
80
this instruction count may not reflect the following properties of the computational
instructions:
• The actual number of instructions executed by the hardware – For example,
a multiplication followed by an add may be converted into a multiply-add
instruction by the compiler.
• Different types of instructions – A double precision multiply and a jump
instruction are each counted as one instruction, though the streaming cores’
execution bandwidth with respect to each is different.
• Data dependencies – Memory or back-to-back register dependencies are not
reflected by the instruction count.
The above limitations does not allow for a detailed instruction scheduling ap-
proach, but it will provide a relatively reasonable estimate to capture the first
order performance degradation factors as indicated by the results discussed in
Section 8.2.
The compute time of a kernel is estimated by examining the dynamic instruction
count from the collected traces for a snapshot of the kernel execution. An average
dynamic instruction count for warps of a thread block is computed based on the
above information. The GPU hot clock frequency, which is the execution speed
of individual streaming cores, is identified through a call to the CUDA program-
ming APIs. Occupancy and the number of available streaming multiprocessors
are similarly computed as describe earlier in Listing 3.1.
8.1.1 Memory Level Parallelism
Memory-Level Parallelism (MLP) is the ability to simultaneously service multiple
memory transactions. The analytical model treats each streaming multiprocessor
81
as a number of in order cores that exploit memory level parallelism (MLP) through
concurrent memory accesses issued from a set of coexisting threads. Coexisting
threads are congregated from:
1. threads tightly coupled together through a SIMD memory vector instruction
2. threads executed tightly close together as a result of interleaving the execu-
tion of coexisting warps
To compute the memory time in the presence of the MLP, the analytical model
makes the following assumptions:
1. A warp with outstanding memory transaction(s) initiated by a previous mem-
ory load cannot issue any new memory loads until all previous outstanding
transactions are resolved.
2. The computation within a kernel is divided into a sequence of memory loads
followed by a sequence of other type of instructions. A dynamic instruction
stream, in the presence of enclosing loop constructs, generates separate se-
quences of memory and compute regions that are executed sequentially.
3. Graphics processors based on the Fermi architecture, support up to 48 con-
current warps on each streaming multiprocessor. And each streaming multi-
processor contains 32 in order cores each capable of issuing an independent
memory transaction. Therefore, it is reasonable to assume a total of 32,
48, or 64 outstanding loads (Los) per multiprocessor. Based on an error
minimization process, I assume a maximum of 48 outstanding loads for the
experimental results presented in this section.
4. As thread blocks enforce implicit synchronization points at their entry and
exit points, any measurement of critical path or resources should be con-
ducted at a thread block level.
82
Assumption 1 is a reasonable simplification when the amount of inter-thread
MLP is large enough to consume the corresponding hardware resources, i.e., the
number of outstanding loads. Thus, the model overlaps the latency of loads from
different threads and serializes the intra-thread load latencies as shown in Fig-
ure 8.1. This is a reasonable assumption for most memory bound GPU appli-
cations as they are composed of a large number of concurrently running threads
that can easily saturate the hardware limits. In cases that the inter-thread MLP
– computed later in Equation 8.1 – is smaller than Los, i.e. LosMLP > 1, up to
d Los
MLP
e memory loads – if they exist – within a memory region can be issued
together, which results in a shorter memory time critical path. Accordingly, the
initial inter-thread MLP computed in Equation 8.1 is also updated to reflect the
exposed intra-thread MLP.
With the above assumptions, the degree of inter-thread MLP presented in a ker-
nel or in other words, the number of inflight memory transactions, is determined
by:
Coalescing Factor: Coalescing Factor (CF) represents the number of distinct
memory transactions issued on average for a memory load operation in a
kernel. For example, initially a coalescing factor of one means that memory
locations referenced by all threads in a warp are mapped to a single cache
block. The coalescing factor can vary from one memory load to another
within a kernel. To capture the overall behavior, an average CF per warps
of a thread block is computed over the total number of dynamic loads col-
lected for the snapshot of the kernel execution and is then multiplied by the
number of warps per thread block.
Warp Level parallelism: Warp Level Parallelism (WLP), analogous to the no-
tion of Thread Level Parallelism (TLP), refers to the number of warps that
83
are being executed concurrently on a GPU. The execution of concurrent
warps is heavily interleaved by the warp scheduler. Therefore, it is reason-
able to assume that memory operations from concurrently executing warps
are issued tightly close together. This is specially true if the kernel is mem-
ory bound.
Based on the described assumptions and definitions, the available inter-thread
MLP within a kernel is estimated by:
MLP =
CF ×WLP
TBwarps
(8.1)
where, TBwarps is the number of warps per one thread block.
Based on dynamic memory traces, the expected L1 and L2 hit ratios, and the
main memory access latency, the critical path with respect to memory cycles
(time) is identified for each thread block. Different warps within a thread block
may have different dynamic number of memory loads, which results in load im-
balance with respect to the number of memory operations. Threads within a thread
block synchronize once at the beginning of the thread block execution and once
when the thread block execution ends. Therefore, the critical path analysis is con-
ducted in the granularity of a thread block.
8.1.2 Utilization of the Available Memory Level Parallelism
According to the execution model that I described earlier, GPUs exploit inter-
thread MLP to overlap the memory latency of loads issued from coexisting threads.
Figure 8.1 illustrates the case when the MLP available within a kernel can be
fully exploited by overlapping the latency of all memory operations (of different
threads) ready to be issued.
84
Figure 8.1: Scheduling of memory and compute cycles
For the sake of simplicity, assume there are a total of four concurrent thread
blocks. The kernel in this example has 3 static loads – denoted by L 0, L 1, and
L 2. The expected memory latency varies from one static load to the other and
is proportional to the length of the vertical blocks in Figure 8.1. Static loads are
repeated according to the dynamic control flow pattern present within (the warps
of) a thread block. As a result the computation within a thread block is divided into
multiple segments or regions. Each region starts with a set of memory loads that
read the required data. When the data is ready, the region starts it computation.
A kernel consists of one or more memory regions each followed by a compute
85
region. A sequence of memory-compute regions can occur when there is a loop
within the body of a kernel.
To estimate the critical memory path (time), multiple (equal to the thread-block
occupancy of a kernel) sample thread blocks are chosen. The length of the criti-
cal paths for memory access time of the sampled thread blocks are computed by
scheduling their dynamic memory loads back-to-back and then adding up the es-
timated memory access latency of each memory load (based on its static ID). The
critical paths are then averaged to obtain the expected total memory access latency
for a typical thread block, Tmem path. Next, memory time for a set of concurrently
running thread blocks (threads) is estimated by the following equation:
Tmem = Tmem path ×max(1, MLP
Los
) (8.2)
The inter-thread MLP is estimated by Equation 8.1. Equation 8.2 states that
memory access time for concurrently executing threads can be overlapped as long
as enough hardware resources (outstanding load entries) exist to accommodate the
inter-thread MLP available within a kernel. Otherwise, the memory level paral-
lelism is repressed, which results in serialization of a portion of memory accesses.
Figure 8.2 illustrates this serialization effect with an example similar to the one
shown in Figure 8.1. The difference between the two cases is that in Figure 8.2
CF is increased such that the available inter-thread MLP is larger (almost twice in
this example) than Los. Therefore, memory accesses from different thread blocks
(threads) are serialized. Equation 8.2 accounts for this serialization effect via the
term MLP
Los
.
Figure 8.1 also shows how execution of compute cycles – represented by darker
blocks – is interleaved with the latency of memory loads in the absence of ordering
constraints enforced via synchronizations points. If the kernel is memory bound,
86
Figure 8.2: Serialization of concurrent memory accesses
compute time is fully overlapped with the memory time, except for the compute
time of the last region of a thread block.
As mentioned before, compute time is computed based on an estimate of the
87
number of instructions nodes in the abstract syntax tree. These instructions and
consequently the count associated to them can be different from the actual stream
of instructions being executed by the hardware. Therefore, it is more appropriate
to assume a bulk scheduling model for their execution. Following this assump-
tion and the concepts of memory and compute regions introduced before, the total
count of instructions – which corresponds to the compute time Tcomp – is amor-
tized over the total number of regions as follows:
Tregion =
Tcomp
Nregions
(8.3)
Note that Tmem corresponds to the memory time for all the concurrently running
thread blocks (threads), while Tcomp account for compute time of only on thread
block. Therefore, for a set of concurrently running thread blocks on a streaming
multiprocessor, total compute time is equal to:
Tcomp total =
Tcomp ×WLP
TBwarps
(8.4)
1. If Tcomp total for a kernel is less than the corresponding Tmem, the kernel
is memory bound and the execution time for a single set of concurrently
executing thread blocks – one execution window – can be estimated by:
Twindow = Tmem + Tregion × WLP
TBwarps
(8.5)
According to the above equation, the overall execution time critical path is
composed of memory time – latency of concurrently issued memory loads
from simultaneous threads – and the time required to execute the last com-
pute block of all the tread blocks scheduled on the same streaming multi-
processor.
88
2. For compute bound kernels, compute time, Tcomp total, dominates the exe-
cution time of a kernel. Consequently, Twindow is equal to Tcomp total.
3. In the presence of synchronization points, the critical path for the execu-
tion time of a kernel is constructed based on the assumption that memory
and compute times cannot be interleaved. This assumption is accurate with
respect to the warps within a thread-block as synchronization points are
inserted between memory regions and compute regions. Therefore the la-
tency of compute instructions cannot overlap with the latency of memory
loads from the same thread block. With respect to different thread blocks,
the assumption is still reasonable since now both compute time and mem-
ory time are part of the critical path. For example, if compute time of the
current thread block is overlapped with the memory time of another thread
block, part of the memory time of the current thread block that was sup-
posed to be overlapped with the memory time of the second thread block is
now exposed.
Therefore, Twindow is computed as:
Twindow = Tmem + Tcomp total (8.6)
To obtain the total execution time of a kernel, Twindow computed by the above
equations is multiplied by the total number of execution windows for a kernel,
denoted by Nwindows, which is equal to:
Nwindows =
NTBs
NSMs × WLPNwarps
(8.7)
The model discussed in this section treats loads from the shared memory the
same as global memory accesses. Shared memory accesses are even coalesced
89
to account for bank conflicts and accesses that result in multiple shared memory
loads. The only difference is that references to shared memory have the fixed
memory access latency that is equal to the L1 cache hit access time.
Figure 8.3 shows the scatter plot for the predicted versus measured execution
time for the benchmark suite discussed in Section 6. The results indicates an
average relative error of 16%, with the highest prediction error of 27.1%.
100 101 102 103
Measured(ms)
100
101
102
103
P
re
d
ic
te
d
(m
s)
Precicted v.s. measured execution time(ms)
Figure 8.3: Scatter plot of predicted versus measured execution times
8.2 Probabilistic Merit of a Kernel
In Section 6.4, I discussed a curve fitting approach to estimate the PDF of the
systematic error of a model. Following a similar discussion a posterior PDF is
computed for the relative error of the predicted execution time for each of the
90
0 5 10 15 20 25 30 35 400
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Error %
P
ro
b
a
b
il
it
y
D
is
tr
ib
u
ti
o
n
Figure 8.4: Posterior probability density function of the relative error
discussed benchmarks; the obtained relative errors serve as observations to up-
date the prior PDF introduced in Section 6.4. Figure 8.4 shows the PDF of the
computed posterior error function, ferr post(x).
Now that the error function, ferr post(x) is defined, for a predicted execution
time, t̂, one is interested in the range of true values that the execution time is
likely to occupy. The PDF for the true value of the execution time, t(E), where E
is the random variable representing the relative error with PDF of ferr post(x), is
defined as follows:
|t(E)− tˆ|
t
= E =⇒ t(e) =

tˆ
1+E
E ≥ 1
tˆ
1+E
, tˆ
1−E E < 1
(8.8)
The PDF t(E) expresses the range representing the probable value of the true
execution time. Therefore, when comparing the relative merit of two predicted
91
values tˆ1 and tˆ2, instead of asking the question, ”Is tˆ1 ≤ tˆ2?”, the question is
rephrased to: ”What is the probability of tˆ1 ≤ tˆ2?” In other words, the goal is to
evaluate P (tˆ1 ≤ tˆ2) and not to determine whether tˆ1 ≤ tˆ2.1
To compute the probability of P (tˆ1 ≤ tˆ2), ferr post(x) is sampled and based
on each collected data point for error, the potential value(s) for each t(E) are
computed. The number of the data points collected determine the resolution of the
sampling. In this work, I choose 500 sample data points, which results in a total
of 1000 sampled values for each pair of t1(E) and t2(E). The number of samples
chosen from each subinterval [bec, be + 1c) is proportional to the likelihood of
random variable E occurring within that interval.
Next, the n sampled values for t1(E) and t2(E) are sorted in ascending order
as St1 = {t1;0, t1;1, · · ·, t1;n−1} and St2 = {t2;0, t2;1, · · ·, t2;n−1} respectively such
that tk;i ≤ tk;i+1.
To compute the probability P (tˆ1 ≤ tˆ2) for each sample point t1;i from St1 , the
following steps are performed:
1. Find the first sample point t2;j in St2 such that t1;i < t2;j . If no such t2;j
exists set j = n.
2. Let pi = 1n × n−jn . The term pi represents the probability of choosing the
sample point t1;i in St1 and the value of t1;i being less than or equal to a
sample point chosen in St2 .
Consequently, P (tˆ1 ≤ tˆ2) is the sum of pi terms computed above:
P (tˆ1 ≤ tˆ2) =
n−1∑
i=0
pi (8.9)
1P (X) is defined as the probability of event X .
92
8.3 Case Study: 7-Point Stencil
In this section, following the previously discussed approach for evaluating the
merit of a kernel, I will investigate three optimized version of the stencil code,
which represent important class of scientific computation such as partial differen-
tial equation solvers that work based on structured grids.
In a general stencil code, data is arranged in a regular two or tree dimensional
grid. The physical location of the grid can be directly derived from the indices
of the array elements. Owing to the regular access patterns that can be statically
determined and a low computational intensity, a number of effective optimization
strategies can be applied to the stencil code.
Here, I look into a well studied stencil code [44] representing a three dimen-
sional heat equation initially expressed as:
B[ i , j , k ] = c0 ∗ A[ i , j , k ] +
c1 ∗ ( A[ i − 1 , j , k ] + A[ i , j − 1 , k ] +
A[ i , j , k − 1] + A[ i + 1 , j , k ] +
A[ i , j + 1 , k ] + A[ i , j , k + 1 ] )
During each operation a point in the grid is updated by a weighted linear com-
bination of a subset of its neighbors.
8.3.1 Exploiting Locality through the L1 Cache
The first version of the stencil benchmark, shown in Figure 8.1, is an improved
version of the naive stencil code that uses one thread to calculate one output point.
In this improved version, as shown in Figure 8.1, each thread computes a one-
element column across the z-dimension of the grid. As a result of this tiling op-
timization, threads reuse the data along the z-dimension. Threads within a warp
93
(and a thread block) compute a rectangular column along the z-dimension. There-
fore, although threads load their input values independently from the global mem-
ory, a good portion of the memory accesses get coalesced.
The estimated memory time for this kernel is almost three times of the estimated
compute time, indicating a memory bound kernel.
According to the core computation discussed above, each internal point con-
tributes to calculation of seven output values – itself, 4 planar neighbors, and the
top and bottom grid points. Corner points are used for fewer output values. With
this memory access pattern, the only justifiable reason for a kernel to be memory
bound is that the working set for the concurrently running thread blocks exceeds
the effective capacity of the L1 cache and the existing temporal locality is not fully
exploited.
An effective optimization commonly used for such a scenario is tiling. There-
fore, the next natural choice of the optimized kernel is one that is tiled across the
xy plane.
8.3.2 Two Dimensional Tiling in Shared Memory
To keep the working set of the concurrently running threads close to the streaming
cores, the new version of the kernel uses shared memory to store grid points of
the three active planes (top, current, and bottom) across x and y dimensions.
In the kernel shown in Listing 8.2, after each phase of computation along the
z-dimension, threads update the content of the shared memory with the elements
of the next three relevant slice along the z-dimension – currently hold locally in
registers.
Since corner points are used for fewer output values and considering the limited
capacity of the shared memory, only the the middle elements of a plane are stored
94
1
2 g l o b a l void b l o c k 2 D t i l i n g ( f l o a t c0 , f l o a t c1 , f l o a t ∗A0 ,
3 f l o a t ∗Anext , i n t nx , i n t ny , i n t nz )
4 {
5
6 i n t i = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
7 i n t j = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
8
9 i f ( i > 0 && j > 0 && ( i < nx − 1) && ( j < ny − 1) ){
10
11 f l o a t bot tom = A0 [ i + nx ∗ ( j + ny ∗ 0 ) ] ;
12 f l o a t c u r r e n t = A0 [ i + nx ∗ ( j + ny ∗ 1 ) ] ;
13
14 f o r ( i n t k = 1 ; k < nz − 1 ; k ++){
15 f l o a t t o p = A0 [ i + nx ∗ ( j + ny ∗ ( k + 1 ) ) ] ;
16
17 Anext [ i + nx ∗ ( j + ny ∗ k ) ] =
18 ( t o p + bot tom +
19 A0 [ i + nx ∗ ( j + 1 + ny ∗ k ) ] +
20 A0 [ i + nx ∗ ( j − 1 + ny ∗ k ) ] +
21 A0 [ i + 1 + nx ∗ ( j + ny ∗ k ) ] +
22 A0 [ i − 1 + nx ∗ ( j + ny ∗ k ) ] ) ∗ c1
23 − c u r r e n t ∗ c0 ;
24
25 bot tom = c u r r e n t ;
26 c u r r e n t = t o p ;
27 }
28 }
29 }
Listing 8.1: Stencil code loading data from the L1 and L2 caches
in the shared memory. The rest of the elements are loaded directly from the main
memory.
The estimated memory time for this version drops almost 70% compared to the
previous version. But the compute time increases by 40%. Since threads synchro-
nize frequently (after each step of computation), in this version both memory time
and compute time contribute to the total execution time. As a result, the over all
execution time reduction is only 4% compared to the previous version.
8.3.3 Thread Coarsening
Based on the data collected from the model, previous kernel with 2-dimentional
tiling in shared memory, observed an increase in the compute time portion of the
execution time.
A commonly performed optimization to reduce redundant calculations or mem-
95
1 g l o b a l void b l o c k 2 D h y b r i d ( f l o a t c0 , f l o a t c1 , f l o a t ∗A0 , f l o a t ∗Anext ,
2 i n t nx , i n t ny , i n t nz )
3 {
4
5 cons t i n t i = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
6 cons t i n t j = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
7 cons t i n t s h i d = t h r e a d I d x . x + t h r e a d I d x . y ∗ blockDim . x ;
8
9 / / sh are d memeory
10 ex tern s h a r e d f l o a t sh A0 [ ] ;
11 sh A0 [ s h i d ]= 0 . 0 f ;
12 s y n c t h r e a d s ( ) ;
13
14 / / compute r e g i o n s f o r load and s t o r e
15 cons t bool r e g i o n = i > 0 && j > 0 && ( i < nx − 1) && ( j < ny − 1) ;
16 cons t bool x l b o u n d = ( t h r e a d I d x . x == 0 ) ;
17 cons t bool x h bound = ( t h r e a d I d x . x == ( blockDim . x − 1 ) ) ;
18 cons t bool y l b o u n d = ( t h r e a d I d x . y == 0 ) ;
19 cons t bool y h bound = ( t h r e a d I d x . y == ( blockDim . y − 1 ) ) ;
20
21 / / r e g i s t e r f o r bo t tom and t o p p l a n e s
22 f l o a t bot tom = 0 . 0 f , t o p = 0 . 0 f ;
23
24 / / l oad da ta f o r t h e bo t tom and c u r r e n t p l a n e s
25 i f ( ( i < nx ) && ( j < ny ) ){
26 bot tom = A0 [ Index3D ( nx , ny , i , j , 0 ) ] ;
27 sh A0 [ s h i d ] = A0 [ Index3D ( nx , ny , i , j , 1 ) ] ;
28 }
29
30 s y n c t h r e a d s ( ) ;
31
32 f o r ( i n t k = 1 ; k < nz − 1 ; k ++){
33
34 f l o a t a l e f t r i g h t , a r i g h t l e f t , a t o p , a down ;
35 / / l oad r e q u i r e d da ta on xy p l a n e s
36 / / i f i t i s i n t h e sh ar ed memory , l oad from sh ar ed memory
37 / / o t h e r w i s e , l oad from t h e g l o b a l memory
38 i f ( ( i < nx ) && ( j < ny ) )
39 t o p =A0 [ Index3D ( nx , ny , i , j , k + 1 ) ] ;
40
41 i f ( r e g i o n ){
42
43 a l e f t r i g h t = x l b o u n d ?A0 [ Index3D ( nx , ny , i − 1 , j , k ) ] : sh A0 [ s h i d − 1 ] ;
44
45 a r i g h t l e f t = x h bound ?A0 [ Index3D ( nx , ny , i + 1 , j , k ) ] : sh A0 [ s h i d + 1 ] ;
46
47 a down = y l b o u n d ?A0 [ Index3D ( nx , ny , i , j − 1 , k ) ] : sh A0 [ s h i d − blockDim . x ] ;
48
49 a t o p = y h bound ?A0 [ Index3D ( nx , ny , i , j + 1 , k ) ] : sh A0 [ s h i d + blockDim . x ] ;
50
51 Anext [ Index3D ( nx , ny , i , j , k ) ] = ( a l e f t r i g h t + a r i g h t l e f t + a t o p +
52 a down + t o p + bot tom ) ∗ c1 − sh A0 [ s h i d ] ∗ c0 ;
53
54 }
55
56 / / swap t h e da ta
57 s y n c t h r e a d s ( ) ;
58 bot tom = sh A0 [ s h i d ] ;
59 sh A0 [ s h i d ] = t o p ;
60 s y n c t h r e a d s ( ) ;
61 }
62
63
64 }
Listing 8.2: Stencil code using shared memory for elements in the middle, and
L1 and L2 for the hollows
96
Version Memory
time %
Predicted
bottleneck
Predicted exe-
cution time im-
provement %
Measured exe-
cution time im-
provement %
No shared memory 100% memory
time
NA NA
Shared memory
2D tiled
46% compute
time
4% ≈ 0
Coarsened shared
memory 2D tiled
52% memory
time
11% 21%
Table 8.1: Summary of predicted performance information for the Stencil kernels
ory accesses is thread coarsening. Thread coarsening consists of packing the
computation of multiple fine grain threads into a more coarse grained one to avoid
redundant work. The downside is that in order to perform redundant computation
once and save the results, more registers are required. Increase in register usage
may decrease the occupancy and utilization of the streaming cores.
To investigate this potential optimization, the previous shared memory tiled
version of the stencil kernel is coarsened (tiled) across the x-dimension. Now each
thread performs twice the amount of work it used to do. The coarsened version
is run through the performance modeling framework. The result of the analysis
indicates that occupancy is not affected by the increase in the number of registers
used and that the compute time improves by 16% compared to the shared memory
tiled version. The overall execution time is reduced by 6% and 11% compared to
the shared memory tiled and the initial version respectively.
Table 8.1 summarize the performance results collected from the model and the
GPU hardware for the three discussed kernels: The second column reports the
contribution of the predicted memory time to the total predicted execution time for
each kernel, based on which, the major time consuming component is identified
in column 3. Columns 4 and 5 measure the amount of performance improvement
for predicted (model) and measured (hardware) execution times with respect to
97
Y
X
cache cache & tiled cache & tiled
& coarsened
cache NA 0.40 0.34
cache & tiled 0.60 NA 0.40
cache & tiled
& coarsened
0.66 0.60 NA
Table 8.2: Comparison of the probabilistic relative error of the stencil kernels –
reported probability for each pair of X and Y correspond to P (X ≤ Y ).
the first kernel that does not use shared memory tiling.
As discussed in Section 1.5, expressing the predicted performance or execution
time through to a single number without accounting for the potential deviation of
the output of the model from the real execution time is not appropriate. Therefore,
based on the approach discussed in Section 8.2 the relative probabilistic merits of
the three stencil kernels are computed and summarized in Table 8.2.
Probabilities reported in Table 8.2 put the raw prediction numbers into perspec-
tive with respected to the previously observed error in the model. For example,
according to the data presented in Table 8.2, chances of the coarsened stencil ker-
nel outperforming the initial version is 66%. The raw execution time reported by
the model are 25.95ms and 29.10ms respectively. Actual measurement on the
GPU results in corresponding execution times of 22.60ms and 29.00ms, which
follows the same trend predicted by the model. But one needs to account for the
potential additive error reported along with the prediction times. After perturbing
the raw predictions with the potential error, a more realist comparison states that
likelihood of version 3 performing better than the first kernel is 66%.
In general, the larger the magnitude of the difference between two predicted
execution times is, given a PDF for the error, the higher is the chance of circum-
venting the uncertainty of predictions caused by the systematic error present in
the model.
98
CHAPTER 9
CONCLUSIONS AND FUTURE WORK
9.1 Conclusions
This work presents a novel solution to the problem of providing meaningful per-
formance feedback to developers in highly multithreaded graphics processors.
The stochastic modeling approach presented allows for using a simple and effi-
cient tracing scheme without user intervention and without concerns for distorted
execution timing. It further provides confidence level information in the presence
of scheduling uncertainties. Its sampling strategy also allows one to minimize the
cost of tracing and simulation. The close match between the generated distribu-
tion mean and the sampled hardware performance counter values provides good
validation for the proposed stochastic memory hierarchy model.
The high resolution performance statistic generated through coupling source
code level instrumentation and the memory hierarchy simulation model provides
a comprehensible view of the level of memory efficiency being exploited by in-
dividual data structures in the program. This information, though still far from
targeted optimization hints to a developer or compiler, is a crucial and fundamen-
tal step forward towards that goal.
The analytical model built on top of the stochastic memory hierarchy model,
on the other hand, provides a big picture overview of the overall performance and
the first order bottlenecks in a GPU kernel.
99
9.2 Future Work Direction
There are several avenues for future work. While the model is demonstrated and
validated based on CUDA applications and the Tesla C2050 hardware, it can po-
tentially be applied to other highly multithreaded processors too. As a direction
for future work, it would be useful to extend the system to OpenCL [2] applica-
tions and apply it to the AMD GPUs for further evaluation across a wider range
of GPU architectures.
Another avenue for this work is to augment the current framework with an intel-
ligent component to mine the performance statistics and retrieve more insightful
information. For example, it would be interesting to collect extra raw performance
data from the memory hierarchy model and later analyze it with respect to the in-
teraction of memory accesses to different data structures in the program, or even
accesses to the same data structure that are initiated from different locations in the
source code.
To evaluate these interactions, it may be beneficial to feed partial traces – mem-
ory addresses from specific static memory locations in the program – to the mem-
ory hierarchy simulator. Then, one can evaluate the performance of the memory
system in the absence of a subset of memory accesses and potentially set guide-
lines based on which one can optimally map the working set of an application into
the proper address spaces available on a GPU. For example, a set of memory refer-
ences may have detrimental effects – by causing conflict misses – on the existing
locality within a data structure that is initially mapped to the global memory. If
the subset is isolated and loaded into the shared memory, the detrimental effect of
conflict misses may fade away.
Another interesting direction would be investigating the potential speed up and
performance improvement of a data parallel computation – not necessarily ex-
100
pressed as a GPU kernel – on a target GPU computing platform. A systematic
approach to solve this problem, in addition to performance analysis techniques
presented in this work, requires an efficient mapping of data structures into the
available address spaces on the a GPU, e.g., global memory, shared memory, etc.
101
APPENDIX A
SYMBOLIC EVALUATION
In this chapter, I propose a tractable approach to symbolic evaluation of conditions
and array access expressions. The propposed symbolic evaluation system has two
components:
1. A symbolic execution engine for evaluating effects of expressions that are
simple enough.
2. A set of simplification rules for simplifying more complex expressions.
If the expression to be evaluated is too complicated for the symbolic evaluation
engine, the expression is conditionally simplified such that the evaluation of the
simplified expression implies results equal to that of the original expression. The
simplification process may be applied recursively until tractable expressions are
obtained or the simplification engine decides that no further simplification is ap-
plicable. In the latter case, the expression is interpreted conservatively, which
results in a lower bound for the performance.
During the simplification process certain constraints on free variables may be
imposed, e.g., input parameters. The validity of the results is contingent on the
truth of the conjectures made about free variables by the simplification system.
Therefore, the results of each evaluation step are accompanied by a set of con-
straints that specify the domain of applicability of each evaluation step.
102
A.0.1 Structural Conditions
A predicate denotes either a structural condition or a data-dependent condition.
A structural conditional expression is a function of thread coordinates, enclosing
loop induction variables and symbolic constants of the kernel. In this section I will
show how to analyze a large class of structural conditions that are defined by Def-
inition (A.1), through symbolic evaluation. Analyzing data-dependent conditions
requires statistical knowledge of the data and is discussed in Section 5.
O(At + B, I(i)), I(i) = αi + β or I(i) = 2αi+β (A.1)
i ∈ {0, 1, ..., I}, t ∈ {0, 1, ...,T− 1} and A,B, α, β ∈ Z
O(x, y) ∈ {x = y, x 6= y, x ≤ y, x ≥ y, x (mod y) ≡ 0}
For brevity, I consider one-dimensional thread-blocks of the size T, where T is
well-bounded due to hardware limitations. Variable t stands for the thread ID in
the affine expression At + B. In Definition (A.1), i is the loop induction variable
with the upper bound of I, which may not be known at compile time. Operator
O can be either a comparison or a modulo operator. Without loss of generality,
I restrict the discussions to ≤ for comparison operators, A, B, α, β ∈ N, and
I(i) = αi + β. Conclusions on other cases can be derived similarly. The follow-
ing two cases apply, based on the type of operator O.
1. O is a comparison operator: At + B ≤ αi + β
We solve At + B for each t ∈ {0, 1, ...,T− 1}. Let At′ + B ≤ αi + β for
0 ≤ t′ < T; we have At′+B−β
α
≤ i ≤ I. Consequently, the total number of
steps for symbolic evaluation is given by:
103
T−1∑
t=0
I−
⌈At + B− β
α
⌉
+ 1 ≈ 2T(αI− B + β + α)− AT(T− 1)
2α
(A.2)
Notice that for I0 ≤ i, where I0 = A(T−1)+B−βα , all threads satisfy the in-
equality. In other words, control flow divergence does not occur for com-
putation steps greater than I0. If I replace I with I0 in Eq. (A.2) the upper-
bound for the total number of evaluation steps becomes:
AT2 −AT+ 2αT
2α
(A.3)
Notice that I have iterated over the thread IDs. To measure the thread ID
dependent effects correctly, the active threads for each computation step i
are replayed, then the control flow divergence, coalescing and bank conflict
effects are measured. As a result, the total cost is twice the value of Ex-
pression (A.3). Notice that if I0 ≤ T, we directly iterate over i. The cost
in Expression (A.3) is formulated in terms of program constants or known
hardware limits and is proportional to cost of symbolically evaluating the
condition. The cost is computed in advance. If the computed cost is less
than a predefined threshold, the condition is symbolically evaluated. Other-
wise, the condition is assumed to be true for all threads.
Now, let w0 be the computed execution weight for the region under the
control of the above predicate. To make up for the divergence-free part (the
I− I0 iterations), the overall weight is computed as: ]
max(I− I0, 0) + w0I0
max(I0, I)
(A.4)
If I is known at compile time, Expression (A.4) is reduced to a single value,
otherwise it remains parametric on I and consequently the total execution
104
cycles become a function of I. Note that in cases where I is not known
statically and its actual value is less than I0, I have approximated the corre-
sponding execution weight by w0.
2. O is the modulo operator: (At + B) ≡ 0 (mod αi + β)
We solve expressionAt + B for each t ∈ {0, 1, ...,T− 1}. Let (At′ + B) ≡
0 (mod αi + β) which implies that αi + β is some combination of prime
factors of At′ + B. If AT + B is not a very large number we can store
combinations of prime factors of all numbers less that AT + B in a lookup
table in order to retrieve them efficiently. Let P be a combination of prime
factors of At′ + B. Factor P is equal to αi + β if P−β
α
is an integer and
0 ≤ P−β
α
≤ I; the converse is also true. Therefore, t′ is a candidate thread
that satisfies the condition. The total number of steps to test all values of t
is bounded by:
T−1∑
t=0
⌊At + B
2
⌋
≈ AT(T− 1)
4
+
BT
2
Similar to case 1, this method requires two passes to collect the performance
information.
Since the largest prime factor that is examined is equal toA(T− 1) + B, the
condition evaluates to true for all threads if I0 ≤ i, where I0 = A(T−1)+B−βα .
Similar to the discussion given for the previous case, the overall execution
weight is computed as stated in Expression (A.4).
A.0.2 Structural Memory Accesses
In this section I first describe the class of memory access expressions that is ac-
cepted by our expression simplification engine. Later I explain how a candidate
105
expression is simplified through an example. Finally, I discuss my approach to ef-
ficiently capture qualitative properties of these expressions from the point of view
of GPU architecture performance.
The subscripting function of an array reference, F , is a multivariate expression.
Function F can be defined recursively as the sum of possibly complex terms ac-
cording to Definitions (A.5) and (A.6), where v is a vector of integer variables
of the program (thread coordinates, input parameters, enclosing loops induction
variables, and an element representing integer number 1) all of which are involved
in the evaluation of the subscripting expression, and v[`] is the `th element of the
variable vector v.
F (0)(v) = v[`] for some ` ∈ { 1, 2, · · · , |v|} (A.5)
F (n)(v) =
N(n)∑
i=1
M
(n)
i∏
j=1
Pi,j
Pi,j ∈ {Ok ·
(F (n−1)i,j )(v)} for k = 1, 2, 3, 4
For some α, β ∈ Z and some ` ∈ { 1, 2, · · · , |v|}, I define the following four
operators, namelyO1, · · · , O4, on the subscripting functionF . Note that negative
exponent expressions are excluded by definition.
O1 · (F (n−1)i,j )(v) = F (n−1)i,j (v) mod 2αv[`]+β (A.6)
O2 · (F (n−1)i,j )(v) = b
F (n−1)i,j (v)
2αv[`]+β
c
O3 · (F (n−1)i,j )(v) = 2αv[`]+β ×F (n−1)i,j (v)
O4 · (F (n−1)i,j )(v) = α×F (n−1)i,j (v)
106
The operators defined above are commonly used to express complex access
patterns, e.g., wrap-around and strided accesses.
Example 1. Expression (A.7) shows a relatively complex index expression from
the FFT kernel. Variables involved in this expression include induction variables
of two enclosing loops (i and 2j), input parameter (N), and thread coordinates (b
as thread-block ID and t as thread ID).
⌊b× N+ t
2j
⌋
× 2j+1 + (b× N+ t) mod 2j + i× 2j (A.7)
The following derivation shows how the first term in Expression (A.7) satis-
fies the definition of F , where the vector of involved variables is initialized as
v = [b, t, i, j, N, 1]:
F (2)(v)= bF
(1)
1,1 (v)
2j
c × 2j+1 ×F (1)1,2 (v) (A.8)
= b(F1,1)
(0)
1,1(v)× (F1,1)(0)1,2(v) + (F1,1)(0)2,1(v)
2j
c
× 2j+1 × (F1,2)(0)1,1(v)
= bb× N + t
2j
c × 2j+1 × 1
In step two of the derivation above, it is understood that for example, the term
(F1,1)1,2 refers to the second factor of the first term in F1,1.
Expression (A.7) is examined to determine whether the qualitative behavior
of residing threads in a memory vector instruction (a half-warp in the case of
NVIDIA GPUs) can be confined to their local thread coordinates. Figure A.1(a)
shows the expression tree for Expression (A.7). We refer to the multivariate ex-
pressions that are composed of thread ID and a combination of other thread co-
107
ordinates (thread-block ID) with free variables as complex expressions. Through
a bottom-up process I label the complex sub-expressions. Labels are propagated
upward in the expression tree until a non-distributive operator such as modulo or
integer division is reached (referred to as a tainted operator). At this point, the
simplification engine attempts to recursively disentangle the local thread ID part
from the rest of the terms by imposing constraints on free variables.
We explain this simplification process for the first term of Expression (A.7)
which is highlighted in the expression tree of Fig. A.1(a) by dashed arcs. The first
non-distributive tainted operator node that is visited during the bottom-up process
for this sub-tree is the div operator (b
2j
c). Operator div can be distributed if at
least one of the terms in its enumerator is a multiple of the denominator 2j. Tainted
operator div is distributed speculatively over + operator. From the two resulting
sub-expression trees in Fig. A.1(b), only the one on the left contains a free vari-
able. Therefore, speculation continues down the left sub-tree and along the path
that leads to the free variable N, as shown in Fig. A.1(c). At this point, the con-
straint N = k× 2J (for some k ∈ Z) is added to the simplified expression tree of
Fig. A.1(c), where J is the upper bound for the induction variable j. The simpli-
fication steps are valid only if the constraint on N is proved to be true. The proof
can be either derived at compile time or checked at runtime. Figure A.1(d) shows
the routine distribution of multiplication over addition. In Fig. A.1(e), based on
the constraint put on N earlier, multiplication and integer division have equaled
out each other. As the result of these simplifications, the original expression tree is
transformed to the one shown in Fig. A.1(f). The local thread coordinate t is now
separated from other thread coordinates such as b. The two highlighted sub-trees
in Fig. A.1(f) are now simple enough to be symbolically evaluated.
108
A.0.3 Memory Bank Conflict and Coalescing
In general, deriving the exact memory access pattern is infeasible, especially with
the presence of relatively complex subscripting functions (Section A.0.2) and un-
known upper bounds for the induction variables. However, we are interested in
certain qualitative properties of the array subscript expressions. For example, for
efficient utilization of memory bandwidth, successive words are assigned to suc-
cessive memory banks. Let NUMbank be the number of memory banks, which is
also equal to the number of threads that can simultaneously access memory. The
problem of determining the pattern of memory bank conflicts can be transformed
to the problem of computing x mod NUMbank, where x belongs to a finite set of
integers. Size of this set is proportional to the number of variables present in the
subscripting expression. Let v = [v1, v2, · · · , vn] be all n participating variables
in evaluation of expression E(v). Each vi can be formulated as αiNUMbank + ρi,
where 0 ≤ ρi < NUMbank. Evaluating E(v) to capture memory bank conflict pat-
terns is equivalent to evaluation of E(ρ), where ρ = [ρ1, ρ2, · · · , ρn]. The differ-
ence is that members of variable vector ρ are well-bounded.
The simplification engine is more stringent on subscripting expressions that are
evaluated for memory coalescing. Not only should the terms involving local coor-
dinates (thread ID) be separable from the terms that involve other thread coordi-
nates and free variables, but each thread ID dependent term should be expressable
as a sum of terms such as At + B, bAt+B
2αi+β
c or (At + B) mod 2αi+β , where t is the
local thread coordinate, A,B ∈ Z and α, β ∈ I. As a result, the symbolic evalu-
ation engine can identify through a limited number of steps, how many different
memory segments are accessed by threads within a single memory vector instruc-
tion.
109
+yy
""
+
zz %%
*
{{ 
mod
 ""
*
 
div
 ""
2j+1 +
zz 
2j i 2j
+
}} 
2j *
{{ 
t
*
~~ 
t b N
b N
(a)
*
{{ 
+
{{ ##
2j+1
div
 ##
div
 !!
*
}} 
2j t 2j
b N
(b)
*
xx

+
{{ &&
2j+1
*
}} 
div
 %%
b div
 ""
t 2j
N 2j N = k× 2J
(c)
+
zz ''*
{{ 
*
xx 
b *
{{ 
div
 &&
2j+1
div
 ""
2j+1 t 2j
N 2j N = k× 2J
(d)
+
xx ''*
~~ 
*
xx 
b *
 
div
 &&
2j+1
2 N t 2j
N = k× 2J
(e)
+
yy
%%
+
zz %%
+
|| ##
mod
 ""
*
 %%
*
 
*
{{ 
t 2j i 2j
b *
 
div
 ""
2j+1
2 N t 2j N = k× 2J
(f)
Figure A.1: Expression simplification for Example 1. (a) Initial expression tree.
(b) Distributing integer division. (c) Distribution is carried on along the path that
contains free variables; constraint N = k× 2J is bound to free variable N. (d)
Multiplication is distributed over the add operator. (e) Multiplication and integer
division cancel each other out. (f) The simplified expression tree.
110
REFERENCES
[1] NVIDIA Staff, NVIDIA CUDA Programming Guide 3.0, 2010.
[2] The OpenCL Specification, 2009.
[3] W. W. Hwu and J. Stone, “A programmers view of the new GPU computing
capabilities in the Fermi architecture and CUDA 3.0,” White paper, Univer-
sity of Illinois, 2009.
[4] V. Salapura, K. Ganesan, A. Gara, M. Gschwind, J. C. Sexton, and R. E.
Walkup, “Next-generation performance counters: Towards monitoring over
thousand concurrent events,” in Proceedings of the ISPASS 2008 - IEEE In-
ternational Symposium on Performance Analysis of Systems and software,
2008, pp. 139–146.
[5] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing
CUDA workloads using a detailed GPU simulator,” in Performance Anal-
ysis of Systems and Software, 2009. ISPASS 2009. IEEE International Sym-
posium on, 2009, pp. 163 –174.
[6] M. Rosenblum and M. Varadarajan, “SimOS: A fast operating system simu-
lation environment,” Stanford, CA, USA, Tech. Rep., 1994.
[7] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An infrastructure for
computer system modeling,” Computer, vol. 35, pp. 59–67, February 2002.
[Online]. Available: http://portal.acm.org/citation.cfm?id=619072.621910
[8] D. D. Sylvain Collange and D. Parello, “Barra, a parallel functional GPGPU
simulator.”
[9] I. Corporation, “Intel itanium 2 processor reference manual for software
development and optimization,” 2004, Manual. [Online]. Available:
http://www.intel.com/design/itanium2/manuals/251110.htm
[10] A. V. Aho, P. J. Denning, and J. D. Ullman, “Principles of optimal
page replacement,” J. ACM, vol. 18, pp. 80–93, January 1971. [Online].
Available: http://doi.acm.org/10.1145/321623.321632
111
[11] C. C. Hurd, “A note on early Monte Carlo computations and scientific
meetings,” IEEE Ann. Hist. Comput., vol. 7, pp. 141–155, April 1985.
[Online]. Available: http://portal.acm.org/citation.cfm?id=3648.3665
[12] A. G. H.H. Goldstine, “The electronic numerical integrator and computer
(ENIAC),” IEEE Annals of the History of Computing, vol. 18, pp. 10–16,
1996.
[13] R. Eckhardt, “Stan Ulam, John Von Neumann and
the Monte Carlo Method,” 1987, Internet draft.
[Online]. Available: http://www.lanl.gov/history/admin/files/Stan
Ulam John von Neumann and the Monte Carlo Method.pdf
[14] T. Bayes, “An essay towards solving a problem in the doc-
trine of chances,” 1763, Internet draft. [Online]. Available:
http://www.stat.ucla.edu/history/essay.pdf
[15] M. D. Root and J. R. Boer, Directx Complete, 1st ed., M. Sprague, Ed. New
York, NY, USA: McGraw-Hill, Inc., 1998.
[16] C. OpenGL Architecture ReviewBoard, OpenGL reference manual: the of-
ficial reference document for OpenGL, release 1. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc., 1992.
[17] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard, “Cg: a
system for programming graphics hardware in a C-like language,” ACM
Trans. Graph., vol. 22, pp. 896–907, July 2003. [Online]. Available:
http://doi.acm.org/10.1145/882262.882362
[18] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Hous-
ton, and P. Hanrahan, “Brook for GPUs: stream computing on
graphics hardware,” in ACM SIGGRAPH 2004 Papers, ser. SIG-
GRAPH ’04. New York, NY, USA: ACM, 2004. [Online]. Available:
http://doi.acm.org/10.1145/1186562.1015800 pp. 777–786.
[19] M. D. McCool, Z. Qin, and T. S. Popa, “Shader metaprogramming,”
in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference
on Graphics hardware, ser. HWWS ’02. Aire-la-Ville, Switzerland,
Switzerland: Eurographics Association, 2002. [Online]. Available:
http://portal.acm.org/citation.cfm?id=569046.569055 pp. 57–68.
[20] D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: using data parallelism to
program gpus for general-purpose uses,” in Proceedings of the 12th inter-
national conference on Architectural support for programming languages
and operating systems, ser. ASPLOS-XII. New York, NY, USA: ACM,
2006. [Online]. Available: http://doi.acm.org/10.1145/1168857.1168898
pp. 325–335.
112
[21] A. KleinOsowski and D. J. Lilja, “MinneSPEC: A new SPEC benchmark
workload for simulation-based computer architecture research,” Computer
Architecture Letters, vol. 1, 2002.
[22] J. J. Yi, S. V. Kodakara, R. Sendag, D. J. Lilja, and D. M. Hawkins, “Char-
acterizing and comparing prevailing simulation techniques,” in Proceedings
of the 11th International Symposium on High-Performance Computer
Architecture. Washington, DC, USA: IEEE Computer Society, 2005. [On-
line]. Available: http://portal.acm.org/citation.cfm?id=1042442.1043426
pp. 266–277.
[23] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automat-
ically characterizing large scale program behavior,” SIGOPS Oper.
Syst. Rev., vol. 36, pp. 45–57, October 2002. [Online]. Available:
http://doi.acm.org/10.1145/635508.605403
[24] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, “SMARTS:
accelerating microarchitecture simulation via rigorous statistical sampling,”
in Proceedings of the 30th annual international symposium on Computer
architecture, ser. ISCA ’03. New York, NY, USA: ACM, 2003. [Online].
Available: http://doi.acm.org/10.1145/859618.859629 pp. 84–97.
[25] T. M. Conte, M. A. Hirsch, and K. N. Menezes, “Reducing state
loss for effective trace sampling of superscalar processors,” in Pro-
ceedings of the 1996 International Conference on Computer Design,
VLSI in Computers and Processors, ser. ICCD ’96. Washing-
ton, DC, USA: IEEE Computer Society, 1996. [Online]. Available:
http://portal.acm.org/citation.cfm?id=645464.653497 pp. 468–477.
[26] W. Liu, W. Muller-Wittig, and B. Schmidt, “Performance Predictions for
General-Purpose Computation on GPUs,” in International Conference on
Parallel Processing, September 2007.
[27] N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha, “A Memory Model
for Scientific Algorithms on Graphics Processors,” in ACM/IEEE Confer-
ence on Supercomputing, November 2006.
[28] K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the Efficiency
of GPU Algorithms for Matrix-matrix Multiplication,” in Conference on
Graphics Hardware, August 2004.
[29] C. Jiang and M. Snir, “Automatic Tuning Matrix Multiplication Performance
on Graphics Hardware,” in International Conference on Parallel Architec-
tures and Compilation Techniques, September 2005.
[30] http://clang.llvm.org/.
113
[31] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A.
Stratton, and W.-m. W. Hwu, “Program optimization space pruning for a
multithreaded GPU,” in Proceedings of the 6th annual IEEE/ACM interna-
tional symposium on Code generation and optimization, 2008, pp. 195–204.
[32] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W.
Hwu, “An adaptive performance modeling tool for GPU architectures,” in
Proceedings of the 15th ACM SIGPLAN symposium on Principles and prac-
tice of parallel programming, ser. PPoPP ’10. New York, NY, USA: ACM,
2010. [Online]. Available: http://doi.acm.org/10.1145/1693453.1693470
pp. 105–114.
[33] S. Hong and H. Kim, “An analytical model for a GPU architecture with
memory-level and thread-level parallelism awareness,” in Proceedings of the
36th annual international symposium on Computer architecture, 2009, pp.
152–163.
[34] Y. Zhang and J. D. Owens, “A quantitative performance analysis model for
GPU architectures,” in Proceedings of the 17th IEEE International Sympo-
sium on High-Performance Computer Architecture, 2011.
[35] G. Diamos, A. Kerr, and M. Kesavan, “Translating GPU binaries to tiered
SIMD architectures with Ocelot.”
[36] B. J. Smith, “Readings in computer architecture,” M. D. Hill, N. P. Jouppi,
and G. S. Sohi, Eds. Morgan Kaufmann Publishers Inc., 2000, ch. Ar-
chitecture and applications of the HEP mulitprocessor computer system, pp.
342–349.
[37] M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich,
and W. S. Lee, “The M-Machine multicomputer,” in Proceedings of the 28th
annual international symposium on Microarchitecture, 1995, pp. 146–156.
[38] A. Snavely, L. Carter, J. Boisseau, A. Majumdar, K. S. Gatlin, N. Mitchell,
J. Feo, and B. Koblenz, “Multi-processor performance on the Tera MTA,” in
Proceedings of the 1998 ACM/IEEE conference on Supercomputing, 1998,
pp. 1–8.
[39] J. Nickolls and W. J. Dally, “The gpu computing era,” IEEE Micro, vol. 30,
pp. 56–69, March 2010.
[40] S. Laha, J. H. Patel, and R. K. Iyer, “Accurate low-cost methods for perfor-
mance evaluation of cache memory systems,” IEEE Trans. Comput., vol. 37,
pp. 1325–1336.
[41] NVIDIA, “CUDA occupancy calculator,” devel-
oper.download.nvidia.com/cuda/CUDAOccupancycalculator.xls.
114
[42] C. Lattner and V. Adve, “LLVM]: A Compilation Framework for Lifelong
Program Analysis & Transformation, booktitle = Proceedings of the interna-
tional symposium on Code generation and optimization: feedback-directed
and runtime optimization, year = 2004, pages = 75–,.”
[43] H. Stark and J. Woods, Probability, random processes, and estimation theory
for engineers. Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1986.
[44] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker,
D. Patterson, J. Shalf, and K. Yelick, “Stencil computation optimization
and auto-tuning on state-of-the-art multicore architectures,” in Proceed-
ings of the 2008 ACM/IEEE conference on Supercomputing, ser. SC
’08. Piscataway, NJ, USA: IEEE Press, 2008. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1413370.1413375 pp. 4:1–4:12.
115
