Cache Memory Access Patterns in the GPU Architecture by Nimkar, Yash
Rochester Institute of Technology
RIT Scholar Works
Theses
7-2018
Cache Memory Access Patterns in the GPU
Architecture
Yash Nimkar
ypn4262@rit.edu
Follow this and additional works at: https://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized
administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Nimkar, Yash, "Cache Memory Access Patterns in the GPU Architecture" (2018). Thesis. Rochester Institute of Technology. Accessed
from
Cache Memory Access Patterns in the GPU
Architecture
Yash Nimkar
Cache Memory Access Patterns in the GPU
Architecture
Yash Nimkar
July 2018
A Thesis Submitted
in Partial Fulfillment
of the Requirements for the Degree of
Master of Science
in
Computer Engineering
Department of Computer Engineering
Cache Memory Access Patterns in the GPU
Architecture
Yash Nimkar
Committee Approval:
Dr. Sonia Lopez Alarcon Date
Thesis Advisor
Department of Computer Engineering
Rochester Institute of Technology
Dr. Amlan Ganguly Date
Department of Computer Engineering
Rochester Institute of Technology
Dr. Roy Melton Date
Department of Computer Engineering
Rochester Institute of Technology
i
Acknowledgments
I would like to thank my Thesis Advisor, Dr. Sonia Lopez Alarcon for all her help
and support. This has been a continual learning experience for me and my thesis
would not have been the same without Dr. Sonia Lopez’s guidance and direction. I
would also like to thank Dr. Amlan Ganguly and Dr. Roy Melton for being on my
Master’s thesis committee.
I would like to thank Richard Tolleson and Richard Flegal from the Department
of Computer Engineering at RIT for all IT related help especially while setting up
my research VMs and computers.
I would like to thank Paul Mezzanini, Sidney Pendelberry and Emilio Del Plato
from the Research Computing department at RIT for all their help in relation to the
research computing clusters and VMs that were initially used for my research.
I would like to thank Xun Gong who is a Multi2Sim developer at Northeast-
ern University for all his help with the Multi2Sim simulator and for setting up and
customizing the NVIDIA Kepler model on the Multi2Sim.
I would like to thank our labs ex-students or alumni, Ben Johnstone and Harshdeep
Chawla, for their help as I used their research as a base to understand the GPU
architectures and setup the GPU simulator.
Lastly, I would like to thank my family members and close friends for their contin-
ued support along the way. This would not have been possible without their support
and encouragement.
ii
iii
Abstract
Data exchange between a Central Processing Unit (CPU) and a Graphic Process-
ing Unit (GPU) can be very expensive in terms of performance. The characterization
of data and cache memory access patterns differ between a CPU and a GPU. The
motivation of this research is to analyze the cache memory access patterns of GPU
architectures and to potentially improve data exchange between a CPU and GPU.
The methodology of this work uses Multi2Sim GPU simulator for AMD Radeon and
NVIDIA Kepler GPU architectures. This simulator, used to emulate the GPU ar-
chitecture in software, enables certain code modifications for the L1 and L2 cache
memory blocks. Multi2Sim was configured to run multiple benchmarks to analyze
and record how the benchmarks access GPU cache memory. The recorded results
were used to study three main metrics: (1) Most Recently Used (MRU) and Least
Recently Used (LRU) accesses for L1 and L2 caches, (2) Inter-warp and Intra-warp
cache memory accesses in the GPU architecture for different sets of workloads, and (3)
To record and compare the GPU cache access patterns for certain machine learning
benchmarks with its general purpose counterparts.
iv
Contents
Signature Sheet i
Acknowledgments ii
Dedication iii
Abstract iv
Table of Contents v
List of Figures vii
List of Tables ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Multi2Sim 12
3.1 Setting up the simulator on an Ubuntu OS . . . . . . . . . . . . . . . 13
3.2 AMD Southern Islands (Radeon) GPU Architecture . . . . . . . . . . 14
3.3 NVIDIA Kepler GPU Architecture . . . . . . . . . . . . . . . . . . . 19
4 Analyzing Cache Memory Access Patterns 24
4.1 L1 and L2 Cache Hit Ratios . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 AMD Southern Islands . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 NVIDIA Kepler . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 MRU and LRU Cache Counters for Temporal Locality . . . . . . . . 27
4.3 Inter-warp and Intra-warp Cache Locality . . . . . . . . . . . . . . . 33
5 Results 39
5.1 Cache Hit Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
v
CONTENTS
5.2 MRU and LRU Temporal Locality Results . . . . . . . . . . . . . . . 41
5.3 Inter-warp and intra-warp Locality Results . . . . . . . . . . . . . . . 48
5.4 Comparing the CPU and GPU results . . . . . . . . . . . . . . . . . 54
6 Machine Learning Benchmarks using CUDA 58
6.1 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.1 L1 and L2 cache hit ratios . . . . . . . . . . . . . . . . . . . 61
6.2.2 MRU and LRU Temporal Locality Results . . . . . . . . . . 62
6.2.3 Inter-warp and Intra-warp Locality Results . . . . . . . . . . 65
7 Conclusion 70
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1.1 Unified Memory Model . . . . . . . . . . . . . . . . . . . . . 72
7.1.2 Machine Learning Benchmarks on the GPU . . . . . . . . . 72
A Code Listings 78
vi
List of Figures
3.1 Block diagram of the AMD 7970 architecture [14] . . . . . . . . . . . 16
3.2 Block diagram of the Compute Unit of the AMD 7970 architecture [14] 17
3.3 Block diagram of the NVIDIA Kepler architecture [1] . . . . . . . . . 19
3.4 Kepler GPU Memory Architecture [1] . . . . . . . . . . . . . . . . . . 20
3.5 NVIDIA Kepler Steaming Multi-processor (SM) architecture [1] . . . 21
4.1 Cache Replacement Policies [15] . . . . . . . . . . . . . . . . . . . . . 29
5.1 Cache Memory Hit Ratios for the OpenCL benchmarks . . . . . . . . 40
5.2 Cache Memory Hit Ratios for the CUDA benchmarks . . . . . . . . . 40
5.3 OpenCL benchmark results for the AMD SI architecture for L1 cache 42
5.4 CUDA benchmark MRU Results for L1 Cache of the NVIDIA Kepler
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 OpenCL benchmark results for the AMD SI architecture for L2 cache 45
5.6 CUDA benchmark MRU Results for L2 Cache of the NVIDIA Kepler
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.7 OpenCL L1 vector and L2 cache intra-warp and inter-warp access per-
centages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.8 CUDA L1 vector and L2 cache intra-warp and inter-warp access per-
centages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.9 OpenCL L1 vector intra-warp and inter-warp hits and misses . . . . . 52
5.10 OpenCL L2 intra-warp and inter-warp hits and misses . . . . . . . . 52
5.11 CUDA L1 intra-warp and inter-warp hits and misses . . . . . . . . . 53
5.12 CUDA L2 intra-warp and inter-warp hits and misses . . . . . . . . . 53
6.1 Machine Learning benchmark results for the NVIDIA Kepler architec-
ture for L1 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Machine Learning benchmark results for the NVIDIA Kepler architec-
ture for L2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 L1 and L2 cache inter-warp and intra-warp access percentages for Ma-
chine Learning benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 L1 intra-warp and inter-warp hits and misses for Machine Learning
benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vii
LIST OF FIGURES
6.5 L2 intra-warp and inter-warp hits and misses for Machine Learning
benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1 NVIDIA’s Unified Memory Model [27] . . . . . . . . . . . . . . . . . 73
viii
List of Tables
3.1 Dependency list for the Multi2Sim simulator . . . . . . . . . . . . . . 13
3.2 Difference in terminology between OpenCL and CUDA . . . . . . . . 15
3.3 AMD Southern Islands emulated GPU configuration on Multi2Sim [10,
14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 NVIDIA Kepler emulated GPU configuration on Multi2Sim [1, 14] . . 22
5.1 OpenCL Benchmark MRU results for L1 Cache (per module) of the
AMD SI architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 CUDA Benchmark MRU results for L1 Cache of the NVIDIA Kepler
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 OpenCL Benchmark MRU results for L2 Cache of the AMD SI archi-
tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 CUDA Benchmark MRU results for L2 Cache of the NVIDIA Kepler
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Inter-warp and Intra-warp Access Percentages for L1 and L2 cache for
OpenCL benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Inter-warp and Intra-warp Access Percentages for L1 and L2 cache for
CUDA benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.7 MRU Results for the CPU [11] . . . . . . . . . . . . . . . . . . . . . 54
5.8 Comparison of MRU results for the CPU and the GPU [11] . . . . . 55
5.9 Cache Hit Rates for the CPU [10] . . . . . . . . . . . . . . . . . . . 56
5.10 Comparison of Cache Hit Ratios for the CPU and the GPU . . . . . 57
6.1 Defined Machine Learning algorithms in CUDA . . . . . . . . . . . . 59
6.2 L1 and L2 Hit Ratios for Machine Learning Benchmarks . . . . . . . 61
6.3 MRU Results for the Machine Learning benchmarks on the NVIDIA
Kepler architecture for L1 cache . . . . . . . . . . . . . . . . . . . . 62
6.4 MRU Results for the Machine Learning benchmarks on the NVIDIA
Kepler architecture for L2 cache . . . . . . . . . . . . . . . . . . . . 63
6.5 Inter-warp and Intra-warp Access Percentages for L1 and L2 cache . . 66
6.6 Comparison of the warp locality results for the two sets of CUDA
benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
ix
LIST OF TABLES
x
Chapter 1
Introduction
Graphic Processing Units (GPUs) are regarded as one of the standards when it comes
to high processing power and parallel execution. As the world advances in technology,
the applications of GPUs range from machine learning and computer vision to general
purpose computing and more. Although the use of GPUs is growing greatly in indus-
try, there is very little knowledge and understanding of GPUs and their relationship
with cache, as well as the communication between a CPU and a GPU.
Data level parallelism is achieved when multiple processors work on the same code
to execute the results in parallel therefore improving the performance. GPUs enabled
programmers to run multiple applications using data level parallelism by executing
multiple threads in parallel. This was done using the defined hardware characteristics,
such that each thread will execute the same kernel functionality. This execution model
is one level higher than the Single Instruction Multiple Data (SIMD) model and is
called Single Instruction Multiple Thread (SIMT). In 2007, NVIDIA launched their
first GPU accelerators with supporting hardware to improve the performance of high
end applications. GPU accelerators are mainly used in heterogeneous multi-processor
systems, where the GPU works in tandem with the CPU to improve or accelerate
performance.
In a heterogeneous multi-processor chip, the CPU and GPU are the two main
processors. The CPU is the primary host while the GPU is the device that works
1
CHAPTER 1. INTRODUCTION
off the primary host (CPU). The CPU’s cache memory access patterns are very well
documented, but the same cannot be said about the GPU. There has not been enough
research done in the GPU field to understand how the GPU accesses its cache and to
understand the cache memory access patterns of the GPU.
1.1 Motivation
One of the primary objectives of this research is to understand the GPU cache memory
access patterns and in turn improve the communication between GPU and CPU in a
heterogeneous multi-processor chip. A GPU simulator called Multi2Sim was used to
simulate various benchmarks for two sets of GPU architectures — NVIDIA’s Kepler
and AMD’s Radeon.
The development of GPUs has enabled high data parallelism using multiple, easy
to use, and flexible programming models. Modern GPUs can be used for general
purpose processing and are part of a heterogeneous multi-processor chip along with
the CPU. Modern GPUs offer greater computational power than the CPU and are
used to run applications in parallel taking advantage of the application’s data level
parallelism. GPUs are Single Instruction Multiple Data (SIMD) architectures. This
enables them to offer greater performance with low latency and overhead.
In a heterogeneous multi-processor system, different tasks are assigned to and
handled by different processing cores based on the individual characteristics of the
processor and the task assigned to it. This improves performance and saves power,
effectively using the available resources on the system or chip with an even work load
for the processors.
Understanding the memory access patterns of the GPU can lead to improving the
communication and memory management between the CPU and the GPU. The CPU
cache memory definitions and access patterns have been studied and identified for
various releases of CPUs [11, 12, 13]. Savaldor Petit et al. [11] talk about percentage
2
CHAPTER 1. INTRODUCTION
of cache hits along the multiple lines for each set of the CPU’s cache. Various CPU
benchmarks were run to identify the temporal locality of lines for each set of CPU
cache. The first MRU line shows a very high percentage of cache hits between 85%
to 95%, depending on the benchmarks being tested. The other 15% to 5% of hits
were then distributed among the other MRU lines. The results showed a very high
percentage of hits on the Most Recently Used (MRU) block or MRU0 line, and this
value was recorded to be 92% on average. This served as a motivation to find the
MRU cache access patterns of the GPU to see if the GPU showed similar results to
those of the CPU.
Studying the GPU cache locality and cache memory access patterns can further
help understand the working of the GPU for specific high-performance benchmarks
like those used for machine learning. Most of the machine learning workloads are
executed using pre-defined libraries or frameworks like Tensorflow. These machine
learning workloads are heavily reliant on a GPU for high data level parallelism and
high computational power. To the best of the author’s knowledge, machine learning
algorithms or workloads have not been tested on a simulator level to understand
how the GPU works with these different workloads and how this impacts its cache
memory. By running some machine learning benchmarks on the Multi2Sim simulator
for the GPU, the GPU cache behavior can be studied for different machine learning
workloads.
This research can also pave the way for a new approach, currently being pursued
by AMD and NVIDIA, called the Unified Memory Model[27] between the CPU and
the GPU. Based on this model, the CPU and GPU share a dedicated unified memory
block located in Dynamic Random Access Memory (DRAM), where the GPU can
access the data directly from DRAM without involving or depending on the CPU.
3
CHAPTER 1. INTRODUCTION
1.2 Objective
The main objective of this research is to learn more about the cache memory access
patterns of the GPU. These results will help identify specifics about GPU cache
operations and can be used to improve data exchange between the GPU and the
CPU in a heterogeneous multi-processor chip. Multi2Sim [6] was used to perform a
wide range of simulations on two different GPU architectures. Mult2Sim emulates
the AMD’s Radeon GPU architecture as well as NVIDIA’s Kepler GPU architecture.
A set of OpenCL and CUDA benchmarks was used to test both GPU architectures
for different workloads.
The first objective was to setup the Multi2Sim simulator on an Ubuntu VM with
all the required dependencies. The simulator was compiled to run the OpenCL and
CUDA benchmarks to test the working of the GPU. To modify the GPU emulator
code defined in the Mult2Sim simulator, first there was a need to understand how the
AMD and NVIDIA GPUs were defined in hardware and how they were emulated in
code on the simulator. After understanding the architectural code and the memory
code, three metrics were defined to analyze GPU cache behavior: cache hit ratios,
MRU temporal locality and inter-warp/intra-warp locality.
The caching policy was studied to add specific cache counters to keep track of
the most recently used (MRU) and least recently used (LRU) blocks in the GPU
cache. A set of MRU and LRU counters were implemented to study the temporal
locality of the the GPU cache. These MRU counters determine the number of cache
accesses made for each MRU line in a multi-way set associative cache. In the GPU
architecture, L1 and L2 cache blocks were defined using different set associativity.
This associativity determined how many ways or lines were used to split the cache.
For example, if L1 cache was defined a 4-way set associative cache, then each set
was defined using 4 ways or lines. MRU0 represents the most recently used line and
4
CHAPTER 1. INTRODUCTION
MRU4 would represent the least recently used line. The caching policy uses these
multiple lines to add, replace, and remove blocks of memory for each set in cache.
The MRU counters were updated during the GPU execution every time a block got
accessed in cache at that very location on the MRU line. The respective MRU line
counter was incremented. These MRU counters were used to record the temporal
locality of the GPU in L1 and L2 cache. These GPU MRU results were compared
to those of the CPU to see if the GPU showed similar values and access patterns to
those of the CPU.
A set of inter-warp and intra-warp cache counters were implemented to understand
the cache locality for different sets of workloads for the defined GPU architectures.
The intra-warp access counters represent the accesses made within each warp while
the inter-warp access counters represent the accesses made across various warps in the
overall GPU architecture. This set of counters can help analyze the cache memory
access patterns of a GPU. The L1 and L2 intra hits were defined as the number of L1
and L2 data cache hits that resulted from intra-warp locality, respectively. Contrarily,
the L1 and L2 inter hits were the data cache hits that resulted from inter-warp locality.
This helped understand the cache locality of the GPU architecture for different sets
of workloads.
Furthermore, some machine learning benchmarks were defined in CUDA and
tested on the GPU architecture to analyze how the GPU reacts to different ma-
chine learning workloads. These benchmarks were defined using math algorithms
that are most commonly used by Convolution Neural Networks (CNN) and Deep
Neural Networks (DNN). The cache hit ratios, MRU temporal locality and the inter-
warp/ intra-warp locality were recorded and compared against the general purpose
CUDA benchmarks.
5
Chapter 2
Background
2.1 Simulators
A large portion of the work in this thesis focused on setting up the right GPU simu-
lator. GPGPU Sim [1, 2] was the most popular simulator a few years ago, and most
of the research in the GPU field was performed using this simulator. The advantages
of this simulator were that it was easy, convenient and had all the latest GPU archi-
tectures modeled to replicate GPUs in hardware, at that time. But over the years,
this simulator has not been updated, and this results in many compatibility issues in
terms of the latest versions of CUDA, OpenCL, GCC compilers and Python. This
meant that this simulator could not be used to perform experiments for the latest
GPU architectures, to study recent technology trends. This GPGPU simulator re-
cently merged with Gem5 [3], but the some of the dependencies still have not been
updated. Hence neither simulator (GPGPU Sim or Gem5-gpu) was chosen for these
experiments.
Barra [1, 4] and Ocelot [1, 5] are some of the other GPU simulators that were
considered for this research. The drawback of Barra is that it runs CUDA applications
and benchmarks for NVIDIA GPUs only and does not have the capability to run any
OpenCL applications and benchmarks for the AMD series of GPUs. In addition to
that, Barra can be used to run CUDA benchmarks on an assembly level using an
6
CHAPTER 2. BACKGROUND
Instruction Set Architecture (ISA) and not on a Parallel Thread Execution (PTX)
level. Unlike Barra, Ocelot supports both CUDA and OpenCL benchmark on a ISA
and PTX level. Since Ocelot has not been maintained and updated to the latest SDK,
GCC, and CUDA releases and hence was not chosen for this research.
Multi2Sim [1, 6] was chosen as the main simulator for this research as it was the
most up-to-date simulator with the latest dependencies and accurate functionality.
Multi2Sim is easy to setup and use, and it is very well documented and supported,
being one of the newer open source simulators out there. Older versions of Multi2Sim
supported only CPU architecture simulations. But recent simulator released have
helped incorporate GPU architecture emulation and simulation for AMD’s Radeon
Southern Islands series and for NVIDIA’s Kepler GPU series. This is another advan-
tage as this simulator can be used to run both OpenCL and CUDA benchmarks to
test the AMD Radeon and NVIDIA Kepler GPU architectures.
2.2 Related Work
In this new generation of high performance computing, the GPUs have taken up a
new distinction in terms of processing power using parallel thread execution. The
technology has a come a long way since the initial release of the GPU in terms of its
architecture and how it works communicates with the CPU.
Lee et al. [7] proposed a method to improve GPU performance by updating
the warp scheduling policy within a GPU architecture. This method also used a
simulator and tested the scheduling policies of the GPU, like the round-robin and
greedy instruction issue scheduling policies and patterns. Most of the related research
in this field had focused on static scheduling methods for all workloads, and this
was the first time a dynamic or adaptive instruction issue scheduling policy was
suggested. The proposed method was called instruction-issue pattern based adaptive
warp scheduler (iPAWs). This scheduling policy switched between a greedy and
7
CHAPTER 2. BACKGROUND
round-robin scheduler based on the type of workload being run on the GPU. Various
sets of benchmarks were run to simulate large diverse workloads on an NVIDIA
GPU using CUDA. The proposed adaptive warp scheduling method was observed to
be optimal than the static greedy or round-robin scheduling methods. This paper
accurately described the use of a GPU simulator for results and defined the GPU
modifications well in terms of the GPU architecture. This paper served as a baseline
for the use of a GPU simulator for a similar research in the GPU field, but this time
to analyze the memory access patterns of a GPU. This paper also presented GPU
results for warp locality mainly, inter-warp and intra-warp statistics in L1 and L2
cache. This helped in understanding the warp locality for different sets of workloads
for both the Greedy, Round-Robin scheduling policies and iPAWS (as suggested by
the paper).
Mei et al. [8] talk about the GPU memory hierarchy through micro-benchmarking.
This paper proposes a new micro-benchmarking approach to test the GPU cache
memory components for three NVIDIA GPU models: Kepler, Fermi and Maxwell.
The memory components studied are data cache, texture cache, and translation look-
aside buffer (TLB). This paper goes on to introduce the proposed micro-benchmarking
approach to study the memory access patterns of the GPU which are comprised of
cache memory units with the Least Recently used (LRU) caching policy. The memory
access patterns of the GPU and the memory statistics, like number of accesses, number
of hits, number of misses, hit-ratio, etc. were recorded for the new proposed method
of fine grained benchmarking method. The proposed method proved to enhance the
memory capability and capacity for all three generations of the NVIDIA GPUs, more
so for Kepler and Maxwell than Fermi. Furthermore, it enhanced the performance
of the GPU by reducing the latency for shared memory of the GPU caused due to
bank conflicts. Some of the drawbacks or limitations of this approach included low
utilization rate of GPU hardware resources and unbalanced bandwidth values.
8
CHAPTER 2. BACKGROUND
Johnstone et al. [9] talk about the bandwidth requirements of the GPU cores
to determine the appropriate choice of an interconnect between the GPU and CPU
in a heterogeneous multi-core chip. This paper completely focused on running GPU
simulations for different benchmarks and workloads using GPGPU simulator. The
interconnect properties were found to be dependent on the performance of the GPU,
and how the bandwidth affected the GPU performance. The GPU architectures and
the use of GPGPU simulator were the main takeaways from this paper in relation to
this research area.
Choo et al. [10] talk about analyzing and optimizing GPU cache memory perfor-
mance for different computational workloads using Multi2Sim. This paper used the
AMD Southern Islands (SI) GPU architecture defined on MultiSim to run various
computational workloads and observe the L1 and L2 cache hit rates and behavior.
The L1 and L2 cache hit ratios were compared for the CPU and the GPU. The CPU
showed much higher cache hit ratios than the GPU as expected, as the CPU focuses
more on the memory hierarchy during execution to increase performance. On the
other hand, the GPU relies more on parallel execution to process larger workloads
such that the same set of instructions can be executed across multiple threads to
increase productivity and performance. Since this paper used Multi2Sim, it was used
as a reference point for any AMD SI GPU simulations performed during this the-
sis. The cache hit ratios recorded from this thesis were also compared to the results
recorded from this paper for reference. There was no research in terms of caching
policies or identifying the most recently used blocks in cache memory. This paper
completely focuses on improving the GPU memory performance by proposing two
methods: shared L1 vector data cache and clustered work-group scheduling. Both of
these methods were executed for different workloads, and the performance improve-
ments were recorded.
Furthermore, the CPU cache memory definitions and access patterns have been
9
CHAPTER 2. BACKGROUND
studied and identified for various releases of CPUs [11, 12, 13]. Savaldor Petit et
al. [11] investigated the temporal locality of a multi-way set associative cache by
recording the percentage of cache hits along the multiple lines for each set of the
CPU’s cache. This helped analyze the power consumption and performance of the
CPU for the different cache lines using the current caching policies. A new drowsy
cache policy was proposed to demonstrate a good balance between performance and
power consumption. The experiments were performed using the HotLeakage simulator
and the Spec2000 benchmark suite for CPUs. This drowsy cache policy was then
compared to two existing caching policies, Most Recently Used On (MRO) and Two
Most Recently Used On (TMRO), to compare the performance and power statistics.
This research involved going through each Most Recently Used (MRU) block or line in
cache and the power values, and hit percentages were recorded for each line. Various
CPU benchmarks were run to identify the temporal locality of lines for each set of
CPU cache. The first MRU line shows a very high percentage of cache hits between
85% to 95%, depending on the benchmarks being tested. The other 15% to 5% of
hits were then distributed among the other MRU lines. The results showed a very
high percentage of hits on the Most Recently Used (MRU) block or MRU0 line, and
this value was recorded to be 92% on average. This paper serves as a good reference
point for the MRU cache access patterns for the CPU. This served as a motivation
to find the MRU cache access patterns of the GPU to see if the GPU showed similar
results to the CPU results recorded by this paper.
Kumar et al. [12] talk about the modern CPU cache memory hierarchy and
perform cache analysis for various cache replacement policies in CPU cache memory.
The latest generations of processors were compared to find the different sets of factors
that affect the performance and utilization of cache memory. Furthermore, the cache
replacement policies were studied and analyzed using performance analysis. Banday
et al. [13] talk about the recent advances in cache memories for the latest processors.
10
CHAPTER 2. BACKGROUND
Most of the latest processors were compared based on cache hierarchy, organization,
performance, cache access patterns and cache replacement policies. The CPU’s cache
memory access patterns and functionality are known and have been studied for years,
but the same cannot be said about the GPU.
11
Chapter 3
Multi2Sim
Multi2Sim is a well known and documented research simulator. The simulator was
recently updated to include certain GPU architectures. Furthermore, Multi2Sim is
regarded as the most up to date open source simulator that works with the latest
versions of CUDA, OpenCL, Python, and GCC.
As stated previously, the two main GPU architectures that Multi2Sim can simu-
late are AMD’s Radeon Southern Islands and NVIDIA’s Kepler GPU architectures.
Both GPU architectures are approximately two generations behind the current in-
dustry standard, as is the case with most of the CPU and GPU simulators. However,
they can be modified to add any current or new architectures at the simulation level
using newer releases. Both AMD Radeon and NVIDIA Kepler represent really strong
GPUs that are used for a wide variety of applications like parallel and high perfor-
mance computing. They are commonly used in machine learning and computer vision
applications, mainly using libraries and frameworks like Tensorflow. AMD’s SI GPU
architecture used the OpenCL libraries and framework while NVIDIA’s Kepler GPU
architecture used CUDA libraries.
Multi2Sim emulates GPU architectures in software and enables the user to run
a variety of benchmarks on the CPU itself as though the experiments were being
run on a GPU. This enables the user to make architectural changes to the defined
GPU model without making any hardware changes. Every time a change is made
12
CHAPTER 3. MULTI2SIM
Table 3.1: Dependency list for the Multi2Sim simulator
Dependency Version
Operating System Ubuntu (14.04 preferred)
CUDA CUDA version 6.5 (preferred) or higher
Python Python2 and Python3
GCC GCC version 4.8.2 or higher
in the GPU emulator code, the Multi2Sim framework is recompiled to reflect those
changes, and a new executable is generated to run simulations. Multi2Sim supports
multiple CPU and GPU benchmarks that help test the architectures for different
workloads. Spec2000 is the most commonly used CPU benchmark suite. Contrarily,
AMD SDK 2.5 [21] CUDA SDK [22] are the commonly used GPU benchmark suites.
Both OpenCL and CUDA benchmarks display two sections, one for the CPU and one
for the GPU. Both the CPU and GPU sections are simulated on the respective CPU
and GPU architectures implemented in software by Multi2Sim.
3.1 Setting up the simulator on an Ubuntu OS
The open source Multi2Sim simulator code was downloaded from the official M2Sim
github repository and was setup on a 64 bit Ubuntu 14.04 operating system. Although
Ubuntu is the primary OS supported by Multi2Sim, the 14.04 version is the optimal
Ubuntu version as the simulator was tried and tested on this version during the
developmental phase. This does not limit the Multi2Sim to only the 14.04 version.
Systems with newer Ubuntu versions can also be used to setup this simulator.
Table 3.1 shows the major dependencies of the Multi2Sim simulator. These spe-
cific packages and versions had been tested during the developmental stages of the
Multi2Sim simulator and hence were chosen and setup on the testing system.
13
CHAPTER 3. MULTI2SIM
Helper tools were used to download and add the installation scripts for the sim-
ulator. After running the configure and make install commands, Multi2Sim was
successfully built, and the m2s executable was generated in the dedicated Multi2Sim
home directory. The m2s executable was used to run various simulations on the
defined CPU and GPU architecture models. All the commands supported by the
simulator were listed by running the command m2s –help. Listing 3.1 shows all the
bash commands that were used to configure and install Multi2Sim.
1
2 # I n s t a l l i n g any dependenc ies − i f r equ i r ed
3 $ sudo yum i n s t a l l devtoo l s e t−2
4 $ s c l enable devtoo l s e t−2 bash
5
6 # Bash Commands to i n s t a l l Multi2Sim
7 $ l i b t o o l i z e
8 $ a c l o c a l
9 $ autoconf
10 $ automake −−add−miss ing
11
12 $ cd ˜Multi2Sim Home Directory
13 $ . / c on f i gu r e −−p r e f i x=/home/ypn4262/ t o o l s /multi2s im /5.0/
14
15 # Compiling a l l the s imulator source f i l e s and genera t ing the m2s executab le
16 $ make −j 4
17 $ make i n s t a l l
18
19 # Using the help command to f i nd a l l the a v a i l a b l e commands f o r Multi2Sim
20 $ m2s −help
21 $ . / t o o l s /multi2s im /5.0/ bin /m2s −−help
Listing 3.1: Bash commands to setup Multi2Sim
3.2 AMD Southern Islands (Radeon) GPU Architecture
AMD’s Radeon series of GPUs are the closest competitors to NVIDIA’s GeForce
series of GPUs. AMD’s Southern Islands family of HD GPUs consists of the Radeon
HD 7000 series of GPUs. The Southern Islands architecture defined by Multi2Sim
consists of Radeon HD 7770, 7850, 7870 and 7970 architectures. The default and most
commonly used Southern Islands architecture is Radeon HD 7970. This architecture
was analyzed to understand the GPU and its architectural design.
14
CHAPTER 3. MULTI2SIM
Table 3.2: Difference in terminology between OpenCL and CUDA
OpenCL CUDA
Compute Unit (CU) Streaming Multiprocessor (SM)
Compute Element CUDA Core
Work-item Thread
Wavefront (64 work-items) Warp (32 threads)
Wavefront pool Warp pool
Work-group Thread block
Local memory Shared memory
Private memory Registers
The AMD SI GPU architecture used an OpenCL Application Programming In-
terface (API) platform which serves as a parallel to the CUDA platform of NVIDIA’s
GPUs. In the OpenCL implementation, the GPU is defined with Compute Unit (CU)
which serves as the CUDA equivalent of NVIDIA’s Streaming Multiprocessor (SM).
Furthermore, OpenCL APIs use work items, wavefront, and wavefront pools which
serve as the CUDA equivalent of NVIDIA’s threads, warps, and warp pools (or thread
blocks) respectively. Table 3.2 shows the difference in terminology between OpenCL
and CUDA.
Figure 3.1 shows a block diagram of the AMD Radeon 7970 architecture from
the AMD Southern Islands GPU family. The Radeon 7970 architecture has three
main parts, compute devices, compute units and SIMD lanes. Figure 3.1 (a) shows
the block diagram of a compute device. This specific architecture is defined using a
thread dispatcher and scheduler with 32 compute units. The ultra-thread dispatcher
is responsible for scheduling work groups and assigning them to the available Compute
Units. The Compute Units interact directly with global memory, which is comprised
of cache memory and main memory. This global memory block can be accessed by
the entire collection of Compute Units in the ND-Range [14].
Figure 3.1 (b) shows the block diagram of a compute unit. The individual compute
15
CHAPTER 3. MULTI2SIM
Figure 3.1: Block diagram of the AMD 7970 architecture [14]
unit has 4 SIMD execution units, each having 16 SIMD lanes for parallel work-item
execution. These SIMD units are free to interact directly with local memory. The
SIMD lanes allow parallel work-item execution, such that the same set of instructions
is executed across multiple work-items. A wavefront consisting of 64 work-items is
created within each work group and is assigned to a specific SIMD unit for execution.
Each of the 16 SIMD lanes of the Compute Unit is executes four work-items per
wavefront. The local memory block allows the work-items to share information during
execution [14].
Figure 3.1 (c) represents a single SIMD lane which shows how an ALU interacts
with the register file. The ALU is comprised of functional units to process integer
and floating-point values. The register file is responsible for each work-item’s private
memory [14].
Figure 3.2 shows a block diagram of the Compute Unit of the AMD Radeon 7970
GPU architecture. The figure shows the modular structure of the Compute Unit
comprising of a Scalar Unit, Vector Memory Unit, Branch Unit, Local Data Share
Unit (LDS) unit, and 4 SIMD units for parallel execution. During execution, multiple
work-groups are assigned to each compute unit. These work-groups are further split
into wavefronts (comprised of 64 work-items), and the wavefronts are executed simul-
taneously for each instruction. The front-end of the Compute Unit fetches and reads
the instructions from the instruction memory. These instructions are then passed
16
CHAPTER 3. MULTI2SIM
Figure 3.2: Block diagram of the Compute Unit of the AMD 7970 architecture [14]
to the functional units for execution. The Scalar Unit executes scalar arithmetic
and scalar memory instructions. The Vector Memory Unit handles all vector global
memory instructions. The Local Data Share Unit (LDS) handles the local memory
instructions. The Branch Unit (BRU) handles all the branch instructions for control
flow. Each of the functional units (SIMD units) interacts with either global or local
memory and executes vector arithmetic and logic instructions [14].
Table 3.3 shows the hardware configuration of the AMD Radeon 7970 GPU series
that was replicated by the AMD SI GPU model in Multi2Sim [10, 14]. L1 cache and
L2 cache were defined as 4-way and 16-way set associative caches, respectively. The
L1 cache was split between L1 scalar cache and L1 vector cache. The L1 scalar cache
was responsible for all the L1 scalar instructions that got fetched and executed only
once for the entire wavefront. Most of the constant data values were stored here. The
L1 vector cache was responsible for all the vector instructions that got fetched for
the whole wavefront but got executed multiple times for each of the 64 work-items in
that wavefront for parallel execution. The L1 vector cache was the primary focus for
the L1 cache statistics. The L1 vector cache has 32 memory modules such that each
17
CHAPTER 3. MULTI2SIM
Table 3.3: AMD Southern Islands emulated GPU configuration on Multi2Sim [10, 14]
Configuration Value
Computational
Frequency 1000 Hz
Number of Compute Units 32
Number of SIMD lanes 16
Max Number of Wavefront Pools 4
Max Wavefronts per Pool 10
L1 Vector Cache
Number of L1 Cache Modules 32
Associativity / Number of Ways 4
Number of Sets 64
L1 Block Size 64 B
Total L1 Cache Size 512 KB
L1 Scalar Cache
Number of L1 Cache Modules 8
Associativity / Number of Ways 4
Number of Sets 64
L1 Block Size 64 B
Total L1 Cache Size 128 KB
L2 Cache
Number of L2 Cache Modules 6
Associativity / Number of Ways 16
Number of Sets 128
L2 Block Size 64 B
Total L2 Cache Size 768 KB
of the 32 compute units is directly mapped to a single L1 cache module. L2 cache
has 6 memory modules that are shared by all of the 32 compute units. The L2 cache
had a higher set associativity than L1 cache and was larger than L1 cache in terms
of cache size.
18
CHAPTER 3. MULTI2SIM
Figure 3.3: Block diagram of the NVIDIA Kepler architecture [1]
3.3 NVIDIA Kepler GPU Architecture
NVIDIA is the leader in the GPU market with its latest generation GeForce GPU
series. The NVIDIA Kepler GPU architecture was the sole NVIDIA architecture
available on Multi2Sim, but future updates may result in other GPU models. The
Kepler architecture comprises of the GeForce 6000 series and the GeForce 7000 series
GPUs.
Figure 3.3 shows a block diagram of the NVIDIA Kepler architecture from the
Kepler series of GPUs. Part (a) shows the block diagram of the SIMD execution
pipeline of the GPU. The SIMD pipeline consists of the front end, a register file, 32
SIMD lanes followed by an execution buffer. Part (b) shows a block diagram of the
Branch Unit (BRU) in the GPU architecture. The BRU describes how the opcode is
fetched from the dispatch buffer and sent to the execution unit. Moreover, the data
are written to the register file using the write buffer. Part (c) shows the Load and
Store Unit (LSU) of the GPU architecture. The LSU controls how the GPU interacts
with its different memory units like cache, shared memory, global memory, and local
memory. The LSU unit is used by the Streaming Multiprocessor (SM) to interact
with the GPU’s memory units, and data are either written to or read from these
memory units during execution.
19
CHAPTER 3. MULTI2SIM
Figure 3.4: Kepler GPU Memory Architecture [1]
Figure 3.4 shows the top level block diagram of the Kepler GPU architecture.
The GigaThread Engine interacts with the SMs which in turn interact with global
memory. There are a total of 14 SMs defined in the Kepler architecture on the
Multi2Sim and each SM has a dedicated L1 cache module assigned to it. The L2 cache
modules are shared between multiple SMs. Finally, the entire memory architecture
is supported using interconnects and memory controllers. The GigaThread engine
is responsible for storing and assigning the thread blocks to the defined SMs. The
block and grid dimensions are defined in the GigaThread Engine. The engine is in
charge of assigning thread blocks to available SMs for execution. The thread blocks
then get passed on to the SM where they get executed for the fetched instruction
opcode. Based on available resources, multiple thread blocks may be assigned to the
same SM. The Kepler architecture is defined using CUDA, which specifies a thread
as set of instructions to be executed. Multiple threads can be executed in parallel
using the Single Instruction Multiple Data (SIMD) pipeline which enables high data
level parallelism. In CUDA, a group of threads is defined as a thread block, where
all the threads in that thread block execute the same set of instructions in parallel.
A group of 32 threads is defined as a warp, and a thread block is further comprised
of multiple warps. These warps get assigned to available SMs during execution. This
20
CHAPTER 3. MULTI2SIM
Figure 3.5: NVIDIA Kepler Steaming Multi-processor (SM) architecture [1]
architecture defines 4 warp schedulers, thus enabling 4 warps to be scheduled and
processed at the same time [8].
Figure 3.5 shows the block diagram of the Streaming Multiprocessor (SM) in the
Kepler architecture. Each SM is comprised of various functional units like the Single
Precision Unit (SPU), the Branch Unit (BRU), the Double Precision Unit (DPU),
the Integer Math Unit (IMU), the Special Functional Unit (SFU), and the Load and
Store Unit (LSU).
Table 3.4 shows the hardware configuration of the NVIDIA Tesla K20X GPU [17,
18] that was matched by the NVIDIA Kepler GPU architecture on Multi2Sim.
Similar to the AMD GPU architecture, NVIDIA’S Kepler architecture also defines
L1 caches and L2 caches as 4-way and 16-way set associative cache blocks, respectively.
L1 cache had 14 memory modules such that each of the 14 Streaming Multiprocessors
(SM) is directly mapped to a single L1 cache module. L2 cache is divided into 6
21
CHAPTER 3. MULTI2SIM
Table 3.4: NVIDIA Kepler emulated GPU configuration on Multi2Sim [1, 14]
Configuration Value
Computational
Frequency 732 Hz
Number of SMs 14
Number of SIMD lanes 32
Warp Size 32
Max Warps per SM 64
Max Threads per SM 2048
L1 Cache
Number of L1 Cache Modules 14
Associativity / Number of Ways 4
Number of Sets 32
L1 Block Size 128 B
Total L1 Cache Size 16 KB
L2 Cache
Number of L2 Cache Modules 6
Associativity / Number of Ways 16
Number of Sets 32
L2 Block Size 128 B
Total L2 Cache Size 1536 KB
memory modules which are shared by all of the 14 SMs.
The AMD SI and NVIDIA Kepler GPU architectures were both linked to Multi2Sim’s
common memory model. Initially, the NVIDIA Kepler architecture was not connected
to any memory module, as the common memory module defined in Multi2Sim was
based on the AMD GPU’s memory hierarchy. After a discussion with the Multi2Sim
Developers team, there was consensus that the memory model could be connected to
the NVIDIA Kepler GPU architecture to show accurate functionality. Multi2Sim’s
memory module was then connected to the NVIDIA Kepler GPU architecture in the
LSU unit of the SM definition. The LSU unit was responsible for all memory ac-
cesses. Since both GPU architectures used a common memory model, the results
were consistent and easy to compare.
Listings 3.2 and 3.3 show the UNIX commands used to compile and run OpenCL
and CUDA benchmarks on Multi2Sim.
22
CHAPTER 3. MULTI2SIM
1 #Compiling the Benchmarks
2
3 cd Benchmarks Home Dir
4 cd Matr ixMul t ip l i ca t i on #( or any other benchmark )
5 vim MakeFile #(To update the MakeFile and the Multi2Sim d i r e c t o r y )
6 #(And l i n k i n g the OpenCL and CUDA l i b r a r i e s )
7 make #(To compile the benchmark )
Listing 3.2: Compiling the OpenCL and CUDA benchmarks
1 #Running AMD SDK 2.5 Benchmark (forAMD Southern I s l and )
2
3 <multi2s im binary><benchmark host program> l o a d <benchmark GPU binary>
4
5 . /m2s −−s i−r epor t r epor t . txt −−mem−r epor t mem report . txt −−s i−sim de t a i l e d
6 m2s−bench−amdsdk−2.5−master /Matr ixMul t ip l i ca t i on /Matr ixMul t ip l i ca t i on −−load
7 m2s−bench−amdsdk−2.5−master /Matr ixMul t ip l i ca t i on /Mat r i xMu l t ip l i c a t i on Kerne l s . bin −q
8
9
10 #Running CUDA SDK 2.5 Benchmark ( f o r NVIDIA Kepler )
11
12 <multi2s im binary> <benchmark GPU binary>
13
14 export LD LIBRARY PATH=/home/ t o o l s /multi2s im /5.0/ l i b
15
16 . /m2s −−kpl−r epor t kp l r epo r t . txt −−mem−r epor t kpl mem report . txt −−kpl−sim de t a i l e d
17 /home/ypn4262/Thes is /m2sim/benchmarks/m2s−bench−cudasdk−6.5−master /VectorADD/vectorAdd m2s
Listing 3.3: Running the OpenCL and CUDA benchmarks on Multi2Sim
23
Chapter 4
Analyzing Cache Memory Access Patterns
The objective of this research is to analyze cache memory access patterns for GPU
architectures. In particular, the three metrics include cache hit ratios, Most Recently
Used (MRU) and Least Recently Used (LRU) cache access counters, and finally data
cache accesses for intra-warp and inter-warp locality for different workloads.
The L1 and L2 cache statistics and hit ratios were compared for a variety of bench-
marks for both AMD SI and NVIDIA Kepler GPU models. The data cache hit ratios
varied based on the benchmark being tested and depended greatly on the benchmarks
size. Each benchmark was run for multiple input sizes to find the optimal cache hit
ratio. All cache hit ratios were compared for the all benchmarks to understand the
cache performance of both AMD SI and NVIDIA Kepler GPU architectures for the
tested workloads.
The MRU and LRU counters recorded the most and least recently used blocks in
each set for L1 and L2 cache respectively. The results for each set were combined
to generate cache access percentages for each MRU line in cache. The cache access
percentages for different MRU lines were compared and analyzed to find GPU cache
memory trends for a variety of benchmarks. The GPU results were compared against
the CPU as a reference.
The L1 and L2 data cache accesses were studied for intra-warp and inter-warp
localities. Inter-warp accesses represent the accesses made in data cache such that
24
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
the accessing threads belong to the same warp. Intra-warp accesses represent the
accesses made in data cache where the accessing threads do not belong to the same
warp, but belong to different warps within the warp-pool. The inter-warp and intra-
warp results varied based on the workload characteristics. The cache statistics were
recorded and analyzed for OpenCL and CUDA benchmarks for inter-warp and intra-
warp localities.
4.1 L1 and L2 Cache Hit Ratios
4.1.1 AMD Southern Islands
The hardware configuration defined in Chapter 3 demonstrates the AMD Southern
Islands GPU architecture model on Multi2Sim. This model replicates the AMD
Radeon HD 7970 GPUs series. The AMD Southern Islands GPU model uses a 4-way
set associative L1 cache and a 16-way set associative L2 cache. The L1 cache consists
of two sections, L1 scalar cache and L1 vector cache. L1 scalar cache has a total of
8 defined memory modules, such that each module maps to four compute units. L1
vector cache has a total of 32 defined memory modules, such that each module maps
to a single compute unit. Contrarily, L2 cache does not have any scalar or vector
differentiation. L2 cache consists of 6 cache memory modules which are shared across
all 32 compute units during execution.
Multi2Sim’s pre-defined memory counters were used to find the memory statis-
tics for the entire memory model. These memory statistics were recorded for cache
memory, global memory and local memory. The memory statistics included Accesses,
Evictions, Retrieved Accesses, Hits, Misses, and Hit Ratios for each and every mem-
ory module defined in cache. All these statistics across various memory modules for
the specific type of cache (L1 or L2) were compared. The same set of AMD SDK
OpenCL benchmarks was used to test the behavior and cache hit ratios of L1 scalar
25
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
cache, L1 vector cache and L2 cache. All the results were tabulated, graphed and
compared to find common trends of cache memory using the defined AMD SI GPU
model on Multi2Sim.
4.1.2 NVIDIA Kepler
Similarly, the L1 and L2 cache statistics were recorded for the NVIDIA Kepler GPU
model on Multi2Sim. Pertaining to the hardware configurations mentioned in Chapter
3 for the Kepler GPU model, the L1 and L2 cache definitions were noted. The
NVIDIA Kepler GPU model did not differentiate the L1 cache as scalar or vector
cache, as the AMD SI GPU model did. The Kepler model defined the L1 cache as
unified cache modules. The Kepler GPU model used a 4-way set associative L1 cache
and a 16-way set associative L2 cache, similar to the AMD SI model. L1 vector cache
has a total of 14 defined memory modules, such that each L1 cache module maps to
a single SM (as the Kepler architecture used 14 SMs). L2 cache consists of 6 cache
memory modules, and these are shared across multiple SMs during execution.
The memory statistics for various CUDA SDK benchmarks were recorded for the
NVIDIA Kepler GPU architecture. The L1/L2 cache hit ratios and memory statistics
were tabulated, graphed and analyzed. The results from the CUDA benchmarks and
OpenCL benchmarks were compared to understand which model worked better with
the respective memory hierarchy.
26
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
4.2 MRU and LRU Cache Counters for Temporal Locality
The simulator, timing and memory code for both AMD SI and NVIDIA Kepler GPU
architectures was studied to learn how the memory hierarchy was used in each GPU
architecture. As explained in Chapter 3, the AMD GPU architecture used an OpenCL
API platform while the NVIDIA Kepler GPU architecture used a CUDA API plat-
form.
The Compute Unit (for OpenCL) or the Streaming Multiprocessor (for CUDA) is
the main processing unit of the GPU that executes the defined kernels. Both GPU
models use Multi2Sim’s memory model to load and store data. Specific memory units
are defined in the GPU architectures on Multi2Sim, and these units are responsible for
performing all memory load and store operations for the respective GPU architecture.
This memory unit is called the Vector Memory Unit in AMD’s SI model and Load
and Store (LSU) unit in NVIDIA’s Kepler model respectively.
The memory model was studied to find the cache memory implementation. First,
the pre-defined sets and blocks in cache were modified to include set and block coun-
ters within their respective modular designs. These counters were incremented every
time the cache block was accessed. Additionally, the cache code was examined to find
the defined caching policies. The cache block used three cache replacement policies:
Random, First In First Out (FIFO) and Least Recently Used (LRU). These cache
replacement policies signify how the list of blocks are replaced in memory.
The Random replacement policy replaces blocks of memory at random when a
new block is added to cache. Each set uses a list to store all the existing blocks in
that respective set. This list of blocks is created during initialization. The new block
replaces the old block at the exact same position in the list of blocks in cache. The
FIFO Replacement policy replaces the blocks in memory using the First In First Out
(FIFO) protocol as used in the Queue data structure. The oldest block is the first
27
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
one to be replaced. The new block replaces the old block at the same position in the
list but with its new time stamp to indicate that it is the most recently added block.
The Least Recently Used (LRU) replacement policy works by removing the least
recently used block from cache memory when the cache is full, and replaces that
block with the new entry. Each set in cache uses its own LRU list, which is a list or
collection of blocks for that respective set in cache. This LRU list is used only for the
Replacement LRU caching policy. The Replacement LRU policy is the most broadly
used replacement policy. Unlike the generic list of blocks used for Random and FIFO
replacement policies, this LRU list organizes the blocks from Most Recently Used
(MRU) to Least Recently Used (LRU) blocks. The LRU list keeps track of the blocks
in cache memory and updates the position of the blocks based on the most recent
cache accesses made. The new block replaces the old block at the same position of
the LRU list. The new block is then removed and added to the front of the LRU list
to indicate that it is the most recently used block in cache. For example, if the LRU
list is organized to accommodate Blocks 1, 2 and 3 (in that order), when the cache
is full and Block3 is the least recently used block in cache, then Block3 is replaced
in the list with the new entry. The new block is placed in the same position as the
replaced block in the LRU list. That new block node is then removed from the list
and is added to the head of the LRU list indicating that Block3 is the most recently
used block and is stored in the first MRU line. The LRU list is ordered from most
recently used blocks to least recently used blocks in cache memory, for each set in
cache. Figure 4.1 shows the three cache replacement policies implemented in the
memory model and their working.
Cache specific counters were defined in the memory hierarchy to keep track of the
Most Recently Used (MRU) and Least Recently Used (LRU) lines in cache memory.
This helped identify the temporal locality of GPU cache and identify certain memory
access patterns in cache memory. These counters were called MRU and LRU counters
28
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
Figure 4.1: Cache Replacement Policies [15]
as the name signifies, and helped identify the accesses made to the most recently used
and least recently used lines in cache memory. These MRU and LRU counters were
defined for both L1 and L2 cache blocks.
A new LRU counter list was created to replicate the LRU caching list used by
each set in cache. The list was defined in the architecture of the cache itself, such
that it could be updated every time the LRU caching list was updated by the LRU
cache replacement policy. As soon as a cache access was made to a specific block
in cache memory, that block’s index in the LRU caching list was obtained and the
block’s internal counter was incremented. The LRU counter list was then incremented
at the same index to show that the block at that index was accessed. The defined
LRU caching policy protocol identified the block being accessed and then removed
the block node from the LRU caching list placing it at the head of the list. This
indicated that the first block in the list was the most recently used block and the
29
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
last block was the least recently used block. The next time that block was accessed
in cache, the LRU list would increment the counter at the block’s new index thus
replicating the defined functionality.
The same process was repeated for every cache access made, and the LRU counters
(at the respective block indices) were incremented every time a block was accessed
based on the initial position of the block in the LRU caching list. This meant that
the original index of the block was used to show which cache line was accessed in
terms of most recently used down to least recently used.
Listing 4.1 shows the MRU counter implementation in cache. Listing 4.2 shows
how the MRU counters were incremented and updated based on the original index of
the accessed block. This was done in correlation with the LRU caching policy. The
AccessBlock() function defined in the Cache.cc file was modified. Listing 4.3 shows
a small portion of code from the file SystemEvents.cc. The code is used to call the
AccessBlock() function every time a cache access is made. The memory module’s
access counter and statistics are also updated here.
30
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
1 c l a s s Set
2 {
3 // Only Cache needs to i n i t i a l i z e f i e l d s
4 f r i e nd c l a s s Cache ;
5
6 // L i s t o f b locks in LRU order − Pre−de f ined and used by the LRU caching po l i c y
7 misc : : L i st<Block> l r u l i s t ;
8
9 // Pos i t i on in Cache : : b locks where the b locks s t a r t f o r t h i s s e t
10 Block ∗ b locks ;
11
12 // Creat ing a l i s t o f MRU/LRU counter s
13 i n t ∗ l r u c oun t e r s ;
14
15
16 void i n i t l r u c o u n t e r s ( i n t n){
17 th i s−>l r u c oun t e r s = new in t [ n ] ;
18 f o r ( i n t i =0; i<n ; i++){
19 th i s−>l r u c oun t e r s [ i ] = 0 ;
20 }
21 }
22
23 void s e t l r u c o un t e r s ( i n t ∗ ar r ){
24 th i s−>l r u c oun t e r s = arr ;
25 }
26
27 in t ∗ g e t l r u c oun t e r s ( ) {
28 return th i s−>l r u c oun t e r s ;
29 }
30
31 void in c r ement l ru coun t e r s ( i n t po s i t i o n ){
32 th i s−>l r u c oun t e r s [ p o s i t i o n ]++;
33 }
34 } ;
Listing 4.1: MRU implementation in the Set architecture of cache. (Defined in Cache.cc)
31
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
1 void Cache : : AccessBlock ( unsigned s e t i d , unsigned way id )
2 {
3 // Get s e t and block
4 Set ∗ s e t = getSet ( s e t i d ) ;
5 Block ∗block = getBlock ( s e t i d , way id ) ;
6
7 // A block i s moved to the head o f the l i s t f o r LRU po l i c y . I t w i l l a l s o be moved i f
8 // i t i s i t s f i r s t a c c e s s f o r FIFO pol i cy , i . e . , i f the s t a t e o f the block was i n v a l i d .
9 bool move to head = rep lacement po l i cy == ReplacementLRU | |
10 ( r ep l a c ement po l i cy == ReplacementFIFO && block−>s t a t e == BlockInva l id ) ;
11
12 // Move to the head o f the LRU l i s t
13 i f ( move to head )
14 {
15 // ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ Incrementing the MRU counters ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
16 in t index = 0 ;
17
18 f o r ( misc : : L i st<Block > : : I t e r a t o r i t e r = set−> l r u l i s t . begin ( ) ;
19 i t e r != set−> l r u l i s t . end ( ) ; ++i t e r ){
20
21 i f ( i t e r . node == &(block−>l ru node ) ){
22 set−>i n c r ement l ru coun t e r s ( index ) ;
23 break ;
24 }
25 index++;
26 }
27 set−> l r u l i s t . Erase ( block−>l ru node ) ;
28 set−> l r u l i s t . PushFront ( block−>l ru node ) ;
29 }
30
31 // Code − For each Block counter
32 // block−>incrementCounter ( ) ;
33 }
Listing 4.2: Incrementing the MRU counters in cache (Defined in Cache.cc)
1 // S t a t i s t i c s
2 module−>i n cAcce s s e s ( ) ;
3 module−>i n c i n t r a a c c e s s c o u n t e r s ( frame−>warp pool id , frame−>warp id in poo l ) ;
4 module−>UpdateStats ( frame ) ;
5
6 // Entry i s locked . Record the t r an s i e n t tag so that a
7 // subsequent lookup de t e c t s that the block i s being brought .
8 // Also , update LRU counter s here .
9 cache−>setTrans ientTag ( frame−>set , frame−>way , frame−>tag ) ;
10 cache−>IncAccessCounter ( ) ;
11 cache−>AccessBlock ( frame−>set , frame−>way) ;
12
13 // Access l a t ency
14 module−>i n cD i r e c to ryAcce s s e s ( ) ;
15 es im engine−>Next ( ev en t f i nd and l o ck a c t i on , module−>getDirectoryLatency ( ) ) ;
Listing 4.3: Incrementing the memory statistics in cache (Defined in SystemEvents.cc)
32
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
4.3 Inter-warp and Intra-warp Cache Locality
Inter-warp and intra-warp counters were defined in order to understand the L1 and
L2 data cache locality. Inter-warp accesses represented the accesses made in data
cache such that the executed threads belonged to the same warp. Intra-warp accesses
represented the accesses made in data cache where the executed threads did not
belong to the same warp, but belonged to different warps within the warp-pool.
This set of inter and intra counters were defined in the same way for both AMD
SI and NVIDIA Kepler GPU architectures on Multi2Sim. Since the AMD SI GPU
architecture uses OpenCL APIs, these counters represent inter-wavefront and intra-
wavefront accesses. In the OpenCL model, a wavefront is defined as a collection
of 64 work-items. Contrarily, since the NVIDIA Kepler architecture uses CUDA
APIs, these counters represent inter-warp and intra-warp accesses. In the CUDA
implementation, a warp is defined as a collection of 32 threads.
These inter-warp (or inter-wavefront) and intra-warp (or intra-wavefront) accesses
were recorded for each memory module of L1 and L2 data cache. The counters were
defined and incremented in the memory model of Multi2Sim (Module.cc). Both AMD
SI and NVIDIA Kepler used the same memory model on Multi2Sim, thus making the
memory code modifications identical for both architectures, although, the Load and
Store Units of both GPU architectures needed to be modified to pass on the warp
id and warp-pool id (or wavefront id and wavefront pool id) for each warp/wavefront
during execution. The vector memory unit was in charge of load and store instructions
on the AMD SI architecture. Meanwhile the Load and Store Unit (LSU) was in charge
of all memory operations in the NVIDIA CUDA architecture. The inter-warp and
intra-warp counters definitions for the memory module are shown in Listing 4.4.
33
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
1 // Inter−warp and intra−warp code in Module . h and Module . cc f i l e s in the memory model o f Multi2Sim
2
3 // Temp warp id and warp pool id to be s to red by the LSU unit during schedu l ing warps to SMs
4 in t temp warp pool id = 0 ;
5 i n t temp warp id in poo l = 0 ;
6
7 // The warp id and warp pool id from the prev ious frame acce s sed
8 in t l a s t wa rp i d = −1;
9 i n t l a s t wa rp poo l i d = −1;
10 // The warp id and warp pool id from the prev ious frame that was h i t f o r int ra−warp and inte r−warp
11 in t l a s t i n t r a h i t w a r p i d = −1;
12 i n t l a s t i n t r a h i t w a r p p o o l i d = −1;
13 i n t l a s t i n t e r h i t w a r p i d = −1;
14 i n t l a s t i n t e r h i t w a r p p o o l i d = −1;
15
16 // Intra−warp and inte r−warp acc e s s and h i t counter s
17 i n t in t ra warp counte r = 0 ;
18 i n t in t e r warp counte r = 0 ;
19 i n t i n t r a h i t c o un t e r = 0 ;
20 i n t i n t e r h i t c o u n t e r = 0 ;
21
22 // Function to increment in t e r−warp and intra−warp counters
23 void i n c i n t e r i n t r a c o u n t e r ( i n t warp pool id , i n t warp id ){
24 i f ( ( l a s t wa rp i d == −1) && ( l a s t wa rp poo l i d == −1) ){
25 l a s t wa rp poo l i d = warp poo l id ;
26 l a s t wa rp i d = warp id ;
27 }
28 e l s e {
29 i f ( ( l a s t wa rp i d == warp id ) && ( l a s t wa rp poo l i d == warp poo l id ) )
30 in t ra warp counte r++;
31 e l s e {
32 in t e r warp counte r++;
33 l a s t wa rp poo l i d = warp poo l id ;
34 l a s t wa rp i d = warp id ;
35 }
36 }
37 }
38
39 void i n c i n t e r i n t r a h i t c o u n t e r ( i n t warp pool id , i n t warp id ){
40 i f ( ( l a s t i n t r a h i t w a r p i d == −1) && ( l a s t i n t r a h i t w a r p p o o l i d == −1) ){
41 l a s t i n t r a h i t w a r p p o o l i d = warp poo l id ;
42 l a s t i n t r a h i t w a r p i d = warp id ;
43 }
44 e l s e {
45 i f ( ( l a s t i n t r a h i t w a r p i d == warp id ) && ( l a s t i n t r a h i t w a r p p o o l i d == warp poo l id ) )
46 i n t r a h i t c o un t e r++;
47 e l s e {
48 i n t e r h i t c o u n t e r++;
49 l a s t i n t r a h i t w a r p p o o l i d = warp poo l id ;
50 l a s t i n t r a h i t w a r p i d = warp id ;
51 }
52 }
53 }
Listing 4.4: Inter-warp and intra-warp definitions in Module.cc
34
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
Although both GPU architectures used the same memory model in Multi2Sim,
the Load and Store Units of both GPU architectures were defined differently. The
Load and Store Units for both GPU architectures were modified to pass on the warp
id and warp-pool id (or wavefront id and wavefront pool id) for each warp/wavefront
to the data cache memory module during execution. The vector memory unit was in
charge of load and store instructions on the AMD SI architecture. However, the Load
and Store Unit (LSU) was in charge of all memory operations in the NVIDIA CUDA
architecture. The code additions in the Load and Store Unit included recording the
warp id and warp pool id for each warp being executed by the SM and passing on
those values to the memory module. The memory module used the warp pool id and
warp id and incorporated them in each frame object for every cache access made.
This helped track the inter-warp and intra-warp accesses in the memory module.
Every time the frame was accessed as a hit or a miss, the memory module updated
its internal counters in the UpdateStats function. The warp id and warp pool id
of the previously accessed frame were compared to those of the currently accessed
frame to determine whether the data cache access represented inter-warp or intra-
warp locality. The same process was repeated in the Vector Memory Unit of the
AMD SI GPU architecture, but for wavefronts instead of warps. The inter-warp and
intra-warp implementation in the LSU unit of the Kepler GPU architecture is shown
in Listing 4.5. The UpdateStats function defined in the memory module is shown in
Listing 4.6.
35
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
1 // Update Uop wr i t e ready cyc l e
2 f o r ( auto i t = uop−>getWarp ( )−>ThreadsBegin ( ) , e = uop−>getWarp ( )−>ThreadsEnd ( ) ; i t != e ; ++i t )
3 {
4 // Get thread
5 Thread ∗ thread = i t−>get ( ) ;
6
7 // Access memory f o r each a c t i v e thread
8 i f ( uop−>getWarp ( )−>getThreadActive ( thread−>getIdInWarp ( ) ) )
9 {
10 // Get the thread uop
11 Uop : : ThreadInfo ∗ t h r e ad i n f o = &uop−>t h r e a d i n f o l i s t
12 [ thread−>getIdInWarp ( ) ] ;
13
14 // Check i f the thread i n f o s t r u c t has a l ready made a s u c c e s s f u l
15 // cache ac c e s s . I f so , move on to the next thread
16 i f ( th r ead in fo−>acce s s ed cache )
17 cont inue ;
18
19 // Trans late v i r t u a l address to a phys i c a l address
20 unsigned phy s i c a l add r e s s = sm−>getGPU()−>getMmu( )−>
21 Trans lateVi r tua lAddress (
22 uop−>getThreadBlock ( )−>getGrid ( )−>address space ,
23 th r ead in fo−>g loba l memory acces s addres s ) ;
24
25 // Make sure we can acc e s s the cache i f so , submit the ac c e s s .
26 // i f not , mark the acce s s ed f l a g o f the thread i n f o s t r u c t
27 i f (sm−>cache−>canAccess ( phy s i c a l add r e s s ) )
28 {
29 // Se t t ing warp id and warp pool id f o r Intra−warp and inte r−warp Accesses
30 Warp ∗warp = uop−>getWarp ( ) ;
31 i n t warp id = warp−>get Id ( ) ;
32 i n t warp poo l id = uop−>getWarpPoolId ( ) ;
33
34 WarpPoolEntry ∗warp poo l entry = uop−>getWarpPoolEntry ( ) ;
35 i n t i d i n warp poo l = warp pool entry−>getIdInWarpPool ( ) ;
36
37 // Se t t ing the Warp Pool and Warp Ids in Module
38 sm−>cache−>temp warp pool id = warp poo l id ;
39 sm−>cache−>t emp warp id in poo l = id in warp poo l ;
40
41 // This i s where the Cache acc e s s c a l l i s made
42 sm−>cache−>Access ( module access type , phy s i c a l add r e s s ) ;
43 th r ead in fo−>acce s s ed cache = true ;
44
45 // S t a t i s t i c s
46 th i s−>ca che a c c e s s c oun t e r++;
47 sm−>i n c t o t a l c a c h e a c c e s s e s ( ) ;
48 }
49 e l s e
50 a l l t h r e a d s a c c e s s e d = f a l s e ;
51 }
52 }
Listing 4.5: Inter-warp and intra-warp definitions in the Load and Store Unit (LSU.cc)
36
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
1 // UpdateStats func t i on de f ined in Module . cc in the memory model o f Multi2Sim
2 // The intra−warp and in te r−warp counters are a l s o incremented here
3 void Module : : UpdateStats (Frame ∗ frame )
4 {
5 // Assert that the frame module i s in f a c t the module
6 a s s e r t ( t h i s == frame−>getModule ( ) ) ;
7
8 // Record acc e s s type . Hits and Misses recorded s epa ra t e l y here so that
9 // we can san i ty check them aga in s t the t o t a l number o f a c c e s s e s .
10 i f ( frame−>r e q u e s t d i r e c t i o n == Frame : : RequestDirectionUpDown ) {
11 i f ( frame−>read ) {
12 num reads++;
13 i f ( frame−>r e t r y )
14 num retry reads++;
15 i f ( frame−>b lock ing )
16 num blocking reads++;
17 e l s e
18 num non blocking reads++;
19 i f ( frame−>h i t ) {
20 // Incrementing the in t e r−warp and intra−warp counters
21 i n c i n t e r i n t r a h i t c o u n t e r ( frame−>warp pool id , frame−>warp id in poo l ) ;
22 num read hits++;
23 i f ( frame−>r e t r y )
24 num re t ry r ead h i t s++;
25 }
26 e l s e {
27 num read misses++;
28 i f ( frame−>r e t r y )
29 num ret ry read mis se s++;
30 }
31 }
32 // Must go a f t e r Read
33 e l s e i f ( frame−>nc wr i t e ) {
34 num nc writes++;
35 i f ( frame−>r e t r y )
36 num ret ry nc wr i t e s++;
37 i f ( frame−>b lock ing )
38 num block ing nc wr i t e s++;
39 e l s e
40 num non block ing nc wr i te s++;
41 i f ( frame−>h i t ) {
42 // Incrementing the in t e r−warp and intra−warp counters
43 i n c i n t e r i n t r a h i t c o u n t e r ( frame−>warp pool id , frame−>warp id in poo l ) ;
44 num nc wr i t e h i t s++;
45 i f ( frame−>r e t r y )
46 num re t r y nc wr i t e h i t s++;
47 }
48 e l s e {
49 num nc wr i te misses++;
50 i f ( frame−>r e t r y )
51 num re t ry nc wr i t e mi s s e s++;
52 }
53 }
54 e l s e i f ( frame−>wr i te ) {
55 num writes++;
56 i f ( frame−>r e t r y )
37
CHAPTER 4. ANALYZING CACHE MEMORY ACCESS PATTERNS
57 num ret ry wr i t e s++;
58 i f ( frame−>b lock ing )
59 num block ing wr i te s++;
60 e l s e
61 num non block ing wr i tes++;
62 i f ( frame−>h i t ) {
63 // Incrementing the in t e r−warp and intra−warp counters
64 i n c i n t e r i n t r a h i t c o u n t e r ( frame−>warp pool id , frame−>warp id in poo l ) ;
65 num wr i te h i t s++;
66 i f ( frame−>r e t r y )
67 num re t ry wr i t e h i t s++;
68 }
69 e l s e {
70 num write misses++;
71 i f ( frame−>r e t r y )
72 num re t ry wr i t e mi s s e s++;
73 }
74 }
75 e l s e {
76 throw misc : : Panic ( ” Inva l i d memory operat ion type” ) ;
77 }
78 }
79 e l s e i f ( frame−>r e q u e s t d i r e c t i o n == Frame : : RequestDirectionDownUp ) {
80 a s s e r t ( frame−>h i t ) ;
81 i f ( frame−>wr i te ) {
82 num write probes++;
83 i f ( frame−>r e t r y )
84 num ret ry wr i t e probes++;
85 }
86 e l s e i f ( frame−>read ) {
87 num read probes++;
88 i f ( frame−>r e t r y )
89 num retry read probes++;
90 }
91 e l s e {
92 throw misc : : Panic ( ” Inva l i d memory operat ion type” ) ;
93 }
94 }
95 e l s e {
96 num hlc ev i c t i on s++;
97 }
98 }
Listing 4.6: Updating the statistics of the memory module (Module.cc)
38
Chapter 5
Results
5.1 Cache Hit Ratios
A set of benchmarks from the AMD SDK 2.5 Suite [21] and CUDA SDK 6.5 Suite
[22] were run on Multi2Sim to test the cache behavior of the GPU for a variety
of workloads. The OpenCL benchmarks from the AMD SDK suite were tested on
the AMD SI GPU model, while the CUDA benchmarks were tested on the NVIDIA
Kepler GPU model. Figure 5.1 shows the L1 and L2 cache hit ratio comparisons for
the executed OpenCL benchmarks. Figure 5.2 shows the L1 and L2 cache hit ratio
comparisons for the executed CUDA benchmarks.
Referring to the OpenCL benchmark results from Figure 5.1, the average hit
ratios for L1 scalar cache, L1 vector cache and L2 cache were 70%, 46% and 35%
respectively. The L1 scalar data cache is responsible for storing scalar and constant
values during execution while the L1 vector data cache is responsible for storing the
vector data for the workload being tested on the GPU. Comparing the hit ratios of
L1 and L2 cache for the CUDA benchmarks in Figure 5.2, the average hit ratios for
L1 and L2 data cache were 46% and 26% respectively.
For both sets of benchmarks, the L2 cache showed the lowest hit ratios among the
L1 and L2 data caches. This was because unlike L1 cache modules, L2 modules were
shared by multiple CUs or SMs during execution. These low level cache modules in
39
CHAPTER 5. RESULTS
Figure 5.1: Cache Memory Hit Ratios for the OpenCL benchmarks
Figure 5.2: Cache Memory Hit Ratios for the CUDA benchmarks
40
CHAPTER 5. RESULTS
L2 cache were accessed only if the L1 cache access was not successful.
Both L1 and L2 data caches display hit ratios of under 50% on average, and this
is because of the working of the GPU. The GPU relies heavily on parallel thread
execution and less on cache memory. Contrarily, the CPU has fewer number of cores
and processes a small number of threads at a time but relies more heavily on cache
memory to show high performance levels. The GPU exploits Thread-level Parallelism
(TLP) while the CPU exploits Instruction-level Parallelism (ILP).
5.2 MRU and LRU Temporal Locality Results
A set of benchmarks from the AMD SDK 2.5 Suite [21] and CUDA SDK 6.5 Suite
[22] were run on Multi2Sim to find temporal locality of multiple MRU lines for L1
and L2 cache blocks. The OpenCL benchmarks from the AMD SDK 2.5 suite were
used to test the cache behavior of the AMD SI GPU model, while the CUDA SDK
6.5 suite was used to test the cache behavior of the NVIDIA Kepler GPU model.
Each benchmark was compiled and then run on the simulator for a specific number of
elements as specified in the respective benchmarks kernel files. The temporal locality
of multiple lines were recorded for all benchmarks. These results were compared to
analyze the cache behavior for different types of workloads and the locality for each
MRU line.
Table 5.1 and Figure 5.3 indicate the L1 cache statistics and MRU results for the
4-way set associative L1 cache architecture. Table 5.2 and Figure 5.4 indicate the L2
cache statistics and MRU results for the 16-way set associative L2 cache architecture.
The MRU counters for each MRU line in cache were recorded.
41
CHAPTER 5. RESULTS
Table 5.1: OpenCL Benchmark MRU results for L1 Cache (per module) of the AMD SI
architecture
Benchmark MRU0 MRU1 MRU2 MRU3
Bitonic Sort 100 % 0 % 0 % 0 %
Binomial Option 81 % 17 % 1 % 1 %
Floyd Marshall 76 % 6 % 8 % 10 %
Radix Sort 96 % 1 % 1 % 2 %
Simple Convolution 59 % 16 % 7 % 18 %
Recursive Gaussian 70 % 11 % 6 % 13 %
Sobel Filter 81 % 12 % 5 % 2 %
Matrix Multiplication 85 % 4 % 5 % 6 %
Mersenne Twister 100 % 0 % 0 % 0 %
Monte Carlo 90 % 3 % 3 % 4 %
Average 84 % 7 % 3 % 6 %
Figure 5.3: OpenCL benchmark results for the AMD SI architecture for L1 cache
42
CHAPTER 5. RESULTS
Table 5.2: CUDA Benchmark MRU results for L1 Cache of the NVIDIA Kepler architec-
ture
Benchmark MRU0 MRU1 MRU2 MRU3
Vector Add 96 % 1 % 1 % 2 %
Transpose 93 % 3 % 2 % 2 %
Simple Streams 51 % 28 % 15 % 6 %
Simple Call Back 87 % 2 % 2 % 9 %
Histogram 65 % 32 % 2 % 1 %
Matrix Multiplication 90 % 4 % 3 % 3 %
Fast Walsh Transform 58 % 37 % 1 % 4 %
Cpp Overload 99 % 1 % 0 % 0 %
Average 77 % 15 % 4 % 4 %
Figure 5.4: CUDA benchmark MRU Results for L1 Cache of the NVIDIA Kepler archi-
tecture
43
CHAPTER 5. RESULTS
For the OpenCL benchmarks, the most recently used line (MRU0) on average
comprised of 84% of the total number of access made to L1 cache. The second, third
and fourth MRU lines consists of 7%, 3% and 6% of all the L1 cache accesses. For
the CUDA benchmarks, the first MRU line on average consists of 77% of the total
number of access made to L1 cache. The second, third and fourth MRU lines consists
of 15%, 3.53% and 4% of all the L1 cache accesses.
Some OpenCL benchmarks like Bitonic Sort, Radix Sort, Mersenne Twister and
Monte Carlo showed the highest cache access percentages for the most recently used
line of 90% or more for the OpenCL benchmarks. Similarly, the VectorAdd, Trans-
pose and Cpp Overload CUDA benchmarks also showed the highest number of cache
accesses in the most recently used line of above 90% or more for the CUDA bench-
marks. The Simple Convolution OpenCL benchmark shows the lowest cache access
percentage for the most recently used line at under 50% for the OpenCL results.
Similarly, Fast Walsh Transform and Simple Streams CUDA benchmarks show the
lowest percentages of accesses in the most recently used line of under 60% for the
CUDA results.
Tables 5.3 and 5.4 indicate the L2 cache statistics and MRU results for the 16-
way set associative L2 cache architecture for the OpenCL benchmarks and CUDA
benchmarks respectively. Similar to the L1 cache results, the MRU counters for each
MRU line in cache were recorded. The graphical representation of the MRU results
for L2 cache for the OpenCL and CUDA benchmarks are shown in Figures 5.5 and
5.6 respectively.
44
CHAPTER 5. RESULTS
Table 5.3: OpenCL Benchmark MRU results for L2 Cache of the AMD SI architecture
Benchmark MRU0 MRU1 MRU2 MRU3 - 15
Bitonic Sort 97 % 3 % 0 % 0 %
Binomial Option 99 % 1 % 0 % 0 %
Floyd Marshall 31 % 12 % 16 % 41 %
Radix Sort 75 % 24 % 1 % 0 %
Simple Convolution 73 % 27 % 0 % 0 %
Recursive Gaussian 51 % 14 % 9 % 26 %
Sobel Filter 36 % 13 % 12 % 39 %
Matrix Multiplication 32 % 22 % 18 % 28 %
Mersenne Twister 53 % 15 % 10 % 22 %
Monte Carlo 58 % 2 % 2 % 38 %
Average 60 % 13 % 7 % 20 %
Figure 5.5: OpenCL benchmark results for the AMD SI architecture for L2 cache
45
CHAPTER 5. RESULTS
Table 5.4: CUDA Benchmark MRU results for L2 Cache of the NVIDIA Kepler architec-
ture
Benchmark MRU0 MRU1 MRU2 MRU3 - MRU15
Vector Add 65 % 3 % 3 % 29 %
Transpose 33 % 11 % 10 % 46 %
Simple Streams 38 % 10 % 10 % 42 %
Simple Call Back 56 % 1 % 1 % 42 %
Histogram 45 % 13 % 11 % 31 %
Matrix Multiplication 31 % 22 % 17 % 30 %
Fast Walsh Transform 43 % 9 % 6 % 42 %
Cpp Overload 100 % 0 % 0 % 0 %
Average 46 % 10 % 8 % 36 %
Figure 5.6: CUDA benchmark MRU Results for L2 Cache of the NVIDIA Kepler archi-
tecture
46
CHAPTER 5. RESULTS
For the OpenCL benchmarks, the first MRU line on average is comprised of 60% of
the total number of access made to L2 cache in the AMD SI architecture. The second
and third MRU lines is comprised of 13% and 7%. Since the L2 cache was a 16-way
set associative cache, all the remaining MRU lines (MRU3 to MRU15) were grouped
together for clearer analysis, and these accounted for 20% of all cache accesses in L2
cache. For the CUDA benchmarks, the first MRU line on average is comprised of
46% of the total number of accesses made to L2 cache in the Kepler architecture.
The second and third MRU lines is comprised of 10% and 8%. The remaining MRU
lines accounted for 36% of all the cache accesses made to L2 cache. The L2 cache
showed lower MRU locality than L1 cache because there were more number of threads
sharing 16 MRU lines in L2 cache as opposed to 4 MRU lines in L1 cache.
Similar to the L1 cache results, some OpenCL benchmarks like Bitonic Sort, Bi-
nomial Option and Simple Convolution show a high percentage of cache accesses for
the first MRU line of 80% or more. The Cpp Overload CUDA benchmark shows the
highest percentage of accesses in the most recently used line of 100%. The Floyd
Marshall and Matrix Multiplication OpenCL benchmarks show low cache access per-
centages for the most recently used line of under 35%. Similarly, Transpose and
Matrix Multiplication CUDA benchmarks show the lowest percentages of accesses in
the most recently used line of under 35%. Likewise, the most recently used line had
the highest number of L2 cache accesses. After the first MRU line, the grouped value
for MRU3 to MRU15 was the next highest access percentage for L2 cache. This was
because of the compounding effect of grouping the MRU lines together.
The most recently used line dominated the number of cache accesses made to L1
and L2 caches for both sets of benchmarks. This indicates that if any line (or block)
needs to preserved, transferred or improved in GPU cache then it is the most recently
used line in each set. The second MRU line recorded the next highest cache access
percentage and hence this line would be the next to be preserved or improved after
47
CHAPTER 5. RESULTS
the most recently used MRU line.
5.3 Inter-warp and intra-warp Locality Results
The L1 and L2 data cache accesses for inter-warp and intra-warp localities for the
OpenCL are tabulated as shown in Tables 5.5 and 5.6. Table 5.5 represents the cache
results for the OpenCL benchmarks, and Table 5.6 represents the cache results for
the CUDA benchmarks. These results show the correlation between the data cache
accesses and the warp based localities.
From Table 5.5 and Figure 5.7, the L1 cache showed higher data cache accesses re-
sulting from intra-warp locality for the OpenCL benchmarks. The average intra-warp
and inter-warp cache access percentages for L1 cache were 68% and 32% respectively.
Meanwhile, L2 cache showed higher data cache accesses resulting from inter-warp lo-
cality. The average intra-warp and inter-warp cache access percentages for L2 cache
were 28% and 72% respectively.
The CUDA results from Table 5.6 and Figure 5.8 show how the cache localities
differed between L1 and L2 cache. Both L1 and L2 caches showed higher data cache
accesses resulting from inter-warp locality. The inter-warp cache access percentages
for L2 cache were slightly higher than those for L1 cache. The average intra-warp
and inter-warp cache access percentages for L1 cache were 33% and 67% respectively.
The average intra-warp and inter-warp cache access percentages for L2 cache were
21% and 79% respectively.
48
CHAPTER 5. RESULTS
Table 5.5: Inter-warp and Intra-warp Access Percentages for L1 and L2 cache for OpenCL
benchmarks
L1 Cache L2 Cache
Benchmark Intra-warp % Inter-warp % Intra-warp % Inter-warp %
Bitonic Sort 4 % 96 % 23 % 77 %
Eigen Value 49 % 51 % 84 % 16 %
Floyd Marshall 66 % 34 % 8 % 92 %
Radix Sort 100 % 0 % 100 % 0 %
Simple Convolution 78 % 22 % 21 % 79 %
Recursive Gaussian 81 % 19 % 23 % 77 %
Sobel Filter 75 % 25 % 6 % 94 %
Matrix Multiplication 69 % 31 % 17 % 83 %
Mersenne Twister 93 % 7 % 10 % 90 %
Monte Carlo 87 % 13 % 53 % 47 %
Average 68 % 32 % 28 % 72 %
Figure 5.7: OpenCL L1 vector and L2 cache intra-warp and inter-warp access percentages
49
CHAPTER 5. RESULTS
Table 5.6: Inter-warp and Intra-warp Access Percentages for L1 and L2 cache for CUDA
benchmarks
L1 Cache L2 Cache
Benchmark Intra-warp % Inter-warp % Intra-warp % Inter-warp %
Vector Add 28 % 72 % 35 % 65 %
Transpose 53 % 47 % 6 % 94 %
Simple Streams 17 % 83 % 5 % 95 %
Simple Call Back 21 % 79 % 32 % 68 %
Matrix Multiplication 15 % 85 % 3 % 97 %
Fast Walsh Transform 39 % 61 % 28 % 72 %
Cpp Overload 65 % 35 % 33 % 67 %
Histogram 24 % 76 % 23 % 77 %
Average 33 % 67 % 21 % 79 %
Figure 5.8: CUDA L1 vector and L2 cache intra-warp and inter-warp access percentages
50
CHAPTER 5. RESULTS
The inter-warp and intra-warp cache access hits and misses for L1 and L2 cache
are graphed in Figures 5.9, 5.10, 5.11, and 5.12. Figures 5.9 and 5.10 represent the
OpenCL cache results while Figures 5.11 and 5.12 represent the cache results for the
CUDA benchmarks.
The OpenCL benchmarks showed a high percentage of misses for both inter-warp
and intra-warp localities. This was directly related to the hit ratios recorded for L1
and L2 cache. Since L1 cache showed a higher intra-warp access percentage, the
statistics were dominated by intra-warp hits or misses. Contrarily, L2 cache showed
a higher inter-warp access percentage which meant that the L2 cache statistics were
dominated by inter-warp hits or misses.
The CUDA benchmarks recorded a higher intra-warp access percentage for both
L1 and L2 cache which meant that both L1 and L2 statistics were dominated by
inter-warp hits or misses. The low hit ratios further suggested a higher number of
inter-warp misses, especially for L2 cache which showed a hit ratio of 26%.
Most of the OpenCL and CUDA benchmark results show the fewer number of hits
resulting from inter-warp or intra-warp locality. This shows the behavior of the default
thread scheduler and the characteristics of the tested workloads. OpenCL benchmarks
favoured intra-warp locality for L1 cache and inter-warp locality for L2 cache. CUDA
benchmarks favoured inter-warp locality for both L1 and L2 caches. The thread
scheduler can be modified to exploit inter-warp locality by allowing threads from
different warps to load and reuse the data from cache, to limit the number of inter-
warp accesses. This can be done provided all the threads are being processed at the
same rate.
51
CHAPTER 5. RESULTS
Figure 5.9: OpenCL L1 vector intra-warp and inter-warp hits and misses
Figure 5.10: OpenCL L2 intra-warp and inter-warp hits and misses
52
CHAPTER 5. RESULTS
Figure 5.11: CUDA L1 intra-warp and inter-warp hits and misses
Figure 5.12: CUDA L2 intra-warp and inter-warp hits and misses
53
CHAPTER 5. RESULTS
Table 5.7: MRU Results for the CPU [11]
1st MRU line 2nd MRU line Rest
Average 92 % 6 % 2 %
5.4 Comparing the CPU and GPU results
Savaldor Petit et al. [11] studied the cache access patterns of the CPU. This paper
records the percentage of cache hits along multiple MRU lines of each set in CPU
cache. A variety of CPU benchmarks were tested to understand how the CPU handled
different workloads. These benchmark results were tabulated and graphed, and the
average cache access percentages for each MRU line in CPU cache were recorded.
Table 5.7 summarizes those findings to compare the GPU’s cache access patterns to
those of the CPU.
Referring to the CPU statistics from Table 5.7, the first MRU line dominates the
cache accesses with the highest percentage of cache accesses. The second MRU line
shows the next highest percentage of hits followed by the other MRU lines. The
remaining MRU lines are grouped together for clearer results in the table as the
percentages were minimal. The first MRU line accounts for an average of 92% of
all the cache accesses. The second MRU line accounts for an average of 6%. The
remaining set of MRU lines account for 2%.
In comparison to the results of the GPU simulations for AMD and Kepler ar-
chitectures, the CPU cache access patterns seem to be very similar to the GPU.
Recalling Table 5.1 in Section 5.2, the first, second, third and fourth MRU lines ac-
count for 84%, 7%, 4% and 5% of all L1 cache accesses on average, respectively for
the AMD Southern Islands GPU architecture. Reviewing Table 5.3 in Section 5.2,
the first, second and third MRU lines account for 60%, 13% and 7% of all L2 cache
accesses on average, respectively for the AMD Southern Islands GPU architecture.
54
CHAPTER 5. RESULTS
Table 5.8: Comparison of MRU results for the CPU and the GPU [11]
Device Architecture 1st MRU line 2nd MRU line Rest
CPU Intel Itanium2 - L1 cache 92 % 6 % 2 %
GPU AMD SI - L1 cache 84 % 7 % 9 %
GPU AMD SI - L2 cache 60 % 13 % 27 %
GPU NVIDIA Kepler - L1 cache 77 % 15 % 8 %
GPU NVIDIA Kepler - L2 cache 46 % 10 % 44 %
The remaining set of MRU lines (MRU3 to MRU15) in L2 cache of the GPU was
grouped together as shown in Table 5.2, and accounted for 20% of all L2 cache ac-
cesses. In Table 5.2 in Section 5.2, the first, second, third and fourth MRU lines
account for 77%, 15%, 4% and 4% of all L1 cache accesses on average, respectively
for the NVIDIA Kepler GPU architecture. Lastly, looking at Table 5.4 in Section 5.2,
the first, second and third MRU lines account for 46%, 10% and 8% of all L2 cache
accesses on average, respectively for the AMD Southern Islands GPU architecture.
The remaining set of MRU lines (MRU3 to MRU15) in L2 cache of the GPU was
grouped together and accounted for 36% of all L2 cache accesses. Table 5.8 shows the
CPU and GPU comparisons for the respective MRU access percentages. The overall
results reflect the GPU cache performance of different sets of workloads over various
two GPU architectures.
In conclusion, The MRU0 cache access percentage for the GPU is lower than
that for the CPU. This is because GPUs show lower temporal locality. This is a
consequence of the high level of parallelism (mainly, threads sharing cache). This
indicates that if any line (or block) needs to preserved, transferred or improved in
CPU and GPU cache then it is the most recently used line in each set.
55
CHAPTER 5. RESULTS
Table 5.9: Cache Hit Rates for the CPU [10]
L1 Cache Hit Ratio L2 Cache Hit Ratio
Average 88 % 93 %
Choo et al. [10] analyzed GPU cache memory traffic for compute workloads and
proposed two implementations to optimize cache memory, Multi-CU shared L1 data
cache and clustered work-group scheduling. The research used Multi2Sim and ran
benchmarks on the AMD SI model of the GPU and the default model of the CPU. The
hit ratios for L1 and L2 cache were compared for the GPU and CPU architectures to
identify the memory traffic and memory latency. Two new approaches were suggested
to improve and optimize CPU and GPU cache memory performance for a variety of
compute workloads. Table 5.9 shows the average cache hit rates for L1 and L2 cache
of the CPU.
Collating results from Table 5.9, L2 cache indicates a higher hit ratio than L1
cache. The average L1 and L2 cache hit ratios were recorded as 88% and 93%,
respectively. A comparison and analysis of the results of the GPU simulations for
AMD and Kepler architectures show the CPU cache hit ratios seem to be much
higher than those of the GPU.
Table 5.10 shows the L1 and L2 cache hit ratio comparisons between the CPU and
GPU results. Figure 5.1 in Section 5.1 indicates 70%, 46% and 35% were the average
cache hit ratios for L1 scalar cache, L1 vector cache and L2 cache, respectively, for the
AMD Southern Islands GPU architecture. In addition, Figure 5.2 in Section 5.2 shows
the average cache hit ratios for L1 cache and L2 cache were 46% and 26%, respectively,
for the NVIDIA Kepler GPU architecture. These GPU cache hit ratios for AMD SI
and NVIDIA Kepler GPU architectures were analogous. Both architectures show low
cache hit ratios, and this is because of how CPU and GPU work differently.
The GPU is comprised of a high number of cores that allow the GPU to process
56
CHAPTER 5. RESULTS
Table 5.10: Comparison of Cache Hit Ratios for the CPU and the GPU
Device Architecture L1 Hit Ratio L2 Hit Ratio
CPU Intel Itanium2 88 % 93 %
GPU AMD SI 70 % (L1 Scalar) % 35 %
GPU AMD SI 46 % (L1 vector) % 35 %
GPU NVIDIA Kepler 46 % 26 %
thousands of threads at the same time. The GPU relies heavily on parallel execution
for high levels of performance. The parallel execution framework hides longer latencies
that may result from lower levels of cache. But this limits the cache performance of
the GPU in comparison with the CPU. Contrarily, the CPU relies heavily on the
memory hierarchy for higher performance. The CPU consists of fewer cores using
large amounts of cache memory that process a low number of threads at a time. The
CPU’s reliance on cache memory helps improve performance and show high levels of
cache behavior. Hence the CPU will always record higher data cache hit rates than
the GPU.
The CPU shows much higher cache hit ratios for both L1 and L2 cache. For L1
cache, the GPU shows approximately 45 % lower average cache hit ratios to those of
the CPU. For L2 cache, the GPU shows approximately 65 % lower average cache hit
ratios to those of the CPU. This implies that the GPU cache memory traffic for the
tested benchmarks is significantly worse that of the CPU. The low GPU L1 and L2
cache hit ratios imply that there is higher memory traffic at lower memory levels like
L2 cache and Global Memory [10]. This promotes long memory latency in the GPU
architecture.
57
Chapter 6
Machine Learning Benchmarks using CUDA
6.1 Design Methodology
Since machine learning benchmarks for GPU simulators were limited, a common set
of machine learning algorithms were defined using CUDA as part of this work. These
benchmarks are compatible with CUDA libraries and can be run on Multi2Sim’s
Kepler GPU architecture.
Some of the common machine learning algorithms used for Convolution Neural
Networks (CNNs) were identified, and they are listed in Table 6-1. The machine
learning algorithms used for this research were derived from a Deep Neural Network
(DNN) implementation in CUDA [20]. This reference consisted of a CUDA GPU
kernel containing multiple math algorithms for machine learning. These algorithms
were selected based on the working of a CNN and its layers. Both Matrix Multi-
plication benchmarks were used to parse and process the data for a set of matrices,
which was helpful for the Mean Pooling, Max Pooling and Zero Padding benchmarks.
The Max Pooling, Mean Pooling and ReLU benchmarks replicated the working of the
Pooling layer and ReLU layer of a CNN. The Zero Padding and Linear Interpolation
benchmarks replicate the math functions used to process and preserve the data during
execution by the CNN.
58
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Table 6.1: Defined Machine Learning algorithms in CUDA
Defined Benchmarks
Matrix Multiplication
Matrix Element-wise Multiplication
Max Pooling
Mean Pooling
Zero Padding
Linear Interpolation
ReLU / Leaky-ReLU
Individual micro-benchmarks were created using CUDA APIs for each algorithm
to understand the cache behavior of the GPU for those respective workloads. The
kernels were modified to be compatible with the NVIDIA Kepler architecture. A set
of functions were defined to initialize, run and validate the data for each benchmark.
The Matrix Multiplication benchmark was defined to perform basic matrix multi-
plication of two input matrices and generated a resulting output matrix. The Matrix
Element-wise Multiplication was a variation of the Matrix Multiplication benchmark,
such that it performed element-wise multiplication of two input matrices. The el-
ements at the each index location of both the input matrices were multiplied to
generate the result at the same index location of the output matrix. The Matrix
Multiplication and Matrix Element-wise multiplication benchmarks served as a base
for any computational matrix algebra operations by a CNN.
In a CNN, there are two main aspects that affect the output of each convolution
layer during filtering: stride and padding. The stride offsets along the length of the
data and the padding allows the output data to have the same spatial size as the
input data. Zero padding is one of the common padding approaches which adds zeros
to the output data evenly across the borders to preserve the data size. The Zero
Padding benchmark was defined to replicate this functionality of CNNs. A pad input
was defined to specify the row/column pad to be placed around the input matrix.
59
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
The benchmark used an input parameter to specify a surrounding pad value. The
default pad value was set to 0 for Zero Padding, but could be altered if needed. The
output matrix was generated by adding the zero pad layers around the input matrix.
The Max and Mean Pooling were defined to replicate the pooling layer of CNNs.
Both benchmarks were defined using a matrix size and a pooling factor as inputs. The
pooling factor determined the dimensions of the pooling window while performing the
pooling operation (Max Pooling or Mean Pooling). A filtering window was chosen,
and the output matrix was generated by taking the mean or max of all the matrix
values within that filtering window.
The Linear Interpolation benchmark was defined to implement a curve fitting
algorithm between a set of points. This algorithm is also used by CNNs to interpolate
images by using a filtering window across the source image.
The Rectifier Linear Unit (ReLU) benchmark was defined to replicate the ReLU
layer of CNNs. This benchmark defined and used a rectifier, which used an activa-
tion function to preserve the positive values and remove any negative values during
execution. This benchmark was also modified to include and test Leaky ReLU and
Noisy ReLU functionality.
The GPU cache behavior for the implemented machine learning benchmarks was
studied for three main metrics - L1/L2 cache hit ratios, MRU access percentages and
inter-warp/intra-warp locality for L1 and L2 caches. In addition, the benchmarks
were compiled and run in the same way as the CUDA SDK benchmarks as shown in
Chapter 5. All the machine learning benchmark results were compared to the CUDA
general purpose benchmark results as both sets of benchmarks were run on the same
Kepler GPU architecture using CUDA libraries. This helped analyze the difference
in GPU cache behavior for different workloads without changing the underlying GPU
architecture (Kepler).
60
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Table 6.2: L1 and L2 Hit Ratios for Machine Learning Benchmarks
Benchmarks L1 Cache Hit Ratio L2 Cache Hit Ratio
Matrix Multiplication 8 % 50 %
Matrix Element-wise 69 % 48 %
Mean Pooling 90 % 52 %
Max Pooling 92 % 51 %
Zero Padding 74 % 49 %
Linear Interpolation 75 % 53 %
ReLU 25 % 46 %
Average ML 62 % 50 %
Average Gen Purpose 46 % 26 %
6.2 Results
6.2.1 L1 and L2 cache hit ratios
Results for the L1 and L2 cache hit ratios were recorded and compared as shown in
Table 6.2. Interpreting the results in Table 6.2, the average cache hit ratio for L1
cache was 62% and L2 cache was 50%. Matrix Multiplication and ReLU had the
lowest L1 cache hit ratios of 8% and 25%. Mean Pooling and Max Pooling showed
the highest L1 cache hit ratios, where Mean Pooling was 90%, and Max Pooling was
92%. The rest of the benchmarks showed 70% to 75% hit ratios for L1 cache. The
L2 cache hit ratios for all the tested benchmarks were approximately 45% to 55%.
The hit ratios of L1 and L2 cache for the CUDA SDK benchmark suite were recorded
to be 46% and 26% respectively. The computational machine learning benchmarks
showed better cache hit rates for L1 and L2 cache than both the general purpose
CUDA SDK benchmarks.
61
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Table 6.3: MRU Results for the Machine Learning benchmarks on the NVIDIA Kepler
architecture for L1 cache
Benchmark MRU0 MRU1 MRU2 MRU3
Matrix Multiplication 95 % 2 % 2 % 1 %
Matrix Element-wise 81 % 9 % 10 % 0 %
Mean Pooling 69 % 27 % 4 % 0 %
Max Pooling 70 % 26 % 4 % 0 %
Zero Padding 100 % 0 % 0 % 0 %
Interpolation 98 % 2 % 0 % 0 %
ReLU 84 % 9 % 3 % 4 %
Average ML 85 % 11 % 3 % 1 %
Average Gen Purpose 77 % 15 % 4 % 4 %
6.2.2 MRU and LRU Temporal Locality Results
The MRU results of the defined benchmarks for L1 and L2 cache were recorded and
compared. Tables 6-3 and 6-4 show the MRU access percentages for L1 cache and L2
cache, respectively. The MRU percentage values for each benchmark are graphed for
L1 and L2 cache and compared in Figures 6.1 and Figure 6.2, respectively.
Based on the L1 cache results recorded in Table 6.3 and Figure 6.1, all the tested
benchmarks showed similar cache access patterns to the AMD and CUDA SDK bench-
mark results from Chapter 5. On average, the first MRU line consists of 85% of all
the L1 cache accesses for the respective benchmark being tested. The second, third
and fourth MRU lines consists of 11%, 3% and 1% of all the L1 cache accesses. Per-
centages for the first MRU line for some benchmarks like Matrix Multiplication, Zero
Padding, and Interpolation showed a percentage of 95% or higher. In addition, the
next tier of benchmarks like Matrix Element-wise Multiplication, and ReLU showed
approximately 80 to 85% of cache accesses. Finally, other benchmarks and algorithms
like Mean Pooling and Max Pooling showed the smallest percentage of cache accesses
for the first MRU line at 69% and 70%, respectively.
The MRU access percentages were also compared for L2 cache. The L2 cache
62
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Figure 6.1: Machine Learning benchmark results for the NVIDIA Kepler architecture for
L1 cache
Table 6.4: MRU Results for the Machine Learning benchmarks on the NVIDIA Kepler
architecture for L2 cache
Benchmark MRU0 MRU1 MRU2 MRU3 - MRU15
Matrix Multiplication 39 % 34 % 25 % 2 %
Matrix Element-wise 83 % 16 % 1 % 0 %
Mean Pooling 88 % 11 % 1 % 0 %
Max Pooling 97 % 3 % 0 % 0 %
Zero Padding 95 % 5 % 0 % 0 %
Interpolation 98 % 2 % 0 % 0 %
ReLU 48 % 25 % 17 % 10 %
Average ML 78 % 14 % 6 % 2 %
Average Gen Purpose 46 % 10 % 8 % 36 %
63
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Figure 6.2: Machine Learning benchmark results for the NVIDIA Kepler architecture for
L2 cache
showed similar cache access patterns to L1 cache where the first MRU line dominated
the total number of accesses made consists of 78% of all L2 cache accesses. The second
and third MRU line accounted for the next highest cache accesses made. The second
MRU line consists of 14% and the third MRU line consists of 6% of all L2 cache
accesses. Finally, the MRU counters for the fourth through sixteenth MRU lines were
summed and recorded as one column result, and this accounted for 2% (on average)
of all cache accesses made for each benchmark. Additionally, for the first MRU line,
benchmarks like Matrix Element-wise Multiplication, Mean Pooling, Max Pooling,
Zero Padding and Interpolation show very high cache access percentages between
80% to 100%. Contrarily, Matrix Multiplication and ReLU show lower percentages
for the MRU line at 39% and 48%, respectively. The variation of the result is because
of the workload and how the GPU processed the data for those algorithms utilizing
cache memory.
This showed that the computational machine learning benchmarks performed bet-
ter in terms of MRU access percentages than the general purpose CUDA SDK bench-
64
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
marks. In Chapter 5, the first MRU and second MRU lines were identified as the
most important lines to be preserved in cache. They are even more important for
the machine learning benchmarks, especially for the L2 cache which shows a very
significant difference.
6.2.3 Inter-warp and Intra-warp Locality Results
Table 6.5 shows the inter-warp and intra-warp cache access percentages for L1 and
L2 caches. The L1 and L2 cache inter-warp and intra-warp access percentages for
all benchmarks are displayed as a column graph in Figure 6.3. The results show
how the cache localities differed between L1 and L2 cache for a variety of machine
learning benchmarks using CUDA. Both L1 and L2 caches showed higher data cache
accesses resulting from inter-warp locality, provided that the inter-warp cache access
percentages for L2 cache were slightly higher than those for L1 cache. The average
intra-warp and inter-warp cache access percentages for L1 cache were 48% and 52%,
respectively. The average intra-warp and inter-warp cache access percentages for
L2 cache were 40% and 60%, respectively. This showed that the math algorithms
being tested favoured inter-warp locality, but the the split between inter-warp and
intra-warp locality was more balanced than that of the general purpose benchmarks.
The inter-warp and intra-warp cache access hits and misses for L1 and L2 cache
are graphed in Figure 6.4 and 6.5. These cache statistics were directly related to the
cache hit ratios recorded for L1 and L2 cache modules. Lower cache hit ratios resulted
in more inter-warp and intra-warp misses. This shows the characteristics of the tested
workloads on the GPU favored inter-warp locality. The machine learning benchmarks
showed higher numbers of intra-hits than the general purpose benchmarks, and this
was because the machine learning benchmarks showed higher data cache hit ratios
for both L1 and L2 cache.
From the results recorded in Table 6.6, it was concluded that the machine learning
65
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Table 6.5: Inter-warp and Intra-warp Access Percentages for L1 and L2 cache
L1 Cache L2 Cache
Benchmark Intra-warp % Inter-warp % Intra-warp % Inter-warp %
Matrix Multiplication 43 % 57 % 5 % 95 %
Matrix Element-wise 27 % 73 % 46 % 54 %
Mean Pooling 52 % 48 % 58 % 42 %
Max Pooling 52 % 48 % 30 % 70 %
Zero Padding 58 % 42 % 77 % 23 %
Linear Interpolation 26 % 74 % 58 % 42 %
ReLU 43 % 57 % 6 % 94 %
Average ML 43 % 57 % 40 % 60 %
Average Gen Purpose 33 % 67 % 21 % 79 %
Figure 6.3: L1 and L2 cache inter-warp and intra-warp access percentages for Machine
Learning benchmarks
66
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Figure 6.4: L1 intra-warp and inter-warp hits and misses for Machine Learning bench-
marks
Figure 6.5: L2 intra-warp and inter-warp hits and misses for Machine Learning bench-
marks
67
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Table 6.6: Comparison of the warp locality results for the two sets of CUDA benchmarks
Cache Average Values CUDA SDK Machine Learning
L1 Cache
Hit Ratio 46 % 62 %
Intra Accesses 33 % 43 %
Intra Hits 11 % 25 %
Intra Misses 21 % 18 %
Inter Accesses 67 % 57 %
Inter Hits 29 % 38 %
Inter Misses 39 % 21 %
L2 Cache
Hit Ratio 26 % 50 %
Intra Accesses 21 % 40 %
Intra Hits 7 % 15 %
Intra Misses 13 % 25 %
Inter Accesses 79 % 60 %
Inter Hits 19 % 35 %
Inter Misses 60 % 25 %
benchmarks showed the higher cache hit ratios for both L1 and L2 caches than the
general purpose CUDA benchmarks, and this was mainly because of the difference in
workloads.
The results from Table 6.6 were compared to find the breakdown of the L1 cache
accesses for two sets of GPU benchmarks for inter-warp and intra-warp data cache lo-
calities. An analysis of the results indicates that the machine learning benchmarks had
the higher intra-warp access percentage, while the CUDA SDK benchmarks recorded
the higher inter-warp access percentage. Furthermore, the machine learning bench-
marks showed higher numbers of hits resulting from intra-warp and inter-warp locality
than the CUDA SDK benchmarks.
The results from Table 6.6 recorded the L2 cache access breakdowns for the two
sets of GPU benchmarks to identify the best set of benchmarks for L2 cache. As
a result, machine learning benchmarks had the higher intra-warp access percentage,
while the CUDA SDK benchmarks recorded the higher inter-warp access percentage.
68
CHAPTER 6. MACHINE LEARNING BENCHMARKS USING CUDA
Moreover, the highest amount of hits and misses resulting from intra-warp locality
were from machine learning benchmarks. In terms of the the inter-warp results, the
machine learning benchmarks showed the higher numbers of data cache hits resulting
from inter-warp locality, whereas the CUDA SDK benchmarks showed the higher
numbers of misses resulting from inter-warp locality.
The L1 and L2 cache hit ratios were the highest for the machine learning bench-
marks as compared to the CUDA SDK general purpose benchmarks. Consequently,
the machine learning benchmarks favoured inter-warp locality for both L1 and L2
caches. Likewise, the machine learning benchmarks had higher numbers of data
cache hits resulting from inter-warp and intra-warp localities as the hit ratios for the
machine learning benchmarks were higher for both L1 and L2 caches than the CUDA
SDK benchmarks. This was because the machine learning benchmarks were mainly
computational benchmarks and not general purpose benchmarks.
69
Chapter 7
Conclusion
Multi2Sim was successfully setup on an Ubuntu OS with all its dependencies to run
GPU simulations and analyze GPU cache statistics and behavior. AMD SDK 2.5
and CUDA SDK 6.5 were the two benchmark suites used to run GPU simulations
on the AMD SI and NVIDIA Kepler GPU architectures, respectively. The bench-
marks helped analyze GPU cache behavior and cache access patterns for a variety of
workloads.
Both AMD and NVIDIA Kepler GPU architectures, displayed L1 and L2 data
caches hit ratios of under 50%. The GPU cache hit ratios were compared to those of
the CPU. The CPU displayed higher hit ratios for both L1 and L2 caches. This is
because the GPU relies heavily on parallel thread execution and less on cache memory.
The GPU achieves high levels of performance using Thread-level Parallelism (TLP).
Contrarily, the CPU has fewer cores and processes a small number of a threads at a
time but relies more heavily on cache memory to show high performance levels. The
CPU achieves high level of performance using Instruction-level Parallelism (ILP).
The MRU results helped identify that the most recently used block dominated the
number of cache accesses for both L1 and L2 cache. This indicates that if any line
(or block) needs to preserved, transferred, or improved in GPU cache then it would
be the most recently used line in each set, for a multi-way associative cache.
The inter-warp and intra-warp locality for cache accesses was studied for both
70
CHAPTER 7. CONCLUSION
GPU architectures. This locality was directly dependent on the characteristics of
the workloads being tested. The OpenCL benchmarks tested on the AMD SI GPU
model showed higher intra-warp locality for L1 cache and high inter-warp locality for
L2 cache. Contrarily, the CUDA benchmarks tested on the NVIDIA Kepler GPU
model showed higher inter-warp locality for both L1 and L2 caches, more so in L2
cache than L1 cache.
A set of machine learning benchmarks was defined to analyze the GPU behavior
for machine learning workloads. These benchmarks were defined using CUDA APIs
and were tested on the NVIDIA Kepler GPU model on Multi2Sim. The cache hit
ratios, MRU results and inter-warp/intra-warp localities were compared against the
general purpose computing benchmarks from the CUDA SDK benchmark suite as
both benchmarks were tested on the Kepler GPU model on Multi2Sim. The machine
learning benchmarks showed higher cache hit ratios than the CUDA SDK benchmark
suite. The general purpose CUDA SDK benchmarks recorded L1 and L2 cache hit
ratios of under 50%. However, the machine learning benchmarks exhibited average
cache hit ratios for L1 cache of 62.03% and L2 cache of 49.75%. The machine learning
benchmarks showed similar cache access patterns and locality to the general purpose
benchmark suites. The most recently used line dominated the number of cache ac-
cesses for both L1 and L2 caches by more than 75%. The difference in temporal
locality was significant for L2 cache more so than L1 cache. The machine learning
benchmarks favored inter-warp locality for both L1 and L2 caches.
71
CHAPTER 7. CONCLUSION
7.1 Future Work
7.1.1 Unified Memory Model
In a heterogeneous muti-processor system, one of the main performance bottlenecks
is that the CPU and GPU are responsible for their own sets of physical memory.
Data need to be allocated for both CPU and GPU memory. Furthermore, the data
get initialized and transferred from the CPU to the GPU for processing, and then get
transferred back to CPU memory. This leads to large overheads and latencies.
NVIDIA introduced a new Unified Memory Model [26, 27] in the latest CUDA6
toolkit and SDK, which helped overcome the bottleneck of having physically separate
memory blocks. The Unified Memory Model defines a block of shared memory in
DRAM which is shared by the CPU and the GPU. The system handles this unified
shared memory for both processors such that there is no explicit need to copy or
transfer data from one memory unit to the other. The shared unified memory can
be accessed by either CPU or GPU with a single pointer. The system handles any
conflicts such that each processor thinks it is accessing memory from its respective
memory block. This largely reduces latency and overhead, and helps improve the
performance at run-time.
Even though this architecture change has been implemented in hardware in the
latest GPUs using CUDA 6, the GPU simulators have not been updated for this
architectural change yet. The simulator can be configured for the Unified Memory
Model design to study the CPU and GPU cache behavior.
7.1.2 Machine Learning Benchmarks on the GPU
Most of the investigated machine learning benchmark suites [20, 28, 29 30] required
a GPU as a hardware component. None of the benchmark suites were built to run on
GPU simulators. Shi Dong et al. [28] defined a Deep Neural Network benchmark suite
72
CHAPTER 7. CONCLUSION
Figure 7.1: NVIDIA’s Unified Memory Model [27]
for general purpose GPUs. This benchmark suite had a really good implementation of
GPU CUDA kernels for multiple machine learning algorithms. The limitation of the
current version of the benchmark suite was that it could only be run using a GPU as
a hardware component and not GPU simulators. However, the authors are working
on defining GPU CUDA kernels for GPU simulators for the next release. This would
enable running the benchmarks on Multi2Sim to analyze GPU cache behavior for a
variety of machine learning workloads.
73
Bibliography
[1] X. Gong, R. Ubal and D. Kaeli, “Multi2Sim Kepler: A detailed architectural
GPU simulator,” 2017 IEEE International Symposium on Performance Analysis
of Systems and Software (ISPASS), Santa Rosa, CA, 2017
[2] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Ana-
lyzing cuda workloads using a detailed gpu simulator,” in IEEE Int. Symp. on
Performance Analysis of Systems and Software, 2009, pp.163174.
[3] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood, “gem5-gpu: A
heterogeneous cpu-gpu simulator,” IEEE Computer Architecture Letters, vol. 14,
no. 1, pp. 3436, 2015.
[4] S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A parallel func-
tional simulator for gpgpu,” in 2010 IEEE International Symposium on Modeling,
Analysis and Simulation of Computer and Telecommunication Systems, 2010, pp.
351360.
[5] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: a dynamic opti-
mization framework for bulk-synchronous applications in heterogeneous systems,”
in ACM Proc. of the 19th Int. Conf. on Parallel Architectures and Compilation
Techniques, 2010, pp. 353364.
[6] The Multi2Sim Simulation Framework - A CPU-GPU Model for Heterogeneous
Computing. URL: http://www.multi2sim.org/
[7] M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho and S. Ryu, “iPAWS: Instruction-issue
pattern-based adaptive warp scheduling for GPGPUs,” 2016 IEEE International
Symposium on High Performance Computer Architecture (HPCA), Barcelona,
2016.
74
BIBLIOGRAPHY
[8] Xinxin Mei, Xiaowen Chu, “Dissecting GPU Memory Hierarchy through Mi-
crobenchmarking”, March 2016
[9] Ben Johnstone and Sonia Lopez Alarcon, “Bandwidth Requirements of GPU Ar-
chitectures” (2014), Rochester Institute of Technology, New York, USA
[10] K. Choo, W. Panlener and B. Jang, “Understanding and Optimizing GPU Cache
Memory Performance for Compute Workloads,” 2014 IEEE 13th International
Symposium on Parallel and Distributed Computing, Marseilles, 2014, pp. 189-
196.
[11] Salvador Petit, Julio Sahuquillo, Jose M. Such, and David Kaeli. 2005. Exploiting
temporal locality in drowsy cache policies. In Proceedings of the 2nd conference
on Computing frontiers (CF ’05). ACM, New York, NY, USA.
[12] S. Kumar and P. K. Singh, “An overview of modern cache memory and perfor-
mance analysis of replacement policies,” 2016 IEEE International Conference on
Engineering and Technology (ICETECH), Coimbatore, 2016.
[13] M. T. Banday and M. Khan, “A study of recent advances in cache memo-
ries,” 2014 International Conference on Contemporary Computing and Informatics
(IC3I), Mysore, 2014.
[14] Multi2Sim 5.0 Guide - http://www.multi2sim.org/downloads/m2s-guide-4.2.pdf
[15] Caching Algorithms - www.ecs.umass.edu/ece/labs/vlsicad/ece665/presentations/Cache-
Wang.ppt
[16] B. Wang, Z. Liu, X. Wang and W. Yu, “Eliminating intra-warp conflict misses
in GPU,” 2015 Design, Automation & Test in Europe Conference & Exhibition
(DATE), Grenoble, 2015, pp. 689-694.
75
BIBLIOGRAPHY
[17] NVIDIA Kepler GK110 Architecture Whitepaper, 2012
https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-
Architecture-Whitepaper.pdf
[18] NVIDIA Tesla K20X GPU Accelerator, 2013.
http://developer.download.nvidia.com/GTC/inside-tesla-kepler-k20-family.pdf
[19] AMD Radeon GCN Architecture Whitepaper, 2012
https://www.amd.com/Documents/GCN Architecture whitepaper.pdf
[20] Xingdi (Eric) Yuan, “A Deep Neural Net implementation in CUDA”, MIT, 2015.
https://github.com/xingdi-eric-yuan/cuda-deep-neural-nets
[21] https://github.com/Multi2Sim/m2s-bench-amdsdk-2.5-src
[22] https://github.com/Multi2Sim/m2s-bench-cudasdk-6.5
[23] Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt, “Dynamic Warp
Formation and Scheduling for Efficient GPU Control Flow”, University of British
Columbia, Vancuover, BC, Canada
[24] Rajat Raina, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-scale deep un-
supervised learning using graphics processors,” In Proceedings of the 26th Annual
International Conference on Machine Learning (ICML ’09). ACM, New York, NY,
USA
[25] B. Jang, D. Schaa, P. Mistry and D. Kaeli, “Exploiting Memory Access Pat-
terns to Improve Memory Performance in Data-Parallel Architectures,” in IEEE
Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 105-118, Jan.
2011.
76
BIBLIOGRAPHY
[26] R. Landaverde, Tiansheng Zhang, A. K. Coskun and M. Herbordt, “An in-
vestigation of Unified Memory Access performance in CUDA,” 2014 IEEE High
Performance Extreme Computing Conference (HPEC), Waltham, MA, 2014.
[27] Unified Memory Model by NVIDIA - https://devblogs.nvidia.com/unified-
memory-in-cuda-6/
[28] Shi Dong and David Kaeli. 2017. DNNMark: A Deep Neural Network Benchmark
Suite for GPUs. In Proceedings of the General Purpose GPUs (GPGPU-10). ACM,
New York, NY, USA, 63-72.
[29] https://github.com/doody1986/DNNMark
[30] Yusaku Sako, Bartosz Ludwiczuk, “Benchmark on Deep Learning Frameworks
and GPUs”, 2018 https://github.com/u39kun/deep-learning-benchmark
77
Appendix A: Code Listings
1 #inc lude ” cuda runtime . h”
2 #inc lude ” s td i o . h”
3
4 #de f i n e BLOCK SIZE 16
5 #de f i n e SRC ROWS 128
6 #de f i n e SRC COLS 128
7 #de f i n e DEST ROWS 32
8 #de f i n e DEST COLS 32
9 #de f i n e STRIDE X 4
10 #de f i n e STRIDE Y 4
11 #de f i n e N SRC ROWS ∗ SRC COLS
12 #de f i n e VALUE 0
13
14 g l o b a l void MeanPool ing kernel ( i n t ∗ src , i n t ∗ dst , const i n t rowssrc , const i n t c o l s s r c ,
const i n t rowsdst , const i n t co l sd s t , const i n t s t r idex , const i n t s t r idey , const i n t n){
15 in t t i d = threadIdx . x + blockIdx . x ∗ blockDim . x ;
16 i n t s t r i d e = blockDim . x ∗ gridDim . x ;
17 whi le ( t i d < n){
18 in t cdst = t id % co l s d s t ;
19 i n t rd s t = t i d / c o l s d s t ;
20 i n t r s r c = rds t ∗ s t r i d e y ;
21 i n t c s r c = cdst ∗ s t r i d e x ;
22 i n t xend = ( c s r c + s t r i d e x − 1) > ( c o l s s r c − 1) ? ( c o l s s r c − 1) : ( c s r c + s t r i d e x − 1) ;
23 i n t yend = ( r s r c + s t r i d e y − 1) > ( rowssrc − 1) ? ( rowssrc − 1) : ( r s r c + s t r i d e y − 1) ;
24 // l o c [ t i d ] = ( r s r c ∗ c o l s s r c + c s r c ) ;
25 i n t va l = 0 ;
26
27 f o r ( i n t i = r s r c ; i <= yend ; ++i ){
28 f o r ( i n t j = c s r c ; j <= xend ; ++j ){
29 i f ( s r c [ i ∗ c o l s s r c + j ] > dst [ t i d ] ) {
30 va l += s r c [ i ∗ c o l s s r c + j ] ;
31 }
32 }
33 }
34 dst [ t i d ] = val / ( s t r i d e x ∗ s t r i d e y ) ;
35 t i d += s t r i d e ;
36 }
37 }
38
39
40 void MeanPooling ( i n t ∗ src , i n t ∗ dst , i n t src rows , i n t s r c c o l s , i n t dest rows , i n t d e s t c o l s , i n t
s t r idex , i n t s t r idey , i n t n) ;
41
42 void pr in t mat r i x ( i n t ∗matrix , i n t height , i n t width ) ;
43
44 i n t main ( )
45 {
46 in t ∗a = ( in t ∗) c a l l o c (SRC ROWS ∗ SRC COLS, s i z e o f ( unsigned in t ) ) ;
47
48 in t ∗c = ( in t ∗) c a l l o c (DEST ROWS ∗ DEST COLS, s i z e o f ( unsigned in t ) ) ;
49 i n t ∗d = ( in t ∗) c a l l o c (DEST ROWS ∗ DEST COLS, s i z e o f ( unsigned in t ) ) ;
50 i n t i ;
51 f o r ( i =0; i < SRC ROWS ∗ SRC COLS; i++)
52 {
78
APPENDIX A. CODE LISTINGS
53 a [ i ] = 2∗ i ;
54 }
55
56 in t ∗dev a = 0 ;
57 i n t ∗dev c = 0 ;
58
59 // Al l o ca t e GPU bu f f e r s f o r three vec to r s ( two input , one output ) .
60 cudaMalloc ( ( void ∗∗)&dev c , DEST ROWS ∗ DEST COLS ∗ s i z e o f ( unsigned in t ) ) ;
61
62
63 cudaMalloc ( ( void ∗∗)&dev a , SRC ROWS ∗ SRC COLS ∗ s i z e o f ( unsigned in t ) ) ;
64
65 // Copy input vec to r s from host memory to GPU bu f f e r s .
66 cudaMemcpy( dev a , a , SRC ROWS ∗ SRC COLS ∗ s i z e o f ( unsigned in t ) , cudaMemcpyHostToDevice ) ;
67
68 dim3 dimBlock (BLOCK SIZE ,BLOCK SIZE) ; // block ( blockIdx , blockIDy )
69 dim3 dimGrid ( (SRC COLS + dimBlock . x − 1) / dimBlock . x ,
70 (SRC ROWS + dimBlock . y − 1) / dimBlock . y ) ; // g r id ( g l o a l s i z eX + block idx −1)/ blockidx ,
g l o a l s i z eY + block idy −1)/ b lock idy ) ;
71
72
73 MeanPool ing kernel<<<dimGrid , dimBlock>>>(dev a , dev c , SRC ROWS, SRC COLS, DEST ROWS, DEST COLS
, STRIDE X, STRIDE Y, N) ;
74
75 // Copy output vector from GPU bu f f e r to host memory .
76 cudaMemcpy( c , dev c , DEST ROWS ∗ DEST COLS ∗ s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ) ;
77
78 //MeanPooling (a , d , SRC ROWS, SRC COLS, DEST ROWS, DEST COLS, STRIDE X, STRIDE Y, N) ;
79
80
81 // bool f l a g = true ;
82 // f o r ( i n t i = 0 ; i < DEST ROWS ∗ DEST COLS; i++ )
83 //{
84
85 // i f ( c [ i ] != d [ i ] )
86 //{
87 // p r i n t f (” V e r i f i c a t i o n f a i l \n”) ;
88 // p r i n t f (” i = %d \n” , i ) ;
89
90 // // p r i n t f (”A = %d\n” , a [ i ] ) ;
91 // // p r i n t f (”B = %d\n” , b [ i ] ) ;
92 // p r i n t f (”C = %d\n” , c [ i ] ) ;
93 // p r i n t f (”D = %d\n” , d [ i ] ) ;
94
95 // f l a g = f a l s e ;
96 // break ;
97 //}
98 //}
99
100 // i f ( f l a g )
101 // p r i n t f (” V e r i f i c a t i o n pass\n”) ;
102
103 p r i n t f ( ”Performing Mean Pool ing . . . \ n” ) ;
104 p r i n t f ( ”Benchmark executed s u c c e s s f u l l y .\n” ) ;
105 p r i n t f ( ”\n” ) ;
106
79
APPENDIX A. CODE LISTINGS
107 p r i n t f ( ” Input Matrix :\n” ) ;
108 pr in t mat r i x (a , SRC ROWS, SRC COLS) ;
109 p r i n t f ( ”Result Matrix :\n” ) ;
110 pr in t mat r i x ( c , DEST ROWS, DEST COLS) ;
111
112 // p r i n t f (” Input Matrix :\n”) ;
113 // pr in t mat r i x (a , 4 , 4) ;
114 // p r i n t f (” Result Matrix :\n”) ;
115 // pr in t mat r i x ( c , 2 , 2) ;
116
117 // p r i n t f (”Matrix D:\n”) ;
118 // pr in t mat r i x (d , 10 , 10) ;
119 }
120
121
122 void MeanPooling ( i n t ∗ src , i n t ∗ dst , i n t src rows , i n t s r c c o l s , i n t dest rows , i n t d e s t c o l s , i n t
s t r idex , i n t s t r idey , i n t n)
123 {
124 // in t max = 0 ;
125 in t tmp = 0 ;
126 in t index = 0 ;
127
128 f o r ( i n t i = 0 ; i < s r c rows ; i++)
129 {
130 f o r ( i n t j = 0 ; j < s r c c o l s ; j++)
131 {
132 tmp = sr c [ i ∗ s r c rows + j ] ;
133
134 i f (tmp > dst [ index ] ) {
135 dst [ index ] = tmp ;
136 }
137
138 // dst [ i ∗ s r c rows + j + 1 ] = s r c [ i ∗ s r c rows + j ] ;
139 }
140 }
141
142 }
143
144 void pr in t mat r i x ( i n t ∗matrix , i n t height , i n t width )
145 {
146 in t i , j ;
147 f o r ( i = 0 ; i < width ; i++)
148 {
149 f o r ( j = 0 ; j < he ight ; j++)
150 p r i n t f ( ”%5d” , matrix [ i ∗ width + j ] ) ;
151 p r i n t f ( ”\n” ) ;
152 }
153 }
Listing A.1: CUDA mean pooling benchmark GPU Kernel
80
APPENDIX A. CODE LISTINGS
1 . g l oba l b i t on i cSo r t
2
3 .mem
4
5 . data
6 FloatMode = 192
7 IeeeMode = 0
8
9 userElements [ 0 ] = IMM UAV, 10 , s [ 4 : 7 ]
10 userElements [ 1 ] = IMM CONST BUFFER, 0 , s [ 8 : 1 1 ]
11 userElements [ 2 ] = IMM CONST BUFFER, 1 , s [ 1 2 : 1 5 ]
12
13 COMPUTE PGM RSRC2:USER SGPR = 16
14 COMPUTE PGM RSRC2:TGID X EN = 1
15 . arg
16
17 u32∗ theArray 0 uav10 RW
18 u32 value : s tage 16
19 u32 passOfStage 32
20 u32 width 48
21 u32 d i r e c t i o n 64
22
23 . t ext
24
25 s bu f f e r load dword s0 , s [ 8 : 1 1 ] , 0x04 // 00000000: C2000904
26 s bu f f e r load dword s1 , s [ 8 : 1 1 ] , 0x18 // 00000004: C2008918
27 s bu f f e r load dword s2 , s [ 1 2 : 1 5 ] , 0x04 // 00000008: C2010D04
28 s bu f f e r load dword s3 , s [ 1 2 : 1 5 ] , 0x08 // 0000000C: C2018D08
29 s waitcnt lgkmcnt (0) // 00000010: BF8C007F
30 s min u32 s0 , s0 , 0 x 0 0 0 0 f f f f // 00000014: 8380FF00 0000FFFF
31 v mov b32 v1 , s0 // 0000001C: 7E020200
32 v mul i32 i24 v1 , s16 , v1 // 00000020: 12020210
33 v add i32 v0 , vcc , v0 , v1 // 00000024: 4A000300
34 v add i32 v0 , vcc , s1 , v0 // 00000028: 4A000001
35 s sub i32 s0 , s2 , s3 // 0000002C: 81800302
36 v l s h r r e v b32 v1 , s0 , v0 // 00000030: 2C020000
37 s l s h l b32 s1 , 1 , s0 // 00000034: 8F010081
38 s bu f f e r load dword s3 , s [ 1 2 : 1 5 ] , 0x00 // 00000038: C2018D00
39 v mul l o i32 v1 , s1 , v1 // 0000003C: D2D60001 02020201
40 v l s h l r e v b32 v1 , 1 , v1 // 00000044: 34020281
41 v bfe u32 v2 , v0 , 0 , s0 // 00000048: D2900002 00010100
42 v add i32 v1 , vcc , v1 , v2 // 00000050: 4A020501
43 v add i32 v2 , vcc , s1 , v1 // 00000054: 4A040201
44 v l s h l r e v b32 v1 , 2 , v1 // 00000058: 34020282
45 v l s h l r e v b32 v2 , 2 , v2 // 0000005C: 34040482
46 s waitcnt lgkmcnt (0) // 00000060: BF8C007F
47 v add i32 v1 , vcc , s3 , v1 // 00000064: 4A020203
48 v add i32 v2 , vcc , s3 , v2 // 00000068: 4A040403
49 t bu f f e r load format x v3 , v1 , s [ 4 : 7 ] , 0 o f f e n format : [BUF DATA FORMAT 32 ,BUF NUM FORMAT FLOAT]
// 0000006C: EBA01000 80010301
50 t bu f f e r load format x v4 , v2 , s [ 4 : 7 ] , 0 o f f e n format : [BUF DATA FORMAT 32 ,BUF NUM FORMAT FLOAT]
// 00000074: EBA01000 80010402
51 s bu f f e r load dword s0 , s [ 1 2 : 1 5 ] , 0x10 // 0000007C: C2000D10
52 s l s h l b32 s1 , 1 , s2 // 00000080: 8F010281
53 s waitcnt lgkmcnt (0) // 00000084: BF8C007F
54 s sub i32 s2 , 1 , s0 // 00000088: 81820081
81
APPENDIX A. CODE LISTINGS
55 v and b32 v0 , s1 , v0 // 0000008C: 36000001
56 v cmp eq i32 vcc , 0 , v0 // 00000090: 7D040080
57 v mov b32 v0 , s0 // 00000094: 7E000200
58 v mov b32 v5 , s2 // 00000098: 7E0A0202
59 v cndmask b32 v0 , v5 , v0 , vcc // 0000009C: 00000105
60 s waitcnt vmcnt (0) // 000000A0 : BF8C1F70
61 v min u32 v5 , v3 , v4 // 000000A4 : 260A0903
62 v max u32 v3 , v3 , v4 // 000000A8 : 28060903
63 v cmp eq i32 vcc , 0 , v0 // 000000AC: 7D040080
64 v cndmask b32 v0 , v5 , v3 , vcc // 000000B0 : 00000705
65 t bu f f e r s t o r e format x v0 , v1 , s [ 4 : 7 ] , 0 o f f e n format : [BUF DATA FORMAT 32 ,BUF NUM FORMAT FLOAT]
// 000000B4 : EBA41000 80010001
66 s waitcnt expcnt (0) // 000000BC: BF8C1F0F
67 v cndmask b32 v0 , v3 , v5 , vcc // 000000C0 : 00000B03
68 t bu f f e r s t o r e format x v0 , v2 , s [ 4 : 7 ] , 0 o f f e n format : [BUF DATA FORMAT 32 ,BUF NUM FORMAT FLOAT]
// 000000C4 : EBA41000 80010002
69 s endpgm // 000000CC: BF810000
Listing A.2: OpenCL bitonic sort benchmark for GPU execution
82
