Characterizing Optimizations to Memory Access Patterns using
  Architecture-Independent Program Features by Chilukuri, Aditya et al.
Characterizing Optimizations to Memory Access
Paerns using Architecture-Independent Program
Features
Aditya Chilukuri
aditya.chilukuri@anu.edu.au
Australian National University
Josh Milthorpe
josh.milthorpe@anu.edu.au
Australian National University
Beau Johnston
beau.johnston@anu.edu.au
Australian National University
ABSTRACT
High-performance computing developers are faced with the
challenge of optimizing the performance of OpenCL work-
loads on diverse architectures. eArchitecture-Independent
Workload Characterization (AIWC) tool is a plugin for the
Oclgrind OpenCL simulator that gathers metrics of OpenCL
programs that can be used to understand and predict pro-
gram performance on an arbitrary given hardware architec-
ture. However, AIWC metrics are not always easily inter-
preted and do not reect some important memory access
paerns aecting eciency across architectures. We pro-
pose a new metric of parallel spatial locality – the closeness
of memory accesses simultaneously issued by OpenCL work-
items (threads). We implement the parallel spatial locality
metric in the AIWC framework, and analyse gathered results
on matrix multiply and the Extended OpenDwarfs OpenCL
benchmarks. e dierences in the observed parallel spa-
tial locality metric across implementations of matrix mul-
tiply reect the optimizations performed. e new metric
can be used to distinguish between the OpenDwarfs bench-
marks based on the memory access paerns aecting their
performance on various architectures. e improvements
suggested to AIWC will help HPC developers beer under-
stand memory access paerns of complex codes and guide
optimization of codes for arbitrary hardware targets.
KEYWORDS
Architecture Independent Analysis, Heterogenous Comput-
ing, Workload Characterization, Memory Access Paerns
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACMmust be honored. Abstracting with
credit is permied. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
IWOCL’20, Munich, Germany
© 2020 ACM. 978-x-xxxx-xxxx-x/YY/MM. . .$15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
ACM Reference format:
Aditya Chilukuri, Josh Milthorpe, and Beau Johnston. 2020. Charac-
terizingOptimizations toMemoryAccess Paerns usingArchitecture-
Independent Program Features. In Proceedings of InternationalWork-
shop on OpenCL, Munich, Germany, April 27–29, 2020 (IWOCL’20),
11 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
High-performance computing (HPC) systems are increas-
ingly heterogeneous. A single node on a modern super-
computer may combine traditional CPUs with one or more
accelerators such as GPUs, Field Programmable Gate Ar-
rays (FPGAs), or many integrated core devices (MICs). High
bandwidth interconnects support tight integration between
multiple devices of dierent types on a single compute node.
e OpenCL programming language is designed to sup-
port modern HPC soware engineers in writing code that
executes on multiple hardware targets. is gives HPC so-
ware developers greater exibility by allowing codes aimed
at a range of hardware targets to be wrien in a single pro-
gramming language environment.
Application codes dier in resource requirements, control
structure and available parallelism. Similarly, compute de-
vices dier in number and capabilities of execution units,
processing model, and available resources. ese heteroge-
neous computing environments present opportunities for
soware engineers and HPC integrators to design highly
optimized systems with multiple kernels executing on hard-
ware targets best suited for the diverse computational tasks
performed by each kernel [18]. However, this opportunity
also presents a challenge in optimizing code to run on di-
verse architectures. e work reported here aims to help
HPC developers understand and optimize memory access
paerns to improve performance on heterogeneous systems.
In the evolution of computer architectures over the last few
decades, the exponential growth in computational capability
has not been matched by proportional increases in memory
speeds[4]. Memory accesses pose larger bolenecks to per-
formance as application demand for main memory scales
ar
X
iv
:2
00
3.
06
06
4v
1 
 [c
s.D
C]
  1
2 M
ar 
20
20
IWOCL’20, April 27–29, 2020, Munich, Germany Aditya Chilukuri, Josh Milthorpe, and Beau Johnston
with the arithmetic capability of computer systems. To miti-
gate this latency, modern CPU designs have employed a wide
range of cache technologies to reduce main memory accesses
using the principle of spatial locality: the observation that
data that a program accesses close together in time tend to
also be close together in memory. On the other hand, GPUs
rely on hardware multithreading to hide memory latency,
and their architectures favour ALU capability over sophis-
ticated logic to manage a cache hierarchy and out-of-order
execution. As a result, the performance of kernels on CPUs
and GPUs alike is strongly dependent on memory access
paerns intrinsic to the code.
Our aim in this paper is to develop a framework to guide
HPC soware engineers in hardware-dependent code opti-
mization – specically by guiding the improvement of mem-
ory access paerns. We provide examples of manufacturer-
recommended code optimizations to improve memory access
paerns on the target architecture, and examine these us-
ing the architecture-independent workload characterization
(AIWC) [8] plugin for Oclgrind [16]. Our work highlights
the benets and challenges arising from an architecture-
independent analysis of memory-based program character-
istics. We propose two new metrics for AIWC to beer char-
acterize these memory-based optimizations. We measure
the presented codes using the new metrics and demonstrate
the metrics’ eectiveness in capturing the essence of the
optimizations performed.
e structure of this paper is as follows. In section 2 we
use the example of matrix multiplication on a GPU to show-
case the specic vendor-recommended optimizations our
work aims to measure. In section 3 we discuss how AIWC
and its precursors proled memory access paerns in an
architecture-independent fashion and also consider relevant
architecture-dependent approaches to memory access prol-
ing. In section 4 we consider non-parallel memory metrics
collected by AIWC and evaluate the suitability of these met-
rics for capturing the impact of performance optimizations
to various OpenCL codes. In section 5 we propose a new
parallel spatial locality metric, implement it in the AIWC
framework, and evaluate it over the matrix multiplication
example from section 2. In section 6 we present the collected
metric for selected benchmarks from the Extended OpenD-
warfs benchmark suite to validate our methodology in this
paper. Finally, in section 7 we discuss further avenues to
extend our work and conclude.
2 MOTIVATING EXAMPLE
We aim to capture the essence of vendor-recommended opti-
mizations for target architectures in our metrics. We demon-
strate the optimization strategies using an OpenCL kernel
which multiplies square matrices of order N . We start with
a simple unoptimized kernel, and then improve the memory
access paerns of this kernel by incrementally performing
the optimizations recommended in the CUDA Optimization
Handbook [14] for NVIDIA GPUs. ese incrementally op-
timized OpenCL codes are then analysed using AIWC to
determine the accuracy in proling favourable memory ac-
cess paerns of the current framework and the proposed
extensions to AIWC.
Simple Unoptimized Matrix Multiplication
e unoptimized matrix multiplication kernel presented in
Appendix A is used as a baseline to validate the performance
improvements measured from each following optimization.
Each thread of the kernel updates the (дlobalRow,дlobalCol)
element of matrix C by computing the dot product of the
corresponding row of A and column of B.
Using Shared Memory to Coalesce Global Memory
Access to Matrix A
We rst notice that the number of global memory accesses to
matrices A and B increases as O(N 3) with respect to matrix
size N . Global memory is typically located o-chip and
accesses induce large delays. NVIDIA GPUs coalesce global
memory loads and stores issued within thread-groups into as
few DRAM transactions as possible. Multiple global memory
loads and stores are coalesced into a single transaction when
certain device-specic conditions are met. On most NVIDIA
GPUs, data accesses are coalesced when multiple requests
are made for memory locations from the same cache line in
global memory [15]. Appendix B contains the coalescedA
kernel, which coalesces accesses to matrix A by storing tiles,
or square blocks, ofA’s values into shared memory (OpenCL
local memory).
Using Shared Memory to store Tiles of Matrix B
e code is further optimized by improving the locality of
reads from Matrix B. While in the previous kernel each
thread reads only a single element of matrix A from global
memory, each thread reads a full column of matrix B. e re-
peated reads of elements of matrix B can be shared between
threads in a work group by reading tiles from matrix B into
shared memory. Appendix C contains the coalescedAB ker-
nel which performs this optimization.
Optimizing Handling of Shared Memory
NVIDIA shared memory is divided into multiple banks –
stored in independent memory modules – to allow parallel
memory access. Bank conicts occur when shared mem-
ory in the same bank is accessed concurrently. e code
in coalescedAB kernel is further optimized by implicitly
Architecture-Independent Memory Access Paern Analysis IWOCL’20, April 27–29, 2020, Munich, Germany
transposing tiles of MatrixAwhile loading from global mem-
ory. is improves memory bank utilization during reads
to shared memory. e coalescedABT kernel demonstrates
this optimization by modifying lines 20,21 of coalescedAB
to:
20 ASub[localCol][localRow] = A[tiledRow];
21 BSub[localRow][localCol] = B[tiledCol];
Alignment of Memory Allocation
Memory access alignment is important to best utilize all parts
of GPU memory architecture. Global memory buer align-
ment can allow threads to access blocks of global memory
aligned to the nearest cache line. is enables coalescing of
memory accesses. If buers are misaligned, parallel memory
requests may cross over cache lines, and may double the
number of slow global memory accesses needed as demon-
strated in gures 1 and 2.
0 32 64 96 128 160 192 224 256 288 320 352 384
Figure 1: All threads accessmemory aligned to nearest cache
line in parallel [14]
0 32 64 96 128 160 192 224 256 288 320 352 384
Figure 2: Unaligned sequential memory addresses t in two
cache lines [14]
A similar principle applies when using shared memory.
Bank conicts may be reduced by aligning allocations of
shared memory buers. e alignedABT kernel improves
the alignment of shared memory tiles in coalescedABT. An
arbitrary large alignment value of 4096 is chosen as it is larger
than the cache line size of modern hardware and aributed
to the local array declarations in the existing code as shown
below.
1 __local float aTile[TILE_DIM][TILE_DIM]
__attribute__ ((aligned(4096)));
2 __local float bTile[TILE_DIM][TILE_DIM]
__attribute__ ((aligned(4096)));
All the examples above are an incomplete listing of opti-
mization strategies developers can apply to code targeted at
NVIDIA GPUs. In section 4, we will compare the collected
AIWC metrics with the expected eect of each optimiza-
tion to examine how eectively our methodology uncovers
underlying bolenecks and guides optimization eorts.
3 RELATEDWORK
Hoste and Eeckout [5] show thatwhile conventional architecture-
dependent characteristics are useful in locating performance
bolenecks, they can hide the underlying, inherent program
behaviour causing performance bolenecks. A conceptual
understanding of the performance characteristics of com-
plex codes is necessary for the programmer to eectively
optimize these codes. Architecture-dependent characteris-
tics typically include instructions per cycle (IPC), cache and
branch prediction miss rates, page faults and DRAM bus data
transfer rates. ese are typically collected using hardware
performance counters available on most target architectures.
ese performance counters do not serve to guide optimiza-
tion beyond highlighting potential bolenecks [3, 5]. Further,
in many cases, architecture-dependent characteristics cannot
be directly correlated to specic code paerns. For exam-
ple, the causes of high cache miss rates in the execution of
a program are complex and depend on microarchitecture
specic features such as cache size, prefetch behaviour and
cache placement policies. A HPC developer tasked with
optimizing code for a given hardware target would benet
from architecture-independent metrics of the code that can
be used to measure the eect of code modications. ese
metrics would help to guide the developer in both nding
and xing performance bolenecks.
To address the limitations of conventionalmicroarchitecture-
dependent characteristics, Hoste and Eeckout [5] developed
the Microarchitecture-Independent Workload Characteriza-
tion tool (MICA). ey observed that performance counter-
based approaches to proling codes oen failed to nd under-
lying program features that map to improved or worsened
usage of performance-critical hardware features of the target
architecture. e MICA framework is a holistic characteri-
zation tool, and thus collects features including instruction
mix, instruction-level parallelism, register trac, data stream
strides and branch predictability.
Of these metrics, data stream strides are of particular in-
terest in memory access paern proling. MICA’s stride
length metric measures the distance between consecutive
memory accesses in a single-threaded application. For CPU
architectures running single-threaded applications, this met-
ric correlates to the spatial locality of memory accesses –
a measure of how closely bunched are memory access in
nearby times. is is directly correlated to cache reuse rates,
critical to code performance on CPUs [6].
IWOCL’20, April 27–29, 2020, Munich, Germany Aditya Chilukuri, Josh Milthorpe, and Beau Johnston
e MICA approach was tailored for single-threaded ap-
plications as the metrics collected rely heavily on Pin in-
strumentation [13]. As such, MICA is unsuited to analysing
HPC workloads with heavy use of parallelism. e Work-
load ISA-Independent Characterization for Applications (WI-
ICA) [17] extends MICA to present a framework to analyse
single-threaded programs independent of the instruction set
architecture (ISA).
Kim and Shrivastava [11] present CuMAPz, a CUDAmem-
ory access proling tool to guide NVIDIA GPU optimiza-
tions. CuMAPz focuses on the problem of improving CUDA
application performance using NVIDIA memory-based op-
timizations. CuMAPz analyses CUDA codes structurally
and simulates code execution on the memory hierarchy of
specic NVIDIA GPU models.
During simulation of the target code, CuMAPz records all
memory accesses in various buers to simulate the global
and shared memories on NVIDIA GPUs. Using this detailed
simulation data, CuMAPz can estimate the performance-
critical (i) shared memory data reuse prot, (ii) prot from
coalesced access (iii) memory channel skew cost and (iv)
bank conict cost characteristics of the target code.
e simulation environment used by CuMAPz and the
aached analysis framework is highly specic to CUDA en-
abled GPUs. Replicating the CuMAPz framework for all
target architectures is challenging. However, CuMAPz is
an interesting simulator from the standpoint of this paper’s
work since it adheres to the core parts of NVIDIA GPUs’
memory models in its analysis, while allowing the user to
specify their GPU model specic hardware information.
Our approach is an architecture-independent analysis of
memory access paerns to provide metrics correlating to
similar performance critical memory access optimizations
as CuMAPz. We aim to further the state of the art by provid-
ing a framework to guide developers in optimizing OpenCL
codes for any given target architecture. To the best our
knowledge, none of the previous works presents a set of
performance metrics that accurately characterize memory
access paerns of parallel applications independent of the
target architecture.
e Architecture-Independent Workload Characterization
(AIWC) tool [8] collects a set of instruction set architecture
(ISA)-independent features based on those identied by Shao
and Brooks [17]. AIWC runs as a plugin to the Oclgrind [16]
framework, which simulates OpenCL kernel execution on an
ideal device according to the OpenCL execution and mem-
ory models. AIWC collects metrics of kernel memory access,
including simple counts such as the (i) total memory foot-
print, the total number of unique addresses accessed; and
(ii) 90% memory footprint, the number of unique addresses
covering 90% of memory accesses. While these metrics are
architecture-independent, they are correlated to program
performance on typical architectures. For example, a small
ratio of 90% memory footprint to total memory footprint
indicates that a program accesses a small subset of memory
addresses frequently, which is highly benecial for perfor-
mance in a cached memory hierarchy.
AIWC also records the global memory address entropy
(GMAE), a positive real number corresponding to the ran-
domness of the memory access distribution of a program.
To measure locality of memory accesses, AIWC collects the
local memory address entropy (LMAE) of memory addresses
accessed aer dropping n least signicant bits of all memory
addresses accessed by the program. To calculate this, AIWC
collects a frequency distribution of all non-register mem-
ory accesses by all threads in the target kernel. Using the
collected frequency distribution, AIWC calculates 10 sepa-
rate local memory address entropy (LMAE) values according
to increasing number of least signicant bits (LSB) skipped
using the explicit formula for the n-bits skipped LMAE:
LMAEn−bits =
∑
a∈An
pa log2(p−1a ) (1)
• An is the set of all addresses accessed aer skipping
n LSBs of each address.
• pa := #accessa#accesstotal is the probability (calculated as
relative frequency) at which each memory address
is accessed.
LMAE measures the locality of memory accesses per-
formed over the full execution of a program. A steeper drop
in entropy with increasing number of bits may correlate to
more localized memory accesses over the program’s execu-
tion.
4 NON-PARALLEL MEMORY METRICS
We rst consider AIWC memory metrics that do not take
into account the interaction between work items executing
in parallel. We compare these metrics for the incrementally
optimized versions of the matrix multiplication example pre-
sented in section 2.
In addition to the memory metrics reported in [8], we
added a new metric of relative local memory usage. is
measures the proportion of all memory accesses from the
symbolic execution of the kernel that occurred to memory al-
located as local. On NVIDIA GPUs, this memory address
space is mapped to fast on-chip shared memory. Relative
local memory usage is an example of a metric that is useful to
measure performance-critical access paerns on some archi-
tectures such as GPUs, and not others, such as CPUs. CPUs
do not typically have a notion of user-controlled on-chip
memory shared between hardware threads such as NVIDIA
GPUs’ shared memory. is is a natural consequence of
Architecture-Independent Memory Access Paern Analysis IWOCL’20, April 27–29, 2020, Munich, Germany
simple coalA coalAB coalABT alignABT
0
0.1
0.2
0.3
Kernel
Ex
ec
ut
io
n
tim
e
(m
s)
N = 80
N = 256
simple coalA coalAB coalABT alignABT
0
50
100
Kernel
Ex
ec
ut
io
n
tim
e
(m
s)
N = 1408
N = 2048
Figure 3: Execution time of matrix multiplication kernels (NVIDIA Tesla P100)
programming for a heterogeneous system. Specic code pat-
terns may translate to performance improvements only on
certain hardware.
In the original AIWC tool [10], global and local memory
address entropy (MAE) were calculated using physical ad-
dresses of memory used by the Oclgrind simulator back-end
of AIWC. is caused irregularities in entropy calculations
across multiple runs of the same simulation. We improved
the calculation of memory address entropy by using virtual
addresses to calculate MAE values using an abstract ideal
address space on which all memory accesses by the kernel
occur. is allows AIWC to accurately abstract over the
hardware and ISA-specic dierences in memory layouts
across the diverse hardware targets.
Figure 3 shows execution times for the matrix multipli-
cation kernels presented in section 2 for N × N matrices of
dierent sizes. We recorded kernel execution time exclusive
of data transfer on an NVIDIA Tesla P100 using NVIDIA
OpenCL 1.2 CUDA 10.1.236 (driver 418.87), and took the
average of 100 runs. Across varying problem sizes, the ven-
dor recommended optimizations to the matrix multiplication
code lead to increased performance.
Table 1 summarizes AIWC memory metrics collected for
each of the matrix multiply kernels in section 2. Note that
only two LMAE values are shown for brevity. We now
analyse the eectiveness of AIWC metrics in proling the
NVIDIA recommended memory optimizations applied to the
matrix multiplication kernel.
Relative local memory usage: As reliance on NVIDIA GPUs’
shared memory increases in each kernel from simple to
coalescedAB, we nd that the proportion of OpenCL local
memory usage increases as expected.
Global and local MAE: Entropy measurements decrease
from simple to coalescedAB, as the optimizations reduce
the number of reads frommatricesA and B in global memory,
replacing these with reads from smaller tiles in local memory.
We observe an almost completely uniform distribution of
memory accesses in simple, where the program makes N
loads to each element of A and B with N the dimension of
the matrices. e distribution of memory accesses becomes
increasingly non-uniform as we perform fewer accesses to
matricesA and B and more accesses to smaller local memory
buers, resulting in decreases in local and global memory
entropy values from simple to coalescedAB.
Memory Footprint: Similar to trends in global and local
MAE, we nd that the ratio of 90% memory footprint to
total memory footprint decreases from 60.12% for simple to
0.25% (coalescedAB). Increased utilization of local memory
in the optimized kernels means that the local memory buers
make up a greater proportion of total memory accesses in
the program. As the local memory buers are small and
reused within a workgroup, the memory footprint of the
local memory buers is also smaller.
AIWC’s metrics strongly reward optimizations that tend
to localize memory accesses. Local memory buers are typi-
cally smaller than global memory arrays when programming
for GPUs due to hardware limitations on sizes of sharedmem-
ory [15]. However, the metrics currently measured by AIWC
do not have a direct causal relation to code paerns that opti-
mize memory accesses on GPUs. e proposed relative local
IWOCL’20, April 27–29, 2020, Munich, Germany Aditya Chilukuri, Josh Milthorpe, and Beau Johnston
simple coalescedA coalescedAB coalescedABT alignedABT
Total memory footprint 196608 196608 196608 196608 196608
90% Memory Footprint 118196 56176 489 489 489
Global MAE 17.02 13.18 9.78 9.78 9.78
LMAE #bits=3 16.02 12.18 8.78 8.78 8.78
LMAE #bits=10 9.02 5.18 1.78 1.78 1.78
Relative Local Memory Usage 0 0.50 0.94 0.94 0.94
Table 1: Selection of AIWC[8] metrics for 256 × 256 matrix multiplication
memory usage metric is the rst to correspond to a recom-
mended optimization strategy of using fast on-chip shared
memory. Further, we nd that all current AIWC metrics do
not measure any sizeable dierence between coalescedAB,
coalescedABT and alignedABT codes. We address this by
proposing another new metric for locality of memory ac-
cesses in the following section.
5 A PARALLEL SPATIAL LOCALITY METRIC
Aggregate metrics of the kind presented by AIWC necessar-
ily present a simplied view of program behaviour, omiing
many details. Dierent ways of aggregating program mea-
surements lead to dierent features of program execution
being emphasised in the nal metrics. For example, the cal-
culation of memory address entropy described in section 3
relies only on the frequency distribution of memory accesses
to all addresses accessed by the kernel, and discards temporal
information. e order of sequential memory accesses per-
formed by each thread, as well the relationship betweenwork
items in an OpenCL work group, are both vital in accurately
characterizing parallel codes.
We propose a new architecture-independent metric, par-
allel spatial locality, to measure memory access paerns
in parallel programs. e proposed metric is inspired by
CuMAPz’ direct approaches to measuring optimization spe-
cic characteristics of CUDA codes.
During simulation, AIWC collects a list of all memory
accesses by each thread of execution. In the OpenCL pro-
gramming model, threads within a work groups execute the
same code and share access to local memory. We can group
together memory accesses of threads in a work group at
each logical time step in the symbolic execution of the code.
On a GPU, memory accesses executed at the same time by
dierent threads in a work group are likely to interact, de-
termining the extent of memory access coalescing and bank
conicts.
ere are three steps involved in generating an AIWC
metric: recording, calculating and summarizing data collected
from the symbolic execution of the kernel under inspection.
Record: we rst record memory accesses performed by
each thread in an OpenCL work group as described above to
achieve a global ordering of all memory accesses performed
by the group. is ordering is collected in the form of logical
timestamps (t0..tlast ) at which memory accesses occur.
Calculate: for each timestamp t = t0..tlast , calculate the
n-bits-dropped entropy of memory addresses accessed by all
threads in a work group within the timestamp t . Here n can
range between 0 and 10 as was the case for LMAE.
Summarize: average the collected entropy values across
all the timestamps to calculate the parallel spatial locality
metrics for one thread group. We then average the n-bits-
dropped entropy summaries across thread groups to obtain
the n-bits-dropped parallel spatial locality metrics for the
kernel’s execution.
e proposed metric is a parallel computing analogue
for MICA’s data stride metric that measures the distance
between consecutive data accesses in a single-threaded envi-
ronment. In parallel programs, to accurately measure spa-
tial locality of accesses, we must consider memory accesses
performed by multiple threads in a close temporal scope.
e proposed metric calculates the locality of accesses in
each time step of the program’s execution and steeper reduc-
tions in n-bits-dropped parallel spatial locality scores will
be observed in programs that oen access nearby memory
addresses within the same timestamp. Such programs will
perform beer on GPUs, as they will make beer use of both
global memory access coalescing and shared memory bank
structures. To a lesser extent, the proposed metric reects
performance-critical memory access paerns on CPUs, as
pulling a single cache line from global memory into last-level
cache may improve memory access times for all CPU cores.
Figure 4 shows the proposed parallel spatial locality metric
as measured by AIWC for each of the matrix multiplication
kernels from section 2. We observe that the coalescedABT
and alignedABT kernels have the steepest reductions in en-
tropy as the number of bits skipped is increased, which cor-
relates to beer locality of parallel memory accesses. It is
Architecture-Independent Memory Access Paern Analysis IWOCL’20, April 27–29, 2020, Munich, Germany
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
# of bits skipped
n
-b
its
dr
op
pe
d
Pa
ra
lle
lS
pa
tia
lL
oc
al
ity simple
coalescedA
coalescedAB
coalescedABT
alignedABT
Figure 4: Parallel spatial locality metric obtained from AIWC for matrix multiply kernels for 256 × 256 matrix multiplication
expected for these kernels to exhibit beer parallel spatial lo-
cality than simple, as coalescedABT and alignedABTmake
use of local memory, where accesses tend to be localized sim-
ply due to the small size of shared on-chip memory typically
available on GPUs. Further we nd that the proposed met-
ric successfully distinguishes between the coalescedAB and
coalescedABT kernels. It accurately depicts a steeper reduc-
tion for the more optimized coalescedABT kernel, where a
larger proportion of parallel memory accesses make beer
use of the memory bank structure of GPU shared memory
than all previous kernels. is is a signicant improvement
over the state-of-the-art AIWCmetrics in characterizing how
codes localize simultaneous memory accesses to beer use
the hardware provided.
6 EVALUATION: EXTENDED OPENDWARFS
BENCHMARK SUITE
e Extended OpenDwarfs benchmark suite [9, 12] is a set
of diverse OpenCL workloads. Each benchmark is assigned
to one of the 13 Berkeley Dwarfs, common computational
and communication paerns which aim to capture the land-
scape of parallel computing workloads [2]. To show that the
proposed parallel spatial locality metric is useful for under-
standing performance properties of a wide range of applica-
tion codes, we present the results of running the AIWC tool
on selected benchmarks from the suite [1]. e workloads
presented are not optimized for any specic architecture –
hence optimizations using OpenCL local memory (which
translates to CUDA shared memory) are not performed.
Many of these benchmarks can be run with up to four
problem sizes based on the sizes of caches found in modern
CPU memory hierarchies [9]. For these results we used a
problem size seing of medium, except for the GEM bench-
mark, which is run at a seing of tiny. For benchmarks such
as BFS that have multiple kernel invocations per run, we
present the AIWC parallel spatial locality metrics for the in-
vocationwith the highest number of LLVM-SPIR instructions
executed. Note that the presented benchmarks use 32-bit
numeric types (OpenCL int and float), so dropping up to
2 bits of the memory addresses accessed will not change the
parallel spatial locality, since any addresses accessed are at
least 4 bytes apart.
N-body methods: GEM
e GEM benchmark computes the electrostatic potential of
a biomolecule by calculating the sum of charges contributed
by all atoms in the biomolecule at each specic surface vertex.
is is an embarrassingly parallel problem. Each OpenCL
work-item operates on a single surface vertex, nding the
electrostatic potential generated by looping through every
atom in the biomolecule in global order. e computation
paern is highly regular and memory accesses are perfectly
synchronized. Atom data is accessed consecutively, with
all work-items simultaneously accessing each atom’s data.
e parallel spatial locality metric reects this paern of
IWOCL’20, April 27–29, 2020, Munich, Germany Aditya Chilukuri, Josh Milthorpe, and Beau Johnston
0 1 2 3 4 5 6 7 8 9 10
0
2
4
6
8
10
# of bits skipped
n
-b
its
dr
op
pe
d
Pa
ra
lle
lS
pa
tia
lL
oc
al
ity
GEM NW SRAD Kernel 1 SRAD Kernel 2
CSR BFS Kernel 1 BFS Kernel 2 LUD Diagonal
LUD Perimeter LUD Internal
Figure 5: Parallel spatial locality metric for selected OpenDwarfs benchmark kernels
ecient memory utilization (single loads from global mem-
ory servicing all OpenCL work-items). e recorded parallel
spatial locality approaches the theoretical limit of 0 (0.0124
at 10 bits dropped). is indicates that almost all memory
accesses made by the kernel are perfectly synchronized be-
tween OpenCL threads. Performance results [9, 12] show
that GEM performs signicantly beer on GPUs than on
CPUs, as memory unit stalls are at low levels for both CPUs
and GPUs due to the highly ecient memory utilization of
this benchmark. As memory operations do not present a
boleneck, this benchmark is able to take advantage of the
superior oating-point compute capability of GPUs [12].
Dense Linear Algebra: Lower-upper decomposition
(LUD)
LUD in OpenDwarfs is a program to decompose an input
N × N matrix as the product of one lower and one upper
diagonal matrix. Memory access paerns in a dense linear
algebra workload such as LUD are typically highly regular
and deterministic for each OpenCL work item, based on the
matrix dimension and oset parameters. e OpenDwarfs
implementation of LUD [1] splits the LUD computation into
three kernels: LUD perimeter and LUD diagonal kernels
spawn work-items in a single work dimension; while LUD
internal is decomposed in both dimensions of the input.
e LUD kernels partially benet from memory access
coalescing on GPUs, for the lines of code where all the work-
items access contiguous memory along a row of the input
matrix. However, when threads simultaneously access a col-
umn of the input matrix, multiple memory requests are made
as addresses accessed are too distant to be coalesced into
a single memory transaction on GPUs. is is reected in
the drop in parallel spatial locality for the LUD kernels to
approximately half the value at 0 bits skipped. Taking LUD
internal as an example, the 0-bits skipped parallel spatial
locality metric is 4.101, while the 10-bits skipped parallel
spatial locality metric is 2.053. e swi decline of the met-
ric to 2.053 from 2 to 6-bits skipped parallel spatial locality
indicates that approximately half the parallel memory ac-
cesses in LUD internal are highly localized. is occurs
when work-items simultaneously access contiguous memory
along a matrix row. Similar to the presented matrix multiply
codes, LUD can be optimized for GPUs by eective utilization
of shared memory [1].
Dynamic Programming: Needleman-Wunsch (NW)
Needleman-Wunsch is a dynamic programming algorithm
used to perform protein sequence alignment by identifying
the similarity between two strings of amino acids. e com-
putation of each element in the similarity matrix depends on
Architecture-Independent Memory Access Paern Analysis IWOCL’20, April 27–29, 2020, Munich, Germany
its west, north and north-west neighbours. is dependency
enforces a wavefront computation paern, travelling along
the main diagonal of the matrix. Each iteration of the kernels
computes over an antidiagonal of the matrix, starting at its
top le corner, and nishing at its boom right corner.
In particular, thewavefront paern of computations causes
each work-item to request distant memory addresses in a
parallel fashion. us, parallel memory requests into the in-
put matrix prohibit locality. We note the lack of any dropo
in parallel spatial locality as the number of bits dropped
increases. is rightly indicates that memory addresses ac-
cessed at each logical timestamp are very distant. On GPUs,
this translates to poor utilization of both memory access
coalescing and caching [12]. is trend in the parallel spatial
locality metric suggests that a possible improvement for GPU
performance would be to load blocks of the input matrix into
on-chip local memory to reduce the number of global mem-
ory requests – this is typically performed in GPU optimized
implementations of the Needleman-Wunsch algorithm [1].
Structured Grids: Speckle Reducing Anisotropic
Diusion (SRAD)
SRAD removes locally correlated noise from images by fol-
lowing a repeated grid update computation paern on im-
age pixel grids. Conditional statements in the code cause
thread divergence, potentially within a work-group, to han-
dle boundary conditions. ese boundaries constitute a small
portion of the executed work-items, especially on larger
data-sets and so the eective thread divergence is minimal.
Memory access paerns in SRAD are statically determined
and relatively localized. Similar to the LUD and simple ker-
nels, both SRAD kernels observe a rapid drop-o in parallel
spatial locality (gure 5) as the number of bits skipped is
increased, with the metric stabilizing at approximately half
its original value when 0 bits are skipped.
At each point in the kernels’ memory access prole, each
work-item with global ID (i, j) in an OpenCL work-group
accesses the (i, j)th elements of various matrices. e se-
quential memory access paern is non-linear since dierent
matrices are accessed consecutively by the kernel, prohibit-
ing ideal caching [12]. However, memory requests made
simultaneously by a work-group always fall within a rect-
angular block of one of the matrices. is allows memory
access coalescing on GPUs for OpenCL work-items access-
ing contiguous matrix data, greatly reducing the number of
memory requests made on each line of the kernel code on
GPUs. However, work-items accessing data along a column
of the matrix do not observe memory access coalescing. us
we observe that while cache hit rates are typically low on this
benchmark, particularly on GPUs [12], GPUs can hide the
latency of global memory accesses through memory access
coalescing to some extent.
Graph Traversal: BFS
Graph traversal algorithms require pointer chasing opera-
tions to traverse nodes of a graph and perform calculations.
e Breadth-First Search implemented in OpenDwarfs calls
two OpenCL kernels to traverse nodes immediately con-
nected to the list of nodes at each depth starting from the
root node. BFS is characterized by an imbalanced workload
per kernel launch depending on the degree of the nodes be-
ing operated on, with only a proportion of launched nodes
performing meaningful work. As such, the AIWC features
collected for each kernel invocation vary signicantly. e
parallel spatial locality metric collects the entropy of memory
accesses at each time step, and workload imbalances across
work-items are dealt with by averaging entropy scores across
all the time steps in the execution of an OpenCL work-group.
Of the two BFS kernels, it is BFS kernel 1 that performs
the graph traversal necessary to generate a list of neighbours
at each node. e memory access paerns in BFS kernel
1 are irregular. Work-items fetch discontinuous memory
locations aributed to any particular node in the graph, de-
pending on the connectivity of the node they operate on.
e program structure involves multiple levels of pointer-
chasing and thus the precise memory addresses accessed are
data-dependent. is leads to poor parallel spatial locality.
us, we see a slow dropo in parallel spatial locality for
BFS kernel 1.
Sparse Linear Algebra: CSR
Compressed Sparse Row Matrix-Vector Multiplication (CSR)
computes the product of a sparse matrix and a dense vector.
e matrix is stored in a compressed row storage sparse
matrix format, which is very ecient for storage when the
number of zero elements is far greater than the number of
non-zero elements. ree inputs are provided to the CSR
kernel. e non-zero elements of a matrix are stored in
row-major order in Ax , along with separate arrays Aj, Ap
indicating the position of each non-zero element in the ma-
trix.
In the OpenDwarfs implementation, each row of the input
sparse matrix is assigned to a separate OpenCL work item.
e locations of matrix data read by each work-item are
dependent on the number and position of non-zero values in
the sparse-matrix, which are decided by the values in Aj and
Ap. Similar to BFS, this means memory access paerns are
runtime-dependent due to indirect addressing. is paern
of indirect addressing is typical of applications in the Spare
Linear Algebra Dwarf, which severely hinders locality of
memory accesses performed. e collected parallel spatial
IWOCL’20, April 27–29, 2020, Munich, Germany Aditya Chilukuri, Josh Milthorpe, and Beau Johnston
locality metric reects this trend. Figure 5 shows the grad-
ual decline in parallel spatial locality as the number of bits
dropped is increased.
7 CONCLUSIONS AND FUTUREWORK
To the best of our knowledge, this work is the rst to pro-
pose the use of architecture-independent metrics of paral-
lel memory access to guide hardware-specic optimization.
We implemented and evaluated a new feature, relative lo-
cal memory usage, to help characterize memory-based GPU
code optimizations. We also implemented a new parallel
spatial locality metric to capture the idea of closeness of
memory accesses made by parallel OpenCL workloads. We
ran the enhanced AIWC tool on matrix multiply kernels and
selected OpenDwarfs benchmarks, presenting results and
analysis to validate our methodology.
Our proposed parallel spatial locality metric may also cor-
relate to some memory-based optimization strategies for
CPUs. Future work would apply the approach followed by
this paper to optimizations for CPU and FPGA architectures
to critically evaluate the viability of AIWC in analyzing mem-
ory access paerns in codes targeted to these architectures.
While our work in this paper intended to guide optimiza-
tion eorts, another potential use case for AIWC is provid-
ing architecture-independent performance predictions for
OpenCL kernels by generating machine learning models
based on AIWC metrics [7]. Future work would modify the
presented metrics and develop new memory metrics with
the specic intent of being fed to machine learning mod-
els to predict kernel performance on arbitrary and novel
architectures.
REFERENCES
[1] Extended OpenDwarfs. hps://github.com/ANU-HPC/OpenDwarfs/
commit/dee488cac9833f029dfada356ae4077b68c4b5, Jan 2020.
[2] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro,
Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A
Paerson, William Lester Plishker, John Shalf, Samuel Webb Williams,
et al. e landscape of parallel computing research: A view from
berkeley. Technical report, Technical Report UCB/EECS-2006-183,
EECS Department, University of California, Berkeley, 2006.
[3] Karthik Ganesan, Lizy John, Valentina Salapura, and James Sexton. A
performance counter based workload characterization on Blue Gene/P.
In International Conference on Parallel Processing (ICPP), pages 330–337.
IEEE, 2008.
[4] John L. Hennessy and David A. Paerson. Computer Architecture - A
antitative Approach, 5th Edition. Morgan Kaufmann, 2012.
[5] Kenneth Hoste and Lieven Eeckhout. Microarchitecture-independent
workload characterization. IEEE Micro, 27(3), 2007.
[6] Intel Corporation. Intel 64 and IA-32 Architectures Optimization
Reference Manual. 2016.
[7] Beau Johnston, Gregory Falzon, and Josh Milthorpe. Opencl perfor-
mance prediction using architecture-independent features. In 2018
International Conference on High Performance Computing & Simula-
tion, HPCS 2018, Orleans, France, July 16-20, 2018, pages 561–569. IEEE,
2018.
[8] Beau Johnston and JoshMilthorpe. AIWC: OpenCL-based architecture-
independent workload characterization. IEEE/ACM Workshop on the
LLVM Compiler Infrastructure in HPC (LLVM-HPC), Nov 2018.
[9] Beau Johnston and JoshMilthorpe. Dwarfs on accelerators: Enhancing
OpenCL benchmarking for heterogeneous computing architectures. In
Proceedings of the 47th International Conference on Parallel Processing
Companion, ICPP ’18, pages 4:1–4:10, New York, NY, USA, 2018. ACM.
[10] Beau Johnston, James Price, Moritz Panzer, Petros Kalos, TomDeakin,
Nido Media, and Daniel Saier. BeauJoh/Oclgrind: Adding AIWC
– An Architecture Independent Workload Characterisation Plugin.
hps://doi.org/10.5281/zenodo.1134175, December 2017.
[11] Yooseong Kim and Aviral Shrivastava. CuMAPz: a tool to analyze
memory access paerns in CUDA. In Leon Stok, Nikil D. Du, and
Soha Hassoun, editors, Proceedings of the 48th Design Automation
Conference, DAC 2011, San Diego, California, USA, June 5-10, 2011,
pages 128–133. ACM, 2011.
[12] Konstantinos Krommydas, Wu-chun Feng, Christos D Antonopoulos,
and Nikolaos Bellas. OpenDwarfs: Characterization of dwarf-based
benchmarks on xed and recongurable architectures. Journal of
Signal Processing Systems, 85(3):373–392, 2016.
[13] Chi-Keung Luk, Robert Cohn, RobertMuth, Harish Patil, Artur Klauser,
Geo Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel-
wood. Pin: building customized program analysis tools with dynamic
instrumentation. In ACM SIGPLAN notices, volume 40, pages 190–200.
ACM, 2005.
[14] NVIDIA Corporation. CUDA C++ Best Practices Guide. 2019.
[15] NVIDIA Corporation. CUDA C++ Programming Guide. 2019.
[16] James Price and Simon McIntosh-Smith. Oclgrind: An extensible
OpenCL device simulator. In Proceedings of the 3rd International Work-
shop on OpenCL, page 12. ACM, 2015.
[17] Yakun Sophia Shao and David Brooks. ISA-independent workload
characterization and its implications for specialized architectures. In
IEEE International Symposium on Performance Analysis of Systems and
Soware (ISPASS), pages 245–255. IEEE, 2013.
[18] Kyle Spaord, Jeremy Meredith, and Jerey Veer. Maestro: data
orchestration and tuning for OpenCL devices. Euro-Par 2010-Parallel
Processing, pages 275–286, 2010.
Architecture-Independent Memory Access Paern Analysis IWOCL’20, April 27–29, 2020, Munich, Germany
A SIMPLE KERNEL
1 __kernel void simpleMultiply(__global float *A,
2 __global float *B,
3 __global float *C,
4 int N)
5 {
6 const int globalRow = get_global_id(0); //
Row ID of C (0..N-1)
7 const int globalCol = get_global_id(1); //
Col ID of C (0..N-1)
8 // Compute a single element of C (loop over
K)
9 float acc = 0.0f;
10 for (int k = 0; k < N; ++k) {
11 acc += B[k * N + globalCol] * A[globalRow
* N + k];
12 ` }
13 // Store the result
14 C[globalRow * N + globalCol] = acc;
15 }
B COALESCEDA KERNEL
1 __kernel void coalescedAMultiply(
2 const __global float* A,
3 const __global float* B,
4 __global float* C,
5 const int N)
6 {
7 __local float aTile[TILE_DIM][TILE_DIM];
8
9 const int localRow = get_local_id(0);
10 const int localCol = get_local_id(1);
11
12 const int globalRow = get_global_id(0);
13 const int globalCol = get_global_id(1);
14
15 __private float sum = 0.0f;
16
17 const int numTiles = N / TILE_DIM;
18 __private const int tiledRow = globalRow*N+
localCol;
19 for (int i = 0; i < numTiles; i++) {
20 aTile[localRow][localCol] =
A[tiledRow+i*TILE_DIM];
21 barrier(CLK_LOCAL_MEM_FENCE);
22 for (int k = 0; k < TILE_DIM; k++) {
23 sum += aTile[localRow][k] *
B[(i*TILE_DIM+k)*N+globalCol];
24 }
25 barrier(CLK_LOCAL_MEM_FENCE);
26 }
27 C[globalRow*N+globalCol] = sum;
28 }
C COALESCEDAB KERNEL
1 __kernel void coalescedABMultiply(
2 const __global float* A,
3 const __global float* B,
4 __global float* C,
5 const int N) {
6 __local float ASub[TILE_DIM][TILE_DIM];
7 __local float BSub[TILE_DIM][TILE_DIM];
8
9 const int localRow = get_local_id(0);
10 const int localCol = get_local_id(1);
11 const int globalRow = get_global_id(0);
12 const int globalCol = get_global_id(1);
13
14 float acc = 0.0f;
15 const int numTiles = N/TILE_DIM;
16
17 for (int i = 0; i < numTiles; i++) {
18 const int tiledRow =
globalRow*N+i*TILE_DIM + localCol;
19 const int tiledCol = globalCol +
(TILE_DIM*i + localRow)*N;
20 ASub[localRow][localCol] = A[tiledRow];
21 BSub[localRow][localCol] = B[tiledCol];
22
23 barrier(CLK_LOCAL_MEM_FENCE);
24
25 for (int k=0; k<TILE_DIM; k++) {
26 acc += ASub[localRow][k] *
BSub[k][localCol];
27 }
28
29 barrier(CLK_LOCAL_MEM_FENCE);
30 }
31 C[globalRow*N + globalCol] = acc;
32 }
