High-performance computing developers are faced with the challenge of optimizing the performance of OpenCL workloads on diverse architectures. e Architecture-Independent Workload Characterization (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program performance on an arbitrary given hardware architecture. However, AIWC metrics are not always easily interpreted and do not re ect some important memory access pa erns a ecting e ciency across architectures. We propose a new metric of parallel spatial locality -the closeness of memory accesses simultaneously issued by OpenCL workitems (threads). We implement the parallel spatial locality metric in the AIWC framework, and analyse gathered results on matrix multiply and the Extended OpenDwarfs OpenCL benchmarks. e di erences in the observed parallel spatial locality metric across implementations of matrix multiply re ect the optimizations performed. e new metric can be used to distinguish between the OpenDwarfs benchmarks based on the memory access pa erns a ecting their performance on various architectures. e improvements suggested to AIWC will help HPC developers be er understand memory access pa erns of complex codes and guide optimization of codes for arbitrary hardware targets.
INTRODUCTION
High-performance computing (HPC) systems are increasingly heterogeneous. A single node on a modern supercomputer may combine traditional CPUs with one or more accelerators such as GPUs, Field Programmable Gate Arrays (FPGAs), or many integrated core devices (MICs). High bandwidth interconnects support tight integration between multiple devices of di erent types on a single compute node.
e OpenCL programming language is designed to support modern HPC so ware engineers in writing code that executes on multiple hardware targets. is gives HPC soware developers greater exibility by allowing codes aimed at a range of hardware targets to be wri en in a single programming language environment.
Application codes di er in resource requirements, control structure and available parallelism. Similarly, compute devices di er in number and capabilities of execution units, processing model, and available resources. ese heterogeneous computing environments present opportunities for so ware engineers and HPC integrators to design highly optimized systems with multiple kernels executing on hardware targets best suited for the diverse computational tasks performed by each kernel [18] . However, this opportunity also presents a challenge in optimizing code to run on diverse architectures. e work reported here aims to help HPC developers understand and optimize memory access pa erns to improve performance on heterogeneous systems.
In the evolution of computer architectures over the last few decades, the exponential growth in computational capability has not been matched by proportional increases in memory speeds [4] . Memory accesses pose larger bo lenecks to performance as application demand for main memory scales with the arithmetic capability of computer systems. To mitigate this latency, modern CPU designs have employed a wide range of cache technologies to reduce main memory accesses using the principle of spatial locality: the observation that data that a program accesses close together in time tend to also be close together in memory. On the other hand, GPUs rely on hardware multithreading to hide memory latency, and their architectures favour ALU capability over sophisticated logic to manage a cache hierarchy and out-of-order execution. As a result, the performance of kernels on CPUs and GPUs alike is strongly dependent on memory access pa erns intrinsic to the code.
Our aim in this paper is to develop a framework to guide HPC so ware engineers in hardware-dependent code optimization -speci cally by guiding the improvement of memory access pa erns. We provide examples of manufacturerrecommended code optimizations to improve memory access pa erns on the target architecture, and examine these using the architecture-independent workload characterization (AIWC) [8] plugin for Oclgrind [16] . Our work highlights the bene ts and challenges arising from an architectureindependent analysis of memory-based program characteristics. We propose two new metrics for AIWC to be er characterize these memory-based optimizations. We measure the presented codes using the new metrics and demonstrate the metrics' e ectiveness in capturing the essence of the optimizations performed. e structure of this paper is as follows. In section 2 we use the example of matrix multiplication on a GPU to showcase the speci c vendor-recommended optimizations our work aims to measure. In section 3 we discuss how AIWC and its precursors pro led memory access pa erns in an architecture-independent fashion and also consider relevant architecture-dependent approaches to memory access pro ling. In section 4 we consider non-parallel memory metrics collected by AIWC and evaluate the suitability of these metrics for capturing the impact of performance optimizations to various OpenCL codes. In section 5 we propose a new parallel spatial locality metric, implement it in the AIWC framework, and evaluate it over the matrix multiplication example from section 2. In section 6 we present the collected metric for selected benchmarks from the Extended OpenDwarfs benchmark suite to validate our methodology in this paper. Finally, in section 7 we discuss further avenues to extend our work and conclude.
MOTIVATING EXAMPLE
We aim to capture the essence of vendor-recommended optimizations for target architectures in our metrics. We demonstrate the optimization strategies using an OpenCL kernel which multiplies square matrices of order N . We start with a simple unoptimized kernel, and then improve the memory access pa erns of this kernel by incrementally performing the optimizations recommended in the CUDA Optimization Handbook [14] for NVIDIA GPUs. ese incrementally optimized OpenCL codes are then analysed using AIWC to determine the accuracy in pro ling favourable memory access pa erns of the current framework and the proposed extensions to AIWC.
Simple Unoptimized Matrix Multiplication
e unoptimized matrix multiplication kernel presented in Appendix A is used as a baseline to validate the performance improvements measured from each following optimization. Each thread of the kernel updates the ( lobalRow, lobalCol) element of matrix C by computing the dot product of the corresponding row of A and column of B.
Using Shared Memory to Coalesce Global Memory Access to Matrix A
We rst notice that the number of global memory accesses to matrices A and B increases as O(N 3 ) with respect to matrix size N . Global memory is typically located o -chip and accesses induce large delays. NVIDIA GPUs coalesce global memory loads and stores issued within thread-groups into as few DRAM transactions as possible. Multiple global memory loads and stores are coalesced into a single transaction when certain device-speci c conditions are met. On most NVIDIA GPUs, data accesses are coalesced when multiple requests are made for memory locations from the same cache line in global memory [15] . Appendix B contains the coalescedA kernel, which coalesces accesses to matrix A by storing tiles, or square blocks, of A's values into shared memory (OpenCL local memory).
Using Shared Memory to store Tiles of Matrix B e code is further optimized by improving the locality of reads from Matrix B. While in the previous kernel each thread reads only a single element of matrix A from global memory, each thread reads a full column of matrix B. e repeated reads of elements of matrix B can be shared between threads in a work group by reading tiles from matrix B into shared memory. Appendix C contains the coalescedAB kernel which performs this optimization.
Optimizing Handling of Shared Memory NVIDIA shared memory is divided into multiple banksstored in independent memory modules -to allow parallel memory access. Bank con icts occur when shared memory in the same bank is accessed concurrently. e code in coalescedAB kernel is further optimized by implicitly transposing tiles of Matrix A while loading from global memory. is improves memory bank utilization during reads to shared memory. e coalescedABT kernel demonstrates this optimization by modifying lines 20,21 of coalescedAB to: 
Alignment of Memory Allocation
Memory access alignment is important to best utilize all parts of GPU memory architecture. Global memory bu er alignment can allow threads to access blocks of global memory aligned to the nearest cache line. is enables coalescing of memory accesses. If bu ers are misaligned, parallel memory requests may cross over cache lines, and may double the number of slow global memory accesses needed as demonstrated in gures 1 and 2. Figure 2 : Unaligned sequential memory addresses t in two cache lines [14] A similar principle applies when using shared memory. Bank con icts may be reduced by aligning allocations of shared memory bu ers. e alignedABT kernel improves the alignment of shared memory tiles in coalescedABT. An arbitrary large alignment value of 4096 is chosen as it is larger than the cache line size of modern hardware and a ributed to the local array declarations in the existing code as shown below. All the examples above are an incomplete listing of optimization strategies developers can apply to code targeted at NVIDIA GPUs. In section 4, we will compare the collected AIWC metrics with the expected e ect of each optimization to examine how e ectively our methodology uncovers underlying bo lenecks and guides optimization e orts.
RELATED WORK
Hoste and Eeckout [5] show that while conventional architecturedependent characteristics are useful in locating performance bo lenecks, they can hide the underlying, inherent program behaviour causing performance bo lenecks. A conceptual understanding of the performance characteristics of complex codes is necessary for the programmer to e ectively optimize these codes. Architecture-dependent characteristics typically include instructions per cycle (IPC), cache and branch prediction miss rates, page faults and DRAM bus data transfer rates. ese are typically collected using hardware performance counters available on most target architectures. ese performance counters do not serve to guide optimization beyond highlighting potential bo lenecks [3, 5] . Further, in many cases, architecture-dependent characteristics cannot be directly correlated to speci c code pa erns. For example, the causes of high cache miss rates in the execution of a program are complex and depend on microarchitecture speci c features such as cache size, prefetch behaviour and cache placement policies. A HPC developer tasked with optimizing code for a given hardware target would bene t from architecture-independent metrics of the code that can be used to measure the e ect of code modi cations. ese metrics would help to guide the developer in both nding and xing performance bo lenecks.
To address the limitations of conventional microarchitecturedependent characteristics, Hoste and Eeckout [5] developed the Microarchitecture-Independent Workload Characterization tool (MICA). ey observed that performance counterbased approaches to pro ling codes o en failed to nd underlying program features that map to improved or worsened usage of performance-critical hardware features of the target architecture. e MICA framework is a holistic characterization tool, and thus collects features including instruction mix, instruction-level parallelism, register tra c, data stream strides and branch predictability.
Of these metrics, data stream strides are of particular interest in memory access pa ern pro ling. MICA's stride length metric measures the distance between consecutive memory accesses in a single-threaded application. For CPU architectures running single-threaded applications, this metric correlates to the spatial locality of memory accessesa measure of how closely bunched are memory access in nearby times. is is directly correlated to cache reuse rates, critical to code performance on CPUs [6] . e MICA approach was tailored for single-threaded applications as the metrics collected rely heavily on Pin instrumentation [13] . As such, MICA is unsuited to analysing HPC workloads with heavy use of parallelism. e Workload ISA-Independent Characterization for Applications (WI-ICA) [17] extends MICA to present a framework to analyse single-threaded programs independent of the instruction set architecture (ISA).
Kim and Shrivastava [11] present CuMAPz, a CUDA memory access pro ling tool to guide NVIDIA GPU optimizations. CuMAPz focuses on the problem of improving CUDA application performance using NVIDIA memory-based optimizations. CuMAPz analyses CUDA codes structurally and simulates code execution on the memory hierarchy of speci c NVIDIA GPU models.
During simulation of the target code, CuMAPz records all memory accesses in various bu ers to simulate the global and shared memories on NVIDIA GPUs. Using this detailed simulation data, CuMAPz can estimate the performancecritical (i) shared memory data reuse pro t, (ii) pro t from coalesced access (iii) memory channel skew cost and (iv) bank con ict cost characteristics of the target code. e simulation environment used by CuMAPz and the a ached analysis framework is highly speci c to CUDA enabled GPUs. Replicating the CuMAPz framework for all target architectures is challenging. However, CuMAPz is an interesting simulator from the standpoint of this paper's work since it adheres to the core parts of NVIDIA GPUs' memory models in its analysis, while allowing the user to specify their GPU model speci c hardware information.
Our approach is an architecture-independent analysis of memory access pa erns to provide metrics correlating to similar performance critical memory access optimizations as CuMAPz. We aim to further the state of the art by providing a framework to guide developers in optimizing OpenCL codes for any given target architecture. To the best our knowledge, none of the previous works presents a set of performance metrics that accurately characterize memory access pa erns of parallel applications independent of the target architecture.
e Architecture-Independent Workload Characterization (AIWC) tool [8] collects a set of instruction set architecture (ISA)-independent features based on those identi ed by Shao and Brooks [17] . AIWC runs as a plugin to the Oclgrind [16] framework, which simulates OpenCL kernel execution on an ideal device according to the OpenCL execution and memory models. AIWC collects metrics of kernel memory access, including simple counts such as the (i) total memory footprint, the total number of unique addresses accessed; and (ii) 90% memory footprint, the number of unique addresses covering 90% of memory accesses. While these metrics are architecture-independent, they are correlated to program performance on typical architectures. For example, a small ratio of 90% memory footprint to total memory footprint indicates that a program accesses a small subset of memory addresses frequently, which is highly bene cial for performance in a cached memory hierarchy.
AIWC also records the global memory address entropy (GMAE), a positive real number corresponding to the randomness of the memory access distribution of a program. To measure locality of memory accesses, AIWC collects the local memory address entropy (LMAE) of memory addresses accessed a er dropping n least signi cant bits of all memory addresses accessed by the program. To calculate this, AIWC collects a frequency distribution of all non-register memory accesses by all threads in the target kernel. Using the collected frequency distribution, AIWC calculates 10 separate local memory address entropy (LMAE) values according to increasing number of least signi cant bits (LSB) skipped using the explicit formula for the n-bits skipped LMAE:
• A n is the set of all addresses accessed a er skipping n LSBs of each address. • p a := #access a #access t ot al is the probability (calculated as relative frequency) at which each memory address is accessed. LMAE measures the locality of memory accesses performed over the full execution of a program. A steeper drop in entropy with increasing number of bits may correlate to more localized memory accesses over the program's execution.
NON-PARALLEL MEMORY METRICS
We rst consider AIWC memory metrics that do not take into account the interaction between work items executing in parallel. We compare these metrics for the incrementally optimized versions of the matrix multiplication example presented in section 2.
In addition to the memory metrics reported in [8] , we added a new metric of relative local memory usage.
is measures the proportion of all memory accesses from the symbolic execution of the kernel that occurred to memory allocated as local. On NVIDIA GPUs, this memory address space is mapped to fast on-chip shared memory. Relative local memory usage is an example of a metric that is useful to measure performance-critical access pa erns on some architectures such as GPUs, and not others, such as CPUs. CPUs do not typically have a notion of user-controlled on-chip memory shared between hardware threads such as NVIDIA GPUs' shared memory.
is is a natural consequence of programming for a heterogeneous system. Speci c code patterns may translate to performance improvements only on certain hardware. In the original AIWC tool [10] , global and local memory address entropy (MAE) were calculated using physical addresses of memory used by the Oclgrind simulator back-end of AIWC. is caused irregularities in entropy calculations across multiple runs of the same simulation. We improved the calculation of memory address entropy by using virtual addresses to calculate MAE values using an abstract ideal address space on which all memory accesses by the kernel occur.
is allows AIWC to accurately abstract over the hardware and ISA-speci c di erences in memory layouts across the diverse hardware targets. Figure 3 shows execution times for the matrix multiplication kernels presented in section 2 for N × N matrices of di erent sizes. We recorded kernel execution time exclusive of data transfer on an NVIDIA Tesla P100 using NVIDIA OpenCL 1.2 CUDA 10.1.236 (driver 418.87), and took the average of 100 runs. Across varying problem sizes, the vendor recommended optimizations to the matrix multiplication code lead to increased performance. Table 1 summarizes AIWC memory metrics collected for each of the matrix multiply kernels in section 2. Note that only two LMAE values are shown for brevity. We now analyse the e ectiveness of AIWC metrics in pro ling the NVIDIA recommended memory optimizations applied to the matrix multiplication kernel.
Relative local memory usage: As reliance on NVIDIA GPUs' shared memory increases in each kernel from simple to coalescedAB, we nd that the proportion of OpenCL local memory usage increases as expected.
Global and local MAE: Entropy measurements decrease from simple to coalescedAB, as the optimizations reduce the number of reads from matrices A and B in global memory, replacing these with reads from smaller tiles in local memory. We observe an almost completely uniform distribution of memory accesses in simple, where the program makes N loads to each element of A and B with N the dimension of the matrices. e distribution of memory accesses becomes increasingly non-uniform as we perform fewer accesses to matrices A and B and more accesses to smaller local memory bu ers, resulting in decreases in local and global memory entropy values from simple to coalescedAB.
Memory Footprint: Similar to trends in global and local MAE, we nd that the ratio of 90% memory footprint to total memory footprint decreases from 60.12% for simple to 0.25% (coalescedAB). Increased utilization of local memory in the optimized kernels means that the local memory bu ers make up a greater proportion of total memory accesses in the program. As the local memory bu ers are small and reused within a workgroup, the memory footprint of the local memory bu ers is also smaller.
AIWC's metrics strongly reward optimizations that tend to localize memory accesses. Local memory bu ers are typically smaller than global memory arrays when programming for GPUs due to hardware limitations on sizes of shared memory [15] . However, the metrics currently measured by AIWC do not have a direct causal relation to code pa erns that optimize memory accesses on GPUs. e proposed relative local memory usage metric is the rst to correspond to a recommended optimization strategy of using fast on-chip shared memory. Further, we nd that all current AIWC metrics do not measure any sizeable di erence between coalescedAB, coalescedABT and alignedABT codes. We address this by proposing another new metric for locality of memory accesses in the following section.
A PARALLEL SPATIAL LOCALITY METRIC
Aggregate metrics of the kind presented by AIWC necessarily present a simpli ed view of program behaviour, omi ing many details. Di erent ways of aggregating program measurements lead to di erent features of program execution being emphasised in the nal metrics. For example, the calculation of memory address entropy described in section 3 relies only on the frequency distribution of memory accesses to all addresses accessed by the kernel, and discards temporal information. e order of sequential memory accesses performed by each thread, as well the relationship between work items in an OpenCL work group, are both vital in accurately characterizing parallel codes. We propose a new architecture-independent metric, parallel spatial locality, to measure memory access pa erns in parallel programs. e proposed metric is inspired by CuMAPz' direct approaches to measuring optimization speci c characteristics of CUDA codes.
During simulation, AIWC collects a list of all memory accesses by each thread of execution. In the OpenCL programming model, threads within a work groups execute the same code and share access to local memory. We can group together memory accesses of threads in a work group at each logical time step in the symbolic execution of the code. On a GPU, memory accesses executed at the same time by di erent threads in a work group are likely to interact, determining the extent of memory access coalescing and bank con icts.
ere are three steps involved in generating an AIWC metric: recording, calculating and summarizing data collected from the symbolic execution of the kernel under inspection.
Record: we rst record memory accesses performed by each thread in an OpenCL work group as described above to achieve a global ordering of all memory accesses performed by the group. is ordering is collected in the form of logical timestamps (t 0 ..t l ast ) at which memory accesses occur.
Calculate: for each timestamp t = t 0 ..t l ast , calculate the n-bits-dropped entropy of memory addresses accessed by all threads in a work group within the timestamp t. Here n can range between 0 and 10 as was the case for LMAE.
Summarize: average the collected entropy values across all the timestamps to calculate the parallel spatial locality metrics for one thread group. We then average the n-bitsdropped entropy summaries across thread groups to obtain the n-bits-dropped parallel spatial locality metrics for the kernel's execution. e proposed metric is a parallel computing analogue for MICA's data stride metric that measures the distance between consecutive data accesses in a single-threaded environment. In parallel programs, to accurately measure spatial locality of accesses, we must consider memory accesses performed by multiple threads in a close temporal scope. e proposed metric calculates the locality of accesses in each time step of the program's execution and steeper reductions in n-bits-dropped parallel spatial locality scores will be observed in programs that o en access nearby memory addresses within the same timestamp. Such programs will perform be er on GPUs, as they will make be er use of both global memory access coalescing and shared memory bank structures. To a lesser extent, the proposed metric re ects performance-critical memory access pa erns on CPUs, as pulling a single cache line from global memory into last-level cache may improve memory access times for all CPU cores. Figure 4 shows the proposed parallel spatial locality metric as measured by AIWC for each of the matrix multiplication kernels from section 2. We observe that the coalescedABT and alignedABT kernels have the steepest reductions in entropy as the number of bits skipped is increased, which correlates to be er locality of parallel memory accesses. It is Figure 4 : Parallel spatial locality metric obtained from AIWC for matrix multiply kernels for 256 × 256 matrix multiplication expected for these kernels to exhibit be er parallel spatial locality than simple, as coalescedABT and alignedABT make use of local memory, where accesses tend to be localized simply due to the small size of shared on-chip memory typically available on GPUs. Further we nd that the proposed metric successfully distinguishes between the coalescedAB and coalescedABT kernels. It accurately depicts a steeper reduction for the more optimized coalescedABT kernel, where a larger proportion of parallel memory accesses make be er use of the memory bank structure of GPU shared memory than all previous kernels. is is a signi cant improvement over the state-of-the-art AIWC metrics in characterizing how codes localize simultaneous memory accesses to be er use the hardware provided.
EVALUATION: EXTENDED OPENDWARFS
BENCHMARK SUITE e Extended OpenDwarfs benchmark suite [9, 12] is a set of diverse OpenCL workloads. Each benchmark is assigned to one of the 13 Berkeley Dwarfs, common computational and communication pa erns which aim to capture the landscape of parallel computing workloads [2] . To show that the proposed parallel spatial locality metric is useful for understanding performance properties of a wide range of application codes, we present the results of running the AIWC tool on selected benchmarks from the suite [1]. e workloads presented are not optimized for any speci c architecture -hence optimizations using OpenCL local memory (which translates to CUDA shared memory) are not performed.
Many of these benchmarks can be run with up to four problem sizes based on the sizes of caches found in modern CPU memory hierarchies [9] . For these results we used a problem size se ing of medium, except for the GEM benchmark, which is run at a se ing of tiny. For benchmarks such as BFS that have multiple kernel invocations per run, we present the AIWC parallel spatial locality metrics for the invocation with the highest number of LLVM-SPIR instructions executed. Note that the presented benchmarks use 32-bit numeric types (OpenCL int and float), so dropping up to 2 bits of the memory addresses accessed will not change the parallel spatial locality, since any addresses accessed are at least 4 bytes apart.
N-body methods: GEM
e GEM benchmark computes the electrostatic potential of a biomolecule by calculating the sum of charges contributed by all atoms in the biomolecule at each speci c surface vertex.
is is an embarrassingly parallel problem. Each OpenCL work-item operates on a single surface vertex, nding the electrostatic potential generated by looping through every atom in the biomolecule in global order. e computation pa ern is highly regular and memory accesses are perfectly synchronized. Atom data is accessed consecutively, with all work-items simultaneously accessing each atom's data. e parallel spatial locality metric re ects this pa ern of e cient memory utilization (single loads from global memory servicing all OpenCL work-items). e recorded parallel spatial locality approaches the theoretical limit of 0 (0.0124 at 10 bits dropped). is indicates that almost all memory accesses made by the kernel are perfectly synchronized between OpenCL threads. Performance results [9, 12] show that GEM performs signi cantly be er on GPUs than on CPUs, as memory unit stalls are at low levels for both CPUs and GPUs due to the highly e cient memory utilization of this benchmark. As memory operations do not present a bo leneck, this benchmark is able to take advantage of the superior oating-point compute capability of GPUs [12] .
Dense Linear Algebra: Lower-upper decomposition (LUD) LUD in OpenDwarfs is a program to decompose an input N × N matrix as the product of one lower and one upper diagonal matrix. Memory access pa erns in a dense linear algebra workload such as LUD are typically highly regular and deterministic for each OpenCL work item, based on the matrix dimension and o set parameters. e OpenDwarfs implementation of LUD [1] splits the LUD computation into three kernels: LUD perimeter and LUD diagonal kernels spawn work-items in a single work dimension; while LUD internal is decomposed in both dimensions of the input. e LUD kernels partially bene t from memory access coalescing on GPUs, for the lines of code where all the workitems access contiguous memory along a row of the input matrix. However, when threads simultaneously access a column of the input matrix, multiple memory requests are made as addresses accessed are too distant to be coalesced into a single memory transaction on GPUs. is is re ected in the drop in parallel spatial locality for the LUD kernels to approximately half the value at 0 bits skipped. Taking LUD internal as an example, the 0-bits skipped parallel spatial locality metric is 4.101, while the 10-bits skipped parallel spatial locality metric is 2.053. e swi decline of the metric to 2.053 from 2 to 6-bits skipped parallel spatial locality indicates that approximately half the parallel memory accesses in LUD internal are highly localized.
is occurs when work-items simultaneously access contiguous memory along a matrix row. Similar to the presented matrix multiply codes, LUD can be optimized for GPUs by e ective utilization of shared memory [1].
Dynamic Programming: Needleman-Wunsch (NW)
Needleman-Wunsch is a dynamic programming algorithm used to perform protein sequence alignment by identifying the similarity between two strings of amino acids. e computation of each element in the similarity matrix depends on its west, north and north-west neighbours. is dependency enforces a wavefront computation pa ern, travelling along the main diagonal of the matrix. Each iteration of the kernels computes over an antidiagonal of the matrix, starting at its top le corner, and nishing at its bo om right corner.
In particular, the wavefront pa ern of computations causes each work-item to request distant memory addresses in a parallel fashion. us, parallel memory requests into the input matrix prohibit locality. We note the lack of any dropo in parallel spatial locality as the number of bits dropped increases. is rightly indicates that memory addresses accessed at each logical timestamp are very distant. On GPUs, this translates to poor utilization of both memory access coalescing and caching [12] . is trend in the parallel spatial locality metric suggests that a possible improvement for GPU performance would be to load blocks of the input matrix into on-chip local memory to reduce the number of global memory requests -this is typically performed in GPU optimized implementations of the Needleman-Wunsch algorithm [1].
Structured Grids: Speckle Reducing Anisotropic
Di usion (SRAD) SRAD removes locally correlated noise from images by following a repeated grid update computation pa ern on image pixel grids. Conditional statements in the code cause thread divergence, potentially within a work-group, to handle boundary conditions. ese boundaries constitute a small portion of the executed work-items, especially on larger data-sets and so the e ective thread divergence is minimal. Memory access pa erns in SRAD are statically determined and relatively localized. Similar to the LUD and simple kernels, both SRAD kernels observe a rapid drop-o in parallel spatial locality ( gure 5) as the number of bits skipped is increased, with the metric stabilizing at approximately half its original value when 0 bits are skipped.
At each point in the kernels' memory access pro le, each work-item with global ID (i, j) in an OpenCL work-group accesses the (i, j) th elements of various matrices. e sequential memory access pa ern is non-linear since di erent matrices are accessed consecutively by the kernel, prohibiting ideal caching [12] . However, memory requests made simultaneously by a work-group always fall within a rectangular block of one of the matrices. is allows memory access coalescing on GPUs for OpenCL work-items accessing contiguous matrix data, greatly reducing the number of memory requests made on each line of the kernel code on GPUs. However, work-items accessing data along a column of the matrix do not observe memory access coalescing. us we observe that while cache hit rates are typically low on this benchmark, particularly on GPUs [12] , GPUs can hide the latency of global memory accesses through memory access coalescing to some extent.
Graph Traversal: BFS Graph traversal algorithms require pointer chasing operations to traverse nodes of a graph and perform calculations.
e Breadth-First Search implemented in OpenDwarfs calls two OpenCL kernels to traverse nodes immediately connected to the list of nodes at each depth starting from the root node. BFS is characterized by an imbalanced workload per kernel launch depending on the degree of the nodes being operated on, with only a proportion of launched nodes performing meaningful work. As such, the AIWC features collected for each kernel invocation vary signi cantly. e parallel spatial locality metric collects the entropy of memory accesses at each time step, and workload imbalances across work-items are dealt with by averaging entropy scores across all the time steps in the execution of an OpenCL work-group.
Of the two BFS kernels, it is BFS kernel 1 that performs the graph traversal necessary to generate a list of neighbours at each node. e memory access pa erns in BFS kernel 1 are irregular. Work-items fetch discontinuous memory locations a ributed to any particular node in the graph, depending on the connectivity of the node they operate on. e program structure involves multiple levels of pointerchasing and thus the precise memory addresses accessed are data-dependent. is leads to poor parallel spatial locality.
us, we see a slow dropo in parallel spatial locality for BFS kernel 1.
Sparse Linear Algebra: CSR
Compressed Sparse Row Matrix-Vector Multiplication (CSR) computes the product of a sparse matrix and a dense vector. e matrix is stored in a compressed row storage sparse matrix format, which is very e cient for storage when the number of zero elements is far greater than the number of non-zero elements. ree inputs are provided to the CSR kernel.
e non-zero elements of a matrix are stored in row-major order in Ax, along with separate arrays Aj, Ap indicating the position of each non-zero element in the matrix.
In the OpenDwarfs implementation, each row of the input sparse matrix is assigned to a separate OpenCL work item. e locations of matrix data read by each work-item are dependent on the number and position of non-zero values in the sparse-matrix, which are decided by the values in Aj and Ap. Similar to BFS, this means memory access pa erns are runtime-dependent due to indirect addressing. is pa ern of indirect addressing is typical of applications in the Spare Linear Algebra Dwarf, which severely hinders locality of memory accesses performed. e collected parallel spatial locality metric re ects this trend. Figure 5 shows the gradual decline in parallel spatial locality as the number of bits dropped is increased.
CONCLUSIONS AND FUTURE WORK
To the best of our knowledge, this work is the rst to propose the use of architecture-independent metrics of parallel memory access to guide hardware-speci c optimization. We implemented and evaluated a new feature, relative local memory usage, to help characterize memory-based GPU code optimizations. We also implemented a new parallel spatial locality metric to capture the idea of closeness of memory accesses made by parallel OpenCL workloads. We ran the enhanced AIWC tool on matrix multiply kernels and selected OpenDwarfs benchmarks, presenting results and analysis to validate our methodology.
Our proposed parallel spatial locality metric may also correlate to some memory-based optimization strategies for CPUs. Future work would apply the approach followed by this paper to optimizations for CPU and FPGA architectures to critically evaluate the viability of AIWC in analyzing memory access pa erns in codes targeted to these architectures.
While our work in this paper intended to guide optimization e orts, another potential use case for AIWC is providing architecture-independent performance predictions for OpenCL kernels by generating machine learning models based on AIWC metrics [7] . Future work would modify the presented metrics and develop new memory metrics with the speci c intent of being fed to machine learning models to predict kernel performance on arbitrary and novel architectures. 
