Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.
Introduction
In the past decade, graphics processing units (GPUs) have emerged as a popular platform for non-graphics computations. Through languages such as OpenCL and CUDA, programmers can use these massively parallel architectures (and other accelerators) for computational domains such as linear algebra, image processing and molecular science. The increased popularity of such accelerators has made programming, maintainability, and portability issues of major importance. Although accelerator programming models have partially addressed these issues, programmers are still expected to tune their code for aspects such as (in the case of GPUs) memory coalescing, warp size, core count and the on-chip memories.
To counter the imminent memory wall [3], recent GPUs have been equipped with software-managed on-chip memories (scratch-pad) and hardware-managed on-chip memories (cache). In particular for integrated solutions with generalpurpose memories (e.g. ARM Mali, AMD Fusion, XBox One) off-chip memory bandwidth is scarce: using the on-chip memories efficiently is required to exploit the GPU's full potential [13] . In fact, many GPU programs are memory bandwidth intensive: for an example set of benchmarks, this is as much as 18 out of 31 [5] . Specific examples of cache optimisations include cache blocking for sparse matrix vector multiplication (5x speed-up) and loop tiling for a stencil computation (3x speed-up). Programmers of GPUs are therefore performing memory coalescing to maximise off-chip throughput or tiling to improve data-locality. Furthermore, programmers determine the allocation of threads to threadblocks, affecting scheduling freedom and cache performance.
With programming models such as CUDA and OpenCL, programmers create a large number of independent 1 threads that execute a single piece of program code (a kernel ). Still, microprocessors such as the GPU do not exploit the potential of spatial and temporal data-locality enabled by this independence. Therefore, we propose locality-aware thread scheduling: changing the schedule of threads, warps and threadblocks with respect to a kernel's memory accesses.
This work does not aim to improve performance for already optimised (e.g. coalesced, tiled) code, but is instead motivated by non-optimised program code and the performance potential of locality-aware thread scheduling. This improves programmability, a metric intertwined with: 1) portability: the generality of program code when targeting different microprocessors, 2) productivity: the time it costs to design and maintain program code, and 3) performance: the speed or energy efficiency of a program. Although the focus of this work lies on GPUs, we make a note that the ideas are equally valid for other cache-based processors that are programmable in an SPMD-fashion.
This work demonstrates that locality-aware thread scheduling can significantly improve the programmability of GPUs. The main contributions are: -Section 5: The potential of multi-level locality-aware thread scheduling for GPUs is identified and quantified for several non-optimised benchmarks. -Section 6: Two example kernels are evaluated further, identifying the effects of thread scheduling on among others caches and memory bank locality.
Background
This section briefly introduces the GPU architecture and its execution model. Additional background can be found in the CUDA programming guide [10] . We use NVIDIA's Fermi architecture as an example in this paper. The Fermi architecture has up to 16 cores (also known as streaming multiprocessors or compute units). Each core contains 32 processing elements (or CUDA cores) and a 64KB on-chip configurable memory, combining scratchpad and L1 data cache (16/48KB or 48/16KB). All cores share a larger L2 cache (up to 768KB).
The CUDA and OpenCL programming models allow programmers to specify small programs (kernels) that are executed multiple times on different data. Each instance of a kernel (a thread in CUDA terminology, a workitem in OpenCL terminology) has its own unique identifier. Programmers furthermore divide all their threads in fixed-size blocks (threadblocks in CUDA terminology, workgroups
