Abstract-We consider automatic performance tuning of stencil computations on Graphics Processing Units. We present a strategy that uses machine learning to determine the best way to use memory followed by a heuristic that divides the remaining optimizations into groups and exhaustively explores one group at a time. We evaluate our strategy using 102 synthetically generated OpenCL stencil kernels on an Nvidia GTX Titan GPU. We assess our strategy both in terms of the number of configurations explored during auto-tuning and the quality of the best configuration obtained. We explore two alternative heuristics that use different groupings of the optimizations. We show that, relative to a random sampling of the space and an expert search, our strategy achieves a reduction in the number of configurations explored of up to 80% and 84% respectively while also finding better performing configurations.
I. INTRODUCTION
Stencil computations appear in many domains such as image processing [22] , partial differential equation solvers [16] and cellular automata [8] . They are often computationally intensive and have abundant parallelism. Thus, they are excellent candidates for acceleration on Graphics Processing Units (GPUs).
However, achieving high performance on GPUs is often challenging. Programmers must apply optimizations to their stencil code in order to exploit the underlying GPU architecture. They must ensure memory coalescing, use local and texture memories, reduce thread divergence and select kernel launch configurations that balance parallelism with resource usage [20] . The impact of each optimization is often difficult to assess, particularly when combined with other optimizations. Programmers are left to explore a large space of optimization configurations, i.e., combinations of optimizations and the way in which they are applied, to find good performing ones. This exploration entails running the code for each configuration of interest-a process that can take months of compute time [7] .
Thus, there has been considerable interest in performance auto-tuning, i.e., automatically exploring the space of configurations in efficient ways to determine a good combination of optimizations to apply. A common approach is to use expert knowledge of the stencil computation and the underlying GPU architecture to limit the space, then to exhaustively search through it; we refer to this strategy as expert search. However, this approach requires new expert knowledge for every stencil computation and for every new GPU architecture. In addition, it can still require exploring a prohibitively large number of configurations.
In this work we present an auto-tuning strategy for searching through the space of possible optimization configurations for good performing ones. Our approach tackles the large space in two ways. First, it partitions it based on memory optimization and uses machine learning to predict the partition containing the best configuration. Second, it groups the remaining optimizations so as to minimize the dependencies among optimizations in different groups and auto-tunes the groups independently. We believe this approach allows us to explore fewer configurations than expert search approaches while obtaining as good or better performing configurations.
We first define a set of optimizations that are commonly applied to stencil computations. We train a random forest [3] machine learning model to predict the optimal memory type based on static program features. We then develop two alternative heuristics for grouping the remaining optimizations. The first groups the optimizations along the dimensions of the grid in which the GPU threads are organized. The second puts each optimization, applied to all dimensions of the grid, in a group by itself. We evaluate our strategy on an Nvidia GTX Titan using 102 synthetically generated OpenCL stencil kernels with wide ranges of stencil patterns (dense, star, no-corner, diamond and thumbtack) and radii (0 to 5). In comparison to an expert search approach, our strategy with the better of the two heuristics (the one based on dimensions) explores 84% fewer configurations while finding a configuration that is on average 3% better. Compared to a random sampling of the space, our strategy with the same heuristic explores 80% fewer configurations while finding one that is 7% better on average.
The rest of this paper is structured as follows. Section II provides background material. Section III defines stencil computations. Section IV describes the optimizations we tune. Section V demonstrates the interestingness of the configurations space. Section VI details our auto-tuning strategy. Section VII reports our evaluation. Section VIII outlines related work and Section IX concludes.
II. BACKGROUND A. OpenCL Programming Model
We use Open Computing Language (OpenCL) [25] , an industry standard language for programming heterogeneous parallel systems. Its paradigm consists of a host from which programs, known as kernels, are launched and one or more devices on which those kernels execute. Many instances of a kernel are executed concurrently, each by a work-item, i.e., a thread. Work-items are grouped in a multidimensional grid and are identified by their indices, known as global IDs, in each dimension of this grid.
OpenCL also has the notion of a work-group, a multidimensional collection of work-items within the grid that execute on the device at the same time. The IDs of a workitem in each dimension within its work-group are known as its local IDs. Work-groups provide two major benefits. First, work-items in the same work-group are present simultaneously on the device, making it possible to synchronize across them. Most hardware implementations ensure that workgroups are executed together at some level in the hardware. On the GTX Titan used in this study, this hardware is called a Streaming Multiprocessor (SMX) [21] . Second, work-items in the same work-group share a software-managed cache known as local memory.
B. GPU Architecture
GPUs are highly parallel platforms consisting of clusters (SMXs) of simple cores. Multiple cores within an SMX execute the same instruction on possibly different data. When an OpenCL kernel executes, one or more work-groups are assigned to an SMX at once. The number of work-groups running on each SMX affects both parallelism and hardware utilization.
One of the most important factors affecting performance is how data is loaded from global memory, the slow, off-chip memory of the GPU. If multiple consecutive work-items all access a contiguous, aligned chunk of memory of a particular size these accesses are coalesced into a single memory transaction [20] . Coalesced accesses significantly reduce memory access time. Further, expensive global memory accesses can be eliminated by storing data used by multiple work-items in local memory.
Texture memory is a hardware-managed cache with multidimensional locality that is used to store image data types [20] . If data is allocated as an image, then it is stored in global memory like normal, but is cached in the texture cache. This cache exploits locality in non-innermost dimensions of image data.
III. STENCIL COMPUTATIONS A stencil computation produces one or more output arrays as a function of one or more input arrays in such a way that each output element is a function of some input elements at fixed offsets relative to the index of the output element. This fixed set of offsets defines the stencil. Figure 1 shows sequential C code for the general form of a stencil computation with three dimensional input and output arrays. The loops x, y, and z are known as the spatial loops and they sweep through the elements of the output array, out. They compute each of its elements as a function, f, of the stencil defined by the set of constants C ij , the ith offset in the jth dimension. X_SIZE and Y_SIZE are constants whose definitions are omitted for brevity. The t loop repeats this process, and is known as the time loop.
It is possible for the input and output arrays of a stencil computation to be the same, e.g., in successive overrelaxation [26] . This introduces loop-carried data dependencies that render such stencils less suited for GPUs. Thus, f o r ( i n t t = 0 ; t < T MAX; ++ t ) { f o r ( i n t z = 0 ; z < Z SIZE ; ++z ) f o r ( i n t y = 0 ; y < Y SIZE ; ++y ) f o r ( i n t x = 0 ; x < X SIZE ; ++x ) { f l o a t temp0 = i n we restrict ourselves to stencil computations in which the input and output arrays are different. For simplicity, we only consider stencil programs with one input and one output array of the same dimensionality. Stencil computations can have a variety of different stencil patterns, as shown in Figure 2 . The stencil radius is the maximum distance between any element of the stencil and the centre of the stencil. The examples in Figure 2 all have a stencil radius of two. The stencil size is the number of input elements used to compute each output element.
When implemented naively in OpenCL the spatial loops of a stencil become a kernel, as shown in Figure 3 . The computation of each output element is mapped to a single work-item and the for loops are replaced with OpenCL API calls to get_global_id(). Thus, in this example, a three dimensional global grid is used. We use the letters x, y, and z to refer to the 0th, 1st, and 2nd dimensions of this grid respectively. The time loop is executed by the host, causing it to repeatedly launch the kernel. We focus on optimizing this kernel and thus do not consider optimizing the time loop (e.g., tiling [9, 19] ).
IV. OPTIMIZATIONS

A. Work-Group Size
The work-group size is the number of work-items in a work-group in each dimension. This size can have a f o r ( i n t x = x s t a r t ; x < x s t a r t +2; ++x ) { f l o a t temp0 = i n significant impact on performance because it affects available parallelism. It needs to be large enough to utilize the hardware, but not so large as to reduce parallelism by reducing the number of work-groups that can execute at once. A large work-group size can also exhaust other resources such as registers or local memory, reducing performance or rendering a configuration unexecutable.
B. Block Merging
Block merging combines the work done by adjacent workitems into a single work-item. The number of work-items merged together is referred to as the block merge factor. Figure 4 shows an example of applying this optimization. A loop is introduced to compute multiple elements instead of one (two in the example).
Block merging increases thread work granularity thereby mitigating some of the overhead of launching threads. It also allows duplicate reads to be removed if the introduced loop is unrolled. Specifically, it allows global memory accesses to be replaced by register reads if the merged work-items require the same input element in their computation. The removal of these duplicate reads requires a compiler that is able to unroll the loop created by merging, identify duplicate reads and remove them.
However, block merging increases the register use of each work-item. This in turn can decrease the maximum work-group size and the number of concurrent work-groups executing on an SMX. Further, block merging in the innermost dimension of the global grid can uncoalesce already f o r ( i n t y = y s t a r t ; y < y end ; y += wy )
. . . coalesced accesses. These factors negatively impact performance and make it hard to predict the benefit of block merging.
C. Cyclic Merging
Cyclic merging combines non-adjacent work-items such that work is assigned to work-items in a work-group in a round-robin fashion. The number of work-items merged in this manner is referred to as the cyclic merge factor. Cyclic merging introduces a loop into the kernel code for each dimension merged and also requires some changes to the array index calculations, as shown in Figure 5 .
Cyclic merging increases thread work granularity and has the benefit that when applied to the inner-most dimension of the grid it does not disrupt existing memory coalescing. However, it is unlikely to remove duplicate reads since the elements accessed by the merged work-items are likely far apart. Similar to block merging, it can limit parallelism by increasing register usage and decreasing the maximum workgroup size and the maximum number of concurrent workgroups. Thus, it may or may not benefit performance.
D. Local Memory Caching
This optimization refers to caching the input data into local memory before reading it. This caching is done cooperatively by the threads in a way that ensures memory coalescing when possible. Only the minimum amount of data needed to avoid storing partial results is loaded from the z dimension at any one time so as to reduce the total amount of local memory needed. Figure 6 shows an example of the changes to the kernel to implement this optimization. Reads from global to local memory are introduced at the beginning of the kernel, the existing reads from global memory are replaced with reads to local memory, and a barrier is added between the two types of reads for synchronization.
Use of local memory can improve kernel performance when there are repeated accesses to the same data within a work-group because global memory transactions are replaced by faster local memory accesses. It can also improve memory coalescing when memory transactions are otherwise uncoalesced. However, large local memory requirements can unnecessarily restrict the maximum work-group size, limiting available parallelism. Furthermore, use of local memory adds the overhead of extra memory accesses and synchronization. Thus, it is not always beneficial [11] .
E. Vectorization
This optimization converts reads from the input array, arithmetic operations on the intermediate result, and writes to the output array into vector operations to exploit specialized vector hardware. It only applies to kernels that are block merged in the x dimension and only to those operations that were formerly done by multiple work-items that, postmerging, are accomplished by a single work-item. Thus the two optimizations in conjunction replace parallelism via work-items with parallelism via vectorization.
Reads that align to vector boundaries only require changing the data types of relevant variables to vector types. However, for input reads that do not align to a vector boundary, two vector reads need to be made and an interim vector result must be composed. Output writes are always aligned to vector boundaries.
F. Image/Texture Memory
This optimization stores the input array as an OpenCL image data type rather than a standard OpenCL memory buffer, allowing the kernel to exploit the texture cache. It requires changing the data type of the input array and its allocation on the host. Further, OpenCL API calls must be used to access the elements rather than normal array subscripts. The texture cache is hardware managed and thus requires no additional changes to the kernel.
This optimization can improve performance by replacing expensive global memory accesses with accesses to the texture cache. However, local memory can accomplish the same and using both simultaneously adds overhead for no benefit. Determining which to use for a given kernel is nontrivial. 
G. Optimization Space
All of the optimizations are summarized in Table I along with abbreviations used throughout the paper to refer to them. Each optimization takes on a value from the given range. Thus, an optimization configuration is the full set of values, one for each optimization. We restrict optimization values to powers of 2. Further, the possible values for many of the optimizations depend on the other optimization values. The product of W, B, and C for a given dimension must be less than or equal to N in that dimension. In addition, VX must be less than or equal to BX. These restrictions result in a total optimization space of over 50 million configurations. Many of these configurations are not actually executable because they violate some limitation of the hardware, e.g., they use more local memory than is available on the device. Nonetheless, the space is prohibitively large to explore exhaustively.
H. Optimization Interactions
The above optimizations interact with one another in nontrivial ways. For example, block and cyclic merging can increase register usage of work-items and thus impact the optimal work-group size. Similarly, using local memory adds an additional resource constraint that can limit the workgroup size. Local memory also impacts the optimal workgroup shape. When local memory is not used, a rectangular work-group shape, with a large size in the x dimension and small size in y and z, is usually favoured because of better cache performance. However, when local memory is used, more cubic work-group shapes are favoured because they result in more memory reuse for the same size of local memory.
Thus, finding the optimal optimization configuration is not simply a matter of determining the best configuration for each optimization in isolation.
V. INTERESTINGNESS
We demonstrate the need for auto-tuning by showing that the optimization space is "interesting". That is, it is sufficiently complex that simple approaches to determining a high-performing configuration fail. To this end, we show that:
1) The majority of configurations in the space perform poorly.
2) The optimal value for each optimization varies from kernel to kernel.
3) The optimizations interact, requiring that they be explored together and not individually. 4) The best optimization configuration on one kernel does not necessarily perform well on other kernels. Ideally, we would examine the entire optimization space. However, this is not feasible. Consequently, we perform our exploration of the optimization space by randomly sampling configurations. These configurations are used as a proxy for the full optimization space.
To improve the quality of the randomly sampled configurations, we omit obviously bad configurations: (1) 
These restrictions reduce the optimization space to 567,000 configurations. We randomly sample 1500 configurations for each of the 102 kernels used in our evaluation (see Section VII-A). We believe that this size is sufficient because we found that the performance of the best optimization configuration from the sample plateaued after 1200-1500 configurations. For each sampled configuration, speedup is calculated by dividing the runtime of the best-performing configuration for that kernel from the sample by the runtime of the given configuration. Figure 7 shows a histogram of speedup values across all of the configurations, that is, each bar indicates how many optimization configurations fell in the given speedup bin. The majority of the configurations perform poorly, with only 1.1% of them achieving a speedup within 10% of the best performing configuration. Conversely, a very large fraction of the configurations, 39.2%, experienced more than a 10x slowdown relative to the best configuration. This shows that the optimization space is skewed towards poorly performing configurations.
A. High-Performing Configurations Are Few
B. High-Performing Configurations Are Difficult to Find
We next show that the best value for each optimization varies across different kernels. If a single value is always best for each optimization, then it doesn't matter that most of the configurations perform poorly, one can always set each optimization to its best value. Figure 8 shows the number of times each possible value for each optimization was used in the best-performing configuration. The graphs show that there is significant variation in the optimization values that give the best performance, particularly for WX, WY, WZ, CZ, and data loading technique, where no individual value is best for more than 35 out of the 102 kernels. BX/VX shows less variation, likely 
C. The Optimizations Interact
Next, we show that the optimizations interact by showing that tuning the optimizations individually is not sufficient to find a good optimization configuration. For each kernel, we first determine the best value for each optimization while keeping the other optimizations held to the values used by the best configuration found by random sampling for that kernel. Then we combine the independently chosen optimizations to arrive at a final configuration.
A histogram of the speedup values of this approach across all 102 kernels is shown in Figure 9 . The leftmost bar consists of configurations that are not even executable. On average, this approach achieves only 90% of the performance of the best sampled configuration. That is, on average, it performs worse than its starting point. Thus, determining the best value for each optimization individually does not result in good performance; the optimizations must be examined together. 
D. Good Configurations Are Not the Same Across Kernels
Finally, we show that the best performing configuration for a kernel does not, on average, perform well across all kernels, thus demanding the auto-tuning of each kernel.
We run the best performing configuration of each kernel on the 101 other kernels. Figure 10 shows a histogram of the resulting speedups. If the optimal configuration of each kernel worked best for all other kernels, all configurations would achieve a speedup of 1. The left-most bar, once again, corresponds to configurations that are not even executable. Further, although many of the configurations perform well, and a few even exceed the performance of the best on a particular kernel, there is a significant tail to the histogram, indicating that performance can often be very poor. On average, these configurations achieve 64% of the performance of the best configuration. This indicates that the best performing configuration cannot be learned for one kernel and used for other kernels.
VI. AUTO-TUNING STRATEGY A. Predicting Data Loading Technique
The use of local and texture memory as well as vectorization, described in Section IV, all optimize how data is loaded from the GPU's memory hierarchy. We simplify the optimization space by only allowing data to be loaded in one of four ways: from global memory without vectorization, Figure 10 : Speedup distribution of the best configuration for each kernel on all other kernels from global memory with vectorization, from local memory, or from texture memory. As mentioned previously, we refer to this choice as the data loading technique. We determine this technique first because it has the largest impact on performance.
The impact of the data loading technique depends to a large extent on the memory access patterns of the stencil computation. Thus, we hypothesize that the optimal data loading technique can be predicted using static program features, such as stencil size. Figure 11 shows a histogram of the stencil sizes for which each data loading technique is best for our 102 kernels. There is a general transition from using global memory with vectorization to texture memory to global memory without vectorization and then finally to local memory as the stencil size (i.e., the number of input elements in the stencil) increases. This suggests that this (and possibly other) static features can be used to predict the optimal data loading technique. Since the mapping of static program features to optimal data loading technique is likely complex and platformspecific we do not attempt to model it analytically. Instead, we use supervised machine learning [23] to predict the optimal data loading technique. We train a random forest [3] machine learning model to this end. The inputs to this model are stencil size and dimensionality of a kernel as well as which dimension, if any, differs from the rest if the kernel is asymmetric. The output is the predicted optimal data loading technique for the kernel.
B. Grouping and Search Heuristics
Our auto-tuning heuristics further reduce the optimization space by breaking up the remaining optimizations into groups that are explored independently of each other. The values of the optimizations in each group are exhaustively explored, that is, the kernel is run for every possible value, while keeping the values of the other optimizations fixed. We refer to this process as tuning of the optimizations in question. Once the best values are found for these optimizations, they are fixed and the optimizations in the next group are tuned.
This approach has two key challenges. The first is how to group the optimizations, given that they are not independent of one another, as shown in Section V-C. We choose groups that minimize dependencies across groups. The second challenge is how to order the tuning of the groups. We tune the groups in order of their expected performance impact. We then mitigate the impact of group order by repeating the entire sequence of groups n times. The choice of n is important. If too small, then the heuristics will not have enough iterations to reach a local minima. If n is too large, the heuristics may evaluate extra configurations needlessly and take longer to run. We arbitrarily choose n = 2; more intelligent exploration of this parameter is left for future work.
We explore two complementary ways of grouping the optimizations, by dimension and by optimization, resulting in two possible search heuristics. In the first, all optimizations are explored for each dimension independently. In the second, each optimization is explored independently for all dimensions.
1) Group by Dimension:
This heuristic considers all of the optimizations that correspond to a particular dimension at once, regardless of the optimization they perform. Specifically, it: 1) Sets WY=1, WZ =1, BY=1, BZ=1, CY=1, and CZ=1 2) Tunes WX, BX, and CX.
3) Tunes WY, BY, and CY. 4) Tunes WZ, BZ, and CZ. 5) Repeats steps 2-4 n times. By tuning those optimizations affecting the x dimension first, it is given the highest priority and thus has the largest impact. This is done because the innermost dimension is the one that impacts memory coalescing and most significantly benefits from caching. Since larger values are usually better for the work-group size in the dimension of coalescing, the initial values of WY and WZ are set to one so as to allow WX to be as large as possible. Similarly, BY, BZ, CY, and CZ are set to one initially to allow BX and CX to explore as large a range as possible before encountering significant register pressure.
2) Group by Optimization: This heuristic considers all of the optimizations that correspond to a particular optimization at once, regardless of which dimension they affect. Specifically, it: 1) Sets BX=1, BY=1, BZ=1, CX=1, CY=1, and CZ=1. 2) Explores WX, WY, and WZ while keeping BX, BY, and BZ fixed and while keeping the following products constant: WXxCX, WYxCY, WZxCZ. 3) Tunes BX, BY and BZ. 4) Tunes CX, CY, and CZ. 5) Repeats steps 2-4 n times. By tuning the work-group size first over the thread merging factors, we prioritize overall parallelism level over register usage. Similarly, rather than holding CX, CY, and CZ constant in step 2, W i xC i for i=X, Y, and Z is kept constant, thereby allowing the Cs to vary in a controlled manner. This product corresponds to the amount of work assigned to each work-group, while the Cs correspond to the amount assigned to each work-item. It is likely that the amount of work assigned to each work-group has a bigger impact on the total amount of available parallelism and thus this heuristic once again prioritizes this overall parallelism level over register usage.
Finally, we tune the block merging factor before the cyclic merge factor to ensure that there is as much register availability for block merging as possible because it is usually the more efficient way of merging threads as it allows for removal of duplicate reads through register sharing.
C. Overall Strategy
Our overall strategy when auto-tuning a kernel is: first use the machine learning model to predict the optimal data loading technique for the kernel then apply our heuristic while fixing the data loading technique to the value predicted by the model. We evaluate two alternative heuristics and the optimal data loading technique can vary between them. Thus, we train a separate model for each heuristic.
VII. EVALUATION A. The Stencil Kernels
We desire a large, representative set of stencil kernels to both test our heuristics and train our machine learning models. Since there is no publicly available, large benchmark suite of such kernels, we generate a set of synthetic kernels in such a way that their variation is controlled [4] and they are representative of a wide range of stencil computations.
We generate five stencil subtypes: dense, star, diamond, no-corners, and thumbtack, as depicted in Figure 2 . For each subtype, we consider 1D, 2D, and 3D stencils variants. All the stencils act on 3D input and output arrays consisting of 256 elements in each dimension. Their radii range from 0 to 5. We generate 102 different kernels in this way, as summarized in Table II. All of the kernels perform a weighted sum of their input elements using weights that are randomly chosen during program generation. These weights are thus known at kernel compile time. We assume that boundary data is copied into place beforehand. Finally, all the kernels are run with single-precision floating point input data that is randomly generated.
The optimizations are hand-coded into the kernels and are parameterized for the different configurations using preprocessor directives. Thus, before compiling a kernel, it is possible to select what optimizations to apply by setting various define statements. This allows optimizationrelated parameters to be known at compile time, allowing the compiler to perform as much simplification as possible. The auto-tuner copies the input data from the host to the GPU before the first configuration is tested. If a desired configuration has already been run its binary is retrieved from a cache of past configurations, otherwise the auto-tuner sets various optimization-related parameters in the code and compiles the kernel using the OpenCL API. If the kernel can be launched on the device, the OpenCL binary is run on the GPU four times and the average runtime of the last three runs is recorded.
B. The Auto-Tuner
Our approach requires a recompile for every unique optimization configuration, even if configurations differ only in runtime parameters such as the work-group size. This allows us to make these values compile-time constants rather than getting them at runtime using OpenCL API functions. This maximizes the compiler's ability to optimize by constant folding, constant propagation and dead code elimination.
For machine learning training, we run each heuristic with each data loading technique. We then train and test a model for each heuristic using leave-one-out cross-validation [23] to ensure that no kernel was part of the training set used to construct the model used to test it.
C. Reference Approaches
We compare our heuristics to 3 reference approaches: 1) Random Sampling (Rand): This approach corresponds to running the first stage of the interestingness experiment described in Section V, i.e. running 1500 configurations, and selecting the best performing one.
2) Expert Search (ExpS): This approach represents the one used by most existing stencil auto-tuning work in the literature (described in Section VIII). It performs exhaustive, empirical auto-tuning but only on a restricted subset of the space chosen by an expert. We use our experience to restrict the space as follows: VX must be less than or equal to 4, WX must be greater than or equal to 32, WY*CY must be less than or equal to 4, and WZ*CZ must be less than or equal to 4.
3) Oracle: There is an oracle-based approach for each of our two heuristics. Each consists of running the heuristic with every data loading technique and then taking the best configuration found across all of them. These oracles effectively give upper bounds on the speedups we can achieve with our machine learning models.
D. Evaluation Platform and Metrics
Our results are gathered on an Nvidia GTX Titan GPU forced to its highest performance state to ensure that the results are not skewed by power saving features. The host is an Intel i7 960 with 6GB of RAM running Ubuntu 12.04 and Nvidia driver version 325.15. This version of the Nvidia OpenCL compiler does not remove duplicate reads introduced by block merging. Thus, we fix BY and BZ to one for this evaluation (see Section V). The machine learning models are created using Weka 3.6 [10] with its default random forest parameters.
Each of the machine learning models is assessed using two metrics: its absolute accuracy, i.e., on what fraction of the kernels it predicts the correct data loading technique, and the average speedup of the heuristic using the model relative to its corresponding oracle, averaged over all kernels. This speedup reflects the penalty of a mispredicted loading technique on performance.
The performance of the heuristics is also assessed with two metrics. The first is the runtime of the best configuration discovered by the heuristic. This time is normalized by calculating the speedup with respect to the best configuration found by Rand. The geometric mean of speedup is used to report average speedup across the kernels. The second metric is the time the heuristic takes to get to that configuration, measured by the number of configurations examined and the runtime of these configurations on the GPU. The runtime of each configuration corresponds to a single run of a kernel, i.e. a single iteration of the time loop.
E. Results
The speedups, number of unique configurations, and GPU runtimes (in seconds) for all heuristics and reference approaches, averaged over all kernels, are shown in Table III.  The table also shows "best count", the number of times each strategy found the best performing configuration amongst those explored. Note that the best counts for the oracles are necessarily greater than or equal to that of our strategies as they look at a superset of the configurations of our strategies. Further, the best count column total for the baselines and the oracles is slightly more than 102 because for some kernels more than one strategy finds the best configuration. A number of observations can be made from this table. First ExpS outperforms Rand both in speedup and in best count. It explores significantly more configurations in order to do so, but requires less device-side runtime as these configurations are generally better performing. Secondly, the machine-learning-based heuristics are able to achieve higher speedups than both of those reference approaches while considering significantly fewer configurations and requiring less device-side runtime. In particular the dimensions heuristic considers 80% fewer configurations than Rand (296 vs. 1490) while finding ones that are 7% better on average and considers 84% fewer configurations than ExpS (296 vs. 1800) while finding ones that are 3% better on average. Table IV gives some insight into why this is the case. It summarizes the speedups and best counts achieved by our heuristics for each data loading technique (i.e., without the use of the machine learning models). None of the individual loading techniques is best all of the time, and none of them has a speedup as large as Rand. However, Rand rarely finds the best configuration of all of the strategies as evidenced by its best count of 2. Conversely, the individual data loading techniques, although worse on average, have much higher best count values; when they do well, they do very well. By combining the strengths of all of these data loading techniques, the machine-learning-based strategies are able to achieve best counts of 18 and 53 for the dimensions and optimizations heuristics respectively. As a result, they are able to achieve greater average speedups than Rand and ExpS with fewer configurations.
Comparing the dimensions and optimizations heuristics, we see that the optimizations approach is able to achieve a substantially higher speedup (13% vs. 7%) but at the cost of more than three times the number of configurations and significantly more runtime. Thus, we believe that the dimensions heuristic provides the better tradeoff between the various metrics. With that said, which of these approaches is best could vary from one user to the next depending on their particular requirements for speedup and tuning time.
The accuracies of the machine learning models are 73% and 80% for the dimensions and optimizations heuristics respectively. Relative to the oracles, their heuristics achieve 97% and 98% of the maximum possible performance. These values indicate that although the models do not always predict correctly, when they are incorrect, their mistakes have little impact on performance.
F. Restricting the Optimization Space
We can use expert knowledge of the optimization space to further limit the exploration performed by the heuristics and thus reduce the number of configurations they consider. We force the work-group size in the x dimension to be at least 16 and the work group sizes in the y and z dimensions to be at most 32. This ensures we exploit memory coalescing and disallows a total work-group size over 1024, which is rarely beneficial. We can afford to impose looser restrictions than ExpS because our strategy has already significantly reduced the optimization space size. The results of applying these restrictions to our dimensions heuristic are shown in Table V . There is a significant reduction in the number of configurations considered and an even more significant reduction in runtime. This is because the excluded configurations are generally the worst performing ones. Speedup remains unchanged because the new heuristic is not able to explore any configurations that the old one could not.
This shows that our strategy complements expert search approaches as the two strategies can effectively be used together.
VIII. RELATED WORK
There is a large body of work dealing with automatic performance tuning of stencil codes on both multicores and GPUs. Some of this work considers an expert-chosen subset of the optimization space and runs a program for all combinations of optimizations in this space [5, 6, 14, 15, 18] . Others use non-exhaustive strategies but focus only on a single optimization [9, 11, 12, 19, 24] . In contrast, we consider multiple optimizations concurrently and develop a strategy to prune the entire optimization space with no need for expert choices, taking into account the interaction amongst the optimizations. Our approach complements this existing work in that when combined with expert knowledge, our strategies improve further. Finally, most of the existing work is only applied to a small set of programs, albeit sometimes across different inputs [2, 17] . In contrast, we apply our work to a large number of programs that are synthetically generated to have a wide range of stencil patterns and sizes.
There are several approaches that use statistical and machine learning models to partition and prune the optimization configuration space; e.g., recursive partitioning regression trees [13] and kernel canonical correlation analysis [7] . In contrast to our work, these approaches require a large set of samples of the optimization configurations to achieve high prediction accuracy.
Finally, there are auto-tuning frameworks [1, 5] that can be used to auto-tune stencil computations. Our work complements these frameworks. Indeed, our heuristics could be implemented as strategies within them. 
IX. CONCLUSIONS
We presented a strategy for automatic performance tuning of stencil computations on GPUs when applying common optimizations. Our strategy uses machine learning to determine the best approach to memory loading and then explores the remaining optimization in groups. We presented two possible ways of grouping the optimizations so as to reduce inter-group dependencies. We synthetically generated 102 OpenCL kernels and evaluated our strategy both in terms of the number of configurations explored during auto-tuning and in terms of the quality of the best configuration obtained. We showed that our best heuristic achieves a reduction in the number of configurations explored relative to random sampling (by 80%) and expert search (by 84%) strategies while also finding better performing configurations.
There are a number of directions for future work. First some of the limitations of this work can be addressed. For example, optimizations that target the time loop or stencil code with different computational characteristics can be considered. Second, machine learning models other than random forests may be explored. Finally, the performance of our auto-tuning strategy can be evaluated on GPU platforms other than those of Nvidia.
