Determining the best set of optimizations to apply to a ker nel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel.
INTRODUCTION
The utilization of parallel programming on architectures such as the graphics processing unit (GPU) leads to sig nificant speedup over a sequential CPU implementations for many algorithms. In particular, there is a large body of work utilizing NVIDIA's CUDA architecture to obtain a speed-up using GPUs, e.g., audio/video processing, medical imaging, and black hole simulation [18] .
In order to achieve the best performance possible, we may need to use particular code transformations, such as loop tiling and unrolling, permutation, and fusion / fission.
However, constant tweaking is required to determine which transformations give the best performance, and the code re sulting from these transformations is brittle. A kernel may be optimized for the current architecture, but the same code might give poor performance on a different architecture.
In this paper, we use HMPP Workbench, a directive-based compilation framework targeted to GPUs. The tool allows us to insert pragmas to transform sequential code and gen erate CUDA and OpenCL code that can potentially match or exceed the performance of manually tuned GPU kernels.
To find a good set of transformations to apply to input code on a given architecture, we use auto-tuning, a pro cess by which we optimize the code with several applicable transformation configurations and then pick the optimized version of the code with the best performance. Specifically, we use auto-tuning with HMPP transformations to convert programs written in sequential C code to generate paral lel CUDA / OpenCL code that gives improved performance over the default HMPP configuration. In this work, we use this approach to obtain optimized versions of 2D and 3D convolution kernels, codes in the Poly Bench [1] suite, and an implementation of belief propagation for stereo vision. The main contributions of this paper are: (1) showing that auto-tuning applied to GPU kernels written in a directive based language can be used to effectively parallelize and op timize a variety of codes, in many cases meeting or exceeding the performance of hand-written GPU programs, (2) show ing how particular transformations affect performance and describing the best transformation configuration found for each kernel, and (3) showing that the best transformations are kernel and architecture specific.
We give a background of GPU computing in Section 2, de scribe directive-based programming and the HMPP Work bench in Section 3, discuss our experiment setup in Sec tion 4, show optimized HMPP results on 2D /3D convolu tion, Poly Bench kernels, and belief propagation for stereo vision in Section 5, describe related work in Section 6, and discuss conclusions and future work in Section 7.
GPU COMPUTING
The graphics processing unit (GPU) has recently evolved from specialized hardware to a powerful co-processor to the CPU. The most powerful GPUs can perform more FLOPS (b) Transformation using 'split' unroll, the codelet on the left shows the code with pragma, and the codelet on the right shows the resulting transformation. for (i =O; i < N/2; i ++l { for (L2 = 0; L2 < 2; L2 ++l { (c) Transformation using tiling, the codelet on the left shows the code with pragma, and the codelet on the right shows the resulting transformation and have greater memory bandwidth than the most powerful CPUs [28] . The development of the CUDA and OpenCL programming environments allow the programmer to run multiple threads in parallel to utilize the parallel processing power of the CPU without the use of a graphics API.
In the CUDA / OpenCL environment, the parallel threads are organized into ID, 2D, or 3D structures called thread blocks, and each thread block is placed in a ID or 2D grid of thread blocks. The number of threads in a thread block is limited to a value determined by the particular CPU. Threads within a thread block execute on the same CPU multiprocessor as part of a 32-thread chunk known as a warp, have access to a common space of fast, on-chip shared memory, and can synchronize with each other.
The programming guide from NVIDIA [28] suggests var ious optimizations when programming on the CPU, includ ing adjusting the thread block size to increase multiprocessor occupancy, using registers and shared memory rather than high-latency global memory for data accesses, and organiz ing the threads in a way that favors coalesced global memory accesses.
DIRECTIVE-BA SED GPU PROGRAMMING
In this work, we use directive-based CPU programming using the HMPP Workbench from CAPS Entreprise to gen erate CPU code and explore a large space of possible opti mizations. The HMPP compiler framework is shown in Fig  ure 2 . This tool allows the programmer to generate CUDA / OpenCL kernels from loops written in sequential C code via pragmas placed before the kernel and in the command to run the kernel [11] . A project to make HMPP directives an open standard called OpenHMPP [8] is underway.
There are benefits to developing applications in a high level directive-based language as compared to developing in low-level languages, such as OpenCL and CUDA. For exam ple, high-level languages preserve the serial implementation of the code. The focus is on highlighting the parallelism in the code as opposed to developing a platform-specific im plementation of the parallelism. A particular advantage of a directive-based approach is that it eases the interaction between domain scientists and programmers. However, pro gramming with high-level languages may mean a loss of per formance when compared to programming in low-level lan guages.
In addition to pragmas to specify parallelization, HMPP provides pragmas to drive code transformations for opti mization. These include pragmas for loop permutation where the loops are re-ordered, tiling / unrolling of loops at any factor with varying parameters, and fusion / distribution of computations in loops. Figure 1 shows the input and output for tiling and loop unroll transformations using the 'contigu ous' and 'split' unroll schemas using factor 2. When using the 'contiguous' option to unroll a loop that sequentially steps through an array of size N, the array accesses in the first iteration of the unrolled loop are to indices 0 and 1. When using the 'split' unrolling schema, the accesses in the first iteration are to indices 0 and N /2, which helps maintain memory coalescence across threads. To determine the be havior of the remaining iterations that occur when the unroll factor does not perfectly divide the array, HMPP provides 'remainder' and 'guarded' pragmas. The default 'remainder' option generates a remainder loop that is processed after the unrolled loop, while the 'guarded' option inserts guards in the unrolled loop so only the intended computations are run in the final iteration.
In addition to CAPS, Cray and PCI offer directive-based CPU programming. Also, the cuda-lite framework [38] shares some similarities with the HMPP Workbench since both uti lize code annotations to simplify the process of optimizing CPU code. However, the cuda-lite framework operates at a lower level, with the annotations placed within a CUDA kernel, while the pragmas in directive-based CPU program- 
EXPERIMENTAL SET-UP
Our experiments explore the use of HMPP code transfor mation pragmas to perform auto-tuning to optimize GPU kernels written using directives in a high-level language. Us ing HMPP directives, we can perform auto-tuning on a large optimization space on CUDA and OpenCL kernels. In this work, we focus on loop permutation, loop unrolling, loop tiling, and specifying which loops to parallelize.
We We start with optimizing 2D and 3D convolution kernels, go on to optimize the codes in the Poly Bench suite, and fi nally work on optimizing belief propagation for stereo vision. In each of the codes, we apply 'contiguous' and 'split' loop unrolling using the 'remainder' and 'guarded' options for the remainder loop. We also explore tiling and permutation as appropriate and look at which loops are best to parallelize. The contiguous and split unrolling schemas along with the tiling schema are described and shown in Section 3.
The transformed codes are generated using a python script that contains a list of unroll, tiling, permutation, and par allelization transformations to use for a particular kernel. Running the script generates and compiles codes with every combination of input transformations. In order to generate the annotated codes, the script uses a file with the source for the code with specified locations to place each pragma for code transformation. Figure 3 shows a portion of this file for the PolyBench GEMM kernel. The code contains a place to put a pragma to permute the loops as well as places to insert pragmas for unrolling / tiling each loop and to specify whether or not to parallelize each target loop. We go on to run each transformed program to determine the best input configuration.
The measured run-times include the running time of each kernel and the overhead involved in calling the kernel (pos sibly multiple times) but not the time to transfer data from the CPU to the GPU and vice versa. All the computations used for auto-tuning are performed on 32-bit floating point values and utilize thread block dimensions of 32 x 8 to allow for full GPU multiprocessor occupancy.
EXPERIMENTAL RESULT S
In this section, we present our auto-tuning HMPP re sults on 2D convolution and 3D convolution, the Poly Bench benchmark suite, and belief propagation as applied to stereo vision. In many of the results, the auto-tuned HMPP results meet or exceed the performance of hand-written CUDA and OpenCL versions of the codes.
1 2D Convolution
for (int j = 0; j < DIM_1 -1; j++)
First, we look at optimizing the 2D convolution kernel shown in Figure 4 using code transformations in HMPP. Our experiments are performed using CUDA and OpenCL with arrays of size 4096 x 4096. In our initial experiment, we re verse the loop order via permutation and find that the alterSpeedup using contiguous unroll Speedup using split unroll Speedup using tiling �Q. ::1-::1-,,-0.6 ,,- nate loop order significantly increases the runtime, causing us to keep the initial loop order. Also, experiments involv ing loop unrolling showed that using the 'guarded' option for remainder behavior gave similar or better results than the 'remainder' option, causing us to use the 'guarded' option in all ensuing experiments on this kernel. Our next experiments look at tiling and unrolling the in ner loop at factors 1-8, using both the 'contiguous' and 'split' schemas for unrolling. The results for these experiments are shown in Figure 5 . The CUDA results show that applying 'contiguous' unrolling or tiling with factor 2 gives the best re sults, but the speedup corresponding to these schemas drops off with larger factors while there continues to be a speedup at larger factors in split unrolling. The pattern is a little different in the OpenCL results. Applying 'contiguous' un roll or tiling with factor 2 gives some speedup, but the best speedup occurs with an unroll factor of 8 using the 'split' schema.
The likely explanation for run-time improvement when using a contiguous unroll or tiling factor of 2 is that the convolution kernel is structured such that contiguous loop unrolling / tiling allows particular data to be shared across unrolled iterations, decreasing the number of memory ac cesses. However, contiguous unrolling weakens the capabil ity to perform coalesced memory accesses, particularly in higher unroll factors. This causes a drop-off in performance with higher unroll factors. Meanwhile, the 'split' unroll schema does not provide data sharing between iterations, but does allow the kernel to maintain coalesced memory ac cesses, preventing this performance drop-off.
The results of the best found configurations for this kernel are also shown with results on other kernels in Figures 8 and 9 and described in Table 3 .
3D Convolution
Next, we look into optimizing the 3D convolution kernel shown in Figure 6 using 256 x 256 x 256 arrays.
Our initial experiments use each of the 6 possible loop orderings via permutation without any tiling or unrolling. Loop order can influence memory coalescence on the CPU, and memory coalescence or lack of it can be a major factor in the run-time of a CPU kernel. To simplify the discussion of loop order, we refer to the outer 'i' loop as loop '1', the middle 'j' loop as loop '2', and the inner 'k' loop as loop '3'.
In this kernel, there are three nested loops which can theo retically be parallelized, but HMPP only supports paralleliz ing up to two nested loops. By default, the first two loops are parallelized, and a non-optimal loop permutation order can lead to non-coalesced memory accesses and a longer run time.
for (int i = 0; i < DIM_O -1; i ++)
for (int j = 0; j < DIM_1 -1; j++) for (int k = 0; k< DIM_2 -1; k++)
Q. 25 Q. The results of our initial experiments are shown in Fig  ure 7 and show a significant speedup over the default (1,2,3) configuration in the permutations where the inner paral lelized loop is loop '3' in the initial code.
Next, we keep the default (1,2,3) loop ordering and use an alternate method to determine loop parallelization, inserting pragmas to set the first loop to 'noParallel' and the subse quent two loops to 'parallel', causing the final two loops to be parallelized while the first loop remains sequential. The run time of this configuration is shown alongside the results of each permutation in Figure 7 and show that the CUDA and OpenCL results using this configuration are slower than the (1,3,2) and (2,3,1) permutations but faster than the other permutations.
Additional experiments explore tiling and unrolling each loop at various factors in the best permutations. In the CUDA experiments, the speedup of the (1,3,2) permutation goes from 21.2x to 27.2x when unrolling loop '3' by a factor of 4 using the 'contiguous' schema and 'guarded' remainder option. This represents the best found CUDA speedup. In the OpenCL experiments, the best found configuration uses the (2,3,1) permutation without any unrolling or tiling. The speedup of this configuration is 22x over the default. The re sults of these configurations are shown alongside other codes in Figures 8 and 9 and described in Table 3 .
The PolyBench Benchmark Suite
The Poly Bench collection of benchmarks contains codes for linear algebra, linear algebra solvers, data-mining, and stencils, all with static control parts [1]. We convert a num ber of codes in this benchmark suite to CUDA / OpenCL using HMPP pragmas. These codes are described in Table 2 . For the ATAX and BICG codes, we manually performed loop distribution on the inner loop nests before running the experiments to allow greater parallelism than in the initial code structure. We will automate this process in future work via the 'distribute' HMPP pragma.
In the current set of experiments, we look at permutation, unrolling, tiling, and specifying which loop(s) to parallelize with the parameters shown in Table 1 . An exhaustive search of all possible transformations within this space is not feasi ble, as many of these kernels contain a number of loops, so a sampling of this space is utilized. No particular formula determines the sampling as every kernel is different. We try and include the transformations that will retrieve the best results based on results of other kernels. In the future, we may be able to improve results by doing a more exhaus tive 'local search' around the best optimizations found in these experiments. The array dimensions used in our exper iments, running time of a sequential CPU implementation using these dimensions, and number of HMPP-transformed versions of each converted PolyBench code are given in Ta ble 2. Also, the cluster described in Section 4 is used for the CPU running times.
We constructed hand-written CUDA and OpenCL kernels to compare with the optimized HMPP results. These ker nels are tuned to maximize parallelism and coalescence of memory accesses but are not heavily optimized. For these initial experiments, we used 32-bit float data. The result ing speedup of the best HMPP-transformed version of these codes over the default HMPP configuration and the hand coded results are in Figures 8 and 9 alongside the convolu tion results. The default HMPP and best found configura tion running times are shown in Table 2 , and the best found HMPP configuration for each code is described in Table 3 . The results show that both the CUDA and OpenCL default HMPP configurations are faster than CPU implementations on 12 of the 15 kernels, in many cases by a large margin. In addition, the best found HMPP configurations from auto tuning beat the CPU result on all 15 kernels and are often significantly faster than the default HMPP configuration. Section 5.5 discusses these results in more detail.
Our hand-written GPU kernels as well as the default and best found optimized HMPP-annotated codes are distributed as the PolyBench/GPU 1.0 package, available at http:/ / www.cse.ohio-state.edu/-pouchet/software/polybench/ GPU.
Experiments Using Doubles
To see if we can generalize our auto-tuning results on 32-bit floats to 64-bit doubles, we take the best transformations found for floats and apply these transformations to the same codes using doubles. We also ran manual implementations of the kernels using doubles and normalized the results to the running time of the default HMPP configuration using doubles. The results when using doubles for CUDA and OpenCL are shown on the right of Figures 8 and 9 . These results show that the configurations that give speedups on floats also give speedups using doubles on most of the ker nels, though not always to the same extent. Specifically, the geometric mean of the speedup using these configurations over default HMPP on CUDA (OpenCL) is 2.73x (2.62x) using floats and 2.24x (2.24x) using doubles.
HMPP Auto-Tuning Results Discussion
Figures 8 and 9 show results from our auto-tuning experi ments on convolution and Poly Bench codes using HMPP on a C2050 (Fermi) GPU. The running times of the sequential CPU, default HMPP, and best found HMPP-transformed configurations are given in Table 2 . Each auto-tuned code beats the performance of the default code produced by the HMPP compiler, in some cases by a large margin. All the auto-tuned HMPP CUDA codes beat the default HMPP CUDA configuration by over 13 percent, with the speedup ranging from 1.135x for the SYRK kernel to 49.7x for the ATAX kernel. In addition, the average performance of the auto-tuned kernels show a geometric mean speedup of 2.73x compared with 2.02x using hand-written kernels. When run ning the best auto-tuned configuration using 64-bit doubles, there continues to be a speedup in most of the codes. The geometric mean of the speedup is 2.24x over default HMPP.
The results are similar using OpenCL, though some of the speedups are lower, particularly on kernels where the best found configuration involves loop unrolling. One pos sible explanation for this is that CUDA codes are compiled with an Open64-based compiler in CUDA 4.0 while OpenCL codes are compiled with an LLVM-based compiler. Thus, the LLVM-based compiler may be doing a better job op timizing some of the non-transformed HMPP codes. Still, auto-tuned HMPP targeting OpenCL gives a speedup over default HMPP on many of the codes. The geometric mean of the OpenCL speedup using the auto-tuned results is 2.62x and 2.24x over default HMPP using floats and doubles, re- Table 2 : Kernel description, size of each array dimension, number of HMPP-transformed versions generated, and runtime (in seconds) of sequential CPU implementation and of default HMPP and best found HMPP transformed configurations on C2050 (Fermi) using floats on 2D/3D convolution and parallelized PolyBench codes.
Code HMPP CUDA HMPP OpenCL
2DCONV
Unroll 2nd loop using 'contiguous' and 'guarded' options Unroll 2nd loop using 'split' and 'guarded' options with with factor 2 factor 8 2MM Unroll 3rd and 6th loops using 'split' and 'guarded' option Unroll 3rd and 6th loops using 'split' and 'guarded' w/ factor 3 options w / factor 4 3DCONV Permute loop-nest order from ( . 1,2,3) to (1,3,2); unroll loop Permute loop-nest from (1,2,3) to (2,3,1) ordering 2 using 'contiguous' and 'guarded' options w / factor 4 3MM Unroll 3rd, 6th, and 9th loops using 'split' and 'guarded' Unroll 3rd, 6th, and 9th loops using 'contiguous' and options w / factor 3 'guarded' options with factor 8 ATAX Reverse order of 2nd nested loop set (4th and 5th loops) Reverse order of 2nd nested loop set (4rd and 5th loops) using permutation and tile 1st and 2nd loop w / factor 4 using permutation and tile 1st and 2nd loops w / factor 2 BICG Reverse order of 1st loop set (2rd and 3rd loops) and tile Reverse order of first loop set (2rd and 3rd loops) and 2nd, 3rd, 4th, and 5th loops w / factor 2 tile 2nd loop w / factor 4 CORR Parallelize 8th loop rather than 7th loop using noParallel/-Parallelize 8th loop rather than 7th loop using noParparallel pragmas and tile 9th loop w / factor 4 allel/parallel pragmas and unroll 9th loop using 'contiguous' and 'remainder' options w / factor 2 COVAR Parallelize 6th loop rather than 5th loop using noParalParallelize 6th loop rather than 5th loop using noParlei/parallel pragmas and unroll 7th loop using 'split' and allel/parallel pragmas 'guarded' options w / factor 4 FDTD-2D Unroll 4th loop w / factor 2 and 8th loop with factor 4, each Unroll 4th and 8th loops using 'split' and 'remainder' using 'split' and 'guarded' options options w / factor 2 GEMM Unroll 3rd loop using 'split' and 'guarded' options with Unroll 3rd loop using 'contiguous' and 'guarded' opfactor 3 tions with factor 8 GESUMMV Tile 1st loop w / factor 4 and tile 2nd loop w / factor 2 Tile 1st loop w / factor 4 and tile 2nd loop w / factor 3 GRAMSCHM Unroll 5th and 6th loops using 'split' and 'guarded' options Unroll 5th and 6th loops using 'contiguous' and w/ factor 3 'guarded' options w / factor 4 MV T Tile 1st, 2nd, and 4th loops w / factor 2 Tile 1st loop w / factor 2 SYR2K Unroll 5th loop using 'split' and 'remainder' options with Unroll 2nd loop using 'split' and 'guarded' options with factor 3 factor 2 and 5th loop using 'split' and 'guarded' options with factor 4 SYRK Unroll 5th loop using 'split' and 'guarded' options with Unroll 2nd and 5th loops using 'split' and 'remainder' factor 2 options with factor 2 The results in Table 3 show some pattern of what trans formations should be applied. It is important to use the best loop permutation to maximize memory coalescence. For ex ample, we get a speedup of over 20x compared to the default on the 3DCONV, ATAX, and BICG kernels by using per mutation. The programmer should consider what loops are parallelized because the default configuration may not be optimal. In the CORR and COVAR kernels, we get over 2.5x speedup by specifying loops to parallelize. For loop un rolling, the results show that applying unrolling to the inner most loop in a loop nest often gives the best speedup. In the 2DCONV, 2MM, 3MM, GEMM, SYR2K, and SYRK ker nels, the best configuration on CUDA and OpenCL involve loop unrolling on the inner loop in particular loop nest (s).
Still, the best found transformations vary across kernels and programming environments, showing a need to experi ment with different optimizations and parameter values due to different characteristics of each kernel and programming environment.
Experiments With Different Architectures
We go on to explore how the best found optimization con figurations can vary across GPU architectures. We take the best found CUDA HMPP optimization configuration for each kernel on the C2050 (Fermi) GPU and run them on the NVIDIA GTX 280 (Tesla), which uses the GT200 architecture and has 240 cores, and the NVIDIA 9800GT, which uses the G92 architecture and has 112 cores. The re sulting runtime is compared with the running time of the best CUDA HMPP configuration found using auto-tuning on each architecture. The results for each kernel are shown in Figure 10 .
Some of the configurations that worked well on the Fermi did not perform as well on the Tesla or 9800 GT. Using the best found transformed kernels on the Fermi and running them on the Tesla or 9800 GT resulted in a slowdown com pared to the default HMPP configuration on 4 out of the 15 kernels on the Tesla and 5 of the 15 kernels on the 9800 GT. Still, some of the best optimization configurations, partic ularly permutation and parallelization of particular loops, did result in speedups across all architectures. For example, in the cases where permutation or parallelization is used in the best found optimization configuration on the Fermi, those configurations also result in a speedup over the default HMPP on the Tesla and 9800 GT GPU architectures. Still, the best optimization configuration found on a specific tar get architecture, T1, is often significantly faster than using the best configuration from an alternate architecture, '12, and using it on T1. On the left of Figure 10 , the geomet ric mean speedup over the default HMPP configuration for all the kernels on the 'Best Fermi on Tesla' configurations is 1.69x while the speedup of the auto-tuned 'Best Tesla' is 2.39x. On the right of Figure 10 , the geometric mean of the speedup of the 'Best Fermi on 9800GT' configurations is 1.34x while the speedup of the auto-tuned 'Best 9800GT' is 1. 71x. These results show that the best transformations are not only specific to the kernel, but seem to be specific to the GPU architecture.
Optimizing Belief Propagation
We go on to use HMPP to optimize belief propagation as applied to stereo vision. The input to the algorithm is a stereo set of two images, and the output is a computed disparity map between the images. In Figure 11 , we show an image in the Tsukuba stereo set used in our experiments and the corresponding ground-truth disparity map. We begin with the sequential code corresponding to the work by Felzenszwalb [12J available online. We adjust the implementation to maximize the number of coalesced mem ory accesses and add HMPP pragmas to run the program on the GPU. Each step of the algorithm is parallelized, as the computations in every step can be performed on each image pixel simultaneously. We profile the default parallel HMPP implementation and find that an iterative 'message-passing' step dominates the runtime, so we focus on optimizing that step. The targeted code repeats for a given number of iter ations and consists of message values corresponding to each disparity at each pixel being passed to neighbors and up dated. Additional details about each step of the algorithm can be found in Felzenszwalb's work.
Our experiments use the Tsukuba stereo set with image dimensions of 384 x 288 and 15 possible disparity levels. The implementation is run using a single level with 250 message passing iterations, with the optimized implementation de termined in the same manner as in Section 5.3 using the transformations shown in Table 1 . The CUDA and OpenCL results of the initial and best optimized HMPP implementa tions are in Figure 12 alongside results of a manual CUDA implementation developed by Grauer-Gray et al. [15J. The speedup is relative to the initial sequential CPU implemen tation. Each CUDA and OpenCL implementation performs at least seven times faster than the CPU implementation, with the optimized HMPP implementation giving a greater speedup. In the CUDA results, the manual implement a- tion performs better than default HMPP but worse than our optimized HMPP implementation, showing how HMPP optimizations can improve the performance of a program to outperform a manual GPU implementation.
RELATED WORK
There are a number of library generators that automati cally produce high-performance kernels, including FFT [13, 29, 36] , BLAS [34, 37, 5, 14, 16] , Sparse Numerical Com putation [19, 33, 26, 4, 23] ' and domain specific routines [3, 7, 24J. Recent research [21, 22 , 17J expands automatic code generation to routines whose performance depends not only on architectural features, but also on input characteristics. These systems are a step toward automatically optimizing code for different architectures. However, these prior works have been largely focused on small domain-specific kernels.
Automatic code generation and optimization for different architectures has been explored in many studies. The re search related to loop optimizations includes work by Wolf et al. [35] , Kisuki et al. [20] , and Stephenson et al. [32J that look at such optimizations on the CPU. Wolf et al. looked at how to best combine particular transformations on the architecture, Kisuki et al. focused on retrieving the optimal tiling size and unroll factor simultaneously, while Stephen son et al. used supervised learning to predict the optimal unroll factor of any loop. These works focused on optimiz ing for single-core CPUs. In our work, we optimize codes for many-core GPU architectures.
There is also related work on GPU code optimization by Ryoo et al. [31 J . They looked at a number of manual op timizations of CUDA kernels, including tiling, pre-fetching, and full loop unrolling within CUDA kernels. Liu et al. [25J looked at varying thread block dimensions and loop un rolling as optimizations in CUDA kernels. Baskaran et al. [2J did an extensive study of loop unrolling within CUDA ker nels, but they were concerned with improving the perfor mance of a single application.
Choi et al. [6J developed an auto-tuning framework for sparse matrix-vector multiplication on the GPU. Nukada et al. [27J optimized 3D-FFT of varying transform sizes on GPUs using auto-tuning. These two related works applied auto-tuning to specific applications.
Another related work is the CUDA-CHILL project [30] , which can translate loop nests to high performance CUDA code. The developers of this project provide a programming language interface that uses an embedded scripting language to express transformations or recipes. Their work only sup ports CUDA transformations and it is necessary to program the script, which is not as convenient as using a directive based approach like HMPP.
Cui et al. ([lOJ and [9] ) introduced the EPOD framework built on Open64 that can encapsulate algorithm-specific op timizations into patterns, and these patterns can be reused in other similar codes. It targets multiple platforms includ ing multi-core CPUs and NVIDIA GPUs. One of the works is focused on generating optimization scripts for BLAS3 GPU kernels. Our work differs because it uses a directive based tool where the transformations are placed in program code rather than an outside script and also shows results for specific transformations and on a greater variety of codes.
Our work differs from much of the previous GPU work because it does not require analysis and transformations us ing initial GPU kernels, instead utilizing directives placed in C code to apply the optimizations and generate the ker nels. In addition, this work focuses on transformations at a higher level than presented in the related work, looking at transformations of the loop(s) to be parallelized rather than loops within the kernel.
CONCLUSIONS AND FUTURE WORK
The problem of determining the best set of optimizations to apply to a kernel to be executed on a GPU has been exten sively studied. In this work, we developed optimized GPU kernels using auto-tuning with the directive-based HMPP Workbench. We found that we were able to significantly improve the running time over the default HMPP imple mentation as well as match or even exceed the performance of manually written CUDA / OpenCL code on kernels in the Poly Bench suite and on the belief propagation application as applied to stereo vision. Using code transformations avail able in HMPP, we were able to quickly generate different versions of code spanning a large space of optimizations.
In the future, we plan to extend this work by looking at more large applications and more optimizations, including distribution / fusion. We also plan to experiment with ad ditional GPU architectures, including AMD GPUs, as well as other many-core architectures.
ACKNOWLEDGMENT S
This work was funded in part by the U.S. National Science Foundation through the NSF Career award 0953667 and the Defense Advanced Research Projects Agency through the DARPA Computer Science Study Group (CSSG).
