Abstract-Graphics Processing Units (GPUs) are increasingly used to accelerate scientific applications. The state-of-the-art limits the adaptability of GPU kernels to both problem parameters and hardware characteristics. This makes writing high performance libraries for GPUs challenging. We address these challenges through Kernel Specialization (KS) which supports both user and hardware parameters and produces highly optimized GPU code. We apply KS to Particle Image Velocimetry (PIV), a technique used to obtain instantaneous velocity measurements in fluids for such diverse applications as aircraft design and artificial heart design. KS helps the user search PIV's highly non-linear design space, supports a wide range of PIV parameters, and results in improved acceleration times over existing kernels.
Ç

INTRODUCTION
T HE significant performance gains available with General Purpose computing on Graphic Processing Units (GPGPUs) has resulted in GPGPU applications for an increasing variety of problems. GPGPU applications often achieve peak performance when kernel implementations are carefully crafted to accommodate particular problem and GPU hardware characteristics. This leads to programming practices that limit the adaptability of kernel implementations in two ways. First, kernels cannot be easily adapted to similar problems that differ in a few particulars. Second, kernels cannot be easily adapted to different hardware characteristics. An application built for one GPU model can have limited performance on a newer model, a drawback compounded by the rapid evolution of GPU hardware. The state-of-the-art in both hardware and software tools makes writing libraries challenging, hinders code reuse, and limits the applicability of GPGPUs to a wider range of problem instances. Our research, using NVIDIA's CUDA programming environment, tackles these challenges with general techniques that can be easily extended to OpenCL.
Parameterization provides a mechanism for programming more adaptable GPGPU kernels. Parameters can be used to tune the kernel to particular problem instances and for characteristics of the current GPU, such as the number of threads per block and the amount of work assigned to each thread. Kernel implementations are thus made configurable by adjusting parameter values rather than rewriting code. Unfortunately, many important GPU optimizations, such as loop unrolling and register blocking, require static values at compile time. Register blocking, an optimization explained by Volkov [1] , involves doing more work per thread using more registers. To achieve these optimizations, it is common practice to hard code both problem and hardware parameters, limiting adaptability. In contrast, when parameter values are runtime evaluated (RE), performance suffers since many crucial performance optimizations cannot be applied by the compiler.
Instead of locking a kernel implementation into one of two regimes: higher performance without adaptability or adaptability with lower performance, we use kernel specialization (KS), a method that yields both high-performance and adaptability [2] , [3] . We combine adaptable kernel implementation techniques with a framework, GPU Prototyping Framework (GPU-PF), that dynamically compiles specialized kernels at application runtime. Specialization at runtime means late binding of problem and hardware specific parameters, specified as kernel compile time static values. Delaying specification of these variables until runtime enables compile time GPU optimizations that yield high performance. This approach yields adaptable CUDA C kernel source code so that the benefits of both adaptability and highly optimized code are achieved.
Kernel specialization improves the CUDA development experience because instantiating many versions of the same kernel for different parameter and data type combinations is avoided. This not only improves the maintainability and readability of GPU kernels, but also minimizes the size of program binaries, as binaries do not contain compiled versions of many possible versions of the kernel. Popular libraries such as OpenCV [4] and MathWorks Parallel Computing Toolbox (PCT) [5] ship or generate a large number of binaries for different problem instances.
Our technique complements related research that focuses on greater flexibility of GPU kernels. For example, autotuning [6] , [7] could easily be used with KS, enhancing both. Contributions of our research include a methodology to write kernels once and recompile for different parameters, along with an application framework that automates the process. This research:
1. Demonstrates improved adaptability of GPU kernels to different problems and architectures while offering good performance, 2. Employs optimizations such as loop unrolling and strength reduction in a runtime adjustable fashion, 3. Demonstrates reduced register usage, and 4. Combines kernel specialization with warp specialization. In [2] , we demonstrated the advantages of KS applied to sliding window applications and medical image reconstruction. In this work, we show the advantages when applying KS to Particle Image Velocimetry (PIV). PIV is used to obtain instantaneous velocity measurements in fluids and has diverse applications. It is used to design airfoils, helicopter rotors, wind energy rotors, artificial hearts, and to improve swim suits for Olympic swimmers. This research is the result of collaboration with the MIT Robot Locomotion Group (RLG) [8] where PIV is used to control a flap in liquid [9] . KS benefits PIV both by improving performance and in allowing more flexible problem implementations than were previously considered. KS combined with warp specialization supports the ability to apply PIV for real-time control.
The remainder of this paper is organized as follows. Section 2 gives an overview of approaches to GPU programming. Section 3 enumerates the benefits of our approach, explains adaptable kernel coding techniques and describes the runtime compilation framework. Section 4 describes the particle image velocimetry application used to demonstrate KS. Experiments and results for the PIV application appear in Sections 5, and 6. Results are presented for two different NVIDIA GPUs: a Tesla C1060 and a Tesla C2070. Section 7 describes related work and Section 8 draws conclusions and presents future areas of inquiry. More details of the experiments and results as well as detailed related work is available in the online supplement material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPDS.2014.2317721.
GPU PROGRAMMING FOR PERFORMANCE
GPU programming is distinct from general purpose programming in software. Here we focus on compiler optimizations, warp specialization, and memory hierarchies. For a more complete discussion of GPU programming see [10] .
While the approach discussed in this paper can be applied to other GPGPU languages such as OpenCL, all our experiments use CUDA and NVIDIA GPUs. CUDA provides a software environment for writing GPU kernels and specifies an abstract hardware target with large amounts of parallelism available in the form of a grid of thread blocks. CUDA code is first compiled and then run.
Taking full advantage of compiler optimizations is a key factor in programming GPUs for performance. For example, runtime evaluation of parameter values hurts performance significantly as compiler optimizations cannot be applied. Control code, such as loop bound evaluation, results in instructions that require each thread to either execute or wait, incurring a performance penalty known as thread divergence. A natural way to parallelize a loop is to assign the work for each loop iteration to a separate thread. For programs with nested loops, each thread may run one or more loop iterations. These inner loops can incur significant overhead in CUDA programs due to loop control evaluation [11] . Many applications fix loop counts at compile time, which improves performance by allowing the compiler to implement loop unrolling. However, this reduces the kernel's ability to adapt to different problem parameters. Other compile time optimizations also greatly improve performance. For example, integer division and modulus are expensive operations on a GPU and can be removed with strength reduction for operands that are powers of two. In addition, reduction trees are commonly found in parallel code and reductions on data lengths that are powers of two simplify control flow. When values are not known at compile time, such optimizations cannot be applied and performance suffers.
A recent CUDA programming technique, called warp specialization can also greatly impact performance. Many CUDA performance considerations center around grids, blocks, and threads [12] . Each thread is logically independent, but is run in a SIMD manner. The number of threads launched in parallel at runtime, known as a warp, is architecture specific. For all NVIDIA GPUs to date, the warp size is 32 threads, which means that parameter and data sizes that are multiples of 32 tend to run most efficiently. Introduced in the CudaDMA library [13] , warp specialization dedicates several warps to load data and the rest to compute, thus using both the memory hierarchy and compute resources simultaneously. We apply this technique because it makes better use of GPU resources.
Finally, to perform well, GPU programs need to take memory hierarchies into account. NVIDIA GPUs consist of memory hierarchies including off-die global memory, ondie shared memory, and registers. Shared memory in Fermi GPUs can be configured as L1 cache or user managed local memory; in earlier GPUs, shared memory was user managed. The NVIDIA memory hierarchy is inverted relative to CPUs, with more on-die memory dedicated to the register file than to shared memory or cache. When we refer to block-level memories, we include both shared memory and registers. Tradeoffs between shared memory and register usage can affect performance in ways that may be unintuitive to the programmer [1] . Writing parameterized code is complicated because registers cannot be directly addressed on existing NVIDIA GPUs. Register allocation must be fixed at compile time. Most approaches to improving performance focus on improving GPU occupancy, which involves maximizing the number of threads that are launched in parallel. Volkov [14] , [1] has shown that increasing instructionlevel parallelism (ILP) may provide increased performance over maximum occupancy. Note that higher occupancy may require more hardware resources and memory bandwidth for many active warps, while not improving overall throughput. Volkov's approach increases the use of registers per thread and can maximize the concurrent use of available hardware by taking advantage of register memory bandwidth. This is achieved by a technique called register blocking, which involves loop unrolling as well as packing the operations from multiple threads into one thread. This merging of threads is an explicit CUDA code transformation where fewer threads must each manage more work. Register blocking via unrolling loops and merging multiple threads helps to maximize ILP.
KERNEL SPECIALIZATION
Kernel specialization applies to both problem specific parameters and hardware-oriented parameters that adjust the characteristics of a kernel for a particular device. Kernels are coded in CUDA C to exploit existing CUDA compiler tools to generate optimized binaries. Traditionally, block and grid dimensions are specified at compile time for best performance. With KS, these dimensions can be left unspecified until kernel runtime without sacrificing performance. KS enables programming with dynamically sized memory declarations that are converted to fixed size prior to kernel compilation at runtime. KS also allows developers to use a simpler, static shared memory allocation syntax but have it behave like dynamically allocated shared memory.
The GPU Prototyping Framework generates kernels for particular problem and hardware instances as needed at application runtime, and thus eliminates the need to compile all supported variants into a set of binaries. This reduces program binary size, yet permits complex applications that fit within kernel-resource constraints imposed by the hardware. The framework caches kernel binaries so if a previously generated specialized kernel is needed again during application runtime, it can be loaded with speed similar to a dynamically linked shared object. Thus, runtime compilation cost can be amortized through reuse of compiled kernels. Many data flow pipelines are amenable to this approach. An example is a kernel that is used repeatedly during video acquisition. KS is more effective for parameters that control algorithmic behavior and control flow than for scaling factors that can be considered data. In this section, we first look at coding techniques for KS, and then the GPU-PF. Additional details can be found in [2] , [3] .
Coding Techniques
Coding techniques for creating adaptable yet specialized kernels are a central part of our KS method. Adaptable kernel code contains undefined constants that are provided values when the kernel is compiled, at either application compile time or application runtime. To specify parameters at compile time we use: (1) pre-processing directives and constant-value macro expansion and (2) C++ templates. The flavor of these techniques is demonstrated by comparing the for loop in the three listings that follow. In the example, KS enables loop unrolling, an important optimization applied by nvcc (NVIDIA's CUDA Compiler) that can only be applied if the loop bound is known at compile time. Listing 1 is a runtime evaluated loop that uses a variable as an upper bound without any KS. Listing 2 has been specialized with a loop bound whose value is specified at kernel compile time. Listing 3 (KS) is the most flexible in that one can choose, at kernel compile time, whether the loop bound should be RE as in listing 1 or should be defined at compile time as in listing 2. The ability to perform compile time optimization at application runtime distinguishes KS from statically-compiled techniques.
Listing 1. A runtime evaluated loop without kernel specialization
In Listing 1, the value of loopCount is only known at runtime, consequently the compiler cannot apply loopunrolling optimizations. In contrast, the code in Listing 2 has been written such that the loop upper-bound is defined as a compile-time constant, consequently the compiler can perform loop-unrolling optimizations at application compile time.
Listing 2. A specialized loop upper-bound provided at compile-time Listing 3 shows fully adaptable KS kernel code, in which the loop upper-bound can be toggled, either to be RE (no compiler loop-unrolling) or to have a value specified at kernel compile time (enable compiler loop-unrolling). The semantics of loops::op(loopCount) are such that if CT_LOOPS is defined as true, count is statically defined with the value of LOOP_COUNT, otherwise the value of count is determined at runtime. Similar KS techniques enable other compile-time optimizations, such as strength reduction, constant folding and propagation, and memory optimizations such as a single interface for many supported memory types (global, constant, etc.). The benefits of compile time optimization are magnified by dynamically compiling specialized kernels at runtime, a mechanism provided by the GPU Prototyping Framework. Listing 3. Specialized loop upper-bound can be set at compile-time or runtime
The GPU Prototyping Framework for Kernel Specialization
We developed the GPU Prototyping Framework for constructing CUDA C applications with data streaming processing pipelines. The GPU-PF orchestrates runtime compilation of specialized kernels, and provides a library of helper functions for implementing KS. The CUDA API supports runtime kernel compilation from PTX (intermediate code supported by NVIDIA) input, but not CUDA C. With the GPU-PF, developers use CUDA C to develop applications with specialized kernels, and the framework invokes the compiler, nvcc, at runtime to compile kernels. The framework passes KS parameters using nvcc's -D flag to define macros. After an application has been specified, the values of parameters can be adjusted; GPU-PF handles propagating the effects of the parameter changes through the application. Some supported parameter types are listed in Table 1 . Fig. 1 gives a general overview of KS with the GPU-PF. The user writes host application code that uses parameters, resources and actions to construct a processing pipeline. The host application code also contains code for the three phases supported by GPU-PF: specification, refresh, and execution of the pipeline. The host application code contains no CUDA-specific code as all CUDA code is encapsulated within GPU-PF functions, and thus kept distinct from host application code. As seen in Fig. 1 , the host application code implements the problem specification primarily with original code including some GPU-PF calls for specialization parameters, whereas the refresh and execution code largely consists of GPU-PF library code invoked by function calls. The host application code can alter parameters such that the next pipeline will be dynamically recompiled using the new parameter values. The host application code is compiled into a binary executable whose runtime behavior is outlined with a dashed line in Fig. 1. 1. Specification. During specification, pipeline parameters, resources (e.g., memory allocations and kernel modules), and actions (e.g., data transfers and kernel execution) are instantiated, but nothing is allocated and no actions other than error checking are performed.
2.
Refresh. Runtime kernel compilation occurs during refresh, and cached kernel binaries are used if possible to avoid unnecessary compiles. The refresh stage occurs before the first pipeline execution and, when static parameters are updated, before the next pipeline execution. Only resources affected by parameter changes are updated, and all resource allocation takes place at refresh instead of during execution. Both host and GPU memory are allocated during refresh. 3. Execution. During processing, the specialized kernel executes, calling the refresh phase if necessary. Refresh is only required the first time a specialized kernel is executed. At program termination, GPU-PF automatically cleans up all resources and memory. The GPU-PF library provides host-code abstractions to make KS and CUDA programming less tedious and less error-prone. These include program resources and actions on the resources. Resource types include data locations (memory or files) and modules (e.g., the CUDA kernels). Multiple memory types are abstracted to have the same interface regardless of underlying memory type. Handled memory types include global memory, constant memory, host memory (malloced, CUDA pinned, memory mapped, or user provided), and generic references to a subset of any memory type. Texture memory references are handled as a distinct memory type, due to the decoupling of the memory allocation in host code (as global or CUDA array memory) from the texture reference inside a CUDA module.
Applications use the GPU-PF library to specify any of a set of actions including: memory copy, kernel execution, file I/O, and calls to arbitrary user functions. To aid debugging, GPU-PF provides logs detailing the actions performed and parameters used. GPU-PF includes timing facilities, based on CUDA GPU or host timers, to report application timing information.
Overhead of KS
To take advantage of KS and GPU-PF, the user must modify their code. The host code needs to make use of API calls provided by GPU-PF. For GPU code, the user must use Cstyle preprocessing macros and optionally C++ templates. Note that the amount of change included in a given GPU kernel is up to the discretion of the programmer. KS provides advantages even with minimal changes to GPU code; more sophisticated changes will result in added benefits. An advantage of using KS is delaying various parameter decisions to runtime, which incurs the extra cost of compilation (on the order of seconds); however, due to caching, this overhead is incurred infrequently. A user can limit the number of runtime compilations by making frequently changing parameters runtime evaluated; however, this may result in a decrease in kernel performance. Additional overhead of the framework, due to the refresh stage, is on the order of milliseconds and is about one 10th of the total time for CUDA initialization. Refresh does not occur every time a kernel is executed, so this overhead is also incurred infrequently. The GPU binary cache can grow quite large, however KS gives the benefit of only generating kernels for different parameter settings as needed, rather than shipping all possible variants of them as part of a GPU solution.
PARTICLE IMAGE VELOCIMETRY: AN APPLICATION
Our previous work describes the use of kernel specialization for lung tumor tracking and for medical X-ray cone beam computed tomography [2] , [3] . Here we apply KS to improve the performance of particle image velocimetry. PIV is used to obtain instantaneous velocity measurements in fluids. The fluid is seeded with tracer particles that are assumed to faithfully follow the flow dynamics, and then illuminated so particles are visible. The motion of the particles is used to calculate speed and direction (the velocity field) of the flow being studied. The PIV setup considered here processes video streams of particles seeded into a moving fluid (Fig. 2) . The particles are illuminated by laser pulses, making them more easily distinguishable from the fluid. Tracking the particles between frames can be used to determine characteristics of the fluid flow within the field of view. Problem parameters were based on a PIV test bed used by the MIT Robot Locomotion Group [8], [9] . To compare a pair of video frames, a mask is extracted from one image and compared to several nearby locations in the second image. The location offset in the second image that produces the greatest similarity is selected to generate a local velocity estimation for the mask. Doing the same computation many times over the image pair produces a velocity field estimate. We consider two different PIV implementations, one that is relatively static and a second in which kernel specialization enables more dynamic parameters.
The mask associated with velocity estimates is shown in Fig. 3 . In the static version, the space of mask offsets searched is defined by the rectangular dimensions of the interrogation window (IW). The search space is always dense, meaning that a similarity score is generated for every possible offset of the mask within the IW. Consider an IW of 40 Â 40 and a mask of 32 Â 32 starting in the center. In this case, the mask can move up to four pixels in each direction (up, right, down, left) producing a nine-by-nine square offset space. Many individual template matching operations are performed and the mask data is unique for each individual velocity estimate.
In static PIV processing, the distribution of local searches throughout an image pair for which velocity estimates are generated is a regular grid and is defined in terms of IW parameters. Starting from the top left corner, the distance between adjacent velocity estimates is controlled by a parameter that specifies, for each dimension, the number of pixels by which adjacent interrogation windows overlap. Continuing with the 40 Â 40 IW example, an overlap of 20 in each direction results in a grid of velocity estimates that are 20 pixels apart such that adjacent interrogation windows share half of their pixels in each dimension.
The dynamic version of PIV processing was designed to suit the needs of the MIT RLG [8], [9] . This problem specification is defined in terms of IW corners and offsets. The IW corners are provided as a list of arbitrary x-and y-coordinates representing the locations of the top left corner of each desired interrogation window. The offsets at which to calculate similarity scores are also specified as a list of arbitrary xand y-coordinates. The set of offsets are applied to each mask, and the offset with the highest match is used to calculate the velocity vector. This new problem specification allows the same implementation to perform both global and local searches. For example, image-wide scales, similar to those used in the static case, can be used to determine the overall fluid motion. When the flow is non-turbulent and a direction can be determined ahead of time, the search space defined by the offsets is biased in the flow's direction. The problem specification can also be used to specify a PIV instance where the set of masks and offsets are tightly clustered around an area of turbulent flow, thus producing better estimates.
The dynamic version of PIV limits opportunities for data sharing, a key performance consideration. In the static version, the regular spacing of interrogation windows and dense rectangular search space create a regular data access pattern across interrogation windows and between consecutive mask offsets. The regularity provides opportunities to minimize data traffic through the use of shared and texture memory. In contrast, the dynamic problem specification makes it challenging to a priori determine memory patterns and data sharing opportunities.
The similarity score used for the GPU implementations is sum of squared differences (SSD):
The selection process for a given mask's velocity based on similarity scores is a reduction scan for each mask corner. The minimum value SSD is chosen.
CUDA Implementation
The PIV CUDA implementation consists of three main tasks: loading data, accumulating squared differences, and reducing accumulated data to a single value. Specific parameters are from the MIT RLG [9] . The target range for the number of IW corners is between 200 and 2,000 with up to 2,000 offsets for each. Mask dimensions are between 25 and 64 pixels on a side and always square. The computation associated with each mask is independent; the target parameter ranges reduce the computation associated with a single mask to a relatively small portion of the total computation. Based on this, a mapping of one mask per CUDA block was chosen. At the low end of the range, this produces more than enough blocks to fully utilize current NVIDIA hardware, while at the same time generating significant inter-thread parallelism except at extremely low counts of offsets and mask sizes. Each iteration of the summation loops is independent. Another benefit of this mapping scheme is the mask data is static for each block. Thus, it is possible to load the mask data to block-local memory only once.
Note that data is not shared by multiple IWs, since it is problem dependent and based on the values of both IW corners and offsets. Reading IW data through texture memory and the L1-style caching on NVIDIA Fermi and later GPUs can mitigate the performance impact. The reduction operation is per IW and therefore no more than 2,000 values. This reduction is performed at the block level since it is efficient, removes the overheads associated with a second kernel launch, and decreases global memory traffic by reducing the output of the kernel to a single value per block. The accumulation associated with computing the similarity score at each offset must be mapped to a block's threads. Based on the need to use the register file to maximize performance and because the mask data is static, register blocking is used to store the mask tile data in registers. Consecutive threads load consecutive column-major mask data values into registers (Fig. 4) . While this generates a stride for intra-thread accesses, it produces coalesced interthread data access patterns for both the initial mask data load and reading IW data from the second video frame.
This block-level mapping breaks the accumulation of squared differences across the threads and keeps the threads in lock-step over the offset space. The thread mapping provides sufficient per-thread work for each offset and generates ILP. At the low end of the template size range of twenty-five pixels per side, this still maps more than four pixels of mask data to each thread for a thread block of 128 threads. Given the number of mask data pixels assigned to each thread and the number of threads per block, a thread block will cover a certain "natural" area. For example, 128 threads per block at four pixels per thread results in a natural area of 512 pixels.
For larger masks, the mask is tiled and the block loops over each tile. Looping over the offset space inside the mask tiling requires tracking and accumulating mask tile contributions for each offset. A shared memory area with the same number of elements as in the offset space is reserved for this purpose. However, before a given mask tile's contribution for a given offset is accumulated, the portion calculated by each thread must be reduced to a single value. Block-level reductions represent a potential point of inefficiency as fewer and fewer threads participate in each round of the reduction. The idle threads must wait for the reduction to complete before continuing on to the next mask offset. In the case of the PIV kernel, a reduction must be performed for each offset for each mask tile, potentially incurring this performance impact several thousand times during the lifetime of the kernel.
To address this, the PIV kernel applies warp specialization, as described in Section 2. Note that with loop unrolling and register blocking, the compute threads in the PIV kernel already produce significant instruction-level and memorylevel parallelism. Several different styles of warp specialization were attempted. The three main tasks, loading data, accumulating squared differences, and reducing accumulated data to a single value, were distributed in different ways among two and three-stage pipelines in which each pipeline stage was assigned to a varying number of warps. The best performance observed was a two-stage doublebuffered approach used for the results reported here. One set of warps does both the loading of new data and computing the sum of squared differences assigned to each thread, as shown in Fig. 5 . In this group of warps, data is read directly by the threads consuming the values without buffering in shared memory. Once a mask tile at a given offset has been processed, each thread writes its contribution for the current shift offset into shared memory. The second warp group, consisting of a single warp, performs the reduction. Combined with double buffering, this allows the first group to immediately begin processing the next shift offset. Using a single warp prevents the need for any synchronization related to the reduction. The single warp performing the reduction only accesses shared memory, and does not compete with the other loading and computing warps, as input data for the loading and computing warps comes from registers and global or texture memory. The last remaining thread in the reduction tree accumulates the value into the correct offset element in shared memory. Once all mask tiles are accumulated for all offsets, the final reduction scan for the minimum value is performed. This single output is the index of the offset coordinates that produces the optimal value. Note that this style of warp specialization works best for an application such as PIV which uses a map reduce style of computation. We are investigating more structured ways of applying warp specialization across a range of applications.
Kernel Specialization
For PIV, kernel specialization is used in a number of ways to improve the performance of the runtime evaluated kernel. Note that the RE kernel had already been optimized. The key benefits offered by KS are a combination of loop unrolling and enabling dynamic register blocking. NVIDIA GPU registers cannot be dynamically addressed by the thread that owns them. Any loops operating over register-resident data must be static at compile time so the specific registers can be encoded into the GPU binary. KS converts the number of registers used for register blocking into a dynamic variable. This allows the kernel to adjust its register usage based on the number of registers available on a target GPU and on the current problem. In addition, KS allows other offsets and strides to be statically compiled into the PIV kernel. These include thread identification thresholds used to control inter-warp divergence and the number of threads in each specialized group.
The specialized kernel is derived much like the large template matching kernel, described in [2] . A RE kernel is developed first, and then a kernel specialized version is created by replacing kernel arguments and CUDA built-in references with new macro names provided at compile time. Little additional optimization is applied to the kernel. For example, we neither remove runtime guards for cases where the natural thread block area is an integer multiple of the mask area nor eliminate tile looping when the natural thread block area exactly matches the full mask area. In addition, the RE kernel contains fixed register blocking values, with register blocking levels specialized for each value, to enable determination of the optimal value as well as to study its impact.
EXPERIMENTAL SETUP
To explore the performance impacts of kernel specialization, the runtime evaluated (non-specialized) and specialized GPU PIV implementations were run on a number of different data sets. For each kernel and problem set, a wide range of implementation parameters were tested. GPU-PF was used for these experiments. The data presented for different implementations include kernel performance and register usage, two metrics that demonstrate the advantages of KS.
Results are shown for single kernel performance improvement, since KS is a kernel level optimization. The PIV kernels are also compared to previously optimized FPGA implementations [15] , [16] . In addition, KS and RE kernel performance are compared. For the GPU applications, timing is handled by GPU-PF, which uses CUDA events for GPU-related activities and high-resolution host timers otherwise. Table 2 summarizes the hardware configurations. The C1060 has 240 thread processors compared to 448 on the C2070. The C1060 also has lower memory bandwidth (102.4 GB/sec versus 144 GB/sec) and lower peak performance. Hence, better performance is expected from the C2070 if the greater thread level parallelism can be exploited and the memory bandwidth can be used efficiently. All CPU results are from the Xeon-based workstation. Two different NVIDIA GPUs were used to illustrate KS support for different GPU hardware.
The PIV application was benchmarked on a number of different problem sets and a range of implementation parameters. More than ten thousand different benchmarking trials were examined. Problem specific parameters varied include image size, interrogation window size, mask size and overlap. GPU specific parameters varied include number of threads per block, texture memory usage, and whether or not warp specialization is enabled. With the PIV kernel, all the parameters are orthogonal, resulting in 160 implementation parameter combinations per problem instance. Details of the different experimental setups are given in Appendix 1, available in the online supplemental material.
RESULTS
Across the range of problem parameters tested for the PIV application, kernels using kernel specialization produced notable improvements over the corresponding fully runtime evaluated kernels, both in performance and register usage per thread. Results compare the kernel variants to each other and list the best performance across the wide range of implementation parameters for each GPU and problem pair. Table 3 shows that specialized kernels convey cumulative speed-ups of two to three times relative to a baseline runtime evaluated kernel for ten data sets (A and B). Note that the RE kernel would be even slower if not for the use of static register blocking. The appendix, available in the online supplemental material, contains a more detailed table of relative and cumulative performance. Table 4 demonstrates the benefits of kernel specialization for peak performance for the PIV application. For the C2070, a fixed kernel (hard-coded with a single set of compile-time values that are used regardless of problem parameters) can achieve about 80 percent of peak performance, whereas for the C1060 about 90 percent can be achieved. While acceptable on average, Table 4 shows that in each case the minimum relative performance is significantly lower, demonstrating the benefits of making such parameters adjustable. Peak performance was obtained with KS. Appendix 2, available in the online supplemental material, contains more details.
We found that optimal values for register blocking and thread count can vary non-linearly with PIV mask size changes and across architectures. Results for two mask sizes run on the C2070 are shown in Fig. 6 . A more complete set of these results can be found in Figs. 1 and 2 in the appendix, available in the online supplemental material. The variation of the optimal configuration with input parameter changes and hardware architecture changes motivates the selection of parameters at application runtime.
KS can also significantly reduce the number of registers per kernel thread, assuming all other configuration values are equal. Fewer registers per thread improves performance as it enables more active warps and enables more resource intensive kernels to fit on a given multiprocessor. In the case of the PIV kernel on the C1060, KS used 16 registers compared with 32 for the RE kernel. In summary, compared to both RE kernels and fixed kernels, kernel specialization produces significant speedups. In addition, our results confirm the benefits of warp specialization, as well as illustrating peak performance with fewer threads but with greater ILP and register blocking. More detailed results can be found in the appendix, available in the online supplemental material.
RELATED WORK
Merrill et al. [17] show that GPU kernel performance is highly non-linear and non-predictable such that no single kernel configuration provides good performance over the range of parameters tested. Our particle image velocimetry results in Section 4 show a similar phenomenon. Such results motivate the need for dynamic approaches, such as our kernel specialization approach, that achieve good performance by adjusting to problem specifics.
A number of researchers have targeted improved performance as well as the parameterization and adaptability of GPU kernels. These techniques can be separated into four categories: application specific customization, code generation, autotuning, and domain specific tools. Details of the related work can be found in Appendix 3.
CONCLUSIONS AND FUTURE WORK
Our results show that kernel specialization provides significant performance improvements and execution times that are noticeably lower by removing overheads associated with flexibility. These benefits are derived on top of good baseline implementations of real-world scientific applications of non-trivial complexity. Previous work on medical image processing also showed these benefits [2] . Simpler kernels may provide even greater performance improvements. KS provides increased adaptability without a performance penalty. The data sets chosen cover a wide range of irregular parameters. KS provides performance improvements over high-performing baseline implementations by removing overheads associated with flexibility.
The PIV application results show that KS provides benefits beyond those associated with loop unrolling. The PIV KS kernels were compared with runtime evaluated kernels that already took advantage of loop unrolling over the core data loading and computation. The fully specialized kernels allow optimization over a range of parameters, and apply constant folding and propagation as well as additional strength reduction optimizations.
Kernel performance is highly non-linear and dependent on the correct selection of implementation parameter values. We have shown that KS outperforms fixed kernel implementations across a range of parameters. We have also demonstrated that KS allows the adaptation of kernels to different NVIDIA GPU processors. Autotuning techniques would help in selecting the best parameter values to apply for each range of parameters. We plan to incorporate autotuning into future versions of our tools. This paper shows the benefits of kernel specialization for Particle Image Velocimetry. Previously, we have shown KS benefits large template matching applications and medical image reconstruction. In the future, we plan to show that KS benefits many other applications as well, especially those with problem parameters and data sets that are not powers of two and where the input parameters change with different problem instances.
Miriam Leeser received the PhD degree and Diploma in computer science from Cambridge University. She is a professor of electrical and computer engineering and the head of the Reconfigurable and GPU Computing Laboratory at Northeastern University (NEU). Her research interests include reconfigurable computing and computer arithmetic. She is a senior member of the IEEE, ACM, and the Society of Women Engineers.
Laurie Smith King received the PhD degree in computer science from the College of William and Mary. She is a professor of computer science at the College of the Holy Cross. Her research interests include hardware-software codesign and programming languages. She is a member of the IEEE and the ACM.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
