Abstract-Graphics Processing Units (GPUs) are massively parallel, many-core processors with tremendous computational power and very high memory bandwidth. With the advent of general purpose programming models such as NVIDIA's CUDA and the new standard OpenCL, general purpose programming using GPUs (GPGPU) has become very popular. However, the GPU architecture and programming model have brought along with it many new challenges and opportunities for compiler optimizations. One such classical optimization is loop unrolling. Current GPU compilers perform limited loop unrolling.
I. INTRODUCTION
The ever-increasing demand for graphics processing power driven by the rapidly growing computer games industry has led to the development of extremely powerful, highly parallel, multi-threaded, many-core Graphics Processing Units (GPUs). For example, nVidia's GeForce GTX 280 [1] GPU has 240 processor cores on a single chip, and has a peak performance of about 900 GFLOPS. The introduction of General Purpose Programming on GPUs (GPGPU) programming models such as nVidia's CUDA [2] and the new standard OpenCL [3] has made it easier to harness the vast processing power of GPUs for solving non-graphics, general purpose problems in areas such as linear algebra, signal processing, life sciences, etc.
An important program optimization that is performed automatically by compilers, and sometimes manually, is loop unrolling. Loop unrolling is a technique in which the body of a suitable loop is replaced with multiple copies of itself, and the control logic of the loop is updated accordingly. A remainder loop is added at the end of the unrolled loop, if necessary. Loop unrolling for CPU programs has been researched and implemented in compilers for many decades now. However, the results have not carried over automatically to GPGPU programs, due to the significant differences in architecture and programming model. Current GPGPU compilers perform very little or no unrolling, and there are no reported compile-time techniques for selecting optimal unroll factors for GPGPU programs.
In this work, we attempt to characterize the impact of loop unrolling on GPGPU programs. Using this characterization, we develop a static, semi-automatic, compile-time technique to select optimal unroll factors for suitable loops in CUDA programs. Our technique is based on analyzing the compiled CUDA code and estimating the relative performance of various unroll configurations. In addition, we propose a technique for pruning the search space of unroll factors. We present experimental results to show that we correctly identify optimal unroll factors for several CUDA programs and benchmarks. In Section II, we characterize the impact of loop unrolling on GPGPU programs. In Section III, we describe the framework for selecting optimal unroll factors and in Section IV we show the results of using the framework. Finally, we compare our work with other existing approaches and conclude.
II. LOOP UNROLLING FOR GPGPU PROGRAMS
The effects of loop unrolling on CPU programs have been understood and described extensively in past work [4] . As is the case with CPU programs, unrolling loops in GPGPU programs also produces the same effect of possibly reduced instruction counts and increased opportunities for scheduling and register tiling, with the same pitfalls of increased register usage and code size. However, the differences in architecture, programming models, and resource constraints of GPUs imply that the analytical models developed for CPU loop unrolling cannot be directly extended to GPGPU loops.
In this section, we look at the various factors that influence loop unrolling for GPGPU programs and their complex interplay. We first provide a high-level overview of the GPU architecture and outline the various resources available, their constraints and their limitations. We then examine how these resource constraints can influence compiler optimizations such as loop unrolling.
A. GPU Architecture and Resources
From the perspective of GPGPU, the GPU can be viewed as a many-core processor containing an array of Streaming Multiprocessors (SMs), each of which consists of 8 Scalar Processors (SPs), two transcendental function units, a multithreaded instruction unit and fast, on-chip shared memory and a register-file. Each SP is also associated with off-chip Resource   Size  Constraint  Total global memory  1GB  None  Total shared memory per SM  16KB  Shared by all  concurrent blocks  Total number of register per SM  16K  Shared by all  concurrent threads  Warp size  32  None  Max threads per block  512  None  Max dimensions of a block  512, 512, 64  None  Max dimensions of a grid  64K, 64K, 1  None   TABLE I  GTX 280 RESOURCES, THEIR SIZES AND CONSTRAINTS local memory, which is mostly used for register spills and storing automatic arrays. In addition to these, each GPU device also contains large, slow, off-chip device (global) memory. The number of SMs per device and the sizes of the various memories and the register file vary by device.
Each SM manages the creation, execution, synchronization and destruction of concurrent threads in hardware, with zero scheduling overhead, and this is one of the key factors in achieving very high execution throughput. Each parallel thread is mapped to an SP for execution, and each thread maintains it's own register state. All the SPs in an SM execute their threads in lock-step, according to the order of instructions issued by the per-SM instruction unit. The SM creates and manages threads in groups of 32, and each such group is called a warp. A warp is the smallest unit of scheduling within each SM.
The GPU achieves efficiency by splitting it's work-load into multiple warps and multiplexing many warps onto the same SM. When a warp that is scheduled attempts to execute an instruction whose operands are not ready (due to an incomplete memory load, for example), the SM switches context to another warp that is ready to execute, thereby hiding the latency of slow operations such as memory loads.
The CUDA programming model organizes threads into a three-level hierarchy At the highest level of the hierarchy is the grid. A grid is a 2D array of thread blocks, and thread blocks are in turn, 3D arrays of threads. The GPU hardware scheduler maps one or more thread blocks onto each SM and the threads in a thread block are split into warps and scheduled onto the SPs. Threads belonging to the same thread block can co-operate with each other, by sharing the low-latency, onchip shared memory and also through barrier synchronization. Synchronization across thread-blocks is not directly supported by the programming model. Table I lists some of the important resources on a specific GPGPU device: the GeForce GTX 280. Typically, the size of resources such as the number of threads per block, the block dimensions and grid dimensions present a challenge to the programmer, who has to make the best use of these resources to solve the problem at hand. However, the most important resource from the perspective of the compiler is the register-file. Since the entire register-file on an SM is shared among all the threads concurrently executing, if a compiler optimization increases the register usage per thread, then the number of threads that can execute concurrently (GPU Occupancy) reduces. This trade-off needs to be considered by all optimizations, such as loop-unrolling, that may increase the per-thread register usage. We need to examine the effects of loop unrolling for GPGPU programs in the context of these resource constraints to determine if a given GPGPU loop will benefit from loop unrolling, and if so, what the optimal unroll factor would be.
B. Factors Influencing Loop Unrolling For GPGPU Programs

1) Instruction Count:
One of the most obvious benefits of unrolling loops is the reduction of instruction count due to fewer number of compare and branch instructions necessary for performing the same amount of computation. Instruction count may also reduce because of better instruction selection, scheduling and optimizing opportunities that result from unrolling.
2) ILP and GPU Occupancy: Loop unrolling, in general, provides better instruction scheduling opportunities to the compiler due to increase in the number of available independent operations within the body of the loop, resulting in increased Instruction Level Parallelism (ILP). The compiler could use the additional independent operations to hide latencies of pipelines and memory access.
In addition to ILP, GPU tolerates these latencies by multiplexing a large number of warps onto each SM, up to the limit allowed by resource constraints. When a warp that is executing encounters a high latency operation, the scheduler switches execution to another warp that is ready for execution. By the time all warps are cycled and the first warp that was switched out is switched back in, it is likely that the high latency operation would have completed.
Due to this complex interplay of ILP and occupancy inside a GPU, estimating the benefit from increased ILP on the overall performance of the program is hard. Increase in register usage, which mostly comes with increase in ILP, reduces GPU occupancy since the register file is a resource shared by all active threads. Therefore, any model for estimating the benefit of loop unrolling must accurately model the trade-off between increased ILP and decreased occupancy. We attempt to characterize this trade-off in Section III-C.
3) Instruction Cache Capacity: Loop unrolling results in an increase in size of the loop body. When the size of the loop body exceeds that of the instruction cache, cache misses occur whenever a portion of the loop body that is not in the cache is executed. These misses could reduce the performance of the program, undoing some of the benefits gained by reduced instruction count and increased ILP. The effect of the I-cache on loop unrolling is the same for CPU programs and GPGPU programs.
The size of the I-caches on nVidia GPUs are not documented by nVidia and there are no known techniques for automatic measurement of I-cache capacities on GPUs. Therefore, experiments similar to those described in [5] , but adapted to run on a GPU instead of a CPU, were developed. These experiments indicated that the capacity of the I-cache on both GTX 280 and the 8800 GTX devices was 32KB and this measured capacity was used in later experiments.
III. SELECTION OF OPTIMAL UNROLL FACTORS FOR LOOPS IN GPGPU PROGRAMS
This section describes the design and detailed working of the system that performs the selection of optimal loop unroll factors for loops in CUDA GPGPU programs.
A. Design Overview
The framework for selecting optimal loop unroll factors for user-specified loops in CUDA programs is centered around two components: the PTX Analyzer and the Unrolling Driver, which make use of a chain of other tools to help analyze the given kernel. Figure 1 provides a high-level overview of the design of the framework.
The analysis framework requires that the loops/loop-nests selected by the user for unrolling have no conditional controlflow within their bodies, since the analysis is purely static. Analyzing loops with conditional control-flow in their bodies would require some dynamic analysis, such as profiling, in order to correctly model the control-flow. It must be noted that this is not an overly-constraining requirement because branching within loop bodies could produce warp divergence, leading to reduced performance and this constraint ensures that most GPGPU programmers try to avoid or minimize conditional control-flow within loop bodies. In addition, the nVidia GPU ISA allows predicated execution of instructions, which can be used to achieve limited control flow within loop bodies, without significant performance loss, and loops with such predicated control flow are accepted by the analyzer.
!" # $ % Fig. 1 . Overview of the optimal unroll factor estimation framework
The components of the framework are described below:
1) Compilation and dis-assembly tool-chain:
Input: Annotated CUDA source code with loop identifiers and unroll factors. Output: Disassembled PTX of the CUDA source with target loop(s) unrolled by specified amount. Description: The tool-chain consists of ORIO, NVCC and DECUDA. ORIO [6] is a source-to-source transformation tool that performs semi-automatic unrolling of loops. NVCC [2] is nVidia's compiler driver that translates CUDA source code to GPU specific binary (CUBIN) -the CUBIN format is not publicly disclosed and the back-end of NVCC is nVidiaproprietary. DECUDA [7] is a reverse-engineered disassembler that translates the compiled binary back to PTX, which is an intermediate representation common to all nVidia targets and is publicly documented by nVidia.
We chose to explicitly compile the source code with NVCC before analysis instead of basing our analysis entirely on a theoretical model so that we could accurately model the idiosyncrasies of the compiler's code-generator and register allocator, whose inner workings are unknown since they are nVidia-proprietary. Also, it is important to note that the PTX generated by DECUDA is not the same as the PTX generated by NVCC under the -ptx flag. The latter is the program's IR before it is processed by the compiler's back-end, whereas the former is the IR after being processed by the back-end. Analyzing the disassembled PTX allows us to factor in the optimizations performed by the compiler's back-end accurately.
2) Occupancy Constraining Factor:
Input: Disassembled PTX of a CUDA kernel from DECUDA Output: Details related to the occupancy of the GPU Description: The Occupancy Constraining Factor (OCF) component gathers information about the resource usage by the kernel and computes the distribution of thread blocks to each SM and the maximum number of warps that can be concurrently active on each SM based on the device-specific resource constraints and usage. This information is used by the Driver to compute the number of iterations of the thread block to be performed by each SM, which is then used to accurately estimate the number of execution cycles, by the PTX Analyzer. Equations 1 through 6 describe the computations performed by OCF.
The maximum occupancy for each SM in the device, Omax, is given by:
where OR, O ShMem , OT , OB, OW represent the maximum occupancy for each SM due to the constraints on register usage, shared memory usage, thread count, thread block count and warp count respectively. These values are computed in isolation, without the effects of other resource constraints and are given by the following equations.
where RSM is the number of registers available per SM, R thread and TB represent the number of registers per thread and number of threads per block for the given kernel.
where ShM emSM is the amount of shared memory available per SM and ShM emB is the amount of shared memory consumed by each thread block in the given kernel.
where TSM is the maximum number of threads allowed per SM by the hardware and TB is the number of threads per block in the given kernel's specification.
where BSM is the maximum number of thread blocks per SM allowed by the hardware.
where WSM is the hardware limit on the maximum number of warps per SM and WB is the number of warps per block in the given kernel. WB is the same as TB/32.
The details of the hardware limits on the various resources for GTX 280 are given in Table I .
3) Driver:
Input: GPU Occupancy information from OCF and estimate of cycles spent in execution of loops under consideration, from PTX Analyzer Output: Selection of optimal unroll factors for the given CUDA kernel; Decisions of which unroll factors to evaluate and when to stop evaluation.
Description:
The Driver is the central controlling component of the framework. It executes in a loop, systematically increasing the unroll factors of the specified loops starting from 1, and annotates the user-annotated CUDA source with the current unroll factor(s). It then invokes the compilation and disassembly tool-chain on the annotated source. At the end of the tool-chain execution, OCF computes the maximum occupancy of each SM in the device, based on the resource consumption by the kernel with unrolled loops. Using this occupancy information, the Driver invokes the PTX Analyzer to analyze the disassembled PTX and estimate the number of execution cycles spent by the kernel. The Driver maintains a table of estimated execution cycles corresponding to the various unroll factors and updates the table with the results of analysis from PTX Analyzer for each new unroll configuration. The driver terminates the evaluation of unroll factors when one of the following events occur:
• The unroll factors reach a user-specified upper limit. If the user is aware of the maximum amount by which the loops can be unrolled, the Driver makes use of this information. If this information is not available, the Driver uses it's own internal limit to stop the search.
• When the resource usage, specifically the register usage per thread, increases to an extent where not even a single thread block can be scheduled on an SM. The CUDA runtime will not allow a kernel with an occupancy of less than 1 thread-block per SM to be launched.
• When the size of the outermost loop being unrolled reaches or exceeds the size of the SM's instruction cache, as described in Section II-B3. In practice, we observed very few cases in which the evaluation had to be terminated due to this event -the two earlier causes were more common. At the end of the evaluation the Driver selects the unroll configuration with the least estimated execution cycles as the optimal unroll factor from it's table of results. If however, the Driver finds that the kernels with unrolled loops perform worse than the original version with no unrolling, it reports that unrolling is not beneficial for the given kernel.
The Driver receives the value of maximum occupancy, Omax from OCF and the total number of blocks to scheduled on the device, B total from the user and computes the following values:
where BSM is the total number of blocks assigned to each SM for execution.
where F ullItersSM is the number of full iterations, i.e. iterations for which each SM can execute with maximum occupancy Omax. Along similar lines, we also have:
where P artialItersSM gives the number of partial iterations, i.e iterations for which each SM can execute with occupancy less than Omax. Basically, O partial is the number of blocks that remain to be executed after all the full iterations have completed. This occupancy level, O partial is computed as:
Using this information, the Driver computes the active warp count that is required by the PTX Analyzer for estimating the execution cycles.
where, W f ull is the number of active warps in each of the full iterations, W partial is the number of active warps for the duration of the partial iteration, if any, and WB is the number of warps per block, as described in Equation 6 .
The Driver stores the current unroll configuration in a file that can be accessed by the PTX Analyzer. Finally, the Driver uses the computed values of W f ull , F ullItersSM , W partial and P artialItersSM along with the iteration count I to invoke the PTX Analyzer, get the estimated execution cycle counts for the various loops/loop-nests in the CUDA kernel and compute the total execution cycles. Specifically, the computation of total execution cycles, C total is performed in two steps, as shown below:
where ComputeLoopCycles is an invocation of the PTX Analyzer to estimate the number of cycles spent in the body of the given loop, with the given active warp count and iteration count. ComputeLoopCycles is described in detailed in Section III-B.
For every unroll configuration evaluated, the Driver stores the value of C total along with unroll configuration. When all unroll configurations have been evaluated, the one for which C total has the least value is chosen as the optimal unroll configuration.
It is important to note that we do not attempt to estimate the actual number of cycles spent inside the body of L. Instead, we assume that the original version of every loop L has an iteration count I, which may not be exactly equal to or even close to the actual iteration count of L. As the body of L is unrolled and analyzed, we divide I by UF L , the current unroll factor for L and use that as the iteration count for estimating the cycles spent in the body of the unrolled loop. This gives us an estimate of the relative performance of the loop body under different unroll factors, instead of an estimate of actual performance. We use the relative differences in estimated performance to select the optimal unroll factor.
4) PTX Analyzer:
Input: Disassembled PTX of a CUDA kernel from DECUDA and information about active warp count, iteration count and unroll factors from Driver Output: Estimate of the number of cycles spent in executing the given CUDA kernel; estimate of the number of stall cycles.
Description:
The PTX Analyzer is the core analysis component of the framework. It reads the disassembled PTX file generated by DECUDA, and discovers many useful attributes of the CUDA kernel represented by the PTX file and is therefore a very useful tool for understanding the structure, behavior, and performance characteristics of the compiled CUDA kernel. Among the interesting attributes gathered by the PTX Analyzer, the important ones are:
• Number of instructions in the kernel and categorization of their types.
• The ratio of arithmetic operations to memory operations.
• Control Flow Graph (CFG) of the kernel, which can be printed to file for visual inspection.
• Number of loops in the kernel, their nesting levels, the number and type of instructions in their bodies and their ratios.
• Estimate of number of cycles spent in executing each of the identified loops. We use the PTX Analyzer for only estimating the number of execution cycles in loop bodies.
PTX Analyzer parses the PTX file output by DECUDA and using it's knowledge of the PTX ISA, identifies and categorizes the instructions as arithmetic operations, branch operations (conditional/unconditional), memory loads/stores, synchronization operations and other miscellaneous operations. Further, it also distinguishes between the different kinds of memory being accessed (global, local, shared). Once the instructions are categorized, the analyzer uses standard techniques described in compiler literature such as [8] , for identifying basic-blocks, and the flow of control between them, and constructing the CFG of the kernel.
By making use of the low-level details about the CUDA kernel provided by the PTX Analyzer, we propose a technique to estimate the number of execution cycles spent in the bodies of various loops in the kernel, using which the Driver can select the optimal loop unrolling configuration. The following section describes the technique for estimate execution cycles in loop bodies. This estimate is used by the Driver to select the optimal unroll configuration from among many configurations as described in Section III-A3.
B. Estimating the Number of Cycles
In this section, we describe the technique used to estimate the number of cycles spent in executing the body of a given loop/loop nest L. This estimate is used by the Driver to select the optimal unroll configuration as described in Section III-A3.
The key to estimating the number of execution cycles is to model the behavior of the CUDA scheduler, which is responsible for scheduling the thread blocks on to the SMs in the device, and for switching from one warp to another when the current warp encounters a blocking instruction. However, the behavior of the scheduler has not been disclosed by nVidia, but previous research [9] has indicated that a roundrobin scheduling can be used to model the behavior. We too use a model with round-robin scheduling of warps, i.e when the current warp is switched out because of executing an instruction whose operands are not ready, we assume all the remaining active warps are scheduled before returning to the first warp. Also, we assume a static, equal distribution of thread blocks to SMs.
The procedure for estimating the total number of cycles spent in the body of a given loop, with a given active warp count and iteration count is given in Figure 2 . The main routine in estimation is ComputeLoopCycles, and is implemented inside the PTX Analyzer. ComputeLoopCycles takes as input L, the CFG representation of a loop/loop-nest, W , the number of warps currently active on the SM and I, the assumed iteration count for all versions of loop L. The iteration count I is divided by UF L , the unroll factor for L so as to amortize the cycle counts by the unroll factor. The procedure then walks through the instructions in the body of the loop from top to bottom, accumulating cycles based on the type of instruction seen, using the procedure ProcessInstr. If it detects an inner loop, it calls itself recursively and computes the cycle counts in the inner loop and updates the current cycle counter with counts from the inner loop as well. When the cycle count for a single iteration of the complete loop body has been computed, it is multiplied by the iteration count (previously amortized by the unroll-factor for the current loop) and returned to the Driver for further analysis. Since the loops accepted by the analyzer are free of conditional control-flow in the loop body, the only branches present in the loop body will be the backedges from the loop-footer to the loop-header and the loopexit branch. This makes the loop CFG traversal relatively straightforward.
ComputeLoopCycles maintains two different counters: C current tracks the number of cycles incurred since the current warp was scheduled back after a warp switch (or from the beginning of execution before the first switch) while C total tracks the total number of cycles spent across all warps since the beginning of execution. In addition, ComputeLoopCycles also creates a 
Ccurrent ← Ccurrent+ CyclesForInstr(Instr) UpdatePendingLoads(T , CyclesForInstr(Instr)) return 3) return procedure ProcessInstr, outlined in Figure 3 , is responsible for updating the counters and table based on the type of instruction encountered.
ProcessInstr is called from ComputeLoopCycles whenever it encounters a new instruction in the loop body. It examines the current PTX instruction and updates the cycle counts with appropriate values assumed to be provided by the generic procedure CyclesForInstr. As long as no instructions that cause a warp switch are encountered, Ccurrent is updated with cycle count of the current instruction. When a load from global or local memory is seen, a new entry keyed on the destination register of the load is added to the table. Whenever the cycle counters are updated, the cycle counters corresponding to all the existing table entries are also updated. This is performed by the procedure UpdatePendingLoads.
For each instruction examined, ProcessInstr checks if any of the source operands of the instruction is a destination of a pending load present in the table. If it is, and the number of cycles corresponding to the load are not sufficient to hide the latency of the memory load, then a warp-switch is caused and the cycle counters are updated accordingly. Since this load is ComputeFinalCycles: Estimate the number of cycles spent between the last blocking instruction (global/local load, barrier sync) and the last instruction in the loop Input: CFG of input loop: L Output: Estimated number of cycles spent between the last blocking instruction and the last instruction in the body of loop L 1) C f inal ← 0 2) for each basic-block B in L, from loop footer to loop header, do a) for each instruction I in B, from last to first, do if I is a blocking instruction, If loop L contains an instruction that could cause a warpswitch, then we need to compute, what are called final cycles. Calculation of final cycles is necessary to increase the accuracy of estimation of executed cycle counts and they specifically model the following scenario:
Starting from iteration 2 through iteration I − 1 of the loop L, when W0, the first warp in the set of active warps executes Instr F W S , the first instruction in the loop body that causes a warp switch, the rest of the warps W 1 through WW −1 are blocked at InstrLW S , the last instruction that caused a warp switch, in the previous iteration of the loop. Therefore, each warp in W1 through WW −1 now resumes execution at InstrLW S and continues till it reaches Instr F W S in it's next loop iteration. Therefore, the number of instructions available to hide the latency of InstrFWS is not just
where Instr N is the last instruction in the loop body. . The number of cycles accumulated in the range (InstrN − InstrLW S ) is called Final Cycles and is computed by the procedure ComputeFinalCycles , described in Figure  4 .
Strictly speaking, ComputeFinalCycles may not be completely accurate in some cases. It is likely to be accurate when InstrLW S is a barrier synchronization operation, but when InstrLW S is a global/local memory load, the actual warp switch will take place at a later instruction when the destination register of the load is read. Also, the memory load may have completed and may not result in a warp switch. But in practice, the technique used in ComputeFinalCycles is a conservative approximation and will usually not result in over-computing final cycles. This approximation is unlikely to significantly affect the relative performance differences required for predicting the optimal unroll factors. 
C. Search Space Pruning
The GPU does not attempt to reduce latencies of accessing global and local memories through the use of caches. Instead, it relies on sufficient ILP or a large number of active warps, or usually a combination of the two to tolerate such long latencies. Therefore, a reduction in ILP or reduction in the number of active warps, or occupancy, can lead to situation where there are not enough instructions or warps to cover the latency of a pending global memory load (or less likely, the arithmetic pipeline). In such a situation, the GPU device stalls without performing any useful computation, and this situation is detrimental to performance.
Since loop unrolling can boost ILP, but can increase register usage, which in turn can reduce occupancy, the effects of unrolling on stall cycles cannot be modeled directly. However, we can extend the cycle estimation framework discussed in previous sections to also estimate stall cycles for any given loop and unroll factor. Figure 5 describes an extension to the procedure ProcessInstr, which allows the PTX Analyzer to also estimate the number of stall cycles incurred during the execution of the loop. Stall cycles, C stall are characterized as the number of cycles spent by the GPU device without executing any useful computation, waiting for a pending load to complete. When a warp that is executing, issues an instruction whose source operands are the targets of a pending load, the scheduler switches to other warps that are ready to execute. However, after executing all these warps, when the first warp is scheduled again, if the load has not yet completed, then there are no more ready warps to switch to, and the GPU is forced to stall. When the product of the number of cycles since the last warp switch Ccurrent and the active warp count W , is greater than the latency of the memory load, then there will be no stall cycles. This situation can be improved further, if C ilp , the number of cycles between the issue of the load and the use of the load, is high, since C ilp will reduce the latency of the global load accordingly.
The estimate of stall cycles can be used by the Driver to stop searching for optimal unroll factors. When the total stall cycles incurred by the loop body increases as a result of reduction in GPU occupancy, it is an indication that the occupancy levels have fallen below the minimum number of warps required to keep the GPU fully busy, or at least as busy as the original version of the loop. It is also an indication that a boost in ILP is unlikely to overcome the reduction in performance due to drop in occupancy, i.e., we have reached a point of diminishing returns and further search may be unnecessary.
The results in Table II demonstrate the effect of computing stall cycles on the MonteCarlo simulation kernel on the GTX 280. We can observe that there are no stall cycles estimated for the original version with no unrolling, but as successive unroll factors are compiled and evaluated, the ILP appears to increase, resulting in the increase of register usage and an increase in performance because there are still enough active warps to mask all the load latencies. However, when we reach an unroll factor of 16, the max occupancy drops to just 1 thread block or 8 warps per SM, and this results in GPU stalls and performance reduces significantly. This indicates that we have reached a point of diminishing returns, and the Driver could stop evaluating more unroll factors at this point.
IV. EXPERIMENTAL RESULTS
In this section we present the results of using the system described in the above sections to select optimal unroll factors for loops in CUDA programs. For our experiments, we used two systems with the following configurations: 1) CPU: 8-core Intel Core i7 @ 2.67 GHz with 6 GB RAM, GPU: nVidia GTX 280 @ 1.3 GHz, with 30 SMs (240 SPs) and 1 GB DRAM, CUDA version: 2.1, Optimization level: O3 2) CPU: Intel Core2 Duo @ 2.13 GHz with 4 GB RAM, GPU: nVidia 8800 GTX @ 675 MHz, with 16 SMs (128 SPs) and 768 MB DRAM, CUDA version: 2.0, Optimization level: O3 We have used CUDA kernels from the nVidia CUDA SDK and the Parboil Benchmark Suite from UIUC for demonstrating the effectiveness of the selection of optimal unroll factors. Table III lists the predicted cycles and the measured execution time for varying loop unroll factors in the benchmarks. Though a large number of unroll factors were explored during the course of the experiments, only a few are presented here for sake of brevity and clarity.
Figures 6a through 9b provide a comparison of estimated performance and measured performance of each unroll factor evaluated, relative to the estimated and measured performance of the original program. An interesting experiment was the N-Body simulation experiment. While selecting the optimal unroll factors for the kernel on the 8800 GTX with NVCC 2.0, we observed that the per-thread register usage did not increase significantly even up to unroll factors of 256 -in fact, performance of the loop steadily increased with every increasing unroll factor until it hit 256, which is an upper limit on the depth of unrolling. From Table III , we can see that 256 is the optimal unroll factor for the N-Body kernel on 8800 GTX. However, on the GTX 280, with NVCC 2.1, the perthread register usage increased from 22 to 58 at unroll factor The Black-Scholes computation kernel that is part of the nVidia CUDA SDK was configured by the original authors to use a maximum of 16 registers per-thread in order to maintain high GPU occupancy. When the loop in the kernel was unrolled 4 times with the register limit at 16, there was a 17 percent speedup. However, when the register limit was removed , unrolling the same loop did not yield similar speedup. In the latter case, since there was no register limit, per-thread register usage increased beyond 16, leading to reduced GPU occupancy and therefore, reduced performance. Again, by computing the right occupancy factors, the Driver was able to correctly estimate the relative performance behavior of all the unroll versions, with and without register limit. Details can V. RELATED WORK Loop Unrolling for CPU based programs has been studied for many decades now. Techniques for selecting optimal loop unroll factors for CPU based programs have been proposed in [4] , [10] - [12] . However, as we have described in Section II, the architectural differences, programming model requirements, and the different resource constraints of GPGPU dictate that the results from CPU based programs cannot be carried over directly to GPGPU programs.
In GPGPU literature, it has been established that loop unrolling is a beneficial optimization for GPGPU programs [13] - [15] and attempts have been made to identify optimal loop unroll factors. [15] describes that loop unrolling can be used to reduce the dynamic instruction count in GPGPU programs and thereby increase performance, but does not describe a systematic way of analyzing unroll factors and selecting the most optimal among these. [13] and [14] both use empirical search for explicitly running various versions of the original program with different unroll factors, and using actual run times to guide the search further or choose the best unroll factor. However, empirical search has two drawbacks: the cost of having to run various versions of the program before picking the best one can be very high, especially if the program has a long running time. Also, empirical search, as with other dynamic analysis techniques is usually sensitive to program inputs. The results seen with a particular input may not be reproducible with a different input, requiring empirical search to be conducted for all possible program inputs, which may not be feasible. [13] attempts to improve the quality of the search by using an analytical model to guide the search and [14] uses an adaptive compiler system to reduce the scope of empirical search. But neither of these techniques performs static, compile-time analysis, without having to run different versions of the program, and neither approach is completely input-agnostic. To the best of our knowledge, there is no other compile-time, static analysis technique that attempts to select optimal loop unroll factors for loops in GPGPU programs, like the technique described in this paper. The technique described performs analysis on the disassembled PTX code to estimate the number of cycles executed in the body of the given loop with an estimated trip count and produces a relative ordering of the performance benefits for various unroll factors. The most optimal loop unroll factor is chosen from this relative ordering. Also, since our technique is static, it is input-agnostic.
VI. CONCLUSION
Although Loop Unrolling is a classical compiler optimization which has been available for CPU programs for many years now, it is not yet widely implemented in current GPGPU compilers. This is primarily due to the complex interplay between the GPU architecture, programming model, resource constraints and loop unrolling, which is not clearly characterized. In this paper, we have examined the factors that influence loop unrolling for GPGPU programs and described the tradeoff between ILP and GPU occupancy. We have developed a technique to select optimal unroll factors, based on estimation of relative performances. Further, we have described a technique for pruning the search space of unroll factors and provided experimental results to show the effectiveness of the techniques described. Based on the contributions of this paper, future work in this area can extend the above techniques to target newer GPGPU models such as OpenCL, as well as new compiler optimizations, such as unroll-and-jam of nested loops. Also, the use of dynamic information, such as that from a profiler, can be used to enable unrolling of loops with conditional control flow in the loop body.
