Abstract. DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper presents a model-driven approach to automating this process. Validation using 1D, 2D and 3D SOR solvers shows that our framework can find the tile sizes for these representative DOACROSS loops to achieve performances close to the best observed for a range of problem sizes tested.
Introduction
GPGPUs have become one of the most powerful and popular platforms to exploit fine-grain parallelism in high performance computing. Recent research on developing programming and compiler techniques for GPUs focuses on (among others) general programming principles [5, 9, 14] , cost modeling and analysis [1, 15, 3] , automatic code generation [2, 11] , and performance tuning and optimization [4, 6, 12, 19] . However, these research efforts are almost exclusively limited to DOALL loops. In practice, DOACROSS loops play an important role in many scientific and engineering applications, including PDE solvers [13] , efficient preconditioners [7] and robust smoothers [8] . Presently, Pluto [2] seems to be the only framework that can map sequential DOACROSS loops to CUDA code automatically for NVIDIA GPUs. This is done by applying loop skewing and tiling with user-supplied tile sizes (for a user-declared grid of thread blocks of threads).
DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. Unlike DOALL loops, DOACROSS loops must be skewed first to ensure that the subsequent tiling transformation preserves the loop-carried dependences. Furthermore, performing skewing and tiling allows wavefront parallelism to be exploited both across the tiles and within a tile. Tile size selection, which is performance-critical on GPUs, are more complex for DOACROSS than DOALL loops due to parallelism-inhibiting loop-carried dependences and more complex interactions among the GPU architectural constraints. Thus, it is not practical to rely on the user to pick the right tile sizes to optimize code through improving processor utilization and reducing synchronization overhead. Existing tile size techniques proposed for caches in CPU architectures do not apply [10, 20] .
This paper makes the following contributions:
-We present (for the first time) a model for estimating the execution times of tiled DOACROSS loops running on GPUs (Section 3); -We introduce a model-driven framework to automate tile size selection for tiled DOACROSS loops running on GPUs (Section 4); -We evaluate the accuracy of our model using representative 1D, 2D and 3D SOR solvers and show that the tile sizes selected lead to the performances close to the best observed for a range of problem sizes tested (Section 5).
Parallelization of DOACROSS loops on GPUs
We describe a scheme for mapping sequential DOACROSS loops to CUDA code on GPUs. Our illustrating example is a 1D SOR-like solver. This scheme is the same as that supported by Pluto [2] except tiles are mapped to thread blocks in a different way in order to achieve better load balance.
for(i1=1;i1<=I1;i1++) for(i2=1;i2<=I2;i2++) Loop Transformations In Pluto, parallelizing an n-dimensional DOACROSS loop nest L consists of mapping it into a 2n-dimensional loop nest as follows:
The mapping process for the 1D SOR solver in Figure 1 is illustrated in Figure 2 . The mapping ρ [18] is realized by composing a loop skewing S, a loop tiling T and another loop skewing W. First, the iteration space of L is skewed by a unimodular transformation S ∈ Z n×n . Second, the skewed iteration space is tiled into n-dimensional rectangles of size m 1 × · · · × m n by T . S is chosen to guarantee the legality of tiling so that all loop-carried dependences in L are preserved [18] . At this point, a 2n-dimensional loop nest is created such that the first n loops (called tile loops) enumerate the tiles and the inner n loops (called element loops) enumerate the iterations within a tile. Finally, another skewing transformation W is applied to the iteration spaces of both sets of loop nests to expose wavefront parallelism across the tiles and within a tile. In either loop nest, the first loop is sequential and the remaining n − 1 loops are DOALL. We will speak of inter-tile wavefronts and intra-tile wavefronts (as shown in Figure 2 ).
Mapping to GPUs A NVIDIA GPU consists of a number of streaming multiprocessors (SMs), each of which contains a number of processor cores called streaming processors (SPs). All SPs in one SM share a local memory and a set of registers. GPU programming is enabled through CUDA. A kernel is executed by a grid of threads organized as thread blocks (known as a thread organization). A thread block is scheduled to execute on any one of the SMs (as a whole). The threads in a block are partitioned into 32-thread warps, which are units of execution (on the SPs of the SM to which the block is mapped). The threads in one block can synchronize through syncthreads and communicate through shared memory. The inter-block synchronization is not directly supported. Figure 3 gives the CUDA code for the 1D SOR solver parallelized as shown in Figure 2 . Tiles are mapped to thread blocks and individual loop iterations in a tile are mapped to the threads in a block. All inter-tile wavefronts are executed sequentially to satisfy inter-tile (or inter-block) dependences. Hence, the syncblocks macro as introduced in Pluto at 4 . In Pluto, the tiles in an (n − 1)-dimensional inter-tile wavefront are distributed over a 2D grid of thread // inter-tile loop nest for(t1=Lt1;t1<=Ut1;t1++){ for(t2=Lt2(t1)+blockIdx.x;t2<=Ut2(t2);t2+=gridDim.x){ // intra-tile loop nest Code for shared memory coalesced loads __syncthreads(); // barrier for the loads for(e1=Le1(t1,t2);e1<=Ue1(t1,t2);e1++){ for(e2=Le2(t1,t2,e1)+threadIdx.x;e2<=Ue2(t1,t2,e1);e2+=blockDim.x){ i2=h(t1,t2,e1,e2);
barrier for each intra-tile wavefront } Code for shared memory coalesced stores __syncthreads(); // barrier for the stores } __syncblocks(); // barrier for each inter-tile wavefront } blocks of size gridDim.x × gridDim.y cyclically along two of the n−1 dimensions of the wavefront. This can cause load imbalance for large tiles since a wavefront has irregular boundaries. In this paper, the tile coordinates in such a wavefront are "linearized" much like how the subscripts of a multi-dimensional array are. Then the tiles are mapped to a 1D grid of thread blocks of size gridDim.x cyclically to achieve better load balance (with gridDim.y=1 always). As a result, all thread blocks in an inter-tile wavefront can be potentially executed in parallel but the tiles within a block are always executed sequentially.
The loop iterations in a tile are distributed as in Pluto to a 3D thread block of size blockDim.x × blockDim.y × blockDim.z cyclically. To improve memory coalescing, all data read by a tile are first loaded from device memory to shared memory at 1 and all those written in a tile are stored back to device memory at 3 . Like inter-tile wavefronts, all intra-tile wavefronts are executed sequentially. Hence, the syncthreads instruction at 2 . The iterations in an intra-tile wavefront that are assigned to different threads can execute in parallel.
Execution Time Modeling
We parameterise an execution time model for a tiled DOACROSS loop nest in order to automate tile size selection. Initially, we assume that all tiles are full. We consider first intra-tile execution (Section 3.1) and then inter-tile execution (Section 3.2). In Section 3.3, we estimate the parameters used. In Section 3.4, we discuss briefly how to mitigate the effects of border tiles on performance.
Intra-Tile Execution
This section focuses on estimating the execution time, T T ILE , for a single (full) tile, denoted T ILE. As shown in (1) and Figures 2 and 3 , the loop iterations in a tile are indexed by (e 1 , . . . , e n ), where e 1 enumerates all intra-wavefronts within the tile. As illustrated in Figure 2 , T T ILE , which can be broken down into the time on loading the input data at 1 , the time on executing the tile, and the time on storing the results back 3 , is approximated by:
The first term Ue1 e1=Le1 T e1 is the computation cost of T ILE estimated as a sum of the execution times T e1 of all its intra-tile wavefronts with e1 ranging over these wavefronts starting from the smallest given by the lower bound L e1 of loop e1 to the largest given by the upper bound U e1 of loop e1 along dimension e1. The second term T mem denotes the memory latency consumed by the memory accesses at the code before 1 and the code before 3 . The last term 2σ thd denotes the overhead of the two syncthreads at 1 and 3 , where σ thd is dependent on the number of threads used, i.e., blockDim.x × blockDim.y × blockDim.z.
The execution time T e1 of the intra-tile wavefront indexed by e1 is given by:
GPUs execute instructions with warps as units of execution and hide memory latency through interleaving of thread blocks. In the scheme shown in Figure 3 , the warps are never idle when executing a wavefront as all memory accesses happen before and after the execution of T ILE. Thus, the first term represents the workload for computing the wavefront indexed by e 1 , which is estimated to be proportional to G e1 , the number of 32-thread warps executed at the wavefront. In addition, the first term implicitly considers the effects of bank conflicts on the execution time of T ILE. However, the same G e1 may result when the number of loop iterations, H e1 , in the wavefront indexed by e1 varies (due to division by 32). To accommodate its impact on performance, T e1 is fine-tuned by including the second term β × H e1 , which attempts to differentiate the effects of varying H e1 values on performance. Note that G e1 and H e1 may vary from wavefront to wavefront as shown in Figure 2 (d). Given an intra-tile wavefront, both can be precisely calculated. The last term σ thd is the overhead of syncthreads at 2 . By substituting T i e1 in (4) into (3), we obtain:
which is illustrated graphically in Figure 4 . As highlighted, G e1 varies across the wavefronts with less work being done at the beginning and end of the computation process for T ILE when its wavefronts are executed. The memory access latency T mem can be estimated by:
where N mem denotes the number of loads and stores made in T ILE. Note that N mem is not related to memory coalescing since that would make its estimation dependent on the actual data layout at run time. This simple approximation seems to be adequate as T mem is not a dominant term in (3) for the following reasons combined. First, DOACROSS loops are usually not bandwidth-bound. Second, optimal tile sizes found are large in order to exploit two levels of wavefront parallelism. Finally, the memory accesses performed by warps can overlap. We will return to this issue briefly at the end of Section 5. By substituting T mem given in (6) into (5), simplifying and letting
we obtain the following estimated execution time of T ILE:
with U e1 −L e1 +1 syncthreads instructions executed at 2 inside the wavefront, one at 1 and one at 3 , as shown in Figure 3 .
Inter-Tile Execution
A DOACROSS loop nest is parallelized into a CUDA kernel. The execution time, T total , for the kernel, i.e., for its inter-tile wavefronts is estimated by:
The first term is the computation cost of all tiles in the kernel estimated as a sum of the execution times T t1 of all its inter-tile wavefronts with t1 starting from the lower bound L t1 of loop t1 to the upper bound U t1 of loop t1 along dimension t1. The second term σ ker is the kernel startup cost. Thus, U t1 − L t1 + 1 is the number of tiles contained in the inter-tile wavefront indexed by t1. If all P SMs execute simultaneously up to B thread blocks each, then the number of tiles, denoted I t1 , contained in a thread block is:
where B is decided by the GPU architectural constraints and kernel code according to the CUDA programming guide as demonstrated in Table 1 . The execution time T t1 is determined by the slowest among the P SMs with the other SMs idle waiting at the syncblocks macro at 4 . As a result, we have:
where T i t1 is the execution time of the i-th tiles in all B thread blocks by the slowest SM and σ blk is the overhead of syncblocks at 4 (to be measured below).
Parameter Estimation
We determine the six parameters used in T total given in (9) for a tiled loop nest L: σ ker , σ thd , σ blk , α, β and γ. We do so for a given thread organization (determined by gridDim and blockDim) so that its tile size selection can be automated (Section 4). For NVIDIA GPUs, there are at most 16B max different thread organizations because (1) there are B max different 1D grid layouts with gridDim.x = B × P , where B ≤ B max ≤ B SM = 8 as shown in Table 1 , and (2) the number of threads per block, i.e., blockDim.x × blockDim.y × blockDim.z is one of the 16 possibilities contained in {32, 64, . . . , 512}. Table 1 . Determining B for an NVIDIA Tesla C1060 GPU. An item in Column 1 that depends on hardware, kernel code or both is indicated with an H, S or B appropriately.
Architectural Parameters: σ ker , σ thd and σ blk These overheads are small (relative to the execution time of a loop nest L) and are measured for a GPU architecture as follows. First of all, σ ker is the startup overhead of the kernel for L, which can be obtained through running an empty version of the kernel (with the computations in L removed) for a given thread organization. In fact, as σ ker ≪ T total , treating it as a small constant for all thread organizations does not affect in practical terms how the relative performances of L are ranked for all combinations of thread organizations and tile sizes used. As for syncthreads executed at 1 , 2 and 3 , it is lightweight on NVIDIA GPUs. Its overhead σ thd depends on the number of threads per block, i.e., blockDim.x × blockDim.y × blockDim.z and is measured as in [16] . There are only 16 cases to consider as blockDim.x × blockDim.y × blockDim.z is a multiple of 32 ranging from 32 to 512. Finally, the syncblocks macro is invoked at the end of each inter-tile wavefront at 4 . Its overhead σ blk , which is higher than σ thd , depends mainly on the number of thread blocks contained in an inter-tile wavefront, i.e., gridDim.x = B × P . The effects of different blockDim.x × blockDim.y × blockDim.z values on σ blk are negligible. As B ≤ B max ≤ B SM = 8, syncblocks is measured as in [17] for a few, i.e., up to B max different gridDim.x values.
Program-Dependent Parameters: α, β and γ Once the values of σ ker , σ thd and σ blk are determined, the given loop nest L is simplified to possess one inter-tile wavefront with exactly B × P thread blocks consisting of only full tiles. This ensures that all P SMs have exactly the same workload so that these three program-dependent parameters can be accurately measured.
The three parameters are found for each of up to 16B max different thread organizations as mentioned earlier (where B max ≤ 8). In each case, the simplified loop nest L is executed for a total of n times, each with a different tile size. Let T i be the execution time corresponding to the tile size S i used. Given a tile size S i , all parameters in T total except α, β and γ are now known. We can find the values of α, β and γ by performing a linear curve fitting using the least-square method for T total with respect to the n execution times, T 1 , . . . , T n , obtained.
1 Compute the register usage per thread, RT , using any tile size and thread organization. 2 for each tile size m = (m1, . . . , mn) that satisfies the tile size constraint 3
Let ST B (shared memory usage per block) be set as the shared memory usage per tile 4 for each t = (blockDim.x, blockDim.y, blockDim.z) that satisfies the blockDim constraint 5
Let RT B = RT × blockDim.x × blockDim.y × blockDim.z 6 Let B = min(BW , BR, BS ), where BW , BR and BS are computed in Table 1.  7 Evaluate T total given in (9) for the current tile size m and the current thread organization specified by gridDim = B × P and blockDim = t 8 if T total < T best // T best is initialized to ∞ 9 T best = T total 10
Record m as the best tile size so far (and set gridDim.x = B × P and blockdDim = t) Fig. 5 . An algorithm for automating tile size selection.
Border Tiles
A border tile may execute faster than a full tile. If the i-th inter-tile wavefront that induces T i t1 in (11) contains non-full border tiles, then T i t1 may overapproximate the actual execution time of the wavefront. We can improve this inaccuracy with an estimate of 0.5 × T T ILE as the execution time of a border tile T ILE by assuming that the average size of border tiles is half of a full tile.
Model-Driven Tile Size Selection
Given the estimated execution time of T total in (9) for a tiled loop nest L as input, we employ an "educated" search to find automatically and efficiently an optimal tile size m = (m 1 , . . . , m n ) for L and an associated thread organization, determined by gridDim and blockDim, used for realizing the optimal tiling. In this paper, a tile layout is determined by a tile size and a thread organization.
The Algorithm
We use two kinds of constraints to prune the search space:
Tile Size Constraint The tile size, i.e., L T ILE = m 1 × · · · × m n is bounded from below by a data reuse rate D =
LT ILE
Nmem (where N mem is introduced in (6)) and from above by the size of shared memory. For DOACROSS loops, large tile sizes lead to higher data reuse rates. Thus, D must be larger than an empirical minimum threshold to ensure better intra-tile data locality. blockDim Constraint In NVIDIA GPUs, blockDim.x × blockDim.y × blockDim.z represents the number of threads per block. According to [16] , the SP performance usually suffers with too many or too few threads. Furthermore, blockDim.x × blockDim.y × blockDim.z must be no smaller than the number of iterations contained in the largest intra-tile wavefront to ensure that every thread has some work to do. Thus, some small and large values of blockDim.x × blockDim.y × blockDim.z can be ignored.
Our algorithm for automating tile size selection for L is outlined in Figure 5 . Recall that as shown in Figure 3 , all tiles in a thread block are executed sequentially. Thus, for every type of resource listed in Table 1 , the amount consumed by a block is calculated on a per-tile basis. The basic idea is to perform an "educated" search when going through all tile layouts to find the one with the smallest execution time T total . In line 1, the register usage per thread, denoted R T , is measured independently of tile layouts used. This is because in each case the same code as shown in Figure 3 is compiled for each thread by NVIDIA's nvcc compiler. Finding R T this ways speeds up the process for calculating R T B in line 5. Similarly, in line 3, S T B does not depend on blockDim. Once R T B and S T B are known, B W , B R and B S are computed in line 6 as per Table 1 .
The Framework
We have implemented our tile size selection technique using a combination of the Clan polyhedral representation extractor, Pluto's polyhedral parallel tiling infrastructure and CLooG code generator, as shown in Figure 6 . Our tile size selection module is invoked in the third step in the sequence.
Experiments
We use three representative DOACROSS kernels, 1D (3-point), 2D (5-point) and 3D (7-point) SOR solvers, to demonstrate the accuracy and efficiency of our tile size selection framework on an NVIDIA Tesla C1060 GPU (c.f. Table 1 ). Four problem sizes are discussed for each kernel, representing 12 different optimization problems for which best tile layouts (tile size/blockDim combinations) are solved.
Accuracy It is impractical to measure the accuracy of our tile size selection framework for a kernel by comparing the actual execution time of the best tile layout found with the execution times of all tile layouts.
We have decided to evaluate this work empirically as is often done in automated performance tuning. For each of the 12 optimization problems discussed here, we have randomly sampled 1000 different tile layouts. The largest relative error (between the estimated execution time T total and actual execution time) is observed to be within 6.05%. To see this graphically, the relative errors of 100 sampled tile layouts for each optimization problem are plotted in Figures 7 -9 . Let us look at the actual performance gap between the tile layout found by us and the best-performing one in each case. Let us consider a generic optimization problem O. Let T m total and R m total be the estimated and actual execution times of any tile layout m for O with the relative error being e m . In particular, let T opt total and R opt total be the estimated and actual execution times of the the best tile layout opt predicted for O with the relative error being e opt . The (worst) performance gap between opt and the best-performing one, m, is bounded by ( 1+em 1+eopt − 1) × 100%, when R opt − R m = T opt total /(1 + e opt ) − T m total /(1 + e m ), i.e., e m is the largest, where R opt > R m . Based on our sampled tile layouts, the performance gaps are found to be 5.29%, 0.51%, 2.10% and 4.92% for the four problem sizes of 1D SOR (displayed from left to right in Figure 7 ), 4.33%, 1.31%, 5.14% and 2.01% for the four problem sizes of 2D SOR (Figure 8 ), and 0.66%, 2.28%, 6.05% and 1.30% for the four problem sizes of 3D SOR (Figure 9 ). Search Time Our algorithm is efficient in finding the best tile layout for a loop nest (on an Intel Xeon 2.0 GHz CPU). When tiling an n-dimensional loop nest that represents an (n − 1)-D SOR solver with a tile size m = (m 1 , . . . , m n ), m 1 represents the time dimension and m 2 , . . . , m n represent the n − 1 spatial dimensions for the underlying mesh. Due to loop skewing, the worksets of different time slices of a tile are also skewed [10] . Thus, the data reuse rates of a tile for the 1D, 2D and 3D SOR solvers are expressed as a function of m and are bounded from above by m 2 , m2m3 m2+m3 and m2m3m4 m2m3+m2m4+m3m4 , respectively, when m 1 → ∞. For the 1D SOR solver, the data reuse rate induces a tile size constraint:
LT ILE Nmem ≥ 300, where the threshold 300 is empirically set (Section 4.1). The search time is 238 secs over a search space of 3 × 10 6 tile layouts. For the 2D SOR solver, the tile size constraint is LT ILE Nmem ≥ 6. The search time is 369 secs over a search space of 3.2 × 10 6 tile layouts. For 3D SOR, the data reuse rate imposes LT ILE Nmem ≥ 1. The search time is 503 secs over a search space of 4.7 × 10 6 tile layouts.
