Loop tiling is an effective optimization to improve performance of multiply nested loops, which are the most time-consuming parts in many programs. Most massively parallel systems today are organized hierarchically, and different levels of the hierarchy differ in the organization of parallelism and the memory models they adopt. To make better use of these machines, it is clear that loop nests should be tiled hierarchically to fit the hierarchical organization of the machine; however, it is not so clear what should be the exact form of these hierarchical tiles. In particular, tile shape selection is of critical importance to expose parallelism of the tiled loop nests. Although loop tiling is a well-known optimization, not much is known about tile shape selection.
INTRODUCTION
Most highly parallel systems are organized hierarchically. Often, different levels in the hierarchy have different organizations and follow different memory models. For example, in a computer cluster, the nodes connected by a network form the first level of the hierarchy. At the next level, each node is typically a shared-memory multiprocessor, which could be connected to one or more GPGPUs. The processors within the GPGPU form a third level of the hierarchy.
Hierarchy-aware optimizations are usually necessary to unleash the potential of hierarchically organized systems. Loop tiling is an effective optimization to improve the performance of multiply nested loops, which are the most time-consuming parts of many programs. Tiling can be applied to improve data locality [Wolf and Lam 1991b; Wolfe 1989a; Abu-Sufah et al. 1981; Ahmed et al. 2000; Song and Li 1999; Coleman and McKinley 1995; Ramanujam and Sadayappan 1991] and expose parallelism [Wolf and Lam 1991a; Wolfe 1989b; Andonov et al. 2001; Högstedt et al. 1999; Wonnacott 2000] .
Authors' address: X. Zhou, M. J. Garzarán, and D. A. Padua, Department of Computer Science, University of Illinois at Urbana-Champaign; emails: {zhou53, garzaran, padua}@illinois.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. To make better use of hardware hierarchies, loop nests can also be tiled hierarchically to fit the organization of the target machine. Figure 1 (a) shows a doubly nested loop, covering an iteration space I = {(i, j) : 0 ≤ i < N ∧ 0 ≤ j < N }. After applying a twolevel tiling, the number of levels of nesting increases to six, as shown in Figure 1 (b). The outer two pairs of loops i k -j k (k = 1, 2) form a subiteration space I k = {(i k , j k )} at the corresponding tiling level. Figure 1 (c) matches each tiling level to one of the hierarchy levels of a simple cluster with SMP nodes. In top-down order, (1) the iteration space of the loop nest is tiled, and each tile is mapped to a node of the cluster, and (2) on each node, the assigned tile is tiled again. Each of these second-level tiles is assigned to a processor in the node. Hierarchical tiling can also be done bottom-up by partitioning the original iteration space into tiles that are assigned to the lowest level of hardware and then tiles are grouped into larger tiles for higher levels of hardware.
The shape and size of tiles can have a significant impact on locality, intertile communication, and parallelism [Ohta et al. 1995; Xue 1997; Lim et al. 1999] . However, little is known about strategies for the selection of tile shapes. Existing strategies would typically result in loops similar to that shown in Figure 1 where rectangles are used at both levels of tiling. In this article, we extend the study of tile shape selection to the more general shape of parallelograms.
In hierarchical tiling, tile shapes at the different levels interact with each other and together determine execution time. This means that to minimize execution time, it is not sufficient to select the tile shape at each level separately; in some cases, a better tile shape selection is obtained by tackling hierarchical tiling in a global manner. Figure 2 shows an example of different tile shape choices. Arrows represent the dependences between tiles. Dashed lines show the tiles that can be executed in parallel in a wavefront schedule. The numbers stand for the order of the execution of the wavefronts. In the three cases shown in the figure, the iteration space is constrained by two dependences: d 0 = (1, 0) and d 1 = (0, 1). There are several possible shapes at each level of tiling. The straightforward choice is to use squares at all levels as in Figure 2 (a), where the tiles along the same diagonal line can be executed in parallel. If we assume that the computation time of each tile at the lowest level (level 0) is one unit of time, the total execution time of the schedule in Figure 2 (a) is 7 × 7 = 49 units of time. However, there are better tile shapes. If we change the level 1 tile shape into a parallelogram, as shown in Figure 2 (b), the execution time in terms of units of time becomes 4 × 11 = 44 < 49. However, nonrectangular parallelograms are not necessarily always a better solution. If we also choose a parallelogram shape for the level 0 tiling, as shown in Figure 2 (c), the total execution time would increase to 7 × 11 = 77 units of time (here we assume that the parallelogram tile at level 0 takes the same time to execute as the square tile.).
This article makes the following contributions:
-We present a model that computes execution time as a function of the tile shapes. The model assumes unlimited resources so that tile shape is the only factor that determines execution time. This model is used to guide tile shape selection. -We show that the problem of optimal tile shape selection for hierarchical tiling is a nonlinear bilevel programming problem. -We describe the implementation of an automatic system for the selection of tile shapes in a hierarchical system. When the best tile shapes are not rectangular, this system can find good choices in the space of parallelograms.
The rest of the article is organized as follows. Section 2 discusses related work. Section 3 introduces basic concepts and the assumptions used in the rest of the article. Section 4 describes the model that computes ideal execution time in terms of tile shape. Section 5 discusses the implementation of the automatic tile shape selection system to find good tile shapes. Section 6 present our experimental results. Section 7 concludes the article.
RELATED WORK
Tiling is a well-studied transformation for the optimization of loop nests Lam 1991a, 1991b; Wolfe 1989a Wolfe , 1989b Abu-Sufah et al. 1981; Ahmed et al. 2000; Song and Li 1999; Coleman and McKinley 1995; Ramanujam and Sadayappan 1991; Andonov et al. 2001; Högstedt et al. 1999; Wonnacott 2000; Bondhugula et al. 2008] . It is useful to improve locality and/or parallelism. Most of the research in the area has focused on regular or intuitive tile shapes such as rectangles and regular parallelograms. However, studying more general shapes is important, because as our experiments show, more general shapes of parallelograms have the potential to achieve higher performance than what is possible with the simpler shapes mentioned previously that can be produced with the help of simple transformations such as loop skewing. Notice that although the work by Bondhugula et al. [2008] can also find tiles that belong to this more general class of parallelograms, their approach only considers one level of tiling, whereas our approach takes into account all levels of tiling.
It is well known that tile shape, as well as tile size, can have a significant impact on locality, intertile communication, and parallelism. Xue [1997] presents an approach to find the parallelogram or hyperparallelepiped 1 tile shape that minimizes communication volume between tiles. This work does not consider the impact of tile shape on parallel scheduling, although minimizing communication between tiles can sometimes result in higher parallelism. Högstedt et al. [1999] introduced a model to select the tile shape that minimizes execution time. Since their work only considers a single level of tiling, they show that for a single level of tiling, the best execution time can be determined by solving a linear programming problem. None of these papers discusses the tile shape selection problem in the context of hierarchical tiling.
The idea of taking advantage of the hierarchy of the hardware is not new. Liu et al. [2011] propose a cache hierarchy-aware tile scheduling technique to maximize data reuse. Leung et al. [2010] implemented a C-to-CUDA compiler that performs hierarchical decomposition for GPUs. Hierarchical tiling has been proposed to improve performance in the presence of memory hierarchy and to enhance parallel schedule [Carter et al. 1995; Hartono et al. 2009; Carter et al. 1996; Zhou et al. 2012] . In Zhou et al. [2012] , we introduced hierarchical overlapped tiling, a transformation that tries to reduce the overhead introduced by overlapped tiling by taking into account the hierarchy of the machine (overlapped tiling reduces communication overhead by introducing redundant computation). Tiles in overlapped and hierarchical overlapped tiling always have the shape of a trapezoid. In addition, to minimize redundant computation, the slope of the side edges of the trapezoid is determined by the slope of the extreme dependence vectors. Thus, the only adjustable parameter that can impact the shape of the trapezoid is the height of the trapezoid. The approach discussed in this article can use any parallelogram shape, as long as it is compatible with the constraints imposed by the dependence vectors. Section 6 compares the execution times obtained by the approach that we propose in this article with those obtained using hierarchical overlapped tiling. The work of Renganarayana and Rajopadhye [2004] is closest to that presented in this article. Their work presents a model to estimate the overall execution time of loop nests. Their framework determines the optimal tile sizes for hierarchical tiling by solving a convex optimization problem. However, this study only considers hyperrectangular shapes.
DEFINITIONS
This section defines the terms used in the rest of the article.
Iteration Space Representation
The iteration space of the loop nest shown in Figure 3 (a) is I = { i} ⊆ Z n , where i is the vector of loop index values. This loop nest can also be represented as shown in Figure 3 (b). To simplify the problem, we restrict the shape of the iteration space to ndimensional hyperparallelepipeds. We use n edges of the hyperparallelepiped sharing a common vertex to represent the n-dimensional iteration space and choose the common vertex as the origin. Figure 4 gives two examples of the edge vectors that define two-and three-dimensional iteration spaces. Assume that e k = (e k,0 , e k,1 , . . . , e k,n−1 ), k = 0, 1, . . . , n − 1 are the n edge vectors of the hyperparallelepiped-shaped iteration space I. The edge matrix of I is
We define the function span(E) as follows: Clearly, I = span(E), and the total number iterations in I is |det(E)| (because |det(E)| is the volume of the parallelepiped).
Tiling Transformation Representation
When all dimensions are tiled, 2 tiling is a map from
. . , i n−1 ) is the index for each tile and j = ( j 0 , j 1 , . . . , j n−1 ) is the iteration index within each tile. We only consider disjoint tiles; thus, for each i ∈ I, there is a unique ( i : j) corresponding to it and vice versa.
The loop nest in Figure 3 is transformed by tiling into the form shown in Figure 5 . The top n loop nests form a new n-dimensional iteration space I . We can then tile recursively to produce a hierarchically tiled loop nest. We say that I is the tiled iteration space of the original iteration space I.
Again, to simplify the problem, we assume that the shape of all nonboundary tiles are the same n-dimensional hyperparallelepiped. Assume that t k = (t k,0 , t k,1 , . . . , t k,n−1 ), k = 0, 1, . . . , n − 1 are the n edges of the hyperparallelepiped tile; we call the following matrix T tiling matrix:
Each nonboundary tile, J, is equal to span(T ) and contains |det(T )| iterations.
To study how I, I , and T are related, we derive an equation involving E, E , and T , where E is the edge matrix of I (I = span(E )). If we describe tiling as an affine transformation, the shape of I is guaranteed to be an n-dimensional hyperparallelepiped because the shape of I is hyperparallelepiped. Since we assume that tiling is an affine transformation, we must be able to find an n × n matrix A such that
Next, consider the effect of tiling transformation represented by A. As shown in Figure 6 , after tiling, under the new coordinates system shown as axis labels i 0 and i 1 , the vectors t 0 and t 1 become (1, 0) and (0, 1), respectively. More generally, we have the 2 The problem becomes simpler if one or more dimensions are not tiled.
following equation (1 n denotes the n × n identity matrix):
We conclude that A = T −1 and E = E · T −1 . We say that T −1 represents the affine transformation of the tiling transformation with tile shape T .
Hierarchical Tiling
As mentioned in Section 3.2, it is possible to recursively tile an iteration space I. Hierarchical tiling applies tiling at each level. We follow a bottom-up approach for hierarchical tiling. First, we use tiling matrix T 0 to tile the original iteration space I 0 = I, producing a new iterations space I 1 . More generally, on the k-th level of tiling, we apply tiling matrix T k to tile the iteration space I k and generate the new iteration space I k+1 . Assume that E k is the edge matrix of I k , then
Dependences
Dependences are the partial order requirements that must be enforced between iterations. Two iterations can be scheduled either concurrently or in any sequential order if there are no direct or indirect dependences between them. Otherwise, the iteration at the source of the dependence must finish before the iteration at the destination of the dependence starts. 
Without loss of generality, we assume that m ≥ n and there are n dependence vectors that are linearly independent. result, that there can be no cycles in the intertile dependence graph, which means that the hyperplanes defining the tiles must not be crossed by two dependence vectors with different directions. Figure 7 gives two examples of tile shapes. The tile shape shown in Figure 7 (a) is valid, because dependence vectors only cross the hyperplane in one direction. For example, for the tile boundaries that are parallel to t 0 , only dependence vector d 1 can go across from the left side to the right side; for the tile boundaries that are parallel to t 1 , both dependence vector d 0 and d 1 can go across, although always from the lower side to the upper side. This means that for each boundary hyperplane, the tile has either in-bound dependences or out-bound dependences. However, the tile shape in Figure 7 (b) is invalid, as there are both in-bound and out-bound dependences crossing the boundary hyperplane of t 0 . This means that the horizontal neighboring tiles are in a dependence cycle. Figure 8 depicts the restriction that the topological sorting requirement imposes on the dependences. To produce a valid tile shape without a dependence cycle between tiles, the infinite cone spanned by extending the tile edges t 0 , t 1 , . . . , t n−1 must contain every dependence vector. 4 Thus, for the example in Figure 8 , t 0 and t 1 must not be in the shaded area. To describe the preceding observation formally, we say that each dependence vector d k must be covered by the cone spanned by the extension cords of t 0 ,
We further require that tiles be large enough so that intertile dependences only exist between adjacent tiles (including diagonal adjacency). This requirement can be expressed as follows: The preceding two requirements can be merged into the following:
which is the constraint imposed by dependences when choosing the tile shape. When the original iteration space I is tiled by matrix T , each dependence vector d k is also transformed to d k by the affine transformation represented by T −1 . Since T must satisfy the constraints of Equation (3), we have
As shown in Figure 9 , a dependence vector (2, 2) would result in three dependence vectors after tiling: (1, 0), (0, 1), and (1, 1). In general, a single dependence vector d k of the original iteration space I generates one or more dependences between tiles in the tiled iteration space I . The rule is as follows:
that are linearly independent, the intertile dependences in I generated by those n dependence vectors must include n orthonormal unit vectors.
From the previous observation, we conclude that D k , which is the dependence matrix at the k-th level of tiling, must have the following form:
Execution Model
Assuming that the sequential execution time only depends on the amount of computation, and that the execution time of each iteration is constant, the sequential execution time of a loop with iteration space I would be
Heret c is the execution time of each iteration in iteration space I. The execution time of a parallelized iteration space depends on the schedule of iterations within the iteration space. To not be tied to any specific scheduling scheme, we use the concept of ideal execution time in our analysis. Ideal execution time is the minimal execution time of an iteration space I that can be achieved by any valid schedule of iterations. If each iteration in I is viewed as an activity and D represents the dependences between activities, the ideal execution time is determined by the critical path in I. To simplify the problem, we assume that there is infinite hardware parallelism available at each level. In practice, this assumption means that the amount of hardware parallelism is larger than the maximum number of iterations that can be executed in parallel. Therefore, the critical path in I is the longest path of dependences between iterations.
Let L(E, D) denote the length of the longest path of dependent iterations in the iteration space I with edge matrix E and dependence matrix D; the ideal execution time of I is Figure 10 . This will result in a longer dependence path. As a result, we can conclude that the longest dependence path must only contain d 0 and d 1 (P), and other dependence vectors can be ignored 5 when calculating L(E, D). In particular, if there are n dependence vectors that are orthonormal unit vectors, we can ignore other rows in D and only consider those n dependence vectors, because any other single dependence vector can be replaced by the combination of one or more orthonormal unit vectors, which will result in a longer dependence path. Based on the preceding observation, we have Equation (8):
According to Equation (5), under the context of hierarchical tiling, the dependence matrix each at each level D k = 1 n * , ∀k ≥ 1. Therefore, Equation (8) can be used to simplify further analysis. 
MODEL
This section describes a model to estimate the execution time as a function of the tile shape. The model is used to direct tile shape selection.
Problem Statement
The problem of selecting tile shape for hierarchical tiling is defined as identifying the tile shapes defining an l-level hierarchical tiling that minimizes the execution time of the computation defined by giving an n-dimensional hyperparallelepiped-shaped iteration space I and m dependence vectors. The problem reduces to determining a sequence of tiling matrices T 0 , T 1 , . . . , T l−1 .
The goal of this model is to assess the impact on parallelism of general parallelogram tile shapes in hierarchical tiling. This model does not consider data locality and only focuses on the impact of intertile dependences on parallel scheduling. Our model assumes that maximal exposure of parallelism (instead of sophisticated data reuse schemes) is the dominant factor determining overall performance. Thus, the execution time predicted by this model can be inaccurate if the preceding assumption is not true. However, this assumption should be true if the scale of the system or cluster used is large enough compared to the size of the input data. In addition, as indicated in Section 3, the model makes the following assumptions:
(1) The model considers general parallelogram tile shapes-that is, parallelograms for two-dimensional spaces and hyperparallelepipeds for spaces with higher number of dimensions. We follow a bottom-up approach to do hierarchical tiling. First, tiling matrix T 0 is selected. T 0 tiles the original iteration space I 0 = I. Then, each tile is considered as an iteration of a new iteration space, I 1 . On the k-th level of tiling, tiling matrix T k is selected to tile the iteration space I k and generate the new iteration space I k+1 . Let E k be the edge matrix of the k-th level iteration space I k (with E 0 = E, the edge matrix of I), then
The iterations within the bottom-level tiles (the finest grain, level 0) are executed sequentially, so the per-tile execution time can be calculated by Equation (6) (t c denotes the execution time of each single iteration):
For upper-level tiles (T k , 1 ≤ k < l), each tile at the level immediately below is considered as a single iteration. The per-iteration execution time is Time(T k−1 ). Let D k denote the dependences at the k-th level with D 0 = D and t k s be the synchronization and/or communication overhead associated to each tile at this level; the per-tile execution time is
Then, the total execution time after tiling is
Thus, the problem of optimal tile shape selection for hierarchical tiling is equivalent to selecting a sequence of tiling matrices T 0 , T 1 , . . . , T l−1 to minimize the preceding equation. Equation (11) does not include D; however, according to the discussion in Section 3.4, the shape of T 0 must conform the constraints (Equations (1) and (2)) imposed by each dependence vector in D.
Compute the Longest Dependent Path
As shown in Equation (11), total execution time is a function of the tiling matrices T 0 , T 1 , . . . , T l−1 . To do a quantitative analysis on the relation between tile shape and execution time, and then find the minimum of Equation (11), we must identify the function L first.
is the length of the longest dependent path with dependence matrix 1 n in tile T k . The iteration space in tile T k can be normalized to an n-dimensional hypercube by applying the affine transformation 6 represented by T
the tile shape at the k-th level of tiling, T k must conform to the constraints imposed by each dependence vector in D k . According to Equations (1) and (2),
Then, let D denotes the dependences within the tile (normalized tile so that the tile is defined by orthogonal edge vectors):
In the two-dimensional case, the intuitive explanation of the preceding observation is that the direction of each d j must be in the upward, right, or upper-right direction, as shown in Figure 11 (a) and (b). Since the normalized iteration space is a hypercube, the longest path must start from the bottom-left corner, which is the base point (0, 0, . . . , 0).
To compute L(T k , 1 n ), we must find the longest path Figure 11 (c)) such that
6 To keep the problem in the integer domain, the affine transformation can be revised to T −1 k · |det(T k )|, and the discussion in this section would be still valid. L is the length of the dependent path found. Therefore, L(T k , 1 n ) = max{L}. We have that the last point p L−1 is defined as 
Therefore, we conclude that
The meaning of L(E l , 1 n ) is the length of the longest dependent path in the iteration space of the highest-level tiling. Similarly, we can apply the affine transformation represented by E −1 l to normalize the iteration space:
Let D denote the transformed dependence matrix:
However, unlike the case when calculating L(T k , 1 n ), there is no guarantee that ∀ d j , 0 ≤ d j ≤ 1. More intuitively, each dependence vector d j can point in any direction. Therefore, L(E l , 1 n ) cannot be directly solved by the linear programming problem of Equation (13).
Since dependence vectors d j can point in any direction, the longest dependent path does not necessarily start from origin (0, 0, . . . , 0) of the hypercube iteration space. Thus, the dependent path P in the iteration space E l is as defined by Equation (12) 58:14 X. Zhou et al. with the difference that p 0 is not necessarily equal to 0. To simplify the analysis, we define a related problem that is graphically depicted in Figure 12 : find the longest path P , which is a sequence of points p 0 ,
L is the length of path P . P starts from the origin and only requires that the end point p L −1 is within the hypercube surrounding the origin. p L −1 can be calculated as follows:
l . Therefore, max{L } can be solved through the following linear programming problem
Next, we need to prove that max{L } is approximately equal to max{L} so that 1 n ) . First, given any path P, we can construct a path P by letting p j = p j − p 0 . Thus, it is easy to see that max{L} ≤ max{L }.
On the other hand, given a path P with start point p 0 = 0 and end point p L −1 = ( p 0 , p 1 , . . . , p n−1 ), to find a corresponding path in space span(1 n ) (which is a hypercube), we construct s = (s 0 , s 1 , . . . , s n−1 ) as follows:
58:16 X. Zhou et al. Fig. 13 . The process of automatic tile shape selection using NOMAD.
Given the edge matrix E of the original iteration space I, an optimal tile shape for hierarchical tiling is given by the sequence of valid tiling matrices T 0 , T 1 , . . . , T l−1 that minimize Time(I) in Equation (19). If I is n-dimensional, each tiling matrix T k contains n 2 elements. In total, there are n 2 · l variables. Thus, given that Equation (19) is a recurrence that contains linear-programming problems terms, we can conclude that the problem of selecting the tile shape that minimizes Time(I) in Equation (19) is a bilevel programming problem. Because Time(I) is nonlinear, it generally cannot be easily solved directly. Notice that for a single level of tiling, if we assume infinite resources, as our model does, the minimum execution time is determined by the dependence critical path. However, for multilevel tiling, the critical path is a path of dependent tiles. Thus, the critical path at one level of the hierarchy depends on the critical path at the lower level. Because of this, the best tile shape at one level might not be the best one when the next level is also considered. This makes the problem of finding the best tile shapes for hierarchical tiling a bilevel programming problem.
AUTOMATIC TILE SHAPE SELECTION
According to the previous discussion, optimal tile shape selection for hierarchical tiling is an multidimensional nonlinear optimization problem without a known analytical solution. We use NOMAD [Le Digabel 2011] to select the tile shapes automatically based on the model described in Section 4. NOMAD is a tool to solve optimization problems defined by an opaque program. We use NOMAD to optimize Time(I), whose computation requires solving n − 1 linear programming problems (Equation (19)). We use LP_Solve [Berkelaar et al. n.d.] to solve these linear programming problems. The program accepts the iteration space I, the dependence matrix D, the single execution time t c , the tiling overhead at each level t s , and the n 2 · l components of the tiling matrices T 0 , T 1 , . . . , T l−1 , and outputs the result of Time(I). NOMAD implements the Mesh Adaptive Direct Search (MADS) algorithm [Abramson et al. 2009 ] to repeatedly generate a better set of n 2 · l values as the input to the black-box program by searching on the mesh. NOMAD is not guaranteed to find an optimal solution; it will stop and output the current best solution when no better solution can be found. The process of automatic tile shape selection using NOMAD is shown in Figure 13 .
EXPERIMENTS

Experiment Platform
We evaluated the strategy for tile shape selection discussed in this article on Blue Waters. Blue Waters is a powerful supercomputer thath consists of nodes with an NVIDIA Tesla K20X GPU accelerator. The Tesla K20X GPU contains 2,688 CUDA cores. The collection of all nodes and the GPUs within each node naturally form two levels of hardware hierarchy. We only use 256 nodes for the experiments. We see the collection of nodes as the first level and the cores within the GPU as the second. For the purposes of this study, we ignore the intermediate level containing the CPU.
Benchmarks
We use seven stencil programs as benchmarks: one-dimensional and two-dimensional Gauss and Jacobi, PathFinder, Poisson, and HotSpot. Gauss and Jacobi are the kernels of the Gauss-Seidel and Jacobi finite difference PDE solvers, respectively. PathFinder is a dynamic programming algorithm to find a minimum weighted path. Poisson is a solver of the Poisson equation, using the Laplace operator [Ames 1977 ] over a twodimensional grid with a five-point stencil. HotSpot implements a chip temperature estimation model [Huang et al. 2004] . PathFinder and one-dimensional Gauss and Jacobi are one-dimensional stencils that form two-dimensional iteration spaces (time dimension + space dimension), whereas the rest of benchmarks are two-dimensional stencils that form three-dimensional iteration spaces (time dimension + two space dimensions).
Tiling Schemes
We implemented the automatic tile shape selection system discussed in Section 5. We also implemented a tool using the Omega Library [Kelly et al. 1995 ] to transform loop nests onto the selected tiled form. The tool generates OpenCL [Khronos OpenCL Working Group 2008] code for the inner loops and MPI code for the outer loops. In this way, the lower level of tiles will be mapped onto the GPU device of each node, whereas the higher level of tiles will be mapped onto different nodes of the cluster.
We compare the resulting tiled loops with other tiled versions of these loops that (1) make use of common tile shapes or (2) are the result of applying the hierarchical overlapped tiling method introduced in Zhou et al. [2012] . The common tile shapes that we used in the experiments includeSquare, Diamond, and Skewing. Depending on the direction of skewing, we compare two tile shapes of Skewing: Skewing-1 and Skewing-2. These tile shapes either are very regular or can be easily constructed by simple loop transformation such as loop skewing. We can say that these common tiling schemes are the easiest and most natural to use when applying tiling manually. Table I lists the tile shapes for two-and three-dimensional iteration spaces and the corresponding tiling matrices. No tile shape can be used for all cases, because the tiling matrix that represents the tile shape must satisfy Equation (1), and therefore correctness is guaranteed only for some dependence matrices. Table I 1, 1 ) is the matrix of the boundary dependence vectors of Jacobi. Square, Skewing-1, and Skewing-2 can be applied to iteration spaces with dependence matrix D g , but Diamond cannot. However, only Diamond and Skewing-1 are good for iteration spaces subject to the dependence matrix D j , whereas the other two tile shapes are not. For hierarchical tiling, it is possible to choose different tile shapes for each level of tiling. However, according to Equation (5), for the k-th level of tiling, k > 0, the dependence matrix always contains 1 n = D g . This means that Diamond can only be applied at the lowest (0-th) level.
For hierarchical overlapped tiling, trapezoid tiles are used at each level. This type of tiling cannot be represented using tiling matrices. To fit the dependence vectors, the trapezoid tile must be "rotated" for Gauss as shown in Figure 14 (a).
Platform Parameters
We measured the execution time of a single iteration t c and the synchronization/communication overhead t s at each level on the experimental platform. The parameter t c is application dependent, mainly determined by the number of points of the stencil; t s depends on the target machine. Our experimental results show that for 3-point, 5-point, and 9-point stencils, t c is about 1.3 μs, 1.5 μs, and 1.6 μs, respectively. At level 1 of the hardware parallelism (internode), t s is the communication overhead of sending and receiving MPI messages between nodes. Because level 0 parallelism is mapped onto the GPU accelerator of each node with OpenCL code, at level 0 (intranode) t s is the overhead of starting an OpenCL kernel and transmitting data between host memory and device memory. Our experimental results show that at level 0, t s is about 18 μs; at level 1, t s is about 63 ms for two-dimensional iteration spaces and 124 ms for three-dimensional iteration spaces. The communication overhead is higher for three-dimensional iteration spaces because tiles in a threedimensional iteration space have more neighbors than two-dimensional iteration spaces, and thus more messages must be communicated. 
Performance
In this section, we compare the performance achieved by the code generated by the techniques of this article with code that is built using some combination of the common tiling shapes mentioned previously. For the comparison, we use two common tiling schemes: Tiling Scheme 1 and Tiling Scheme 2, as well as the hierarchical overlapped tiling. As discussed in Section 6.3, a tiling scheme does not work for every problem because of the constraints imposed by dependences. Figure 15 shows our experimental results. The figure shows the execution time of the different schemes normalized to the execution time of Tiling Scheme 1. Execution time includes the overall execution time of the entire application and includes communication and computation. To make a fair comparison, we do an empirical search to determine the best tile size for each tiling scheme. The white bar and grey bar in Figure 15 represent the execution times achieved by Tiling Scheme 1 and Tiling Scheme 2 using the optimal value of x's for the tiling matrix at each level. The black bar is the execution time obtained using the strategy described in this article, which automatically generates the tile shapes for a two-level hierarchical tiling. As the figure shows, the codes using the automatically selected tile shapes are always faster than the corresponding codes using the common tiling schemes and the corresponding codes using the hierarchical overlapped tiling method [Zhou et al. 2012] . Notice that hierarchical overlapped tiling does not perform well for Gauss, as this approach is generally not appropriate for this type of stencils. On average, the automatically selected tile shapes achieve 34%, 23%, and 20% speedup over Tiling Scheme 1, Tiling Scheme 2, and hierarchical overlapped tiling, respectively. In particular, for onedimensional Gauss, it achieves 43% speedup over Tiling Scheme 1 and 32% over Tiling Scheme 2.
Table III lists the tile shape selection for one-dimensional Gauss and Jacobi.
Model Accuracy
Finally, we also measured the accuracy of the model discussed in this work. Figure 16 shows, for one-dimensional Gauss and Jacobi, the execution time measured and the execution time estimated by the model for the different tiling schemes. The figure shows that the discrepancy between the measured and the estimated execution time is fairly small (less than 15%) for all tiling schemes except for the automatically selected shapes for one-dimensional Jacobi. Factors that can cause inaccuracy include that t c and t s can have small variation for different programs and that hardware parallelism is not unlimited (to to simplify the model, it assumes that hardware parallelism is always sufficient compared to the computation parallelism exposed by the program at each level). In practice, predicting the absolute value of execution time is not as important as predicting which shape will deliver better execution times. Figure 17 plots the measured and estimated execution time for one-dimensional Gauss as the tile shape changes for level 0 and level 1. We use T 0 = ( 
CONCLUSION AND FUTURE WORK
In this article, we build an analytical model to study the relation between the tile shape at each level of hierarchical tiling and the execution time of the tiled loop nest. We show that the problem of optimal tile shape selection for hierarchical tiling is a nonlinear bilevel programming problem. We implement an automatic system that does black-box optimization with our analytical model to automatically select tile shapes for hierarchical tiling. The tile shape selected by the automatic system can be quite different from the intuitive regular shapes. Our experimental results show that irregular tile shapes may have the potential to achieve higher performance over regular, intuitive tiling shapes.
