Abstract
Introduction
Tiling is a widely used technique t o increase the granularity of computations and the locality of data references. This technique is restricted t o perfect loop nests with uniform dependences, which we define as in Banerjee [3] . The basic idea is to group elemental computation points into tiles that will be viewed as computational units. The larger the tiles, the more efficient the computations are performed using state-of-the-art processors with pipelined arithmetic units and a multi-level memory hierarchy (this is illustrated by recasting numerical linear algebra algorithms in terms of blocked Level 3 BLAS kernels [7, 61) . Also, another advantage of tiling is the decrease of the communication time (which is proportional t o the surface of the tile) relatively to the computation time (which is proportional to the volume of the t,ile). The price to pay for tiling may be an increased latency (if there are data dependences, say, we need to wait for the first processor t o complete the whole execution of the first tile before another processor can start the execution of the second one, and so on), as well as some load-imbalance problems (the larger the tile, the more difficult t o distribute computations equally among the processors).
Tiling has been studied by several authors and in different contexts. A short review of the existing literature is provided in the extended version of this paper [5] . Basically, most of the work amounts t o partitioning the computation domain of a uniform loop nest into tiles whose shape and size arc optimized according to some criteria. Little attention has been paid t o distributing the tiles to physical processors and to computing the final scheduling. For is assigned several tiles, what should be the computation ordering of these tiles? An in-depth study has been presented by Ohta et a1 [lo] , x h o have cstended results of Hiranandani et al. [8] on fine grain pipelining for DOACROSS loops. TVe survey their work in Section 3 .
In this paper, we build upon the work of Ohta et a1 [lo] . TVe reformulate the problem of tiling with limited resources using more realistic assumptions on data dependences and communicationcomputation overlap than those used in [lo] . We also derive an optimal mapping to assign tiles to physical processors. All these results are presented in Sections 4 and 5. Finally, we state some conclusions in Section 6.
Tiling as a loop transformation technique
When targeting a data-parallel or SPMD style of programming, classical constraints in the literature to define tiles are the following:
Tiles are bounded For scalability reasons, x e want the number of points inside a tile to be bounded by a constant independent of the domain size. Tiles are identical by translation This constraint is imposed to allow for automatic code generation: a tile must be the image by a translation of any other one unless it crosses the computation domain boundaries. Tiles are "atomic" Each tile is a unit of computation: all synchronization points are beginnings and ends of tiles. The order on tiles must be compatible with the order on nodes: one must thus avoid that two distinct tiles depend upon each other.
A4s already said, tiling is restricted t o perfectly nested loops with uniform dependences, such as the following simple example:
This loop nest has depth 2. The dependences are uniform (intuitively; dependence vectors are translations), and they can be encapsulated into the dependence matrix
The atomicity constraint can be expressed by the analytical condition HD 2 0, where H is the matrix of vectors normal to the faces (or the edges in two-dimensional problems) of the tile [9] . In Figure 1 , we sketch a valid tiling for our example. The matrix H is the one derived using the scalable comiriunication-to-computation criteria of Boulet et al. [4] :
We check that H D 2 0. Note that the volume of the tile, which represents the number of computations per tile, is given by the determinant of H-': Vcomp = det(H-') = 96. The number of comrnuriications is the following: each tile sends e 24 data items to its right neighbor, e 34 data items to its lower neighbor. and 6 data items to its lower-right neighbor In other words, the computation of a tile cannot be started before both its left and upper neighbor tiles have been executed.
As stated above, the atomicity constraint implies that inter-processor communications only take place at the end of the processing of each tile. Note that current architectures do allow for communications and computations t o overlap, and it is important t o point out that the atomicitay constraint does not prevent a given processor from simultaneously communicating boundary data of one tile (whose execution it just completed) and starting the computation of another tile. Also, minimizing communication start-up overheads is a "sine-qua-non'' condition towards achieving good performance. Even though sophisticated routing strategies are designed and implemented in hardware, communication start-up costs remain very expensive as opposed to the elemental time for communicating one data item (and even worse for performing a computation). Frequent exchanges of short messages should therefore be replaced by fewer sends and receives of longer messages. To summarize, in the context of distributed memory architectures, tiling is a technique that permits to optimize communications while increasing the granularity of computations.
Tiling with resource constraints
Ohta et al. [lo] (H6) The scheduling of the tiles is column-rvise: at step 0. processor Pi executes tile (0.0) and the computed value is communicated to the adjacent processor PI (more precisely, a rectangular slice of width w and height n2 is sent). .it step 1, processors P o and P i execute tiles (0,1) and ( 1 , O ) simultaneously. Bfter having executed a whole column of tiles; a processor moves on to its next column.
Figure 2. Mapping rectangular tiles onto a ring of processors.
A step is the time required to compute a tile and to communicate data. Ohta et al. [lo] use the following expression:
where t is the elemental computation time, a is a communication start-up and b is the inverse of the communication bandwidth times the width U: of the slice being communicated (the communication cost is a linear expression in the message size).
To compute the total execution time, Ohta et al. [lo] use the formula ( M l + Mp)Ttzle, where Ml = P -1 is the latency (the step a t which the last processor begins to work) and M p = i:n:'22 is the number of tiles per processor (assumed to be an integer). Using the approximation Ml = P, they derive the total execution time T as LY-1 s,
The execution time is found to be minimal when choosing n1 = % and n2 = The objective of this paper is t o discuss the hypotheses ( H l ) to (H6) of Ohta et al., and t o reformulate their results using a more accurate modeling of current architectures. Indeed, their study is conducted while assuming that processors cannot simultaneously communicate bordering data items of the last tile and perform computations for the next tile. However, overlapping computations and communications is a facility provided by all distributed memory computers, so we relax this restriction. This simple modification has a tremendous effect on the determination of the best tile size.
N l t '
4 Allowing for communication-computation overlap
On the model
Hypotheses (H2), (H3) and (H4) may appear very restricting. However, we point out the following justifications:
Tile shape We assume that the tiles are rectangular. This is to be understood as the outcome of previous program transformations. The first step in tiling amounts to determining the best shape and size of the tiles, assuming an infinite grid of virtual processors. Because this step will lead to tiles whose edges are parallel to extrema1 dependence vectors in the convex hull of the dependence cone, me can perform a unimodular transformation and rewrite the original loop nest along the edge axes. The resulting domain may not be a rectangular, but we can approximate it using the smallest bounding box (however, this approximation may impact the accuracy of our results). On the other hand, hypotheses ( H j ) and (H6) are unnecessarily restrictive, because the mapping and scheduling of the tiles should be an output decision of the procedure of tiling with limited resources, rather than being given a priori. We overcome this restriction in Section 5.
Dependence vectors

Revisiting the results of Ohta et al.
The total execution time is given by the following proposition (all the proofs of the results of this section are available in [5] ):
Proposition 1 Under the hypotheses ( H l ) to (HS) of Section 3, and allowing for communicationcomputation overlap, the total computation time T is (assuming all fractions to be integer):
(1)
To understand Proposition 1 intuitively, note that two cases can occur:
Either there are enough tiles in each column so that when a processor has completed the execution of a whole tile column, it does not have to wait for its next tile column to be ready (this will happen if GTcomp is greater than or equal to the delay P Ttzle imposed by horizontal constraints), Or each processor has to wait upon finishing a tile column until the next one is ready.
The optimal number of processors that should be used so as to minimize the total execution time is given by the following proposition:
Corollary 1 Under the hypotheses (H2) to (H6) of Section 3, and allowing for communication computation overlap, let
The number of processors P that minimizes the total execution time is given by:
For large domains, we will have 1 5 P 3 5 P,. and it is no surprise that the optimal number of processors is the one required to ensure steady-state execution: Equation (1) in Proposition 1 states that all processors remain active once started if
,Yzn1t 2 P(n1nzt + U + h a ) .
So far, we have assumed that nl and n2 were input parameters, because the size and shape of the tiles may be imposed by some a priori considerations (such as the cache size). We can try to determine the values of n1 and I L~ in the range 1 5 nl 5 zil, 1 5 712 5 iV2 that would minimize the total execution time. We rewrite the steady-state equation by introducing the following function f:
Corollary 2 Under the hypotheses ( H l ) t o (HS) of Section 3, and allowing for communrcationcomputation overlap, the total execution tzme 2s mznzmum for
This result is disappointing in that we end up with degenerate tiles in most practical situations.
For instance if P << N z (which is very likely to happen in practice), f ( 1 ) 2 1, and the optimal tile size is n1 = n2 = 1, not a very coarse-grain tiling indeed! For other values of the problem parameters we would have an optimal tile size that depends upon the domain size, thereby contradicting the assumption that tiles are bounded (Section 2). Tote that Ohta et a1 [lo] also have this problem in their original model. The flaw is that the model is not accurate enough to take the impact of data locality and data reuse into account (which are the main objectives of tiling, and the main motivation for designing blocked linear algebra algorithms [7] ). A first solution is to model the computation cost of a tile by an affine expression like Tcomp = nlnnt + U , where U represents some access overhead. It is not difficult to plug this expression into the formula given for the total execution time T , and to derive the optimal tile size. Another solution is to assume a fixed tile size that would be imposed by some a priori considerations (such as the cache size). Again, we can let n1nz = S in Equation (l) , and minimize T for nl; say. the tzle perzod Pt as the time elapsed between corresponding instructions of two successive tiles that are mapped t o the same processor; while they define the tile latency Lt to be the time between corresponding instructions of two successive tiles that are mapped t o diflerent processors. With these notations, the parallel execution time becomes [2] A very interesting approach has been proposed by Andonov and Rajopadhye [2] : they introduce
The pourer of this approach is that the expressions for Lt and Pt can be modified to take into account several architectural models, while Equation (3) still remains valid. A very detailed architectural model is presented in [ Z ] , and several other models are explored in [l] .
With our notations, Pt = Tcomp and Lt = Tcomp + T,,,,. We can rewrite Equation (1) as otherwise Equation ( 3 ) , or its variant Equation (4) , is the key to our tiling problem, because it expresses the parallel execution time as a function of the domain size, of the number of processors, and of the tile parameters Pt and Lt, or equivalently Tcomp and T,,,,.
Optimal mapping and scheduling
Hypotheses (HS) and (H6) are very restrictive in that they impose the mapping of tiles to processors as well as their scheduling. The intuitive motivation for (H5) is that a cyclic distribution of tiles is quite natural to load-balance computations. Once the distribution of tiles to processors is fixed, there are several possible schedulings (any wavefront execution that goes along a left-to-right diagonal is valid). Specifying a column-wise execution may lead to the simplest code generation.
It turns out that (H5) and (H6) provide the best solution among all possible distributions of tiles to processors, which is a very strong result. This result holds true under the assumption that the communication cost for a tile is not larger than its computation cost. Since the communication cost for a tile grows linearly with its size, while the computation costs grows quadratically, this hypothesis will be satisfied if the tile is large enough'. This result is formally stated in the theorem below. Beforehand, we need to refine the communication cost as follows: Tcomm_honz = a+ bnz is the cost of communicating data from (the processor owning) tile ( i , j ) TconLm_ue,t = a' + b'nl is the cost of communicating data from (the processor owning) tile
We pay a communication cost only when the two processors that own the neighboring tiles are not the same. So far we never paid any cost for vertical communications, and we always did for horizontal communications, because of hypothesis (H5). We had to refine the communication cost because in this section, we do not make any assumption on the mapping of tiles to processors.
Depending upon the values of Tcomm-horlz and Tcomm-ve,t, the best mapping will be column-wise or row-wise: 7x1 and nz be chosen so that to (the processor owning) its right neighbor tile (i + l , j ) , 
Theorem 1 Under the hypotheses (H2) to (H4) of Section 3, and allowing f o r communicationcomputation overlap, let
nl). T h e n the absolute minimum f o r the total execution time is
'Of course, we can imagine theoretical situations where the communication cost is so large that a sequential execution would lead to the best result.
and it is achieved b y mapping rows of tiles using a one-dimensional cyclic distribution (tile ( 1 ,~) zs allocated to processor i mod P). and by sch,eduling the trles row-wise. 
We know from Lemma 1 that for all k 2 1,
ni: prove by induction that for I. 5 k 5 P -1, at least P -k processors are kept idle between the beginning of the execution of piz,ot..tile(k -1) and that of pzr,ot_tile(k). This will lead to be result that
This d l prove the desired result, because the same amount of idleness, so to speak, will be spent at the end of the computation (by symmetry of the dependence graph). Now, for the induction: (l) , the only successors of pivoLtrle(0) that can be executed are in S(1). But all tasks in S(1) must be executed sequentially, hence between the beginning of the execution of piz~ot_tiZe(O) and that of pivof_tile (l) , a t least (P -1) processors are kept idle.
Assume that the hypothesis is true until step k . Between the beginning of the execution of piuot-tile(k) and that of pivot_tile(k + l ) , at most one processor can be active in S(1), at most another one in 5(2), . . _ 1 and a t most one processor in S ( k + l), so that at most k + 1 processors can be active, or equivalently, at least P -( k + 1) processors remain idle. It is worth to point out that Theorem 1 holds true in a large framework. Whatever the model used for estimating the communication time T,,,, and the computation time Tcomp, the parallel execution time for a columnwise allocation of tiles to processors is given by Equation (4). Theorem 1 basically says that such a columnwise or rowwise allocation will be optimal as soon as
T C , "
T c o m p
2. the weight of a tile column (or tile row) is greater that the tile latency
The first hypothesis will be fulfilled if the tile is large enough (because the cornrnunication cost grows linearly while the computation cost grows quadratically). The second hypothesis will be fulfilled as soon as the domain is large enough in front of the number of processors, a situation very likely to happen in practice. 6 
Conclusion
In this paper, we have studied tiling techniques aimed at adapting the granularity of uniform loop nest algorithms towards execution on distributed-memory machines. We view tiling as a two-step process: the first step amounts to determining the best shape and size of the tiles (assuming an infinite grid of virtual processors), while the second step consists in mapping and scheduling the tiles to physical processors. We have concentrated on the second step, assuming a realistic model where (independent) communication and computation may overlap. We have obtained several new results, including a strong result on the optimal mapping and scheduling. However, much remains t o be done to extend these results to arbitrary dimensions and domain shapes.
More generally, the relationship between tiling; scheduling and mapping is not yet well understood, and the two-step approach may not prove too complicated for practical problems. Yet, such a two-step approach is typical in the field of parallelizing compilers (other examples are general task graph scheduling, software pipelining and loop parallelization algorithms).
Finally, the recent development of heterogeneous computing platforms may well lead to using tiles whose size and shape will depend upon the characteristics of the processors they are assigned to ... a truly challenging problem!
