This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency trac on shared-memory multiprocessors. While several previous papers have l o o k ed at hyperplane partitioning of iteration spaces to reduce communication trac, the problem of deriving the optimal tiling parameters for minimal communication in loops with general ane index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We i n troduce the notion of data footprints to estimate the communication trac between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We h a v e implemented this framework in a compiler for Alewife, a distributed shared memory multiprocessor.
Introduction
Cache-based multiprocessors are attractive because they seem to allow the programmer to ignore the issues of data partitioning and placement. Because caches dynamically copy data close to where it is needed, repeat references to the same piece of data do not require communication over the network, and hence reduce the need for careful data layout. However, the performance of cache-coherent systems is heavily predicated on the degree of temporal locality in the access patterns of the processor. Loop partitioning for cache-coherent m ultiprocessors is an eort to increase the percentage of references that hit in the cache.
A short version of this paper appears in ICPP 1993.
Contributions and Related Work
This paper develops a unied theoretical framework that can be used for loop partitioning in cache-coherent m ultiprocessors, or for loop and data partitioning in multicomputers with local memory. 1 The central contribution of this paper is a method for deriving an optimal hyperparallelepiped tiling of iteration spaces to minimize communication. The tiling species both the shape and size of iteration space tiles. Our framework allows the partitioning of doall loops accessing multiple arrays, where the index expressions in array accesses can be any ane function of the indices.
Our analysis uses the notion of uniformly intersecting references to categorize the references within a loop into classes that will yield cache locality. This notion helps specify precisely the set of references that have substantially overlapping data sets. Overlap produces temporal locality in cache accesses. A similar concept of uniformly generated references has been used in earlier work in the context of reuse and iteration space tiling [3, 4] .
The notion of data footprints is introduced to capture the combined set of data accesses made by references within each uniformly intersecting class. (The term footprint was originally coined by Stone and Thiebaut [5] .) Then, an algorithm to compute precisely the total size of the data footprint for a given loop partition is presented. Precisely computing of the size of the set of data elements accessed by a loop tile was itself an important and open problem. While general optimization methods can be applied to minimize the size of the data footprint and derive the corresponding loop partitions, we demonstrate several important special cases where the optimization problem is very simple. The size of data footprints can also be used to guide program transformations to achieve better cache performance in uniprocessors as well.
Although there have been several papers on hyperplane partitioning of iteration spaces, the problem of deriving the optimal hyperparallelepiped tile parameters for general ane index expressions has remained open. For example, Irigoin and Triolet [6] i n troduce the notion of loop partitioning with multiple hyperplanes which results in hyperparallelepiped tiles. The purpose of tiling in their case is to provide parallelism across tiles, and vector processing and data locality within a tile. They propose a set of basic constraints that should be met by a n y partitioning and derive the conditions under which the hyperplane partitioning satises these constraints.
Although their paper describes useful properties of hyperplane partitioning, it does not address the issue of automatically generating the tile parameters. Careful analysis of the mapping from the iteration space to the data space is very important in automating the partitioning process. Our paper describes an algorithm for automatically computing the partition based on the notion of cumulative footprints, derived from the mapping from iteration space to data space.
Abraham and Hudak [7] considered loop partitioning in multiprocessors with caches. However, they dealt only with index expressions of the form index variable plus a constant. They assumed that the array dimension was equal to the loop nesting and focused on rectangular and hexagonal tiles. Furthermore, the code body was restricted to an update of A[i; j].
Our framework, however, does not place these restrictions on the code body. It is able to handle much more general index expressions, and produce parallelogram partitions if desired. We also show that when Abraham and Hudak's methods can be applied to a given loop nest, our theoretical framework reproduces their results.
Ramanujam and Sadayappan [8] deal with data partitioning in multicomputers with local memory and use a matrix formulation; their results do not apply to multiprocessors with caches. Their theory produces communication-free hyperplane partitions for loops with ane index expressions when such partitions exist. However, when communication-free partitions do not exist, they can deal only with index expression of the form variable plus a constant oset. They further require the loop dimension to be equal to the loop nesting.
In contrast, our framework is able to discover optimal partitions in cases where communication free partitions are not possible, and we do not restrict the loop nesting to be equal to array dimension. In addition, we show that our framework correctly produces partitions identical to those of Ramanujam and Sadayappan when communication free partitions do exist.
In a recent paper, Anderson and Lam [9] derive communication-free partitions for multicomputers when such partitions exist, and block loops into squares otherwise. Our notion of cumulative footprints allows us to derive optimal partitions even when communication-free partitions do not exist.
Gupta and Banerjee [10] address the problem of automatic data partitioning by analyzing the entire program. Although our paper deals with loop and data partitioning for a single loop only, the following dierences in the machine model and the program model lead to problems that are not addressed by Gupta and Banerjee: (1) The data distributions considered by them do not include general hyperparallelepipeds. In order to deal with hyperparallelepipeds, one requires the analysis of communication presented in our paper. Our work complements the work of Wolfe and Lam [3] and Schreiber and Dongarra [11] . Wolfe and Lam derive loop transformations (and tile the iteration space) to improve data locality in multiprocessors with caches. They use matrices to model transformations and use the notion of equivalence classes within the set of uniformly generated references to identify valid loop transformations to improve the degree of temporal and spatial locality within a given loop nest. Schreiber and Dongarra briey address the problem of deriving optimal hyperparallelepiped iteration space tiles to minimize communication trac (they refer to it as I/O requirements).
However their work diers from this paper in the following ways: (1) Their machine model does not have a processor cache. (2) The data space corresponding to an array reference and the iteration space are isomorphic. These restrictions make the problem of computing the communication trac much simpler. Also, one of the main issues addressed by S c hreiber and Dongarra is the atomicity requirement of the tiles which is related to the dependence vectors and this paper is not concerned with those requirements as it is assumed that the iterations can be executed in parallel.
Ferrante, Sarkar, and Thrash [12] address the problem of estimating the numb e r o f c a c he misses for a nest of loops. This problem is similar to our problem of nding the size of the cumulative footprint, but diers in these ways: (1) We consider a tile in the iteration space and not the entire iteration space; our tiles can be hyperparallelepipeds in general. (2) We partition the references into uniformly intersecting sets, which makes the problem computationally more tractable, since it allows us to deal with only the tile at the origin. (3) Our treatment o f coupled subscripts is much simpler, since we look at maximal independent columns, as shown in Section 5.2.
Overview of the Paper
The rest of this paper is structured as follows. Section 2 states our system model and our program-level assumptions. Section 3 rst presents a few examples to illustrate the basic ideas behind loop partitioning; it then discusses the notion of data partitioning, and when it is important. Section 4 develops the theoretical framework for partitioning and presents several additional examples. Section 5 extends the basic framework to handle more general expressions, and Section 6 indicates modications to the basic framework to handle data partitioning and more general types of systems. The framework for both loop and data partitioning has been implemented in the compiler system for the Alewife multiprocessor. The implementation of our compiler system and a sampling of results is presented in Section 7, and Section 8 concludes the paper.
Problem Domain and Assumptions
This paper focuses on the problem of partitioning loops in cache-coherent shared-memory multiprocessors. Partitioning involves deciding which loop iterations will run collectively in a thread of computation. Computing loop partitions involves nding the set of iterations which when run in parallel minimizes the volume of communication generated in the system. This section describes the types of programs currently handled by our framework and the structure of the system assumed by our analysis. Figure 1 shows the structure of the most general single loop nest that we consider in this paper. 3 ). Similar notation has been used in several papers in the past, for example, see [3, 4] . All our vectors and matrices have i n teger entries unless stated otherwise. We assume that the loop bounds are such that the iteration space is rectangular. However, we note that our methods can still be used to derive reasonable partitions when this condition is not met. Loop indices are assumed to take all integer values between their lower and upper bounds, i.e, the strides are one. Previous work [7, 8, 13] in this area restricted the arrays in the loop body to be of dimension exactly equal to the loop nesting. Abraham and Hudak [7] further restrict the loop body to contain only references to a single array; furthermore, all references are restricted to be of the form A[i 1 + a 1 ; i 2 + a 2 ; : : : ; i d + a d ] where a j is an integer constant. Matrix multiplication is a simple example that does not t these restrictions.
Program Assumptions
Given P processors, the problem of loop partitioning is to divide the iteration space into P tiles such that the total communication trac on the network is minimized with the additional constraint that the tiles are of equal size, except at the boundaries of the iteration space. The constraint of equal size partitions is imposed to achieve load balancing. We restrict our discussions to hyperparallelepiped tiles, of which rectangular tiles are a special case.
Like [ 7 , 8 , 1 3 ] , we do not include the eects of synchronization in our framework. Synchronization is handled separately to ensure correct behavior. For example, in the doall loop in Figure 1 , one might i n troduce a barrier synchronization after the loop nest if so desired. We also note that in many cases ne-grain data-level synchronization can be used within a parallel do loop to enforce data dependencies and its cost approximately modeled as slightly more expensive communication than usual [14] . See Appendix B for some details.
System Model
Our analysis applies to systems whose structure is similar to that shown in Figure 2 . The system comprises a set of processors, each with a coherent cache. Cache misses are satised by global memory accessed over an interconnection network or a bus. The memory can be implemented as a single monolithic module (as is commonly done in bus-based multiprocessors), or in a distributed fashion as shown in the gure. The memory modules might also be implemented on the processing nodes themselves (data partitioning for locality makes sense only for this case). In all cases, our analysis assumes that the cost of a main memory access is much higher than a cache access, and for loop partitioning, our analysis assumes that the cost of the main memory access is the same no matter where in main memory the data is located. The goal of loop partitioning is to minimize the total number of main memory accesses. For simplicity, w e assume that the caches are large enough to hold all the data required by a loop partition, and that there are no conicts in the caches. When caches are small, the optimal loop partition does not change, rather, the size of each loop tile executed at any given time on the processor must be adjusted so that the data ts in the cache (if we assume that the cache is eectively ushed between executions of each loop tile). Unless otherwise stated, we assume that cache lines are of unit length. The eect of larger cache lines can be included easily as suggested in [7] , and is discussed further in Section 6.2.
Loop Partitions and Data Partitions
This section presents examples to introduce and illustrate some of our denitions and to motivate the benets of optimizing the shapes of loop and data tiles. More precise denitions are presented in the next section.
As mentioned previously, w e deal with index expressions that are ane functions of loop indices. In other words, the index function can be expressed as in Equation 1 . Consider the following example to illustrate the above expression of index functions. In this example, the second and fourth column of G are zero indicating that the second and fourth subscripts of the reference are independent of the loop indexes. In such cases, we show in Section 5.2 that we can ignore those columns and treat the referenced array as an array o f lower dimension. In future, without loss of generality, w e assume that the G matrix contains no zero columns. Now, let us introduce the concept of a loop partition by examining the following example. Loop partitioning species the tiling parameters of the iteration space. Loop partitioning is sometimes termed iteration space partitioning or tiling. 
EndDoall
Let us assume that we h a v e 100 processors and we w ant to distribute the work among them. There are 10,000 points in the iteration space and so one can allocate 100 of these to each o f the processors to distribute the load uniformly. Figure 3 shows two simple ways of partitioning the iteration space { by r o ws and by square blocks { into 100 equal tiles.
Minimizing communication volume requires that we minimize the number of data elements accessed by each loop tile. To facilitate this optimization, we i n troduce the notion of a data footprint. F ootprints comprise the data elements referenced within a loop tile. In other words, the footprints are regions of the data space accessed by a loop tile. In particular, the footprint with respect to a specic reference in a loop tile gives us all the data elements accessed through that reference from within a tile of a loop partition.
Using Figure 4 , let us illustrate the footprints corresponding to a reference of the form B[i+j,i-j] for the two loop partitions shown in Figure 3 . The footprints in the data space resulting from the loop partition a are diagonal stripes and those resulting from partition b are square blocks rotated by 45 degrees. Algorithms for deriving the footprints are presented in the next section.
Let us compare the two loop partitions in the context of a system with caches and uniformaccess memory (see Figure 2 ) by computing the number of cache misses. The number of cache misses is equal to the number of distinct elements of B accessed by a loop tile, which is equal to the size of a loop tile's footprint on the array B. (Section 6.1 deals with minimizing cache-coherence trac). Caches automatically fetch a loop tile's data footprint as the loop tile executes. For each tile in partition a, the numb e r o f c a c he misses can be shown to be 104 (see Section 5.1) whereas the number of cache misses in each tile of partition b can be shown to be 140. Thus, because it allows data reuse, loop partition a is a better choice if our goal is to minimize the number of cache misses, a fact that is not obvious from the source code.
When is data partitioning important? Data partitioning is the problem of partitioning the data arrays into data tiles and assigning each data tile to a local memory module, such that the number of memory references that can be satised by the local memory is maximized. Data partitioning is relevant only for nonuniform memory-access (NUMA) systems (for example, see Figure 5 ).
In systems with nonuniform memory-access times, both loop and data partitioning are required. Our analysis applies to such systems as well. The loop tiles are assigned to the processing nodes and the data tiles to memory modules associated with the processing nodes so that a maximum number of the data references made by the loop tiles are satised by the local memory module. Note that in systems with nonuniform memory-access times, but which h a v e caches, data partitioning may still be performed to maximize the number of caches misses that can be satised by the memory module local to the processing node.
Referring to Figure 4 , the footprint size is minimized by c hoosing a diagonal striping of the data space as the data partition, and a corresponding horizontal striping of the iteration space as the loop partition. The additional step of aligning corresponding loop and data tiles on the same node maximizes the number of local memory references. In fact, the above horizontal partitioning of the loop space and diagonal striping of the data space results in zero communication trac. Ramanujam and Sadayappan [8] presented algorithms to derive such communication-free partitions when possible. On the other hand, in addition to producing the same partitions when communication-trac-free partitions exist (see Sections 5.1 and 6.3), our analysis will discover partitions that minimize trac when such partitions are non-existent a s w ell (see Example 8) .
Example 3
For the loop shown in Example 3, a parallelogram partition results in a lower cost of memory access compared to any rectangular partition since most of the inter iteration communication can be internalized to within a processor for a parallelogram partition (see Section 7.1). Because rectangular partitions often do not minimize communication, we w ould like to include parallelograms in the formulation of the general loop partitioning problem. In higher dimensions a parallelogram tile generalizes to a hyperparallelepiped; the next section denes it precisely.
A F ramework for Loop and Data Partitioning
This section rst denes precisely the notion of a loop partition and the notion of a footprint o f a loop partition with respect to a data reference in the loop. We prove a theorem showing that the number of integer points within a tile is equal to the volume of the tile, which allows us to use volume estimates in deriving the amount of communication. We then present the concept of uniformly intersecting references and a method of computing the cumulative footprint for a set of uniformly intersecting references. We develop a formalism for computing the volume of communication on the interconnection network of a multiprocessor for a given loop partition, and show h o w loop tiles can be chosen to minimize this trac. We briey indicate how the cumulative footprint can be used to derive optimal data partitions for multicomputers with local memory (NUMA machines).
Loop Tiles in the Iteration Space
Loop partitioning results in a tiling of the iteration space. We consider only hyperparallelepiped partitions in this paper; rectangular partitions are special cases of these. Furthermore, we focus on loop partitioning where the tiles are homogeneous except at the boundaries of the iteration space. Under these conditions of homogeneous tiling, the partitioning is completely dened by specifying the tile at the origin, as indicated in Figure 6 . Under homogeneous tiling, the concept The rest of this paper deals with optimizing the shape of the tile at the origin for minimal communication. Because the amount of communication is related to the number of integer points within a tile, we begin by proving the following theorem relating the volume of a tile to the number of integer points within it. This theorem on lattices allows us to use volumes of hyperparallelepipeds derived using determinants to determine the amount of communication.
Theorem 1 The number of integer points (iteration points) in tile L is equal to the volume of the tile, which is given by j detLj.
Proof: We provide a sketch of the proof; a more detailed proof is given in [15] .
It is easy to show that the theorem is true for an n-dimensional semi-open rectangle. For a given n-dimensional semi-open hyperparallelepiped, let its volume be V and let P be the number of integer points in it. It can be shown that one can pack R n of these hyperparallelepipeds into an n-dimensional rectangle of volume V R and number of integer points P R , for any positive integer R, such that both V R R n V and P R R n P grow slower than R n . In other words, V R = R n V + v(R); P R = R n P + p ( R )
where v(R) and p(R) grow slower than R n . N o w subtracting the second equation from the rst one, and noting that V R = P R for the n-dimensional rectangle, we get,
Given that both v(R) and p(R) grow slower than R n , this can only be true when V P = 0 .
Proposition 1 The number of integer points in any general tile is equal to the number of integer points in the tile at the origin.
Proof: Straight-forward from the denition of a general tile.
In the following discussion, we ignore the eects of the boundaries of the iteration space in computing the numb e r o f i n teger points in a tile. As our interest is in minimizing the communication for a general tile, we can ignore boundary eects.
Footprints in the Data Space
For a system with caches and uniform access memory, the problem of loop partitioning is to nd an optimal matrix L that minimizes the numb e r o f c a c he misses. The rst step is to derive an expression for the number of cache misses for a given tile L. Because the number of cache misses is related to the number of unique data elements accessed, we i n troduce the notion of a footprint that denes the data elements accessed by a tile. The footprints are regions of the data space accessed by a loop tile. The footprint gives us all the data elements accessed through a particular reference from within a tile of a loop partition. Because we consider homogeneous loop tiles, the number of data elements accessed is the same for each loop tile.
We will compute the numb e r o f c a c he misses for the system with caches and uniform access memory to illustrate the use of footprints. The body of the loop may contain references to several variables and we assume that aliasing has been resolved; two references with distinct names do not refer to the same location. Let A 1 ; A 2 ; : : : ; A K be references to array A within the loop body, and let f( To facilitate computing the size of the union of the footprints we divide the references into multiple disjoint sets. If two footprints are disjoint or mostly disjoint, then the corresponding references are placed in dierent sets, and the size of the union is simply the sum of the sizes of the two footprints.
However, references whose footprints overlap substantially are placed in the same set. The notion of uniformly intersecting references is introduced to specify precisely the idea of \sub-stantial overlap". Overlap produces temporal locality in cache accesses, and computing the size of the union of their footprints is more complicated.
The notion of uniformly intersecting references is derived from denitions of intersecting references and uniformly generated references. The intersection of footprints of two references that are not uniformly generated is often very small. For non-uniformly generated references, although the footprints corresponding to some of the iteration-space tiles might o v erlap partially, the footprints of others will have n o o v erlap. Since we are interested in the worst-case communication volume between any pair of footprints, we will assume that the total communication generated by t w o non-uniformly intersecting references is essentially the sum of the individual footprints.
However, the condition that two references are uniformly generated is not sucient for two references to be intersecting. As a simple example, A[2i] and A[2i + 1] are uniformly generated, but the footprints of the two references do not intersect. For the purpose of locality optimization through loop partitioning, our denition of reuse of array references will combine the concept of uniformly generated arrays and the notion of intersecting array references. This notion is similar to the equivalence classes within uniformly generated references dened in [3] .
Denition 6 Two array references are uniformly intersecting if they are b oth intersecting and uniformly generated.
Example 5 The following sets of references are uniformly intersecting. This follows directly from the denition of uniformly intersecting references. Recall that an element~i of the loop tile is mapped by the reference (G;ã r ) to data elementd r =~iG +ã r , and by the reference (G;ã s ) to data elementd s =~iG +ã s . The translation vector, (d s d r ), is clearly independent o f i .
The volume of cache trac imposed on the network is related to the size of the cumulative footprint. We describe how to compute the size of the cumulative footprint in the following two sections as outlined below.
First, we discuss how the size of the footprint for a single reference within a loop tile can be computed. In general, the size of the footprint with respect to a given reference is not the same as the numb e r o f p o i n ts in the iteration space tile.
Second, we describe how the size of the cumulative footprint for a set of uniformly intersecting references can be computed. The sizes of the cumulative footprints for each o f these sets are then summed to produce the size of the cumulative footprint for the loop tile.
Size of a Footprint for a Single Reference
This section shows how to compute the size of the footprint (with respect to a given reference and a given loop tile L) eciently for certain common cases of G. The general case of G is dealt with in Section 5. We begin with a simple example to illustrate our approach. Proof:ĩ 1 G =~i 2 G implies~i 1 =~i 2 if and only if the only solution toxG =0 is0. The latter implies that the nullspace o f G T is of dimension 0. From a fundamental theorem of Linear Algebra [16] , this means that the rows of G are linearly independent. It is to be noted that when the rows of G are not independent there exists a nontrivial integer solution toxG =0, given that the entries in G are integers. This proves the second statement of the lemma. Lemma 2 The mapping of the iteration space to the data space as dened b y G is onto if and only if the columns of G are independent and the g.c.d. of the subdeterminants of order equal to the number of columns is 1.
Proof: Follows from the Hermite normal form theorem as shown in [17] . Lemma 2 LG. G is invertible implies that the rows of G are independent and hence the mapping dened b y G is one to one from Proof: It is immediate from Lemma 2 that G is onto when it is unimodular. G is onto implies that every data point in D has an inverse in the iteration space. Can the inverse of the data point be outside of L? L emma 3 shows this is not possible since G is invertible.
We make the following two observations about Theorem 2.
G is unimodular is a sucient condition; but not necessary. An example corresponds to the reference A[i + j]. Further discussions on this is contained in Section 5.
One may w onder why G being onto is not sucient for D to coincide with the footprint. Even when every integer point i n D has an inverse, it is possible that the inverse is outside of L. F or example, the mapping dened by the G matrix The data point (1) is in LG but none of its inverses is in L. The same is true for the data points (2); (3); (6); (7), and (11). The one to one property o f G guarantees that no point from outside of L can be mapped to inside of D. The reason for this is that the one to one property is true even when G is treated as a function on reals. Let us now i n troduce our technique for computing the cumulative footprint when G is unimodular. Algorithms for computing the size of the individual footprints and the cumulative footprint when G is not unimodular are discussed in Section 5.
Size of the Cumulative F ootprint
The size of the cumulative footprint F for a loop tile is computed by summing the sizes of the cumulative footprints for each of the sets of uniformly intersecting references. This section presents a method for computing the size of the cumulative footprint for a set of uniformly intersecting references when G is unimodular, that is, when the conditions stated in Theorem 2 are true. More general cases of G are discussed in Section 5. We rst describe the method when there are exactly two uniformly intersecting references, and then develop the method for multiple references. Ignoring SDUHis reasonable if we assume that the oset vectors in a uniformly intersecting set of references are small compared to the tile size. We refer to this simplication as the overlapping subtile approximation. This approximation will result in our estimates being higher than the actual values. Although one can easily derive a more exact expression, we use the overlapping subtile approximation to simplify the computation. The following discussion formalizes this notion and extends it to multiple references.
Cumulative F ootprint for Multiple References The basic approach for estimating the cumulative footprint size involves deriving an eective oset vectorâ that captures the combined eects of multiple oset vectors when there are several overlapping footprints resulting from a set of uniformly intersecting references. First, we need a few denitions.
Denition 8 Given a loop tile L, there a r e two neighboring loop tiles along the ith row of L dened b y f y j y = x + l i ; x 2 tile Lg and fỹ jỹ =x l i ; x 2 tile Lg, wherel i is the ith row of L, for 1 i l . W e r efer to the former neighbor as the positive neighbor and the latter as the negative neighbor. We also refer to these neighbors as the neighbors of the parallel sides of the tile determined by the rows of L, excluding the ith row. Figure 11 illustrates the notion of neighboring tiles.
The notion of neighboring tiles can be extended to the data space in like manner as follows. The notion of the communication along the rows of D facilitates computing the size of the cumulative footprint. Consider the data footprints of a loop tile with respect to a set of uniformly intersecting references shown in Figure 12 . The cumulative footprint can be expressed as the union of any one of the footprints and the remaining elements of the cumulative footprint. We take the union because a given data element needs to be fetched only once into a cache.
In Figure 12 , the cumulative footprint is the union of the footprint of the loop tile with respect toã 4 and the shaded regions corresponding to the remaining elements of the cumulative footprint resulting from the other references. The area of the shaded region can be approximated by the sum of communication along the kth row for 1 k 2 a s s h o wn in Figure 13 2 For data partitioning, however, the formulation must be modied as discussed in Section 6.3.
Proof: As observed e arlier, the size of the cumulative footprint is approximately the size of any of the footprints plus the communication across its sides. Clearly the size of any one of the footprints is given by j detDj. The rest follows from Lemma 4. Finally, as stated earlier, the total communication generated by non-uniformly intersecting sets of references is essentially the sum of the communicating generated by the individual cumulative footprints. Example 8 in Section 4.5 discusses an instance of such a computation.
Minimizing the Size of the Cumulative F ootprint
We n o w focus on the problem of nding the loop partition that minimizes the size of the cumulative footprint. The overall algorithm is summarized in Table 1 . 
EndDoall
Here we h a v e t w o uniformly intersecting sets of references: one for A and one for B. Let us look at the class corresponding to B since it is more instructive. Because A has only one reference, whose G is unimodular, its footprint size is independent of the loop partition, given a xed total size of the loop tile, and therefore need not gure in the optimization process. The G matrix corresponding to the references to B is, 
This expression must be minimized keeping j det Lj (or the product L i L j L k ) a constant. The product represents the area of the loop tile and must be kept constant to ensure a balanced load. The constant is simply the total area of the iteration space divided by P, the number of processors. For example, if the loop bounds are I, J, and
This optimization problem can be solved using standard methods, for example, using the method of Lagrange multipliers [18] . The size of the cumulative footprint is minimized when Abraham and Hudak's algorithm [7] gives an identical partition for this example. We n o w use an example to show h o w to minimize the total numb e r o f c a c he misses when there are multiple uniformly intersecting sets of references. The basic idea here is that the references from each set contribute additively to trac. 
There are three uniformly intersecting classes of references, one for B, one for C, and one for A. Because A has only one reference, its footprint size is independent of the loop partition, given a xed total size of the loop tile, and therefore need not gure in the optimization process. 
G i s I n v ertible, but not Unimodular
G is invertible and not unimodular implies that not every integer point in the hyperparallelepiped D is an image of an iteration point i n L . A unit cube in the iteration space is mapped to a hyperparallelepiped of volume equal to j det Gj. So the size of the data footprint is j det D= det Gj = j det Lj. When G is invertible the size of the data footprint is exactly the size of the loop tile since the mapping is one to one.
Next, the expression for the size of the cumulative footprint i s v ery similar to the one in Theorem 3, except that the data elements accessed are not dense in the data space. That is, the data space is sparse. The size of the cumulative footprint according to Theorem 4 is
we constrain L i L j = 100 for load balance, we get L j = 1 and L i = 100. This partitioning represents horizontal striping of the iteration space.
Columns of G are Dependent and the Rows are Independent
We can apply Theorem 4 to compute the size of a footprint when the columns of G are dependent, as long as the rows are independent. We derive a G 0 from G by c hoosing a maximal set of independent columns from G, such that G 0 is invertible. We can then apply Theorem 4 to compute the size of the footprint as shown in the following example. 
The rows of G are Dependent
The rows of G are dependent means that the mapping from the iteration space to the data space is many to one. It is hard to derive an expression for the footprint in general when the rows are dependent. However, we can compute the footprint and the cumulative footprint for many special cases that arise in actual programs. In this section we shall look at the common case where the rows are dependent because one or more of the index variables do not appear in the array reference. We shall illustrate our technique with the matrix multiply program shown in Example 10 below. The notation l$C[i,j] means that the read-modify-write of C[i,j] is atomic.
Example 10
The references to the matrices A, B and C belong to separate uniformly intersecting references. So the cumulative footprint is the sum of the footprints of each of the references. We will focus We cannot apply our earlier results to compute the footprint since G is a many t o o n e mapping. However, we can nd an invertible G 0 such that for every loop tile L, there is a tile L 0 such that the number of elements in footprints LG and LG 0 are the same. 
Other System Environments
This section describes how our framework can be used to solve the partitioning problem in a wide range of systems including those with coherent caches, distributed-memory, and non-unit cache line sizes.
Coherence-Related Cache Misses
Our analysis presented in the previous section was concerned with minimizing the cumulative footprint size. This process of minimizing the cumulative footprint size not only minimizes the number of rst-time cache misses, but the number of coherence-related misses as well. 
In a load-balanced partitioning, j det Lj = L i L j is a constant, so the L i L j term drops out of the optimization. The optimization process then attempts to minimize L j , which is proportional to the volume of cache coherence trac, as depicted in Figure 14 . Let us focus on regions X, Y and Z in Figure 14 (c). As explained in Figure 13 , the processor working on the loop tile to which these regions belong (say, processor P O ) shares a portion of its cumulative footprint with processors working on neighboring regions in the data space. Specically, region Z is a subtile of the positive neighbor and region Y is a subtile shared with its negative neighbor. Region X, however, is completely private to P O .
Let us consider the situation after the rst iteration of the outer sequential loop. Accesses of data elements within region X will hit in the cache, and thereby incur zero communication cost. Data elements in region Z, however, potentially cause misses because the processor working on the positive neighbor might h a v e previously written into those elements, resulting in those elements being invalidated from P O 's cache. Each of these misses by processor P O suers a network round trip because of the need to inform the processor working on its positive neighbor to perform a writeback and then to send the data to processor P O . F urthermore, if the home memory location for the block is elsewhere, the miss requires an additional network roundtrip.
Similarly, in region Y, a write by processor P O potentially incurs two network round trips as well. The two round trips result from the need to invalidate the data block from the cache of the processor working on the negative neighbor, and then to fetch the blocks into P O 's cache.
In any case, the coherence trac is proportional to the area of the shared region Z, which i s equal to the area of the shared region Y, and is given by L j .
Eect of Cache Line Size
The eect of cache line sizes can be incorporated easily into our analysis. Because large cache lines fetch m ultiple data words at the cost of a single miss, one data space dimension will be favored by the cache. Without loss of generality, let us assume that the j th dimension of 
Data Partitioning
In systems in which main memory is distributed with the processing nodes (e.g., see Figure 5 ), data partitioning is the problem of partitioning the data arrays into data tiles and the nested loops into loop tiles and assigning the loop tiles to the processing nodes and the corresponding data tiles to memory modules associated with the processing nodes so that a maximum number of the data references made by the loop tiles are satised by the local memory module. Our formulation facilitates data partitioning straightforwardly. There are two cases to consider: systems with caches and systems without caches.
Systems with Caches The data partitioning strategy in distributed shared-memory systems with caches ( Figure 5(a) ) proceeds as follows. The optimal loop partition L is rst derived by minimizing the cumulative footprint size as described in the previous sections. Data partitioning requires the additional derivation of the optimal data partition D for each class of uniformly intersecting references from the optimal loop partition L. We derive the shapes of the data tiles D for each G corresponding to a specic class of uniformly intersecting references. A specic data tile is chosen from the footprints corresponding to each reference in an uniformly intersecting set. In systems with caches, the choice of a specic footprint d o e s n o t matter, because each data element in the footprint results in a single miss. We then place each loop tile with the data tiles accessed by it on the same processing node.
As an example, let us work out the optimal data partitioning for Example 2. The optimal loop partition for this example was worked out in Section 5.1. The optimal L was shown to stripe the iteration space horizontally and was given by " 100 0 0 1 # The corresponding footprint D = LG represents a diagonal striping of the data space and is
given by " 100 100 1 1 # Thus, for this example, if diagonal tiles of data (as depicted in Figure 15 ) are placed in the memory modules close to the processors with the corresponding iteration tiles, cache misses will be satised completely within the node. This data partition thus represents a communicationfree data partition.
Interestingly, because G for this example is not unimodular (its determinant is 2), not all data space points are accessed. In the gure, the shaded points represent the untouched data elements.
Systems without Caches The compiler has two options to optimize communication volume in systems without caches. The compiler can choose to make local copies of remote data, or it can fetch remote data each time the data is needed. In the former case, the compiler can use the same partitioning algorithms described in this paper for systems with caches, but it must also solve the data coherence problem for the copied data. This section addresses the latter case.
Although the overall data partitioning strategy remains largely the same as described in the previous section, we m ust make one change in the footprint size computation to reect the fact that a given data tile is placed in local memory and data elements from neighboring tiles have t o be fetched from remote memory modules each time they are accessed. Because data partitioning for distributed-memory systems without caches (see Figure 5(b) ) assumes that data from other memory modules is not dynamically copied locally (as in systems with caches), we replace the max min formulation by the cumulative spread a We then derive the optimal L to minimize this communication volume. We then derive the optimal data partition D for each class of uniformly intersecting references from the optimal loop partition L as described in the previous section on systems with caches. A specic data tile is chosen from the footprints corresponding to each reference in an uniformly intersecting set. In systems without caches, because a single data element might h a v e to be fetched multiple times, the choice of a specic data footprint does matter. A simple heuristic to maximize the number of local accesses is to choose a data tile whose osets are the medians of all the osets in each dimension. We can show that using a median tile is optimal for one-dimensional data spaces, and close to optimal for higher dimensions. However, a detailed description is beyond the scope of this paper. We then place each loop tile with the corresponding data tiles accessed by it on the same processor.
Implementation and Results
This paper presents cumulative footprint size measurements from an algorithm simulator and execution time measurements from an actual compiler implementation on a multiprocessor.
Algorithm Simulator Experiments
We h a v e written a simulator of partitioning algorithms that measures the exact cumulative footprint size for any given hyperparallelepiped partition. The simulator also presents analytically computed footprint sizes using the formulation presented in Theorem 3.
We present in Figure 16 
EndDoall
The example demonstrates that the analytical method yields accurate estimates of cumulative footprint sizes. The estimates are higher than the measured values when the partitions are mismatched with the oset vectors due to the overlapping subtile approximation described in Section 4.4. We can also see that the dierence between the optimal parallelogram partition and a poor partition is signicant. The dierences become even greater if bigger osets are used. This example also shows that rectangular partitions do not always yield the best partition.
Implementation on Alewife
We h a v e also implemented some of the ideas from our framework in a compiler for the Alewife machine [19] to understand the extent to which good loop partitioning impacts end application performance, and the extent to which our theory predicts the optimal loop partition. The Alewife machine implements a shared global address space with distributed physical memory and coherent caches. The nodes contain slightly modied SPARC processors and are congured in a 2-dimensional mesh network.
Distributed-memory architectures require three types of related analyses to distribute code and data on to the machine:
Loop Partitioning Each processor must be assigned a set of loop iterations that maximizes reuse of data in caches and achieves good load balance.
Data Partitioning and Alignment Arrays must be distributed among the processors such that memory references that miss in the cache go to the local memory rather than across the network to another node. This is accomplished by partitioning arrays with tile shapes suggested by the D matrix, and then aligning corresponding loop and data tiles on the same processor.
Placement In an architecture like Alewife the memory access time depends on the distance between the node making the memory request and the node where the requested data resides. The data partitioning and alignment phases make assignments to virtual processors which m ust be mapped onto the real machine in order to minimize memory reference latency. This is a smaller eect that may become important i n v ery large machines. We h a v e implemented loop and data partitioning as well as alignment. The results in this paper focus on loop partitioning so we made sure that whatever loop partition was chosen, the optimal data partition for that particular loop partition was used. Otherwise, isolating the eect of cache misses is dicult because changing the loop partition alters both the number of non-local memory references and the numb e r o f c a c he misses.
The structure of our compiler is shown in Figure 17 . The input to the compiler is a program where parallelism is specied either by the programmer, or in a previous compilation phase. As in [7] , we separate the notion of parallelization from that of implementation. The languages accepted at present are Mul-T, a parallel Lisp language, and Semi-C, a parallel version of C. An initial series of transformations are performed including constant-folding and procedure integration producing a graphical intermediate form called WAIF.
WAIF is a hierarchical graphical representation of a source program. WAIF has two abstraction levels: The program graph (WAIF-PG) and the task and data communications graph (WAIF-CG). WAIF-PG is a customized version of an abstract syntax tree. WAIF-CG summarizes the communication patterns between tasks and data structures that can be derived from a static analysis. Data and loop partitioning are performed as transformations on the WAIF-CG and then code for sequential threads with explicit synchronization is generated. The sequential code-generation process performs standard optimizations such as strength reduction and loop-invariant code motion, producing machine code for Alewife's processors.
Alewife Experiment
The performance gain due to loop and data partitioning depends on the ratio of communication to computation and other overhead. To get an understanding of these numbers for Alewife, we ran several versions of the following parallel loop nest on an Alewife machine simulator. Using the algorithms in this paper, and taking the four-word cache line size into account, the compiler chooses a rectangular loop partition and determines that the optimal partition has an aspect ratio of 2:1. The compiler then chooses the closest aspect ratio (1:1) that also achieves load balance for the given problem size and machine size, which results in a tile size of 64x64 iterations. We also generate code using suboptimal partitions with tile sizes ranging from 16x256 to 256x16. This set of executions is labeled run A.
We ran a second version of the program using a dierent set of oset vectors that give a n optimal aspect ratio of 8:1 (run B). This results in a desired tile size between 256x16 and 128x32, with the compiler choosing 256x16, which has the aspect ration 16:1. Figure 18 shows the running times for the dierent tile sizes, and demonstrates that the compiler was able to pick the optimal partitions for both cases. There is some noise in these gures because there can be variation in the cost of accessing the memory that is actually shared due to cache coherence actions, but the minima of the curves are about where the framework predicted. The actual slope of the curves depends on the cost of computing an address and the actual memory latency. One of the reasons that the slopes of these curves are not very steep is that we used ideal data partitions making most references local. The other is that the code generated for index calculations for array references is not very good right n o w because Alewife does not have virtual memory. This means that arrays must be allocated non-contiguously and accessed indirectly. Much of this overhead can be eliminated with more sophisticated compiler analysis of loop invariant expressions. We can separate these two eects by running the same programs with a no-op replacing the actual loads and stores for the array references. This running time represents the noncommunication overhead including index calculation as well as the time to spawn tasks on all processors. Subtracting these times from those in the previous gure gives us the numbers in Figure 19 . They represent dierences due only to communication and thus represent the greatest possible gain from correct partitioning when the data is partitioned so that almost all accesses are to local memory. In a more realistic program it would likely not be possible to have such an ideal data partition and there would be more non-local references making cache reuse that much more important. In addition, these dierences are smaller than they might b e o n future machines because the local memory latency on Alewife is quite low and will increase as processors get even faster.
Another consideration is that 16 processors is a small machine size. In a larger machine the rectangular partitions can have wider aspect ratios leading to greater dierences for non-optimal partitions. We ran the same program on 64 processors, increasing the data size so that each processor would do the same amount o f w ork as in the 16 processor case. The much wider variation can be seen in Figure 20 . The 8x512 point for run A is o the top of the chart.
Conclusions
The performance of cache-coherent systems is heavily predicated on the degree of temporal locality in the access patterns of the processor. If each block of data is accessed a number of times by a given processor, then caches will be eective in reducing network trac. Loop partitioning for cache-coherent m ultiprocessors strives to achieve precisely this goal. for extending our compiler implementation to include general ane index expressions and data partitioning. Gino Maa helped dene and implement the compiler system and its intermediate form. Andrea Carnevali suggested the simple proof for Theorem 1 on lattices that is sketcked in this paper. We a c knowledge the contributions of the Alewife group for implementing and supporting the Alewife simulator and runtime system used in obtaining the results.
A A F ormulation of Loop Tiles Using Bounding Hyperplanes
A specic hyperparallelepiped loop tile is dened by a set of bounding hyperplanes. Similar formulations have also been used earlier [6] .
Denition 15 Given a l dimensional loop nestĩ, e ach tile of a hyperparallelepiped loop partition is dened by the hyperplanes given by the rows of the l l matrix H and the column vectors and as follows. The parallel hyperplanes areh jĩ = j andh jĩ = j + j , for 1 j l. A n iteration belongs to this tile if it is on or inside the hyperparallelepiped.
When loop tiles are assumed to be homogeneous except at the boundaries of the iteration space, the partitioning is completely dened by specifying the tile at the origin, namely (H;0;), as indicated in Figure 21 . For notational convenience, we denote the tile at the origin as L. 
B Synchronization References
Sequential do loops can often be converted to parallel do loops by i n troducing ne-grain datalevel synchronization to enforce data dependencies or mutual exclusion. The cost of synchronization can be approximately modeled as slightly more expensive communication [14] . For example, in the Alewife system the inner loop of matrix multiply can be written using ne-grain synchronization in the form of the loop in Example 12. 
EndDoall
In the code segment in Example 12, the \l$" preceding the C matrix references denote atomic accumulates. Accumulates into the C array can happen in any order, just that each accumulate action must be atomic. Such synchronizing reads or writes are both treated as writes by the coherence system. Similar linguistic constructs are also present i n I d [ 2 0 ] and in a variant o f F ORTRAN used on the HEP [21] .
