This paper presents a new method for the problem of mapping of nested FOR-loops with uniform dependencies, into mesh-connected parallel architectures. This method is based on loop mapping for systolic arrays. The virtual array of cells is derived from the index space, by applying a linear transformation. This array is divided (cut) into a fixed number of clusters, equal to the number of available real processors. The basic idea of our method is that the cutting is performed along properly selected boundary directions, so as to minimize inter-cluster communication and equilibrate the number of virtual cells for every cluster. Each cluster is then assigned to a different processor, which performs in a roughly independent manner, as the communication requirements are now minimized. This mapping cuts down overall communication delays, while using a fixed number of processors from a (n-1)-dimensional mesh-connected distributed architecture.
Introduction
The primary task in parallelizing a FOR-loop, is finding a schedule of computations into time, while preserving the data dependencies of the initial lexicographic loop order. Most methods are based on finding a linear time transformation, since linear time scheduling differs only at a constant from the optimal schedule for "fat" domains, as Darte proved in [3] . Linear scheduling was introduced by Lamport in [8] , with the hyperplane method 1 , which organizes indiced computations into welldefined distinct groups called hyperplanes. Many methods were proposed in literature to find the optimal hyperplane; some of them are based on the solution of diophantine equations [8, 10] and others on the use of integer programming [13, 3] or even linear programming in subspaces [13] . When index spaces with uniform de-pendence vectors are concerned, a polynomial complexity scheduling algorithm is presented in [7] .
Once optimal parallel execution is found, an efficient method of mapping the concurrent groups of computations (hyperplanes) into the parallel architecture should be applied. A systematic methodology for mapping into fixed size systolic arrays was presented in [8] . Since the target architecture is synchronously operating, there is no need for communication efficient mapping and the main criterion for optimality is now the total number of processors. Other methods dealt with the same problem of mapping, while reducing not only the size but also the resulting dimension of the systolic array (see [4, 6, 11] ).
Researchers are trying to minimize the inter-processor communication, by dividing the index space into as much as possible independent groups of computations. Shang and Fortes [14] , and Peir [12] , have presented methods for dividing the index space into independent sets of computations, which are assigned to different processors, thus zeroing the communication cost. Unfortunately, the independent partitioning of the index space is not always feasible. Sheu and Tai have presented a systematic method of partitioning the index space into groups and assign them into different processing elements [15] . Their method first projects the n-dimensional index space onto the zero hyperplane. The resulting projected space is divided into groups of computations, which preserve the initial optimal time schedule. However, their method does not reduce overall communications, since it ignores the maximization of intra-group references. A similar technique was presented by King in [5] . This method produces better results, since it groups together chains of related computations, but requires a greedy (exhaustive) search to define the group, where each computation belongs. Time scheduling is not explicitly defined. They only discuss computational structures on two dimensional index spaces. However, they introduced the idea of grouping related computations together. In addition to this, their target architecture has an unlimited number of processors, which is not realistic for most of actual large index spaces.
In this paper we propose a new method, based on mapping into an unbounded array of systolic cells. The initial index space J n is linearly transformed into another Jï n , which is used for initial systolic mapping. Transformation matrix T is divided into the hyperplane transformation and the space transformation S, which actually maps the initial index space J n to a (n-1)-dimensional projected virtual space. This space represents a virtual array of cells, which should be further divided to fit the size of the available hardware. Our method analytically derives a partitioning, which minimizes communication requirements and delays. This (n-1)-dimensional array is partitioned into blocks of neighboring virtual nodes, where each block is assigned to a different physical processor. We establish the notion of a single cut of the projected index space, along a boundary direction. All possible single cuts along these directions are evaluated, by a formula calculating the approximate total communication requirements between the separate parts of each cut.
DKP_WS99_Paper.doc submitted to World Scientific : 07/10/99 : 12:49 PM
3/14
Once the optimal cut (the one with the minimum communication requirements) has been selected, we perform the same procedure along the rest n-2 dimensions. The resulting set of cuts, called a mapping, divides the virtual space into clusters; all points of each cluster are assigned to the same physical processor. The interprocessor communication is reduced, as neighboring virtual processors are merged together into a physical one. In addition, by cutting the virtual space into equal size tiles (except boundary effects), overall computational load is balanced. Thus, the resulting space mapping is efficient, in terms of processor utilization and communication delays.
Basic terminology and notation used throughout this paper is given in Section 2. Sections 3 presents the properties and algorithms used for partitioning; the virtual array of systolic cells and the proposed mapping is elaborated in Section 4. An example of our method is presented in Section 5 and the obtained results are discussed in Section 6.
Preliminary Concepts and Definitions
We focus on algorithms, which have the form of a nested FOR-loop structure, with uniform data dependencies [11] . The algorithmic model is formally described as follows:
where i i , 0 ≤ i ≤ n-1, are the loop variables, l i and u i are integer-valued constants that represent the lower and upper limits for loop variables, and S j , 1 ≤ j ≤ k, are k assignment statements. For the nested loop given above, the index vector is defined as the vector of dimension n:
T . Furthermore, each statement S j is an assignment of the form:
, where v 0 is an output variable, v j , 1 ≤ j ≤ p, are input variables and E is an arbitrary expression of the input variables. It should be noted that all variables in the loop statements, may be indexed by the index vector i. Let J n be the set of indices:
where Z is the set of integer numbers. J n is an n-dimensional integer space. Each point in this n-dimensional integer space, is a distinct instantiation of the loop body. The instance of statement S j is represented by S j (i). Notice that the points of J n are
4/14
ordered lexicographically; the usual symbol `<` is used to denote this (linear) ordering. Dependencies may exist between variables appearing in instances of assignment statements. A (directed) dependence between two instances S j (i 1 ) and S j (i 2 ), is characterized by the dependence vector d = i 2 -i 1 , as in [10] . The dependence matrix D contains as columns all existing dependence vectors. A dependence vector is denoted by d i , 1 ≤ i ≤ m. Such an algorithm will be defined by its index space J n and its dependence matrix D and will be represented by A(J n , D).
Partitioning the Index Space
In this section, we describe our method of partitioning and mapping an ndimensional index space onto a mesh-connected MIMD architecture
2
. This method is based on systolic loop mapping.
When bounded number of cells or processors is given, a partitioning methodology should also be applied to fit the virtual array (resulting from mapping phase) into the real one. We follow GPLS method (Globally Parallel Locally Sequential) 3 . This means, that since the number of available cells (physical processors) is less than the number of virtual nodes, the virtual array is partitioned into blocks, whose number equals the size of the real (physical) array of processors. All virtual nodes inside the same block are sequentially executed -by the same processor.
The cut of the virtual array into the fixed number of blocks is done along directions, which reduces the overall communication requirements and divides the array into equal size partitions. Since everything in each block is executed by the same physical processor there are no communication requirements. The only communication overhead is imposed by inter-block message passing, which is unavoidable, since all blocks are executed in parallel (Globally Parallel). Of course, there are now more local memory requirements, but the size of memory in distributed architectures is large enough, and accesses to local memory for the same block are fast. Hereunder, the details of our approach are further analyzed.
Transformation of the Index Space J n
We assume that a linear transformation matrix T has already been selected. T is given as two submatrices, and S, as in [10] :
The first row of the above matrix T is the hyperplane vector , which determines the time execution ordering for the transformed index space -ï n : -ï n = T . J n . The submatrix S, is the space transformation, which maps onto a (n-1)-dimensional array of systolic cells S . J n , as in [10] . For the needs of our method, we consider that T was optimally selected. This means that: gives the optimal hyperplane with respect to the least makespan and S produces the optimal space mapping, in terms of the total number of cells and communication links, as shown in [7] . By applying the transformation matrix T to the original index space J, the latter has been transformed to a new index space Jï. By ignoring the first dimension of the transformed index space Jï, we obtain a projected (n-1)-dimensional index space, which will be partitioned and eventually mapped to a (n-1)-dimensional mesh of processing cells, as in [10] .
Cuts and Terminology
In the rest of the paper, the prefix `h-` will be used extensively. It is read as "hyper" and stands for n-dimensional. The terminology after the prefix is taken from the 3-dimensional space. That is:
• h-space: the n-dimensional space, In the course of our method, one must keep in mind the following issues:
4
• matrix T has been previously selected so as, to best meet the needs of our problem, • transformed index space is an h-polygon with edges that are parallel in pairs, • real processor space, onto which our transformed space is to be mapped, is an h-mesh and that • optimal cut will be given in terms of continuous space measures (not discrete).
Communication Cost Function for Alternative Cuts
The cost value of each mapping is computed as the sum of the cost values of its individual cuts. The cost of a cut is defined to be the number of transformed dependence vectors that traverse the cut's h-line. The formula that computes the cost value of a single cut is thus the following (cost function 1):
where: m is the number of distinct dependence vectors, p is the vector that is per- The cost function 2 that is defined above, gives a heuristic measure, of how good a mapping is. Moreover, when the projected transformed index space is an hDKP_WS99_Paper.doc submitted to World Scientific : 07/10/99 : 12:49 PM
8/14
parallelogram, applying the cost function to all different mappings, leads to the optimal solution with optimal processor utilization.
Determining Possible Cut Directions

Algorithm 1. Calculating the binding h-lines
Step 1.1: Define matrix V of dimension nx2 n , containing as columns all the permutations (2 n ) of the coordinates of the index space boundary points:
Step 1.2 &DOFXODWH WKH WUDQVIRUPDWLRQ RI 9 9ï T . V.
Step 1.3: Ignore first row of V, which represents time coordinate, and construct matrix W of dimension (n-1)x2 n , containing the other rows. W represents boundary point coordinates in the projected transformed index space. 
Mapping Algorithm
In order to find all possible mappings, we compute all different ways in which the projected transformed h-space can be allocated to the processor space. We assume that the processor space is denoted by a vector , of n-1 elements. Each element shows how many processing cells are available on each direction of the h-mesh, where the transformed h-space is to be mapped 6 . This is equal to the number of cuts that should be made along this direction plus one.
Our mapping algorithm is divided in two phases: pre-calculation, where constant coefficients are calculated once and for all, and calculation, where the cost of
9/14
all possible mappings is calculated. In the following, will denote the number of different non-trivial values in vector and % j will denote these values, for 1 ≤ j ≤ .
Algorithm 2. Pre-calculation of cost for multiple cuts
Step 2.1: For all pairs of binding h-lines:
Step 2.1.1: Let i p be the perpendicular vector:
Step 2.1.2: Calculate the constant coefficient for cuts in the direction of i p :
Step 2.1.3: For all different numbers % j in the vector of processors, 1 ≤ j ≤ :
Step 2.1.3.1: Calculate the constant coefficient for multiple cuts in the direction of every i p , cutting the projected transformed index space in % j segments:
where cutArea() is a function that returns the h-length of the cut's h-line, according to Algorithm 4.
Algorithm 3. Calculation of the minimal mapping
Step 3.1: For all valid mappings: 
Step 4.1: Let P the set of points that define the cut's h-line segment, initially empty:
Step 4.2: For all combinations of (n-2) binding h-lines not perpendicular to p (for all p i ≠ p): Step 4.2.1: Add the given h-line.
10/14
Step 4.2.2: Solve the linear system of dimension (n-1)x(n-1) to compute the point of intersection of the n-1 h-lines.
Step 4.2.3: If the solution satisfies all the remaining binding h-lines, then add it to P, else discard it. Step 4.3: Calculate the h-length of the h-line segment defined by the points in P.
Since all points in P lie on the given h-line, they define an h-line segment
7
. H-line segments are really finite subspaces of dimension n-2, whose h-length must be calculated. The calculation of the h-length of an h-line segment that is defined by a set of k points, where k ≥ n-1, can be reduced to the same calculation, but for a set of n-1 points. The problem does not apply for the case of n = 3, but it is easy to see that in the case of n = 4 the h-line segment is a polygon and we have to triangulate it, in order to calculate its area. Thus, it suffices to define the h-length of an h-line segment that is defined by n-1 points, and this can be done inductively:
Algorithm 5. Inductive definition of h-length
Step 5.1: Base case, for n = 3, use Euclidean distance.
Step 5.2: Inductive case, for n > 3 do the following:
Step 5.3: Exclude one point arbitrarily: u .
Step 5.4: Use same algorithm to calculate the h-length lï of the h-line segment that is defined by the remaining n-2 points (in an h-space of dimension n-3).
Step 5.5: Find the projection u′ of u on the h-plane defined by the remaining n-2 points.
Step 5.6: Calculate the Euclidean distance d between u and u′ .
Step 5.7: The result is the product of l and d.
EXAMPLE
Consider the following FOR-loop:
for i 0 = 1 to 6 do for i 1 = 1 to 4 do The first line of Dï denotes time dependence of transformed algorithm and is ignored, while the remaining matrix has as columns the space dependencies in the transformed index space. In our example, the transformed dependencies are: In Figure 2 we can see the projected transformed index space, together with the transformed dependencies. In order to calculate the bounds of the projected transformed index space, we apply Algorithm 1. The matrices that are calculated in the first three steps of the algorithm are, respectively: V = , The convex hull of the points, whose coordinates are given in matrix W, is a polygon whose vertices are the points: (2, 2), (7, 2), (10, 5), (10, 7), (5, 7) and (2, 4). The inequalities that describe the interior of this polygon, in the form of pairs of parallel lines, are the following: pair 1: 2 ≤ x ≤ 10, pair 2: 2 ≤ y ≤ 7, pair 3: -2 ≤ z ≤ 5. 
where the length of the cuts have been calculated by using Algorithms 4 and 5. It is not difficult to see that the best mapping indicated by Algorithm 3 consists of two multiple cuts, one in the direction of pair 1 and one in the direction of pair 3. This mapping is shown in Figure 3 . Its total cost is: 6668 . 38
This does not significantly deviate from the real cost, as can be seen in Figure 4 .
We can see that the real cost of the multiple-cut in the direction of pair 1, i.e. the DKP_WS99_Paper.doc submitted to World Scientific : 07/10/99 : 12:49 PM
13/14
number of dependence vectors traversing the vertical lines, is equal to 20, whereas the estimated cost was 18.6668. 8 Similarly, the real cost of the multiple cut in the direction of pair 3, is equal to 22, instead of the estimated 20.
We should note here, that if the projected transformed index space is not an hparallelogram, there is inevitably a loss of processing cells. This is the case in our 2-dimensional example here, where the underused processors can be seen in Figure 3 .
If we apply the method proposed in [9] , for finding an optimal transformation when mapping to systolic cells, the following results will be produced, among many others, for our example: These matrices result in systolic arrays of 42, 24, 12 and 12 cells respectively. They all preserve the optimal hyper-plane [1, 1, 1] . The optimal transformation for the needs of our problem is T 4 . This matrix maps into the least possible number of cells, as shown in [1] , and is better than T 3 , because it produces an array requiring only two external communication links (two external transformed dependence vectors). In the beginning of the example, we have selected T 1 for presentation, because the resulting structure is big enough to demonstrate the merits of our method.
Conclusion
In this paper we have presented a new method for the partitioning and mapping of nested loops onto fixed size, distributed, mesh-connected architectures. This method is based on transforming the initial n-dimensional index space J n into an equivalent -ï n , using a transformation (matrix) T, and then divide the projected virtual (n-1)-dimensional space, through S, into blocks which are assigned to different processors. Interprocessor communication is considerably reduced, by choosing the optimal cut along each dimension. As it was shown, the proposed method is formally presented and easily programmable. Future work includes affine by statement partitioning of the index space, to further reduce the redundant interprocessor communication links.
References
