Parallelising compilers typically need some performance estimation capability in order to evaluate the trade-offs between different transformations.
Parallelising
compilers typically need some performance estimation capability in order to evaluate the trade-offs between different transformations.
Such a capability requires sophisticated techniques for analysing the program and providing quantitative estimates to the compiler's internal cost model.
Making use of techniques for symbolic evaluation of the number of iterations in a loop, this paper describes a novel compile-time scheme for partitioning loop nests in such a way that load imbalance is minimised.
The scheme is based on a property of the class of canonical loop nests, namely that, upon partitioning into essentially equal-sized partitions along the index of the outermost loop, these can be combined in such a way as to achieve a balanced distribution of the computational load in the loop nest as-awhole. A technique for handling non-canonical loop nests is also presented; essentially, this makes it possible to create a load-balanced partition for any loop nest which consists of loops whose bounds are linear functions of the loop indices. Experimental results on a virtual shared memory parallel computer demonstrate that the proposed scheme can achieve better performance than other compile-time schemes.
introduction
In order to evaluate the performance trade-offs of different transformations, parallelising compilers are usually armed with some performance estimation capability; this issue has been addressed recently by a number of researchers [3, 7, 191 . Although the implementation details of these schemes vary, generally they attempt to identify sources of performance loss, such as load imbalance, interprocessor communication, cache misses, etc. [4, 61. This has two implications for a parallelising compiler. Firstly, the compiler must be capable of extracting quantitative information from programssince parallelising compilers usually target the parallelisation of loop nests, significant information lies in the number of times each loop will be executed; this can be used, for instance, to estimate the amount of work assigned to each processor, or the number of non-local accesses to data [7] .
Penn&ion to make digitnbl~nrd copies of all or pnrl ofthis mnlerinl lb personal or cIa..room use is granlrd without fee provldcd that the copies nre nol made or distrihurcd for protit or commrrcinl advu,l;lge. the copyright notice. the title ofthc puhlicnlion &and its date appear. and nolicr is given that copyright is hy permission ofthe ACM, Inc. To copy olhe~~se. Secondly, the compiler should avoid postponing critical decisions concerning the parallelisation process until run-time, since doing so reduces the information available and, consequently, the accuracy of its performance prediction. One such critical decision is the mapping of loop nests onto a parallel architecture; that is, the way that the loop iterations are allocated to processors for execution.
The problem of mapping loop nests has attracted significant interest; traditionally, researchers have inclined towards run-time based schemes [S, 11, 13, 181. Their underlying argument has been that information not available at compile-time may permit a more balanced distribution of the workload, especially in cases where different iterations of the parallel loop perform different amounts of work. Although the balanced distribution may be achieved at the expense of other overheads (e.g., increasing the number of memory accesses so as to balance the computational work), run-time mapping schemes seem to be preferable whenever the execution of the loop nest depends on expressions whose value is unknown at compile-time.
However, in a number of cases, the resulting pattern of workload variation is highly regular, and it may be feasible to devise compile-time mapping schemes. This is the case, for example, when each iteration of the parallel loop performs a different amount of work as a result of enclosed loops whose execution depends on the value of the index of the parallel loop.
This theme has been investigated by Haghighat and Polychronopoulos [9, lo] , who suggest that symbolic cost estimates can be used to design robust compile-time strategies for mapping loop nests. They propose balanced chunk scheduling, a mapping scheme for triangular perfect loop nests; the main idea is to partition the outermost loop in such a way that each processor executes the innermost loop body the same (or almost the same) number of times. However, balanced chunk scheduling can only be applied to a limited class of loop nests and, in practice, it is often more desirable that the outermost, parallel loop be distributed in equal-sized partitions; for instance, in a data parallel environment, if a triangular loop nest is followed by a rectangular loop nest in which the same arrays are involved, an unequal partitioning (in this respect) would lead to significant communication and/or load imbalance overhead. In this paper we develop a general approach for compiletime mapping of parallel outer loops in the body of which there are inner loops whose execution depends on the outer loop index.
We make use of symbolic cost estimates as a means of evaluating our strategy, and minimisation of load imbalance is the main objective. By avoiding options which would tend to increase other sources of overhead, this scheme achieves good practical results.
The remainder of the paper is structured as follows: Section 2 provides a brief background on measuring load imbalance and on symbolic counting of the number of loop iterations. Section 3 forms the main body of this paper; after first describing a strategy for mapping the class of canonical loop nests, we then show how to transform non-canonical loop nests into multiple canonical loop nests. This strategy is evaluated and compared with other mapping schemes in Section 4. Section 5 summarises the paper and its results.
Background

Load Imbalance
We define the total (computational) workload in a parallel code fragment (in this paper, this will always be a parallel loop nest) to be Wtot, distributed among p processors in such a way that each processor i, 0 5 i < p, is assigned a workload equal to Wi.' Clearly, CT:' Wi = Wt,,t. This distribution embodies a load imbalance, k given by When, L = LR = 0, that is, for all i, Wi = Wtot/p, the code exhibits perfect load balance.
Counting the Number of Loop Iterations
In the case of loops, an estimate for the values of Wi in (1) can be derived by considering the number of times that each part of the loop body is executed. This corresponds to the (complex) problem of enumerating the integer points of a polytope [l] . In the context of loop nests, some techniques to compute this are described in [5, 14, 171. In general, the number of times, n, that a single statement surrounded by m loops is executed is given by:
where lj , uj are the lower and upper bounds, respectively, of the j-th loop, 1 5 j 2 m. If, for every loop, the loop bounds are constant, integer expressions whose values are known at compile-time, the loop nest is rectangular (in m-dimensions) and it is trivial to show that n = fly=, cy,j=l, 1. However, in a variety of situations, the loop bounds may be non-constant (for instance, dependent on the index of an outer loop), or may contain expressions whose value is not known at compile-time.
For the latter, it is important to know, when evaluating sums such as those in (3), whether uj 2 fj. In this paper, whenever this is the case, we split the loop iterations in such a way that the upper bound is always greater than or equal to the lower bound; this is discussed in Section 3.4.
' The units of workload may be the number of operations executed, or the CPU cycles needed to execute the code on some machine. where lo = 1, uk = lb+1 -1, for 0 5 k < p -1, up-i = u, and, for all lk, uk, where n = u -I+ 1. The following satisfy both (4) and (5):
l Partitioning by decreasing order:
where the first (n mod p) partitions contain [n/p1 iterations, while the rest contain [n/p].
l Partitioning by increasing order:
where the last (n mod p) partitions contain [n/p] iterations, while the rest contain Ln/pJ.
If n is a multiple of p, both relations reduce to &=l+kn/p, k=0,1,2 ,..., p-l, and the loop is divided into equal partitions.
In this case, assuming that the body of the loop nest does not contain statements whose execution depends on the value of the index of the outer loop, perfect load balance is achieved. If n is not a multiple of p, then a relative load imbalance equal to (p -n mod p)/(n + p -n mod p) is expected, a quantity which approaches zero if n >> p. These partitioning techniques can also be applied to the outermost loop of a loop nest, whether or not it is perfectly nested. If the bounds of the inner loops are not a function of the index of the outer loop, nor are there any conditional statements in the loop body whose execution depends on the value of the index of the outer loop, then perfect load balance may be achieved.
For perfectly nested loops, partitioning may be applied to the iterations of more than one loop at the same time. To illustrate this, assume that partitioning for p processors takes place over two loops, the first being executed n times, and the second being executed m times.
Minimising L in (1) requires us to find pl, ~2, where p = p1 pz, such that FE-1 [El is a minimum.
Instead 
Loop Nests Containing Conditionals
If the loop body contains conditionals whose execution does not depend on the value of the index of the outer loop, then the load imbalance resulting from the partitioning schemes described so far is not affected; each iteration of the outer loop still performs the same amount of work and, consequently, perfect load balance can be achieved. In the special case where an inner loop conditional involves the index of the outer loop and a constant, then, by applying indez set splitting [20] prior to partitioning, the conditional may be removed (21.
For instance, let 1 5 i < u bound the index i of the outermost, parallel loop, and I, 5 i 5 u, correspond to the logical expression evaluated by a conditional in the inner loop body. Then, the original loop nest can be split into three consecutive loop nests, whose indices take values in the following intervals, respectively:
('3
The second interval contains the values of i which satisfy both 1 5 i 5 u and 1, 5 i 5 uz. In some cases, the upper bound of an interval will be smaller than its lower bound, and the corresponding loop nest will never be executed; in this case, it can be eliminated.
Assuming that at least two of the intervals are non-empty, the question is how to partition the resulting loop nests.
There are two main approaches which are illustrated in the following example. Consider the code shown in Figure 1 .a;' applying (6) and assuming that, for the values of L, U, A, it is known at compile-time that L<A<U, the code shown in Figure 1 .b results. One approach to partitioning this code is to partition each loop using either of the schemes described in Section for both loops, if the number of iterations is a multiple of the number of processors, p, then perfect load balance is achieved.
In the general case, let n, m be the number of loop executions, and WI, W2 be the workload in the body of each loop, respectively; assuming that the same partitioning scheme (either by decreasing or increasing order) is applied to both loops, then the resulting L is given by L = [:I WI + 1~1 W2 -nW1~"W2. Whenever (n mod p) + (m mod p) 5 p, the load imbalance can be reduced by partitioning one of the loops by decreasing order and the other one by increasing order. This approach is followed in the code shown in Figure l .~.~
Canonical Loop Nests
The partitioning schemes for rectangular loops presented in Section 3.1 result in a small value of load imbalance, when each iteration of the parallel loop performs the same amount of work. A simple counter-example is that of a triangular loop nest in which the index of the outer loop, i, takes values from 1 to n, while the index of the inner loop takes values from 1 to i. When this is mapped onto p processors, then, for p, n _> 2, the relative load imbalance has a lower bound of l/4. It is apparent that the partitioning schemes described so far are inadequate for minimising load imbalance; nevertheless, using them as a basis, more effective schemes can be devised.
In the remainder of this paper we examine loop nests of depth m having the general form shown in Figure 2 
The following example illustrates Theorems 1 and 2:
Example 1 Consider the loop nest shown in Figure 3 .a. Assuming that N > 1, then the inequalities -2 < 3+1-l and J+I 5 5*1+2 always hold, while, for each inequality, the coefficients of I are non-zero; hence, the requirements of Definition 1 are satisfied and the loop nest is canonical of depth 3. Thus, based on Theorems 1 and 2, and assuming that the number of iterations of the outer loop, N, is a multiple of 2p2, where p is the number of processors, partitioning the loop nest according to (7) leads to perfect load balance; the partitioned loop nest is shown in Figure 3 .b. a
In the general case, where the number of iterations of the outer loop is not a multiple of 2pmS1, the partitioning technique suggested by Theorem 2 can be applied, provided that the outer loop is partitioned according to one of the partitioning schemes described in Section 3.1. In this case, a small value of load imbalance is expected.
Theorems 1 and 2 also apply to loop nests in which there are more than one inner loop at the same level (i.e., loops which are surrounded only by the same outer loops) whose bounds depend on the index of a surrounding loop; the nec- Figure 4: Transforming a non-canonical loop nest to canonical loop nests essary proviso is that, for any loop in the nest, the lower bound is always less than or equal to the upper bound.
Non-Canonical Loop Nests
Section 3.3 describes an approach to partitioning canonical loop nests, as introduced in Definition 1. In this section we re-consider loop nests having the general form shown in Figure 2 , but without the restrictions associated with Definition 1, apart from the requirement that El 5 ui (the loop nest is non-empty). Applying index set splitting, the original loop nest can be transformed into multiple adjacent loop nests each of which erther satisfies the requirements for partitioning inherent in Theorem 2 or is rectangular. Consider the loop nest shown in Figure 2 ; the first step consists of finding the values of i which satisfy the inequalities 11 5 i 5 ui and &ii + 122 5 uzii + 1~22. If no such values exist, then the loop with index jz is never executed. If there are such values, given by Ii 5 i 5 ui, the loop with index jz is always executed; therefore, this loop and the outermost loop together meet the criteria for a canonical loop nest of depth 2, and no index set splitting is required. Conversely, if there is a subset of the values of i which satisfies both inequalities, say 11 < i 5 u', , where u: < ui, then the outer loop must be split&to two consecutive loops, the first of which corresponds to the values given by 11 < i 5 u;, and the second of which corresponds to rr; + 1 < i < ui; for the former values, the loop with index jp and the outer loop together meet the criteria for a canonical loop nest of depth 2, while, for the latter values, the loop with index jz is never executed.
If, as a result of the previous step, there are some values of i for which the two outermost loops form a canonical loop nest of depth 2, then the next step consists of finding values of i and jz for which the three outermost loops form a canonical loop nest of depth 3. These values must satisfy the system of inequalities:
1'1 _< i < u'1 hi + 122 5 j2 I u2li + u22 l31 i -t 132 j2 + 133 5 u31 i + u32 j2 + u33, where the first inequality corresponds to those values of i that make the two outermost loops a canonical loop nest of depth 2. The same procedure is repeated for each loop, successively, until there are no remaining loops or else a given system of, say k, 2 5 k 5 m, inequalities has no solutions (this would imply that the loop with index jk is never executed for the values of i, jz, ,jkel that make the k -1 outermost loops form a canonical loop nest of depth k -1). Note that, in the case where the original loop nest contains more than one consecutive loop at some level, the same procedure should be applied for each loop separately.
These ideas are illustrated in the following example:
Example 2 Consider the loop nest of depth 3 shown in Figure 4 .a. Since there are two consecutive loop nests in the body of the I loop, the procedure described above must be applied separately for each of them.
For the first loop nest, the J loop and the outermost loop form a canonical loop nest of depth 2; and the K loop joins them to form a canonical loop nest of depth 3 whenever 2*1-J 5 1000 _ J 2 2*1-1000. Thus, the J loop must be split into two consecutive loops depending on appropriate values of J; the bounds of the first such loop will be 1 and HAX(1,2+I-1000)-i, and of the second such loop MAX(1,2*1-1000) and I. Since the body of the J loop does not contain statements other than the K loop, no statements are executed in the case when 1 5 J 5 MX(l,2+1-10001-l; hence, the corresponding loop can be eliminated.
For the second loop nest, the J loop and the outermost loop form a canonical loop nest of depth 2 when 2*1-500 < 1000 w I 5 750; the index of the I loop is split accordingly. The K loop joins in to form a canonical loop nest of depth 3 when I+J 5 1000 w J 5 1000-I. The code resulting after applying the necessary transformations is shown in Figure 4 .b. Evaluating MAX(1,2*1-1000)) by replacing it with appropriate conditionals which are then removed using index set splitting (see Section 3.2), results in the code shown in Figure 4 .c (note that MIN(1000 MIN( ,1000 is always equal to 1000-I since I (7) for m = 3, the partitioned code leads to perfect load balance when using 5 processors, and, in general, a relatively low value of load imbalance [15] . 0
Evaluation and Experimental Results
A series of experiments has been conducted in order to evaluate the performance obtained by the partitioning strategy described above, compared with other compile-time approaches. Two routes have been adopted for analysing the results when applying different mapping schemes: the first compares the values of load imbalance, L, and relative load imbalance, LR, computed as shown in Section 2.1; the second compares the resulting performance on a virtual shared memory computer, the KSRl. Our objectives have been not only to evaluate the practical efficacy of the new partitioning schemes, but also to establish whether the theoretical values for L and/or LR are a sound means for justifying the selection of a particular mapping scheme. Two benchmark programs are used (see below). The compared approaches are denoted KAP, MARS, CYC, BCS, and CAN: KAP corresponds to the mapping strategy of the KAP auto-parallelising compiler; MARS corresponds to the mapping strategy of the MARS experimental parallelising compiler [3] ; CYC corresponds to a cyclic scheme for mapping the iterations onto processors (i.e., processor 0 executes iterations l,p+1,2p+l,..
., processor 1 executes iterations 2,p+2,2p+2 ,..., ingeneral,processori,O<i=Zp-l,executesiterationsi+l+kp,k=0,1,2,...,n/p-1
[ll]);Bc~ corresponds to balanced chunk scheduling [lo] (extended to support loop nests of depth 3); and the general term CAN corresponds to the partitioning scheme described by (7). A suffix is added to CAN to distinguish between diierent values of m and/or transformations applied; these are described below, as appropriate.
Upper Triangular Matrix Multiplication
The code for the first benchmark, shown below, performs the multiplication of two upper triangular n x n matrices. Clearly, the loop nest is canonical of depth 3 (see Definition l), and the partitioning scheme CAN-Q, based on (7) for m = 3, may lead to perfect load balance. For comparison, the partitioning scheme CAN-Z, corresponding to m = 2, is also implemented.
The load imbalance, L, in terms of the number of times the assignment statement of the loop body is executed, and the corresponding relative load imbalance, LR, for two different values of N, 256 and 1024, are shown in Table 1 . MARS and KAP exhibit high L and LR, CAN-2 exhibits relatively smaller values, while the remaining three mapping schemes exhibit significantly smaller vahres; in all cases, CAN-3 exhibits the smallest values.
The partitioned programs were executed on the KSRl, using the same two values of N; the resulting performance is depicted in Figures 5 and 6 , where the ideal line assumes linear speed-up. In both graphs, KAP and MARS perform worst of all while CAN-3 performs best; the performance of CAN-3 is comparable with that of CYC and BCS. These results are consistent with the performance that might be anticipated from the vahms of L and LR shown in Table 1 .
Banded SYR2K
The second benchmark, banded symmetric rank-2k update (SYRPK), contains non-affine bounds, as shown below: Clearly, this loop nest is not canonical. However, converting the HIN and NAX functions to IF statements, and removing the latter by index set splitting (see Section 3.2), the code can be transformed into four consecutive canonical loop nests of depth 3, assuming that N > 2*BB-1 [15]; this version is denoted CAN-St. For comparison, two additional mapping schemes are also implemented; they are based on direct application of the partitioning schemes described by (7), for m = 2 (CAN-Z) and m = 3 (CAN-S), to the original loop nest, regardless of the fact that the latter is not canonical. No version based on balanced chunk scheduling was implemented since loop nests having bounds containing NIN and MAX functions do not conform to its requirements.
The load imbalance, L, in terms of the number of times the assignment statement of the loop body is executed, and the corresponding LR, for two pairs of values for N and BB, {512,64}, and {1024,256}, are shown in Table 2 . The partitioned programs were executed on the KSRl, using the same two pairs of values for N and BB; the resulting performance is depicted in Figures 7 and 8 . In the first case (Figure 7 ), KAP and MARS perform worst of all, except when running on 16 processors, where CYC performs worst of all. cAr+Qt performs best of all when using fewer than 16 processors; equally good results are achieved by and, to some extent, CAN-Z. CYC exhibits odd behaviour; it performs nearly best of all when running on 12 processors, but worst of all when running on 16 processors, and nearly worst when running on 8 processors.
This is due to the significant number of cache misses when the number of processors is a power of 2. Similar remarks can be made about the results in Figure 8 . CAN-3t performs best of all; CAN-3 exhibits comparable performance, but CAN-2 performs significantly worse. KAP and MARS perform worst of all except when using 8 or 16 processors; in these cases, CYC, which also suffers from a high number of cache misses, performs worst of all.
Comparing the computed values of L and LR in Table 2 and the actual performance shown in Figures 7 and 8 , another interesting observation is that, although CAN-% nearly always exhibits higher load imbalance than CAN-Q, its actual performance is generally better than that of CAN-3 (except when running on more than 12 processors, where the difference in load imbalance between the two partitioning schemes becomes relatively higher); the superior performance of CAN3t is due to the elimination of MIN and MAX functions from the loop bounds (apart from those necessary for partitioning the outermost, parallel loop).
Conclusion
This paper has presented a partitioning scheme for loop nests in which, upon partitioning into equal partitions along the index of the outermost loop, each partition has a computational load which can be expressed in terms of a polynomial expression; these loop nests, termed canonical, are composed of loops for which the upper bound is always greater than or equal to the lower bound.
It has also been shown how to apply index set splitting to transform non-canonical loop nests in such a way that the above criterion is satisfied. Although minimising load imbalance has been the primary target of the scheme, it seems that, by partitioning into groups having consecutive iterations (in contrast to the cyclic partitioning scheme [ll]), as well as into as near as possible equal-sized partitions along the index of the outermost loop (in contrast to balanced chunk scheduling [9, lo]), our approach has also been effective in reducing other forms of overhead.
