This paper presents a compile-time scheme forpartitioning non-rectangular loop nests which consist of inner loops whose bounds depend on the index of the outermost, parallel loop. The minimisation of load imbalance, on the basis of symbolic cost estimates, is considered the main objective; howevel; options which may increase other sources of overhead are avoided. Experimental results on a virtual shared memory computer are also presented.
Introduction
Cost estimates provided by symbolic analysis at compiletime may lead to robust compile-time schemes for mapping certain classes of loop nests; these may be particularly helpful in the context of parallelising compilers [8] . Such a scheme, balanced chunk scheduling, is suggested in [3] for mapping triangular perfect loop nests (a loop nest whose iteration points, when depicted on a Cartesian system of coordinates, correspond to a triangle is called triangular; conversely, a loop nest whose iteration points correspond to a rectangle is called rectangular); its main disadvantage lies in not distributing evenly the iterations of the parallel loop, which is often a non-desirable option in practice.
In this paper we develop a novel scheme for mapping at compile-time parallel loops in the body of which there are loops whose execution depends on the index of the parallel loop. A brief definition of load imbalance, whose minimisation is considered the main objective, is given in Section 2. Section 3 analyses the proposed scheme and considers transformations which may be needed. Section 4 presents some experiments, and, finally, Section 5 epitomises the results.
Background
With the term loop partitioning we refer to this stage of the mapping phase which deals with the formation of p groups of loop iterations which can be executed in parallel by p processors. When foiming these groups the objective is to minimise any overheads, thus increasing performance. In this paper, we consider load imbalance as the dominant overhead; however, to avoid false sharing, we require that each group contains as many consecutive loop iterations as possible, while, to achieve scalability, we require that each group is assigned the same (or almost the same) number of iterations of the loop to be ]partitioned.
Assuming that the totad amount of computation (i.e. the workload) in a loop nest is Wtot, which is distributed amongst p processors in such a way that each processor i, In order to estimate the vialues of Wi in (l), we consider the number of times that each part of the loop body is executed. Techniques to compute this number (which corresponds to the number of integer points in a polytope) are described in [5, 6, 7] ; they are based on the evaluation of nested sums with each sum corresponding to a loop.
Based on the above, the iterations of a single loop with lower bound 1 and upper bound U can be partitioned across p processors, with processor k, 0 5 k. < p, executing a loop whose bounds, / k , U k , can be computed by U k = Ik+l -1, for 0 5 k. < p -1, uP-l = U, and, either If n is a multiple of p , both the above relations reduce to 11, = I + kn/p. In this case, as well as for any rectangular loop nest, perfect load balance is achieved.
Methodology

Generalities
In the case of a non-rectangular loop nest, the partitioning schemes described in the previous section lead to a high value of load imbalance; for instance, for a triangular perfect loop nest the relative load imbalance has a lower bound of 1/4. However, these schemes can serve as a basis for amore suitable partitioning strategy. In order to illustrate this, consider the triangle and the trapezium shown in Figure I ; the horizontal axis corresponds to the outermost parallel loop. Drawing lines parallel to BC (resp. CD for the trapezium) which cut AC (resp. AD) into 4 equal parts, it is possible to divide the triangle (resp. the trapezium) into 2 partitions of equal area (the light-shaded area and the dark-shaded area); this can be generalised for any number of partitions, as well as for a convex polygon (by dividing it into triangles and trapeziums). This strategy is analysed in the next sections.
Partitioning canonical loop nests
The class of loop nests examined in this section have the general form of the loop nest shown in Figure 2 . The DOALL construct denotes a parallel loop which has to be partitioned. It is assumed that the collective sets of statements labelled statements. 1, statements. 2, and statements. 3 do not include statements whose execution depends on the value of the index of a surrounding loop (this implies that they may include DO . . . ENDDO loops, which perform the same number of iterations regardless of the value of I). It is also assumed that the second set (statements. 2) contains at least one statement. Then: . . . Theorem 3.1 makes use of a property of canonical loop nests, namely that, upon partitioning into equal partitions along the index of the outermost loop, the k-th partition, 0 5 k < p , has a workload equal to Ak + B , A , B constants. Based on this property, we can extend the definition of the loop nest used in Theorem 3.1 to cover cases where there are more than one consecutive inner loops: Corollary 3.1 also applies to loop nests where the bound of an inner loop depends on the indices of two outer loops. Unrolling the innermost of the two outer loops, the resulting code has the general form of the loop nest shown in Figure 3 . Example 3.1 Consider the triangular loop nest shown in Figure 4 .a. Based on the general form shown in Figure 2 , this example is a special case for 121 = 0,122 = 11, u 2 1 = 1, u 2 2 = 0 ; these values satisfy the conditions for a canonical loop nest, as required by Definition 3.1. Thus, assuming that the number of iterations of the outer loop, n, is a multiple of 2p, where p is the number of processors used, the partitioning scheme described by Theorem 3.1 leads to perfect load balance; the partitioned code is shown in Figure 4 .b
0
The same partitioning technique may be applied in the general case, where n is not a multiple of 2p, provided that the outer loop is partitioned according to one of the two partitioning schemes described in Section 2. Then, a relatively small value of load imbalance is expected; for instance, in
We use the KSR directives to denote parallelism in the code. Thus, the codeenclosed within the PARALLEL REGION and END PARALLEL REGION directives is executed by all P processors, but using different data for each processor, the latter is achieved by means of a library function, IPRMID ( ) , which returns an integer between 0 and P-1 depending on which processor executes the code. The variables I, J, K, LK, and UK are declared as private, that is, each processor has its own copy of the variable. the case of a canonical perfect loop nest, the relative load imbalance does not exceed 1/( 1 + n / 2 p ) [6].
Generalised loop nests
This section re-considers loop nests having the general form shown in Figure 2 , but without the restrictions associated with Definition 3.1. First, we prove that, if there are no values of i for which the loop nest is canonical, then the loop nest is rectangular; it is assumed that 11 5 u1. Figure 2 ; then, either there is a subset of the iteration space of the outer loop for which the r'oop nest is canonical, or the loop nest is rectangulal: Repeating this procedure for each of the inner loops, the interval [ [ I , u 1 ] is split into ,a maximum of m + 1 subintervals in each of which some of the inner loops are canonical with respect to the outer loop, while others are executed and can be eliminated. This is illustrated in the following example: number of processors, p , divides 50, the resulting code can be partitioned such that perfect load balance is achieved. Instead, if the partitioning scheme described by Theorem 3.1 had been applied directly to the loop nest shown in Figure 5 .a, and assuming that p = 10 processors were used and the amount of work in the body of each of the inner loops was W, a load imbalance equal to 9690" would result; applying the partitioning schemes described in Section 0 2, a load imbalance equal to 48465" would result.
Theorem 3.2 Consider the loop nest shown in
Evaluation and Experimental Results
A series of experiments on a virtual shared memory computer, the KSR1, has been conducted. Three benchmark programs which comprise non-rectangular loop nests were used. The different schemes compared are denoted by the shorthands KAP, MARS, BCS, and CAN; KAP corresponds to the mapping strategy of the KAP commercial parallelising compiler, MARS corresponds to the mapping strategy of the MARS experimental parallelising compiler [ 11, BCS corresponds to balanced chunk scheduling, and CAN to the partitioning scheme described by Theorem 3.1.
Adjoint convolution has been used to evaluate the effectiveness of run-time loop mapping schemes [2, 41, as well as balanced chunk scheduling [3] . The version of the code used is shown in Figure 6 . First, we compute the load imbalance, L , in terms of the work associated with the assignment statement of the loop body, and the corresponding relative load imbalance, L R , for N = 8000; the results are shown in Table 1 . The performance of the partitioned programs on the KSRl, for the same value of N, is shown in Figure 7 ; the ideal line corresponds to linear speed-up. KAP and MARS perform worst of all while BCS and CAN do best of all; these results are consistent with the anticipated performance from the values of load imbalance shown in Table 1 . The second benchmark examined, a program adding two, upper triangular, n x n matrices, has been chosen as an example of a loop nest where the size of the data involved (as opposed to adjoint convolution) is considerably larger; the corresponding code is shown in Figure 8 . The load imbalance, L , in terms of the work associated with the assignment statement of the loop body, and the corresponding relative load imbalance, L R , for N = 1600, are shown in Table 2 .
The performance on the KSRl is depicted in Figure 9 ; CAN performs best of all, and KAP performs worst of all. While Number of processors 2 Table 2 . Expected load imbalance and relative load imbalance for triangular matrix addition.
this might have been anticipated from the values of load imbalance shown in Table 2 , it appears that the latter do not suffice to provide an adequate justification for the performance of MARS and BCS. This is because the large amount of time spent on memory handling (due to the relatively large size of the arrays involved) renders load imbalance a small fraction of the overall overhead.
Finally, we examine TRED2, an approximately 140-linelong routine from the eigenvalue solver package EISPACK. In order to parallelise the code we consider three loop nests which account for over 99% of the overall execution time for a problem size N = 1024; the first two loop nests are triangular, thus the scheme described by Theorem 3.1 can be applied. CAN performs best of all leading to an improvement of up to 20% over MARS and up to 70% over KAP; detailed results are shown in [6].
Concluding Remarks
This paper presented a strategy for mapping loop nests in which, upon partitioning intop equal partitions along the index of the outermost loop, the k-th partition has a computational load equivalent to Ak + B, A, B constants. It has also been shown how to apply index set splitting to transform certain loop nests in a way that the above criterion is satisfied. Our results indicate that the proposed strategy outperforms techniques used by existing parallelising compilers.
Although the strategy has been developed on the basis of minimising load imbalance and evaluated on a virtual shared memory computer, it may also be applicable to a distributed memory environment. Consider, for instance, the code shown in Figure 8 . Our strategy implies that if the three arrays are partitioned columnwise into, say, 2p equal partitions with processor k, 0 5 k < p, being assigned the k-th 
