Barrier MIMDs are asynchronous Multiple Instruction stream Multiple Data stream architectures capable of parallel execution of variable execution time instruc tions and arbitrary control flow (e.g., w h ile loops and calls); however, they differ from conventional MlMDs in that the need for run-time synchronization is significantly reduced. This work considers the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transform ing a multiply-nested loop into a single loop. Loop coalescing is extended to nested tri angular loops, in which inner loop bounds are functions of outer loop indices. Also, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. Wc concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations [ Wol89J, [Pol88] and linear scheduling theory [ShF88], rShO901 cart be applied to this problem.
I. Introduction
Parallel computer architectures hold great promise for solving large, compute-intensive problems.
To fully exploit paraUel machineSj it is necessary to translate applications software into efficient parallel code. Most of the parallelism in programs is found in loops, and techniques are necessary to extract loop paraUelism arid exploit it at run-time.
This work considers loop parallelization and scheduling for a new class of parallel machines called [DiS88] , [OKD90] . Bar rier MIMDs are characterized by a fast, flexible hardware barrier synchronization mechanism that exe cutes in a few clock cycles. Barriers may be applied across any arbitrary subset of the processors. Recall that a processor performs the following steps at a barrier synchronization point:
barrier MIMD (Multiple Instruction stream, Multiple Data stream) architectures
[ I ] Marks itself as present at the barrier.
[2] Waits for all other participating processors to arrive at the barrier.
[3] After all participating processors have arrived at the barrier, it continues execution past the barrier.
In a barrier MIMD, step [3] is modified so that processors proceed past the barrier simultaneously. Using this property, previous work [ZaD90] has shown that for basic blocks of code executed on a barrier MIMD, static scheduling can remove many unnecessary synchronizations at compile-time.
This work considers the problem of scheduling nested loop structures on a barrier MIMD. Since the processors have separate, independent control streams, the body of the nested loops can Contain subrou tine calls, IF statements, other control flow constructs and variable-time instructions. Hence, barrier MlMDs can exploit loop parallelism that VLIW and SIMD machines, limited to a single control stream, must ignore.
The basic approach employs loop coalescing [Pol88] , a technique for transforming a multiplynested loop into a single loop. Loop coalescing is extended to nested triangular lo o p s, in which inner loop bounds are functions of Outer loop indices. Also, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. We concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transforma tions [Wol89] ,[P6l88] md linear scheduling theory [ShF88] , [Sh09Q] can be applied to this problem.
:' H ' Loop Coalescing
This manuscript is organized as follows. In section two, some previous work in scheduling parallel, shared-memory MIMD architectures is reviewed. Section three extends the loop coalescing transforma tion to triangular loops, and proposes an improved technique for coalescing rectangular loops (with con stant upper and lower bounds). Section four shows how a coalesced loop can be scheduled on a barrier MIMD; an algorithm for generating the proper sequence of barrier synchronizations is given. Finally, conclusions and directions for future work are given in section five.
Previous Work ■
Scheduling schemes for parallel architectures fall into two broad classes: static and dynamic. In with DOMAIN statements such as EYEJAY are called domains. In the aerodynamic flow codes to be executed on the FMP only rectangular domains were considered, as these were the most common domaias found in such code. Loops iterating over rectangular domains are called rectangular loops; they correspond to nested loops with constant upper and lower bounds.
T~' LoopGoalescing
Parallel execution pf the DOALL iterations began when control flow in the program reached the DOALL. Early FMP studies considered employing a centralized control unit to compute an optimal allo cation of the loop instances. However, the final design employed a decentralized mechanism for static loop scheduling1: processor id numbers P were assigned from O to PMAX-I , where PMAX was the number of processors. Each prbcessor was also given the maximum instance number and the number of processors executing the DOALL. Processor P began by executing instance number IJ=P. In the previ ous example, the index variables were I and J: each processor can determine these index variables from the instance number IJ with the following equation: Mapping consecutive iterations to a single processor is called consecutive allocation in this work.
Generalized Loop Coalescing
In this section, a technique for coalescing triangular nested loops with inner loop bounds that Sre functions of the outer loop indices is proposed. An improved method for generating the original indices from the coalesced index rectangular loops is also given. Triangular loops are ubiquitous in the numerical linear algebra codes [DoM79], [GoV83] that are perhaps the most common input to vectorizing and paral lelizing compilers. The new technique broadens the applicability of loop coalescing.
The approach used in the FMP to generate the original loop indices from the coalesced index can be applied to rectangular loops with nest levels greater than two. The basic idea is to coalesce starting from the innermost nest levels and proceed outward. The two innermost levels are coalesced, followed by the next innermost loop and the coalesced loop formed in the previous step, and so on until the outermost • -5 - To execute the coalesced loop on a barrier MIMD, each processor independently computes the tran sition functionfor successive j until i(j)>JK, where JK is the current instance for the processor. This gives J for the instance, which is then used to compute K. The body of the loop is executed using these generated values for I and J . The cost of these operations depends on the complexity of the transition function, which in turn depends on the form of the inner loop bounds. Alternately, the transition series could be generated at compile-time, and saved in local memory in the processors, reducing the run-time overhead at the expense of extra storage.
To generalize the approach given above for doubly-nested loops, it is necessary to determine the proper transition function for general loop bounds. In the general case, doubly-nested lbops have the following form: 
Thenum berofinstancesinthecoalescedlooptisthengivenas
x(M,N) = t(N)
A closed-form expression for i(j) is required, and this will sometimes require manipulation of the For the general loop form, once J is computed from the transition function, K is determined from the expression
As a more complex example, consider the loop of figure 6, which is part of Trench's algorithm for determining the inverse of a Tocplitz matrix |GoV83J4: The transition function is derived as follows:
Distributing the summation
4.
Proper synchronization for the coalesced form of this loop is considered in the next section. Unlike the other rectangular loop coalescing techniques, the new approach does not use integer divi sion. In the best case the I and J computations require a single compare operation each, and JK and K computations require two integer multiplies, two subtractions, and one addition. However, on average the I and J computations will require that / and j be incremented some average amount until the inequality is satisfied.
The best approach will depend on the availability of integer division in hardware and the relative speed of integer division and multiplication, as well as the average increment per iteration in the triangu lar approach. Recent processor architecture designs have reduced the amount of hardware support for relatively infrequent operations such as division, and software support routines for integer division are slow. One study found that a general purpose divide routine averaged 80 cycles per divide operation
Notice that the need for multiplies and divides to compute indices for each iteration can in general be eliminated by using consecutive allocation (mentioned in section 2) and replicating the original loop ing control structure in the code for each process. This is discussed further in [Pol88] . We stress the Other techniques because they efficiently support arbitrary allocations (including consecutive allocation), however, when consecutive allocation is appropriate, the use of the original looping structure may be preferable.
Loop Coalescing

Loop Scheduling and Synchronization on Barrier MIMD Architectures
In the previous section, a generalized technique for coalescing loops was described. In this section loop coalescing is considered for static, decentralized scheduling of barrier MIMD architectures. The approach taken will be similar to that for the FMP, except the compiler will automatically construct the domain for a set of nested loops after the appropriate analysis has been performed, and the domains are not restricted to rectangular shapes. In addition, the instances of the coalesced loop may be synchronized as necessary by a barrier, so coalescing is not restricted to loops without dependencies. Loop coalescing simplifies Ippp scheduling; since the single dimension of the coalesced, iteration space can be allocated evenly among the processors with small scheduling overhead.
The basic properties of barrier MIMD architectures were mentioned in the introduction. They include a fast hardware barrier synchronization mechanism that can be applied across any subset of the processors. A barrier processor generates the proper sequence of barrier masks to insure correct sequenc ing and proper timing relationships between computational processors. It places the barriers in a barrier synchronization buffer where they are matched against processors waiting at a barrier, and then executed.
A single WAIT line from each processor to the synchronization buffer is used to indicate that a particular processor is participating in a barrier synchronization. Thus, when scheduling a loop it is necessary to generate code for the computational processors to request a barrier and for the barrier processor to gen erate the proper barrier masks in the correct order.
In addition, before execution of a coalesced loop on a barrier MIMD, the barrier processor must broadcast the number of iterations in the coalesced loop and the number of processors executing the loop.
Loop iterations in the coalesced index set are assigned to the computational processors using interleaved allocation, as in the FMP. This binding occurs at compile-time between loop iterations and a virtual bar rier MIMD machine; the binding between the virtual and actual barrier MIMD machine occurs at run time when the barrier processor broadcasts the number of iterations in the loop and the number of proces sors in the actual machine.5
Data dependencies [ShF88] , |Wol89] between loop iterations must be considered during coalescing, and if such dependencies do exist then the resulting coalesced loop may require barrier synchronization.
If no dependencies exist between iterations, then no synchronization is required and processors proceed to
5. This approach also allows the machine to be partitioned so that independent loops (or programs) may be executing simultaneously on different parts of the machine. The algorithm for generating barriers for simple linear schedules is now described. Each computa tional processor executes this algorithm to generate a proper sequence of barriers to correctly implement the simple lirieaf schedule.
Algorithm: Barrier Generation
The wavefront index to is generated from linear schedule function o (/), whereJ = (J y , Jr2, -. 7") are the n indices of the original nested loops that have been coalesced. The wavefront index represents the wavefront in which iteration (/) is executed. Let p be the processor id number, P the number of pro cessors executing the schedule, and let N be the number of iterations in the coalesced loop. / represents the current iteration being executed by processor p. The procedure is: Statements 14 and 16 can be distributed out the I loop; since the range for the resulting loop matches that of the DO loop labeled 10 and no dependencies exist between these loops, they can be fused [Wol89] . The resulting code is shown in figure 13 . 
J = (IJ m od(N -K )) + (K+\) .
Since both the I and J loops may be executed in parallel, there is no need to generate barriers to enforce a proper ordering between iterations of the coalesced loop. The code for partial or complete pivot ing, if it were included in the example, could be parallelized like the P loop. As with the forward elimination example, the banier processor could tune the processor allocation to adapt to the monotonically decreasing parallelism as K increases.
LoopCoalescjng Table 3 shows the allocation of the iterations of the coalesced loop for four processors (N= 6).
natural when loop coalescing is combined with linear schedules. Notice how the the loop limits for the inner la n d PloopsOabelcd 20 and 30) vary with K. For small values of K, most of the parallelism resides in the I loop since the P loop range is small; how ever, the situation changes as K approaches N, where the I loop range becomes small, and the P range large. Parallelism exists in both loops8 but shifts from the I loop to the P loop as K moves through its range. Since loop interchange is possible, it is difficult to decide which loop should be parallelized fora machine that supports a single level of loop parallelism. If coalescing is applied to these loops the paral lelism inherent in the loop structure can be exploited more effectively, since it would be inherent in the single coalesced loop.
As another example of the difficulty in effective loop parallelization, coasider again the loop nest from Trench's algorithm, given in figure 6. Now assume that, instead of the loop bounds given in the 8i The parallelism in the P loop must be realized through and an associative reduction [Wol89J.
can exploit parallelism in one or the other loop; the loop with maximum parallelism depends on the rela tive values of M and N9. The appropriate test can be executed at run-time to determine which loop should be parallelized: with loop coalescing, the div and mod parameter can be a variable set according to the results of this test. The result is a very efficient technique to statically generate the proper run-time test to exploit the maximum parallelism possible.
Conclusions
In this work, loop coalescing has been extended to apply to triangular nested loops. A new approach has been proposed for coalescing rectangular loops that is more efficient than current techniques. The new loop coalescing techniques, combined with some familiar loop transformations for parallelization, have been applied to the problem of scheduling nested loop structures on barrier MIMD architectures. Simple linear schedules have been shown to be an effective paradigm for efficiently exploiting the parallelism in nested loops. These schedules can also quite easily take advantage of parallelism that is inherent in the interaction between nested loops. Loop coalescing also has advantages in parallelizing loop structures where the parallelism shifts from one loop to another during execution, and where simple tests at run-time can determine the best loop to parallelize.
Future research effort include extending the barrier generation algorithm so that it can be applied to linear schedules in general. Current work also includes a prototype compiler that will implement several of the transformations described in this work.
9. The analysis necessary to determine such tests in die general case is given in [ShO90|.
