Loop Coalescing and Scheduling for Barrier MIMD Architectures by O\u27Keefe, Matthew T. & Dietz, Henry G.
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
6-1-1990






Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
O'Keefe, Matthew T. and Dietz, Henry G., "Loop Coalescing and Scheduling for Barrier MIMD Architectures" (1990). Department of
Electrical and Computer Engineering Technical Reports. Paper 727.
https://docs.lib.purdue.edu/ecetr/727
Loop Coalescing and 
Scheduling for Barrier 
MIMD Architectures




School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
Loop Coalescing arid
for Barrier MIMD Architectures
Matthew T. O ’Keefe and Henry G. Dietz
School of Electrical Engineering 
Purdue University 
West Lafayette1IN 47907 
June 7,1990
h ankd@ ecn .purdue .edu
(317)494 3357
ABSTRACT
Barrier MIMDs are asynchronous Multiple Instruction stream Multiple Data 
stream architectures capable of parallel execution of variable execution time instruc­
tions and arbitrary control flow (e.g., w h ile  loops and calls); however, they differ 
from conventional MlMDs in that the need for run-time synchronization is significantly 
reduced. This work considers the problem of scheduling nested loop structures on a 
barrier MIMD. The basic approach employs loop coalescing, a technique for transform­
ing a multiply-nested loop into a single loop. Loop coalescing is extended to nested tri­
angular loops, in which inner loop bounds are functions of outer loop indices. Also, a 
more efficient scheme to generate the original loop indices from the coalesced index is 
proposed for the case of constant loop bounds. These results are general, and can be 
applied to extend previous work using loop coalescing techniques. Wc concentrate on 
using loop coalescing for scheduling barrier MIMDs, and show how previous work in 
loop transformations [ Wol89J, [Pol88] and linear scheduling theory [ShF88], rShO901 
cart be applied to this problem.
Key phrases: Loop Coalescing, Loop Transformation, Barrier Synchronization, Com­
piler Parallelization, Compiler Optimization, Static Barrier MIMD.
: ' ■ % y-~. .  W r , . -  -  • . :  \  ; ■  '  ■ ■ ■ : ■  ; V \ : .  /  '  '  ■ ’  ■ '  - ' y " . ; ' .  '
I. Introduction
Parallel computer architectures hold great promise for solving large, compute-intensive problems. 
To fully exploit paraUel machineSj it is necessary to translate applications software into efficient parallel 
code. Most of the parallelism in programs is found in loops, and techniques are necessary to extract loop 
paraUelism arid exploit it at run-time.
This work considers loop parallelization and scheduling for a new class of parallel machines called 
barrier MIMD (Multiple Instruction stream, Multiple Data stream) architectures [DiS88], [OKD90]. Bar­
rier MIMDs are characterized by a fast, flexible hardware barrier synchronization mechanism that exe­
cutes in a few clock cycles. Barriers may be applied across any arbitrary subset of the processors. Recall 
that a processor performs the following steps at a barrier synchronization point:
[ I ] Marks itself as present at the barrier.
[2] Waits for all other participating processors to arrive at the barrier.
[3] After all participating processors have arrived at the barrier, it continues execution past the barrier.
In a barrier MIMD, step [3] is modified so that processors proceed past the barrier simultaneously. Using 
this property, previous work [ZaD90] has shown that for basic blocks of code executed on a barrier 
MIMD, static scheduling can remove many unnecessary synchronizations at compile-time.
This work considers the problem of scheduling nested loop structures on a barrier MIMD. Since the 
processors have separate, independent control streams, the body of the nested loops can Contain subrou­
tine calls, IF statements, other control flow constructs and variable-time instructions. Hence, barrier 
MlMDs can exploit loop parallelism that VLIW and SIMD machines, limited to a single control stream, 
must ignore.
The basic approach employs loop coalescing [Pol88], a technique for transforming a multiply- 
nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner 
loop bounds are functions of Outer loop indices. Also, a more efficient scheme to generate the original 
loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are 
general, and can be applied to extend previous work using loop coalescing techniques. We concentrate on 
using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transforma­
tions [Wol89],[P6l88] md linear scheduling theory [ShF88], [Sh09Q] can be applied to this problem.
:'H' Loop Coalescing
Page 2
This manuscript is organized as follows. In section two, some previous work in scheduling parallel, 
shared-memory MIMD architectures is reviewed. Section three extends the loop coalescing transforma­
tion to triangular loops, and proposes an improved technique for coalescing rectangular loops (with con­
stant upper and lower bounds). Section four shows how a coalesced loop can be scheduled on a barrier 
MIMD; an algorithm for generating the proper sequence of barrier synchronizations is given. Finally, 
conclusions and directions for future work are given in section five.
2. Previous Work ■
Scheduling schemes for parallel architectures fall into two broad classes: static and dynamic. In 
static scheduling, compile-time information is used to determine a binding between tasks and processors 
before program execution begins; this approach has.. very low run-time overhead but can result in poor 
load balancing under certain conditions. In contrast, dynamic scheduling employs run-time information to 
perform this binding during program execution, resulting in good load balancing at the expense of high 
run-time oyerhead. Hybrid schemes between static and dynamic scheduling are also possible.
The Flow Model Processor (FMP) MIMD architecture [LuBBOJ, [Lun87] employs static scheduling 
for allocating parallel loop iterations to processors. The FMP is a shared-memory MIMD notable for ks 
fast hardware barrier synchronization mechanism and a decentralized approach to scheduling and control. 
--iTh^ f̂ai^get application domain for the machine was computational aerodynamics, although it supports a 
general MIMD model.
Tlie Flow Model Ptbcessor was programmed using an extended Fortran language that included a 
parallel DO loop construct, the d o a l l . The DOALL provided the basic parallel construct for the FMP; 
no dependencies exist between d o a l l  iterations so they can be executed in parallel. The iteration space 
for the DOALL was described by a d o m a in  statement. For example, the declaration
!MAX; J = I , JMAX
declares that there are IMAX*JMAX elements, each consisting of a pair of values for i  and J  in the 
ranges shown. Each pair of index values specifies an instance I J  of the loop body. Index sets created 
with DOMAIN statements such as EYEJAY are called domains. In the aerodynamic flow codes to be 
executed on the FMP only rectangular domains were considered, as these were the most common 
domaias found in such code. Loops iterating over rectangular domains are called rectangular loops; they 




Parallel execution pf the DOALL iterations began when control flow in the program reached the 
DOALL. Early FMP studies considered employing a centralized control unit to compute an optimal allo­
cation of the loop instances. However, the final design employed a decentralized mechanism for static 
loop scheduling1: processor id numbers P were assigned from O to PMAX- I , where PMAX was the 
number of processors. Each prbcessor was also given the maximum instance number and the number of 
processors executing the DOALL. Processor P began by executing instance number IJ=P. In the previ­
ous example, the index variables were I  and J: each processor can determine these index variables 
from the instance number IJ with the following equation:
IJ = J *  IMAX+I
In this case, I  =IJ mod IMAX and J  =IJ div IMAX. After computing each instance, a processor incre­
ments its instance number IJ by PMAX to obtain the next instance to compute. This mapping of iterations 
to processors is called interleaved allocation in this work. This continues until IJ > UMAX. All proces­
sors then participate in a hardware barrier synchronization before program execution proceeds.
A centralized control mechanism is needed only at the beginning of the DOALL to broadcast the 
number of processors participating and the maximum instance number. At that point, processors can 
independently compute (he iterations assigned to (hem without accessing any central control or shared 
variables. This avoids (he contention and run-time overhead inherent in a dynamic scheduling scheme. 
The FMP loop scheduling technique establishes a binding at compile-time between loop iterations and a 
virtual machine, where each processor is given an equal number of iterations; a binding between the vir­
tual machine and actual machine is made at run-time.
Notice that the loop iterations are divided up among the processors equally and are allocated “ all at 
Once” at the beginning of parallel execution. If loop iteration execution times vary widely, there would 
seem to be a danger that the processors would finish at widely different times. Detailed instruction-level 
simulation studies conducted during the design of the Flow Model Processor showed that the execution 
time of iterations was close and the amount of processor time spent waiting was small. Kruskal and 
Weiss [KrW85] studied this problem and showed that for a wide class Of distributions for iteration execu­
tion times, allocating an equal number of iterations to each processor all at once has good efficiency.
I. Although the scheduling is static, i.e. performed at compile-time, recompilation is unnecessary if the machine 
configuration changes or different numbers of processors are used to execute the DOALL.
Page 4
A dynamic scheduling scheme known as guided-self scheduling [Pol 8 8] was developed by 
Polychronopolous and Kuck to reduce the amount of run-time overhead while still maintaining good load 
balancing among processors. Loop coalescing is applied to transform nested parallel DO loops with con­
stant upper and lower bounds (i.e., rectangular loops) into a single parallel DO loop with a single dimen­
sion. Other transformations such as loop distribution and loop interchanging [Wol89] can be applied to 
transform a set of nested loops into the proper form for coalescing. In essence, loop coalescing is a com­
piler technique that constructs the FMP domains automatically at compile-time.
Processors obtain iterations of the coalesced loop by accessing the shared coalesced index variable; 
the number of iterations given at each access varys dynamically, starting out large but tapering off to a 
single iteration according to
LoopCoaIescing
PMAX
, /?;+! <r- Ri -  Xi
where /?,- is the number of iterations remaining at step i (and R y = N, the total number of iterations in the 
loop), Xi is the number of iterations given to the processor requesting work at step r, and PMAX is the 
number of processors. This adaptive variation in allocated work reduces the number of synchronization 
operations compared to allocating a single iteration at a time. The number of synchronization operations 
is also reduced by coalescing, since only a single index, not multiple indices as in the original loops, need 
be accessed. In guided self-scheduling, the processor at step i executes iterations (TV-/?,-+I,.... AM?,+*,]. 
Mapping consecutive iterations to a single processor is called consecutive allocation in this work.
3. Generalized Loop Coalescing
In this section, a technique for coalescing triangular nested loops with inner loop bounds that Sre 
functions of the outer loop indices is proposed. An improved method for generating the original indices 
from the coalesced index rectangular loops is also given. Triangular loops are ubiquitous in the numerical 
linear algebra codes [DoM79], [GoV83] that are perhaps the most common input to vectorizing and paral­
lelizing compilers. The new technique broadens the applicability of loop coalescing.
The approach used in the FMP to generate the original loop indices from the coalesced index can be 
applied to rectangular loops with nest levels greater than two. The basic idea is to coalesce starting from 
the innermost nest levels and proceed outward. The two innermost levels are coalesced, followed by the 
next innermost loop and the coalesced loop formed in the previous step, and so on until the outermost
Page 5
Loop Coalescing
loop in the loop nest is reached and coalesced. Consider the following loop:
V'': DO 10 I = I, IMAX
DO 20 J =  I, JMAX 
; DO 30 K =  I, KMAX
Coalescing the two innermost loops yields
DO 10 1 = 1 ,  IMAX
DO 20 JK = I, JMAX*KMAX 
J = JK div KMAX 
;K = JK mod KMAX
followed by the remaining loops
DO 10 IJK = 1 ,  IMAX * JMAX* KMAX 
I = IJK div JMAX*KMAX 
JK = IJK mod JMAX*KMAX 
J — JK div KMAX 
K = JK mod KMAX
Each coalescing step results in the need for one integer division2 at run-time to generate the loop indices 
from the coalesced index, and this example requires two integer divisions. In contrast, 
Polychronopolous’s scheme [P0I88J requires two integer divisions, one multiplication and one subtrac­
tion per loop index, resulting in six integer divisions, three multiplications, and three subtractions for this 
example.
Techniques for coalescing two-level triangular loopsare now given, In a triangular nested loop, the 
inner loop bounds are functions of the outer loop index variable.
2. The div and mod operations have been specified in the loop body for clarity. The quotient of the integer division 
represents the div result, the remainder the mod result.
Page 6
Loop Coalescing
Consider the following triangular loop structure:
DO 1 0  J  =  I*  N 
DO 2 0  K =  I ,  I
V--;:: • w'.;;'' V V W ■ : - V V . . ■ - \V- V V.'
The index set for this loop with AZ=5 is given in figure one. .V---
V -■' ; - . - ''' • ■ w . • w'V;v' 7 . wV .' VV ;.w
V;Vv: lvvV- V - . ' 7 7 7
'■ ■ : • V -. ■ " /. ■ , .. : , ■.. 7-;' V ; -7v M  ̂ ' 7 7  '
' v - ' w v v r v v v W s . ^ V : ■ : - - .. : 7 7 ■ O :
v' ■ ' VVv'VV- . ;-w W v' ■■V : V : ;v' -■■ ■' ■’
w ^ - - -  v w  V ’ v ?: ' 7 ' Vv v  9 13
4 -
V - V W V:W~ • ' '  ' ' V . P v V 'V- p :
■ - V . ' - : ' . ’• \-v V1. V V-VV v -  VV . .. ' ■ v ■.
'= •- w . : v • " . V W5;' V 8 12
K 3 - /Vv W..-/- ..: ; ■ - °  ■' O 0
. - -... . ■ • V . :v'. "v V ■ - V - . ■
V. ■ . . ■ . . 2 V V .V4  W 7  ; . \ l'lVv;--V-iV .-
2 — -;V -. . O - P O 0 -r V -
VvV'v - ' V -  -
VV - . ■: .i - v :-v
1 ^
0 v  ■ ; i v / V / - v
n
V. 3 V ' V' V: - 6 10
' V W v V  : / •  V-WW T V ■ v  : ?  -■■■- " ? ?  v - ?
, Vw Vi I : V ■ 2 ; 3 4 ... -5' -
Figure I: Example Triangular Loop wilh Serial Execution Order.
The iterations are labeled with their serial execution order. The total number of instances, T(Ar), in the 
coalesced loop, is given by the expression
'V VV ■ V :V - "  ' Wv -v V - ' ■■■' ■’ N ■ ■ ■ ' ■ V - : V
t(A/) = £X(Z,AZ)Vvv v . r̂-v-W-:' ;• / /= 1  - ...ww v - wVvvw-;-
where X(JtN )= J  for this example, which yields T(AZ) ~N(N+l)/2. The function X(JyN) represents the 
number of iterations of the inner K loop as a function of the outer loop index J  and upper bound AZ. Nor­
malized inner loops have lower loop bounds and increments equal to one and X(/,A0 reduces to the loop 
upper bound. In the general case, the function X(Z1AZ) is given as
Loop Coalescing
T  MJ,N) = ( ub(J,N) -  lb(J,N) + I) div inc(J,N)
where ub(J,N), lb(J,N), and inc{J,N) are the upper bound, lower bound, and increment functioas, 
respectively, for the inner loop.
For the example loop, the function x(N) = N (N+l)/2 is the number of iterations in the coalesced 
loop. The index variable for the coalesced loop will be JK, with a lower bound of 0 and upper bound of 
T(W)-I. The original loop indices J and K must be re-generated from the coalesced index.
Figures 2 and 3 show how J  and K vary with the coalesced index JK.




t — r T l I ? I I I ? I I I I
0 I 2 3 4 5 6 7 8 9 10 11 12 13 14 15
V T V  jk
Figure 2: K as a Function of JK.
It can be seen that transitions occur at 0, I , 3, 6 and 10. This scries can be generated by & transition Junc­
tion I O'), 0 < j  < N -I  where, in this example, i(j) = j  (/+l)/2. The transition function can be used to 
determine the value of index J  given coalesced index JK: /=Jninfjf : x(J) >JK}, i.e., the smallest j  
such that i(J) > JK. Hence, to determine J  from JK, the function i(J) must be computed for 
/= 0 ,1 ,2 ,... untilM j) > JK. It is then straightforward to compute the inner loop index K; in the example 





O O O P O
O O O O
O O O
o O
T I I I I I I I  I I I I I I I I
O l 2 3 4 5 6 7 8 9 10 11 12 13 14 15
: '-O'; JK .?■/- -v' 0:'/;
Figure 3: J as a Function of JK.
To execute the coalesced loop on a barrier MIMD, each processor independently computes the tran­
sition functionfor successive j  until i(j)>JK, where JK is the current instance for the processor. This 
gives J  for the instance, which is then used to compute K. The body of the loop is executed using these 
generated values for I  and J . The cost of these operations depends on the complexity of the transition 
function, which in turn depends on the form of the inner loop bounds. Alternately, the transition series 
could be generated at compile-time, and saved in local memory in the processors, reducing the run-time 
overhead at the expense of extra storage.
To generalize the approach given above for doubly-nested loops, it is necessary to determine the 
proper transition function for general loop bounds. In the general case, doubly-nested lbops have the fol- 
lowing form:
DO 10  I  =  Mr N , P V
DO 20 J  = fo(I,M,N) , mZ>(I,M,N), i«c(J,M,N)
Figure 4: General FOrm for Doubly-NeSted Loops.
Page 9
The upper and lower bounds and increment of the inner loop are functions of the outer loop index and 
bounds. Note that the outer loop bounds and increment M, N, and P can be integer expressions. Loop 
normalization could be employed to transform the loop increment and lower bounds to one to simplify 
the general-form loop structure, but this increases the complexity of subscript expressions3. It is not used 
in this work.
The transition function for i(j) for the general form doubly-nested loops is
Loop Coalescing






I (ub(J,M,N)~lb(J,M,N)+l) div inc(J,M,N) if inc(J,M,N)>0
[ (lb(J,M,N)-ub(J,M,N)+\) div inc(J,M,N) if inc(J,M,N)<Q ' 
Thenumberofinstancesinthecoalescedlooptisthengivenas
x(M,N) = t(N)
A closed-form expression for i(j) is required, and this will sometimes require manipulation of the 
summation. This was not the case for the example in figure one, where
i(/)
I iJ  = J ( j + m  \<j <n
j =i
o j<  I
is a well-known form. As another example, consider the following loop structure:
DO 10 J I, N
DO 20 K = I, N-J+1
Figure SrExample of a General Form Doubly-Nested Loop.
3. Wolfe |Wol86] recently observed that loop normalization can adversely affect the complexity of transforming 
loops since it typically increases the complexity of the array subscript expressions, and it can sometimes prevent a 





Z N - J +1 1<j <A-1
j < 1
which can be reduced to
l(/) = Z N - Z J + Z l  = /(2A-^+l)/2 , \< j< N -lV
J=\ J=I J=I
For the general loop form, once J is computed from the transition function, K is determined from the 
expression
K = J K - + lb(J,M,N) .
As a more complex example, consider the loop of figure 6, which is part of Trench’s algorithm for 
determining the inverse of a Tocplitz matrix |GoV83J4:
DO 10 J =  2, (N-I)/2 + I
DO 20 K = J, N - J + 1  
B(J,K) = B (J-l,K-l) +
(v  [ N + l - K ]  * v  [ N + l - J ]
10 CONTINUE
20 CONTINUE
-  v [ J - I ] * v [K - I ] ) /GAMMA;
Figure 6: Doubly-nested Loop Taken from Trench’s Algorithm. 
The iteration space for this loop (with A=10) is given in figure 7.
The transition function is derived as follows: ' I 1 -
v0)




4. Proper synchronization for the coalesced form of this loop is considered in the next section.
Page 11
Loop Coalescing
Figure 7: Iteration Space for Loop Nest from Trench’s Algorithm (N=IO).
I ( J ) =  Y 1N - ■ &  +  &  , 2 < j < ( ( N — I ) / 2 + 1 )
7=2 7=2 7=2
and simplifying yields
lO) = N 0 '- 1 ) - 2 0 0 + 0 /2 -1 )+  20 -1 ) = - j 2 + ( N + \ ) j - N  , 2</<(N-l)/2 .
Table I shows how the coalesced loop iterations for Trench’s algorithm are spread across a four-processor 
barrier MIMD. The original indices for the loop are given in parentheses (7,N) next to the coalesced 
index.
Page 12
■ '■ '.' /'Vi ■ ; ... ■ / .
PEO PE I PE 2 PE 3
0(2,2) I (2,3) 2 (2,4) 3 (2,5)
4(2,6) 5 (2,7) 6 (2,8) 7(2,9)
8 (3,3) 9 (3,4) 10(3,5) 11(3,6)
12(3,7) 13 (3,8) 14 (4,4) 15 (4,5)
... - ■ , ■ . • 16 (4,6) 17 (4,7) 18 (5,5) 19 (5,6)
Loop Coalescing
Table I: Processor Assignment for Trench Loop Nest (4 processors)
The approach for coalescing doubly-nested triangular loops can be applied successively to coalesce 
multiply-nested loops. This procedure begins with the innermost loops and continues outward; for exam­
ple, with a triply-nested loop, the two inner loops are coalesced, followed by the outermost loop and the 
new coalesced inner loop. The following multiply-nested rectangular loop will be coalesced with the tri­
angular loop coalescing techniques. This will allow a comparison between the previous loop coalescing 
techniques for rectangular loops and the new, general approach described in this work.
' DO 1 0  , I  = I ,  IMAX .
DO 2 0  J = I, JMAX 
DO 3 0  K =■: I, KMAX
Coalescing the J and K loops yields the transition function
Ijk(J) = 'EKMAX = j*KMAX 
J—I .
' ’ ' . ■ ■ : ' 'I; ■
and the loops now have the form
DO 10 I = I, IMAX
DO 2 0  JK = 0 ,  JMAX* KMAX- 1
J  = m in  { j  : j*KMAX > JK } 
K = JK -  ( j  — I ) *KMAX +1
Page 13
Loop Goaleseing
Coalescing these two loops gives the transition function
Iwa-(I) = ZVMAX*KMAX) = i (JMAX* KMAX)
' /=1 : ■ ''-C -
and the completely coalesced loop is
DO 1 0  I  JK •=* 0 ,  IMAX*JMAX*KMAX-1f '
I
JK
min { I : i*(JMAX*KMAX) > IJK } 
IJK - (i—I)*(JMAX*KMAX)
J = min { j : j*KMAX > JK }
K = JK - (j —I)*KMAX +1
Unlike the other rectangular loop coalescing techniques, the new approach does not use integer divi­
sion. In the best case the I  and J  computations require a single compare operation each, and JK and K 
computations require two integer multiplies, two subtractions, and one addition. However, on average the 
I  and J  computations will require that / and j  be incremented some average amount until the inequality 
is satisfied.
The best approach will depend on the availability of integer division in hardware and the relative 
speed of integer division and multiplication, as well as the average increment per iteration in the triangu­
lar approach. Recent processor architecture designs have reduced the amount of hardware support for 
relatively infrequent operations such as division, and software support routines for integer division are 
slow. One study found that a general purpose divide routine averaged 80 cycles per divide operation 
[MaP88].
Notice that the need for multiplies and divides to compute indices for each iteration can in general 
be eliminated by using consecutive allocation (mentioned in section 2) and replicating the original loop­
ing control structure in the code for each process. This is discussed further in [Pol88]. We stress the 
Other techniques because they efficiently support arbitrary allocations (including consecutive allocation), 
however, when consecutive allocation is appropriate, the use of the original looping structure may be 
preferable.
Loop Coalescing
4. Loop Scheduling and Synchronization on Barrier MIMD Architectures
In the previous section, a generalized technique for coalescing loops was described. In this section 
loop coalescing is considered for static, decentralized scheduling of barrier MIMD architectures. The 
approach taken will be similar to that for the FMP, except the compiler will automatically construct the 
domain for a set of nested loops after the appropriate analysis has been performed, and the domains are 
not restricted to rectangular shapes. In addition, the instances of the coalesced loop may be synchronized 
as necessary by a barrier, so coalescing is not restricted to loops without dependencies. Loop coalescing 
simplifies Ippp scheduling; since the single dimension of the coalesced, iteration space can be allocated 
evenly among the processors with small scheduling overhead.
The basic properties of barrier MIMD architectures were mentioned in the introduction. They 
include a fast hardware barrier synchronization mechanism that can be applied across any subset of the 
processors. A barrier processor generates the proper sequence of barrier masks to insure correct sequenc­
ing and proper timing relationships between computational processors. It places the barriers in a barrier 
synchronization buffer where they are matched against processors waiting at a barrier, and then executed. 
A single WAIT line from each processor to the synchronization buffer is used to indicate that a particular 
processor is participating in a barrier synchronization. Thus, when scheduling a loop it is necessary to 
generate code for the computational processors to request a barrier and for the barrier processor to gen­
erate the proper barrier masks in the correct order.
In addition, before execution of a coalesced loop on a barrier MIMD, the barrier processor must 
broadcast the number of iterations in the coalesced loop and the number of processors executing the loop. 
Loop iterations in the coalesced index set are assigned to the computational processors using interleaved 
allocation, as in the FMP. This binding occurs at compile-time between loop iterations and a virtual bar­
rier MIMD machine; the binding between the virtual and actual barrier MIMD machine occurs at run­
time when the barrier processor broadcasts the number of iterations in the loop and the number of proces­
sors in the actual machine.5
Data dependencies [ShF88], |Wol89] between loop iterations must be considered during coalescing, 
and if such dependencies do exist then the resulting coalesced loop may require barrier synchronization. 
If no dependencies exist between iterations, then no synchronization is required and processors proceed to
5. This approach also allows the machine to be partitioned so that independent loops (or programs) may be 
executing simultaneously on different parts of the machine.
Page 15
asynchronously execute the coalesced loop until all iterations are computed. At this point, all processors 
barrier synchronize before continuing execution.
Loop transformations can be used to restructure a loop nest to provide different coalescing results 
[P0I88]. For example, loop interchanging [Wol89] can be applied as necessary to move parallel loops to 
the innermost nest levels [P0I88]. Alternately, serial loops could be moved into the innermost levels, with 
outer parallel loops coalesced around them. Loop distribution |KuM72], fWol89] can also be employed to 
transform loops into perfectly-nested form for coalescing. The best loop structure for coalescing depends 
on several factors, including the necessity of balancing work among processors to exploit as much paral­
lelism as possible, the data dependence structure of the nested loops, and run-time constraints such as data 
locality. Qne major difference between barrier MIMDs and other MIMD machines is that barrier sjm- 
chronization is very fast and efficient; also, the static nature of scheduling the machine makes large varia­
tions in processor execution times unlikely.
The order of loops before coalescing directly affects the allocation of loop iterations across the pro­
cessors as well as the number of barriers generated. Proper execution on a barrier MIMD imposes certain 
constraints on this ordering. In particular, the innermost coalesced loops must not have any dependencies 
across loop iterations. In this work, only the outermost coalesced loop is allowed to have dependencies: 
the dependencies across this loop may require barriers for correct execution.
For example, consider the loop nest from Trench’s algorithm (figure 7): a dependence exists 
between iterations (J,K) and (J+1, AT+1), which will be represented as the dependence vector d W. [I 1] 
(ShF88j. This dependence vector can be seen in figure 7. From the figure, it is clear that all iterations 
with J  = b, where b is a constant, can be executed in parallel, i.e., the K loop may be executed in parallel; 
the J  loop is executed serially, and barriers can be used to enforce this ordering. The basic idea is to 
determine a schedule o(J,K) = n(J,K)+c that is a linear function of the loop indices so that iterations exe­
cuting in parallel have a(J,K) = d, where d is a constant.
The difference between schedule values for consecutive iterations executed on a single processor 
determines how many barriers that processor should execute. Table 2 shows how this approach generates 
barriers to enforce the proper execution order for the loop nest from Trench’s algorithm. Barriers are 
represented as horizontal lines in the table. The linear schedule for this example is a(J,K) —J - 2.
Loop Coalescing
Schedules that are linear functions of the loop indices are referred to as linear schedules [ShF88J. 
These schedules are related to the well-known wavefront method [KuM72], [Kuh80] but are generalized
Loop Coalescing













10 (3,5) 11 (3,6)
16(4,6) 17 (4,7) 14(4,4) 15 (4,5)
18(5,5) 19 (5,6)
Table 2: Proper Execution Ordering Enforced by Barriers, 
in the sense that coefficients of the linear schedule function are not restricted to integers and may be 
rational numbers [ShF88]. It will be shown in this work that the wavefrbrits (called hyperplMes m  
[ShF88]) in a linear schedule can be implemented directly by barriers. This work will be Concerhed pri­
marily with simple linear schedules that are functions of a single index variable although more general 
linear schedules are briefly considered. The schedule proposed for the loop nest of Trench’s algorithm 
was a simple lineair schedule. The wavefronts generated by this schedule can be seen in figure 8.
Linear schedules'have many advantages. In a classic paper [KaM67], Karp, Miller and Winograd 
proved that, under certain conditions (uniform data dependencies and unit-time computations) the execu­
tion time of an optimal linear schedule and the/ree or dataflow schedule, which executes a computation 
as soon as its operands are available, is bounded by a constant6. Hence, a good linear schedule should be 
able to exploit most of the parallelism within a loop (or set of nested loops). Simple linear schedules 
have a straightforward interpretation in terms of nested loops. The outermost loop corresponds to the 
wavefront direction; the simple linear schedule is a function of the outermost loop index, as in the exam­
ple loop nest from Trench’s algorithm. The barriers that enforce the wavefront order are, in effect, 
enforcing the seriM order inherent in the outermost loop.




' r .: - ’ -
3 -









Figure 8: Wavefronts for Loop Nest from Trench’s Algorithm (W=IO).
The algorithm for generating barriers for simple linear schedules is now described. Each computa­
tional processor executes this algorithm to generate a proper sequence of barriers to correctly implement 
the simple lirieaf schedule.
Algorithm: Barrier Generation
The wavefront index to is generated from linear schedule function o(/), whereJ= (Jy , Jr2, —. 7„) 
are the n indices of the original nested loops that have been coalesced. The wavefront index represents 
the wavefront in which iteration (/) is executed. Let p  be the processor id number, P the number of pro­
cessors executing the schedule, and let N be the number of iterations in the coalesced loop. /  represents 
the current iteration being executed by processor p. The procedure is:
[1] [Initialize.] W0 <—(),/<— p, done FALSE.
[2] [Generate indices.] Compute the indices I  from the coalesced index /. (As described in the 
previous section.)
[3] [Calculate wavefront index.] © 0(7), p <r~ © -  ( P gives the number of barriers before
execution of iteration I.)
Page 18
[4] [Generate barriers.] Execute P barrier waits before executing iteration /.
[5J [Check for completion.] If done = TRUE, execute one more barrier and then terminate the 
algorithm.
[6] [Set up for next iteration.] (Do & I <- I+P.
[7] [Check for last iteration.] if /  > AT, then /  <- N -I , done «- TRUE.
[8] Go to [2]
The barrier processor must generate O(Jmix) -  O(Jmm) barrier masks, where Zmax and J mm are the 
maximum and minimum points for the linear schedule o. Note that in this algorithm, the barrier mask 
includes all P processors, so the capability to barrier synchronize subsets of the processors is unused. 
More sophisticated algorithms could be developed to avoid this. In step [6], the next iteration to execute 
is obtained as I <— I +P, yielding and interleaved allocation of iterations. Consecutive allocation, as used 
in guided self-scheduling, is also possible with minor modifications of the algorithm.
Several examples of loop scheduling for barrier MIMDs will now be given to clarify and expand the 
ideas in this section. The first example code, given in figure 9, solves a lower triangular system of equa­
tions using forward elimination [GOV83].
: DO 10  I  = I ,  N
1 0 0  Y ( I )  = B ( I )
DO 2 0  J  =  I r I - I
7  2 0 0  Y(I) = Y(I) - L (I,J ) *Y(J )
2 0  CONTINUE
3 0 0  Y ( I )  =  Y ( I ) / L ( I , I )
10  CONTINUE
Loop Coalescing
Figure 9: Forward Elimination for Triangular System Solution.
Statement 10 Q can be distributed out of the I  loop, and the I  and J  loops interchanged, bringing the 










DO 10  I l  =  I ,  N  
Y ( I l )  =  B ( I l )
CONTINUE 
DO 2 0  J  = I ,  N
Y(J) = Y (J) /L (J, J)
■': DO I  = J + l ,  N
Y ( I )  =  Y ( I )  -  L ( I , J ) * Y ( J )  
CONTINUE 
■ CONTINUE >
Figure 10: Restructured Loop before Coalescing.
The Il loop may be executed in parallel, with a barrier separating loops Il and J. Loops J and I 
can now be coalesced. Statement 3 0 0  can be moved into the inner I  loop at the cost of computational 
redundancy; alternately, this statement could be executed conditionally within the inner loop, depending 
on the generated index values IP0I88]. The right approach depends partly upon the ability of the machine 
to quickly broadcast values from one processor to all others; if this capability is missing, then it may pay 
to compute the value locally in each processor.






which reduces to \(j)=  jN  -  j(j+ \)/2  , l<j<N. The original indices can be generated from the 
coalesced index as follows:
and
J  = min { j  : i ( j ) >J I }
I = J l - I ( J - I ) + (J+l)
This transition function is rather complex, but there are several approaches to reducing the overhead in 
computing it. One obvious solution is to have an independent integer function unit dedicated to comput­
ing the transition function in parallel with the loop body computations. Note that the transition function
Page 20
Loop Coalescing
can be computed in advance for increasing values of J  as it is independent of the loop body. This 
“ look-ahead” approach to computing the transition function could also be used to fill gaps in computa­
tion while a processor is waiting to barrier synchronize with other processors. Thus, it appears that the 
transition function overhead can be masked quite effectively.
Coalescing loops Jand  Iyields
DO 1 0  I l  =  I ,  N 
1 0 0  Y ( I l )  =  B ( I l )
10  CONTINUE
DO 2 0  J I  = 1 ,  N
J  = m in  { j  : l ( j )  > J I  }
I  = J I  -  l ( j - l )  + ( J + l )
2 0 0  Y ( I )  =  Y ( I )
, I F  ( I  . EQ.  J + l  ) Y ( I )
2 0  CONTINUE
L ( I ,  J )  * Y (J )0
Y ( I ) / L ( I r I )
Figure 11: Restructured Loop after Coalescing.
The simple linear schedule for this coalesced loop is a(7) = J - I . The barriers enforce the proper ordering 
between successive column computations. Row computations are executed in parallel depending on the 
number of processors allocated at the end of the loop. Notice how the parallelism width of the forward 
elimination algorithm decreases monotonically as the algorithm moves down the columns of the matrix 
L. This is quite common for such triangular loop structures. The barrier processor can tune the processor 
allocation for the coalesced loop by separating the computation into phases: as the parallelism width goes 
down (or up) for each phase, fewer (or more) processors can be allocated by the barrier processor for the 
current phase.
The next example considered is Gaussian elimination [GoV83]. The innermost loops for Gaussian 
elimination, labeled 20 and 30 in figure 12, can be coalesced and scheduled effectively on a barrier 




DO 40 K = I, N-I
code for partial (or complete) pivoting elided
LoopCoalescing
C
DO 10 P =  K+l, N
W(P) = A(K,P)
10 CONTINUE
' 'DO'30 I =  K+l, N '
14 COEF = A (I,K)/A(K,K)
16 A(I,K) = COEF
DO 20 J = K+l, N




Figure 12: Original Code for Gaussian Elimination [GoV83].
Statements 14 and 16 can be distributed out the I loop; since the range for the resulting loop matches 
that of the DO loop labeled 10 and no dependencies exist between these loops, they can be fused 
[Wol89]. The resulting code is shown in figure 13.
Page 22
Loop Coalescing




code for partial (or complete) pivoting elided
..;::v T  v  DO 1 0  P =  K + l ,  N
W( P)  =  A ( K , P )
1 4  COEF = A ( P , K)  / A ( K ,  K) V - ' xV' y ' V
A ( P f K) =  COEF ;;;
1 0  c o n t i n u e  ^ v--'; v
■ V"'!- ' '  ■ DO 3 0  ; I  =  K + l ,  N v - v ; ;... .-VV'; ;
DO 2 0  J  =  K + l ,  N
A ( I ,  J )  =  A ( I f J )  -  COEF*W (J)
2 0  CONTINUE
3 0  CONTINUE ■ .
4 0  CONTINUE
Figure 13: Restmcturcd Code for Gaussian Elimination.
The P loop may be executed in parallel; of course, since it is a single loop, no coalescing is necessary. 
The restructured code after coalescing inner loops I  and J  into index I J  is shown in figure 147. A bar­
rier is required between the P loop and the coalesced I J  loop, and after the I J  loop to enforce the 
proper ordering inherent in the outer loop, which is executed serially. Clearly, X(NrK)  = (N-K )2 and the 
functions to generate the original indices from I J  are
I = (IJ div (N-K))  + (AT-f-l)
and
J  = (IJ mod(N-K)) + (K+\) .
Since both the I  and J  loops may be executed in parallel, there is no need to generate barriers to 
enforce a proper ordering between iterations of the coalesced loop. The code for partial or complete pivot­
ing, if it were included in the example, could be parallelized like the P loop. As with the forward
7. In this rectangular loop example, the transition function has been replaced by div and mod  operations.
Page 23
elimination example, the banier processor could tune the processor allocation to adapt to the monotoni- 
cally decreasing parallelism as K increases.
LoopCoalescjng
DO 40 K = If N-I
C
C code for partial (or complete) pivoting elided
C '■
DO 10 P =  K+lf N 
W(P) = A(KfP)
14 COEF = A(EfK)/A(KfK)
16 A(PfK) = COEF
10 . CONTINUE.
DO 30 IJ = Of (N-K) **.2. - I 
I = ( U  div (N-K)) + (K+l)
J = (IJ mod (N-K)) + (K+l) 
A(If J) = A(IfJ) - COEF*W(J) 
30 CONTINUE 
40 CONTINUE
Figure 14: Restructured Code for Gaussian Elimination.
The next example considers executing a non-simple linear schedule on a barrier MIM D. This and 
the following example show how linear schedules can exploit the maximum parallelism in nested loop 
structures by considering all loops simultaneously. The loop given in figure 15 implements a four-point 
difference problem: notice that the dependencies, shown in figure 16, preclude parallelizing either the I  
or J loop directly. This example is taken from [Wol89].
Page 24
Loop Coalescing
DO 2 0  I  -  2 ,  N - I
DO 10  J -  2 ,  N - I
A ( I , J )  =  ( A ( I - I r J)  + A ( 1 + 1 ,  J )
+ A ( I , J - l )  + A ( I , J + I ) ) / 4
10 CONTINUE
2 0  CONTINUE
Figure 15: Four-point difference problem,
.. .,:;-The wavefront technique [KuM72], [Kuh80] was originally developed to exploit the parallelism in 
such loops. These loops can be coalesced and executed with a linear schedule. Loop structures such as the 
four-point difference problem show that parallelism in some nested loops is not inherent in one or the 
other loop, but can be extracted by considering both loops simultaneously. This simultaneous approach is
A valid linear schedule for executing the loops in figure 15 is o (/,/) = /+ /-4 ; the wavefronts for 
this schedule correspond to the dashed lines in figure 16. Note that this is not a simple linear schedule, 
since o is a function of both I  and J . However, in this example, the coalesced loop can be executed on 
a barrier MIMD if the number of processors allocated is equal to N -2. The barrier generation algorithm 
will still work properly for some linear schedules if the number of processors is restricted to the range of 
the innermost Coalesced loop. The exact conditions when it may still be applied is a current research 
problem. Table 3 shows the allocation of the iterations of the coalesced loop for four processors (N= 6).
natural when loop coalescing is combined with linear schedules.
Page 25
Loop Coalescing
Figure 16: Index Space and Wavefronts for 4-Point Problem (JV=6).
Page 26
Loop Coalescing
PEO PE I PE 2 PE 3
0 (2,2) 'v; - -■ /
4(2,3) I (3,2) ■ •; : ■V  ̂ • ■: y
8(2,4) 5(3,3) 2 (4,2)
C; . ■ '
12(2,5) 9(3,4) 6 (4,3) 3 (5,2)
- • 13 (3,5) 10(4,4) 7(5,3)
: - /' 14(4,5) 11(5,4)
':vv-', - - 15 (5,5)
Table 3: Proper Execution Ordering Enforced by Barriers.
Another example of the interaction between loops that affects the amount of exploitable parallelism 
can be seen in the loop nest of figure 17. This loop nest also provides an example of a rational linear 
schedule [ShF88], [ShO90],
> . . - . V  DO 2 0  I  =  I ,  N ' / V : . y -V .v  ■ .v '̂v-vC .y.y .'Cvy , ,  vy' C V/.'vv
. -V-V1; do 10 J  = I ,  M v. • v ;v 7  \ V:.y.v';
A ( I , J )  = A ( I - 2 ,  J )  + B ( I , J ) * C ( I )  + D ( J , I )
1 0  CONTINUE 
2 0  CONTINUE
Figure 17: Loop Nest with a Rational Linear Schedule.
The J  loop can be executed in parallel, but a dependence along I  loop prevents parallelization; 
notice, however, that the dependence distance in the I  loop is 2. This means that two wavefronts along 
the J  loop can be executed in parallel: this can be realized with the rational linear schedule 
a(J,K) = (/—1)/2. The loop is coalesced in the normal manner and the resulting schedule is, in fact, 
optimal in terms of execution time. In this example, a rational linear schedule yields twice as much paral­
lelism compared to parallelizing the J  loop alone. A detailed discussion of issues related to optimal
Page 27
linear schedules and loop scheduling can be found in [ShO90].
The following example provides insight on a subtle problem in exploiting loop parallelism and how 
loop coalescing can help solve the problem. The innermost loop nest for Cholesky decomposition 
([GoV83], pp. 89) is shown in figure 18.
DO 40 K = I , N
. . - temp = 0.0 \ .
DO 10 P = I , K- I  ., ■ ■ ■'
10 temp = temp + G( K, P) *G( K, P)  ;
G (K, K) = sqrt (A (K, K) - temp)
Loop Coalescing
K + l , Nv;-' DO 30 I =
temp = 0.0 ■ . X;: .
DO 20 P = I,K-I - -r.
x2.0,;..:x;;; ; temp = temp + G (I,P) *G(K, P)
30 G(IrK) = (A(I,K) - temp)/G (K,K)
40 CONTINUE
FigureT8: Code for Cholesky Decomposition fGoV83].
Notice how the the loop limits for the inner la n d  PloopsOabelcd 20 and 30) vary with K. For 
small values of K, most of the parallelism resides in the I loop since the P loop range is small; how­
ever, the situation changes as K approaches N, where the I loop range becomes small, and the P range 
large. Parallelism exists in both loops8 but shifts from the I loop to the P loop as K moves through its 
range. Since loop interchange is possible, it is difficult to decide which loop should be parallelized fora 
machine that supports a single level of loop parallelism. If coalescing is applied to these loops the paral­
lelism inherent in the loop structure can be exploited more effectively, since it would be inherent in the 
single coalesced loop.
As another example of the difficulty in effective loop parallelization, coasider again the loop nest 
from Trench’s algorithm, given in figure 6. Now assume that, instead of the loop bounds given in the
8i The parallelism in the P loop must be realized through and an associative reduction [Wol89J.
Page 28
Loop Coalescing
figure, the bounds for J  are I ..Nand for K are I ..AL Given the dependence structure, a linear schedule 
can exploit parallelism in one or the other loop; the loop with maximum parallelism depends on the rela­
tive values of M and N9. The appropriate test can be executed at run-time to determine which loop 
should be parallelized: with loop coalescing, the div and mod parameter can be a variable set according to 
the results of this test. The result is a very efficient technique to statically generate the proper run-time 
test to exploit the maximum parallelism possible.
5. Conclusions
In this work, loop coalescing has been extended to apply to triangular nested loops. A new approach 
has been proposed for coalescing rectangular loops that is more efficient than current techniques. The new 
loop coalescing techniques, combined with some familiar loop transformations for parallelization, have 
been applied to the problem of scheduling nested loop structures on barrier MIMD architectures. Simple 
linear schedules have been shown to be an effective paradigm for efficiently exploiting the parallelism in 
nested loops. These schedules can also quite easily take advantage of parallelism that is inherent in the 
interaction between nested loops. Loop coalescing also has advantages in parallelizing loop structures 
where the parallelism shifts from one loop to another during execution, and where simple tests at run-time 
can determine the best loop to parallelize.
Future research effort include extending the barrier generation algorithm so that it can be applied to 
linear schedules in general. Current work also includes a prototype compiler that will implement several 
of the transformations described in this work.





H. G. Dietz and T. Schwederski, “Extending Static Synchronization Beyond SIMD and VLIW,” 
Tech. Report TR-EE 88-25, Purdue University, School of Electrical Engineering, June 1988.
[DoM79] . .. V
J. J. Dongarra, C. B. Moler, J. R. Bunch, and G. W. Stewart. UNPACK Users’ Guide, SIAM: Phi­
ladelphia, 1979.
[GoV83]
G. H- GpIub and C. F. Van Loan. Matrix Computations, Johns Hopkins University: Baltimore,
. 1983.
[KaM67]
R. M. Karp, R. E. Miller and S. Winograd, ‘‘The Organization of Computations for Uniform 
Recurrence Equatioas,” Journal of the ACM, Vol. 14, No. 3, pp. 563-590, July 1967.
[KrW85]
C. P. Kruskal and A, Weiss, “ Allocating Independent Subtasks on Parallel Processors,” IEEE 
Trans. Software Eng., vol. SE-11, no. 10, pp. 1001-1016, October 1985.
[KuM72]
D. J. Kuck, Y. Muraoka, and S. C. Chen, “ On the Number of Operations Simultaneously Execut­
able in Fortran-Like Programs and Their Resulting Speedup,” IEEE Trans. Comput., vol. C-21, no. 
12, pp. 1293-1310.
[Kuh80]
R. H. Kuhn, Optimization and Interconnection Complexity for Parallel Processors, Single-Stage 
Networks, and Decision Trees. Ph.D. dissertation, Dept, of Comp. Science, U. of Illinois at 
Urbana-Champaign, February 1980.
[LuB80]
S. F. Lundstrom and G. H. Barnes. “ A Controllable MIMD Architecture,” Proc. Jnt- Conf on 
Parallel Processing, pp. 19-27,1980.
[Lun87j
S. F. Lundstrom, “ Applications Considerations in the System Design of Highly Concurrent Mul­




D. J. Magenheimer1L. Peters, K. W. Pettis, and D. Zuras, ‘ ‘Integer Multiplication and Division on 
the HP Precision Architecture,”  IEEE Trans. Comput., vol. C-37, no. 8, pp. 980-990, August 1988.
[OKD90]
M. T. O ’Keefe and H. G. Dietz, “ Hardware Barrier Synchronization: Static Barrier MIMD,” to 
appear, 1990 lnt. Cortf. on Parallel Processing, St. Charles, IL.
[Pol88]
C. D. Polychronopolous, Parallel Programming and Compilers, Kluwer Academic Publishers, Bos­
ton, 1988.
[ShF88]
W. Shang and J.A.B. Fortes, “ Time Optimal Linear Schedules for Algorithms with Uniform 
Dependencies,” Proceedings o f In f I Conf'. on Systolic Arrays, May 1988, pp. 393-402 (also to 
appear in IEEE Trans, on Computers).
[ShO90]
W. Shang, M. T. O’Keefe, and J. A.B. Fortes, “ On Optimal, Generalized Cycle Shrinking,” Techn­
ical Report, School of Electrical Engineering, Purdue University, in preparation (May 1990).
[W0I86J
M. J. Wolfe, “ Loop Skewing: The Wavefront Method Revisited,” lnt. Jour, o f Parallel Program­
ming, vol. 15, no. 4,1986.
[Wol89]
M. J. Wolfe, Optimizing Supercompilers for Supercomputers, MITPress: Cambridge, MA, 1989. 
[ZaD90]
A. ZaaIrani, H. G. Dietz, and M. T. O’Keefe, “ Static Scheduling for Barrier MIMD Architectures,” 
to appear, 1990 lnt. Conf on Parallel Processing, St. Charles, IL.
Page 31
