Swing modulo scheduling: a lifetime-sensitive approach by Llosa Espuny, José Francisco et al.
Swing Modulo Scheduling: A Lifetime-Sensitive Approach 
Josep Llosa, Anto o GonzBlez, Eduard Ayguad6, and Mateo Valero 
Universitat Politkcnica de Catalunya 
Departament d’ Arquitectura de Computadors, Barcelona (Spain) 
Email: { josepll,antonio,eduard,mateo} @ ac.upc.es 
Abstract 
This paper presents a novel software pipelining approach, 
which is called Swing Modulo Scheduling (SMS). It 
generates schedules that are near optimal in terms of 
initiation interval, register requirements and stage count. 
Swing Modulo Scheduling is an heuristic approach that has 
a low computational cost. The paper describes the 
technique and evaluates it for the Perfect Club benchmark 
suite. SMS is compared with other heuristic methods 
showing that it outperforms them in terms of the quality of 
the obtained schedules and compilation time. SMS is also 
compared with an integer linear programming approach 
that generates optimum schedules but with a huge 
computational cost, which makes it feasible only for very 
small loops. For a set of small loops, SMS obtained the 
optimum initiation interval in all the cases and its schedules 
required only 5% more registers and a 1% higher stage 
count than the optimum. 
Keywords: Fine Grain Parallelism, Instruction 
Scheduling, Loop Scheduling, Software Pipelining, 
Register Requirements, VLIW and Superscalar 
Architectures. 
1. Introduction 
Software pipelining [5] is an instruction scheduling 
technique that exploits instruction level parallelism out of 
loops by overlapping successive iterations of the loop and 
executing them in parallel. The key idea is to find a pattern 
of operations (named the kernel code) so that when 
repeatedly iterating over this pattern, it produces the effect 
that an iteration is initiated before the previous ones have 
completed. 
The drawback of aggressive scheduling techniques, such 
as software pipelining, is their high register pressure. The 
register requirements increase as the concurrency increases 
[18,16], due to either machines with deeper pipelines, or 
wider issue, or a combination of both. Registers, like 
functional units, are a limited resource. Therefore, if a 
schedule requires more registers than available, some 
1089-795X/96 $5.00 0 1996 IEEE 
Proceedings of PACT’96 
80 
actions, such as adding spill code, have to be performed. 
The addition of spill code can degrade performance [16] 
due to additional cycles in the schedule, or due to memory 
interferences. 
Some research groups have targeted their work towards 
exact methods based on integer linear programming, For 
instance, the proposal in [ 111 search the entire scheduling 
space to find the optimal resource-constrained schedule 
with minimum buffer requirements, while the proposals in 
[10,6] find schedules with the actual minimum register 
requirements. The task of generating an optimal (in terms 
of throughput and register requirements) resource- 
constrained schedule for loops is known to be NP-hard. All 
these exact approaches require a prohibitive time to 
construct the schedules and therefore their applicability is 
restricted to very small loops. Therefore, any practical 
algorithm must use some heuristics to guide the scheduling 
process. Some of the proposals in the literature only care 
about achieving high throughput [21,14,13,24,8,20] while 
other proposals have also been targeted towards 
minimizing the register requirements [9,12,17], which 
result in more effective schedules. 
Stage Scheduling [9] is not a whole modulo scheduler by 
itself but a set of heuristics targeted to reduce the register 
requirements of any given modulo schedule. This objective 
is achieved by moving operations in the schedule. The 
resulting schedule has the same throughput but lower 
register requirements. Unfortunately there are constraints 
in the movement of operations that might yield to 
suboptimal reductions of the register requirements. 
Slack Scheduling [I21 is a heuristic technique that 
simultaneously schedules some operations late and other 
operations early with the aim of reducing the register 
requirements and achieving maximum execution rate. The 
algorithm integrates recurrence constraints and critical- 
path considerations in order to decide when each operation 
is scheduled. The algorithm is based on Iterative Modulo 
Scheduling [8,20] in the sense that it may result in ejecting 
operations already scheduled to give place to a new one 
(sort of controlled backtracking). 
Hypemode Reduction Modulo Scheduling (HRMS) [ 171 
is a heuristic strategy that tries to shorten loop variant 
lifetimes, without sacrificing performance. The main part 
of HRMS is the ordering strategy. The ordering phase 
orders the nodes before scheduling them, so that only 
predecessors or successors of a node can be scheduled 
before it is scheduled (except for recurrences). During the 
scheduling step the nodes are scheduled as soonhate as 
possible, if predecessors/successors have been previously 
scheduled. The effectiveness of their proposal is compared 
in terms of achieved throughput and compilation time 
against other heuristic methods [ 12,241 showing a better 
performance. The main drawback of the HRMS heuristic 
proposed to order the nodes is that it does not take into 
account that nodes are more critical in the scheduling 
process if they belong to a more critical path of the graph. n12 
In this paper we present a novel ordering strategy, Swing 
Modulo Scheduling (SMS), that considers latencies to 
decide how critical the nodes are. It is an heuristic 
technique that has a low computational cost (e.g., 
compiling all the innermost loops without conditional exits 
and procedure calls of the Perfect Club takes less than half 
a minute) while it produces schedules very close to those 
generated by optimal approaches based on exhaustive 
search which have a prohibitive computational cost for real 
programs. 
The rest of the paper is organized as follows. Section 2 
overviews the main concepts related with software 
pipelining. Section 3 discusses an example to motivate our 
proposal, which is formalized in Section 4. Section 5 shows 
the main results of our experimental evaluation of the 
schedules generated by SMS. It is also compared with the 
schedules generated by other heuristic approaches and the 
optimal ones. The main concluding remarks are given in 
Section 6. 
2. Overview of Software Pipelining 
In a software pipelined loop, the schedule for an iteration is 
divided into stages so that the execution of consecutive 
iterations which are in distinct stages is overlapped. The 
number of stages in one iteration is termed stage count 
(Sc). The number of cycles between the initiation of 
successive iterations (i.e. the number of cycles per stage) in 
a software pipelined loop is termed the Initiation Interval 
(W 12 11. 
The Initiation Interval ZZ between two successive 
iterations is bounded by both recurrence circuits in the 
graph (RecMZI) and resource constraints of the architecture 
(ResMZI). This lower bound on the ZZ is termed the 
Minimum Initiation Interval (MZZ=max(RecMZZ, ResMZZ)). 
The reader is referred to [8,20] for an extensive dissertation 
on how to calculate ResMZZ and RecMZZ. 
Hardware configuration: 
1 add unit 
1 mu1 unit 
2 loadstore. units 
add 2 cycles 
mul. 2 cycles 
load 2 cycles 
store: 1 cycle 
Latencies: 
Figure 1: Dependence graph for the motivating example. 
Values used in a loop correspond either to loop-invariant 
variables or to loop-variant variables. Loop-invariants are 
repeatedly used but never defined during loop execution. 
Loop-invariants have only one value for all iterations of the 
loop, therefore each one requires one register for all the 
execution of the loop regardless of the schedule and the 
machine configuration. 
For loop-variants, a value is generated in each iteration 
of the loop and, therefore, there is a different lifetime 
corresponding to each iteration. Because of the nature of 
software pipelining, lifetimes of values defined in an 
iteration can overlap with lifetimes of values defined in 
subsequent iterations. This is the main reason why the 
register requirements are increased. In addition, for values 
with a lifetime larger than the ZZ new values are generated 
before the previous ones are used. To fix this problem, 
either software solutions (modulo variable expansion [ 151) 
and hardware solutions (rotating register files [7]) have 
been proposed. 
Some of the software pipelining approaches can be 
regarded as the sequencing of two independent steps: node 
ordering and node scheduling. These two steps are 
performed assuming MZZ as the initial value for ZZ. If it is 
not possible to obtain a schedule with this ZZ, the scheduling 
step is performed again with an increased ZZ. Next section 
shows how the ordering step influences on the register 
requirements of the loop. 
3. Motivating example 
Consider the dependence graph in Figure 1, and an 
architecture configuration with the pipelined functional 
units and latencies specified in the same figure. Since the 
graph in Figure 1 has no recurrence circuits, its initiation 
interval is constrained only by the available resources; in 
81 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
1 2  3 4 5 6 8 91011  I/s I/s add mu1 
0 nl I 
1 
8 
9 
10 
11 
12 
13 
14 
15 
1 2  3 4 5 6 8 91011 
0 10 0 8 
1 9 1 6 
2 11 2 8 
3 10 3 9 
c) d) 
Figure 2: Top-Down scheduling: a)  Schedule of one 
iteration, b)  Lifetimes of variables, c)  Kernel of the 
scheduling, and d)  Register requirements. 
this case, the resource that limits the MZZ is the multiplier 
and the value is MII = 4/1 = 4. 
A possible approach to order the operations to be scheduled 
would be to use a top-down strategy that gives priority to 
operations in the critical path; with this ordering, nodes 
would be scheduled in the following order: <nl, n2, n5, n8, 
n9, n3, n10, n6, n4, n l l ,  n12, n7>. Figure 2.a shows the 
top-down schedule for one iteration and Figure 2.c the 
kernel code (numbers in brackets represent the stage to 
which the operation belongs). Figure 2.b shows the 
lifetimes of loop variants. The lifetime of a loop variant 
starts when the producer is issued and ends when the last 
consumer is issued. Figure 2.d shows the register 
requirements for this schedule; for each cycle it shows the 
number of live values required by the schedule. The 
number of registers required can be approximated by the 
maximum number of simultaneously live values at any 
cycle, which is called MaxLive (in [22] it is shown that 
register allocation never requires more than MaxLive+l 
registers). In Figure 2.d, MaxLive=lI. Notice that with this 
approach, variables generated by nodes n2 and n9 have an 
unnecessary large lifetime due to the early placement of the 
corresponding operations in the schedule; as a 
consequence, the register requirements for the loop 
increase. 
In the strategy presented in [ 171 the ordering is done with 
the aim that all operations (except for the first one) have a 
previously scheduled reference operation. For instance, for 
the previous example, they would suggest the following 
order to schedule operations <nl, n3, n5, n6, n4, n7, n8, 
n10, nll, n9, n2, n12>. Notice that with this scheduling 
c) d) 
Figure 3: HRMS scheduling: a) Schedule of one 
iteration, b) Lifetimes of variables, c) Kemel of the 
scheduling, and d)  Register requirements. 
order, both n2 and n9 (the two conflicting operations in the 
top-down strategy) have a reference operation (n8 and n10, 
respectively) already scheduled when they are going to be 
placed in the partial schedule. 
Figure 3.a shows the final schedule for one iteration. For 
instance, when we schedule operation n9, operation n10 
has already been placed in the schedule (at cycle 8) so it 
will be scheduled as close as possible to it (at cycle 6), thus 
reducing the lifetime of the value generated by n9. 
Something similar happens with operation n2, which is 
placed in the schedule once its successor is scheduled. 
Figure 3.b shows the lifetimes of loop variants and Figure 
3.d shows the register requirements for this schedule. In 
this case, MaxLive=9. 
The ordering suggested by HRMS does not give 
preference to operations in the critical path. For instance, 
operation n5 should be scheduled 2 cycles after the 
initiation of operation nl ;  however this is not possible since 
during this cycle the adder is busy executing operation n3, 
which has been scheduled before. Due to that, an operation 
in a more critical path (n5) is delayed in front of another 
operation that belongs to a less critical path (n3). 
Something similar happens with operation nl l  that 
conflicts with the placement of operation n6, which again 
belongs to a less critical path but the ordering has selected 
it before. Figures 4.a and 4.c show the schedule obtained by 
our proposal and Figures 4.b and 4.d the lifetime of 
variables and register requirements for this schedule. 
MaxLive for this schedule is 8. The schedule is obtained 
using the following ordering <n12, n l l ,  n10, n8, n5, n6, 
nl, n2, n9, n3, n4, n7>. Notice that nodes in the critical 
82 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
c) d) 
Figure 4: SMS scheduling: a )  Schedule of one iteration, 
b) Lifetimes of variables, c )  Kemel of the scheduling, and 
d )  Register requirements. 
path are scheduled with a certain preference with respect to 
the others. The following section details the algorithm that 
orders the nodes having in mind these ideas, and the 
scheduling step. 
4. Swing Modulo Scheduling (SMS) 
Most modulo scheduling approaches consists of two steps. 
First, they compute an schedule trying to minimize the ZZ 
but without caring about register and then variables are 
allocated to registers. The execution time of a software 
pipelined loop depends on the ZZ, the maximum number of 
live values of the schedule (MaxLive) and the stage count. 
The ZZ determines the issue rate of loop iterations. 
Regarding the second factor, if MaxLive is not higher than 
the number of available registers then the computed 
schedule is feasible and then it does not influence the 
execution time. Otherwise, some action should be taken in 
order to reduce the register pressure. Some possible 
solutions outlined in [20] and evaluated in [ 161 are: 
Swing Modulo Scheduling (SMS) is a modulo scheduling 
technique that tries to achieve a minimum ZZ, reduce 
MaxLive and minimize the stage count. It is an heuristic 
technique that has a low computational cost while it 
produces schedules very close to those generated by 
optimal approaches based on exhaustive search, which 
have a computational cost prohibitive for real programs. 
In order to achieve a minimum ZZ and to reduce the stage 
count, SMS schedules the nodes in an order that takes into 
account the RecMZZ of the recurrence to which each node 
belongs (if any) and as a secondary factor it considers how 
critical is the path to which the node belongs. 
To reduce MaxLive, SMS tries to minimize the lifetime 
of all the values of the loop. To achieve that, it tries to keep 
every operation as close as possible to both its predecessors 
and successors. When an operation is to be scheduled, if the 
partial schedule has only predecessors, it is scheduled as 
soon as possible. If the partial schedule contains only 
successors, it is scheduled as late as possible. The situation 
in which the partial schedule contains both predecessors 
and successors of the operation to be scheduled is 
undesirable since in this case, if the lifetime from the 
predecessors to the operation is minimized, the lifetime 
from the operation to its successors is increased. Some 
techniques like [9] deal with this situation by rescheduling 
the predecessors and the successors. SMS does not perform 
this type of backtracking but schedules the operations in 
such an order that this situation happens very rarely. In fact 
it happens only once for each recurrence and it is avoided 
completely if the loop does not contain any recurrence. 
The algorithm followed by SMS consists of the 
following three steps that are described in detail below: 
Computation and analysis of the dependence graph. 
Ordering of the nodes. 
Scheduling. 
SMS can be applied to generate code for innermost loops 
without subroutine calls. Loops containing IF statements 
can be handled after applying if-conversion [ l ]  and 
provided that the processor supports predicated execution 
~71. 
Reschedule the loop 
increasing the ZZ will 
the issue rate. 
with an increased In general7 4.1. Computation and analysis of the dependence 
reduce MaxLive but it decreases 
graph 
Add spill code. This again has a negative effect since 
it increases the required memory bandwidth and it will 
result in more memory penalizations (e.g. cache 
misses). In addition, memory may become the most 
saturated resource and therefore adding spill code may 
require to increase the ZZ. 
The dependence graph of an innermost loop consists of a 
set of four elements (DG={V, E, 61)): 
Vis the set of nodes (vertices) of the graph, where each 
node v E V corresponds to an operation of the loop. 
E is the set of edges, where each edge (u,v) E E 
represents a dependence from operation U to operation 
Finally, the stage count determines the number of 
iterations of the epilogue part of the loop (it is exactly equal 
to the stage count minus one). 
v. Only data dependences (flow, anti and output- 
dependences) are included since the type of loops that 
SMS can handle only include one branch instruction at 
83 
the end that is associated to the iteration count. Other 
branches have been previously eliminated by the if- 
conversion phase. 
6,," is called the distance function. It assigns a 
nonnegative integer to each edge (u,v) E E. This value 
indicates that operation v of iteration I depends on 
operation U of iteration I-ti,,,. 
h, is called the latency function. For each node of the 
graph, it indicates the number of cycles that the 
corresponding operation takes. 
Given a node v E V of the graph, Pred(v) is the set of all 
the predecessors of v. That is, Pred(v) = {U I U E Vand (u,v) 
E E}. In a similar way, Suc(v) is the set of all the successors 
of v. That is, Suc(v) = {U I U E Vand (v,u) E E}. 
Once the dependence graph has been computed, some 
additional functions that will be used by the scheduler are 
calculated. In order to avoid cycles, one backward edge of 
each recurrence is ignored for performing these 
computations. These functions are the following: 
ASAP, is a function that assigns an integer to each 
node of the graph. It indicates the earliest time at 
which the corresponding operation could be 
scheduled. It is computed as follows: 
IfPred(u) = 0 then ASAP, = 0 
else ASAP, = max (ASAP, i & - 6,, x MII)V v E Pred(u) 
ALAP, is a function that assigns an integer to each 
node of the graph. It indicates the latest time at which 
the corresponding operation could be scheduled. It is 
computed as follows: 
IfSuc(u) = 0 then ALAP, = max ASAP, V v E V 
else ALAP, = min (ALAP, - h, i 6 , ,  x MU) V v E Suc(u) 
MOV, is called the mobility function. For each node of 
the graph, it denotes the number of time slots at which 
the corresponding operation could be scheduled. 
Nodes in the most critical path have a mobility equal 
to zero and the mobility will increase as the path in 
which the operation is located is less critical. It is 
computed as follows: 
MOV, = ALAP, -ASAP, 
* D, is called the depth of each node. For each node of 
the graph, it is defined as the maximum number of 
predecessors weighted by their latency. It is computed 
as follows. 
IfPred(u) = 0 then D, = 0 
else D, = max (D, i L) V v E Pred(u) 
* H ,  is called the height of each node. For each node of 
the graph, it is defined as the maximum number of 
successors weighted by their latency. It is computed as 
follows: 
IfSuc(u) = 0 then H ,  = 0 
else H,, = max (H, + A,) kf v E Suc(u) 
ha = hb = hc=hd= he= 1 
8a,b = 6 ,  c = 6a,d = 6b,e = & , e  = sd,e = 
ASAP, = 0; ASAPb =ASAPc = ASAPd = 1; ASAP, = 2 
ALAP, = 0; ALAPb = A M P c  = AUPd = 1; ALAP, = 2 
MOV, = MOVb = MOV, = MOVd = MOV, = O 
Da= 0; Db= Dc= Dd= 1; De = 2  
Ha = 2; Hb = U, = Hd= 1; He = 0 
Figure 5: A sample dependence graph. 
4.2. Ordering the nodes 
The ordering phase takes as input the dependence graph 
previously calculated and produces an ordered list 
containing all the nodes of the graph. This list indicates the 
order in which the nodes of the graph will be analyzed by 
the scheduling phase. That is, the scheduling phase (see 
next section) first allocates a time slot for the first node of 
the list; then, it looks for a suitable time slot for the second 
node of the list and so on. Notice that, as the number of 
nodes already placed in the partial schedule increases, there 
are more constraints to be met by the remaining nodes and 
therefore it is more difficult to find a suitable location for 
them. 
As previously outlined, the target of the ordering phase 
is twofold: 
Give priority to the operations that are located in the 
most critical paths. In this way, the fact that the last 
operations to be scheduled should meet more 
constraints is offset by their higher mobility (MOV,). 
This approach tends to reduce the II and the stage 
count. 
Try to reduce MaxLive. In order to achieve this, the 
scheduler will place each node as close as possible to 
both its predecessors and successors. However, the 
order in which the nodes are scheduled has a severe 
impact on the final result. For instance, assume the 
sample dependence graph of Figure 5 and a dual-issue 
processor. 
If node a is scheduled at cycle 0 and then node e is 
scheduled at cycle 2 (that is, they are scheduled based 
on their ASAP or ALAP values), it is not possible to 
find a suitable placement for nodes b, c and d since 
there are not enough slots between a and e. On the 
other hand, if nodes a and e are scheduled too far 
away, there are many possible locations for the 
remaining nodes. However, MaxLive will be too high 
no matter which possible schedule is chosen. For 
84 
instance, if we try to reduce the lifetime from a to b, 
we are increasing by the same amount the lifetime 
from b to e .  In general, having scheduled both 
predecessors and successors of a node before 
scheduling it may result in a poor schedule. Because of 
this, the ordering of the nodes will try to avoid this 
situation whenever possible (notice that in the case of 
a recurrence, it can be avoided for all the nodes 
excepting one). 
If the graph has no recurrences, the intuitive idea to 
achieve these two objectives is to compute an ordering 
based on a traversing of the dependence graph. The 
traversing starts by the node at the bottom of the most 
critical path and moves upwards, visiting all the ancestors. 
The order in which the ancestors are visited depends on 
their depth. In case of equal depth, nodes are ordered from 
less to more mobility. Once all the ancestors have been 
visited all the descendants of the already ordered nodes are 
visited but now moving downwards and in the order given 
by their height. Successive upwards and downwards 
sweeps of the graph are performed alternatively until all the 
graph has been traversed. 
If the graph has recurrences, the graph traversing starts at 
the recurrence with the highest RecMII and applies the 
previous algorithm considering only the nodes of the 
recurrence. Once this subgraph has been traversed, the 
nodes of the recurrence with the second highest RecMII are 
traversed. At this step, the nodes located at any path 
between the previous and the current recurrence are also 
considered in order to avoid having scheduled both 
predecessors and successors of a node before scheduling it. 
When all the nodes belonging to recurrences or any path 
among them have been traversed, then the remaining nodes 
are traversed in a similar way. 
Concretely, the ordering phase is a two-level algorithm. 
First a partial order is computed. This partial order consists 
of an ordered list of sets. The sets are ordered from the most 
to the least priority set but there is not any order inside each 
set. Each node of the graph belongs to just one set. 
The most priority set consists of all the nodes of the 
recurrence with the highest RecMII. In general, the ith set 
consists of the nodes of the recurrence with the ith highest 
RecMII, eliminating those nodes that belong to any 
previous set (if any) and adding all the nodes located in any 
path that joins the nodes in any previous set and the 
recurrence of this set. Finally, the remaining nodes are 
grouped into sets of the same priority but this priority is 
lower than that of the sets containing recurrences. Each one 
of these sets consists of the nodes of a connected 
component of the graph that do not belong to any previous 
set. 
Once this partial order has been computed, then the 
nodes of each set are ordered to produce the final and 
? := Empty-list 
:or each set of nodes S in decreasing priority do 
if Pred-L(0) # 0 and Pred-L(0) S then 
R := Pred-L(0) n S 
order := bottom-up 
R := Suc-L(0) n S 
order := top-down 
R := (node with the highest ASAP value in S}  ; 
order := bottom-up 
else ifSuc-L(O) # 0 and Suc-L(O) C S then 
else 
ifmore than one, choose anyone 
end if 
Repeat 
iforder = top-down 
while R # 0 do 
v := Element of R with the highest H, : 
o:= 0 I <v> 
R := R - (v} v (SUC (v)  n S) 
ifmore than one, choose node with lowest MO? 
endwhile 
order := bottom-up 
R := Pred-L(0) n S 
while R # 0 do 
else 
v : = Element of R with the highest D, ; 
0 := 0 I <v> 
R := R - (v} v (Pred(v) n S) 
ifmore than one, choose node with lowest M o b  
endwhile 
order : = top-down 
R := Suc-L(O) n S 
endif 
until R = 0 
ndfor 
Figure 6: Ordering algorithm. 
complete order. This step takes as input the previous list of 
sets and the whole dependence graph. The sets are handled 
in the order previously computed. For each recurrence of 
the graph, a backward edge is ignored in order to obtain a 
graph without cycles. The final result of the ordering phase 
is a list of ordered nodes 0 containing all the nodes of the 
graph. 
The ordering algorithm is shown in Figure 6, where I 
denotes the list append operation and SucJ(0)  and 
Pred-L(0) are the sets of predecessors and successors of a 
list of nodes respectively, which are defined as follows: 
Pred-UO) = (v I 3 U E 0 such that v E Pred(u) and v P 0} 
Suc-L(O) = (v I 3 U E 0 such that v E Suc(u) and v E 0} 
4.3. Scheduling 
The scheduling step analyses the operations in the order 
given by the ordering step. The scheduling tries to schedule 
the operations as close as possible to the neighbors that 
have already been scheduled. When an operation is to be 
85 
scheduled, it is scheduled in different ways depending on 
the neighbors of these operations that are in the partial 
schedule. 
If an operation U has only predecessors in the partial 
schedule, then U is scheduled as soon as possible. In 
this case the scheduler computes the Early-Start of U 
as : 
Early-Start = max 
Where tv is the cycle where v has been scheduled, AV 
is the latency of v, 6 is the dependence distance 
from v to U ,  and PSP(u3 is the set of predecessors of U 
that have been previously scheduled. Then the 
scheduler scans the partial schedule for a free slot for 
the node U starting at cycle EurlySturtu until the cycle 
EarlyStart + ZI - 1. Notice that, due to the modulo 
constraint, it makes no sense to scan more than 11 
cycles. 
If an operation U has only successors in the partial 
schedule, then U is scheduled as late as possible. In this 
case the scheduler computes the Latestart  of U as: 
(t + AV - 6v x I l )  
U V E  PSP(U) v 
V l t  
CI 
Late-Start = minv E pss (t - A u + 6 u v X I I ) =  
Where PSSG) is the set o k successors o f u  that have 
been previously scheduled. Then the scheduler scans 
the partial schedule for a free slot for the node U 
starting at cycle Late-Sturtu until the cycle 
LateStartU - II + 1. 
If an operation U has both predecessors and successors, 
then the scheduler computes Early-Startu and 
Late-Startu as described above and scans the partial 
schedule starting at cycle Early-Startu until the cycle 
min(late-Startu, Early-Startu + ZZ - 1). This situation 
will only happen for exactly one node of each 
recurrence circuit. 
Finally, if an operation U has neither predecessors nor 
successors, the scheduler computes the Early-Start of 
U as: 
and scans the partial schedule for a free slot for the 
node U from cycle Early-Startu to cycle Early-Startu 
+U- 1.  
If no free slots are found for a node, then the ZZ is 
Early-Startu = ASAPu 
increased by 1. The scheduling step is repeated with the 
increased IZ, which will provide more opportunities for 
finding free slots. One of the advantages of our proposal is 
that the nodes are ordered only once, even if the scheduling 
step has to do several trials. 
4.4. Examples 
This section illustrates the performance of the SMS by 
means of two examples. The first example is a small loop 
without recurrences and the second example uses a 
dependence graph with recurrences. 
Assume that the dependence graph of the body of the 
innermost loop to be scheduled is that of Figure 1 (page 2), 
where all the edges represent dependences of distance zero. 
Assume also a four-issue processor with four functional 
units (1 adder, 1 multiplier and 2 loadlstore units) fully 
pipelined with the latencies listed in Figure 1. 
The first step of the scheduling is to compute the MZI and 
the ASAP, ALAP, mobility, depth and height of each node 
of the graph. MZZ is equal to 4. Table 1 shows the remaining 
values for each node. 
I Node I ASAP I ALAP I M I D I  H 1  
I I t I I 
nl I 0 1  0 1  0 1  0 1  10 
n2 I 0 1  2 1  2 1  0 1  8 
I I I I I 
n3 I 2 1  6 1  4 1  2 1  4 
n4 I 4 1  8 1  4 1  4 1  2 
n5 I 2 1  2 1  0 1  2 1  8 
n8 4 4 0 4 6 
n9 0 4 4 0 
I I I I I 
1112 I 10 I 10 I 0 1  10 I 0 
Table I :  ASAC ALAC mobility (M), depth (0) and height 
(H)  of nodes of Figure 1. 
Then, the nodes are ordered. The first level of the 
ordering algorithm groups all the nodes into the same set 
since there are not recurrences. Then, the elements of this 
set are ordered as follows: 
Initially R={n12) and order = bottom-up. 
Then, all the ancestors of n12 are ordered depending 
on their depth and their mobility as a secondary factor. 
This gives the partial order 0 = a 1 2 ,  n l l ,  n10, n8, 
n5, n6, nl ,  n2, n9>. 
* Then, the order shifts to top-down and all the 
descendants are ordered based on their height and 
mobility. This gives the final ordering 0 = a 1 2 ,  nl l ,  
n10, n8, n5, n6, nl, n2, n9, n3, n4, n7>. 
The next step is to schedule the operations following the 
previous order. ZZ is initialized to MZZ and the operations are 
scheduled as shown in Figure 4 (page 4): 
0 The first node of the list, n12, is scheduled at cycle 10 
(given by its ASAP) since there are neither 
predecessors nor successors in the partial schedule'. 
Once the schedule is folded this will become cycle 3 of 
stage 2. 
1. In fact the resulting schedule stretches from cycles - 
1 to 10 but in all the figures we have normalized the rep- 
resentation starting always at cycle 0, so n12 is in cycle 
11 of Figure 4. 
86 
