Hypernode reduction modulo scheduling by Llosa Espuny, José Francisco et al.
Hypernode Reduction Modulo Scheduling * 
Josep Llosa, Mateo Valero, Eduard Ayguadk and Antonio GonzBez 
Departament d’ Arquitectura de Computadors 
Universit at Politkcnica de Catalunya 
Campus Nord, Mhdul D6, Gran Capit& s/n 
08071, Barcelona, SPAIN 
{ j osepll,mateo,eduard,ant onio} @ac.upc.es 
Abstract 
Software Pipelining as a loop scheduling technique 
that extracts parallelism from loops b y  overlapping the 
execution of several consecutive iterations. Most prior 
scheduling research has focused on achieving mini- 
mum execution time, without regarding register re- 
quirements. Most strategies tend to stretch operand 
lifetimes because they schedule some operations too 
early or too late. 
The paper presents a novel strategy that simulta- 
neously schedules some operations late and other op-  
erations early, minimizing all the stretchable depen- 
dencies and therefore reducing the registers required 
b y  the loop. The k e y  of this strategy is a pre-ordering 
phase that selects the order in which the operations 
will be scheduled. The results show that the method de- 
scribed an this paper performs better than other heuris- 
tic methods and almost as well as a linear program- 
ming method but requiring much less time to produce 
the schedules. 
Keywords: Instruction Scheduling, Loop Schedul- 
ing, Software Pipelining, Register Allocation, Register 
Spilling. 
1 Introduction 
Software pipelining is an instruction scheduling 
technique that exploits the instruction level paral- 
lelism of loops by overlapping successive iterations of 
the loop and executing them in parallel. Finding the 
optimal schedule is an NP-complete problem and there 
are several works that DroDose and evaluate differ- 
* I  
ent heuristic strategies to perform software pipelining 
[20, 12, 11, 23, 6, 191. 
The drawback of aggressive scheduling techniques, 
such as software pipelining, is the high register pres- 
sure. The register requirements increase as the concur- 
rency increases [16, 151, due to either machines with 
deeper pipelines, or wider issue, or a combination of 
both. Registers, like functional units, are a limited 
‘This work has been supported by the Ministry of Education 
of Spdn under contracts TIC 880f92 and TIC 429f95, by ES- 
PRIT 6634 Basic Research Action (APPARC) and by CEPBA 
(European Center for Parallelism of Barcelona). 
1072-4451/95 $4.00 0 1995 IEEE 
Proceedings of MICRO-28 
resource. Therefore, if a scheduling requires more reg- 
isters than available, some actions, such as adding spill 
code, have to be performed. The addition of spill code, 
can degrade performance [15] due to additional cycles 
in the scheduling, or due to memory interferences. 
The problems introduced by the high register re- 
quirements of aggressive scheduling techniques, to- 
gether with the trend of increasing ILP in current mi- 
croprocessors [9, 241, have led to scheduling research 
oriented to minimize the register requirements (in part 
due to the limited number of registers that existing 
architectures have, and in part due to the limitations 
in chip area and especially access time, that re ister 
files with a high number of registers will impos$. In 
this direction there are also proposals for alternative 
register file organizations [22, 4,  141. 
In order to achieve maximum performance, schedul- 
ing algorithms that reduce the register pressure while 
scheduling for high throughput are of high interest. 
Huff’s Slack Scheduling [lo] is an heuristic technique 
that attempts to address this concern. SPILP [8] 
is an integer linear programming formulation of the 
scheduling problem that obtains the optimal resource- 
constrained schedule, with minimal buffer require- 
ments. In [7] a linear programming formulation that 
obtains optimum schedules with the minimum regis- 
ter requirements is presented. Unfortunately heuris- 
tic strategies do not always obtain the optimum re- 
sults and linear programming methods require a much 
higher time to construct the schedules than heuristic 
methods. 
This paper presents Hypernode Reduction Modulo 
Scheduling (HRMS), a heuristic strategy that tries to 
shorten loop variant lifetimes, without sacrificing per- 
formance. The main part of HRMS is the ordering 
strategy. The ordering phase orders the nodes before 
scheduling them, so that only predecessors or succes- 
sors of a node can be scheduled before it is scheduled 
(except for recurrences). During the scheduling step 
the nodes are scheduled as soon/late as possible, if pre- 
decessors/successors have been previously scheduled. 
This strate y has been tested with a set of loops 
taken from [8f and compared against three leading 
scheduling strategies. These three strategies are the 
previous mentioned Slack and SPILP together with 
350 
FRLC [23] which is an heuristic strategy which does 
not take into consideration the register requirements. 
Experimental results show that HRMS obtains bet- 
ter schedules than the other heuristic strategies, with 
a comparable scheduling time. On the other hand, 
HRMS produces similar results to SPILP, but requires 
up to 2 orders of magnitude less time than SPILP to 
produce the schedules. In addition, HRMS is com- 
pared against a Top-Down scheduler [15] and charac- 
terized in terms of quality of the generated schedules 
and the computational cost on a test-bench of over 
a thousand loops from the Perfect Club Benchmark 
Suite [3] that account for 78% of the execution time 
of the Perfect Club. 
In Section 2 an example is used to illustrate the 
problems that most strategies have, and shows how 
our strategy shortens lifetimes, and reduces register 
pressure. Section 3 describes our proposal (HRMS . 
nally, Section 5 states our conclusions. 
2 
Section 4 presents the experiments performed, and tl - 
Overview of software pipelining and 
motivating example 
In a software pipelined loop the schedule for an it- 
eration is divided into stages so that the execution of 
consecutive iterations which are in distinct stages is 
overlapped. The number of stages in one iteration is 
termed stage count(SC). The number of cycles be- 
tween the initiation of successive iterations (i.e. the 
number of cycles per stage) in a software pipelined 
scheduling is termed the Initiation Interval(I1) [20]. 
The Initiation Interval II between two successive 
iterations is bounded either by loop-carried depen- 
dences in the graph (RecMII) or by resource con- 
straints of the architecture (ResMII ). This lower 
bound on the II is termed the Minimum Initiation 
Interval (MI1 ). The reader is refered to [6, 191 for an 
extensive dissertation of how to calculate ResMII and 
RecMII . 
Values used in a loop correspond either to loop- 
invariant variables or to loop-variant variables. Loop- 
invariants are repeatedly used but never defined dur- 
ing loop execution. Loop-invariants, have only one 
value for all iterations of the loop, therefore each one 
requires one register for all the execution of the loop 
irrespective of the scheduling and the machine config- 
uration. 
For loop-variants, a value is generated in each it- 
eration of the loop and, therefore, there is a differ- 
ent lifetime corresponding to each iteration. Because 
of the nature of software pipelining, lifetimes of val- 
ues defined in an iteration, can overlap with lifetimes 
of values defined in subsequent iterations. In addi- 
tion, for values with a lifetime larger than the II new 
values are generated before the previous one is used, 
overwriting it. 
One approach to fix this problem is to provide some 
form of register renaming so that successive definitions 
of a value use distinct registers. Renaming can be per- 
formed at com ile time by using modulo variable 
expansion [13f i.e., unrolling the kernel and renam- 
ing at compile time the multiple definitions of each 
& v4 v5 
y.0 
Figure 1: Dependence graph of our motivating exam- 
ple. 
variable that exist in the unrolled kernel. A rotating 
register file can be used to solve this problem without 
replicating code, renaming different instantiations of 
a loop-variant at execution time [5]. 
2.1 Motivating example 
Consider the dependence graph of Figure 1, and an 
architecture where all the operations can be executed 
by any functional unit (i.e. general-purpose functional 
units). Assume that there are 4 pipelined units, and 
that the execution latency is 2. Since the graph in 
Figure 1 has no recurrence circuits, its initiation in- 
terval is constrained only by the available resources 
MI1 = [a] = 2. 
In many approaches, the lifetimes of some values 
can be unnecessarily large. As an example, Figure 2a 
shows a top-down scheduling, and Figure 3a a bottom- 
up scheduling for the example graph. 
In the top-down scheduling, node E is scheduled 
before node F. Since E has no predecessors it can be 
placed at any cycle, but in order not to delay any pos- 
sible successor, it is placed as soon as possible. Figure 
2b shows the lifetimes of loop variants for the top- 
down scheduling assuming that a value is alive from 
the beginning of the producer operation to the begin- 
ning of the last consumer. Notice that loop variant 
V5 has an unnecessary large lifetime due to the early 
placement of E during the scheduling. 
In the bottom-up approach E is scheduled after F, 
therefore it is placed as late as possible reducing the 
lifetime of V5 (Figure 3b Unfortunately C is sched- 
predecessor it is scheduled as late as possible. Notice 
that the V2 has an unnecessary large lifetime due to 
the late placement of C. 
In the strategy we propose, an operation will be 
ready for scheduling even if some of its predecessors 
and successors have not been scheduled. The only con- 
dition (to be guaranteed by the pre-ordering step) is 
that when an operation is scheduled, the partial sched- 
ule contains only predecessors or successors or none of 
them, but not both of them (in the absence of recur- 
rences). The ordering is done with the aim that all 
operations have a previously scheduled reference op- 
uled before B and, in or d er to not delay any possible 
351 
b)V1 v2 v4 v5 V6 
a) 
Cycle 
1 
2 
3 
4 
5 
6 
7 
8 
9 
d)V1 v2 v4 v5 V6 
.._..__.... ... .. ........... .......... 
Figure 2: Top-Down scheduling: a) Schedule of one 
iteration, b) Lifetimes of variables, c) Kernel, d) Reg- 
ister requirements 
eration (except for the first operation to be scheduled). 
For instance, consider that nodes of the graph in Fi - 
ure 1 are scheduled in the order {A, B, C, D, F, E, Gf. 
Notice that node F will be scheduled before nodes {E, 
G}, a predecessor and a successor respectively, and 
that the partial scheduling will contain only a prede- 
cessor (D) of F, With this scheduling order, both C 
and E (the two conflicting operations in the top-down 
and bottom-up strategies) have a reference operation 
already scheduled, when they are placed in the partial 
schedule. 
Figure 4a shows the final scheduling for one itera- 
tion. Operation A will be scheduled in cycle 0. Op- 
eration B , which depends on A, will be scheduled in 
cycle 2. Then C and later D, are scheduled in cycle 
4. At this point, operation F is scheduled as soon as 
possible, i.e. at cycle 6 (because it depends on D>, 
but there are no available resources at this cycle, so it 
is delayed to cycle 7. Now the scheduler places oper- 
ation E as late as possible in the scheduling because 
there is a successor of E previously placed in the par- 
tial scheduling, thus operation E is placed at cycle 5. 
And finally, since operation G has a predecessor pre- 
viously scheduled, it is placed as soon as possible in 
the scheduling, i.e. at cycle 9. 
Figure 4b shows the lifetimes of loop variants. No- 
tice that neither C nor E have been placed too late 
and too early respectively, because the scheduler al- 
ways takes previously scheduled operations as a refer- 
ence point. Since F has been scheduled before E, the 
scheduler has a reference operation to calculate a late 
start cycle for E. Figure 4d shows the number of alive 
registers in the kernel (Figure 4c) during the steady 
state phase of the execution of the loop. There are 
6 alive registers in the first row and 5 in the second, 
therefore, the loop variants require only 6 registers. 
In contrast the top-down schedule requires 8 registers 
and the bottom-up schedule requires 7 registers. 
The following section describes the algorithm that 
a) 
b)V1 V2 V4 V5 V6 
w e  0
1 
2 
3 
4 
5 
6 
7 
8 
9 
d)V1 V2 V4 V5 V6 
Figure 3: Bottom-Up scheduling: a) Schedule of one 
iteration, b) Lifetimes of variables, c) Kernel, d) Reg- 
ister requirements. 
orders the nodes before scheduling, and the scheduling 
step. 
3 Hypernode Reduction Modulo 
The dependences of an innermost loop can be rep- 
resented by a Dependence Graph G = DG(V, E ,  6, A). 
V is the set of vertices of the graph G, where each 
vertex v E V represents an operation of the loop. E is 
the dependence edge set, where each edge (u,v) E E 
represents a dependence between two operations U ,  U. 
Edges may correspond to any of the following types of 
dependences: register dependences, memory depen- 
dences or control dependences. The dependence dis- 
tance h(,,,) is a nonnegative integer associated with 
each edge (u,v) E E.  There is a dependence of dis- 
tance S(,,,) between two nodes U and v if the execution 
of operation v depends on the execution of operation 
U 6(,,,) iterations before. The latency A, is a nonzero 
positive integer associated with each node U E V and 
is defined as the number of cycles taken by the corre- 
sponding operation to produce a result. 
HRMS tries to minimize the register requirements 
of the loop by scheduling any operation U as close as 
possible to their relatives i.e. the predecessors of U ,  
P r e d ( u ) ,  and the successors of U ,  Succ(u). Scheduling 
operations in this way shortens operand's lifetime and 
therefore reduces the register requirements of the loop. 
To software pipeline a loop, the scheduler must han- 
dle cyclic dependences caused by recurrence circuits. 
A recurrence circuit from an operation to an instance 
of itself s1 iterations later, must not be stretched be- 
yond s1 x 11. In addition, placing an operation U at 
a cycle tu commits its associated resources for cycles 
tu + s x 11, vs. 
HRMS solves these problems by splitting the 
scheduling into two steps: A pre-ordering step that 
orders nodes and, the actual scheduling, that sched- 
Scheduling 
352 
a) 
Cycle 
1 
2 
3 
4 
5 
6 
7 
a 
9 
b)V1 V2 V4 V5 V6 
V i  V2 V4 V.5 V6 
d) 
Figure 4: Bidirectional scheduling: a) Schedule of one 
iteration, b) Lifetimes of variables, c) Kernel, d) Reg- 
ister requirements. 
ules nodes (once at a time) in the order given by the 
pre-ordering step. 
The pre-ordering step orders the nodes of the de- 
pendence graph with the goal of scheduling the loop 
with an II  as close as possible to MI1 and using the 
minimum number of registers. It gives priority to re- 
currence circuits in order not to stretch any recurrence 
circuit. It also ensures that, when a node is scheduled, 
the current partial scheduling contains only predeces- 
sors or successors of the node, but never both (unless 
the node is the last node of a recurrence circuit to be 
scheduled). 
The ordering step assumes that the dependence 
graph, G = (V,  E ,  6,  A), to be ordered is a connected 
component. If G is not a connected component it is 
decomposed into a set of connected components {Gj}, 
each Gj is ordered separately and finally the lists of 
nodes of all Gj are concatenated giving a higher pri- 
ority to the Gi with a more restrictive recurrence cir- 
cuit(in terms of RecMII) .  
Next the pre-ordering step is presented. First we 
will assume that the dependence graph has no recur- 
rence circuits (Section 3.1), and in Section 3.2 we in- 
troduce modifications in order to deal with recurrence 
circuits. Finally Section 3.3 presents the scheduling 
step. 
3.1 Pre-ordering of graphs without recur- 
rence circuits 
To order the nodes of a graph, an initial node, that 
we call Hypernode, is selected. In an iterative process, 
all the nodes in the dependence graph are reduced to 
this Hypernode. The reduction of a set of nodes to the 
Hypernode consists of deleting the set of edges among 
the nodes of the set and the Hypernode, replacing the 
edges between the rest of the nodes and the reduced 
set of nodes by edges between the rest of the nodes and 
the Hypernode, and finally deleting the set of nodes 
function Pre-Ordering(G, L ,  h )  
{Returns a list with the nodes of G ordered} 
It takes as input: } 
The dependence graph (G) } 
A list of nodes partially ordered (L) } 
An initial node (i.e the hypernode) (h) } 
List := L; 
while ( Pred(h) # 0 or Succ(h) # 0) 
I 
I 
V' := Pred(h); 
V' := Search-AllYaths( V' ,G); 
G' := Hypernode_Reduction(V',G,h); 
L' := Sort-PALA(G'); 
List : = Con cat en at e( List, L' ) ; 
V' := Succ(h); 
V' := Search-AlI_Paths(V',G); 
G' := Hypernode-Reduction( V' ,G&); 
L' := SortASAP(G'); 
List := Concatenate(List ,L') 
return List 
Figure 5: Function that pre-orders the nodes in a de- 
pendence graph without recurrence circuits 
being reduced. 
The pre-ordering step (Figure 5) requires an initial 
Hypernode and a partial list of ordered nodes. The 
current implementation selects the first node of the 
graph (i.e the node corresponding to the first opera- 
tion in the program order) but any node of the graph 
can be taken as the initial Hypernodel. This node is 
inserted in the partial list of ordered nodes, then the 
pre-ordering algorithm sorts the rest of the nodes. 
At each step, the predecessors (successors) of the 
Hypernode are obtained. Then the nodes that ap- 
pear in any path among the predecessors (successors) 
are obtained (function Search-AllYaths '. Once the 
them have been obtained, all these nodes are reduced 
(see function Hypernode-Reduction in Figure 6 )  to the 
Hypernode, and the subgraph which contains them is 
topologically sorted. The topological sort determines 
the partial order of predecessors (successors), which is 
appended to the ordered list of nodes. The predeces- 
sors are topologically sorted using the algorithm we 
name PALA. The PALA algorithm is like an ALAP 
As Late As Possible) algorithm, but the list of or- 6 ered nodes is inverted. The successors are topolog- 
ically sorted using an ASAP (As Soon As Possible) 
algorithm. 
As an example, consider the dependence graph in 
Figure 7a. Next, we illustrate the ordering of the 
nodes of this graph step by step. 
predecessors (successors) and all the pat b s connecting 
'The algorithm tries to shorten lifetimes irrespective of the 
starting node. Preliminary experiments showed that selecting 
different initial nodes produced different schedules that had ap- 
proximately the same register requirements (there were minor 
differences caused by resource constraints). 
2The execution time of Search-All-Paths is O(llVll + 11311). 
353 
function Hypernode_Reduction(V’,G,h) 
{ G = (V ,E,5 ,X);  V’ C V ;  h E V } 
{ Creates the subgraph G’ = (V’, E’, S, A) C G } 
{ And reduces G’ to the node h in the graph G } 
E‘ := 0; 
for each U E V’ do 
for each e = ( v l , v 2 )  E Adj-edges(zs) do 
E := E - {e); 
if v l  E V’ and v 2  E V’ 
else 
then E’ := E‘ U{.} 
if v l  = U and v 2  # h 
if v 2  = U and v l  # h 
then E := E U { ( h ,  v2)} 
then E := k U ( ( v 1 ,  h ) }  
v := v - { U )  
L A  
return G’ 
Figure 6: Function Hypernode-Reduction 
Figure 7: Sample example for reordering without re- 
currences. 
1. Initially, the list of ordered nodes is empty 
(List = 0). We start by designating a node of 
the graph as the Hypernode (7-1 in Figure 7). As- 
sume that A is the first node of the graph. The re- 
sulting graph is shown in Figure 7b Then A is ap- 
pended to the list of ordered nodes (List = { A } ) .  
2. In the next step the predecessors of 3t are se- 
lected. Since it has no predecessors, the succes- 
sors are selected (Le. the node C). Node C is 
reduced to 7-1, resulting in the graph of Figure 
7c and C is added to the list of ordered nodes 
(List = { A ,  C}) .  
3. The process is repeated, selecting nodes G and 
H. In the case of selecting multiple nodes, there 
can be paths connecting these nodes. The algo- 
rithm looks for the possible paths, and topolog- 
ically sorts the nodes involved. Since there are 
no paths connecting G and H, they are added to 
the list (List = { A ,  C,  G ,  H } ) ,  and reduced to the 
Hypernode, resulting the graph of Figure 7d. 
a) 
C D E  
Figure 8: Types of recurrences 
4. Now 3-1 has D as a predecessor, thus D is reduced, 
producing the graph in Figure 7e, and appended 
to the list (List = { A ,  C ,  G, H ,  D}). 
5 .  Then, the successor J of 7-1 is ordered (List = 
{ A ,  C, G, H ,  D ,  J y  and reduced, producing the 
graph in Figure 7 
6. At this point 3-1 has two predecessors B and I, 
and there is a path between B and I that contains 
the node E. Therefore B, E, and I are reduced to 
7-1 producing the graph of Figure 7g. Then, the 
subgraph that contains B, E, and I is topologically 
sorted, and the partially ordered list { I ,  E ,  B }  is 
appended to the list of ordered nodes (List = 
7. Finally node F is reduced to 7-1 producing the 
graph of Figure 7h with only the Hypernode, 
which is the stop condition of the ordering algo- 
rithm. 
After performing the ordering phase, the nodes will 
be scheduled in the order {A, C, G, H,  D, J ,  I, E, B, 
F}. Notice that the nodes that have been ordered as 
predecessors (i.e. I, E, B and F) will be scheduled as 
late as possible while the nodes ordered as successors 
will be scheduled as soon as possible. 
3.2 Pre-ordering of graphs with recur- 
In order not to degrade performance when there 
are recurrence circuits, the ordering step is performed 
giving priority to the recurrence circuits with higher 
RecMII . The main idea is to reduce all the recurrence 
circuits to the Hypernode, while ordering their nodes. 
After this step, we have a dependence graph without 
recurrence circuits, with an initial Hypernode and with 
a partial ordering of all the nodes that were contained 
in recurrence circuits. Then, we order this dependence 
graph as shown in Subsection 3.1. 
Before presenting the ordering algorithm for recur- 
rence circuits, let us put forward some considerations 
about recurrences. Recurrence circuits can be classi- 
fied as: 
{A,  c, G, H ,  D, J ,  1, E ,  B } )  
rence circuits 
0 Single recurrence circuits (Figure Sa). 
0 Recurrence circuits that share the same set of 
backward edges (Figure 8b). We call recurrence 
subgraph to the set of recurrence circuits that 
354 
procedure Ordering_Recurrences(G, L ,  List, h);  
This procedure takes the dependence graph ( G ) }  t and the simplified list of recurrence subgraphs ( L ) }  
{It returns a partial list of ordered nodes (Lis t )}  
{and the resulting hypernode (h ) }  
V' := Head(L); 
G' := GenerateSubgraph(V', G); 
h := First(G'); 
List := Pre-Ordering(G', List, h);  
while L # 8 do 
List := {h} ;  
V' := Search-AIIPaths( { h,Head(L)} ,G); 
G' := Generate-Subgraph(V', G ) ;  
List := Pre-Ordering(G', List, h);  
Figure 9: Procedure to order the nodes in recurrence 
circuits 
share the same set of backward edges. In this 
way Figures 8a and 8b are recurrence subgraphs. 
Several recurrence circuits can share some of their 
nodes (Figures 8c and 8d) but have distinct sets 
of backward edges. In this case we consider that 
these recurrence circuits are different recurrence 
subgraphs. 
All recurrence circuits are identified during the cal- 
culation of RecMII . For instance, the recurrence cir- 
cuits of the graph of Figure 8b are {A, D, E} and 
{A, B, C, E}. During the identification of recurrence 
circuits they are simplified so that recurrence circuits 
that belong to the same recurrence subgraph (i.e. that 
have the same set of backward edges) are stored as a 
single recurrence subgraph (in the worst case we can 
have a recurrence subgraph for each backward edge). 
For instance, the recurrence circuits associated with 
Figure 8b are stored as the recurrence subgraph (A, 
B, C, D, E}. After all recurrence subgraphs have been 
stored all the redundant nodes are removed so that a 
node only appears in the list associated with one re- 
currence subgraph . The nodes that appear in more 
than one recurrence subgraph are removed from all 
the sublists except for the most restrictive sublist in 
terms of RecMII (i.e. the first one in the list of recur- 
rence subgraphs). For instance, the list of recurrence 
subgraphs associated with Figure 8c {A C D 
C, E}} will be simplified to the list {A, ' C, ' D {: [3 ,
El}. 
Once the nodes have been simplified, the actual or- 
dering for recurrence circuits is performed. The algo- 
rithm that orders the recurrence circuits (see Figure 
9) takes as input a list L of the recurrence subgraphs 
ordered by decreasing values of their RecMII . Each 
entry in this list is a list of the nodes traversed by 
the associated recurrence subgraph. Trivial recurrence 
circuits, i.e. dependences from an operation to itself, 
do not affect the preordering step since trivial recur- 
rence circuits impose no scheduling constraints, as the 
scheduler previously ensured that II 2 RecMII . It 
Figure 10: Example for Ordering-Recurrences proce- 
dure 
starts by generating the corresponding subgraph for 
the first recurrence circuit, but without one of the 
backward edges that causes the recurrence (we re- 
move the backward edge with higher 6(u,u) ) .  There- 
fore the resulting subgraph has no recurrences and 
can be ordered using the algorithm without recur- 
rences presented in Subsection 3.1. The whole sub- 
graph is reduced to the Hypernode. Then, we look for 
all paths between the Hypernode and the next recur- 
rence subgraph (in order to properly use the algorithm 
SearchAlbPaths it is required that all the backward 
edges causing recurrences have been removed from the 
graphi. After that, the graph containing the Hypern- 
ode, t e next recurrence circuit, and all the nodes that 
are in paths that connect them are ordered applying 
the algorithm without recurrence circuits and reduced 
to the Hypernode. If there is no path between the 
Hypernode and the next recurrence circuit, any node 
of the recurrence circuit is reduced to the Hypernode, 
so that the recurrence circuit is now connected to the 
Hypernode. 
This process is repeated until there are no more 
recurrence subgraphs in the list. At this point all 
the nodes in recurrence circuits or in paths connecting 
them have been ordered and reduced to the Hypern- 
ode. Therefore the graph that contains the Hypernode 
and the remaining nodes, is a graph without recur- 
rence circuits, that can be ordered using the algorithm 
presented in the previous subsection. 
For instance, consider the dependence graph of Fig- 
ure loa. This graph has two recurrence subgraphs {A, 
C, D, F} and {G, J ,  M}. Next we will illustrate the 
reduction of the recurrence subgraphs: 
The subgraph {A, C, D, F} is the one that im- 
poses most restrictions to RecMII . Therefore 
the algorithm starts by ordering it. If we iso- 
late this subgraph and remove the backward edge 
we obtain the graph of Figure lob. After or- 
dering this graph the list of ordered nodes is 
(List = { A ,  C, D, F } ) .  When the graph of Figure 
10b is reduced to the Hypernode 7f in the origi- 
nal graph (Figure loa), we obtain the dependence 
graph of Figure 1Oc. 
The next step is to reduce the following recur- 
rence subgraph {G, J,  M}. For this purpose 
the algorithm searches all the nodes that are in 
all possible paths between 7f and the recurrence 
subgraphs. Then, the graph that contains these 
355 
nodes is constructed (see Figure 10d). Since back- 
ward edges have been removed, this graph has 
no recurrence circuits, so it can be ordered us- 
ing the algorithm presented in the previous sec- 
tion. When the graph has been ordered, the list of 
nodes is appended to the previous one resulting in 
the partial list (List = { A ,  C, D,  F, I ,  G, J ,  M }  . 
in the graph of Figure 1Oc producing the graph of 
Figure 10e. 
3. At this point, we have a partial ordering of the 
nodes belonging to recurrences, and the initial 
graph has been reduced to a graph without recur- 
rence circuits (Figure 10e). This graph without 
recurrence circuits is ordered as presented in Sub- 
section 3.1. So finally the list of ordered nodes is 
List = { A ,  C,  D ,  F ,  I ,  G, J ,  M ,  H ,  E ,  B ,  L ,  K } .  
Then, this subgraph is reduced to the Hypemo 1 e
3.3 Scheduling step 
The scheduling step places the operations in the 
order given by the ordering step. The scheduling tries 
to schedule the operations as close as possible to the 
neighbors that have already been scheduled. When an 
Qperation is to be scheduled, it is scheduled in different 
ways depending on the neighbors of these operations 
that are in the partial schedule. 
If an operation U has only predecessors in the par- 
tial schedule, then U is scheduled as soon as p o s  
sible. In this case the scheduler computes the 
Early-Start of U as: 
Early-Start, = max tu + A,, - ~5(~,,) x II 
U € P S P ( u )  
Where tu is the cycle where v has been sched- 
uled, A, is the latency of U, 6(u,u) is the de- 
pendence distance from v to U ,  and PSP(u)  is 
the set of predecessors of U that have been pre- 
viously scheduled. Then the scheduler scans in 
the partial schedule for a free slot for the node 
U starting at cycle EarlyStart ,  until the cycle 
EarlyS’tart, + II - 1. Notice that, due to the 
modulo constraint, it makes no sense to scan more 
than I1 cycles. 
If an operation U has only successors in the partial 
schedule, then U is scheduled as late as possible. 
In this case the scheduler computes the Late-Start 
of U as: 
LateStart, = min tu - A, + 6(,,,,) x II 
Where PSS(u)  is the set of mccessurs of U that 
have been previously scheduled. Then the sched- 
uler scans in the partial schedule for a free slot 
for the node U starting at cycle LateStart, until 
the cycle Late-Start, - I1 + 1. 
e If an operation U has predecessors and succes- 
sors, then the scheduler scans the partial sched- 
ule starting at cycle EarlyStart ,  until the cycle 
min(Late-Start,, EarlyStart ,  + II + 1). 
UEPS 
If no free slots are found for a node, then the II is  
increased by 1. The scheduling step is repeated with 
the increased II , which will have more opportunities 
for finding free slots. One of the advantages of our 
proposal is that the nodes are ordered only once, even 
if the scheduling step has to do several trials. 
4 Results 
HRMS has been implemented in C++ using the 
LEDA libraries [17]. In this section we present some 
results of our experimental study. We compare HRMS 
with other scheduling methods using a small set of de- 
pendence graphs for which there are previously pub- 
lished results. In addition HRMS, has been exhaus- 
tively tested and evaluated for over one thousand loops 
from the Perfect Club Benchmark Suite [3]. For this 
loops, HRMS performance is compared with that of a 
Top-Down scheduler. 
4.1 Comparison with other scheduling 
We have evaluated how well our method performs 
compared with 3 leading methods. The selected meth- 
ods are: an heuristic method that does not take into 
account register requirements [23 FRLC), a life-time 
sensitive heuristic method [lo] {$hack) and a linear 
programming method [8](SPILP). 
We used 24 dependence graphs from [SI with a ma- 
chine configuration with l FP Adder, l FP Multiplier, 
1 FP Divider and 1 Load/Store unit. We have as- 
sumed a unit latency for add, subtract and store in- 
structions, a latency of 2 for multiply and load, and a 
latency of 17 for divide. 
Table 1 compares the initiation interval II, the 
number of buffers (Buf) and the tdtal execution time 
of the scheduler on a Sparc-10/40 workstation, for the 
four scheduling methods. The results for the other 
three methods have been obtained from [SI. The num- 
ber of buffers required by a schedule is defined in [8] 
as the sum of the buffers required by each value in the 
loop. A value requires as many buffers as the number 
of times the producer instruction is issued before the 
issue of the last consumer. In addition, stores require 
one buffer. In [18] it was demonstrated that the buffer 
requirements provide a very tight upper bound on the 
total register requirements. 
Table 2 summarizes the main conclusions of the 
comparison. The entries of the table represent the 
number of loops for which the schedules obtained by 
HRMS are better (I1 <), equal (I1 =), or worse 
s(u) (I1 >) than the schedules obtained by the other meth- 
ods, in terms of the initiation interval. When the initi- 
ation interval is the same, it also shows the number of 
loops for which HRMS requires less buffers (Buf <), 
equal number of buffers (Buf =), or more buffers 
Buf >). Notice that HRMS achieves the same per- 
buffer requirements. When compared to the other 
methods, HRMS obtains a lower II in about 33% of 
the loops. For the remaining 66% of the loops the 
I1 is the same but in many cases HRMS requires less 
buffers, specially when compared to FRLC. 
methods 
I ormance as the SPILP method both in terms of II and 
356 
Table 1: Comparison of HRMS schedules with other scheduling methods. 
Table 2: Comparison of HRMS performance versus 
the other 3 methods. 
I HRMS I SPILP I Slack I FRLC 
CompilationTime I 1.45 I 290.72 I 0.93 I 0.71 
Table 3: Comparison of HRMS compilation time to 
the other 3 methods. 
Finally Table 3 compares the total compilation time 
in seconds for the four methods. Notice that HRMS is 
slightly slower than the two heuristic methods. But, 
these methods perform noticeably worse in finding a 
good scheduling. On the other hand, the linear pro- 
gramming method (SPILP) requires a much higher 
time to construct a scheduling that turns out to have 
similar performance as the scheduling produced by 
HRMS. In fact, most of the time spent by SPILP is 
due to L vermore Loop 23, but even without taking 
into accou& this loop, HRMS is over 40 times faster. 
4.2 Further evaluation of HRMS 
In order to further evaluate HRMS, the dependence 
graph of all the innermost DO loops of the Perfect 
Club have been obtained with the ICTINEO compiler 
[2]. We have not measured loops with subroutine calls 
or with conditional exits. Loops with conditionals in 
their body have been converted to single basic block 
loops using IF-conversion [l]. A total of 1258 loops, 
which account for 78% of the total execution time of 
the Perfect Club3, have been scheduled. 
We assume a unit latency for store instructions, 
a latency of 2 for loads, a latency of 4 for addi- 
tions and multiplications, a latency of 17 for divi- 
sions and a latency of 30 for square roots. The 
loops have been scheduled for a machine configura- 
tion with 2 load/store units, 2 adders, 2 multipliers 
and 2 Div/Sqrt units. All units are fully pipelined 
except the Div/Sqrt units which are not pipelined at 
all. 
Scheduling all the loops consumed 5.5 minutes in ti. 
Sparc- 10/40 workstation. Computing the recurrence 
circuits consumed only 3.2% of the scheduler execu- 
tion time and the pre-ordering step consumed only 
9% of the scheduler execution time. In contrast the 
scheduling step consumed 87.8% of the scheduler ex- 
ecution time. Notice the minimal impact of the OP 
dering step (which is the key part of HRMS) on the 
overall scheduling time. 
3Executed on an HP 90001735 workstation 
357 
- HRMS 
- - -  TopDown 
- HRMS 
- - -  Top-Down 
0 32 84 
Number of registers 
Figure 11: Static cumulative distribution of register 
requirements of loop variants. 
0 32 84 
Number of registers 
Figure 12: Dynamic cumulative distribution of regis- 
ter requirements of loop variants. 
In order to evaluate performance, and to obtain 
dynamic results, the executiqn time (in cycles) of a 
scheduled loop has been estimated as the IIof this 
loop times the number of iterations this loop performs 
(i.e. the number of times the body of the loop is exe- 
cuted). For this purpose the programs of the Perfect 
Club have been instrumented to count iterations for 
the selected loops. 
HRMS achieved optimal execution time (I1 = MI1 ) 
for 1227 loops (97.5%). On average, the scheduler 
achieved an I1 = 1.01 x MI1 . Considering dynamic 
execution time, the scheduled loops would execute at 
98.4% of the maximum performance. 
Once the loops have been scheduled, a lower bound 
on the register pressure of the loops (MaxLive) can 
be found by computing the maximum number of val- 
ues that are alive at any cycle of the schedule. This 
section approximates the register requirements by this 
lower bound4. Lifetimes of loop variants start when 
the producer is issued and end when the last consumer 
is issued. Loop invariants are produced before enter- 
ing the loop and are alive during all the execution of 
the loop, requiring one register each one during the 
execution of the loop. 
Since an scheduler can only reduce the register re- 
quirements due to loop variants, Figure 11 compares 
the register requirements of them for both schedulers. 
On average HRMS requires 87% of the registers re- 
quired by the Top-Down scheduler. Even though there 
are few loops requiring a high number of registers, 
loops with a high number of registers represent an im- 
For an extensive discussionof the problem of allocatingreg- 
isters for software-pipelined loops refer to [21]. The allocation 
strategies presented in this paper almost always achieve the 
MaxLive lower bound. In particular, the wands-only strat- 
egy using end-fit with adjacency ordering never requires more 
than MaxLive + 1 registers. 
portant amount of the execution time of the Perfect 
Club (see [15]). Figure 12 shows the dynamic register 
requirements (i.e. each loop has been weighted by its 
execution time) of loop variants for both schedulers. 
Most machines store loop variants and loop invari- 
ants are stored in the same register file, so their com- 
bined register pressure is also of interest. Figure 13 
shows the dynamic register requirements of loop vari- 
ants plus loop invariants. Notice that about 20% (it 
varies depending on the scheduler) of the cycles is 
spent in loops requiring more than 64 registers and 
45% of the cycles is spent in loops requiring more than 
32 registers. 
Given that actual machines have a limited number 
of registers (generally 32), it is also of interest to eval- 
uate the effect on performance of loop variants plus 
loop invariants when there is a fixed amount of avail- 
able registers. Figure 14 shows the execution time of 
the loops scheduled with both schedulers when there 
are infinite, 64 and 32 registers available. When a loop 
requires more than the available number of registers, 
spill code has been added [15] and the loop has been re- 
scheduled. Notice that the code generated by HRMS 
is about 43% faster in a machine with 64 registers and 
about 21% faster in a machine with 32 registers for 
the assumed architecture. We can also observe that 
the code generated by HRMS for a machine with 32 
registers runs almost as fast as the code generated by 
the Top-Down scheduler for a machine with 64 regis- 
ters. 
5 Conclusions 
This paper has presented Hypernode Reduction 
Modulo Scheduling (HRMS), a novel and effective 
technique for resource-constrained software pipelining. 
HRMS can deal with loops containing loop-carried de- 
pendences and attempts to optimize the initiation in- 
358 
- HRMS 
- - -  Top-Down 
HRMS 
_C Top-Down 
i 
Number of registers 
Figure 13: Dynamic cumulative distribution of regis- 
ter requirements of loop variants plus loop invariants. 
terval while reducing the register requirements of the 
schedule. 
HRMS pre-orders the nodes of the dependence 
graph before scheduling them. The ordering function 
gives priority to recurrence circuits, in order not to 
penalize the initiation interval. In addition, nodes are 
ordered in such a way, that when a node is scheduled, 
the scheduling contains at least a reference node (a 
predecessor or a successor). The ordering step guar- 
antees that (except in the special case of recurrence 
circuits) only predecessors or successors of the current 
node are already scheduled, but not both of them. 
Nodes are scheduled after being ordered. The 
scheduling step schedules a node as soon as possible 
if it has predecessors already scheduled, and schedules 
a node as late as possible if it has successors already 
scheduled. Scheduling nodes in this way shortens life- 
times of loop variants, and therefore reduces the reg- 
ister requirements of the schedule. 
The usefulness of HRMS has been empirically es- 
tablished by applying it to several loops taken from 
common scientific benchmarks. We have compared 
our schedules with three leading methods, namely 
Govindarajan et al. SPILP integer programming for- 
mulation, Huff’s Slack Scheduling and Wang et al. 
FRLC scheduling. Our schedules exhibit significant 
improvement in performance in terms of initiation in- 
terval and buffer requirements compared to FRLC, 
and a significant improvement in the initiation interval 
when compared to Slack lifetime sensitive heuristic. 
We obtained similar results as SPILP, which required 
up to two orders of magnitude more computing time 
to obtain the schedules. 
Finally we provided an exhaustive evaluation of 
HRMS using 1258 loops from the Perfect Club Bench- 
mark Suite. HRMS generates schedules that are op- 
timal in terms of IIfor 97.4% of the loops. The pre- 
ordering step has a minimal impact on the scheduling 
inf 64 32 
Number of registers 
Figure 14: Cycles required to execute the loops with 
infinite registers, 64 registers and 32 registers. 
time. HRMS has been also compared with a Top- 
Down scheduler that does not care about the register 
requirements. It has been shown that, when there is 
a limited number of registers, HRMS has a big perfor- 
mance advantage. 
Acknowledgments 
The authors would like to thank Q. Ning, R. Govin- 
darajan, Erik R. Altman and Guang R. Gao for sup- 
plying us the dependence graphs they used in [SI in 
order to compare our proposal with other methods. 
Thanks are also due to Enric Riera and the rest of 
the ICTINEO team for their support to this work. Fi- 
nally we would like to thank the anonymous referees 
for their valuable comments and suggestions. 
References 
[l] J.R. Allen, K.  Kennedy, and J. Warren. Conversion 
of control dependence to data dependence. In Proc. 
10th annual Symposium on Principles of Program- 
ming Languages, January 1983. 
[a] E. AyguadC, C. Barrado, J. Labarta, D. Lbpez, 
S. Moreno, D. Padua, and M. Valero. A uniform rep- 
resentation for high-level and instruction-level trans- 
formations. Technical Report UPC-DAC-95-02, Uni- 
versitat Politkcnica de Catalunya, January 1995. 
[3] M. Berry, D. Chen, P. KOSS, and D. Kuck. The Perfect 
Club benchmarks: Effective performance evaluation 
of supercomputers. Technical Report 827, Center for 
Supercomputing Research and Development, Novem- 
ber 1988. 
[4] A. Capitanio, N. Dutt, and A. Nicolau. Partitioned 
register files for VLIWs: A preliminary analysis of 
tradeoffs. In MICRO%, pages 292-300, 1992. 
359 
[5] J.C. Dehnert, P.Y.T. Hsu, and J.P. Bratt. Overlapped 
loop support in the cydra 5 .  In Proceedings of the 
Third International Conference on Architectural Sup- 
port for Programming Languages and Operating Sys- 
tems, pages 26-38, 1989. 
[6] J.C. Dehnert and R.A. Towle. Compiling for cydra 5. 
Journal of Supercompu ting, 7( 1 /2): 18 1-227, 1993. 
[7] A.E. Eichenberger, E.S. Davidson, and S.G. Abra- 
ham. Optimum modulo schedules for minimum reg- 
ister requirements. In Proc., Internat. Conf. On Su- 
percomputing, pages 31-40, July 1995. 
[8] R. Govindarajan, E.R. Altman, and G.R. Gao. Mini- 
mal register requirements under resource-constrained 
software pipelining. In Proceedings of the 27th Annual 
International Symposium on Microarchitecture, pages 
85-94, November 1994. 
Design of the R8000 microprocessor. 
Technical report, MIPS Technologies, Inc, June 1994. 
[lo] R.A. Huff. Lifetime-sensitive modulo scheduling. In 
6th Conference on Programming Language, Design 
and Implementation, pages 258-267, 1993. 
[Ill S. Jain. Circular scheduling: A new technique to per- 
form software pipelining. In Proceedings of the ACM 
SIGPLAN ’91 Conference on Programming Language 
Design and Implementation, pages 219-228, June 
1991. 
[12] M.S. Lam. Software pipelining: An effective schedul- 
ing technique for VLIW machines. In Proceedings of 
the SIGPLAN’88 Conference on Programming Lan- 
guage Design and Implementation, pages 318-328, 
June 1988. 
[13] M.S. Lam. A Systolic Array Optimizing Compiler. 
Kluwer Academic Publishers, 1989. 
[14] J. Llosa, M. Valero, and E. AyguadC. Non-consistent 
dual register files to reduce register pressure. In 1st 
Symposium on High Performance Computer Architec- 
ture, pages 22-31, January 1995. 
1151 J. Llosa, M. Valero, E. AyguadC, and J. Labarta. 
Register requirements of pipelined loops and their 
effect on performance. In 2nd Int. Workshop on 
Massive Parallelism: Hardware Software and Applt- 
cations, October 1994. 
[l6] W. Mangione-Smith, S.G. Abraham, and E.S. David- 
son. Register requirements of pipelined processors. In 
Int. Conference on Supercomputing, pages 260-246, 
July 1992. 
[17] K. Mehlhorn and S. NSher. LEDA, a library of effi- 
cient data types and algorithms. Technical Report T R  
A 04/89, Universitat des Saarlandes, Saarbriicken, 
1989. 
[18] Q. Ning and G. R. Gao. A novel framework of register 
allocation for software pipelining. In Conf. Rec. of the 
Twentieth Ann. ACM SIGPLAN-SIGACT Symp. on 
[9] P.Y.T. Hsu. 
Principles of Programming Languages, pages 29-42, 
January 1993. 
[19] B.R. Rau. Iterative modulo scheduling: An algorithm 
for software pipelining loops. In Proceedings of the 
27th Annual International Symposium on Microarchi- 
tecture, pages 63-74, November 1994. 
[20] B.R. Rau and C.D. Glaeser. Some scheduling tech- 
niques and an easily schedulable horizontal architec- 
ture for high performance scientific computing. In 
Proceedings of the 14th Annual Microprogmmming 
Workshop, pages 183-197, October 1981. 
[all B.R. Rau, M. Lee, P. Tirumalai, and P. Schlansker. 
Register allocation for software pipelined loops. In 
Proceedings of the ACM SIGPLAN’92 Conference on 
Programming Language Design and Implementation, 
pages 283-299, June 1992. 
[22] J.A. Swensen and Y.N. Patt. Hierarchical registers 
for scientific computers. In Int. Conference on Super- 
computing, 1988. 
Decomposed software 
pipelining: A new new approach to exploit instruc- 
tion level parallelism for loops programs. In IFIP, 
January 1993. 
[24] S.W. White and S. Dhawan. POWERS: Next gener- 
ation of the RISC System/6000 family. In IBM RISC 
System/6000 Technology: Volume II. IBM Corpora- 
tion, 1993. 
[23] J. Wang and C. Eisenbeis. 
360 
