Modulo scheduling with integrated register spilling for clustered VLIW architectures by Zalamea León, Francisco Javier et al.
Modulo Scheduling with Integrated Register Spilling 
for Clustered VLIW Architectures 
U "  
cluslpr I 
Javier Zalamea, Josep Llosa, Eduard Ayguadt and Mateo Valero * 
Department d'hquitectura de Computadors (UPC) 
Universitat Pol i thica de Catalunya 
fizalamea,josepll,eduard,mateo} 0 ac.upc.es 
u u  U11 
CIUSIU 2 ... ClvMan 
Abstract 
Clustering is a technique to decentralize the design of 
future wide issue VLJW cores and enable them to meet 
the technology constraints in t e r n  of cycle time, area and 
power dissipation. In a clustered design, registers andfunc- 
tional units are grouped in clusters so that new instructions 
are needed to move data between them. New aggressive 
instruction scheduling techniques are required to minimize 
the negative effect of resource clustering and delays in mov- 
ing data around. 
In this paper we present a novel software pipelining tech- 
nique that performs instruction scheduling with reduced 
register requirements, register allocation, register spilling 
and inter-cluster communication in a single step. The al- 
gorithm uses limited backtracking to reconsider previously 
taken decisions. This backtracking provides the algorithm 
with additional possibilities for obtaining high throughput 
schedules with low spill code requirements for clustered ar- 
chitectures. We show that the proposed approach outper- 
forms previously proposed techniques and that it is very 
scalable independently of the number of clusters, the num- 
ber of communication buses and communication latency. 
The paper also includes an exploration of some parameters 
in the design offuture clustered VWW cores. 
1. Introduction 
Semiconductor technology is continuously favoring 
packing more logic into a single chip. This extra logic 
allows the implementation of wider issue configurations 
with more resources (functional units, registers, etc) and in- 
creases the potential instruction-level parallelism (EP) that 
can be exploited. Very-long Instruction Word (VLIW) ar- 
chitectures can benefit from this tendency and are perfectly 
'This work has been supported by the Ministry of Education of Spain 
under contract TIC 98/511 and by CEPBA (European Center for Paral- 
lelism of Barcelona). Javier Zalamea is granted by the Agencia Espaiiola 
de Cooperaci6n International. 
Bw 1 I 
10'72-4451/01$10.00  2001 IEEE 
160 
0.0 
16 32 61 128 16 32 69 1% I6 32 61 128 
Figure 2. a) Cycle time, b) area and c) power dis- 
sipation for a clustered architecture composed of 8 
general-purpose functional units and 4 memory ports, 
with varying number of registers in each register file. 
and power consumption, is a major goal when proposing 
clustered organizations for future designs. 
The use of clustered design is not a new idea and is cur- 
rently found in several systems, either based on dynamic 
scheduling or statically scheduled VLIW. Clustered designs 
are also found in many commercial embedded and DSP 
processors such as the TI'S TMS32OC6x [ 171, Equator's 
MAPlOOO[15],ADI'sTigerSharc [14] andHPLx [ll]. 
The reduced complexity of clustered designs and the 
decentralized design translates into lower area cost, lower 
power consumption and higher clock rates. Figure 2 shows 
the cycle time, area and power consumption (Rixner's et 
al. model [29]) by a VLIW core with 8 functional units 
and 4 memory ports organized as a unified, as a 2-cluster 
and as a &luster processor core. For example, a k l u s t e r  
processor with 64 registers per cluster (Le: 256 registers in 
total) has a cycle time slightly below a 16-register unified 
configuration and requires an area similar to a 32-register 
unified configuration and a power consumption close to a 
1 &register unified configuration. 
Statically scheduled VLIWs require efficient compiler 
technology to extract ILP from applications. Software 
pipelining [ 5 ]  is a very effective instruction scheduling tech- 
nique for loop intensive codes that overlaps the execution 
of various successive iterations. Different approaches have 
been proposed in the literature [ 11 for the generation of soft- 
ware pipelined schedules. Modulo scheduling [26] is a class 
of software pipelining algorithms that is very cost effective 
and has been implemented in many production compilers. 
Most of the early modulo scheduling techniques focused 
mainly on achieving high throughput [ l ,  7,25, 281. 
However, one of the drawbacks of modulo scheduling 
(and software pipelining in general) is that they increasc: the 
register requirements. This has motivated some recent mod- 
ulo scheduling approaches that not only try to maximize 
throughput but also try to minimize register requirements 
[6,9, 16,20,22]. Despite obtaining schedules with reduced 
register requirements, if a schedule requires more registers 
than those available in the processor some additional steps 
are needed such as an increase in the initiation interval (11) 
[21], the addition of spill code and rescheduling of the loop 
[21] or a combination of all of these [32]. 
These approaches for register constrained software 
pipelining follow a two-step process. In the first step, the 
loop is scheduled without considering register constraints. 
Once the loop is scheduled, register allocation is performed 
over the existing schedule. If the loop requires more regis- 
ters than those available on the target architecture, the cor- 
responding action is taken (i.e. increase the I1 and/or insert 
spill code) and the loop is scheduled again. MIRS (Mod- 
ulo scheduling with Integrated Register Spilling) [33] is an 
approach that performs modulo scheduling, register alloca- 
tion, and register spilling simultaneously in the same step. 
This is achieved thanks to the use of an iterative modulo 
scheduling approach with backtracking (i.e. with the pos- 
sibility of undoing previously taken scheduling and spilling 
decisions). 
Schedulers for clustered VLIW processors based on 
modulo scheduling have followed a similar approach. The 
early ones did not pay attention to register pressure [123, 
tried to minimize it [23], or used simple approaches to re- 
duce it whenever sufficient registers were not available (e.g. 
increasing the II [31]). 
Scheduling instructions for clustered architectures re- 
quires the same compiler technology as unified architec- 
tures (efficient instruction scheduling, register requirement 
minimization, register spilling) and in addition requires the 
assignment of operations to clusters and the insertion of 
communication operations. Each communication (move) 
requires the scheduling of a coupled send-receive pair in 
the source-destination cluster which is a complex operation 
(in terms of reservation table) which adds extra difficulty to 
the scheduling task. 
In this paper we present MIRS-C (Modulo scheduling 
with Integrated Register Spilling and Cluster assignment). 
MIRS-C performs modulo scheduling, register allocation, 
spilling and cluster assignment simultaneously in a single 
step. The proposal is based on an iterative approach that al- 
lows us to undo previously taken scheduling decisions, re- 
move previous spill actions and to remove previously added 
cluster communication operations. In this paper we show 
that MIRS-C out-performs previously used techniques ei- 
ther when there is an unbounded number of registers avail- 
able and when there are register constraints. In addition, a 
broad evaluation of several clustered and unified configura- 
tions shows that the clustered configurations have a very 
low degree of degradation in terms of cycles in front of 
the unified architectures (that can be considered an upper 
bound). However, when cycle time is factored in, the clus- 
tered configurations clearly outperform the unified ones. 
The paper is organized as follows. Section 2 describes 
some related work on scheduling for clustered architectures. 
Section 3 presents MIRS-C first, paying attention to its iter- 
161 
ative nature and backtracking capabilities and then focus- 
ing on clustered architectures. Section 4 performs a qual- 
itative and quantitative evaluation of the schedules gener- 
ated by MIRS-C and compares them with the ones achieved 
by a non-iterative proposal. This section also contributes 
an evaluation of the performance of clustered VLIW cores 
when using MIRS-C. Finally, Section 5 concludes the paper 
and describes some future work. 
2. Related work on instruction scheduling for 
clustered arc hit ec tures 
There have been a number of previous proposals for han- 
dling the problem of scheduling for clustered architectures. 
Some of them focus on acyclic schedules and perform the 
cluster assignment and instruction scheduling in two se- 
quential steps [8, 10, 181. Some of them (such as [4]) also 
handle constraints in terms of reduced connectivity between 
the registers and the functional units. Recently there have 
been other proposals [ 19,241 for solving the cluster assign- 
ment and instruction scheduling in a single step. 
In the context of cyclic schedules (such as modulo 
scheduling), there have been a few proposals to solve the 
same problem. Some of them [23] perform the job in two 
sequential steps (cluster selection and instruction schedul- 
ing). Other approaches perform them in a single step, such 
as [12] for an unusual register file organization based on a 
set of local queues for each cluster and a queue file for each 
communication channel or [31] for a bus-based clustered 
architecture. An implementation of the latter is used in Sec- 
tion 4 to compare the quality of the schedules generated by 
OUT proposal. 
MIRS-C (Modulo scheduling with Integrated Register 
Spilling and Clustered assignment) unlike previous ap- 
proaches, it takes into account loop invariant values for both 
register spilling and cluster assignment. The main charac- 
teristic of MZRS-C is the use of an iterative approach that 
allows some limited backtracking. This means that the algo- 
rithm may decide to undo previously taken decisions about 
cluster assignment and instruction scheduling during the it- 
erative process. This limited backtracking is very useful 
in scheduling operations with complex reservation tables 
(which appear in inter-cluster move operations) as well as 
in undoing, in a simple and efficient way, previously taken 
spill and inter-cluster data movements. 
3 MIRS-C algorithm 
In this section we describe the MIRS-C modulo schedul- 
ing algorithm in three steps: first, we provide some defini- 
tions which are useful for understanding MIRS-C; second, 
we present the algorithm for non-clustered architectures and 
finally we describe the additional functions required to han- 
dle clustered architectures. A more formal definition of 
some concepts used in MIRS-C are omitted for brevity, but 
are included in a research report [33]. 
3.1. Definitions and concepts 
Dependence relationships between operations in a loop 
are usually represented using a Dependence Graph (G). This 
graph is used to schedule the operations so that these depen- 
dence relationships are honored when the loop is executed 
in a specific target architecture. In G each U vertex or node 
represents an operation of the loop and the edges represent 
a dependence between two nodes. These dependences can 
be: register dependences, memory dependences or control 
dependences. 
The iterative approach presented in this paper first pre- 
orders the nodes in what we call the PriorityList, using 
HRMS strategy [22]. After that the actual iterative schedul- 
ing process constructs a Partial Schedule S by scheduling 
nodes one at a time following the order in the PriorityList. 
During this iterative process, new nodes may be added to 
G. These nodes appear because of adding spill code or be- 
cause of the need for moving data between clusters. These 
storelload and mme nodes will inherit their priority from 
the associated producerlconsumer nodes. 
The pre-ordering of nodes is done with the object of 
scheduling the loop with an I1 as close as possible to mini- 
mum initiation interval and by using a minimum number of 
registers. To achieve this, priority is given to recurrences so 
as not to stretch out any recurrence circuit. It also ensure:< 
that when a node is scheduled, the current partial schedul- 
ing contains only predecessor nodes or successor nodes of 
the node, but never both (unless the node is the last node of 
a recurrence circuit to be scheduled) [22] .  
During the scheduling step, each node is placed on the 
partial schedule S as close as possible to its neighbors that 
have already been scheduled. In order to achieve this, the 
following definitions are useful: 
EarlyStart for a node U in graph G is the earliest cycle at 
which the node can be scheduled in such a way that all 
its predecessors may complete their execution. 
LateStart for a node U in graph G is the latest cycle at 
which the node can be scheduled in such a way that it 
may complete its execution before its successors start 
executing, and 
Direction for a node U is the search direction for a free slot 
in the partial schedule (starting from Early-Start  or 
from Late S t a r t ) .  
Figure 3 shows the most important actions related to the 
scheduling of a U node. For each U node the algorithm first 
computes EurlySturt, LateStan and Direction. With this 
information the algorithm then tries to find a cycle in the 
current partial schedule S in which the node can be placed 
162 
1 
Procedure Schedule(G, S. U, i) ( 
Early-Start (G. S, U) i 
LateStart (G, S, U) ; 
Direction(G. S, U); 
IF (Find-Free-Slot(S, U)) 
ELSE 
S = Schedule-in-Cluster(i, U); 
S = Forcing_and_Ejection(i, U); 
1 
Figure 3. Main steps performed to schedule one op- 
eration. For non-clustered architectures the cluster 
number i = 0. 
without causing any resource conflict or dependence viola- 
tion. If such a cycle is found, the partial schedule is updated, 
along with the resources and registers used '. If not, the 
algorithm applies the Forcingand-Ejection heuristic that 
forces node U into a specific cycle and ejects nodes causing 
resource constraints or dependence violations. This heuris- 
tic is described in more detail later on (Subsection 3.2.2). 
In order to control the iterative nature of the algorithm, 
the following definitions are useful. The BudgetRatio is the 
number of attempts that the iterative algorithm is allowed to 
perform per node in G. Budget is the number of attempts 
that the iterative algorithm can still perform before giving 
up the current value of 11. Initially, the Budget is set to the 
product of the number of nodes in G by the BudgetRatio. 
The next definitions are useful in order to understand 
the phase in which to decrease the register requirements. 
Values used in a loop correspond either to loopvariant or 
loopinvariant variables. Each loopinvariant variable has 
only one value for all iterations of the loop. For each l o o p  
variant variable a new value is generated in each iteration 
of the loop and thus has a different lifetime. The maxi- 
mum number of simultaneously live values (MaxLive) is a 
ralatively accurate approximation of the number of registers 
required for the schedule [27]. The critical cycle is defined 
as the scheduling cycle in which the number of live values 
is highest. 
The lifetime of a variable last from its definition to its 
final use. The lifetime of a variable can be divided into 
severals sections (called uses) whose lifetime goes from the 
previous use to the current one. The lifetime section cor- 
responding to latency of the producer functional unit of the 
variable is called the non-spillable section. 
3.2. Algorithm for non-clustered architectures 
3.2.1. Algorithm 
Figure 4 summarizes the main steps of the iterative MZRS-C 
algorithm. Those steps whose numbers start with C apply 
when clustered architectures are targeted. For non-clustered 
'Register requirements are approximated with the maximum number 
of simultaneously live values MaxLive. 
Procedure MIRS-C (G) { 
S = empty(); / /  Initialize 
I1 = MII; 
Priority-List = Order-HRMS(G); 
WHILE (!Priority-List.empty()) { 
(1) Budget = Budget-Radio * Number-Nodes(G1; 
( 2 )  U = Priority-List.highestqrioriority0; 
(C1) i = Select-Cluster(G, S, U); 
(C2) WHILE (Need-Move(G, S, U, i)) ( 
move = Add-Move(G, U, i); 
Schedule(G, S, move, i); 
1 
( 3 )  Schedule(G, S, U, i); 
(4) IF (Priority-List.empty0) ( 
Register-Allocation(G, S); 
1 
(5) Check_andJnsert-Spill(G, S, Priority-List); 
(6) IF (Restart-Schedule(G, Budget)) { 
Re-Initialize(II++, S, Priority-List); 
GOT0 ( 1 ) ;  
1 
Budget -- ; 
1 
(7) Print(I1, S); 
1 
Figure 4. Skeleton of the MZRS-C algorithm. 
architectures they have no effect. The algorithm uses the 
node ordering strategy defined in [22] to assign priorities to 
the nodes in G, although, of course, other priority heuristics 
could be used. 
After picking-up a U node from the PriorityList (step 
2) ,  the algorithm schedules it (step 3) as explained in Fig- 
ure 3. Once the scheduling for the U node is accom- 
plished, and if the PriorityList is empty, the algorithm per- 
forms the actual register allocation' (step 4). After that, the 
Checkand-lnsertSpill heuristic is applied (step 5). This 
heuristic evaluates the necessity of introducing spill code 
and decides which lifetimes are selected for spilling (see 
Subsection 3.2.3). After applying spilling and inserting 
new nodes in the dependence graph, the RestartSchedule 
heuristic (step 6 )  decides whether the current partial sched- 
ule S and value of I1 are still valid (for continuing with the 
scheduling process) or whether it is better to restart the pro- 
cess with an increased value for II(see Subsection 3.2.4). 
The scheduling process finishes in step (7) when the al- 
gorithm detects that the PriorityList is empty. At this point, 
the actual VLIW code is generated. 
3.2.2. FotcingandZjection heuristic 
When the scheduler fails to find a cycle in which to schedule 
the U node, it forces the node at a specific cycle given by: 
Forced-Cycle = max(EarlyStart ,  (Prev-Cycle(i) + 1)) if 
the search Direction is from EarlyStart to LateStart or, if 
2Sometimes MaxLive is a lower bound and it is necessary to insert ad- 
ditional spill code to ensure that the schedule does not use more registers 
than those available on the target architecture. These new spilled nodes are 
inserted into the PriorivList. 
163 
Direction is in reverse order, U is scheduled in the cycle: 
Forced-Cycle = min( la teStar t ,  (Prev-Cycle(i) - 1)). 
In both expressions, Prev-Cycle(i) is the cycle at which 
the node was scheduled in the last previous partial schedule 
(before a possible ejection) [ 161. 
When forcing a node in a particular cycle, the heuristic 
ejects nodes that cause resource conflicts with the forced 
node. If for a particular resource conflict several candidate 
nodes are possible, the heuristic selects the one that was first 
placed in the partial schedule S. Other iterative algorithms 
[6, 16,281 eject all the operations that cause a resource con- 
flict. In our iterative algorithm, only one is ejected. The 
heuristic also ejects all previously scheduled predecessors 
and successors whose dependence constraints are violated 
due to the enforced placement. 
Notice that all the unscheduled operations are returned to 
the Prion'tyLisr with their original priority. Therefore, they 
will be immediately picked up by the iterative algorithm for 
rescheduling. 
3.2.3. Checkand_lnsertSpill heuristic 
This heuristic first compares the actual number of registers 
required (RR) in the current partial schedule and the total 
number of registers available (AR) and decides whether to 
insert spill code or proceed with the next node on the Pri- 
or i ty l i s t .  In our implementation, spill code is introduced 
whenever RR > SG x AR, SG being the spill-gauge. SG 
may take any positive value larger or equal to 1. If set to 
1, it means that the algorithm adds spill code as soon as the 
register limit is reached. When set to a very large value it 
causes the algorithm to perform spilling after obtaining a 
partial schedule for all the nodes in the dependence graph. 
The effects of intermediate vaIues for this parameter on the 
quality of the schedule are discussed in [33]. In this paper 
we have used SG = 2. Another possibility of adding spill 
code is when the PriorityList is empty and RR > AR. 
In order to efficiently reduce the register requirements, 
the spill heuristic tries to select the use (from among those 
that cross the critical cycle in the partial schedule S) that 
has the largest ratio between its lifetime and the memory 
traffic that its spilling would generate (number of load and 
store operations to be inserted). If such a use is not found 
or it does not span a minimum number of cycles, one of 
the nodes already scheduled in the critical cycle is ejected 
and placed back in the PriorityList. This forces a reduction 
of the register requirements in the critical cycle by moving 
the non-spillable section of the use outside this cycle. The 
minimum number of cycles that the selected use must last 
(named minimum span gauge MSG) is another parameter 
of our algorithm that influences the quality of the schedules 
generated and is experimentally evaluated [33]. In this pa- 
per we use MSG = 4. 
For the selected use, spilled Zoadlstore nodes are in- 
serted in the dependence graph G. These operations are also 
inserted in the Prion'tyList with the priority of their as- 
sociated consumer/producer nodes minus 1. In addition, 
these nodes are forced to be placed as close as possible 
to their associated consumer/producer nodes. To achieve 
this, the EarlyStart of a spilled load node is set to its 
L a t e S t a r t  - DG and the Lutestart of a spill store node 
is set to its Early-Start + DG, DG being the distance 
gauge. The influence of this gauge is discussed in [33]. In 
this paper we assume DG = 4. 
Once spill nodes are inserted in the dependence graph 
G ,  the Budget is increased by the number of nodes inserted 
times the BudgerRario in order to give further chances to 
the iterative algorithm to complete the schedule. 
3.2.4. RestartSchedule heuristic 
The iterative algorithm discards the current partial schedule 
in two cases: 1) if the number of trials is exhausted (Bud- 
get reaches 0)  and 2) if the processor configuration with the, 
current value of I1 cannot support the memory traffic gen- 
erated due to the newly inserted spill operations. In both 
situations, the algorithm is restarted with a larger value for 
11. If both tests are passed, the algorithm proceeds with the 
next node on the PriorityList. 
3.3. Algorithm for clustered architectures 
In this subsection we explain the additional operations 
(steps C1 and C2 in the algorithm shown in Figure 4) and 
the modifications required when compiling for clustered x- 
chitectures. These operations imply cluster selection, inser- 
tion, scheduling and ejection of move operations and bal- 
ancing register requirements. 
3.3.1. Cluster selection 
After picking-up the U node in (step 2), the algorithm de- 
cides the most appropriate cluster ( i )  into wich schedule it 
(step Cl). This is done, taking into consideration (in the 
specified order), the following aspects: 
0 Availability of empty slots (one slot is enough) to 
schedule the U operation in the current partial sched- 
ule for each cluster. 
0 Minimun number of movement operations that would 
be required to access the variables producedlconsumed 
by already scheduled operations. 
0 Minimum occupancy of the functional unit that can 
perform the U operation. 
164 
3.3.2. Move operations 
Once the i cluster is selected to host a U operation, the nec- 
essary move operations are introduced in the dependence 
graph (step C2). A move operation is needed whenever a U 
node requires a value produced by an operation scheduled 
in a different cluster or whenver it produces a result which 
is later consumed by an operation scheduled in a different 
cluster. If a U node has one o more successors in another 
cluster, only one move operation is inserted. 
Once move operations are inserted, the algorithm first 
schedules the new move operations and then the original 
U operation. This implies the repetition of the scheduling 
steps (Figure 3) for each of these nodes. 
Move operations can also be added when a loop- 
invariant variable is selected for spilling (step 4). Invariants 
consume a single register during the whole lifetime of the 
loop in a non-clustered architecture. In a clustered archi- 
tecture, however, if the invariant has several consumer op- 
erations scheduled in different clusters, we initially assign 
one register in each cluster in which the invariant is used. 
If the algorithm decides to spill the lifetime associated with 
an invariant and it is stored in another cluster, then the algo- 
rithm inserts a move node to bring it as late as possible. If 
the invariant is not available in another cluster or resources 
(ports and buses in the interconnection) are saturated, then 
the invariant is loaded from memory. 
Move operations can be ejected from the current partial 
schedule (Forcingand-Ejection heuristic) whenever a re- 
source conflict occurs in the cycle in which they are sched- 
uled. A move operation can also be ejected and removed 
from the dependence graph whenever the algorithm decides 
to eject an U operation that is predecessor or the unique 
successor of that move node. When a U node is picked-up 
again, the algorithm will decide if move operations are re- 
ally required (because the selection policy may end up with 
a different decision than the one initially taken for the same 
node). 
Move operations can also be ejected from the schedule 
and removed from the dependence graph during the spilling 
process. When a use that has a sourceltarget move node is 
selected for spilling, the move node can be eliminated from 
the dependence graph unless the following conditions are 
satisfied: 
1. the move node is the source node of the use which has 
2.  the move has several consumers, and 
3. one of these consumers is scheduled before the target 
been selected for spilling, 
of the use selected for spilling. 
If.the move node is eliminated, the movement be- 
tween clusters is carried out through memory by the new 
storelload spill operations. When a move node is elimi- 
nated from the graph, the edge from the predecessor opera- 
tion is deleted and all edges coming out from the move node 
are connected to the predecessor. 
3.3.3. Balancing the register pressure 
If the algorithm discovers that the number of available reg- 
isters in a cluster is exhausted, then it applies certain steps to 
reduce the register pressure. One of them consists of mov- 
ing (push or pull) the cycle in which move operations are 
scheduled. This releases registers in one of the clusters and 
uses them in the other cluster. In other words, we advance 
(delay) the moment at which the value is sent (received) to 
(from) another cluster. This action is performed as part of 
the CheckandJnsertSpill heuristic. If not sufficient, then 
spill code is added in the usual way. 
4. Performance Evaluation and Comparison 
In this section we evaluate the quality of the sched- 
ules generated by MIRS-C and evaluate the performance 
achieved on several processor configurations under ideal 
and real memory assumptions. 
For the evaluation we use a workbench composed of all 
the loops from the Perfect Club benchmark [2]  that are suit- 
able for software pipelining3. Loop unrolling has been ap- 
plied on small loops in order to saturate the functional units. 
A total of 1258 loops representing about 80% of the total 
execution time of the benchmark are used. 
The evaluation framework includes a set of VLIW 
cluster configurations k-(GPxMy-REGz) defined as fol- 
lows: k clusters, each one composed of x general-purpose 
floating-point functional units; y memory ports (number of 
loadstore units) and z registers in the register file. Each 
cluster also includes 2 ports (one input and one output 
port) which perform the move operations between clusters 
through an inter-connection with 2 buses. In all configura- 
tions the latencies of operations performed in the functional 
units are: 4 cycles for addition and multiplication, 17 cy- 
cles for division and 30 cycles for square root. All opera- 
tions are fully pipelined except for division and square root. 
Move operations are also pipelined and take A, cycles. In 
this paper, we focus our study and experimental evalua- 
tion on aggressive processor configurations which could be 
implemented in the near future with a potential ILP to be 
exploited. In particular we consider a range of configura- 
tions such that k = {1,2,4}, k x 2 = 8, k x y = 4 and 
z = { 16,32,64,128}. We consider two possible values for 
the latency of move operations A, = { 1,3}. 
3Although the Perfect Club benchmark set is considered obsolete for 
the purposes of evaluating supercomputer performance, the structure and 
computation performed in the loops are still representative of current nu- 
merical codes. 
165 
Table 1. Comparison between the algorithm pro- 
posed in [31] and the MIRS-C when an unbounded 
number of registers is considered. 
Config. 11 Numberofloops 11 1311 1) MIRS-c - 
Not Different 
k A, Cnvr Schedule CII Cbf CII Ctrf 
4.1. Comparison with other methods 
MIRS-C is an iterative algorithm that solves the register- 
constrained instruction scheduling problem for clustered ar- 
chitectures in a unified way. The algorithm was initially 
evaluated for monolithic processor cores (i.e. a single clus- 
ter with all the resources) [33], showing a noticeable im- 
provement over existing algorithms to handle the register- 
constrained instruction scheduling for non-clustered con- 
figurations. In summary, MIRS is able to improve the per- 
formance of two previous non-iterative scheduling tech- 
niques and achieve speed-ups in the range 1.5-2 and re- 
ductions in memory traffic of the order of 0.4-0.6. 
The next step in the evaluation is to compare the quality 
of the schedules generated by MIRS-C with the ones gen- 
erated with a non-iterative scheduler [31]. The algorithm 
does not apply backtracking, i.e. does not eject operations 
already scheduled. In addition, when the algorithm runs 
out of registers, then it increases the II of the loop without 
trying to insert spill code. In order to analyze the impact 
of these two aspects, we perform two different sets of ex- 
periments. First we assume that the register file has an un- 
bounded number of registers (i.e. the scheduler will never 
have to increase the II or insert spill code due to a shortage 
of registers). The second experiment will assume a register 
file with 64 registers and will be useful in comparing the 
quality of the schedules in a register-constrained architec- 
ture. 
Table 1 shows the CIIfor all the loops in the workbench. 
The experiment assumes an unbounded number of registers 
in each cluster, showing the ability of MIRS-C and the algo- 
rithm proposed in [3 13 to generate good schedules. The ca- 
pability of ejecting nodes in the partial schedule when cer- 
tain resource conflicts arise is crucial and results in better 
schedules. Ejection may cause the conflicting operation to 
be scheduled in a different cycle but in the same cluster or 
even to be scheduled in a different cluster. This is espe- 
cially useful when complex operations are scheduled. EII 
is reduced by factors of 0.95, 0.93 and 0.91 for 1, 2, and 
4 clusters, respectively. Notice that the higher the number 
of clusters, the higher the increase in the execution rate that 
MIRS-C is able to achieve. 
When the size of the register file is limited and the sched- 
I ‘  
Table 2. Comparison between the algorithm pro- 
posed in [31] and the MIRS-C when the total number 
of registers is constrained to k x z = 64. 
uler runs out of registers, the algorithm proposed in 1311 re- 
lies on reducing the execution rate (increasing the 10. The 
evaluation done by the authors does not consider the allo- 
cation of loop invariants and therefore non-convergence is- 
sues [21]. Our implementation of their proposal takes into 
account loop invariants and resulted in the inability of the 
scheduler to find a valid solution for a relatively large num- 
ber of loops (the ones consuming most of the time in the 
applications) in our workbench. The column “Not Cnvr” in 
Table 2 reports the number of loops that do not converge to 
a valid solution when using the algorithm proposed in [3 13. 
Column labeled “Different Schedule”, in Table 2,  reports 
the number of loops for which [31] and MIRS-C generate 
a schedule with different values of II and/or memory traffic 
(trj). In almost all the cases (except two) the II achieved 
by MIRSX is lower, which results in a higher instruction 
execution ratio. Notice that the number of loops for which a 
different schedule is obtained increases when the number of 
clusters increase. The following columns in the same table 
report the sum of the individual II (CI1) and the number 
of memory operations (Ztrf) for these loops. For instance, 
for k = 4 and A, = 3, MIRS-C produces schedules with 
an average reduction of 0.63 in the II at the expense of an 
average increase in memory traffic of 1.44. 
Finally, Table 3 shows a comparison in terms of schedul- 
ing time, between the algorithm proposed in [31] and the 
MIRS-C algorithm. Note that for the same subset of loops, 
the backtracked algorithm (MZRS-c) is very competitive. 
Moreover, for register constrained configurations, MIRS..C 
is slightly faster since sometimes, adding spill code avoids 
re-scheduling all the loop. The set of loops for wich [31] 
fails to find a valid schedule is small. However it is com- 
posed of extremely big loops that require a large compi- 
lation time. For this reason MIRS-C spends most of the 
scheduling time to find a valid schedule for those loops. 
4.2. Evaluation of processor configurations with 
In this section we use MIRS-C to explore the petior- 
mance of a set of processor configurations in terms of exe- 
cution cycles (IItimes the number of iterations of the loops), 




6 x z 
1 x 00 
Sche-Time (Am=l) Sched-Time (Xm=3) 
loops I [31] I MIRS-C loops 1 [31] I MIRS-C 
1258 I 25.99 I 27.93 I I 
1 x 64 
1 x 64 
2 x CO 
I 4 x 16 I( 1226 1 276.60 I 254.14 11 1218 I 266.99 I 256.25 I 
Table 3. Comparison of scheduling time between the 
algorithm proposed in [31] and the MIRS-C. The rows 
with less than 1258 loops are for the subset of loops 
which [31] finds a valid schedule. 
1258 - 59.34 
1248 27.71 25.99 
1258 30.28 42.52 1258 34.1 36.41 
(assuming that the cycle time is constrained by the access 
time of the register file, as shown in Figure 2). 
As shown in Figure 5, configurations with a higher de- 
gree of clustering and sufficient number of registers per 
cluster result in schedules that take more cycles to execute. 
However, the lower cycle time compensates this loss and 
clearly results in a lower execution time. When the number 
of registers per cluster is small, those configurations with 
more clusters have more registers in total, require less spill 
code and therefore result in schedules that take less cycles 
to execute. Notice that for all values of k the minimum exe- 
cution time is achieved when a total number of 64 registers 
are available (16 registers per cluster when k=4, 32 regis- 
ters per clusterwhen k=2 and 64 registers in the monolithic 
design). However, for k=4 and k=2, a noticeable reduction 
in memory traffic is achieved if 32 and 64 registers per clus- 
ter, respectively, are used. This has an impact on cycle time 
which is more or less compensated by the reduction in the 
number of cycles to be executed. 
In order to measure the degradation introduced by clus- 
tering, we compare the number of cycles required when 64 
registers are available in total. The number of execution 
cycles increase by 8% (2 clusters) and by 19% (4 clusters) 
relative to the non-clustered configuration. 
In summary, when the memory is assumed to behave in 
an ideal manner, the configuration with k=4 REG16 (k=2 
REG32) achieves a speed-up of 54.2% (27.7%) with respect 
to the non-clustered one (REG64). In addition to this, and 
as shown in Figure 2, configuration with k=4 (k=2) reduces 
the area by a factor of 0.15 (0.36) and power consumption 
by a factor of 0.49 (0.67). 
Similar conclusions are drawn when the latency of the 
move operation is higher. The ordering strategy used by 
MIRS-C gives priority to nodes that belong to recurrences. 
This means that these nodes are scheduled first and there- 
fore have less constraints when they need to be placed in 
the partial scheduling. As a consequence, move operations 
tend to appear outside of the recurrences, thus minimizing 
the effect of move latency. 
2 x 32 
2 x 32 
4 x 03 
4 x 16 
Figure 5. Execution cycles, memory traffic and execution 
time for different VLIW core configurations under the ideal 
memory assumption. First column: X,=l. Second column: 
xm=3. 
1258 - 187.35 1258 - 198.31 
1243 49.18 46.25 1244 61.07 42.54 
1258 45.36 142.03 1258 48.36 167.30 
1258 - 819.95 1258 - 983.95 
In all previous evaluations we have assumed a fixed num- 
ber of ports to cany our move operations and a fixed number 
of buses to interconnect the clusters. In particular, 2 ports in 
each cluster and two buses in the interconnection. The fol- 
lowing experiment has been designed to show the scalabil- 
ity of clustered architectures and evaluate the requirements 
in terms of number of buses. The scalability is evaluated by 
replicating IC times a cluster element GP2Ml-REG32. We 
consider 2 , 3 , 4  and an unbounded number of buses connect- 
ing the clusters. Notice that the organization scales quite 
well whenever we ensure that the number of buses is close 
to k / 2 .  Therefore, notice that the assumption that we have 
considered though at this the paper in terms of number of 
buses is correct. 
4.3 Evaluation of processor configurations with 
real memory and binding prefetching 
Finally we analyze the performance of clustered config- 
urations in a real memory environment. The memory is as- 
sumed to be multi-ported (with k x y = 4 ports), with a 
cache memory of 32 Kb and a line size of 32 bytes. The 
cache memory is lockup-free and allows up to 8 pending 
memory accesses. Hit latency for read (write) accesses is 
2 (1) cycles. Miss latency is considered to be 25 71s; this 
167 
- kI RE0138: Useful 
I k2 REG@. Useful 
I k4 RE064: Uwfvl 
m kl REG128: Sfdl 
m 112 REGM: Still 
0 k4 REG@: SmU 
I kI R E W :  Useful 
m P RE032 Useful 
eta k4 RE032: Useful 
Ca k1 REOM: S t d l  
m k2 RE032 Stall 
0 k4 REG32 Slr l l  
3.0 > 
1 2 3 4 5 6 7 8  
Number of dusters 
Figure 6. Scalability of clustered VLIW cores and 
number of buses. 
latency is translated to cycles taking into consideration the 
cycle time for each processor configuration. 
The evaluation breaks down the total number of cycles 
and execution time into two components: useful (i.e. when 
the processor is doing useful work) and stall (i.e. when 
the processor is blocked waiting for a cache miss to com- 
plete the access). All performance figures in this section 
are relative to the number of useful cycles of configuration 
MIRS-C can assume either hit latency to schedule 
memory load operations or to apply binding prefetching. 
Scheduling with hit latency minimizes the register pressure 
and theoretically increases performance. This generates a 
valid schedule that stalls the processor whenever a cache 
miss occurs or whenever a dependent instruction needs the 
datum to be brought up from memory (in case of lockup 
free caches). Binding prefetching can be used to tolerate 
the latency of these cache misses [3]. Binding prefetching 
consists in scheduling the load instructions assuming cache 
miss latency. Binding prefetching does not increase mem- 
ory traffic but increases register pressure. Therefore, con- 
figurations based on clustering are able to offer higher ca- 
pacity than non-clustered organizations, and therefore will 
potentially benefit from binding prefetching and aggressive 
prefetching strategies. 
In this paper we use a selective binding prefetching ap- 
proach [30]. The algorithm assumes that those load opera- 
tions included in recurrences as well as spill load operations 
are scheduled assuming hit latency. All other load oper- 
ations are scheduled assuming miss latency. Those loops 
which execute a small number of iterations are also sched- 
uled assuming hit latency for all their memory load opera- 
tions (in order to avoid long prologues and epilogues in the 
software pipelined code). 
Figure 7 shows the behavior of several core configura- 
tions: k = l  with z={64,128} ,  k=2 with z={32,64} and k=4 
with z={32 ,64} .  The plot on the left shows the total num- 
ber of execution cycles when load operations are sched- 
uled assuming hit latency (columns above the label N o m ! )  
and applying binding prefetching (columns above the label 
Prefetching). Notice that prefetching leads to a noticeable 
I -( GP8MbREG64). 
a b 
Figure 7. Evaluation of some processor configura- 
tions with real memory and binding prefetching. 
reduction of stall cycles for all configurations. Using these: 
values, one would conclude that clustering is not worth us- 
ing. However, when the number of cycles is factorized by 
the cycle time of the configuration, then the picture changes. 
The plot on the right shows the execution time for the same 
processor configurations. Notice that the appropriate reg- 
ister file size for the non-clustered configuration is 64, for 
k=2 is 64 registers per cluster, and for k=4 is 32 registers 
per tluster. When comparing these “best” configurations, 
we notice that k=4 achieves a speed-up of 1.46 and k=2 
achieves a speed-up of 1.19, both with respect to the non- 
clustered configuration. As we have mentioned in previous 
sections, this improvement is also obtained with a reduction 
in terms of area and power consumption, making clustered 
architectures the design choice for future VLIW configura- 
tions. 
5. Conclusions 
In this paper we have presented a novel software pipelin- 
ing technique for clustered VLIW processors. The proposed 
technique performs instruction scheduling, register alloca- 
tion and cluster assignment in a single step. The integration 
of these three actions in a single step allows us to find global 
solutions that are a good trade-off between them instead of 
optimizing for one of them while penalizing the others. 
The proposed technique is based on an iterative approach 
with limited backtracking which allows one to undo pre.vi- 
ous scheduling, spilling or communication decisions with- 
out the compilation time penalty of a wide search of the 
solution space. 
The results show important improvements over previous 
techniques and negligible performance degradation when 
compared to a unified architecture in terms of execution cy- 
cles. However, when cycle time is factored in, the clustered 
architectures ares significantly superior to the unified one. 
Experiments also show that this technique allows scalabil- 
ity of up to 8 clusters. Finally, when the memory hierar- 
chy is factored in, important speed-ups are obtained by the 
clustered architectures because extra prefetching can be per- 
formed by using the higher number of available registers. 
168 
References [ 191 K. Kailas, K. Ebcioglu, and A. Agrawala. Cars: A new code 
V. Allan, R. Jones, R. Lee, and S. Allan. Software pipelin- 
ing. ACM Computing Surveys, 27(3):367-432, September 
1995. 
M. Berry, D. Chen, P. Koss, and D. Kuck. The Perfect 
Club benchmarks: Effective performance evaluation of su- 
percomputers. Technical Report 827, Center for Supercom- 
puting Research and Development, November 1988. 
D. Callahan, K. Kennedy, and A. Porterfield. Software 
prefetching. In Proc Fourth Int. Con$ on Architectural Sup- 
port for Programming Languages and Operating Systems 
(ASPLOS-IV), pages 40-52, April 1991. 
A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register 
files for VLIWs: A preliminary analysis of tradeoffs. In 
Proc. of the 25th Annual Int. Symp. on Microarchitecture 
(MlCRO-25), pages 292-300, December 1992. 
A. Charlesworth. An approach to scientific array process- 
ing: The architectural design of the AP120BfFPS-164 fam- 
ily. Computer, 14(9):18-27, 1981. 
A. K. Dani, V. J. Ramanan, and R. Govindarajan. Register- 
sensitive software pipelining. In Procs. of the Merged 
12th International Parallel Processing and 9th International 
Symposium on Parallel and Distributed Systems, april 1998. 
J. Dehnert and R. Towle. Compiling for the Cydra 5. The 
Journal of Supercomputing, 7( 112):181-228, May 1993. 
G. Desoli. Instruction assignment for clustered VLIW DSP 
compilers: A new approach. Technical Report HPL-98- 13, 
HP Laboratories, January 1998. 
A. Eichenberger and E. Davidson. Stage scheduling: A tech- 
nique to reduce the register requirements of a modulo sched- 
ule. In Proc. of the 28th Annual Int. Symp. on Microarchi- 
tecture (MlCRO-28), pages 338-349, November 1995. 
J. Ellis. Bulldog: A Compiler for VLIWArchitectures. MIT 
- 
Press, 1986. 
1 P. Faraboschi, G. Brown, G. Desoli, and E Homewood. Lx: 
A technology platform for customizable VLIW embedded 
porcessing. In Proc., 27nd Annual Internat. Symp. on Com- 
puter Architecture, pages 203-2 13, June 2000. 
~ M. Fernandes, J. Llosa, and N. Topham. Partitioned sched- 
ules for clustered vliw architectures. In Proc., 12th Interna- 
tional Parallel Processing Symposium and 9th Symposium 
on Parallel and Distributed Processing (IPPS/SPDP’I998), 
pages 386-391, March 1998. 
[I31 J. Fisher. Very long instruction word architectures and the 
ELI-5 12. In Proc., Tenth Annual Internat. Symp. on Com- 
puter Architecture, pages 140-150, June 1983. 
[14] J. Fridman and Z. Greefield. The tigersharc DSP architec- 
ture. IEEE Micro, pages 66-76, January-February 2000. 
[15] P. N. Glaskowsky. MAPlOOO unfolds at Equator. Microp- 
orcessor Report., 12(16), December 1998. 
[ 161 R. Huff. Lifetime-sensitive modulo scheduling. In Proc. of 
the 6th Conference on Programming Language, Design and 
Implementation, pages 258-267, 1993. 
[17] T. I .  Inc. TMS32OC62d67x CPU and Instruction Set Refer- 
ence Guide. 1998. 
[18] S. Jang, S. Carr, P. Sweany, and D. Kuras. A code geration 
framework for VLIW architectures with partitioned register 
banks. In Procs. of 3rd. Int. Con$ on Massively Parallel 
Computing Systems, April 1998. 
- - 
generation framework for clustered ilp processors. in Proc., 
7th High-Pe$omnce Computer Architecture (HPCA-7), 
January 2001. 
[20] J. Llosa, A. Gonzaez, E. Ayguadt, M. Valero., and J. Eck- 
hardt. Lifetime-sensitive modulo scheduling in a production 
environment. IEEE Trans. on Comps., 50(3), March 2001. 
[21] J. Llosa, M. Valero, and E. Ayguad6. Heuristics for register- 
constrained software pipelining. In Proc. ofthe 29th Annual 
Int. Symp. on Microarchitecture (MlCRO-29), pages 250- 
261, December 1996. 
[22] J. Llosa, M. Valero, E. Ayguadt, and A. Gonzaez. Hypern- 
ode reduction modulo scheduling. In Proc. of the 28th An- 
nual Int. Symp. on Microarchitecture (MICRO- 28), pages 
350-360, November 1995. 
[23] E. Nystrom and E. Eichenberger. Effective cluster assign- 
ment for modulo scheduling. In Procs. of 31st. Annual Inc. 
Symp. on Microarchitecture (MICR0-31), pages 103-1 14, 
November 1998. 
[24] E. Ozer, S. Banerjia, and T. Conte. Unified assign and sched- 
ule: A new approach to scheduling for clustered register file 
microarchitectures. In Procs. of 31st. Annual Int. Symp. on 
Microarchitecture (MICRO-31), pages 308-3 15, November 
1998. 
[25] S. Ramakrishnan. Software pipelining in PA-RISC compil- 
ers. Hewlett-Packard Journal, pages 3 9 4 5 ,  July 1992. 
[26] B. Rau and C. Glaeser. Some scheduling techniques and 
an easily schedulable horizontal architecture for high perfor- 
mance scientific computing. In Proc. ofthe 14th Annual Mi- 
croprogramming Workshop, pages 183-1 97, October 198 l .  
1271 B. Rau, M. Lee, P. Tirumalai, and P. Schlansker. Regis- 
ter allocation for software pipelined loops. In Proc. of the 
ACM SIGPLAN’92 Conference on Programming Language 
Design and Implementation, pages 283-299, June 1992. 
[28] B. R. Rau. Iterative modulo scheduling: An algorithm for 
software pipelining loops. In Proc. of the 27th Annual In- 
ternational Symposium on Microarchitecture, pages 63-74, 
November 1994. 
[29] S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, 
and J. Owens. Register organization for media process- 
ing. In Proc., 6th High-Pe$omnce Computer Architecture 
(HPCA-6), pages 375-386, January 2000. 
Cache sensitive modulo 
scheduling. In Procs. of the 30th Annual Int. Symp. on 
Microarchitecture (MICR0-30), pages 338-348, December 
1997. 
[31] J. Shchez and A. Gonzaez. The effectiveness of loop un- 
rolling for modulo scheduling in clustered vliw architec- 
tures. In Proc International Conference on Parallel Pro- 
cessing (ICPP’200), pages 555-562, August 2000. 
[32] J.  Zalamea, J. Llosa, E. Ayguad6, and M. Valero. Improved 
spill code generation for software pipelined loops. In Procs. 
of the Programming Languages Design and Implementation 
(PLDI’OO), pages 134-la., June 2000. 
[33] J. Zalamea, J. Llosa, E. AyguadC, and M. Valero. MIRS 
Modulo scheduling with integrated register spilling. In Proc. 
of 14th Annual Workshop on Languages and Compilers for 
Parallel Computing (LCPC2001), August 2001. 
[30] J. Shchez and A. Gonzfilez. 
169 
