Software pipelining is a widespread technique to nd an instruction-level parallel schedule for loops. Reducing execution time often results in an increasing demand of resources to execute the loop operations and to store variables. This paper presents a new technique to reduce the register pressure generated by pipelined schedules. The technique nds a new schedule aiming at reducing the number of required registers without modifying the initiation interval of the schedule and the number of resources required to execute the instructions. A two-steps approach is proposed for such a reduction: minimizing the SPAN of the loop and rescheduling operations within a basic block. Experimental results show that further improvements on the schedules found by the best existing techniques can be obtained at the expense of a negligible computational cost.
Introduction
The number of registers in parallel architectures is given by the processor architecture, and therefore it is limited. For this reason, a lot of e ort has been done in recent years to reduce the number of registers required to execute a loop. In these architectures, a loop is usually executed by means of software pipelining techniques. These techniques attempt to nd a pipelined schedule of the loop, which contains instructions belonging to di erent iterations. The number of registers required to execute the schedule may be reduced by storing some variables in memory. This is done by adding spill code 1] to the loop body. Spilling may decrease the register requirements without degradation of the software pipelining performance if the spilling decision (which variable must be spilled out to memory) is e ciently controlled 2]. 3] observed that scheduling followed by register allocation requires much spill code. On the other hand, register allocation followed by scheduling reduces the potential parallelism too much. The same conclusion is supported in 4], where scheduling and register allocation are solved simultaneously. We show in this paper that scheduling followed by register allocation may obtain optimal results in most cases.
When there are not enough available registers to execute the schedule, some techniques increase the expected number of cycles (initiation interval or II) and schedule the loop again 2].
Intuitively, for the same loop, a schedule larger than the other one executes fewer instructions This work was supported by CYCYT TIC- in parallel, and therefore it requires less registers at a time. However, recent experiments have demonstrated that this approach may never converge 5] . Spilling may also be considered after increasing the expected II.
In order to assign physical registers to loop variables, several register allocation approaches have been proposed in the literature. A technique which uses a register allocation graph is presented in 6] . Each node in the register allocation graph represents a variable. An edge between two nodes indicates that the variables do not overlap their lifetimes, and therefore they can be stored in the same register (a variable is said to be alive between the time it is generated -written-and the last use -read-of it). The aim of this technique is to determine the minimum number of cliques that cover the graph. 1, 7, 8, 9, 10, 11] present an alternative approach based on the use of a con ict graph. In the con ict graph, each node represents a variable, and an edge between two nodes exists when both variables cannot share the same register because their lifetime overlaps. Other approaches are based on interval graphs 12] and cyclic interval graphs 13, 14] . An interval graph contains information that is not available neither in register allocation graphs nor in con ict graphs. Given a set of lifetimes, the interval graph contains only the overlapping information of any two lifetimes in the set. Cyclic interval graphs are an appropriate representation for variable lifetimes in loops. Coloring a cyclic interval graph with the minimum number of colors is known to be an NP-Hard problem 15]. Consequently, no polynomial-time algorithm is known to solve this problem.
Strategy overview of RESIS
In this paper we propose RESIS (REduce Span and Incremental Scheduling), an approach aiming at reducing the register requirements of a schedule without increasing the initiation interval.
After using RESIS, any known approach to perform register allocation can be used. RESIS is based on changing both the cycle at which each instruction is executed and the iteration to which it belongs. RESIS works in two separate steps:
SPAN reduction: the rst step is done at graph level. First of all, the dependence graph associated to the given schedule is built. Then, variable lifetimes are reduced by reducing the iteration index of some instructions and by rescheduling the obtained graph again. Any algorithm for scheduling basic blocks (which does not modify the iteration index of any operation) can be used.
Incremental scheduling: after reducing the SPAN, variable lifetimes are reduced by moving some instructions within the schedule. This is done without modifying the index of any instruction. To do so, the cycle at which further variable lifetimes overlap is found. The instructions to be moved are selected among those that produce a result whose lifetime traverses such a cycle.
Motivation example
Let us introduce a single example. Figure 1(a) shows the data dependence graph of the SPEC-SPICE Loop 10. We assume a result latency of one cycle for subtract and store instructions, and two cycles for multiply and load. We also assume the execution of the instructions is fully pipelined, and a superscalar architecture with one multiplier, one adder/subtracter and one load/store unit. With the previous assumptions, Figure 1 RESIS may reduce the register requirements of schedules obtained by using any software pipelining approach because it performs a global optimization of the variable lifetimes. The example from Figure 1 shows that scheduling instructions attempting to nd variable lifetimes as short as possible is not always the best solution. Increasing the lifetime of the appropriate variable may result in a better schedule that requires less registers, as shown in Figure 1 (c). The rest of the paper is organized as follows: Section 2 explains the way to represent a loop and a schedule. Since a pipelined schedule can be represented by a new loop equivalent to the initial one, Section 2 also presents the equivalence relation-ships between two loops. Section 3 shows three di erent lower bounds on the number of registers required to execute a loop. Sections 4 and 5 present the two main steps of RESIS: SPAN reduction and incremental scheduling. Some results are reported in Section 6. Finally, Section 7 concludes the paper. 
Equivalent DGs
Since may not be the same for all the instructions in the schedule, we conclude that the schedule may be associated to a DG di erent from the initial one, but equivalent to it. The equivalence between both DGs can be described by means of the rules of retiming. Retiming 19] transforms a DG in a way such that the index of the nodes ( 's) and the distance of the dependences ( 's) may be di erent in both DGs.
By using the rules of retiming, it is known that every dependence e = (u; v) in the DG is related to an integer number D, so that 
Obtaining a DG from a schedule
The DG associated to the schedule is a DG with the same topology as the initial one, but di erent mappings and . The value of 0 for each instruction is taken from the schedule. The value of 0 for each dependence is computed by using Equation (1) can be easily obtained from the DG in Figure 2 (a) by using any software pipelining approach. However, note that it can also be obtained from the DG in Figure 2 (c) by using any algorithm for scheduling basic blocks. This idea has been used in 21, 22] to design an e ective methodology for software pipelining with resource and timing constraints.
3 Lower bounds on registers
Variable lifetime
For a dependence u ! v, the variable lifetime spreads from the completion of u to the cycle in which the FUs executing v no longer require the input data. For a given node u with n outgoing edges in the DG, only one register (and not n) is necessary to store the result computed by u. Therefore, only the edge (u; v) with the longest lifetime variable is taken into account to compute the number of registers required to store the result of u in the loop execution.
The number of registers required by a schedule is only known after doing register allocation. Allocating registers for a software-pipelined loop is beyond the scope of this work. An extensive discussion including heuristic solutions and empirical results can be found in 23]. This work makes a realistic estimation on the number of registers required by a schedule, and optimize register requirements by reducing this estimation.
Lower bound MaxLive
A tight lower bound on the number of registers required by a schedule is the maximum number of variables whose lifetimes overlap at any cycle 23, 24, 25] . This number is denoted MaxLive.
Such a lower bound may not be reached, since a register assignment with such requirements may not exist. However, recent experiments performed on the Perfect Club Benchmark Loops show that no more than MaxLive + 1 registers are required to e ciently perform register allocation 5]. Figure 3 shows an example for a VLIW processor in which all the instructions execute in a single cycle. Variable lifetimes are represented as vertical lines. A point in the intersection between a line and a cycle indicates that a register is required at this cycle. Without loss of generality, we will assume here that, for a VLIW architecture and for a dependence e = (u; v), registers are used from the rst cycle of u until the last cycle of v minus one 1 (the variable lifetime depends on the target architecture. Other models of architectures are discussed in 26]). With the previous assumption, the lower bound on the number of registers required is 2. However, no register assignment exists so that the code may be written by using only two registers. The lifetimes of V 1 and V 2 overlap, and therefore both variables must be stored in di erent registers. Since the lifetime of V 3 overlaps with both V 1 and V 2, it requires a di erent register to be stored. Therefore, three registers are required after register allocation.
Henceforth, we will use the term variable lifetime to refer to the lifetime of the variable associated to a given dependence. For a dependence e = (u; v), the minimum variable lifetime occurs when u is scheduled as late as possible and v is scheduled as soon as possible. 
Absolute and relative lower bound
A lower bound on the number of registers required by any schedule of a given DG is the number of registers required when: An absolute lower bound (ALB) on the number of registers can be computed by using Equation (3) in the initial DG. This is because, in general, the initial DG contains the greatest number of ILDs for all the equivalent DGs representing the loop. Therefore, variable lifetimes are the shortest possible, and thus the absolute lower bound might be considered as a lower bound for any schedule of the loop.
A relative lower bound (RLB) on the number of registers can be computed by using Equation (3) with the DG associated to the schedule. Such a lower bound is valid for any schedule associated to this DG (that is, with the same indices in the nodes). In general, ALB RLB and both are lower than MaxLive.
SPAN reduction 4.1 Strategy overview
The rst step is based on reducing the variable lifetimes by reducing the SPAN of the DG associated to the given schedule, and then scheduling the DG again. The SPAN of a DG is de ned as max ? min + 1, where max and min are the maximum and minimum values for respectively. In general, a reduction of the SPAN in a DG leads to a reduction in: the variable lifetimes in any associated schedule the number of registers required to store partial results across iterations the iteration time (time to execute an iteration of the initial loop) The SPAN reduction phase works as follows. First of all, the DG associated to the given schedule, DG', is built (we assume the initial DG is known). The distance of the dependences in DG' is computed by using the indices of the nodes in the schedule, according to Equation (2) . Following this, the maximum value for ( max ) is computed by exploring all nodes in DG'. Then, this value is iteratively decreased while the following three conditions hold:
a DG with minimum SPAN is not found (minimum SPAN=loop unrolling degree if the loop has previously been unrolled. Otherwise, minimum SPAN=1) the critical path (CP) of the current DG is not longer than the expected II the number of registers estimate for the current schedule is greater than the absolute lower bound (MaxLive >ALB) Figure 4 shows an example of the e ectiveness of reducing the SPAN in an architecture with 2 FUs. Figure 5 shows a ow diagram of the algorithm used to reduce the SPAN. In order to make the scheduling task easier, the SPAN reduction algorithm attempts to reduce the number of PSDs and NSDs without increasing the SPAN by transforming them into FSDs. Function reduce scheduling dependences performs such a task (see Figure 7) by means of retiming.
In the algorithm, a DG is considered better for scheduling than another one if it contains less PSDs. For equal number of PSDs, the DG with the lowest number of NSDs is considered as the best. An edge is selected to be retimed only once if a better DG is not found. The heuristics used to select an edge for retiming (in order of priority) are as follows: Such strategies ne-tune the schedule by moving instructions. Figure 9 shows an example of the way to reduce register requirements by incremental scheduling. We consider two di erent movements of instructions in the schedule:
Re-schedule: moves an instruction from the current cycle to another cycle if su cient resources are available.
Swap: swaps the scheduling of two instructions. The swapping is performed when both instructions have a similar execution pattern (both instructions use the same resources at the same cycle). The algorithm to rearrange instructions within the schedule works as follows:
1. Compute the relative lower bound (RLB) on the number of registers required by the schedule. This is done to stop the search when a schedule requiring such resources is found. The following sections describe the heuristics used to select an instruction to be moved (Section 5.2) and how an instruction is moved within the schedule (Section 5.3).
Selecting an instruction to move
Let u be the instruction to move, and let c be the cycle across which u must be moved. If u has been scheduled before c, u will be moved forward across cycle c. Otherwise, it will be moved backwards across cycle c. An instruction is selected only once if no movements are carried out within the schedule. The criteria used to select an instruction to be moved are as follows:
1. First, we try to select nodes to be moved forward. This is done because the scheduling algorithm schedules as soon as possible the instructions. Since a forward movement delays the instruction, the variable lifetime may be reduced. Nodes are selected among those that have no predecessors scheduled before (or at) cycle c and have at least one successor scheduled after cycle c (otherwise, no register reduction will be produced). The node which produces the longest variable lifetime is selected.
2. If no node with such characteristics exists, we try to select a node to be moved backwards.
A node is selected among those that are scheduled after cycle c. The node selected is the one with the largest di erence between the number of required input data (written in registers) and the number of data outputs to be written in registers. Among nodes with the same di erence, the node which produces the variable with the longest lifetime is selected.
Moving an instruction
This section presents how a movement is performed within the schedule. A movement consists of moving a node (forward or backwards) across a given cycle c, attempting to prevent the variable lifetime from being alive at cycle c. The instruction to move is re-scheduled (if possible) after cycle c if movement is forward, and before or at cycle c if movement is backwards. The swapping of two instructions is only performed when the instruction selected for movement cannot be rescheduled because of the lack of available resources. In order not to change the iteration index of any instruction, no movement can be done across the boundaries of the schedule.
Re-scheduling
An instruction u is re-scheduled as follows: 1. Unschedule u (assume that u was scheduled at cycle S(u)).
2. If u must be moved forward, try to schedule u from cycle c+1 to cycle ALAP(u) (the last cycle in the schedule at which u can be executed without violating any data dependence). 3. If u must be moved backwards, try to schedule u from cycle c to cycle S(u). Note that both explorations are not as greedy as possible. For example, if u must be moved forward, we could try to schedule u from cycle ALAP(u) to cycle c+1, instead of from cycle c+1 to cycle ALAP(u). A similar argumentation can be performed for a backwards move. It seems that the algorithm would quickly converge towards the nal solution. However, experiments have shown that in some cases this approach obtains worse results.
Swapping
In order to swap u with another node, a node v is selected among those that have a similar execution pattern as u. Node v is moved across cycle c without increasing the number of registers required at cycle c. The swapping is recursively done by following the same algorithm as that used to move u. That is, rst of all we try to re-schedule v without moving any other node. If v cannot be re-scheduled, we attempt to put v in the space used by other node x, and then we try to swap x for another node which has been scheduled after cycle c. Although this is a recursive behavior, experiments have shown that the depth of the search is not very great and the recursive exploration is not expensive.
Experimental Results
We have borrowed from 31] a set of benchmark loops selected from assorted scienti c programs such as Livermore Loops, SPEC, Linpack and Whetstone. As in 31], we assume a unit result latency for add, subtract, store, and move instructions, a result latency of 2 cycles for multiply and load, and a result latency of 17 cycles for divide. We also assume that all the FUs are fully pipelined. We will assume a superscalar architecture with 1 FP adder, 1 FP multiplier, 1 FP divisor and 1 load/store unit. Lifetime for a dependence e = (u; v) has been considered from the starting of u to the starting of v. In order to show the e cacy of RESIS, we have executed the algorithm over the schedules generated by HRMS 16] . Table 1 shows the reduction obtained in the number of registers. For each benchmark, the rst column shows the initiation interval of the found schedule. The next two columns show the absolute (ABS) and the relative lower bounds (RLB). RLB has been computed by using the nal schedule. The next column (OPT) shows the actual minimal register requirements. This number has been calculated by using an integer linear programming approach 32]. The next columns show the register requirements (MaxLive) of the schedule found by HRMS and by RESIS after each step (SPAN reduction and incremental scheduling), as well as the CPU-time used. This time has been calculated by using a Sparc-10/40 workstation. Finally, last column (di ) shows the register reduction achieved. When comparing the absolute and the relative lower bound to the optimal register requirements, we nd that the proposed lower bounds are very close to MaxLive. This suggests that ALB and RLB are a good estimation for MaxLive.
Note that, despite HRMS is a very good algorithm from the point of view of register pressure, RESIS achieves improvements in a 22.2% of the cases. Note also that the optimization is achieved by both the SPAN reduction phase and the incremental scheduling phase. The short time used to calculate the nal schedule suggest that RESIS can be suitable to be integrated in a parallel compiler. Finally, the schedule found after incremental scheduling is optimal (from the point of view of MaxLive) in a 92.6% of the cases. This fact shows that RESIS is a good approach to improve the register requirements of a schedule. However, we think that integrating both SPAN reduction and incremental scheduling in a unique task may still obtain better results.
Conclusions
In this paper we have presented RESIS, a new algorithm for register optimization. RESIS is divided into two steps, namely SPAN reduction and incremental scheduling. SPAN reduction is based on reducing the maximum iteration index for any instruction in the schedule. Incremental scheduling is a code reordering technique which attempts to reduce the maximum number of variables whose lifetime overlaps at any cycle. Two movements are considered: reschedule an instruction and swapping two instructions. We have also presented di erent lower bounds on register requirements: an absolute lower bound for the number of registers required by any schedule of a loop. a relative lower bound for the number of registers required by any schedule of a particular DG. a lower bound on the number of registers required by a given schedule, calculated as the maximum of the register requirements for each cycle.
The results obtained by RESIS show that it is a good approach for reducing the number of registers required by a schedule. 
