Abstract| The rapid advances in high-performance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs on the given architecture (with a xed number of processor resources) at the maximum possible iteration rate ( a la rate-optimal) while minimizing the number of bu ers | a close approximation to minimizing the number of registers.
Abstract| The rapid advances in high-performance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs on the given architecture (with a xed number of processor resources) at the maximum possible iteration rate ( a la rate-optimal) while minimizing the number of bu ers | a close approximation to minimizing the number of registers.
The main contributions of this paper are: First, we demonstrate that such problem can be described by a simple mathematical formulation with precise optimization objectives under a periodic linear scheduling framework. The mathematical formulation provides a clear picture which permits one to visualize the overall solution space (for rate-optimal schedules) under di erent sets of constraints. Secondly, we show that a precise mathematical formulation and its solution does make a signi cant performance di erence. We evaluated the performance of our method against three leading contemporary heuristic methods. Experimental results show that the method described in this paper performed signi cantly better than these methods. The techniques proposed in this paper are useful in two di erent ways:
(i) As a compiler option which can be used in generating faster schedules for performance-critical loops (if the interested users are willing to trade the cost of longer compile time with faster runtime). provide a rich solution space involving a large number of schedules for software pipelining. In exploiting the space of good compile-time schedules, it is important to nd a fast, software-pipelined schedule which makes the best use of the machine resources | both function units and registers | available in the underlying architecture.
In this paper, we are interested in addressing the following software pipelining problem:
Problem 1: OPT] Given a loop L and a machine architecture M, construct a schedule that achieves the highest performance of L within the resource constraints of M while using the minimum number of registers. The performance of a software-pipelined schedule can be measured by the initiation rate of successive iterations. Thus \highest performance" refers to the \fastest schedule" or to the schedule with the maximum initiation rate. A schedule with the maximum initiation rate is called a rate-optimal schedule.
The following two important questions are related to Problem 1, the OPT problem.
Question 1: Can a simple mathematical formulation be developed for the OPT problem? Question 2: Does the optimality formulation pay o in real terms ? We need to answer the question \So what, after all ?"
In order to answer Question 1, we consider an instance of Problem 1. That is, Problem 2: OPT-T] Given a loop L a machine architecture M, and an iteration period T, construct a schedule, if one exists, with period T satisfying the resource constraints of M and using the minimum number of registers. In this paper we consider target architectures involving both pipelined and non-pipelined execution units. Our approach to solving the OPT-T problem is based on a periodic scheduling framework for software pipelining framework 15], 11]. Based on the periodic scheduling framework, we present a simple Integer Linear Programming (ILP) formulation for OPT-T. We are able to express the resource constraints as linear constraints | combining such resource constraints with the work by Ning and Gao, where a tight upper bound for register requirement is speci ed using linear constraints 11], a uni ed formulation for the OPT-T problem is obtained. As in 11], we use FIFO bu ers to model register requirement in this paper. (The relationship between the Ning/Gao formulation and ours can be better understood by examining Fig. 2 (page 6) in which the tradeo between bu er and function unit optimality is depicted. )
Readers who are familiar with related work in this eld will nd the optimality objective in the above problem formulation to be very ambitious. Of course, the general complexity of the optimal solution is NP-Hard, and heuristics are needed to solve the problem e ciently. However, we feel that a clearly stated optimality objective in the problem formulation is quite important for several reasons:
1. The solution space of \good" schedules 1 has increased considerably with the rapid advances in highperformance architecture. Current and future generation processors are likely to contain multiple function units. Likewise, in compilers, advances made in dependence analysis (such as array data ow analysis 16] and alias analysis 17]) will expose more instructionlevel parallelism in the code, while loop unrolling, loop fusion and other techniques will increase the size of the loop body 18]. So a given loop is likely to have many good schedules to choose from, and optimality criteria are essential to guide the selection of the best ones. 2. There are always a good number of users who have performance-critical applications. For them, the runtime performance of these applications is of utmost concern. For these applications, the user may be willing to trade a longer compilation time for an improvement in the runtime speed. Compilers for future generation high-performance architectures should not deny such opportunities to these users. The techniques developed in this paper can be provided to such users via a compiler option. 3. The techniques proposed in this paper can also be used in a scheduling framework to ascertain the optimal solution so as to evaluate and improve existing/newly proposed heuristic scheduling methods. Thus the usefulness of the techniques proposed in this paper should be viewed in the light of items (1) to (3) above.
We have implemented the solution method and tested it on 1008 loops extracted from various benchmark programs such as the SPEC92, the NAS kernels, linpack, and the livermore loops. The loops were scheduled for di erent architectural con gurations involving pipelined or nonpipelined execution units. In our experiments, we were able to obtain the optimal schedule for more than 80% of the test cases considered. These experiments, run on a SPARC 20, required an execution time with median ranging from 0.6 to 2.7 seconds for the di erent architectural con gurations. The geometric mean of execution time ranged from 0.9 to 7.4 seconds. mance of various scheduling methods on the 1008 kernel loops. The ILP approach yielded schedules that are faster in 6% of the test cases compared to Slack Scheduling, in 21% of the test cases compared to the FRLC method, in 27% of the test cases compared to the modi ed list scheduling. In terms of bu er requirement, the ILP approach did signi cantly better than the three heuristic methods in, respectively, 61%, 87%, and 83% of the test cases. 2 In this paper we have concentrated only on loop bodies without conditional statements. Though it is possible to extend our approach to loops involving conditional statements using techniques discussed in 21], it is not clear whether the optimality objective will still hold. We defer this study to a future work. Further, in this work we focus only on architectures involving pipelined or non-pipelined function units. Function units having arbitrary structural hazards are dealt with in 22] by extending the formulation proposed for non-pipelined function units.
Finally, as it will become evident, the proposed framework can easily handle other optimization problems in software pipelining. For example, given the number of available registers, it can minimize either the number of required FUs or a weighted sum of the FUs in di erent FU types. Other possible problem formulations can be observed from Figure 2 (refer to page 6).
This paper is organized as follows. In the following section, we motivate our approach with the help of an example. The solution space of software pipelined schedules is discussed in Section III. In Section IV, the formulation of the OPT-T problem for pipelined execution units is developed. The OPT-T formulation for non-pipelined function units is presented in Section V. Section VI deals with an iterative solution to the OPT problem. In Section VII, the results of scheduling 1008 benchmark loops are reported. Our ILP schedules are compared with the schedules generated by other leading heuristic methods in Section VIII. In Section IX, we discuss other related work. Concluding remarks are presented in Section X.
II. Background and Motivation
In this section, we motivate the OPT problem and the solution method to be presented in the rest of this paper with the help of a program example.
A. Motivating Example
We introduce the notion of rate-optimal schedules under resource constraints, and illustrate how to search among them the ones which optimize the register usage. A more rigorous introduction to these concepts will be given in the next section. We adopt as our motivating example the loop L in Figure 1 given by Rau et al in 13].
Both C language and instruction level representations of the loop are given in Fig. 1(b) while the dependence graph is depicted in Figure 1 (a). Assume that instruction i 0 is 2 For a small number of test cases, less than 4%, the ILP schedule was worse in terms of either initiation rate or bu er requirement. This is due to fact that we limit our ILP search to a maximum 3 minutes. More details on the results are presented in Section VII. Integer FUs, 2 FP Units and 1 Load/Store unit. Further, in this subsection, we will assume that all pipelined function units are free of structural hazards and an operation can be initiated in each function unit at each time step. Scheduling non-pipelined function units are discussed in Section II-C.
The performance of a software-pipelined schedule for L can be measured by the initiation rate of successive iterations. In the following discussion, we often use the reciprocal of the initiation rate, the initiation interval T. Let us rst establish a lower bound for T | i.e. the shortest initiation interval for loop L under various constraints. It is well known that, the initiation interval is governed by both loop-carried dependencies in the graph and the resource constraints presented by the architecture. Under the loop-carried dependency constraint, the shortest initiation interval, T dep , is given by:
where d(C) is the sum of the delays (or latencies) of the instructions (or nodes) in cycle C of the dependence graph, and m(C) is the sum of the dependence distances around cycle C 23] . Those cycles C crit with the maximum value of d(Ccrit) m(Ccrit) are termed critical cycles of the graph. In our example graph, (refer to Fig. 1(a) ), the self loop on instructions i 2 is the critical cycle. Thus, T dep for the given dependency graph is 2.
Resource constraints (of the architecture) also impose a lower bound on the initiation interval. 
Thus
T res = max( 1 3 ; 3 2 ; 2) = 2 Considering both dependence and resource constraints, the lower bound on minimum initiation interval (T lb ) for our example with pipelined FUs is T lb = maxfdT dep e; dT res eg = maxf2; 2g = 2 That is, any schedule of loop L that obeys the resource constraint will have a period greater than or equal to T lb = 2. The smallest iteration period T min T lb , for which a resource-constrained schedule exists, is called the rateoptimal period (with the given resource constraints) for the given loop. It can be observed that the initiation rate 1 Tlb for a given DDG may be improved by unrolling the graph a number of times. The unrolling factor can be decided based on either the T dep or the T res value, or on both. However, for the purpose of this paper, we do not consider any unrolling of the graph. Though the techniques developed in this paper can be used in those cases as well. B. An Illustration of the OPT Problem
In this paper we investigate periodic linear schedules, under which the time the various operations begin their execution are governed by a simple linear relationship. That is, under the linear schedule considered in this paper, the j-th instance of an instruction i begins execution at time T j +t i , where t i 0 is an integer o set and T is the initiation interval or the iteration period of the given schedule.
( 1 T is the initiation rate of the schedule.) Table I gives a possible schedule (Schedule A) with period 2 for our example loop. This schedule is obtained from the linear schedule form T j+t i , with T = 2, t i0 = 0, t i1 = 2, t i2 = 4, t i3 = 7, t i4 = 9, and t i5 = 11. Schedule A has a prologue (from time step 0 to time step 9) and a repetitive pattern (at time steps 10 and 11). During the rst time step in the repetitive pattern (time step 10), 1 FP instruction (i 2 ), 1 Integer instruction (i 0 ), and 1 store instructions are executed, requiring 1 FP Unit, 1 Integer FU and 1 Load/Store Unit. Instructions i 3 , i 4 and i 5 are executed during the second time step (time step 11), requiring 2 FP Units and 1 Load/Store Unit. Since this resource requirement of the repetitive pattern is less than what is available in the architecture, it is a resourceconstrained schedule. Further, Schedule A is one of those resource-constrained schedules which achieves the fastest initiation interval (T min = 2).
Next let us compute the register requirement for this schedule. In Schedule A, the instruction i 0 res six times before the rst i 5 res. Since there is a data dependence between i 0 and i 5 , the values produced by i 0 must be bu ered and accessed by i 5 in order to insure correct execution of the program. Conceptually, some sort of FIFO bu ers need to be placed between producer and consumer nodes. In this paper we will assume that a bu er is reserved at a time step when the instruction is issued, and remain reserved until the last instruction consuming that value completes its execution. The size of each bu er depends on the lifetime of the value. Therefore, a bu er of size 6 needs to be allocated for instruction i 0 . As another example, four instances of i 1 are executed before the execution of the rst instance of i 4 . Consequently, a bu er size of 4 is required for instruction i 1 . In a similar way a bu er size of 1 each is required for instructions i 3 , and i 4 , and a bu er size of 2 is required for i 2 26] , it was demonstrated that the minimum bu er requirement provides a very tight upper bound on the total register requirement, and once the bu er assignment is done, a classical graph coloring method can be subsequently performed which generally leads to the minimum register requirement. In this paper, we assume that such a coloring phase will always be performed once the bu er size is determined. Consequently we restrict our attention to these FIFO bu ers or logical registers.
A question of interest is: do there exist other rateoptimal schedules of L with the same resource constraint, but which use fewer registers? This is exactly what we have posed as Problem 1 (the OPT problem) in the introduction:
The answer is a rmative, and is illustrated by Schedule B in Table II -B which uses only 14 bu ers. This schedule is also resource constrained with an iteration period 2. The values of t s for the instructions are t i0 = 0 t i1 = 1 t i2 = 3 t i3 = 6 t i4 = 8 and t i5 = 10:
The bu er requirements for this schedule are as shown below: Instruction Total i 0 i 1 i 2 i 3 i 4 i 5 Bu ers 5 4 2 1 1 1 14 It may be veri ed that no schedule with period 2, satisfying the resource constraint, uses less than 14 bu ers. Thus Schedule B is the solution we sought for the OPT problem | a rate-optimal schedule for the given loop L. Note that we generated this schedule using the method outlined in Section IV-C. Next let us focus on the issues involved in scheduling non-pipelined FUs. When the FUs are non-pipelined, each instruction initiated on an execution pipe continues to keep the FU busy until it completes its execution. Thus the T res lower bound for non-pipelined FUs is: A schedule, Schedule C, for non-pipelined FUs is shown in Table III . In this table we use the notation, e.g. i 2 to indicate that instruction i 2 continues its execution from the previous time step. The repetitive pattern, starting at time step 9, indicates that during each time step at most 2 FP, 1 Integer, and 1 Load/Store Units are required. Thus, it appears that Schedule C is a resourceconstrained rate-optimal schedule for non-pipelined FUs. Unfortunately, this schedule is not legal. This is because, for Schedule C, we cannot nd a xed assignment of instructions to FUs. By this we mean that a compile-time mapping of instructions to speci c FUs cannot be done for the repetitive pattern. To see this, consider the repetitive pattern starting at time step 9. If we assign the rst FP unit to instruction i2 at time step 9, and the second FP unit to i4 at time step 10, then we have the rst FP unit free at time step 11 and the second FP unit free at time 12 (or time step 9, taking the time steps with modulo 3).
But mapping i 3 to the rst FP unit at time step 11 and to the second FP unit at time 9 implies that the instruction i 3 migrates or switches from one FU to another during the course of its execution. Such a switching is impractical. In order to ensure that an instruction do not switch FUs during its execution, we require that there be a xed assignment of instructions to FUs. Unfortunately, there does not exist any schedule with a period T = 3 which satis es the xed FU assignment and requires only 2 FP units (in addition to 1 Integer and 1 Load/Store unit).
As indicated in the above example, for architectures with non-pipelined FUs, the software pipelining problem involves not only instruction scheduling (when each instruction is scheduled for execution) but also mapping (how instructions are assigned to FUs). Thus, to obtain rate-optimal resource-constrained software pipelining, we need to formulate the two related problems, namely scheduling and mapping, in a uni ed framework. Section V discusses such a formulation for non-pipelined FUs.
Table II-C shows a correct software pipelined schedule for the motivating example. In this schedule, instructions i 3 and i 4 share the rst FP unit while i 2 executes on the second FP unit. Note that the period of the schedule is T = 4.
In order to give a proper perspective of problems addressed in this paper, a discussion on the solution space of linear schedules is presented in the following section.
III. The Solution Space of Linear Schedules
This section presents an overall picture of the solution space for periodic linear schedules P with which we are working. Within this space, the set of periodic linear schedules our interest is only in those periodic schedules which use R function units or less, which is denoted by the region labeled R. Obviously R is a subset of P. It may be noted that the initiation intervals of some of the schedules in R can be greater than or equal to T min de ned in Section II-A. Since we are interested in rate-optimal schedules, we denote all schedules with period T min by the region labeled T. There can be periodic schedules in T which use more than R function units.
The intersection of the sets T and R refers to the set of schedules with a period T min and using R or less function units. This is denoted by the region labeled TR. The schedules in TR are rate-optimal under the resource constraint TRR TRR P { Periodic Schedules T { Schedules with Period T min R { Schedules using R or fewer resources TB { Schedules with period T min and minimum Bu ers TR { Schedules with period T min and using R or fewer resources TRB { Schedules with period T min , using R or fewer resources, and with minimum Bu ers TRR { Schedules with period T min , using R or fewer resources, and N or fewer Registers Fig. 2 . Schedule Space of a Given Loop R | that is there is no schedule which uses not more than R resources, and has a faster initiation interval. In our example loop L, Schedule A is an element of TR. By the de nition T min , it is guaranteed that there exists at least one schedule with T = T min and uses R or less resources. Hence TR is always nonempty.
To optimally use the available registers in the architecture, it is important to pick, in TR, a schedule that uses minimum registers. The set of such schedules is denoted by the region labeled TRB. Note that the existence of such a schedule is guaranteed, from the fact that region TR is nonempty and the de nition of set TRB. In our example, Schedule A is not a member of TRB while Schedule B is.
To put our problem statement in proper perspective, the goal in the OPT problem (See Introduction Problem 1)
is to nd a linear schedule which lies within region TRB. However, for a compiler writer, the TRB region is only of indirect interest in the following sense. A compiler writer is more interested in nding a schedule with the shortest period T using R or fewer FUs and not requiring more than N registers, the available registers in the machine. Such schedules form the TRR region shown in Fig. 2 . The region TRR may be contained in, may contain, may intersect, or may be disjoint with TRB 3 . One of the four relationships is possible due to the following reasons.
(1) There is no guarantee that there exists a schedule with period T and using N or fewer registers. In this case TRR is null 4 . (2) As mentioned in Section II-B, as logical bu ers provide a good approximation to physical registers, one can easily see that when a TRR schedule exists, it is possible to have either all TRR schedules to be in TRB or all TRB schedules to be TRR schedules. (3) Though minimum bu er requirement provides a very tight upper bound for register requirement, a minimum register schedule need not necessarily be a minimum bu er schedule. Thus TRR intersects TRB and TRR is not contained in TRB. (4) Last, though very unlikely, it is possible that none of the TRR schedules are not in TRB. In this case, TRR \ TRB = :
As will be seen later, it is possible to modify our formulation in Sections IV and V to nd a TRR schedule using the approach followed in 26], 27]. The details of these approaches and the additional complexity introduced by them are beyond the scope of this paper. The reader is referred to 26] for further details. Due to the additional complexity introduced by the above approach in modeling register requirements directly, we restrict our attention in this paper to nding a TRB schedule.
Lastly, in Figure 2 there is a region labeled TB which denotes the set of all schedules with an initiation interval T min that use the minimum number of registers. That is, for the initiation interval T min , there may be schedules which use fewer registers than those in TRB. However, a schedule in TB may or may not satisfy the resource constraint R. In our example loop L, in fact, the intersection of TB and R is empty. Figure 2 (a) depicts this situation. Of course, this is not always the case. Fig. 2(b) represents the case when TB intersects R. Notice that in this case, TRB is a subset of TB. An interesting feature of the TB region is that a schedule belonging to TB can be computed e ciently using a lowdegree polynomial time algorithm developed by Ning and Gao 11] . As alluded to in the Introduction, this fact will be used as a key heuristic later in searching for a solution in TRB. More speci cally, the register requirement of a TB schedule is used as a lower bound for the number of registers in the OPT problem.
IV. OPT-T Formulation for Pipelined FUs
In this section, we rst brie y introduce some background material. In the subsequent subsection, we develop the integer program formulation for the OPT-T problem. In Section IV-C, the OPT-T formulation for the motivating example of Fig. 1 is shown.
A. De nitions
This paper deals only innermost loops. We represent such loops with a Data Dependence Graph (DDG), where nodes represent instructions, and arcs the dependences between instructions. With loop-carried dependences, the DDG could be cyclic. If node i produces a result in the current iteration and the result is used by node j, dd iterations later, then we say that the arc (i; j) has a dependence distance dd, and we use m ij to denote it. In the DDG this is represented by means of dd initial tokens on the arc (i; j).
De nition IV.1: A data dependence graph is a 4-tuple (V; E; m; d) where V is the set of nodes, E is the set of arcs, m = fm ij ; 8(i; j) 2 Eg is the dependence distance vector on arc set E, and d = fd i ; 8i 2 V g is the delay function on node set V.
In this paper we focus on the periodic schedule form T j +t i discussed in Section II. A periodic schedule is said to be feasible if it obeys all dependence constraints imposed by the DDG. The following lemma due to Reiter 23] characterizes feasible periodic schedules.
Lemma IV.1 (Reiter 23 ]) The initial execution times t i are feasible for a periodic schedule with period T if and only if they satisfy the set of inequalities:
where d i is the delay of node i, T the period, and m ij the dependence distance for arc (i; j).
In this paper, we assume that the rate-optimal period T min is always an integer. If not, the given DDG can be unrolled a suitable number of times, such that the resulting (unrolled) DDG has a integer period. Further, we have concentrated in this paper on straightline code. Hu found that a large majority of FORTRAN loops contain no conditionals 7]. For loops involving conditionals, we assume a hardware model that supports predicated execution as in 24]. If-conversion 28] can be performed to support this model. As well, in 13] it was shown that predicated execution simpli es code generation after modulo scheduling. 
For Schedule B, the values of the t i variables used in the linear form are: t 0 = 0; t 1 = 1; t 2 = 3; t 3 = 6; t 4 = 8; t 5 = 10;
The main question is how to relate the A matrix to the t i variables. For this purpose we can rewrite each t i as: Lastly, we need to represent the register requirements of the schedule in a linear form. As mentioned earlier, in this paper, we model register requirements by FIFO bu ers placed between producer and consumer nodes. Such an approach was followed in 11]. Further, we assume that bu er space is reserved as soon as the producer instruction commences its executions and remains reserved until the (last) consumer instruction begins its execution.
Consider an instruction i and its successor j. The result value produced by i is consumed by j after m ij iterations. This duration, called the lifetime of the result, is equal to (t j + T m ij ? t i ) in the periodic schedule. During this time, i would have red (t j + T m ij ? t i )=T times, and therefore this many bu ers are needed to store the output of i. If instruction i has more than one successor j, then the register requirement for i is the maximum of (t j +T m ij ? In 25], it was demonstrated that minimum bu er requirement provides a very tight upper bound on the total register requirement, and once the bu er assignment is done, a classical graph coloring method can subsequently be performed which generally leads to the minimum register requirement. In this paper, we assume that such a coloring phase will always be performed once the schedule is determined. Now integrating the bu er requirements with our ILP formulation, we can obtain the formulation which minimizes the bu er requirements in constructing rate-optimal resource constrained schedules. For this purpose, the objective function is minimizing the total number of bu ers used by the schedule. That is
The complete ILP formulation is shown in Figure 3 .
C. OPT-T Formulation for the Motivating Example
To illustrate the operation of the OPT-T formulation, we again examine the motivating example presented in Section II.
The minimum iteration period for the DDG in Figure 1 is T = 2. Further there are N = 6 nodes. Equation (14) gives the dependence constraints for a feasible schedule: Finally, the objective is to minimize the total number of bu ers P N?1 i=0 b i subject to the constraints in Equations (16) { (23), and that a t;i , k i , t i , and b i are nonnegative integers. Solving this integer program formulation yields Schedule B.
In solving the above integer programming problem, we need to obtain values for all a t;i variables and k i variables, and thus obtain the values for the t i variables which determine the schedule. Each t i variable can take values only within a speci c range (determined by the dependences and the iteration period of the DDG), which in turn will restrict the range of t for which a t;i can take the value 1.
V. OPT-T Formulation for Non-Pipelined FUs
In this section we develop the formulation for the OPT-T problem for non-pipelined FUs. As illustrated in Section II-C, this problem requires both scheduling and mapping to be performed simultaneously. In the following section we show how the resource usage for non-pipelined FUs can be modeled. The formulation of the mapping problem is discussed in Section V-B.
A. Resource Usage in Non-Pipelined FUs
In order to estimate the resource requirements with nonpipelined FUs, we need to know not just when each instruction is initiated (given by the A matrix), but also how long Notice that the FP instructions and the Load instructions which take 2 time units to execute, require the FU for more than one time step in the usage matrix. As before, adding the appropriate elements of each row gives the FU requirement for type r.
How do we obtain the U matrix from A? An instruction i In our example loop, instructions i 0 and i 5 take one time unit to execute. Hence u t;i0 = a t;i0 and u t;i5 = a t;i5 That is, u 0;i0 = a 0;i0 ; u 1;i0 = a 1;i0 ; u 2;i0 = a 2;i0 u 0;i5 = a 0;i5 ; u 1;i5 = a 1;i5 ; u 2;i5 = a 2;i5 For instruction i 2 , i 3 and i 4 , u t;i is de ned as: u 0;i1 = a 0;i1 + a 2;i1 u 1;i1 = a 1;i1 + a 0;i1 u 2;i1 = a 2;i1 + a 1;i1 u 0;i2 = a 0;i2 + a 2;i2 u 1;i2 = a 1;i2 + a 0;i2 u 2;i2 = a 2;i2 + a 1;i2 u 0;i3 = a 0;i3 + a 2;i3 u 1;i3 = a 1;i3 + a 0;i3 u 2;i3 = a 2;i3 + a 1;i3 u 0;i4 = a 0;i4 + a 2;i4 u 1;i4 = a 1;i4 + a 0;i4 u 2;i4 = a 2;i4 + a 1;i4 The requirement for type r FUs at time step t is X i2I(r) u t;i :
Since this should be less than the number of available FUs, X i2I(r) u t;i F r for all t 2 0; T ? 1] and for all r (25) Replacing the resource constraint (Equation 10) in the ILP formulation (refer to Figure 3 ) by Equations 24 and 25, we obtain the scheduling part of the ILP formulation for nonpipelined FUs. However, as explained in Section II-C, the complete formulation must include the mapping part ( xed FU assignment) as well. Otherwise the schedules produced by the formulation may require the switching of instructions between FUs during the course of execution 5 . In the following subsection we show how the mapping problem can also be formulated under the same framework.
B. Fixed FU Assignment
Consider Schedule C shown in Table III . Since the loop kernel is repeatedly executed, we map times 9, 10, and 11 to 0, 1, and 2 as shown in Figure 4 The usage of FP units is shown in Figure 4(b) . Note that the function unit used by i3 wraps around from time 2 to 0. This is a problem. At time 2, i3 begins executing on the function unit that was used by i2 at times 0 and 1. Since each instruction is supposed to use the same FU on every iteration, this causes a problem at time 0, when i3 is still executing on the FU needed by i2. The problem is that Equation 25 only notes the number of FU's in use at one time, i.e. the number of solid horizontal lines present at each of the 3 time steps in Figure 4(b) . However, we need to ensure that the two segments (corresponding to instruction i 3 ) get assigned to the same FU.
This problem bears a striking similarity to the problem of assigning variables with overlapping lifetimes to di erent registers. In particular, it is a circular arc coloring problem 29] . We must ensure that the two fragments corresponding to i 3 get the same color, a fact represented by the dotted arc in Figure 4(b) . In addition the arcs of i 3 overlap with both i 2 and i 4 , meaning i 3 must have a different color than either. Similarly i 2 and i 4 must have di erent colors than each other. Now using the usage matrix, we can formulate the coloring problem using integer constraints. If two instructions i and j are executing at time t then clearly each must get a di erent FU assigned to it. That is, if c i and c j represent the colors (or function unit to which they are mapped to) of instructions i and j respectively, then c i 6 = c j if both u t;i and u t;j are 1. Such a constraint can be represented in integer programming by adopting the approach given by Hu 30] . We introduce a set of w i;j integer, 0-1 variables, with one such variable for each pair of nodes using the same type of function unit. Roughly speaking these The successful formulation of the OPT-T problem provides the basis of our solution to the OPT problem. To solve the OPT problem, we need to iteratively solve the OPT-T formulation for increasing values of T starting from T lb until we nd a schedule satisfying the function unit constraint. In other words, T min is the smallest value greater than or equal to T lb for which a schedule obeying the resource constraint exists. We want to solve the OPT-T formulation with iteration period T min . It has been observed that in most cases, T min is at or near T lb 8], 7]. Thus using an iterative search, starting at T lb we can quickly converge to T min .
In solving the ILP formulation of the OPT-T problem, we can guide our search by giving a lower bound on the number of bu ers required. We illustrate this idea as follows. Let T be the smallest iteration period for which a schedule obeying the function unit constraint exists. For this value of T, by solving the minimum register optimal schedule formulation proposed by Ning and Gao 11], we can obtain a lower bound on the number of bu ers. Ning and Gao's formulation is a linear program formulation and can be solved e ciently. However since this formulation 11] does not include resource constraints, the obtained schedule may or may not satisfy resource constraints.
VII. Performance of ILP Schedules
In this section we present the performance results of the ILP scheduler. Section VIII is devoted to a comparison with heuristic methods.
We have implemented our ILP based software pipelining method on a UNIX workbench. We have experimented with 1008 single-basic-block inner loops extracted from various scienti c benchmark programs such as SPEC92 (integer and floating point), linpack, livermore, and the NAS kernels. The DDG's for the loops were obtained by instrumenting a highly optimizing research compiler. We have considered loops with up to 64 nodes in the DDG as in 7] . The DDG's varied widely in size, with a median of 7 nodes, a geometric mean of 8, and an arithmetic mean of 12.
To solve the ILP's, we used the commercial program, CPLEX. In order to deal with the fact that our ILP approach can take a very long time on some loops, we adopted the following approach. First, we limited CPLEX to 3 minutes in trying to solve any single ILP, i.e. a maximum of 3 minutes was allowed to nd a schedule at a given T. Second, initiation intervals from T min ; T min + 5] were tried if necessary. As soon as a schedule was found before T min +5, we did not try any greater values of T.
We have assumed the following execution latencies for the various instructions. We applied our scheduling for different architectural con gurations. We considered architectures with pipelined or non-pipelined execution units. We also considered architectures where the FUs are generic, i.e. each FU can execute any instruction. Such FUs are referred to as homogeneous FUs. A heterogeneous FU type, like Load/Store Unit, on the other hand, can only execute instructions of a speci c type (or a class of types). The six di erent architectural con gurations considered in our experiments are: In a large majority of cases, the ILP approach found an optimal schedule close to T min as shown in Table VII . To be speci c, for architectures with homogeneous pipelined FUs (A1 and A2), the ILP approach found an optimal schedule in more than 88% of cases. For non-pipelined homogeneous FUs, an optimal schedule was found in 71% of the cases. Lastly, for architectures with heterogeneous FUs (A5 and A6) it varies from 80% to 85%. For all architectural con gurations, in a small fraction of the test cases, the ILP method found a schedule at a T greater than a possible T min . That is, in these cases, the obtained schedule is a possible optimal schedule. We say a possible T min and possible optimal schedule here since there is no evidence | CPLEX' 3 minute time limit expired without indicating whether or not a schedule exists for a lower value of T min . Table VII Next we proceed to compare how close the ILP schedules were to the optimal bu er requirement. In deriving minimal bu er, rate-optimal schedules, CPLEX's 3 minute time limit was sometimes exceeded before nding a bu er optimal schedule. In those cases we took the best schedule obtained so far. In other words, this could be one of the schedule from the set TR in Fig. 2 . Once again, this schedule could possibly lie in TRB, but there is no evidence | for or against | as the 3 minute time limit of CPLEX was exceeded. We compare the bu er requirement of this schedule with that of a TB schedule obtained from the Ning-Gao formulation 11]. We note again that the Ning-Gao formulation obtains minimal bu er, rate optimal schedules using linear programming techniques and does not include resource constraints. Thus the bound obtained from NingGao's formulation is a loose lower bound, and there may or may not exist a resource-constrained schedule with this bu er requirement. Let us denote the bu er requirement of TB, TR, and TRB schedules by B TB , B TR , and B TRB respectively. Then B TR B TRB B TB . To compare the quality of schedules, we take the minimum bu er requirement B min as B TRB if a TRB schedule is found and B TB otherwise. Thus, when a TRB schedule is not found, B min is an overly optimistic lower bound. Table VII shows the quality of ILP schedules in terms of their bu er requirements. Here we consider only those cases where the ILP approach found a schedule, optimal or otherwise. As can be seen from this table, the ILP approach produces schedules that require minimal bu ers in 85% to 90% of the cases for architectures involving heterogeneous FUs (pipelined or non-pipelined) or homogeneous pipelined FUs (6 or 4 FUs). For architectures with homogeneous non-pipelined FUs (A3 and A4) the quality of schedule, in terms of both computation rate (1=T) and bu er requirement is poor compared to all other architectural con gurations. This is due to the increased complexity of mapping rather than scheduling. The complexity of mapping instructions to FUs is signi cantly higher for homogeneous FU than for heterogeneous FUs. This is because, each instruction can potentially be mapped to any of the FUs, and hence the overlap (in execution) of all pairs of instructions needs to be considered. On the other hand, in the heterogeneous model, we only need to consider all pairs of instructions that are executed in the same FU type.
Finally, how long did it take to get these schedules? We measured the execution time (henceforth referred to as the compilation time) of our scheduling method on a Sun/Sparc20 workstation. The geometric mean, arithmetic mean, and median of the execution time for the 6 architectural con gurations are shown in Table VII . A histogram of the execution time for various architectural con gurations is shown in Figure 6 . From Table VII we observe that the geometric mean of execution time is less than is less than 2 seconds for architectures with homogeneous pipelined FUs and less than 5 seconds for architectures with heterogeneous FUs. The median of the execution time is less than 3 seconds for all cases. Architectural con gurations A3 and A4 (with homogeneous non-pipelined FUs) required We conclude this section by noting that even though our ILP based scheduling method was successful in a large majority of test cases, it still could not nd a schedule for 15% to 20% of the test cases in the given time limit and the number of tries. For these cases, there are a number of alternatives: (1) allow the ILP more than 3 minutes, (2) change the order in which the ILP solver attempts to satisfy the constraints, (3) move to some other exact approach such as enumeration 26], (4) fall back to some heuristic. We have made no systematic investigation of (1) and (2), although have found that each is successful for some loops. Enumeration achieves about the same number of loops scheduled as the ILP approach described here, although the loops successfully scheduled by the two approaches are not identical 26]. The ILP approach can be used as the basis for some heuristics. For example, heuristic limits on the scheduling times of each node could be added as constraints to the ILP.
VIII. Comparison with Heuristic Methods
Our extensive experimental evaluation indicates that the ILP approach can obtain the schedule for a large majority of the test cases reasonably quickly. But does the optimality objective and the associated computation cost pay o in terms computation rate or bu er requirement of the derived schedules? It is often argued that existing heuristic methods (without any mathematical optimality formulation) do very well and consequently there is no need to nd optimal schedules. Our results indicate otherwise. We consider 3 leading heuristic methods for comparative study. Table VIII compares the computation rate and bu er requirements of ILP schedules with those of the heuristic methods for various architectural con gurations. In particular, columns 3 and 4 tabulate the number of loops in which the ILP schedules did better and the percentage improvement in T min achieved. Similarly columns 8 and 9 represent the improvements in bu er requirements. Due to the approach followed in obtaining the ILP schedules | restricting the time to solve an ILP problem to 3 minutes and trying a schedule for the next (higher) T value (sub-optimal schedules) | the computation rate and/or the bu er requirements of ILP schedules are greater than the heuristic methods in a small fraction of the test cases. Columns 5 and 6 represent, respectively, the number of test loops and the percentage improvement in T min achieved by the heuristic methods. Columns 10 and 12 in Table VIII are for bu er improvements. Note that the bu er requirements are compared only when the corresponding schedules had the same iteration period.
As can be seen from In these cases, the ILP schedules are faster on the average by 13% to 15% as shown in column 4 of Table VIII . Further, the high computation costs of ILP schedules pay signi cant dividends in terms of bu er requirements for all architecture con gurations. In more than 45% of the test cases (when the corresponding schedules have the same iteration period), the bu er requirements of ILP schedules are less than those of Hu 's Slack Scheduling method. The geometric mean of the improvement (in bu er requirements) achieved by the ILP schedules range from 15% to 22%. Compared to Gasperoni's modi ed list scheduling and Wang, et al's FRLC method, ILP produced faster schedules in 18% to 40% (or 187 to 394) of the test cases for the various architectural con gurations considered. The improvement in T min achieved by the ILP schedules are signi cant, 26% to 48%. This means that the schedules generated by the ILP method can run 50% faster than those generated by the FRLC method or the modi ed list scheduling method. These heuristic methods score well in a small fraction (up to 3%) of the test cases. Once again the bu er requirements of ILP schedules are better (by 17% to 29%) than FRLC or modi ed list scheduling in 460 to 640 test cases.
The most attractive feature of the heuristic methods is their execution time. The execution time for any of the heuristic methods was less than 1 second for more than 90% of the loops. The mean execution time was less than 0.25 second for all the architectural con gurations. Of the three heuristic methods, Hu 's Slack Scheduling method required slightly more computation time.
Our experiments reveal that the ILP-based optimal scheduling method does produce good schedules though at the expense of a longer compilation time. With the advent of more e cient ILP solvers, the compilation time is likely to decrease in future. Irrespective of the high compilation costs, our experiments suggest the possible use of the ILP approach for performance critical applications. In the following subsection we present a case for the ILP approach even though the use of such an approach in production compilers is debatable.
A. Remarks
We hope that the experimental results presented in this and in the previous section will help the compiler community in the assessment of the ILP based exact method. Despite a reasonably good performance in a large majority of the test cases, the use of ILP based exact methods in production compilers remains questionable. However, in the course of our experiments, we noticed that many loop bodies occur repeatedly in di erent programs. We developed a tool that analyzes whether two DDGs are similar in the sense that they (1) execute the same operations | or at least execute operations with the same latency and on the same function unit, and (2) have the same set of edges and dependence distances between those operations.
We found that out of our 1008 test cases, there are only 415 loops that are unique. One loop body was common to 73 di erent loops! The repetition of loop bodies, on the one hand, implies that our benchmark suite consists only of 415 unique test cases (rather than 1008); on the other hand, it suggests the number of distinct loops appearing in scienti c programs is limited, and the compiler could use our ILP approach to precompute optimal schedules for the most commonly occurring loops. This scheme could also be tailored to individual users by adding new loops to the database as the compiler encounters them. In fact, the ILP computation could be run in the background, so that the user may get non-optimal code the rst time his/her code The complexity of the tool to analyze whether two DDGs are similar is O(E 4 ) in the worst case, but O(E) in the average case, where E is the number of edges in the DDG, and in most cases E N, the number of nodes in the DDG. 53 seconds were required on a Sun/Sparc20 to nd the 415 unique loops out of the 1008, i.e. about 53 milliseconds per loop For practical use, the tool requires that a database of DDGs and their schedules stored in an encoded form. The number of DDGs (in the database) that are compared with a given loop can be drastically reduced by a simple comparison of the number of nodes and the number of arcs of the DDGs.
One last question remains on the usefulness of such a database of DDGs and their precompiled schedules: How many of these (precompiled) schedules required a longer compilation time? This question is relevant because if the database of DDGs only contain loops for which the schedule can anyway be found in a shorter compilation time, it perhaps will take lesser time to determine the schedule than to search the database. We investigate this by plotting the compilation time of the 415 unique loops against multiplicity | how often does this DDG repeat in the benchmark suite. We also plot the size of the DDGs versus multiplicity in Fig. 7 .
As can be seen from Figure 7 , though the repetition of DDGs is more common when the size of the DDG is small, large DDGs do repeat, perhaps with a low degree of multiplicity (2 to 6). The plots on compilation time of DDGs (for various architectural con gurations) against multiplicity also indicate similar results; i.e. though a majority of the database is likely to contain DDGs that take shorter compilation time, there does exist DDGs which require longer compilation time and repeat in the benchmark suite, perhaps with a low degree of multiplicity. This is especially true for architectural con gurations A3 to A6.
Our initial results only show that DDGs that require longer compilation time do repeat, though with a lower degree of multiplicity. However, it does not study the tradeo involved in the cost of storing database of loops with their precompiled schedules and the advantage in obtaining optimal schedules quickly. Such a tradeo determines the usefulness of the database approach. Further study is required to derive stronger and conclusive results. Lam 8] proposed a resource-constrained software pipelining method using list scheduling and hierarchical reduction of cyclic components. Our A matrix is similar to her modulo resource reservation table, a concept originally due to Rau and Glaeser 12] . Both as she put it, \rep-resent the resource usage of the steady state by mapping the resource usage of time t to that of t mod T." Lam's solution of the OPT problem was also iterative. Hu 's Slack Scheduling 7] is also an iterative solution to the OPT problem. His heuristics (i) give priority to scheduling nodes with minimumslack in the time at which they can be scheduled, and (ii) try to schedule a node at a time which minimizes the combined register pressure from node inputs and outputs. He reported extremely good results in addressing the OPT Gao 11] proposed an e cient method of obtaining a software-pipelined schedule using minimum bu ers for a xed initiation rate. However, they did not address function unit requirements in their formulation. In comparison to all these, our approach tries to obtain fastest computation rate and minimum bu ers under the given resource constraint.
In 39] Feautrier independently gave an ILP formulation similar to our method. However his method does not include FU mapping for non-pipelined execution units. Eichenberger, Davidson and Abraham 27] have proposed a method to minimize the maximum number of live values at any time step for a given repetitive pattern by formulating the problem as a linear programming problem. However, their approach start with a repetitive pattern that already satis es resource constraint. It is possible to incorporate their approach in our formulation and model register directly, rather than through logical bu ers. Such an approach was independently developed and incorporated in our formulation by Altman 26] . Hwang et al. have proposed an integer programming formulation for scheduling acyclic graphs in the context of high-level synthesis of systems 40].
X. Conclusions
In this paper we have proposed a method of constructing software pipelined schedules that use minimum bu ers and run at the fastest iteration rate for the given resource constraints. A graph coloring method can be applied to the obtained schedule to get a schedule that uses minimumregisters. Our approach is based on an integer programming formulation. The formulation is quite general in that (1) it can be used to provide a compiler option to generate faster schedules, perhaps at the expense of longer compilation time, especially for performance-critical applications; and (2) since our formulation has precisely stated optimality objectives, it can be used to ascertain the optimal solution and hence evaluate and improve existing/newly proposed heuristic methods.
We have empirically established the usefulness of our formulation by applying it to 1008 loops extracted from common scienti c benchmarks on six di erent architecture models with varying degrees of instruction-level parallelism and pipelining. Our experimental results based on these benchmark loops indicate that our method can nd an optimal schedule | optimal in terms of both computation rate and register usage | for a large majority of test cases reasonably fast. The geometric mean time to nd a schedule was less than 5 seconds and the median was less than 3 seconds. Even though our ILP method takes longer, it produced schedules with smaller register requirements in more than 60% of the test cases. ILP schedules are faster (better computation rate) than their counterparts in 14% of the test cases (on the average). We believe that the results presented in this paper will be helpful in assessing the tradeo s of ILP based exact methods for software pipelining.
