Introduction
It is well known that, as a rule, there is inadequate instrtrctionlevel parallelism (ILP) between the operations in a single basic block and that higher levels of parallelism can only result from exploiting the ILP between successive basic blocks. Global acyclic scheduling techniques, such as trace scheduling [13, 23] and superblock scheduling [19] , do so by moving operations from their original basic blocks to preceding or succeeding basic blocks. In the case of loops, the successive basic blocks correspond to the successive iterations of the loop rather than to a sequence of distinct basic blocks.
Various
cyclic scheduling schemes have been developed in order to achieve higher levels of ILP by moving operations across iteration boundaries, i.e., either forward to previous iterations or backward to succeeding iterations.
One approach, "unroll-before-scheduling", is to unroll the loop some number of times and to apply a global acyclic scheduling algorithm to the unrolled loop body [13, 19, 23] . This achieves overlap between the iterations in the unrolled loop body, but still maintains a scheduling barrier at the back-edge. The resulting performance degradation can be reduced by increasing the extent of the unrolling, but it is at the cost of increased code size.
Software pipelining [8] refers to a class of global cyclic scheduling algorithms which impose no such scheduling barrier. One way of performing software pipelining, the "movethen-schedule" approach, is to move instructions, one by one, across the back-edge of the loop, in either the forward or the backward direction [11, 12, 20, 15, 28] . Although such code motion can yield improvements in the schedule, it is not always clear which operations should be moved around the back edge, in which direction and how many times to get the best results. The process is somewhat arbitrary and reminiscent of early attempts at global acyclic scheduling by the ad hoc motion of Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and Its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. MICRO 27-11/94 San Jose CA USA Q 1994 ACM 0-89791 -707-3/94/001 1..$3.50 63 code between basic blocks [42] . On the other hand, this currently represents the only approach to software pipelining that at least has the potential to handle loops containing control flow in a near-optimal fashion, and which has actually been implemented [28] . How close it gets, in practice, to the optimal has not been studied and, in fact, for this approach, even the notion of "optimal"
has not been defined.
The other approach, the "schedule-then-move" approach, is to instead focus directly on the creation of a schedule that maximizes performance, and to subsequently ascertain the code motions that are implicit in the schedule. Once again, there are two ways of doing this. The first, "unroll-while-scheduling", is to simukaneousIy unroll and schedule the loop until one gets to a point at which the rest of the schedule would be a repetition of an existing portion of the schedule [3] . Instead of further unrolling and scheduling, one can terminate the process by generating a branch back to the beginning of the repetitive portion.
Recognition of this situation requires that one maintain the state of the scheduling process, which includes at least the following information: knowledge of how many iterations are in execution and, for each one, which operations have been scheduled, when their results will be available, what machine resources have been committed to their execution into the future and are, hence, unavailable, and which register has been allocated to each result. All of this has to be identical if one is to be able to branch back to a previously generated portion of the schedule. Computing, recording and comparing this state presents certain engineering challenges that have not yet been addressed by a serious implementation.
On the other hand. by focusing solely on creating a good schedule, with no scheduling barriers and no ad hoc, a priori decisions regarding inter-block code motion, such unroll-while-scheduling schemes have the potential of yielding very good schedules even on loops containing control flow.
Another "schedule-then-move" approach is modulo scheduling [34] , a framework within which algorithms of this kind may be defined'.
The framework specifies a set of constraints that must be met in order to achieve a legal modulo schedule. The objective of modulo scheduling is to engineer a schedule for one iteration2 of the loop such that when this same schedule is repeated at regular intervals, no intraor inter-iteration dependence is violated, and no resource usage conflict arises between operations of either the same or distinct iterations. This constant interval between the start of successive iterations is termed the initiation interval (II). In contrast to unrolling approaches, the code expansion is quite limited.
In fact, with the appropriate hardware support, there need be no code expansion whatsoever [36] . Once the modulo schedule has been I The Orngnd we of the term "sgfc~ae pipelini~~" by Charleswortb was to refer to a limited form of modulo scheduhng However, current usage of the term has broadened Its meanmg to the one Indicated here The subject of this paper is the modulo scheduling algorithm itself, which is at the heart of this entire process. This includes the computation of the lower bound on the initiation interval. The reader is referred to the papers cited above for a discussion of the other steps that either precede or follow the actual scheduling.
Although the modulo scheduling framework was formulated over a decade ago [34] , at least two product compilers have incorporated modulo scheduling algorithms [30, 10] , and any number of research papers have been written on this topic [16, 21, 41, 39, 44, 45, 18] , there exists a vague and unfounded perception that modulo scheduling is computationally expensive, too complicated to implement, and that the resulting schedules are sub-optimal.
In large part, this is due to the fact that there has been little work done to evaluate and compare alternative algorithms and heuristics for modulo scheduling from the viewpoints of schedule quality as well as computational complexity.
This paper takes a first step in this direction by describing a practical modulo scheduling algorithm which is capable of dealing with realistic machine models. Also, it reports on a detailed evaluation of the quality of the schedules generated and the computational complexity of the scheduling process. For lack of space, this paper does not even attempt to provide a comparison of the algorithm described here to other alternative approaches for software pipelining.
Such a comparison will be reported elsewhere. Also, the goal of this paper is not to justify software pipelining.
The benefits of this, just as with any other compiler optimization or transformation, are highly dependent upon the workload that is of interest, Each compiler writer must make his or her own appraisal of the value of this capability in the context of the expected workload.
The remainder of this paper is organized as follows.
Section 2 discusses the algorithms used to compute the lower bound on the initiation interval. Section 3 describes the iterative modulo scheduling algorithm. Section 4 presents experimental data on the quality of the modulo schedules and on the computational complexity of the algorithms used, and Section 5 states the conclusions. format. The resource usage of a particular opcode is specified as a list of resources and the attendent times at which each of those resources is used by the operation relative to the time of issue of the operation. Figure  1a is a pictorial representation of the resource usage pattern of a highly pipelined ALU operation, with an execution latency of four cycles, which uses the two source operand buses on the cycle of issue, uses the two pipeline stages of the ALU on the next two cycles, respectively, and then uses the result bus on its last cycle of execution.
Likewise, Figure lb shows the resource usage pattern of a multiply operation on the multplier pipeline. This method of modelling resource usage is termed a reservation table [9] .
From these two reservation tables, it is evident that an ALU operation (such as an add) and a multiply cannot be scheduled for issue at the same time since they will collide in their usage of the source buses. Furthermore, although a multiply may be issued any number of cycles after an add, an add may not be issued two cycles after a multiply since this will result in a collision on the result bus.
When performing scheduling with a realistic machine model, a data structure similar to the reservation Accordingly, the ResMII is computed by first sorting the operations in the loop body in increasing order of the number of alternatives, i.e., degrees of freedom. As each operation is taken in order from this list, the number of times it uses each resource is added to the usage count for that resource. For each operation, that alternative is selected which yields the lowest partial ResMII, i.e., the usage count of the most heavily used resource at that point. When all operations have been considered, the usage count for the most heavily used resource constitutes the ResMII.
The recurrence-constrained MII (RecMIl)
A loop contains a recurrence if an operation in one iteration of the loop has a direct or indirect dependence upon the same operation from a previous iteration, The dependence in question may either be data dependence (flow, anti-or output) or control dependence.
Clearly, in the chain of dependence between the two instances of the operation, one or more dependence must be between operations that are in different iterations. We shall refer to such dependence as inter-iteration dependencesl.
Dependence between operations in the same iteration are termed intra-iteration dependence.
A single notation can be trsed to represent both types of dependence. The distance of a dependence is the number of iterations separating the two operations involved.
A dependence with a distance of O connects operations in the same iteration, a dependence from an operation in one iteration to an operation in the next one has a distance of 1, and so on. All undesirable anti-and output dependence are assumed to have been eliminated, in apreceeding step, by the use of expanded virtual registers (EVRS) and dynamic single assignment [32] . Briefly, an EVR extends the concept of a virtual register to one that can retain the entire sequence of values written to that EVR. Since earlier values are never overwritten and destroyed, antidependences can be eliminated, even in cyclic code in which the same operation is executed repeatedly.
(Of course, like conventional virtual registers, 13VRS cannot be implemented in hardware. This discrepancy is handled by the register allocator. Rotating registers [37, 5] provide hardware support for EVRS in innermost loops, but are not essential.) The dependence can be represented as a graph, with each operation represented by a vertex in the graph and each dependence represented by a directed edge from an operation to one of its immediate successor operations. There may be multiple edges, possibly with opposite directions, between the same pair of vertices. The dependence distance is indicated as a label on the edge. Additionally, each edge possesses a second attribute, which is the delay, i.e., the minimum time interval that must exist between the start of the predecessor operation and the start of the successor operation.
In general, this is influenced by the type of the dependence edge and the execution latencies of the two operations as specified in Table  1 . For a classical VLIW processor with non-unit architectural latencies, the delay for an anti-dependence or output dependence can be negative if the latency of the successor is sufficiently large. This is because it is only necessary that the predecessor start at the same time as or finish before, respectively, the successor finishes. A more conservative formula for the computation of the delay, which assumes only that the latency of the successor is not less than 1. is also shown in Table 1 . This is more appropriate forsuperscalar processors.
The recurrence-constrained lower bound on II, RecMII, is calculated using this dependence graph. The existence of a recurrence manifests itself as a circuit in the dependence graph. Assume that the sum of the delays along some elementary circuit2 c in the graph is Delay(c) and that the sum of the distances along that circuit is Distance(c). The existence of such a circuit imposes the constraint that the scheduled time interval between an operation on this circuit and the same operation Distance(c) iterations later must be at least Delay(c Since resource reservations are made on a MRT, a conflict at time T implies conflicts at all times T + k* II. So, it is sufficient to consider a contiguous set of candidate times that span an interval of II time slots. Therefore, MaxTime, which is the largest time slot that will be considered, is set to MinTime + 11-1, whereas in acyclic list scheduling it is effectively set to infinity. Figure 3 . The function IterativeSchedule.
Computation of the scheduling priority
As is the case for acyclic list scheduling, there is a limitless number of priority functions that can b: devised for modulo scheduling.
Most of the ones used have been such as to give priority, one way or other, to operations that are on a recurrence circuit over those that are not [16, 21, 10] . This, to reflect that fact that it is more difficult to schedule such operations since all but the first one scheduled in a SCC are subject to a deadline. Instead, we shall use a priority function that is a direct extension of the height-based priority [17, 31] HeightR( ) has a couple of good properties. As we shall see in Section 4, a large fraction of the loops are quite simple in their structure. For such loops there is a very good chance of scheduling them in one pass, but only if the operations are scheduled in topological sort order. HeightR( ) ensures this, Second, HeightR( ) gives higher priority to operations in those SCCS which have less slack. This makes HeightR( ) an effective heuristic in loops which have multiple, non-trivial SCCS.
Calculation of the range of candidate time slots
The MRT enforces correct schedules from a resource usage viewpoint. Correctness, from the viewpoint of dependence constraints imposed by predecessors, is taken care of by computing and using Estart, the earliest time that the operation in question may be scheduled while honoring the its dependence on its predecessors. In the context of recurrences and iterative modulo scheduling, it is impossible to guarantee that all of an operation's predecessors have been scheduled, and have remained scheduled, when the time comes to schedule the operation in question. So, Estart is calculated considering only those immediate predecessors that are currently scheduled. The early start time for operation P is given by the equation in Figure 5b , where Pred(P) is the set of immediate predecessors of P and SchedTime (Q) is the time at which Q has been scheduled.
Dependence with predecessor operations are honored by not scheduling an operation before its Estart. Dependence with successors operations are honored by virtue of the fact that when an operation is scheduled, all operations that conflict with it, either because of resource usage or due to dependence conflicts, are unscheduled. When these operations are scheduled subsequently, and Estart is computed for them, the dependence constraints are observed. At any point in time, the partial schedule for the currently scheduled operations fully honors all constraints between these scheduled operations.
It is pointless and redundant to consider more than II contiguous time slots starting with Estart. If a legal time slot is not found in this range because of resource conflicts, it will not be found outside this range. Therefore, MaxTime is set equal to Estart + H -1.
Selection of operations to be unscheduled
Assume that a time slot is found, between MinTime and MaxTime, that does not result in a resource conflict with any currently scheduled operation. The only operations that will need to be unscheduled are those immediate successors with whom there is a dependence conflict.
However, no operation need be unscheduled because of a resource conflict.
On the other hand, if every time slot from MinTime to MaxTime results in a resource conflict then we must make two decisions. First, we must choose a time slot in which to schedule the current operation and, second, we must choose which currently scheduled operations to displace from the schedule. The first decision is made with an eye to ensuring forward progress; in the event that the current operation was previously scheduled, it will not be rescheduled at the same time. This avoids a situation where two operations keep displacing each other endlessly from the schedule. If Estart is less than the previous schedule time, the operation is scheduled at Estart, If not, it is scheduled one cycle later than it was scheduled previously,
Regardless
of which time slot we choose to schedule the operation, one or more operations will have to be unscheduled because of resource conflicts.
In the event that there are multiple alternatives for scheduling an operation the choice of alternative determines which operations are unscheduled. Ideally, we would like to select that alternative which displaces the lowest priority operations.
Instead of attempting to make this determination directly, all operations are unscheduled which conflict with the use of any of the alternatives. The current operation is then scheduled using one of the alternatives. The displaced operations will then be rescheduled, perhaps at the very same time, in the order specified by the priority function. was written out to a file that was then read in by the research scheduler.
The input set to the research scheduler consisted of 1327 loops (1002 from the Perfect Club, 298 from Spec, and 27 from the LFK).
In the Cydra 5, 64-bit precision arithmetic was implemented on its 32-bit data paths by using each stage of the pipelines for two consecutive cycles. This results in a large number of block and complex reservation tables which, while they amplify the need for iterative scheduling, are unrepresentative of future microprocessors with 64-bit datapaths. A compiler switch was used to force all computation into 32-bit precision so that, from the scheduler's point of view, the computation and the reservation tables better reflect a machine with 64-bit datapaths. The scheduling experiments were performed using the detailed, precise reservation tables for the Cydra 5 as well as the actual latencies ( Table 2 ). The one exception is the load latency which was assumed to be 20 cycles rather than the 26 cycles that the Cydra 5 compiler uses for modulo scheduled loops. Table 3 are various statistics on the nature of the loops in the benchmarks utilized.
The first column lists the measurement,, the second column lists the minimum value that the measurement can possibly yield, and the remaining columns provide various aspects of the distribution statistics for the quantity measured. The third column lists the frequency with which the minimum possible value was encountered, the fourth and the fifth columns specify the median and the mean of the distribution, respectively, and the last column indicates the maximum value that was encountered for that measurement.
As can be seen from Table 3 , the number of operations per loop is generally quite small but there is at least one loop which has 163 operations. The fact that the median is less than the mean indicates a distribution that is heavily skewed towards small values, but having a long tail. The MII behaves in much the same way, as does the lower bound on the length of the modulo schedule for a single iteration of the loop. The lower bound on the modulo schedule length for a given II is the larger of MinDist [START, STOP] and the actual schedule length achieved by acyclic list scheduling. The large number of small loops appears to be due to the presence in the benchmarks of a large number of initialization loops.
Examining the distribution statistics in Table  3 for the quantity Max(O, RecMII-ResMII) we find an even more pronounced skew towards small values (mean = 4.54, maximum = 115). What is noteworthy is that for 84% of all loops this value is O, for 90% it is less than or equal to 20, and for 95~0 it is less than or equal to 28. This has implications for the average computational complexity of the MH calculation; 84% of the time the RecMII is equal to or less than the ResMII and ComputeMinDist need only be invoked once per SCC in the loop.
A non-trivial SCC is one containing more than one operation. From a scheduling perspective, an operation from a trivial SCC need be treated no differently than one which is not in an SCC as long as the II is greater than or equal to the RecMII implied by the reflexive dependence edge. A loop can be more difficult to schedule if the number of non-trivial SCCS in it is large. Statistically, there tend to be very few SCCS per loop. In fact, 77~0 of the loops, the vectorizable ones, have no non-trivial SCCS. These statistics affect the average complexity of computing the MII. The analysis in Section 4.4 bears this out.
Characterization of iterative modulo scheduling
The total time spent executing a given loop (possibly over multiple visits to the loop) is given by
where EntryFreq is the number of times the loop is entered, LoopFreq is the number of times the loop body is traversed, and SL is the schedule length for one iteration.
The first two quantities are obtained by profiling the benchmark programs. This formula for execution time assumes that no time is spent in processor stalls due to cache faults or other causes. Except in the case of loops with very small trip counts, the coefficient of II is far larger than that of SL, and the execution time is determined primarily by the value of H. Consequently, II is the primary metric of schedule quality and SL is the secondary metric.
Let DeltaII refer to the difference between the achieved II and the MII. Table 3 shows that for 96?10 of all loops the lower bound of MII is achieved. Of the 1327 loops scheduled, 32 had a DeltaII of 1, 8 had a DeltaII of 2, and 11 had a DeltaII that was greater than 2. Of these, all but two had a DekaII of 6 or less, and those two had a DeltaII of 20. Iterative modulo scheduling is quite successful in achieving optimal values of II. (It is worth noting that MH is not necessarily an achievable lower bound on II. The difference of the achieved II from the true, but unknown, minimum possible II may be even less than that indicated by these statistics. ) These statistics also have implications for the average computational complexity of iterative modulo scheduling since the number of candidate MII values considered is proportional to log2(DeltaII).
These statistics also indicate that it is not beneficial to evaluate HeightR() symbolically, as a function of II, as is suggested by Lam for computing Estart [21] . In either case. symbolic computation is more expensive than a numerical computation. The advantage of the symbolic computation is that the reevaluation of HeightR(), when the 11 is increased, is far less expensive than recalculating it numerically. However, the statistics on DeltaII show that this benefit would be derived for only 4~o of the loops, whereas the higher cost of symbolic evaluation would be incurred on all the loops.
A somewhat more meaningful measure of schedule quality is the ratio of the achieved II to MII, i.e., the relative nonoptimality of the H over the lower bound. The distribution statistics for this metric are shown in Table 3 . Again, 96% of the loops have no degradation, 99% have a ratio of 1.1 or less, and the maximum ratio is 1.5.
The secondary measure of schedule quality is the length of the schedule for one iteration.
The distribution statistics for the ratio of the achieved schedule length to the lower bound described earlier are shown in Table 3 . For all but 5 loops, this ratio is no more than 1.5. (Note that this lower bound, too, is not necessarily achievable.)
In the final analysis, the best measure of schedule quality is the execution time which is computed by using the above formula. By using the lower bounds for SL and H in that formula, a lower bound on the execution time is obtained. Only 597 of the 1327 loops end up being executed for the input data sets used to profile the benchmark programs. Only these loops were considered when gathering execution time statistics. The distribution statistics for the ratio of the actual execution time to the lower bound are shown in The last row in Table  3 provides some statstics on the scheduling inefficiency, i.e., the number of times an operation is scheduled as a ratio of the number of operations in the loop, given that the II corresponds to the smallest value for which a schedule was found. Under these circumstances, iterative modulo scheduling is quite efficient. For 90% of the loops, each operation is scheduled precisely once, the average value of the ratio is 1.03 and the largest value is 4.33. These statistics speak to the efficiency of the function IterativeSchedule. When considering the efficiency of the procedure ModuloSchedule, one must also take into account the scheduling effort expended for the unsuccessful values of II.
In procedure
ModuloSchedule, the parameter BudgetRatio determines how hard IterativeSchedule tries to find a schedule for a candidate II before giving up. BudgetRatio multiplied by the number of operations in the loop is the value of the parameter Budget in IterativeSchedule.
Budget is the limit on the number of operation scheduling steps performed before giving up on that candidate II.
In collecting the experimental data reported on above, BudgetRatio was set at 6, well above the largest value actually needed by any loop (which was 4.33). This was done in order to understand how well modulo scheduling can perform, in the best case, in terms of schedule quality. However, this large a BudgetRatio might not be the best choice. Generally, in order to find a schedule for a smaller value of II one must use a larger BudgetRatio.
Too small a BudgetRatio results in having to try successively larger values of II until a schedule is found at a larger II than necessary. Not only does this yield a poorer schedule, but it also increases the computational complexity since a larger number of candidate values of H are attempted, and IterativeSchedule, on all but the last, successful invocation, expends its entire budget each time.
On the other hand, once the BudgetRatio has been increased enough that the minimum achievable II has been reached, further increasing BudgetRatio cannot be beneficial in terms of it, we can, conservatively, assume that they are equal. So, the schedule quality.
However, it can increase the computational cost of iterative modulo scheduling is 2.18 (i. Figure 6 shows the dilation in the aggregate execution time involves a number of steps, the complexity of each of which is over all the loops (as a fraction of the lower bound) and the listed in Iterative modulo scheduling generates near-optimal schedules. Furthermore, despite the iterative nature of this algorithm, it is quite economical in the amount of effort expended to achieve these near-optimal schedules. In particular, it is far more efficient than any cyclic or acyclic scheduling algorithm for loop schettuling which makes use of unrolling or code replication.
If such algorithms replicate more than 118% of the loop body (which is @st over m copy of the loop body) they will be more expensive computationally. 
