This paper presents a software pipelining algorithm for the automatic extraction of ne-grain parallelism in general loops. The algorithm accounts for machine resource constraints in a way that smoothly integrates the management of resource constraints with software pipelining. Furthermore, generality in the software pipelining algorithm is not sacri ced to handle resource constraints, and scheduling choices are made with truly global information. Proofs of correctness and the results of experiments with an implementation are also presented.
Introduction
Recently there has been considerable interest in a class of compiler parallelization techniques known collectively as software pipelining. Software pipelining algorithms compute a static parallel schedule overlapping the operations of a loop body in much the same way that a hardware pipeline overlaps operations in a dynamic instruction stream. The schedule computed by a software pipelining algorithm is suitable for execution on a synchronous, tightly-coupled parallel machine, such as a super-scalar or VLIW (Very Long Instruction Word) machine.
Software pipelining algorithms are interesting for at least three reasons. The rst reason is that super-scalar and VLIW machines are being built. IBM's System 6000 can execute four operations in parallel; Intel's i860 and i960 chips can execute three operations in parallel. The largest tightly-coupled synchronous machine built to date is Multi ow's TRACE-14, which has 14 functional units. Several computer manufacturers|e.g., HP, Phillips, Siemens|are also developing VLIW or super-scalar architectures. The second reason is that these tightly-coupled machines must be programmed at a very low This work supported in part by ONR grant # N0001490K0215 y This work supported in part by ONR grant # N0001490K0215 1 level. Someone writing a program for a tightly-coupled machine must develop a parallel schedule, which means that person must know about and account for details of the hardware design such as instruction timings and resource con icts between functional units. This task is extremely time-consuming and error-prone; compilation techniques are needed to translate programs written at a reasonably high level into good parallel schedules.
The nal reason is that software pipelining techniques hold the promise of producing better code with faster compilation time than other scheduling techniques. This potential is illustrated by the example in Figure 1 . Figure 1a shows a simple sequential loop and Figures 1b and 1c show two di erent parallel schedules for the loop. For convenience, we label the operations in the original loop a{d and refer only to these labels in the parallel loops. In this example, some parallelism is present within the loop body (because operations b and c can be executed simultaneously) as well as across iterations (because d from one iteration can overlap with a from the next iteration). The classical approach to scheduling the loop in Figure 1a is to unroll the loop body some number of times and then apply scheduling heuristics within the unrolled loop body Fis81] as illustrated in Figure 1b . While this approach allows parallelism to be exploited between some iterations of the original loop, there is still sequentiality imposed between iterations of the unrolled loop body. In general, if the loop could be fully unrolled, all parallelism both inside and across iterations could be exploited by this approach. However, full unrolling is usually impossible or impractical to obtain. Software pipelining provides a direct way of exploiting parallelism inside and across all iterations of a loop; hence software pipelining achieves the e ect of scheduling with full unrolling. A software-pipelined version of the original loop is given in Figure 1c .
Previous Work
One body of work on software pipelining has focussed on establishing the formalism required to adequately address what software pipelining algorithms can and cannot achieve. Results in this line of development include a software pipelining algorithm that generates optimal code for loops without conditional tests AN88a] and a proof that optimal software pipelining is impossible in general SGE91]. However, this work has largely ignored resource constraints.
Existing software pipelining algorithms handle resource constraints in a variety of ways. Some algorithms deal with only weak forms of resource constraints|e.g., the number of operations that can be executed in parallel. Others assume resource constraints are handled in a separate \ x-up phase" after software pipelining NPA88]. Several software pipelining algorithms account for resource constraints directly as part of the software pipelining algorithm, e.g. RG81, Lam87] . However, in most such algorithms the treatment of resource constraints is intimately connected to software pipelining|that is, the software pipelining is not separable from the handling of resource constraints. One of our interests is to separate what is really intrinsic to software pipelining from other, orthogonal concerns. A more extensive discussion of previous and related work is included in Section 9. 
Our Approach
In this paper, we present an algorithm that smoothly integrates software pipelining with the treatment of resource constraints, while at the same time maintaining a structured design that separates orthogonal concerns. Our algorithm serves two purposes. First, we believe the algorithm represents a practical direction and can form the basis of implementations of software pipelining; we discuss an implementation of our algorithm in Section 7. Second, the algorithm represents a summary of many of the most interesting aspects of our investigation of software pipelining over the last several years Nic85, AN88a, AN88b, Aik88, Aik90, AN91]. Our algorithm has several novel features:
The handling of resource constraints is orthogonal to the software pipelining.
At each step the algorithm has global information about the operations that can be scheduled.
In a technical sense de ned precisely in Section 8, given su cient resources our algorithm can produce code arbitrarily close to the theoretical optimum.
The advantage of the rst point is that the treatment of resources could be modi ed (say, for a di erent machine) and no changes would be required in the overall algorithm. The second and third points together imply that the quality of the nal pipelined loop is limited only by the ability to make good resource allocation decisions (see Section 8) and not by the design of the software pipelining algorithm.
Our software pipelining algorithm is built from two components: a scheduler and a dependence analyzer. The machine-dependent scheduler is used to incrementally build a parallelized loop from a sequential loop. For each parallel instruction, the scheduler selects operations to schedule based on the set of operations available for scheduling in that instruction and available resources. The set of available operations is maintained by a global dependence analyzer; as the scheduler makes decisions about where to place operations, the set of available operations is updated incrementally. Together, the scheduler and the dependence analyzer encapsulate all machine-dependent information. As the parallelized loop is constructed, the software pipelining algorithm checks for repeating states that can be \pipelined." The software pipelining algorithm itself is very simple; the di culty lies in establishing minimal restrictions on the scheduler and dependence analyzer that guarantee the correctness and termination of the software pipelining algorithm.
The rest of this paper is divided into nine sections. Section 2 de nes the model of parallel computation used to develop the algorithm. Section 3 works through a small example to give an intuitive idea of how the software pipelining algorithm works. Section 4 describes the algorithm and presents a proof of correctness. Section 5 gives an algorithm for incrementally maintaining the set of available operations. Section 6 describes the integration of resource constraints into the algorithm. Handling resources well is critical in realistic applications of software pipelining. Section 7 brie y describes an implementation of our algorithm, some additional optimizations, and some experimental results. The experimental results bear out the strengths of our approach and point out some weaknesses; both are discussed at length. Section 8 presents a result that suggests our algorithm can achieve the best schedules possible in the presence of 4 resource constraints. A discussion of related work is in Section 9. The nal section summarizes and presents some conclusions.
Basic Terminology
This section develops a simple model of a tightly-coupled, synchronous parallel machine. The formalism is used to explain our software pipelining algorithm and to provide a basis for a proof of correctness.
A program is an automaton hX; ; n 0 ; Ni. X is a set of n operations fx 0 ; : : :; x n?1 g. Operations are divided into assignments that read and write a global store, tests (boolean-valued functions) that a ect the ow of control, and a distinguished operation stop.
The body of the program is a set N of states n 0 ; : : :; n m?1 . The state n 0 is the start state of the program. Associated with each state n is ops(n), the operations of n, which are elements of X. The states represent parallel instructions; intuitively, when control reaches a state n, all operations in ops (n) are executed simultaneously. To simplify the presentation, we assume that all operations execute in unit time. Extensions to multi-cycle operations and pipelined functional units are discussed in Section 6.
A con guration is a pair hn; si where n is a state and s is a store (the contents of memory locations and registers). The transition function maps con gurations into con gurations. An execution is a sequence of con gurations h: : :; hn i ; s i i; : : :i such that (hn i ; s i i) = hn i+1 ; s i+1 i.
The transition function describes how a tightly-coupled, synchronous machine actually executes a parallel instruction. We deliberately avoid de ning a transition function in any detail. The transition functions of super-scalar and VLIW machines are complex and vary considerably from machine to machine. The greatest source of complexity is de ning what it means to execute more than one test in parallel (multi-way jumps). As an example, in one possible model tests within a state n are always organized as a binary decision tree with a unique root. One branch of each test in the decision tree is labeled true, the other is labeled false. Each leaf of the decision tree is a pointer to another state. When the state is executed, all of the tests are evaluated in parallel in the store. The next state to be executed is the leaf that terminates the (unique) path from the root where every branch is labeled by the value of that test in the store. There are other possible implementations of multi-way jumps; many mechanisms have been proposed and implemented Fis80, KN85, AAG + 86, Ebc87]. The software pipelining algorithm we present applies to any of these control-ow mechanisms.
We use the following abstraction of control-ow throughout this paper. We assume that control-ow is determined entirely by tests; that is, the result of evaluating the tests in a state determines the next state. A branch of a state n is a truth assignment hx 1 = true; x 2 = false; : : :i to the tests x 1 ; x 2 ; : : : in n. The set of all branches of n is branch(n); if n has no tests, then branch(n) is the singleton set fhig consisting of the empty truth assignment. The function succ-on-branch maps a state n and a branch c 2 branch(n) to a successor node n 0 (the name succ-on-branch stands for \successor on branch"). 1 We 1 Note that in most cases a node with k tests will not have 2 k distinct successors. For generality, we treat each of the 2 k branches separately in our algorithm; in an implementation for a particular control-ow mechanism many branches can be 5 assume that if succ-on-branch(n; c) = n 0 , then there is a state s such that (hn; si) = hn 0 ; s 0 i and that the evaluation of tests in con guration hn; si satis es the truth assignment c.
The set of successors succ(n) of a state n is fn 0 j9c s:t: succ-on-branch(n; c) = n 0 g. When n is executed, control is transferred to some n 0 2 succ(n). A state that contains the operation stop cannot contain other operations and cannot have any successors.
We next de ne a meaning function for programs, which is used in the proof that our software pipelining algorithm is correct (i.e. that it preserves the meaning of the original program).
De nition 2.1 Let P be a program hX; ; n 0 ; Ni. If there is an execution hhn 0 ; si; : : :; hn k ; s 0 ii such that ops(n k ) = fstopg then (P; s) = s 0 . If no such execution exists, then (P; s) =?.
Programs P 1 and P 2 are equivalent (P 1 P 2 ) if 8s (P 1 ; s) = (P 2 ; s).
Software pipelining is a loop parallelization technique, so we must describe the loops we are interested in parallelizing. For convenience, we use the following de nition. A sequential loop is a program with i operations x 0 ; : : :; x i?1 and i states n 0 ; : : :; n i?1 where n j = fx j g. All backedges go to the start state n 0 ; that is, if n i 2 succ(n j ) and i j then i = 0. Every state is assumed to be reachable from the start state.
An Example
Given a sequential loop L, our software pipelining algorithm incrementally builds a parallelized loop from L. Initially, the parallelized loop is empty (has no states) and the algorithm chooses a set of operations from the sequential loop L that legally can be scheduled as the start state of the parallel loop. After scheduling a subset of the available operations as the start state, the algorithm recursively schedules the successors of the start state by considering what operations can be scheduled in the successor states, and so on. The main di culty is guaranteeing that this procedure terminates. We show that eventually the scheduled states must fall into a detectable repeating pattern, at which point a loop can be constructed from this pattern of repeating states.
An important data structure used by the algorithm is an incrementally maintained set A of available operations. At each step, A contains a set of operations available for scheduling in the current state being constructed. How this set is built and maintained is discussed in Section 5. For now it is only important to understand that set A contains all operations that could be scheduled legally in the current state without violating program semantics.
Initially, the new program graph is empty and A contains all operations available for scheduling in the rst state. Consider the program in Figure 2 . We display programs as control-ow graphs with the convention that true branches of tests are to the left and false branches are to the right. Not all operations can be scheduled in the rst state; for example, c must be scheduled after b, since c references a value that b writes. In standard compiler terminology, there is a data dependence from b to c KKP + 81].
For this example, we assume a machine model in which all reads take place before any writes during execution of a state, and write con icts are not permitted. In this model, the operations a, b, and f are all available for scheduling in the rst state. Because the algorithm may overlap operations from di erent iterations, we superscript operations with the scheduled iteration from which they came. In addition, we subscript available operation sets to keep track of di erent values for di erent states. Thus, initially A 0 = fa 0 ; b 0 ; f 0 g.
Another component of the pipelining algorithm is the scheduler. The scheduler selects from A a set of operations to schedule in the current state. Together, the procedure to maintain the set of available operations and the scheduler encapsulate all machine-dependent information. The software pipelining algorithm itself is built on top of these two components.
A pipelined version of the loop in Figure 2 is given in Figure 3 . In Figure 3 , the state n i is labeled by the integer i. The rest of this section describes how the software pipelining algorithm computes this parallel schedule from the sequential loop. For the rst state n 0 , assuming that the machine has su cient resources, the scheduler could choose to schedule all available operations. Because f is a test, there will be two successors of the rst state|one for the case where f evaluates to true, and one for the case where f evaluates to false. The sets of available operations are di erent for the two successors.
Consider the successor n 1 of n 0 for the case where f 0 evaluates to false. This case is easy, as the program terminates on this branch. The new set A 1 of available operations is fc 0 ; d 0 ; e 0 g, re ecting the fact that a 0 , b 0 and f 0 have been scheduled and that this branch of f 0 is the loop exit. Because write con icts are not permitted, d 0 and e 0 cannot be scheduled in the same state, but both are \available"|at this point, all dependences on the two statements have been satis ed. Assume that the scheduler selects operations c 0 and d 0 for state n 1 . Operation c 0 is a test, so there are two successors of this state. For the successor n 2 where c 0 evaluates to false, the set of available operations A 2 is fe 0 g. Assume that the scheduler places e 0 in n 2 . For the single successor n 3 of n 2 the set of available operations A 3 is just fg 0 g, the stop operation. Thus n 3 contains only g 0 . Backing up to n 1 , the set of available operations for the branch where c 0 evaluates to true is also fg 0 g, so the successor of n 1 on this path is also n 3 . This completes the terminating path from n 0 . On the other path, where f 0 evaluates to true, the new set of available operations A 4 is fc 0 ; d 0 ; e 0 ; a 1 g. Note that the operation a 1 from the second iteration is available for scheduling in parallel with statements from the rst iteration. A subtle point is that operation b 1 is not available for scheduling, even though all reads take place before all writes and all operations from the rst iteration that read variable j are available in A 4 . Operation b 1 is not available because (as before) d 0 and e 0 cannot be scheduled in the same state. Even though all operations that read j are in A 4 , not all of these can be scheduled in n 4 , and this fact prevents statements that write j from being available.
Assume that the scheduler selects operations c 0 , d 0 , and a 1 for state n 4 . Operation c 0 is a test, so there are two successors of this state. For the successor n 5 where c 0 evaluates to true, the set A 5 is fb 1 ; f 1 g. Assuming that the scheduler places both operations in n 2 , the set of available operations for the successor of n 5 on the path where f 1 is true is fc 1 ; d 1 ; e 1 ; a 2 g. Note that, except for the superscripts, this set is exactly the same as A 4 . The superscripts are just a way of keeping track of the iteration of each operation; the sets have the same operations. Rather than continue scheduling at this point, the pipelining algorithm simply makes n 4 a successor of n 5 . Similarly, the set of available operations for the successor of n 5 where f 1 evaluates to false is fc 1 ; d 1 ; e 1 g. Except for superscripts, this is exactly the same as A 1 . As before, the pipelining algorithm makes n 1 a successor of n 5 . Backing up, the pipelining algorithm next considers the successor n 6 of n 4 where c 0 evaluates to false. The set of available operations A 6 is fe 0 ; b 1 ; f 1 g. Assuming that the scheduler places all three operations in n 6 , the sets of available operations for the two successors of n 6 are the same as for n 5 and scheduling proceeds just as it did for n 5 . The algorithm terminates with the schedule in Figure 3 .
There are three technically di cult aspects of the software pipelining algorithm. The rst problem is justifying the step where previously scheduled states are \reused", such as when the pipelining algorithm decided to make n 4 the successor of n 5 . We have simply implied that this is correct, and in the example it happens to be correct, but in general this step is not correct. Intuitively, the problem is that just because two sets of available operations happen to be the same for two di erent states, that does not by itself guarantee that all subsequent sets of available operations would be the same in all successors of those states.
We illustrate this problem with the loop in Figure 4a . To make the example as simple as possible, there are no conditional statements or exits from the loop. Assuming that the variable i is always zero upon entering the loop, note that statements b and c are independent for the rst 50 iterations and data dependent for the next 50 iterations. If dependence analysis recognizes that b and c are independent for the rst 50 iterations, then as the parallelized loop is built the scheduler could place b and c together in the rst 50 iterations. Following the pipelining strategy for the previous example, repeating states would be detected in the second iteration, leading to the parallelized program in Figure 4b , which is clearly incorrect. In this example, irregular dependencies make it di cult to detect repeating behavior. Section 4 formalizes the software pipelining algorithm and provides constraints on the scheduler and available operation information that guarantee the correctness and termination of the software pipelining algorithm. The second problem is computing the sets of available operations. An algorithm for maintaining these sets incrementally was rst presented in EN89] for programs without loops (i.e. with acyclic control-ow graphs). In Section 5, we present a detailed description of the computation and maintenance of available operations for use in software pipelining of loops. Our presentation is simpler and easier to understand and implement than the algorithm in EN89].
The third signi cant problem is managing nite resources. While resource allocation does not bear directly on the correctness of our software pipelining algorithm, good resource usage is obviously important if the algorithm is to be useful in practice. In Section 6 we show how nite resources is integrated with software pipelining in our system.
The Software Pipelining Algorithm
The example in Section 3 illustrates that the key step in our algorithm is discovering when states can be \reused" to form a software pipeline. Recognizing patterns in the scheduled operations is not trivial and is in fact not valid if the scheduler and the available operation analysis are not constrained in some way. For example, if the scheduler merely selects operations to schedule at random, no repeating behavior can be inferred. Similarly, even if the scheduler is well-behaved, the example in Figure 4 shows that if the available operation analysis does not exhibit a detectable pattern, software pipelining is not possible.
In this section we present constraints on the scheduler and available operation analysis that make software pipelining possible. These constraints are quite weak and are easily satis ed in practice. After presenting the constraints, we present the software pipelining algorithm itself and prove its correctness. Finally, we discuss termination of the software pipelining algorithm.
The Constraints
Recall that x c i denotes the instance of operation x i from iteration c of a loop. The following de nition is used in the discussion of the constraints.
De nition 4.1 Let X = f: : :; x j i i ; : : :g be a set of operations. The set X c is the set f: : :; x j i +c i ; : : :g.
As discussed in Section 3, one component of the software pipelining algorithm is a scheduler for a speci c machine. The following constraint requires that: (1) the scheduler is a function, (2) the scheduler must schedule some operation in every state, and (3) the operation chosen can depend on the set of operations available and the relative distance in iterations between the operations available, but not on the actual iterations of the operations available. Constraint 4.2 Let X be a set of operations. The scheduler must be a function mapping a set of already scheduled operations and a set of available operations to a single operation or the value \none."
In addition, schedule(X; A) 6 = none if X = ;. We also require that (9x j 8i schedule(X i ; A i ) = x k+i j and x k+i j 2 A i ) _ (8i schedule(X i ; A i ) = none)
In our algorithm, X is the set of operations already scheduled in the state currently under consideration. The operation x k j returned by the scheduler is an additional instruction to be scheduled in the same state. The primary restriction imposed by Constraint 4.2 is that the scheduler is a function of operations available in the state being scheduled. This constraint is weak because the set of available operations provides global information about the program|the scheduler can choose any statement that could be legally scheduled in the current state. 2 This particular constraint also has a signi cant design bene t: it cleanly separates the scheduler from the rest of the algorithm, thus isolating the most machine-dependent portion of the code. Any scheduler satisfying Constraint 4.2 will work with the software pipelining algorithm. In Section 6 we show how to generalize Constraint 4.2 to include resource constraints.
The scheduler is used by the software pipelining algorithm to repeatedly select operations for scheduling in a state. When the scheduler returns \none", the state is nished and successors of the state are scheduled. In Section 3 we presented a simpli ed example in which the scheduler chooses a subset of the available operations for scheduling. However, the iterative method described here is necessary in general because the operations available for scheduling in a state n can depend on the set of operations already scheduled in n. For example, consider the simple program fragment in Figure 5 . Assuming that the parallel machine performs reads before writes, it is clear that both a and b can be scheduled together in the rst (and only) state n 0 . However, b cannot be scheduled in n 0 unless a is also scheduled in n 0 |that is, b is not available for scheduling in n 0 unless a is scheduled in n 0 . Otherwise, if the set of available operations were simply fa; bg, then the scheduler could choose to schedule b in n 0 and a in n 0 's successor, which is incorrect.
A second constraint is placed on the available operations. At any moment there is a set of operations A that are available for scheduling associated with a state n. There are two ways that A can be updated. First, the procedure call update-one(n; A; x i ) returns a pair consisting of the updated state with operations ops(n) fx i g and the new set of available operations given that x i has been scheduled. Second, when n is complete we wish to compute the set of available operations in the successors of n. The procedure call next(n; A) maps n and A to a set of pairs fhn j ; A j ig where for every branch c j 2 branch(n), n j is a new (empty) node, n j = succ-on-branch(n; c j ), and A j is the set of operations available in n j . Implementations Constraint 4.3 says that the operations available may depend on which operations have already been scheduled and the relative distance (in iterations) between operations already scheduled, but it cannot depend on the actual values of the iterations of operations already scheduled. In the implementation of update-one, the result node is simply n updated to include the operation x j+i (see Section 5.2).
Whether Constraint 4.3 is satis ed or not depends on the form of the data dependence analysis used to maintain operation availability information. Standard data dependence graphs satisfy Constraint 4.3, as do extensions to dependence graphs, such as labeling edges with constant distance vectors PBJ + 91]. In fact, as far as we know, every proposed representation of dependence information satis es this constraint. Constraint 4.3 is needed to rule out pathological cases like Figure 4 , where irregular dependence analysis leads to incorrect schedules.
The Algorithm
The software pipelining algorithm is given in Figure 6 . Given an initial set of available operations, the procedure pipeline invokes the procedure schedule-state to build a single state, and then to build states for all the branches of that state, and so on. If at any point the algorithm encounters the same set of available operations (modulo iteration numbers) a second time, it uses the previously scheduled state. The algorithm never backtracks to explore alternative schedules. While a backtracking version could be designed easily, we feel a backtracking algorithm would be too slow to be practical.
The order in which pipeline processes the successors of a scheduled state is unspeci ed and makes no di erence in the nal parallel program. The order in which states are scheduled can make a di erence in procedure schedule-state (n; A) while schedule(ops(n); A) 6 = none do let x = schedule(ops(n); A) in hn; Ai update-one(n; A; x) return hn; Ai procedure pipeline the e ciency of the available operations computation; in Section 5 we present a slightly modi ed version of the algorithm in Figure 6 that processes states in an e cient order. We use Constraints 4.2 and 4.3 to prove the correctness of the software pipelining algorithm in Figure 6 . Let L be a sequential loop and let L 0 be the result of software pipelining. We show that L L 0 .
As a rst step in the proof, we must assume that the available operation analysis is correct. Intuitively, the available operation analysis is correct if any schedule that is consistent with the analysis preserves program semantics. We use the program in Figure For the induction step, assume that e = hhn 0 ; s 0 i; : : :; hn i ; s i ii is an execution of L 1 and that e 0 = hhn 0 0 ; s 0 i; : : :; hn 0 i ; s i ii is an execution of L 0 . Furthermore, assume that there exists a k such that when the states n i and n 0 i were scheduled, the sets of available operations were A in procedure pipeline2 and A k in procedure pipeline respectively. Finally, assume that ops(n i ) = ops(n 0 i ) k . It is easy to check that all of these assumptions hold after the base case. The stores must be the same in the two transitions since, by hypothesis, n i and n 0 i have the same operations. Let c be the branch taken in state hn i ; s i i. Note c is also taken in state hn 0 i ; s i i, because n i and n 0 i have the same operations evaluated in the same store. To nish the proof, we need to show that n i+1 and n 0 i+1 have the same operations, possibly di ering in iteration numbers used by the pipelining algorithm. That is, we must show that ops(n i+1 ) = ops(n 0 i+1 ) j for some j.
Consider once more the state of the two software pipelining algorithms when n i+1 and n 0 i+1 are scheduled. By Constraint 4.3 and the induction hypothesis, hm; Bi 2 next(n i ; A) and hm 0 ; B k i 2 next(n 0 i ; A k ) where m and m 0 are fresh, empty states on branch c from n i and n 0 i respectively. Now there are two cases. For the rst case, assume 8j scheduled-before B j ] = no when hm 0 ; B k i is removed from the todo list by pipeline. In L 1 , let schedule-state (m; B) = hn i+1 ; Ci. Then, by Constraints 4.2 and 4.3, in L 0 we have schedule-state (m 0 ; B k ) = hn 0 i+1 ; C k i and ops(n i+1 ) = ops(n 0 i+1 ) k .
For the second case, assume that, when hm 0 ; B k i is removed from the todo list by pipeline, there is a j such that scheduled-before B j ] = n 0 . Then n 0 was scheduled in L 0 using available operations B j . The rest of the argument is symmetric to the case above, using B j in place of B k and the fact that n 0 i+1 = n 0 . 2 
Termination
Theorem 4.5 proves that the software pipelining algorithm produces only correct results, but it does not show that it always terminates. To show termination, we must prove that the todo set in procedure pipeline is eventually empty. The todo set decreases in size when there is a pair hn; Ai such that for some j, A j has been scheduled previously. Let be the equivalence relation on sets of operations A B , 9j s:t: A = B j . If we assume that the procedure schedule-state always terminates, then to prove termination it is su cient to show that there are only nitely many equivalence classes under .
Unfortunately, there may be in nitely many equivalence classes and in fact the procedure pipeline is not necessarily terminating under the constraints given so far. Consider, for example, what happens if the A sets simply increase in size on each recursive call. A necessary condition for A B is that jAj = jBj; if there are sets of unbounded cardinality, then there are in nitely many equivalence classes.
An additional constraint is placed on the availability information to limit the size of the set of operations available for scheduling. Constraint 4.6 There is a constant k such that for all possible availability sets A, if x j 2 A then no y h 2 A for any h j + k. This constraint states that operations can be available from at most k consecutive iterations at one time.
Thus, the scheduler has a \sliding window" of operations and until operations in the rst iteration are scheduled, the window cannot be shifted to include a new iteration at the end. Proof: If there are n operations in a loop body and k consecutive iterations can appear in A, then every available operation set is a subset of f: : :; x c+j i i ; : : :g for some c, 0 i < n, and 0 j i < k. 2 The value k of Constraint 4.6 is a parameter of the software pipelining algorithm. It need not be the same for every loop scheduled (i.e., it can be computed dynamically), but it must have a maximum value for any particular loop. Also, it is not necessary to make the window an integral number of iterations. Partial iterations work just as well, although the details of the implementation are a bit more complex.
While Constraint 4.6 is motivated by the need to guarantee termination, it also leads to a good implementation of the procedure pipeline. The most expensive part of pipeline is checking whether the current set of available operations A has ever been scheduled before for some A j . For a window size of k iterations, operation availability information for iterations j through j + k ? 1 can be represented as a bit vector of length kn, where n is the number of operations in the sequential loop. The bit hn + i is 1 if operation x j+h i is available for scheduling; otherwise it is 0. When iteration j has been completely scheduled (this occurs when the rst n bits are all 0) the bit vector is shifted left n bits, discarding information for iteration j, and the last n bits are set to re ect the availability of operations in iteration j + k. With this representation, checking whether the same availability information has been seen before only requires checking whether the same bit vector has been seen before, which can be implemented very e ciently through hashing.
Available Operations
Available operations analysis plays a role in our algorithm similar to the role global data-ow analysis plays in traditional optimizing compilers. An algorithm for computing available operations was rst given in EN89] (for historical reasons, available operations were termed \uni able-ops" in EN89]). In this section we give a new presentation of available operations. While functionally equivalent to the algorithms of EN89], our presentation is both simpler and more direct, and the nal algorithms are easier to implement. The development is divided into two parts. First, we show how to compute the initial set of available operations. Second, we show how to incrementally update the information in response to decisions made by the scheduler. At the end of the section we prove the correctness of the analysis and discuss some e ciency considerations.
Computing Available Operations
Recall that Constraint 4.6 forces the available operations to span no more than k iterations of a loop. Therefore, to compute the operations available for scheduling it is su cient to examine at most k iterations of a loop. Since any number of unrolled iterations form a loopless (acyclic) program, we restrict the problem of computing available operations to an analysis of loopless programs.
Computing available operations requires the use of dependence analysis between operations. There are many variations on dependence analysis in the literature that satisfy our requirements (Constraint 4.3) and it is beyond the scope of this paper to include them here KKP + 81, FOW87, PBJ + 91]. The algorithms in this section are presented using an abstract mechanism for dependence. By using a particular dependence analysis representation the algorithms can be made more e cient. We use the following de nitions to model dependence analysis. depends(x; y) = write(x) \ (read(y) write(y)) 6 = ; depends(X; y) = 9x 2 X s.t. depends(x; y) The set write(x) (resp. read(x)) must include every location x could ever write (resp. read). The set kill(x) must include only locations x always writes. Two di erent sets write(x) and kill(x) are de ned because dependence analysis must be conservative in general; it is not always possible to know at compiletime exactly which locations an operation may read or write. The predicate depends(x; y) is true if there may be a dependence from x to y.
De ning correct available operation analysis requires identifying the operations that cannot be available because of potential data dependence violations. Assume that x precedes y on a path and depends(x; y) is true. Then clearly y cannot be available on that path until x is scheduled, or else y could be scheduled before x, resulting in a dependence violation. The following data ow equation speci es the operations reachable from state n that are not data dependent on an intervening operation:
nodeps(n 0 )) ? fxjdepends(ops(n); x)g) Since the program fragment P being analyzed is loopless, nodeps(n) can be computed for all n by a single bottom-up traversal of the control-ow graph for P.
The program in Figure 8 illustrates another situation in which an operation cannot be available. In this case, operation b cannot be available for scheduling in the rst state, because its de nition of j could change the value read by the reference to j in operation c. In standard compiler terminology, location j is live at the rst state, and b can kill c's reference to j. Clearly, any operation that can kill a live reference cannot be available.
The second component of the available operations analysis is a computation of live references. A reference to location l is live at a state n if there is a state reachable from n where l is (potentially) read and there is no intervening write of l. A conventional live reference analysis is not su cient for our purposes; instead, we wish to compute live references discounting the e ect of a particular operation x. More precisely, we wish to know the set of live references assuming that x has been moved to the root state of the program. In this case, to say \x has been moved to the root" means that all occurrences of x that can potentially move to the root are not counted in the live variable computation. The intuitive justi cation behind this computation is that when moving an operation x in the schedule, it is necessary to check if x will kill live references in its new position. However, in deciding whether or not x will kill live references in its new position one should not count references of x itself in its current position.
The following data ow equation de nes the set of locations live at state n modulo operation x:
where Z n 0 = ( live(n 0 ; x) if x 2 nodeps(n 0 ) live(n 0 ; stop) otherwise where Y = ops(n) ? fxg
The two cases in the de nition of Z n 0 distinguish between the cases where occurrences of x can or cannot be blocked by data dependencies. If there is an occurrence of x on a path that is not blocked by data dependencies (i.e., x 2 nodeps(n 0 )) then that occurrence of x is discounted in the live reference computation (i.e., live(n 0 ; x)). If there is no occurrence of x that can potentially move, then all live references are counted (i.e., live(n 0 ; stop) counts all references, since stop has no e ect on the store). As with the computation of nodeps, live(n; x) can be computed for all states n and operations x by a single bottom-up traversal of the control-ow graph for P. Some further improvements to the e ciency of this procedure are discussed at the end of the section.
Let r be the initial state (or root) of P. An operation x is available for scheduling in r if it satis es three conditions: x is in nodeps(r), x does not kill any live reference in operations other than x, and x is in the \sliding window" of operations. Recall that Constraint 4.6 requires that operations from no more than k consecutive iterations be available for any state. For a set of available operations A, let min-it(A) be the minimum iteration number of any operation in A. available(r) = fx i jx i 2 R and i ? min-it(R) < kg where R = nodeps(r) ? fxjwrite(x) \ live(r; x) 6 = ;g Note that we are concerned with live references only at the root r; operations that potentially kill live references at an internal state n are included in nodeps(n). The program in Figure 9 illustrates this situation. Unlike Figure 8 , the reference to j in operation c is not live at the root because it reads the value written by d. The important observation is that any operation that can kill a reference that is not live at the root (e.g. b can kill c's reference to j) must be dependent on some preceding operation (e.g. depends(d; b)). That is, a reference to j that is not live at the root must be preceded by an operation that writes j; this operation prevents other operations that could write j from being available at the root.
The most expensive part of computing available(r) is computing live(r; x) for every operation x. The e ciency of the naive procedure described above can be improved in two ways. First, it is not necessary to compute live(r; x) for every x; it is su cient to compute it only for those operations in nodeps(r), because nodeps(r) is a superset of the operations available for scheduling. Second, operations that kill live references could be detected earlier in the computation instead of checking only at the root; we have not given this alternative to simplify the presentation.
Maintaining Available Operations
We rst describe at a high level how available operations are maintained, after which we give implementations of the procedures next and update-one. Let P be a sequential loopless program. We add an empty state r (a state with no operations) to P and make it the root; r is the initial state in procedure pipeline (see Figure 6 ). This empty state r will be lled with operations chosen by the scheduler.
The next step is to compute the data ow analysis of Section 5.1. At this point the set of operations available for scheduling in state r is available(r).
Once the initial global analysis of P is completed, we are ready to begin scheduling states. When state n is scheduled, it is rst lled with operations (by schedule-state), and then hn; Ai is removed from the todo set and n's successors are added to the todo set. A state in the todo set is a frontier state. At any point in the incremental development of the parallelized loop, every frontier state of the parallel loop has the property that its known predecessors have been scheduled and its successors have yet to be scheduled. (The \unknown" predecessors are those that are added through backedges inserted that complete the software pipelined loop.) Available operations are needed only for the frontier states; predecessors of frontier states are never modi ed. When procedure pipeline terminates, there are no frontier states and the (modi ed) program P is the parallel loop. The rst todo set is fhr; available(r)ig. Thus, initially r is the only frontier state. The procedure pipeline selects a pair hr; available(r)i from todo and lls it with operations by calling schedule-state(r; available(r)).
The procedure schedule-state in turn calls update-one one or more times to choose the scheduled oper- ations (see Figure 6 ). The procedure call update-one(available(r); r; x) performs two tasks. First, x is deleted from interior states of P and x is added to the frontier state r. Thus, this transformation moves x to r from its original place in the sequential schedule. (Some copies of x may have to remain in interior states of P if x cannot move on all paths to the frontier state; see the discussion below.) When a test is moved to r the control ow of P must be modi ed to preserve P's semantics. Second, the sets nodeps and live are updated where necessary.
An important fact is that both the deletion of x and updating of the nodeps and live sets can be restricted to a relatively small subset of the states of P; this property makes the incremental cost of maintaining available operations reasonable. The new set of available operations is (the updated) available(r).
When the scheduling of r is complete, next(r; A) = fhr i ; A i ig is the set of (empty) successors r i of r and the corresponding sets of available operations A i . We implement next(r; A) by inserting a new, empty state r i before each succ(r; c i ) on branch c i 2 branch(r). The set available(r i ) is exactly the set of operations available for scheduling on branch c i from r. Note that the r i are new frontier states of P. This implementation of next allows P (and therefore the available operation analysis) to be shared among all elements of the todo set. As scheduling proceeds there will be multiple frontier states in P, one for each element in todo. An implementation of next is given in Figure 11 .
Lemma 5.2 Let P 0 be P with the modi cations performed by next. Then P 0 P. Proof: Procedure next only inserts empty nodes in P. 2
To complete the description of available operations we must give an implementation of procedure update-one. We could do this trivially in terms of the local transformations of Percolation Scheduling Nic85], but for completeness we describe a direct implementation that is closer to the way it should be done in practice. Let r be a frontier state of P, let x = schedule(ops(r); available(r)), and assume that x 6 = none. We rst describe how x is deleted from P and added to r when x is an assignment; this is the easier case. Moving an operation x while preserving P's semantics is a little subtle because x may be available at the frontier state but still blocked by data dependencies on some paths. The program in Figure 12a illustrates this situation. In this example, c is available at the root because c does not kill any references live at the root and because there is a path from the root to c (in this case passing through the false branch of a) such that c is not dependent on any operation on the path. However, there may be other paths from the root to c (in this case passing through the true branch of a) such that c is dependent on some operation on the path; clearly, c cannot be deleted from such a path. In addition, there may even be paths from other frontier states to c (represented by the incoming edge from e). If c is moved to root in Figure 12 it still must be preserved on paths from other frontier states.
As illustrated in Figure 12b , this problem can be resolved by duplicating states so that no instance of the operation being moved is shared between paths where it can move and paths where it cannot move.
In this example, only the single state containing c needs to be duplicated, but in general multiple states may have to be duplicated. In Figure 12c , state c has been deleted and the operation moved to the root.
It is easy to verify that the program in Figure 12c is equivalent to the program in Figure 12a .
To formalize which states are duplicated and which states are deleted we need some additional de nitions and notation. A path is a sequence of states hn 1 ; : : :; n k i such that n i+1 2 succ(n i ) for all 1 i < k. A state n k is covered by a state n 1 for operation x if there is a path hn 1 ; : : :; n k i such that operation x is in nodeps(n i ) for every n i on the path: covered(n 1 ; x) = fn k j there exists is a path hn 1 ; : : :; n k i s.t. 8i; 1 i k x 2 nodeps(n i )g
We say a path is covered by (n; x) if every state of the path is in covered(n; x). When an operation x is moved to a frontier state r it should be deleted only from paths that are covered by (r; x); other paths should be left unchanged. The simplest case is if every path to x is covered by (r; x). We say a loopless program P is delete consistent for (r; x) if for every n 2 covered(r; x) such that n = fxg, every path from a frontier state to n is covered by (r; x). If P is delete consistent for (r; x), then x is not blocked by data dependencies on any path to the frontier state r. Hence, we can delete states n = fxg in covered(r; x), update the predecessors of n to point to n's successor, and add x to r.
Lemma 5.3 Let x be an assignment such that x 2 available(r) and let N = fnjn = fxg and n 2 covered(r; x)g. Assume that P is delete consistent for (r; x). Let P 0 be P with the following changes.
(Recall that hi is the empty branch|see Section 2.) modify each n 0 where succ-on-branch(n 0 ; c) = n for some n 2 N so that succ-on-branch(n 0 ; c) = succ(n; hi) delete every n 2 N r r fxg Then P P 0 .
Proof: For brevity we only sketch the proof. The transformation can be implemented by a sequence of semantics-preserving Percolation Scheduling transformations between adjacent nodes Nic85]. Since each individual Percolation Scheduling transformation preserves program semantics, the entire sequence preserves program semantics. 2 Of course, Lemma 5.3 only applies if P is delete consistent. We next show how to make an arbitrary loopless program delete consistent for (r; x). The set of predecessors of a state n is pred(n) = fn 0 jn 2 succ(n 0 )g. The following lemma gives an easy test for determining whether P is delete consistent.
Lemma 5.4 P is delete consistent for (r; x) i n 2 covered(r; x) ) pred(n) covered(r; x). Proof: If every predecessor of a member of covered(r; x) is in covered(r; x), then clearly every path from r to n 2 covered(r; x) is covered by (r; x). For the other direction, assume that there is an n 2 covered(r; x) and for some n 0 2 pred(n), n 0 6 2 covered(r; x). Then there must be a path from some frontier state r 0 of the form hr 0 ; : : :; n 0 ; ni. This path is not covered by (r; x). 2
The following algorithm makes P delete consistent for (r; x). Let C = covered(r; x). Iterate the following two steps until no n is chosen in (1):
(1) Choose n 2 C such that some p in pred(n) is not also in C.
(2) Let n 0 be a duplicate of n and for every p 2 pred(n) such that p 6 2 C, if succ-on-branch(p; c) = n then modify p so that succ-on-branch(p; c) = n 0 . Note that this algorithm copies the minimum number of states needed to make P delete consistent.
Once P is delete consistent the steps of Lemma 5.3 can be applied to move x to the frontier state. All that remains is to update the nodeps and live sets. States that are duplicated in making P delete consistent retain the nodeps and live information of the original state. The set of states for which the analysis can change is covered(r; x). Since both nodeps and live are computed bottom-up, the analysis can be updated in a single bottom-up pass over the paths covered by (r; x).
Finally, we show how to update the available operations in the case where the instruction chosen by update-one is a test. Let r be a frontier state of P, and let x = schedule(ops(r); available(r)). For update-one to move a test x while preserving P's semantics, it is necessary to modify the control ow of P. Intuitively, we duplicate all the covered paths from r to x; the original set of paths leads to the successor on x's true branch, the duplicate set of paths leads to the successor on x's false branch. This transformation is illustrated in Figures 13 . The program in Figure 13a is already delete consistent for (root; b). When b is moved to root in Figure 13b , state a is duplicated on b's true and false branches to preserve control ow. Recall that a branch c is a truth assignment to tests hx 1 = b 1 ; : : :; x n = b n i where b i is one of true or false. The following lemma shows how to move a test to a frontier state.
Lemma 5.5 Let P be a program with frontier state r and let x 2 available(r) where x is a test. Assume P is delete consistent for (r; x) and let X = covered(r; x) in P. Program P 0 is P with the following modi cations performed in order:
(1) For each n 2 X, let n 0 be a state such that ops(n) = ops(n 0 ) and for each c 2 branch(n) succ-on-branch(n 0 ; c) = 8 > < > : n 1 if n 1 6 2 covered(r; x) n 0 1 if n 1 2 covered(r; x) and ops(n 1 ) 6 = fxg succ-on-branch(n 1 ; hx = false i) if n 1 2 covered(r; x) and ops(n 1 ) = fxg where n 1 = succ-on-branch(n; c).
(2) For each n 2 X, if succ-on-branch(n; c) = n 1 and ops(n 1 ) = fxg, then modify n so that succ-on-branch(n; c) = succ-on-branch(n 1 ; hx = true i) (3) For each c 2 branch(r) where c = hx 1 = b 1 ; : : :; x n = b n i and n = succ-on-branch(r; c) do succ-on-branch(r; hx = true; x 1 = b 1 ; : : :; x n = b n i) = n succ-on-branch(r; hx = false; x 1 = b 1 ; : : :; x n = b n i) = ( n if n 6 2 X n 0 if n 2 X (4) Let ops(r) ops(r) fxg
Then P P 0 . 3
Proof: Again, for brevity we only sketch the proof. It is easy to verify that P 0 preserves the control ow of P. As in the proof of Lemma 5.3, the transformation can be expressed as a sequence of local transformations between adjacent nodes. 2
Part (1) of Lemma 5.5 duplicates covered paths by creating a copy n 0 of every state in covered(r; x) and by assigning successors so that the paths formed by the n 0 lead to the false branch of x. Part (2) modi es the original states in covered(r; x) so that they lead to the true branch of x. Part (3) modi es the branches of r to point to the original nodes if x is true and to the copied nodes if x is false. A description of procedure update-one is given in Figure 14 . The implementation we have described is somewhat naive and there are ine ciencies that can be eliminated at the cost of greater complexity in the algorithm. Most of the potential problems are related to space explosion, either in the size of the nal code or in the size of intermediate data structures used by the algorithm. Some states that are initially di erent may become identical as a result of scheduling operations. This observation applies to both states in the parallel schedule and states that have yet to be scheduled. A good implementation should merge states that are identical and are on identical paths. When performed on the states of the parallel schedule, this optimization reduces the size of the nal code. A separate potential problem lies in the de nition of delete consistency. Making the sequential program delete consistent prior to moving an operation x may result in duplicating many states of the sequential program. These duplicates cannot subsequently be merged because x occurs on one set of paths in its original position (i.e., on those paths where x was blocked by a data dependence) and not on the set of paths where x was moved. A partial solution is to move x as far as possible on the paths where it is blocked by a data dependence, thus allowing some sharing of common paths. Some scheduling systems have this property Nic85, ME92] . However, this optimization may be of marginal value in our algorithm, because of the duplicated states of the sequential program are soon eliminated by subsequent scheduling anyway.
Another approach to improving the e ciency of the techniques presented here is to use a representation other than the control-ow graph for computing available operations. The obvious alternative is to use some form of the program dependence graph, which admits more e cient algorithms for some purposes (see LA92, AJLS92] for uses of program dependence graphs in the context of software pipelining). We have presented our techniques using a control-ow graph representation for simplicity only|there is no barrier to using other, potentially faster, representations in an implementation.
Correctness of the Analysis
In Section 4, we assumed the available operations analysis was correct to prove the correctness of the software pipelining algorithm. Recall that the available operations analysis is correct if L 1 L, where L is a sequential loop and L 1 is the in nite parallel program computed by pipeline2. In this section we prove that the implementation of available operations given in Sections 5.1 and 5.2 is correct. 
Managing the Window
For performance reasons, it is obviously desirable to minimize the number of iterations of L that are actually used in the available operations analysis. It is possible to use only a few iterations of P because Constraint 4.6 forces the available operations analysis for any state to span no more than k iterations of loop L. In this section we show that the number of iterations needed for available operations analysis can be limited to k.
The only problem with limiting the number of iterations used in the analysis is that di erent frontier nodes may require available operations from di erent iteration windows. For example, for a frontier state r operations may be available from iterations i to i+k?1, but for another frontier state r 0 operations may be available from iterations i + c to i + k + c ? 1. In a naive implementation, P must contain operations from iterations i through i + k + c ? 1 to cover both frontier states. Fortunately, this is not necessary. We can rst schedule r using iterations i to i + k ? 1 and any other states that have operations available from iteration i. Once all states with operations available from iteration i are scheduled, a new iteration of L can be added to P and the window shifted to i + 1 to i + k. Figure 15 is a modi ed version of pipeline. In this implementation, all frontier states that have operations available from iteration i are scheduled before any frontier states that have operations available from iteration i + 1. A new iteration is added to P only when every state that has operations available from iteration i is already scheduled. Thus, P always contains the minimal number of iterations and iterations are added to P as infrequently as possible.
There is one detail omitted from Figure 15 . When the ith iteration of L is added to P, the live sets of the leaf states (i.e. the states at the end of iteration i) must be initialized to the set of locations live at the end of iteration i.
Resources
Resource allocation is a critical issue for software pipelining algorithms. In this section, we show how the allocation of functional units can be smoothly integrated into our software pipelining algorithm. Our approach to incorporating functional resources is similar to the reservation table methods used in dynamic scheduling algorithms Bae80]. To describe the modi cations to the algorithm that accommodate functional resources we require some additional de nitions. Let ff 1 ; : : :; f n g be the set of functional units for a machine. We drop the assumption that every operation executes in a single cycle and assume that c is the greatest number of cycles required by any operation. A reservation table is an n c array of boolean values, where entry (i; j) is true i resource f i is busy at cycle j. For and resource constraints alone do not prevent operations that depend on x's result from being scheduled in the cycle after x is initiated. We resolve this problem by treating an i-cycle operation x as i one-cycle operations; operations that depend on the result of x are dependent on the last operation in the chain.
To guarantee legal schedules, it is necessary to constrain the i unit-cycle operations to be scheduled in successive cycles without interruption. This constraint can be encapsulated entirely within the policy for selecting operations to schedule and thus does not a ect the overall structure of the software pipelining algorithm.
To allocate functional units, the software pipelining algorithm is modi ed so that when a state n is scheduled there is a reservation table associated with n describing resource usage at that point in the schedule. The scheduler is modi ed so that it chooses an operation that is both available and for which resources can be allocated. Two reservation tables R 1 and R 2 are compatible if they do not require the same functional unit in the same cycle; i.e. there is no point (i; j) such that R 1 (i; j) = true = R 2 (i; j). If the reservation table R is associated with state n, then the scheduler must choose an operation x to schedule in n such that compatible(R; resources(x)). The following constraint modi es Constraint 4.2 to include reservation tables. The procedures next and update-one must also be modi ed to update reservation tables to re ect the changes in available resources when operations are scheduled. The procedure call next(n; A; R) should advance the reservation table R by one cycle to re ect the fact that in successors of n the resources used in the rst cycle of R are no longer reserved. The procedure call update-one(n; A; R; x) should not only update n and A, but also update R by adding the resources required by x.
The next constraint modi es Constraint 4.3 to include reservation tables. The logical or of two reservations tables R 1 _ R 2 is a table R such that R(i; j) = R 1 (i; j) _ R 2 (i; j). The reservation table advance(R) is a table R 0 such that R 0 (i; j) = R(i; j + 1) for j < c and R(i; c) = false. Figure 16 gives a modi ed version of the software pipelining algorithm that includes reservation tables. For simplicity, the modi cations are presented to the original algorithm in Figure 6 rather than the more e cient version in Figure 15 . Note that the detection of repeating states now involves both the set of available operations and the reservation table. Using Constraints 6.1 and 6.2, it is straightforward to adapt the original proof of correctness of the software pipelining to prove the correctness of the algorithm in Figure 16 . Termination is still guaranteed because there are only a nite number of reservation tables and therefore repeating states are guaranteed to occur.
Register Allocation
Registers are another critical resource that must be utilized e ectively to achieve good results in practice. Traditional register allocation can interact very badly with software pipelining. If register assignment is performed before scheduling|the usual practice|then software pipelining may produce poor results, because the register allocator may unnecessarily reuse registers, thus adding data dependences to the program. Our approach is to modify an initial register allocation \on the y" during software pipelining. the program fragment in Figure 17(a) . In this example, operation b is not available for scheduling at the root because its target register is the one of the operand registers of operation a. However, if there is a spare register, then the dependence can be broken by renaming the destination register of b as in Figure 17 (b). Now operation b 0 is available for scheduling. It is necessary to insert a register move c into the program to restore the machine state after operation a. This transformation is a heuristic|it assumes that the advantage gained in eliminating the dependence outweighs the cost of the extra copy. This is usually true, and almost always the copy operation can be removed by a later global pass of generalized copy propagation PNW92].
There is an additional problem with the register allocation scheme described above. Including register allocation in the software pipelining algorithm requires that registers be taken into account when determining when two states are \the same". A su cient condition is that two states s and s i can be considered the same only if each register holds a value generated by operation x in state s and a value generated by operation x i in state i. This condition guarantees correctness and termination and is analogous to the similar requirements for available operations and functional units.
In practice it appears that this condition alone permits an impractical number of states to be generated in some loops and a repeating pattern fails to emerge within a reasonable number of steps. The problem is that, even for small register les, the number of possible assignments of values to registers is astronomical. To accelerate convergence of the pipelining algorithm, it is necessary to limit the space of possible register assignments in some way. The solution we use is as follows. A register le r is a renaming of register le r 0 if r can be mapped to r 0 by some set of register-to-register transfers. In the software pipelining algorithm, two states are considered to be equivalent if the register les in the two states are renamings of each other and the other conditions (on operations and functional units) are satis ed. This design makes identi es together many register les, thereby accelerating convergence of the algorithm. The cost is that register-to-register transfers must be issued on the backedges of pipelined loops to move values into the correct registers. These copy operations can be eliminated by a separate copy elimination optimization pass after software pipelining. Leaving the copy operations in the code is reasonable as well, as they incur only a minor performance penalty (see Section 7).
Implementation and Experiments
The software pipelining algorithm described here has been implemented as part of a compiler project at the University of California, Irvine. The compiler is a version of the GNU C compiler (GCC) modi ed to accommodate our methods. GCC is used as a front-end to translate the C level source into an intermediate representation. This translation includes an initial register allocation and a number of common optimizations: constant folding, jump optimization (e.g., removing jumps to jumps), common subexpression elimination, and strength reduction. GCC's instruction scheduling, loop unrolling, and inlining are disabled and replaced by our software pipelining and scheduling algorithm. A number of incremental optimizations (e.g. incremental tree-height reduction) are bene cial in conjunction with software pipelining pipelining. For the results presented in this paper, only dynamic renaming (see Section 6.1) and load-after-store elimination are performed together with pipelining. Loadafter-store elimination identi es loads that depend on a unique store; such loads can be eliminated in favor of uses of the value being stored. (In some cases, the store can be removed also if it is known that the eliminated load is the only read of the location written to by the store.) Load-after-store elimination is useful because it removes register spill code that becomes dead as a result of dynamic renaming. Both dynamic renaming and load-after-store elimination are an inherent part of our \on the y" register allocation scheme.
The strength of our software pipelining framework is the exibility to exploit whatever ne-grain parallelism is available in a loop. Restrictions placed on code motions are designed to be as weak as possible while still guaranteeing correctness and termination. As discussed below, this exibility does in fact translate into very good speedups across a variety of architectural models.
The downside of the weak restrictions of our system is that there are a huge number of potential states, even for small loops and machines with modest resources. The huge state space can cause slow convergence of pipelining to a pattern and large nal loops. There is a clean solution to this problem: the scheduler should be designed to minimize code explosion by restricting code motions that increase code size. For the purposes of this paper, we have focussed the experiments to reveal information about the software pipelining algorithm and not information about particular smart scheduling heuristics. Thus, we have used only very simple greedy list-scheduling heuristics that make no e ort to take account of the impact of code motions on code size. As we shall see shortly, in the majority of cases the size of the state space is not a problem and software pipelining converges quickly to a pattern even with naive scheduling.
In some cases, using an iteration window that is large enough to maximally exploit the available parallelism results in unjusti ably slow convergence. To identify these cases in this experiment, we nd Name Latency Description ALU 1 cycle integer add/sub and logical SHIFT 1 cycle arithmetic and logical shifts FALU 3 cycles oating point add/sub and logical MUL 3 cycles integer and oating point multiply DIV 13 cycles integer and oating point divide MEM 2 cycles cache read (cache miss stalls the processor) 1 cycle cache write BRANCH 4 cycles conditional branch Table 1 : Functional Unit Kind and Latency it useful to introduce the notion of \cut-o convergence" which constrains the maximum number of iterations scheduled to some xed amount; the remainder of any paths that have not converged after the cut-o number of iterations are simply scheduled sequentially. We stress that \cut-o convergence" is a creature of our experiment|its purpose is to identify when code explosion is a problem. In practice one should prefer to use scheduling heuristics designed to prevent code explosion; this topic is discussed further in Section 7.2.
Architectural Models
Two pipelined VLIW architecture models are used for the experiments: one with homogeneous functional units, and one with heterogeneous functional units. Both models assume a single 64-bit wide register le shared by all functional units. With the exception of two \unlimited resource" experiments used to measure threshold performance results, the register le is assumed to have 32 registers.
Operation latencies for both models, given in Table 1 , are similar to the Motorola 88110 Superscalar. An instance of the heterogeneous model has 1, 2, 3, or an unlimited number of each of the functional units de ned in Table 1 . An instance of the homogeneous model has 2, 4, 8, or an unlimited number of homogeneous functional units, where each homogeneous functional unit can perform any of the functions de ned in Table 1 .
Each VLIW instruction speci es one, possibly NOP, operation for each functional unit. Each operation has the optional side e ect of advancing the pipeline. For both models there are no hardware interlocks for detecting data or control hazards, so the compiler is entirely responsible for ensuring that all hazards are avoided at run time. Tables 2 and 3 show the dynamic speedup measured for both target architecture models on 24 Livermore Loops. The speedups are with respect to running the unscheduled code sequentially on the target architecture. Thus, the speedups re ect both the exploitation of multiple functional units and pipelining kernel 2 FU's (4) 4 FU's (6) 8 FU's (10) Table 2 show the speedups for the homogeneous model assuming 32 registers and 2, 4, and 8 homogeneous functional units, respectively. Table 3 shows the same information for the heterogeneous model con gured with 1, 2, and 3 units of each type (again assuming 32 registers). The last two columns of each table show threshold performance levels that are discussed below.
Experimental Results
For the results presented in the rst 3 columns of each table, loops were pipelined with progressively larger iteration windows until there was no noticeable increase in the average speedup for all benchmarks. The numbers in parentheses at the top of each column show the smallest iteration window sizes for which the highest average performance was attained for each and for which the speedups shown in the tables were generated. Notice that all of the window sizes are fairly small and none exceeds 10 iterations. For the results shown in the rst three columns, no more than 64 iterations are scheduled (i.e. this is the cut-o ). In almost all cases, only a fraction of this number is needed for convergence to a pattern. The \conv type" column in Tables 4 and 5 how the algorithm terminated: P indicates convergence to a pattern and C indicates cut-o convergence on at least one path.
The last two columns of Tables 2 and 3 , which are identical for both tables, show the speedups obtained assuming an unlimited number of functional units and either 32 registers (column \Infx32") or an unlimited number of registers (column \InfxInf"). For both columns we want to show the maximum speedup that can be obtained for the speci c architecture con guration given the xed code motion capabilities, scheduling heuristics, and front-end optimizations used in our system. Therefore, for the \Infx32" and \InfxInf" columns, the iteration window size and cut-o limits were set to the number of iterations that would be executed by each loop at run time, which for most of these loops is 100 iterations. Thus, loops that exhibit natural convergence are guaranteed to be optimal in the sense de ned for Theorem 8.2, and the few loops that do not converge are optimal in the same sense because they are fully unrolled and scheduled. Note that by 8 homogeneous functional units and 3 of each heterogeneous functional units, the speedups are already optimal with respect to the \Infx32" numbers, but were obtained with an iteration window size of just 10 iterations. For 16 of the 24 loops, the \8 FU's" and \3 of each" speedups are optimal with respect to the \InfxInf" numbers, which shows that even optimal register allocation for these loops cannot increase performance.
A wide range of speedups exist in Tables 2 and 3 , ranging from 1.9 all the way up to 17.2. What we have tried to show is that given a xed set of code motion capabilities, scheduling heuristics, and frontend optimizations, such as those produced by GCC, our software pipelining algorithm is able to achieve the same performance as fully unrolling and scheduling the loop. Furthermore, despite the generality of our approach, the algorithm manages to achieve good utilization of resources, even with naive scheduling heuristics. The overall performance of these benchmarks, with either pipelining or complete unrolling, could be improved in a number of ways that are orthogonal to our software pipelining approach (e.g. by improving memory reference disambiguation In the rest of this section we discuss and interpret the performance results in more detail. To aid in interpreting the results shown in Tables 2 and 3, we present the following performance measures in  Tables 4 and 5: conv type Convergence type: P means that pipelining converged on a pattern and C means that it converged on the cut-o .
reg use The maximum number of registers used at any instruction (i.e. state). min loop The number of instructions on the shortest path through the pipelined loop. 5 max loop The number of instructions on the longest path through the pipelined loop. total size The total number of instructions in the benchmark, including inner loop instructions as well as all code preceding and succeeding the loop.
As discussed in Section 6.1, software pipelining sometimes inserts register-to-register transfers in order to speed convergence of the algorithm. Because these transfers can be eliminated by copy propagation (albeit at the cost of an increase in code size) we have not counted them in the speedup gures in Tables 2  and 3 . Even if the copies are not eliminated, the gures in Tables 4 and Table 5 show that the performance penalty is low. For example, for LL2 with 2 homogeneous functional units, the worst case is that all 25 registers are live and must be copied on the backedge, which costs 13 cycles, or 13% of the length of the pipelined loop body. For the large loops the penalty is well below 10%; for small loops the overhead can be reduced by unrolling the pipelined loop body.
There are two interesting anomalies in the speedup tables. The rst is that for a few benchmarks, (e.g. LL2, LL8, and LL23 in both tables) the speedup actually decreases slightly after some increases in the number of resources. One cause is that even though two pipelined loops may exhibit the same asymptotic speedup, the overhead from their pre-loop and/or post-loop code can di er (e.g. speedup goes from 14.0 to 13.6 when going from Infx32 to InfxInf for LL2). The other cause of some small decreases in performance when resources increase is overly simplistic scheduling heuristics. For instance, the list scheduling heuristics currently used in the compiler allow operations to be scheduled much earlier than their next use, potentially saturating the register le at subsequent states and thus preventing the removal of false dependencies via renaming that might otherwise allow operations on the critical path to be scheduled earlier. Kernels LL8 and LL23 provide good examples of this e ect. The fewer resources there are, the fewer \unimportant" (i.e. o the critical path) operations are scheduled much further head of their next uses and the less likely that the register le becomes saturated with unimportant values. This problem can be alleviated with di erent scheduling heuristics. In any case, this issue is orthogonal to software pipelining itself.
disambiguation.
The other anomaly occurs for loops LL9, LL18, and LL22 in Table 2 and LL7, LL13, and LL18 in  Table 3 . Factoring out considerations like the above scheduling anomaly and other heuristic aspects such as speculative scheduling, we would expect speedup to increase linearly with the number of functional units until some threshold speedup is reached. Thus, for each doubling in the number of functional units, we would expect the the speedup to be the lesser of twice the old speedup and the maximum (unlimited) speedup; however, for this second class of anomaly, when going from 2 to 4 functional units in Table 2 or 1 to 2 of each functional units in Table 3 , we see that the speedup is slightly less than this expected value. The reason for this is that while the iteration window size is chosen to maximize the average speedup shown in the tables, the performance of a few of the loops in each table would have improved with a larger window size, though without any signi cant e ect on the average speedup over all loops.
Finally, it is interesting to consider the circumstances under which the algorithm fails to converge to a pattern before the cuto is reached. An analysis of the kernels with type \C" convergence (see Tables 4 and 5) shows that the problem arises is vectorizable loops, or, more generally, loops with very few ow dependencies. In this case, the only constraints are resource constraints, and operations are free to move almost anywhere in the schedule. In this situation the lack of dependence structure in the program combined with greedy scheduling heuristics tends to lead to an explosion in the set of states, slowing convergence. Variations on a device of Ebcio glu's may show how to modify the scheduler to avoid this problem Ebc87]. The basic idea is to introduce arti cial dependencies that don't harm parallelism extraction but dramatically reduce the number of potential states the scheduler may explore.
For example, a rule of thumb for vectorizable loops could be that operation x i must be scheduled no later than the time of x i+1 . Since the loop is vectorizable there is no reason to prefer scheduling one before the other; eliminating some orderings reduces the overall number of potential states.
Notice that in most cases, the total size of the nal loop is an order of magnitude larger than the shortest and longest paths through the loops. Because loop control conditionals from succeeding iterations are scheduled in parallel with operations from preceding iterations, a new loop exit path is usually created for each iteration scheduled. 6 In some cases this is simply the cost side of a cost vs. performance trade-o inherent to scheduling conditional jumps, the bene ts of which are to allow strictly control dependent operations (e.g. operations like stores that can not be renamed) to be scheduled earlier, and to commit to alternative control paths as early as possible so as to minimize the amount speculative scheduling. Fortunately, in many cases, such as for loop exit code, this cost often can be signi cantly reduced by merging multiple identical control paths into a single shared path. In the context of available operations scheduling, this is accomplished by merging states with identical available operations sets, an optimization we have not implemented. 6 To guarantee the preservation of correct semantics when scheduling a conditional above operations that precede it, it is necessary to duplicate those operations onto each branch of the conditional after it has been scheduled.
8 On Optimal Software Pipelining
In this section we brie y review research on the limitations of software pipelining, especially a result showing that optimal software pipelining is unachievable SGE91]. Given this result, we show that our algorithm is \as good as possible" in the sense that it can produce arbitrarily good schedules.
Research in software pipelining has naturally focused on discovering algorithms for computing pipelined schedules, both in general and for speci c machines. Concurrently, researchers have investigated the theoretical limitations of software pipelining. One of the central theoretical questions is whether or not there is a software pipelining algorithm that produces optimal pipelined schedules for an arbitrary loop. Because scheduling algorithms are based on preserving data dependences, the natural meaning of \optimality" is with respect to the length of dependence chains.
De nition 8.1 A program L is time optimal if for every execution hhx 0 ; s 0 i; : : :; hhfstop g; s n ii of L, n is the length of the longest dependence chain in the execution.
The obvious form of the optimality question is stated as follows: is there an algorithm which takes as input a machine description (i.e., resource constraints, instruction timings, etc.) and a loop, and produces a time optimal schedule for that machine? This problem statement is not very useful, however, because scheduling problems with nite resources are computationally intractable even without software pipelining. To gain some insight into software pipelining itself, researchers have usually abstracted the problem as: given su cient resources and a loop L, is there an algorithm which computes a time optimal schedule for L?
The answer to this question is trivially \no" for some programs, such as the one in Figure 2 . Recall that instructions d and e cannot be scheduled in the same instruction because they write the same store location. One branch of the test must always be optimized at the expense of the other branch, and thus there does not exist a parallel version L that is time optimal. The con ict between d and e in Figure 2 is usually classi ed as another type of dependence|an output dependence KKP + 81]. To avoid this problem, we can rephrase the question again: given unbounded resources and a loop L without output dependences, is there an algorithm which computes a time optimal schedule for L? This question has been resolved negatively SGE91]. Again the problem is that for some loops an optimal closed-form parallel version does not exist.
While De nition 8.1 is natural, it appears that so many quali cations are required to apply it in the analysis of general software pipelining algorithms that it ceases to be useful. For our purposes we adopt a di erent de nition of what it means for a software pipelining algorithm to be \as good as possible."
Recall from Section 4 that the loop L 1 is the (in nite) parallel program that results from scheduling with complete information about available operations. While L 1 may not be \optimal", it represents the best that can be done with global knowledge of the program and the ability to fully unroll loops.
The following theorem shows that as the window size k of the software pipelining algorithm increases, the quality of the code approaches that of L 1 . Today there are variety of algorithms and frameworks for software pipelining. We describe each and discuss its relationship to our own work. Because of the large amount of work in the area, our discussion of each proposal is necessarily brief.
Modulo Scheduling
Modulo scheduling is an important software pipelining technique introduced by Rau and Glaeser RG81] and subsequently used as the basis for numerous other algorithms Lam87, Jon91, RTS92, Huf93, WMHR93]. Modulo scheduling has been used in compilers for the FPS series Tou84], the polycyclic machine RG81], and Cydrome's Cydra Cyd87].
A basic modulo scheduling algorithm works as follows. Consider a loop L that requires a resource k times per iteration of the loop body. If the target machine has t of the resource, then an upper bound on the throughput is one iteration of L every k=t cycles. Let the initiation interval s be max(1; k=t). In modulo scheduling, the loop body is heuristically scheduled one statement at a time. When a statement is scheduled at time c, the instance of c in iteration i is scheduled at time c + is. If at any point a statement cannot be added to the schedule due to resource or dependency constraints, then the schedule is abandoned and the algorithm either backtracks or tries a larger initiation interval.
Modulo scheduling smoothly integrates the simultaneous treatment of resource constraints and software pipelining. The primary disadvantage of modulo scheduling is that it does not apply directly to loops with conditional tests in the loop body. Two extensions have been proposed to overcome this limitation. Lam introduced hierarchical reduction to combine modulo scheduling with complex control ow Lam87] In hierarchical reduction, the \then" and \else" branches of a conditional test are rst scheduled independently. The shorter branch is padded with noops to make it the same length as the longer branch, and the scheduler encapsulates the entire if-then-else construct as a single statement. Hierarchical reduction su ers from several drawbacks. First, some paths are padded with noops, which may slow execution; second, treating the if-then-else as a single statement necessarily overestimates resource requirements, and third, preserving the control structure of the program restricts possible code motions.
A second proposal for integrating modulo scheduling with conditional tests is to use if-conversion AKPW83] before modulo scheduling and reverse if-conversion WHB92, WMHR93] after modulo scheduling. When a loop is if-converted, the expression of control ow is changed from explicit jumps to guarded operations, where each operation of the original loop is guarded by the predicates of the conditionals that control its execution. In this way, all non-trivial control ow in the loop is replaced by data dependences.
Modulo scheduling with if-conversion appears to improve upon modulo scheduling with hierarchical reduction WMHR93]. However, if-conversion retains the undesirable features of hierarchical reduction to a considerable degree. First, because control ow is expressed as data dependence, speculative execution of operations (i.e., moving operations above conditionals) is not possible, nor is it possible to reorder conditionals for the same reason. Thus, the possible code motions are restricted. 7 In addition, performing if-conversion greatly hinders the management of limited resources during scheduling. If-conversion schedules all the operations in the original loop body in a single basic block. These operations compete for resources during scheduling, including operations that could never execute simultaneously because they appear on di erent control paths in the original loop. Thus, straightforward modulo scheduling of if-converted loops overestimates resource requirements.
For the case of loops without control ow and unlimited resources, there is considerable commonality between our algorithm and modulo scheduling. For example, in AN88a], it was shown that a simpli ed version of our algorithm produces optimal code for loops without conditionals in the body and for machines with su cient resources. Despite the di erences in conception between the two algorithms, this result was later shown to hold for a small modi cation of modulo scheduling as well Jon91].
In short, our algorithm combines software pipelining, resource constraints, and handling of control ow with a exibility not matched by current modulo scheduling techniques. In our opinion, the signi cant practical advantage of modulo scheduling at this time is that in cases where both techniques produce equally fast schedules, the schedules produced by modulo scheduling are generally more concise Jon91].
Pipeline Scheduling
The work most closely related to our own is that of Ebcio glu Ebc87] and Ebcio glu and Nakatani NE89, EN90, NE90] and later Moon and Ebcio glu ME92]. Pipeline scheduling di ers from our approach in that the loop body is not constructed by scheduling and testing for repeating states. Instead, the original loop is incrementally transformed to create a parallel schedule. Software pipelining is achieved by moving operations across the backedge of the loop; this has the e ect of moving an operation between loop iterations. The handling of control ow is based on the same principles as our own approach and is equally general. A scheduling window of operations is also used NE90], although the purpose is to reduce code explosion rather than to guarantee termination.
An advantage of pipeline scheduling is that the loop is always equivalent to the original loop and therefore it is legal to apply any semantics-preserving transformation to the loop at any time, even transformations that have little to do directly with scheduling. Ebcio glu and Nakatani exploit this property by aggressively renaming registers and performing strength reduction, optimizations which substantially alter the dependence structure of the loop. We also apply some of these optimizations (see Section 7), but cannot apply them as generally as pipeline scheduling because of our need to guarantee regular dependences for correctness. As an aside, to the best our knowledge, modulo scheduling implementations do not perform any transformations that modify the dependence graph.
An advantage of our algorithm over the current pipeline scheduling algorithm is in the handling of resource constraints. Pipeline scheduling uses only local transformations to move operations from one state to another. Thus, at some points resource constraints may need to be violated in the schedule as an operation moves through one state on its way to another state. To deal with this property, pipeline scheduling has a moderately elaborate phase structure in which resources constraints are alternately enforced and relaxed on speci c portions of the loop body. Our algorithm treats resource constraints in a more direct and uniform way.
9.3 GURPR GURPR, for Global Unrolling, Pipelining, and Rerolling, is a software pipelining technique proposed by Su, Ding, and Xia SDX87]. The technique is based on URPR, an algorithm for pipelining loops without tests SDX86]. Given a loop L, the rst step of GURPR is to apply URPR to each path through the original loop body. The separate pipelined paths are then put together to form the pipelined loop, with compensation code added at points where execution could jump from one path to another.
The approach is similar in philosophy to trace scheduling|paths are rst optimized as basic blocks, ignoring jumps into and out of the path, and then x-up code is added to ensure correctness. GURPR is also subject to the same criticism as trace scheduling. There is no reason why the execution of a program should repeatedly follow the same path through the loop body. Our approach and the approach Ebcio glu and Nakatani is more uniform, overlapping iterations on all paths, rather than a subset of paths.
Petri Net Techniques
Recently there has been interest in using Petri Nets to formalize the software pipelining problem GWN91, RA93] . There is a natural mapping from operations, dependences, and resource constraints into Petri Nets, thus combining all of these features in a single, well understood formalism. This approach has been shown to be competitive with modulo scheduling with hierarchical reduction RA93] and appears promising.
The weakness of current algorithms based on Petri Net techniques is that control ow is handled in a way very similar to if-conversion. The net e ect of the mapping into the Petri Net model is that control ow is enforced just like data dependences and thus speculative execution of operations is not possible. Furthermore, the rate execution of iterations is determined by the length of the longest path through the loop body, even when shorter paths through the loop are taken during execution.
Conclusions
We have presented a simple but fairly detailed description of a compaction-based software pipelining algorithm that handles resource constraints. The novel aspect of our algorithm is that it cleanly|in fact, completely|separates issues speci c to software pipelining, such as detecting repeating \pipeline" states and termination, from other orthogonal issues, such as the computation of available operations and scheduling decisions. We hope that this makes two contributions to the state of the art. First, our algorithm explains in a fairly simple way what software pipelining is about and what are its unique characteristics. Second, the modular and simple design of our algorithm should facilitate the development of general, retargetable implementations of software pipelining.
