Abstract { This paper presents a new approach to local instruction scheduling based on integer programming that produces optimal instruction schedules in a reasonable time, even for very large basic blocks. The new approach rst uses a set of graph transformations to simplify the datadependency graph while preserving the optimality of the nal schedule. The simpli ed g r aph results in a simpli ed integer program which can be solved much faster. A new integerprogramming formulation is then applied to the simpli ed graph. Various techniques are used to simplify the formulation, resulting in fewer integer-program variables, fewer integer-program constraints and fewer terms in some of the remaining constraints, thus reducing integer-program solution time. The new formulation also uses certain adaptively added c onstraints (cuts) to reduce solution time. The proposed optimal instruction scheduler is built within the Gnu Compiler Collection (GCC) and is evaluated experimentally using the SPEC95 oating point benchmarks. Although optimal scheduling for the target processor is considered i n t r actable, all of the benchmarks' basic blocks are o ptimally scheduled, including blocks with up to 1000 instructions, while total compile time increases by only 14%.
Introduction
Instruction scheduling is one of the most important compiler optimizations because of its role in increasing pipeline utilization. Conventional approaches to instruction scheduling are based on heuristics and may produce schedules that are suboptimal. Prior work has considered optimal instruction scheduling, but no approach has been proposed that can optimally schedule a large number of instructions in reasonable time. This paper presents a new approach t o o p t i m a l instruction scheduling which uses a combination of graph transformations and an advanced integer-programming formulation. The new approach produces optimal schedules in reasonable time even for scheduling problems with 1000 instructions.
This research was supported by Equator Technologies, Mentor Graphics' Embedded Software Division, Microsoft Research, the National Science Foundation's CCR Division under grant #CCR-9711676, and by the University of California's MICRO program.
Local Instruction Scheduling
The local instruction scheduling problem is to nd a minimum length instruction schedule for a basic block. This instruction scheduling problem becomes complicated (interesting) for pipelined processors because of data hazards and structural hazards 11] . A data hazard occurs when an instruction i produces a result that is used by a f o l l o wing instruction j, and it is necessary to delay j's execution until i's result is available. A structural hazard occurs when a resource limitation causes an instruction's execution to be delayed.
The complexity of local instruction scheduling can depend on the maximum data-hazard latency which occurs among the target processor's instructions. In this paper, latency is de ned to be the di erence between the cycle in which instruction i executes and the rst cycle in which data-dependent instruction j can execute. Note that other authors de ne latency (delay) to be the cycle di erence minus one, e.g., 2, 17] . We prefer the present de nition because it naturally allows write-after-read data dependencies to be represented by a latency of 0 (write-after-read dependent instructions can execute in the same cycle on a typical multiple-issue processor, because the read occurs before the write within the pipeline).
Instruction scheduling for a single-issue processor with a maximum latency of two is easy. Instructions can be optimally scheduled in polynomial time following the approach proposed by Bernstein and Gertner 2] . Instruction scheduling for more complex processors is hard. No polynomial-time algorithm is known for optimally scheduling a single-issue processor with a maximum latency of three or more 2]. Optimal scheduling is NP-complete for realistic multiple-issue processors 3]. Because optimal instruction scheduling for these more complex processors is considered intractable, production compilers use suboptimal heuristic approaches. The most common approach i s list scheduling, where instructions are represented as nodes in a directed acyclic data-dependency graph (DAG) 15]. A graph edge represents a data dependency, an edge weight represents the corresponding latency, and each DAG node is assigned a priority. Critical-path list scheduling is a common variation, where an instruction's priority is based on the maximumlength path through the DAG from the node representing the instruction to any l e a f n o d e 1 5 ] . While critical-path list scheduling is usually e ective, it can produce suboptimal results even for small scheduling problems. Consider the DAG in Figure 1 , taken from 17], where each edge is labeled with its latency and each node is labeled with its critical-path priority. For this DAG, nodes A, B and C all have the same priority, s o t h e s c heduler can arbitrarily select the order of these instructions. If the initial order is A, C, B or C, A, B, the next cycle will be a stall because the latency from B to D or from B to E is not satis ed. Other orders of A, B and C will produce an optimal schedule which has no stall.
Optimal Instruction Scheduling
Although optimal instruction scheduling for complex processors is hard in theory, in practice it may b e p o s s i b l e t o o ptimally solve important instances of instruction scheduling problems in reasonable time using methods from combinatorial optimization. Prior work has used various combinatorial optimization approaches to optimally schedule instructions for complex processors. However none of these approaches can optimally schedule large basic blocks in reasonable time.
Ertl and Krall 8] use constraint logic programming and consistency techniques to optimally schedule instructions for a single-issue processor with a maximum latency greater than two. Vegdahl 23] and Kessler 13] use dynamic programming to optimally schedule instructions. Chou and Chung 6] and Tomiyama et al. 22 ] use approaches that implicitly enumerate all possible schedules to nd an optimal schedule. For e ciency, 6] and 22] propose techniques to prune the enumeration tree so that redundant o r e q u i valent s c hedules are not explicitly enumerated. Experimental results for these various approaches show that they are e ective only for small to medium-sized basic blocks (30 instructions or less).
Prior work has also used integer programming to optimally schedule instructions. An integer program is a linear program, with the added requirement that all problem variables must be assigned a solution value that is an integer. Like the other approaches to optimal instruction scheduling, prior work using integer programming has only produced approaches that are e ective for small to mediumsized scheduling problems. Arya 1] proposes an integer programming formulation for vector processors, although the basic approach is general and can be applied to other processor types. The experimental results in 1] suggest the formulation can signi cantly improve the instruction schedule. However, the results are limited to only three small to medium-sized problems (12 to 36 instructions). Only the smallest problem is solved optimally, with the other problems timing out before an optimal solution is found. Leupers and Marwedel 14] propose an integer programming formulation for optimally scheduling a multiple-issue processor. Although their work focuses on DSP processors, again the basic approach is general and can be used for other processor types. The experimental results in 14] are also limited to a few small to medium-sized problems (8 to 37 instructions). While the solution time might be acceptable for the largest problem studied (37 instructions solves in 130 seconds), the solution time does not appear to scale well with problem size (the next smaller problem, 24 instructions, solves in 7 seconds). Thus the approach does not appear practical for large instruction scheduling problems. Chang, Chen and King 5] propose an integer programming formulation that combines local instruction scheduling with local register allocation. Experimental results are given for one simple 10-instruction example which t a k es 20 minu t e s t o s o l v e optimally. These results suggest that the approach has very limited practicality.
Although prior work using integer programming has produced limited success for optimal instruction scheduling, integer programming has been used successfully to optimally solve v arious other compiler optimization problems, including array dependence analysis 19], data layout for parallel programs 4] and global register allocation 9]. Integer programming is the method of choice for solving many large-scale real-world combinatorial optimization problems in other elds 16], including other scheduling problems such as airline crew scheduling. This successful use of integer programming elsewhere suggests that improved integer programming formulations may b e t h e k ey to solving large-scale instruction scheduling problems.
This paper presents a new approach to optimal instruction scheduling based on integer programming, the rst approach which can solve very large scheduling problems in reasonable time. The paper is organized as follows. Section 2 describes a basic integer-programming formulation for optimal instruction scheduling, which is representative o f f o rmulations proposed in prior work. The material in Section 2 p r o vides the reader with background on how instruction scheduling can be formulated as an integer programming problem, and provides a basis for comparing the new integerprogramming formulation. Experimental results in Section 2 s h o w that the basic formulation cannot solve large instruction scheduling problems in reasonable time, which is consistent with the results from prior work. Section 3 introduces a set of DAG transformations which can signi cantly simplify the DAG, while preserving the optimality o f t h e s c hedule. The simpli ed DAG leads to simpli ed integer programs which are shown experimentally to solve signi cantly faster. Section 4 introduces a new integer-programming formulation that simpli es the integer program by reducing the number of integer-program variables, reducing the number of integer-program constraints, and simplifying the terms in some of the remaining constraints. The simpli ed integer programs are shown experimentally to solve dramatically faster. The last section summarizes the paper's contributions and outlines future work.
2 Basic Integer-Programming Formulation This section describes a basic integer-programming formulation for optimal instruction scheduling, which is representative o f f o r m ulations proposed in prior work. The basic formulation provides background for the DAG transformations presented in Section 3 and the new integer-programming formulation presented in Section 4.
Optimal Instruction Scheduling, Basic Formulation
In the basic formulation, the basic block is initially scheduled using critical-path list scheduling. The length U of the resulting schedule is an upper bound on the length of an optimal schedule. Next, a l o wer bound L on the schedule length is computed. Given the DAG's critical path c and the processor's issue rate r, a l o we r b o u n d o n t h e s c hedule for a basic block w i t h n instructions is: If U = L, t h e s c hedule is optimal, and an integer program is unnecessary. I f U > L , a n i n teger program is produced (as described below) to nd a length U 1 s c hedule. If the integer program is infeasible, the length U schedule is optimal. Otherwise a length U 1 s c hedule was found and a second integer program is produced to nd a length U 2 s c hedule. This cycle repeats until a minimum-length schedule is found.
To produce an m clock-cycle schedule for an n-instruction basic block, a 0-1 integer-program scheduling variable is created for each instruction, for each c l o c k cycle in the schedule. The scheduling variable x j i represents the decision to schedule (1) or not schedule (0) instruction i in clock cycle j. The scheduling variables for the corresponding clock cycles are illustrated in Figure 2 .
A solution for the scheduling variables must be constrained so that a valid schedule is produced. A constraint is used for each instruction i to ensure that i is scheduled at exactly one of the m cycles, a must-schedule constraint with the following form:
Additional constraints must ensure the schedule meets the processor's issue requirements. Consider a statically scheduled r-issue processor that allows any r instructions to issue in a given clock cycle, independent of the instruction type. For this r-issue processor, an issue constraint of the following form is used for each clock cycle j :
If a multiple-issue processor has issue restrictions for various types of instructions, a separate issue constraint c a n b e used for each instruction type 14].
A set of dependency constraints is used to ensure that the data dependencies are satis ed. For each instruction i, the following expression resolves the clock c y c l e i n w h i c h i is scheduled:
Because only one xi variable is set to 1 and the rest are set to 0 in an integer program solution, the expression produces the corresponding coe cient j, which i s t h e c l o c k cycle in which instruction i is scheduled. Using this expression, a dependency constraint of the following form is produced for each edge in the DAG to enforce the dependency of instruction i on instruction k, where the latency from k to i is the constant L ki :
Prior work has proposed a basic method for simplifying the integer program and hence reducing integer-program solution time. Upper and lower bounds can be determined for the cycle in which an instruction i can be scheduled, thereby reducing an instruction's scheduling range. All of i's scheduling variables for cycles outside the scheduling range set by i's upper and lower bounds are unnecessary and can be eliminated. After scheduling range reduction, if any i nstruction's scheduling range is empty (its upper bound is less than its lower bound), no length m schedule exists and an integer program is not needed.
Chang, Chen and King propose using the critical path distance from any l e a f n o d e ( a n y root node) or the number of successors (predecessors) to determine an upper bound (lower bound) for each instruction 5]. For the r-issue processor de ned above, a lower bound Li on i's scheduling range is set by:
(1) where cri is the critical path distance from any root node to i, a n d pi is the numberof i's predecessors. Similarly, a n upper bound Ui on i's scheduling range is set by:
where cli is the critical path distance from i to any leaf node, and si is the numberofi's successors.
Collectively, the reduced set of scheduling variables, the must-schedule constraints, the issue constraints, and the dependency constraints constitute a basic 0-1 integer programming formulation for nding a schedule of length m. Applied iteratively as described above, this formulation produces an optimal instruction schedule.
Basic Formulation, Experimental Results
The basic formulation is built inside the Gnu Compiler Collection (GCC), and is compared experimentally with criticalpath list scheduling. As shown in 2], optimal instructionscheduling for a single-issue processor with a two-cycle maximum latency can be done in polynomial time. However, optimal scheduling for a single-issue processor with a threecycle maximum latency, the next harder scheduling problem, is considered intractable 2]. If this easiest hard scheduling problem cannot be solved optimally in reasonable time, there is little hope for optimally scheduling more complex processors. Thus, this paper focuses on optimal scheduling for a single-issue processor with a three-cycle maximum latency.
The SPEC95 oating point b e n c hmarks were compiled using GCC 2.8.0 with GCC's instruction scheduler replaced by an optimal instruction scheduler using the basic formulation. The benchmarks were compiled using GCC's highest level of optimization (-O3) and were targeted to a singleissue processor with a maximum latency of three cycles. The target processor has a latency of 3 cycles for loads, 2 cycles for all oating point operations and 1 cycle for all integer operations. The SPEC95 integer benchmarks are not included in this experiment because for this processor model there would be no instructions with a 2-cycle latency, w h i c h makes the scheduling problems easier to solve.
The optimal instruction scheduler is given a 1000 second time limit to nd an optimal schedule. If an optimal schedule is not found within the time limit, the best improved schedule produced using integer programming (if any) is selected, otherwise the schedule produced by l i s t s c heduling is selected. The integer programs are solved using CPLEX 6.5, a commercial integer-programming solver 12], running on an HP C3000 workstation with a 400MHz PA-8500 processor and 512MB of main memory. The experimental results for the basic formulation are shown in Table 1 Various observations are made about these data. First, using only list scheduling most of the schedules, 6,885 (93%), are shown to be optimal because the upper bound U equals the lower bound L or because after scheduling range reduction for a schedule of length U 1, an instruction's scheduling range is empty. This is not surprising because this group of basic blocks includes such trivial problems as basic blocks with one instruction. For the 517 non-trivial problems that require an integer program, 482 (93%) are solved optimally and 35 (7%) are not solved using a total of 35,879 CPU seconds (10.0 hours). A s a p o i n t of reference, the entire benchmark suite compiles in 708 seconds (11.8 minutes) when only list scheduling is used. Thus, the basic formulation fails in two important respects: not all basic blocks are scheduled optimally, a n d t h e s c heduling time is long. Only 15 of the basic blocks (3% of the non-trivial basic blocks) have a n i mproved schedule, and the total static cycle improvement i s only 15 cycles, a modest speedup. Speedup will be much higher for a more complex processor (longer latency and wider issue). For example, results in 21] for the multipleissue Alpha 21164 processor show that list scheduling is suboptimal for more than 50% of the basic blocks studied. The proper conclusion to draw from the results in Table 1 is not that optimal instruction scheduling does not provide signi cant speedup, but that the basic integer-programming formulation cannot produce optimal schedules in reasonable time, even for the easiest of the hard scheduling problems. A m uch better approach to optimal instruction scheduling is needed.
3 DAG T ransformations A set of graph transformations is proposed which can simplify the DAG before the integer program is produced. These transformations are fast (low-order polynomial time in the size of the DAG) and are shown to preserve the optimality of the nal schedule. The integer program produced from a transformed DAG is signi cantly simpli ed and solves much faster.
DAG Standard Form
The transformations described in the following sections are for DAGs in standard form. A D AG in standard form has a single root node and a single leaf node. A DAG with multiple leaf and root nodes is transformed into a DAG i n standard form by adding an arti cial leaf node and an arti cial root node. The arti cial root node is the immediate predecessor of all root nodes. A latency-one edge extends from the arti cial root node to eachrootnodeintheDAG. Similarly, an arti cial leaf node is the immediate successor of all DAG leaf nodes. A latency-one edge extends from each l e a f n o d e o f t h e D AG to the arti cial leaf node. These arti cial nodes do not a ect the optimal schedule length of the original DAG nodes and are removed after scheduling is complete.
3.2 DAG P artitioning Some DAGs can be partitioned into smaller subDAGs which can be optimally scheduled individually, and the subDAG schedules can be recombined to form a schedule that is optimal for the entire DAG. Partitioning is advantageous because the integer-program solution time is super-linear in the size of the DAG, and thus the total time to solve the partitions is less than the time to solve the original problem. Partitioning is also advantageous because even though the original DAG may require an integer program to nd its optimal schedule, one or more partitions may be optimal from list scheduling and an integer program is not required for those partitions.
A D AG can be partitioned at a partition node. A partition node is a node that dominates the DAG's leaf node and post-dominates its root node. A partition node forms a barrier in the schedule. No nodes above a partition node may be scheduled after the partition node, and no nodes below the partition node may b e s c heduled before the partition node. An e cient algorithm for nding partition nodes is described in Section 3.5.1. The algorithm's worst-case execution time is O(e), where e is the number of edges in the DAG. Figure 3a shows a DAG w h i c h can be divided into two partitions at partition node D. Nodes A, B, C, and D form one partition, and nodes D, E, F, and G form the other partition. As illustrated, the two partitions are each optimally scheduled, and the schedules are then combined to form an optimal schedule for the entire DAG. In Figure 4 , edgeBE is a redundant e d g e which c a n b e removed. When edgeBE is removed, the DAG is reduced to the DAG s h o wn in Figure 3 , which can then be partitioned. Figure 5b , node E is the entry node, node I is the exit node, nodes F, G, a n d H are the interior nodes, and node X is external to the region.
Redundant Edge Elimination
In region linearization, list scheduling is used to produce a s c hedule for the instructions in each D AG region. Under certain conditions (described below), the original region sub-DAG can be replaced by a linear chain of the region's nodes in the order determined by list scheduling, while preserving the optimality of the nal overall schedule. This can significantly simplify the DAG, and hence the integer program. This paper considers region linearization for a single-issue processor. Region linearization for multiple-issue processors is also possible and will be considered in a future paper. The rst interior node in order O has an incoming edge in the DAG that is one of the minimum-latency edges outgoing from the entry node to the region's interior.
The last interior node in order O has an outgoing edge in the DAG that is one of the minimum-latency edges incoming to the exit node from the region's interior.
Proof: Assume an optimal schedule where the SESE re- nodes in the transformed region are separated by latency one edges. The entry node is separated from the rst interior node and the last interior node is separated from the exit node by the same edges that occur between these nodes in the original DAG. For the DAG in Figure 6a , the node order A, B, C, D, E, F is a dominant order. Figure 6b shows the region after the linearization transformation.
Non-SESE regions
Regions which are not SESE regions contain side-exit nodes and side-entry nodes. A side-exit node is an interior node with an immediate successor that is not a region node. A side-ent r y n o d e i s a n i n terior node with an immediate predecessor that is not a region node. Figures 5b and 5c are examples of non-SESE regions. In Figure 5b , node G is a side-entry node. In Figure 5c , node K is a side-exit node.
The conditions for linearizing non-SESE regions are more restrictive than those for SESE regions because of the additional dependencies to and from nodes outside the region. Therefore, following the proof of Theorem 1, the dependencies between interior nodes are satis ed. Similarly, the dependencies between nodes outside the region and the region entry and exit nodes are satis ed. Only the predecessors of a side-exit node in the DAG precede the side-exit node in order O. Therefore, the minimum numberof nodes precede each side-exit node in order O. If the region nodes are removed from the schedule and reordered in order O, a sideexit node cannot be placed in a vacancy later than the original location of the side-exit node in the schedule. Therefore, the dependencies from the side-exit node to nodes outside the region are satis ed. A symmetric argument proves that Figure 6d shows the region after the linearization transformation.
E cient Algorithms for D AG T ransformations
This section describes a set of e cient algorithms for performing the DAG transformations described in Section 3.2 through Section 3.4.
DAG P artitioning Algorithm
An algorithm is proposed for nding the partition nodes of a DAG. If a DAG is drawn as a sequence of nodes in a topological sort 7], then all edges from nodes preceding a partition node terminate at or before the partition node. The original instruction order of the DAG nodes before instruction scheduling can be used as the topological sort for the algorithm. Each D AG edge and node is considered only once in the execution of the algorithm. Therefore, the algorithm runs in O(n + e) time, where n is the numberof nodes and e is the number of edges. Table 2 illustrates the execution of the algorithm on the example DAG in Figure 7 . The column`current latest' indicates the value of latest at the start of the iteration. The column`new latest' indicates the value of latest at the end of the iteration. current new partition iteration i Ni latest latest node? Table 2 : Execution of the partitioning algorithm on the DAG in Figure 7 .
An Algorithm for Finding Redundant Edges
An e cient algorithm is proposed for nding the redundant edges of a partition. The algorithm iterates through all nodes in the partition except the last node. At e a c h iteration, each edge extending from the current node is compared For each iteration i of the outer loop, every edge terminating at a node preceding Ni in order O is considered O(1) times. Therefore, the worst case execution time of the algorithm is O(nP eP ).
An Algorithm for Region Linearization
An algorithm for region linearization is proposed which uses critical-path list scheduling to nd a schedule S f o r a r e g i o n R. If node order O of S is a dominant order for the region R as de ned in Section 3.4 then O is enforced by linearizing the region nodes in the DAG.
The region nding algorithm described in Section 3.5.3 nds the entry node and exit node for each region of a partition. The interior nodes of the region can be determined by performing a forward depth-rst search from the entry node and a reverse depth-rst search from the exit node. Nodes which are touched in both searches are interior nodes of the region. The searches can also identify side-entry nodes and side-exit nodes.
By Theorem 2, in scheduling a non-SESE region R, n o i nterior node may b e s c heduled before a side-exit node which i s not a predecessor of the side-exit node in the DAG. Symmetrically, n o i n terior node may b e s c heduled after a side-entry node which is not a successor of the side-entry node in the DAG. To enforce these constraints during region scheduling, temporary edges are added to the region subDAG. Latency one edges are added from each side-exit node to each i n terior node which is not a predecessor of the side-exit node. Similarly, latency one edges are added to each side-entry node from every interior node which is not a successor of the sideentry node. If there exists a cycle in the dependency graph after this transformation, no dominant order for the region is found. The temporary dependencies are removed after scheduling the region.
After adding the temporary dependencies, the region nodes are scheduled using critical-path list scheduling. 
DAG T ransformations, Experimental Results
The experiment described in Section 2.2 was repeated with GCC's instruction scheduler replaced by an optimal instruction scheduler that includes the DAG transformations and the basic integer-program formulation. As before, each basic block i s r s t s c heduled using list scheduling and the basic blocks with schedules that are shown to be optimal are removed from further consideration. For this experiment, the DAG transformations are then applied to the remaining basic blocks, and the basic blocks are scheduled again using list scheduling. The resulting schedules are checked for optimality and those basic blocks without optimal schedules are then solved using integer programming. The results from this experiment are shown in Table 1 .
Various observations can be made by comparing the data in Table 3 with the data obtained using only the basic formulation in Table 1 . The DAG transformations reduce the total numberofinteger programs by 28%, to 374 from 517. The DAG transformations reduce the numberofinteger programs that time out by 8 3 % , to 6 from 35. Total optimal scheduling time using the DAG transformations is reduced by about ve-fold, to 7743 seconds from 35,879 seconds. The number of basic blocks that have an improved schedule using the DAG transformations increases by 6 0 % , t o 2 4 f r o m 15. There is a 1 cycle average schedule improvement for the 15 blocks which improved using the basic formulation alone. The additional 9 basic blocks which i m p r o ved using the DAG transformations have a n a verage improvement o f 1.8 cycles. This suggests that the DAG transformations help solve the more valuable problems with larger cycle improvements. Overall these data show that graph transformation is an important t e c hnology for producing optimal instruction schedules in reasonable time. However additional new technology is clearly needed.
Advanced Integer-Programming Formulation
This section describes an advanced formulation for optimal instruction scheduling which dramatically decreases integerprogram solution time compared with the basic formulation.
Advanced Scheduling-Range Reduction
As described in Section 2.1, the basic formulation uses a technique that can reduce an instruction's scheduling range, and hence the number of scheduling variables. Signi cant additional scheduling-range reductions are possible. The basic formulation uses static range reduction based on critical-path distance or the number of successors and predecessors. This technique is termed static range reduction because it is based on static DAG properties. An additional criterion for static range reduction is proposed which uses the number of predecessors (or successors) and the minimum latency from (or to) any immediate predecessor (or successor). For the r-issue processor de ned earlier, if instruction i has pi predecessors, the predecessors will occupy at least bpi=rc cycles. If the minimum latency from a predecessor of i to i is pred minli, i can be scheduled no sooner A new range reduction technique, iterative range reduction, is proposed. Iterative range reduction uses initial logical implications (described below) to reduce the scheduling range of one or more instructions. This range reduction may in turn allow the ranges of predecessors or successors to be tightened. An instruction i's lower bound can be tightened by selecting the maximum of:
The static lower bound Li, o r For each immediate predecessor j, j's lower bound plus the latency of edgeji. The second criterion allows i's lower bound to be tightened iteratively as the lower bounds of i's immediate predecessors are tightened iteratively. Similarly, instruction i's upper bound can be tightened by selecting the minimum of:
The static upper bound Ui, o r For each immediate successor j, j's upper bound minus the latency of edgeij. Following an initial logical implication, the predecessor and successor range reductions may iteratively propagate through the DAG and may lead to additional logical implications that can reduce scheduling ranges. These new logical implications may in turn allow additional predecessor and successor range reductions. This process can iterate until no further reductions are possible.
For the r-issue processor de ned earlier, a logical implication can be made for instructions that have a one-cycle scheduling range. If an instruction i has a one-cycle scheduling range that spans cycle C Based on j's range reduction, the ranges of j's predecessors and successors may be reduced. This may lead to additional instructions with one-cycle ranges, and the process may i terate for further range reductions.
Iterative range reduction can lead to scheduling ranges that are infeasible, which implies that no length m schedule exists. Two infeasibility tests are:
The scheduling range of any node is empty because its upper bound is less than its lower bound. For any k-cycle range, more than rkinstructions have scheduling ranges that are completely contained within the k cycles, i.e., the scheduling ranges violate the pigeon hole principle 10]. Figure 8 illustrates iterative range reduction using the one-cycle logical implication for a single-issue processor. Figure 8a shows each n o d e labeled with the lower and upper bounds that are computed using static range reduction for a s c hedule of length six 2 . Nodes A, C, E and F have onecycle ranges, so the corresponding cycles (1,2,5 and 6) are removed from the scheduling ranges of all other nodes, resulting in the scheduling ranges shown in Figure 8b . Predecessor and successor ranges can then be tightened using the iterative range reduction criterion, producing the scheduling ranges shown in 8c. The resulting scheduling ranges are infeasible because node B's scheduling range is empty.
A second logical implication based on probing is used to reduce scheduling range. Probing is a general approach that can be used for preprocessing any 0 -1 i n teger program to improve its solution time 20] . Probing selects a 0-1 integer program variable and attempts to show that the variable cannot be 1 by assuming the variable's value is 1 and then showing that the resulting problem is infeasible. If the problem is infeasible, by c o n tradiction, the variable's value must 2 List scheduling produces a length 7 schedule for this DAG, so an integer program is produced to nd the next shorter schedule, as described in Section 2.1. be 0 and the variable can be eliminated from the problem. General-purpose probing is used in commercial solvers, but has a very high computation cost 12].
A speci c probing technique called instruction probing is proposed for reducing instruction scheduling ranges. Instruction probing is computationally e cient because it exploits knowledge of the instruction scheduling problem and exploits the DAG's structure. Instruction probing is done for each instruction i, starting with i's lower bound. A lowerbound probe consists of temporarily setting i's upper bound equal to the current lower bound. This has the e ect of temporarily scheduling i at the current l o wer bound. Based on i's reduced scheduling range, the ranges of i's predecessors are temporarily tightened throughout the DAG. If the resulting scheduling ranges are infeasible, the probe is successful and i's lower bound is permanently increased by 1 . Based on i's new lower bound, the ranges of i's successors are permanently tightened throughout the DAG. If the resulting scheduling ranges are infeasible, the overall scheduling problem is feasible. Otherwise, the new lower bound is probed and the process repeats. If a lower-bound probe is unsuccessful, i's lower-bound probing is complete. i's upper bound is then probed in a symmetric manner. Figure 9 illustrates the use of instruction probing for a single-issue processor. Figure 9a shows the scheduling ranges that are produced using static range reduction for a schedule of length 8. Figure 9b shows the temporary scheduling ranges that result from probing node B's lower bound. Based on the one-cycle logical implication, cycle 2 is removed from node C's range. Node C's increased lower bound in turn causes the lower bounds for nodes E, G and H to be temporarily tightened. Because node E has a one-cycle range (cycle 5), cycle 5 is removed from node D's range, which in turn causes node F's lower bound to be tightened. Nodes F and G must be scheduled at the same cycle, which is infeasible. Thus, node B cannot be scheduled at cycle 2 and cycle 2 is permanently removed from B's scheduling range. The consequence of B's permanent range reduction is shown in Figure 9c . Based on B's tightened lower bound, the lower bounds of nodes D, F and H are permanently tightened. Based on node B's one-cycle range, cycle 3 is removed from node C's range. Due to node D's one-cycle range, cycle 6 is removed from node G's range. The resulting ranges are infeasible because nodes F and G must be scheduled at the same cycle, thus no 8-cycle schedule exists for the DAG.
Optimal Region Scheduling
For a region as de ned in Section 3, the critical-path distance between the region's entry and exit nodes may not be suciently tight. This can lead to excessive i n teger-program solution time because the integer program produced from the DAG is under-constrained. A region is loose if the criticalpath distance from the region's entry node to its exit node is less than the distance from the entry node to the exit node in an optimal schedule of the region. Clearly in any valid overall schedule, a region's entry and exit nodes can be no closer than the distance between these nodes in an optimal schedule of the region. A loose region can be tightened by computing an optimal schedule for the region to determine the minimum distance between the region's entry and exit nodes. A pseudo edge can then be added to the DAG f r o m t h e e n try node to the exit node, with a latency that equals the distance between these nodes in the optimal region schedule. Pseudo edges may allow the scheduling ranges of the entry and exits nodes to be reduced. These range reductions may then iteratively propagate through the DAG as described in the previous subsection and may result in scheduling ranges that are infeasible for scheduling the basic block i n m cycles. Even if the overall scheduling problem is not shown to be infeasible, the problem will solve faster because of the reduced numberofscheduling variables. The application of region scheduling is illustrated in Figure 10 . Figure 10a shows the scheduling ranges for a schedule of length 12. The region AF is an instance of the region in Figure 8 , which has an optimal schedule of length 7. Thus, after region scheduling is applied to region AF a latency 6 pseudo edge is added from node A to node F, as shown in Figure 10b . The pseudo edge causes the lower bounds for nodes F, G, H, I, J and K to be tightened to the ranges shown in Figure 10b . Based on node H's one-cycle range, cyc l e 8 i s r e m o ved from node G's range. The increase in node G's lower bound causes node I's lower bound to increase. The resulting ranges are shown in Figure 10c . These ranges are infeasible because nodes I and J must be scheduled at the same cycle. Thus, the overall schedule is shown to be infeasible based on the optimal schedule of only one inner region.
Optimal region schedules are computed starting with the smallest inner regions, progressing outward to the largest outer region, the entire DAG partition. Optimal schedules for the inner regions can usually be found quickly using the infeasibility t e s t s . If an optimal region schedule requires an integer program and the inner regions have pseudo edges, a dependency constraint is produced for each pseudo edge, in the manner described in Section 2.1. As the optimal region scheduling process moves outward, the additional pseudo edges progressively constrain the scheduling of the large outer regions, allowing the optimal scheduling of the outer regions to be solved quickly, usually using only the infeasibility tests. 4 For some integer programming applications, the number of nodes in the branch-and-bound tree that must be solved to produce an optimal integer solution can be dramatically reduced by adaptively adding application-speci c constraints, or cuts, to the LP subproblem at each node in the branch-and-bound tree. The cuts are designed to eliminate areas of the solution space in the LP subproblem which c o n tain no integer solutions. This enhancement to the branch-and-bound method is called branch-and-cut 24]. Two t ypes of cuts are proposed for solving instruction scheduling integer programs: dependency cuts and spreading cuts.
Dependency Cuts
In an LP solution, an instruction may be fractionally scheduled over multiple cycles. For an instruction k that is data dependent on instruction j, fractions of j c a n b e s c heduled after all cycles in which k is scheduled, without violating the corresponding dependency constraint. This is illustrated in Table 4 , which shows a partial LP solution for a scheduling problem that includes instructions j and k, where k is dependent o n j with latency 1. The solution satis es the dependency constraint b e t ween j and k becausej is scheduled at cycle 1 0:5 + 5 0:5 = 3, while k is scheduled at cycle 4. However a fraction of j is scheduled after all fractions of k. This invalid solution can be eliminated in the subsequent LP subproblems by adding the following dependency cut for cycle c, the last cycle in which j can be scheduled given the position of the last fraction of k in the current LP solution:
where LB(j) and LB(k) are the scheduling range lower bounds for j and k, respectively, and l jk is the latency of the dependency between j and k. For example, the following dependency cut can be added for cycle 3 for the solution in Table 4 to prevent j from being fractionally scheduled after a fraction k in cycle 4 in subsequent LP subproblems: Table 4 : Example LP solution to illustrate a dependency cut.
Spreading Cuts
In an LP solution, an instruction k may be fractionally scheduled closer to k's immediate predecessors than the latencies allow i n a n i n teger solution, but still satisfy the corresponding dependency constraints. This is illustrated in Table 5 , which s h o ws a partial LP solution for a scheduling problem which includes instructions i, j, and k. Instruction k is dependent on both i and j with latency 2. For example, the following spreading cut can be added for cycle 3 for the solution in Table 5 to force the fractions of i and j to be spread further apart from a fraction of k in cycle 3 in subsequent LP subproblems: Table 5 : Example LP solution to illustrate a spreading cut.
Symmetric spreading cuts can be used to prevent instructions from being fractionally scheduled closer to their immediate successors than the latencies allow i n a n i n teger solution. 
Redundant Constraints
Some constraints in the integer program may be redundant and can be removed. This simpli es the integer program and reduces its solution time.
If an instruction i has a one cycle scheduling range at cycle C , the must-schedule constraint for instruction i can be eliminated, and the issue constraint for cycle C can be eliminated for a single-issue processor.
The integer program includes a dependency constraint for each D AG edge. The dependency constraint ensures that each instruction is scheduled su ciently earlier than any o f its dependent successors. However, if the scheduling ranges of dependent instructions are spaced far enough apart, the data dependency between the two instructions will necessarily be satis ed in the schedule produced by the integer program. In this case, the dependency constraint for the corresponding edge can be removed from the integer program. More precisely, if an instruction k is dependent o n instruction j with latency L, then the j to k dependency constraint is redundant i f :
L + Upper bound of j Lower bound of k For the example DAG in Figure 11 , the dependency constraint for edgeC D is redundant because the upper bound for node C is cycle 3, the lower bound for node D is cycle 5 and the latency of edgeC Dis 1.
Algebraic Simpli cation
An algebraic simpli cation reduces the number of dependency constraint terms and reduces the size of the coecients of the remaining terms. The basic dependency constraint f o r edge jk has one term for each cycle in j's range and one term for each cycle in k's range. The dependency constraint is simpli ed to include terms only for cycles in which j's and k's scheduling ranges overlap. The simpli ed constraints allow t h e i n teger program to solve faster. The maximum sized coe cient has been reduced to 2 from 74.
Advanced IP Formulation, Experimental Results
The experiment described in 2.2 is repeated with GCC's instruction scheduler replaced by an optimal instruction scheduler that includes the DAG transformations and the advanced integer-program formulation. The experimental results in Table 6 show the dramatic improvement p r o vided by the new optimal scheduler. All basic blocks are scheduled optimally and the scheduling time is very reasonable. The graph transformations, advanced range reduction and region scheduling techniques reduce to 22 the number of basic blocks that require an integer program, down from 517 using the basic formulation. These 22 most di cult problems require a total solver time of only 45 seconds, an average of only 2 seconds each. The total increase in compilation time is only 98 seconds, a 14% increase in total compilation time, which includes the time for DAG transformations, advanced range reduction and region scheduling. Total scheduling time is reduced by more than 300 fold compared with the basic formulation. The improvement i n c o d e quality is more than 4 times that of the basic formulation, 66 static cycles compared with 15 cycles. The additional 6 basic blocks that are solved optimally using the advanced formulation all have improved schedules, with the average improvement of 5.8 cycles. This suggests that the hardest problems to solve are those which provide the most performance improvement. Table 6 : Experimental results using DAG transformations and the advanced integer programming formulation.
The new approach can optimally schedule very large basic blocks. The scattergram in Figure 12 shows a dot for each of the 517 basic blocks that are processed by the graph transformations and the advanced integer programming formulation. The axes indicate the block's size and the time to optimally schedule that block. This gure shows that many v ery large blocks, as large as 1000 instructions, are optimally scheduled in a short time.
5 Summary This paper presents a new approach to optimal instruction scheduling which is fast. The approach quickly identi es most basic blocks for which l i s t s c heduling produces an optimal schedule, without using integer programming. For the remaining blocks, a simpli ed DAG and an advanced integer-programming formulation lead to optimal scheduling times which a verage a few seconds per block, even for a benchmark set with some very large basic blocks. These results show that the easiest of the hard instruction scheduling problems can be solved in reasonable time. This is an important rst step toward solving harder instruction scheduling problems in reasonable time, including scheduling very large basic blocks for long-latency multiple-issue processors. The proposed approach will also serve as a solid base for future work on optimal formulations of combined local instruction scheduling and local register allocation that can be solved in reasonable time for large basic blocks.
