The 
Introduction
The IA-64 architecture relies on the extraction of instruction-level parallelism (ILP) in software. One philosophy of the architecture is to enable the compiler to expose the parallelism in programs, thereby simplifying the hardware implementations. The architecture provides many features that the compiler can employ in accomplishing this task. The IA-64 architecture [1] provides support for control and data speculation, allowing the compiler to reduce the impact of memory latency by breaking control and memory dependence barriers. Full predication support is also available that allows removal of branches to transform a control dependence to a data dependence on the compare controlling the branch. In effect, predication enables elimination of branches and their associated mispredictions. Such architectural support comes with the challenge for the compiler to use these features judiciously. Instruction scheduling plays a major role in the usage of such features to extract ILP.
Several instruction scheduling techniques have been described in the literature to perform scheduling across basic block boundaries. Trace scheduling [8] [9] , superblock and hyperblock scheduling [11] [13] operate on regions known as traces, which consist of a contiguous set of basic blocks. In contrast, treegion scheduling [10] uses a decision-tree subgraph of a program's control flow graph as a scheduling region. Ebcioglu's VLIW scheduling [14] arranges instructions in the form of a decision tree. Bernstein & Rodeh describe global instruction scheduling using a program dependence graph PDG [7] as a framework [3] . In most of these techniques, control-flow transformations are used to simplify regions with side-entries or a merge in control flow. Hyperblock scheduling uses predication to eliminate control-flow within the region. Tail duplication, which increases code size, is used to eliminate side entries into the region. Trace scheduling allows side entries into the scheduling region but only allows straight line code within the region. The global code scheduler (GCS) in Intel's reference compiler for the IA-64 architecture, allows arbitrary acyclic control flow within the scheduling scope referred to as a scheduling region. It also enables code scheduling across inner loops by abstracting them away through nesting. There is no restriction placed on the number of entries into or exits from the scheduling region. In order to enable effective and simplified data flow analysis in the presence of control flow a path based dependence representation has been developed that combines data dependence and control flow within a region. This representation is described in Section 3.
Most scheduling techniques find it difficult to make good decisions on the generation and scheduling of compensation code. This problem is addressed by GCS using wavefront scheduling and deferred compensation. The global code scheduler schedules along all the paths in a region simultaneously. The wavefront is a set of blocks that represents a strongly independent cut set of the region. Instructions are only scheduled into blocks on the wavefront. The wavefront can be thought of as the boundary between scheduled and yet to be scheduled code in the scheduling region. Section 4 describes this technique.
Control flow in program code can make the task of code motion difficult and complicated. Trace scheduling requires bookkeeping for valid code motion, while superblock scheduling uses tail duplication of entire blocks to avoid such bookkeeping. Tail duplication of blocks, especially in the absence of profile information, is not always practical due to the code expansion and harmful cache and TLB effects it causes. In GCS, tail duplication is done at the instruction level and is referred to as P-ready code motion. An instruction is duplicated based on a cost and profitability analysis. Section 6 details this approach. Aggressive movement of code results in groups of branches at the tail end of the scheduling region. Using the support for multiway branches in the IA-64 architecture, these branches can be scheduled to execute together. To do this effectively GCS keeps track of block layout decisions and makes the appropriate changes to the block order in the process of scheduling the region as described in Section 7 .
Scheduling Regions
The global code scheduler is capable of scheduling code in any subgraph of the control flow graph provided the subgraph is acyclic. Though GCS allows for regions with multiple entries and exits, as a result of our region picking heuristics, the regions most commonly encountered have a single entry. In order to permit code motion across regions (such as inner loops) GCS maintains a region hierarchy, and the region of code being scheduled may contain nested regions representing already scheduled code. Figure 1 is an example of a scheduling region formed by GCS. As shown in Figure 1 (b) before scheduling begins all JS edges [2] are removed from the scheduling region. A JS edge is a control flow graph edge that emanates from a split node, i.e. a node with multiple successors, and ends at a join node, i.e. a node with multiple predecessors. JS edges are removed by adding an empty block (referred to as a JS block) between the split node and the join node. The removal of JS edges is required to enable the wavefront scheduling technique described later in this paper. It is not always simple to remove a JS edge while keeping the control flow graph functionally correct. One example is a join node with a label associated with it, which has multiple predecessors that branch to it indirectly. In case of such instances, GCS creates a fictitious JS block that is put in at the beginning of the scheduling process, and deleted at the end. The scheduler is constrained from inserting any code into this block, thereby guaranteeing that it can be removed at the end of the scheduling process. The scheduling algorithm may require blocks at certain region exits and entries to schedule compensation code in. Blocks called interface blocks are provided for this purpose at side entries and side exits. A side entry is a node in the region that has at least one predecessor that lies within, and at least one predecessor that lies outside the scheduling region. Similarly a side exit is a node in the region that has at least one successor that lies within, and at least one successor that lies outside the scheduling region. In Figure 1( In GCS, region picking and scheduling are strongly cou-pled. The high level driver for region picking and scheduling is described by Algorithm 1. Region picking commences from the innermost loop. After each region is scheduled, all or part of it is nested, and the remainder (if any) is subsequently grouped with surrounding blocks to form another scheduling region. The choice of blocks for nesting is based on resource usage heuristics. This cycle of region formation and scheduling continues until the entire routine becomes one nested region or, in other words, is completely scheduled. A summary of the data flow information for the scheduled and nested regions of code is computed. This information is used when the enclosing region is scheduled.
Path Based Dependence Representation
Data dependence representation is a central component of any code scheduler. A well-designed representation increases the efficiency of the scheduler and can enable aggressive code motion. This section describes the data structures and their use in data and memory dependence representation [5] as implemented in GCS.
Path Vectors
The number of distinct control flow paths through an acyclic region is finite. In the worst case the number of paths in a region can grow exponentially with the number of blocks. GCS's region picking heuristics keep the number of paths in the region below a specified threshold. This control over the number of paths enables usage of bit vectors to represent boolean properties on a per path basis. We define a path vector to be a bit vector in which each bit maps to a unique path in the region. A path vector can be used to represent a subset of all the paths in the region. Figure 2 depicts the paths through a scheduling region and illustrates some path vectors relevant to GCS. The path vector BPV(n) encodes all the paths that flow through block n. Control flow relationships, such as dominance/postdominance, control equivalence, disjointness etc, can be determined by performing bitwise operations on these path vectors. For example, in Figure 2 blocks B and D are control equivalent since BPV(B) equals BPV(D). It can also be deduced that block A either dominates or post-dominates block E since BPV(A) is a superset of BPV(E). Algorithm 2 provides an algorithm that enumerates the paths through an acyclic region and computes the block path vectors.
Data Dependence Representation
In a register based intermediate representation, instructions that read or write a register v can be partitioned into two sets, a set of readers, Readers(v), and a set of writers, Writers (v) . An instruction may be a member of both sets if it both reads and writes the register. The data dependence between any member w of Writers(v) and a member r of Readers(v) is represented as a def-use path vector DUPV (w,r,v) . This path vector is defined to be the set of control flow paths along which the value v written by w is read by r. An example is depicted in Figure 3 . Given a virtual register v, Algorithm 3 illustrates how to compute the def-use path vectors DUPV(w,r,v) for all r, w in Readers(v) and Writers(v) respectively. Only flow dependence information is used in GCS. Anti and output dependence information is not necessary since it can be inferred as described in subsection 3.3. This reduces the storage space required, and the cost of keeping the dependence graph up to date while scheduling. ; BPV(Block(w 2 )) endif endfor endif endfor endfor ers or writers of x where x 6 = v. This enables efficient local updates of the dependence information. For example, if an instruction I is moved from one block to another, then we only need to recompute the dependences associated with the virtual registers it accesses. For each virtual register x that it accesses, all DUPV(w,r,x) need to be recomputed where w belongs to Writers(x) and r belongs to Readers(x). As seen from Algorithm 3, the cost of a local update to the dependence graph for a virtual register v is O(N 2 W N R ) where N R is the number of readers and N W the number of writers of v. The use of distinct register names for unrelated lifetimes reduces the number of readers and writers, consequently reducing the complexity of the dependence graph update. Another advantage of the virtual register based partitioning of the dependence graph is that it allows a lazy approach to updating it. When an instruction I is moved from one block to another, the affected def-use path vectors are not recomputed immediately. Instead the def-use path vectors associated with the virtual registers referenced by I are marked stale. Any subsequent access to dependence information marked stale, triggers an automatic update of those def-use path vectors. This avoids unnecessary updates due to multiple invalidates that may occur between uses of the dependence information.
Renaming
Most schedulers maintain anti and output dependence information. This is done primarily to detect if moving an instruction violates such a dependence and if renaming would be required to eliminate the dependence. Maintaining such information implies additional complexity since the information would need to be continually updated. Our code scheduler does not maintain information for anti and output dependences. Instead the path vectors DUPV(w,r,v) which encode flow dependences are used to glean this information on a demand-driven basis. To detect an anti or output dependence violation, the instruction I involved in the code motion is assumed to have already moved to the new location. For each register v that is written by I, the path vector DUPV(I,r,v) is recomputed for all r 2 Readers(v). A change in DUPV (I,r,v) indicates that an anti or output dependence is violated by the code motion, and renaming is required. The process of violating an anti dependence creates a flow dependence. Similarly when an output dependence is violated, a flow dependence is changed (unless the result is dead, in which case the output dependence violation does not matter). Hence by just looking for changes to flow dependences caused by a prospective code motion, the need for renaming can be determined. The pseudo-code to compute the need to rename is provided in Algorithm 4. The proof of correctness is outside the scope of this paper. 
Memory Dependence Representation
Memory dependences are represented separately from the register-based dependences. For memory dependences flow, anti, and output dependences are computed and stored. To take advantage of the data speculation capabilities in the IA-64 architecture, we associate with each memory dependence edge, the probability that the memory references interfere.
Wavefront Scheduling And Deferred Compensation
One of the weaknesses of most cross block scheduling techniques is their inability to be judicious regarding the scheduling of compensation code. In these techniques, when evaluating a candidate instruction for scheduling that requires compensation copies in other blocks, there is no notion or measure of how desirable the compensation is to the other block where it is required. In some techniques compensation is allowed only when the block where the compensation is needed has not already been scheduled. Wavefront scheduling and deferred compensation [6] try to address these issues.
Wavefront
Given an acyclic region, JS edges are eliminated and interface blocks are added, as described in Section 2, to form a scheduling region. We define a wavefront in the scheduling region as a strongly independent cut set that partitions the scheduling region into three parts:
nodes above the wavefront nodes on the wavefront nodes below the wavefront The wavefront is strongly independent, implying that there is no control flow path in the acyclic region that flows through more than one node on the wavefront. The wavefront nodes collectively dominate all the nodes below the wavefront, and collectively post-dominate all the nodes above the wavefront. This property guarantees that when scheduling a candidate instruction I originally in block B k in the region into block B w on the wavefront, compensation code can be inserted entirely into blocks on the wavefront. Figure 4 shows an acyclic scheduling region and all the possible wavefronts in it. Given any acyclic region R with no JS edges and any block B in it, there is always at least one wavefront that passes through B. These properties are easily proved and are outside the scope of this paper. 
Wavefront Scheduling
When taking a directional approach to scheduling (topdown or bottom-up) there is always a region of the code that is considered scheduled, and another region that is considered unscheduled. In basic block, or trace scheduling the instruction scheduled last usually represents this boundary. In wavefront scheduling, the wavefront is the partition. Assuming top-down scheduling, the nodes above the wavefront have already been scheduled, and the schedule in these nodes will not be changed (unless one resorts to backtracking). No code has been scheduled into any node below the wavefront. Nodes on the wavefront are in the process of being scheduled. Scheduling of code into any block on the wavefront is not considered complete until the wavefront is advanced and the block is moved above it. In Figure 4 , the wavefronts are numbered to show the advancement of the wavefront in a top-down scheduling scheme. Note that there can be multiple alternatives to how the wavefront is advanced as shown in the figure by W2a and W2b. Algorithm 5 provides an algorithm for advancing the wavefront.
Algorithm 5
Advancing the wavefront. Closed blocks are blocks which are candidates for moving above the wavefront. // Compute in "del nodes" the set of nodes to be moved // above the wavefront.
del nodes = Set of closed nodes on wavefront for each node n in del nodes if (n has exactly one successor s) for each predecessor p of s if (p 6 2 del nodes) del nodes = del nodes ; f ng break endif endfor endif endfor for each node n in del nodes W a v e f r o n t = W a v e f r o n t ;fng W a v e f r o n t = W a v e f r o n t S SuccessorsOf(n) 
Deferred Compensation
The control flow paths in the scheduling region R along which an instruction I needs to execute can be deduced from the original position of I and it's data flow properties. This set of control flow paths can be represented as a path vector which we call the compensation path vector of the instruction CPV(I). When I is scheduled in a block K that does not fully dominate the block from which I originated (or in other words BPV(K) is not a superset of CPV(I)), we need to generate compensation copies of I to be scheduled elsewhere. Rather than creating such copies and associating them with other blocks immediately, we record the compensation needs in CPV(I). This is done by scheduling a copy I' of I in block K, and updating CPV(I) to (C P V(I) ; BPV(K)). The actual generation of the compensation copies is deferred until they are actually scheduled. The instruction I left behind in it's original block now represents all the compensation copies that will be generated whether it be one copy or many. When the last compensation copy is to be scheduled in a block C, BPV(C) will be a superset of CPV(I), signaling that this is the last compensation copy. At this time instead of scheduling a copy of I, the instruction I itself is scheduled in C. Figure 5 illustrates this process. Delaying compensation code generation allows scheduling freedom for the compensation (the destination of the compensation is not dictated a priori). Scheduling multiple copies of an instruction on a path through the region is avoided as shown in the next subsection. Compensation for downward code motion below splits, is handled similarly. 
Advancing Wavefront in the Presence of Deferred Compensation
In GCS, information about the topologically last blocks where an instruction needs to be scheduled is maintained for each instruction (including those representing deferred compensation). The constraint on how late a deferred compensation can be scheduled is imposed in order to avoid generating multiple copies of an instruction on a path through the region. For example, in Figure 5 , when instruction I' is scheduled in F, the compensation copies represented by I have to be scheduled on or before E. When I" gets scheduled in C, I has to be scheduled in D. This constraint on the scheduling range for instructions, is used in closing a block, i.e. determining that scheduling code into that block is complete. Only when a block is closed can the wavefront be advanced across it, since all blocks above the wavefront represent scheduled blocks into which we cannot schedule any new code. So a block on the wavefront gets closed once all instructions that have to be scheduled on or before it have been scheduled. This block on the wavefront may subsequently get opened in response to instructions being scheduled in other blocks on the wavefront. Hence a block can flip-flop between open and closed states. For example in Figure 5 , assume D is a closed block at the point when it is decided to schedule I" in C. This causes block D to be opened. Additionally in a top down scheduling scheme permitting downward code motion as in GCS, constraints on the downward code motion (i.e. how far down an instruction can be moved below its block of origin) need to be taken into account in determining whether a block can be closed.
Control Speculation and Predication
When evaluating a candidate for scheduling, the usefulness (inverse of speculativeness) of the code motion, is measured in terms of the likelihood that control will flow through the target block T and the block of origin O, given that control reaches T. This is measured as P r o b (B PV(T)
T BPV(O))=Prob(BPV(T), where
Prob(pv) is the function that provides the aggregate probability that control flows along any one of the paths in the path vector pv given that control flows through the region.
The IA-64 speculation capabilities are used to speculatively execute loads and their uses. Speculation check instructions [1] and recovery code are generated as a byproduct of speculation. Load safety information [4] is used to avoid unnecessary speculation. GCS also uses predication support in the IA-64 architecture to convert a speculative instruction to a non-speculative one. If the instruction being speculated across a branch is scheduled after the compare operation that controls the branch, the predicate produced by the compare may be available for use. In such situations it is preferable to predicate the instruction over speculating it. This eliminates the need for a check instruction and recovery code. In addition, when the predicate is false, any adverse effects of a speculative operation on the data cache and TLB are avoided. In some cases when an instruction is moved across multiple branches, the block predicate for the block of origin may not be available (i.e. compare not scheduled or result unavailable) at the point where the instruction is being scheduled. However a predicate for an intermediate block in the control dependence chain may be available. Predication with such a predicate will not render the instruction non-speculative, but will reduce the speculativeness of the instruction when the instruction executes (i.e. qualifying predicate is true). To do this effectively GCS maintains a predicate promotion list and keeps track of predicates that become available as compare instructions are scheduled. The predicate promotion list is essentially a form of control dependence information but in the predicate domain.
Instructions may be predicated by another phase of the compiler before GCS encounters them. Predicated instructions can also be speculated. GCS speculates these instructions by promoting the qualifying predicate to an available predicate in the predicate promotion list. Predicate promotion is thus achieved on the fly while scheduling.
Downward code motion
Operations such as stores and speculation check instructions cannot be speculated, and therefore tend to stay behind in the block of origin. It is advantageous to move an instruction downward if it does not fit into the schedule for a block. Another motivation to move code down is to empty a block. This can help eliminate an unconditional branch, or expose an opportunity for multiway branch generation. Downward code motion can also help reduce the amount of speculation or compensation code needed to expose ILP. Often the guarding predicate for instructions that cannot be speculated may not be available much before the controlling branch, and hence such operations do not move up using predication either. However, these operations can usually be moved down non-speculatively to a join block, since the predicate can be used to conditionally execute them. It should be noted that the predicate is only available for use in blocks dominated by the compare instructions that generate the predicate. This places a limit on how far down the code can be moved. In GCS, before a block is closed and the wavefront advanced across it, the availability of any necessary predicates for the code being moved down is checked.
P-ready Code Motion
An instruction I from block O is M-ready [2] at a block T, if there is no unscheduled source operand of I on any path that flows through O and T. Consider the example in Figure 6a . In this example I is M-ready at B, but instruction J is not due to instruction K. Note that instruction K is on a path from block B to block E that is unlikely to be taken. If the join into block E is eliminated by tail duplicating E, as shown in Figure 6b , then J is now M-ready at B, whereas J' is not. The same result can be achieved by only duplicating instruction J using P-ready code motion. We define an instruction I from block O to be P-ready (partially ready) at a block T, if it is not M-ready at T, and if there is no unscheduled operand on at least one path that flows through T and O. When a P-ready instruction I is scheduled at a block T, compensation copies of I need to be placed at points where it is M-ready. In Figure 6a , J is P-ready at B. Figure 6c , shows J scheduled using P-ready code motion in block B, with a compensation copy in block D. P-ready compensation code can increase dynamic instruction count. Hence a decision to select a P-ready candidate over an M-ready one should be based on probability of executing the compensation due to P-ready code motion. In GCS, both M-ready and P-ready candidate lists are maintained and the best candidate is chosen based on heuristics. Only those instructions that are important to execute early get moved up to locations where they are P-ready. The scheduler has much of the benefits afforded by full tail duplication, and pays the price of tail duplication only on those instructions where it is deemed worthwhile. The current implementation of P-ready code motion in GCS has one limitation over tail duplication. To preserve the control flow graph within the region during scheduling, GCS does not tail duplicate branch instructions through this mechanism. This limitation may be removed in the future.
Block Reordering And Multiway Branch Generation
In order to schedule branches effectively, GCS needs to know the physical block order before scheduling begins. During the scheduling process, some blocks are emptied. Block reordering and branch re-targeting around such blocks often results in the elimination of unconditional branches. In some cases, compensation code may be inserted into a previously emptied block potentially resulting in the addition of a branch. A greater number of branches deleted than are added can result in a performance gain.
The IA-64 architecture supports the execution of a linear sequence of branch instructions in a single cycle. Such sequences of branch instructions are called multiway branches. When a block becomes empty except for a branch instruction, there may be a multiway opportunity if a control flow predecessor that ends with a branch, is also its physical layout predecessor. If such a control flow predecessor is not its physical layout predecessor the block layout may be changed to create the multiway opportunity. Therefore in GCS we choose to model all unconditional branches and to update block layout ordering as we schedule [12] . Data on the usefulness of this capability is presented in Section 8.
Results
This section presents the results of experiments conducted to measure the effectiveness of various features implemented in the global code scheduler. The SPEC95 integer benchmark suite was used in all experiments. The programs in the suite were run with an input set developed at Intel. This input set attempts to approximate the characteristics of the SPEC95 reference input set, while considerably reducing run time. The compilations used profile feedback and interprocedural inlining. The same inputs were used for the profiling as well as the measurement runs. Figure 7a shows a cumulative distribution of the size of scheduling regions measured in terms of the number of initial basic blocks they contain. This number does not include any JS or interface blocks, or nodes representing nested regions. Hence it is a true indication of the size of the scheduling scope. Though GCS's region picking heuristics attempt to build large regions, a large percentage (60%) of regions are 5 blocks or less in size. Figure 7b shows the cumulative distribution of the size of program regions. For all benchmarks except 147.vortex, the majority (60%) of our program regions are 7 blocks or smaller in size. Improvements in scheduling would be possible by increasing program region size using more aggressive inlining, unrolling, peeling and other methods. A considerable portion of the program code in 147.vortex is not exercised by the input used. GCS region picking heuristics separate and first schedule cold regions (unexercised code) and later nest these cold regions within hot ones. This causes fragmentation of the program region with no detrimental effect. This is the primary reason for smaller scheduling regions in 147.vortex even though most program regions are large. Percentage of multiway candidate branches combined into a multiway branch.
The block reordering capabilities of GCS were evaluated by measuring the magnitude of the changes to unconditional branches during scheduling. Figure 8 shows the average number of unconditional branches added and the number deleted during scheduling by GCS as a percentage of all branches (conditional and unconditional) except calls and returns in the region. As can be seen, on average more than twice as many unconditional branches are deleted as are added. This figure also shows the 2-way and 3-way . Speedup due to wavefront scheduling over basic block scheduling on the Merced microarchitecture. multiway branches that were formed as a percentage of all branches that can be combined in a multiway. This includes calls and returns as well as all unconditional and conditional branches, but does not include speculation check instructions that branch to recovery code. For each 2-way multiway generated one branch cycle is saved since the 2-way is executed in one cycle instead of two. For each 3-way multiway, at least two cycles are saved.
The performance impact of wavefront scheduling in GCS is shown in Figure 9 . Performance is presented as speedup over basic block scheduling. Basic block scheduling forms regions out of a contiguous set of blocks and does not include branches except for call instructions. The code was scheduled for the Merced processor from Intel Corporation. The speedup was measured assuming perfect caches. Two data points are presented for each benchmark indicating speedup with and without P-ready code motion. The geometric mean for the speedup due to wavefront scheduling is also shown. The mean increase in performance without P-ready code motion is about 30%, with 147.vortex and 124.m88ksim benefiting from almost a 50% performance increase. Except for 130.li, the speedup from P-ready code motion is modest. P-ready code motion itself provides more than a 20% increase in speedup for 130.li. With respect to scheduling, P-ready code motion is more efficient than tail duplication of basic blocks, in that instructions are duplicated only if they yield a performance increase as in 130.li.
Conclusions
An overview of the global code scheduler implemented in Intel's reference compiler for IA-64 has been provided. GCS schedules code in hierarchical acyclic regions, which may include nested regions including nested loops. This paper has described a novel scheduling technique for scheduling subgraphs of the control flow graph, called wavefront scheduling. Wavefront scheduling and deferred compensation help GCS to be judicious about the compensation code it generates. This technique is built around a framework of path based analysis. We have described at a high level the use of bit vectors in GCS to represent control flow paths to analyze data dependences, control flow relationships, and to do efficient bookkeeping of compensation. A more efficient alternative to tail duplication which we call P-ready code motion has been described. This technique can reap most of the scheduling benefit of tail duplication without paying the high code bloat cost. We have also shown the value of modeling and changing the physical block layout during scheduling especially in making effective use of the multiway branch feature in IA-64.
