The high performance of pipelined, superscalar processors such as the POWERS" and PowerPC" is achieved in large part through the parallel execution of instructions. This fine-grain parallelism cannot always be achieved by the processor alone, but relies to some extent on the ordering of the instructions in a program. This dependence implies that optimizing compilers for these processors must generate or schedule the instructions in an order that maximizes the possible parallelism. This paper describes the parts of the TOBEY compiler which address the instruction scheduling issue.
Introduction
The TOBEY' family of compilers is designed to optimize code for the superscalar IBM RISC System/6000@ (RS/6000) computers based on the POWER, POWER2m, and PowerPCm architectures. All of these machines have pipelined, superscalar implementations that manage instruction-level parallelism in hardware. The implementations differ in the degree of superscalar parallelism, the depth of pipelines, and the latencies of instructions. Advanced models also include register renaming and true out-of-order execution. This variety of implementations has placed great demands on the compiler and particularly on the instruction scheduler. The scheduler is the compiler component responsible for the reordering and replicating of instructions for the purpose of minimizing execution time for a given target machine. The scheduler must be flexible enough to generate code optimized for machines with a wide range of capabilities. The scheduler must also be portable, to ease the application of TOBEY compiler technology to a larger class of target machines. This paper describes the fundamental algorithms used in the TOBEY instruction scheduler, along with the engineering solutions designed to make them work somewhat independently of the target machine. The IBM Haifa Scientific Center, the IBM Toronto Laboratory, and the IBM Thomas J. Watson Research Center jointly developed the instruction scheduler. The program undergoes two transfonnations before the first phase of instruction scheduling is done. First, inner loops are unrolled to expose more independent instructions per iteration. Next, the lifetimes* of variables are analyzed and renamed so that each unique lifetime has a unique name. The global scheduling and software pipelining phase performs global code motion and high-level loop transformations, providing greater flexibility and opportunity to the two scheduling phases which follow. The local scheduling phase creates basic block instruction schedules using a sophisticated model of the target machine. Global register allocation assigns registers to variables using an enhanced implementation of the Chaitin algorithm [l]. Finally, the posrpass instruction scheduling phase schedules code generated by the register allocator, 
Overview of the TOBEY compiler

578
R. J. BLAINEY schedules interlocks due to register assignment, and performs some special-case scheduling of branch instructions.
Local instruction scheduling
The local instruction scheduler processes linear, branchfree segments of a program (basic blocks), reordering to optimize the use of the target machine. In general, instruction scheduling reorders instructions subject to control flow and data dependence constraints. Limiting the scope to basic blocks allows the local scheduler to move instructions without being concerned about legality of code motion across control flow. This leaves data dependence as the only constraint to reordering.
The local scheduler divides a basic block into windows of instructions. A window is delimited by reaching a maximum size, by reaching a boundary instruction (for example, a trap or special fence instruction), or by reaching the end of the block. The scheduler processes windows of limited size primarily to avoid the excessive compile time of reordering long sequences of instructions. To improve the scheduling of instructions which are near window boundaries, the windows are overlapped. For each window, the local scheduler builds a dependence graph in which the nodes are instructions and the directed edges correspond to some type of data dependence between instructions.
Using the dependence graph, the scheduler executes an algorithm called list scheduling to issue all nodes in the graph in an order which minimizes pipeline delays and idle processor cycles. The algorithm is essentially a time-driven simulation of the target machine where, in each cycle, one or more instructions may be issued.
Dependence graph
The scheduler analyzes the sequence of instructions to be reordered, identifying any interesting data dependences. Three types of dependence are typically of interest: true dependence, antidependence, and output dependence. An instruction has a true dependence on a previous instruction if it uses a value generated by the other instruction.
An instruction has an antidependence on a previous instruction if it writes to a register or memory location which is used by the other instruction. Finally, an instruction has an output dependence on a previous instruction if it writes to a register or memory location which is also written by the other instruction. Figure 2 shows examples of each type of dependence.
The scheduler labels or weights the dependence graph edges with nonnegative integers representing the pipeline delay that would result if the two instructions represented by the incident nodes were issued in sequence. For example, if the edge represents the dependence of a fixedpoint instruction on a value loaded from memory, the edge is labeled I to represent the single cycle typically required to access the data cache. Edge weights may also be assigned the special value weak if the incident nodes represent instructions that may be executed in parallel on the target machine. Weak edges represent an ordering relation (that is, the dependent instruction may not be issued before the instruction on which it depends), but do not require the incident instructions to be issued in different cycles. A weak edge is typically used when there is an antidependence or output dependence between two instructions.
The dependence graph used by instruction scheduling also includes information about the execution time of each instruction. In our model, the execution time of an instruction is a list of execution resources (such as functional units and register files) required and the number of cycles consumed on each functional unit. The number of consumed cycles on a functional unit is the minimal number required to compute the result, not including rename, decode, or writeback stages. As expected for a RISC machine, this time is usually one cycle. However, the RS/6000 machines include complex instructions, such as integer multiply and divide, which may require many cycles to complete execution. Each node in the dependence graph is labeled with a resource list representing the execution time.
dependence graph for the PowerPC 601TM target machine. issues a group of instructions in each cycle which are expected to issue in parallel on the target machine. The group to be issued is based upon a working set of instructions called the tentatively scheduled set. The tentative set is formed by considering all possible combinations of instructions which are available to be issued (according to data dependences) and comparing these alternatives using a preference function X. The committed set is the final form of the tentative set and represents the instructions actually issued in the cycle.
The preference function used to decide the final form of the issue group is the most important factor in the success of the algorithm. The heuristics built into this function include the earliest time that an instruction may be issued and the latest time it may be issued without delaying the execution of the window of instructions. These two measures determine the freedom of motion of the instruction. Given these bounds, the scheduler determines The algorithm is a time-driven machine simulation which Similarly, if issuing v2 first will cause v , to miss its deadline, then prefer v l . 5. Each instruction is assigned a weight that includes the net effect on register pressure combined with various heuristic measures designed to moderate the upward code motion effect of scheduling. Prefer the instruction with lower weight. 6. The uncover count of an instruction is the number of instructions that would become available for entry to the ready list if it were issued. Prefer the instruction with a larger uncover count.
defined as Given a pair of instructions (vI , v z ) , the heuristic set for
7.
Each instruction v has a sum-delay, Sn, which is
where Succ(v) is the set of successors of v in the dependence graph and Wui is the weight of edge ( v , i ) .
The sum-delay is the maximum number of idle processor cycles (represented as nonzero edge weights) on any path to the completion of the window. Prefer the instruction with the larger sum-delay. Each instruction v has a critical path, C n , which is defined as
where Ev is the execution time of instruction v . The critical path is similar to the sum-delay measure but includes the execution time of instructions in the path. Prefer the instruction with the larger critical path.
Prefer the instruction appearing first in the original sequence of code.
Machine simulation The allocation of machine resources to instructions is a critical mechanism in the scheduling of instructions for superscalar machines. To efficiently utilize the resources of the machine, particularly multiple execution units, the scheduler models the dynamic behavior of the machine as closely as possible. The scheduler manipulates data structures corresponding to real machine resources such as functional units, registers, register read and write ports, synchronization counters, and store queues. Most of these attributes (for example, the number and type of functional units) are specified using an abstract machine description language so that the scheduling simulation can be made as independent of the target architecture as possible.
One of the significant features of the TOBEY instruction scheduler is that it is almost entirely driven by tables which represent the target machine organization. The tables are generated by a software tool which understands the high-level machine description language. This allows for greater flexibility in the application of the scheduler to new machines and easier modification of machine attributes. The high-level description language also serves as an effective communication device between compiler developers and chip engineers. The edges of the dependence graph are constructed and weighted using a machine-independent algorithm. The algorithm obtains machine-specific execution time and delay values by lookup in the target machine tables.
To take advantage of the tables generated from machine descriptions, the machine simulation is designed to handle a generic superscalar RISC processor. The dispatcher is used to allocate machine resources for an instruction based on the target machine tables. If some resource is not available because some other instruction is tentatively assigned the resource, it determines which instruction is to be issued.
When instructions compete for execution resources, a preference function is evaluated to determine which instruction is better to issue. If the instruction being considered is better than the one tentatively assigned the resource, the latter is displaced in favor of the former. During the allocation of resources to an instruction, one or more instructions may be displaced. After successfully allocating resources to an instruction, the dispatcher tries to allocate resources for any displaced instructions (since they may be better than some other tentatively scheduled instructions). This process is guaranteed to terminate by preventing the allocation of a resource to a displaced instruction when the resource is tentatively assigned to the instruction which caused the displacement.
Global instruction scheduling
Many techniques have been successful in addressing the problem of local instruction scheduling. By comparison, the scheduling of instructions across basic block boundaries is a much more difficult problem, with far fewer solutions that work well in practice. Global scheduling is complicated by the need to make decisions about the benefit and correctness of moving instructions between basic blocks. Optimal scheduling also requires that appropriate loop-level transformations such as loop unrolling are performed. Previous work in this area includes the trace-scheduling technique developed by Fisher and Ellis [6-81, the global parallelization techniques developed by Ebcioglu [9-111, and a collection of algorithms for software pipelining of loops [12-161.
The TOBEY compiler implements a global scheduling algorithm that is a generalization of list scheduling [17] . To capture the constraints imposed by control flow, the global scheduler replaces the dependence graph used in local scheduling with the program dependence graph (PDG) [18] . The PDG represents both the control and data dependence relationships between statements or instructions in a program.
Control dependence and classijication of code motion
The control dependence graph (CDG) is a part of the PDG that summarizes the relationship of instructions in the control flow of the program. We simplify the graph to group instructions belonging to the same basic block so that the nodes of the CDG are simply basic blocks. The global scheduler processes the program one loop or strongly connected region4 at a time, beginning with the innermost (that is, those loops not containing other loops). Loops nested within other loops are treated as a single control flow node for the purpose of scheduling the containing loop. A PDG is built for each loop as it is processed. Since we may assume that we are building control dependence relations for a strongly connected region of the flow graph, we may also assume that there is no backward control flow (that is, we assume that the region has a single entry point called a header and that any edges leading to the header are not included in the region). This simplifies the control dependence graph to a special form known as the forward control dependence graph [19] . This form of CDG is a directed acyclic graph in which a directed edge from B , to B , represents a conditional flow of control at the exit from block B , , which proceeds to block B, if the condition evaluates to a given value (for example, true or false) labeling the edge. We insert additional edges in the graph to connect blocks with the same set of control dependences (that is, they are executed under the same conditions). We call these blocks equivalent.
We need to review the concepts of dominance and postdominance to further understand the constraints of control flow on global code motion in instruction here: legality and profitability. The code motion is legal if the instruction does not destroy any upward-exposed variables' in other successor blocks to B , and does not cause any observable side effects. In general, we cannot speculatively move an instruction that may cause an exception or alter memory. On the RS/6000 and PowerPC machines, this restriction excludes the movement of floating-point instructions, stores, procedure calls, and some loads. We will later see that some effort in the analysis of memory addressing will allow us to move a large class of load instructions.
instruction would likely be executed anyway or if the instruction costs nothing to execute (for example, if it is scheduled in an otherwise idle processor slot). To assess the relative likelihood of execution, we label the edges in the control dependence graph with probabilities which represent an estimate of the chance of following the conditional path indicated in the graph. The probability of executing an instruction in B, relative to the execution of B , is the product of the probabilities labeling the edges on the path from B , to B, in the CDG. Generally, we move an instruction that has a relative probability of 50% or more of being executed. In the absence of actual knowledge of the likely direction of conditional branches, we use static predictions based on the context of the branch. For example, if the branch is the closing of a loop, we assume that the branch is almost always taken (that is, we assume that loops are usually iterated many times before exiting). This framework for assessing the profitability of code motion is also capable of using information that might be available from a basic block execution profile.
Movement of an instruction is profitable if the
Speculative loads
The most important class of instructions to move across basic blocks are those that load values from memory. Since most computation is done on register operands and the most common way to get values into registers is by using a load instruction, the movement of a load instruction usually allows the movement of many more instructions which are dependent upon the value loaded. Unfortunately, it is difficult to decide whether the movement of a load is legal. Speculative movement of a load is considered illegal if it violates memory-ordering constraints (that is, if the memory location is volatile) or causes an exception that would not have happened if the instruction had not been moved. The first of these conditions is easy to check; the second is much more 5 A variable is upward-exposed in a basit block if there is a use of the variable before any definition in the block (that is, it depends on a value defined in a predecessor block). -------Equivalence edge Dominator tree:
Postdominator tree:
1 C program with example RS/6000 code and the associated flow and dominance graphs.
we consider L , to be similar to L, if L , and L , share the displacement on L,. There are varying degrees of strictness in the similarity test. The strictest version of the test demands that the safety of load L , implies the safety of load L , (that is, the memory referenced by L , is a subset of the memory referenced by 15,). More permissive versions of the test are used at higher optimization levels (-03) that allow for a larger tolerance in the difference in displacements on L , and L,. In addition to similar loads, the target block is searched for definitions of the base register used on the candidate load that are known to be safe locations. For example, loads from the TOC are usually considered safe definitions.
Control dependence graph:
addressing registers and the displacement on L , "covers"
The second set of context-sensitive tests is performed on
If E , is a target block and E, is a source block, B4 and Es ate escape blocks.
the conditional branches controlling the execution of the candidate load and the set of escape blocks along the path from B , (the target block) to B, (the source block) in the CDG. An escape block is a block that is control-dependent on some block on the path from B , to B, along a condition other than the one controlling the execution of B , . For
Example control dependence graph showing escape blocks.
example, in Figure 5, Determining that a load does not cause an exception is dependent upon the operating system and machine on which the program runs. We later describe the set of tests used for the RS/6000 machines running the AIX 3.2 operating system.
The compiler can determine that many loads are incapable of generating an exception regardless of their context. We call these loads globally safe. The scheduler recognizes these loads by their use of a special base addressing register. For example, loads using the stack frame pointer or TOC (table of contents) base register are usually considered safe. Also, those loads using base registers loaded from the TOC and with displacements that are smaller than the allocated memory for the external being referenced are usually considered safe. the load is usually an array reference or an indirect reference through a pointer variable. In this case, the safety of the load may be dependent on the context of the reference.
The first set of context-sensitive tests is performed on the target block and its dominators. The compiler searches the target and dominators for other loads that are similar to the candidate load or that define its base register to an When a load cannot be determined to be globally safe, 584 address known to be safe. Given two loads L , and L,, loads. If there are none, the conditional branches closing the blocks in the CDG between B , and B , are scanned for conformance to a special form that compares the base register of the candidate load with zero. If the "equal to zero" condition controls the escape path, the load is proven to be safe to move above that condition, because if control were to pass to the escape block, the load would have been using a base register of value zero. We are guaranteed not to get an exception in this case because AIX 3.2 allows the first page of memory to be readable (therefore, we also require the load displacement to be in the range [0, 40951). This context-sensitive test is most commonly applied in programs which follow pointer-based data structures such as trees or linked lists that represent null pointers with a value of zero. In Figure 6 , showing a program fragment, the load of r4 may be moved above the conditional branch. Once the load is moved above the conditional branch, we may also move the addi instruction (since r6 is not live on the escape path of the branch), but we cannot move the store instruction speculatively, since it may alter memory which might not otherwise have been altered.
Software pipelining
Recent efforts in the scheduling of instructions for wideissue and deeply pipelined computers have turned to looplevel transformations designed to offer greater opportunity for the resolution of delay slots and the parallel execution of instructions. The most widely used technique is called software pipelining [12- The algorithm is a simple extension to the global scheduling framework described above. Most inner loops that are candidates to be globally scheduled are first unrolled once. The loop is then processed normally, except that only the original loop blocks (the real iteration) are scheduled. The loop blocks from the unrolled or virtual iteration are used only as sources of instructions for the scheduling of the original loop body. Movement of an instruction from the virtual iteration to the real iteration is a special code motion that requires the "twin" instruction in the real iteration to be moved to a prologue of the loop. When scheduling is complete, some instructions may have been moved to a loop prologue because of pipelined code motion. We construct a complementary epilogue section of code following the loop that contains all instructions of the virtual iteration which were not scheduled into the real iteration. The algorithm as stated handles the overlapping of only two iterations of the loop. A simple extension allows us to overlap an arbitrary number of iterations while constructing the appropriate prologue and epilogue sections of the loop. Figure 7 shows an example of a loop compiled for the PowerPC 601 before, during, and after software pipelining. Notice that the resulting loop avoids delays between the load and use of registers r7 and r8 in the loop.
Some machines include dynamic scheduling hardware in the floating-point unit, such as register renaming and a store queue which realizes many of the benefits of static pipelining. Software pipelining has a smaller observable effect on loops which can take advantage of these special facilities.
Duplication of code
The movement of instructions which require duplication presents a special problem for the scheduler. For example, consider the movement of an instruction from B , to B, in Figure 4 . Since B, is not on every path of execution that might reach B,, the scheduler must create copies of the instruction to ensure that the program produces the same results. The scheduler computes a set of basic blocks including B, that collectively dominates B , and creates copies of the moved instruction in each block in the set. i
The set does not include more than one block which lies on the same path to B,, and the instruction is thus executed only once on any path from the entry node.' The scheduler also tries to minimize the number of duplicates that must be generated. The dominating set generated under these requirements is called a minimal independent separating set (MISS). An algorithm for computing the set is described in [25] .
control flow edges in the program. A control flow edge is called a split edge if the predecessor block has more than one successor and is called a join edge if the successor block has more than one predecessor. A join-split edge is both a split edge and a join edge. To ensure that the program does not contain a join-split edge, the control flow graph is modified before global scheduling by replacing join-split edges with a split edge leading to a dummy basic block which leads to a join edge. Given two basic blocks B , and B,, the following algorithm computes the MISS N(Bl, B 2 ) . The algorithm is guaranteed not to fail if the control flow graph contains no join-split edges.
The algorithm assumes that there are no joinsplit 1. Mark all blocks that are reachable from B , .
3.
Remove a marked block u from A.
4.
Let P be the set of immediate predecessors of u . Let 
Mark all blocks reachable from members of P'.
Elimination of false dependences Data dependences are the fundamental constraints on the movement of code by instruction scheduling. True dependence represents data flow and usually cannot be avoided. Antidependence and output dependence are artifacts of variable naming and register allocation. We call these false dependences, since they can both be avoided by using a different selection of variable names or register numbers.
The global scheduler identifies situations where the presence of an antidependence interferes with the generation of well-scheduled code. In such situations, it creates a new register name for the written register in an antidependence pair and then introduces a register copy to the old register name. This renaming allows the antidependence edge in the PDG to be broken, and therefore allows the incident nodes to be reordered as necessary. Once the scheduling is complete, we are left with the problem of removing the register copy instructions. This is done by unrolling the containing loop once and then propagating the renamed register into all uses in the unrolled iteration, eliminating the need for the register copy instruction. Figure 8 illustrates this process for a simple program segment. A similar dynamic renaming technique is used in the global scheduling algorithm presented in [24] .
Some true dependences can also be collapsed by the 586 scheduler. In cases where two instructions which have immediate forms on the target machine are related by a true dependence, it may be possible to rewrite the dependent instruction to reflect the effect of the other instruction. For example, consider the instruction add r3,r3,1, followed by cmpwi crl , r3, 1. The compare depends on the incremented value of r3, but may be moved above the add instruction if it is rewritten as cmpwi crl ,r3,0. Some machines in the PowerPC family lack dynamic register-renaming capabilities. This provides further incentive to statically rename registers involved in false dependences. Since many antidependences are created by the register allocator in an attempt to optimize register reuse, a special renaming optimization transforms the program after register assignment. This optimization understands when registers should be renamed for the target machine and which registers are available for use at any point in the program. The renaming transformation is done before postpass instruction scheduling to explicitly break false dependence edges.
Postpass instruction scheduling
The instruction scheduling algorithms used in the compiler are general enough to create good schedules for most programs, but experience shows that some specialized optimizations are necessary to obtain the best performance for certain code patterns. One weakness of the general scheduling algorithm is the inability to rearrange Before: lwz r3,0(r4) 11 : lwz r6,0(r7) addi r5,r3,1 addi r3,r6,1 stwu r5,4(r4) stwu r3,4(r7) bdnz 11 During: lwz r3,0(r4) 11: lwz r6,0(r7) addi r8,r6,1 addi r5,r3,1 COPY r3=r8 stwu r3,4(r7) stwu r5,4(r4) bdnz 11
After:
lwz r3,0(r4) 11: lwz r6,0(r7) addi r8,r6,1 addi r5,r3,1 stwu r8,4(r7) stwu r5,4(r4) lwz r6,0(r7) addi r3,r6,1 addi r5,r8,1 stwu r3,4 (r7) stwu r5,4(r4) bdnz 11
branches to avoid delays in the speculative dispatch of instructions.
Branch swapping
The RS/6000 machines execute branch instructions in parallel with the execution of fixed-and floating-point instructions. Optimal performance can be achieved only when the stream of fixed-and floating-point computation remains uninterrupted. If the branch unit encounters a conditional branch which depends on either an unavailable condition (that is, the condition computation has not yet completed) or a branch whose target address is similarly unavailable, we call such branches unresolved. In these situations, the branch unit cannot determine which instructions are to be executed next, but speculatively dispatches instructions to the fixed-and floating-point units from the fall-through path of the branch. If there are enough fixed-or floating-point instructions on the fallthrough path, unresolved branches suffer no penalty if they are not taken. If, however, the branch unit encounters another branch (which may or may not be resolved), speculative dispatch of instructions is stopped and waits until the first branch is resolved.
The compiler often has opportunities to rearrange the instructions in a program to avoid the branch unit hold-off caused by encountering a resolved branch while speculatively dispatching instructions on the fall-through path of an unresolved branch. An optimization called branch swapping identifies situations where an unresolved branch may contain a resolved branch on the fall-through path. In these situations, it attempts to swap the branches while duplicating any intervening code. A typical branch swapping opportunity is a conditional exit from a counted loop. In this case, we usually have an unresolved conditional branch that contains a branch-on-count-register instruction (always resolved) in the fall-through path. Figure 9 shows a simple example of branch swapping.
Branch reversal
Conditional branches which are not taken usually suffer no penalty on the POWER and POWER2 machines. It is possible to put the most likely target of a branch on the fall-through path in order to avoid the delay of a taken branch. The problem here is that it is not always clear what direction a branch is most likely to take. The compiler currently uses a set of heuristics to guess which direction the branch will most likely take. Sometimes the heuristics are fairly certain (for example, the branch closing a loop is likely to be taken); in other cases, the heuristic is weaker (for example, we assume that if a branch leads to a call in one successor path but not the other, the path without the call is more likely to be taken). More certainty can be achieved by using the information contained in a basic block execution profile.
Given information about the relative probability of branch directions, the branch reversal optimization may decide to reverse the direction of a branch, bringing the Gluing instructions on the taken path into the fall-through position An additional opportunity to avoid branch unit hold-offs is and moving instructions on the fall-through path to the the situation in which an unconditional branch lies in the taken position. This often has the effect of moving fall-through path of an unresolved conditional branch. TO infrequently executed code in a loop out of line, so that avoid the hold-off upon encountering the unconditional the fall-through paths of conditional branches lead directly branch, gluing copies code from the target of the 588 to the closing of the loop. unconditional branch to the fall-through of the unresolved branch, and then replaces the unconditional branch with one targeted to the end of the code that was copied.
Gluing also serves as the code-copying mechanism for branch reversal. Figure 10 illustrates the reversal of a branch and the subsequent gluing of instructions on the taken path into the fall-through position. Notice that if the branch is mostly taken in the original code, it is now mostly not taken in the transformed code, making the resulting loop faster.
Performance results
All of the instruction scheduling techniques described so far have been implemented as part of the latest production versions of the CSet+ + and XL FORTRAN/6000 compilers. Tables 1 and 2 show the results of running the SPECint92 and SPECfp92 benchmark programs on the POWER2 (RS/6000 Model 590) and PowerPC 601 (RS/6000 Model 250) machines. The various levels of scheduling measured are described in Table 3 . The SPECint92 and SPEC' 92 benchmark programs are described in Tables 4  and 5 , respectively. The measurements should not be taken as official performance measurements, since they were not performed using the carefully selected set of options and quiet execution environment required for regular published results. The measurements compare the benchmark scores (expressed as SPECrutios) using various phases of instruction scheduling against a baseline score which represents no instruction scheduling. The percentage of execution time reduction is relative to the next lowest level of instruction scheduling (B relative to A, C relative to B, and so on). All measurements include the maximum optimization available in the TOBEY compiler and also include options to specifically target the machines being measured.
The measurements illustrate the effect of the various types of instruction scheduling on the performance of the two machines. Notice first that the effect is more significant on the POWER5 which has a larger degree of instruction-level parallelism (two integer and two floatingpoint units) than the PowerPC 601 machine. The net effect of all instruction scheduling on the POWER2 is 19.4% on SPECint92 and 37.8% on SPECfp92, whereas the respective improvements on the PowerPC 601 are 7.9% and 16.5%. Another interesting measurement is the small or nonexistent improvement due to loop-based optimizations such as unrolling and software pipelining on the SPECint92 programs. The performance of these programs does not depend as heavily as that of the SPECfp92 programs on the execution time of loops and is not expected to benefit much by these optimizations. Notice also that the effect of local instruction scheduling is more significant for the SPEC' 92 programs and the effect of branch optimizations is more significant for the SPECint92 programs. These effects are an indication of the relative size of basic blocks in the two benchmark suites (SPECfp92 blocks are larger) and the relative frequency of branch instructions (SPECint92 has more frequent branching). Before the application of any of the instruction scheduling techniques, inner loops are unrolled where heuristics suggest a benefit. Other parts of the TOBEY compiler attempt both inner and outer loop unrolling, but this unrolling phase is specifically concerned with exposing more independent instructions to the scheduler. The heuristics attempt to determine the optimal unrolling factor for a loop based on the number of execution units of each type available on the machine and the set of data dependence relations and associated pipeline delays applicable to the instructions in the loop. Where possible, registers in unrolled iterations are given names distinct from the corresponding registers in the original iteration in order to avoid introducing antidependence relations between iterations, In the prototype implementation, the TOBEY global scheduling phase is replaced by the VLIW global scheduling and enhanced pipeline scheduling algorithms [23, 241. For each basic block, VLIW scheduling creates a set of instructions which are available to move forward.
Scheduling chooses the best instruction from the set of instructions that can move to a point in the program and moves the instances of that instruction forward. Scheduling also makes bookkeeping copies (similar to the duplication used in the TOBEY scheduler) for edges that join the path of the selected instructions' upward motion (but are not on these paths) and updates the set of available instructions associated with basic blocks only on the paths that were traversed by the moved instructions. This algorithm provides a general mechanism for the reordering of instructions in a program across arbitrary control flow while preserving the semantics of the original program. The key advantages of this technique over the existing TOBEY scheduler are the larger scope of analysis and the ability to handle the movement of conditional branch instructions between basic blocks. One of the key disadvantages is the greatly increased compilation time required.
Memory disambiguation
To extract large amounts of parallelism from sequential code, the instructions must be somewhat independent. Instructions which reference memory often have an unknown dependence relationship because the effective addresses referenced may not be known at compile time. The compiler is able to determine that certain variable references cannot refer to the same memory location, but elements of the same array and indirect pointer references are usually aliased to one another by the compiler. Some of these aliased references do not in fact interfere, and the scheduler can do a better job if they can be proven not to interfere at compile time.
When memory is referenced through the indexing of the same array, indices can be compared and determined not to reference the same array element. The problem of determining that two array references are independent has many practical solutions originally designed as part of vectorizing compilers [32- 351. The problem is somewhat more difficult when encountered by instruction scheduling, since the subscript expressions of multidimensional arrays have usually been linearized. A framework for building symbolic indexing expressions along with a suite of dependence tests has been prototyped in the TOBEY compiler. The dependence tests are successful in proving that many array references are distinct, but fail to disambiguate most pointer dereferences. In order to generate well-scheduled code for programs which have uncertain pointer-induced aliases, the scheduler includes a prototype implementation of run-time disambiguation. This technique creates a copy of certain loops and optimizes one of the copies assuming that certain reference pairs are distinct and the other copy assuming that they are aliased. A test is introduced to choose which loop to execute depending on the actual pointer values. The first disambiguation technique (array-based dependence analysis) is uniformly successful in improving execution time but has a large compile-time cost. The run-time disambiguation technique is successful in some cases but requires a careful analysis to determine when the cost of a run-time test is profitable. Reduces the size of input files by using Lempel-Ziv coding.
Calculates budgets, SPEC metrics, and amortization schedules in a spreadsheet based on the UNIX@ cursor-controlled package "curses."
Translates preprocessed C source files into optimized Sun-3 assembly language output. Same as mdljdp2 but single precision.
Solves particle and Maxwell's equations on a Cartesian mesh.
Generates two-dimensional, boundary-fitted coordinate systems around general geometric domains.
Traces rays through an optical surface containing spherical and planar surfaces.
Trains a neural network using back-propagation.
Simulates the human ear by converting a sound file to a cochleagram using fast Fourier transforms and other math library functions.
Solves the system of shallow-water equations using finite difference approximations.
Calculates masses of elementary particles in the framework of the quark gluon theory.
Uses hydrodynamical Navier-Stokes equations to calculate galactical jets. Executes seven program kernels representative of operations used in NASA applications. Calculates multi-electron integral derivatives.
Projle-directed optimizations
Many aggressive algorithms for extracting parallelism from sequential code assume some knowledge of the paths of execution that are most often taken. In fact, some techniques go so far as to optimize the most frequent paths at the expense of the less frequently taken paths. If we are to use such aggressive techniques, we must determine the most frequently taken paths on the basis of actual program executions. It is necessary to evaluate the dynamic behavior of the program while processing typical data sets and then feed this information back into the compiler. The level of detail required by the compiler to make decisions about the likelihood of certain execution paths is beyond most standard programs which produce execution profiles. Therefore, we are investigating various techniques for obtaining execution profile information at the basic block level. In addition, we are investigating enhancements to the compiler which might take advantage of this information and solutions to the problem of keeping the information correct while performing the normal set of optimizing program transformations.
