To e ectively exploit instruction level parallelism, the compiler must move instructions across branches. When an instruction is moved above a branch that it is control dependent on, it is considered to be speculatively executed since it is executed before it is known whether or not its result is needed. There are potential hazards when speculatively executing instructions. If these hazards can be eliminated, the compiler can more aggressively schedule the code. The hazards of speculative execution are outlined in this paper. Three architectural models: restricted, general and boosting, which h a v e increasing amounts of support for removing these hazards are discussed. The performance gained by each level of additional hardware support is analyzed using the IMPACT C compiler which performs superblock s c heduling for superscalar and superpipelined processors.
Introduction
For non-numeric programs, there is insu cient instruction level parallelism available within a basic block to exploit superscalar and superpipelined processors 1 2 3 . To s c hedule instructions beyond the basic block boundary, instructions have t o b e m o v ed across conditional branches. There are two problems that need to be addressed in order for a scheduler to move instructions above branches.
First, to schedule the code e ciently, the scheduler must identify the likely executed paths and then move instructions along these paths. Second, when the branch is mispredicted, executing the instruction should not alter the behavior of the program.
Dynamically scheduled processors can use hardware branch prediction 4 t o s c hedule instructions from the likely executed path or schedule instructions from both paths of a conditional branch such as in the IBM 360 91 5 . Statically scheduled processors can either predict the branch direction using pro ling or some other static branch prediction mechanism or use guarded instructions to schedule instructions along both paths 6 . For loop intensive code, static branch prediction is accurate and techniques such as loop unrolling and software pipelining are e ective a t s c heduling code across iterations in a well-de ned manner 7 8 9 10 . For control intensive code, pro ling provides accurate branch prediction 11 . Once the direction of the branch is determined, blocks which tend to execute together can be grouped to form a trace 12 13 . To reduce some of the bookkeeping complexity, the side entrances to the trace can be removed to form a superblock 14 .
In dynamically and statically scheduled processors in which the scheduling scope is enlarged by predicting the branch direction, there are possible hazards to moving instructions above branches if the instruction is speculatively executed. An instruction is speculatively executed if it is moved above a conditional branch that it is control dependent upon 15 . A speculatively executed in-struction should neither cause an exception which terminates the program nor incorrectly overwrite a v alue when the branch is mispredicted. Various hardware techniques can be used to prevent such hazards. Bu ers can be used to store the values of the moved instructions until the branch commits 16 2 17 . If the branch is taken, the values in the bu ers are squashed. In this model, exception handling can be delayed until the branch commits. Alternatively, non-trapping instructions can be used to guarantee that a moved instruction does not cause an exception 18 .
In this paper we focus on static scheduling using pro ling information to predict the branch direction. We present a superblock s c heduling algorithm that supports three code percolation models which require varying degrees of hardware support to enable code motion across branches.
We present the architecture support required for each model. Our experimental results show the performance of the three models on superscalar and superpipelined processors.
Superblock S c heduling
Superblock s c heduling is an extension to trace scheduling 12 which reduces some of the bookkeeping complexity. The superblock s c heduling algorithm is a four-step process, 1. trace selection, 2. superblock formation and enlarging, 3 . dependence graph generation, and 4. list scheduling.
Steps 3 and 4 are used for both prepass and postpass code scheduling. Prepass code scheduling is performed prior to register allocation to reduce the e ect of arti cial data dependences that are introduced by register assignment 19 20 . Postpass code scheduling is performed after register allocation.
The C code segment in Figure 1 will be used in this paper to illustrate the superblock s c heduling algorithm. Compiling the C code segment for a load store architecture produces the assembly language shown in Figure 2 . The assembly code format is opcode destination, source1, source2
where the number of source operands depends on the opcode. The weighted control ow graph of the assembly code segment is shown is Figure 3 . The weights on the arcs of the graph correspond to the execution frequency of the control transfers. For example, basic block 2 BB2 executed 100 times with the control going from BB2 to BB4 90 of the time and from BB2 to BB3 the remaining 10 of the time. This information can be gathered using pro ling.
The rst step of the superblock s c heduling algorithm is to use trace selection to form traces from the most frequently executed paths of the program 12 . Figure 4 shows the portion of the control ow graph corresponding to the while loop after trace selection. The dashed box outlines the most frequently executed path of the loop. In addition to a top entry and a bottom exit point, traces can have m ultiple side entry and exit points. A side entry point is a branch i n to the middle of a trace and a side exit is a branch out of the middle of a trace. For example, the arc from BB2 to BB3 in Figure 4 is a side exit and the arc from BB3 to BB5 is a side entrance.
To m o v e code across a side entrance, complex bookkeeping is required to ensure correct program execution 12 21 . For example, to schedule the code within the trace e ciently, i t m a y be desirable to move instruction i12 from BB5 to BB4. T o ensure correct execution when the control ow i s through BB3, i12 must also be copied into BB3 and the branch instruction i10 must be modi ed to point to instruction i13. If there were another path out of BB3 then a new basic block w ould need to be created between BB3 and BB5 to hold instruction i12 and a branch t o BB5. In this case, the branch instruction i10 would branch to the new basic block.
The second step of the superblock s c heduling algorithm is to form superblocks. Superblocks avoid the complex repairs associated with moving code across side entrances by removing all side entrances from a trace. Side entrances to a trace can be removed using a technique called tail duplication 14 . A copy of the tail portion of the trace from the side entrance to the end of the trace is appended to the end of the function. All side entrances into the trace are then moved to the corresponding duplicate basic blocks. The remaining trace with only a single entrance is a superblock. Figure 5 shows the loop portion of the control ow graph after superblock formation and branch expansion. 1 During tail duplication, BB5 is copied to form superblock 2 , SB2. Since BB3 only branches to BB5, the branch instruction i10 can be removed and the two basic blocks merged to form BB3'. Note that superblock 1 , SB1, no longer has a side entrance.
Loop-based transformations such as loop peeling and loop unrolling 22 can be used to enlarge superblock loops, a superblock which ends with a control ow arc to itself. For superblock loops that usually iterate only a small number of times, a few iterations can be peeled o and added to the superblock. For most cases, the peeled iterations will su ce and the body of the loop will not need to be executed. For superblock loops that iterate a large number of times, the superblock loop is unrolled several times.
After superblock formation many classic code optimizations are performed to take advantage of the pro le information encoded in the superblock structure and to clean up the code after the above transformations. These optimizations include the local and global versions of: constant propaga- 1 Note that the pro le information is scaled during tail duplication. This reduces the accuracy of the pro le information. The third step in the superblock s c heduling algorithm is to build a dependence graph. The dependence graph represents the data and control dependences between instructions. There are three types of data dependences, ow, anti, and output. Control dependences represent the ordering between a branch instruction and the instructions following the branch. There is a control dependence between a branch and a subsequent instruction i if the branch instruction must execute before instruction i.
The last step in the scheduling algorithm is to perform list scheduling using the dependence graph and instruction latencies to indicate which instructions can be scheduled together. The general idea of the list scheduling algorithm is to pick, from a set of nodes instructions that are ready to be scheduled, the best combination of nodes to issue in a cycle. The best combination of nodes is determined by using heuristics which assign priorities to the ready nodes 20 Examples of code motion can be shown using the assembly code in Figure 6 . This is the assembly code of the C code in Figure 1 and r4 have been renamed to r5 and r6 respectively. Note that once the loop has been unrolled and renamed, branch I9 must branch t o L1' to restore r1 and r4 before the code at L1 is executed. 2 Also note that the code within the superblock corresponding to L0 is placed sequentially in instruction memory. The live-out sets of the three branches within the superblock loop are shown in Figure 7 .
Performing dependence analysis on I1 through I12 for each code percolation model produces the dependence graphs shown in Figure 8 . The data dependences are represented by solid arcs and labeled with f for ow and o for output there are no anti dependences. The control dependences 2 Tail duplication can be recursively applied to form a superblock at label L1'.
I1
Figure 8: Dependence graphs for the three superblock s c heduling models.
are represented by dashed arcs. It is clear from the corresponding number of control dependence arcs in the three graphs that code motion in the restricted code percolation model 9 arcs is the most limited, then general 6 arcs and then boosting 3 arcs. In the general code percolation model, control dependence arcs can be removed if the destination of the sink of the arc is not in live-outsource of the arc. In all cases, control dependence arcs between two branch instructions cannot be removed unless the order of the branches does not matter e.g., in a switch statement.
Other than this constraint, all remaining control dependence arcs can be removed in the boosting code percolation model.
The code schedules determined from the graphs in Figure 8 are shown in Figure 9 . The actions that result when the code is executed on processors without additional hardware support are given.
The code schedules assume uniform function unit resources with the exception that only one branch can be executed per cycle. 3 The integer ALU instructions have a one cycle latency and the load instructions have a t w o cycle latency.
For restricted code percolation both restrictions, the loop takes 9 cycles to execute and the program executes properly without additional hardware support. When only Restriction 1 is observed, general code percolation, load instruction I5 can be issued in cycle t1. This reduces the loop execution time to 5 cycles. Note that since only one branch can be executed per cycle, branch
I6 cannot be issued until cycle t4. While this does not a ect the code schedule, if there is no additional hardware support, instruction I8 will cause a segmentation violation by accessing memory through a nil pointer. In the boosting code schedule, there are no restrictions on code motion 3 This assumption is here in order to illustrate the hazards of removing Restriction 1. In our simulations we d o not impose this restriction unless speci ed.
across branches and thus instruction I7 can be issued in cycle t2. Since r2 is in the live-out set of instruction I6, without additional hardware support, count will be incremented one too many times and if the program terminated normally, avg would be incorrect. Furthermore, as in the case of general code percolation, without hardware support there will be a segmentation violation which will terminate the program. In this example, the schedule using boosting code percolation does not improve upon the schedule achieved from general code percolation.
Architecture Support
In this section we discuss the details of the architecture support required by the three scheduling models. Architecture support is required to relax the restrictions on upward code motion. An instruction that is moved above a branch is referred to as a speculative instruction. When Restriction 1 is relaxed, a speculative instruction can overwrite a value used on the taken path. Therefore, some form of bu ering is required to ensure that the value is not written until the branch direction is determined. To relax Restriction 2, a speculative instruction should not cause an exception if the branch is taken. In addition, when any instruction is moved above a branch and the branch i s taken, the instruction may cause an extra page fault. While additional page faults do not alter the program's outcome, they will reduce the program's performance. To a v oid extra page faults, an alternative approach is to handle page faults of speculative instructions when the branch commits.
The next three sections describe the architecture support needed for each code percolation model. Table 1 provides a summary of the three models. 
Restricted Code Percolation
The restricted code percolation model assumes that the underlying architecture supports a class of trapping instructions. These typically include oating point instructions, memory access instructions, and the integer divide instruction. These instructions cannot be moved across a branch The hardware support for handling page faults does not need to be modi ed to support restricted code percolation. Page faults are handled when they occur. Since memory accesses are not speculatively executed, the only source of additional page faults will be from instruction memory page faults. Since instructions are speculatively executed along the most likely executed path, they will likely be in the working set in memory and thus will not usually cause additional page faults.
General Code Percolation
The general code percolation model assumes that the trapping instructions in the restricted code percolation model have non-trapping counterparts 18 24 . Our implementation of general code percolation assumes that there are non-trapping versions for integer divide, memory loads, and oating point arithmetic. These instructions can also be moved across a branch if they do not violate Restriction 1. Memory stores are still not percolated above branches for two reasons. First, it is di cult to perform perfect memory disambiguation to ensure that Restriction 1 is not violated.
Second, in a load store architecture, stores are typically not on the critical path and thus will not impact the performance as much as a load or an arithmetic instruction.
There are two t ypes of exceptions, arithmetic and access violation. To implement non-trapping instructions, the function unit in which the exception condition occurs must have hardware to detect whether the instruction is trapping or non-trapping and only raise the exception ag for a trapping instruction. For a non-trapping load instruction, if there is an access violation, the load is aborted. When an exception condition exists for a non-trapping instruction, the value written into the destination register will be garbage. The use of this value is unpredictable, it may e v entually cause an exception or it may lead to an incorrect result. Thus, code compiled with general code percolation will not necessarily raise an exception when an exception condition exists.
When an exception condition exists for a speculative instruction and the branch is taken, this condition is ignored as it should be. However, it is also ignored when the branch is not taken.
The garbage value returned may e v entually cause an exception but there is no guarantee. If the program does not terminate due to an exception, the output will likely be incorrect. Since the program has an error i.e., an exception condition exists in the original program, it is valid to produce incorrect output. However, from a debugging point of view, a detectable error has become undetectable, which is undesirable. Therefore, code should rst be compiled with restricted code percolation until the code is debugged. Then general code percolation can be turned on to improve the performance. This approach m a y not be suitable for critical applications such as transaction processing where unreported errors are not acceptable.
Some applications such as concurrent garbage collection rely on trapping instructions to execute properly. F or such applications, a compiler ag can be used to prohibit certain instructions from being speculatively executed. Alternatively, additional hardware support can be used to handle exceptions for speculative instructions 25 .
As with restricted code percolation, page faults are handled when they occur. No additional hardware beyond traditional hardware support is required to handle page faults. Since memory accesses can be percolated, the number of page faults for the general model may be larger than the number for the restricted model.
Boosting Code Percolation
Boosting code percolation is based on Smith et. al.'s speculative execution model 17 . Speculative boosted instructions which violate Restrictions 1 and 2 can be moved above a branch because no action is committed until the branch commits. The basic architecture support for boosting is shown in Figure 10 . This architecture is similar to the TORCH architecture 17 . The shadow register le is required to hold the result of a non-store boosted instruction until the branch commits. The shadow store bu er is required to hold the value of a boosted store instruction until the branch commits.
Instructions that are moved above conditional branches are marked as boosted. An instruction can be moved above more than one branch instruction. This would require additional bits to indicate the If the boosted instruction nishes before the branch commits, the result is stored in the shadow register le until the branch commits. Since code is scheduled within a superblock, instructions are moved across a branch from the not-taken path. Thus, if the branch is not taken, the values in the shadow register are copied to the sequential register le. However, if the branch i s t a k en, 4 If multiple branches can be issued in the same cycle, there must be an ordering of branches and hardware to support multiple squashing delay slots. Boosted instructions can be issued with multiple branches provided they are issued in the proper slot. All exception handling for boosted instructions, including page fault handling, is delayed until the branch commits. Page faults could also be handled immediately in this model but the hardware is available to delay page fault handling until the branch commits. When a boosted instruction causes a page fault or exception the condition is stored until the branch commits. If the branch i s taken, the exception condition is ignored. Otherwise, the values in the shadow bu ers are cleared and the boosted instructions and delay slot instructions boosted or not in the execution pipeline are squashed. At this point the processor is in a sequentially consistent state and the boosted instructions are reexecuted sequentially until the exception occurs. To reexecute the boosted instructions, the program counter of the rst boosted instruction, pc boost, m ust be saved. 5 The instructions can either be reexecuted in software by the exception handling routine or in hardware. In the software scheme, the only additional hardware for exception handling is for the pc boost register. In the hardware scheme, the instruction fetch mechanism must be altered to fetch from pc boost when an exception condition exists when the branch commits. Only instructions that are marked as boosted are reexecuted, all others are squashed at the instruction fetch unit. After an exception on a boosted instruction is handled assuming it does not terminate the program, only boosted instructions are executed until the branch instruction. Then the exception condition is cleared and instruction fetch returns to normal operation.
Experiments
The purpose of this study is to analyze the cost-e ectiveness of the three scheduling models. In the previous section we analyzed the cost with respect to the amount of hardware support required by each model. In this section we analyze the performance of each model for superscalar and superpipelined processors.
Methodology
To study the performance of the three scheduling models, each model has been implemented in the superblock s c heduler of the IMPACT-I C compiler. The IMPACT-I C Compiler 24 is a retargetable, optimizing compiler designed to generate e cient code for superscalar and superpipelined processors. The performance of code generated by the IMPACT-I C compiler for the MIPS R2000 is slightly better than that of the commercial MIPS C compiler 6 14 . Therefore, the scheduling results reported in this paper are based on highly optimized code.
The IMPACT-I C compiler uses pro le information to form superblocks. The pro ler measures the execution count o f e v ery basic block and collects branch statistics. A machine description le is used to characterize the target machine. The machine description includes the instruction set, 6 MIPS Release 2.1 using the -O4 option. To e v aluate the performance of a code percolation model on a speci c target architecture, a benchmark was compiled using the composite pro le of 20 di erent inputs. Using a di erent input than those used to compile the program, pro ling information is used to calculate the best and worst case execution times of each benchmark. The execution time of a benchmark is calculated by multiplying the time to execute each superblock b y its weight and adding a mis-predicted branch penalty. The worst case execution time is due to long instruction latencies that protrude from one superblock to another superblock. For the benchmark programs used in this study Table 1 , the di erence between the best case and the worst case execution time is always negligible.
Processor Architecture
The base processor is a pipelined, single-instruction-issue processor that supports the restricted code percolation model with basic block s c heduling. Its instruction set is a superset of the MIPS R2000 instruction set with additional branching modes 26 . Table 3 shows the instruction latencies.
Instructions are issued in order. Read-after-write hazards are handled by stalling the instruction- unit pipeline. The microarchitecture uses a squashing branch s c heme 27 and pro le-based branch prediction. Branch prediction is used to layout the superblocks such that the branches are likely not taken. If the branch is taken, the instructions following the branch is squashed. If the branch is predicted taken, the base processor has one branch delay slot. The processor has 64 integer registers and 32 oating-point registers. 7 The superscalar version of this processor fetches multiple instructions into an instruction bu er and decodes them in parallel. An instruction is blocked in the instruction unit if there is a readafter-write hazard between it and a previous instruction. All the subsequent instructions are also 7 The code for these benchmarks contains very few oating point instructions.
blocked. All the instructions in the bu er are issued before the next instruction is fetched. The maximum number of instructions that can be decoded and dispatched simultaneously is called the issue rate. The superscalar processor also contains multiple function units. In this study, unless otherwise speci ed, we assume uniform function units where every instruction can be executed from every instruction slot. When the issue rate is greater than one, the number of branch slots increases 27 .
The superpipelined version of this processor has deeper pipelining for each function unit. If the number of pipeline stages is increased by a factor P, the clock cycle is reduced by approximately the same factor. The latency in clock cycles is longer, but in real time it is the same as the base microarchitecture. The throughput increases by up to the factor P. W e refer to the factor P as the degree o f s u p erpipelining. The instruction fetch and decode unit is also more heavily pipelined to keep the microarchitecture balanced. Because of this, the number of branch slots allocated for the predicted-taken branches increases with the degree of pipelining 27 .
Results
In this section we rst motivate the need for superblock s c heduling and then analyze the relative performance of each of the superblock s c heduling models for superscalar and superpipelined architectures. In addition, we c haracterize the performance of the models for various hardware resource assumptions.
Basic Block vs. Superblock S c heduling
First, we w ant t o v erify the need for superblock s c heduling. Figure 11 shows that the speedup that can be achieved using basic block s c heduling for an 8- 
Scheduling Superscalar and Superpipelined Processors
Next we w ant to analyze the performance of the three scheduling models on superscalar and superpipelined processors with uniform function units. Figure 12 shows the speedup of the three scheduling models for a superscalar processor model. the same for superpipelined as for superscalar. Comparing the performance of the three models on a superscalar processor for issue rates 2 and 4 Figure 12 with the performance of the models for the pure superpipelined processors in Figures 13 and 14 it can be seen that all models perform slightly better on the superscalar processors. This is due to the higher branch penalty for superpipelined processors. expansion, and accumulator expansion. Furthermore, since the boosting code percolation model supports speculatively executed stores, these results show that the bene t of moving stores above branches is small. The fact that both the general and boosting models perform considerably better than the restricted code percolation model implies that moving any or all of the following types of instructions: memory loads, integer divide, and oating point arithmetic, greatly reduces the critical path. Since our benchmark set is not oating point i n tensive and there are usually many more loads than integer divide instructions, these results imply that scheduling loads early has a large impact on the performance. Since the latency of oating point arithmetic is relatively large, scheduling these instructions earlier will also bene t numerical applications.
Scheduling a Superscalar with Non-uniform Function Units
The cost to replicate all function units for each additional instruction slot can be very high. Therefore, we h a v e e v aluated the performance degradations due to non-uniform function unit resources.
Since the relative behavior of the three scheduling models is the same for both the superscalar and the superpipelined processors, we only analyze the e ect of limiting resources for the superscalar processor. Figure 15 shows the speedup of the three scheduling models for a superscalar processor Another interesting point is that the relative performance of the restricted code percolation compared to boosting and general code percolation increases when the load delay is decreased.
When the load delay is decreased from 3 to 2 for an 8-issue processor, the speedup for general and boosting code percolation increases from 12 for lex to 37 for grep while the speedup for restricted code percolation increases from 20 for espresso to 44 for grep. Likewise, when the load delay is decreased from 2 to 1 for an 8-issue processor, the speedup for general and boosting code percolation increases from 8 for tbl to 53 for cmp while the speedup for restricted code percolation increases from 25 for espresso and 80 for grep. This is expected since loads cannot be moved across branches in the restricted model and thus are more likely to be on the critical path than in the general and boosting models. Therefore, restricted code percolation is more sensitive to increasing the memory access delay. 
Scheduling a Superscalar with 8K Data Cache
In the previous experiments we h a v e assumed an ideal instruction and data cache. To analyze the e ect of the data cache, which t ypically has a higher miss ratio than the instruction cache, we replaced the ideal data cache with an 8K direct mapped data cache with 32 byte blocks. An 8K data cache was chosen to represent moderate sized on-chip caches in the near future. Therefore, for the range from moderate to large data cache sizes, the performance impact due to cache misses is bounded by the speedup shown in Figure 18 and those in Figure 12 . We assume that the processor stalls on a cache miss. The initial delay to memory is 4 cycles and the transfer size is 32 bits. For an 8 issue processor, Figure 18 shows that the e ect of the data cache misses e ectively decreases the speedup of boosting and general from 50 for compress to approximately 0 for eqntott and of restricted code percolation from 34 for compress to approximately 0 for eqntott. As expected, the performance of the data cache has a greater impact on the more aggressive s c heduling models. While trace scheduling uses branch frequencies to determine the scheduling region, it allows side entrances into traces 12 . During scheduling, these side entrances require incremental bookkeeping to perform code duplication when an operation is moved upward across a side entrance. Similarly, percolation scheduling 29 and global instruction scheduling based on the Program Dependence Graph 15 apply incremental code duplication when an operation is moved across a program merge point. 8 Superblocks eliminate the need for incremental bookkeeping by performing tail duplication to remove side entrances to the trace. In addition to simplifying scheduling, separating superblock formation from scheduling allows the compiler to apply superblock optimizations. These optimizations increase the size of the superblock and remove dependences in order to increase the instruction level parallelism 21 .
Conclusion
In this paper we h a v e analyzed three code percolation for superscalar and superpipelined processors.
We h a v e shown that increasing the scheduling scope from basic block to superblock increases the An extra bit is required per instruction to indicate that the instruction has been moved across a
branch. In addition, extra hardware is required to control the execution pipeline, shadow register le and shadow store bu er when a branch commits. To handle precise exceptions and page faults the program counter of the rst instruction to be move across a branch m ust be saved.
The boosting code percolation model is the least restrictive; however, it also requires the most hardware support. In this paper, we analyzed the speedup of all three models on superscalar and superpipelined processors. On average, the boosting code percolation model performs slightly better than general code percolation. Both the boosting and general code percolation models perform considerably better between 13 and 145 for an issue-8 processor than restricted code percolation. Similar trends have been shown for processors with varying resource assumptions.
We believe that future processor instruction sets should support some form of the general code percolation model in order to be competitive in the superscalar and superpipelining domain.
Superblock s c heduling and other global code scheduling techniques can exploit the general code percolation model. We hope to see future research and engineering work in the direction of making general code percolation an extended part of existing architectures and an integral part of future processor architectures.
