Abstract
e Introduction
Currently available superscalar processors issue up to four instructions in each processor cycle. Even highcr issuc rates are expected in the future. To exploit these issue rates, a processor must idcntify groups of independent instructions that can safely be issued and executed in parallcl. Unfortunately, the processor irself has only a local view of the dynamic instruction stream it is executing. As a result, even with advanced hardware techniques such as dynamic branch prediction and out-oforder instruction issue, the amount of parallclism that can be reaIised is limited and may be as low as a factor of two [I] .
An alternative approach is to use an instruction scheduler to reorganise code into independcnt groups at compile time. The processor then issues these pre-formed groups in parallel at run time. Unlike the hardware, an instruction scheduler can take a global view of a program and is therefore able to assemble large groups to satisfy high instruction issue rates. As a result future high performance superscalars are likely to rely increasingly on instruction scheduling to realise their full porcntial.
The Hatfield Superscalar Pr0jec.t aims to develop thc instruction scheduling technology to support a high instruction issue rate. The research is bascd on the Hatfield Superscalar Architecture (HSA) that has been developed at the University of Hertfordshire. The long term aim of the project is to rcalise an ordcr of magnitude speedup over traditional RISC processors that issue only one instruction per cycle. Further objectives include thc dcvelopment of heuristics to avoid exccssive, code expansion during instruction scheduling and the dcvelopment of hardware mechanisms, such as guardcd instruction execution, to support high issue rates.
This paper presents preliminary results generated using the first instruction scheduler developed for HSA. In particular, we measure the speedup achieved by diffcrent processor models and evaluate the impact of running codc schcduilcd for one processor on alternative processor moclcls. We also examine the impact of multi-cycle cache acccss times and of using guarded instruction execution to rcmove instructions prematurely from the instructlon pipeline. Finally, we quantify the ability of the scheduler to remove branches during scheduling.
'I'hie HSA architectural model
HSA is a load and store architecture with a RISC instmcition set derived from our carlier HARP project [2] .
Separate integer and Boolean register files are provided. The one-bit Boolean registers are used to storc branch conditions and to implement guarded instruction cxecution. Functional units include arithmetic, relational, shift, mulliply, memory reference and branch units. A simplc four-stage pipeline is used:
IF: Instruction Fetch ID: EIX: Execute WB: Write Back
In the first stage a fixed number of instructions is fctchcd from the instruction cache into an Instruction Buffer. One or more processor cycles may be requircd. In the case of multiple cycle fetches it is assumed that the cache accesses are pipelined and that a new instruction access can begin in each cycle. In the instruction decode stagc onc or more instructions are issued to functional units. Instructions then spend a variable number of cycles in the cxccution stage bcfore returning results to a registcr filc in thc write-back stage.
Thc Instruction Buffer decouples instruction fctch from inslruction issue. Typically, the fetch rate ol an HSA proccssor exceeds the maximum issue rate to allow the processor to benefit from its ability to remove or squash instriuctions from the Instruction Buffer beforc they are issued to Branches are resolved in rhe ID stage. With a single cycle instruction cache the branch delay is therefore one. Load and store instructions use register indirect addressing or the ORed indexing addressing mechanism developed for HARP [3] . These simple addressing mechanisms allow memory addresses to be made available at the end of the ID stage and avoid a load delay with a single cycle data cache.
Other major features of HSA include a generalised delayed branch mechanism, guarded instruction execution and hardware support for speculativc instruction execution.
Delayed branch mechanism
In a RISC pipeline, branches are typically resolved in the second pipeline stage giving a branch delay of one. This latency is often hidden by using a delayed branch mechanism in which a fixed numbcr of instructions following the branch is always executed, irrespective of the outcome of the branch.
The classic delayed branch mechanism is too inflexible for a superscalar architecture. HSA therefore generalises the traditional mechanism by allowing each branch instruction to specify explicitly the number of instructions that must be executed in its branch delay slots [41. The flexibility provided by this mechanism allows it to adapt to different instruction issue rates and to different branch latencies. Compatibility is ensured as long as each processor can execute the new branch instructions. Any HSA processor can therefore execute code scheduled for any other HSA processor.
In contrast some recent superscalar architectures have abandoned the delayed branch mechanism [5, 6] in favour of using a branch target cache (BTC) to predict the outcome of branches. While a perfect BTC will result in no pcrformance degradation, in practice therc is a pcrformance penalty proportional to both the BTC miss rate and thc instruction issue rate. As a result, as the issue rate of superscalar processors increases, the performance pcnal ty of using a BTC will also increase. Using a BTC also involves a significant increase in hardware complexity. In addition to the cost of the BTC itself, the processor must be able to recover rapidly from incorrect branch predictions by squashing any instructions that have been spe.culativc1y issued after the branch prediction.
Guarded instruction execution
Guarded or conditional instruction execution has been proposed by a number of researchers 17,SI and has been implemented in several processors including the Acorn ARM [9] 
TB6 ADD R1, R2, R3; Branch removal is particularly important in superscalar processors that predict the outcome of branch instructions dynamically, since removing branches from the program will also rc,duce the number of branch prediction failures.
Another advantage of guarded execution is that the pressure on functional units and other processor resources can bc reduced. Consider the following example with three parallcl instruction groups:
BT B1, label (#6); TB1 instrl; FB1 instr4 TBl instr2; FB1 instr5 TBI insu3; FB1 instr6 Two branch delay slots are assumed and six instructions following the branch must be executed before the branch is tdkcn. All six will be issued to functional units but three of them will be squashed in their functional units without returning a result to a register. Consequently the pressure on result buses and register file write ports will be reduced.
Instructions can also be squashed in the Instruction Burrer [IO] . HSA will squash an instruction in the ID stage, if the relevant Boolean guard becomes available and the insiruction has remained in the Instruction BuPfer for a full cycle without being issued. In the above example, the instructions in the two branch delay groups satisfy these conditions; so only two of the final four instructions need be issued to functional units.
. Instruction scheduling
HSA relies on compile-time instruction scheduling to achieve high execution rates. Scheduling techniques were originally developed to pack independcnt opcrations into microinstructions to reduce the size of microprograms and to enable them to execute more quickly. Recognition that limited parallelism was available bctween branches led to attempts to schedule code globally. Initially efforts were directed towards combining basic blocks to create large groups of instructions that could then be scheduled using the techniques that had been developed to compact microcode.
One of the first global scheduling tcchniques developcd was Trace Scheduling [l 11. In Trace Scheduling, the scheduler identifies a series of paths or traces through a procedure that are highly likely to be followed at run timc.
The most important trace is then selected and schcdulcd a$ if it were a single basic block. Finally, code is added at the entry and exit points of the scheduled trace to preserve the program semantics. This process is repcated until all thc traces in a procedure have been scheduled.
While spectacular speedups were achievcd with highly numeric programs, Trace Scheduling has a number of limitations. Firstly, there is the problem of identifying traces that are likely to be followed at run time. Accurate trace identification depends on knowing the likely outcome of branches and, in general, this information can only be obtained by running a program and collecting the required branch stalislics. To reduce this difficulty, loops are often unrollcd to obtain a long uacc consisting of several loop iterations. Secondly, the excculion time of thc trace path IS optimised at the expense of off-trace paths, which may be slowcd down. Thirdly, the addition of semantic preserving code together with aggressive loop unrolling can lead to dramatic code expansion. Finally, and perhaps crucially, instructions will usually only be executcd in parallel if they were originally in the same tracc. Branches off and into a trace therefore bccome barriers to code motion. In the case of loops, instructions from one loop iteration will never be overlapped with instructions from a subsequcnt iteration to achieve so-callcd software pipelining. This disadvantage can be partially offset through aggressive loop unrolling, but only at the cost of dramatic code expansion.
Professor Hwu's group at Illinois University also increases the scope for instruction schcduling by combining basic blocks to form largcr scheduling units called superblocks [12] . A superblock consists of a series of basic blocks with a single entry point but multiple cxit points. Superblocks of maximum size are systematically crcated through basic block duplication called Tail Ebcioglu estimates [8] that the processor cycle time will be degraded by 30%. However, in the case of a two-cycle cache access time, the pcnalty would be over 100%. A major objective of the HSA project is to achieve similar speedups using more realistic processor models with multi-cycle cache access times and non unit latencies for the more complex instructions such as multiply and divide.
A conditional group scheduler
HSA uses a Conditional Group Scheduler (CGS) to reorganise code for parallel execution at run time. The primary goal of CGS is to achieve the maximum specdup while minimising code expansion. A secondary aim is to allow code scheduled for one processor to run on different processor models with minimal pcrlormance loss.
CGS [lo] is a global code scheduling algorithm that applies powerful low-level mechanisms within the framework of a high-level algorithm. The high-he1 algorithm determines the order in which sections of code are to be scheduled and which low-level mechanisms are to be enabled during the schcduling phases. CGS providcs two sets of parameters that significantly effcct the overall scheduling algorithm. One set of parameters controls the individual low-level scheduling features applicable for the duration of the entire scheduling process. A second set determines the type and scope of code motions lhat are allowed during each individual scheduling phase.
The basic CGS schcduling unit is loop. However, branch instructions impose an unnecessary lower bound on the time required to execute each loop body. For example, with a branch delay of two, the minimum loop length is three instruction groups, and no loop iteration can start less than three cycles after the previous iteration. This limit on the Iteration Interval can only be removed by replicating the loop body.
To minimise code expansion, CGS only replicates loops if the loop execution time can be reduced. This is achicved by first compacting the loop body into the minimum number of instruction groups. Then if the natural length of the scheduled loop is less than the total exccution time of a branch instruction, thc loop body is duplicated. Guarding replicated loops avoids increasing the number of conditional branches in the loop [ 4 ] . The replicated loop body is then rescheduled to combine code from different iterations. As a result, assuming a branch delay of two, a loop body with only one instruction group will be replicated twice to give an Initiation Interval of one, while a loop body with two instruction groups will be replicated once giving an Initiation Interval of at most two.
The scheduling process is divided into four phascs. In the first phase each instruction group is filled in turn, starting at the head of the loop. Instructions are moved into each group by recursively searching through the loop control structure for suitable candidates. Software pipelining is achieved by allowing instructions to be moved across loop back edges from previously scheduled groups. To preserve program semantics, instructions moved across loop back edges are also added to a prelude ahead of the loop. If no restrictions are placed on code motion across loop back edges, the scheduler performs excessive code motion, eagerly pre-computing successive values on the faster paths through the loop, sometimes many iterations ahead of the point where the values are used. Successive values of the loop count, for example, can often be pre-computed many cycles ahead without speeding up the computation.
To avoid the resultant code explosion, a number of heuristics were developed. Firstly, code motion across a loop back edge is terminated as soon as the scheduler is unable to move an instruction into the instruction group currently being assembled. The objective of this restriction is to synchronise the operations involved in each iteration and to avoid premature computation of valucs many iterations ahead. Code motion that requires the duplication of instructions within the loop or renaming of any previously scheduled instruction is also prohibited in this phase.
Although the first scheduling phase minimises the number of instruction groups occupied by the loop body, a lower bound is set by the delayed branch mechanism. As a result, the code will be distributed across all the available instruction groups, including any branch dclay groups. A loop with a natural size of one or two may therefore bc distributed across three instruction groups, including the two branch delay groups. The second scheduling phase therefore compacts the code groups at the top of a loop and allows the minimum size of the loop to be determined. During this phase code motion across the loop back edge is disabled. At the end of the second phase, loop replication is performed if the natural length of the loop is less than the total latency of the branch instructions.
In the third phase, the replicated loop body is rescheduled with code motion across loop back edges enabled once more. This phase allows code from multiple loop bodies to be overlapped. If no code replication has occurred virtually no code motion takes place in this phase. In the final phase, code is compacted again with the back edge disabled. Earlier limitations on code replication and register renaming can now be safely rcmovd in the absence of code motion across back edgcs.
Scheduling proceeds from inner to outer loops until each procedure is scheduled. During the later stages, code moves both into and across previously scheduled inner loops, providing the execution time of a previously scheduled loop is not increascd. Finally, after all the procedures in a program have been scheduled, code is pcrcolated across procedure calls. This code motion halts as soon as one of the instruction groups in the called procedure fails EO 
Experimental results
This section presents results that demonstrate the performance gains to be had by combining a powerlul scheduling algorithm with a simplified but sophisticatcd superscalar architecture. Eight general-purpose programs known collectively as the Stanford Integer Benchmark Set [ Table 11 are used throughout. All of the test programs are written in 'C' and compiled by a GNUCC generated compiler that targets the HSA instruction set. After scheduling, the programs are executed on a highly parameterised HSA simulator [ 101. A cube packing problem.
The 8 queens chess problcm.
Quicksort. Towers of Hanoi problem. Binary tree sort.
The dynamic instruction distribution of the benchmarks IS unremarkable with 40% arithmetic, 30% load & store, 13% relational and 17% branches. Howevcr, the distribution of the branch instructions is unusual with 26% of all branches being procedure calls and rcturns. This distribution reflects the highly recursive nature of three of the programs: perm, tower and tree.
Two versions of the HSA architecture are comparcd, a Slow Cache Model and a Fast Cache Modcl. In the Slow Cache Model, both the instruction and data cache requirc two cycles to perform a read operation. As a result all branch instructions have two branch delay slots. Also, data loaded from the cache by a load instruction can not be used by an immediately following instruction without introducing a stall of one cycle. The load dclay is therdore onc. In thc Fast Cache Model, the cache access time is reduced to one cycle, giving a singlc branch dclay slot and eliminating the load delay. In both models multiplication requires three cycles and division 16 cyclcs.
Many of the results involve measurement of the overall speedup of scheduled code over unscheduled code. A Baseline Model is provided by running unschcduled code through a single-instruction-issue version of thc simulator and recording the total number of machine cycles required. Since the compiler makes no attcmpt to fill branch delay slots, the Baseline Model effectivcly predicts that all branches arc not taken and incurs a branch penalty whenever a branch is taken.
Unscheduled code performance
Since HSA is a superscalar rather than a VLIW architecture, some speedup will be achicvcd when unscheduled code is run on a multiple-instruction-issuc model. The speedups obtained with the Fast Cache Model are given in Figure 1 . With an issue rate of eight a 60% speedup is achieved, reducing to 34% with an issue rate of two. These figures conceal significant variation between individual programs. Perm, for example, specds up by 11 8% while puzzle only improves by 36%. Finally, if all instruction latencies are reduced to one, the arithmetic mean speedup is increased to 3.9 and the harmonic mcan to 3.6, a 24% improvement over the slow cache modcl. This difference emphasises thc dangers inhercrit in using over simplified processor modcls. 
Impact of issue rate on perforniance
The results in thc previous section assumed processor models with infinite resources. Figure 5 shows how the performance of the Fast Cache Model degradcs when code scheduled for a machine with no resource constraints is run on a model with progressively lower instruction issue rates. In general, performance degrades gracefully as the issue rate is reduced. A mean speedup of 3.0 is maintained with an issue rate of eight and the speedup is still 2.0 with the highly restrictive issue rate of three.
This graceful degradation in performance as the issue rate is reduced can be partly attributed to the ability of an HSA processor to squash code in the Instruction Buffer. Typically, guarded code will be scheduled into branch delay slots from both successor paths. Squashing allows much of this code to be removed from the Instruction Buffer before it is issued to a functional unit. Squashing thercforc becomes increasingly beneficial as the instruction issue rate is reduced.
The same code was re-run with squashing in the Instruction Buffer disabled. With low issue rates, some programs now have speedups less than one [lo] . In these cases the schedulcd code is running more slowly than the unscheduled code. Over-optimistic code promotion has resulted in time being wasted sequentially executing instructions whose results were never used. With higher issue rates, these speculative results were executed in parallel and therefore did not degrade performance. Figure 6 compares the speedups obtained for thc benchmark programs with and without the ability to squash code in the Instruction Buffer. Lime. The Data Cache access time was held constant at one cycle to avoid any impact on the results from incrcased load latencies. Figure 9 shows two sets of harmonic mean speedups for the scheduled benchmarks. The first set indicates the speedup achieved by scheduling code for each model, while the second set normalises the speedups with respect to the Fast Cache Model. While the first set of figures suggests that CGS performs well as the Insruction Cache latency is increased, the normalised figures reveal that the total execution time does increase significantly. This loss of performance is caused by the schedulers inability to fill all of the branch delay slots as the instruction cache access time is increased. However, detailed examination of the scheduled code suggests that much of the performance loss is caused by the inability of the scheduler to deal adequately with all types of branches. In particular, CGS currently makes no attempt to promote conditi,onal branches into the scope of other conditional branches or to replicate conditional branches. As a result thesc figures should improve as CGS is enhanced.
Branch removal
Finally, it is interesting to examine the total numbers of each type of instruction that are executed in the scheduled code and to compare these figures with those taken for runs of the unscheduled programs. With the Fast Cache Modcl, although the total number of instructions executed increases by 23%, the number of branch instructions executed actually falls by 35%. As a result the percentage of executed insuuctions which are branch instiructions falls from 17% to 10%. The reduction is particularly dramatic in perm where 17327 conditional branches are reduced to just five. Similar results were obscrved with the Slow Cache Model. The total instructions executed increasing by 32%, while thc number of branches executed fell by 3 1 %.
Concluding remarks
These results demonstrate the validity of the HSA model and show that compile-time scheduling can achieve significant speedups whilst retaining code compatibility acrolss a range of processor implementations. By transferring much of the superscalar's work to the schedulcr, the processor has also been greatly simplified. Nonethelcss, our minimal superscalar design is still able to find significant levels of parallclism without the aid of a scheduler, achieving a speedup of 1.6.
Currently schcduling realises speedups in the range 2.9 to 3.6 depending on the processor model. These figures are 60% better than those achieved on our earlier HARP project and compare favourably with results reporicd by other groups.
We also show that code scheduled for a processor with infinite resources can achieve satisfactory performance over a wide range of processor models, with performance degrading gracefully as the issue rate is reduced. Much of this graceful degradation is achieved through HSA's ability to squash instructions in the Instruction Buffer. In general, implementations require a 50% increase in processor resources to compensate for the ability to squash instructions. Nonetheless maximum speed is achieved by scheduling code for a specific processor. Fjgures are also presented showing the impact of cache access time on speedup, with a 25% degradation as thc cache access time is increased to three cycles. Future work with CGS will attempt to reduce this figure.
Finally, scheduling was found to reduce the number of branches executed by over 30%. Similar results are reported by HWU'S group which used conditional execution specifically to remove branches and achieved a 27% reduction in dynamic branch counts [161. In contrast, CGS reduces the number of branches as a side effcct of instruction scheduling. These results suggest that guarded execution can significantly improve thc performance of Branch Target Caches in a superscalar processor.
The initial thrust of our scheduling work was to realise ever greater parallelism. However, it was found that the unrestrained application of CGS's powerful code motion primitives demanded excessive resources and, on occasion, even slowed up program execution. Restraining mechanisms were therefore developed to avoid code explosion. An important conclusion of this work is therefore that the application of code motion primitives must always be tempered by heuristics which take into account the likely costs and bencfits of the code motion.
Many questions regarding instruction scheduling remain unanswered. We still do not know much parallelism can be realised through instruction scheduling. Although the specdup achieved by CGS is encouraging, our scheduler is still incomplete in many respects. Furthermore, analysis using trace driven simulation [ 171 suggests that significant additional parallelism is theoretically availablc. Further work is also required to quantify thc bcncfits of guardcd execution and instruction squashing. It is therefore likely that our scheduler will be continually enhanced lor somc time to come.
