Introduction
There is growing interest in machines that exploit, usually with compiler assistance, the parallelism that programs have at the instruction example of this parallelism. The code fragment in l(a) consists of three instructions that can be executed at the same time, because they do not depend on each other's results. The code fragment in l(b) does have dependencies, and so cannot be executed But how much parallelism is there to exploit? Popular wisdom, supported by a few studies [7, 13, 14] , suggests that parallelism within a basic block rarely exceeds 3 or 4 on the average. Peak parallelism can be higher, especially for some kinds of numeric programs, but the payoff of high peak parallelism is low if the average is still small.
These limits are troublesome. Many machines already have some degree of pipelining, as reflected in operations with latencies of multiple cycles. We can compute the degree of pipelining by multiplying the latency of each operation by its dynamic frequency in typical programs; for the DECStation 5000," load latencies, delayed branches, and floating-point latencies give the machine a degree of pipelining equal to about 1.5.
Adding a superscalar capability to a machine with some pipelining is beneficial only if there is more parallelism available than the pipelining already exploits. To increase the instruction-level parallelism that the hardware can exploit, people have explored a variety of techniques. These fall roughly into two categories.
One category includes techniques for increasing the parallelism within a basic block, the other for using parallelism across several basic blocks. These techniques often interact in a way that has not been adequately explored. We would like to bound the effectiveness of a technique whether it is used in combination with impossibly good companion techniques, or with none. A general approach is therefore needed. In this paper, we will describe our use of trace-driven simulation to study the importance of register renaming, branch and jump prediction, and alias analysis. In each case we can model a range of possibilities from perfect to non-existent.
We will begin with a survey of ambitious techniques for increasing the exploitable instruction-level parallelism of programs.
1.1. Increasing parallelism within blocks. we must do the second and third instructions in that order, because the third changes the value of rl. However, if the compiler had used r3 instead of rl in the third instruction, these two instructions would be independent.
A smart compiler might pay attention to its allocation of registers, so as to maximize the opportunities for parallelism. Current compilers often do not, preferring instead to reuse registers as often as possible so that the number of registers needed is minimized.
An alternative is the hardware solution of register renaming, in which the hardware imposes a level of indirection between the register number appearing in the instruction and the actual register used. Each time an instruction sets a register, the hardware selects an actual register to use for as long as that value is needed. In a sense the hardware does the register allocation dynamically, which can give better results than the compiler's static allocation, even if the compiler did it as well as it could. In addition, register renaming allows the hardware to include more registers than will fit in the instruction format, further reducing false dependencies. Unfortunately, register renaming can also lengthen the machine pipeline, thereby increasing the branch penalties of the The number of instructions between branches is usually quite small, often averaging less than 6. If we want large parallelism, we must be able to issue instructions from different basic blocks in parallel. But this means we must know in advance whether a conditional branch will be taken, or else we must cope with the possibility that we do not know.
Branch prediction is a common hardware technique. In the scheme we used [9, 12] , the branch predictor maintains a table of two-bit entries. Low-order bits of a branch's address provide the index into this table.
Taking a branch causes us to increment its table entry; not taking it causes us to decrement. We do not wrap around when the table entry reaches its maximum or minimum.
We predict that a branch will be taken if its table entry is 2 or 3. This two-bit prediction scheme mispredicts a typical loop only once, when it is exited. 2. This and previous work.
To better understand this bewildering array of techniques, we have built a simple system for scheduling instructions produced by an instruction trace. Our system allows us to assume various kinds of branch and jump prediction, alias analysis, and register renaming. In each case the option ranges from perfect, which could not be implemented in reality, to non-existent.
It is important to consider the full range in order to bound the effectiveness of the various techniques. For example, it is useful to ask how well a realistic branch prediction scheme could work even with impossibly good alias analysis and register renaming. analysis. However, they did not consider more realistic assumptions, arguing instead that they were interested primarily in programs for which realistic implementations would be close to perfect.
The study by Smith, Johnson, and Horowitz [13] was a realistic application of trace-driven simulation that assumed neither too restrictive nor too generous a model. They were interested, however, in validating a particular realistic machine design, one that could consistently exploit a parallelism of only 2. They did not explore the range of techniques discussed in this paper.
We believe our study can provide useful bounds on to retiring the cycle's instructions from the scheduler, and the behavior not only of hardware techniques like branch passing them on to be executed. prediction and register renaming, but also of compiler techniques like software pipelining and trace scheduling, Unfortunately, we could think of no good way to model loop unrolling.
Register renumbering can cause much of the computation in a loop to migrate backward toward the beginning of the loop, providing opportunities for parallelism much like those presented by unrolling.
Much of the computation, however, like the repeated incrementing of the loop index, is inherently sequential.
We address loop unrolling in an admittedly unsatisfying manner, by unrolling the loops of some numerical programs by hand and comparing the results to those of the normal versions.
3. Our experimental framework.
To explore the parallelism available in a particular program, we execute the program to produce a trace of the instructions executed. This trace also includes data addresses referenced, and the results of branches and jumps. A greedy algorithm packs these instructions into a sequence of pending cycles.
In packing instructions into cycles, we assume that any cycle may contain as many as 64 instructions in parallel. We further assume no limits on replicated functional units or ports to registers or memory: all 64 instructions may be multiplies, or even loads. We assume that every operation has a latency of one cycle, so the result of an operation executed in cycle N can be used by an instruction executed in cycle N+ 1. This includes memory references: we assume there are no cache misses.
We pack the instructions from the trace into cycles as follows.
For each instruction in the trace, we start at the end of the cycle sequence, representing the latest pending cycle, and move earlier in the sequence until we find a conflict with the new instruction. Whether a conflict exists depends on which model we are considering. If the conflict is a false dependency (in models allowing them), we assume that we can put the instruction in that cycle but no farther back. Otherwise we assume only that we can put the instruction in the next cycle after this one. If the correct cycle is full, we put the instruction in the next non-full cycle. If we cannot put the instruction in any pending cycle, we start a new pending cycle at the end of the sequence.
As we add more and more cycles, the sequence gets longer. We assume that hardware and software techniques will have some limit on how many instructions they will consider at once. When the total number of instructions in the sequence of pending cycles reaches this limit, we remove the first cycle from the sequence, whether it is full of instructions or not. This corresponds We can assume that indirect jumps are perfectly predicted.
We can assume infinite or finite hardware prediction as described above (predicting that a jump will go where it went last time). We can assume static prediction based on a profile. And we can assume no prediction. In any case we are concerned only with indirect jumps; we assume that direct jumps are always predicted correctly.
The effect of branch and jump prediction on scheduling is easy to state. Correctly predicted branches and jumps have no effect on scheduling (except for register dependencies involving their operands). Instructions on opposite sides of an incorrectly predicted branch or jump, however, always conflict.
Another way to think of this is that the sequence of pending cycles is flushed whenever an incorrect prediction is made. Note that we generally assume no other penalty for failure. This assumption is optimistic; in most real architectures, a failed prediction causes a bubble in the pipeline, resulting in one or more cycles in which no execution whatsoever can occur. We will return to this topic later.
We can also allow instntctions to move past a certain number of incorrectly predicted branches. This corresponds to architectures that speculatively execute instructions from both possible paths, up to a certaiñ anout limit. None of the experiments described here involved this ability.
Four levels of alias analysis are available. We can assume perfect alias analysis, in which we look at the actual memory address referenced by a load or stor~a store conflicts with a load or store only if they access the same location.
We can also assume no alias analysis, so that a store always conflicts with a load or store. This is a common technique in compile-time instruction-level code schedulers. We look at the two instructions to see if it is obvious that they are indepen- 3.2. Programs measured.
As test cases we used four toy benchmarks, seven real programs used at WRL, and six SPEC benchmarks.
These programs are shown in Figure  3 . The SPEC benchmarks were run on accompanying test data, but the data was usually an official "short" data set rather than the reference data set. The programs were compiled for a DECStation 5000, which has a MIPS R3000* processor.
The Mips version 1.31 compilers were used.
" R3000 is a trademark of MIPS Computer Systems, Inc.
ResuIts.
We ran these test programs for a wide range of configurations. The results we have are tabulated in the appendix, but we will extract some of them to show some interesting trends. To provide a framework for our exploration, we defined a series of five increasingly ambitious models spanning the possible range. These five are specified in Figure 4 ; the window size in each is 2K instructions. Many of the results we present will show the effects of variations on these standard models.
Note that even the Fair model is quite ambitious. 
Branch and jump prediction.
The success of the two-bit branch prediction has been reported elsewhere [9, 101. Our results were comparable and are shown in Figure 5 . shown as dotted lines. Unsurprisingly, the Stupid model rarely gets above 2; the lack of branch prediction means that it finds only intra-block parallelism, and the lack of renaming and alias analysis means it won't find much of that. The Fair model is better, with parallelism between 2 and 4 common. Even the Great model, however, rarely has parallelism above 8. A study that assumed perfect branch prediction, perfect alias analysis, and perfect register renaming would lead us down a dangerous garden path. So would a study that included only fpppp and tomcatv, unless that's really all we want to run on our machine. which is not always sufficient to resolve the conflict between a store at the end of one iteration and the loads at the beginning of the next. In naive unrolling, the loop body is simply replicated, and these memory conflicts impose the same rigid framework to the dependency structure as they did before unrolling. The unrolled versions have slightly less to do within that framework, however, because 3/4 or 9/10 of the loop overhead has been removed. As a result, the parallelism goes down slightly. Even when alias analysis by inspection is adequate, unrolling the loops either naively or carefully sometimes causes the compiler to spill some registers. This is even harder for the alias analysis to deal with because these references usually have a different base register than the array references.
Loop unrolling is a good way to increase the available parallelism, but it is clear we must integrate the unrolling with the rest of our techniques better than we have been able to do here.
Effects of window size.
Our standard models all have a window size of 2K
instructions: the scheduler is allowed to keep that many instructions in pending cycles at one time. Typical superscalar hardware is unlikely to handle windows of that size, but software techniques like trace scheduling for a VLIW machine might. This would tend to have less parallelism than the continuous window model we used above. Figure 9 shows the same models as Figure  8 , except assuming discrete windows rather than continuous. Under the Great model, discrete windows do nearly as well as continuous when the window is 2K instructions, but the difference increases as the window size decreases; we must use discrete windows of 128 instructions before the curves level off. If we have very small windows, it might pay off to manage them continuously; in other words, continuous management of a small window is as good as multiplying the window size by 4. As before, the Perfect model does better the larger the window, but the parallelism is only two-thirds that of continuous windows.
Effects of branch and jump prediction.
We have several levels of branch and jump prediction. Figure 10 shows the results of varying these while register renaming and alias analysis stay perfect. Reduc- Otherwise, removing jump prediction altogether has little effmt. This graph does not show static or finite prediction: it turns out to make little difference whether prediction is infinite, finite, or static, because they have nearly the same success rate.
That jump prediction has little effect on the parallelism under non-Perfect models does not mean that jump prediction is useless. In a real machine, a jump predicted incorrectly (or not at all) may result in a bubble in the pipeline.
The bubble is a series of cycles in which no execution occurs, while the unexpected instructions are fetched, decoded, and started down the execution pipeline. Depending on the penalty, this may have a serious effect on performance, Figure 11 shows the degradation We also ran several experiments varying alias analysis in isolation, and varying register renaming in isolation. 
12 --------. 
. was (by definition) indistinguishable from perfect alias analysis on programs that do not use the heap, and was somewhat helpful even on those that do. There remains a gap between this analysis and perfection, which suggests that the payoff of further work on heap disambiguation may be significant. Unless branch prediction is perfect, however, even perfect alias analysis usually leaves us with parallelism between 4 and 8. Figure 13 shows the effect of varying the register renaming under the Perfect and Great models. Dropping from infinitely many registers to 256 CPU and 256 FPU registers rarely had a large effect unless the other parameters were perfect.
Under the Great model, register renaming with 32 registers, the number on the actual machine, yielded parallelisms roughly halfway between no renaming and perfect renaming. renaming is important as well, though a compiler might be able to do an adequate job with static analysis, if it knows it is compiling for parallelism.
Even ambitious models combining the techniques discussed here are disappointing. Figure 14 shows the parallelism achieved by a quite ambitious hardware-style model, with branch and jump prediction using infinite tables, 256 ITT-J and 256 CPU registers used with LRU renaming, perfect alias analysis, and windows of 64 instructions maintained continuously. The average parallelism is around 7, the median around 5. Figure 15 shows the parallelism achieved by a quite ambitious software-style model, with static branch and jump prediction, 256 FPU and 256 CPU registers used with LRU renaming, perfect alias analysis, and windows of 2K instructions maintained continuously.
The average here is closer to 9, but the median is still around 5. A consistent speedup of 5 would be quite good, but we cannot honestly expect more (at least without developing techniques beyond those discussed here).
We must also remember the simplifying assumptions this study makes. We have assumed that all operations have latency of one cycle; in practice an instruction with larger latency uses some of the available parallel- 
