Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction-and datacache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under-or over-state the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact.
I. INTRODUCTION
Many microarchitecture studies focus on issues related to branch prediction, cache size, and more recently, instruction window size. Although points of diminishing returns exist, increasing the size of any of these structures generally increases performance. On the other hand, tradeoffs among these components are usually necessary due to die-size constraints.
Unfortunately, the relative importance and inter-relationships of these major parameters has become more complex as out-of-order issue and decoupling buffers cause latencies to interact in sometimes subtle ways. With a certain branch predictor, for example, adding data cache may be more beneficial than increasing instruction window size. More is better-but a small improvement to the branch predictor might be better yet, possibly shifting the tradeoffs so that adding instruction-window entries becomes more beneficial than adding cache. A branch-prediction study that models unrealistic caches can under-or over-state the benefits of better prediction. If the caches are too small, cache misses may mask the effects of branch mispredictions. Modeling caches that are perfect can be risky as well: some benchmarks stream data through even the largest caches, so omitting this significant contribution to execution time can lead to overstated benefits.
While such pitfalls may strike many readers as obvious, methodological mistakes like these are still common. Running pilot experiments to insure against such potential pitfalls and to cull the design space is expensive. Finally, detailed simulation generally prohibits executing realistic benchmarks to completion without reducing input sizes. We hope to address these problems by examining three issues in this paper.
We characterize benchmark performance in terms of cache size and instruction-window size to help cull the design space that researchers need to explore. For example, several SPECint benchmarks [59] fit in 8 K instruction caches; while the small I-cache footprint of SPEC benchmarks is a common complaint, we provide detailed data showing this. We discuss tradeoffs among cache size, instruction-window size, and branch-prediction accuracy. When not taken into account, these tradeoffs can skew simulation results in damaging ways. For example, assuming perfect branch prediction makes large (e.g., 128-entry) instruction-window sizes look artificially beneficial. Using time-series data that shows different phases of execution for cache miss rates and branch misprediction rates, we discuss the sensitivity of various tradeoffs to the choice of simulation window and of simulation length. This permits researchers to run shorter simulations and to avoid unrepresentative phases of execution. 50-millioninstruction-long simulations are almost always adequate, provided the 50 M-instruction window is carefully chosen. In particular, short windows must not include the initial phases of the program. To show these effects, this paper presents data for each SPECint benchmark that shows performance as a function of instruction-window size, data-cache size, and instruction-cache size, for both a contemporary hybrid branch predictor [36] and an ideal branch predictor. Comparing the ideal branch-prediction data against the hybrid branch-prediction data shows how much branch mispredictions limit performance; more importantly, it illustrates the circumstances under which simulating ideal prediction skews the results.
The paper also briefly considers how the tradeoff between data-cache size and instruction-window size changes as a function of branch prediction accuracy, briefly discuss return-address-stack repair, and presents some measurements for average branch-resolution times.
It is crucial to understand the interplay of these parameters, and this paper serves as a reference aid for this. We also hope that the comprehensive data we present will help researchers reduce their simulation requirements in various ways by describing a methodology for culling the simulation space and providing reference data for that purpose. We focus on the SPEC suite of benchmarks both because they provide consistency and comparability across studies, and because there are few other agreed-upon benchmarks that are portable and publicly available with source code. In particular, we focus only on the integer benchmarks because the SPEC floating-point programs have near-perfect branch prediction.
Section II describes our simulation approach and benchmarks in more detail. Section III examines the relationship between performance and instruction window size, and the impact of branch mispredictions on this relationship. Section IV examines the relationship between performance and both instruction-and data-cache size. Section V discusses branch prediction effects in more detail, and Section VI discusses simulation techniques. Section VII summarizes related work, and Section VIII concludes the paper.
II. EXPERIMENTAL METHODOLOGY

A. Simulator
We use HydraScalar-our heavily modified version of SimpleScalar's [4] sim-outorder-for our experiments. SimpleScalar provides a toolbox of simulation components-like a branch-predictor module, a cache module, and a statistics-gathering module-as well as several simulators built from these components. Each simulator interprets executables compiled by gcc version 2.6.3 for a virtual instruction set (the "PISA") that most closely resembles MIPS IV [43] . The simulators emulate the executable's execution in varying levels of detail.
Cycle-by-cycle simulators like HydraScalar that do their own instruction fetching and functional simulation (as opposed to relying on direct execution to provide instructions for simulation) can accurately simulate execution down mis-speculated paths. Like a real processor, HydraScalar checkpoints appropriate state as it encounters branches, and then upon detecting a mispredicted branch, wrong-path instructions are squashed, and the correct state recovered using the checkpointed state. Modeling the actual instruction flow on mis-speculated paths captures consequences like prefetching and cache pollution.
HydraScalar models in detail an out-of-order execution (OOE), 5-stage pipeline: fetch (including branch prediction), decode (including register renaming), issue, writeback, and commit. We add three further stages between decode and issue to simulate time spent renaming and enqueuing instructions. Issue selects the oldest ready instructions. Table I summarizes our baseline model, which loosely resembles the configuration of an Alpha 21264 [26] . Our experiments vary instruction-window size, first-level data-and instruction-cache sizes, and branch-predictor accuracy. The simulations use a two-level, non-blocking cache hierarchy with miss-status holding registers (MSHRs) [29] . The cache module simulates a simple pipelined bus with fixed fetch spacing-the bus can accept a new transaction every n cycles, and once started the transaction completes without contention-and fixed memory latency (no interleaving, reordering, or fast-page-mode accesses). The model also assumes perfect write buffering (stores consume bandwidth but never cause stalls), which should have minimal impact on these results [51] . HydraScalar simulates a unified active list, issue queue, and rename register file. This type of instruction window is called a register update unit or [56] . The architectural register files (32 registers each for integer and floating-point) are separate and updated on commit. Using an RUU eliminates artifacts arising from interactions between active-list size and issue-queue size, and reduces the already large number of variables we examine. A load-store queue (LSQ) disambiguates memory references: stores may only pass preceding memory references whose addresses are known not to conflict. This study does not explore LSQ size, so for simplicity we fix the LSQ at one-half the RUU size to eliminate artifacts from stalls should the LSQ fill up. HydraScalar uses a McFarling-style hybrid branch predictor [36] that combines two 2-level prediction mechanisms [65] with a selector that chooses between them. The two components are a 4K-entry GAg (global-history) predictor and a 1K-entry 10 PAg (local-history) predictor; the latter uses 3-bit saturating counters. For each predictor, the selector chooses the component most likely to be correct by consulting its own 4K-entry table of saturating 2-bit counters, indexed by global history [7] . Since many PHT entries correspond to not-taken branches (or are simply idle), a BTB entry is only allocated for taken branches, permitting the BTB to have fewer entries than the PHT [5] . We have also added two varieties of perfect branch prediction. The first, which we call 100%-direction, correctly predicts the direction of every conditional branch, but does not prevent any BTB misses. The second, oracle prediction, correctly predicts every branch and destination, regardless of type. The fetch stage we model makes a prediction for each branch fetched, but within a group of fetched instructions, those following the first predicted-taken branch are discarded because control must now jump to a new location. In practice, therefore, the fetch engine fetches through not-taken branches and stops at taken branches.
SimpleScalar updates the predictor state during the instruction-commit stage. This means there is a window of time, (Statistics are taken only from the post-warmup, 50 M-committed-instruction simulation window, and use the baseline configuration in Table I . "All" refers to all branches, whether conditional, direct-jump, indirect-jump, or return. "Indirect branches" here does not include returns. "Branch accuracy" refers to target-address prediction, except for the conditional-branch column, which presents direction-prediction accuracies.)
while a branch traverses the pipeline, during which its outcome is not available and the branch predictor uses slightly "stale" state. The prediction accuracies reported here are thus not as high as those reported in trace-driven prediction studies that do not model cycle-by-cycle timing effects. Although timing and implementation details of commercial microprocessors are difficult to come by, it seems that early-i.e., speculative-history update is a recent innovation due to the fixup mechanisms required to undo corruption from incorrect updates. The importance of speculative update is discussed in [17] , and mechanisms for implementing fixup appear in [24] and [52] . The Alpha 21264 implements speculative update with fixup for the global-history portion of its hybrid predictor [26] . Many mispredicted indirect jumps are function returns. Since a function might be called from many different locations, a BTB often provides the wrong target for these jumps. A return-address stack is a natural solution. It is best pushed and popped speculatively, in the fetch stage, and thus requires a fixup mechanism to prevent corruption. We model a simple mechanism, described later in Section V-B, that does not guarantee restoration of the correct state, but in practice virtually eliminates corruption [50] . All our simulations use this mechanism and, like the Alpha 21264, use a 32 entry stack [26] .
Branch direction mispredictions suffer at least a seven-cycle latency, because the branch condition is not resolved until the writeback stage. The latency may be longer if the branch spends time waiting in the RUU. Conditional jumps for which the predicted direction is correct-and direct jumps-can still miss in the BTB (a misfetch), but a dedicated adder in the decode stage computes branch targets so that BTB misses can be detected early. A BTB miss therefore only experiences a 2-cycle penalty. Indirect jumps need to read the register file, which we assume cannot be done from decode. When the BTB mispredicts these targets, the error is only detected in the writeback stage. supplied LISP files as arguments.
B. Benchmarks
We only perform full-detail simulation for a representative 50-million-instruction segment of the program, carefully selecting this window to capture behavior that is representative in terms of branch misprediction rate, cache miss rates, and overall IPC, and in particular, to avoid unrepresentative startup behavior. Section VI presents a detailed discussion of how we chose these simulation windows. Cycle-level simulations are run in a fast mode to reach the chosen simulation window. In this "warmup" phase, no pipeline simulation takes place; only caches, branch predictor, and architectural state are updated with each instruction. Then one million instructions are simulated in full-detail to prime other structures. Table II presents the total length of the warmup phase and summarizes branch behavior during the simulation window for each benchmark. Table III We first consider performance as a function of RUU size, holding cache size constant and using the baseline branch predictor. The integer benchmarks fall into two groups, determined by the point at which performance as a function of RUU size plateaus. This plateau is the point at which additional entries support useful instructions so rarely that overall performance is barely affected. Most benchmarks show a plateau in performance at 48-64 RUU entries, regardless of cache size. These benchmarks all have branch prediction accuracy below 95%. M88ksim falls in this category as well; it achieves 95% accuracy and its performance has already plateaued at 32 entries. (At the end of this section we consider m88ksim's decrease in IPC from 32 to 48 entries.) Note that contemporary processors also have RUU sizes in the 32-64 entry range. Thus, our results show that further increases in RUU sizes are unwarranted for many integer codes unless coupled with concomitant increases in branch-prediction accuracy. Two benchmarks, vortex and ijpeg, plateau later, at 96 entries. Vortex has a branch prediction accuracy of 98%, but ijpeg's accuracy is only 88%. In fact, ijpeg, with one of the lowest accuracies, gets one of the largest gains among the SPECint programs from additional RUU. Tomcatv, with near-100% branch-prediction accuracy, gets by far the most benefit from additional RUU; going from 32 to 128 RUU entries improves performance by 50% with a 64 K cache.
These data suggest that conditional-branch prediction accuracy is a helpful indicator of sensitivity to RUU size: except for ijpeg, the poorly-predicted programs don't benefit from larger RUUs. A more accurate indicator also factors in basic block size to obtain mispredictions per instruction [15] . Table IV tabulates these rates. It shows that ijpeg benefits from further increments of RUU despite its low prediction rate because it has one of the lowest misprediction-per-instruction rates. Yet why do m88ksim and perl not benefit from deeper RUUs? They may have poor prediction accuracies, but they also have low values in Table IV . Poor prediction rates for indirect branches help explain the apparent discrepancies. Recall that, unlike a BTB miss for direct branches, which can be detected early, a BTB miss for indirect branches incurs a full misprediction penalty in our simulations. Table II shows that even though these benchmarks have among the lowest direction-misprediction-per-instruction rates, they have among the worst accuracies for indirect branches: 25% for m88ksim, and 33% for perl. Indirect-branch accuracies also help explain why vortex plateaus earlier than ijpeg: although both programs have low direction-misprediction-per-instruction rates, vortex only predicts 77% of indirect branches correctly, while ijpeg predicts 98% correctly. The tradeoffs change as prediction accuracy improves and bigger RUUs are more likely to be filled with correctly speculated instructions. At the extreme, 100% direction accuracy ( Figure 2 ) most programs derive significant benefit from bigger RUUs, even out to 256 entries.
A high misprediction rate limits the benefit of additional RUU entries because mispredictions prevent additional RUU entries from helping performance in two major ways. Deeper RUU entries often go unoccupied-mispredictions flush instructions so often that the deepest entries rarely even become active-and if occupied, deeper RUU entries often contain mis-speculated instructions. Note that early branch-predictor update (see Sec. II-A) would boost the prediction accuracy and moderately decrease the impact of these effects, while longer misprediction-resolution times (e.g., longer pipelines) would increase the impact. Figure 3 illustrates the first effect for gcc, which is fairly representative of the other benchmarks in this regard. The left-hand charts show, for a 128-entry RUU, the cycle-by-cycle distribution of RUU occupancies for our baseline branch predictor; the right-hand charts show the same data for 100%-direction prediction. An RUU entry is considered occupied in a cycle if it contains any instruction, whether already executed or not, and whether mis-speculated or not. The bars use the left y-axis, and show percentage of cycles that the RUU has a specific occupancy. The graphs show cumulative curves of the same data using the right y-axis. For gcc with imperfect prediction and a 128-entry RUU, entries beyond the 64th are occupied only 11% of the time, and entries beyond the 96th only 2% of the time. Half of this large RUU is idle most of the time. In fact, although we do not show the data here, the deeper half of just a 48-entry RUU is idle more than 50% of the time. Deeper entries are used so infrequently that they can at best have a limited impact on performance. The low average occupancy results mainly from frequent squashing of instructions after a misprediction is identified. With perfect prediction, the picture looks quite different, as the right-hand side of Figure 3 shows. Even the deepest entries of a 128-entry RUU get regular use, and the RUU is often close to full.
With poor branch prediction, when a deeper RUU entry does become active, it often merely receives a misspeculated instruction. These instructions can have helpful or harmful cache effects, but otherwise are useless and may cause contention for resources. Instructions which have completed execution and are waiting to retire also waste RUU entries. Figure 4 shows the cycle-by-cycle distribution of RUU occupancies for instructions that have been correctly speculated and have not yet executed. These correspond to issue-queue occupancy and are naturally lower than the overall occupancies in Figure 3 . For imperfect prediction with gcc, useful occupancy exceeds 64 entries only 5.5% of the time, and exceeds 96 only 0.25% of the time. With 100%-direction prediction, on the other hand, useful occupancy for gcc exceeds 64 entries 15.5% of the time, and exceeds 96 entries 2.4% of the time. The change is more dramatic for other programs like go: with the baseline predictor, useful occupancy exceeds 64 entries not even 0.5% of the time, while with 100%-direction prediction, useful occupancy exceeds 64 entries 64% of the time.
Even if deep RUU entries receive a useful instruction, that instruction often waits for its operands. If the instruction waits so long that a smaller RUU could also have fetched it before the operands were ready, the earlier fetch and decode afforded by the larger RUU does not help. We find that instructions almost never issue from beyond the 48th-64th entry, but recall that deeper entries rarely contain any useful unissued instructions except with perfect branch prediction. RUU size tends to matter more with smaller first-level data caches than with large caches, because smaller caches provide more miss latencies that can be overlapped with other instructions from the RUU. This is evident in Figure 1 for go, gcc, compress, xlisp, perl, vortex, and to a lesser extent, ijpeg: the slope along the RUU axis (front-to-back) becomes shallower as the data cache becomes larger. Some instances do arise, however, in which bigger caches cause RUU size to matter more. Consider the step from a 64-entry to a 96-entry RUU for xlisp: with an 8 K cache, Figure 1 shows no difference, while with a 512 K cache, a small improvement appears.
We studied two more consequences of branch mispredictions which might keep larger RUUs from helping. First, if a larger RUU causes branches to take longer to commit (because the branches are fetched earlier), branch-prediction accuracy may decline as a result. The branch predictor is updated when branches commit; between the time of a branch's fetch and its commit, subsequent branches will see slightly stale branch predictor state. Average branchresolution time for m88ksim increases by 17% when moving from a 32 to a 48-entry RUU, and its conditional-branch accuracy in turn falls by 1%. We attribute m88ksim's better performance for a 32-entry RUU than for larger RUUs to this effect. The same effect may explain, in Figure 1 , the small decline for perl for 64-entry and larger RUUs, and the small decline for xlisp for a 64-entry RUU with large caches.
Second, mis-speculated instructions can also have cache effects. These can hurt performance by creating contention or by causing extra instruction-or data-cache misses; on the other hand, and they can have a prefetching effect for caches [41] . The performance impact of these effects is less than 1% for all the benchmarks.
IV. SENSITIVITY TO L1 CACHE SIZE
A. Instruction Cache
Figures 5 and 6 show further 3D graphs of IPC as a function of cache size and RUU size; here, however, we vary I-cache size. Some of our benchmarks-compress, ijpeg, and tomcatv-fit even in an 8 K I-cache, and we omit plots for them. But for others, Figure 5 shows that instruction cache size profoundly affects performance. With the baseline branch predictor, increasing cache size from 8 K to 16 K improves performance by 20 to 40%-much more for m88ksim-and increasing cache size from 8 K to 128 K nearly doubles performance for gcc, perl, and vortex. So although the SPEC benchmarks are notorious for their small cache footprints, several still benefit from larger I-caches than those found in processors today.
I-cache size has even more impact with 100% direction-accuracy branch prediction, as seen in Figure 6 . This is partly because perfect prediction eliminates the prefetching done by mis-speculated paths, but this mechanism only affects IPC by about 1%. Most of the difference is due simply to eliminating time wasted on mispredictions.
Tables V and VI present instruction-cache miss ratios for the baseline and 100% predictors respectively; RUU size in these tables is 64 entries. Note that cache miss ratios include wrong-path accesses, and recall that caches are 2-way set-associative. The higher miss rates in Table VI (100% prediction) result from the lack of pollution by wrong-path prefetching.
If enough instructions reside in the RUU, some I-cache misses can be partially or completely hidden with waiting instructions. But I-cache misses are strongly clustered: with an 8 K I-cache, from 45 to 70% of misses happen within 2 cycles of the previous miss, and with a 64 K cache, from 40 to 60%. (90% of perl's I-misses happen within 2 cycles, but it experiences only a few thousand misses during the simulated interval.) When misses are so strongly clustered, even a full RUU doesn't help much, so frequent I-cache misses introduce many bubbles. Data misses are less clustered, so this problem is not as prevalent for data references, and an RUU can do a better job of compensating for the cache misses.
The most important implication of these results is that small I-caches limit the value of a larger RUU. Figure 2 showed that with 100%-direction branch prediction, RUU size strongly affects performance. Those data assume a 64 K I-cache. But in Figure 6 , shrinking the I-cache to 8 or 16 K largely wipes out RUU effects for go, gcc, and perl. The I-cache misses so often that it can rarely build a sufficient buffer of instructions. When it does, clustering of misses means that even a big and full RUU drains, and the processor still stalls. This is an especially clear example of shifting tradeoffs: the choice of one design or simulation parameter can control the impact of another. For simulation purposes, this means certain parameters can have a profound impact on how realistic the results are. I-cache size can influence how much effect RUU size has on IPC, but over the range we examine, RUU size does not influence how much effect I-cache size has on IPC.
B. Data Cache
In moving from a smaller RUU to a larger RUU, we might expect that the processor's sensitivity to cache performance would diminish, since there are more opportunities to overlap miss times with useful work. This presumes an RUU with enough correctly-speculated instructions. Turning back to Figure 1 , with the baseline, hybrid branch predictor, this effect is mildly visible when moving from 32 to 48 RUU entries for gcc, xlisp, and more strongly over the entire graph for vortex. But "useful" RUU occupancy is generally quite low, so a larger RUU has minimal impact.
The effect is prominent, however, with 100%-direction prediction in Figure 2 : every benchmark except ijpeg exhibits declining D-cache sensitivity as RUU size becomes deeper. For most benchmarks, D-cache hardly matters once the RUU is in 96-256 entries deep, depending on the benchmark.
Plotting performance as a function of cache size, as in Figures 1 and 2 , also permits estimation of the programs' working set sizes. Table IX identifies the point at which further increases in cache size stop producing improvements in IPC-a good approximation of working set size. (Although some applications have a hierarchy of working sets [47] .) Knowledge of working set sizes like this can be useful in three ways. It ensures that when choosing a single data cache size for some other type of study (e.g. a branch prediction study), the chosen operating point does not have an artificially high miss rate. Second, it helps establish a suitable range of data-cache sizes for simulation-based studies [63] . Third, knowledge of working set sizes helps when scaling cache or problem sizes.
Tables VII and VIII present data-cache miss ratios for the baseline and 100% predictors respectively. Recall that caches are 2-way set-associative. The tables verify that IPC closely correlates with L1 D-cache miss ratio.
C. Instruction Cache and Data Cache Interactions
We have examined interactions between data-cache size and RUU size, and instruction-cache size and RUU size. Figures 7 and 8 plot IPC as a function of data-cache and instruction-cache size together. The results follow naturally from our preceding results. Performance is strongly sensitive to I-cache size until the program fits or almost fits into the I-cache. Until the program fits in the I-cache, I-cache size matters substantially more than D-cache size. With 100%-direction prediction, data-cache size matters little because we have chosen such a large RUU. We have also run similar experiments with RUU size set to 32 entries, in which case performance is more sensitive to data-cache size, just as the "ruu32" data in Figure 2 suggests.
Just as too-small I-caches minimize the impact of changing RUU size, small I-caches limit the impact of larger Dcaches. This trend is only visible in Figure 7 ; we use a 128-entry RUU and the D-cache curves are thus already flat in Figure 8 . Go, gcc, and perl especially show how a too-small I-cache limits the benefit of larger D-caches. Many data and instruction misses presumably coincide, and the effect of eliminating some data misses is hidden if the instruction misses remain.
V. BRANCH PREDICTION AND THE RELATIVE IMPORTANCE OF RUU AND L1 DATA-CACHE
A. Direction-prediction accuracy
Sections III and IV compared the effects of RUU size and L1 data-cache size on performance, and considered both a realistic hardware branch predictor and a 100%-direction-accuracy predictor. The IPC vs. RUU size vs. data-cache size graphs in Figures 1 and 2 form the backbone of those observations. The change in moving from our baseline hybrid predictor to 100%-direction accuracy is substantial, and the corresponding IPC surfaces change dramatically between those two figures. This section therefore considers how the surfaces change as we vary branch prediction accuracy in finer steps.
Unfortunately, realistic hardware often cannot achieve accuracies near 100%: some branch behavior is simply too random. To see what might happen to RUU-cache tradeoffs in this range of accuracies, we boost the PHT's accuracy by randomly choosing some mispredictions and artificially correcting them. Conditional branch predictions modified in this way behave as though their directions had been correctly predicted. The simulator can be set to attain any desired direction-prediction accuracy by adjusting the fraction of repaired predictions. This artificial adjustment takes place infrequently, so most of the PHT's predictions are untouched. We do not claim here that a program would behave exactly as shown, since the mispredictions to fix up are chosen randomly. We hope only to suggest how the tradeoffs between RUU size and L1 data-cache size might change as branch-prediction techniques continue to improve. In fact, the tradeoffs change in a quite clear-cut fashion.
For the first boosting of prediction accuracy, we use a huge but realistic PHT instead of applying our artificialboosting technique. This huge predictor is similar to the baseline predictor, but modified to XOR the history bits with some address bits, and with 64 K entries in each table. Subsequent steps in accuracy are achieved with synthetic boosting, but using the huge predictor minimizes the number of correct predictions that need to be artificially created. At 100% direction accuracy, only BTB misses remain. Fig. 9 . Gcc's performance as a function of RUU size and L1 data-cache size for a variety of branch-prediction accuracies. Figure 9 presents graphs of IPC vs. RUU size vs. L1 data-cache size for gcc over a range of direction-prediction accuracies. The prediction accuracies are indicated on each graph, and all the data are plotted on the same vertical scale. The first graph uses the baseline predictor and is taken from Figure 1 ; the second-to-last uses 100%-direction prediction and is taken from Figure 2 ; the intervening graphs use the mechanisms just described. The last graph in Figure 9 goes beyond 100%-direction accuracy by correcting all BTB misses, including those for returns and indirect branches-i.e., oracle prediction. We present data for gcc as representative of the other benchmarks.
As gcc's direction-prediction accuracy increases from 86% to 100%, not only does overall performance increase, but successive steps in RUU become useful (albeit slowly), contributing to the increase in IPC. Furthermore, as we mentioned earlier, with good branch prediction accuracy, bigger RUUs do such a good job of overlapping cache miss latencies that data cache size becomes less critical. The combination of these effects mean that the slopes along the RUU and cache axes are roughly equal at 98.4% accuracy. In fact, the slope along the cache axes flattens more slowly for gcc than for other benchmarks. Conversely, big caches miss so infrequently that big RUUs are not as necessary for hiding misses. 
B. Return-Address-Stack Repair
The return-address stack is a small but important structure for achieving better control-flow prediction accuracy. Procedure returns present the same problem as other indirect branches: because a procedure might be called from many different locations (consider printf()), the target of a particular return instruction varies. Although general register-indirect jumps are hard to predict, the regular structure of call-return sequences permits a return-address stack to match returns with corresponding calls. Like other prediction techniques, the stack's prediction is only a hint: if the supplied return address is wrong, the misprediction is corrected in the writeback stage.
Return-address-stack accuracy can be an especially strong lever on performance. The stack is pushed and popped immediately after a call or return is fetched, i.e., speculatively in the fetch stage, so some pushes or pops may correspond to wrong-path instructions. As with branch history, if the stack is not repaired, the wrong-path pops or pushes may corrupt it, as Jourdan et al. have pointed out [24] . They propose a sophisticated, self-checkpointing return-address stack that saves popped entries to avoid overwriting them with future mis-speculated pushes. So long as the stack does not overflow, this structure can return to any prior state.
We have shown that simply saving the current top-of-stack pointer at the time of each branch prediction, and restoring it after a misprediction, reduces by 50-93% return-address mispredictions from wrong-path corruption. Saving the top-of-stack contents along with the top-of-stack pointer virtually eliminates return-address mispredictions [50] . All the simulations in this paper have assumed pointer and top-of-stack contents repair.
C. Branch Resolution Delays
Even though a branch can move from decode to writeback in 5 cycles in our model, branches must wait for operands and then arbitrate for issue. In fact, branches typically take 10 or more cycles to resolve. Table X presents average branch-resolution times and their standard deviations. Note that the delay should be independent of the predictor organization, except as differences in prediction change the instruction flow through the processor.
Such long resolution times typically mean the processor has multiple branches in flight, in different stages of the processor. This requires a structure in which shadow register maps, return-address-stack patch state, and if applicable, branch-history fixup state are stored. The structure has a fixed depth-20 entries in the Alpha 21264 [26] , for example-beyond which fetch stalls. For simplicity, however, our simulations have assumed unlimited shadow-state capacity.
VI. SIMULATION TECHNIQUES
The SPECint programs typically run for billions of instructions using reference inputs. Even smaller inputs typically run for hundreds of millions. But the fastest detailed microarchitecture simulators still take about an hour per 100 M instructions simulated on an UltraSPARC II or Pentium Pro. Simulating to completion is usually too expensive, especially for studies like this paper, which need hundreds of separate simulation runs.
Using very small inputs and scaling hardware structures accordingly is one possible solution. Scaling is risky, however: appropriate scaling factors are not always evident, and one must be careful to scale all relevant structures appropriately. Using smaller inputs may also change the relationship among various factors: for example, if various loops now iterate less often, branch-prediction behavior may change, inherent ILP may change, and so forth.
We focus on the obvious alternative: selecting a small, representative simulation window from a full-length run with the reference input. Many researchers do this because it makes simulation so much simpler. In particular, we simulate just 50 M instructions for each program. This section argues that a small simulation window like this must be chosen carefully, but can be reasonably representative of general program behavior. This makes it possible to simulate many different configurations or many benchmarks in a short period of time.
Sampling schemes are widely used in architecture studies and several pieces of prior work have investigated sampling methodologies, particularly related to cache memory simulations. For example, Laha et al. [30] studied the accuracy of memory reference trace sampling using caches that were 128 KB and smaller. Their study concluded that sampling allows accurate estimates of cache miss rates, but their results were presented for fairly unaggressive sampling techniques: they simulated 60% of all memory references.
As the fraction of references or instructions modeled becomes smaller, the question of how to "prime" the cachehow to deal with the unknown cache state at the beginning of each sample-becomes important. Our work takes a brute-force approach and simply simulates all instructions preceding the desired sample, just at a lower level of detail. Only the model's caches, branch predictor, and architectural state are updated. Other work has studied analytic models for estimating cache miss rates during the unprimed portion of the sample [25] , [64] , or described means for bounding errors by adjusting simulation lengths [34] . Iyengar and Trevillyan have derived the R-metric for measuring the representativeness of a trace [18] , and they generate traces by scaling basic-block transition counts and adjusting selected instructions to optimize the R-metric. Their technique incorporates cache and TLB behavior as well as branch-prediction behavior, but because it uses traces, important mis-speculation effects may be omitted.
To accurately choose our simulation window, we have measured interval branch-misprediction rates for each of the benchmarks: i.e., the misprediction rate computed separately over each million-instruction intervals in the program. This exposes representative segments of the trace. To illustrate this, Figure 10 shows the interval-branch-misprediction traces for four of the benchmarks, and pair of vertical lines delineates the 50 M-instruction segment we have chosen as our simulation window. Traces for all the programs appear in [54] . M88ksim's trace resembles perl's in terms of having a short initial phase and a fairly flat trace afterwards, and Compress, vortex, and tomcatv resemble ijpeg in terms of having clearly distinct, repeating phases. Compress's phases, however, correspond to successive compress and decompress phases which are artifacts of the benchmark version. Compress also has a dramatic startup phase of about 1.7 billion instructions during which it generates the data to be compressed and decompressed. This is also an artifact of the benchmark version, but the benchmark versions are the ones many architects use for their simulations, and the misprediction rate during the startup phase is inordinately high (14.5% compared to a maximum of 11% for the rest of the program), creating a risk of substantially unrepresentative results if simulations include too much of this segment of compress's execution.
We also obtained interval traces for data-and instruction-cache miss rates and ensured that our chosen segment was suitable with respect to these data as well. In most cases the different traces show similar qualitative behavior: major shifts most likely correspond to different phases of the programs' computations. Cache traces also appear in [54] .
As mentioned with regard to compress, most programs show some sort of initial phase. In several cases the misprediction rate during initialization differs markedly from that during later phases of the program: go, compress, perl, and tomcatv are examples. If this initial behavior represents too large a fraction of the simulation, consequent results are unreliable. The simulation window should therefore be placed after any such initial phases. Fortunately, this can be done fairly quickly using fast-mode simulation. SimpleScalar will soon offer a checkpointing facility that removes even the need for fast-mode simulation [3] . Unless the initial behavior dominates the rest of the program's execution, its omission still gives reasonable results.
Although simulating a small but well-chosen window can produce representative results, many simulation-based studies have only modeled 50-100 M instructions from the beginning of a program's execution. This over-emphasizes initial behavior, and for some programs includes only initial behavior. This risks substantial distortion of results. To show how significantly results can change when the 50M-instruction simulation window is dominated by an initial phase, Figure 11 plots abbreviated versions of our 3D IPC vs. RUU size vs. data cache size graphs for go, gcc, and perl, for both regular and 100% branch prediction. The left two columns show data for the baseline branch prediction, and the right two columns show data for 100%-direction prediction. In each case, we compare a safe warmup period (we use the warmup times given in Table II ) with just a million-cycle warmup period. In all cases, behavior between the two cases-between the first and second graphs, and between the third and fourth graphs-differs markedly, even though the one million cycles of warmup eliminate the most egregious startup effects. For go and perl, the initial phase has substantially different branch-prediction behavior, so such differences come as no surprise. Gcc, on the other hand, has no clear initial phase, but second-level cache misses contribute significantly, because 1 M is too short a warmup for big L2 caches. Of course, as simulations run longer, distortionary effects from initial phases have less impact on overall results. On the other hand, the initialization phase can be long: e.g., 1.5 B instructions for compress, and over 2 B for vortex and many floating-point programs-prohibitively long for full-detail simulation. Instead, a 50 M-instruction simulation window carefully chosen from later in the program's execution reliably gives representative results for SPECint programs. As evidence for this, Figure 12 gives further 3D graphs comparing our 50 M-window to a 250 M-instruction simulation window (we have 500 M-instruction data for go, so we present that instead). Again, compare the first graph to the second, and the third graph to the fourth. Particular IPC values change slightly, but the IPC-cache-RUU surfaces remain similar. Although data is not presented here, the same is true for all the other benchmarks (see [54] for the remaining data). The one slight exception is gcc, for which the 50 M step from an 8 K to a 16 K cache is not as pronounced as with 250 M. But data with a 100 M window matches the 250 M data. The discrepancy is due to L2 cache misses: with a perfect L2, the 50 M and 250 M data match closely. Ijpeg is also sensitive to L2 misses, because the L2 miss behavior does not settle down until several execution phases have passed. But we did find a 50 M-instruction window that captures representative behavior.
Gathering the data for Figure 12 forced us to refine our choice of window in the case of ijpeg and gcc. Choosing a representative 50 M-window window using the interval traces is a good heuristic, but verifying it by comparing 50 M-instruction plots IPC vs. RUU size vs. D-cache size against 250 M-instruction plots provides even more reliable simulation windows. [35] . Many commercial programs also have a large static branch footprint, while most SPEC programs do not [11] .
Other fundamental research has focused on understanding and improving branch prediction accuracy via both hardware and software means. Lee, Chen, and Mudge [32] , Sechrest, Lee, and Mudge [49] , and Skadron et al. [53] , among others, have described and measured the different causes of mispredictions, and Evers et al. [13] recently measured the benefits of branch correlation. Work by Yeh and Patt [65] , Pan et al. [40] , McFarling [36] , Sprangle et al. [57] , and Eden and Mudge [10] , among others, have proposed hardware mechanisms for keeping multi-level branch predictors, for tracking correlations between branches, and for avoiding contention among branches in the predictor's state tables. Such hardware branch prediction mechanisms have been widely incorporated into commercial designs [16] , [26] , [38] . Some work has also explored software-based branch-prediction techniques: Young, Gloy, and Smith [66] , [67] have demonstrated compiler-based methods for correlated branch prediction, while Mahlke and Natarajan [33] and August et al. [2] have examined branch prediction synthesized in the compiler. Rotenberg et al. take an even more aggressive tack: they reorganize the processor around traces, groups of basic blocks which have been coalesced into a single unit. When the fetch engine hits in the trace cache, it can provide several basic blocks every cycle without the need for merging cache blocks, multiple-branch or multi-ported predictors, or a multi-ported instruction cache [19] , [46] . On a more theoretical level, Gloy and Emer have developed a general language for describing predictors, and they show how it can be used for automated synthesis of predictors. Resulting structures can be complex, but this model may yield insight into avenues for further improvement [12] .
Predicting branch targets is important, too. To better understand the performance effect of BTB misses, Michaud, Seznec, and Uhlig measure compulsory BTB misses [37] for all types of branches. Calder and Grunwald [6] , Chang, Hao, and Patt [8] , and Driesen and Hölzle [9] have all examined ways to augment the BTB by taking prior branchtarget history into account. None of these papers explicitly treat predicting return-instruction targets, for which returnaddress stacks can virtually eliminate mispredictions. Jourdan et al. [24] and Skadron et al. [50] both focus on returnaddress-stack design, especially on mechanisms for repairing the return-address-stack after it has been modified by mis-speculated instructions.
While branch prediction is a well-known performance "lever," its relationship to cache design decisions has not previously been quantitatively evaluated. Jouppi and Ranganathan find in [22] that branch prediction is a stronger limitation on performance than memory latency or bandwidth.
Finally, cache design has been a key issue with processor architects for several years now. Many papers study the tradeoffs between L1 cache size and speed; the most recent, simulating a MIPS R10000 model, is by Wilson and Olukotun [62] . Prefetching helps tolerate load latencies, but under OOE must take into account that many misses are adequately tolerated without prefetching. Most techniques [39] , [42] were developed with simpler processor models in mind, but [48] , for example, discusses data prefetching for the HP PA-8000. Farkas et al. [14] have recently provided some insights regarding memory system design for dynamically-scheduled processors, and Johnson and Hwu [21] discuss a cache allocation mechanism to prevent rarely accessed data from displacing frequently accessed lines. Srinivasan and Lebeck measure loads' latency tolerance and demonstrate the importance of quickly completing loads that feed branch instructions [58] . These papers do not touch on the relationship of branch prediction to cache design as our paper does.
VIII. CONCLUSIONS
By presenting a database of simulation results for the SPECint programs, this paper examines shifting tradeoffs among instruction-window size, first-level data-and instruction-cache size, and branch-prediction accuracy in highperformance processors. The results show that regardless of cache size, having more than 48 RUU (instruction window) entries yields almost no benefit to performance for many benchmarks. This mostly occurs because deeper entries are not used: mispredictions occur often enough to prevent the RUU from building up a large pool of instructions. Even when the deeper entries are active, they usually contain only mis-speculated instructions. This falls far short of the optimum: for programs with branch prediction accuracies near 100%, or if branch prediction could be made perfect, adding RUU even out to 256 entries yields strong benefits.
For many SPECint programs, L1 data-cache size is a strong lever on performance. This becomes less so as branch prediction improves and deeper RUU entries become useful, because the deeper RUU affords enough lookahead to overlap L1 misses with useful computation. At the extreme of 100% prediction accuracy, data cache size hardly matters at all. Conversely, as cache size increases, bigger RUUs help less because fewer misses occur and less lookahead is necessary. Nevertheless, as branch-prediction accuracy improves, sensitivity to RUU size increases quickly and RUU effects eventually dwarf L1 data-cache effects.
The picture is somewhat different for L1 instruction-cache misses, which are typically so tightly clustered-often only 1 or 2 cycles elapse between misses-that even with 100% prediction and a deep window of instructions, a too-small I-cache is the dominant bottleneck.
The most important bottleneck nevertheless remains branch prediction. L1 data-miss penalties will always inspire innovations, but caches are now becoming sufficiently big and sophisticated that future work should perhaps focus specifically on latency-intolerant misses and on better branch prediction. As architects attack the branch-prediction bottleneck with more sophisticated hardware schemes and alternative techniques-trace caches [45] , compiler-enhanced hardware prediction [2] , [33] , predication [20] , [44] , and multi-path execution [1] , [27] , [28] , [61] -larger RUUs will become attractive.
This paper also considers sampling techniques to allow shorter but full-detail simulations. For the SPECint programs, fairly short samples of 50 M instructions from simulations with reference inputs yield good results, but accuracy becomes quite sensitive to the choice of the simulation window. The sample must come from a point after any initial execution phases, which can be quite long, up to several billion instructions.
Because latencies can overlap or compound each other in modern, out-of-order processors, design parameters interact in sometimes complex ways. This paper has illustrated a number of resulting tradeoffs, but more importantly, has highlighted some potential pitfalls that can result from unwise combinations of branch-prediction, instruction-window, and cache configurations. The comprehensive data presented here has two main contributions. First, it quantitatively shows the importance of considering these configuration issues in conjunction, rather than choosing sizes individually or independently. Second, this paper helps cull the design space to avoid expensive or methodologically flawed simulations.
