Abstract
Introduction
Processor technology is advancing at a rapid pace. Over the past two decades, CPU performance has roughly doubled every one and a half years. Unfortunately, memory latencies have not improved as quickly. Consequently, the speed-gap between CPU and memory is constantly growing and has reached a point where it presents one of the biggest performance bottlenecks. Load value prediction is a relatively new approach to improve the performance of memory systems by breaking dependence chains and hiding the growing latencies. In this paper, we study which predictor combinations yield the most effective hybrid load value predictors.
Load instructions often fetch predictable sequences of values [LWS96] . For instance, about half of all the load instructions in the SPECint95 benchmark suite retrieve the same value that they did the previous time they were executed. Such behavior, which has been demonstrated explicitly on a number of architectures, is referred to as value locality [Gab96, LWS96] . The load value locality can be exploited to predict the result of a load instruction before the memory can provide the value.
Correct load value predictions enable the CPU to continue processing the dependent instructions without having to wait for the memory access to finish. Of course, it is only known whether a prediction was correct once the true value has been retrieved from memory, which can take many cycles.
Speculative execution allows the CPU to continue execution with a predicted value before the prediction outcome is known. If it later turns out that the prediction was correct, the speculative status is simply dropped. If the prediction was incorrect, everything that the CPU did using the incorrect value has to be purged and redone with the correct value.
Because branch predictors require a similar mechanism to recover from mispredictions, most modern CPUs already contain the necessary hardware to perform this kind of speculation [Gab96] . However, recovering from mispredictions takes time and slows down the processor. Load value prediction is therefore only effective if most of the predictions are correct.
Several distinct types of load value locality have been identified and predictors to exploit them have been proposed [BuZo99, Gab96, LWS96, SaSm97b, TuSe99, WaFr97] . While the best performing predictors in the current literature are all hybrids [BuZo00, PMT99, RFKS98, WaFr97] , no systematic study of such predictors has been performed. The goal of this paper is to evaluate all hybrids that can be built out of a register value, a last value, a stride 2-delta, a last four value, and a finite context method predictor to determine components that complement each other well and thus yield highperforming hybrid load value predictors
We identified novel hybrid combinations that are smaller and simpler than the best hybrids from the literature yet exceed their speedup. These new hybrids yield harmonic-mean speedups over the eight SPECint95 programs of up to 18%.
Our study shows that hybrids are able to deliver 25% more speedup than the best single-component predictors and that different components contribute independently to the overall performance. We infer that the existing, distinct types of load value locality can only be exploited effectively using multi-component predictors, in which each component is tailored to a different kind of locality.
Our analysis also revealed some unexpected results. For example, powerful individual components frequently do not yield effective hybrids when combined. On the other hand, some components that perform rather poorly when used in isolation can form strong coalitions with other components. Several hybrids from the literature were found to contain components that predict highly overlapping sets of load instructions and therefore do not complement one another well. Furthermore, some hybrids actually yield a lower performance than their individual components because of negative interference.
This happens when adding a new component causes more selector-related losses than the added predictability can compensate for.
The remainder of this paper is organized as follows. Section 2 introduces the five components we use in our hybrids, Section 3 presents the evaluation methods, Section 4 discusses the performance of the hybrid load value predictors, Section 5 lists related work, and Section 6 concludes the paper.
Basic Load Value Predictors
It is almost impossible to predict a random load value correctly because a 32-bit word can hold over four billion distinct values and a 64-bit word over 10 19 values. Even with merely twenty equally distributed values the odds of picking the correct one is only five percent, which is probably too low to be useful. This is why almost all the proposed load value predictors make predictions based on context, that is, based on recently loaded values.
Using context results in very promising predictors because load values tend to cluster, repeat, occur in iterating sequences, exhibit discernable patterns, and correlate with one another. Such behavior is referred to as value locality or predictability. To illustrate the extent of the existing load value locality, we present Table 2 .1. The table lists five types of predictability found in the eight benchmark programs we use throughout this study. The numbers reflect the percentage of executed load instructions that are predictable.
-The register value predictability (reg) indicates how frequently the target register of a load instruction already contains the value that the load is about to read.
-The last value predictability (lv) shows how often a load fetches a value that is identical to the previous value fetched by the same load instruction.
-The stride 2-delta predictability (st2d) reflects how frequently a load instruction loads a value that is identical to the last value plus the difference between the last and the second to last loaded value.
-The last four value predictability (l4v) indicates how often a value is loaded that is identical to any one of the last four values fetched by the same load instruction.
-The finite context method predictability (fcm) shows how frequently a value is loaded that is identical to the value that followed the same sequence of last four values when it was last encountered by any load instruction in the program.
Note that the results for the finite context method are implementation specific, that is, they depend on the hash function and the table size that are used. We used a direct-mapped, tag- less table with 2048 entries and a shift-xor hash function to obtain these results (see Section 2.5). Whenever the memory system satisfies a load request, the corresponding predictor line is updated with the true load value and maybe other information. Note that all load instructions, whether they are predicted or not, access the memory and therefore update the predictor. The benefit of load value prediction does not come from removing load instructions but from breaking dependencies and hiding latencies, that is, from taking the load instructions out of the critical path.
All the predictors used in this study are direct mapped, meaning the n least significant bits that are not always zero of a load instruction's PC are used as an index into the predictor to select one of the 2 n predictor lines. Note that load value predictors are indexed using the PC of the load instruction as opposed to conventional caches, which use the effective address. The index for a predictor with 2 n lines is computed as follows.
index(PC load_instr ) = (PC load_instr >> 2) % 2 n This is probably the simplest and fastest meaningful hash function. Shifting right by two eliminates the two least significant bits that are always zero because instructions have to be word-aligned in the Alpha processor. Utilizing a more complex hash function may result in less aliasing but will most likely increase the length of the critical path. Since direct-mapping results in only little aliasing even with moderate predictor sizes, this simple but effective hash function is used throughout the literature [Gab96, GaMe98, LiSh96, LWS96, SaSm97b, WaFr97] .
Note that in cache terminology, direct mapping implies the presence of tags. However, load value predictors, unlike caches, do not have to be correct all the time and tags are therefore not mandatory.
Because of the small amount of aliasing, predictors often have only partial tags or no tags to reduce their size. If two or more load instructions do alias (i.e., they have the same index), they need to share a line in the predictor and may evict each other's information.
The generic load value predictor from Figure 2 .1 can be tailored to exploit different kinds of load value locality by selecting the kind of information to be stored in them and the computation to be performed with this information. The following subsections describe possible implementations of five basic load value predictors that exploit last value, register value, stride 2-delta, finite context method, and last four value locality. The last subsection discusses confidence estimators, which represent an important additional component in hardware-based load value predictors.
Last Value Predictor
The last value predictor [Gab96, LWS96] always predicts that a load instruction will load the same value that it did the previous time it was executed. Hence, the only information that needs to be stored in the predictor is the most recently loaded value. Predictions retrieve this value, and updates store the true load value in the predictor to make it available for the next prediction.
The last value predictor's operation can formally be described as follows, where the numeral subscripts indicate the size in number of bits, "ld" refers to the load instruction being predicted or updated, "p_value" is the predicted value, and "u_value" is the update value. The first line, which describes the predictor, lists the fields making up a predictor line inside the curly brackets followed by the name of the predictor and the number of predictor lines. In this case, the LV predictor's lines contain a single 
Register Value Predictor
The register value predictor [TuSe99] is even simpler than the last value predictor. Since it always predicts that the target register of the load instruction (the register that is about to receive the loaded value) already contains the correct load value, i.e., that the load instruction is a NOP, no values have to be stored in the predictor. In Section 2.6, we will see that this predictor still needs to store some information to work well. 
Stride 2-Delta Predictor
The stride predictor [Gab96] truly computes the predicted value and is therefore able to predict never before seen values. In its conventional form, this predictor stores the last value along with the difference (called the stride) between the last and the second to last loaded value. The stride is added to the last value when a prediction is made to form the predicted value. Once the true load value is available, the predictor's stride field is updated to reflect the difference between the last value (which is stored in the predictor) and the true load value. Then the last value in the predictor is overwritten with the true load value. Since about 98% of all the observed strides fall within the range of -128 to 127 [RFKS98] , eight bits per predictor line are sufficient to capture almost all strides.
Unfortunately, the normal stride predictor makes two mispredictions at every transition from one predictable sequence to another. This is a problem in practice because programs fetch a surprisingly large number of short sequences of repeating values [Bur00] .
To remedy this shortcoming, a more sophisticated version of this predictor has been proposed called the stride 2-delta predictor [SaSm97a] . The 2-delta refers to the fact that this predictor retains two strides instead of only one. The first stride is identical to the one found in the conventional stride predictor. The second stride is only updated if the current update-stride is the same as the stride already stored in the first stride field. In other words, the second stride is only updated if the same stride has been seen at least twice in a row. Only the second stride is used for making predictions.
Of course, the second stride field also only needs to be eight bits wide. In the pseudo code describing the stride 2-delta predictor below, the function lsb 0..7 (x) extracts the eight least significant bits of x.
Unless otherwise noted, all stride predictor results in this study refer to the stride 2-delta predictor. The stride 2-delta predictor can predict sequences of repeating values; the stride is simply zero in this case. In addition, it can predict sequences that exhibit genuine stride behavior (e.g., -4, -2, 0, 2, 4).
Such sequences are, however, not very frequent [Gab96, SaSm97a] because register allocation assigns induction variables to registers, but they do occur when a program uses global variables as counters.
Last Four Value Predictor
The last four value predictor [BuZo99, LiSh96] is similar to the last value predictor except every predictor line contains the four most recently loaded values instead of only the most recent value.
Retaining more than just the last value has been shown to improve performance, even when scaling predictors to the same overall size [BuZo99] . The last four value predictor essentially consists of four independent last value predictors operating in parallel and a meta-predictor that chooses which predictor to believe. The operation of the meta-predictor and the corresponding select function (see below)
are discussed in Section 2.6. 
Finite Context Method Predictor
The most complex and sophisticated non-hybrid predictor we investigate is the finite context method predictor [SaSm97a, SaSm97b] . It retains the last four loaded values in every predictor line. However, since these values are only used to compute an index into the predictor's second level (a lookup 
The predictor: level1 ◊ level2 Finite context method predictors can predict long reoccurring sequences of arbitrary values (e.g., 3, 7, 4, 9, 2, ..., 3, 7, 4, 9, 2). These sequences occur, for instance, during the repeated traversal of dynamic data structures. Note that FCM predictors can also predict alternating sequences and sequences exhibiting stride behavior as long as the sequences repeat and their lengths do not exceed the size of the predictor's second level.
Confidence Estimation
A substantial fraction of the executed load instructions cannot be correctly predicted with the currently known prediction techniques. Trying to predict these loads will inevitably result in mispredictions. Because recovering from mispredictions takes time, a high misprediction-rate can incur a recovery cost that eradicates any benefit from correct predictions. Hence, it is possible for a load value predictor to slow down the processor instead of speeding it up.
To keep the number of mispredictions at a minimum, almost all load value predictors incorporate some form of confidence estimator to identify predictions that are likely to be incorrect so that they can be inhibited [CRT99, LWS96, ReCa98, RFKS98, SaSm97b, TuSe99, WaFr97]. Inhibiting such predictions reduces the number of mispredictions (and the associated recovery cost) and thus improves the predictor's performance.
One way of estimating the likelihood of a correct prediction is to look for discernable patterns in the past predictability of a load instruction. The intuition is that the recent behavior is usually a good indicator of what will happen next. For example, if a load was predictable every other time it was executed in the recent past, there is a good chance that the outcome of the next prediction will be the same as the outcome of the second last prediction.
The SAg confidence estimator exploits such predictability patterns. It stores the prediction outcomes in a bit-pattern (called a history) in which the n th bit represents the outcome of the n th last prediction.
Usually a one encodes a predictable value and a zero an unpredictable value.
Whenever the memory returns a load value, this value is compared with its predicted value (even if the prediction was not used) and the outcome of this comparison is shifted into the history, whereby the oldest bit is lost.
In order to use such histories as a measure of confidence, it is essential to know which ones are (frequently) followed by a correct prediction and which ones are not. The SAg confidence estimator uses saturating counters to record the number of predictable values that followed each possible history pattern. Predictions are only allowed if the counter value associated with the current prediction outcome history is above a preset threshold. Thus, the counters dynamically assign a confidence to each history and continuously adjust which patterns should be followed by a prediction and which ones should not.
The following pseudo-code describes the operation of the SAg confidence estimator, which is named after the structurally identical SAg branch predictor [YePa93] . m denotes the number of history bits in each line and x represents the number of bits in each saturating counter. The threshold, the top, and the penalty, as well as the number m are parameters of the SAg confidence estimator. The best setting of these parameters depends to varying degrees on the load value predictor, the programs, and the recovery mechanism used.
Note that the first level of the SAg confidence estimator is normally merged with the first level of the load value predictor. Load value predictors with confidence estimators therefore have an additional field in each line, which is why the Reg predictor also has to store some information.
We can now explain the select function used in the L4V predictor (Section 2.4). It simply picks the component that reports the highest confidence, giving younger values the priority in case of a tie.
Evaluation Methods
All measurements pertaining to this study are based on the Alpha AXP architecture [DEC92] . The performance of the various load value predictors is evaluated using the AINT simulator [Pai96] with a cycle-accurate, superscalar back-end that runs native Alpha binaries. The simulator is configured to emulate a high-performance microprocessor similar to the Alpha 21264 [KMW98] . It accurately models the processor's internal timing behavior, resource constraints, and speculative execution as well as the memory hierarchy and its latencies. Only bus-contention is not modeled.
The simulated CPU is four-way superscalar, issues instructions out-of-order from a 128-entry instruction window, has a 32-entry load/store buffer, four integer and two floating point units, a 64kB
two-way set associative L1 instruction-cache, a 64kB two-way set associative L1 data-cache, a 4MB
unified direct-mapped L2 cache, a 4096-entry branch target buffer (BTB), and a 2048-line hybrid gshare-bimodal branch predictor. The three caches have a block size of 32 bytes. Not modeling bus-contention, assuming fully pipelined functional units, and allowing up to four load instructions to issue per cycle reduce the average instruction latency in comparison to real CPUs.
In addition, ignoring bus-contention also reduces the memory latency. A lower instruction latency implies more executed load instructions per time-unit, which increases the pressure on the load value predictor. Hence, the performance of a load value predictor would likely, if anything, be higher in a real CPU than the measurements in this study indicate because of the reduced chance of making an out-of-date prediction and the fewer dropped updates due to a busy predictor. The slightly longerthan-modeled memory latency in real systems has the same effect, i.e., it decreases the pressure on the predictor while at the same time making correct load value predictions more beneficial because of the even longer load latency that is hidden.
We study the performance of load value predictors in the presence of two distinct misprediction recovery mechanisms. The simpler but less powerful re-fetch mechanism is the one already used for recovering from branch mispredictions [Gab96] . When a misprediction is detected in this scheme, all the instructions that follow a mispredicted instruction are purged from the instruction window and the processor state is reset to what it would have been had no instruction beyond the mispredicted one executed. The CPU then continues processing instructions by fetching the next instruction, that is, the instruction that immediately follows the mispredicted instruction. Re-fetch recovery incurs a cyclepenalty because it takes time to purge instructions from the instruction window, to restore the CPU's state, and to re-fetch instructions.
Unfortunately, in this scheme instructions are sometimes purged whose results are correct. For ex-ample, if instruction X is independent of an earlier load instruction L, then X may execute in an out-oforder processor before the load is completed. Because instruction X is independent of L, its result does not dependent on the load value and should therefore be correct. Purging X is consequently not necessary, even in the presence of a mispredicted value for L.
In fact, mispredicting L does not even invalidate the instructions that do depend on L (up to the first conditional branch instruction whose branch target depends on L). In the worst case, these instructions are executed with an incorrect input value. Because all the affected instructions remain in the instruction window, it suffices to re-execute them with the correct input value [LiSh96] . Hence, the state of the directly and indirectly dependent instructions only needs to be reset after a misprediction so that the issue logic will select them again for execution. This second (or subsequent) execution will produce the correct result because the input operands are now correct. We refer to this misprediction recovery mechanism as re-execute recovery.
While the re-execute mechanism avoids the unnecessary purging of independent instructions and the overhead of re-fetching already fetched instructions, it still incurs a cycle-penalty for identifying the dependent instructions, changing their state, and re-executing them. However, the penalty is considerably smaller than the one incurred by the re-fetch recovery mechanism. Note that, as opposed to refetch hardware, re-execute hardware does not yet exist and incorporating it would require changes to the CPU core.
Benchmarks
We use the eight SPECint95 programs [SPEC95] as our benchmark suite. These programs are well understood, non-synthetic, and compute-intensive, which is ideal for processor performance evaluations. The SPECint95 programs are written in C and perform the following tasks:
compress: compresses and decompresses a file in memory Except for gcc, we use the reference inputs for all programs. Due to a restriction in our simulation infrastructure, only the varasm input file is used with gcc. To avoid possible side effects that may be attributed to poor code quality, the peak-versions of the programs are utilized, which were compiled with DEC GEM-CC on a DEC Alpha 21164 using the highest optimization level "-migrate -O5 -ifo".
The optimizations include common sub-expression elimination, split lifetime analysis, code scheduling, no-op insertion, code motion and replication, loop unrolling, software pipelining, local and global inlining, inter-file optimizations, and more. In addition, the binaries are statically linked, which allows the linker to perform additional optimizations to reduce the number of run-time constants that are loaded.
The few floating-point load instructions contained in the binaries are also taken into account and loads to the zero-registers (R31 and F31) as well as load address instructions (LDA and LDAH) are ignored since they do not access the memory and therefore do not need to be predicted. To better estimate how large a load value predictor needs to be, it is important to know how many of the static load instructions are actually executed and how frequently. Table 3 .3 shows the number of load instructions that contribute the given quantiles (percentages) of all the executed loads in the eight programs. The quantiles are given both in absolute terms as well as in percent of the total number of load sites. For example, the first line in Table 3 Table 3 .3: SPECint95 quantile information
The data in Table 3 .3 show that a surprisingly small number of load sites supply most of the executed load instructions. On average, 3.5% of the load sites contribute ninety percent and a mere 0.6% of the load sites already contribute half of all the executed loads. Less than 37% of the load sites are visited at all during program execution.
These quantile numbers are promising because they imply that load value predictors do not have to be large enough to store information about every load site in a binary. Rather, a predictor capable of only holding nine percent of the load sites can, on average, already handle 99 percent of the dynamically executed loads. Of course, actual predictors need to be somewhat larger to handle 99 percent of the executed load instructions due to aliasing and uneven predictor utilization.
Segment Information
Each benchmark program is executed for about 300 million instructions on the cycle-accurate simulator to keep the simulation time reasonable. Before the detailed measurements commence, the simulator skips over the initialization code of each program. Doing so is important when only a fraction of a program's execution can be simulated because the initialization is not usually representative of the general program behavior [ReCa98] . No instructions are skipped with gcc and it is executed for 334 million instructions since this amounts to the complete compilation of the varasm input-file. Each simulated segment contains over 49 million executed load instructions, which should be sufficient to render any warm-up effects in the load value predictors negligible. Table 3 .4: Information about the eight simulated program segments
The table shows the number of instructions in billions that are skipped before starting the detailed simulations, the number of simulated instructions and load instructions in millions, the percentage of the simulated instructions that are loads, the instructions per cycle (IPC) of the baseline processor, the L1 data-cache and the L2 cache load miss-rates, and the load value predictability similar to Table 2.1.
Note that the number of instructions and loads as well as the predictability shown in Table 3 .4 are measured in the CPU's commit stage, meaning that only correct path information is included in the table. The last row in the table "whole prg" repeats the averages from the whole program executions.
As is the case with the complete executions, the percentage of load instructions executed by the programs is also uniformly high in the simulated segments. About every fifth instruction is a load. With the exception of compress, the benchmark programs do not have very high L1 data-cache load missrates, making it hard for a load value predictor to be effective. Some of the L2 load miss-rates are, on the other hand, quite large. However, since the corresponding number of cache accesses is very small (not shown), the large L2 miss-rates do not have a significant impact on the performance.
The fast-forward points were carefully hand-selected to make the simulated segments as representative of the whole programs as possible. We chose a segment length of 300 million instructions since this appears to be enough to capture the "average" program behavior. Longer segments do not yield significantly different results. A comparison of Table 3.2 and Table 2 .1 with Table 3 .4 shows that both the percentage of executed instructions that are loads and in particular the predictability found in the eight segments closely match the respective numbers measured over the whole program executions.
Only for li and m88ksim, the search for a representative segment was not very successful. Fortunately, li's segment exhibits too low a predictability and m88ksim's too high a predictability, making the average over the eight programs very close to the average over the complete execution of the entire benchmark suite. Executing only part of a program usually produces lower quantile numbers, in particular for the high quantiles. This phenomenon is quite apparent in Table 3 .5. The Q100 and the Q99 numbers are significantly lower than their counterparts in Table 3 .3, whereas the Q90 and the Q50 numbers are rather similar. The good match of the Q90 numbers indicates that the selected segments will likely exercise the load value predictors sufficiently to obtain representative results. The low Q99 and Q100 quantiles mean that the chosen segments contain proportionately too few infrequently executed loads. As a result, below average predictor aliasing has to be expected. Note, however, that techniques exist to keep infrequently executed load instructions from polluting the predictor [CRT99, BJR+99] .
Results
To determine the performance of the five basic predictors from Section 2, we outfitted them with SAg confidence estimators (Section 2.6) and measured by how much they are able to speed up our simulated CPU (Section 3). Note that we use the harmonic-mean speedup over the eight SPECint95 programs as performance metric throughout this paper.
Based on previous studies [Bur00, BuZo98] , we decided to use ten-bit histories in the confidence estimators and a top value for the saturating counters of sixteen for re-fetch recovery and eight for reexecute. A global search of each predictor was used to obtain the optimal threshold and penalty values. Note that the penalties yielding the highest performance with a re-execute misprediction recovery mechanism are quite low in comparison with those for re-fetch, even when accounting for the larger re-fetch counters. This is a direct reflection of the lower re-execute misprediction penalty. given results should only be used for intra-predictor comparisons between the two kinds of recovery mechanisms.
As expected, all five predictors perform better with re-execute than with re-fetch. The difference in speedup is the smallest for the Reg predictor because it exhibits the most regular predictability patterns of the five predictors, which results in the most accurate confidence estimations and therefore the smallest number of mispredictions.
The FCM predictor exhibits the largest difference between re-fetch and re-execute. The reason is that this predictor makes the most mispredictions with re-fetch, resulting in a substantial recovery cost that keeps the speedup low. 
Hybrid Performance
It is not a priori clear whether combining multiple load value predictors results in a predictor that is capable of predicting more load instructions or that can make predictions that are more accurate. For instance, two different predictors may predict the same load instructions. Obviously, combining two such predictors would not improve the performance but only result in a larger and more complex predictor. For example, the stride 2-delta predictor can make last value predictions. Consequently, combining it with a last value predictor will probably not yield a predictor that is more effective than the stride 2-delta predictor by itself.
Hybrid predictors consist of multiple component predictors of which one must be selected for making a prediction. We use the confidence estimators to guide the selection process by making the hybrids select the component with the highest confidence [ReCa98, RFKS98] . Note, however, that the selected component is only allowed to make a prediction if its confidence is above the preset threshold.
The components in the hybrid predictors discussed in this section are prioritized to resolve selector ties. When two or more components report the same highest confidence, the component with the highest confidence and the highest priority is selected. If only one component reports the highest confidence, then that component is selected regardless of its priority. Since changing the priority among the components of a hybrid does not appear to affect the performance considerably [Bur00], we only investigate hybrids in which the components are prioritized in the following order (from high priority to low priority): Reg, LV, St2d, L4V, FCM.
In order to determine which predictors complement each other well and hence yield good hybrids, we tested every possible combination between a register value, last value, stride 2-delta, last four value, and finite context method predictor. Because the last four value predictor is a strict superset of the last value predictor (assuming the same number of lines and the same CE configuration), we exclude hybrid combinations that include both an LV and an L4V predictor. The performance of the excluded hybrids is identical to the performance of the same predictor without the redundant LV component.
Since our goal is to study which predictors complement each other well, all components are 2048 lines tall regardless of the resulting hybrid's overall size. We chose this height because such predictors already yield a performance that is close to the performance of the same predictor with an infinite number of lines. (The quantile numbers from Section 3 support this observation.) Hence, studying hybrids of 2048-line components should suffice to identify the most promising combinations for building high-performing hybrid load value predictors.
While the size of some of the resulting hybrids is rather large, they can frequently be made smaller by sharing state between their components [Bur00, BuZo00, PMT99]. Nevertheless, due to the varying predictor sizes, care must be taken when using the performance numbers shown in this section for inter-hybrid comparisons. l  lf  4f  s  lsf  s4f  sf  ls  4  rs4f  r4f  rlsf  rlf  rsf  rls  rl  s4  rs  r4  rs4 Predictor Combination Speedup over Baseline (%) Figure 4 .2: Hybrid performance using re-fetch Note that it is not practical to optimize the threshold and penalty for every hybrid individually. Instead, the threshold and penalty values that yield the highest average speedup over the included components are used as an approximation. They are computed as follows. We evaluated the speedup of the five basic predictors for a large number of threshold and penalty pairs and recorded the results in speedup maps [Bur00] . A speedup map is a matrix with different thresholds in one dimension and different penalties in the other dimension. The matrix elements are the speedups measured for the threshold and penalty that intersect at that element. We then computed an average map by forming the arithmetic mean of the entries in the individual maps of each included component (e.g., the register value predictor's map and the last value predictor's map for the Reg+LV hybrid). The highest speedup in the averaged map determined the threshold and penalty value we used for each hybrid. Note that this approach does not always yield the best performance but is usually close. For example, the St2d+FCM hybrid yields a speedup of 9.99% with re-fetch and 13.09% with re-execute when using the parameters from the averaged speedup map, whereas truly optimizing the threshold and penalty results in a speedup of 10.01% for re-fetch and 13.94% for re-execute. Table 4 .2 shows the confidence estimator configurations derived from the averaged speedup maps. All the histories are ten bits long and the counter top value is always sixteen for re-fetch and eight for re-execute.
re-fetch re-exec Table 4 .2: The confidence estimator parameters of the hybrid predictors
Hybrids with more components tend to yield a higher speedup than the ones with fewer components. Because adding a component to a hybrid makes the task of the selector harder (there are more choices), it can happen that the added predictability provided by a new component is unable to offset the increased selector-related losses. When this situation occurs, the hybrid's components interfere negatively with one another and lower the overall performance.
Note that some of the most effective hybrids are small and have only two components (Reg+LV and Reg+St2d). The remaining three of the five best combinations are significantly larger because they include an L4V component. However, previous work by the authors demonstrates that the size of a Reg+St+L4V predictor can be reduced to only slightly more than that of a Reg+St2d hybrid essentially without loss of performance [BuZo00] .
Eleven of the twelve best performing hybrids include the storage-less register value predictor, indicating that the Reg predictor is a very important component in a hybrid. This result is particularly surprising because the Reg predictor by itself performs rather poorly. Note that no profiling was used to improve the register allocation, which can significantly enhance the performance of this predictor [TuSe99] , yet the benefit from including a register value predictor is already substantial. When averaging the re-fetch and the re-execute speedups, the Reg+St2d+L4V hybrid performs best by a considerable margin. The most effective two-component hybrid is the Reg+L4V, which is closely followed by the Reg+St2d hybrid. Finally, the best single-component predictor is L4V trailed by the St2d predictor. Surprisingly, neither of the four-component hybrids outperforms the best threecomponent hybrid. Table 4 .4: Re-execute speedup benefit from adding components
Evidently, both with re-fetch and re-execute, all the hybrids that do not include a Reg component would benefit considerably from having one. This is particularly surprising because the Reg predictor does not perform very well when used in isolation. Similarly, the Reg predictor benefits from being augmented with any other component. Except for the Reg predictor, only the FCM and Reg+FCM hybrids benefit significantly from an LV component. These two predictors also profit the most from having an St2d or an L4V component added to them. As mentioned earlier, most hybrids are slowed down by an FCM component with re-fetch, whereas it is advantageous for most hybrids to have an FCM component with re-execute. Several predictors benefit from an L4V component.
Performance Analysis
In an effort to determine why the register value predictor is such a valuable addition to all hybrids while, for example, the LV component generally is not, we investigated how frequently each component in a hybrid can predict a load value that none of the other components can, how often the predictions from different components overlap, and how often they interfere with one another. Because not every prediction is equally important (e.g., predicting a load that hits in the L1 data-cache is not as important as predicting a load that has to go all the way to main memory), we study the speedup contributions of the hybrids' components rather than the actual set of load instructions that each component is able to predict.
Two-Component Hybrids
A hybrid component's unique speedup contribution is the part of the overall performance that is lost when that component is removed. In other words, the component must actually be present to deliver its unique performance contribution. Conversely, in a two-component hybrid, the shared contribution is common to both components, meaning that either is able to provide this contribution, but the contribution does not increase if both components are used together. Hence, only one of the two components is needed to deliver the shared performance contribution.
The unique and shared speedup contributions in two-component hybrids are computed as follows. With only one exception (St2d+L4V), re-fetch recovery results in larger shared contributions than re-execute for the same predictors. This probably means that the easily predictable loads (i.e., loads that have very high confidences associated with them) tend to be the loads that both components can predict. Those loads are most likely runtime constants that are always predictable because their values never change [CFE97] .
Assuming that predictor
Overall, the Reg predictor complements the other four predictors exceptionally well, indicating that it can predict a rather distinct set of load instructions. The next best "partner" is the FCM predictor.
The St2d predictor does not complement the LV or the L4V predictor well because they all mostly predict last value predictable loads.
The St2d+L4V hybrid is similar to the last distinct four value + stride predictor proposed by Wang and Franklin [WaFr97] , and St2d+FCM is the hybrid proposed by Rychlik et al. [RFKS98] except it is not set-associative and uses a different confidence estimator. Figure 4.5: Venn diagrams for three-component hybrids Again, the amount of sharing correlates reasonably with the re-fetch performance, but there is no significant correlation with the re-execute performance. Nevertheless, the Venn diagrams expose components that do not contribute any performance and, more importantly, components that hurt performance. Such components should obviously be left out of hybrids because doing so will not lower the predictor's performance but will make it smaller, faster, and reduce the power consumption.
Three-Component Hybrids
Because the four-component hybrids do not outperform the best three-component hybrids with refetch and do not significantly outperform the best three-component hybrids with re-execute, we refrain from studying the speedup contributions of the two four-component hybrids.
Related work
Two independent research efforts [Gab96, LWS96] first recognized that load instructions exhibit value locality and concluded that there is potential for prediction. Lipasti et al. [LWS96] propose the last value predictor. Gabbay [Gab96] proposes four predictor schemes: a last value predictor, a stride predictor, a register file predictor, and a sign-exponent-fraction predictor. The SEF predictor is only useful for predicting IEEE floating-point loads. Tullsen and Seng [TuSe99] present the register value predictor as we used it in this study. We found their predictor to best complement any other component in hybrid predictors.
In their next paper, Lipasti and Shen [LiSh96] suggest making predictions based on the last n values instead of just the last value. Wang and Franklin [WaFr97] propose a last distinct four value predictor as well as the first hybrid predictor, a combination of their last distinct four value predictor and a stride predictor. In previous work [Bur00, BuZo99] , we show that the last four value predictor is simpler but about as effective as the last distinct four value predictor.
Sazeides and Smith [SaSm97b] describe the finite context method predictor. They found this predictor to perform very well with large table sizes. Since we use a relatively small FCM component in our hybrids, it may well be that a larger such component would further improve the performance.
Rychlik et al. [RFKS98] use a hybrid between a finite context method predictor and a stride 2-delta predictor in their study. Later, Rychlik et al. augment their predictor with a popular last value predictor and study updating only one component at a time to increase the predictor's capacity [RFK+98] .
We tackled the capacity issue in previous work by investigating approaches to shrink the predictor size without loss of performance [BuZo00] . By compressing values and sharing information between predictor components, we were able to reduce the size of the Reg+St2d+L4V to only about twice the size of an LV predictor with the same number of lines while maintaining the hybrid's performance.
Pinuel et al. present a hybrid between a last value, a stride, and a finite context method predictor [PMT99] in which they also share information between components to keep the predictor size small.
Summary and Conclusions
This paper studies the performance of all hybrid load value predictors that can be built out of a register value, a last value, a stride 2-delta, a last four value, and a finite context method predictor. Our analysis shows that hybrids are able to deliver substantially more speedup than even the best singlecomponent predictor and that different components contribute independently to the overall performance. We conclude that distinct types of load value locality exist that can only be exploited using multiple components, each of which has to be tailored to a different kind of locality.
An investigation of the speedup contributions of individual components revealed that the register value predictor, which by itself performs only poorly, represents the most valuable addition to any other studied component. Conversely, combining well-performing predictors often does not result in an effective hybrid. In fact, we found some predictor combinations to perform worse than a similar predictor with fewer components. This happens when a new component increases the selector-related losses by more than the added predictability can compensate for. Hence, care must be taken when combining predictors into a hybrid.
Our hybridization analysis identified the register value + stride 2-delta predictor as one of the best two-component hybrids. In spite of its substantially smaller and simpler design, it matches or exceeds the speedup of two-component hybrids from the literature. Of all the studied predictors, the register value + stride 2-delta + last four value hybrid performs best with re-fetch as well as when averaging re-fetch and re-execute speedups.
Among predictors with 2048 lines, the best hybrids yield harmonic-mean speedups over the eight SPECint95 programs of close to 18% and outperform the best single-component predictors by over 25%. These substantial performance improvements are obtained with transparent load value predictors that require no change to the instruction set architecture and can therefore even be added to existing CPU families. Furthermore, these speedups are obtained on programs that were not compiled with load value prediction in mind. In future work we will study compiler optimizations to further improve the performance of hybrid and single-component load value predictors. We will also investigate hybrid confidence estimators.
