Abstract
Introduction
Current high-performance superscalar processors use branch prediction to speculatively execute instructions beyond an unresolved branch. If the branch is mispredicted, this work is lost, and execution must restart right after the branch instruction. Newer designs increase instructions issue width and pipeline depth, increasing the relative overhead of mispredicted branches and making accurate branch prediction even more critical to performance.
Conditional direct branches, whose target is encoded in the instruction itself, can already be predicted with reported hit rates of up to 97% ( [YP93] ). In contrast, indirect branches, which transfer control to an address stored in a register, are harder to predict accurately. Unlike conditional branches, they can have more than two targets so that prediction requires a full 32-bit or 64-bit address rather than just a "taken" or "not taken" bit. Current processors predict indirect branches with a branch target buffer (BTB) which caches the most recent target address of a branch. Unfortunately, BTBs typically have much lower prediction rates than the best predictors for conditional branches. For example, an ideal (unconstrained) BTB achieves an average prediction hit ratio of only 64% on the SPECint95 benchmarks.
Though not as common as conditional branches, indirect branches occur frequently enough to cause substantial overhead. Chang et al. [CHP97] predict a reduction in execution time of 14% and 5% for the perl and gcc benchmarks on a wide-issue superscalar processor with an improved prediction mechanism for indirect branches (Target Cache).
In C++ and Java programs, indirect branches occur with even higher frequency (see Table 1 ). These languages promote a polymorphic programming style in which late binding of subroutine invocations is the main instrument for modular code design. Virtual function tables, the implementation of choice for most C++ and Java compilers, execute an indirect branch for every polymorphic call. The C++ programs studied here execute an indirect branch as frequently as once every 50 instructions; other studies [CGZ94] have shown similar results. Some of the C++ programs in Table 1 execute only 6 conditional branches for every indirect branch.
Predictated instructions [M+94] further increase the importance of indirect branch prediction since they remove conditional branches and thus conditional branch misses. For example, Intel expects predication to reduce the number of conditional branches by half for the IA-64 architecture [Intel97] . With indirect branches becoming more frequent relative to conditional branches, and with indirect branches being mispredicted much more frequently, indirect branch prediction misses can start to dominate the overall branch misprediction cost. For example, if indirect branches are mispredicted 12 times more frequently (36% vs. 3% miss ratio), indirect branch misses will dominate conditional branch misses as long as indirect branches occur more frequently than every 12 conditional branches.
As the relevance of indirect branches grows, so does the opportunity for more sophisticated prediction mechanisms.
In the next decade, uniprocessors may reach one billion transistors, with 48 million transistors dedicated to branch prediction ( [P+97] ).
In this study, we explore the design space of prediction mechanisms that are exclusively dedicated to indirect branches. Since the link between misprediction rate and execution overhead has been demonstrated in [CHP97] , we focus on the minimization of branch misprediction rate. Initially, we assume unlimited hardware resources so that results are not obscured by implementation artifacts such as interference in tagless tables. We then progressively introduce hardware constraints, each of which causes a new type of interference and corresponding performance loss. We repeat this process until we obtain implementable predictors. Finally, the practical predictors are pairwise combined into a hybrid predictor, further improving prediction accuracy.
Benchmarks
Our benchmark suite (see Table 1 ) consists of large objectoriented C++ applications that range from 8,000 to over 75,000 non-blank lines of C++ code each., and beta, a a SunSoft version 1.3 b Java High-level Class Modifier c hardware description language compiler d SUIF 1.0 e Fresco X11R6 library compiler for the Beta programming language ( [MMN93] ), written in Beta. We also measured the SPECint95 benchmark suite with the exception of compress which executes only 590 branches during a complete run. Together, the benchmarks represent over 500,000 non-comment source lines.
All C and C++ programs except self 1 were compiled with GNU gcc 2.7.2 (options -O2 -msupersparc plus static linking) and run under the shade instruction-level simulator [CK93] to obtain traces of all indirect branches. Procedure returns were excluded because they can be predicted accurately with a return address stack [KE91] . All programs were run to completion or until six million indirect branches were executed.
2 In jhm and self we excluded the initialization phases by skipping the first 5 and 6 million indirect branches, respectively.
For each benchmark, the tables list the number of indirect branches executed, the number of instructions executed per indirect branch, the number of conditional branches executed per indirect branch, and the percentage of indirect branches in C++ programs that correspond to virtual function calls. For example, only 34% of the indirect branches in eqn are due to virtual function calls; the rest represent indirect calls through function pointers, indirect branches of switch statements, etc. In addition, the tables list the number of indirect branch sites responsible for 90%, 99%, and 100% of the indirect branches. For example, only 2 different branch sites are responsible for 90% of the dynamic indirect branches in go.
90% of the indirect branches in the OO and SPEC programs are executed from less than 100 indirect branch sites, except for self which contains a much larger number of active indirect branches (306). The SPECint95 programs are even more dominated by very few indirect branches, with less than ten interesting branches for all programs except gcc. Because there are so few distinct indirect branches in these programs, they are much more sensitive to variations in indirect branch prediction schemes since a change in the prediction accuracy of a single indirect branch may significantly affect the overall prediction rate.
The relevance of indirect branch prediction is indicated by the number of instructions per indirect branch, and by the number of conditional branches per indirect branch. Three groups emerge: five of the OO benchmarks and one C benchmark execute fewer than 100 instructions per indirect branch; four OO benchmarks and three C benchmarks execute between 100 and 200 instructions for each indirect branch; and four of the SPEC benchmarks execute more than 1,000 instructions per indirect branch. Since the impact of branch 1 self does not execute correctly when compiled with -O2 and was thus compiled with "-O" optimization. Also, self was not fully statically linked; our experiments exclude instructions executed in dynamically-linked libraries. 2 We reduced the traces of three of the SPEC benchmarks in order to reduce simulation time. In all of these cases, the BTB misprediction rate differs by less than 1% (relative) between the full and truncated traces, and thus we believe that the results obtained with the truncated traces are accurate. Table 1 . Benchmarks and commonly shown averages prediction will be very low for the latter four benchmarks, we exclude them from all averages. Table 1 shows the groups for which we will commonly show average misprediction rates. We have included the SPECint95 programs mostly for comparison purposes; we do not believe that they are the best choice for evaluating indirect branch predictors (except for gcc). In effect, most SPEC benchmarks are microbenchmarks as far as indirect branch prediction is concerned, since very few branches dominate their behavior. In our evaluation of indirect branch prediction schemes we will therefore focus on the behavior of the larger OO programs.
Unconstrained indirect branch predictors
We first study the intrinsic predictability of indirect branches by ignoring any hardware constraints on predictor size or organization. Thus, we assume unconstrained, fully associative tables and full 32-bit addresses (unless indicated otherwise).
Branch target buffers
Current processors use a branch target buffer (BTB) to predict indirect branches. The predictor uses the branch address as a key into a table (the BTB) which stores the last target address of the branch.
We simulated two variants: "BTB" is a standard BTB which updates its target address after each branch execution. "BTB-2bc" is a BTB with two-bit counters which updates its target only after two consecutive mispredictions [CG94] 1 . BTB-2bc predictors perform better in virtually all cases, with an average of 24.9% misprediction rate, compared to 28.1% for a standard BTB. Polymorphic branches occasionally switch their target but are often dominated by one most frequent target, a situation observed in object-oriented programs [AH96, D+96] . But even with two-bit counters BTB accuracy is quite poor, ranging from average misprediction ratios of 20% in OO programs to 37% for C programs. Infrequent indirect branches (AVG-200) are less predictable, with a misprediction average of 38% vs. 10% for the programs in AVG-100.
Two-level prediction for indirect branch paths
Two-level predictors improve prediction accuracy by keeping information from previous branch executions in a history buffer. Combined with the branch address, this history pattern is used as a key into a history table which maps the key to the predicted target address. This table itself resembles a BTB. As in BTBs, the entries can be updated on every miss or after two consecutive misses (2-bit counters). We tested every predictor in this section with both variants, and always saw a slight improvement with 2-bit counters. I.e., ignoring a stand-alone miss when updating seems to be a good strategy in general. Thus, we will only show 2-bit counter results in the rest of the paper.
Two-level predictors differ in the way they construct a key pattern for accessing the table. We simulated various alternatives (section 3.3), but will discuss at length only the configuration which resulted in the best average hit rates, shown in Figure 1 . The history pattern consists of target addresses of recently executed branches. The history buffer is shared (global), so all indirect branches influence each other's history. Concatenation with the branch address results in the key used to access the history table. The path length p determines the number of branch targets in the history pattern(A path length of 0 reduces the two-level predictor to a BTB predictor since the key pattern consists of the branch address only). In theory, longer paths are better since a predictor cannot capture regularities in branch behavior with a period longer than p. Shorter paths have the advantage that they adapt more quickly to new phases in the branch behavior. A long path captures more regularities, but the number of different patterns mapping to a given target is larger, so it takes longer to fill in the table. This long "warm-up"-time for long patterns can prevent the predictor from taking advantage of longer term correlations before the program behavior changes again. We studied path lengths up to 18 target addresses in order to investigate both trends and see where they combine for the best prediction rate. Figure 2 shows the impact of the history path length on the misprediction rate for all path lengths from 0 to 18 The average misprediction rate drops quickly from 24.9% for a BTB to 7.8% for p=3 and then slowly reaches a minimum of 5.8% at path length 6. Then the misprediction rate starts to rise again and keeps rising for larger path lengths up to the limit of our testing range at p=18. All benchmark suites follow this pattern, although programs with infrequent branches show uniformly higher misprediction rates.
These results indicate that most regularities in the indirect branch traces have a relatively short period. In other words, a predictable indirect branch execution is usually correlated with the execution of less than three branches before it. Increasing the path length captures some longer term correlations, but at path length six cold-start misses begin to negate the advantage of a longer history. At this point, adding an extra branch target to the path may still allow longer-term Table   Address correlations to be exploited, but on the other hand it will take the branch predictor longer to learn a full new pattern association for every branch that changes its behavior due to a phase transition in the program. A hybrid branch predictor combining both short and long path components should be able to adapt quickly to phase changes while still exploiting longer-term correlations; we experiment with such hybrid predictors in section 6.
Variations
We explored a few other choices for the history pattern elements. In the first variant we used both branch address and target, and in the second we included targets of conditional branches in the history. Both resulted in inferior prediction capacity for any pattern length p (see [DH97] ).
In [YP93] , Yeh and Patt classify two-level predictors for conditional branches. For both the history buffer and the history table, three different schemes are possible, resulting in a total combination of nine variants. Buffer and table can each be shared (Global), each branch can have its own version (per-address), or in an intermediate form, branches that fall in the same set can share structures (per-set), where a set may be determined by the branch opcode, a compiler-assigned branch class, or a particular address range. We simulated all nine combinations, with sets based on branch address range. Due to space considerations, we cannot discuss the results at length. For path length 8, per-address history buffers (with per-address tables) resulted in a misprediction rate of 9.4%. A global history table (with a global history buffer) resulted in 9.6% misprediction rate. The configuration we decided on (Figure 1 : a global buffer and per-address tables) resulted in 6.0% misprediction rate.
Limited-precision branch predictors
The global history pattern is a very long bit pattern. For p=8, it consists of 8 * 32 = 256 bits, and concatenation with the branch address results in a total of 288 bits. The information content of this bit pattern is quite low: the number of different patterns that occur during program execution is much smaller than 2 288 . Since a tag in an associative table includes most of the pattern, long patterns inflate the size of the predictor table. We need to compress the pattern for each path length into a short bit pattern, ideally without compromising prediction accuracy. As a first step towards smaller history patterns, we will only consider path lengths up to size p=12, since longer path lengths result in higher misprediction rates (as seen in Figure 2) 
History pattern compression
A straightforward approach for history pattern compression is to select only a limited number of bits from each target and concatenate these partial addresses into the history pattern. We explored a number of choices by using a range [a..A] of the address bits. We varied a from 2 to 10, and A from a to a+(b-1), where b is the largest number of bits that still allows the history pattern to fit within a total of 24 bits (i.e. b * p <= 24). Starting with bit a=2 worked best on average, and thus we will not show data for other values of a. Figure 3 shows the misprediction ratios resulting from the selection of bits [2..2+(b-1)], for b values of 1,2,3,4 and 8, as well as the misprediction rate for full-precision addresses.
The curve for b=8 almost completely overlaps with the fulladdress curve, indicating that 8 bits are enough even for short path lengths. For decreased address precision, short path lengths suffer most. For example, for path length p=10, 2 bits achieve a misprediction rate of 6.77% vs. 6.53% for full addresses, while for path length p=3, the miss ratio decreases from 10.6% (2 bits) to 7.1% (full addresses). A total bit length of 24 bits suffices for the history pattern to approach the fulladdress performance for all path lengths. Thus, if b is the number of bits used from each target address in the path history, the maximum value of b has to satisfy b * p <= 24. For example, for path length 2 we choose 12 bits for each history entry, and for path length 6 we choose 4. We also tried two other schemes for target address compression:
• Fold the new target address into the desired number of b bits by dividing it into chunks of b bits and xor-ing them all together.
• Shift the history pattern b bits to the left and xor with the complete new target address. These variants were intended to use more information of the target address but did not reliably result in better prediction rates and were sometimes even worse. Since they require more logic than the bit selection discussed above, we decided to drop them from further tests.
Folding the branch address
As mentioned in section 3.3, omitting the branch address reduces the performance of a two-level predictor (for p=8, the misprediction rate increased from 6.0% to 9.6%). However, concatenating the branch address with the history pattern results in a key of 24 + 30 = 54 bits. In analogy with the Gshare predictor used in conditional branch prediction [CHP95] , we can reduce the number of bits in the key pattern to 30 by xor-ing the branch address with the history pattern (we use the low order bits of the branch address, starting at bit 2, xor-ed with most recent target bits). Table 2 shows the misprediction rate averages for both alternatives. Compared to the increase in misprediction rate due to limited table size and associativity in the next section, the reduction of the key pattern from 54 to 30 bits by xor causes a very small rise in misprediction rate. Since this operation reduces the table space used for tag bits by more than half, we use the scheme in the remainder of the paper.
5.
Resource-constrained branch predictors
In this section we introduce limited table sizes and limited associativity in order to obtain practical indirect branch predictors.
Limited-size fully-associative tables
Limited tables introduce a new source of branch misses: capacity misses. When the table is too small to store the history patterns of all branches in its working set, some patterns will be evicted from the table, resulting in capacity misses.
Longer
Though not all patterns are used more than once (some only occur once in the warm-up phase), for longer path lengths capacity misses will occur fairly soon. A predictor with a longer path length may be more accurate than a predictor of shorter path length for an unlimited table, but the capacity misses caused by a small table size can affect the longer path length predictor enough to negate this advantage.
To estimate the effect of capacity misses we simulate fullyassociative tables with LRU replacement policy. Figure 4 shows the average misprediction rate for various fully-associative tables for predictors with path length p=0-4,6,8,10 and 12. The misprediction rate of some path lengths reaches its minimum in the explored range. For p=0 (BTB), the miss rate reaches its minimum at 256 entries. Since there are no Table 2 . Misprediction rates (AVG in %) for xor and concat of history pattern with branch address ) has a misprediction rate of 6.6%, with 0.6% due to capacity misses.
Limited-size limited-associative tables
In practice, a fully-associative LRU table of sufficient size requires too much logic to implement in hardware, and thus we will explore limited-associative tables in this section.
Limited associativity means that part of the key pattern is used as an index into a table to access a limited set of entries. Each entry in the set has a tag that is checked for equality to the rest of the key pattern. The index part of the key determines how a working set of branch patterns is spread out over the sets, and how many patterns share the same set. For instance, if one only used the high-order 8 bits of the branch address as index in a BTB of 256 sets, most of the patterns would have to share the same set. This can cause conflict misses; these are similar to capacity misses, but it is the capacity of the set instead of the table that is the limiting factor. Conflict misses can be reduced without changing the total size of the table by increasing associativity or by choosing a different index scheme, so that different patterns share the same sets. We start out choosing the lower order bits of the key pattern as index. In a two-level predictor, this part contains the lower order branch address bits, xor-ed with the target address bits of the recent targets in the history pattern (see section 4.2).
We test 1, 2 and 4-way associativity, and tagless tables, which is like 1-way associativity but without tags. Where a one-way associative table will register a miss if the search pattern is not in the table, a tagless table will simply return the target corresponding to the index part of the pattern. We compare misprediction rates for equal table sizes, i.e. a table with 256 sets of one entry each (1-way associative) is compared to a table with 64 sets of four entries each (4-way associative).
We tested all table sizes of the previous section, but will show only selected examples for this analysis to reduce the amount of cluttering in the graphs. Figure 5 shows the misprediction rate of different associativities for a 4096-entry table, for all path lengths.
Interleaving
The saw-tooth curve for associativities 1, 2 and 4 indicates that there is something wrong with the way the history pattern is assembled from the target address bits. In particular, for associativity one, the misprediction rate of a p=2 predictor is much higher than a p=1 predictor. Figure 6 shows an example for p=2. Since the index part of the pattern is identical for target sequence t2t1 and t3t1, both paths will occupy the same set in the table. The predictor assigns sets in the same way as a predictor of path length one. If the two patterns alternate often, the path length two predictor will incur frequent conflict misses with a one-way associative table and not return a prediction, while the path length one predictor will return the predicted target address. To a lesser degree, the same effect applies to larger path lengths and higher associativities 1 , explaining the saw-toothed lines for concatenation in Figure 5 . Interleaving remedies this problem by ensuring that the index part of a pattern contains the lower order bits of all target addresses, rather than all bits of a subset of the target addresses. When the target bits are interleaved, target sequences t2t1 and t3t1 will likely differ in the index part of the pattern and will therefore not interfere with each other.
1 Also note that since concatenation places the oldest targets completely in the tag, they are invisible to a tagless table. A path length 12 pattern, with two bits per target in a predictor with a tagless, 4096-entry table will use only the 6 most recent targets, so its effective path length is only 6. FullAs.
Concat t2 t1 index tag
Interleave t3 t1 target3 Concat t3 t1
Interleave t2 t1 target2 target1
Interleaving of target bits is effective because it spreads patterns over more different sets than concatenation. For example, interleaving increases table utilization for ixx from 50% to 79% for a 1024 entry, one-way associative table for path length four. Figure 7 shows that interleaving dramatically improves predictor performance compared to concatenation.
We experimented with three variants of interleaving schemes. Figure 8 shows the interleaving schemes for path length 4 and index length 10. The index part of the pattern contains low order bits from all targets, but two targets are more precisely represented with three bits, and two contribute only their two lower order bits. Straight interleaving represents the most recent targets with higher precision (target 1 and 2), while reverse interleaving represents the older targets most precise (target 3 and 4). Ping-pong interleaving represents both the oldest and youngest target more precisely (1 and 4). Suppose the current branch depends only on the address of target4, and some of the possibilities are equal in their two lower order bits. With straight interleaving, the two patterns will conflict. With reverse interleaving, they will use entries in different sets.
We found that reverse interleaving performs slightly better on average than the two other schemes. For shorter path lengths, the order does not make much difference since the index part of the pattern contains many bits from every target.
For longer path lengths the difference in precision becomes more important. Reverse interleaving gives longer path length predictors the opportunity to use more exact information from older targets, which is their main advantage compared to shorter path lengths. In the remainder of the paper we use reverse interleaving. Figure 7 shows that for any path length, higher associativity results in lower misprediction rates. The only exception is the tagless table, which obtains a lower misprediction rate than a four-way associative table for path length 8 to 12. This effect is caused by positive interference. Since these longer path lengths generate a larger set of distinct patterns, conflict misses occur frequently even in four-way associative tables. The tagless table returns its stored target as a prediction even though it may belong to a different pattern, while the associative table registers a miss. Since many patterns map to a small number of targets, the prediction is better than random so that a tagless table can outperform the associative table. Even where tagless tables do worse than two-or four-way associative tables, the difference in miss rate remains relatively small. Since associative tables require tags and tag checking logic, the hardware implementation of a tagless table is smaller and faster than its associative counterpart, so that it may be the preferable choice under many circumstances. Figure 9 shows the AVG misprediction rates for practical associativities. The best predictor for a given table size changes depending on associativity. For tagless tables, p=3 is best for table sizes 128 to 8192. For 2-way associative tables, p=1 wins for size 128, then p=2 is best for sizes 256 to 1024, after which p=3 performs better. For 4-way associativity, the best predictor for every size up to 1024 is the same as for a fully-associative table (see Figure 4) . Then p=3 remains the best choice up to table size 4096. At size 8192, p=4 has a slight edge. P=6 retains too many conflict misses even for large table sizes and therefore loses its status as best practical predictor. Limited table size and associativity prevent the predictor from taking full advantage of the longer-term regularity detection capability of longer path length predictors (however, see the next section). Table A -1 in the appendix shows the path lengths for the best predictors of all table sizes, and Table A-2 contains their misprediction rates.
Associativity

Hybrid branch predictors
As discussed in section 3.2, predictors with short path lengths adapt more quickly when the program goes through a phase change because it doesn't take much time for a short history pattern to fill up. Longer path length predictors are capable of detecting longer-term correlations but take longer to adapt and suffer more from table size limitations because a larger pattern set is mapped to the same number of targets. Here we combine the two kinds in a hybrid predictor. 
Metaprediction
A hybrid branch predictor combines two or more component predictors that each predict a target for the current branch. The hybrid predictor employs a selection mechanism (metapredictor) to predict which of the predictors is likely to be correct. A branch predictor selection table (BPST) [McFar93] associates a two-bit counter with each branch to keep track of which of two component predictors is more accurate. After resolving a branch, the counter is updated to reflect the relative accuracy of the two components. Alternatively, branches can be partitioned into different classes based on run-time or compile-time information, and each class is associated with the component predictor best suited to handle it [CHP94] .
We attach a "confidence" counter to each table entry to keep track of the number of times the table entry predicted the correct target. The counter is a n-bit saturating counter which tracks the success rate over the last 2 n-1 times the entry was consulted. (Replacing an entry resets the counter to zero). The hybrid predictor selects the target with the highest confidence value; ties are resolved using a fixed ordering (we test different orders in the next section). This metaprediction scheme is usually more fine-grained than a BPST since it keeps track of the prediction accuracy of a particular pattern rather than a particular branch. We tested 1,2,3 and 4-bit counters for all configurations in the next section. Although the performance difference between 2,3 and 4 bit counters was small, 2-bit counters usually performed best and are used for all results shown.
Component predictors
We simulate hybrid predictors with two component predictors of equal table size and associativity but different path lengths. The component table sizes vary from 32 entries to 16K entries, and we simulate all combinations of path lengths in the range 0..12. Figure 10 shows the AVG hit ratios for 2K-and 8K-entry component tables. More details are given in the Table 3 . The best hit rates are obtained by the combination of a short path length predictor (p=1..3) with a longer path length predictor (p=5..12). Since the curve is fairly symmetrical with respect to the diagonal, it appears that the order of the predictors (which is used to break ties in component predictor selection) does not matter much. For smaller tables, the curve is sharper and peaks at shorter path lengths, i.e., it the choice of the short path length component is more important, and very short path lengths do much better. Figure 11 shows the misprediction rates of the best nonhybrid and hybrid predictors for each table size and associativity. We compare predictors based on total table size, i.e., we treat a hybrid predictor with two component predictors of size N as a predictor of size 2N and compare it against the non-hybrid predictor of that size. In all but one case (64 entry, associativity 4), hybrid predictors obtain lower misprediction rates than equal-sized non-hybrid predictors, even though each component separately suffers more from capacity and conflict misses than the non-hybrid predictor. For smaller table sizes (between 64 and 512 entries), the effect of increased associativity remains stronger than that of hybridization. For example, a non-hybrid 4-way associative table of size 256 achieves a lower misprediction rate than a hybrid predictor with two 2-way associative components of size 128 each. For larger table sizes (between 1K and 32K entries), a hybrid predictor with 2-way associative components performs better than a non-hybrid 4-way associative predictor of the same size. For 2-and 4-way associative non-hybrid predictors with tables larger than 2K entries, the prediction rate improves more by changing to a hybrid predictor than by doubling the total table size. For tables larger than 4K entries, a 4-way associative hybrid predictor outperforms even a fully-associative table of the same size. 
Related work
Lee and Smith [LS84] describe several forms of BTBs. Jacobson et al. [J+96] study efficient ways to implement pathbased history schemes and observe that BTB hit rates increase substantially when using a global path history. Their Correlated Task Target Buffer (CTTB), unconstrained and fully associative, reached misprediction rates of 18% and 15% for gcc and xlisp with path length 7; our study found misprediction rates of 12% and 1.5% for p=7. The different results can be explained by several factors: different benchmark version (SPEC92 vs. SPEC95), inputs, and radically different architectures (e.g., the multiscalar processor's history information will likely omit some branches in the immediate past). Finally, Jacobson et al. include conditional branches in the path histories, which is probably responsible for the difference in xlisp (see section 3.3).
Chang et al. [CHP97] explore a limited range of two-level predictors for indirect branches and simulate the resulting speedups of selected SPECint95 programs for a superscalar processor. The misprediction rate of a BTB-2bc is reduced by half to 30.9% for gcc with a Pattern History Tagless Target Cache with configuration gshare(9). This predictor XORs a global 9-bit history of taken/non taken bits from conditional branches with the branch address, and uses the result as a key into a globally shared, tagless 512-entry history table. In the present study, a comparable non-hybrid predictor (p=3, tagless 512-entry) reaches a misprediction ratio of 31.5% for gcc, the best non-hybrid predictor (p=2, four-way associative 512-entry) has 28.1% misprediction rate (31.4% for 256 entries), and the best hybrid predictor (p1=3, p2=1, four-way associative 512-entry) reaches 26.4%. These comparisons should be regarded with caution, since the two experiments differed in architectures (HPS vs. SPARC), compilers, and benchmark inputs (we were unable to obtain the exact benchmark inputs used by Chang et al.).
Emer and Gloy [EG97] describe several single-level indirect branch predictors based on combinations of the values of PC, SP, register number, and target address, and evaluate their performance on a subset of the SPECint95 programs. For these programs, the best predictor shown achieved a misprediction ratio of 30%, although the authors allude to a better predictor that achieves 15%.
Calder and Grunwald proposed the two-bit counter update rule for BTB target addresses [CG94] and showed that it improved the prediction rate of a suite of C++ programs.
Nair [Nair95] introduced path-based branch correlation for conditional branches and showed that a path-based predictor with two-bit partial addresses attained prediction rates similar to a pattern-based predictor with taken/not taken bits (for similar hardware budgets).
Many alternative implementations in this study were inspired by conditional branch predictors. We refer to [USS97] for a recent general overview, to [YP93] for a classification of two-level predictors, and [ECP96] for recent hybrid prediction results.
Conclusions
We have explored a wide range of two-level indirect branch predictors, starting with unconstrained predictors with full-precision addresses and unlimited hardware resources. For a suite of large C++ and C programs totalling more than half a million lines of source code, the best unconstrained predictor achieved a misprediction rate of 5.8%, indicating that indirect branches are intrinsically predictable even though current hardware predictors (BTBs) do not predict them well. An extensive search of the unconstrained twolevel predictor design space showed that a global history and per-address predictors perform best on average.
Subsequent experiments introduced resource constraints in order to evaluate whether realistic predictors could approach this performance with a limited hardware budget. Introducing limited-precision addresses (for a history buffer of 24 bits) increased the misprediction rate to 6.0%. Limiting table size (thus causing capacity misses) resulted in a further increase to a 8.5% misprediction rate for a 1K-entry table and 6.6% for a 8K-entry table. Restricting table associativity resulted in 11.7% and 8.5% misprediction rates for 1K and 8K tagless tables, respectively. Four-way associative tables of the same sizes reduce the misprediction rates to 9.8% and 7.3%, respectively. In comparison, an infinite-size fully-associative branch target buffer achieves a best-case misprediction rate of 24.9%. In other words, two-level prediction improves prediction accuracy by more than a factor three.
Combining two-level predictors with different path lengths in a hybrid predictor further improved prediction accuracy. For a 4-way associative table, the misprediction rate of the best hybrid predictor improved to 8.98% for 1K entries and 5.95% for 8K entries. We found that 2-bit per-pattern confidence counters achieve adequate meta-prediction performance and that combining a short and long path length predictor results in the best performance. Compared to an ideal BTB, an 8K-entry hybrid predictor improves prediction accuracy by a factor of more than four.
We also explored a variety of alternatives that resulted in inferior performance. In particular:
• Per-address or per-set history buffers perform worse than a global, shared history buffer.
• Updating targets on every miss lowers the performance, compared to updating only after two consecutive misses.
• Including conditional branch targets in the history pattern lowers prediction performance by pushing the more relevant indirect branch information out of the history buffer.
• Using bits other than the lower-order bits of target addresses results in lower performance.
• For limited-associative tables, the index part of the key pattern should contain bits from as many targets as possible, i.e., interleaving of target address bits performs better than concatenation. The difference in performance between a BTB and the best practical two-level predictor becomes significant only for history tables larger than 64 entries. As the hardware budget allows larger history tables to be implemented, the path length of the best predictor grows. At 2048 entries, a hybrid predictor's miss rate of 7.8% outperforms that of a BTB by a factor of three. This result suggests that even for very high-ILP processors, indirect branches are less likely to severely constrain the achievable IPC if the transistor budget is large enough. tablesize  btb  fullassoc  tagless  assoc1  assoc2  assoc4  fullassoc  hybrid  tagless  hybrid  assoc1  hybrid  assoc2  hybrid  assoc4 Table A-2. Misprediction rates for selected benchmarks (for full results see [DH97] ).
