Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can alleviate much of the risk. Modern processors include indirect branch predictors which use part of the target address to update a global history. We present a code generation technique to maximize the branch history information available to the predictor. We implement our optimization as an assembly language transformation, and evaluate it for SPEC benchmarks and interpreters using simulated and real hardware, showing indirect branch misprediction decreases.
INTRODUCTION
Indirect jumping is used to implement common programming constructs such as switch statements, virtual function calls and calls through function pointers. The performance of programs that make a lot of use of these features can depend heavily on the underlying hardware's ability to correctly predict the target address of the indirect branch. Predicting the outcome of an indirect branch is often more difficult than conditional branch prediction, because indirect branches can have many different targets.
The simplest type of indirect branch predictor simply uses the branch target buffer (BTB) which is used to predict the (unchanging) target of conditional and unconditional direct branches. When used with indirect branches, the BTB simply predicts that the indirect branch will jump to the same target as last time it was executed. This works well for branches that usually jump to the same target, which are often known as monomorphic indirect branches. However in the case of indirect branches which jump to many different targets (polymorphic indirect branches) the prediction accuracy is usually poor.
To address this problem, researchers have adapted the idea of two-level branch predictors to indirect branches [Driesen and Hölzle 1998; Chang et al. 1997] . Rather than having only a single branch prediction entry per branch, a two-level predictor uses both the address of the branch itself and a history of recent branch outcomes to index a table and find a prediction. In the case of indirect branches, the history of recent branches usually consists of some bits selected from the addresses of recent indirect branch targets, and combined together into a single history register. By capturing information about the outcome of recent branches, a two-level branch predictor is able to exploit correlations between different indirect branches, or between multiple executions of the same branch.
A limitation of two-level indirect branch predictors is that it is only possible to select a subset of the bits from each indirect branch target. For example, the indirect branch predictor in the Intel Pentium M processor selects the lower six bits of the indirect branch target, and hashes them into the branch history register [Uzelac and Milenkovic 2009] . A problem is that if more than one target has the same pattern in those bits, then the predictor will not be able to distinguish between them. In this case, the predictor will have less context information to make the prediction, and the likely outcome is poorer prediction accuracy.
In this paper we present two compiler-based software techniques to improve the predictability of indirect branches that are used to implement switch statements. Both techniques have the goal of ensuring that different targets of the switch branch have bit patterns that can be used to distinguish the branches in the indirect branch predictor. The paper makes the following contributions.
-We show that there is an opportunity to reduce collisions in the indirect branch predictor using compiler techniques. -We present a solution based on NOP insertion and describe the problem formally, and present heuristic solutions. -We present a second solution based on re-ordering the cases inside the switch statement and present heuristic solutions. -We present a hybrid of both these techniques.
-We implement both strategies in an assembly optimizer tool, and evaluate them using both simulation and real hardware. -We provide preliminary experiments that show these techniques can also be applied to improving branch prediction accuracy for virtual function calls.
BACKGROUND
Programming language switch (or case) statements can be implemented using different techniques, depending on the number of cases and the distribution of the switch constant values. For small numbers of cases, a sequence (or tree) of conditional branches suffice. Even for larger numbers of cases, a binary tree of branches can work well. However, in general the most efficient switch code can be generated for large number of cases when the density of the switch constant values is high. to indirect branches. Kalamatianos and Kaeli [1999] showed that correlation with previous branches using data compression techniques could further improve accuracy. Kim et al. [2009] propose a low-cost mechanism to adapt an existing two-level conditional branch predictor to indirect branches. CPU pipelines have grown over time, and modern processors have highly sophisticated branch predictors that attempt to avoid branch misprediction and a flush of the pipeline. In recent years, two-level indirect branch predictors have started to appear in real processors. The earliest example we are aware of is the Intel Pentium M [Gochman et al. 2003] Effective branch predictors are so important for performance that they are a competitive edge for CPU manufacturers, so their specific behavior is seldom publicly documented. There have been attempts to reverse-engineer the exact details of branch predictors [Milenkovic et al. 2002] . Uzelac and Milenkovic [2009] provide experiment flows and microbenchmarks for reverse-engineering cache-like branch predictor structures. In particular, they provide details of the branch predictor structures of the Intel Pentium M, including the indirect branch target buffer (iBTB). They find that the Pentium M takes the six lower bits of the address of the branch target (Figure 1) , and hashes them into a branch history register (called the path information register (PIR) by Intel). This PIR is used in the prediction of future branches by the Pentium M.
The PIR in the Pentium M is a fifteen bit register, which is updated by both taken conditional branches and indirect branches as follows:
where cbt is a taken conditional branch, ibt is a taken indirect branch, IP is the memory address of the branch instruction, and TA is the target address of an indirect branch. In other words, when a new conditional or indirect branch is executed, the existing value of the PIR is shifted right by two places, and a new value is xor'd into the PIR. In the case of a conditional branch, this new value is simply 15 bits of the address of the branch in memory. In the case of an indirect branch, 9 bits are taken from the address of the branch, and six least significant bits from the address of the branch target. It has been reported that the Core microarchitecture includes the same indirect branch predictor [Kim et al. 2009] .
Given that indirect branch predictors use only a selection of bits from the branch target when updating their history register, the effectiveness of two-level indirect branch prediction schemes depends on the bit pattern being different for the common targets. The actual addresses at which the branch targets appear in memory depends on choices made by the compiler and linker. In general, compilers and linkers make no attempt to optimize for the indirect branch predictor. In some systems the locations of the branch targets in memory are simply arbitrary, and whether the chosen bits are distinct for different branch targets is a matter of luck. Fig. 2 . Typical jump-table assembly compiled by GCC for our experimental interpreter. The jump targets will be 16-byte aligned by the linker, which means that the lower four bits of the address will be zero.
Sometimes compilers actually make the situation worse, by aligning branch targets for supposed cache gains-if the target is at the lowest address, then more code may end up fitting on a cache line. However, this leads to less information being available to the branch predictor hardware, and consequentially, worse predictions. Figure 2 shows an example extract of assembly code compiled by GCC for a switch statement. The branch targets of cases in the switch statement are labels .L4 and .L5. The compiler has inserted directives (the .p2aligns, whose first argument is the number of lower bits) to align these branch targets on a 16-byte boundary, which will ensure that the lower four bits always have the value zero. If the indirect branch predictor uses these bits, then they will be useless for distinguishing between different branch targets. In rare cases, aliasing can be helpful to the indirect branch predictor, but this is unlikely.
NOP INSERTION

Idea
Our first technique to modify indirect branch target addresses is to insert NOP instructions directly before a target instruction in order to move its memory address to a higher memory location that will not conflict with other targets.
Problem Formulation
It is clear that we can only have a bijective mapping if the number of targets is less than or equal to the number of "buckets". 
CONJECTURE 3.2. Deciding if there is a solution to the NOP insertion problem less than t is NP-complete (where t is the total number of inserted NOPs).
Unfortunately a proof for this conjecture is not known to exist. A brute-force algorithm to evaluate all possible NOP insertion permutations will have O(
) complexity, where t is the total "budget" and n is the number of target positions. Enumerating all the possible solutions in this manner is the problem of integer compositions. An algorithm for generating integer compositions is described by Knuth [2005] .
Heuristics
On the assumption that the problem is NP-hard, we now present three heuristics that perform well, and we show simulation results.
3.3.1. Greedy. The simplest heuristic, shown in Algorithm 1, is a greedy algorithm. It simply iterates through the addresses from lowest to highest (therefore making updated address calculation easy), and inserts NOPs until the current address no longer conflicts with previously processed addresses. The average case for this algorithm can be analyzed in the same manner as a series of unsuccessful searches in linear probing, and an aggregation of the following formula from Knuth [1998] will provide an average for the number of NOPs inserted:
where N is the number of targets already considered and M is the total number of targets. The worst case of number of NOPs inserted is M(M + 1)/2.
ALGORITHM 1: Greedily insert NOPs.
Input: Target memory addresses (T ).
Output: New target memory addresses (T ). uniqueKeys = ∅;
6 )}; end 3.3.2. Minimize Conflicts. The greedy algorithm can be improved upon-Algorithm 2 also iterates through target addresses from lowest to highest and it inserts enough NOPs to minimize the number of all conflicting targets at each stage while being cautious about how many NOPs it inserts by minimizing (conflicts × number of NOPs).
3.3.3. Maximize "Clean Distance". A third heuristic is shown in Algorithm 3. This algorithm will also iterate through targets from lowest to highest, and it will choose the number of NOPs to insert that maximizes the distance from the current target to the next target that has a conflict, while also avoiding aggressively inserting NOPs by maximizing distance NOPS .
Simulation Results
To compare three heuristics (and variants thereof), we simulated input target addresses using random numbers. First, 60 random variables α with a uniform distribution within the interval (5, 29) 1 are generated. These are used as the sizes to generate a sequence of memory addresses. This is repeated 1000 times for each heuristic that is being ALGORITHM 2: Insert NOPs, minimizing the conflicts at each point.
Input: Target memory addresses (T ).
Output: New target memory addresses (T ). uniqueKeys ← ∅ ; offset ← 0; for each target address T i in T do smallest ← 0; bestNN ← 0; for nn = 0 → (2 6 − 1) do nextAddrs ← getNextAddresses(offset,nn,T ); conflicts = countConflicts(nextAddrs,uniqueKeys); if T i + offset + nn (mod 2 6 ) in uniqueKeys and (conflicts ×nn) < smallest then bestNN ← nn; smallest ← conflicts × nn; end end offset + = bestNN; T i ← T i + offset; uniqueKeys ← uniqueKeys ∪{T i (mod 2 6 )}; end ALGORITHM 3: Insert NOPs, maximizing the distance until a conflict occurs with the offsetting at each point.
Output: New target memory addresses (T ). uniqueKeys ← ∅ ; offset ← 0; for each target address T i in T do greatest ← 0; bestNN ← 0; for nn = 0 → (2 6 − 1) do nextAddrs ← getNextAddresses(offset,nn,T ); dist = getCleanDist(nextAddrs,uniqueKeys);
6 )}; end evaluated. Table I shows simulated results for four of the best heuristics. The second and fourth rows show heuristics that also run the greedy algorithm for each set of addresses and choose the answer with the smallest number of NOPs. The best heuristic, with an average of 116 NOPs inserted, is the combination of maximize "clean distance" heuristic and greedy. It may be wasteful to insert NOP instructions when memory size is severely constrained, so another technique to change the memory addresses of the indirect branch targets (when the targets are placed sequentially in memory) is to re-order the code blocks.
Problem Formulation
Definition 4.1 (Unique Target Address Keys Ordering where n = m). Given a sequence of positive integers S 1 , . . . , S n , is there a permutation of S, such that for ∀x ∈ X, where x i = S 0 + S 1 + · · · + S i−1 , every x i is unique modk.
LEMMA 4.2. For n < m, there is not always a permutation that results in no collisions in the hash values.
PROOF. If the address of two cases p i and p i+1 differ by a value v, such that v mod k ≡ 0, then both will hash to the same number. This happens when length( p i ) mod k ≡ 0. If we have more than one such case, then we will always have at least one collision in the hashed values. PROOF. To ensure a particular x i is unique modk, we must chose a subset S j of S 1 , . . . , S n such that its sum modk equals a particular integer. This sub-problem is the NP-complete problem of subset-sum on a finite field [Nathanson 1996 ].
Heuristics
Since we also assume the problem of re-ordering in the general case is NP-hard, we present two heuristics to order the blocks in the program. These heuristics do not attempt to remove all conflicts, but to reduce them.
The simplest heuristic to reduce the number of conflicts is a "pair-swap". We perform a single pass over all the triplets of target addresses (pairs of 'blocks' 2 ) and swap them if it will mean less conflicts with target addresses already processed. Algorithm 4 shows this heuristic.
A second, more effective heuristic to reduce conflicts is to greedily schedule them: Select the next block that causes no conflicts and place it after the previously scheduled one. If there is no block that does not cause a conflict, then just pick the first block in the list. Algorithm 5 shows this heuristic.
For these heuristics, the worst case would be a new ordering where all targets conflict with each other. This could happen if the input consists of blocks that are all the size of the "buckets".
HYBRID
Re-ordering attempts to reduce conflicts by only moving blocks. A more sophisticated approach is to combine that with NOP insertion to get the best of both. Blocks are ALGORITHM 4: Re-order "blocks" using a "pair-swap". 
Input
Input: Target memory addresses (T ).
Output: New target memory addresses (T ). blocks ← ∅; for each pair of target addresses T i ,T i+1 in T do
blocks ← blocks ∪{T i , T i+1 }; end offset ← 0; output(offset); while blocks = ∅ do choice ← blocks.first; for each block b in blocks do if offset +size(b) (mod 2 6 ) not in uniqueKeys then choice ← b; end end offset ← offset + size(choice); output(offset); end re-ordered to reduce collisions, and NOPs inserted if no block is the required size. Algorithm 6 shows this hybrid algorithm.
PERFORMANCE EVALUATION
Implementation
We implemented our two techniques using the assembly language optimization framework developed by Hundt et al. [2011] . For NOP insertion, we implemented the greedy algorithm (Algorithm 1) for simplicity. For re-ordering, we implemented Algorithm 5. In some cases, the compiler (GCC) inserts .p2align assembler directives before indirect branch targets (as shown in Figure 2 , aligning them to 16-byte boundaries-this is especially detrimental to the performance of the branch predictor, as only 2 unique Compiler Techniques to Improve Dynamic Indirect Branch Prediction 24:9 ALGORITHM 6: Hybrid-re-order "blocks" using a greedy fit, insert NOPs if necessary. bits of the target address would be used in the predictor history. In our implementation we remove these .p2align before the indirect branch targets.
Input: Target memory addresses (T ).
For re-ordering, the code between target labels may include arbitrary control-flow, so we make a simplification, and treat the code between any two target labels as a 'block' to be moved (unless there is a fall-through path from one block to another, in which case we coalesce the two).
A small adaption was required when inserting NOPs for the problem cases where N > M, that is, where the number of branch targets is greater than the number of 'buckets' (64). If N > M, for the first 64 targets we insert NOPs to make each unique, for the next 64 we insert nops until each target only conflicts with one other target, etc.
PIN Simulation
We created a simulation of the indirect branch target buffer and the path information register using PIN [Luk et al. 2005] . Figure 4 shows the misprediction rate reductions of the programs with NOPs inserted to increase target uniqueness, the programs using re-ordering, and the programs using the hybrid technique compared to the baseline (100%). The majority of the benchmarks show reductions in the misprediction rate, some of them large (>90% reduction for the mcf benchmark). The worst performing case was re-ordering targets for the sphinx3 benchmark, where the misprediction rate was increased by 40%.
We did not observe a significant change in branch mispredictions from running our techniques on the SPEC2006 benchmarks on real hardware. We believe this is because on our test machine, conditional branches share the path information register with the iBTB. This is then compounded by the low ratio of indirect branches to conditional branches in those programs. For example, the gcc benchmark, which has a significant number of indirect branches, has 28 times more conditional branches than indirect branches. Given that taken conditional branches update the branch history, and assuming a significant number of the conditional branches are taken, and the history depth is ≈8, updates to the history by indirect branches will be 'flushed' by the time another indirect branch is reached. In the next subsection, we will look at programs that make intensive use of indirect branches. 
Indirect-Branch-Heavy Programs
An important use of multiway branches is in interpreter implementation. We therefore evaluate our techniques on three interpreters: a simple interpreter written in C; Lua, a switch-based dynamic language interpreter; and JamVM, a Java language interpreter (switch implementation). Our first interpreter is a simple interpreter we designed to experiment with optimizations ( Figure 5 ). It has a full set of opcodes, but only a few are exercised in our test bytecode program, which is just a nested loop with arithmetic in the inner loop. The results of applying our code transformation technique on our simple interpreter are enumerated in Table II . These results are for the Intel Core 2. We see a significant decrease in the number of branch mispredictions, and a pronounced reduction in execution time for all three techniques. The re-ordering and hybrid show a 6% increase in level 1 instruction cache reads, but have less misses than either the baseline or the version with NOPs inserted. The hybrid algorithm shows more indirect branch mispredictions than the other techniques; this is likely due to the other techniques producing addresses that are more fortuitous for that benchmark, but the overall effect is not as significant as the change from the baseline. The instruction cache miss rates are tiny (<0.001%), so we do not infer anything from the small changes in misses.
The second interpreter we look at is Lua, a dynamically typed language interpreter. We evaluate both our techniques for Lua using a selection of benchmarks from the Computer Language Benchmarks Game [CLBG 2011] and The Great Win32 Computer Language Shootout on the Intel Core 2. Figure 6 shows the misprediction rate of the interpreter before and after performing the NOP insertion technique. Figure 7 shows speedups resulting from the reduction in indirect branch misses. Figure 8 shows the misprediction rate of the interpreter before and after performing re-ordering and Figure 9 shows the misprediction rate of the interpreter before and after the hybrid technique. The reduction in mispredictions compared to the baseline for the three techniques is shown in Figure 10 in mispredictions for the majority of Lua benchmarks. Figure 11 shows the level 1 instruction cache results for Lua. The size of the instruction cache (32 KB) dwarfs the code size of the interpreter.
The third interpreter we investigate is the JamVM Java interpreter [Lougher 2011 ]. We compile the 'switch' version of the interpreter by disabling the threaded code mode. We evaluated our techniques on this interpreter using benchmarks from the Java Grande Forum [Mathew et al. 1999] . Figure 12 shows the misprediction rates of the baseline interpreter, the interpreter using NOP insertion, the interpreter using reordering and the interpreter using the hybrid technique on the Intel Core 2. The effect of reducing branch target address collisions is not as pronounced for this interpreter, perhaps because there are many more opcodes (>100), and consequentially many more indirect branch targets. Figure 13 shows speedups attained from the reduction in indirect branch misses. Figure 14 shows the level 1 instruction cache results for Java. These have a similar scale as the Lua instruction cache results. 
0%
C++ Programs
In addition to multiway branches, indirect branches are also used to implement indirect function calls. Function calls are used in programs that use function pointers where the target cannot be resolved statically at compile time. Indirect function calls are also used to implement run-time dispatching which is required for polymorphism in object-oriented languages.
For C++, the GCC compiler implements virtual method calls using dynamic dispatch with virtual function tables, which involve an indirect call instruction. Indeed, this may be one of the greatest motives for the addition of indirect branch prediction hardware to recent Intel and AMD CPUs.
To apply our techniques to this use-case of indirect calls, we adapted our assembly language optimization passes to work with function branch targets. We parse the virtual tables in our optimization pass and determine the start labels of all virtual functions. We also include thunks as these are possible targets of indirect calls.
GCC puts functions into separate sections in the assembly file, and these are usually 16-byte aligned by the linker. In our first implementation, we took this constraint into account, and inserted enough NOP instructions to ensure each branch target had unique lower bits. However, we did not observe a significant decrease in indirect branch misses, so we changed the implementation to move all the virtual functions into one section before making targets unique. To evaluate our optimization pass, we created a C++ version of our simple interpreter, using an abstract base Instruction class which has a virtual execop method that child classes override to implement the various opcodes. Thus the simple interpreter loop: while (1){ (*ip)->execop(); }
The results for our C++ simple interpreter are enumerated in Table III . The baseline has a branch misprediction rate of 13%, and the version with NOPs inserted has a misprediction rate of 0.004%. Our re-ordering technique gives a misprediction rate of 0.006%. These are both much better than the baseline, which has a misprediction rate of 13%. Both techniques show a reduction in the number of L1 i-cache reads-with less mispredictions, less instructions are executed, and consequently the i-cache is accessed less. 
Hardware Configuration
We evaluate our optimization on the following system: Intel Core 2 Q6600 (Core Kentsfield) 2.4 GHz CPU with 6 GB memory, running Ubuntu GNU/Linux 11.04 with kernel version 2.6.38-8. GCC version 4.5.2. Performance counters were collected using the perf tool, an interface to the Linux kernel perf events and PAPI [Browne et al. 2000] . For indirect branch misses, we recorded the BR IND MISSP counter.
RELATED WORK
Reducing Branch Interference
Chen and King [1999] adjust addresses to reduce collisions in 2-bit counters for conditional branches. They also add NOP instructions to perform the address adjustment. They present a constrained and a relaxed method, where the constrained method will only insert NOPs following an unconditional branch and relaxed will insert NOPs where they may get executed by the processor. They also describe branch classification, which maps branches with the same history pattern to the same counter. Our work similarly inserts NOP instructions to move branch addresses, but we instead consider indirect branches, and we also present a technique using code re-ordering. Jiménez [2005] build upon that work, describing pattern history table partitioning, a feedback directed technique that moves conditional branch addresses to avoid destructive interference. We have not considered any feedback techniques for the current work. Krall [1994] uses profiling to collect information about branches and then uses code replication to reduce the branch misprediction rate. We are concerned with indirect branches, and do not investigate profiling methods. Uh and Whalley [1999] describe a transformation that coalesces conditional branches into indirect jumps. They show that this transformation to indirect jumps can improve prediction. We are not concerned with how indirect jumps are generated, we always try to maximize their predictability. Yang et al. [2002] re-order branches using profile data to reduce the number of conditional branches executed by a program. Our work is concerned with indirect branches whose execution can generally never be avoided by code rearrangement.
Code Placement for Branches
There has been much research on code rearrangement to improve cache behavior. Related to branches, Calder and Grunwald [1994] present a profile-based technique that 'aligns' branches: It will minimize the number of taken branches, therefore ensuring frequently executed code is closer together, thereby improving cache performance.
Target Predictability for High-Level Languages
In previous work [McCandless and Gregg 2011] , we demonstrated that re-ordering the cases of a switch statement has a significant impact on indirect branch prediction for interpreters. We hypothesized this to be due in part to aliasing in the indirect branch predictor. Li et al. [2005] improve the target predictability of Java, particularly for virtual methods by proposing a rehashable branch target buffer that dynamically adapts branch target storage for polymorphic branches. They show improvements over traditional techniques, including target caches. Our work is focused on improving the performance of such code on already existing hardware. Deitrich et al. [1998] concern themselves with static branch prediction from the compiler's perspective. They present heuristics, evaluate them for a selection of benchmarks and provide insights about their efficacy. Mahlke and Natarajan [1996] present a compiler technique for branch prediction by using profile feedback to insert a prediction function that captures information about the current context that is then used to make predictions. They show performance comparable to that of two-level hardware branch predictors.
Compiler Assisted Branch Prediction
"Weird" Behavior
The effect of linking order as a source of measurement bias has been described by Mytkowicz et al. [2009] . They attribute measurement bias due to link order as an alignment issue, and using the m5 simulator they see instruction cache miss variances when the link order is changed. They suggest branch prediction may also be a cause in general. Our work investigates optimizing code by maximizing information available in branch predictor history. We do not propose that measurement bias may be eradicated, but that bias produced by branch predictor aliasing may be decreased. In a later paper [Knights et al. 2009 ], the authors propose "blind optimization". We believe this will be a useful technique, but there remain traditional hardware-aware optimizations to be discovered and exploited. Hundt et al. [2011] describe changes in execution time from nopinizer, a random NOP insertion pass they implemented using their assembly language optimization framework. They also observed performance variations from a nop killer pass. We do not insert NOPs at random, but instead attempt to utilize them to provide as much information to the predictor as possible. However, for any program, there will always be an unknown optimal address layout that maximizes entropy for a particular history. Indeed, constructive interference [Young et al. 1995] causes branches that would have been mispredicted to be correctly predicted.
CONCLUSIONS
In this article we presented an optimization opportunity where the indirect branch predictor does not perform as well as possible, due to indirect branch target aliasing that can be removed or reduced. We described our optimization to improve the branch prediction of indirect jumps and calls. We presented two techniques to increase the uniqueness of the bits in target addresses that are used by the predictor, the first based on NOP insertion, and the second based on re-ordering; and we describe them formally. We provided heuristics to approximate these techniques and implemented two as assembly language optimizations. Experimental results, both simulation and hardware, were provided for C programs using SPEC2006 and our own intensive indirect branch programs, showing that providing more information to the predictor by making targets unique can reduce the number of indirect branch mispredictions. We also provided preliminary results for virtual function calls.
In our future work, we intend to further experiment with branch target addresses and implement a profile-based optimization that will automatically determine good target addresses for a given profile to maximize entropy in the global history.
