Abstract
Introduction
Branch misprediction latency is the most important component of performance degradation as microarchitectures become more deeply pipelined [20] . Branch predictors must improve to avoid the increasing penalties of mispredictions. Branch predictors based on neural learning are the most accurate predictors in the literature [12, 10] , but they are impractical because the advantage of the extra accuracy is nullified by high access latency, even when latencysensitive predictor organizations are used [7] . This latency is due primarily to the complex computation that must be carried out to determine the excitation of an artificial neuron.
We present a new, practical neural branch predictor. Its latency is much lower than previous designs and is comparable to that of conventional predictors used in industrial designs, making it practical for implementation in a highfrequency microprocessor. At the same time, its accuracy is superior to that of previous highly accurate predictors. Figure 1 illustrates how our new predictor achieves low latency by beginning well ahead of time. The predictor staggers computations in time, predicting a branch using a neuron selected dynamically along the path to that branch, rather than selecting the neuron all at once based solely on the branch address. A happy side-effect of this selection process is improved accuracy because the predictor is able to correlate with path history as well as pattern history.
We show that our path-based neural predictor has a misprediction rate 7% lower than that of the original perceptron predictor, and because of its improved latency it delivers an IPC 16% higher than that predictor at a 64KB hardware budget. This paper is organized as follows: Section 2 briefly discusses related work. Section 3 gives background in neural branch prediction and explains the new prediction algorithm. Section 4 describes our experimental methodology. Section 5 gives the accuracy and performance results of our experiments. Finally, Section 6 concludes the paper.
Related Work

Neural Prediction
Calder et al. use neural networks to perform static branch prediction [4] at compile time. Features such as control-flow information are used to train a neural network to distinguish between branches that are likely to be biased taken from branches that are likely to be biased not taken.
Dynamic branch prediction with neural methods was first proposed by Vintan et al. [23] who explore the use of learning vector quantization, a neural method. The resulting branch predictor achieves an accuracy comparable to a table-based branch predictor.
Branch Prediction with Perceptrons
The original perceptron predictor [9] uses a simple linear neuron known as a perceptron [1] to perform branch prediction. Perceptrons achieve better accuracy than two-level adaptive branch prediction because of their ability to exploit long history lengths which have been shown to provide additional correlation for branch predictors [6] . Another study suggests ways to implement the predictor using techniques from high-speed arithmetic [10] , but the latency of the predictor is more than 4 cycles with an aggressive clock rate. Despite its drawbacks, neural prediction has been suggested as a promising technology for future microprocessors [16] . It has become part of one of Intel's IA-64 simulators for researching future microarchitectures [2] . It has been used as a component in studies of hybrid predictors [12, 22] and is the most accurate single-component branch predictor in the literature [12, 10] .
Path-Based Prediction
Our path-based neural predictor achieves superior accuracy and low latency by choosing the neural weights based on the path taken to reach a branch rather than the branch address itself. Branch outcomes are highly correlated both with path and pattern histories [14, 21] . Previous work has also explored the use of path information to improve branch predictor accuracy.
Latency-Sensitive Prediction
As hardware budgets for branch predictors expand, research has begun to focus on balancing the tradeoff between accuracy and latency important for large predictors with high latencies. Jiménez et al. survey several techniques for mitigating branch predictor delay [8] . The most common technique is overriding, in which a quick but relatively inaccurate predictor guides instruction fetch in a single cycle, and may be corrected by a slower but more accurate multi-cycle predictor. This approach was used for the Alpha EV6 and EV7 cores [11] and was proposed for the Alpha EV8 [16] . The overriding technique does not scale well as branch predictor latency increases because the penalty for an overriding event becomes substantial [7] .
Other studies propose pipelined branch predictors [21, 7, 17 ] to mitigate latency. The main source of latency for most large branch predictors is the access delay to the memories used to implement the pattern history tables. The latency of the perceptron predictor is dominated by computation time.
A Path-Based Neural Predictor
In this section, we review the relevant details of previous work on neural branch prediction. In this context, we give the intuition behind the path-based neural predictor. We then give a detailed explanation of the path-based neural predictor.
Branch Prediction with Perceptrons
The perceptron predictor uses perceptron learning [15, 1] to predict the directions of conditional branches [9, 10] . We review the design of the perceptron predictor, describing algorithms using an Algol-like pseudocode with keywords in boldface and comments in italics. We use taken and not taken as meaningful names for Boolean constants.
The perceptron predictor is similar to other predictors in that it keeps a global history shift register that records the outcomes of branches as they are executed, or speculatively as they are predicted. The width of this register is the history length for the predictor, hereafter referred to as .
The perceptron predictor keeps an represents the global history shift register. Figure 2 gives pseudocode for the prediction and update algorithms for the original perceptron predictor. The prediction algorithm returns a Boolean value predicting the branch at address pc. 
Prediction and Update Algorithms
Figure 2. Perceptron prediction and update algorithm
When a branch outcome becomes known, the train algorithm is invoked to update the predictor. The training algorithm takes an integer parameter f that controls the tradeoff between long-term accuracy and the ability to adapt to phase behavior. It has been empirically determined that choosing
gives the best accuracy [10] . Thus, f is a constant for a given history length. Once the outcome of a branch becomes known, the following algorithm is used to update the perceptron predictor, taking as parameters the outcome as well as the values of p , prediction, and q r t v computed during the prediction phase.
Implementation
We review some of the suggestions for a practical implementation of the perceptron predictor. The matrix should be implemented as a tagless directmapped memory of is read from memory.
Instead of negating the weights to produce summands for the computation of q r t v , they can be bitwise complemented with very little impact on accuracy. This speeds the computation of the summands.
The computation of
can be arranged as a Wallacetree [5] adder to add the summands. This allows the circuit performing this computation to have a depth of z ¤ | gate delays, as opposed to z ¤ gate delays with a naive summing algorithm.
Disadvantage of the Perceptron Predictor
The main disadvantage of the perceptron predictor is its high latency. Even using the high-speed arithmetic tricks mentioned above, the latency of the computation of
is high relative to the clock period of a deeply pipelined microarchitecture. It has been shown that performance is highly sensitive to high branch predictor latency [8] , even when special techniques are used to mitigate latency [7] .
A Path-Based Neural Predictor
Our alternative to the perceptron predictor is a neural predictor that chooses its weights vector according to the path leading up to a branch, rather than according to the branch address alone. This technique has two advantages. First, latency is mitigated because computation of q r t v can begin in advance of the prediction, with each step proceeding as soon as a new element of the path is executed. Second, accuracy is improved because the predictor incorporates path information into the prediction.
Intuitive Description
Our new predictor has much the same structure as the perceptron predictor. It keeps a matrix of weights vectors. Each time a branch is fetched and requires a prediction, one of the weights vectors from is read. However, only the weight, i.e. the bias weight, is used to help predict the current branch. Its value is added to a running total that has been kept for the last branches, with each summand added during the processing of a previous branch. are shift registers that hold speculative and non-speculative global history, respectively.
The Prediction Algorithm
Update Algorithm
Updating the path-based neural predictor is conceptually similar to updating the original perceptron predictor. However, the new update algorithm has to deal with the fact that each weights vector is associated with branches, rather than one branch as in the original predictor. When branch v completes and its outcome is ready to be used to update the predictor, most of the weights vector associated with v cannot be updated because they are being used to predict future branches that have not completed yet. Thus, we design the matrix most recent branch instruction. This array can be implemented as a small circular buffer global to all invocations of the training procedure with speculative and non-speculative versions as with the prediction algorithm. Note that the address modulo ¡ was computed in the prediction algorithm, so it can be recorded in the circular buffer at that time. Also, the modulo operation need not be expensive: it is simply a masking operation if the number of weights vectors is chosen to be a power of two.
Some of the details of these algorithms have been omitted for clarity and brevity, e.g., details the maintenance of the circular buffer of weights vector indices and the maintenance of the contents of , which is simply a nonspeculative copy of the circuitry that maintains . A detailed Java implementation of the algorithm will be made available upon request.
Recovery After Misprediction
When the path-based neural predictor predicts incorrectly, the vector is restored to the value stored in during the predictor update for the last committed branch. Since all of the branches up to the last committed branch were correctly predicted and committed in-order, the restored value of is as it was when the mispredicted branch was fetched, and prediction will continue normally. The recovery takes less than one cycle, and its latency is completely hidden by the latency of other actions taken by the microarchitecture to recover from the misprediction.
Area and Latency
Clearly, the prediction algorithm uses a slower method for computing q r t v than the original perceptron method. However, since it begins the summation process branches before the prediction is needed, the latency is almost completely hidden. The only elements on the critical path to making a prediction are reading the bias weight and adding it to the current partial sum (i.e., #
). This is much faster than computing
all at once with a Wallace-tree and also consumes less area. The Wallace-tree for the original perceptron predictor has z ¤ | carry-save adders as well as a carry-lookahead adder for the final addition, while the new algorithm requires only z ¤ independent adders for updating at each prediction step. For reasonablesized predictors and history lengths, we estimate that the path-based neural predictor would take approximately two clock cycles to produce a prediction given a branch address. This is the same latency tolerated by branch predictors from industrial designs [16] . We give details of these estimates later in Section 4. 
Methodology
In this section, we describe our experimental methodology for evaluating the path-based neural predictor.
Microarchitectural Framework
We use 17 SPEC CPU integer benchmarks running under a version of SimpleScalar/Alpha [3] , a cycle-accurate out-of-order execution simulator that has been enhanced to include our branch predictors, simulate overriding predictors at various latencies, and simulate deep pipelines. We simulate all of the SPEC CPU 2000 integer benchmarks, and all of the SPEC CPU 95 integer benchmarks that are not duplicated in SPEC CPU 2000. The benchmarks are compiled with the CompaQ GEM compiler with the optimization flags -fast -O4 -arch ev6.
To better capture the steady-state performance behavior of the programs, our experiments skip the first billion instructions, as several of the benchmarks have an initialization period lasting fewer than one billion instructions during which program behavior is not characteristic of the many billions of subsequent instructions. After skipping those instructions, each benchmark executes 500 million instructions on the ref inputs before the simulation ends. 
Branch Predictors Simulated
We simulate the following predictors to compare with the path-based neural predictor: 2Bc-gskew We simulate a 2Bc-gskew predictor, which is a McFarling-style [13] hybrid predictor combining a bimodal predictor with an egskew predictor that predicts using the majority prediction of three components: the bimodal predictor and two gshare-like predictors indexed by special hash functions so as to minimize the chance that both pre-dictors will suffer destructive interference at the same time. A version of this predictor would have been used in the Alpha EV8 processor [16] . In our latency-sensitive simulation, 2Bc-gskew takes more than one cycle to return a result. We use a two-level overriding organization [8] to mitigate this latency: A first-level 2K-entry bimodal predictor gives a prediction in a single cycle and instructions are fetched down the predicted path. If the second-level 2Bc-gskew predictor disagrees with the initial prediction, the instructions fetched so far are dropped and fetching continues from the other path. This technique closely reflects the design of the EV8 predictor, in which 2Bc-gskew overrides a less accurate instruction cache line predictor.
Perceptron Predictor
We simulate a recent [10] , highly accurate version of the perceptron predictor that combines global and per-branch history information in a manner reminiscent of the alloyed branch predictors of Skadron et al. [19, 10] . We again use an overriding organization with a first-level 2K-entry bimodal predictor, this time backed up with a second-level perceptron predictor. We note that this predictor has been shown to be more accurate than even the most aggressive multi-component hybrid predictor [10] . Thus, including other combined global and per-branch hybrid predictors in this study would be superfluous.
gshare.fast
We simulate a specialized version of the gshare predictor that has been pipelined to return a result in a single cycle. By using older branch history to prefetch a portion of the pattern history table in a previous cycle and then using the exclusive-OR of more recent history and the low bits of the current branch address to select from that portion, gshare.fast has an effective latency of one cycle [7] . It has been shown to yield higher instruction per cycle rates than highly accurate predictors such as 2Bc-gskew and the perceptron predictor at large hardware budgets [7] . For this study, our simulation of gshare.fast is idealized, assuming that there is no overlap or missing gap between the older history and more recent history.
Fixed-Length Path Predictor
We simulate a fixed length path branch predictor that forms a hash of the history of branch target addresses leading up to the branch to be predictor [21] . The hash function XORs the addresses, first rotating each address by a number of bits equal to it position in the branch history. The hash is used to index a table of twobit saturating counters as in a two-level scheme. We use the same fixed length for each benchmark, as opposed to using a variable-length path branch predictor which requires expensive profiling [21] . (Note that none of the schemes used for this paper require profiling.)
Path-Based Neural Predictor
We simulate the pathbased neural predictor as described above, using an overriding organization with a first-level 2K-entry bimodal predictor as with the other overriding predictors.
Each simulated predictor is pipelined so that it can be accessed on every cycle, e.g. for a predictor with a latency of 2 cycles, the prediction requested 2 cycles ago is available in the current cycle. Each predictor's history registers are updated speculatively and corrected on a misprediction.
Tuning The Predictors
Using the train inputs of the benchmarks and tracedriven simulation, we find the history lengths that minimize the average misprediction rate for each hardware budget and branch predictor, exploring hardware budgets from 1 KB to 64KB. We use these history lengths in the executiondriven simulations on the ref inputs. Table 2 shows the tuned history lengths for each hardware budgets. Note that gshare.fast is not shown, as its history length is fully constrained by the details of its implementation, and is equal to the base-2 logarithm of the number of elements in the pattern history 
Estimating Branch Predictor Latency
We use CACTI 3.0 [18] to estimate the latency of the various memories accessed by the predictors. We use HSPICE along with a custom logic design program to estimate the latency of the circuits used to compute the perceptron output for the perceptron predictor as well as the latency of the adders used for the path-based neural predictor. Table 3 shows the latencies we derived for each branch predictor and hardware budget except for gshare.fast, giving the amount of time it takes from the time a branch address is known to the time a prediction becomes available. For gshare.fast, the latency is always at most one cycle. For 2Bc-gskew, we estimate the latency of the predictor as the delay in accessing the slowest table plus one fan-out-of-four (FO4) delay for taking the majority and choosing the hybrid prediction from the two component predictions. For the global/local perceptron predictor, the latency is the sum of the access delay to the table of weights vectors measured by CACTI and the worst-case delay of the perceptron output circuit as measured by HSPICE. We optimistically ignore the access time to the first-level table of per-branch histories. The fixed-length path branch predictor is computationally expensive to implement because it requires hashing many addresses to produce one prediction. Nevertheless, we optimistically assume that it can be pipelined to produce a result with the same latency as 2Bc-gskew. For the pathbased neural predictor, the latency is the sum of the access delay to the table of bias weights and the worse-case delay of the adder that adds the bias weight to the next partial sum in the vector. For consistency, we use the same adder circuits that were used in the original perceptron predictor study [10] 
Experimental Results
In this section, we give the results of our experimental studies. We discuss the misprediction rates of the various branch predictors. We then discuss the performance achieved by the predictors in terms of instructions-per-cycle (IPC). Figure 6 shows the arithmetic mean misprediction rates for the four predictors ranging over hardware budgets from 1 KB to 64KB over all benchmarks as measured by the microarchitectural simulator. Clearly, the path-based neural predictor has the lowest misprediction rate of all the predictors for all hardware budgets. Figure 7 shows the misprediction rates for each benchmark at a 8KB hardware budget. The path-based neural predictor achieves an average misprediction rate of 5.7%, which is 7% lower than that of the global/local perceptron predictor at 6.1%, 13% lower than that of 2Bc-gskew at 6.6%, and 40% lower than that of the fixed-length path branch predictor at 9.4%. The pathbased neural predictor has the lowest misprediction rate of all the predictors in 9 out of the 17 benchmarks. Ignoring the global/local perceptron predictor, the path-based neural predictor is the best predictor for 14 of the benchmarks. Figure 8 shows the number of instructions executed per cycle (IPC) for each branch predictor and hardware budget. Clearly, the path-based neural predictor yields the best performance at every hardware budget. The key reason is the combination of superior accuracy and low latency. For instance, the global/local perceptron predictor, which is the second most accurate of all the branch predictors, yields the worse performance at higher hardware budgets because of its high latency. At the same time, 2Bc-gskew, a McFarlingstyle hybrid with approximately the same latency as the path-based neural predictor, delivers less accuracy and performance than the single-component path-based neural predictor. At a 64KB hardware budget, the path-based neural predictor delivers an IPC 16% higher than that of the perceptron predictor because of that predictor's high latency. Figure 9 shows the IPC for each benchmark and each predictor at an 8KB hardware budget. The path-based neural predictor yields the best IPC in 15 out of the 17 benchmarks. It achieves a harmonic mean IPC of 1.06, giving a speedup of 12% over the global/local perceptron predictor at 0.95 IPC, 4% over 2Bc-gskew at 1.02 IPC, 18% over gshare.fast at 0.90 IPC, and 18% over the fixed length path branch predictor at 0.90 IPC. At this hardware budget, both 2Bc-gskew and the path-based neural predictor have a latency of 2 cycles, while gshare.fast has a single-cycle latency. The global/local perceptron predictor has a latency of 6 cycles at this hardware budget. Although it is more accurate than gshare.fast and 2Bc-gskew, its higher latency cancels any advantage it might have for performance.
Misprediction Rates
Instructions Per Cycle
Area vs. Hardware Budget
Although standard for branch prediction research, equating the term hardware budget with number of bits of predictor state is problematic in our case. As described in Section 3.2.3, an implementation of the path-based neural predictor may use ¦ independently addressable memories, each with its own selection logic, to facilitate the update algorithm. The path-based neural predictor also requires a number of adder circuits proportional to the history length. We estimate that a naive implementation of a path-based neural predictor using 8KB of state could require 80% more area than a 8KB 2Bc-gskew predictor. Even so, the pathbased neural predictor is still the best choice. A path-based neural predictor with a hardware budget of 4KB, consuming approximately 10% less total area than a 8KB 2Bc-gskew, achieves a harmonic mean IPC of 1.05 which is less than 1% lower than that of an 8KB path-based neural predictor and 3% higher than that of a 8KB 2Bc-gskew. Indeed, a pathbased neural predictor with only 2KB of state achieves the same IPC as an 8KB 2Bc-gskew.
Conclusion
We have presented a new neural branch predictor that has lower latency and superior accuracy to previous neural branch predictors. Our new predictor achieves high accuracy and low latency by predicting a branch using a neuron selected dynamically along the path to that branch. This work is only the beginning of path-based neural prediction;
we have yet to fully exploit the potential of this technique. We have shown that our predictor has better accuracy and yields higher performance than conventional predictors. By incorporating our path-based neural predictor into new microarchitectures, designers will be able to improve IPC rates while increasing pipeline depths and clock frequencies. 
