The main aim of this research is to propose a new Two-Level 
Introduction
As the average instruction issue rate and depth of the pipeline in multiple-instruction-issue (MII) processors increase, accurate dynamic branch prediction becomes more and more essential. Very high prediction accuracy is required because an increasing number of instructions are lost before a branch misprediction can be corrected. As a result even a misprediction rate of a few percent involves a substantial performance loss.
If branch prediction is to improve performance, branches must be detected within the dynamic instruction stream, the direction taken by each branch must be correctly predicted and the branch target address must be correctly predicted. Furthermore, all of the above must be completed in time to fetch instructions from the branch target address without interrupting the flow of new instructions to the processor pipeline. A classic Branch Target Cache (BTC) [Hen96] achieves these objectives by holding the following information for previously executed branches: the address of the branch instruction, the branch target address and information on the previous outcomes of the branch. Branches are then predicted by using the PC address to access the BTC in parallel with the normal instruction fetch process. As a result each branch is predicted while the branch instruction itself is being fetched from the instruction cache. Whenever a branch is detected and predicted as taken, the appropriate branch target is then available at the end of the instruction fetch cycle, and instructions can be fetched from the branch target in the cycle immediately after the branch itself is fetched. Straightforward prediction mechanisms based on the previous history of each branch give a prediction accuracy of around 80 to 95% [Hen96] . This success rate proved adequate for scalar processors, but is generally regarded as inadequate for MII architectures.
The requirement for higher branch prediction accuracy in MII systems and the availability of additional silicon area led to a dramatic breakthrough in the early 90s with branch prediction success rates as high as 97% [Yeh92] being reported. These high success rates were obtained using a new set of prediction techniques known collectively as Two-Level Adaptive Branch Prediction that were developed independently by Yale Patt's group at the University of Michigan [Yeh91] and by Pan, So and Rahmeh from IBM and the University of Texas [Pan92] . Two-Level Adaptive Branch Prediction uses two levels of branch history information to make a branch prediction. The first level consists of a History Register [HR] that records the outcome of the last k branches encountered. The HR may be a single global register, HRg, that records the outcome of last k branches executed in the dynamic instruction stream or multiple local history registers, HRl, that record the last k outcomes of the specific branch being predicted. The second level of the predictor known as the Pattern History Table (PHT) records the behaviour of a branch during previous occurrences of the first level predictor.
It consists of an array of two-bit saturating counters, one for each possible entry in the HR. 2 k entries are therefore required if a global PHT is provided, or many times this number if a separate HR and therefore PHT is provided for each branch PC. Although a single term is usually applied to the new predictors, this is misleading. Since the first level predictor can record either global or local branch history information, two distinct prediction techniques have in fact been developed. The global method exploits correlation between the outcome of a branch and the outcome of neighbouring branches that are executed immediately prior to the branch. In contrast, the local method depends on the assertion that the outcome of a specific instance of a branch is determined not simply by the past history of the branch, but also by the previous outcomes of the branch when a particular branch history was observed.
The main aim of this paper is to propose an improved global Two-Level Adaptive Branch Prediction scheme. Conventional global Two-Level Adaptive Branch Predictors [Pan92] exploit the correlation between the outcome of a branch and the dynamic path followed through a program to reach the branch. The program paths are identified by recording in the HRg whether each branch on the path is taken or not. Unfortunately, this information is insufficient to uniquely identify a program path. In our branch predictor we therefore record both the outcome and address of each branch on a program path. This additional information makes it possible to retrace the path taken to reach a branch and therefore identifies a unique path through the code. We use trace driven simulation to compare our improved predictor that uses this additional path information with a conventional global predictor.
An Improved Branch Predictor
Two-level adaptive branch prediction significantly reduces the number of incorrect branch predictions. Unfortunately, however, some branches are still difficult to predict correctly. For example, in the Stanford benchmarks there are a significant number of "hard-topredict" branches whose direction cannot be determined by examining either the HRg or HRl bit patterns. In these cases, with identical HR values, the branch is almost equally likely to be taken or not taken. Furthermore, increasing the length of HR has little impact.
In theory, the adaptive nature of two-level adaptive branch prediction should help. Ideally, with a given HR pattern the predictor should correctly predict taken in some phases of the program and then adapt to predict not taken in other phases. Unfortunately, in practice, the extent of this dynamic adaptation appears to be minimal. As observed by Sechrist et al [Sec95] , "The role of adaptivity at the second level of two-level branch prediction schemes is more limited than has been thought." It therefore appears that in these difficult-topredict cases insufficient correlation information is fed to the predictor.
Earlier, we observed that the values stored in HRg do not identify a unique program path leading to each branch. Suppose, for example, that the final bit in HRg is set to logic "1", indicating that the branch executed immediately before the branch being predicted was taken. This final bit only indicates that one of perhaps several branches targeting the basic block containing the next branch has been taken. Since only the fall-through path from the immediately preceding basic block has been eliminated, the actual program path is indeterminate. As a result multiple program paths can map into a single HR bit pattern.
The correlation information available can be improved by recording not only the outcome of each branch but also the address of each branch instruction.
1
In this way additional correlation information can be provided for the predictor. Note that simply recording the address of each branch executed is insufficient to uniquely identify each path as can be seen from the following simple example:
Bcc label : label: Bcc loop Providing there are no intervening branches, the outcome of the first branch must also be recorded if the path is to be correctly identified.
Our improved branch predictor (MPAg), shown in Figure 1 , makes full use of improved path information. A single global History Register (HRg) records both the outcome and the address of the last k branches; however, only the eight least significant bits of each PC are recorded to save bits. A fully associative Pattern History Table (PHT) is accessed by concatenating the PC address with the HRg. Each PHT entry holds the branch target address, prediction bits in the form of a two-bit saturating counter and LRU (Least Recently Used) bits used by our replacement algorithm.
A conventional two-level adaptive predictor would use the PC plus HRg to directly index a PHT consisting of an array of two-bit counters. Not surprisingly, the size of these PHT arrays is often excessive. With HRg extended to record full path information, the size of such a PHT would have been prohibitive. We have therefore chosen to implement our PHT as a fully-associative cache. As a result, although the size of each individual entry is increased, the total cost of the PHT is significantly reduced. Clearly a direct-mapped or a set-associative implementation would have been equally appropriate.
Figure. 1 A fully associative modified GAp (MGAp) scheme
We have implemented an MPP (Minimum Performance Potential Replacement) replacement algorithm similar to the one presented in [Per93] . The algorithm replaces the entry having the minimum product of the probability of reference, as given by the LRU field, and the probability of the branch being taken, as given by the prediction counter. As a result branches that are predicted as "not taken" tend to be replaced. Since any branches not held in the PHT is predicted as "not taken" by default, the effect is to minimise mispredictions caused by PHT replacements.
We compare our modified predictor (MPAg) with a conventional two-level adaptive scheme (Figure 2 ). This predictor would be classified as a GAp predictor in the Patt classification [Yeh92] . Again to reduce the size of the PHT and to provide a realistic comparison we have chosen to implement the PHT as a fully-associative cache.
Figure. 2 A fully associative GAp scheme

Evaluation of Branch Predictor Schemes
Benchmark Programs
Our simulation work uses the Stanford integer benchmark suite, a collection of eight C programs designed by Professor John Hennessy to be representative of non-numeric code, while at the same time being compact. The benchmarks are computationally intensive with an average dynamic instruction count of 273,000. About 18% of the instructions are branches of which around 76% are taken. Some of the branches in these benchmarks are known to be particularly difficult to predict; see for example Mudges' detailed analysis [Mud96] of the branches in quicksort.
The benchmarks were compiled using a C compiler developed at the University of Hertfordshire the HSA (Hatfield Superscalar Architecture) [Ste97] . Instruction traces were then obtained using the HSA instructionlevel simulator, with each trace entry providing information on the branch address, branch type and target address. These traces were used to drive a standalone branch predictor developed at the University of Sibiu that was used to simulate the branch predictors investigated in this paper. The trace-driven simulator is highly-configurable, the most important parameters being the number of HRg bits and the size of the PHT. As output the simulator generates the overall prediction accuracy, the number of incorrect target addresses and other useful statistics; see for example Table 1 .
Simulation Results
First, we evaluated our MGAp predictor using a fully-associative PHT with 100 entries; see Table 1 . HRg records the history of from one to five branches (k = 1 to 5). Since a total of nine bits are used to record each branch -eight bits for the least significant bits of the PC address and one bit to record the branch outcome -the size of HRg varies from 9 to 45 bits. As well as recording the number of correct and incorrect branch predictions, Table 1 also records the number of mispredictions caused by incorrect branch targets. This source of mispredictions could be almost completely eliminated by adding an address stack [Kae91] to hold the return addresses for subroutine return instructions. The last column in Table 1 records the total number of replacements, NR, that have taken place in the PHT during each simulation run. NR is a useful metric of branch interference within the PHT. Not surprisingly NR increases as the number of branch patterns recorded in the PHT increases. As more branches are added to each path, the number of paths associated with each branch increases. This in turn increases the pressure for entries in the PHT and the total number of replacements. Table 2 is derived from Table 1 and records the most successful configuration for each benchmark. The average prediction accuracy, using the most successful configuration in each case, is 87.12%, a figure that rises to around 90% if incorrect branch targets are removed. Interestingly, with three exceptions, the highest success rates are obtained with paths consisting of only a single branch (k=1). Furthermore, in all cases the highest prediction accuracy is achieved in the absence of a large number of replacements (NR), and in all but two cases with an NR of zero.
Two opposing factors are at work here. First, the misprediction rate would be expected to fall as the length of the path recorded is increased. However, as path lengths are increased, more entries are required in the PHT. As a result more paths are evicted, and the number of mispredictions increases. This explanation is strongly supported by our results. In general, as the path length is increased the prediction success rate improves until the number of replacements in the PHT becomes significant. For example, permute experiences no PHT replacements and is the only benchmark to achieve its highest prediction accuracy with a path length of five.
We repeated our simulations with a conventional GAp prediction scheme (Table 3) . In order to compare configurations with identical hardware costs, the length of HRg was increased in increments of nine bits rather than one. The average prediction success rates for the two predictors are compared in Table 4 . The MGAp scheme achieves a maximum average success rate of 86.03% with a path length of two, while the conventional GAp scheme achieves an 83.74% success rate with a path length of one. Furthermore at every level of hardware complexity, the MGAp scheme outperforms the GAp scheme.
In our second set of experiments we concentrated exclusively on those branches in the Stanford benchmarks that are inherently "difficult to predict." We consider a branch to be difficult to predict if both the local and global branch contexts used in conventional two-level predictors provide insufficient information to avoid a high misprediction rate. By the local context we mean HRl or the previous history of the branch being predicted while by the global context we mean HRg or the history of the branches executed immediately prior to the branch being predicted. Consider, for example, a specific branch from the program perm (Table 5) . Here neither the local context, HRl, or the global context, HRg, provide sufficient information to allow accurate prediction. In general we consider branches that are mispredicted at least 20-30% of the time by conventional two-level techniques to be difficult to predict.
In Figure 3 to Figure 10 we compare the performance of the two predictors, GAp and MGAp, on the difficultto-predict branches using seven of our benchmarks. The eighth benchmark matrix is excluded since it contains no difficult-to-predict branches. In the case of our MGAp predictor, the index in each figure represents the number of branches (PC + outcome) that make up the path recorded in HRg. In contrast, in the case of the conventional GAp predictor, k represents 9 bits of branch history information. As can be observed from the figures, our MGAp scheme generally outperforms the conventional GAp predictor that has identical hardware costs.
Conclusions and Discussion
In this paper we have simulated the performance of a modified GAp branch predictor based on complete program path information. Complete path information allows the branch predictor to uniquely identify the program path used to reach each branch and therefore potentially reduces the number of mispredictions.
Conventional two-level adaptive branch predictors implement the PHT as an array of two bit counters that increases exponentially in size as the length of HRg is increased. In many configurations this can leads to very large storage arrays. For example, using an 18 bit HRg and the 8 least significant bits of the PC address would require a PHT size of 2 ** 26 or over 64 million entries. By configuring our PHT as a cache rather than as a huge array, we significantly reduce the cost of the whole branch predictor. Furthermore, our configuration makes it possible to utilise full-path information at a reasonable cost, since the length of HRg is no longer the critical factor in determining the size of the PHT. Instead the size of the PHT is largely determined by the number of distinct program paths that must be stored to ensure accurate branch prediction. Finally, the more precise full path information, should allow different paths to be identified using a smaller number of distinct bit patterns in HRg.
The preliminary results presented in this paper are most encouraging, with our modified predictor generally outperforming the conventional GAp predictor. The prediction accuracies are smaller than those observed by other researchers using the Spec benchmarks. There are two reasons for this. Branches in very large benchmarks such as Spec tend to suffer a smaller percentage of mispredictions while the predictor is being trained. Furthermore, other researchers postulate unrealistically large PHT arrays, while the associative PHT simulated in this paper has only a hundred entries.
Particularly useful information has been gleaned regarding the interaction between path length and the number of replacements required in the PHT. The next stage of our research is to investigate our MGAp predictor using a wider range of parameters in our trace driven simulator and, in particular, to investigate increasing the size of our PHT to reduce the number of entry replacements. 
