Recent studies of dynamic branch prediction schemes rely almost exclusively on user-only simulations to evaluate performance. We find that an evaluation of these schemes with user and kernel references often leads to different conclusions. By analyzing our own Atom-generated system traces and the system traces from the Instruction Benchmark Suite, we quantify the effects of kernel and user interactions on branch prediction accuracy. We find that useronly traces yield accurate prediction results only when the kernel accounts for less than 5~o of the total executed instructions. Schemes that appear to predict well under user-only traces are not always the most effective on full-system traces: the recently-proposed two-level adaptive schemes can suffer from higher aliasing than the original per-branch 2-bit counter scheme. We also find that flushing the branch history state at fixed intervals does not accurately model the true effects of user/kernel interaction.
Introduction
Wkh the explosion of new superscalar microarchitectures, there has been a mounting pressure on microprocessor architects to improve the predictability of the conditional branches in the program flow. With the trend toward larger branch misprediction penalties due to the use of deeper pipelines, breaks in the program flow can quickly throttle the performance of these wide-issue microprocessors. Several recent studies [11, 14, 20] have proposed new hardware branch prediction schemes to address this problem. To date, the evaluation of these new techniques has been done almost exclusively with user-level traces of applications such as those found in the SPEC92 benchmark suite. This study goes beyond that work to use full-system traces (i.e. traces with user and kernel references) to evaluate the effectiveness of several two-level adaptive branch prediction schemes. This study also analyzes the performance of these dynamic branch prediction schemes on kemelintensive applications such as an HTTP server and those found in the IBS benchmark suite [18] .
All dynamic branch prediction schemes in this study are similar in that they use a table of two-bit, up-down, saturating counters. A 2-bit counter summarizes the past outcomes of a branch stream, using this information to predict the next branch outcome [10, 17] . The method of selection of a 2-bit counter in this table defines the type of dynamic branch prediction implemented. We evaluate four Permission to make digitabhard copy of part or all of this work for personal or classroom use is grantad without fee provided that mpies am not made or distributed for profit or commercial advanta e the mpyright notice, the !' title of the publication and its date appear, an notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior spaeific permission andlor a fee.
ISCA '96 S196 PA, USA 01996 ACM 0-89791 -768-3/9840005... $3.S0 dynamic schemes that have been shown to be particularly successful at predicting user-level branches: simple per-branch dynamic [17] , GAs [14, 21] , gshare [11] , and PAs [21] . The last three schemes are two-level adaptive schemes which exploit patterns in the recent local or global branch history to improve prediction accuracy.
While recent studies have demonstrated the benefit of two-level adaptive schemes on benchmarks such as SPEC92, Young et al. [23] point out some potential problems with these approaches as the number of static branches to predict increases. Since a large number of programs in the SPEC92 benchmark suite contain very few static branch sites, these benchmarks do not stress the size of the hardware branch prediction tables in most studies. We evaluate two-level adaptive schemes on larger applications, such as those found in the Instruction Benchmark Suite (IBS) [18] . Since these benchmarks do not cover the entire spectrum of applications, we also evaluate the two-level adaptive schemes using our own system traces. We gathered these traces with the Atom tool-building system [5] . Overall, our Atom traces include a selection of the SPEC92 benchmarks and several large, system-intensive applications like an HTTP server. Unlike the SPEC92 benchmarks, the HTTP server spends a significant amount of its execution time in kernel routines. In summary, through the use of the IBS traces and our own system traces, we are able to analyze the performance of two-level adaptive branch prediction schemes under three operating systems and on a wide spectrum of applications.
Different workloads spend different amounts of time in user and kernel code. We find that user-level traces of applications that spend the vast majority of their time in user code (for example the SPEC92 benchmarks) give good approximations of overall prediction accuracy. However, the prediction accuracy on benchmarks with an even relatively small amount of kernel activity (just 10% of instructions) is not modeled well by user-only traces. Schemes that appear the best in user-only traces (e.g. gshare with a large branch history depth) do not always perform best on full-system traces. Our results show that including kernel branches in the branch trace can greatly increase the number of static branches predicted and thus worsen the effects of aliasing. The negative effect of aliasing on prediction accuracy is more pronounced in the two-level schemes with large history depths than in locally-oriented schemes that rely on small history depths [16] . We also find that fttrshing the branch history state [13, 15] at tixect intervals does not accurately model the true effects of user/kernel interactions: some schemes are more sensitive than others to periodic flushing. Section 2 summarizes the recent advances in branch prediction, and it describes the major reasons for poor prediction accuracy in a dynamic branch prediction scheme. Section 3 presents our simulation methodology and our benchmark applications. Section 4 discusses our experimental results. Section 5 presents the conclusions of this work.
Understanding
Branch Prediction Schemes lrr the last five years, researchers have made steady improvements in the accuracy of static and dynamic branch prediction schemes by exploiting the relationships between program branches and the patterns of behavior of individual branches, To understand the operation and to compare the performance of these schemes, Young et al. [23] introduced an analytical framework for today's branch prediction schemes. Figure 1 summarizes the main components of that framework. As illustrated by this figure, the recently proposed branch prediction schemes predict the future outcome of a branch by accessing a predictor which summarizes some portion of the past outcome of this branch. For example, most dynamic branch prediction schemes summarize the past history of a branch through the use of a simple finite-state machine implemented as a 2-bit, up/down, saturating counter, The divider in Figure 1 selects the predictor, e.g. a 2-bit counter, used for each prediction. Before 1991, the divider in the best branch prediction schemes chose a predictor based on the address of the branch to predict [10, 12, 17] . The dynamic versions of these schemes maintained a table of 2-bit counters, referred to as a branch history table (BHT), indexed by the branch address. Figure 2a illustrates the hardware for this approach, which we refer to as 2bc.
Divider
Predictors Substreams Figure 1 . Framework for describing a branch prediction scheme [23] .
The divider mechanism splits the program execution stream into sub- Recently, several researchers have empirically shown that we can improve branch prediction accuracy by building more elaborate divider mechanisms [11, 14, 20] . By appropriately dividing a program's dynamic branch stream into many substreams, we can produce substreams that are more predictable. For dynamic schemes, Yeh and Patt [20] introduced the concept of "two-level adaptive" branch prediction schemes whose dividers include branch history shift registers (BHSRS) which record the recent directions of program branches. Their divider mechanisms use the contents of these shift registers in addition to branch address information to create highly predictable substreams. Krall [9] and Young and Smith [22] describe code transformations that yield similar results for static branch prediction approaches, In this paper, we focus on three two-level adaptive branch prediction schemes that have been shown to be effective on user-level code [11, 14, 21] . Figure 2 depicts each of these schemes, The first is called GAs, and it uses a single, global BHSR to record the outcome of the past k branches. As discussed by Pan, So, and Rahmeh [14] , GAs exploits the correlation between branch executions in a program; correlation occurs when the outcome of one or more branch executions helps to determine the outcome of a future branch. GAs chooses a 2-bit counter from the BHT by concatenating the contents of the global BHSR with the current branch address. McFarling [11] proposes a modification to this scheme where the BHSR contents are exclusive-or-ed with the branch address. McFarling refers to this new scheme as gshare. The exclusive-or function permits the use of longer history and more address bits for a fixed size BHT than GAs. Ideally, this extra information results in more substreams that are more predictable. The final two-level adaptive branch prediction scheme that we consider is called PAs [21] . The PAs scheme maps each program branch into a table of BHSRS; the contents of the selected BHSR are concatenated to a portion of the branch address to select a 2-bit counter from the BHT. This scheme exploits repeating patterns in the execution of a single program branch (e.g. loop branches that iterate a constant number of times), but not correlation between distinct static branches. GAs and gshare may be able to capture some of the same looping patterns as PAs on short loop branches, but their use of global history prevents them from exploiting patterns in longer loops.
The analysis performed by Young et al. [23] suggests that the prediction accuracies generated by the current implementation of dynamic prediction schemes like those in Figure 2 are negatively affected by problems of aliasing and training overhead. Aliasing occurs when the hardware divider assigns streams from different branches to the same 2-bit counter. Though the intermingling of the individual branch streams can constructively, destructively, or neutrally impact the prediction accuracy of the individual branches, Young et al. showed that destructive afiasing occurs more frequently and with larger magnitude than constructive aliasing, especially if the working set of the application is large or the BHT is small in size, Training overhead refers to the fact that a 2-bit counter needs to be "primed" for a particular conditional branch by observing a few executions of that branch. Young et al. did not discuss the effects of training overhead in detail, but this effect is observable in some of their shorter benchmark runs. For these runs, the schemes with finer dividers did not always achieve better prediction accuracies than simpler schemes because the training overhead of many substreams became a noticeable percentage of the total number of predictions. In a simple scheme with a small number of substreams, the few predictions done during 2-bit counter training amounts to a negligible number of mispredictions. As we will see in Section 4, the problem of aliasing can become even more pronounced for traces of system activity.
Though there have been several studies exploring the effects of system references on instruction cache performance, the vast majority of the work in branch prediction has focused on user-only traces. Nair [13] and Perleberg and Smith [15] attempt to model the effects of context switches on the user-level component of branch misprediction by regularly flushing the BHT during a useronly trace. We are familiar with only one study that has performed branch prediction simulations with system-level traces, The study by Lee and Smith [10] contains three traces of the MVS operating system executing a commercial workload. Since this work occurred before the invention of two-level adaptive branch prediction schemes, Lee and Smith report only the performance of these traces on a 2bc scheme (in addition to other 2bc-like schemes).
Methodology
We use trace-driven simulation of user and kernel activity to evaluate prediction accuracy on a range of branch prediction techniques. We use traces that were collected by two different measurement systems, one hardware-based and the other software-based, on three different operating systems. By using traces from two independent sources, we can benefit from the complementary advantages of hardware and software systems and achieve a higher overall degree of confidence in the quality of our simulations.
For our first set of traces, we used the IBS traces from the University of Michigan [18] . These traces were generated on a DECstation 3100 with a MIPS R2000 processor. The traces are designed to provide a realistic instruction reference stream, overcoming limitations of benchmark suites such as SPEC92 which fit in most onchip instruction caches and do not induce significant operating system activity. IBS contains traces for two different operating systems, ULTRIX from Digital Equipment Corporation
[19] and For our simulations, we used the following items from the IBS trace record: the memory address referenced; the flag that indicates whether the reference was to instruction or data space; the flag that indicates user or kernel mode; and the opcode fetched by an instruction reference. With this information, we generated a branch stream (as illustrated in Figure 1 ) that we used as input for our branch prediction scheme simulator. The top of Table 1 gives a description of the IBS benchmarks. Table 2 presents some general statistics for each IBS trace.
We collected additional traces using Atom [5] on an Digital AlphaStation 400/233 running Digital Unix (formally OSF-1), release 3.2. With the Atom tool-building system, it is possible to instrument both user programs and the Digital Unix kernel, thereby collecting complete data for the simulation with no special-purpose hardware and no source-code modifications to the operating system.
When software-based measurements of system activity are used for architectural simulation, care must be taken to avoid excessive distortion in measured behavior due to the overhead of the measurement system. Two kinds of distortion occtm space dilation and time dilation [2] . To remove space dilation effects, we ran our experiments on a machine with enough physical memory so additional system activity due to virtual memory effects did not occur. Table 1 : Description of our benchmark programs. We replaced the SPECint92 version of gcc with version 2.6.3 because we had trouble compiling it on our Alpha machines. The descriptions of the IBS benchmarks are based on those provided in [18] .
The bottom of Table 1 includes a brief description of the benchmarks traced under Atom. Table 3 presents the statistics for these benchmarks. They represent a range of applications with differing degrees of kernel and user activity. Most of the recent previous work in branch prediction has focused on the SPEC92 benchmark suite. We chose a sample of these benchmarks, and as shown in Table 3 , they spend very little of their total instruction count in the kernel. In addition to these benchmarks, we evaluated four benchmarks chosen for their high level of kernel activity.
From the Atom trace, we construct a branch stream that is identical in format to the stream produced from the IBS benchmarks. We then feed this stream to our simulator code. Though Section 4 presents the results from only a single simulation run of each benchmark, we ran each benchmark in Table 3 three times to verify the stability of our results. We found that the maximum difference in the prediction accuracy between two runs with the same branch prediction scheme was always less than 0.3% (typically less than 0.1 %). For a particular benchmark, the prediction accuracy difference between schemes and sizes was always much greater than the difference between runs.
Experimental Results
Our experiments concentrated on two basic questions: are the simulation results of user-level traces representative of the prediction accuracy of a dynamic branch prediction scheme on a full-system trace, and does periodic flushing of the BHT during a user-level from a user process to kernel plus the switches from the kernel to a user process. Please note that the UNIX server is a user process under Mach and its activity is counted in the user categories.
trace accurately reflect the effect of kernel branches on the userlevel component of prediction accuracy? Sections 4.1 and 4,2, respectively, discuss our findings for these two questions.
Throughout this section, we report the mispredict rates for prediction schemes with hardware state of 4K bits, 16K bits, and 64K bits, 1 We refer to a particular scheme with the identifier "name. size", where "size" is the number of bits in that scheme's hardware state. For example, the identifier "2bc.4K'
indicates that the simulator used the hardware "2bc" branch prediction scheme with a BHT size of 2K 2-bit counters. The smallest hardware sizes correspond roughly to the amount of branch prediction hardware found in today's microprocessors [6, 7] . We chose the largest scheme size because it has the same number of storage bits as an 8-kilobyte cache.
1 For 2bc, GAs, and gshare, these swes correspond to BHTs with 2K, 8K, and 32K
2-bit counters
We omit the relatively small hardware costs of the. BHSRS on the . .
scheme evaluated in this study. The address bits used are the lower i/j bits of the branch's word address. The BHT size is 2max"') bits for gshare schemes and 2J + k bits for other schemes. Table 4 lists the specific parameters of each hardware scheme. The gas.4K entry matches the size and organization of the branch prediction hardware in the NexGen Nx586 [7] . The larger GAs schemes were chosen by scaling up the NexGen parameters. 2 For PAs, we initially experimented with an organization that corresponded to the reported parameters used by the branch prediction unit in the Pentium Pro processor (i = 9, j = 9, k = 4) [6]. However, this organization did not achieve misprediction rates as low as the PAs configurations in Table 4 . Similarly, PAs implementations 2. One might be tempted to run a set of GAs stmrdations to determine the best rradeoff between j and k parameters for our bmchmarks. However, as will be seen later in this section, the best "GAs" scheme for some of our benchmarks has a k value of zero (i.e. a 2bc scheme). Hence, we use the NexGen design parameters as a reasonable starring point for experiments. with the same bit cost and a longer history depth (k = 6) afso performed worse than the selected PAs schemes at the 4K and 16K sizes.
The SPEC92 benchmarks have been criticized because a significant portion of their dynamic branch execution count is due to a very small numbers of static branches. The data in Table 5 Table 5 also provides proof that our non-SPEC benchmarks are a challenging workload for the sizes of our branch prediction schemes. Except for a few benchmarks (e.g. u.jpeg and u.nroff), it takes well over 400 static branches to account for 95% of the dynamic branch executions in the full-system trace, Often, the user branches alone are a large portion of this total.
Predicting user and kernel branches
Researchers have evaluated two-level adaptive branch prediction schemes using single-process traces of user-only activity. Their studies repeatedly concluded that the addition of extra hardware to exploit more specific patterns in the branch stream, as found in gshare or PAs, achieved better branch prediction accuracies than the simpler hardware found in the GAs or 2bc schemes. These results
I
have encouraged researchers to develop even more elaborate hardware and convinced microprocessor vendors to implement these two-level adaptive schemes in new superscalar processors [6, 7, 8] . The resuks in this section show that the mispredict rate obtained from a trace of user-level branches is often a poor indicator of a scheme's mispredict rate on user and kernel branches. We also find that 2bc can provide the lowest mispredict rate in some cases. Figure 3 summarizes the results of our simulations, counting how many times each scheme showed the best mispredict rate at a particular scheme size, for user-only and full-system traces. Even from this summary, we can make a number of interesting observations. First, the best dynamic prediction scheme for a trace of user-only branches is not always the same as the best scheme for a full-system trace of user and kernel branches. Second, for both user-only and full-system traces, a scheme like gshare that uses a long branch history predicts better as the scheme size increases. The PAs scheme is the best prediction scheme at the small (4K) scheme size; PAs and GAs do best at the middle ( 16K) scheme size, while GAs and gshare are best at the large (64K) scheme size. To make the same point a different way, the use of long branch histories appears to penalize schemes at small scheme sizes. In addition, the inclusion of kernel branches appears to have a similar effect to that of decreasing the size: schemes with shorter histories do better. As we This figure summarizes data in Tables 8,9 , and 10.
will show later in this section, these observed trends are due to the effects of aliasing [23] , Tables 8, 9 , and 10 (at the end of the paper) present the user-only and full-system mispredict rates for each benchmark under our range of 2bc, PAs, GAs, and gshare schemes. The data in these tables demonstrates that a mispredict rate as measured in a simulation of user-only activity is not necessarily a reliable indicator of the true mispredict rate on the full-system trace. For example, u.video at gas.64K achieves a user-only mispredict rate of 1. lT~o while the full-system trace mispredict rate is 3.71 Yo; more than a factor of three worse. Fortunately, the user-only mispredict rates for the SPEC92 benchmarks under Digital Unix match fairly well with the mispredict rates achieved under full-system tracing, providing some credibility to the results of previous studies. As illustrated by tbe results for O.SC, the match becomes worse as the scheme size decreases or as the history depth of the prediction scheme increases. Furthermore, as the percentage of total instructions executed in user mode decreases, the user-only mispredict rate for user-only experiments quickly deviates from that achieved under full-system tracing. This observation makes sense intuitively, provided that the kernel-only and user-only mispredict rates differ,
The scatter plots in Figures 4 and 5 Table 6 : Arithmetic mean of distortion for each benchmark group.
Distortion is calculated using the formula ([u -fl ) / (u +fi , where u is the user-only mispredict rate and~is the full-system mispredict rate. A v-due of O means no distortion while a vafue of 1 means that one rate dwarfs the other.
rate, then we would expect all points to appear on the diagonal. As expected, this is true for the SPEC benchmarks (the solid circles in Figures 4 and 5) . The IBS and the Other Digital Unix benchmarks show some significant deviations from the diagonal, although these deviations decrease with larger scheme sizes.
Focusing on Figure 4 , the 2bc graphs appear similar across all scheme sizes. This suggests that the 2bc scheme approaches the point of diminishing returns at BHT sizes of 4K bits; enlarging the table does not significantly reduce the mispredict rate since little aliasing is occurring. This is intuitively borne out by the static branch percentiles in Table 5 ; few of our benchmarks use more than 4K static branches at the 95th percentile of static branches. However, the IBS benchmarks, which tend to have smaller overall mispredict rates and thus are less visurdly striking in the scatter plots, still show reasonable improvements from larger scheme sizes.
In Figures 4 and 5 , the two-level adaptive schemes do not appear to reach the point of diminishing returns for the scheme sizes that we examined. With increasing scheme size, each graph looks like a scaled-down version of its predecessor, corresponding to better overall mispredict rates from the larger sizes. It also appears that the deviations from the diagonal decrease with larger sizes. Both of these trends make intuitive sense, since the larger sizes should reduce aliasing within the combined set of user and kernel branches.
To quantify our intuitions of reduced distortions at larger scheme sizes, we examined the normalized distortion, the difference between user-only and full-system mispredict rates divided by their sum. This metric ranges from O to 1, where O indicates no distortion, and 1 indicates that one of the mispredict rates is a tiny fraction of the other (high distortion). Table 6 summarizes this metric for each grouping of benchmark, scheme, and size. The distortion for the SPEC benchmarks is always below 0.03. The distortion for the IBS benchmarks hovers around 0.1, while the Other Digital Unix benchmarks have a slightly higher distortion ranging from 0.1 to 0.2. As we suspected from visual examination of Figures 4 and 5, the distortion generally decreases with larger scheme sizes. It sometimes increases due to the fact that we measure the distortion between the user-only mispredict rate and the full-system mispredict rate, while several of the IBS and the Other Digital The addition of kernel branches to the simulation has increased aliasing (contention) in the prediction scheme hardware. Table 11 presents the full-system mispredict rates where all aliasing (both BHSR and BHT address aliasing) has been removed. Unsurprisingly, unaliased mispredict rates are always better than the mispredict rate of the corresponding scheme in our study. As we observed earlier, for the same scheme, smaller sizes suffer more aliasing than larger sizes. This is borne out by the larger differences in mispredict rates of unaliased and practical implementation at smaller scheme sizes. For example, under gas.4K, o.ht shows a mispredict rate of 10.27%, while the equivalent k = 7 unaliased GAs scheme achieves 2.65%. Aliasing adds almost 300% more mispredictions. The gas.64K scheme shows a mispredict rate of 3.78%; the unaliased k = 9 GAs scheme achieves a rate of 2.36'ZO. Aliasing adds just 60% more mispredictions, Similarly, schemes with deeper branch histories suffer more aliasing than schemes with shallow branch histories. For example, m.video under gas. 16K shows a mispredict rate of 7.26Y0, while the equivalent k = 8 unaliased GAs scheme achieves 2,79%. Aliasing adds almost 160% more mispredlctions.
The gsh. 16K scheme shows a mispredict rate o 7.5 1%; the corresponding uttaliased k = 13 GAs scheme 5 achieves a rate of 2.03%. Aliasing adds 27070 more mispredictions, From our data, it appears that a user-only mispredict rate accurately reflects the full-system mispredict rate if the percentage of the total instruction count spent in user mode is greater than 95~0. If the percentage is less than !)()~o, the results of a user-only trace cannot be trusted. The 90-95% range is a grey area. The reverse of this obser-3. Unatiasedgshare schemes are GAs schemes of the same histnry depth. vation is demonstrated by u.sdet, which spends less than 2'%0of the total instruction count in user mode. In this case, we found that the mispredict rate of the kernel-only trace is a good indicator of the full-system mispredict rate.
Simulating the effect of kernel branches
To date, very few branch prediction studies have considered the effects of user/kernel interaction on prediction accuracy. Nair [13] and Perleberg and Smith [15] each attempt to model the effects of context switching on the user-only branch mispredict rates by flushing the BHT at a fixed interval of instructions. This method is inexact, because interactions with the kernel or other processes do not necessarily flush the branch history state. A short switch may have little effect on the state, and a large table may suffer lower contention and thus suffer less ill effect. Nair, Perleberg, and Smith use traces that omit system activity, and hence they are not able to verify the true effects of kernel branches on the user-only component of the mispredict rate. To evaluate the validity of this approach, we modified our simulator to flush the B HTs and BHSRS at fixed intervals of instructions during a user-only trace. We then compared the resulting mispredict rates to the user component of the full-system trace simulation. Some flush interval will produce the same mispredict rate as the user component of the full-system trace; we call this number the effective ji'ush interval (EFI). If flushing at fixed intervals is an accurate methodology, then we would expect the EFI to remain constant across different prediction scheme organizations and sizes. Our full-system traces include both user and kernel references. So, we can evaluate the accuracy of periodic flushing as a methodology for estimating the effect of user and kernel contention on the branch prediction hardware. It suffices to show one example where periodic flushing produces inaccurate and misleading results. Our u.video benchmark is such an example. This benchmark spends more than two-thirds of its time in the kernel, so one might think that the "pollution" of the branch prediction scheme caused by kernel branches would resemble flushing. Table 7 compares the full-system mispredict rate and the mispredict rates generated by periodic flushing for tt.video. We can see two crucial problems. First, under each scheme, the EFI increases with increasing scheme size. This means that periodic flushing cannot be used to compare different scheme sizes for the same scheme, because it can overly penalize the larger sizes. The 2bc entries at a flush interval of 10,000 instructions give a concrete example of this: the periodic flushing results imply that larger 2bc tables result in only small improvements in prediction accuracy. But the user component of the full-system trace shows significant improvements with increasing scheme size. Larger scheme sizes remove aliasing between user and kernel branches in this case. Since periodic flushing provides no model for the other branches contending for the table, it cannot model the benefits from reduced aliasing.
The second important problem is that the EFI for tt.video varies between prediction schemes even at a constant scheme size. This makes periodic flushing a useless methodology for comparing different branch prediction schemes. The scheme with the larger EFI will be unfairly penalized by the effects of periodic flushes. For example, using a flush interval of 10,000 instructions to compare 64K-bit implementations would lead one to believe that PAs gives the best mispredict rate, followed by GAs, 2bc, and gshare. The order ordering from the full-system trace simulation is different: gshare and GAs achieve the best mispredict rates, followed by PAs and 2bc.
The previous discussion proves that we cannot trust the numerical values produced by periodic flushing. Table 7 demonstrates that we cannot even trust the overall trends implied by periodic flushing results. Using a flush interval of 10,000 instructions, periodic flushing reports that gshare predicts worse with increasing scheme size-exactly the opposite of the truth.
Conclusions
Using full-system (i.e. combined user/kernel) traces gives realistic results that lead to different conclusions about the effectiveness of existing dynamic branch prediction schemes than do the results from user-only traces. We find that including kernel references often increases aliasing, and this effect may cause schemes with short branch histories to achieve better prediction accuracies than those with deep branch histories. While SPEC92 is user-dominated (so prior work in branch prediction retains value), system designers and customers probably want to match their test workloads to a wider range of user/kernel mixes. Simulations that ignore kernel activity risk dangerous inaccuracy: elaborate two-level schemes that appear good under user-only traces may turn out to be less attractive when the whole system is considered. These problems appear even Table 7 : Comparison of the user component of the mispredict rate from a full-system tr&e with the mispredict ;ates derived from periodic flushing intervals, The benchmark in these simulations is u.video. The effective flush interval is the periodic flush interval that achieves the same mispredict rate as the full-system trace simulation.
worse for small scheme sizes. As a rule of thumb, if both the kernel and the user account for more than 5 Yo of the instruction mix, then combined system and kernel traces should be used.
Flushing at fixed intervals poorly models the effect of kernel branches on dynamic branch prediction schemes, It is misleading to use periodic flushing to compare different schemes with the same amount of hardware or to compare the same scheme with varying amounts of hardware, More specifically, periodic flushing fails to capture differences in the organization and size of schemes. It assumes the same amount of contention exists in a 4K-bit scheme as in a 64 K-bit scheme. And, it assumes the same amount of contention in a 2bc scheme as in a gshare scheme of the same hardware size. These underlying fallacies in the periodic flushing model will persist and yield inaccurate results when periodic flushing is used to model multitasking workloads. Table 9 User-only and full-system mispredict rates fm the Ultrix IBS benchmarks. For each benchmark, we highlight the lowest overall misp=dict rttte fm each set of scheme simulations. Id 05 17(MI 1 ,FA; GAs \gsh~"''-$"'l''l'zq-'-I 2b2 12.37 11,30 I 1 gsh 10.54 11,57~w'far'%% Rtble lo: User-only and full-system mispredict rates for the Digital Unix benchmarks. For cacb benchmark, we highlight the lowest overatt mispredict rate for each set of scheme simulations Table 11 : Full-system mispredict rates where all aliasing (both BHSR and BHT addre8s aliasing) has been removed.
Igsh

