Accurate static branch prediction is the ke.y to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). The initial work on static correlated branch prediction (SCBP) demonstrated improvements in branch prediction accuracy, but did not address overall performance. In particulal; SCBP expands the size of executable programs, which negatively affects the performance of the instruction memory hierarchy. Using the profile information available under SCBF: we can minimize these negative performance effects through the application of code layout and branch alignment techniques. We evaluate the performance effect of SCBP and these profile-driven optimizations on instruction cache misses, branch mispredictions, and branch misfetches for a number of recent processor implementations. We find that SCBP improves pe?j4ormance over (traditional) perbranch static profile prediction. We also "find that SCBP improves the performance benefits gained from branch alignment. As expected, SCBP gives larger benefits on machine organizations with high mispredictimisfetch penalties and low cache miss penalties. Finar'ly, we find that the application of profile-driven code layout and branch alignment techniques (without SCBP) can improve the perJormance of the dynamic correlated bnanch prediction techniques.
Introduction
Recent work in branch prediction [3, 18,25,30,31,32, 331 has led to the development of both hardware and software schemes that achieve high prediction accuracy by exploiting branch correlation. The motivation for this work stems from the fact that the performance of superscalar and deeply pipelined processors can benefit significantly from a small improvement (a couple of percent points) in prediction accuracy. As with any technique however, there is a point of diminishing returns where the incremental costs of the technique begin 'to outweigh the 1072-4451/95 $4.00 0 1995 IEEE Proceedings of MICRO-28 further improvements. In static correlated branch prediction (SCBP) techniques [33] , the cost of better prediction accuracy is code expansion, and thus, the point of diminishing returns is definedl primarily by the relationship between the changes in the average access time of the instruction memory subsystem and the cycle count changes enabled by the improvements in prediction accuracy. In this paper, we explore this relationship to determine how parameters such as the pipeline structure and the cache organization of a processor affect the viability of SCBP techniques. Using this framework, we also examine the performance benefits of two profile-driven optimizations on gshare [18], a dynamic branch prediction scheme that efficiently exploits branch correlation. SCBP improves prediction accuracy for a particular branch by creating multiple copies of that branch, effectively encoding information on the outcome of previous branches in the program counter. It is clear that this will affect the instruction cache behavior of the resulting program. First, by enlarging the cache footprint of the program, SCBP increases the number of compulsory and capacity misses. Second, by increasing the code size of individual procedures and causing these procedures to shift relative to each other in memory, it can significantly change the number of conflict misses.
Though Young and Smith [33] present code expansion numbers in their initial paper on SCBP, they do not quantify the magnitude of the performance effects of this code expansion on the memory subsystem, and hence it is difficult to say when further iimprovements in prediction accuracy are outweighed by the memory system penalties due to greater code expansion. In this paper, we consider the first-order effects on program performance: total cycles due to branch mispredictions, branch misfetches, primary cache misses, and additional dynamic instructions due to code layout considerations. As defined by Calder and Grunwald [2] , a misfetch penalty refers to any penalty associated with a correctly predicted branch. Since SCBP is a software scheme, the cycle time of the processor is not affected and hence that component of the performance equation is unchanged. The true performance benefit of SCBP is dependent upon the effectiveness of compiletime optimizations, such as global instruction scheduling [15, 171 and code optimization [4] , since these optimizations attempt to reduce the total number of cycles required to execute a program. Since the design and evaluation of sophisticated compile-time optimizations is an endless task, we do not attempt to give a definitive answer to the question of how much SCBP can ultimately improve application performance.
Even in this limited study, comparing the cache behavior of different versions of the same program can be problematic unless measures are taken to minimize unnecessary conflict misses. We have observed that small changes in the relative locations of different parts of a program can cause very significant changes in the number of cache misses, and these fluctuations can completely obscure the relation between code expansion and changes in the miss rate. Fortunately, the profiling information that is necessary to implement SCBP is sufficient for the implementation of code layout algorithms that keep instruction fetch penalties to a minimum. In particular, we have implemented three previously published techniques [24] that use profiling information to optimize the instruction cache behavior and minimize the instruction misfetch penalties of a program. We apply these techniques together with the code transformations that implement SCBP. For the vast majority of our benchmarks on all of our machine microarchitectures, SCBP gives better performance than per-branch profiled static branch prediction, and a large component of the overall benefits comes from the code layout optimizations. In fact, we find that, in addition to our SCBP scheme, the profile-driven code layout optimizations also help to improve the performance of a dynamic branch prediction scheme. Section 2 reviews the previous work in code expanding optimizations, and it relates this work to the domain of branch prediction. Section 3 introduces our experimental methodology, and it briefly describes the code layout optimizations used in this study. Section 4 presents the results of our simulations. We conclude with a summary of our findings in Section 5.
Previous work
In the last five years, interest in branch prediction has been re-ignited by schemes that exploit branch correlation. Before 1990, the best branch prediction schemes used the recent history of a branch to predict the future direction of that branch. The most effective dynamic schemes used a table of 2-bit, saturating, upldown counters [26] (often referred to as the branch history table (BHT)), while the most effective static schemes relied on profiles from previous runs of the program [8, 16, 201 to determine a fixed prediction per branch. In 1991, Yeh and Patt [30] introduced two-level adaptive schemes which record the direction of the recently executed branches and use this information (in addition to the branch address) to index into a BHT. This hardware organization allows the prediction scheme to exploit patterns of related branches, increasing the overall prediction accuracy. Pan, So, and Rahmeh [25] appear to have been the first to use the term "correlation." In their two-level adaptive scheme, they use a single hardware shift register of length k to record the previous directions of the last k branches (Yeh and Patt [31] refer to this scheme as GAS and to k as the history depth). The contents of the shift register are concatenated with some bits from the branch address to select one of the 2-bit counters in the BHT. Under the constraint of a fixed size BHT, McFarling [ 181 was able to achieve prediction accuracies better than those from GAS by exclusive-oring (rather than concatenating) bits from the branch address with the bits of the branch history shift register. McFarling refers to this new scheme as gshare. Figure 1 illustrates the essential features of GAS and gshare. The results of these hardware studies are appealing because better branch prediction rates translate directly into fewer cycles wasted due to branch mispredictions. However, compile-time optimizations that benefit from improvements in prediction accuracy, such as global instruction scheduling, cannot take advantage of these sophisticated dynamic branch prediction schemes. Inspired by the GAS scheme, Young and Smith [331 developed a static correlated branch prediction ((SCBP) scheme that exploits the correlation found in a branch profile to improve overall branch accuracy using only compilerspecified branch prediction bits. SCBP works by encoding branch history into the program counter. As shown in Fig Young and Smith [33] showed that SCBP does improve overall prediction accuracy over that achievable with simple profiling with reasonable (30-110%) code expansion. They also showed that, by increasing the history depth k and thus allowing for greater code expansion, one can achieve even better prediction accuracies. What was beyond the scope of that initial paper was the effect of prediction accuracy and code expansion on performance.
GAS
Code expansion during compile-time optimizations is not a new problem. Many compile-time optimizations aimed at exploiting instruction-level parallelism also increase the size of the program text. Loop optimizations, including Ioop peeling and loop unrolling [23] , and software pipelining [SI produce reordered code that is larger than the original. Aggressive function inlining [ 131 increases the overall code size, even though some savings in code space are realized through the removal of the procedure call overhead and through the enabling of further intra-procedural optimizations. Speculative execution and global instruction scheduling [l, 15, 17, 21, 271 move instructions across basic block boundaries, and this code motion may result in code duplication that expands the size of the program executable. All of these methods increase both the static size and the dynamic memory footprint of the optimized program, placing greater demands on the instruction memory system. Surprisingly, very few of these studies examine the interaction between the code-expanding optimizations and the instruction memory system. Often, studies of these techniques simply assume a perfect instruction memory system and examine only the change in CPU cycle count due to the compiletime optimization (and possibly data cache effects).
A few studies have considered the impact of object code size on instruction memory performance. The earliest of these studies, e.g. Steenkiste [29] and Davidson and Vaughan [7] , investigated the relationship between instruction cache performance and code density due to instruction encoding. A later study by Chen et al. [6] , fixed the instruction set and examined the impact of code expanding optimizations on the design of instruction caches. They found that several code expanding optimizations noticeably increased the miss ratio of 8 kilobyte and 16 kilobyte caches, and this change resulted in an effective loss of performance after program transformation.
To try and improve the performance of instruction caches, a small number of papers [14, 19 , 241 subsequently examined how programs use the instruction memory system and proposed methods to improve its overall performance. Each of the proposed methods uses profiles of previous program runs either to exclude certain portions of the instruction stream from the instruction cache [19] or to reorganize the code layout to avoid conflict misses and improve the spatial and temporal locality of the cache [ 14, 241. Since it is often difficult to selectively exclude code from today's instruction caches, we concentrate on the code layout techniques, and in particular on the approach described by Pettis and Hansen [24] . ' Pettis and Hansen's approach is based on finding an ordering of the procedures of a program such that groups of procedures with frequent calls between them are placed at nearby addresses. They introduce the term "fluff" to refer to code that is not reached during the profiling run or runs. Such code is viewed as error-handling code (or code that handles very rare cases), and they recommend that this fluff be moved to the end of the program in order to compact the part of the program that is actually executed. By compacting the executed part of the program, their approach improves spatial locality and reduces the potential for conflict misses. Pettis and Hansen also describe a method for setting the branch conditions for the taken and fall-through paths of branches such that each branch falls through more frequently than it takes. If correctly-predicted taken branches still result in a misfetch penalty (as in the DEC Alpha 21164 [lo] ), this branch alignment step results in fewer cycles lost due to misfetch penalties and an increase in the average length of straight-line executed code which improves spatial locality. To offset the cache effects of code expansion in SCBP, we have implemented each of these code layout techniques in our experimental system.
Both SCBP and Pettis and Hansen's layout technique rely on good training data. Fisher and Freudenberger [8] found that different data sets are reasonable predictors of other data sets of a program. Needless to say, bad training sets which exercise only small subsets of program features make for bad results. We briefly describe the structure of the system that performs our SCBP and code layout transformations, and we discuss the way in which we obtain our measurements. We also outline the influence of the pipeline structure and the instruction cache organization on the performance tradeoffs in branch prediction. 
Experimental system
We use one of two production-quality compilers to generate object files for the DEC Alpha architecture. We then use the ATOM instrumentation tool [28] to generate traces of basic blocks and branch conditions. These traces are needed for both the SCBP and code layout algorithms. Next, we build a procedure call graph and control flow graphs for each procedure using the information in the object files. This step poses some problems for us since it is not always easy to determine the targets of dynamic jumps (i.e. jumps where the target address is computed at runtime), which arise from procedure calls. As a result, the procedure call graph used for the procedure ordering algorithm may not always be complete, and the code layout based on this information may not be optimal in all cases. Finally, the procedure call graph, the control flow graphs, and the profile information are fed to our SCBP and code layout algorithms. After applying SCBP, we use the layout techniques described by Pettis and Hansen [24] . In Section 4, we lump procedure positioning, procedure ordering, procedure placement, and procedure splitting (fluff removal) under the term "code layout", and we refer to basic block placement with the term "branch alignment" (after Calder and Grunwald [2] , who solved a similar problem). Table 1 : Benchmark and data set descriptions. The results in this paper were derived from trace-driven simulations. We collected the traces using ATOM vl.1 [28] . We compiled the SPECint92 benchmarks using cc version 2.0.0 and the optimization level specified in the SPEC makefiles. The additional benchmarks were compiled using gcc v2.6.0 (-03). All of the experiments were performed on a DEC 3000/400 running OSF/l version 2.0.
The output of the various code transformation algorithms is a set of basic blocks with addresses, static prediction information, and CFG information linking these blocks together. This information allows a trace-driven simulator to generate the statistics on branch and instruction cache behavior that we present in Section 4. We use a simulator instead of running the transformed code on an actual machine for two reasons. From a pragmatic point of view, there are currently very few commercially-available systems that have a processor with static prediction bits. The PowerPC architecture [22] is one of the few incorporating this general functionality, and its most recent processors implement a dynamic branch prediction scheme that takes precedence over the static prediction bits (Pow-erPC designers believe that dynamic branch prediction schemes perform better than static ones). From an experimental point of view, we want to have the freedom to evaluate performance under several different machine organizations where we vary only the cache and branch penalties.
Measuring the influence on performance
To measure the impact of SCBP (and code layout) on performance, we present a metric quantifying the average number of cycles saved per 1000 instructions executed. Unless stated otherwise, the baseline for these numbers is the identical machine microarchitecture under test with profiled branch prediction and no code layout (and hence no code expansion since SCBP was not performed). All of our profile-driven experiments train and test on different inputs.:! Our performance metric is computed as a weighted sum of the number of misprediicted branches, the number of misfetched branches, and the number of first level (Ll) cache misses. The weights in this equation are the branch misprediction and branch misfetch penalties, which are related to the pipeline organization, and the L1 cache miss penalty, which is assumed tci be the average amount of time that it takes to fetch the missing block from the rest of the memory system. The larger the ratio of the branch mispredict penalty to the cache miss penalty, the more code expansion the system can tolerate for an improvement in prediction accuracy. For hardware schemes, a larger ratio would shift the optimal balance (all other things being equal) of prediction table size versus cache size in favor of larger prediction tables.
The calculation of our metric assumes ithat the processor stalls during an instruction cache miss (i.e. the processor does not overlap branch stalls with instruction cache stalls). Even though our metric does not represent overall performance, it is a much better metric than code expansion or even change in instruction cache miss rate. Furthermore, our metric is independent of the rest of the processor organization. It does not matter if the processor issues one instruction per cycle or four iinstructions per cycle, though obviously, a four-issue machine will benefit more from improvements in prediction acc:uracy since the cycles saved will be a larger percentage of the total cycles it takes to execute 1000 instructions on a four-issue machine than on a single-issue machine.
For all of the experiments in Section 4, we simulate either an 8 kilobyte or 16 kilobyte direct-mapped instruction cache, each with a 32-byte line size. We chose these design points because the vast majority of high-speed microprocessors include a direct-mapped L1 instruction cache of one of these two sizes. Our results do improve 2. For the results in Section 4, we repoa the result obtained by running on one data set (the testing data set), after having trained on the other (the training data set). The data set listed in the label on the result is testing data set and that "tb" was the training data set for this experiment.
the testing data set, e.g. "eq.fn" indicates that the "fx" data set was the with increasing line size, as expected from the results of the previous papers on code layout, and thus we do not include these simulations in this paper. Since conflict misses occur more often in a direct-mapped rather than a set-associative cache, our results are conservative for an organization with a set-associative L1 instruction cache.
Results
To illustrate the combined effects of cache behavior and branch prediction on processor performance, we will present results for three different machine organizations that closely correspond to several recently announced commercial systems. Before we present these performance results however, Sections 4.1 and 4.2 report the effect of SCBP on the code size, the instruction cache miss rate, and the branch misprediction rate. These results provide the background information necessary to understand the performance results presented in Section 4.3.
Code expansion and cache miss rates
We have measured the code expansion both in terms of the increase of the total program size ( Figure 3 ) and in terms of the increase in the size of the code that is executed during the profiling run ( Figure 4) . We see that, as the history depth k increases, code expansion increases. Since SCBP does not expand those parts of the code that were not executed (i.e. those parts without profile information), the relative increase in the size of the code that is actually fetched into the instruction cache is often much greater than the overall code expansion ratio, especially at large values of k. It is this code expansion effect that actually impacts performance. are some anomalies in this data that can be explained by conflict misses. Even though we try to remove hot spots from the cache, they still occur occasionally, especially since the behavior of the profiling inputs is different from that of the testing inputs. Overall, the cache miss rate drops after code layout, but then basically increases as k increases. The increase, however, is not as dramatic as the code expansion numbers in Figure 4 . This effect is due in large part to the code layout routines. Figure 5 demonstrates the significant benefits of code layout via profiling information; often the cache miss rate after code expansion with a large value of k is less than the cache miss rate of the original program with procedures in source code order (the first data point in each series of Figure 5) . Finally, Table 2 shows the size of the cache footprint (the number of compulsory misses times the line size) for no layout optimizations and for the endpoint values of k with code layout and branch alignment. This table (in combination with Figure 5 ) shows that, even when the cache footprint is several times larger than our primary instruction cache, code layout is a more important determinant of the cache miss rate (and thus performance) than is the executable size. Figure 6 shows the misprediction rates for two sets of branch prediction schemes: our SCBP scheme, ranging from uncorrelated (k=O) to various degrees of correlation (k=2 through k=14), and the dynamic gshare scheme, with tables ranging in size from 256 bytes (k=lO) to 8 kilobytes (k=15). We chose these prediction table sizes to cover the spectrum of design trade-offs. At 256 bytes, a hardware prediction table is an insignificant hardware cost when compared to the cost of a typical L1 instruction cache. On the other hand, a hardware branch prediction table of 8 kilobytes is the point where an area tradeoff between the branch prediction table and the L1 cache reportedly becomes relevant [ 121. Figure 6 illustrates that the prediction accuracy achieved by SCBP is generally not as good as that of gshare. In fact, there are some cases, such as awk, eqntott, and xlisp, where the prediction accuracy is much worse under SCBP. Young et al. [34] discuss a range of reasons why SCBP and gshare achieve different prediction accuracies, and hence, we do not repeat that discussion here. The misprediction rates under SCBP do not monotonically decrease as k increases because we train and test on 25 aw.a a w . d co.in co.ps di.a di.b e q h eq.tb es.ml eS.25 gc.co gc.in gr.re3 gr.re5 sc.11 sc.14 1i.n different data sets. In fact, the use of more specific information from the training data set (i.e. larger k values) can sometimes result in increasingly worse pre'diction accuracies (e.g. e4.tb). For this paper, the important aspect of Figure 6 is that the component of the performance metric due to prediction accuracy will typically be greater under gshare than under SCBP since the prediction accuracy of gshare is typically better than SCBP.
Branch prediction accuracy

Evaluating performance
Section 3.2 describes a performance metric, cycles saved per 1000 instructions executed, that focuses on the performance effects of branch prediction, clode layout, and code expansion. Through the experiments in the previous subsection, our system can generate the total number of mispredictions saved over a static profiled scheme without branch correlation, the total number of misfetches saved over an executable without branch alignment, and the increase in the number of cache misses over an executable without code expansion and code layout. ' To evaluate the effects of these changes, we need to choose values for the branch mispredict penalty, the branch miisfetch penalty, and the L1 cache miss penalty. Table 3 presents the values that we choose for our simulations. We chose these machine models because their high misprediction penalties demand aggressive branch prediction schemes. Given our limited compile-time use of the brainch prediction information, we hypothesized processors with low branch mispredict penalties will not benefit from the few percentage point improvement in prediction accuracy generated by SCBP. Preliminary studies based on a MIPS R2000like machine, which has less than one cycle of branch misprediction penalty (depending on how the branch delay slot is filled), verified this hypothesis, and so we concentrated our efforts on the next generation of machine models.
DEC Alpha 21 164-like 5
Intel P6-like HP PA-8000-like The 21164 has a five cycle mispredict penalty and a one cycle misfetch penalty (penalty for correctly predicted taken branches). It incorporates a small 8KB L1 instruction cache and a 96KB on-chip L2 cache with a L1 miss penalty of 6 cycles. The P6 has a branch misprediction penalty of at least 11 cycles and 256KB of on-module, requested-word-first, L2 cache, which reportedly results in a L1 miss penalty of 3 cycles3. Other recently announced processors like the HP PA-8000 also benefit from an SCBP scheme since their very large (greater than 256KB) L1 cache is only slightly influenced by the code expansions listed above. Since the cache footprints of almost all of our benchmarks fit completely into a 256KB cache, the few additional cache misses that occur are almost exclu- 3. Our experiments assume that the processor stalls for the entire cache miss penalty. Chen et al. [6] show.that the use of requested-word-first miss handling and sequential prefetching can overcome a significant portion of the negative cache effects of code-expanding optimizations.
technique hides all but the 3 cycles required to access the L2 cache.
In our P6-like simulations, we assume that the requested-word-first sively compulsory misses due to the differences between the last two columns (labeled k=O and k=14) in Table 2 . Table 4 (at the end) shows all of the detail for our performance calculation using the DEC Alpha 21 164-like model in Table 3 ; Table 5 shows the same for the P6-like machine model. The detailed tables for the PA-8000-like model are similar and therefore omitted from this paper (they are available in the associated technical report [9] ). The baseline simulation in Table 4 is a 21164-like machine model with profiled branch prediction (no branch correlation) and no code layout or branch alignment. The numbers in each row correspond to the change in cycles per 1000 instructions due to the component in the row label. Rows labeled "Cache" show the cycles saved (or lost if negative) due to fewer (or more) cache misses. The code layout (procedure ordering and fluff code removal) algorithms lead to cycles saved while code expansion due to the SCBP algorithm potentially lead to cycles lost. Overall, cycles saved due to fewer cache misses typically begins positive at k=O since there is no code expansion and code layout improves the performance of the instruction cache. As k increases though, the "Cache" numbers decrease and often become negative at high values of k. This trend corresponds to the increasing cost of SCBP's code expansion.
Rows labeled "Predict" show the cycles saved due to fewer branch mispredictions. As expected, the number of cycles saved typically increases as we increase k. In benchmarks that exhibit only weak branch correlation (e.g. diB, there is very little benefit from SCBP. Furthermore, the benefit of SCBP may fluctuate as k increases due to the fact that we train and test on different data sets (the "Prediction" row always improves with increasing k when training and testing on the same data set).
Rows labeled "Align" show the benefit due to rearranging the code to make taken branches less frequent. This row contains non-zero numbers only when a machine model, such as the 21164-like model of Table 3 , has a non-zero misfetch penalty. In Table 4 , branch alignment contributes a large improvement in almost all cases (even though the misfetch penalty is only a single cycle), and the improvement appears to be weakly correlated with k. This suggests that the improved predictions under SCBP also improve the effect of branch alignment.
The "Total" row shows the sum of the previous three rows. Overall, the combination of SCBP, layout, and branch alignment often improves performance of the DEC Alpha 21164-like machine model of Table 3 . The maximum value in the "Total" row often occurs at a k greater than 0, i.e. performance benefits from SCBP. For an 8 kilobyte instruction cache, just the six experiments, aw.c.2, es.z.5, gc.? and sc.*, exhibit their best performance tradeoff at k=O. The large cache footprints and the large code expansion values for these programs are the main reasons why performance does not improve under SCBP even though the mispredict rates improve with increasing k. By enlarging the instruction cache size to 16 kilobytes, the maximum performance benefit occurs at k=2 for aw.c2 and occurs at k=4 for es.z5. Unfortunately, gccl and sc do not benefit from SCBP even at this larger cache size.
Figures 7 through 9 plot three rows from Table 4 that are representative of the types of behavior exhibited by our benchmarks. The three columns per k value correspond to the values in each "Cache", "Predict", and "Align" row. The line shows the total of the three components. In general, we see a bell-shaped curve that attains a maximum at some particular value of k. In Figure 7 , which plots the values for k n , the best performance occurs at k=6. In other benchmarks, like es.ml (Figure 8) , the total line goes negative at high values of k. At these high k values, the penalty due to code expansion greatly outweighs the benefits of the other components. As mentioned above, a few experiments, like sc.Zl (Figure 9 ), perform best at k&. For the Intel P6-like machine model, we: found that the maximum value in the "Total" row occurred at k greater than zero for all benchmark runs ( Table 5 ). This result is not surprising given that the misprediction penalty has increased while the cache miss rates and miss penalties have gotten smaller. Figure 10 re-plots the sc.Zl results under the P6-like machine model. The total line now peaks at k=2. Also, Figure 10 shows that the magnitude of benefits due to improved branch prediction is comparable to the benefits from code layout in the P6-like model. This trend is a consequence of the bigger ratio of the P6 branch misprediction penalty to cache miss penalty.
In the PA-8000-like simulation, we found that the maximum value in the "Total" row occurred at k greater than zero for all experiments except diB*, whlere the performance is fairly uniform for all values of k. Because of the large misfetch penalty and the fact that our executables can all fit into the L1 cache after code layout, we found that the majority of the benefit in the PA-8000-like simulations comes from the contributions of the code layout techniques and avoided misfetches, rather than from the avoided mispredictions.
Profiling for performance
It appears that the combination of code layout, SCBP, and branch alignment gives performance benefits at a number of different points in the pipeline and cache design space. Since dynamic correlated branch prediction schemes often achieve better prediction accuracies than SCBP, it is interesting to investigate the performance of a scheme where we replace SCBP by a dynamic correlated branch prediction scheme, such as gshare, and yet retain the benefits of code layout and branch alignment. Figure   11 presents the results of this study for each of our three machine models.
For the 21164-like model, Figure 11 shows that gshare performs significantly better when the executable is first processed by the code layout and branch alignment routines. Without these profile-driven optimizations, the best SCBP scheme (from Table 4 ) always outperforms gshare. Note that this performance comparison does not penalize gshare for the cost of the 8 kilobyte BHT.
For the P6-like model, Figure 11 shows some, but not much, benefit in gshare when the executable is first processed for code layout and branch alignment. This is a result of the small miss rate (due to a large L1 instruction cache) in the P6-like model. Because code layout is relatively unimportant in the P6-like model, the slightly better branch prediction accuracies under gshare result in noticeably better performance figures than under SCBP.
For the PA-8000-like model, Figure 11 shows that gshare still benefits from code layout, but not quite as much as in the 21164-like model. This is a result of the smaller miss rate in the PA-8000. In all benchmarks except uwk, the best SCBP scheme outperforms gshare without code-layout optimizations.
The key to the effective use of SCBP is the ability to select the proper value of k. We believe that it is possible to build a compile-time algorithm that is able to select a value of k that is close to the best value of k for each particular application, and thus find a balance between code expansion and prediction accuracy that maximizes performance. In this study, we go beyond prediction accuracy to evaluate the performance of SCBP and to quantify the negative effects of code expansion under SCBP. We find that SCBP can improve application performance, especially when coupled with profile-driven code layout and branch alignment techniques. These layout techniques control and minimize the effects of code expansion on the performance of an instruction cache. In fact, we find a synergistic relationship between SCBP and branch alignment in that SCBP also increases the performance improvements resulting from branch alignment. As expected, SCBP achieves the biggest performance gains on machine organizations with high mispredicVmisfetch penalties and low cache miss rateslpenalties.
Conclusion
In summary, compile-time transformations that maximize prediction accuracy do not necessarily maximize application performance. When small incremental improvements in prediction accuracy result in large amounts of code expansion, there is the potential to improve application performance by limiting the amount of branch correlation exploited by SCBP. To achieve even better performance improvements from incremental changes in prediction accuracy, the next step is to couple SCBP with aggressive ILP techniques like global instruction scheduling, which were not employed in the results of this study.
We also find that a dynamic branch prediction scheme like gshare can benefit significantly from the application of profile-driven code layout and branch alignment techniques. Without the benefit of these profile-driven layout techniques, the performance of gshare may drop markedly. In fact, we find that SCBP with code layout and branch alignment can perform better than gshare without profile-driven layout and alignment. This result is true even when gshare achieves a noticeably lower branch misprediction rate. 
