Abstract
Introduction
The number of instructions that superscalar microprocessors can issue every cycle is increasing steadily. Actual processors are, or will soon be able to issue six instructions simultaneously. Nevertheless, a typical 6-issue superscalar architecture, on current applications, hardly sustains 1.5 instructions per cycle. Such a low hardware utilization is mostly due to three constraints: the dependencies between the instructions, the breaks in the control flow due to branches, and the wait states generated by cache misses.
Simultaneous multithreading (SMT) [8] is a promising technique for improving pipeline utilization, in which several instructions issued from different threads are executed concurrently. These threads may be independent processes or processes issued by a single application. In this way, dependencies tend to disappear and the long-latency operations such as cache misses or divide operations, can be overlapped by useful execution of instructions from the other available threads [ 11. Moreover, the processor utilization is enhanced by the ability of SMT to schedule resources dynamically among the available threads. In [8], Tullsen et al. showed that a simultaneous multithreaded architecture can achieve four times the instruction throughput of a singlethreaded wide superscalar with the same issue width.
Extensive research studies have been conducted on software and hardware mechanisms to deal with the branch problem on singlethreaded architectures. Branch-prediction strategies for superscalar architectures now achieve more than 90% accuracy.
The purpose of this paper is to study the behavior of different branch prediction strategies when several threads are executing simultaneously. Only summary results are presented here (a full paper is available on the web at http://www.irisa.fr/caps/HTML/Architecture.html).
We explore the impact on the branch-prediction accuracy of the simultaneous use of prediction tables by several threads. We particularly try to characterize whether the threads take advantage of bigger tables or if the number of mispredictions increases due to pollution, for multiprogramming processing as well as for parallel applications. We also examine the usefulness of providing one private Return Address Stack per active thread.
Branch Prediction
Most applications exhibit 15% to 30% of branch instructions. The way these instructions will be handled is then a critical issue. In a pipelined microprocessor, the precise address of a branch is known only several cycles after the issue of the branch instruction. Waiting for the branch resolution for issuing new instructions would lead to poor performance. Thus, many studies have been conducted to predict the future address to continue the instruction fetch. The best perfonning technique, dynamic prediction, uses run-time history, therefore predictions are not known until runtime.
Dynamic branch prediction can be split into two different problems: predicting the direction of the branch, and, for taken branches, predicting the target instruction address. The common way to predict the target address is to use a BTB (Branch Target Buffer), i.e. a small cache memory. Each entry consists of the address of a branch, its target address and some state information.
Prediction of the direction is usually based on branch history. It consists of using the previous sequence of takenlnot taken information for each branch to predict whether or not the branch will be taken next time it occurs. J. E. Smith proposed utilizing a two-bit saturating upldown counter scheme to collect history [7] . Yeh and Patt have shown that substantially higher accuracy can be achieved by accumulating more branch history [ 111. They proposed using two levels of branch-history information (two-level adaptive branch predictor) to make predictions (figure 1). The first level records the history of the last k branches encountered, while the second (PHT, for Pattern History Table) Figure 1 . two-level adaptive branch prediction of these k branches. Their less complex implementation, called GAg, uses only a global history vector for the first level. This model is interesting as it should have small delays and be easy to pipeline. However, global history information is less efficient at identifying the current branch than simply using the branch address. History vectors work well for workloads dominated by loops. A more efficient prediction can be made by hashing both the branch address and the global history, as proposed by Pan, So and Rahmeh [6] . This kind of bit selection identifies the branches correctly.
Another method is to do an exclusive OR of the history with the branch address; because more bits from each of the two bit vectors are in use, it yields slightly better performance than selection. At last, for small-size predictors, still higher prediction accuracy can be reached by combining branch predictors [ 5 ] .
Simultaneous Multithreading & Branch Prediction
Although extensive research has been conducted on branch prediction for singlethreaded architectures, to our knowledge none has been done for multithreaded one. This study evaluates the new constraints of multithreaded execution on branch prediction and the potential performance improvements that we could expect.
A parallel can be drawn between access to classical cache memories [9] and to prediction tables. In a singlethreaded environment, access to a prediction table can result in different types of misses: initialization misses which occur the first time a branch is encountered and intrinsic misses, where branches conflict with each other due to the limited size of the tables. Multithreading introduces a new type of miss, the extrinsic miss generated when branches of different threads contend for table space. The frequency of this kind of misses should vary quite a bit, depending on the simultaneously executing applications. Thus, it must be maximal when all processes are independent, as in multiprogramming processing. On the other hand, it must diminish as soon as there is sharing (prefetch effect).
Having a prediction table and branch target buffer per thread is a solution to suppress the extrinsic misses. However, it is unlikely to be cost effective, as tables require a lot of space. Moreover, the benefit of sharing totally disappears. When a branch target is computed, it is placed in the BTB. Threads issued from the same application generally share the same instruction code. Their simultaneous execution should therefore induce a prefetch-like effect, one thread placing in the BTB addresses used later by other threads. However, for caches, Gupta [9] has shown that the number of intrinsic misses generated by different processes generally overwhelms the prefetch effect.
The study presented here evaluates the behavior of some of the prediction strategies which perform best for a single thread, in a multithreaded environment, with shared prediction tables. This work was done through simulations and will be the subject of the following sections.
Methodology for Performance Evaluation
We used the Spy program to conduct our simulations. Spy is part of SPA package, a set of tools written by Gordon Irlam to analyze the performance of SPARC binaries. We modified Spy to support the simultaneous tracing of several programs. To handle parallelism (process creation, synchronization, locks, ...), we have also integrated support for PARMACS. One should note that our traces are user-only and that larger branch-predictor structure should probably be needed to achieve the same prediction rate when a complete program including user and kernel activities is traced 133.
In addition, we developed a fully-configurable simulator, which reads instruction streams generated by Spy and integrate feedback control necessary for handling parallelism.
Simulated Architecture
Our study was focused on branch prediction, so we simulated only the branch-prediction mechanisms of a multithreaded processor based on the SPARC7 instruction set. We compared three different strategies: 2bit, gselect and gshare. Our model was inspired by the GAg predictor in [ 121. It has the advantage of being simple, and should allow the prediction in one cycle. It should reduce the risk of bad predictions due to the use of non updated tables, which is important as we do not deal with the number of cycles normally necessary for the target address calculation. A schematic representation of the prediction mechanisms is given in figure 2 . The 2bit algorithm is a slightly modified version of the 2-bit saturating counter scheme proposed by Smith [4] . In gselect, the PHT is indexed with the concatenation of the low order bits of the address with the history register [6] while in gshare, the index is the XOR of the history register with the low order bits of the address 151.
Figure 2. representation of the 2bit (A) and gsharelgselect (B) prediction
We used 12-entry Return Address Stacks (RAS) as in the DEC 21 164, implemented as circular register files. Thus, the last 12 return addresses are always available.
We assumed that at the prediction stage of the pipeline, we know the type of a branch. This is true if the information is kept in the BTB after the first occurrence of the branch.
The ratio of the size of the predictor to the number of threads is kept constant for all the simulations, with 512 entries per thread. The BTB is 4-way set-associative, with a LRU replacement policy. The PHT, when present, has a size of 4096 entries multiplied by the number of threads and is direct-mapped. Each entry in the PHT corresponds to a 2-bit counter. In the 2bit branch prediction algorithm, the PHT is part of the BTB, and thus has the same number of entries and is 4-way set-associative.
Benchmarks
We used two different kinds of benchmark suites. To simulate problems of sharing in a multithreaded environment, we used applications from SPLASH2 [lo] , a set of parallel applications for shared-memory architectures written in Stanford University and using PARMACS. For multiprogramming purposes, we used applications from SPLASH2 and SPEC92 benchmarks [2] . The programs were compiled on Sparcstations (2, 10 or 20) using gcc or €77, with the standard optimization -0.
Multiprogrammed Workload
We first examine the behavior of the three branch prediction strategies on a workload corresponding to multiprogramming processing. We used ten applications issued from the SPEC92 (compress, ear, espresso, tomcatv, wave5 and xlisp) and SPLASH2 (barnes, cholesky, fft and lu) packages. The SPLASH2 applications were running for a one processor machine. For each application, the first 50 million instructions were ignored and we simulated the following 10 million instructions. For the two-degree and the four-degree multithreaded architectures, we used ten benchmarks running simultaneously 2 or 4 out of ten of the previous applications.
In figure 3 , the average misprediction ratios are given for the three branch prediction strategies, according to the number of simultaneous threads and the sizes of the prediction tables. Each ratio is calculated as the misprediction which could have been obtained if the corresponding ten workloads were executed sequentially. The term misprediction recovers the direction mispredictions and the misfetches. R-x-y refers to the misprediction ratio exhibited by the simultaneous execution of x threads on a shared predictor with a BTB of y entries (8 x y entries for the PHT). -s means that the 12-deep RAS is not implemented.
As expected, for a singlethreaded execution, the gshare strategy is the best performing, with less than 6% of bad predictions. The simplest 2bit scheme is outperformed, with average misprediction rates greater than 10%. gselect stands right between the two previous predictors.
The simultaneous execution of 2 and 4 threads show the same behavior. There is no important increase nor decrease of the misprediction ratios for the three prediction schemes. This means that if the sizes of the tables (PHT,BTB) of the branch prediction unit are kept proportional to the number of threads, there are very few interactions between different threads in the tables, either constructive or destructive.
However, it is interesting to notice that with gshare, it is slightly positive. Most of the mispredictions, as illustrated in figure 4 , come from wrong-direction predictions, which is represented by the two lower parts of the bars. The name of a workload is the concatenation of the first two letters of the involved applications. Capital and small letters stand, respectively, for the true program behavior and the predicted branch direction. TNTmeans the branch is takenhot taken, hit/miss refers to the BTB and bad @ means a misfetch occured. bad return @ signifies that the RAS didn't return the correct address.
To evaluate the capacity of the branch prediction schemes to withstand pollution, we ran simulations while keeping the size of the predictor constant when the number of threads increases. The key R-4-512 in figure 3 shows the branch misprediction ratios obtained with four threads and the BTB and the PHT having, respectively, 512 and 4096 entries. The observed degradation of accuracy is significant; with gshare,it is as high as 17%. The distribution of the mispredictions resulting from the simultaneous execution of four threads and corresponding to the gshare scheme, is given in figure  4 . These distributions are quite similar for the two other prediction schemes. From this figure, it appears clearly that the higher number of mispredictions comes mostly from an increase in the BTB misses (white section on the bars). The key R-4-2048-s in figure 3 refers to the misprediction ratio exhibited by the execution of four threads without a RAS. The lack of stacks causes a substantial decrease in the three branch prediction accuracies. For the gshare scheme, the penalty reached 60%, which confirms the need for a RAS per context. A private stack per context is needed since a stack is a resource that cannot be shared between several threads. Indeed, in a stack, data manipulations are reduced to push and pop operations on the top. The coherence of the stack is given by the order of call and ret instructions in the program. By mixing such instructions from several threads, one can no longer maintain the consistency of the data order in the stack and the instruction order in each individual thread. Other studies have shown that the gain when shifting to a 32-deep stack is very small. 12-deep RAS works already quite well and is cheap to implement.
Parallel Workload
A multithreaded processor may also execute parallel threads from a single application. Thus, it is essential to evaluate the impact of sharing on the predictions. For this purpose, we used applications from SPLASH2 series (barnes, cholesky, fft, lu, radix, raytrace, volrend and watern). We ran the programs with 1,2 or 4 threads. To obtain balanced threads, the parallel activity of each application must be studied as it is dependent on the configuration (especially for the length of the initialization phase, which is generally done by a single thread). We use the same convention for the figures as in the previous section As illustrated by figure 5 , results are less consistent than those obtained with a multiprogramming workload but confirm what we know about the behavior of the predictors. The gshare scheme still perfoms better than gselect which '
2bclc
' gshare ' Figure 5 . average misprediction atios is better than 2bit. Moreover, it is clear that the three prediction strategies do not have the same capacity to take advantage of the execution of parallel applications.
The 2bit scheme reveals significantly decreasing prediction accuracy when the number of threads increases. The misprediction ratio for four simultaneous threads is worse by 25% than for one thread.
gselect benefit from parallelism as there is a decrease in mispredictions when the number of threads increases. gshare is the prediction strategy that appears to take the larger advantage of parallelism. There is a large decrease in the average misprediction rate when two threads of a same application are executing simultaneously. However, this gain is smaller with four threads. For all predictors, the mispredictions come mostly from bad direction ( figure 6 for 4 threads) . A parallel version of an application distributes the workload on several processes, so one can expect the sizes of the executed loops to be smaller. For gselect and gshare, the resulting patterns are also smaller and therefore give better prediction accuracy.
The threads issued by the parallel execution of an application are expected to produce similar branch patterns. To evaluate a possible capacity of the predictors to take advantage of this potential sharing, one can reduce the sizes of the prediction tables. The keys Rm-2-512 and Rm-4-512 in figure 5 show that when 2 or 4 threads are executing simultaneously and the BTB/PHT have, respectively, only 5 12/2048 entries, the misprediction ratios are slightly worse. The increase in misprediction is tiny for the 2bit and stays very small for the 2 other predictors (except for gselect with 4 threads). gshare exhibit a very good behavior with misprediction ratios for 2 and 4 threads smaller than those of 1 thread even with a one-thread sized predictor. Figure 6 gives the distribution of the mispredictions when the PHT/BTB are either sized or not to four threads executing simultaneously. This shows that the slight misprediction increase comes mostly from BTB misses (represented in white on the bars). In figure 5 , the key Rm-4-204th corresponds to the misprediction ratios obtained while executing four threads without having one RAS per thread. The misprediction ratios appear to be very bad, and for gshare, the accuracy is more than twofold worse than the one obtained with RAS. This confirms the necessity of having one RAS per context in a simultaneous multithreaded microprocessor.
Concluding Remarks
In this paper, we examined the behavior of different branch-prediction strategies while executing several threads of instructions simultaneously. We simulated a simple mechanism based on 2-bit counters associated with entries of a BTB and two more complex strategies based on the branch prediction mechanism proposed by Yeh & Patt. These last two strategies use a PHT index constructed as either the concatenation or the XOR of an history register and the branch address. We explored the behavior of the branch predictors when independent applications are running simultaneously and when the workload is a parallel program.
Our simulations showed that in multiprogramming environment, if the sizes of the tables (PHT/BTB) are proportional to the number of active threads, there are very few interactions, be they destructive or constructive.
If the sizes of the tables are kept small, there is a significant increase of the mispredictions, which is mostly due to an increase of the conflicts in the BTB and reach 17% for the best performing gshare scheme.
With parallel workloads, we might have expected a beneficial sharing effect. In fact, it is very dependent on the branch predictors. The simple 2-bit predictor behaves very badly when the number of threads increases. gselect and gshare seem to take a small advantage of the execution of threads from the same application. The decrease of the size of the tables has a negative effect on prediction accuracy by increasing the number of BTB misses. However, with the gshare scheme, the resulting misprediction ratios for 2 or 4 threads stay below those exhibited by 1 thread.
We also studied the impact of the addition of one Return Address Stack per context. A 12-deep stack per thread appears to enhance greatly the branch prediction while adding a minimal implementation cost.
The next phase of our study will be to evaluate the real penalty induced by mispredictions. With several threads, it could be substantially reduced compared to a conventional superscalar architecture. We also plan to use traces including OS kernel activity.
