High performance computer implementation
L T I! -& (2) t'., N.axt .& "T" Md,... U.od -1-:. m
TJ"ud al"ul-s-'+
If the fetch address hits in the BAC, there is a branch in the sequence of instructions just fetched. The BAC entry records the branch type (conditional, unconditional, or return) and the target and fall-through basic block starting addresses of the primary branch. The same entry also contains the branch type and fetch addresses of basic blocks for each of the expected number of branches we will make predictions for, and all the known potential fetch addresses of their targets. If the number of basic blocks predicted and fetched per cycle is limited to 2, we get 6 fetch addresses; 2 for the 'two primary basic block addresses, and 4 for the four possible secondary bssic blocks. If the basic block prediction and fetch limit is 3, we get 14 possible fetch addresses; 2 for the primary basic blocks, 4 for the secondary, and 8 for the tertiary basic blocks. Each entry in a 512-entry, 4-way set associative Branch Address Cache which supports two branch predictions per cycle has the following fields: TAG, P. A " BAC Hit" occurs if the tag matches with the upper address bits of the current fetch address and the primary branch is valid.
valid bits -The valid bits for the corresponding branch entries. P refers to the primary branch, ST refers to the secondary branch if the primary branch is taken, and SN is the secondary branch if the primary branch is not taken.
type fields -The branch type of the corresponding branch. The type can be conditional, unconditional, or return.
Each type field consists of 2 bits.
addr fields -The address of the corresDondin$z basic block. Each address field consists o; 30 bit;.
A BAC supporting two branch predictions per cycle would have a total of 212 bits per entry. A BAC supporting three branches would have an additional eight address fields and four additional valid bits for the four possible tertiary branches, making each entry 464 bits wide. We are investigating several possible ways of reducing the number of fields needed per entry, such as storing only the addresses of more likely-taken path(s). When a fetch address misses in the BAC, a large basic block is assumed and the entire instruction cache bandwidth is devoted to fetching sequential instructions. If a branch is discovered once the instructions are decoded and the branch is predicted taken or is an unconditional branch, the prefetched instructions after the branch are discarded.
The address of the fall-through and target addresses are calculated in the cycle after decode. The branch is then allocated a primary branch entry in the BAC. The higher order bits of the fetch address are entered in the tag field, the primary branch valid bit is set, the secondary (and tertiary) branch valid bits are cleared, and primary fall-through and target addresses are entered. If the branch is an indirect branch, however, the target address is not calculated until the operands are ready, and the valid bit is not set until that time.
The branch will also be entered as a secondary branch in the BAC entry of the previous branch ifthe previous fetch address had a valid primary branch entry in the BAC but did not predict a secondary branch and the bssic block of the previous fetch address was not oversized (i.e. there was enough instruction cache bandwidth for another basic block fetch) and the previous branch was not a return.
The Instruction Cache
The ability of the instruction cache to provide enough instructions becomes critical when multiple possibly non-consecutive basic blocks are fetched each cycle. The instruction cache must have high bandwidth, low miss rate, and the ability to fetch from multiple addresses in parallel.
To satisfy the high bandwidth requirement, the cache must either have a large number of banks, or have wide banks. Also, due to off-chip bandwidth and pin limitations, the instruction cache should be on-chip. The ability to fetch from multiple addresses in parallel implies a cache with either interleaved or multi-ported banks, or both. With interleaved banks, each independently addressable, multiple fetch addresses can access the instruction cache simultaneously provided that their accesses are not to the same bank. If there is a bank conflict, priority is given to the earlier (relative to the dynamic instruction stream) fetch address. Therefore it is important to have enough banks to make the probability y of bank conflicts low.
A multi-ported cache eliminates the bank conflict problem.
For example, a dual-ported cache allows the simultaneous access of two fetch addresses, and a triported cache allows the simultaneous access of three fetch addresses. Unfortunately, multi-ported memories are expensive in terms of semiconductor chip area. It is critical for the instruction cache miss rate to be low.
Each instruction cache miss stalls the fetch sequence. Since multiple basic blocks can be fetched each cycle, the opportunity cost can be (up to) the number of cycles it takes to service the miss multiplied by the number of instructions we could have fetched during those idle fetch cycles.
Also, since more instructions are fetched each cycle, there are fewer cycles between instruction cache misses. Therefore more time
is spent waiting for instruction cache misses to be satisfied.
Commonly used ways to minimize instruction cache miss rates are to increase the associativity, to increase the size of the cache, and to prefetch instructions. We chose several cache configurations which gave us reasonably high bandwidth, the ability to fetch multiple addresses in parallel, and a relatively low miss rate.
Most of our simulations were done with a 32K cache which was 2-way set associative with 8 interleaved single-ported banks, each bank having a line size of 16 bytes. Each fetch address can access two banks so that we guarantee between 5 and 8 instructions per fetch address (due to basic block alignment).
This configuration and several others were compared in Section 5.
4
Simulation Methodology
Simulation Environment
We used a trace-driven simulator to evaluate the performance of a machine front-end which implements the Multiple Branch Two-level Adaptive Branch Predictor, a 512-entry 4-way set associative Branch Address Cache (BAC), and a high-bandwidth instruction cache. Unless otherwise specified, the instruction cache configuration used was 32K bytes, 2-way set associative, 8-way interleaved, single-ported, and with a line size of 16 bytes (4 instructions).
For the multiple basic block mechanisms, we can fetch two cache lines (a maximum of 8 instructions) per basic block fetch address because most basic blocks contain 4 to 8 instructions.
In order to do a fair comparison, we allow the single basic block prediction and fetch algorithm to fetch up to 4 cache lines.
The maximum number of instructions issued, passed to the back-end of the machine, is constrained to 16 instructions per cycle.
The benchmarks written in C were compiled with the Motorola Apogee C compiler for the Motorola 88100 instruction set and the ones written in Fortran where compiled with the Green Hill Fortran compiler.
A Motorola 88100 instruction level simulator generated the instruction traces. The first 50 million instructions from each trace were used rather than the entire trace due to simulation time constraints. Nine benchmarks were selected from the SPEC89 benchmark suite. These included 4 integer and 5 floating point benchmarks.
The Since we do not simulate the rest of the machine, the exact mispredicted branch penalty is approximated. A 6 cycle mispredicted branch penalty is assumed; therefore, the instructions following an incorrectly-predicted branch will not be fetched until 6 cycles after the branch is fetched.
The I-cache miss penalty is assumed to be 10 cycles. We also show how the machine performance changes as the branch misprediction penalty and I-cache miss penalty are varied.
5
Simulation Results
Effect on Prediction Accuracy and IPCX of
History Register Length Figure   4 shows how the prediction accuracy changes as we increase the number of bits in the global history register of the MGAg scheme for two branch predictions per cycle. The prediction accuracy is the number of correctly predicted branches over the total number of branches in the dynamic instruction stream. Longer branch histories give better prediction accuracy which is reflected in the rising curves. The hardware cost goes up exponentially with the number of history bits due to the number of pattern history table (PHT) entries required.
The prediction accuracies varied between 91.5 and 98.4% for a branch history register (BHR) length of 14 bits, and between 93.5 and 98.7% for a history register length of 16 bits.
The knees of the curves for most benchmarks are reached at a BHR length of 14bits. We used a 14-bit BHR length for the other experiments reported in this paper. A 14-bit BHR length means that a PHT has 214 x 2 bits, or 32K bits.
5.2
Tradeoff between the Number of Pattern History  Tables  and  History  Register   Length We simulated several MGAg, MGAs, and MGAP configurations to determine how the performance accuracy changes with the number of PHTs for two branch predictions per cycle. Figure 5 for integer benchmarks and Figure 6 for floating point benchmarks show the IPC_f for 1 to 512 PHTs.
Each configuration shown has the same hardware cost, which was achieved by decreasing the number of entries in each PHT as the number of PHTs is increased.
Since the entries in the PHTs are addressed by the BHR, the BHR length is reduced when we decrease the number of entries in each PHT.
The PHT used to make the predictions is determined by the primary branch address. The experiments shown in Figures 5 and 6 bit 10 to select a PHT. This allows branches within the same 256-instruction block in the static code to map to the same PHT.
The prediction accuracies shown in Figures 5 and 6 tend to be higher for configurations with one to eight pattern history tables, then decreases when the number of pattern history tables is increased beyond 8. Longer branch history helps to increase the prediction accuracy. Increasing the number of PHTs reduces the interference between branches, but since the second branch is predicted using the PHT of the first branch, the probability of mapping two branches predicted together into different PHTs is higher when more PHTs are used. The average IPCJ when one basic block is predicted and fetched per cycle is 3.0, and 6.6 for integer and floating point benchmarks, respectively. Two predictions per cycle increaaes this to 4.2 for integer and 7.1 for floating point. Three predictions per cycle increases IPCJ further to 4.9 for integer and 8.9 for floating point.
For Table 2 : Branch prediction utilization of an instruction fetch mechanism which is able to provide fetch addresses of two bazic blocks in each cycle.
the Branch Address Cache, and a branch is found in the sequence of instructions after the instructions are decoded.
Fpppp has a high percentage of cycles with no predictions due to the extremely long sequential code segment which is repeatedly executed.
The percentage of cycles when zero predictions were done per cycle is 10% per cycle for integer and 44% for floating point.
Only a single branch is predicted when the primary branch is a return, or the primary basic block is large (oversized) in which case the instruction fetch bandwidth of the secondary bazic block is usurped. About 24% of the simzle basic block fetches are due to oversized basic blocks, and about 5% are due to the primary branch being a return.
Two branch predictions are made and two basic blocks are fetched 62% of the time for integer and 24% of the time for floating point benchmarks. Table 3 shows the percentage of fetches that cause the machine front-end to stall. The machine front-end st ails only due to instruction cache misses, mispredicted branches, and branch decode penalties. To investigate the effect of branch misprediction penalt y on machine performance, we varied the time to resolve a branch from 4 cycles to 12 cycles. Floating point programs have flatter curves because they contain fewer branches and the prediction accuracy of those branches is higher. The performance degradation when the branch resolution time is increased from 4 cycles to 12 cycles is less than 10%. Integer programs have about 20%-to 30% performance degr~da~onS improve from 3.0 to 4.2 and 4.9, respectively for integer benchmarks.
For floating point benchmarks, the IPC-f went from 6.6 to 7.1 and 8.9. These improvements were achieved bv movidim? the hardware mechanisms to predict and fe{ck multiple basic blocks without specific compiler optimizations.
Acknowledgement
This paper is one result of our ongoing research in high performance computer implementation at the University of Michigan. The support of our industrial partners: Intel, Motorola, NCR, HaL, Hewlett-Packard, and Scientific and Engineering Software is greatly appreciated.
In addition, we wish to gratefully acknowledge the other members of our HPS research group for the stimulating environment they provide, and in particular, for their comments and suggestions on this work. We are particularly grateful to Intel and Motorola for technical and financial support, and to NCR for the gift of an NCR 3550, which is a useful compute server in much of our work.
