Instruction fetching is critical to the performance of a superscalar microprocessor. We d e v elop a mathematical model for three di erent cache techniques and evaluate its performance both in theory and in simulation using the SPEC95 suite of benchmarks. In all the techniques, the fetching performance is dramatically lower than ideal expectations. To help remedy the situation, we also evaluate its performance using prefetching. Nevertheless, fetching performance is fundamentally limited by control transfers. To solve this problem, we i n troduce a new fetching mechanism called a dual branch target bu er. The dual branch target bu er enables fetching performance to leap beyond the limitation imposed by c o n ventional methods and achieve a high instruction fetching rate.
Introduction
The goal of a superscalar microprocessor is to execute multiple instructions per cycle. It relies on instruction-level parallelism (ILP) to achieve this goal 9] . Depending on what type of programs and assumptions used, researchers have s h o wn that parallelism anywhere from 4 to 90 is available 10, 13, 12, 1 5 ] . Unfortunately, all of this potential parallelism will never be utilized if the instructions are not delivered for decoding and execution at a su cient rate.
The underlying problem in fetching instructions using a control ow a r c hitecture is control transfers. Even with perfect branch prediction, conditional and unconditional branches disrupt the sequential addressing of instructions. The non-sequential accessing of instructions causes di culty with fetching instructions in hardware. As a result, the instruction fetcher restricts the amount of concurrency available to the processor 14].
In fact, as it will be shown in this paper, it can be the greatest factor limiting performance. For example, an 8-way superscalar processor with a simple fetching hardware could only expect to fetch less than four instructions per cycle with programs included in SPECint95. This accounts for over 50% of the loss in potential speedup regardless of any other issues. Thus, the performance is severely reduced even if the ILP in the program and execution pipeline would be able to execute eight instructions per cycle.
Branch prediction foretells the outcome of conditional branch instructions. Instruction fetch prediction determines the next instruction to fetch from the memory subsystem 3]. Instruction fetch mechanisms involve t h e process of how instructions are fetched from memory and delivered to the decoder. This paper focuses on hardware instruction fetching mechanisms. Hence, only instruction fetching performance is evaluated and does not attempt to evaluate any other performance issues (such as branch prediction, cache, execution, etc.). Also, in our study we did not include the e ects due to system interference. Our goal is to describe, evaluate, and provide solutions to the rst step in a series of hurdles for exploiting high levels of ILP.
Although we will only discuss hardware techniques, we cannot ignore the potential bene t by software techniques. Using software techniques, the probability of a control transfer instruction can be reduced. Loop unrolling is one method. A relatively new technique proposed by Calder and Grunwald is most promising 2]. By rearranging basic blocks, conditional branches become more likely to be`not taken'. This means that the probability o f a c o n trol transfer instruction is reduced because a`not taken' branch is not a control transfer. Nevertheless, software will only be able to make limited improvements, and the hardware techniques presented in this paper will be able to boost instruction fetching performance after software improvements. Furthermore, unlike software techniques, hardware techniques are able to address limitations created by c o n trol transfers.
To begin with, we describe our fetching model and the terms we use in our analysis. Then, we show why a n d h o w m uch performance is currently limited by c o n trol transfers. Three di erent c a c he options are then brie y described: a s i m p l e c a c he type, an extended cache type, and a self-aligned cache type. The way in which prefetching is applied in hardware is described. Next, the dual branch target bu er is described.
The theory behind the fetching techniques gives insight i n to fetching problems and can give expected performance under given conditions. Therefore, a probabilistic model based on the probability of a control transfer is presented for all combinations of the fetching techniques described. The models are evaluated under several di erent conditions. To v erify that these models predict accurately and to show what real conditions provide, the SPEC95 suite of benchmarks are simulated using the di erent fetching techniques presented.
Fetching Model
This section describes the fetching model used in the rest of the paper. The cache line size is de ned to be the size of a row in the instruction cache. The terms`line' and`row' are used interchangeably. This determines the maximum number of instructions that can be accessed simultaneously in one cycle. Also, a block is de ned to be a group of sequential instructions. A b l o c k's width is the maximum number of instructions allowable. Figure 1 is a block diagram showing the di erent fetching steps. The instruction cache reads the requested fetch block of width q and returns it to the instruction fetcher. The instruction decoder receives a decode block of width n. If prefetching is applied, up to q new instructions from the instruction fetcher go into the prefetch bu er FIFO q u e u e a n d n instructions come out. This implies q > n in the diagram.
Otherwise if prefetching is not used, the fetch and decode widths are equal, and the instruction fetcher delivers instructions directly to the decoder. The instruction fetcher is responsible for determining the new starting PC each cycle and sending it to the instruction cache. It cooperates with a branch predictor or branch target bu er, if employed. Calder and Grunwald 1] describe di erent techniques for fast PC calculation. Whichever technique is used, the new PC must be determined in the same cycle. Also, after the instruction fetcher receives the fetch block from the instruction cache, it performs preliminary decoding to determine the instruction type (or uses prediction/pre-decoding methods). Instructions after the rst instruction that transfers control are invalidated.
Johnson de nes an instruction run to be the sequentially fetched instructions between branches 9]. In this paper, an instruction run is further speci ed to be between instructions that transfer control. A control transfer instruction includes unconditional jumps and calls, conditional branches that are taken, and any other instruction that transfers control, such as a trap. The run length is the number of instructions in a run. In addition, a block run is de ned to be the instructions from the start of the block to the end of the block or the rst instruction that transfers control. The block run length is the number of instructions in a block r u n . 
I-CACHE
If a control transfer requires another cycle to reach the target address, then only one block of instructions can be fetched in a cycle. Regardless of the type of software scheduling or hardware techniques used to improve fetching, 1=b is the limit for the average number of instructions fetched per cycle. Under these conditions, 1=b is the maximum average number of instructions per cycle that can be executed on any single-threaded control-ow a r c hitecture.
Here is an example to illustrate this fundamental fetching limitation. Suppose a program executes a million instructions, and one hundred thousand of these instructions transfer control. The probability of a control transfer instruction is therefore one tenth, and an average of ten instructions fetched per cycle is the theoretical limit. Since each c o n trol transfer instruction requires one cycle, to execute this program would require a minimum of a hundred thousand cycles.
Hardware Techniques
This section describes hardware techniques which perform instruction fetching. To begin with, three cache types are described: a simple cache, an extended cache, and a self-aligned cache. Next, prefetching is described. Finally, a new mechanism to fetch two blocks per cycle, a dual branch target bu er, is introduced.
Simple Cache
A straightforward approach to fetch instructions from the instruction cache is to have the line size equal the width of the fetch b l o c k. If the starting PC address is not the rst position in the corresponding row of the instruction cache, then the appropriate instructions are invalidated and fewer than the fetch width are returned. As with all fetching techniques, if there is an instruction that transfers control, instructions after it are invalidated. Figure 2 shows an example for the simple fetching mechanism. In this example, the second instruction in the rst block a t a k en branch, so the third and fourth instructions are invalidated. Also, only two instructions from the second block are valid. Altogether, only four out of a potential eight instructions are used for instruction decoding and execution, which illustrates the problem with this simple approach. 
Extended Cache
One way to reduce the chance that instructions will be lost from an unaligned target address of a control transfer instruction is to extend the instruction cache line size beyond the width of the fetch block. To avoid lost instructions on sequential reads that are not block aligned, the instruction fetcher must be able to save the last n ; 1 instructions in a row and combine them with instructions that are read the next cycle. Only when there is a control transfer to the last n ; 1 instructions in a cache row, instructions are lost due to an unaligned target address. There is no need to save a n y instructions this cycle because the line can be re-read and still be able to return four instructions.
Self-Aligned Cache
The target alignment problem can be solved completely in hardware with a self-aligned instruction cache. The instruction cache reads and concatenates two consecutive r o ws within one cycle so as to always be able to return n instructions. To implement a self-aligned cache, the hardware must either use a dual-port instruction cache, perform two separate cache accesses in a single cycle, or split the instruction cache into two banks. Using a two-way i n terleaved (i.e., two banks) instruction cache is preferred for both space and timing reasons 5, 
Prefetching
All of the above cache types can be used in conjunction with prefetching. Prefetching helps improve fetching performance, but fetching is still limited because instructions after a control transfer must be invalidated.
The fetch width, q, q n, is the number of instructions that are examined for a control transfer. Let p be the size of the prefetch bu er. After the instruction fetcher searches up to q instructions for a control transfer, valid instructions are stored into a prefetch bu er. Each cycle, the instruction decoder removes the oldest n instructions from the prefetch bu er. In essence, the prefetch bu er enables an average performance closer to the larger expected run length of q instructions compared to n instructions. Figure 5 shows an example using prefetching with n = 4, q = 8, and p = 4. Starting with an empty prefetch bu er, there are seven valid instructions (this example shows a complete block o f q = 8 instructions returned by the instruction cache to the instruction fetcher) before branch. Four instructions are used in this cycle, while the remaining three valid instructions are put in the prefetch bu er for later use. In the next cycle, a block of instructions is read starting with the target address of the branch. Only two instructions are valid because a call instruction was detected. As a result, three instructions from the bu er and the rst add instruction are used, while the remaining call instruction is put into the prefetch b u e r . The purpose of a BTB is to predict the target address of the next instruction given the address of the current instruction. This idea is taken one step further. Given the current PC, the DBTB predicts the starting address of the following two lines, which o vercomes the limitation of a single prediction of the BTB. Using the predicted addresses for the next two lines, a dual-ported instruction cache is used to simultaneously read them. Hence, the rst line may h a ve a control transfer without requiring another cycle to fetch the subsequent l i n e .
The DBTB is indexed by the starting address of the last row currently being accessed in the instruction cache (i.e., the current PC). The entry read from the DBTB can be viewed as two B T B e n tries, BTB1 and BTB2. The DBTB entry indexed may match both in BTB1 and BTB2, in one or the other, or none at all. This allows a single DBTB entry to be shared between two di erent source PCs. Although physically they are one entry, logically they are separate. Figure 6 is a block diagram of a DBTB entry and shows how it is used in determining the following two r o ws' PC starting address, PC1 and PC2. The tag of the current P C i s c hecked against the PC tag found in BTB1. If it matches, then the predicted PC1 found in BTB1 is used. Otherwise, the prediction is to follow through to the next row of the instruction cache. If the value predicted for PC1 matches the value in BTB2, then the prediction for PC2 in BTB2 is used else, PC2 is predicted to be the next row after PC1. The exit position in a DBTB entry indicates where the control transfer (or follow through) is predicted to occur. The DBTB entry also contains branch prediction information about all the potential branches in the referenced line. It may contain no information at all, a one bit prediction, a two-bit saturating prediction, or information for other branch prediction mechanisms. To save space, an alternative design of the DBTB would be to logically unify BTB1 and BTB2. A block diagram is shown in Figure 7 . Only one PC source can be valid, so only one PC tag would now need to be stored. In addition for space savings, the time it takes for PC2 to be ready is reduced because the predicted PC1 does not need to be checked against the tagged PC1 in BTB2. As a result, the logically uni ed DBTB's critical path is the same as a standard BTB. This improvement m a y be critical in a processor's cycle time. The drawback i s B T B 2 m ust be invalidated to re ect a follow through prediction when BTB1 is updated, which can reduce accuracy of prediction. On the other hand, a BTB2 misprediction does not need to invalidate BTB1.
The DBTB has many di erent con gurations, many similar to the traditional BTB. Its options include the number of entries, associativity, branch prediction, and a one or two tagged system. A DBTB can be used with a simple, extended, or self-aligned cache, and with or without prefetching. Figure 8 is a fetching example without prefetching using the DBTB. The previous cycle, BTB1 predicted PC1 to be at Address 0, and BTB2 predicted Line 0 to exit at position 1 to PC2 at Address 12. While Line 0 and Line 3 are being read, PC2 is used to index into the DBTB to predict the next PC1 and PC2. Although Line 0 has a jump, a full fetch block of four instructions is returned. A mathematical model for each t ype of fetching mechanism from the previous section is presented in this section. The model allows the expected instruction fetching performance to be calculated. This accounts for invalid instructions, but not for mispredicted instructions. The fetching bandwidth is only considered, and not other e ects from a processor's pipeline. In the next section, the expected performance from this model will be compared with results from simulation.
Simple Cache
Let L i be the probability a control transfer occurs at position i, a n d E i be the probability the starting address in the block is at position i. 
where c(n b) is the probability of a control transfer in a block,
The total expected instructions fetched per cycle for simple fetching is 
Equation 6 is the weighted sum of the expected number of instructions at each possible starting position.
Extended Cache
The probability the starting address in the block is at position i for the extended cache is 
The probability of a control transfer in a block for the extended cache, given the extended cache line size m, m n, 
The expected instructions fetched per cycle is 
With the cache line size extended beyond the desired n instructions, if there is a control transfer, n out of m times it is expected to transfer into the last n instructions of the block, which behave as the simple fetching case where less than n instructions are available. The rest of the time n instructions will be available.
Self-aligned Cache
The probability of a control transfer in a block for the self-aligned cache is c align (n b) = 1 ; (1 ; b) n :
The expected instructions fetched per cycle for the self-aligned cache is the expected block run length of width n,
because n instructions will always be read from the instruction cache.
Prefetching
All three cache techniques can be used in combination with prefetching. The fetch and decode widths are not equal with prefetching. As a result, q, the fetch width, may n o w be substituted for n, the decode width, as a parameter to some of the equations previously de ned that did not use prefetching, as will be indicated.
Let I type i be the probability exactly i instructions are available up to and including a control transfer instruction or the end of the block, where type is one of the three di erent c a c he types: simple, extend, or align. The equations for the three types are: Let P i be the probability the prefetch bu er contains i instructions. Figure 9 illustrates the transition from one bu er state to another. It does not show all possible transitions. The prefetch bu er increases in size when the number of new instructions is greater than n. It will remain in the same state if exactly n new instructions are available. It decreases in size when fewer than n new instructions are available.
The zero and full boundary states have additional possible transitions. (16) Notice Equation 16 depends only on the last n ; 2 prefetch bu er state sizes since if there are n ; 1 o r more instructions in the prefetch bu er, n instructions are guaranteed for that cycle.
A problem can arise with prefetching and simple cache type. The prefetch bu er can be full, and instructions from the fetch block g o u n used. If this happens, the starting address of the next cycle will not be the rst position, so q instructions will not be available. Therefore, Equation 3 needs to be modi ed to include this e ect, unless a hardware solution similar to that of the extended cache is included. The hardware would need to save instructions left over on a prefetch bu er over ow for the following cycle. If this is done, Equation 12 is an accurate model.
Dual Block Fetching
Fetching two blocks per cycle (via the DBTB) with the simple, extended, or self-aligned cache without prefetching is simply twice the expected value for half the block size, 
5.6 Evaluation Table 1 lists the evaluation of the simple, extended, and self-aligned cache types without prefetching for b = 1 =8 and for di erent v alues of the decode block width n. The value chosen for b, the probability o f
. Although this large fetching width achieves excellent f e t c hing performance, it may not be practical to implement in hardware. Figure 10 shows the expected instruction fetch for the simple, extended, and self-aligned cases without prefetching for b = 1 =8. Although ideally, for a block s i z e o f n, a f e t c hing rate of n instructions per cycle is desired, the di erence between this ideal and the actual rate increases as n increases. Instead, it approaches 1=b (8 in this instance) for each c a s e . The disadvantage for the simple and extended cache techniques is the lower rate at which they reach the limit. It takes a signi cantly larger value of n to reach the same expected fetch performance. With this extended case of m = 2 n, i t s v alue is the average of the values for the align and simple cases for each n. and n = 8 , v arying p and q. The value of the di erent c u r v es for each q is identical for p q ; n. After that point, it branches out and approaches its r(q b) limit. To reach t h e ultimate limit of 1=b, both q and p need to increase. Figure 12 shows the expected instruction fetch for the simple cache with prefetching for b = 1 =8 a n d n = 8 , v arying p and q. Unlike the self-aligned case, each q curve is distinct and greater than the previous q curve. Even without prefetching (p = 0), the values are not identical because the increase in the line size to q reduces the chance that an unaligned target address will not be able to return n instructions. Figure 13 shows the expected instruction fetch for the simple cache, extended cache, and self-aligned cache with prefetching for b = 1=8, n = 8 , q = p + n, and m = 2 q (extended only) verses p. Similar to the cases without prefetching, the extended cache's fetching performance is between the simple and self-aligned cache techniques. Figure 14 shows the expected instruction fetch for the simple cache, extended cache, and self-aligned cache for dual block fetching with prefetching. The parameters are b = 1=8, n = 16, q = p + n, and m = 2q (extended only) verses p. The plot shows that a simple cache performs signi cantly less well than the self-aligned and extended cache. The plots presented show that prefetching can signi cantly increase expected fetching. As the fetch width, q, increases, the expected fetch rate reaches a higher plateau. Unfortunately, with b = 1 =8 and a decode width of eight, an extensive amount of hardware { a fetch width of sixteen, a prefetch bu er size of thirty-two, and a self-aligned cache { is required to reach almost 7 instructions fetched per cycle, still noticeably below the goal of 8 instructions fetched per cycle. It is di cult to achieve a high fetching rate 
Results and Discussion
This section compares the expected instruction fetch with the actual performance of simulations from the SPEC95 benchmark suite running on the SPARC a r c hitecture 6]. The suite was compiled using the SunPro compiler with standard optimizations (-O). Programs were simulated using the Shade instructionset simulator 4] and ran until completion or the rst four billion instructions. Table 2 shows the predicted and observed instruction fetch count results of these programs using the three cache techniques without prefetching (n = 4). Table 3 and Table 4 show t h e predicted and observed instruction fetch count results using the three cache techniques with prefetching (n = 4 , q = 8 , p = 8 a n d n = 8 , q = 1 6 , p = 16, respectively). The rst column in both tables show t h e v alue observed for 1=b, t h e average run length. The average dynamic run length of a program is the total number of instructions executed divided by the number of instructions that transferred control. The observed value of b for each program was used in its calculation of the expected fetch.
A concern with the fetching model presented is that the distribution of run lengths is expected to be uniform, but in observing actual program behavior, the distribution is not uniform. It does, however, generally follow the expected distribution. When the expected fetch is calculated via a weighted sum, the outcome is reasonably accurate. As can be seen in the tables, the di erence between the predicted and observed fetch count is usually within a few percent. The expected and observed performance for dual block fetching without prefetching is exactly twice the values listed in Table 2 for n = 8 . Table 5 lists the performance of SPEC95 for dual block fetching with prefetching (n = 8 q = 1 6 p = 8). The instructions fetched per cycle (IFPC) is listed as well as the instructions per fetch b l o c k (IPB). The results show that close to ideal (n = 8) fetching rate is possible, when a two-block f e t c hing mechanism, such as the dual branch target bu er, is used with extended or self-aligned cache and prefetching. In this case, the fetching hardware mechanism no longer restricts instruction fetching, and therefore, the possibility of exploiting instruction-level parallelism and a high instructions per cycle execution rate. Using a 256-entry, direct-mapped, two-tagged DBTB, we observed that the miss rate was between 10% and 20% for most of the SPEC95 benchmarks. Also, the miss rates for BTB2 was usually slightly higher than BTB1. BTB1 and BTB2 each behaved similarly to a standard BTB. Although perfect branch accuracy was assumed in Table 5 (to make a fair comparison to the other data), it is important to realize that accurate branch prediction becomes critical since more branches need to be predicted accurately perfetch b l o c k.
The overall performance will be much l o wer than the fetching rates shown when branch prediction, cache misses, execution, etc., of a real microprocessor are simulated. In addition, the di erence between the values will be much smaller. These facts do not devalue the results presented. These results show t h e upper limit achievable using di erent fetching mechanisms presented, both in theory and in simulation. No doubt, as branch prediction accuracy, cache performance, and execution performance continue to improve in the future, the demand for adequate fetching mechanisms will increase.
Conclusion
Many programs have su cient ILP to execute eight instructions per cycle, and a superscalar microprocessor can be designed to decode and execute eight instructions per cycle. Unfortunately, because of control transfers in programs, a simple fetching mechanism can not reach this high demand. In fact, it falls far short.
The extended and self-aligned cache techniques also showed extremely poor instruction fetching performance, although the self-aligned cache always performed better than the other two. Prefetching helped the situation, and made it possible to approach the upper limit imposed by the probability of a control transfer in a particular program. Using the dual branch target bu er, simulations showed that it is possible to achieve performance beyond the 1=b limit. Nevertheless, the fetching performance of a dual branch target bu er is limited to 2=b instructions per cycle.
Our use of models that predict the behavior of fetching performance have p r o ven invaluable in the study of instruction fetching. It enables the production of graphs that clearly show the relationship between di erent fetching options without running hundreds of simulations. They can be helpful in the design of a new superscalar microprocessors to determine which technique will meet the performance objective.
