Abstract
1 Introduction cache for instruction fetch. A serial instruction cache separates its tag and data lookup into separate cycles, so only the correct way needs to be driven. We show that this achieves the same energy savings as way prediction, with only a 1% loss in performance when compared to way prediction fetch architectures.
First, we present a way prediction architecture for a high bandwidth decoupled fetch architecture, where multiple way predictions are provided per cycle. In comparison, prior way prediction architectures are coupled to the instruction cache, providing a single way prediction per cycle. Then, we compare the performance of the serial cache architecture to the NLS style of way prediction used in Alpha processors, as well as the multiple way prediction architecture.
Since we examine results using a processor simulator, we arrive at a different conclusion than prior serial instruction fetch research by Inoue et al. [9] . They examined instruction cache miss rates and concluded that way prediction was advantageous over a serial cache design due to the additional pipeline stage it introduces from serializing the tag and data access. Our results show that for a moderately pipelined architecture (8 cycles from fetch to execute), the performance loss of using a serial design over way prediction is only 1% on average. Section 2 presents the serial fetch architecture and the way prediction architecture we evaluate in this study. Section 3 covers the methodology of our simulations. Section 4 presents results for these architectures and we conclude in Section 5.
Fetch Architectures
In this section, we outline two different instruction cache configurations for use with our high-bandwidth, decoupled front-end architecture.
The Decoupled Front-End
In our prior work we explored an architecture that decoupled the branch prediction architecture from the instruction fetch unit (including the instruction cache). The branch predictor and instruction fetch unit are separated by a queue of fetch addresses (branch predictions) called the Fetch Target Queue (FTQ) [15] . The FTQ has two primary functions, it provides latency tolerance between the branch prediction architecture and the instruction fetch unit, and it provides a glimpse at the future fetch stream of the processor.
The ability of the FTQ to tolerate latency between the branch prediction architecture and instruction cache enables a multilevel branch predictor hierarchy called the Fetch Target Buffer (FTB) [14] . The FTB combines a small first level predictor that scales well to future technology sizes with a larger, pipelined second level structure, which provides the capacity needed for accurate branch prediction.
With sufficient branch predictions stored in the FTQ, the architecture is able to tolerate the latency of the second level branch predictor access while the instruction fetch unit continues consuming predictions already stored in the FTQ.
This decoupled design provides us with great flexibility in the selection of an instruction cache.
We are able to pipeline the instruction cache, without impacting the branch prediction architecture.
Moreover, our design allows us to easily scale the number of ports on the instruction cache to provide more instruction fetch bandwidth.
Instruction Cache Design
Instruction cache performance is vital to the processor pipeline. Associativity is a useful technique to improve cache performance by reducing conflict misses in the cache. The conventional set-associative cache design probes the tag and data components of the cache in parallel to reduce the cache access time. However, this approach wastes energy in the bitlines and sense amps of the cache as it must drive all associative ways of the data component on every access, hit or miss. We refer to this design as a parallel cache.
One alternative to this is to access the tag component of the cache first to determine what associative way of the data component should be driven. Such a cache is commonly known as a serial cache.
This design has been used for L2 caches, and recently for data caches on graphics cards [11, 8] . The Alpha 21264 [10] splits the tag and data component of its direct-mapped second level cache, effectively creating a serial L2 cache design. Figure 1 shows one design of a serial cache that we call the multicomponent (MC) serial cache. This cache has the same tag component arrangement as a regular set-associative instruction cache, but rather than a single set-associative data component, there are a number of direct mapped data components. The 16KB 2-way associative configuration shown in Figure 1 has two direct mapped data components, each only 8KB in size. A 16KB 4-way set associative MC cache would have four 4KB direct mapped data components. Each direct mapped data component has its own decoder, sense amps, and other auxiliary structures. At most one data component is enabled at each access, depending on tag information. This determination is entirely nonspeculative: the tag access is a fully associative lookup to determine In recent work [16] , we explored selectively accessing cache ways using a decoupled MC cache to create an energy efficient instruction prefetch architecture. In this submission, we expand on that research by (1) examining the use of a serial cache design just for instruction fetch, and (2) compare our serial fetch design to way predicted fetch architectures. These designs and their analysis are contributions over the work in [16] .
Serial Fetch Architecture

Next Line and Set Prediction
An existing branch prediction architecture that uses an energy efficient cache was proposed by Calder and Grunwald [3] . Their Next cache Line and Set (NLS) prediction architecture predicts an index into the instruction cache rather than a branch target address. This predictor provides a pointer into the instruction cache, indicating the branch target to resume fetch. This form of predictor is used in the Alpha 21264 [10] . in. The authors refer to way prediction as set prediction in [3] , but essentially the predictor determines what associative way contains the desired cache block. This way prediction, combined with an energy efficient cache design (like an MC cache), allows a single cache way to be selectively enabled, and can provide significant energy savings. They examined NLS in terms of reducing the access time for the instruction cache, but not in terms of its energy efficiency.
In this paper, we implemented an NLS predictor that provides one cache block per cycle, and compare its performance to the serial cache design and to a generalized way predictor (described in the next section).
Way Prediction for High Fetch Bandwidth Architecture
One alternative to tolerating the latency of a serial cache access is to use way prediction, as done with the NLS predictor. Figure 3 demonstrates the addition of a multi-block way predictor to the high bandwidth FTB architecture [14] . To provide a high fetch bandwidth architecture with way prediction, we first need the means to compress the tag check and data lookup stages of the instruction cache access into a single stage -while still using the MC cache. Calder et al. [4] have suggested using a separate way predictor with a data cache to perform such an access in parallel. The way predictor that we use in this paper is a simple last n-bit state counter. Since each FTB prediction can potentially span up to 5 cache blocks (due to the default fetch distance stored in the FTB), we would like to provide up to 5 way predictions per cycle. Our multiple way predictor is direct mapped and indexed by the branch prediction fetch PC. Each cycle, 5 sequential predictions are read out of an entry in the way predictor. Our tagless predictor has 32K entries, and each entry has 5 2-bit predictors. The 2-bit counter is used to provide hysteresis for predicting the given way, since there can be conflicts in the table. The way predictor and FTB are accessed in parallel, and the way prediction is stored in the FTQ until it can be consumed by the instruction cache. One alternative to the use of a separate structure is to place the way prediction directly in the FTB, trading the additional area required by the way predictor for the access time impact that would result from widening the FTB.
In Figure 3 , the tag component of the instruction cache grabs the current cache block to be fetched from the FTQ. The way prediction is consumed by a data component and the way compare hardware.
The data component selected by the way predictor will drive the cache block corresponding to the fetch prediction. The tag component searches all cache ways of the line corresponding to the current cache block. The way compare hardware will determine whether or not the way indicated by the tag component matches the way prediction (i.e. whether or not the correct data component was enabled).
If a misprediction has occurred, the correct data component will be accessed in the following cycle (determined via the tag component access). If the way prediction was correct, the instruction cache access will have only taken a single cycle. On a tag miss, the block is brought in from the L2 cache.
Prior Way Prediction Research
Inoue et al. [9] examined using an MRU algorithm to predict what way of an associative cache to access.
They also compared their results to a serial access cache (what they refer to as a phased cache). They provided results using a cache simulator, not a processor simulator, and therefore did not examine the impact of branch prediction, cache pipelining, or fetch bandwidth on their architecture. They also did not address the impact of complexity on the way prediction architecture. Finally, their way predictor, based on a MRU counter, provides only a single way prediction each cycle. They concluded from their cache simulation results that way prediction architecture was advantageous over the serial approach.
Our results show that is not the case when examining results from a detailed processor simulation.
Solomon et al. [18] examined the use of a serial cache along with their Micro-Operation Cache, but did not compare the performance of such a cache to a parallel or way predicted cache configuration, which is the focus of our paper. The main thrust of their study was the evaluation of the Micro-Operation Cache, and they provide no results on the relative performance of their serial cache organization to other cache structures.
Powell et al. [13] have also examined the impact of way prediction on the energy consumption of the instruction cache, making use of a conventional, coupled branch prediction architecture. Their branch prediction architecture is similar to the NLS predictor in that it also only provides a single way prediction (cache block) per cycle, whereas the way prediction architecture we examine in this paper can perform multiple way predictions (fetch multiple blocks) per cycle. They associate a way prediction with the PC of the previous cache address to account for the branch predictor/instruction cache coupling. In comparison, the way prediction architecture shown in Figure 3 features a way predictor that can scale to match the bandwidth of a more aggressive fetch architecture, without sacrificing predictor accuracy.
The decoupled design also provides more flexibility in the design of such a way predictor (as in our decoupled branch predictor [14] ), and could even provide an opportunity to check mispredicted ways using idle cache tag ports (but this is not explored in this paper). A decoupled architecture is also able to take advantage of way misprediction stalls, as it provides an opportunity for the branch prediction architecture to continue ahead of the instruction cache.
Methodology
The simulator used in this study was derived from the SimpleScalar/Alpha 3.0 tool set [2] , a suite of functional and timing simulation tools for the Alpha AXP ISA. The timing simulator executes only userlevel instructions, performing a detailed timing simulation of an aggressive 8-way dynamically scheduled microprocessor with two levels of instruction and data cache memory. Simulation is executiondriven, including execution down any speculative path until the detection of a fault, TLB miss, or branch misprediction.
To perform our evaluation, we collected results for 19 of the SPEC2000 benchmarks (selected at random). The programs were compiled on a DEC Alpha AXP-21164 processor using the DEC C and C++ compilers under OSF/1 V4.0 operating system using full compiler optimization (-O4 -ifo). For each benchmark, two billion instructions were executed (fast forwarded) before actual simulation. We report results for simulating each program for 200 million instructions after fast forwarding. In all cases, we use the reference data sets to simulate results. Table 1 shows the benchmarks used in this study.
Baseline Architecture
Our baseline simulation configuration models a next generation out-of-order processor microarchitecture. We've selected parameters to capture underlying trends in microarchitectural design. The processor has a large window of execution; it can fetch up to 8 instructions per cycle. It has a 128 entry re-order buffer with a 32 entry load/store buffer. We simulated perfect memory disambiguation (perfect Store Sets [5] ). Therefore, a load only waits on a store it is truly data dependent upon. To compensate for the added complexity of disambiguating loads and stores in a large execution window, we increased the store forward latency to 3 cycles.
There is an 8 cycle minimum branch misprediction penalty. The processor has 8 integer ALU units, We use a 128 entry 4-way associative FTB with a 2K entry 4-way associative second level FTB.
Each fetch block stored in the FTB can span up to five sequential cache blocks. We use the McFarling bi-modal gshare predictor [12] , with an 8K entry gshare table and a 64 entry return address stack in combination with the FTB. We use a 32 entry FTQ in conjunction with the FTB.
Memory Hierarchy
We rewrote the memory hierarchy in SimpleScalar to model bus occupancy, bandwidth, and pipelining of the second level cache and main memory. This study makes use of a 32KB 4-way set associative data cache and a 16KB 2-way set associative instruction cache (each with 32-byte lines). Both caches are dual-ported.
The second level cache is a unified 1 MB 4-way set associative pipelined L2 cache with 64-byte lines. The L2 hit latency is 12 cycles, and the round-trip cost to memory is 100 cycles. The L2 cache has only a single port. The L2 cache is pipelined to allow a new request every 4 cycles, so the L2 bus can transfer 8 bytes/cycle. The L2 bus is shared between instruction cache block requests and data cache block requests.
Energy Model
The energy data we need to generate results is gathered using the CACTI cache model version 2.0 developed by Reinman and Jouppi [17] . CACTI 2.0 contains a detailed model of the wire and transistor structure of on-chip memories, verified by hspice. We modified CACTI 2.0 to model the timing and energy consumption of the front-end structures of our architecture. CACTI 2.0 uses data from 0.80 Ñ process technology and can then scale timing data by a constant factor to generate timings for other process technology sizes. We examine timings for the 0.10 Ñ process technology size, which makes use of a ½ ½Î Vdd. CACTI 2.0 reports energy data for successful cache accesses. We modified CACTI 2.0 to report energy data for successful accesses, misses, tag probes, and writes. We further modified CACTI 2.0 to estimate the power consumption of all front-end structures, including the FTB, FTQ, instruction cache, L2 cache (a unified cache -but we only counted power from instruction cache misses, not from data cache misses), and other auxiliary structures. For each, we modified the BITOUT, ADDRESS BITS, and block size parameters appropriately. Then we track the number of hits, misses, tag probes, and writes to these structures in SimpleScalar to compute the overall energy dissipation for these structures.
When we report energy dissipation results in Joules, this includes the power dissipated by all the above listed front-end structures. Figure 4 : IPC results for five architectures: a Parallel cache pipelined over two cycles and a parallel cache with a single cycle access, both using a 2-level FTB; the Serial MC instruction cache with a 2-level FTB; the NLS architecture that provides a single cache block each cycle using an MC cache; and a way prediction architecture using an MC instruction cache with a 2-level FTB. given benchmark on Serial and Way Prediction will depend on whether branch mispredictions or way mispredictions are more prevalent. A benchmark like equake, which has a relatively low branch misprediction rate and a relatively high way misprediction rate actually sees better performance from the Serial architecture. The opposite is true of a benchmark like galgel, which has an extremely low number of way mispredictions. The way predictor achieves nearly 95% accuracy on average for the benchmarks we examined. Figure 6 shows the results for perfect way prediction (i.e. no mispredictions) compared to serial and the way prediction architecture, all for a smaller, but more associative instruction cache. Results show that perfect way prediction provides only a 1% improvement in performance over the way prediction results.
Results
It is interesting to note that a serial instruction cache actually has the potential to outperform a way predicted instruction cache (i.e. in the case of equake). On a way misprediction, the WayPred architecture will also take 2 cycles to access the instruction cache, but it will have wasted a cycle (and a data component cache port) on a mispredicted access. If mispredictions are frequent and branch mispredictions are infrequent, then a serial instruction cache can outperform the more complex design of a way predicted instruction cache.
Inoue et al. [9] found similar way prediction accuracy for the instruction cache, reporting an average 96% accuracy for the benchmarks they examined. They concluded that using a way prediction architecture was advantageous over the serial approach, since the serial architecture would introduce an additional pipeline stage to provide the tag and data component serialization. They only examined the performance of serial and way prediction in terms of miss rates. In comparison, our results show that adding the additional pipeline stage for the serial fetch architecture degrades performance by less than 1% in comparison with the way prediction architectures. These results assume an 8 cycle minimum branch misprediction penalty, which is a very conservative misprediction penalty considering the pipeline depth of current and future processors. As the pipeline depth increases, the IPC difference between serial and way prediction will be even less.
Although the Serial and Way Prediction architectures have similar energy savings and performance, the Way Prediction architecture has more complexity. Way Prediction requires the use of an auxiliary prediction structure (the way predictor) that must be accurate enough to avoid the performance impact of way mispredictions. It is a speculative technique, and therefore requires verification and recovery hardware, even if predictor accuracy is extremely high. Not only does this additional hardware impact the timing and area of the cache, but it also impacts the complexity of the architecture, as the front-end must be able to stall on a way misprediction. Despite a relatively small loss in performance (around 1%) the serial fetch architecture is still an attractive energy efficient design, which does not have the added complexity or hardware of way prediction and misprediction recovery.
Summary
In this paper we have compared the performance of several different front-end architectures. Both our high bandwidth way prediction architecture and our serial fetch architecture successfully combine high bandwidth branch prediction with a scalable and energy efficient instruction cache.
While way prediction can reduce the length of the misprediction pipeline of an architecture, this benefit may not outweigh the architectural costs required to implement way prediction for instruction fetch.
The reduction of a single pipeline stage from the front-end of the machine provides an improvement in IPC of only around 1% (2% for perfect way prediction -i.e. no way mispredicts). However, this small improvement carries with it the additional complexity that is required to verify predictions and recover from way mispredictions. The energy benefits of way prediction and the serial fetch architecture are similar. In fact, the way prediction architecture has the potential to expend slightly more energy, as one must consider the energy dissipated by the way predictor and by mispredicted instruction cache data component accesses.
The way predictor architecture includes a number of additional structures and implementation complexity that does not occur in the serial cache architecture. First, there must be extra hardware to detect and recover way mispredictions, including additional control hardware to determine whether a data component access is to obtain an address from the branch prediction engine or from the way misprediction recovery controller. Second, there must be extra hardware to selectively stall results from different data component ports. It may be the case that in a dual ported cache, one port suffered a mispredicted way and the other had a correctly predicted way. Finally, one must consider the extra hardware required to perform the actual way predictions and perform the updates to the way predictor tables.
Our results differ from the prior work [9] , where the use of a way prediction cache architecture was recommended over a serial access cache architecture based on the additional access time for the serial access. However, this additional access time can be pipelined, resulting in what we have found to be a relatively minimal effect on performance. The serial cache is an attractive design for offering simple and energy efficient instruction fetching. Future work will explore tradeoffs between energy efficient data cache techniques. The data cache is not as tolerant of latency (i.e. additional pipeline stages) as the instruction cache, and cache misses can be better tolerated through the out-of-order window.
