The performance of superscrdar processors is more sensitive to the memory system delay than their single-issue predecessors.
Introduction
Superscalar processors can potentially deliver more than five times speedup over conventional single-issue processors [1] .
With the tot al execution cycle count dramatically reduced, each cycle becomes more significant to the overall performance.
Because each data cache miss can introduce many extra execution cycles, a superscslar processor can easily lose the majority of its performance to the memory hierarchy.
Out-of-order execution can partially hide the miss penalty [2] . In thk paper, we focus on the design of data access microarchitectures to support data prefetching, More specifically, we address the problem of cache pollution due to data prefetching.
Cache pollution can translate into cache misses, which in turn defeats the purpose of prefetchlng.
The ability to minimize the degrading effect of cache pollution by increasing the cache dimension (cache size and/or set associativity) is compared against that of a separate prefetch buffer. Hardware issues concerning both approaches are discussed. The ability of different issue rates to hide the data prefetching overhead will be judged in terms of the net processor speedup.
Compiler Issues
The main concept behind compiler-assisted data prefetching is to utilize the compiler to insert prefetch operations in advance so that the needed datum will be available in the data cache when the actual memory operation is executed.
By Since the address contained in register rl is the result of another memory load, the value in rl will not be available until the first memory load completes its execution. Therefore, the earliest point to insert the prefetch operation is right after the first memory load operation. Theresulting code with data prefetching isshown in Figure 3 . Due to address loads, the cycles between the prefetch operation and the corresponding memory operation can be smaller than L. In th~case, prefetching cannot completely hide the cache miss penalty. But since the data cache refill has been tiltiated by a prefetch operation, the penalty cycles are reduced. may decrease the effectiveness of data prefetching.
On the other hand, if we wish the data cache to service both the memory operations and the prefetch operations simult aneously, the bandwidth of the cache will have to incre=+e to keep up with the service rate.
The datapath to the cache, therefore, must be duplicated to perform simult aneous reads and writes to/from the cache. Prefetched data can be placed into a separate prefetch buffer with block size the same aa that of the data cache [4] . When a memory operation is executed, both the data cache and the prefetch buffer are checked. The CPU will retrieve the proper entry from the data cache first before the prefetch buffer.
If the entry is not in the cache and is in the prefetch buffer, the data is forwarded to both the CPU and the data cache.
If a miss occurs in both the cache and the prefetch buffer, a cache miss is assumed and is handled normally. If a load hits in both the cache and the prefetch buffer, the datum from the data cache and the prefetchlbuffer is multiplexed to give priority to the data cache . For all stores that hlt in the data cache, the corresponding buffer entries are invalidated.
Otherwise, the prefet ched block is transferred to the data cache before the invalidation of the buffer entry. The data cache does not have to be non-blocking, but the prefetch buffer must be able to handle multiple outstanding memory requests. The bandwidth of the data cache remains the same. An extra communication channel is needed between the data cache and the prefetch buffer for data transfer.
Since the prefetch buffer is concurrently accessed along with the data cache by the CPU, the prefetch buffer suffers the same bandwidth problem aa the prefetch cache. The difference is that the prefetch buffer haa no dirty-block state, and thus it does not have to worry about the simultaneous stores from both the memory and prefetch operations. Therefore, the prefetch buffer states are simpler than those of the prefetch cache. With a prefetch buffer, we can guarantee that all data entering the cache will be used at least once. In comparison, for prefetching into the data cache, the cache size and/or associativity could be increased to reduce the degrading effect of pollution, however, useless data cannot be prevented from entering the cache. 4 Simulation Environment The basic distribution can be separated into two categories. The average result for eqntott and espresso is shown in Figure  4 . The average result for tbl, xlisp, and vacc is similarly shown in Figure  5 . The main disparity between the two Figures is that for eqntott and espresso, the maximum prefetch distances tend to be evenly distributed across the entire spectrum.
For the other three benchmarks, the maximum prefetch distance is concentrated at distances of 3 or less. The effectiveness of the prefetchlng approach used in this paper is limited by the constraints of the address loads. For example, we can see that xtisp has a Klgh percentge of memory operations that require the use of an address resulting from another memory load (Table 1 ). In conjunction with the results from Figure 5 (where prefetch distances are small), we can expect prefetching to be less effective for disp as opposed to a lesser constrained program such as espresso. Thk behavior is indeed observed for xlisp. 6 Hardware TradeofFs
Since data prefetching h~the potent~~to incre=e cache pollution, we wish to minimize the degrading effect of cache pollution through different data access considerations. Figures 6 and 7 show the degree of effectiveness of increasing cache size and/or set associativity versus using a prefetch buffer. The use of several configurations of prefetch buffer is evaluated in comparison with the case of a perfect cache (indicated by per in each figure) , and the case of a 1 K direct mapped data cache with no prefetching (indicated by I dX in each figure) . The first number in each of the configuration represents the cache size in 2~0 bytes. The second letter represents the cache associativity (e.g. d is for direct mapped, and 2 is for 2-waY associa- tive).
The third letter represents the prefetch buffer size in 2' entries (e.g. n is for prefetch into the cache, and a 3 is for a 23 entry buffer).
We plot speedup versus the issue rates of 2, 4, and 8. Each data point represents the average of the 5 benchmarks given in 
