Abstract: Three-dimensional (3-D) integration technology provides various architectural opportunities including huge memory bandwidth. This paper proposes versatile stream buffer architecture to work as a secondary victim cache as well as the conventional stream buffer. The versatile stream buffer utilizes empty spaces to exploit massive memory bandwidth provided by 3-D integration technology and to reduce memory access frequency. Performance evaluation results show that the proposed mechanism with a 16 KB stream buffer and a 4 KB victim cache can achieve better performance than the conventional L2 cache with the capacity of 256 KB and 2 MB by 10% and 3%, respectively. The proposed mechanism reduces the miss rate by about 12% more than the conventional L2 cache with the capacity of 256 KB.
Introduction
The semiconductor and computer system areas are investigating three dimensional (3-D) integration technology as an approach to meet tight performance and power constraint. This technology provides the opportunity for huge memory bandwidth and flexible placement of memory components to devise new system architectures for exploiting these characteristics. One of mechanisms to exploit this advantage is a data prefetching technique. This is one of the well-known key techniques to improve the performance of the memory system effectively. This paper proposes a versatile stream buffer architecture to work as a secondary victim cache as well as a conventional stream buffer. The proposed versatile stream buffer architecture with a victim cache utilizes the empty space of the stream buffer and stores the evicted data from the victim cache to exploit high memory bandwidth and to reduce memory traffic.
The performance evaluation results show that the proposed 16 KB stream buffer with a 4 KB victim cache can achieve about 10% and 3% better performance than the conventional 256 KB L2 cache and 2 MB L2 cache. The proposed versatile stream buffer with victim cache reduces the miss rate by about 12% more than conventional 256 KB L2 cache on average. This paper is organized as follows. Section 2 explains related work. The proposed versatile stream buffer architecture with victim cache is described in Section 3. Section 4 presents the simulation environment and performance evaluation result. Section 5 concludes the paper.
Related work
Various studies have been proposed to efficiently utilize memory bandwidth of computer system. The stream buffer structure and victim cache as a small buffer for the L1 direct-mapped cache are introduced in [1] . In [2] , Palacharla et al. adopted the concept of the original stream buffer into the L2 cache in order to reduce long off-chip memory access latency.
Adaptive prefetching mechanisms have been proposed as well. Inoue et al. proposed a mechanism that adjusts fetch size based on the access pattern of the cache block in [3] . Recently, various schemes have been investigated to exploit plentiful memory bandwidth advantage provided by 3-D integration technology. Ono et al. [4] proposed a software-controllable mechanism to adjust fetch sizes dynamically based on the profiling information gathered during compiling time for the 3-D integration technology. In [5] , Woo et al. proposed a SMART-3D architecture, which efficiently utilized the TSVs to reduce long latency of fetching and write-back in an L2 cache. Liu et al. [6] investigated various conventional schemes including a stream buffer to bridge the processor and memory performance gap based on the 3-D ICs. This paper introduces versatile stream buffer architecture with a victim cache to exploit massive memory bandwidth and to reduce memory traffic. The versatile stream buffer and the victim cache store the prefetched blocks and victim blocks from the L1 cache as does the usual stream buffer and victim cache as proposed in [1] and [2] , respectively. The stream buffer is constructed bigger than the conventional stream buffer to prefetch more blocks to exploit the high memory bandwidth of 3-D integration technology. Even though this aggressive prefetching usually provides better performance and incorrect prefetching is interrupted by other cache misses. Fig. 2 shows the number of invalidated blocks when the stream buffer capacity is 16 KB. As shown in Fig. 2 , the invalidated blocks are about 70% of total 256 entries on average. These invalidated blocks are remains useless empty spaces until new prefetch blocks are inserted. To utilize this empty space, the versatile stream buffer stores the evicted data from the victim cache into the invalidated entries of the stream buffer. 
Operation model of proposed architecture
The operation model of the proposed versatile stream buffer architecture is shown in Algorithm 1.
When an L1 cache miss occurs, the stream buffer and victim cache are searched. If the access is hit in the stream buffer or victim cache, the cache block is transferred to the L1 cache and the processor from the stream buffer or victim cache.
In the case of a miss in the stream buffer and victim cache, a stream buffer way to be replaced is selected based on the LRU, and the prefetching is performed for the data after the cache block in which the cache miss occurs. If the access is a hit in the victim cache then the cache block is swapped between the L1 cache and victim cache. In addition, the proposed mechanism stores the replaced entry from the victim cache into the empty stream buffer entry selected by the Most Recently Used (MRU) among the invalidated entries. This insertion policy guarantees that the MRU entry can be stored as long as possible.
Simulation environments and performance evaluation
Performance evaluation was performed based on M-Sim simulator [7] with SPEC CPU 2000 [8] and Mibench [9] benchmarks. The system parameter for simulation is presented Table I . The access latency of the L1 and L2 cache memories, the stream buffer and the victim cache are obtained from Fig. 3 shows the performance of victim caches and stream buffers having various capacities. As shown in Fig. 3 , the best performance of a victim cache and a stream buffer can be obtained when the capacity is 128 KB for both cases. Even though the stream buffer and victim cache having capacity with 128 KB can achieve the best performance, we select the 4 KB for a victim cache and 16 KB for a stream buffer. It is mainly because the difference of performance with 128 KB capacity is negligible and the access latency of fullyassociative 128 KB buffer is too large. We have analyzed the performance of the proposed architecture according to the number of ways and entries of the stream buffer. The performance differences among the 8-way 32-entry, 16-way 16-entry and 32-way 8-entry stream buffers are only about 3-4% in CPI. The 16-way 16-entry stream buffer delivers the best performance among these configurations. The configuration of the stream buffer is determined as a 16-way 16-entry buffer based on this analysis. The victim buffer is constructed as a one-way fully associative buffer as proposed in [1] . formance improvement over the conventional L2 cache with the capacity of 256 Kbytes and 2 Mbytes by about 10% and 3% respectively. Fig. 5 shows the performance of the proposed versatile stream buffer architecture with victim cache and conventional stream buffer with victim cache. The proposed architecture having 16 KB stream buffer and 4 KB victim cache can achieve a performance improvement of about 6% over the conventional stream buffer and victim cache with the same capacity. The proposed architecture can achieve a performance improvement of about 3% more than the conventional stream buffer and victim cache with twice the capacity, i.e., 32 Kbytes and 8 Kbytes each as well. 
Conclusion
This paper proposed a versatile stream buffer with a victim cache to exploit the enormous memory bandwidth provided by 3-D integration technology and to reduce memory traffic. The proposed versatile stream buffer architecture can be adopted into a conventional L2 cache structure as well. The proposed mechanism can achieve performance improvement over the conventional L2 cache with the capacity of 256 KB and 2 MB by about 10% and 3%, respectively. The proposed architecture reduced the miss rate more than the conventional 256 KB L2 cache by about 12% on average.
The result provides that a more sophisticated mechanism to exploit enormous memory bandwidth and to reduce memory loss and effective memory traffic is required rather than just extending the conventional mechanism likes increasing cache block size. One of most important future research will be the analysis and adaptation of the versatile stream buffer into the multi-core system architecture.
