This paper proposes a novel cache architecture for low power consumption, called "Adaptive Way-Predicting Cache (AWP cache)." The AWP cache has multi-operation modes and dynamically adapts the operation mode based on the accuracy of way-prediction results. A confidence counter for way prediction is implemented to each cache set. In order to analyze the effectiveness of the AWP cache, we perform a SRAM design using 0.18 µm CMOS technology and cycle-accurate processor simulations. As the results, for a benchmark program (179.art), it is observed that a performance-aware AWP cache reduces the 49% of performance overhead caused by an original way-predicting cache to 17%. Furthermore, a energy-aware AWP cache achieves 73% of energy reduction, whereas that obtained from the original way-predicting scheme is only 38%, compared to an non-optimized conventional cache. For the consideration of energy-performance efficiency, we see that the energy-aware AWP cache produces better results; the energy-delay product of conventional organization is reduced to only 35% in average which is 6% better than the original way-predicting scheme.
Introduction
Cache memory is one of the major contributors to energy consumption in state-of-the-art microprocessors. This is because the trend increases the size of on-chip caches to alleviate the negative and growing impact of poor memory performance. Although increasing the associativity of caches improves dramatically cache-hit rates, it also increases energy consumption by accessing all the ways in parallel. To solve the energy issue of set-associative caches, researchers have proposed a number of techniques.
The way-predicting cache is a well known approach to achieve both high-performance and low-energy consumption, in which only one way is activated based on MRU-base prediction [10] , [15] . If the prediction is correct, the cache access can be completed in one cycle by activating only the predicted way. Otherwise, the cache then access to the other remaining ways, so that one more cycle is dissipated and no energy reduction can be achieved. The main drawback of the way-predicting cache is such prediction-miss penalty. Namely, if the way-prediction is incorrect, the cache degrades microprocessor performance without producing any † † The author is with the Department of Informatics, Kyushu University, Fukuoka-shi, 816-8580 Japan.
a) E-mail: inoue@i.kyushu-u.ac.jp DOI: 10.1093/ietfec/e88-a.12.3274 energy reduction.
To improve the energy-performance efficiency of waypredicting cache, in this paper, we propose a run-time management technique to support multi-operation modes. Our approach attempts to measure the accuracy of way predictions at run time and adaptively changes its operation mode. A confidence counter for way prediction is added to each cache set. If the counter indicates that the current wayprediction likely to be correct, the cache works in the waypredicting mode. Otherwise, a normal operation mode for high-performance requirement, or another low-energy mode for low-power requirement, is selected.
This paper is organized as follows. In Sect. 2, we explain the energy issue of conventional caches and show a low-energy technique called way-predicting cache. Section 3 proposes a run-time management scheme called adaptive way-predicting cache, and Sect. 4 evaluates its performance-energy efficiency. Section 5 shows related work, and finally in Sect. 6 we conclude this paper.
Way-Predicting Cache

Conventional and Phased Accesses
Compared to direct-mapped caches with the same cachesize and line-size, set-associative caches usually achieve higher cache hit rates because of lower conflict misses. This is the major reason why a number of microprocessors employ set-associative caches. Figure 1(a) shows the general organization of a 4-way set-associative cache. In each way, we see a pair of tag and data memory subarrays. The group of cache lines which have the same cache-index is called a set. In conventional set-associative caches, all the ways are examined in parallel despite only one way can include the data to be required by the microprocessor. Therefore, activating the other remaining ways wastes unnecessarily the energy consumption.
To solve the energy issue of set-associative caches, a phased cache has been proposed [7] . In the phased cache, the cache access is partitioned into two phases as depicted in Fig. 1(b) . First, the tag checking is performed at each way. Then only the data-subarray which includes the referenced data is accessed if there is a tag match. Although the phased cache reduces the energy consumed for data-subarray accesses (or cache-line accesses), the cache-access time is increased due to the sequential operation. On cache misses, the phased cache can completes the access in one clock cy- cle, because the cache does not need to access the datasubarray. However, since many programs exhibit higher cache-hit rates, e.g., 90%, the sequential scheme degrades processor performance.
Way-Predicting Access
In order to achieve high performance and low-energy consumption at the same time, a way-predicting scheme (or called WP cache) has been proposed [10] , [15] . In the WP cache, only one way is speculatively selected with MRU based way prediction. Figure 1 (c) presents the operation of a 4-way set-associative WP cache. If the prediction is correct, only the cache-hit way is activated, and its access can be completed in one cycle. In this scenario, we can achieve both the high-speed access of conventional scheme and the low-energy consumption of phased access at the same time. However, if the way-prediction is incorrect, the other remaining ways are activated at the following cycle. Therefore, the WP cache can not reduce any energy consumption from the conventional cache, and the cache access time becomes longer like the phased cache. Namely, the disadvantage of the conventional approach and that of phased scheme take place. Figure 2 shows the accuracy of MRU-base wayprediction on a 16 KB 4-way set-associative WP cache. We referred the organization proposed in [10] . The detail of experimental environment is explained in Sect. 4.1. There are two types of way-prediction miss: on cache hits and on cache misses. Even if we have an oracle way-predictor, it is impossible to predict correctly the hit way on cache misses, because the target data does not exist in the cache. From the figure, we see that the accuracy is very low for some benchmarks, 181.mcf, 179.art, and 188.ammp. For these programs, the main reason of the inaccurate prediction is a number of cache misses. Furthermore, for other programs, about 10% of cache accesses cause the wrong way-prediction on cache hits. In order to improve the performance-energy efficiency of WP caches, we can con- sider at least two challenges as follows:
• Improving the accuracy of way prediction.
• Alleviating the impact of misprediction.
The former covers only the misprediction on cache hits, whereas the later can care any incorrect prediction. Therefore, in this paper, we focus on the later approach. Namely, our purpose is to hide the negative effects caused by incorrect way predictions.
Adaptive Way-Predicting Cache
Supporting Multi-Operation Modes
The main drawback of WP cache is the penalty caused by way-prediction misses as explained in Sect. 2. In order to improve the performance-energy efficiency of WP caches, we propose an adaptive way-predicting cache (or AWP cache). The AWP cache supports not only the wayprediction mode but also another one. Based on the accuracy of way prediction, the cache selects one of the supported operation modes. Here, we can consider two alternatives to implement the AWP cache as follows:
• Performance aware AWP cache (PaAWP): The cache supports two access modes: normal mode and waypredicting mode. If the prediction accuracy is lower than a given threshold, the current access is performed in the conventional manner. Although no energy reduction is achieved from the conventional way-predicting scheme, the negative performance impact of the WP cache can be alleviated.
• Energy aware AWP cache (EaAWP): The cache can select one of the two operation modes: phased mode and way-predicting mode. The cache works as same as the phased cache if the way-prediction accuracy is low. Therefore, since only the data-subarray which includes the target data is activated for a number of cache-hit accesses, we can improve the energy efficiency of the conventional WP cache. On cache misses, only the tagsubarrays are activated when the cache operates in the phased mode.
It should be noted that the WP cache, Phased cache, and the AWP cache do not affect the cache-hit rate on conventional scheme if the memory-access order is the same.
Where we adopt the proposed AWP scheme to a multiport cache, there are at least two design alternatives as follows:
• Shearing the MRU information bits for way prediction and the associated confidence counter by all ports • Supporting independent MRU bits and a counter for each port
The former allows the cache to operate on an operation mode at a time. In other words, the cache needs to choose a common operation mode for all concurrent accesses. Therefore, the accuracy of way prediction may be degraded. On the other hand, the latter can avoid the negative impact of the sharing approach. This is because by means of implementing the dedicated MRU bits and the confidence counter, the cache can allocate an appropriate operation mode to each port. In order to make it possible, the cache needs to activate word-and bit-lines independently for each port. The efficiency of these two approaches depends on the characteristics of application programs. The detail evaluation of the multi-port AWP cache is our future work.
Measuring the Accuracy of Way Prediction
In the AWP cache, its operation mode for each cache access is determined based on the accuracy of way prediction. In order to measure the accuracy at run time, we implement an n-bit confidence counter (or CC) to each cache set. Namely, based on the history of way-prediction results, the cache determines the accuracy of way-prediction for current access. This kind of dynamic optimization is commonly exploited for high-performance perspective, i.e., branch prediction [1] and load-value prediction [4] . Figure 3 depicts the orga- nization of a 4-way set-associative AWP cache. When a cache set is accessed, both the associated CC and MRU flag are looked up. Based on the value of CC, the mode controller decides the access operation mode. At the same time, the cache performs way prediction in spite of the decided operation mode. If the mode controller selects the way-predicting mode, the cache works as well as the conventional WP cache. Otherwise, the access is completed in the other defined operation mode, i.e., phased mode or normal mode. Regardless of the selected operation mode, the way-prediction results are used to update the CC. Figure 4 shows two types of the finite-state-machine for updating the 2-bit CC. Although a variety of state machine can be considered, here, we focus on the two schemes. The first scheme uses the 2-bit CC as a simple saturate counter. On each cache access, if the way prediction is correct, the counter is incremented. Otherwise, it is decremented as explained in Fig. 4(a) . The mode controller decides that the current way prediction is accurate if the value of counter is equal to or greater than two in this example. Thus, the AWP cache works in the way-predicting mode. Figure 4 (b) depicts the finite-state-machine which is commonly used for dynamic branch prediction [8] . In this scheme, a prediction must miss twice before it is changed to the low-accuracy state.
Negative Impact on Cache-Access Time
Before starting SRAM array activation, the CC and the MRU information associated with the cache-set accessed by the microprocessor are read out. Based on the state of the obtained CC, the cache decides which operation mode, normal (or phased) or way-predicting, should be selected. When the cache works on the way-predicting mode, the MRU information is used for the way prediction. Regardless of the operation mode selected for the current access, the cache updates the CC and the MRU information bits in parallel after the tag checks. The cache does not consider whether or not the issued cache access will be squashed for a mis-specuration. Accessing the MRU information bits (and also the CC) may worsen the cache-access time compared with a conventional organization. This drawback inherently exits in the original way-prediction cache. However, some researches have proposed to compensate for the look-up penalty, in which the cache attempts to access the MRU table in early stage of the cache access [5] , [9] .
Evaluation
In this section, we discuss the performance-energy efficiency of the proposed adaptive WP cache. In Sect. 4.1, we show the experimental environment. We explore the configuration of the confidence counter (CC) added to each cache set in Sect. 4.2. Then the evaluation results for energy consumption and those for performance impact are reported in Sects. 4.2, 4.3, and 4.4. In Sect. 4.5, we discuss the obtained simulation results from both the performance and energy point of view.
Experimental Setup
In this section, we evaluate the performance-energy efficiency of the AWP cache architecture. A 16 KB 4-way setassociative cache with 32 byte line size is assumed. We investigate the detail of energy consumption of the AWP cache based on a 0.18 µm CMOS design. We have designed SRAM memory array from the scratch to meet 300 MHz operation. For the added components, such as CC, update logics, mode controller, we have performed a cell-base design. The function of their components was described by using Verilog-HDL, and then we obtained layout data though logic synthesis and auto layout CAD tools.
On the other hand, for the processor simulations, we used the SimpleScalar tool set version 3.0. We executed seven integer programs and four floating-point programs from the SPEC 2000 benchmark suite. The MIRV precompiled binary codes with −O2 option were used in this evaluation [19] . The SPEC reference data set was used. We forwarded the first 20 billions instructions and measured each data for the following 1 billion committed instructions. We assumed that L1-data cache access latency is one clock cycle if the access is performed on the normal mode or wayprediction mode with correct prediction. On the phased mode or the way-predicting mode with wrong prediction, one more clock cycle is required to complete the cache access. Note that the phased cache completes the cache access in one clock cycle if the access causes a cache miss as explained in Sect. 2.1. The number of ports of the data cache is assumed to be one. We modeled a 4-way out-of-order microprocessor as follows: one load/store unit, fetch/issue width is 4, 512 entry of bimodal branch predictor, 512 entry of 4-way set-associative BTB, the number of integer ALUs is four, that of FP ALUs is four, unified 256 KB L2 cache with 6 clock cycles of access latency.
Efficiency of Confidence Counter
Before discussing the detail of energy-performance efficiency of the AWP cache, we determine the configuration of the CC. As explained in Sect. 3.2, there are two considerations for implementing the CC; bit-width of each CC and the finite-state-machine for accurate prediction. Figure 5 shows how much the two types of 2-bit CC can indicate correctly the accuracy of way prediction. 100% of accuracy in this figure means that the CC perfectly tells us whether or not each way prediction is correct. The left bars and right ones for each benchmark are the results with the simple scheme and the branch-prediction scheme explained in Sect. 3.2, respectively. From this figure, we see that for many programs these two types of CC can achieve almost all the same degree of accuracy. However, for 188.ammp, the simple-type scheme produces 8% of better result. Therefore, in the rest of this paper, we assume that our AWP caches employ the simple-type scheme for confidence counters.
Next, we evaluate the bit-width of the simple-type CC. Figure 6 reports the effects of bit-width to the accuracy of CC. From this simulation results, it can be understood that increasing the bit-width from 1 to 2 clearly improves the accuracy of CC. However, after that, the effect of wider bitwidth is very small. Since the AWP cache add the CC to each cache set, increasing its bit-width worsens the energy overhead. Accordingly, we conclude that the 2-bit simple counter is enough to capture the accuracy of way predictions. Based on these observations, we focus only on the simple 2-bit saturate counter in this evaluation.
So far, we have discussed the accuracy of the CC under the assumption that the cache associativity is four. Here, we analyze the sensitivity of the 2-bit simple-type CC to cache associativity. Figure 7 shows the accuracy of the CC where the cache associativity is varied from 2 to 16. The simulation results for representative five benchmarks are presented in this figure. For the five benchmarks, we see that the accuracy is degraded with the increase in associativity, and the degree of negative impact is application dependent. For instance, almost the same accuracy of lower associativity is maintained for 176.gcc. On the other hand, for the two benchmarks, 175.vpr and 181.mcf, the accuracy is reduced by 15% and 9%, respectively. Therefore, we should optimize the organization of the CC based on the cache configuration employed in the target microprocessor chip.
Energy Analysis Based on a Cache Design
From the discussion in Sect. 4.2, we have decided to add the 2-bit simple saturate counter as CC to each cache set. Furthermore, the AWP cache requires mainly three hardware components to the conventional organization as showed in Fig. 3, i. e., the WP table which includes the 2-bit CC and the 2-bit MRU information, the WP table update unit, and the access controller. In order to evaluate the energy overhead caused by the extended hardware components, we have designed a 16 KB 4-way set-associative AWP cache with a 0.18 µm CMOS technology. Figure 8 is the layout image of the designed AWP cache, and Table 1 shows its energy consumption on a normal-mode access. From the design re- sults, we see that the total cache energy strongly depends on the SRAM array accesses for obtaining tags and cache lines, which are required either in the non-optimized conventional cache or WP cache operations. The energy dissipated for supporting our adaptive operations, WP Table, Access Controller, and WP Table Update in Table 1 , occupies only less than 1% of total energy. Therefore, we do not consider the energy overhead caused by the AWP scheme.
Energy Consumption
By integrating the design results reported in Sect. 4.3 and the simulation results obtained from cycle-accurate processor simulations, we evaluate the energy efficiency of the AWP cache. The total cache energy, Ecache, can be estimated as follows:
Where Edec is the energy consumed for address decode, Etag is that for tag SRAM accesses, Eline is that for cacheline SRAM accesses, Elogics is that for peripheral logics such as tag comparison, access control, and so on. From the design results reported in Sect. 4.3, we do not consider Edec and Elogics. The energy consumed for SRAM array accesses can be presented as follows: 
Ntag (or Eline) and Etag access (or Eline access) are the total number of tags (or cache lines) accessed in the whole program execution and the average energy for a tag access (or a cache line access), respectively. We have calculated Etag access and Eline access from the design results reported in Sect. 4.3. On the other hand, Ntag and Nline depend on the low-energy approaches as showed in Fig. 1 , and we have obtained these values from processor simulations. Figure 9 shows the average cache-access energy normalized to the 16 KB non-optimized conventional 4-way set-associative cache. The purpose of PaAWP is to compensate for the negative performance impact of the WP cache. Since PaAWP works on the high-energy normal operation mode whenever the way prediction likely to be incorrect, we can not achieve any energy reduction from the conventional WP scheme. On the other hand, EaAWP attempts to improve the energy efficiency of the WP cache by exploiting the phased scheme. In the best case, 179.art, 73% of energy reduction can be achieved, while that of the WP cache is only 38%, compared with the non-optimized conventional cache. In average (geometric mean), EaAWP can achieve 5% of more energy reduction than the WP cache.
Performance
Based on the cycle-accurate processor simulations, we have measured the average cache-access time in terms of clock cycles as shown in Fig. 10 . From this figure, we see that PaAWP can effectively hide the misprediction penalty. For all benchmarks, PaAWP reduces the average cache-access time from the WP cache. For 179.art, the performance overhead caused by the WP cache is reduced from 49% to 17%. In average (geometric mean), PaAWP worsens the average cache-access time only by 9%, while the WP cache produces 15% of performance overhead, compared to the nonoptimized energy-consuming conventional cache.
For all but one, in Fig. 10 , EaAWP worsens the average cache-access time of the conventional WP cache. This is because EaAWP attempts to assign the low-power, low-speed phased mode to some cache accesses which probably cause way miss-prediction. However, we see the opposite result for 179.art, i.e. EaAWP improves the cache performance. This is because as explained in Sect. 2.1, the Phased cache can complete the access in one clock cycle on cache misses. Therefore, when EaAWP operates in the phased mode and current access causes a cache miss, it takes only one clock cycle. Contrary, the conventional WP scheme consumes two clock cycles for any cache miss, one for activating the predicted way and one for remaining ways. Therefore, EaAWP has a potential to improve the performance of the conventional WP cache for application programs which produce higher cache-miss rates.
Discussion
In this section, we discuss the performance-energy efficiency of the proposed two AWP caches, PaAWP and EaAWP. First, we consider the relationship between the AWP approach and the characteristics of application programs. In order to analyze the behavior of the AWP caches, we show the breakdown of cache accesses in terms of the operation mode selected. Figure 11 shows the simulation results. Since PaAWP and EaAWP employ the same CC or-ganization, there is no difference for the behavior of mode transitions. We can mainly classify the benchmarks into two groups. The first one is not suitable to our adaptive optimization. For instance, 176.gcc, 177.mesa, and 183.equake fall under this category. In these benchmarks almost all the cache accesses are performed on the wayprediction mode. Thus, the adaptive mode optimization can not show its potential as showed in Figs. 9 and 10 . On the other hand, the second group takes the benefit of the adaptive optimization, 175.vpr, 181.mcf, 179.art, and 188.ammp. For these benchmarks, the AWP scheme aggressively chooses the normal or phased operation mode. For 181.mcf and179.art, the conventional WP cache produces higher cache-miss rates as explained in Fig. 2 , and the AWP scheme effectively detects the way unpredictability on cache misses as reported in Fig. 11 . On the other hand, for 175.vpr and 188.ammp, it can be understood from these figures that the AWP cache accurately finds the unpredictability on both cache hits and misses. Figure 12 depicts the accuracy of confidence counter on each operation mode. For all but one, our counter works very well. Namely, when the AWP cache selects the way-predicting mode, almost accesses can be completed with a correct MRU way prediction.
Next, we compare the ability of PaAWP and EaAWP. Figure 13 shows Energy-Delay Products normalized to the non-optimized conventional cache. For all but one EaAWP achieves EDP reduction, whereas PaAWP can produce better results only for two benchmarks, compared to the WP cache. In average (geometric mean), the WP cache reduces 58% of EDP from the non-optimized conventional cache. Although PaAWP does not improve the performance-energy efficiency of WP cache, EaAWP achieves 6% of more EDP reduction. From these observations, we conclude that the energy-aware mode optimization is superior to the performance-aware approach.
Related Work
Recently, a number of techniques to reduce cache-leakage energy have been proposed [3] , [6] , [12] . Although the impact of leakage energy has been increasing with the advances of VLSI technology, still the dynamic energy consumption affects strongly the total cache energy.
In order to alleviate the negative effect of setassociative caches, researchers have proposed many lowenergy cache architectures [2] , [17] . Witchel et al. [16] presented the Direct Addressed scheme that allows software to access cache data without hardware tag checks. The key idea is to store the tag-check results to the DA-register file, and then re-use them based on the compiler supports. Ma et al. [13] suggested a dynamic way-memoization for eliminating the cache-search operation. The idea is to record within I-cache both the tag check results and the valid bits. The HBTC cache proposed in [11] exploits the BTB for recording the tag-check results. Although the purpose of these techniques is to improve the energy efficiency of setassociative caches, they do not have multi-operation modes.
Huang et al. [9] proposed several access schemes in order to improve the performance-energy efficiency of the phased or the way-predicting caches. However, this paper does not discuss the run-time mode managements. Zhu and Zhang [18] showed a mode-prediction technique, which is very close to our approach. They attempt to predict the occurrence of cache misses to adaptively change the operation mode. While our approach focus on the accuracy of way prediction, so that the AWP cache can detect the waymisprediction not only on cache misses but also on cache hits.
So far, confidence estimation has been exploited for high-performance applications such as branch prediction [1] and load-value prediction [4] . Manne et al. [14] proposed to exploit the confidence information for low-power purposes, in which data-path execution is gated based on the accuracy of branch predictions. In this paper, we use the confidence information to reduce cache energy consumption.
Conclusions
In this paper, we have proposed a run-time management technique to improve the performance-energy efficiency of way-predicting caches. Our cache supports two operation modes; way-predicting mode and normal or phased mode.
Based on the accuracy of way prediction, our AWP cache attempts to compensate for the negative impact of the waymiss-prediction penalty. In our evaluation, it has been observed that an energy-aware AWP cache reduces ED product by 67% for a benchmark (179.art) and 65% in average, while those for original way-predicting cache are 8% and 59%, compared to a non-optimized conventional cache. Our future work is to improve the ability of the confidence counter by exploring other design space.
