Abstract-Power consumption is a major concern in today's processor design. As technology shrinks, leakage power dominates the overall power consumption of the processor although it is expected that dynamic power gains relevance in future semiconductor technology. This is particularly relevant for the cache hierarchy, which contains an important percentage of the microprocessor transistors. In this work we propose the use of a phase adaptive cache design to reduce both leakage and dynamic power consumption with very little impact on the overall performance. We take advantage of the overwhelming preference of the memory accesses for the most recently used blocks, and the fact that these blocks are placed in a fast A partition of the cache. On the other hand, a B partition of the cache memory is placed in a drowsy mode in order to reduce the leakage power consumption of an important portion of the whole storage capacity. We test our design on a private second level cache reaching average dynamic energy savings almost 20% of the conventional cache design's dynamic energy, and leakage energy savings close to 45% of the conventional cache design's leakage energy. These results were achieved with minimal performance losses that stay within 2% of the original values.
I. INTRODUCTION
Power consumption has become a major design concern in current microprocessor design, and it was the main reason from moving the microprocessor industry to multicore designs. Power consumption can be broken down into dynamic, due to transistor switching, and static power. The latter also referred to as leakage power, since it rises because of transistors' leakage currents, dominates the overall consumption in current sub-nanometer technologies. This problem exacerbates as technology shrinks and the number of transistors in chip increases, which is the expected trend in future technologies. Therefore, much research work has addressed leakage power [1] , [2] .
Dynamic power has been also the focus of significant research work in the past. An interesting approach is the accounting cache [3] , which proposes a phase adaptive cache design to reduce dynamic power consumption. These savings are achieved by separating the ways of the cache in two different partitions, namely A and B. The A partition is accessed in first place while the B partition is only accessed if the data is not found in the first partition. That way, the dynamic power due to the logic surrounding the second partition is avoided in a number of cases, reducing the overall dynamic power consumption. However, this concept does not save any leakage power, since the cache ways (regardless of the partition) are always kept active and storing data. One important feature of this design is that it always keeps the most recently used (MRU) blocks in the first partition, by swapping cache lines between the B and A when necessary.
The original accounting cache design was phase adaptive. That is, not only the cache ways were split in two different partitions, but the number of ways assigned to each partition varied dynamically depending on the demands of the different phases of the application. With some extra logic that calculated a cost function, the best configuration was selected for each phase of the execution of a given application. The cost function of the original implementation was based on dynamic power consumption only.
On the other hand, Petit et al [4] showed that most of the accesses to L1 caches (nearly 98%) are to blocks in MRU state 0 and MRU state 1. Locality is not so high in L2 caches, however even in this case, in [5] it has been proven that only the two mentioned MRU states in high associative caches (16 ways) capture over 73% of the accesses.
Based on these concepts, we propose to adapt the accounting cache in order to save not only dynamic power consumption, but also leakage, by placing the B partition in a low-power state or drowsy mode. The proposed architecture has one A partition holding a number of ways, powered with full voltage level. The rest of the cache ways belong to the B partition, and are placed at a lower voltage level. This drowsy B partition will be the source of our leakage savings.
The contributions of this paper are:
• We place the B partition of the accounting cache in drowsy mode in order to save not only dynamic energy but also leakage energy.
• We use both energy and energy-delay product cost functions to make reconfiguration decisions.
• We show how leakage and smaller size technology completly change the original concept of the accounting cache. The proposed design reduces the circuitry complexity of the original phase adaptive cache design based on experimental measurements. We extended the cost function of the original accounting cache to account for leakage power and we found that due to the drowsy mode of the B partition, and the overwhelming preference for most recently used blocks in the accesses, the preferred size for the A partition is one single way in all our tests. That means that there is no need for any cost function implementation, greatly simplifying the hardware complexity of our cache. A cache with one way in the A partition and all the other ways in the drowsy B partition provides important dynamic and static power savings with hardly any damage to the overall performance.
II. DROWSY ACCOUNTING CACHE
Although the proposed techniques can be applied at any level of the cache hierarchy, to focus the research, this work concentrates in L2 caches of tiled CMPs (chip multiprocessors). We assume that each tile consists of the computational core with private L1 and L2 caches. First-level data and instruction cache are critical for performance so they are designed fast and small. Their reduced size compared to L2 caches, constitute for a much smaller fraction the dissipated energy. Because of these reasons, this work considers L1 caches as conventional caches.
A. Baseline Cache Architecture
The baseline accounting cache employed in this work was originally proposed in [3] , and it is divided into two partitions, each one containing a subset of the total ways. The number of ways contained in each partition depends on the cache configuration, and in our case ranges from one to eight. To reduce the design space exploration (i.e. number of possible configurations), we restrict the resizing to 1/7, 2/6, 4/4, and 8/0 ways in the A/B partitions (see Section V, Table III) . These configurations will be referred to as C0, C1, C2 and C3, respectively.
The design space starts with C0 (1/7) configuration where the A partition acts as a direct-mapped cache. Then, we progressively upsize the A partition by increasing its associativity up to 8 ways corresponding to C3 (8/0) configuration.
The mechanism to access the accounting cache is represented in Figure 1 with an example based on a four way cache. The A partition is accessed in the first place, and an access to the B partition takes place only when the requested line is not found in A. In that case, an access to the next level cache is initiated at the same time as the B partitions access. If the data is found in the B partition, the access to the next level is canceled, and the data is swapped between the A and B partitions. If the data cannot be found in the B partition, then it is brought from lower levels of the memory hierarchy (e.g. L3 or main memory), and placed in the A partition.
B. Proposed Cache Architecture
In the devised implementation, the B partition is placed on a drowsy mode in an effort to reduce the leakage energy of the device.
Three design issues should be highlighted regarding the proposed accounting cache. First, notice that like in the original design, dynamic power is saved by avoiding the circuitry necessary to extract data from those ways belonging to the B partition. Second, due to the line swapping between the A and B partitions, the most recently used blocks are always placed in the A partition, whether it is the MRU line in configuration C0, the most or the second most recently used in configuration C1, and so on. More details on the logic required for swapping blocks among the partitions can be found in [3] . Finally, the latency of accessing drowsy blocks is higher than that of accessing fully powered blocks.
The fact that the cache ways are split in two partitions is also a source of extra latency. If all the cache ways are accessed in parallel, the way where the block is placed has no impact on the access latency in the baseline approach. However, in the devised scheme, the A partition is accessed first, and the B partition is accessed subsequently in the case of a miss in the A partition, with some extra delay for this access. Two sources of extra latency should have an impact on the overall performance of our architecture. However, as mentioned above, the majority of the accesses should be to most recently used blocks, and due to the nature of the cache hierarchy, those will be found in the A partition. Therefore, we will show that our arrangement has very little impact on the overall performance of the application.
III. PHASE ADAPTIVE COST FUNCTIONS
As stated in Section I, the original accounting cache was phase adaptive. That is, a configuration (C0, C1, C2,or C3) is chosen depending on the requirements of the different phases of the workload. At the end of each phase, a cost function is used to determine the optimal configuration for the next phase of the execution. One important feature of this cache scheme is the fact that only a few counters associated to the MRU state of the accessed blocks, provide sufficient information to calculate the cost function for each possible configuration.
Statistics regarding the number of hits in each partition, and misses are collected in those counters at intervals of 15K instructions, and evaluated using the cost function shown in equation 1.
Cost =hits
where hits A and hits B refer to the number of hits in the A and B partitions, misses is the number of misses in that cache level, and cost A ,cost B and cost misses are the costs of accessing the A and B partitions, and misses respectively. These costs could be dynamic power ( [7] ), or latency ( [6] ).
The energy cost function proposed in this work also accounts for leakage power and the cost of swaps as well for the sake of completeness as shown in equation 2.
where the leakage equally dissipates over time for all phase-adaptive simulations. On the contrary, its value varies for the drowsy-adaptive simulations based on the configurations. The penalty in energy, i.e. energy misses , has been calculated as the dynamic energy associated with accessing the L3 cache. Tag costs are not included since the tag array is always accessed, although the tag energy is added into the total energy consumption to compare against the baseline, the cost is added into the final energy calculations to determine the total energy usage. We consider the processor runs at 3 GHz, allowing for a consistent conversion from leakage power in mW to leakage energy in nJ.
In addition to the energy cost function, we calculated the delay cost function as illustrated in equation 3 with the same values collected in counters associated to MRU control bits.
DelayCost =hits
In an effort to keep the performance as close to the baseline as possible, we investigated a cache design that utilizes the energy-delay product (EDP) as the cost function. While this requires additional logic, the goal is to improve upon the energy cost function.
IV. POWER-AWARE CACHE SCHEMES
This work pursues to investigate different variants of adaptive caches. To this end, two main design issues are explored, i) the cost functions used to determine the configuration for the next execution interval, and ii) partition B is placed in drowsy mode. Two different cost functions have been studied, the energy -static and dynamic-aware (E) function discussed above or the EDP function. By combining both design issues, four different cache schemes have been explored discussed next.
• Phase. This model assumes that the cache is organized as the original phase cache but the cache configuration is determined by using the E function.
• Drowsy The scheme models that drowsy mode for the whole execution in B partition and the E function.
• PhaseED This cache structure is similar to the Phase scheme but the EDP function is applied to determine the best configuration.
• DrowsyED Similar to the Drowsy scheme but implements EDP as the cost function.
V. E VA L UAT I O N METHODOLOGY
Our evaluation methodology uses Multi2Sim [8] , an application only simulation framework that modeled a superscalar, out-of-order processor with three levels of cache. The architectural parameters of our design are shown in Table I .
All of the cache parameters (i.e. access time and energy for a given cache geometry) were obtained by using CACTI 6.5 [9] . We assumed a 32nm technology node and have the cache set to uniform cache access (UCA) with multiple banks to speedup access time. To reduce energy consumption we used a sequential access mode of first accessing the tag array, then accessing the data array. While a parallel lookup of both the tag and data arrays would be faster, our approach uses the minimal amount of energy of all the access methods.
For the drowsy cache, we simulate the electrical behavior of the cell with HSPICE and obtained an V DD equal to 0.7V in order to avoid faulty bit due to manufacturing imperfections. In other words, voltage levels below this value can induce faulty bits due to process variation. Of course, if this value is reduced by assuming a perfect cell, our proposal would achieve higher energy savings. The associate energy and power values for 0.7V are shown in Table II. The SPEC2006-CPU benchmark suite was used to run simulations; each test was ran using the ref input set for two billion instructions, with the first 500,000 instructions fast-forwarded to warmup the caches. Instruction cache (IL1) was set to be perfect to avoid interferences and focus on the data cache. The second level cache (L2) was modeled to be phase-adaptive with four different configurations as seen in Table III . In addition to the number of cycles to access the B partition, when the cache is operating in the drowsy mode, it requires an additional 1 cycle to wake up the required cache line to swap the data [10] .
The baseline configuration used the same values found for the phase-adaptive cache from CACTI and all of the simulations processor architecture parameters were the same.
VI. EXPERIMENTAL RESULTS
This section compares energy and performance results of our simulations with the phase-adaptive cache versus a typical, baseline cache configuration. Figure 2 shows the relative performance of the studied cache schemes for each benchmark. As observed, the drowsy cache schemes incur in negligible performance losses, by 1.3% on average, and always below 4%. The reasons for this behaviors can be best understood when looked at along with the phase behavior of the different cache schemes. Figure 3 shows the percentage of time spent in each one of the configurations for the PhaseED cache scheme. For some cache schemes, the time is shared between C0 (1 way in the A partition, 7 ways in the B partition) and C3 (8 ways in the A partition). We do not show similar figures for the other experiments for the simple reason that there is no phase behavior in those; Phase uses only the largest A partition (C3) while Drowsy and DrowsyED use only the smallest A partition (C0). The first conclusion that can be drawn out of this is that configurations C1 and C2 are not necessary in our approach. Due to the shrinking technology, the differences in the dynamic energy for the four configurations are small, and leakage gains relevance in the decision. Since Phase does not consider delay, and the total number of ways across the different configurations is the same for all of them, the selected configuration is C3 which is exactly the same cache as the used in the baseline (no B partition, 8 way A partition). PhaseED, having the product of the energy and delay as a cost function, does consider time in the decision making. However, the impact of the variations of the dynamic energy over the total energy for the C1 and C2 configurations is so small that the selection goes from C0 to C3. Since C3 is selected the majority of the time, and the difference in performance to access the B partition without drowsy mode is very small (for the fraction of time in C0), the performance stays almost exactly equal to the baseline one. The drowsy cases, on the other hand, have a constant preference for the smallest A partition. Due to the important role of the leakage energy in the cache, placing the B partition in drowsy mode has a very high impact on the total energy numbers. For that reason, a configuration with a large B partition placed in drowsy mode is largely preferred. The high temporal locality of the benchmarks mentioned in Section I also play an important role in this decision. As explained in Section II, the accounting cache keeps the most recently used blocks in the A partition. A 1-way A partition will hold all the most recently used blocks and will have a high percentage of hits. The fraction of accesses to the B partition is small, and the impact of these accesses to drowsy ways on latency, as well as the impact of the energy due to swaps are not significant. For that reason, the performance for both Drowsy and DrowsyED is only around 4% less than the baseline in the worst case, and only around 1% worse in average. The fact that we consider energy only, or energy-delay has not impact on the the configuration selection or the performance. Figure 4 shows the percentage of leakage savings in L2 compared to the baseline. Since Phase and PhaseED have no drowsy B partition, there are no leakage savings for those cache schemes. Since both drowsy architectures consistently select the configuration C0 (1-way A partition, 7-ways B drowsy partition), both have equal results. The fact that the B partition is placed in drowsy mode provides a static energy reduction on average by 45%. In fact, the small differences across the benchmarks (all of them very close to 45%) come from the small differences on the relative performance, compared to the baseline, that slightly increases the execution time, and the leakage energy in consequence. Figure 5 shows the dynamic energy savings (in percentage) with respect to the conventional cache. The dynamic savings are a lot more dependent on the specific cache behavior of each benchmark, as it is shown in this figure. Since the Phase scheme selects the 8-way A partition, which is technically the same cache as in the baseline, there are no dynamic savings for this case. PhaseED shows some phase behavior, and the smallest A partition is selected in a number of cases, providing some dynamic savings due the accesses that are able to avoid access to the 7 ways of the B partition. The two drowsy schemes consistently select the smallest A partition, avoiding accesses to the 7 ways of the B partition in a high number of cases. For that reason, these two cache schemes provide the best results. The dynamic energy is dependent on avoiding the logic to access the ways, as well as the cost of swaps. On average, the dynamic savings are almost 20% of the baselines dynamic energy, and very similar for the two drowsy cache schemes (Drowsy and DrowsyED). One scheme that is worth mentioning is milc, which shows no savings in dynamic energy. If we compare the number of hits in the A and B partitions of this one with tonto, we see that milc has over 6 times more hits in the B partition than in the A partition, while tonto has 1.7 times more hits in the A partition than in the B partition. tonto has an excellent behavior when it comes down to avoiding accesses to the B partition, and it shows great dynamic energy savings. On the other hand, milc's behavior is the worse out of all the benchmarks, with very high number of accesses to the B partition, and for that reason it does not show any dynamic savings. The smallest A partitions is selected for it, nevertheless, due to the very important leakage savings that we obtain with the large drowsy B partition. Finally, Figure 6 shows the total amount of energy saved in comparison to the baseline. On average the phase-adaptive simulations saved just over 1%, while the drowsy-adaptive cache schemes saved over 45%.
It is important to notice that we obtain good energy results, with hardly any damage to performance through cache schemes Drowsy and DrowsyED, and that these two schemes select the smallest A partition consistently. This leads us to an important conclusion; A constantly partitioned cache with a 1-way A partition, and a 7-way B partition, and capable of swapping blocks between these two partition is what needs to be implemented in hardware. This greatly simplifies the hardware, since there is no need to implement any cost functions, or counters to collect statistics.
VII. RELATED WORK
Sundararajan et al. [11] presented a design, namely the Smart cache, a reconfigurable cache architecture that modifies both the size and associativity. This design relies on a decision tree and machine learning to determine the optimal configuration. Our approach used a simple comparison to determine the optimal cache configuration, requiring minimal additional hardware. To save leakage the Smart Cache turns off unused sets and ways, and any modified data residing in those sets and ways needs to be written back, whereas our design is state-preserving and doesn't require costly and time consuming write backs. Chen et al [12] looked at a three-dimensional reconfigurable cache organization that modifies the capacity, line size and associativity for embedded systems. The authors introduce an algorithm they designate as a reconfiguration management algorithm (RMA) to dynamically adjust the cache. This technique requires the use of a search heuristic to reduce their space from 48 possible configurations to an average of 10. This work only considered dynamic energy, whereas our work reduces both dynamic and static energy. Our work also does not require additional logic to search for the best configuration, reducing the amount of additional logic and number of cycles it takes to compute the next configuration.
Jiongyao et al [13] presented an adaptive width data cache (AWDC) as an alternative means to save dynamic and static power. The authors exploited the inherent differences in data value length to dynamically turn off portions of the cache blocks by using Gated-V DD to reduce power within the SRAM cells. A data type detector is used to determine the length of the data and which sections of the cache block can be turned off accordingly. This requires an additional two bits for every cache block, adding to the total power consumption of the SRAM. On reads, the lowest four bits of the cache block are initially read along with the two control bits and depending on the value, additional pieces of the cache block can be read out. Our design puts entire ways into a drowsy state, reducing the logic required compare to the AWDC.
Recent works have also focused on combining different CMOS compatible technologies (i.e. embedded DRAM or eDRAM and SRAM) to attack leakage power. In this context, Valero et al [2] proposed a new memory cell design. In [1] Lira et al extended this concept at the cache bank level and applied it to NUCA caches. However, these works ignore the fact that the eDRAM manufacturing process is still much more expensive than the SRAM process.
VIII. CONCLUSIONS
In this paper we proposed the use of a phase adaptive cache in order to save both dynamic and leakage energy. To do so, we take advantage of the high temporal locality of most tests and the fact that the employed accounting cache places the most recently used blocks in a quick to access A partition, enabling the possibility of placing the B partition in a low powered drowsy mode. The phase adaptive cache dynamically changes the number of ways between the A and B partitions by evaluating a hardware implemented cost function. We tested this approach on an 8 way L2 cache. After several tests, with and without drowsy B partition, and considering both energy and energy-delay for the adaptive decisions, we conclude that there is no need to implement cost functions to make dynamic decisions, since good energy results are achieved only through a constantly partitioned cache with one way in the A partition and 7 ways in the B partition. The average dynamic energy savings are almost 20% of that of the baseline processor, and 45% savings are achieved in leakage energy thanks to the drowsy B partition. And thanks to the high temporal locality of the accesses and the fact that the accounting cache keeps most recently used blocks in the A partition, performance stayed almost the same, with an average loss of 1% across all benchmarks when compared with the baseline.
IX. ACKNOWLEDGMENTS
This work was supported by the Spanish MINECO under Grant TIN2012-38341-C04-01.
