Valero Bresó, A.; Petit Martí, SV.; Sahuquillo Borrás, J.; Kaeli, DR.; Duato Marín, JF. (2015). A reuse-based refresh policy for energy-aware eDRAM caches. Microprocessors and Microsystems. 39(1):37-48.
Introduction
Capacitors in Dynamic Random-Access Memory (DRAM) cells store data as different levels of charge, and this charge leaks out over time. The elapsed time since the capacitor was last charged until data contents are lost is referred to as the retention time. To avoid data loss due to capacitive discharge, DRAM 5 cell contents are periodically read out and written back in a process known as refresh. Refresh operations consume a significant amount of dynamic energy and can negatively impact performance, since refresh requests compete for memory with regular processor read and write requests. This overhead associated with refresh is expected to grow larger in future technologies given their growing 10 memory densities. For instance, refresh energy consumption is expected to reach nearly half the total consumption of future 64Gb DRAM devices [1] .
Prior work has concentrated on reducing refresh energy by avoiding issuing unnecessary refresh accesses in off-chip DRAM devices. Regular memory accesses implicitly trigger a refresh operation since DRAM contents are written 15 back after they are read. Prior work [2] has shown we can exploit this behavior by delaying periodic refreshes of frequently requested data. Related work considered inter-cell variation in retention time in order to adapt the refresh period to each memory row [1, 3, 4] . Finally, Error Correcting Codes (ECC) have been also used to recover data lost due to extended refresh periods [5] . 20 Fairly recently, DRAM cells started to be embedded in CMOS technology [6] . caches in the IBM Power8 [7] is around 6MB. This value is expected to grow larger beyond 16MB in Intel's Knights Landing prototype [8] .
Similar to off-chip DRAM devices, refresh operations in eDRAM technol- 30 ogy represent an important fraction of the total dynamic energy, as shown in Figure 1 1 . As observed, refresh energy increases with the growth in aggregated cache capacity implemented with eDRAM and grows by up to 67% for an overall capacity of 16MB. Considering all the L2 processor caches, more lines have to be refreshed during the same retention time, which in eDRAM caches is typi- 35 cally thousand times shorter than in off-chip DRAM memories [9] . Nevertheless, eDRAM refresh consumption continues to be significant in embedded single-core processors such as the Samsung S5L8900 used in Apple's iPhone devices [10] , or even in devices with relatively lower eDRAM capacities (i.e., 2MB), such as 1 Results have been obtained with the machine parameters and methodology described in the Sony's PlayStation Portable (PSP) [11] . 40 The overhead of performing refresh in eDRAM caches has been recently noted by other researchers. A common approach to reduce this overhead is to lower the impact of inter-cell variability on refresh energy [12, 9, 13] . Other work has considered using time-based dead-block predictors [14] and cache block state [15] for filtering refresh requests. 45 It is generally accepted that only a small subset of L2 cache lines are reused (see Section 3) , which means that data locality in L2 caches is much lower than in L1 caches. This behavior has been successfully exploited in recently proposed smart replacement strategies [16, 17, 18, 19] . These mechanisms perform better than Least Recently Used (LRU) replacement, selecting a victim block that is 50 not likely to be reused again, which is normally chosen (e.g., randomly) from a set of candidates. In other words, this class of algorithms speculatively identifies those blocks within a cache set that exhibit poor locality and can be considered as candidates for eviction.
This paper describes a novel strategy that reduces refresh energy in low-end 55 single-core processors. Our approach uses the same information that is used by replacement policies to discern whether a block should be refreshed or not in the L2 cache. In this context, the proposed selective refresh policy is integrated with a state-of-the-art Most Recently Used-Tour (MRUT) replacement algorithm [18] .
60
To further increase energy savings, the refresh policy is evaluated with our proposed energy-aware cache, hereafter referred to as the enaw architecture. As part of this strategy, we allow blocks in different banks to be swapped. This enables the MRU blocks to be placed in the same bank. Cache lines stored in this bank are always refreshed and accessed first by using a technique referred 65 to as bank-prediction.
Experimental results show that, compared to a conventional eDRAM cache using a typical refresh mechanism, the proposed enaw approach with selective refresh reduces refresh energy on average by 72%, whereas the overall on-chip memory hierarchy energy savings are up to 30% on average. These benefits come 70 at the cost of minimal performance degradation and area overhead. Moreover, compared to an energy-aware phased eDRAM cache with typical refresh, the proposed enaw cache reduces the refresh energy consumption by more than 50%, while also improving performance. Finally, the proposed cache architecture achieves the best Energy-Delay-squared Product (ED 2 P ) among the studied 75 schemes.
The remainder of this paper is organized as follows. Section 2 provides related research. Section 3 presents the MRUT replacement algorithm. Section 4 introduces the enaw eDRAM cache architecture with the selective refresh policy. Section 5 analyzes our experimental results, including energy, performance, 80 ED 2 P , and area. Finally, Section 6 summarizes the paper and discusses directions for future work.
Related Work
Refresh performance for on-chip eDRAM caches is not as easy to optimize as compared to off-chip DRAM devices since we cannot adopt most existing off-85 chip refresh techniques. First, the access time of external Dual In-Line Memory Modules (DIMMs) is at least 6 orders of magnitude faster (from ns to ms) than the next level of the hierarchy (e.g., disks). Thus, less aggressive techniques should be used since a very long disk access is required on a misspeculation.
Second, main memory is not organized as a cache, so optimizations such as 90 way-prediction cannot be applied. Third, external DRAM memory works at a coarser (row or page) granularity, where the size is typically several KBytes.
The refresh overhead problem in on-chip eDRAM caches has been previously addressed, taking into account inter-cell feature variations [12, 9, 13] . This prior work pursued solutions that are orthogonal to our proposed selective refresh.
95
In [12] , the authors propose to learn the appropriate refresh period from each cache set via a regressive process. Initially, this process assumes the worst-case refresh period for the entire cache. Then, refresh periods are increased step-bystep until ECC detects data losses. In this way, the proper refresh period for each cache set is selected.
100
In [9] , an ECC optimization approach is proposed to identify expired data in enlarged refresh periods. This approach provides both single-bit and multi-bit failure detection. The single-bit error can be corrected, while those cache sections with multi-bit errors are disabled to avoid the high latency and complexity of multi-bit error correction.
105
The Mosaic [13] architecture minimizes the number of refresh operations required by exploiting the fact that cell retention times of eDRAM caches exhibit spatial correlation. The cache is divided into regions with different retention time requirements, and the contents of each region are refreshed at different rates using counters in the cache controller.
110
Chang et al. [14] describe a mechanism that skips refresh operations in those blocks marked as dead by a time-based dead-block predictor. The refresh mechanism requires four control bits per cache block and three additional bits per cache set. Blocks are refreshed depending on the predictor accuracy, which is controlled by a state machine. If the accuracy for a given block is low, then 115 the elapsed time from the last access to the block until the time it is considered useless is increased. On the other hand, high accuracy means that a reasonable elapsed time has been reached. In contrast to this work, instead of making decisions based on the elapsed time from the last access, our refresh policy exploits reuse information in terms of how many times a block has been accessed, and 120 does not require any predictor assistance nor state machine.
In [15] , periodic refreshes are delayed taking into account the implicit refresh incurred by regular accesses. This mechanism is orthogonal to our refresh policy. Prior work has also proposed a refresh policy that is aware of the block state. Refrint requires a 5-bit counter per cache block. The counter is set to its 125 maximum value when a block is accessed or written back, and it is decremented on each periodic refresh to that block. When the counter reaches zero, the block is written back and refreshed if dirty, or it is invalidated in case it is clean.
The 3T1D-based L1 data cache [20] makes use of the information given by the LRU replacement algorithm to save refresh energy. This approach modifies the cache controller to allocate the MRU data blocks in those lines identified with the longest retention time.
The Cache Decay [21] scheme was proposed for L1 SRAM caches to reduce leakage energy. This approach turns off those blocks that are not being used according to a given value (threshold) of processor cycles. The mechanism works 135 based on the fact that due to L1 data locality, hits in a cache block concentrate in bursts of accesses, which are followed by a long period where the block is not accessed until it is evicted. However, this behavior is uncommon in lower-level caches because L1 caches filter many accesses to L2. Unlike this Cache Decay, the goal of our work is to reduce refresh energy in eDRAM caches that minimize 140 leakage by design.
Finally, The Drowsy Cache [22] scheme also focuses on L1 SRAM cache lines. This approach reduces the supply voltage of selected cache lines while preserving their state. This technique is periodically applied in those lines that have not been accessed during a sampling period of time (measured in processor 145 cycles). In contrast, in our proposed adaptive refresh policy, the sampling period is established by tracking cache misses, while the decision to skip the refresh of a line is determined based on the number of misses that hit in the tag array and by the specific information given by the MRUT replacement algorithm.
Exploiting Reuse Information: MRU-Tour
Reuse information has been widely investigated in the past to improve cache performance, especially in L1 data caches [23, 24, 25] . This section describes the MRUT replacement policy that was originally devised for highly associative SRAM low-level caches [18] . This policy is used in this work to drive refresh decisions, that is, to discern which blocks do not need to be refreshed.
155
The MRUT algorithm works on the MRUT concept, which is defined as the number of times that a block becomes the MRU while it is in cache. Figure 2 illustrates this concept for a generic block A during its generation time. This time starts when A is fetched into the cache (t1 ), and finishes when the block is fetched into the cache, its MRUT-bit is set to zero to reflect that an MRUT period has started. If the block is accessed while it is in a non-MRU position, it 175 returns to the MRU location and a new MRUT period starts. This is indicated by setting the MRUT-bit to one, meaning that the block possesses good locality (multiple MRUTs). The victim block is randomly selected among those that have their MRUT-bit set to zero, excluding the last y accessed blocks. This filters stale information, using only recent information, since the LRU stack 180 order is kept for just these y blocks. In case that all the blocks in the same set exhibit multiple MRUTs, the victim block is selected among all the blocks (except the last y referenced blocks). Figure 3 shows our implementation of this algorithm. In addition, to adapt the replacement algorithm to dynamic changes in the working set, the MRUT-bits of all the cache blocks are periodically set to 185 zero at runtime in those applications exhibiting few cache misses (e.g., MPKI < 10). Please refer to [18] for further details.
Enaw Cache Architecture
This section describes the proposed enaw eDRAM architecture, which relies on two main design features to save energy: i) bank-prediction cache and ii) 190 selective refresh. In addition we consider swap operations of blocks in different banks to make bank-prediction more performance effective. These enhancements are discussed below.
Bank-Prediction
Parallel access to all the tags and cache ways of the selected set helps enhance associative eDRAM caches where all the ways must be written back on each access after a (destructive) read. Previous work has addressed this problem by predicting which way should be accessed first. Some of these mechanisms [26, 27, 28] have been deployed in commercial microprocessors. These schemes, 200 especially when applied in L1 caches, suffer minor performance losses since data locality is high in these memories [29] . However, data locality is much less predictable in L2, so way-prediction schemes lead to unacceptable performance losses. To deal with this drawback, we propose a method to predict a group of ways (instead of only one) to be accessed first.
205
In order to discern how many cache ways should be accessed (i.e., predicted) in a first step, we analyzed the hit distribution across the cache ways for a 2MB-16way L2 cache with the LRU algorithm. Figure 4 shows the results.
Labels way-0 and way-1 refer to the cache way storing the MRU block and the barrier if we predict the access to more cache ways. Effectively, the percentage of hits increases to 80% when considering an additional way (considering both way-0 and way-1 ). Using two ways, the hit percentage is above 90% in 15 of 220 26 benchmarks, though only 6 benchmarks have a 90% or higher hit ratio in the case where only a single way is considered. We analyzed the benefits of predicting an increasing number of ways, but each additional way provides only a marginal benefit as seen in Figure 4 . For instance, including way-2 and way-3
only increases the hit ratio by 3%. Based on these results, we can conclude with the MRU and second MRU blocks in the same bank (we will refer to this bank as the MRU bank ), which is accessed first on every cache access. The bank-prediction technique works as follows. On each cache access, both the tag array and the MRU bank of the data array are accessed in parallel in a first step.
Therefore, on a hit in the MRU bank, the access time of the cache is the same 235 as that of a conventional cache. Otherwise, in case of a tag hit associated with any other bank, only the bank containing the target block is accessed in the second step, which starts right after the tag comparison. On a cache miss, just the MRU bank is accessed during the first step and the second step is skipped. Finally, the tag array is assumed to be built with SRAM technology, since implementing this structure with slow eDRAM would negatively affect the performance when performing the second step and on a cache miss. Besides, as the tag array is much smaller than the data array, much less leakage and area 245 savings can come from it. In order to keep the pair of ways that hold the MRU data blocks of each set in the same bank, the cache controller is enhanced to implement the swap of blocks between different banks. The design assumes that tags are not swapped. For 16-way caches, four status bits per tag are required to maintain the mapping between tags and 285 cache ways, which is similar to the technique used in [30] . Note that the access to these status bits is not in the critical path since they are read together with the tag array and the MRU bank in the first step of the enaw cache access (see Figure 5 ). The area overhead introduced by these control bits, as well as the auxiliary buffers, is minimal as analyzed in Section 5.3. 
Refresh Mechanism
The cache memory accesses and swap operations implicitly refresh the eDRAM contents. However, as the storage cells lose their charge over time, if some form of refresh is not performed, rarely accessed data will be lost. In such a case, if the data are later requested, a clean copy can be fetched from the main 295 memory (assuming a write-through policy), but this negatively impacts both performance and energy consumption. To avoid data losses, refresh cycles are scheduled using either a distributed or burst method in typical DRAM memories [31]. This work assumes a distributed refresh method as the baseline since it is the most commonly used method.
300
Refresh operations are performed at a cache block granularity, so that the refresh period can be established by computing the shortest charge retention time divided by the number of blocks in the cache. This guarantees that all the blocks are refreshed ahead of losing their contents. This work simulates logic-compatible eDRAM technology using 10fF trench capacitors [32] in CACTI [33], which gives a retention time of 190K processor cycles for a 3GHz processor clock and 45nm technology. Notice that such a retention time value is reasonable, since it increases from a 1.3x to a 2.1x factor with respect to the retention times assumed in previous work [13, 15, 9] . In order to reduce bank conflicts and contention, refresh operations are interleaved across cache banks and follow 310 a round-robin pattern.
In addition to the baseline distributed refresh policy (referred to as Alw ), two additional selective refresh policies exploiting the MRUT concept have been devised for saving energy. We will refer to these as i) Cond and ii) Adp. These policies work as follows:
315
• Alw . This policy always refreshes the target block regardless of the value of the MRUT-bit and the LRU-stack position of the block.
• Cond . Refresh is applied whenever the following condition is satisfied: the target block is stored in the MRU bank or its MRUT-bit='1' (multiple MRUTs). Otherwise the block is marked as invalid and written back to 320 main memory if dirty (early writeback).
• Adp. This policy dynamically adapts between Alw and Cond at runtime.
The Cond policy aims to reduce energy waste due to refresh with respect to Alw. This policy allows data losses in those blocks less likely to be accessed again (i.e., these blocks are not refreshed). Since Cond is a speculative approach, it can 325 lead to performance degradation caused by misspeculation (a block is requested after capacitive discharge). The block must then be fetched from the main memory. This policy uses the position in the LRU stack and the MRUT-bit to decide whether the block should be refreshed or not. If the prediction accuracy is high, the Cond policy can achieve substantial energy savings with minimal 330 performance loss. However, if the prediction accuracy is low, this policy can severely impact performance.
Because of Cond can threaten performance in the case of low prediction accuracy, the Adp policy has been devised to deal with this drawback. The Adp policy dynamically selects which of the previous policies will be applied 335 depending on the previous cache behavior. It uses a pair of counters for the whole cache to track the number of standard cache misses (miss counter), as well as the number of misses that hit in the tag array when the associated data line has lost its contents (tag-hit-data-miss counter). These misses occur when the requested block has been previously invalidated when working under the 340 Cond policy, and they are a subset of the number of standard misses.
Initially, the Cond policy is selected by default, both counters are reset to zero, and a sampling period starts. This period finishes when the miss counter reaches a given value (e.g., 128). At that point, if the tag-hit-data-miss counter exceeds a given threshold (e.g., 8), the Alw refresh policy is selected during 345 the next sampling period. Otherwise the Cond policy continues to be applied.
Then, both counters are reset and a new sampling period starts. In the case where the Alw policy is used during a given sampling period and the tag-hitdata-miss counter does not surpass its threshold, the Cond policy is chosen for the next sampling period.
350
Threshold values of miss and tag-hit-data-miss counters are defined as a power of two, in order to make hardware simple. That is, it is enough to check just a single bit of each counter to choose the policy to be applied during the next sampling period. Note that the sampling period is independent of the refresh period, which is determined by the retention time and number of blocks. I in their second access. Finally, according to the round-robin policy followed by the refresh mechanism, the seventh periodic refresh to this cache set (not shown) would correspond to the first MRU way (assuming in this example that each cache bank implements a single way in the cache). 375 
Experimental Evaluation
Next, we present the simulation environment used to evaluate the studied cache schemes. An extended version of the SimpleScalar simulation framework [34] has been used to model the MRUT replacement algorithm and the devised enaw architecture with the selective refresh policies. Leakage and dy-380 namic energy were estimated from the execution time and the required memory events (i.e., cache hits, misses, writebacks, swaps, and refreshes) of the benchmarks, respectively. Bank conflicts and contention due to all these memory events has also been considered. Accesses to different banks can be issued concurrently, but an access to a given bank must wait until the previous access to Experiments have been performed for the Alpha ISA with the ref input set and running the SPEC CPU benchmark suite [36] . Results were collected simulating 500M instructions after skipping the initial 1B instructions. Table 1 summarizes the main architectural parameters. 395 
Impact on Dynamic Energy
We want to evaluate the dynamic energy savings of the proposed enaw eDRAM cache and the refresh policies applied to it. Dynamic consumption has been divided into four major components, according to the type of operation: the main memory and L2 writebacks to the main memory (Miss and writeback consumption). The first component covers the energy spent in the access to the L2 cache, including the tag array, data array, and cache controller logic (e.g., 405 decoders, multiplexers, and sense amplifiers). Remember that a swap operation arises in each non-MRU hit. Its consumption has been obtained as the sum of the cost of a read access to the MRU bank, a read access to the target non-MRU bank, a write access to that bank (i.e., step 2 in Figure 7(a) ), and a write access to the MRU bank (step 3). The unidirectional transfer performed on each cache 410 miss (step 1 in Figure 7(b) ) has also been included in this category as a write access to a non-MRU bank. The L2 refresh consumption considers the energy cost due to refreshing the contents after reads and periodic refresh operations.
Finally, a commodity 2GB DRAM main memory is assumed to estimate energy due to L2 misses and L2 writebacks. The number of these L2 memory events 415 differs among refresh policies since they result in different prediction accuracies.
First, we analyze the effects of the devised bank-prediction technique and swap operations. Figure 9 (a) and Figure 9( mJ) consumed, for Integer (Int) and Floating-Point (FP) benchmarks, respectively. We are evaluating the enaw cache working with the previously described 420
Alw policy (labelled as Alw in the graph). Two additional cache schemes have been considered for comparison purposes. Label Conv refers to a conventional eDRAM cache that accesses the tag array in parallel with all the data cache ways, while Psd refers to a phased eDRAM cache that serializes accesses to the tag array and the data array [30] . Thus, the access time includes the tag array 425 latency plus the data array latency, and only the target bank is accessed after the tag comparison. For the sake of fairness, it is assumed that these caches are implemented with the same technology and replacement policy as the proposal.
That is, the tag array is built with SRAM technology and the the MRUT replacement algorithm is used. In addition, both cache schemes implement the 430 baseline Alw refresh policy.
Results differ wildly across benchmarks for a given refresh policy due to two main factors. First, all the analyzed types of consumption depend on the number of accesses to L2 or main memory. The higher the number of accesses, the higher the energy consumption. Second, applications have different execution times.
435
Those benchmarks with longer execution times significantly increase the refresh energy (e.g., mcf and ammp).
Compared to Conv, the consumption of the enaw cache is significantly reduced. Note that in some benchmarks such as gcc and vortex, only the Access consumption of the conventional cache exceeds the total dynamic energy con-440 sumption of the proposed enaw approach. This is due to a reduced number of accesses and refresh operations are carried out in the enaw cache thanks to the bank-prediction technique, which enables the proposed cache to access just the target non-MRU bank after the tag comparison. In addition, although enaw wastes energy in swap operations, this consumption is minimal because most 445 of the hits concentrate in the MRU bank (see Figure 11 ). In fact, the swap energy only represents on average a 11% of the overall dynamic consumption.
The highest swap energy overhead can be found in art, which is the benchmark with the highest number of non-MRU hits.
Compared to Psd, the enaw cache slightly increases the overall dynamic en-450 ergy. Such energy waste, which can be observed in some of the applications like mcf, comes from the bank-prediction inaccuracy (useless MRU bank accesses) that produce a subsequent swap operation. On the other hand, Psd consumes slightly more refresh energy than Alw because the target bank must be refreshed on each cache access (Psd does not include swaps) and it increases the execution 455 time (see Section 5.2). Finally, notice that for a given benchmark, the Miss and writeback energy is the same regardless of the cache design. This is due to the fact that all the studied schemes prevent data loss by refreshing all the cache blocks. for miss and tag-hit-data-miss, respectively, since this pair is the most energyefficient 4 .
4 Increasing the tag-hit-data-miss threshold reduces the Refresh consumption, but the Miss and writeback energy increases much more due to additional induced misses and writebacks.
In contrast, a smaller threshold value makes the behavior of Adp closer to that of Alw. For The MRU hit ratio is on average 70%, while this percentage is above 80% in half of the applications. These results illustrate the effectiveness of the bankprediction and the swap mechanism, since most of the cache accesses hit the 495 MRU bank at the first step.
Performance
An interesting observation is that the MRU hit ratio remains constant for each benchmark regardless of the refresh policy used. This is due to the refresh methods considered here assume the same placement/replacement strategies and blocks stored in the MRU bank are always refreshed.
500
Regarding the non-MRU hit ratio, it is lower for the Cond policy than in the Alw policy. This is because Cond refreshes a small percentage of non-MRU blocks that are reused later. For instance, in a number of the benchmarks (9 of the miss threshold, larger values allow the Cond refresh to be applied for a long period, which may lead to severe performance losses. On the contrary, smaller values shorten the sampling period in such a way that the mechanism does not have enough information to decide which is the most appropriate policy for the next period. of Alw is roughly the same as that of Conv, although differences can be found in some specific benchmarks. For example, in applications such as crafty and vortex, Conv outperforms Alw. This is mainly due to the fact that the latter 515 scheme has to wait for the tag comparison before accessing the target non-MRU bank. In other benchmarks like mcf and bzip2, Alw obtains better results than the conventional cache. This is because the latter approach experiences higher bank contention, since on each cache access, all the cache banks are accessed in parallel. Thus, a given memory request may be stalled by a previous access, 520 independent of the target bank. In mcf and art, this factor severely impacts Conv, since both selective refresh policies obtain better performance.
As explained above, the performance differences between Alw and the selec- This section quantifies the total on-chip memory hierarchy consumption, Energy-Delay-squared Product (ED 2 P ), and area for the studied L2 caches and both SRAM-based L1 data and L1 instruction caches. mm 2 ) include all the components considered for leakage, plus the overhead of 555 the status bits required to keep the mapping between tags and data blocks, the MRU-bits, and the MRUT-bits. The ED 2 P values for each cache architecture were obtained multiplying the corresponding total energy (in mJ) by the squared execution time (in ms).
Compared to the L1 caches, despite the proposed L2 caches are eDRAM-560 based, the amount of leakage current significantly increases in the eDRAM caches mainly due to their larger capacity. Both conventional and the enaw schemes using Alw consume the same amount of leakage energy, closely followed by Adp. Leakage differences appear due to this energy is proportional to the execution time, and longer execution time (see Figure 12 ) implies higher 565 consumption. Depending on the cache architecture, the ratio of leakage with respect to the overall energy varies from 28% (Conv) to 49% (enaw caches with selective refresh).
The dynamic energy also increases with the storage capacity, although the proposed caches save energy by applying bank-prediction. Compared to the 570 L1 data cache, the L1 instruction cache consumes a larger amount of dynamic energy because it is more frequently accessed. As discussed above, compared to Conv, the proposed cache reduces dynamic energy by implementing bankprediction and swap operations. In addition, the selective policies considerably minimize the refresh costs. This allows Adp to be the most energy conservative 575 cache among all the studied schemes. Overall, this approach reduces the total energy by 41% with respect to the conventional cache. Taking into account all the on-chip memory hierarchy, this percentage is by 30%.
The enaw cache using the Adp policy is also the best cache design choice from the perspective of the ED 2 P (lower is the better). Despite the fact that 580 Adp increases the execution time when compared to Conv and Alw, its overall energy savings allow this scheme to obtain the lowest ED 2 P among all the studied caches, even though this metric gives more weight to performance than to energy. Note that although Alw consumes more energy than Psd (mainly due to the savings of predicting the access to the MRU bank), Alw also reduces 585 the ED 2 P compared to Psd because it performs better.
Finally, the scant area overhead (0.4%) that we see in the enaw caches with respect to both conventional and phased approaches is due to the use of intermediate buffers and status bits. Regarding the additional hardware incurred by the proposed Adp refresh method, remember that it only requires two counters 590 for the entire cache, and the policy to be applied is selected by checking just a single bit of each counter, yielding negligible overhead.
Conclusions and Future Work
This paper has shown that the information used by recent replacement algorithms for highly-associative caches can help designers to efficiently reduce 595 the refresh overhead of eDRAM caches. Selective refresh mechanisms relying on a state-of-the-art MRU-Tour replacement algorithm have been studied, which leverages reuse information to identify useless blocks exhibiting poor locality and considers them as candidates for eviction. The proposed refresh policy prevents these blocks from being refreshed, reducing the overall energy consumption.
600
To further obtain energy benefits, the refresh policies have been evaluated on an energy-aware (enaw ) cache architecture, which reduces dynamic energy by applying bank-prediction and swap operations. Compared to a conventional cache with the same storage capacity, the enaw cache reduces refresh energy by 72% and dynamic energy by 58%, which translates into a 30% reduction of the 605 overall on-chip memory hierarchy (leakage and dynamic) consumption. These energy benefits are achieved with minimal impact on performance and area.
Compared to an enegy-aware phased cache, the devised enaw cache reduces the refresh energy consumption in half, while improving the performance. Finally, the energy-delay-squared product analysis further supports that the enaw 610 cache with selective refresh is the best design option among the studied schemes.
Our evaluation here has focused on L2 caches of single-core processors. For future work we plan to extend the selective refresh design to much largered shared low-level caches and multithreaded workloads in chip multi-processors.
