EDRAM cells require periodic refresh, which ends up consum ing substantial energy for large last-level caches. In practice, it is well known that diff erent eDRAM cells can exhibit very diff er ent charge-retention properties. Unfortunately, current systems pessimistically assume worst-case retention times, and end up refreshing all the cells at a conservatively-high rate. In this paper, we propose an alternative approach. We use known facts 
Introduction
An attractive approach to reduce the energy wasted to leakage in the cache hierarchy of multicores is to use embedded DRAM (eDRAM) for the lower levels of caches. EDRAM is a capacitor based RAM that is compatible with a logic process, has high density and leaks very little [13] . While it has higher access times than SRAM, this is not a big concern for large lower-level caches. As a result, eDRAM is being adopted into mainstream products. For example, the IBM POWER7 processor includes a 32 MB on-chip eDRAM L3 cache [32] , while the POWER8 processor will include a 96 MB on-chip eDRAM L3 cache and, potentially, an up to 128 MB off-chip eDRAM L4 cache [28] .
Similarly, Intel has announced a 128 MB off-chip eDRAM L4 cache for its Haswell processor [2] .
EDRAM cells require periodic refresh, which can also con sume substantial energy for large caches [1, 34] . In reality, it is well known that different eDRAM cells can exhibit very different charge-retention properties and, therefore, have dif ferent refresh needs. However, current designs pessimistically assume worst-case retention times, and end up refreshing all the eDRAM cells in a module at the same, conservatively-high rate. 978-1-4799-3097-5/14/$31.00 ©2014 IEEE For example, they use a refresh period of around 40 I-lS [3] . This naive approach is wasteful.
Since eDRAM refresh is an important problem, there is sig nificant work trying to understand the characteristics of eDRAM charge retention (e.g., [11, 16, 17] ). Recent experimental work from IBM has shown that the retention time of an eDRAM cell strongly depends on the threshold voltage (Yt) of its access transistor [17] .
In this paper, we note that, since the values of Yt within a die have spatial correlation, then eDRAM retention times will also necessarily exhibit spatial correlation. This suggests that architectural mechanisms designed to exploit such correlation can easily save refresh energy. There is prior work on solutions that exploit the non uniformity of the retention time of cells in dynamic memories to reduce refreshes. Examples include RAPID [31] , Hi-ECC [34] , the 3TlD-based cache [19] , and RAIDR [21] . We discuss them in detail in a later section. Fundamentally, our contribution is at a different level, in that we investigate and identify the main source of this variation, and build a mathematical model of the variation. The model shows the presence of spatial correlation in retention times. Building on this novel observation, we propose a targeted solution to minimize refreshes.
Our results show that the Mosaic tiled architecture is both inexpensive and very effective. An eDRAM L3 cache aug mented with Mosaic tiles increases its area by 2% and reduces the number of refreshes by 20 times. This reduction is 5 times the one obtained by taking the RAIDR scheme for main memory DRAM [21] and applying it to cache eDRAM. With Mosaic, we get very close to the lower bound in refresh energy, and end up saving 43% of the total energy in the L3 cache. This paper is organized as follows: Section 2 discusses the problem addressed; Section 3 introduces the Mosaic model; Sections 4 and 5 present the Mosaic architecture; Sections 6 and 7 evaluate them; and Section 8 covers related work.
Problem Addressed
In this section, we discuss how eDRAM cells retain charge. We observe that the expected retention time and the one assumed in practice are off by orders of magnitude. We then present the distribution of the retention time and discuss its sources. Therefore, an eDRAM cell requires periodic refresh to maintain the correct logic state. The leakage through the transistor depends on the threshold voltage (VI) of the transistor. The higher the VI is, the lower the leakage is and, therefore, the cell retains its logic value for longer. Conversely, a low VI results in more leakage and, hence, the cell loses its logic value sooner. On the other hand, a higher VI reduces the overdrive of the transistor and increases the access time of the cell. Therefore, there is a tradeoff between the cell access time and how long it retains its value.
eDRAM Cell Retention Time
We now derive a closed-form mathematical equation relating the parameters of the cell to its retention time. Let C be the storage capacitance, Wand L the width and length of the access transistor, V the voltage applied to the gate of the access transis tor, St the subthreshold slope (defined below), lo !! the off drain current through the access transistor, and Tret the retention time of the eDRAM cell. Tret is defined as the time until the capacitor loses 6/iOth of the stored charge [17] , that is, 0.6xC
The definition of VI is empirical. The definition varies from foundry to foundry, and across technology nodes. Kong et al. [17] define it as the gate voltage at which the current becomes the expression on the right in Eq. 2.
The subthreshold slope is defined as the inverse of the slope of the semi-logarithmic loff-V curve, that is,
Re-arranging and substituting,
9 /300 sec (5) From [17] , at 65nm technology, we get C = 20 fF, L = W = 100 nm, VI = 0.65 V and St = 112 mV Idec. Substituting these values in Eq. 5, we get Tret = 25.44 ms.
Therefore, we expect eDRAM cell retention times to be of the order of a few tens of milliseconds. However, in practice, eDRAM cells are refreshed with a period of the order of a few tens of microseconds. For example, Barth et al. [3] report a time of 40 I-1S. This is because manufacturing process variations result in a distribution of retention times and, to attain a high yield, manufacturers choose the retention time for the entire memory module to be the one of the leakiest cells.
Retention Time Variation
It is well known that there is variation in the retention time of eDRAM and DRAM cells (e.g., [11, 16, 17] ). The overall distribution and the sources of variation have also been identified. Retention time (�s) Figure 2 : Typical eDRAM retention time distribution [17] .
fraction of the cells -as given by the area under the curve of a normal distribution from -4<r to 00 . In addition, the fact that it appears as a straight line in the log-normal plot of Fig. 2 indicates that [oglO Trel follows a normal distribution for the Bulk -or that Tre, follows a log-normal one for the Bulk. Based on experimental data, Kong et al. [17] from IBM say "We demonstrate that the Trel (Bulk) Distribution can be at tributed to array (i.e., access transistor) V; variation". This is a key observation, and is consistent with what we know about V; 's process variation distribution. Indeed, it is accepted that process variation in V; follows a normal distribution [17] . If we take the log of Eq. 5, we obtain, V; log 10 Tre, = -+ expression S l (6) which shows that a normal distribution of V; results in a normal distribution of log 10 Trel and, hence, a log-normal distribution of Tre,. This agrees with the straight line in Fig. 2 .
The Tail Distribution includes very few cells. Since it covers the area under the curve from -00 to approx. -4<r in Fig. 2 
The Mosaic Retention Time Model
We want to develop a new model of eDRAM retention time that can help us to understand and optimize eDRAM refreshing.
This section describes our model, which we call Mosaic.
Extracting the Values of Retention Parameters
To build the model, we first need to obtain the values for the key parameters of the Trel Bulk and Tail Distributions in Fig. 2 .
Specifically, we need: (i) the mean and sigma of the Bulk Dis tribution (J..l Bulk, <rBulk), (ii) the mean and sigma of the Tail Dis tribution (J..l Tail, <rTail), and (iii) the fraction of cells that follow the Tail Distribution (p). From Kong et al. [17] , we obtain that J..l (V;) = 0.65 V, <r(V;) = 0.042 V, and S l = 112 mV Idec. Therefore, from Eq. 6, and computing expression based on Eq. 5, we get the parameter values for the Bulk Distribution:
Kim and Lee [16] 
hence, /lTail( lo g LO Trel) = -2.719
We obtain the last two parameters, namely (j Tail (lo g lO Trel) and p, by curve-fitting the data in Fig. 2 . We obtain, (j Tail( lo g LO Trel) = 1.8
This value of p is more accurate than our initial estimation of 31 ppm for the Tail Distribution in Sec. 2.2. The final parameter values are summarized in Table 2 . With these values, we gener ate the curve shown in Fig. 3 . On this same graph, we superpose the experimental data from Table 2 : Parameter values extracted from the data in [17] . 
Generating a Spatial Map of Retention Times
The parameter values for the Bulk and Tail Distributions ex tracted in the previous section are not enough to generate a spatial map of the Trel values in a given memory module. The reason is that the (jBu l k(lo g LO Trel) has a random and a system atic component, and the latter has a spatial correlation function. These effects are caused by the Vt distribution, as per Eq. 6. Dif ferent values for the (j breakdown into systematic and random, and the spatial correlation do not change the Bulk line in Fig. 2 , as long as the total (j stays constant. However, they produce very different spatial maps of Tret values. For example, if the fraction of (j coming from its systematic component is high and the correlation distance is long, there will be spatial clusters of eDRAM cells with similar Trel. For the Tail Distribution, since it does not have any systematic component, we do not need any more information. Overall, to generate a spatial map of the Trel in an eDRAM memory module, we need to know: (i) the values in Table 2, (ii) the breakdown of (jBu l k(lo g lO Tret) into random and sys tematic components, and (iii) the correlation function for the systematic component. Our observation is that we can obtain (ii) and (iii) based on published data on the VI variation of the access transistors. Specifically, as per Eq. 6, the breakdown of (jBu l k (lo g LO Tre,) into random and systematic components is the same as the breakdown of (j(VI). Similarly, the correlation of lo g lO Trel's systematic component follows the correlation of VI's systematic component. Table 2 . For the VI vari ation parameters of the access transistors, we will assume the following. First, following Karnik et al. [14] , the (j(VI) has equal components of systematic and random componentsi.e., (j s y s = (j r and = (j / J2. Secondly, the correlation function for VI's systematic component follows the Spherical function described in VARIUS [26] with a correlation distance q> of 0.4.
To generate the spatial Tret map, we first generate the one for its Bulk Distribution and then superpose the one for its Tail Distribution. For the Bulk one, we first generate the spatial map for the systematic component and then superpose the one for the random component. To generate the Tret's Bulk Distri bution maps, we will first proceed with the intermediate step of generating the VI maps and then use Equation 6 to generate Trel's Bulk maps. This is done for pedagogical reasons, since we could instead directly generate Tre,'s Bulk maps using (i) the /l Bu l k and (jBu l k of Table 2 , (ii) the breakdown of (j(VI), and (iii) the correlation of VI's systematic component.
3.3.1. Spatial Map for lo g 10 Trel's Bulk Distribution. Follow ing the VARIUS methodology [26] , we lay out an imaginary grid of Nx x Ny points on top of the eDRAM memory module. We then invoke VARIUS with /l(VI)=0.65 V, (j(VI)=0.042 V (these two values are from Kong et al. [17] ), and correlation distance q>=0.4 (this is one of our assumptions). We obtain a spatial map of VI's systematic component, as shown in Fig. 4a . We then obtain Nx x Ny samples of a normal distribution with /l = 0 and (jr a nd (VI), without correlation, as shown in Fig. 4b .
As expected, the spatial map looks like white noise. We then superpose both maps, point per point, and obtain the total VI map in Fig. 4c . Finally, the spatial map for lo g lO Trel's Bulk Distribution is obtained from the spatial map of the total VI by applying Eq. 6 to every point in the grid. The resulting map is shown in Fig. 4d :; 
Using the Mosaic Retention Model Across Generations
As we move from one eDRAM generation to the next, we need 
The Mosaic Architecture

Insight and Opportunity
The analysis in the previous section has provided a useful in sight: since the retention time of an eDRAM cell is highly dependent on its access transistor's VI, and VI has well-known spatial correlation properties, then the retention time also has spatial correlation. This fact otlers an opportunity to save re fresh energy. Specifically, we can logically group cells into regions, profile their retention time, and set-up time counters to refresh the regions only at the frequency that each one requires.
With reasonable spatial correlation, the hardware cost of the counters will be minimal.
As an example, consider the eDRAM of Fig. 4f . Let us organize it as a 4-way set associative memory with 256-bit lines.
Since eDRAM reads, writes and refreshes are performed at line granularity, we need the line-level distribution of lag 10 Tre,. For this, we set a line's Tre/ to the minimum of the Tre, of the bit cells constituting the line. The result is shown in Fig. 4g .
If Nlines is the number of lines in the memory module, the per-line retention time map gives an absolute lower bound of the number of refreshes required to prevent data loss:
Nli n e s 1 Min. refreshes/sec = L
i =l Trel_ li n e_ i
This lower bound would be hard to attain. First, the memory has a limited number of ports, and only a number of lines equal to the number of ports can be refreshed simultaneously. Hence, some of the refreshes may need to be delayed. As a result, to ensure correctness when multiple lines need to be refreshed at the same time, and some refreshes need to be delayed, we need to provide a timing guard band.
In addition, providing a counter per line is too expensive.
Kaxiras et al. [15] have estimated the cost of an N -bit counter to be 40N + 20 transistors. Therefore, even a 2-bit counter per 256-bit line would amount to a 40% area overhead. Solutions using Bloom filters lose accuracy.
The striations seen in Fig. 4g With this design, the hardware cost of a counter is amortized over a whole tile. However, we have to refresh the whole tile whenever the counter rolls down to zero.
Profiling the Retention Times
Mosaic needs to profile the retention times of the tiles, for exam ple at boot time. Literature from industry such as IBM [13, [20] and [13] , profiling in the presence of DPD can be best done by using a variety of manufacturer-provided test patterns -e.g., all
Os/ls, checkerboard, walk and random. One of the papers [20] also points out that VRT changes are slow (in the order of hours and sometimes a day) and, therefore, one could profile peri odically. Note that a tile in Mosaic will include over lO,OOO cells. With many cells, it is possible that, macroscopically, these effects exhibit relatively less external variation across measure ments. In reality, sophisticated profiling techniques are still a subject of research in this area, and paramount to dynamic memory manufacturers. Once the tiles are profiled at boot time, the per-tile Trel count in cycles is stored in a small SRAM in the cache controller.
Temperature Adaptation
The VI of a transistor is a function of temperature (T) [33] .
Therefore, the access transistor's leakage current (Eq. 4b) and
Tre, (Eq. 5) vary strongly with T. Empirical data [5, 6] 
The boot-time profiling temperature (To) and the correspond ing Treeo values are stored in the SRAM. At run time, a ther mal sensor measures T. If the T is T', then when Mosaic reads the SRAM, it computes the new T:e1 as f(To, Trel_o, T' + Guardband), using Eqn. 8. We add a small T guardband as a safeguard against T changes between consecutive refreshes of the tile. This guardband is only a few degrees, as the time between refreshes is at most a few ms, and T changes slowly.
For these computations, we need a lookup table (LUT) and an 8-bit adder and multiplier. The exponential portion of Eqn.
8 is stored in a 32-entry LUT (from -40°C to 120 °C in 5 °C steps). Each LUT entry is 8 bits.
In an advanced design, we may want to consider a per-tile T, rather than a single T for the whole eDRAM module. In this case, we would need to store a per-tile reference temperature
To in addition to a per-tile Treeo. At run time, T sensors would provide the T' of each tile. Then, for each tile, the algorithm would use the local T' and the local To to apply the correction to the local Tre,_o.
Designing the Refresh Counters
Assuming that we have accurately profiled each tile's Trel, we now consider the design of the refresh counters. A refresh operation is akin to a read or a write access to the cache line.
However, the refresh operation takes precedence over normal accesses. In addition, we assume that a refresh operation takes one cycle when done in a pipe lined fashion. If a memory module is organized into banks, then the banks can be refreshed in parallel. In a given bank, the number of ports sets the number of refreshes that can proceed in parallel; if the bank has a single port, then only one refresh can be done in the bank per cycle.
In this paper, we assume one port per bank. Therefore, the minimum number of cycles required to refresh a whole bank is NUnes, which is the number of lines per tile times the number of tiles per bank. In the worst case, all the lines of the bank might require a refresh at the same time. To handle this corner case, we need to use a time guardband equal to the maximum time between requesting a refresh and being serviced. This guardband is equal to the time required to refresh all the lines of the bank, that is, Nlines Guardband = -j - (9) where j is the frequency of the cache. Therefore, the corrected value of the retention time to be used for a tile, T/e1, is T/el = Trel -Guardband (10) A second correction comes from the fact that Mosaic uses counters to track time. A counter increments in units of Step. Therefore, T/el has to be rounded to the highest multiple of
Step that is less than T/e1. Combining guardbanding and rounding off, the value of the retention time to be used, T/�l' is T/�l = n x step I n x step:::; T/el < (n+ 1) x step 
Chip's Reference Clock 
At every
Step, the sequencer rolls down all the counters one by one. For a given counter, it first decrements it and then compares its value to zero. If it is zero, the sequencer schedules refreshes for all the lines in the corresponding tile. Next, it reloads the value of the counter after reading the SRAM and adjusting it for T and the other corrections. Then, the sequencer moves to the next tile. The process continues until all the counters get decremented.
Area and Energy Overheads
Mosaic induces little area and energy overhead. To see why, assume that Finally, consider the 32-byte LUT and the 8-bit adder and multiplier. Using McPAT [18] , we estimate that their area is 0.2% of the area of the cache bank, which is negligible. Also, the energy overhead of these structures is very small, as they are accessed only every few J.ls as the SRAM.
Discussion
We can approach the lower bound of refresh energy (Eq. 7) if we can afford per-line counters with an arbitrary number of bits and
Step sizes. However, this solution is very expensive in both area and energy. On the other hand, we can avoid the area and energy overheads altogether if we refresh the entire eDRAM module at a constant rate, as it is currently done. There is a clear tradeotl between the potential refresh-energy savings and the overheads we are willing to incur.
As the number of lines per tile increases, the area and energy overheads of the counters decrease. However, the lines consti tuting the tile are refreshed at the rate needed by the line with the lowest retention in the tile. Therefore, we perfonn more refreshes than required and move away from the lower bound.
As the number of bits in the counter increases, the area and energy overhead increases, but in return we create more bins and therefore reduce the number of refreshes. The benefits saturate as soon as the range of the counter approaches the higher retention times of the distribution.
We do not experiment with counter Step sizes. A Step size larger than the -4.50' point, such as 100 f..l s, would require disabling the lines with retention times in the range 45 f..l s -100 f..l s. It would allow us to investigate solutions that trade-off cache capacity for refresh energy gains. A Step size less than the -4.50' point would increase the granularity of the counter, but decrease the range of the counter. Initial results suggested that, given the same number of bits, the benefits of a larger range with coarser granularity outweigh the benefits of a smaller range with finer granularity.
We can potentially attain higher energy savings by using variable-sized tiles. For example, if there is a line or a small region of lines that have very different Tre1 than their neighbors, we can save refreshes by placing them in a tile of their own. In contrast, if there is a very large region of lines that have similar Treb then we can save counter overhead and energy by placing all of them in a giant tile. However, while having variable-sized tiles may be attractive, it has hardware reconfiguration costs and design complexity. Further, as we will see in Section 7, Mosaic with fixed-sized tiles already eliminates the majority of refresh energy at a very modest hardware cost. Hence, there is little room to improve.
Finally, given an eDRAM module, the fraction of refresh power that Mosaic eliminates does not depend on the application running or its input data size. Mosaic's refresh savings depend on the Trel variation profile, and the hardware parameters of the eDRAM module and Mosaic.
Evaluation Setup
We evaluate Mosaic with simulations of a chip mUltiprocessor (CMP) with 16 cores. Each core is a dual-issue, out-of-order engine. The cache hierarchy consists of per-core private Ll instruction and data caches, a per-core private L2 cache, and a shared L3 cache. The L3 is divided into 16 banks. Each bank is close to one core and has a statically-mapped range of addresses. A 4x4 torus on-chip-network connects the 16 nodes. The CMP uses a MESI directory coherence protocol maintained at the L3. The architectural parameters are shown in Ta ble 3.
The L1 and L2 caches are modeled as SRAMs. The L3 cache is modeled as an eDRAM. Each L3 bank has a Mosaic module as in Fig. 5 . A refresh operation is like a cache access, consuming an energy equal to a cache line hit. It takes one cycle when done in a pipelined fashion.
To evaluate Mosaic, we use a variety of tools. To estimate performance, we use the SESC [25] architecture simulator. To estimate the area and the dynamic and leakage energies of cores, [29] for fast statistical proto typing and analysis. We target a platform where energy efficiency is critical. Hence, we use a modest frequency. We use area, energy, and timing estimates for 32 nm technology -although the Trel distribution from IBM used in this paper (Fig. 2) is for 65 nm technology. We use this distribution for lack of corresponding data at 32 nm. However, at 32 nm, the distribution changes only marginally [16] . To generate spatial maps for Tre1, we use the distribution parameters of Table 2 . For each experiment, we average out the results of 20 Trel maps. The Yt variation parameters used are shown in Ta ble 3. The f..l (Yt) and O'(Yt) values are obtained from Kong et al. [17] . We assume an equal contribution of systematic and random components in O'(Yt), but later we vary the breakdown.
We evaluate Mosaic designs with different combinations of tile size (Ts;z e ) and counter size. The parameter sweep is sum marized in Ta ble 4. A I-bit counter rolls down every Step, and corresponds to the baseline (i.e., conventional) implementation of periodic refresh every Step. We assume a constant T of 330 K. We do not experiment with T variation (spatial or temporal) and the corresponding refresh rate adaptation. We compare Mosaic against: (i) the baseline (i.e., conven tional) periodic refresh, (ii) a proposed scheme that uses mul- tiple refresh periods (RAIDR [21] ), and (iii) an ideal design heads of the counters, and maximizes refresh energy savings.
with the lower-bound refresh as given by Eq. 7. The ideal deIn Section 7.2, we compare the resulting best Mosaic against sign is subjected to the guardband of Section 4.4, but not to the the baseline, RAIDR, and ideal designs. We examine reduction rounding-off constraint. Guardbanding is required because of in refreshes, system performance, and L3 energy savings. Fiport constraints, but rounding-off is an artifact of using counters. nally, in Section 7.3, we perform a sensistivity analysis of the For simplicity, we use a Step size of 50 f.1s. The baseline breakdown of a(Y;) into systematic and random components.
design refreshes at every
Step and has no counter overhead. 7.1. Finding the Best Mosaic RAIDR [21] was proposed to refresh pages in DRAM main memories with bins that are in a geometric progression with a ratio of 2 i.e., either 64 ms, 128 ms or 256 ms, depending on the pages' retention times. It uses a Bloom filter per bin. Since we apply it to eDRAMs, we set it to refresh cache lines at the
Step size, 2 x the Step size, or 4 x the Step size -namely, 50 f.1s, 100 f.1s or 200 f.1s. As we discuss in Section 8, we do not enable more bins to be faithful to the original design. In practice, more bins could be supported but, with many bins, the algorithm becomes inefficient: the bins for the higher refresh times quickly become too coarsed-grained to be useful. Moreover, with many 
Evaluation
In Section 7.1, we evaluate the merit of several combinations of tile sizes and counter sizes in saving refresh energy. We choose the combination that minimizes the area and energy over- In both plots, across all tile sizes, a I-bit counter is equivalent to not having a counter at all, and corresponds to the baseline.
Therefore, in the area overhead plot, the I-bit counter is marked zero. Likewise, in the power plot, its counter power component is zero. All the I-bit combinations are equivalent and correspond to the baseline.
For a fixed Tsize, the area overhead of the counters increases as the size of the counters increases. For a given counter size, its area overhead decreases as the 'Tsize increases. This is because the same counter is now being shared amongst more lines.
For a given Tsize, the refresh power decreases with the counter size. This is because the retention time of the tiles can be tracked at a much finer granularity. However, the benefits flatten out as the range of the counter approaches the maximum retention time of the distribution. As the Tsize increases, the refresh power Mosaic, and ideal designs, and the result is normalized to the goes up. This is because all the lines in a tile are refreshed at the application's baseline design. In the L3 energy plot, each bar is rate of the weakest line in the tile. We also see that the counter broken down into dynamic, leakage and refresh energies from power is negligible compared to the L3 refresh power. bottom to top. The dynamic energy is too small to be seen. and is within one percent of the ideal design.
Sensitivity Analysis
Up until now, we have assumed that a(V;) has equal systematic and random components -i.e., arand : as y s is 1: 1. In future technology nodes, the breakdown into systematic and random components may be different. Hence, we perform a sensitiv ity analysis, keeping the total a(V;) constant, and varying its breakdown into the arand and as y s components. We measure the power consumed by the Mosaic configuration chosen in Sec tion 7.1, as it refreshes L3 and operates the counters. As examples of the first class of approaches targetting SRAMs, we have Gated-Vdd [24] and Cache Decay [15, 35] .
These schemes turn off cache lines that are not likely to be ac cessed in the near future, and thereby save leakage power. Cache
Decay relies on fine-grained logic counters, which are expensive, especially for large lower-level caches. Drowsy Caches [8, 23] periodically move inactive lines to a low power mode in which they cannot be read or written. However, this scheme is less applicable in deep-nm technology nodes, where the difference between Vdd and V; will be smaller.
Ghosh et al. [10] propose SmartRefresh, which reduces re As part of the second class of approaches, there is work focused on reducing the refresh power of dynamic memories by exploiting variation in retention time. It includes RAPID [31] , the 3TlD-based cache [19] , and RAIDR [21] . RAPID [31] proposes a software-based mechanism that allocates blocks with longer retention time before allocating the ones with a shorter retention time. With RAPID, the refresh period of the whole cache is determined only by the used portion.
The 3T1D-based cache [19] is an L1 cache proposal that uses a special type of dynamic memory cell where device varia tions manifest as variations in the data retention time. To track retention times, the authors use a 3-bit counter per line, which in troduces a 10% area overhead. Using this counter, they propose refresh and line replacement schemes to reduce refreshes.
RAIDR [21] is a technique to reduce the refresh power in DRAM main memories. The idea is to profile the retention time of DRAM rows and classify the rows into bins. A Bloom filter is used to group the rows with similar retention times. There are several differences between Mosaic and RAIDR. First, Mosaic observes and exploits the spatial correlation of retention times, while RAIDR does not. In DRAMs, an access or a refresh operates on a row that is spread over multiple chips, which have unknown correlation. Mosaic can be applied to DRAMs if the interface is augmented to support per-chip refresh.
Second, RAIDR classifies rows in a coarse manner, working with bins that are powers of 2 of the baseline (i.e., bins of t, tx2, tx4, tx8, etc.). Therefore, many bins are not helpful because the bins for the higher retention times quickly become too coarsed grained to be useful. Mosaic tracks the retention time of lines in a fine-grained manner, using fixed-distance bins (i.e., t, tx2, tx3, tx4, etc.). This allows it to have tens of bins (64 with a 6-bit counter) and hence enables more savings in refresh power.
Finally, the RAIDR algorithm takes longer to execute with increasing numbers of bins. With 8 bins, in the worst case, it requires 7 Bloom filter checks for every line. Hence, RAIDR only uses 3 bins. The Mosaic implementation using a counter is simple and scalable.
The third class of approaches involves using ECC to enable a reduction in the refresh power [7] . ECC can tolerate some failures and, hence, allow an increase in the refresh time - 
