A Technique for Write-endurance aware Management of Resistive RAM Last
  Level Caches by Mittal, Sparsh
ar
X
iv
:1
31
1.
00
41
v1
  [
cs
.A
R]
  3
1 O
ct 
20
13
A Technique for Write-endurance aware
Management of Resistive RAM Last Level Caches
Sparsh Mittal
Department of Electrical and Computer Engineering
Iowa State University, Ames, Iowa 50011, USA
Email: sparsh0mittal@gmail.com
Abstract
Due to increasing cache sizes and large leakage consumption of SRAM device, conventional SRAM caches contribute
significantly to the processor power consumption. Recently researchers have used non-volatile memory devices to design caches,
since they provide high density, comparable read latency and low leakage power dissipation. However, their high write latency
may increase the execution time and hence, leakage energy consumption. Also, since their write endurance is small, a conventional
energy saving technique may further aggravate the problem of write-variations, thus reducing their lifetime. In this paper, we present
a cache energy saving technique for non-volatile caches, which also attempts to improve their lifetime by making writes equally
distributed to the cache. Our technique uses dynamic cache reconfiguration to adjust the cache size to meet program requirement
and turns off the remaining cache to save energy. Microarchitectural simulations performed using an x86-64 simulator, SPEC2006
benchmarks and a resistive-RAM LLC (last level cache) show that over an 8MB baseline cache, our technique saves 17.55%
memory subsystem (last level cache + main memory) energy and improves the lifetime by 1.33×. Over the same resistive-RAM
baseline, an SRAM of similar area with no cache reconfiguration leads to an energy loss of 186.13%.
Index Terms
Resistive RAM (RRAM or ReRAM), non-volatile memory (NVM), last level cache, leakage energy saving, dynamic cache
reconfiguration, energy efficiency.
I. INTRODUCTION
To fulfill the performance requirement of state-of-the-art programs [1–3] and feed data to increasingly large number of
processor cores, modern processors are using large sized last level caches (LLCs), e.g. 24MB to 32MB LLCs [4, 5]. Since
SRAM offers very high write endurance and fast read/write times, it has been conventionally used to design on-chip caches.
However, SRAM also has large leakage power dissipation, low density and poor scalability. Thus, caches designed with SRAM
may consume a large fraction of chip area and chip power budget.
Recently, researchers have explored use of non-volatile memory (NVM) devices, such as RRAM (resistive RAM) and STT-
RAM (spin torque transfer RAM) and PCM (phase change memory) for designing on-chip caches [6, 7]. NVMs offer high
density, low leakage power, better scalability and comparable read energy. Still, several important issues remain to be addressed,
before they can be used as the universal memory solution. Although NVMs have low leakage power dissipation, its high write
latency increases the execution time which may also increase leakage energy consumption, thus necessitating the use of power
management techniques. Also, the benchmarks with low working set size do not benefit from the large cache capacity provided
by NVM cache and hence, waste significant amount of leakage energy. Further, the write endurance of NVMs, such as RRAM
is several orders of magnitude smaller than that of SRAM [8]. Since, previous energy saving techniques do not take write
endurance of the device into account, they may further reduce the device lifetime by increasing the write-variation in cache
access. Thus, to ensure wide-spread adoption of RRAM caches, addressing their limitations is extremely important.
In this paper, we present a technique for saving energy in non-volatile last level caches. While we take the example of
RRAM cache in this paper, our technique can also be applied to other NVMs, such as STT-RAM (spin torque transfer RAM)
and PCM (phase change memory). Our technique uses cache coloring approach [9] to achieve dynamic cache reconfiguration.
Sparsh Mittal is currently working as a postdoctoral research associate at Oak Ridge National Laboratory, USA.
Using this, the LLC size is adapted to the workload requirement on a per-interval basis. The rest of the cache is turned off to
save leakage energy such that the performance loss is minimal. Thus, large cache capacity is provided to the programs with
large working set size and vice versa.
To achieve wear-leveling, our technique keeps a count of the number of writes to each color and while turning-on (resp.
turning-off) the colors, the colors with the lowest (resp. highest) number of writes to them are first selected. Also, if in an
interval, no cache reconfiguration is performed, a few hottest1 colors are turned-off and an equal number of colors are turned-on
which are used in the subsequent intervals. Thus, the write traffic is selectively channeled to different cache colors at each
time to minimize write-variation.
We conduct microarchitectural simulations using Sniper x86-64 simulator [10] and benchmarks from SPEC2006 suite. The
results show that our technique saves large amount of energy while keeping performance loss negligible and also improving
the cache lifetime. On average, over an 8MB RRAM baseline, our technique saves 17.55% energy and increase MPKI by
only 0.09 misses. Also, the average improvement in speedup is 0.99×, average improvement in cache lifetime is 1.33×. Over
the same baseline, an SRAM cache of equal area (which has 1MB cache capacity and does not use cache reconfiguration),
provides 186.13% loss in energy, a speedup of 0.96× and an increase in MPKI of 4.22.
The rest of the paper is organized as follows. Section II provides a brief background on RRAM and existing techniques
for managing cache power and write endurance problem. Section III presents our methodology. Section IV discuss the energy
saving algorithm and its implementation. Section V discusses the simulation platform, workloads, evaluation metrics and energy
model. Section VI presents the experimental results and finally, Section VII provides the conclusion.
II. BACKGROUND AND RELATED WORK
A. A Brief Background on RRAM
Resistive RAM with unipolar switching uses a insulating dielectric [11]. On applying a sufficiently high voltage, a filament
or conducting path is formed in an insulating dielectric. Once the filament is formed, by applying appropriate voltages, the
filament may be set (which leads to a low resistance) or reset (which leads to a high resistance). Several RRAM prototypes have
been recently developed [12] which demonstrate that RRAM offers significantly higher density than SRAM and comparable
read times.
B. Power Management in Last Level Caches
Caches occupy a significant fraction of chip area. Further, due to their large size, caches designed with SRAM dissipate
a large portion (e.g. more than 80%) of their energy in the form of leakage energy. As power consumption becomes the
first-class design constraint in computing systems [13], improving the energy efficiency of caches has become important. To
address this, several architecture-level techniques have been proposed [14, 15]. Some techniques have also been proposed to
address the power consumption of caches designed with non-volatile memories. Sun et al. [16] propose a technique which
uses STT-RAM cache banks with different retention periods. Their technique uses data migration to move the cache blocks
from banks with lower retention period to those with higher retention period to save energy. Chen et al. [17] use cache
reconfiguration approach to save energy in SRAM-NVM hybrid cache. Some researchers have also proposed techniques for
architecting eDRAM (embedded DRAM) last level caches [18, 19], however, the requirement of refresh presents a major
bottleneck in the use of eDRAM caches.
C. Write Endurance Management in RRAM Caches
Recent RRAM prototype designs show the best write endurance values of RRAM up to 1011 [20]. Since existing cache
management schemes and energy saving schemes are not write-endurance aware, they may lead to failure of a few cells, while
most of the cache blocks can still endure more number of writes.
1We refer to the ‘hottest’ (or ‘coldest’) color as one which has the highest (or lowest) number of writes to it. Also activating a color is synonymous with
turning it on and vice-versa.
Recently, researchers have proposed techniques to improve write endurance of caches designed with RRAM and other non-
volatile devices. Chen et al. [21] propose a technique for improving cache lifetime of STT-RAM caches by reducing inter-set
write-variation. Their technique uses a register, called remap register, which is XORed with the set-index bit of the cache
address. Periodically, this register is updated which enables introducing randomization in writes to different cache sets.
III. METHODOLOGY
A. Key Idea
It is well-known that the cache requirements of different applications and even different phases of the same application are
different. Hence, processor-designers tend to use a cache size to fulfill the peak demand and meet the requirements of SLAs
(service level agreements). This is especially important in high-performance computing domain [22], where performance-critical
programs are routinely executed. However, this leads to large wastage of energy in average case. Thus, by dynamically adapting
the cache size to the program requirement, saving in leakage energy can be achieved with minimum performance loss. NVM
caches provide even more opportunity of cache reconfiguration than conventional caches due to their large size.
Since cache reconfiguration based energy saving techniques turn-off a large number of blocks, they may worsen the cache
lifetime since the write-traffic of the program is redirected to only few active blocks. This problem is specially severe for NVM
caches which have small write endurance. To address this, we utilize the fact that for the same number of active colors, our
technique can keep different combination of colors as active; for example when 4 colors are active, the color numbers {0,1,2,3}
or {11,15,27,33} etc. can be active. This fact can be used to improve the lifetime of the NVM cache through wear-leveling
by selecting different colors to keep active in different intervals based on the number of writes which are already experienced
by them.
B. Achieving Cache Reconfiguration
To dynamically reconfigure the cache, we use cache coloring scheme [9]. This scheme logically divides the cache into
multiple groups, called cache colors. If S denotes the L2 cache size, P denotes the physical page size and Q denotes the L2
cache associativity, then the number of cache colors, N is given by
N =
S
P ×Q
(1)
Also, the physical memory pages are divided into N groups, based on the least significant bits of the physical page number.
By mapping a particular memory region to a cache color, all the physical pages in that memory region are mapped to that
color and thus, they start using that color. To keep the record of the mapping between memory region and cache color, a small
table is kept, called mapping table.
Clearly, multiple memory regions can be mapped to the same cache color, but one memory region cannot be mapped to
multiple cache color at any given time. By controlling the mapping table, the amount of cache allocated to a program can be
controlled. For this, all the regions can be mapped to only a few colors. The remaining colors can be turned off for saving
leakage energy.
To see the typical number of colors, we take the example of an 8MB, 8-way cache with 64B block size and assume a page
size of 4KB. Then, the total number of cache colors is 256. The benefit of cache coloring scheme is that it provides finer
cache allocation granularity than selective-sets and selective-ways approaches [23, 24]. Also, this granularity is coarse-enough
such that the overhead of cache management (e.g. keeping hardware counters, etc.) remains small.
C. Dynamic Profiling
To estimate the program miss-rate under different cache sizes, we use auxiliary tags, and refer to this structure as a profiling
unit [25]. Profiling unit works on the principle of set-sampling [26] and thus, it emulates the miss-rate of the entire cache by
merely using a few sets. The ratio of the number of sets of cache and that of profiling unit is referred to as the sampling
ratio, which is taken as 64 in our experiments. Because of the use of large sampling ratio and data-less (tag only) design, the
overhead of profiling unit is small. In our experiments, we use 5 profiling units, each profiling a cache size of X, X/2, X/4,
X/8 and X/16 respectively. Despite the use of multiple levels, the total overhead of profiling units is less than 0.5% of the L2
cache size. Also, by adding extra levels (such as X/32 etc.), profiling information about smaller sized configurations can also
be obtained which can be used to reduce the lower limit of cache allocation in the energy saving algorithm (Section IV).
D. Estimating Execution Time using CPI Stack
To estimate the program execution time under different cache configurations, we use the CPI stack technique [10, 27, 28]. A
CPI stack shows the contribution of different miss-events (e.g. branch mispredictions, memory stall cycles etc.) in the overall
CPI, taking into account their possible overlap. We use the memory stall cycle component of CPI stack. Assuming that the
number of memory stall cycles in an interval vary linearly with the number of load misses, by estimating the load misses for
a configuration, the number of memory stall cycles for that configuration can be estimated and using this, the execution time
under that configuration can be estimated. For estimating load misses for a configuration, we use extra counters in the profiling
cache. Using the execution time estimate for a configuration, the leakage energy consumption with that configuration can also
be estimated.
IV. ENERGY SAVING ALGORITHM
A. Algorithm Details
In this section, we discuss our energy saving algorithm. The algorithm runs after a large interval (e.g. 10M instructions) and
can run as a kernel module. The algorithm has the following steps.
1) First, the candidate configurations are selected which satisfy the following criterion.
• To keep the algorithm overhead small while also covering a large number of candidate color values, cache allocation
is done at the granularity of two colors.
• To avoid the possibility of cache thrashing, we allocate at least α colors to a program. In our experiments, we take
α = N/16.
• To keep the reconfiguration overhead small, in each interval, a maximum of β colors can be turned-off or turned-on.
In our experiments, we take β = 16.
2) For each of the selected configurations, the execution time is estimated. Also, the execution time of the full-size cache
(i.e. one having N colors) is estimated. Let ∆i denote the percentage extra time that a configuration Cf is taking over the
full-size configuration. Then, those configurations for which ∆i exceeds a threshold (γ) are rejected. This ensures that
the configurations which may cause large performance degradation are avoided. In our experiments, we take γ = 2%.
3) For the remaining configurations, memory subsystem energy is computed, as shown in Section V-C. To make more
accurate estimate of the energy of a configuration, we keep two counters, viz. nClean and nDirty. As the name shows,
nClean (resp. nDirty) shows the number of clean (resp. dirty) blocks in the cache. To compute these counters, scanning
the cache is not required, instead, nClean is increased each time a clean block is inserted in the cache and decreased
each time a clean block is evicted from the cache (and similarly for nDirty). Then, assuming that the dirty blocks are
uniformly distributed in the cache and the number of currently active colors is C, then turning-off p colors would lead
to write-back of (p × nDirty)/C dirty blocks. Also, (p × nClean)/C clean blocks would be evicted and we assume
that half the number of these blocks may be again accessed in the next interval. Using this, the number of extra main
memory accesses due to cache reconfiguration for any configuration can be estimated.
4) Finally, the configuration with the smallest value of energy is selected for the next interval.
To achieve wear-leveling, our algorithm works as follows.
1) We maintain a count of number of writes to each color. Clearly, if a color is turned-off, no writes are performed on it
and hence, its write-count is not updated.
2) For a given number of colors to be turned off, the algorithm selects those colors which have highest write counts to
them.
3) For a given number of colors to be turned on, the algorithm selects those colors which have lowest write counts to them.
4) If in an interval, no reconfiguration is performed, then φ hottest active colors are turned off and an equal number of
coldest inactive (i.e. turned-off) colors are made active. The value of φ depends on the number of colors (C) which are
currently active and it is computed simply as:
if N > C ≥ N/2, φ = 1;
if N/2 > C >= N/8, φ = 2;
if C < N/8, φ = 3.
The reasoning behind this is that, when only a small number of colors are active, the entire write traffic is directed to
only a few colors and hence, a larger number of colors need to shuffled and vice-versa. Also, the value of φ is kept
small to avoid performance loss due to cache flushing.
We refer to this algorithm as Endurance Aware algorithm.
To show the need of endurance-awareness, we also show the results with our technique without the above mentioned steps
for achieving wear-leveling. We term this version of algorithm as Endurance Unaware algorithm. That is, while increasing
the colors, it does not sort the colors based on write intensity and while reducing the colors, it preferentially turns-off colors
with least number of regions in it (and not based on write intensity) to minimize reconfiguration overhead. Also, when the
number of colors remains unchanged, Endurance Unaware algorithm does not perform any action. Note that, unless otherwise
mentioned, we refer to the Endurance Aware algorithm simply as our technique.
B. Algorithm Features and Limitations
Our technique offers the flexibility to control its aggressiveness of cache reconfiguration by choosing the suitable value of
parameters viz. α, β, φ and γ. In each interval, the algorithm examines only γ + 1 configurations and hence, its overhead is
small. Since the number of colors is much smaller than the number of sets in the cache (e.g. a 16MB cache has only 256
colors and 32768 sets), maintaining the write count at color granularity incurs much smaller overhead than keeping it at set
granularity. Also, the number of turned-on (or turned-off) colors is smaller than the total number of colors and hence, sorting
the list of turned-on (or turned-off) colors incurs small overhead.
A limitation of our technique is that, since it works at the granularity of colors, it cannot achieve more fine-grain wear-
leveling, for example at the level of sets. It may be possible that the cache block with the highest number of writes to it may
not be in the cache color with the highest number of writes to it. However, in general, this is not expected to be very common
and hence, our technique is expected to work well for most cases.
C. Algorithm Implementation
We assume that the cache blocks are turned-off using power gating, as suggested in previous works also [29]. Cache
reconfiguration is handled as follows. When the number of cache colors is reduced, the data in those colors are flushed, i.e.
the dirty blocks are written back and the clean blocks are discarded. Also, the memory regions mapped to those colors are
mapped to some other active colors. When the number of cache colors is increased, the extra colors are turned on and some
memory regions which were mapped to other colors are now mapped to these cache colors. Since the algorithm uses a large
interval, the reconfiguration overhead is easily amortized over the phase length. This is also confirmed by the experimental
results (Section VI).
V. EVALUATION METHODOLOGY
A. Simulation Infrastructure
We conduct microarchitectural simulations using Sniper simulator. We use interval core model in Sniper with a processor
frequency of 2GHz. All caches use a block size of 64 bytes. The details of L2 cache are provided below in Section V-C below.
The memory queue has a bandwidth of 10 GB/s and queue contention is also modeled. The latency of main memory is 160
cycles. The length of algorithm is 15M instructions.
B. Workloads
We use all 29 SPEC2006 benchmarks with ref inputs as the workloads. For brevity, we use three-letter acronyms of the
benchmarks, as shown in Table I. We fast-forward the benchmarks for 10B instructions and then simulate them for 750M
instructions.
TABLE I
SPEC2006 BENCHMARKS AND THEIR ACRONYMS
Benchmark Acronym Benchmark Acronym
astar Ast bwaves Bwa
bzip2 Bzi cactusADM Cac
calculix Cal dealII Dea
gamess Gam gcc Gcc
gemsFDTD Gem gobmk Gob
gromacs Gro h264ref H26
hmmer Hmm lbm Lbm
leslie3d Les libquantum Lib
mcf Mcf milc Mil
namd Nam omnetpp Omn
perlbench Per povray Pov
sjeng Sje soplex Sop
sphinx Sph tonto Ton
wrf Wrf xalancbmk Xal
zeusmp Zeu
C. Energy Model
We model the energy consumption of L2 cache, main memory and algorithm overhead. We compute the L2 cache latency and
energy values for both SRAM and RRAM using nvsim simulator [30], which has been verified against real-world prototypes.
In nvsim simulator, we search for designs optimized for write EDP (energy delay product). We assume 32 nm CMOS process
and 8-way set-associative design with sequential tag and data access. We choose SRAM and RRAM capacities to achieve
nearly similar area values. These values are shown in Table II. We assume that the L1 cache is designed using SRAM due to
performance considerations.
TABLE II
ENERGY VALUES FOR SRAM AND RRAM CACHES
RRAM SRAM
Capacity 8MB 1MB
Area 1.96 mm2 1.89 mm2
Hit Latency 6.25 ns 0.70 ns
Miss Latency 3.4 ns 0.22 ns
Write Latency 21.77 ns 0.30 ns
Hit energy 0.423 nJ 0.278 nJ
Miss energy 0.085 nJ 0.006 nJ
Write energy 0.688 nJ 0.270 nJ
Leakage Power 0.740 W 2.239 W
The dynamic energy of accessing main memory is taken as 70 nJ/access and its leakage power consumption is taken as
0.18W [31, 32]. The energy cost of each block transition is taken as 2 pJ [25]. We ignore the energy consumption of profiling
cache, since this is negligibly small compared to that of memory subsystem [28].
D. Evaluation Metrics
We show the results on percentage energy saving and relative performance (also called speedup). Also, we show cache active
ratio and absolute increase in miss-rate, measured in miss per kilo instructions (MPKI). We also show the improvement (I) in
cache lifetime, which is defined as follows. Let the subscripts technique and baseline denote any quantity for our technique
and baseline, respectively. Let L denote the lifetime of the cache and Wmax denote the maximum (worst) number of writes to
any cache block during the entire execution, which takes T amount of time (in seconds). Also, the maximum write endurance
of the block is Γ, which depends on the device property. Then,
Ltechnique =
Ttechnique × Γ
Wmaxtechnique
(2)
Lbaseline =
Tbaseline × Γ
Wmaxbaseline
(3)
I =
Ltechnique
Lbaseline
(4)
Here, we have defined the lifetime as the time it takes for the first block to fail, which will be a block with the largest
number of writes to it.
Across the workloads, the average value of improvement in performance and lifetime are computed as geometric mean of
per-workload values. For the remaining metrics, the average value is computed using arithmetic mean.
VI. RESULTS
Figure 1 shows the experimental results. Note that the baseline is an 8MB RRAM LLC which does not use cache
reconfiguration. SRAM LLC denotes use of a 1MB SRAM LLC. Our technique denotes use of our endurance-aware energy
saving algorithm with an 8MB LLC. The figures for energy saving are drawn separately for SRAM LLC and our technique,
since the range of values are widely different. We now analyze the results in detail.
A. Comparison of Baseline RRAM LLC with SRAM LLC
On average, SRAM LLC provides 186.13% loss in energy, a speedup of 0.96× and an increase in MPKI of 4.22. Thus, it
is clear that, on average, the large capacity provided by RRAM LLC easily overcomes their limitations of large write latency,
although for benchmarks with small working set size2 (WSS) such as povray, sjeng, tonto etc. the large capacity does not
provide any benefit and hence, the miss-rate value remains unchanged and RRAM incurs a small loss is performance due to
higher read and write latency. This is also true for streaming benchmarks such as libquantum and milc. The higher energy
efficiency of RRAM cache for these benchmarks is due to its lower leakage power consumption.
Several SPEC benchmarks, such as soplex, omnetpp etc. are very sensitive to L2 cache size and hence, their miss-rates are
significantly reduced due to the extra cache capacity provided by RRAM cache. This leads to a significant reduction in off-chip
accesses and hence, despite the large write latency, the performance of the program is improved (i.e. execution time is reduced).
This also reflects in reduced leakage energy consumption. Notice that for omnetpp, soplex, sphinx and xalancbmk, the relative
performance with SRAM cache is less than 0.7×. Also, for omnetpp benchmark, SRAM LLC incurs a loss in energy of 665%
for xalancbmk, the loss is energy is 612%. Thus, for benchmarks with large WSS, the extra capacity provided by RRAM leads
to significant reduction in miss-rate (refer the figures on MPKI increase) which reflects in improved performance and energy
efficiency.
The benefit of SRAM cache is its high write endurance, although for the same area, NVM caches have larger size than
conventional SRAM caches and this helps in reducing the write pressure per unit cache block. Still, since the write endurance
of RRAM is several orders of magnitude less than that of SRAM, wear-leveling is required for RRAM to achieve reasonable
lifetimes.
2The working set size of an application shows the number of unique cache lines accessed during a given execution interval.
-10
 0
 10
 20
 30
 40
 50
 60
 70
Ast Bwa Bzi Cac Cal Dea Gam Gcc Gem Gob Gro H26 Hmm Lbm Les Lib Mcf Mil Nam Omn Per Pov Sje Sop Sph Ton Wrf Xal Zeu Avg
 % Energy Saving For RRAM LLC Our Technique
-700
-600
-500
-400
-300
-200
-100
 0
Ast Bwa Bzi Cac Cal Dea Gam Gcc Gem Gob Gro H26 Hmm Lbm Les Lib Mcf Mil Nam Omn Per Pov Sje Sop Sph Ton Wrf Xal Zeu Avg
 % Energy Saving For SRAM LLC SRAM
 0.6
 0.7
 0.8
 0.9
 1
 1.1
 1.2
Ast Bwa Bzi Cac Cal Dea Gam Gcc Gem Gob Gro H26 Hmm Lbm Les Lib Mcf Mil Nam Omn Per Pov Sje Sop Sph Ton Wrf Xal Zeu Avg
 Relative Performance SRAM  Our Technique
 0
 5
 10
 15
 20
 25
 30
Ast Bwa Bzi Cac Cal Dea Gam Gcc Gem Gob Gro H26 Hmm Lbm Les Lib Mcf Mil Nam Omn Per Pov Sje Sop Sph Ton Wrf Xal ZeuAmean
Increase in MPKI SRAM"Our Technique"
 0
 20
 40
 60
 80
 100
Ast Bwa Bzi Cac Cal Dea Gam Gcc Gem Gob Gro H26 Hmm Lbm Les Lib Mcf Mil Nam Omn Per Pov Sje Sop Sph Ton Wrf Xal Zeu Avg
 Active Ratio Our Technique
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
Ast Bwa Bzi Cac Cal Dea Gam Gcc Gem Gob Gro H26 Hmm Lbm Les Lib Mcf Mil Nam Omn Per Pov Sje Sop Sph Ton Wrf Xal Zeu Avg
 Relative Lifetime Endurance Unaware Endurance Aware (Our Technique)
Fig. 1. Results on energy saving, relative performance, increase in MPKI, cache active ratio and relative lifetime
B. Comparison of Baseline with Our Technique for RRAM LLC
On average, using our algorithm provides 17.55% saving in memory subsystem energy, 0.99× improvement in speedup and
1.33× improvement in lifetime. Also, the average active ratio is 70.1% and increase in MPKI is 0.09.
From active ratio values, we observe that for some benchmarks, our technique does not perform any reconfiguration and for
them active ratio remains 100% (e.g. lbm, mcf, sphinx, soplex, xalancbmk) or close to 100% (e.g. cactusADM, gcc, gromacs).
For these benchmarks, the leakage energy saved from turning-off the cache would be nullified by the increase in DRAM
dynamic energy. This is evident from the large value of increase in MPKI observed for SRAM LLC of 1MB capacity.
For some benchmarks, our technique reduces the active ratio to very small value (e.g. less than 25% for gamess, libquantum
and povray) and thus, a large amount of energy is saved. Clearly, our technique adapts itself well to the cache requirements
of workloads to achieve a balance between energy saving and performance loss.
The increase in miss-rate with our technique is very small compared to that with SRAM LLC and hence, it is nearly invisible
in Figure 1. Since the allowed performance loss (γ) in each interval is 2%, the average speedup value of 0.99× is reasonable.
Clearly, our technique keeps a tight control over the performance loss.
Since RRAM baseline already provides large improvement over SRAM cache, on taking SRAM cache as the baseline (which
is used in conventional caches), the improvement of our technique will be much larger. Thus, use of our algorithm on top of
RRAM LLC provides significant advantage over conventional caches.
C. Results on Cache Lifetime
For the Endurance Unaware algorithm, we show the per-workload figures only for relative lifetime and state the average for
the remaining metrics. The average saving in energy is 17.14%, improvement in speedup is 0.99×, active ratio is 69.3%, increase
in MPKI is 0.12. Most importantly, the improvement in lifetime is 0.99×. Thus, while on other metrics, the Endurance Unaware
algorithm provides matching results to the Endurance Aware algorithm, on the metric of lifetime improvement, it performs
poorly. The reason for this is that, it does not try to spread the writes to different cache colors and hence, it may worsen the
problem of limited cache lifetime of NVM caches. Thus, our technique does not compromise energy efficiency or performance
for achieving lifetime improvement through wear-leveling.
The improvement achieved in cache lifetime through wear-leveling in the context of our cache reconfiguration based technique
depends on several factors, such as amount of write variation present, amount of turned-off cache, L2 cache sensitivity of the
workload etc. When nearly all the cache is active, our technique does not aggressively shuffle cache colors in order to avoid
energy loss. Also, if the write variation in the original program itself is small, the scope of wear-leveling also remains small.
For some benchmarks, such as h264ref, gamess, tonto, povray, etc., our technique achieves large improvement in lifetime by
shuffling cache colors and/or deactivating hot colors. The maximum improvement in lifetime of 4.05× is achieved for sjeng.
For two benchmarks, viz. libquantum and milc, our technique degrades the lifetime by a large value. These benchmarks
show streaming property, and hence, by using methods such as cache bypass, the number of writes can be reduced for lifetime
enhancement. This, however, is outside the scope of this paper and is planned as a future work. However, notice that for these
two benchmarks, our endurance aware technique works better than the endurance-unaware version of the technique.
The results presented in this section show that our technique is effective in saving energy with negligible performance loss
and also provides improvement in cache lifetime.
VII. CONCLUSION
While the emerging non-volatile memory devices offer low leakage power and high capacity, they also have the limitations
such as limited write endurance and high write latency. In this paper, we presented an energy saving technique for caches
designed with resistive RAM devices. Our technique uses dynamic cache reconfiguration to adjust the cache size to the
requirements of the running application, while taking into account the limited write-endurance of NVM device. The unused
cache is turned-off to save energy with minimum performance loss. The experimental results confirm that use of RRAM for
designing last level caches (LLCs) can provide significant advantage over SRAM LLCs. Also, taking limited write endurance
of NVMs is important for designing effective policies for managing them.
Our technique can be synergistically integrated with other methods to increase their effectiveness. To increase the lifetime
of RRAM cache even further, the methods to reduce write traffic (e.g. use of write buffer, filter cache, compare-and-write etc.)
and correct errors can be used. Our future efforts will focus on extending our technique to multicore processors.
REFERENCES
[1] A. Agrawal et al., “A new heuristic for multiple sequence alignment,” in IEEE International Conference on Elec-
tro/Information Technology, 2008, pp. 215–217.
[2] S. K. Khaitan and J. D. McCalley, High performance computing for power system dynamic simulation. Springer Berlin
Heidelberg, 2013, pp. 43–69.
[3] M. Raju et al., “High Performance Computing of Three-Dimensional Finite Element Codes on a 64-bit Machine,” Journal
of Applied Fluid Mechanics, vol. 5, no. 2, pp. 123–132, 2012.
[4] B. Stackhouse et al., “A 65 nm 2-billion transistor quad-core Itanium processor,” IEEE Journal of Solid-State Circuits,
vol. 44, no. 1, pp. 18–31, 2009.
[5] R. Riedlinger, R. Bhatia, L. Biro, B. Bowhill, E. Fetzer, P. Gronowski, and T. Grutkowski, “A 32nm 3.1 billion transistor
12-wide-issue Itanium R© processor for mission-critical servers,” in IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), 2011, pp. 84–86.
[6] P. Mangalagiri, K. Sarpatwari, A. Yanamandra, V. Narayanan, Y. Xie, M. J. Irwin, and O. A. Karim, “A low-power phase
change memory based hybrid cache architecture,” in 18th ACM Great Lakes symposium on VLSI, 2008, pp. 395–398.
[7] A. Jog et al., “Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs,” in 49th Annual
Design Automation Conference, 2012, pp. 243–252.
[8] S. Mittal, Architectural Techniques For Managing Non-volatile Caches. Germany: Lambert Academic Publishing (LAP),
2013.
[9] S. Mittal, Z. Zhang, and Y. Cao, “CASHIER: A Cache Energy Saving Technique for QoS Systems,” 26th International
Conference on VLSI Design and 12th International Conference on Embedded Systems (VLSID), pp. 43–48, 2013.
[10] T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring the level of abstraction for scalable and accurate parallel
multi-core simulations,” in International Conference for High Performance Computing, Networking, Storage and Analysis
(SC), Nov. 2011.
[11] H. Li and Y. Chen, “An overview of non-volatile memory technology and the implication for tools and architectures,” in
Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2009, pp. 731–736.
[12] S.-S. Sheu, M.-F. Chang, K.-F. Lin, C.-W. Wu, Y.-S. Chen, P.-F. Chiu, C.-C. Kuo, Y.-S. Yang, P.-C. Chiang, W.-P. Lin
et al., “A 4Mb embedded SLC resistive-RAM macro with 7.2 ns read-write random-access time and 160ns MLC-access
capability,” in IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 200–202.
[13] S. Mittal, “A Survey of Architectural Techniques For DRAM Power Management,” International Journal of High
Performance Systems Architecture, vol. 4, no. 2, pp. 110–119, 2012.
[14] S. Mittal, Y. Cao, and Z. Zhang, “MASTER: A Multicore Cache Energy Saving Technique using Dynamic Cache
Reconfiguration,” IEEE Transactions on VLSI Systems, 2013.
[15] S. K. Khaitan and J. D. McCalley, “A hardware-based approach for saving cache energy in multicore simulation of power
systems,” in IEEE PES General Meeting, 2013.
[16] Z. Sun, X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, and W. Wu, “Multi retention level STT-RAM cache designs
with a dynamic refresh scheme,” in 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp.
329–338.
[17] Y.-T. Chen et al., “Dynamically reconfigurable hybrid cache: An energy-efficient last-level cache design,” in Design,
Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2012, pp. 45–50.
[18] S. Mittal, “A Cache Reconfiguration Approach for Saving Leakage and Refresh Energy in Embedded DRAM Caches,”
Iowa State University, Ames, Iowa, USA, Tech. Rep., 2013.
[19] A. Agrawal, P. Jain, A. Ansari, and J. Torrellas, “Refrint: Intelligent refresh to minimize power in on-chip multiprocessor
cache hierarchies,” International Symposium on High-Performance Computer Architecture (HPCA), 2013.
[20] Y.-B. Kim et al., “Bi-layered RRAM with unlimited endurance and extremely uniform switching,” in Symposium on VLSI
Technology (VLSIT). IEEE, 2011, pp. 52–53.
[21] Y. Chen, W.-F. Wong, H. Li, C.-K. Koh, Y. Zhang, and W. Wen, “On-chip caches built on multilevel spin-transfer torque
ram cells and its optimizations,” J. Emerg. Technol. Comput. Syst., vol. 9, no. 2, pp. 16:1–16:22, May 2013.
[22] M. P. Raju and S. K. Khaitan, “Domain decomposition based high performance parallel computing,” International Journal
of Computer Science Issues, vol. 5, 2009.
[23] M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. Vijaykumar, “Gated-Vdd: a circuit technique to reduce leakage in
deep-submicron cache memories,” in international symposium on Low power electronics and design (ISLPED), 2000, pp.
90 – 95.
[24] D. H. Albonesi, “Selective cache ways: on-demand cache resource allocation,” in 32nd International Symposium on
Microarchitecture (MICRO), Washington, DC, USA, 1999, pp. 248–259.
[25] S. Mittal and Z. Zhang, “Palette: A cache leakage energy saving technique for green computing,” in HPC: Transition
Towards Exascale Processing, ser. Advances in Parallel Computing. IOS Press, 2013.
[26] T. Puzak, “Cache memory design,” Ph.D. dissertation, University of Massachusetts, 1985.
[27] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A performance counter architecture for computing accurate
cpi components,” in 12th international conference on Architectural support for programming languages and operating
systems (ASPLOS). New York, NY, USA: ACM, 2006, pp. 175–184.
[28] S. Mittal and Z. Zhang, “EnCache: Improving cache energy efficiency using a software-controlled profiling cache,” in
IEEE International Conference On Electro/Information Technology, Indianapolis, USA, May 2012.
[29] J. Zhao and Y. Xie, “Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive
data migration,” in International Conference on Computer-Aided Design. ACM, 2012, pp. 81–87.
[30] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-level performance, energy, and area model for emerging
nonvolatile memory,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 7,
pp. 994–1007, 2012.
[31] H. Zheng, J. Lin, Z. Zhang, and Z. Zhu, “Decoupled DIMM: building high-bandwidth memory system using low-speed
DRAM devices,” in International Symposium on Computer Architecture (ISCA), 2009, pp. 255–266.
[32] S. Mittal and Z. Zhang, “MANAGER: A Multicore Shared Cache Energy Saving Technique for QoS Systems,” Iowa
State University, Tech. Rep., 2013.
