A Cache Energy Optimization Technique for STT-RAM Last Level Cache by Mittal, Sparsh
ar
X
iv
:1
31
2.
22
07
v1
  [
cs
.A
R]
  8
 D
ec
 20
13
A Cache Energy Optimization Technique for
STT-RAM Last Level Cache
Sparsh Mittal
Department of Electrical and Computer Engineering
Iowa State University, Ames, Iowa 50011, USA
Email: sparsh0mittal@gmail.com
Abstract
Last level caches (LLCs) occupy a large chip-area and there size is expected to grow further to offset the limitations of memory
bandwidth and speed. Due to high leakage consumption of SRAM device, caches designed with SRAM consume large amount of
energy. To address this, use of emerging technologies such as spin torque transfer RAM (STT-RAM) has been investigated which
have lower leakage power dissipation. However, the high write latency and power of it may lead to large energy consumption
which present challenges in its use. In this report, we propose a cache reconfiguration based technique for improving the energy
efficiency of STT-RAM based LLCs. Our technique dynamically adjusts the active cache size to reduce the cache leakage energy
consumption with minimum performance loss. We choose a suitable value of STT-RAM retention time for avoiding refresh
overhead and gaining performance. Single-core simulations have been performed using SPEC2006 benchmarks and Sniper x86-64
simulator. The results show that while, compared to an STT-RAM LLC of similar area, an SRAM LLC incurs nearly 100% loss
in energy and 7.3% loss in performance; our technique using STT-RAM cache saves 21.8% energy and incurs only 1.7% loss in
performance.
I. INTRODUCTION
Modern processors employ large LLCs and their size is expected to grow even further. In fact, Rogers et al. [1] have
shown that due to bandwidth-limitations, SRAM-based caches may occupy 90% of the chip-area in upcoming fourth CMOS
generation. Thus, due to increasing size of LLCs, along with large leakage energy consumption of SRAM devices, the energy
consumption of LLCs is becoming a significant fraction of processor energy consumption [2].
Recent developments in non-volatile memory devices such as STT-RAM have presented them as a scalable and energy-
efficient SRAM alternative. Compared to SRAM, STT-RAM has much higher density (3× to 4×), competitive read-times and
lower leakage power consumption [3]. Thus, STT-RAM can provide denser memories at lower area footprint. These features
make them suitable for use in LLCs [4]. However, the limitation of STT-RAM is that it has high write latency and write
energy consumption. As an example, for a frequency of 2GHz, the write latency values of a cache designed with SRAM
and STT-RAM having similar area, are 2 cycles and 22 cycles, respectively [5]. Due to the high write latency of STT-RAM,
program execution time on using STT-RAM caches may become large which may increase the leakage energy consumption and
offset the advantage of lower leakage power dissipation. Also, the large value of write energy may also increase the dynamic
energy of cache. Thus, improving their energy efficiency further is extremely important to meet the demands of chip power
budget and ensure wide adoption of STT-RAM.
To address this, in this report, we present SMART an STT-RAM cache reconfiguration technique for saving energy. SMART
is a microarchitectural technique for saving leakage energy in STT-RAM based LLCs. Since leakage energy consumption of
cache is a function of actual leakage power, active fraction of cache and execution time, a leakage energy saving technique
can work by targetting one or more of the above factors. The techniques which trade-off non-volatility of STT-RAM for
reducing its write latency (e.g. [5–7]) target at reducing the program execution time. In contrast, SMART technique uses cache
reconfiguration to dynamically tune cache active fraction to reduce leakage energy consumption, while choosing a reasonably
small write latency. SMART technique chooses a suitable value of STT-RAM retention period1 which exercises a fine balance
1The data retention period of a device indicates how long data can be retained in a nonvolatile memory cell after it is written. Thus, retention period is a
measure of the nonvolatility of a memory cell.
between gaining performance and avoiding the refresh overhead. SMART uses dynamic profiling and hence, it is suitable for
product systems which execute trillions of instructions of arbitrary programs. SMART technique directly optimizes energy
and does not work by controlling other parameters (e.g. miss-rate) for saving energy. This feature provides flexibility to the
designer to also take into account other components of the processor, such as main memory while optimizing for cache energy.
Single-core simulations have been conducted using SPEC2006 benchmarks and Sniper x86-64 simulator. The results show
that while, compared to an STT-RAM LLC of similar area, an SRAM LLC incurs nearly 100% loss in energy and 7.2% loss
in performance; SMART technique using STT-RAM cache saves 21.8% energy and incurs only 1.7% loss in performance.
Clearly, use of SMART on top of STT-RAM LLC can provide significant improvement in energy efficiency over a conventional
SRAM LLC.
The rest of the report is organized as follows. Section II provides a brief background on STT-RAM and existing cache
power management techniques. Section III discusses our methodology. Section IV presents the energy saving algorithm and
Section V discusses the hardware implementation. Section VI discusses the simulation methodology, workloads, energy model
and evaluation metrics. Section VII discusses the experimental results and finally, Section VIII concludes this report.
II. BACKGROUND AND RELATED WORK
A. A Brief Background on STT-RAM
STT-RAM [8] uses a Magnetic Tunnel Junction (MTJ) as the memory storage. The MTJ contains two ferromagnetic layers
that are separated by an oxide barrier layer (e.g., MgO). The magnetization direction of one ferromagnetic layer (called reference
layer) is fixed while that of the other ferromagnetic layer (called the free layer) can be altered by passing a current that is
polarized by the magnetization of the reference layer. The relative magnetization direction of these two layers determines the
resistance of the MTJ. If the two layers have different (resp. same) directions, the resistance of MTJ is high (low), indicating
a “0” (“1”) state. For more details, we refer the reader to previous work [6, 7].
STT-RAM offers unique advantages compared to other emerging technologies, such as eDRAM (embedded DRAM) and
PCM (phase change memory). Compared to eDRAM, it has smaller leakage power consumption and ability to scale well
beyond 10nm [8]. It is non-volatile and even on relaxing its retention period to improve write performance, retention period
values in milliseconds can be achieved [5]. In contrast, eDRAM has retention period in microseconds [9, 10], which decreases
with technology scaling [6]. Thus, refresh energy becomes a major source of energy consumption in eDRAM caches.
Compared to PCM, STT-RAM has much better write endurance (1015 compared to 108 for PCM), although its density is
smaller than that of PCM [11]. Thus, STT-RAM is more suitable for cache, while PCM is more suitable as a main memory.
B. Power Management in Last Level Caches
Energy efficiency is now becoming the most crucial bottleneck in the design of computing systems [12, 13]. Several
researchers have proposed techniques to save energy in non-volatile memory devices at both architecture level and device level
[14]. Zhou et al. [11] propose early write termination to save STT-RAM energy by avoiding bit-writes which are redundant,
i.e. which write the same value in a bit, that is already stored in the cache. Jog et al. [5] propose trading off retention period
of STT-RAM to reduce its write energy. They also propose a technique which selectively refreshes the diminishing blocks (i.e.
those blocks which are about to lose their data) by temporarily storing them in a buffer. Chen et al. [15] propose a technique
for saving energy by dynamically turning-off some cache ways in a hybrid LLC composed of SRAM and STT-RAM (e.g.
1MB SRAM and 3MB STT-RAM). Sun et al. [16] propose using STT-RAM cache banks of different retention periods and
moving dying blocks from banks with lower retention period to those with higher retention period.
Given the large design space of STT-RAM and ability to trade-off and tune its write latency and volatility, power management
of STT-RAM requires detailed exploration and hence, in this report we propose and evaluate cache reconfiguration based energy
saving technique for STT-RAM.
III. METHODOLOGY
In this report, we assume that the LLC is L2 cache, and based on the explanation here, SMART technique can be easily
applied to the case where LLC is an L3 cache.
A. Key Idea
Since different applications and even different phases of an application have different cache requirements, designers typically
provision cache to fulfull the requirement of performance-critical applications/phases. However, this leads to wastage of energy
for applications/phases which have small cache requirement. To address this, if the amount of cache allocated to a program can
be dynamically adjusted, the remaining cache can be turned-off to save energy with little performance loss. SMART technique
works on this key idea and utilizes dynamic cache reconfiguration to save cache energy.
Previous works have explored trading-off non-volatility of STT-RAM to improve its write latency and write energy [5, 7].
In this work, we use cache reconfiguration to improve the energy efficiency of STT-RAM cache.
B. Cache Coloring Scheme
For implementing cache reconfiguration, several schemes have been proposed, such as selective-sets [17], selective-ways
[18, 19], hybrid (selective-sets and selective-ways) [20], cache coloring [21, 22] etc. SMART uses cache coloring scheme, which
provides larger cache reconfiguration granularity than selective-sets or selective-ways approach and is easier to implement than
the hybrid approach. Cache coloring scheme divides the cache into multiple bins, called cache colors. The number of cache
colors is given by
N =
SL2
Z × V
(1)
Here SL2 denotes the size of L2, Z shows the system page size (= 4 KB in our experiments) and V denotes the L2 cache
associativity. As an example, for a 4MB cache with 16-way set-associativity, the number of cache colors (N ) is 64.
Further, based on the least significant bits (LSBs) of their physical page number, the physical pages are divided into N
memory regions. Cache coloring scheme works by mapping the memory regions to the colors inside the cache. To record the
mapping of regions to cache colors, a small “mapping table” is used [21]. The mapping table has extremely small size and
hence, its access latency and energy consumption are extremely small.
To change the amount of cache allocated to a running program, the region-to-color mapping can be controlled. For this, all
the regions of the program can be mapped to only a few cache colors. Thus, the remaining colors are effectively not utilized
and they can be turned-off to save energy. Reconfiguration decisions are only taken at the end of large interval, and hence,
mapping table is changed only at the end of an interval.
C. Dynamic Profiling
To estimate the miss-rate of program for different cache sizes at runtime, we use multiple profiling units [23]. These units do
not store data, but only auxiliary tags. The profiling unit works on the principle of set-sampling [20, 24]. In our experiments,
we use five profiling units which profile caches of size 1X, X/2, X/4, X/8 and X/16 of the L2 cache size, respectively. As an
example, when the L2 cache size is 4MB, the profiling unit marked as 1X emulates an L2 cache of size 4MB; the profiling
unit marked as X/2 emulates an L2 cache of size 2MB and so on. The profiling unit uses a large sampling ratio (1/64 in our
experiments) and hence, its overhead is small. Assuming the tag size of 30 bits, the overhead of all profiling units is merely
0.16% of the L2 cache size.
D. Estimating Execution Time using CPI Stack
To estimate the leakage energy consumption under different cache sizes, estimation of execution time under those cache
sizes is required. For this purpose, we use CPI stack technique [25, 26]. CPI stack has various components, out of which we
use memory stall cycle component. It is assumed that the number of memory stall cycles in an interval vary linearly with the
number of load misses. Also, using extra counters in profiling units, the load misses for different cache sizes are recorded.
Thus, we can get memory stall cycles for different load miss values (and hence, for different cache sizes) by observing the
actual load misses and stall cycles in an interval. Using this, the execution time value for different cache sizes can be estimated.
E. Trading-off Non-volatility to Gain Performance
STT-RAM is a non-volatile device with a retention time value of several years [5]. Relaxing this nonvolatility can enable
the memory cells easier to be programmed, and offers the opportunity to achieve a lower write current or faster switching
speed. Smullen et al. [7] relax the retention time of STT-RAM cells by shrinking the MTJ planar area. Jog et al. [5] relax
the retention time by reducing the thermal barrier of MTJ. This is achieved by decreasing the thickness of the free layer and
lowering the saturation magnetization. In our experiments, we assume that a suitable value of retention time is achieved by
the method proposed by Jog et al.
Jog et al. [5] have shown that for 2GHz frequency; the write latency values of a 4MB STT-RAM for retention periods of 10
years, 1 second and 10 mill-second are 22, 12 and 6 cycles, respectively. Of these, we choose the retention period of 1 second.
This choice is motivated by several reasons. Firstly, it provides a reasonable value of write latency which is significantly better
than the latency for the retention period of 10 years. We did not choose the retention period of 10 years, since the write latency
of 22 cycles is much worse than that of an SRAM cache of similar area (which is 2 cycles [5]). Secondly, due to instruction
level parallelism present in modern processors, a slightly high value of latency of last level cache can be easily hidden and
hence, we did not choose the latency value of 6 cycles (i.e. retention period of 10 milli-seconds). Finally, since the inter-write
times of L2 cache for workloads have been shown to be in the range of 10 milli-seconds, using a retention period value of 1
second ensures that a DRAM-style refresh will be not required for practically any of the L2 blocks. This avoids the overhead
of refresh which is incurred in previous approaches (e.g. [5, 7]).
IV. ENERGY SAVING ALGORITHM
We now provide details of our energy saving algorithm. The algorithm runs periodically and can be a kernel module. The
algorithm has the following steps.
1) Select candidate configurations which satisfy the following criterion.
• The configuration must have at least Clow colors. This ensures that cache thrashing is avoided. In our experiments,
we take Clow = N/16.
• In any interval, a maximum of Q colors can be turned-off or turned-on. This ensures that the reconfiguration overhead
is kept small. In our experiments, we take Q = 16.
• Cache allocation is done at the granularity of two colors. This ensures that the algorithm overhead is kept small
while covering a large number of candidate configurations.
2) For each configuration in the set of selected configurations, the execution time is estimated. Also, the execution time of
a full-size cache (i.e. one with M colors) is estimated. If Ti denotes the execution time estimate of a given configuration
Ci and Tf denote that of the full size configuration, then, the percentage extra time (∆i), that Ci is taking over full-size
configuration is given by
∆i =
Ti − Tf
Tf
× 100 (2)
The value ∆i is computed for all the selected configurations and those configurations for which ∆i > λ is found to be
true, are rejected. This ensures that the configurations which cause large performance loss are avoided. In our experiments,
we take λ = 2.5%.
3) For the remaining configurations, the memory subsystem energy is computed as shown in Equation 3. Finally, the
configuration with the minimum energy consumption is selected for the next interval.
Notice that, since the energy saving algorithm runs after a large interval (e.g. 10M instructions), its overhead is easily
amortized over the phase length. Each time, the algorithm examines only Q + 1 configurations. Also, cache reconfiguration
and hence, block switching takes place only at the end of an interval and not at critical access path of cache.
V. IMPLEMENTATION
To turn-off cache blocks, we use a power-gating scheme, as used by previous works [15]. Consistent with previous work,
we assume that power-gating reduces the leakage power to nearly 3% of its normal value. When the number of active colors
of an application are changed, the following action is taken. When the number of colors are reduced, the data in those colors
are flushed. Also, the regions which were mapped to those colors are mapped to some other active color. When the number
of colors is increased, some memory regions of the application which were mapped to other colors are now mapped to these
newly allocated colors. Notice that the latency of flushing the data can often be hidden by using write-back buffers and MSHR
(miss status handling register) techniques.
VI. EXPERIMENTAL METHODOLOGY
A. Simulation Platform
We use interval core model in Sniper x86-64 simulator with a processor frequency of 2GHz. For all caches, block size is
64B and LRU replacement policy is used. L1 I/D cache is 32 KB, 4-way, with a latency of 2 cycles. L2 cache is an 8-way
cache. The latency and energy consumption values for SRAM and STT-RAM cache are taken from Jog et al. [5], and they
are shown in Table I. We use an STT-RAM 4MB cache with 1 sec retention time, as discussed earlier. Unlike [5], we did not
consider an STT-RAM with even lower write latency since it has very smaller retention time and thus requires either refresh
or buffers to retain data integrity [5].
TABLE I
PARAMETERS OF SRAM AND STT-RAM CACHE
Read Write Read Write Leakage
Latency Latency Energy Energy Power
SRAM 1MB 1.012 ns 1.012 ns 0.578 nJ 0.578 nJ 4542 mW
STT-RAM 4MB 0.973 ns 5.571 ns 1.015 nJ 1.036 nJ 2235 mW(1 sec)
The main memory latency is 160 cycles and memory queue contention is also modeled. Memory bandwidth is 12 GB/s.
Interval length is 15M instructions.
We use all 29 SPEC2006 benchmarks with ref inputs. For the sake of clarity, we use three-letter acronyms of the benchmarks,
as shown in Table II. We fast-forward them for 10B instructions and then simulate for 400M instructions.
B. Energy Model
We model the energy consumption of L2, main memory and energy cost of algorithm execution. Our notation is as follows.
Let EL2 and EMem show the energy consumption of L2 cache and main memory, respectively.EAlgo shows the energy overhead
of the algorithm. LEL2 and DEL2 show the leakage energy and dynamic energy consumption of L2 cache, respectively. For
any interval, FA shows the active ratio of cache and T shows the execution time. ERL2 and EWL2 show the energy consumed
in a single L2 read and a single L2 write operation, respectively. P leakL2 shows the static (leakage) energy consumption of L2
cache. P leakMem shows the leakage power dissipation of main memory and E
dyn
Mem denotes the dynamic energy consumed in each
main memory access. AMem shows the number of main memory accesses. Eχ shows the energy consumed in a single block
transition and B shows the total number of block transitions. Using this notation, our energy model is as follows.
TABLE II
SPEC2006 BENCHMARKS AND THEIR ACRONYMS
Name Acronym Name Acronym
astar ast libquantum lib
bwaves bwa mcf mcf
bzip2 bzi milc mil
cactusADM cac namd nam
calculix cal omnetpp omn
dealII dea perlbench per
gamess gam povray pov
gcc gcc sjeng sje
gemsFDTD gem soplex sop
gobmk gob sphinx sph
gromacs gro tonto ton
h264ref h26 wrf wrf
hmmer hmm xalancbmk xal
lbm lbm zeusmp zeu
leslie3D les
E = EL2 + EMem + EAlgo (3)
EL2 = LEL2 +DEL2 (4)
LEL2 = P
leak
L2 × FA × T (5)
DEL2 = E
R
L2 ×RL2 + E
W
L2 ×WL2 (6)
EMem = P
leak
Mem × T + E
dyn
Mem ×AMem (7)
EAlgo = Eχ ×B (8)
For main memory, we take EdynMem = 70 nJ and P leakDRAM = 0.18 Watt, respectively [27, 28]. Also, we take Eχ = 0.002 nJ
[29].
It has been shown that a 4MB STT-RAM cache occupies similar chip area as 1MB SRAM [5, 30]. For this reason, we
compare a 1MB SRAM with 4MB STT-RAM. We compare the results obtained using SMART with both, those obtained using
an SRAM LLC and an STT-RAM LLC baseline, since this enables us to quantify how much improvement occurs due to our
cache reconfiguration based algorithm and how much improvement occurs due to use of STT-RAM. The cache energy values
are different for SRAM and STT-RAM; and they are shown in Table I.
The energy saving algorithm uses counters for recording the number of misses, execution time and energy estimate of
different candidate configurations. Also, profiling cache consumes a small amout of energy. However, since these overheads
are much smaller than the energy consumed by the memory subsystem (LLC and main memory), we ignore these overheads.
C. Evaluation Metrics
We present results on saving in total memory subsystem energy and L2 leakage energy alone. Further, we show results
on change in execution time and absolute increase in MPKI (miss per kilo-instructions). For SMART technique, we also
show results on cache ActiveRatio (fraction of cache which is active, averaged over entire execution). The change in MPKI
helps in estimating the increase in off-chip traffic. Cache active ratio enables us to get insights into aggressiveness of cache
reconfiguration.
VII. RESULTS
Figure 1 shows the energy saving and execution time results for different techniques. For sake of clarity, the figure uses
only first three letters of each SPEC2006 benchmark (refer Table II). The baseline is a 4MB STT-RAM and “SMART” bar
show use of our dynamic reconfiguration technique on top of a 4MB STT-RAM. “SRAM LLC” bar shows results with a 1MB
SRAM LLC.
-350
-300
-250
-200
-150
-100
-50
 0
 50
 100
ast bwa bzi cac cal dea gam gcc gem gob gro h26hmm lbm les lib mcf mil namomn per pov sje sop sph ton wrf xal zeu Avg
% Energy saved 
SRAM_LLC SMART
-10
 0
 10
 20
 30
 40
 50
ast bwa bzi cac cal dea gam gcc gem gob gro h26 hmm lbm les lib mcf mil nam omn per pov sje sop sph ton wrf xal zeu Avg
 % Simulation Cycle Increase SRAM_LLC SMART
 0
 5
 10
 15
 20
 25
ast bwa bzi cac cal dea gam gcc gem gob gro h26 hmm lbm les lib mcf mil nam omn per pov sje sop sph ton wrf xal zeu Avg
Increase in MPKI SRAM_LLC SMART
 0
 20
 40
 60
 80
 100
ast bwa bzi cac cal dea gam gcc gem gob gro h26hmm lbm les lib mcf mil namomn per pov sje sop sph ton wrf xal zeu Avg
 ActiveRatio SMART
Fig. 1. Results on percentage energy , simulation cycle increase, MPKI increase and ActiveRatio
Compared to STT-RAM baseline, SRAM incurs nearly 100% loss in memory subsystem energy, 7.3% loss in performance
and an increase in MPKI of 2.48. This is because for the same area, SRAM has small cache capacity which leads to higher miss-
rate and number of off-chip accesses. Also, SRAM has higher L2 cache leakage energy consumption which also contributes to
large power consumption. For libquantum benchmark, increased capacity provided by STT-RAM baseline does not reduce the
miss-rate since libquantum is a streaming benchmark with 100% miss-rate. This is clear from the results on MPKI increase and
percentage simulation cycle increase. For some benchmarks, such as soplex, xalancbmk and omnetpp; the increased capacity
provided by STT-RAM significantly reduce the miss-rate since these benchmarks are very sensitive to the L2 cache size. In
fact, with omnetpp, the SRAM LLC incurs an energy loss of more than 300% compared to the STT-RAM baseline. This shows
that use of high density technologies such as STT-RAM can lead to significant improvement in the performance of several
applications. Since STT-RAM has higher write latency, some benchmarks such as povray and gamess show slight improvement
in performance on using SRAM compared to that using STT-RAM. The reason for this is that these benchmarks use L2 cache
minimally and due to their small working set size, they do not benefit from the increased capacity provided by the STT-RAM
cache. Rather, the smaller write latency of SRAM helps them to achieve better performance.
Compared to SRAM cache, the improvement shown by SMART technique comes due to two reasons. Firstly, the larger cache
size helps in reducing the miss-rate and hence, performance of several applications. The improved performance also reflects in
saving of leakage energy of both LLC and main memory. Secondly, the lower leakage energy consumption of STT-RAM, along
with dynamic cache reconfiguration exercised by SMART technique helps in further reducing the active cache area and thus,
minimizing the leakage energy, with only slight increase in miss-rate. Note that streaming applications such as libquantum do
not benefit from increased cache capacity provided by STT-RAM cache over and above SRAM cache. For these applications,
the energy saving comes from cache reconfiguration only.
Compared to the STT-RAM baseline, on average, SMART technique provides 21.8% saving in energy with only 1.7% loss
in performance. Also, the average increase in MPKI is only 0.33 and average ActiveRatio is 48.4, thus SMART technique
turns-off nearly half the cache while still keeping the increase in the off-chip accesses small. By intelligently reconfiguring
the cache, SMART technique keeps tight control over the performance loss. For five benchmarks, SMART technique provides
nearly 50% energy saving and for nine benchmarks, it provides more than 30% energy saving. For four benchmarks, SMART
technique reduces the ActiveRatio to nearly 15% or less.
SMART technique uses a large interval size and by virtue of this, the cost of reconfiguration is amortized over the interval
length. The energy saving algorithm accounts for the energy consumption of both L2 cache and main memory and hence, it
does not attempt to save cache energy at the expense of increasing the main memory energy. Note that a technique which
only accounts of L2 energy may overlook this fact and thus, may increase the energy consumption of other components of the
processor. This increase may even nullify the energy saved in cache.
The aggressiveness of cache reconfiguration of SMART technique can be controlled by changing the values of λ, Q, Clow
and the interval length. Increasing the value of Q enables the algorithm to evaluate larger number of candidate configurations
and thus reach to a suitable cache size much faster. However, it may also increase the oscillations in some benchmarks
due to frequent change in the cache size. Increasing the value of λ allows choosing configurations which may incur larger
performance loss. However, increasing this value too much may lead to high performance degradation which may increase the
execution time and thus increase the leakage energy consumption. Change in interval length allows exploiting the opportunity
of cache reconfiguration at different time-granularity. Very large interval values may lead to losing the opportunity of cache
reconfiguration while very small values may increase the reconfiguration overhead very much. Thus, a balance is required to
achieve largest possible amount of energy. Reducing the value of Clow requires adding extra levels of profiling units in the
profiling cache. Very small value of Clow is suitable for the programs which have very small working set size. However, for
other programs, reducing Clow to a very small value may lead to thrashing. For this reason, we have chosen Clow = N/16 in
our experiments.
The results with a large number of benchmarks have shown that SMART technique is effective in saving cache energy. It
has been shown that for each watt of power dissipated in the computing systems, another watt of power is lost is cooling
system and power delivery. Hence, the improved energy efficiency provided by SMART technique will also result in saving in
the cost of cooling and power delivery. Further, it may enable the designer to increase the cache capacity even further while
still staying in the limits of chip power budget.
VIII. CONCLUSION
To ensure wide-spread adoption of STT-RAM, improving their energy efficiency is vital. In this report, we have presented a
technique for improving the energy efficiency of STT-RAM LLCs by using dynamic cache reconfiguration. For a similar chip
area, SMART technique improves memory subsystem energy efficiency over both SRAM LLC and a baseline STT-RAM LLC.
Our future efforts will focus on further evaluating SMART technique for different design parameters of STT-RAM cache and
synergistically integrating SMART technique with other techniques for saving even larger amount of energy.
REFERENCES
[1] B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and Y. Solihin, “Scaling the bandwidth wall: challenges in and
avenues for CMP scaling,” in ACM SIGARCH Computer Architecture News, vol. 37, no. 3, 2009, pp. 371–382.
[2] S. Mittal, “A Survey of Architectural Techniques For Improving Cache Power Efficiency,” Sustainable Computing:
Informatics and Systems, 2013.
[3] S. Mittal, Architectural Techniques For Managing Non-volatile Caches. Germany: Lambert Academic Publishing (LAP),
2013.
[4] W. Xu, H. Sun, X. Wang, Y. Chen, and T. Zhang, “Design of last-level on-chip cache using spin-torque transfer ram (stt
ram),” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19, no. 3, pp. 483–493, 2011.
[5] A. Jog et al., “Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs,” in 49th Annual
Design Automation Conference, 2012, pp. 243–252.
[6] M.-T. Chang, P. Rosenfeld, S.-L. Lu, and B. Jacob, “Technology Comparison for Large Last-Level Caches (L3Cs):
Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM,” International Symposium on
High-Performance Computer Architecture (HPCA), 2013.
[7] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan, “Relaxing non-volatility for fast and energy-efficient
STT-RAM caches,” in 17th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2011,
pp. 50–61.
[8] A. Driskill-Smith, “Latest Advances in STT-RAM,” in 2nd Annual Non-Volatile Memories Workshop, vol. 4, 2011, p. 25.
[9] J. Barth, W. R. Reohr, P. Parries, G. Fredeman, J. Golz, S. E. Schuster, R. E. Matick, H. Hunter, C. C. Tanner, J. Harig
et al., “A 500 MHz random cycle, 1.5 ns latency, SOI embedded DRAM macro featuring a three-transistor micro sense
amplifier,” IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 86–95, 2008.
[10] S. Mittal, “A cache reconfiguration approach for saving leakage and refresh energy in embedded dram caches,” Iowa
State University, Tech. Rep., 2013.
[11] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “Energy reduction for STT-RAM using early write termination,” in IEEE/ACM
International Conference on Computer-Aided Design-Digest of Technical Papers, 2009, pp. 264–268.
[12] S. Mittal, “A Survey of Architectural Techniques For DRAM Power Management,” International Journal of High
Performance Systems Architecture, vol. 4, no. 2, pp. 110–119, 2012.
[13] J. Li, C. J. Xue, and Y. Xu, “STT-RAM based energy-efficiency hybrid cache for CMPs,” in IEEE/IFIP 19th International
Conference on VLSI and System-on-Chip (VLSI-SoC), 2011, pp. 31–36.
[14] S. Mittal, “Energy Saving Techniques for Phase Change Memory (PCM),” Iowa State University, Tech. Rep., 2013.
[15] Y.-T. Chen et al., “Dynamically reconfigurable hybrid cache: An energy-efficient last-level cache design,” in Design,
Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2012, pp. 45–50.
[16] Z. Sun, X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, and W. Wu, “Multi retention level STT-RAM cache designs
with a dynamic refresh scheme,” in 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, pp.
329–338.
[17] S. Yang, M. D. Powell, B. Falsafi, K. Roy, and T. Vijaykumar, “An integrated circuit/architecture approach to reducing
leakage in deep-submicron high-performance I-caches,” in International Symposium on High Performance Computer
Architecture (HPCA), 2001, pp. 147–157.
[18] S. Mittal, Z. Zhang, and J. Vetter, “FlexiWay: A Cache Energy Saving Technique Using Fine-grained Cache Reconfigu-
ration,” in 31st IEEE International Conference on Computer Design (ICCD), 2013, pp. 100–107.
[19] D. H. Albonesi, “Selective cache ways: on-demand cache resource allocation,” in IEEE/ACM International Symposium
on Microarchitecture (MICRO), 1999, pp. 248–259.
[20] S. Mittal and Z. Zhang, “EnCache: Improving cache energy efficiency using a software-controlled profiling cache,” in
IEEE International Conference On Electro/Information Technology, Indianapolis, USA, May 2012.
[21] S. Mittal, Z. Zhang, and Y. Cao, “CASHIER: A Cache Energy Saving Technique for QoS Systems,” in 26th International
Conference on VLSI Design and 12th International Conference on Embedded Systems (VLSID), India, January 2013, pp.
43–48.
[22] S. Mittal, “Dynamic cache reconfiguration based techniques for improving cache energy efficiency,” Ph.D. dissertation,
Iowa State University, 2013.
[23] S. Mittal and Z. Zhang, “ESTO: A Performance Estimation Approach for Efficient Design Space Exploration ,” Design
Contest at 26th International Conference for VLSI Design, January 2013.
[24] T. Puzak, “Cache memory design,” Ph.D. dissertation, University of Massachusetts, 1985.
[25] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A performance counter architecture for computing accurate cpi
components,” in Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2006, pp.
175–184.
[26] T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring the level of abstraction for scalable and accurate parallel
multi-core simulations,” in International Conference for High Performance Computing, Networking, Storage and Analysis
(SC), Nov. 2011.
[27] S. Mittal, Y. Cao, and Z. Zhang, “MASTER: A Multicore Cache Energy Saving Technique using Dynamic Cache
Reconfiguration,” IEEE Transactions on VLSI Systems, 2013.
[28] H. Zheng, J. Lin, Z. Zhang, and Z. Zhu, “Decoupled DIMM: building high-bandwidth memory system using low-speed
dram devices,” in International Symposium on Computer Architecture (ISCA). New York, NY, USA: ACM, 2009, pp.
255–266.
[29] S. Mittal and Z. Zhang, “Palette: A cache leakage energy saving technique for green computing,” in HPC: Transition
Towards Exascale Processing (To be published), ser. Advances in Parallel Computing, C. Catlett, W. Gentzsch,
L. Grandinetti, G. Joubert, and J. Vazquez-Poletti, Eds. IOS Press, 2013.
[30] X. Dong et al., “Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory
replacement,” in 45th ACM/IEEE Design Automation Conference (DAC), 2008, pp. 554–559.
