A Study on Performance and Power Efficiency of Dense Non-Volatile Caches
  in Multi-Core Systems by Jadidi, Amin et al.
A Study on Performance and Power Efficiency of Dense
Non-Volatile Caches in Multi-Core Systems
Amin Jadidi, Mohammad Arjomand, Mahmut T. Kandemir, Chita R. Das
School of Electrical Engineering and Computer Science, Pennsylvania State University, USA
{axj945,mxa51,kandemir,das}@cse.psu.edu
In this paper, we present a novel cache design based on
Mult-Level Cell Spin-Transfer Torque RAM (MLC STT-
RAM) that can dynamically adapt the set capacity and as-
sociativity to use efficiently the full potential of MLC STT-
RAM. We exploit the asymmetric nature of the MLC storage
scheme to build cache lines featuring heterogeneous perfor-
mances, that is, half of the cache line are read-friendly, while
the other is write-friendly. Furthermore, we propose to op-
portunistically deactivate ways in underutilized sets to con-
vert MLC to Single-Level Cell (SLC) mode, which features
overall better performance and lifetime. Our ultimate goal
is to build a cache architecture that combines the capac-
ity advantages of MLC and performance/energy advantages
of SLC. Our experiments show an improvement of 43% in
total numbers of conflict misses, 27% in memory access la-
tency, 12% in system performance (i.e., IPC), and 26% in
L3 access energy, with a slight degradation in cache lifetime
(about 7%) compared to an SLC cache.
1. INTRODUCTION
Ever increasing number of cores in chip multiprocessors
(CMPs) coupled with the trend toward rising working set
sizes of emerging workloads (such as web-search algorithms,
social networking and modern databases) stresses the de-
mand for large, multi-level, on-die cache hierarchy to hide
long latency of off-chip main memory. During the last few
decades, SRAM-based cache memories successfully kept pace
with this capacity demand by exponential reduction in cost
per bit. However, entering sub-20nm technology era, where
leakage power becomes dominant, it is very challenging to
continue with the expansion of the cache hierarchy while
avoiding the power wall. One promising approach to address
this problem is to replace SRAM with a non-volatile mem-
ory (NVM) technology [41, 12]. Recently, various NVMs
have been prototyped in nano-technology regime [14, 16]
and many of them are expected to be commercially avail-
able by the end of the decade [31]. Among such memory
technologies, Spin-Transfer Torque RAM (STT-RAM) is the
best candidate for use within the processor; STT-RAM has
zero leakage power, accommodates almost 4× more density
than SRAM, and has small read access latency and high
endurance.
The storage element in a STT-RAM cell is a Magnetic
Tunnel Junction (MTJ) which stores binary data in form of
either parallel magnetic direction (set) or anti-parallel mag-
netic direction (reset). Two types of STT-RAM cell proto-
types can be realized: Single-Level Cell (SLC) STT-RAM
and Multi-Level Cell (MLC) STT-RAM. The SLC STT-
RAM cell consists of one MTJ component which is used
to store one bit information. The MLC STT-RAM device,
on the other hand, is typically composed of multiple MTJs,
which are connected either serially or in parallel, and are
used to store more than one bit information in a single cell.
Such increased density in MLCs comes at the cost of linear
increase in access latency and energy with respect to the cell
storage level (i.e., the number of bits stored). For instance,
the read (or write) latency and energy consumption of a 2-
bit STT-RAM cell is two times higher than that of a SLC
STT-RAM device under same fabrication technology. Fur-
thermore, MLC STT-RAM usually has lower endurance (in
terms of write cycles) compared to the SLC. In short, SLC
and MLC storage elements show two different characteris-
tics: SLC is fast, power-efficient, and has a long lifetime;
but, MLC trade-offs these metrics for high density.
Over the past few years, several device-level and architect-
ure-level optimizations/solutions has been proposed that at-
tempt to address the issues of high write latency/energy [41,
12, 26, 11, 37, 38] and limited endurance [39, 20] in SLC
STT-RAM caches. However, little attention has been paid
to explore the potential of the MLC STT-RAM cache in mul-
ticore systems. Indeed, such an analysis and study is nec-
essary as feature scaling is continuing and employing MLC
devices seems to be the only way of increasing cache capac-
ity in a cost-efficient and power-efficient manner. In this
paper, we focus on the MLC STT-RAM cache design space
exploration and related optimizations, when it is used as last
level cache (LLC) in CMPs. Specifically, this paper tries to
answer the following research questions:
1. What is the design of an MLC STT-RAM cache? Can
we reduce its read and write access latency/energy by
employing logic-level optimizations?
2. What are the performance implications of an MLC
STT-RAM cache compared to an SLC STT-RAM cache
in an iso-area design? What kinds of workloads can
get benefits from MLC STT-RAM cache configura-
tion? And what kinds of workloads exhibit perfor-
mance degradation in a system with MLC STT-RAM
cache?
3. What types of architecture-level optimizations can be
employed to further improve the performance-efficiency
and energy-efficiency of a MLC STT-RAM cache? Af-
ter all logic-level and architecture-level optimizations,
does the resultant MLC STT-RAM cache perform bet-
ter than its SLC counterpart for a wide range of work-
load categories?
1
ar
X
iv
:1
70
4.
05
04
4v
1 
 [c
s.A
R]
  1
7 A
pr
 20
17
This paper answers the above research questions in detail
and introduces the following novel mechanisms to tackle the
challenges brought by an MLC STT-RAM cache:
• To reduce the read/write access latency and energy
of an MLC-based cache, we introduce stripped data-
to-cell mapping scheme as a logic-level optimization.
This design mainly relies on the asymmetric behavior
of reading or writing different bits of an MLC; while
one bit can be read fast, using the other saves more
energy during writes. Instead of storing cache lines
next to one another in independent memory cells, this
mapping enables data blocks to be “stacked” on top
of each other – storing bits of two cache lines in the
same cell, one as MSB and the other as LSB. With
this data layout arrangement, we demonstrate that, for
half of the cache lines, the read/write access latency
is comparable with SLC cache; and for the rest of the
cache lines, the read/write access energy is in range of
an SLC cache.
• To further improve performance of the cache, we pro-
pose an associativity adjustment scheme. This scheme
tries to adjust the associativity degree of each set in-
dependently by switching on or off the ways stacked
onto others. With this mechanism in place, each set
cache acts like highly-associative cache for sets that
benefit from more associativity and behaves like a low-
associative cache when extra capacity is not useful for
a set, thereby reducing both energy consumption and
access latency. The main feature of this dynamic de-
sign is that it can easily determine cache associativity
with limited hardware overhead. Indeed, this scheme
only requires (1) a mechanism to detect workload be-
havior within each set and (2) a read and write-aware
inter-set data movement which does not alter any ad-
dress and data path in the cache hierarchy.
• We also propose a swapping policy to enhance perfor-
mance and energy of the cache, on top of the other two
schemes. When used with a stripped cache design, this
scheme tries to place the read-dominant cache blocks
to the ways with small read access latency, while it
places the write-dominant cache blocks to the ways
with small write energy.
Compared to a Single-Level Cell (SLC) STT-RAM cache
with limited associativity, the proposed design reduces con-
flict misses of the sets with large large working set. And,
compared to a MLC STT-RAM cache with conventional
stacked data-to-cell mapping, it improves read performance
and write energy by converting MLC lines to SLC when a
set does not need extra cache capacity.
2. OVERVIEW OF STT-RAM
2.1 Single-Level Cell (SLC) Device
STT-RAM is a scalable generation of Magnetic Random
Access Memory (MRAM). Figure 1a shows the basic struc-
ture of an SLC STT-RAM cell which is composed of a stan-
dard NMOS access transistor and a Magnetic Tunnel Junc-
tion (MTJ) as the information carrier – This forms a “1T1J”
memory element. Each MTJ consists of two ferromagnetic
layers and one oxide barrier layer. One of the ferromagnetic
(a)
BL
WL
SL
MTJ (ΔR @Ic±1)
MTJ
MgO
‘1’
‘0’
(b)
BL
WL
SL
M
TJ
1
M
TJ
2
Bottom 
electrode
Top electrode
MgO
‘00’
‘01’ ‘10’
‘11’
Resistance
Current
Ic+1  Ic+2Ic-2   Ic-1
ΔR1
ΔR2 ΔR1
ΔR2
Inter metal
(c)
Bottom electrode
Soft domain Hard domain M
gO
Top electrode
(d) (e)
BL
WL
SL
M
TJ
1
M
TJ
2
Figure 1: (a) SLC STT-RAM cell consisting of one access
transistor and one MTJ storage carrier (“1T1J”); (b) Binary
states of an MTJ: two ferromagnetics with anti-parallel (or
parallel) direction indicate a logical ‘1’ (or ‘0’) state; (c)
MLC STT-RAM cell with serial MTJs: the soft-domain on
top of the hard-domain; (d) Resistance levels for 2-bit STT-
RAM: four resistance levels are obtained by combining the
states of two MTJs having different threshold current.
layers (i.e., the reference) has a fixed magnetic direction and
the magnetic direction of the other (i.e., the free layer) can
be changed by passing a spin-polarized current into the cell.
If these two ferromagnetic layers have anti-parallel (or paral-
lel) direction, the MTJ resistance is high (or low), indicating
a binary value of ‘1’ (or ‘0’) state (Figure 1b). Compared to
conventional MRAM, STT-RAM exhibits superior scalabil-
ity since the threshold current required to make the status
reversal (from ‘0’ to ‘1’ or vice versa) decreases as the MTJ
size shrinks. Furthermore, the STT-RAM technology has
reached a good level of maturity and the fabrication cost of
cells in this technology is small – the number of additional
mask steps beyond a standard CMOS process is no more
than two (less than 3% added cost) [14].
2.2 Multi-Level Cell (MLC) Device
The MLC capability can be realized by modeling four or
more resistance levels within one cell. With the current
VLSI technologies, the MLC STT-RAM devices are fabri-
cated by using multiple MTJs structured either in parallel
or series connections. Figure 1c depicts these two configu-
rations for a 2-bit MLC device – the 2-bit series STT-RAM
is composed of two vertically stacked MTJs, and the 2-bit
parallel cell is composed of one common reference layer and
two independent free layers which are arranged side-by-side
(parallel). Under the same fabrication conditions and pro-
cess variation, the serial MLC has advantages over the par-
allel design: (1) it exhibits much lower bit error rates re-
lated to read and write disturbance issues [44]; (2) it re-
sults in smaller cells; and (3) while it has full-compatibility
with modern perpendicular STT-RAM technology, fabricat-
ing parallel MTJs with this technology is very challenging.
We consider serial MTJ-based 2-bit STT-RAM cells in this
work, although our ideas apply equally to parallel MTJ-
based devices and MLCs with higher bit densities.
2
Table 1: 2-bit MLC STT-RAM compared to SLC cell circuit
model [16] at the 45 nm PTM.
Parameters
MLC
SLC
LSB (MTJ1) MSB (MTJ2)
Dimensions 70×140 nm 75×150 nm 75×140 nm
Read Latency 0.962 ns 0.962 ns 0.856 ns
Read Energy 0.0115 pJ 0.0115 pJ 0.0112 pJ
Write Latency 10 ns 10 ns 10 ns
Write Energy 1.92 pJ 3.192 pJ 3.192 pJ
Endurance 1012 Writes 1010 Writes
In the 2-bit serial STT-RAM, MTJs have different layer
thickness and area. This results in different threshold cur-
rent, I±c and resistance variation, ∆R for the two MTJs.
Four levels of resistance are obtained by combining the bi-
nary states of both MTJs (Figure 1d). In such a cell, the
layer that requires a small current to switch (MTJ1) is re-
ferred to as soft-domain and the layer that requires a large
current to switch (MTJ2) is referred to as hard-domain. As-
suming 2-bit information, the Least Significant Bit (LSB)
and the Most Significant Bit (MSB) are stored into the soft-
domain and hard-domain, respectively. Below, we describe
details of the write and read operations for a 2-bit series
MLC.
2.2.1 Two-step Write Operation
The circuit schematic for access operations on a 2-bit se-
rial STT-RAM is depicted in Figure 2. Two MTJs are ac-
cessed by an access transistor controlled by a word-line (WL)
signal and a write current always passes through both MTJs.
Accordingly, when the hard-domain is written, the state of
the soft-domain is also switched into the same direction, be-
cause of its larger threshold current (Ic). In order to write
the LSB, the direction and amplitude of the current pulse
are defined by the logical value of the MSB and LSB as well
as the characteristics of both MTJs. If the desired LSB value
is equal to the currently stored MSB, a second current pulse
is not necessary, since the soft-domain was already switched
to the proper state. Otherwise, a second current pulse is
required to write the LSB into the soft-domain (Figure 2c).
This second current pulse should be set between the Ic val-
ues of the two MTJs to prevent bit flip of the hard-domain
(Figure 2b). Consequently, the number, direction, and am-
plitude of write current pulses in this two-step write scheme
vary depending on the written data.
2.2.2 Two-step Read Operation
On a read, the access transistor is turned on and a voltage
difference is applied between the bit-line (BL) and source-
line (SL) terminals. This voltage difference causes a current
to pass through the cell and is small enough to avoid any
MTJs to switch its magnetic direction. The value of the cur-
rent is a function of the MTJs’ resistance and is input into
a multi-reference sense amplifier. The sense amplifier unit
has three resistance references, each between two neighbor-
ing resistance states (Figure 2d). The cell content is read
through two read cycles in a binary search fashion. In the
first step, the sense amplifier uses R2 as a reference to iden-
tify the value of MSB stored in the hard-domain. In the
second step, depending on the MSB value, the reference re-
sistance is switched to R1 or R3, and the LSB is read from
the soft-domain.
BL
WL
SL
RR1
RR2
RR3
LAL
SLSB SMSBLAMS1S2
S3
SA
Read Bias Generator
REN
REN
WEN
WEN
MSB
MSBMSBLSB
δIC2-IC2+
IC1-IC1+
Write Circuit Read Circuit2bit MLC
RP1+RAP2
RAP1+RP2
RAP1+RAP2
RP1+RP2 ‘00’‘01’
‘10’ ‘11’
IC2- IC2+IC1- IC1+
‘xx’ → ‘11’
‘xx’ → ‘10’
‘xx’ → ‘01’
‘xx’ → ‘00’
R
RR3
‘01’‘10’
R
Icell
‘00’
‘11’
RR2RR1
1st Cycle
2nd Cycle1st Cycle
1st Cycle2nd Cycle
1st Cycle
R
IC2- IC2+IC1- IC1+ IcellIcell
(a)
(c) (d)(b)
Figure 2: (a) Schematic of the read and write access circuits
in a 2-bit MLC cache array; (b,c) write operation transi-
tion model: first, the MSB is written to the hard-domain
and if the LSB differs from MSB a small current is driven
to switch its direction; (d) three resistance references in a
sense amplifier, each between the two neighboring resistance
states.
2.3 SLC versus MLC: Device-Level Compar-
ison
Table 1 gives the typical latency, energy and lifetime pa-
rameters of a 2-bit MLC and a SLC STT-RAM cell. These
parameters are taken from state-of-the-art prototypes (e.g., [16])
at the 45nm PTM technology. Based the parameters in this
table: we can compare SLC and MLC in four ways:
• Cell Area – In STT-RAM, the MTJ size is larger
than the access transistor and determines the cell size.
As a result, under same technology constraints, the
SLC has the same area as smaller MTJ of the MLC
(MTJ1), i.e., 70×140, and MLC area is 75×150 nm
which is a little larger (due to larger size MTJ2). As
such, in an ISO-area design, the capacity of the 2-bit
MLC cache is less than two times of the SLC cache.
• Read Operation – Both devices use the same read
voltage difference (i.e., -0.1 V [16]), which results in
a read latency of 0.856ns for SLC cell, and total read
latency of 1.9ns for the MLC cell (about 0.9 ns per bit
which is in the same range of SLC). The same discus-
sion can be made for energy consumption.
• Write Operation – In order to meet the write per-
formance requirement of LLC, the writing pulse width
is set to 10 ns [10] (which is in range of today’s SRAM
caches). For 2-bit MLC, this results in the write energy
of the MTJ2(soft-domain) and MTJ1(hard-domain) to
be 1.92pJ and 3.192pJ, respectively [40]. Also, writing
into SLC cell consumes the same energy as writing into
MTJ2, as both employ the same sized MTJs.
• Cell Endurance – Due to the need for larger write
currents and two-step write operation in the MLC cell,
it generally has lower cell endurance than SLC (about
100× based on the model in Table 1).
Throughout this paper, we use the values in this table for
performance, energy and endurance analysis of the STT-
RAM-based cache designs.
3
Ba
nk
 0
RD
 &
 
W
R
Ba
nk
 1
RD
 &
 
W
R
Ba
nk
 2
RD
 &
 
W
R
Ba
nk
 3
RD
 &
 
W
R
Di
re
ct
or
y
Ro
w 
De
co
de
r
Column Selector
RD/WR RD/WR RD/WR
Ta
g  
Ar
ra
y 
(S
LC
)
Da
ta
  A
rra
y 
(M
LC
)
RD/WR
Mux driveux driveMux driveComparatorsCo paratorsValid
Ba
nk
 7
RD
 &
 
W
R
Ba
nk
 6
RD
 &
 
W
R
Ba
nk
 5
RD
 &
 
W
R
Ba
nk
 4
RD
 &
 
W
R
Router
Cores + L1 and 
L2 Caches + 
Other peripheral 
circuits
TSBs
STT-RAM LLC 
cell array
(a)
(b) (c)
Figure 3: 2-bit STT-RAM layout and a schematic view of
the cache array, read and write circuits.
3. MLC STT-RAM CACHE: THE BASELINE
We assume a chip multiprocessor (CMP) with an MLC-
based STT-RAM last-level cache (LLC). Figure 3 shows the
architectural model of this system where, due to CMOS non-
compatibility of STT-RAM, the LLC is integrated into the
processor die using 3D VLSI1. This system is a 2-tier 3D
chip. At the processor tier, cores and all lower caches are
placed. The STT-RAM cache (LLC), which is at the upper
layer, is logically shared among all cores while physically
structured as static NUCA and mounted at the top tier of
3D die. Note that prior studies assume the same integration
model for NVM caches in future processors [37, 6, 26]. The
modeled STT-RAM cache has two characteristics, namely,
NUCA structure and serial-loop access, which are explained
below.
NUCA structure – The LLC in our CMP model is a
large cache, which has to be structured as a NUCA (Non-
Uniform Cache Architecture) for scalability and high-perfor-
mance. In NUCA, a large cache is divided into multiple
banks (one bank per each core) connected by an on-chip
network for data and address transfer between the banks.
NUCA exhibits non-uniform access latencies depending on
the physical location of the cache bank being accessed [27].
1Today, 3D ICs are commercially available [29] and have
been receiving immense research interest from early 2000.
Besides its latency and bandwidth benefits, one of the ad-
vantages of 3D ICs over 2D ICs is that they provide a plat-
form to integrate different (non-compatible) technologies on
the same die, with less concern on impacts of noises and
fabrication costs.
Two main types of NUCA have been proposed: static NUCA
and dynamic NUCA. In static NUCA, a cache line is stati-
cally mapped into banks, with the lower bits of the index de-
termining the bank. In dynamic NUCA, on the other hand,
any given line can be mapped into several banks based on
a placement policy. Although dynamic NUCA can drasti-
cally reduce average access latency (compared to the static
NUCA) by putting the frequently-accessed data into banks
that are closer to the source of request [23], it is not a good
design choice in non-volatile caches (like our design), since it
increases the cache write traffic (due to frequent movement
of cache lines between banks), which can in turn accelerate
the wear-out problem. Therefore, we assume that the LLC
is built as a static NUCA.
Serial-lookup access – The modeled LLC is a serial-
lookup cache. In serial caches, tag and data arrays are ac-
cessed sequentially, saving energy at the expense of increased
delay. The serial cache access latency relies heavily on the
tag array latency, and consequently, we choose SLC for the
tag array to minimize the latency. Figure 3b shows the mod-
eled LLC with SLC STT-RAM tag array and 2-bit MLC
STT-RAM data array. As shown in this figure, STT-RAM
has the same peripheral interfaces used in SRAM caches:
each bank consists of a number of cache lines, decoders for
set index, read circuits (RDs), write circuits (WRs), and
output multiplexers. Unlike SRAM, however, the current
sense amplifiers in the STT-RAM read circuit are shared
and multiplexed across bit-lines due to their large size com-
pared to the cell array. For the read and write operations, a
decoder circuit selects a cache set and connects the selected
physical line to RDs for reading or WRs for writing.
3.1 Stripped Data-to-Cell Mapping
In the discussed 2-bit MLC model (Section 2.3), both bits
are assumed to be written or read together, although sequen-
tially. Applied to a cache context, both bits of a cell would
normally be mapped to the same cache line, as illustrated
on the left side of Figure 4. In this stacked data-to-cell map-
ping, reading a cache line always takes two read cycles (i.e.,
1.9ns in our 2-bit model), while writing takes two write cy-
cles at most (i.e., 20ns). In other words, with the stacked
mapping, the access latency to an MLC STT-RAM cache
is roughly twice that of an SLC cache. The same discus-
sion applies to the energy consumption of each access. As
opposed to this design, we propose to exploit the read and
write asymmetry of the two MTJs in order to simultaneously
optimize overall access latency and energy consumption.
As discussed earlier, when reading or writing bits sepa-
rately in a 2-bit STT-RAM, we observe a different latency
and energy consumption: for read operations, the MSB can
be read from the hard-domain in a single read cycle (0.96 ns
for the 2-bit model in Table 1), whereas reading the LSB
requires a second read cycle. For write operations, on the
other hand, one can write into either the hard-domain or the
soft-domain independently, by using a single current pulse.
Unlike the soft-domain, writing into the hard-domain has
two effects:
1. Writing into the hard-domain may cause the LSB to
flip. Thus, each write request to MSB must be pre-
ceded with an LSB read, which takes an extra 1.9 ns
(or generally two read cycles). Following that, both
MTJs are written sequentially that is, first the MSB
and then the read LSB.
4
Row D
ecoder
Column SelectorTA0
TA1 TB0TB1 TC0TC1 TD0TD1
A1 A0A3 A2
A5 A4A7 A6
B1 B0B3 B2
B5 B4B7 B6
C1 C0C3 C2
C5 C4C7 C6
D1 D0D3 D2
D5 D4D7 D6
Row D
ecoder
Column SelectorTA0
TA1 TB0TB1 TC0TC1 TD0TD1
B0 A0B1 A1
B2 A2B3 A3
B4 A4B5 A5
B6 A6B7 A7
D0 C0D1 C1
D2 C2D3 C3
D4 C4D5 C5
D6 C6D7 C7
Stacked Mapping Stripped Mapping
Figure 4: An illustration of stacked versus stripped data-to-
cell mapping for a 8-bit data array (four 2-bit MLC cells) and
2-bit tag arrays (in SLC). In stacked data-to-cell mapping
scheme, data bits of the same cache block (each one has 8
bits) are mapped to 4 independent memory cells (i.e., 2 in
each cell). In stripped mapping, each memory cell contains
only one bit of each cache block – so, each 8-bit cache block
spans over 8 memory cells. For instance, lines ‘B’ and ‘D’
are mapped to hard-domains (MTJ1), whereas ‘A’ and ‘C’
use soft-domains (MTJ2).
2. Writing into the hard-domain dissipates 1.6 times the
energy required for the soft-domain. The larger is the
writing current, the shorter the cell lifetime.
Accordingly, although the hard-domain emulates SLC read
access in latency, the cost of writing into the soft-domain
is much lower, primarily because only a small current is re-
quired to switch its polarity.
Based on the different read/write characteristics of both
MTJs, we propose a stripped data-to-cell mapping, which
groups the hard-domains together to form Fast Read High-
Energy write (FRHE) lines, and groups the soft-domains to
form Slow Read Low-Energy write (SRLE) lines. The right
portion of Figure 4 depicts the logical arrangement of the
tag and data arrays for the stripped mapping. Within each
cache set, half of the cache lines will be mapped to the hard-
domains and the other half to the soft-domains. Table 2
summarizes the sequence of transactions required to read
and write the FRHE or SRLE lines in a stripped 2-bit cache.
In addition to its performance efficiency, the stripped data-
to-cell mapping provides some other opportunities for op-
timizing energy consumption and lifetime of a 2-bit MLC
cache. Before describing these advantages, we first evalu-
ate the performance efficiency of a MLC-based cache (with
stripped mapping) when running real workloads.
Table 2: Sequence of transactions when accessing a 2-bit
MLC cache with stripped data-to-cell mapping. FRHE (Fast
Read High-Energy write) allocates hard-domains (MTJ2),
and SRLE (Slow Read Low-Energy write) allocates soft-
domains (MTJ1).
FRHE Read (1) Read hard-domains
FRHE Write (1) Read hard-domains; (2) Read soft-
domains; (3) Write hard-domains; (4) Write
soft-domains
SRLE Read (1) Read hard-domains; (2) Read soft-
domains
SRLE Write (1) Write soft-domains
3.2 Performance Analysis
Employing an MLC STT-RAM cache has two opposing
impacts on the performance of a multi-core system. On the
one hand, thanks to its large capacity, it can improve per-
formance by reducing the cache miss rate and so the need
for accessing off-chip main memory if employed as an LLC
(like our model). This is especially true for emerging work-
loads (like social networking applications and new database
workloads) that usually have very large working set sizes.
On the other hand, due to its high read and write access
latencies, such a cache architecture degrades the system per-
formance for workloads with low or negligible miss rates for
on-chip caches. Although the stripped data-to-cell mapping
partially addresses the problem of high access latency, the
problem still exists for half of the cache lines which always
exhibit the maximum MLC STT-RAM read and write la-
tencies.
To examine the performance efficiency of a 2-bit STT-
RAM cache with the stripped data-to-cell mapping, Fig-
ure 5 presents the LLC miss rate and Instruction-per-Cycle
(IPC) of a single-core system for one-second execution of
four typical applications from SPEC-CPU 2006 [36]. The
applications are chosen to cover a wide range of scenarios
with the low, moderate, and high LLC miss rates. In this
experiment, we skip the 100 Million instructions as the ini-
tialization stage, and the results for the next 200 million
instructions are collected. We assume two LLC configura-
tions: (1) a 256KB SLC-based STT-RAM cache with 8 ways
per each set, and (2) a 512KB 2-bit STT-RAM cache with
16 ways per each set and the proposed stripped data-to-cell
mapping. In both the configurations, the cache line size is set
to 64B. Note also that, both the configurations have almost
the same LLC area size (i.e., this is an iso-area analysis)2.
One can make two main observations from the results
in this figure. First, if an application exhibits low LLC
miss rate during some of its phases or its entire execution,
the SLC cache results in better performance (i.e., lower
IPC), thanks to its lower access latency than the MLC-based
cache. Second, during the phases where SLC cache’s miss
rate is high, the MLC cache can increase IPC if it can hold
the whole or major part of the working set size of the appli-
cation. Indeed, there are the cases where the application’s
memory footprint is very large (even larger than MLC cache
size) or the application has a streaming behavior (i.e., access-
ing a large set of addresses in a sequential fashion without
reuse), during which both the SLC and MLC caches have
high miss rates and consequently low IPC values. Note that
we have observed the same behavior for energy consumption
of the SLC and MLC caches, but due to lack of space, we
will discuss the energy consumption results in the evaluation
section.
To generalize our observation to large LLC sizes and eval-
uate the performance efficiency of the stripped data-to-cell
mapping, Figure 6 compares memory access latency of a
4MB 16-way stripped LLC (L3) with two extreme baselines:
(1) a 2MB 8-way SLC cache with nearly the same die area
(i.e., fast cache), and (2) a 4MB 16-way 2-bit cache with
stacked data-to-cell mapping (i.e., dense cache). For this
experiment, we use the same evaluation methodology de-
scribed later in Section 4 and the workload set in Table 4.
2For this experiment, we used the same simulation platform
described in Section 4.
5
 0
 0.25
 0.5
 0.75
M
i s s
 R
a t
e
SLC Cache MLC Cache
 0
 0.5
 1
I P
C
Time (One second)
(a) gcc
 0
 0.25
 0.5
 0.75
M
i s s
 R
a t
e
SLC Cache MLC Cache
 0
 0.5
 1
I P
C
Time (One second)
(b) xalancbmk
 0
 0.25
 0.5
 0.75
 1
M
i s s
 R
a t
e
SLC Cache MLC Cache
 0
 1
 2
 3
 4
I P
C
Time (One second)
(c) hmmer
 0
 0.25
 0.5
M
i s s
 R
a t
e
SLC Cache MLC Cache
 0
 0.5
 1
I P
C
Time (One second)
(d) omnetpp
Figure 5: Performance comparison of the SLC and MLC-based STT-RAM caches in terms of the LLC miss ratio and IPC (as
a system-level metric) for four workloads from the SPEC-CPU 2006 benchmark suite.
 0
 0.2
 0.4
 0.6
 0.8
 1
Low-Miss Medium-Miss High-Miss
A
c c
e s
s  
L
a t
e n
c y
 a
t  
L
3
 P
o
r t
s
2MB-8W-SLC 4M-16W-MLC 4MB-16W-Stripped
Figure 6: Comparison of the stripped cache with SLC array
and the stacked 2-bit MLC, each with the same die area.
Stripped MLC is better than SLC in applications with high
and medium L3 miss rates, since it increases the effective
cache capacity in terms of lines and associativity on-demand.
It is also better than the conventional stacked MLC cache
in applications with low L3 miss rates as it constructs the
fast read lines.
This figure confirms that stripped MLC let us have cake
and eat it too: in applications with high and medium LLC
miss ratios (first two columns), the performance improve-
ment over the SLC baseline is mainly due to the increase in
effective cache capacity; while in applications with low L3
miss ratio, it reduces the access time of the MLC cache by
constructing separate fast read and write lines.
Based on this discussion, one can conclude that the MLC
STT-RAM cache is not always beneficial – it can be harmful
to the cache latency and energy in certain cases. Therefore,
we need a mechanism to “dynamically” shape-shift the MLC
cache to SLC cache or vise versa (depending on the applica-
tions’ dynamic cache requirements) in order to consistently
achieve low memory access latency.
3.3 Enhancements for the Stripped MLC Cache
3.3.1 The Need for Dynamic Associativity
It is not performance beneficial to shape-shift the cache
configuration from SLC to MLC (or vice versa) at the gran-
ularity of an entire cache size. Memory references in general
purpose applications are often non-uniformly distributed ac-
ross the sets of a set-associative cache. This non-uniformity
creates a heavy demand on some sets, which can lead to a
high number of local conflict misses, while lines in some other
sets can remain underutilized. This fact, which is the base
of our proposal, can be illustrated with some examples. Fig-
ure 7 plots the absolute number of conflict misses that each
set exhibits in a 8-way 512KB MLC-based cache in a single-
core system during one-second execution of four workloads
(the same as workloads as in Figure 7). For each workload,
one can see that there are some sets that have few conflict
misses, while some others are very much stressed. More-
over, the numbers of these opposite-behaving sets vary from
one program phase to another. This non-uniformity of miss
counts across cache sets indicates that, in the stripped cache,
high associativity not only brings no benefits for sets with
low-utilized lines, but it can also result in great degrada-
tion in performance, lifetime and energy consumption of the
cache. As a result, the new cache architecture we propose
involves, besides the explained stripped data-to-cell map-
ping, an on-demand associativity policy which dynamically
modulates the associativity of each set.
To this end, we propose an on-demand associativity ad-
justment policy which determines the associativity of sets
in the stripped cache considering their local miss rates. We
initialize the associativity to the lowest level, which corre-
sponds to half of the full capacity in our case. Then, the
6
 0
 500
 1000
 1500
 2000
 0  128  256  384  512
M
i s s
 C
o u
n t
Set Index
(a) gcc
 0
 2500
 5000
 7500
 10000
 0  128  256  384  512
M
i s s
 C
o u
n t
Set Index
(b) xalancbmk
 0
 250
 500
 750
 1000
 0  128  256  384  512
M
i s s
 C
o u
n t
Set Index
(c) hmmer
 0
 750
 1500
 2250
 3000
 0  128  256  384  512
M
i s s
 C
o u
n t
Set Index
(d) omnetpp
Figure 7: Distribution of the missed accesses over LLC’s sets for 200 million instructions in four applications from the SPEC-
CPU 2006 benchmark suite. We see that, for each application, there are some sets that have few conflict misses, while some
others are very stressed.
associativity of a set will grow and decrease overtime de-
pending on the dynamic utilization. To mitigate the effects
of slow reads and high-energy writes, when a cache line needs
to be turned off, an FRHE and SRLE pair is merged into an
SLC line, which uses exclusively the soft-domains, while all
hard-domains are fixed at the same value (‘0’ or ‘1’)3. As
a result, an SLC line will be read in a single cycle since the
hard-domains are known and a single resistance reference
will be required. This causes an SLC line to feature fast
reads and low-energy writes.
We first describe how the decision to grow the associa-
tivity is taken. As replacement policies are not ideal, it
is clearly that the associativity should not be prematurely
increased after every miss. To achieve better cache perfor-
mance, we introduce two saturation counters for each set: a
miss counter (Mcnt) and a weight counter (Wcnt). The miss
counter captures the number of misses that a set exhibits
and the weight counter prevents the effects of short-term
variations in misses and makes a set with large weight value
to be less likely to increase its associativity. Wcnt reflects
the associativity of a set and is initialized to the minimum
associativity. In each program epoch, Mcnt is initialized to
Wcnt×N. When Mcnt reaches zero, the hardware increases
the associativity by one and on the next miss, the fetched
memory block will be placed into the newly-allocated way.
In an effort to balance the wear between blocks and maxi-
mize lifetime, we introduce a circular pointer (a 3-bit counter
for the eight cache-line pairs of our stripped cache) indicat-
ing which cache-line pair should be selected for the next
increase in associativity. Therefore, cache lines are switched
to the MLC mode in a round-robin fashion and writes are
well distributed among MLC cache lines.
At the end of an epoch, Mcnt is compared with SLC-
Associativity×N to decide whether the cache set associativ-
ity should be reduced. If Mcnt is larger than SLC-Associativity
×N, this indicates that the utilization is low enough to re-
duce the associativity by one for the next epoch. In this
case, the replacement policy is triggered and the associativ-
ity is reduced by evicting a cache line and converting the
corresponding cache-line pair into SLC.
3In our experiments, the hard-domains are set to logic ‘0’
since bits “00” and “01” have less overlap under severe pro-
cess variation [44].
3.3.2 The Need for a Cache Line Swapping Policy
Besides performance efficiency, the stripped data-to-cell
mapping provides an opportunity to further enhance the
performance efficiency, energy efficiency, and lifetime of a
MLC-based STT-RAM cache. More accurately, fast read
lines (FRHE) greatly speed up read operations compared to
the stacked mapping. On the other hand, if write-dominated
lines can be directed to low write-energy lines (SRLE), the
write energy and cell lifetime can be kept close to those of
SLC. To maximize the benefits provided by stripping, we
propose a swapping policy to dynamically promote write-
dominated data blocks to SRLE lines and read-dominated
ones to FRHE lines. The swapping mechanism in the stripped
mapping must be used with care. Looking into the cache
traces, we find many blocks that are both read- and write-
dominated. In Figure 8, we report the read/write intensity
of the all memory blocks in LLC for a set of workloads from
the multi-thread PARSEC-2 benchmark programs [4] and
the SPEC-CPU 2006 benchmark programs [36]. If we con-
sider that a memory block written (or read) in LLC 90% of
the time is write-dominated (or read-dominated), we ob-
serve that 32.7% of such memory block on average, and
down to 2% for some programs. In this situation, even utiliz-
ing a near-optimal data swap, the remaining non-dominated
memory blocks might keep swapping between the FRHE and
SRLE lines. This might incur more cache contention and
consume more energy. Worse, the cache lifetime would be
considerably reduced by the write amplification provoked by
the swaps. Therefore, to avoid this scenario, a swap policy
is introduced. For read- and write-aware swap, each cache
line is associated with a swap counter (Scnt) and a swap
weight counter (SWcnt). At each epoch, SWcnt is initial-
ized to ‘1’, increments when a swap happens, and saturates
when it reaches M. Scnt is initialized at each epoch with
SWcnt×N, and decrements when its FRHE line is written,
or when its SRLE line is read. If Scnt reaches zero either
its SRLE or FRHE line will replace the block to be evicted
based on the replacement policy in use. If the victim line
is an FRHE, it is replaced by the SRLE; otherwise it is re-
placed by the FRHE. Here, M is the maximum value that
a hardware counter can represent (in our experiments 256
for an 8-bit counter), and N is the swap threshold that we
study in the experiments. It is to be noted that, on a miss,
Scnt and SWcnt are reinitialized.
7
 0
 20
 40
 60
 80
 100
blackscholes
bodytrack
bzip2
caneal
dealII
dedup
facesim
ferret
fluidanim
ate
freqm
ine
gcc
G
em
sFD
TD
grom
acs
lbm
leslie3d
m
cf
m
ilc
nam
d
om
netpp
parser
perlbench
povray
raytrace
soplex
stream
cluster
sw
aptions
vips
X
264
xalancbm
k
%
 o
f  
T
o
t a
l
RD-to-WR < 0.1
0.1 < RD-to-WR < 0.9
RD-to-WR > 0.9
Figure 8: Percent of memory blocks in cache with read-dominated, write-dominated, and non-dominated properties for a set
of workloads from PARSEC-2 and SPEC-CPU 2006 programs.
4. EXPERIMENTAL METHODOLOGY
In this section, we describe our simulation platform as well
as the design methodology used for our evaluation through-
out this paper.
4.1 Infrastructure
We perform a microarchitectural, execution-driven simu-
lation of an out-of-order processor model with Alpha ISA
using the Gem5 simulator [5]. The simulated CMP runs
at 2.5 GHz frequency. We use McPAT [24] to obtain the
timing, area, energy and thermal estimation for the CMP
we model, and use CACTI 6.5 [28] for detailed cache area,
power and timing models. For STT-RAM LLC, NVSim [13]
is used and is parametrized with the cell latency and energy
parameters from Table 1. We use 45 nm ITRS models, with
a High-Performance (HP) process for all the components of
the chip except for LLC, which uses a Low-Operating-Power
(LOP) process.
4.2 System
We model the 8-core CMP system detailed in Table 3.
The system has three levels of caches and, because STT-
Table 3: Main characteristics of our simulated CMPs. The
latencies shown assume a 32 nm process at 2 GHz.
Processor Layer (Tier 1)
Cores 8-cores, SPARC-III ISA, out-of-order, 2 GHz, So-
laris 10 OS
L1 caches 32 kB private, 4-way, 64 B, LRU, Write through,
2-port, 1-cycle Access time, MSHR: 4 instruction
& 32 data
L2 cache 2 MB private, NUCA, Unified, Inclusive, 16-way,
64 B, LRU, Write-back, 10 cycle, MSHR: 32 (in-
struction and data)
Coherency MOESI directory; 2×4 grid packet switched NoC;
XY routing; 1-cycle router; 1-cycle link.
L3 Cache Layer (Tier 2)
L3 cache NUCA, 8 banks, Shared, Inclusive, STT-RAM,
64B, LRU, Write-back, 1-port, 32×64 B write
buffer, 8×64 B read buffer, 4-cycle L2-to-L3 la-
tency, MSHR: 128 (instruction and data)
L3 Config. cf. Table 5
Off-Chip Main Memory
Controller 4 on-chip, FR-FCFS scheduling policy
DRAM DDR3 1333 MHz (MICRON), 8 B data bus, tRP-
tRCD-CL: 15-15-15 ns, 8 DRAM banks, 16 kB row
buffer per bank, Row buffer hit: 36 ns, Row buffer
miss: 66 ns
RAM is not compatible with CMOS, it is built on a 2-tier
3D integration. At the processor tier (near to the heat sink),
the cache hierarchy has a split L1 private instruction and
data cache for each core. Each core also has a private L2
cache that is kept exclusive to the L1 cache. The STT-RAM
L3 cache (LLC) is logically shared among all the cores while
physically structured as static NUCA and mounted at the
top tier of 3D die [37, 6, 26]. On 45 nm, the CMP area is
estimated to 60 mm2 and has a TDP of 77 W at 2 GHz and
1.1 V supply voltage.
Using McPAT [24], the processor layer in our 3D IC (i.e.,
tier 1) has a 5.1 mm2 die area. By assuming an SLC STT-
RAM cell size of 14 F2 [37], we derived that a 5 MB SLC
can fit in same die area in tier 2. Since the area of an
STT-RAM cell is dominated by its access transistor and
we use an SLC tag array for our MLC data array, 8 MB
MLC STT-RAM can fit within the 5.1 mm2 area. Table 5
summarizes the configuration of the LLC in the proposed
system as well as in three reference configurations: a 5 MB
SLC cache (fast cache), an 8 MB 2-bit MLC cache with
stacked mapping(dense cache), both with the same die area,
and an 8 MB SLC cache (fast-dense cache) with double area.
The performance of static STT-RAM NUCA, is highly
sensitive to write operations that can be blocking for the
subsequent read requests. To alleviate this inefficiency, each
bank has a separate 8-entry Read Queue (RDQ) and 32-
entry Write Queue (WRQ) that queue all pending requests.
A read request to a line pending in the WRQ is serviced by
the WRQ. When a bank is idle and either RDQ or WRQ
(but not both) is non-empty, the oldest request from that
queue is serviced. If both RDQ and WRQ are non-empty,
then a read request is serviced unless the WRQ is more
than 80% full, in which case a write request is serviced.
This ensures that read requests are given priority for service
in the common case, and write requests eventually have a
chance to get serviced.
4.3 Workloads
For multi-threaded workloads, we use the complete set of
parallel programs in PARSEC-2 suite [4]. For multi-program
evaluation, we use the SPECCPU2006 benchmarks. We
classify a benchmark as memory-intensive if its L3 cache
Misses Per 1 K Instruction (MPKI) is greater than three;
otherwise, we refer to it as memory non-intensive. Also,
8
Table 4: Characteristics of the evaluated workloads.
Workload MPKI HPKI Workload MPKI HPKI
Multi-Threaded PARSEC-2 Workloads (8 Threads) 8-Application Multi-Programmed (SPEC-CPU 2006)
blackscholes 0.78 (L) 37.33 (H) MP1: 2 Copies of (xalancbmk,omnetpp,bzip2,mcf) 20.16 (H) 39.51 (H)
bodytrack 0.96 (L) 11.90 (M) MP2: 2 Copies of (milc,leslie3d,GemsFDTD,lbm) 33.01 (H) 38.69 (H)
canneal 15.19 (H) 27.13 (H) MP3: 2 Copies of (mcf,xalancbmk,GemsFDTD,lbm) 24.23 (H) 37.55 (H)
dedup 3.04 (M) 9.072 (M) MP4: 2 Copies of (mcf,GemsFDTD,povray,perlbench) 14.89 (H) 23.41 (H)
facesim 10.66 (H) 14.26 (M) MP5: 2 Copies of (mcf,xalancbmk,perlbench,gcc) 18.33 (H) 49.41 (H)
ferret 7.80 (M) 23.21 (M) MP6: 2 Copies of (GemsFDTD,lbm,povray,namd) 6.99 (M) 11.68 (M)
fluidanimate 5.54 (M) 10.51 (M) MP7: 2 Copies of (gromacs,namd,dealII,povray) 1.85 (M) 7.94 (M)
freqmine 0.51 (L) 7.30 (M) MP8: 2 Copies of (perlbench,gcc,dealII,povray) 5.21 (M) 25.63 (H)
raytrace 0.45 (L) 0.92 (L) MP9: 2 Copies of (namd,povray,perlbench,gcc) 2.22 (M) 8.87 (M)
streamcluster 0.51 (L) 5.35 (M) MP10: 2 Copies of (milc,soplex,bzip2,mcf) 22.94 (H) 38.89 (H)
swaptions 0.15 (L) 4.34 (M) MP11: 2 Copies of (parser,gcc,namd,povray) 4.27 (M) 26.39 (H)
vips 2.24 (M) 15.69 (M)
x264 1.22 (M) 12.92 (M)
Table 5: Evaluated LLC configurations.
L3 Config.
Latency Dynamic Leakage
[cycles] Energy [nJ] Power [W]
8 MB Lookup: 3 hard-R: 0.34 1.52
8-to-16way hard-R-hit: 3 soft-R: 0.38
stripped MLC soft-R-hit: 5 hard-W: 1.93
hard-W-hit: 19 soft-W: 1.28
soft-W-hit: 42
5MB 8way Lookup: 1 R: 0.32 0.156
SLC R-hit: 3 W: 1.29
W-hit: 19
8MB 16way Lookup: 3 R: 0.64 0.152
stacked MLC R-hit: 5 W: 1.58
W-hit: 37
8MB 16way Lookup: 2 R: 0.32 0.217
SLC R-hit: 3 W: 1.29
W-hit: 19
we say a benchmark has cache locality if the number of L3
cache Hits Per 1K Instruction (HPKI) for the benchmark is
greater than five. Each benchmark is classified by measuring
the hits and misses when running alone in the 8-core system
given in Table 3. For the multi-program workload selection,
we used eleven 8-core-application workloads that are chosen
such that each workload consists of at least six memory-
intensive applications and two applications with good cache
locality.
Each application is simulated to completion and the re-
sults are taken from the instructions in parallel region (i.e.,
the region of interest). Regarding the input sets, we use
Large set for the PARSEC-2 applications and sim-large for
the SPECCPU2006 workloads. All applications are com-
piled using ICC (Intel C Compiler) and IFORT (Intel For-
tran Compiler) at the O3 optimization level. Table 4 char-
acterizes the evaluated workloads based on L3 MPKI and
HPKI for the SLC reference system of Table 3. To justify
the evaluation results, our workloads are classified based on
their L3 miss count and hit count intensity: considering L3
MPKI, each workload is either high-missed (H) if MPKI is
greater than 10, medium-missed (M) if MPKI is between 1
and 10 or low-missed (L) if MPKI is less than 1. A workload
is either high-hit (H) if HPKI is greater than 20, medium-hit
(M) if HPKI is between 1 and 20 or low-hit (L) if HPKI is
less than 1.
5. EVALUATION RESULTS
Our dynamic stripped cache has a minimum and maxi-
mum associativity of 8 ways and 16 ways, respectively. The
first and second baselines are the SLC and stacked MLC
cache with the same die area and line size, but different asso-
ciativities: 8 ways for SLC baseline and 16 ways for stacked
MLC. The last baseline is a cache with SLC devices with the
same capacity and associativity of MLC, but the die area is
doubled. For our dynamic configuration, on the other hand,
a read hit or a write hit can be serviced by either MSB lines
(i.e., MSB read bit or LSB read hit) or LSB lines (i.e., MSB
write hit or LSB write miss).
The proposed cache organization centers around the use
of SLC devices in applications with low misses and MLC
devices in applications with medium and high misses. Ulti-
mately, this mechanism attempts to reduce the miss penalty
by the same measures as an MLC cache. Thus, an upper
bound on the miss reduction for the proposed mechanism
is provided by stacked MLC cache of the same size. Our
cache is expected to approach this upper bound for high
missed and medium missed applications. This upper bound
can result in our scheme outperforming the SLC baseline as
seen in the results. The two other upper bound for the pro-
posed cache are determined by the read hit at fast read lines
(i.e., FRHE) and a write hit at low write-energy lines (i.e.,
SRLE). These second and third upper bounds determine the
reduction in access latency compared to MLC arrays and
is provided by the SLC baseline with double capacity (last
baseline).
5.1 Performance Evaluation
For programs with the high and medium L3 MPKI, we
expect a higher effect on the latency when increasing cache
associativity. This is also observed in Figure 9 that plots the
CPI improvement for a system with proposed cache struc-
ture support with respect to the studied baselines. For each
benchmark, the results are normalized to the SLC baseline
for ease of comparison. This figure shows an improvement of
up to 29% in CPI of the high associativity caches (i.e., MLC
cache configurations and 8 MB SLC cache) with respect to
the 4 MB SLC baseline. Our scheme also outperforms the
8 MB stacked 2-bit cache by 10% on average thanks to be-
ing able to construct FRHE and SRLE lines without gen-
erally loosing maximum way associativity requirement of a
set. Comparing the results with the 8 MB SLC cache base-
line, it can be seen that the performance of the system with
proposed cache structure is within 5% of the maximum per-
formance observed.
For applications with low miss ratio in the LLC, one can
see our cache configuration behave like an SLC baseline
in most applications. Only in some application, there is
a slight degradation in overall system performance (up to
9
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
canneal
facesim
M
P-H
1
M
P-H
2
M
P-H
3
M
P-H
4
M
P-H
5
M
P-H
6
dedup
ferret
fluidanim
ate
vips
x264
M
P-M
1
M
P-M
2
M
P-M
3
M
P-M
4
M
P-M
5
blackscholes
bodytrack
freqm
ine
raytrace
stream
cluster
sw
aptions
G
m
ean
N
o
r m
a l
i z
e d
 I
n
s t
r u
c t
i o
n
P
e r
 C
y
c l
e  
( I
P
C
)
5MB SLC
8MB Stacked MLC
8MB SLC
8MB Stripped MLC
Figure 9: Percentage of IPC improvement for the proposed cache architecture with respect to the baselines. The proposed
architecture has capacity advantage of MLCs in applications with high misses (9 first) programs and medium misses (13
second) programs. It also has the SLC access latency in 8 applications with low misses (at left side).
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
canneal
facesim
M
P-H
1
M
P-H
2
M
P-H
3
M
P-H
4
M
P-H
5
M
P-H
6
dedup
ferret
fluidanim
ate
vips
x264
M
P-M
1
M
P-M
2
M
P-M
3
M
P-M
4
M
P-M
5
blackscholes
bodytrack
freqm
ine
raytrace
stream
cluster
sw
aptions
G
m
ean
N
o
r m
a l
i z
e d
 T
o
t a
l
E
n
e r
g
y
 C
o
n
s u
m
p
t i
o
n
5MB SLC
8MB Stacked MLC
8MB SLC
8MB Stripped MLC
Figure 10: Total energy consumption of the cache architectures normalized to the SLC baselines. It shows that proposed
architecture uses low read and low write energy of FRHE and SRLE lines.
4% in blackscholes) that is because of the higher latency of
8 MB NUCA access circuit compared to 4 MB SLC configu-
ration. In short, as the last group of bars show in Figure 9,
the cache architecture with dynamic associativity achieves
10% CPI improvement (on average) for all applications with
different miss ratio behavior.
5.2 Energy Consumption
The percentage of reduction in total memory energy com-
pared to the baseline STT-RAM cache configurations is shown
in Figure 10. This evaluation includes the energy consump-
tions of both the STT-RAM LLC and off-chip main mem-
ory, and uses the energy model given in Table 5. Gener-
ally speaking, compared to the SLC baselines and the MLC
baseline, the percentage reduction in energy consumption
follows the same trend observed for the system performance
improvement. In other words, we can see that the energy
consumption of the proposed scheme is much better than
the SLC baseline in all applications with high and medium
misses due to the higher hit ratio of the on-chip memory hi-
erarchy. On the other hand, the energy consumption of our
scheme is better than the MLC baseline with the stacked
data-to-cell mapping, as it constructs lines with low write
energy and trying to allocate them write-dominated blocks.
Also, with respect to the SLC baseline cache, the proposed
cache architecture results in 17% reduction in the total mem-
ory energy on average (and 29% to 46% for programs such
as canneal, facesim, and art with high MPKI).
5.3 Lifetime Evaluation
In this work, we assume that reliable writes into an SLC
STT-RAM cell is limited to 1012 cycles [37], and it is linearly
scaled down for 2-bit STT-RAMs (i.e., exactly one-tenth).
For the lifetime evaluation, the main memory traces are ex-
tracted from the full-system simulator and are fed into a sim-
ulation tool. To avoid cache failure on wear-out of limited
cells, each cache line is augmented with an ECC correcting
up to 5 faulty bits. After each cache write, the read access
circuit is used to read it out and a cache line (or equally
a physical way) is assumed to be dead if the read-out data
block has more than 5 bit mismatches with the original data.
Finally, the simulator keeps track of the write counts on each
coding set until a cache set has more than four dead physi-
cal ways. We measure this duration and use it to estimate
and analyze the lifetime. The lifetime of the complete cache
architecture is compared with the 5 MB SLC caches and an
8 MB stacked MLC cache in Figure 11. Overall, the pro-
posed scheme provides a lifetime larger than 70% of an SLC
cache with identical ECC strength.
6. RELATED WORK
In this section, we go over the most relevant pieces of work
on STT-RAM cache memories as well as cache associativity.
6.1 STT-RAM Cache Memories
Owing to real concerns of SRAM power in nanometer
regime, resistive memories such as STT-RAM are presented
10
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
canneal
facesim
M
P-H
1
M
P-H
2
M
P-H
3
M
P-H
4
M
P-H
5
M
P-H
6
dedup
ferret
fluidanim
ate
vips
x264
M
P-M
1
M
P-M
2
M
P-M
3
M
P-M
4
M
P-M
5
blackscholes
bodytrack
freqm
ine
raytrace
stream
cluster
sw
aptions
G
m
ean
N
o
r m
a l
i z
e d
 L
i f
e t
i m
e
5MB SLC 8MB Stacked MLC 8MB Stripped MLC
Figure 11: Lifetime of the cache configurations normalized to the SLC baselines. Our, proposed stripped scheme tries to act
like SLCs to reach maximum lifetime.
to offer a highly-scalable low-leakage alternative for large
cache arrays. Compared to competitive non-volatile mem-
ories (such as ReRAM, PcRAM and FeRAM), STT-RAM
benefits from the best attributes of fast nanosecond access
time, CMOS process compatibility, high density, and better
write endurance. Dong et al. give a detailed circuit-level
comparison between SRAM cache and STT-RAM cache in
a single-core microprocessor [12].
Based on the findings given in this study, Sun et al. ex-
tended the application of STT-RAM to NUCA cache sub-
strate in CMPs and studied the impact of the costly write
operation in STT-RAM on power and performance [25]. To
address the slow write speed and high write energy of STT-
RAM, many proposals have been made in recent years. Zhou
et al. proposed an early write termination scheme that uses
write current as a read bias current and cancels an ongo-
ing write if the write data is unnecessary [45]. Alternative
approaches are SRAM/STT-RAM hybrid cache hierarchies
and some enhancements, such as write buffering [37], data
migration [37, 17, 42], and data encoding [2]. As a cache
solution with uniform technology, previous work proposed
to trade off the non-volatility of STT-RAM for write perfor-
mance and power improvement [35, 38, 19]. To ensure data
integrity in these architectures, some DRAM-style refresh
schemes are introduced which may not scale well for large
cache capacities.
Regarding MLC STT-RAM, Chen et al. proposed a dense
cache architecture using devices with parallel MTJs [11].
Although they use the MLCs with parallel MTJs to have
lower write power (compared to series MTJs) suitable for
cache [10], a reliability comparison of these two devices show
that parallel devices confront serious challenges in nanome-
ter technologies with large process variations [44]. Finally,
some recent proposals studied the effect of decoupling bits of
an MLC device (STT-RAM [18] or PcRAM [43]) in perfor-
mance, energy, and reliability improvement of non-volatile
memories.
6.2 Reducing Conflict Misses in Caches
There are several prior works targeting conflict misses in
set-associative caches, which can be generally categorized as
follows:
Category 1: Hashing block address – Instead of us-
ing a subset of the address bits to index the cache, one can
use a better hash function on the address to reduce con-
flict misses by spreading out accesses. Although hashing
relaxes the miss pressure on some sets with large working
set [22], it slightly increases access latency as well as area
and power overheads due to this additional circuitry. More-
over, it needs to store full block address in tag which adds
to tag store overheads.
Category 2: Skew-associative caches – Skew-associ-
ative cache [34] indexes each way with a different hash func-
tion. Then, a block address conflicts with a fixed set of
blocks, but those blocks conflict with other addresses. This
spreads out the conflicts. Based on skew-associative design,
ZCache [33] decouples the notion of ways and associativity.
Ways represent the number of tags that must be searched
when looking up a cache line, while associativity is referred
as the number of blocks that can be evicted to make room
for an incoming line. The ZCache keeps the number of
ways small, but it has a large associativity. Although skew-
associative caches typically exhibit lower conflict misses than
a set-associative cache with the same number of ways [7],
they break the concept of a set, so they cannot use replace-
ment policy implementations that rely on set ordering.
Category 3: Mapping multiple locations to one
physical way – Column-associative caches [1] extend the
concept of direct-mapped caches to allow a block to reside
in two locations based on a primary and a secondary hash
functions. In this approach, tag lookup is a two-step mech-
anism. First, it uses the primary hash function to search for
a block, and second, it uses the secondary function on the
cache miss. To improve access latency, a hit on the second
step triggers a swapping logic to reorder first and second
search functions. Similar techniques predict which location
to be searched first [9], and there are also the schemes that
try merging the less used sets with more used ones to balance
utilization of the sets [32].
The main shortcomings of these approaches are the vari-
able hit latency, reduced cache bandwidth due to multiple
lookups, and additional energy required to do swaps on hit
accesses.
Category 4: Victim cache – A victim cache is a fully-
associative small cache that keeps blocks evicted from the
main cache until for future reuses [21]. It can avoid conflict
misses that are re-referenced after a short period. How-
ever, due to its limited capacity, it is not particularly useful
when the number of sets with large local misses are con-
siderably large [8]. Inspired by this scheme, Scavenger [3]
11
 0
 5
 10
 15
 20
 25
Low-Missed
Medium-Missed
High-Missed
Gmean%
 o
f  
R
e d
u
c t
i o
n
 i
n
 M
i s
s  
R
a t
e Our-Proposal
V-Way
Scavenger
SBC
Figure 12: Comparison of our proposed cache with V-
Way [30], Scavenger [3], and SBC [32] caches in terms of
the percentage reduction in cache misses relative to the SLC
cache in previous configuration. Note that cache sizes are
set to have same die area. This figure shows that our tech-
nique is better than its counterparts in miss ratio, especially
when the application requires large associativity.
divides cache space into two equally large parts, a conven-
tional set-associative cache and a victim cache organized as a
heap. Although victim cache results in better performance
in applications with moderate miss ratios, it suffers from
the additional latency and energy consumption needed for
checking the victim cache, regardless of the hit or miss on
the victim cache.
Category 5: Employing pointer-like tag array –
An alternative strategy is to implement tag and data arrays
separately, making the tag array highly associative, and us-
ing it as pointers to an array of data blocks. Examples
in this line are Indirect Index Cache (IIC) [15] and V-Way
cache [30]. IIC implements the tag array as a hash table
using open-chained hashing for high associativity. The V-
Way cache, on the other hand, implements a conventional
set-associative tag array, but makes it larger than the tag ar-
ray to reduce conflict misses. Tag indirection schemes suffer
from two problems. First, they usually increase hit latency,
as they have to serialize tag lookup and data access. Sec-
ond, the tag array overheads in these two schemes are large
(around 2×), and may not be acceptable for large cache ar-
rays.
6.2.1 Comparative Analysis
Several proposals increase cache associativity relying on
techniques requiring either heaps [3], hash table [15] or pre-
diction mechanism [9]. This may increase energy and latency
of cache hits and the resulting cache design may be much
more complex than conventional cache arrays. On the other
hand, this paper focuses on multi-bit capability of cutting-
edge STT-RAM technology and proposes a workload-aware
per-set associativity regulation using its SLC to MLC (and
viceversa) shapeshifting property.
Here, we compare the proposed MLC STT-RAM cache
system against state-of-the-art works on cache associativ-
ity applied to the same platform (i.e., STT-RAM cache).
We use three schemes for comparison, including the V-Way
cache [30], the Scavenger cache [3], and the dynamic SBC
cache [32]. For fair analysis, these three approaches are ap-
plied to SLC with the same die size. Figure 12 compares the
LLC miss rates of these schemes for the configuration used
previously (Table 3). The results are shown as an average
reduction in miss rate for different workloads categories in
Table 4, i.e., workloads with low LLC miss rate, workloads
with moderate miss rate, and workloads with high miss rate.
The results are normalized to the miss rate of the baseline
SLC configuration. We can observe that, the results vary be-
tween the 12.1% reduction for our proposed solution and the
8.6% reduction for Scavenger. And, SBC achieves a 10.8%
reduction, better than the 9% obtained by V-Way. More
accurately, the proposed cache is the best one in high and
medium missed category and SBC is the best for workloads
with low miss accesses. We must take into account that V-
Way cache turns misses into hits, while the other three one
turns them into secondary hits, which suffer the delay of a
second access to the tag array. On the other hand, the du-
plication of the tag-store entries, the addition of one pointer
to each entry and a mux to choose the correct pointer in-
creases the V-Way tag access time by around 39%, while our
solution along with SBC involves very light structures, thus
having a negligible impact on access time.
7. CONCLUSION
The emerging technology of SLC STT-RAM has been
shown to be a promising candidate for building large last-
level caches. The natural next step would be to use MLC
STT-SRAM, but its advantage in doubling the storage den-
sity comes with a number of serious shortcomings in terms
of lifetime, performance, and energy consumption. In this
paper, we have shown that, by operating MLC STT-RAM
in SLC mode when the additional density is not required,
one can achieve the best of both worlds and improve perfor-
mance and energy with only a minimal impact on lifetime.
This improvement requires more than a naive shut down
of unused ways in the cache (which are thus used in SLC
mode instead of MLC mode) and we have shown how one
should actively migrate data across physical ways to max-
imize the benefits of this technique. This work shows that
emerging memory technologies can be efficiently accommo-
dated in traditional memory technologies, but they require
some new techniques for the integration to be successful.
8. REFERENCES
[1] A. Agarwal and S. D. Pudar. Column-associative
caches: a technique for reducing the miss rate of
direct-mapped caches. In ISCA, pages 179–190, May
1993.
[2] M. Arjomand, A. Jadidi, and H. Sarbazi-Azad.
Relaxing writes in non-volatile processor cache using
frequent value locality. In DAC, June 2012.
[3] A. Basu, N. Kirman, M. Kirman, M. Chaudhuri, and
J. Martinez. Scavenger: A new last level cache
architecture with global block priority. In MICRO,
pages 421–432, December 2007.
[4] C. Bienia and K. Li. PARSEC 2.0: A new benchmark
suite for chip-multiprocessors. In MoBS, June 2009.
[5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,
A. Saidi, A. Basu, J. Hestness, D. R. Hower,
T. Krishna, S. Sardashti, R. Sen, K. Sewell,
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The
Gem5 simulator. SIGARCH Computer Architecture
News, 39(2):1–7, Aug. 2011.
[6] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale,
L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W.
Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar,
J. Shen, and C. Webb. Die stacking (3D)
12
microarchitecture. In MICRO, pages 469–479,
December 2006.
[7] F. Bodin and A. Seznec. Skewed associativity
enhances performance predictability. In ISCA, pages
265–274, June 1995.
[8] M. W. Brehob. On the mathematics of caching. Ph.D.
dissertation, Michigan State University, 2003.
[9] B. Calder, D. Grunwald, and J. Emer. Predictive
sequential associative cache. In HPCA, pages 244–253,
February 1996.
[10] Y. Chen, X. Wang, W. Zhu, H. Li, Z. Sun, G. Sun, and
Y. Xie. Access scheme of multi-level cell spin-transfer
torque random access memory and its optimization. In
MWSCAS, pages 1109–1112, August 2010.
[11] Y.-T. Chen, J. Cong, H. Huang, C. Liu, R. Prabhakar,
and G. Reinman. Static and dynamic co-optimizations
for blocks mapping in hybrid caches. In ISLPED,
pages 237–242, March 2012.
[12] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen.
Circuit and microarchitecture evaluation of 3D
stacking magnetic ram (MRAM) as a universal
memory replacement. In DAC, pages 554–559, June
2008.
[13] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. NVSim: A
circuit-level performance, energy, and area model for
emerging nonvolatile memorys. IEEE TCAD,
31(7):994–1007, July 2012.
[14] A. Driskill-Smith. Latest advances and future
prospects of STT-RAM. In NVM Workshop, 2010.
[15] E. G. Hallnor and S. K. Reinhardt. A fully associative
software-managed cache design. In ISCA, pages
107–116, June 2000.
[16] T. Ishigaki, T. Kawahara, R. Takemura, K. Ono,
K. Ito, H. Matsuoka, and H. Ohno. A multi-level-cell
spin-transfer torque memory with series-stacked
magnetotunnel junctions. In VLSIT, pages 47–48,
June 2010.
[17] A. Jadidi, M. Arjomand, and H. Sarbazi-Azad.
High-endurance and performance-efficient design of
hybrid cache architectures through adaptive line
replacement. In ISLPED, 2011.
[18] L. Jiang, B. Zhao, Y. Zhang, and J. Yang.
Constructing large and fast multi-level cell
STT-MRAM based cache for embedded processors. In
DAC, pages 907–912, June 2012.
[19] A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan,
R. Iyer, and C. R. Das. Cache revive: architecting
volatile STT-RAM caches for enhanced performance
in CMPs. In DAC, pages 243–252, June 2012.
[20] M. R. Jokar, M. Arjomand, and H. Sarbazi-Azad.
Sequoia: A high-endurance NVM-based cache
architecture. IEEE TVLSI, 24(3):954–967, 2016.
[21] N. Jouppi. Improving direct-mapped cache
performance by the addition of a small
fully-associative cache and prefetch buffers. In ISCA,
pages 364–373, June 1990.
[22] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. Using
prime numbers for cache indexing to eliminate conflict
misses. In HPCA, February 2004.
[23] C. Kim, D. Burger, and S. W. Keckler. An adaptive,
non-uniform cache structure for wire-delay dominated
on-chip caches. In ASPLOS, pages 211–222, October
2002.
[24] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M.
Tullsen, and N. P. Jouppi. McPAT: an integrated
power, area, and timing modeling framework for
multicore and manycore architectures. In MICRO,
pages 469–480, December 2009.
[25] S. Microsystems. ULTRASPARC T2 supplement to
the ULTRASPARC architecture 2007. Tech. rep.,
2007.
[26] A. K. Mishra, X. Dong, G. Sun, Y. Xie,
N. Vijaykrishnan, and C. R. Das. Architecting on-chip
iinterconnects for stacked 3D STT-RAM caches in
CMPs. In ISCA, pages 69–80, June 2011.
[27] N. Muralimanohar and R. Balasubramonian.
Interconnect design considerations for large NUCA
caches. In ISCA, pages 369–380, June 2007.
[28] N. Muralimanohar, R. Balasubramonian, and
N. Jouppi. Optimizing NUCA organizations and
wiring alternatives for large caches with CACTI 6.0.
In MICRO, pages 3–14, December 2007.
[29] J. T. Pawlowski. Hybrid memory cube (HMC). In
IEEE Hot Chips Symposium, pages 1–24, 2011.
[30] M. Qureshi, D. Thompson, and Y. Patt. The V-Way
cache: Demand-based associativity via global
replacement. In ISCA, pages 544–555, June 2005.
[31] M. K. Qureshi, M. M. Franceschini, L. A.
Lastras-Montan˜o, and J. P. Karidis. Morphable
memory system: A robust architecture for exploiting
multi-level phase change memories. In ISCA, pages
153–162, 2010.
[32] D. Rolan, B. Fraguela, and R. Doallo. Adaptive line
placement with the set balancing cache. In MICRO,
pages 529–540, December 2009.
[33] D. Sanchez and C. Kozyrakis. The ZCache:
Decoupling ways and associativity. In MICRO, pages
187–198, December 2010.
[34] A. Seznec. A case for two-way skewed-associative
caches. In ISCA, pages 169–178, May 1993.
[35] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi,
and M. R. Stan. Relaxing non-volatility for fast and
energy-efficient STT-RAM caches. In HPCA, pages
50–61, February 2011.
[36] C. Spradling. SPEC CPU2006 benchmark tools. ACM
CAN, 35(1):130–134, March 2007.
[37] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel
architecture of the 3D stacked MRAM L2 cache for
CMPs. In HPCA, February 2009.
[38] Z. Sun, X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong,
X. Zhu, and W. Wu. Multi retention level STT-RAM
cache designs with a dynamic refresh scheme. In
MICRO, pages 329–338, December 2011.
[39] J. Wang, X. Dong, Y. Xie, and N. P. Jouppi. i2WAP:
improving non-volatile cache lifetime by reducing
inter- and intra-set write variations. In HPCA, pages
234–245, 2013.
[40] M. Wang, S. Peng, Y. Zhang, Y. Zhang, Y. Zhang,
Q. Zhang, D. Ravelosona, and W. Zhao.
Demonstration of multilevel cell spin transfer
switching in MgO magnetic tunnel junctions. Applied
Physics Letters, 93(24):242502, December 2008.
13
[41] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and
Y. Xie. Hybrid cache architecture with disparate
memory technologies. In ISCA, pages 34–45, 2009.
[42] X. Wu, J. Li, L. Zhang, E. Speight, and Y. Xie. Power
and performance of read-write aware hybrid caches
with non-volatile memories. In DATE, pages 737–742,
March 2009.
[43] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi,
and O. Mutlu. Efficient data mapping and buffering
techniques for multilevel cell phase-change memories.
ACM TACO, 11(4):40:1–40:25, Dec. 2014.
[44] Y. Zhang, L. Zhang, W. Wen, G. Sun, and Y. Chen.
Multi-level cell STT-RAM: Is it realistic or just a
dream? In ICCAD, pages 526–532, November 2012.
[45] P. Zhou, B. Zhao, J. Yang, and Y. Zhang. Energy
reduction for STT-RAM using early write termination.
In ICCAD, pages 264–268, November 2009.
14
