Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in
  Hybrid Tiered-Memories by Song, Shihao et al.
Exploiting Inter- and Intra-Memory Asymmetries for
Data Mapping in Hybrid Tiered-Memories
Shihao Song
shihao.song@drexel.edu
Drexel University
Philadelphia, PA, USA
Anup Das
anup.das@drexel.edu
Drexel University
Philadelphia, PA, USA
Nagarajan Kandasamy
nk78@drexel.edu
Drexel University
Philadelphia, PA, USA
Abstract
Modern computing systems are embracing hybrid memory
comprising of DRAM and non-volatile memory (NVM) to
combine the best properties of both memory technologies,
achieving low latency, high reliability, and high density. A
prominent characteristic of DRAM-NVM hybrid memory is
that it has NVM access latency much higher than DRAM
access latency. We call this inter-memory asymmetry. We
observe that parasitic components on a long bitline are a
major source of high latency in both DRAM and NVM, and
a significant factor contributing to high-voltage operations
in NVM, which impact their reliability. We propose an ar-
chitectural change, where each long bitline in DRAM and
NVM is split into two segments by an isolation transistor.
One segment can be accessed with lower latency and oper-
ating voltage than the other. By introducing tiers, we enable
non-uniform accesses within each memory type (which we
call intra-memory asymmetry), leading to performance and
reliability trade-offs in DRAM-NVM hybrid memory.
We show that our hybrid tiered-memory architecture has
a tremendous potential to improve performance and reliabil-
ity, if exploited by an efficient page management policy at
the operating system (OS). Modern OSes are already aware
of inter-memory asymmetry. They migrate pages between
the two memory types during program execution, starting
from an initial allocation of the page to a randomly-selected
free physical address in the memory. We extend existing OS
awareness in three ways. First, we exploit both inter- and
intra-memory asymmetries to allocate and migrate memory
pages between the tiers in DRAM and NVM. Second, we im-
prove the OS’s page allocation decisions by predicting the
access intensity of a newly-referenced memory page in a
program and placing it to a matching tier during its initial
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACMmust be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
ISMM ’20, June 16, 2020, London, UK
© 2020 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
allocation. This minimizes page migrations during program
execution, lowering the performance overhead. Third, we
propose a solution to migrate pages between the tiers of
the same memory without transferring data over the mem-
ory channel, minimizing channel occupancy and improving
performance. Our overall approach, which we call MNEME,
to enable and exploit asymmetries in DRAM-NVM hybrid
tiered memory improves both performance and reliability
for both single-core and multi-programmed workloads.
CCSConcepts: •Computer systems organization→Pro-
cessors andmemory architectures; •Hardware→Mem-
ory and dense storage; Aging of circuits and systems;
• Software and its engineering → Main memory; Vir-
tual memory.
Keywords: phase change memory (PCM), DRAM, tiered
memory, bitline parasitic, hybrid memory, non volatile mem-
ory (NVM), NBTI, endurance
ACM Reference Format:
Shihao Song, Anup Das, and Nagarajan Kandasamy. 2020. Exploit-
ing Inter- and Intra-Memory Asymmetries for Data Mapping in
Hybrid Tiered-Memories. In Proceedings of the 2020 ACM SIG-
PLAN International Symposium on Memory Management (ISMM
’20), June 16, 2020, London, UK. ACM, New York, NY, USA, 15 pages.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 Introduction
DRAM has long been the choice substrate for architecting
main memory subsystems due to its low cost per bit. How-
ever, DRAM is a fundamental performance and energy bot-
tleneck in almost all computer systems [53, 56, 82, 84], and is
experiencing significant technology scaling challenges [35,
53–55]. DRAM-compatible emerging non-volatile memory
(NVM) technologies such as Flash [13], oxide-based RAM
(OxRAM) [52], phase-change memory (PCM) [83], and spin
transfer torque magnetic RAM (STT-MRAM) [4] can address
some of these challenges [41, 44, 47, 61, 67, 74, 75]. How-
ever, they are usually slower than DRAM and have limited
endurance.1 Modern computing systems are therefore em-
bracing hybrid memory designs comprising of DRAM and
NVM [5]. These systems combine the best properties of both
1NVM’s endurance ranges from 105 writes for Flash to 1010 writes for
OxRAM, and PCM in between, with 107 writes.
ar
X
iv
:2
00
5.
04
75
0v
1 
 [c
s.A
R]
  1
0 M
ay
 20
20
ISMM ’20, June 16, 2020, London, UK Shihao Song, Anup Das, and Nagarajan Kandasamy
memory technologies to improve latency, reliability, capac-
ity, and cost. The non-volatile 3D XPoint memory [11] is one
example of a hybrid memory, with DRAM and NVM con-
nected to separate channels, interfacing with a multi-core
CPU chip using the JEDEC’s new NVDIMM specification [1].
IBM POWER9 architecture [69] is another example, which
uses embedded DRAM (eDRAM) as a write cache to NVM-
based main memory. Figure 1 illustrates both these hybrid
architectures and we evaluate them in Section 6.
Figure 1. DRAM-NVM hybrid memory architecture of (a)
3D XPoint Memory [11] and (b) IBM POWER9 [69].
Modern operating systems (OSes) such as Nimble [86] are
already aware of the performance and reliability asymmetry
in hybrid memory. They migrate write-intensive pages to
DRAM (which has practically infinite endurance) and read-
intensive pages to NVM (which has a read latency compa-
rable to DRAM), starting from an initial random page place-
ment [9]. There are two key limitations in these OSes. First, if
pages are not placed in their matching memory (i.e., NVM or
DRAM) at their initial allocation, they can incur significant
performance and energy overhead during program execu-
tion due to the high bank and channel occupancy in moving
page data between the two memory. Second, limited write
endurance is not the only reliability issue in NVM. In fact, a
recent study has shown that even read accesses can lead to
high-voltage related aging (another key reliability issue) in
a NVM’s peripheral circuit [7].
Our objective is to improve performance and reliability
(both endurance and aging) of DRAM-NVM hybrid memory.
We achieve this goal by exploiting the following three major
observations in this paper.
Observation 1: A significant number of pages are mi-
grated more than once during a program execution.
Figure 2 plots the fraction of memory pages with no mi-
gration, exactly one migration, and more than one migration
using Nimble’s dynamic page migration policy for the evalu-
ated workloads, which are detailed in Section 5.
We observe a wide variation in behavior across these pro-
grams. For instance, over 93% of memory pages in perlbench
are migrated at most once, whereas 95% of all pages in roms
are migrated more than once. On average, 67% of memory
pages in these programs suffer more than one migration
during their execution. These migrations lead to high energy
and performance overhead.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0
50
100
M
em
or
y
p
ag
es
(%
) with no migration one migration more than one migration
Figure 2. Fraction of memory pages that suffer no migration,
exactly one migration, and more than one migration using
Nimble for our evaluated workloads.
Observation 1 leads to our first key idea that if a memory
page is placed in a matching memory during its initial allo-
cation, many of these migrations can be eliminated, leading
to performance and energy improvements (Section 6). This
idea also leads to our next observation.
Observation 2: There are typically only a few first-touch
instructions (FTIs) in a program and only a small percentage
of these instructions induce the most memory accesses.
Modern OSes implement first-touch page allocation policy,
where a virtual-to-physical address translator allocates a
random physical memory page from the free pool to a virtual
page address, when the virtual page is first touched by a
memory instruction in the program. We call this memory
instruction first-touch instruction (FTI).
Figure 3 plots the number of FTIs and referenced pages per
billion instructions of the evaluated workloads. We report
total FTIs (outer first bar in each set) and the number of FTIs
that touch pages which serve over 90% of memory accesses
(inner first bar). We also report the pages referenced per
billion instructions (second bar). We observe that 1) there are
very few FTIs per billion instructions of each workload, and
2) on average, only 17% of FTIs in a program touch pages
which serve over 90% of memory accesses.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
0
20
40
60
80
100
F
T
Is
re
co
rd
ed
p
er
bi
lli
on
in
st
ru
ct
io
ns
(3
0,
4)
(5
,1
)
(6
3,
5)
(6
4,
8)
(5
8,
7)
(3
,1
)
(1
1,
2)
(1
4,
2)
(1
7,
3)
(2
0,
2)
(1
0,
2)
(3
7,
7)
(6
9,
10
)
(1
7,
2)
(6
,2
)
total FTIs
access-inducing FTIs
total pages
0
5K
10K
15K
P
ag
es
re
fe
re
nc
ed
p
er
bi
lli
on
in
st
ru
ct
io
ns
Figure 3. First-touch instructions (FTIs) per billion instruc-
tions of the evaluated workloads.
Observation 2 leads to our second key idea of profiling
FTIs based on the number of accesses to memory pages they
touch and using it to predict the access intensity of a newly-
referenced memory page, thereby placing it in a matching
memory during its initial allocation.
Observations 1 and 2 are related to OS-based page manage-
ment in DRAM-NVM hybrid memory. Our final observation
is related to the internal architecture of memory.
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories ISMM ’20, June 16, 2020, London, UK
Observation 3: Parasitic components on a long bitline are
a major source of high latency in both DRAM and NVM, and
a significant factor contributing to high-voltage operations in
NVM, which impact reliability.
This is observed and reported for DRAM [49]. We expand
this observation for NVM, where each bit is represented by
the resistance of a cell (low resistance represents logic ‘1’ and
high resistance logic ‘0’). An NVM cell’s resistance is read
or programmed by driving current through the cell using
a peripheral circuit, which consists of sense amplifiers (to
read) and write drivers (to write). We analyze the internal
architecture of an NVM bank and find that its peripheral cir-
cuit is several orders of magnitude larger than the size of an
NVM cell [17, 28, 70, 74].2 To amortize this large size, mem-
ory designers connect a peripheral circuit to many (typically
4096) NVM cells through a wire called a bitline. Numerous
bitlines are laid in parallel to form an NVM array (called a
tile). A row of NVM cells is called a wordline.
Figure 4 (a) illustrates an NVM tile with bitlines and word-
lines. Many such tiles make a partition (see our simulation
parameters in Table 4). Figure 4(b) illustrates the lumped RC
circuit of a bitline to model its parasitic components. The
voltage drop (called the IR drop) on the bitline parasitic needs
to be compensated by a peripheral circuit to access the NVM
cells on its bitline. As we can see from this figure, farther a
cell from the peripheral circuit, higher is the IR drop.
Figure 4. (a) An NVM tile with bitlines and wordlines and
(b) Lumped RC model of a bitline.
Figure 5 plots the design analysis performed while archi-
tecting main memory. We show such analysis for PCM, an
emerging NVM, based on Micron’s 45nm design [8].3 The
left y-axis plots the IR drop on a bitline for SET, RESET, and
READ operations as a function of the number of bitline cells.
The right y-axis plots the normalized cost per bit. We observe
that the normalized cost decreases as the number of bitline
cells increases. However, higher the number of bitline cells,
higher is the IR drop. The trade-off point is typically set to
4096 cells in most PCM designs [8, 51, 66, 79]. Similar anal-
ysis conducted on Micron’s 45nm DRAM design suggests
2NVM, like DRAM, is organized hierarchically. An example NVM of 128GB
capacity can have 4 channels, with 4 ranks per channel, and 8 banks per rank.
A bank can have 8 partitions, which are similar to subarrays in DRAM [38].
3We expect the values to be of similar orders of magnitude for other designs.
that 512 cells per bitline in a DRAM subarray gives the best
latency and cost trade-offs [26, 49]. In this paper, we assume,
without loss of generality, each bitline in NVM and DRAM
contains 4096 and 512 cells, respectively.
1000 2000 3000 4000 5000 6000 7000 8000
Cells per bitline in a PCM partition
0
5
10
IR
dr
op
(V
) SET RESET READ
0.25
0.50
0.75
1.00
N
or
m
al
iz
ed
co
st
p
er
bi
tExecution Time
Figure 5. Architecting the number of PCM cells per bitline.
Table 1 reports the biasing voltage needed to access the
nearest cell (1st cell), the farthest cell (4096th cell) and an
intermediate cell (512th) on a bitline in PCM. Higher voltages
are needed to access cells that are farther from their periph-
eral circuit. This has two implications. First, cells that are
nearer to their peripheral circuit can be accessed faster. This
is because the on-chip voltage regulator, which supplies the
biasing voltage for a peripheral circuit, has faster response
time and higher energy efficiency to generate lower voltages.
Second, operating a peripheral circuit at a lower voltage in-
curs lower circuit aging, which improves reliability (see our
reliability formulation in Section 4.1.2). We conclude that
PCM (and in general, NVM) has asymmetric latency and
reliability in accessing its content.
Table 1. PCM’s biasing voltages.
Cell Op.
Bias Voltage
Nearest cell Farthest cell Intermediate cell
(1st cell) (4096th cell) (512th cell)
SET 2.1 3.7V 2.3V
RESET 6.8 7.1V 6.9V
READ 0.96 2.85V 1.2V
Observation 3 leads to our third key idea of introducing
an isolation transistor on each bitline in DRAM and NVM,
to allow its length to appear shorter when accessing cells
nearer to its peripheral circuit, thereby achieving low latency
in DRAM and NVM and additionally, high reliability in NVM.
Segmented bitlines create latency and reliability asymmetry,
i.e., tiers within both DRAM and NVM.
We introduceMNEME4, a mechanism that builds on the
three ideas above to enable additional tiers in hybrid memory
and exploit these tiers through an efficient OS-level page al-
location policy. MNEME places a newly-referenced memory
page to the best tier during its initial allocation, minimizing
channel and bank occupancy associated with migration of
page data during execution. Through MNEME, we make the
following key contributions.
4In Greek mythology, MNEME is the muse of memory. MNEME means per-
sistent effect of memory of past events, which are the first-touch instructions
in the context of this paper.
ISMM ’20, June 16, 2020, London, UK Shihao Song, Anup Das, and Nagarajan Kandasamy
• We introduce a new memory architecture with seg-
mented bitlines within DRAM and NVM to improve
performance and reliability during data accesses.
• We propose an approach to predict access intensity
of a newly-referenced memory page using the page’s
first-touch instruction (FTI) and place it in a matching
memory tier during its initial allocation, reducing page
migrations during program execution.
• We develop our FTI-based page allocation for the entire
program duration to adapt to and make correct allo-
cation decisions for different phases of execution in a
program with potentially distinct working sets.
• We showhow to reduce channel occupancy during page
migration between tiers of the same memory, thereby
improving performance.
• We introduce an efficient hardware implementation of
MNEME using Bloom filters.
• We implement MNEME for two hybrid memory archi-
tectures and also for commodity DRAM-based main-
stream architecture, and show significant performance
and reliability improvements for both single-program
and multi-programmed workloads.
2 New Segmented Bitline Architecture of
MNEME
Figure 6(a) shows the proposed segmented bitline architec-
ture, where each long bitline in DRAM and NVM is split
into two segments using an isolation transistor: the segment
connected directly to the peripheral circuit is called the near
segment, whereas the other is called the far segment. Cells
in the near segment can be accessed faster using lower bias
voltages due to the reduced parasitic on the current path (see
Figure 6(b)). This improves performance. Additionally, by
using lower voltages, circuit aging when accessing the near
segment is minimized, which improves reliability. Aging-
related reliability is particularly critical for NVM, which
requires higher operating voltages than DRAM [75].
Figure 6. (a) Bitlines partitioned into segments. (b) Access-
ing a near segment cell. (c) Accessing a far segment cell.
However, the isolation transistor increases access latency
of the far segment and introduces its ON resistance in the
current path, which imposes additional bias requirement for
the peripheral circuit (see Figure 6(c)). This lowers reliability
of the peripheral circuit in accessing the far segment.
Segmented bitlines is previously proposed for DRAM sub-
arrays [49] and they lead to performance trade-offs in ac-
cessing near versus far segments. We propose segmented
bitlines for DRAM-NVM hybrid memory, which introduces
reliability trade-off, in addition to the performance ones.
To evaluate the performance improvement using this new
memory architecture, Figure 7 plots the execution time of
15 workloads (see Section 5) on a hybrid memory with seg-
mented bitlines. Results are normalized to the execution time
of a Baseline design, where bitlines are not segmented. We
observe that simply introducing memory tiers by creating
segments in each bitline is just not enough to guarantee
performance improvement; in fact, performance improves
by only 2% on average for these workloads. We believe that
our hybrid tiered-memory architecture can only deliver on
its promises if the inter- and intra-memory asymmetries are
exploited efficiently by an operating system (OS)-level page
management policy, which we introduce next.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.00
0.25
0.50
0.75
1.00
E
xe
cu
ti
on
ti
m
e
n
or
m
al
iz
ed
to
B
as
el
in
e without segmented bitlines with segmented bitlines
Figure 7. Execution time of our evaluated workloads nor-
malized to Baseline where bitlines are not segmented.
Our hybrid tiered-memory architecture is shown in Fig-
ure 8, where the memory tiers are arranged with increasing
access latency from the CPU. The figure also shows how
our architecture differs from the two state-of-the-art ap-
proaches: TL-DRAM [49], which only uses memory tiers
within DRAM-based main memory and Nimble [86], which
uses DRAM-NVM hybrid main memory like ours but the
bitlines are not segmented. We evaluate both these state-of-
the-art approaches in Section 6.
Figure 8. Proposed hybrid tiered-memory architecture.
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories ISMM ’20, June 16, 2020, London, UK
3 New Page Management Policy of
MNEME
Figure 9 shows a high-level overview of our page allocation
policy to exploit our tiered architecture. A program is broken
down into fixed intervals (called phases). At each phase, we
profile FTIs based on accesses to pages they touch. This
information is then used to decide the initial placement of all
newly-referenced memory pages of the subsequent phases.
Figure 9. Proposed program execution.
Figure 10 shows an example of using FTIs to predict the ac-
cess intensity of newly-referencedmemory pages inMNEME.
a1: for (...) {
   
a2:   ld $r0, <addr1>
      ...
      ...
a3:   sw $r2, <addr2>
      ...
    }
    ...
Page A Page E
Page R
Page Q
Page F
Page 
fault 
Page fault 
Page fault 
Page fault 
Figure 10. Examples of subsequentmemory accesses created
by first-touch load and store instructions.
Assume that no profile information is available in the
beginning. The program counter a1 causes the initial page
fault, loading Page A into the far memory segment. During
the course of program execution, the load instruction at a2
causes a page fault, loading page E into the far memory
segment. Similarly, the store instruction at a3 also loads
page Q into the far segment. The instructions at addresses,
a1, a2, and a3 are the FTIs. Now assume that as we iterate
through the for loop, the load and store instructions, located
at addresses a2 and a3, respectively, generate numerous
accesses to E and Q , respectively. MNEME records them
as FTIs that induce large numbers of accesses to any page
that these instructions might load in the future. Therefore,
later on when the load instruction references an address that
requires page F to be loaded, MNEMEwill load this page into
the near memory segment. Similarly, R will also be loaded
in the near segment when accessed by the store instruction.
In addition to load and store statements, implementations
of branch or jump tables will also contain FTIs that can be
predicted to load frequently-accessed pages; for example,
when a jump instruction is the FTI that causes the page con-
taining the corresponding function to be loaded. Fig. 3 shows
that such FTIs are a major source of memory references.
A conceptual overview ofMNEME is shown in Figure 11.
At a high-level, the memory controller maintains a table
containing the memory addresses of access-intensive first-
touch instructions (i.e., their program counter value). We
call this FTI table. We split the execution of an application
into phases, with each phase comprising of 100 million in-
structions (See Section 6.6 for evaluation on the size of an
execution phase).
Figure 11. A conceptual overview of MNEME.
To allocate a newly referenced memory page in an execu-
tion phase, the OS page-fault handler runs a custom instruc-
tion to check if the virtual address corresponding to the FTI
of the page hits in the FTI table. If a match is found, the mem-
ory page is predicted to be access-intensive. The page-fault
handler allocates this new memory page to the near memory
segment. We explain later how to choose between DRAM
and NVM. The memory controller then uses reduced timing
and voltage parameters to access this page, improving per-
formance and reliability. Otherwise, the FTI is considered to
be unknown and possibly referencing a non-access intensive
memory page. The OS page handler allocates the memory
page to the far segment, while tracking the number of ac-
cesses this page generates within the phase, leveraging OS
page tracking structures, and recording it inside a table. We
call thisAccess Intensity Record (AIR). At the end of each
phase, the top access-intensive unknown FTIs of AIR (with
number of accesses higher than a threshold) are inserted into
the FTI table to predict and place all new memory pages to
near or far segments. In this way, the FTI table is constantly
updated with new access-inducing FTIs that are uncovered
during program execution.
Using the FTI table and AIR, MNEME can place a new
memory page to a specific segment in DRAM or NVM. To
select the specific memory type (i.e., DRAM vs. NVM), we
introduce the following changes: 1) we maintain two FTI ta-
bles: one holding those FTIs that touch more write-intensive
pages (we call this FTI_W ), and another holding those FTIs
that touch more read-intensive pages (we call this FTI_R),
and 2) extend the AIR to record the number of read-inducing
and write-inducing pages that each FTI touches.
For a new memory page in a program phase, there can be
four possibilities with the corresponding FTI.
ISMM ’20, June 16, 2020, London, UK Shihao Song, Anup Das, and Nagarajan Kandasamy
• Hit in FTI_W and miss in FTI_R: the FTI is predicted as
write-access inducing. So, allocate the memory page to
a near segment in DRAM.
• Miss in FTI_W and hit in FTI_R: the FTI is predicted as
read-access inducing. So, allocate the memory page to
a near segment in NVM.
• Hit in FTI_W and hit in FTI_R: the FTI is predicted as
both read and write inducing. So, allocate the memory
page to a near segment in DRAM (conservative).
• Miss in FTI_W and miss in FTI_R: the FTI is predicted
as non-access inducing. So, allocate the memory page
this FTI touches to a far segment in NVM if space is
available there, otherwise allocate it to a far segment
in DRAM. Additionally, make an entry for the FTI in
AIR and start recording accesses to the page.
If MNEME predicts the access intensity of a new memory
page correctly (which we evaluate in Section 6), the page
will be placed in the correct memory tier during its initial
allocation, reducing run-time page migration overhead. Oth-
erwise, the page will be placed in an incorrect tier and will be
migrated by tracking its accesses during program execution.
Page migration in MNEME: Figure 12 shows the jour-
ney of hot and cold pages through tiers of the proposed
hybrid tiered-memory architecture. MNEME supports two
types of migrations: 1) page migrations between tiers of the
same memory unit, and 2) page migrations across tiers of
different memory units. For within memory migrations, we
propose an approach where a page can be migrated from one
tier to another in the same memory bank without utilizing
the memory channels. This can be achieved for DRAM us-
ing two back-to-back activates (see Sec. 4.1.1) utilizing row
buffers, which are shared between read and write operations
(see RowClone [72] for instance).
Figure 12. Data migration in MNEME.
However, for PCM, andNVM in general, this is not straight
forward because the peripheral circuit consists of separate
hardware to read and write. Figure 13 shows the architecture
of a peripheral circuit in an NVM (e.g., PCM) bank [74]. The
peripheral circuit consists of the sense amplifier (to read) and
the write driver (to write), which are connected to a bitline
using transistors M1 and M2. From the write driver’s internal
circuit diagram shown in Figure 13, we observe that the write
driver can be viewed as a collection of two components –
the write pulse shaper logic, which generates the current
pulses necessary for the cell’s SET and RESET operations,
and the verify logic, which verifies the correctness of these
operations. These two circuit components together serve
write requests from the bank using a write scheme known
as program-and-verify (P&V) [30, 57].
Figure 13. Internal circuit of a bank’s peripheral structure.
Based on this observation, we propose simple circuit mod-
ifications to introduce the decoupling transistor M (see Fig.
13), which can be configured when needed, to transfer data
from the sense amplifier to the verify logic. As a result of
this modification, we can program the data read in the sense
amplifier to a different row in the bank using the write dri-
ver. This facilitates data migration between tiers of the same
memory bank without utilizing external memory channels.
To migrate pages across tiers of different memory units,
we still use the memory channel, which leads to performance
overhead. However, our OS-level page allocation policy min-
imizes these migrations significantly (see Section 6.5).
4 Implementation of MNEME
MNEME consists of two key components: 1) interface to sup-
port efficient data tiering within and across memory units in
hybrid memory, and 2) intensity prediction via FTI table and
AIR. We discuss how to implement each of these components
in order to design an efficient implementation of MNEME.
4.1 Interface to Support Tiered Hybrid Memory
Figure 14 shows the handshaking between OS, CPU and
memory, via the memory controller. Inside the memory con-
troller, there is a separate read and write queue to buffer
requests to the memory. The scheduler schedules requests
from these queues using its access scheduling policy. We
use the FR-FCFS policy [68], where the scheduler prioritizes
requests that hit in the row buffer in a memory bank.
Figure 14. A full-system overview.
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories ISMM ’20, June 16, 2020, London, UK
The address mapping block is responsible for selecting the
desired timing and voltage parameters based on the memory
segment (near or far) that a request hits. To explain this, we
use an example of 128GB NVM with 4 channels, 4 ranks
per channel, 8 banks per rank, 8 partitions per bank, 128
tiles per partition, 4096 wordlines and 2048 bitlines per tile.
Considering bank interleaving, the address mapping scheme
is as follows: [36:35] = rank address, [34:32] = partition ad-
dress, [31:25] = tile address, [24:13] = row address, [12:11]
= column address, [10:8] = bank address, [7:6] = channel
address, and [5:0] = byte address. We assume, without loss
of generality, that an isolation transistor divides each bitline
into near segment with 512 cells and far segment with 3584
cells (= 4096 - 512).
To decode the segment that a request hits, we use bit slic-
ing on the wordline address bits [24:13]. Therefore, address
bit 22 is used as the segment select bit (‘0’ =⇒ near segment
and ‘1’ =⇒ far segment). The segment select bit is used to
select segment-specific timing parameters stored in a lookup
table (LUT) inside the memory controller. The memory re-
quest and the selected timing parameters are forwarded to a
command generator, implemented as a state machine, which
issues memory-specific commands at appropriate intervals.
The description above applies to DRAM as well, with the
exception of the on-chip voltage regulator, which is needed
only for NVM to drive current through its cells. Furthermore,
we use the DRAM configuration of Lee et al. [49], where a
bitline contains 512 cells and is divided using an isolation
transistor into near segment with 128 cells and far segment
with 384 cells. Bit slicing is performed accordingly.
We nowprovide latency and reliability analysis of near and
far segments in DRAM and NVM. We consider the DRAM
architecture of Lee et al. [49] and the PCM architecture of
Redaelli [65], both from Micron.
4.1.1 Latency Analysis. To understand the latency im-
pact due to bitline segmentation, we briefly review memory-
timing parameters. The following discussion applies for both
DRAM and PCM. To serve a memory request that accesses
data at a particular row and column address within a bank,
a memory controller issues three commands to the bank.
• ACTIVATE: activate the wordline and enable the periph-
eral circuit for the memory cells to be accessed.
• READ/WRITE: drive read or write current through the
cell (PCM) or share charge from the cell (DRAM). After
this command executes, the data stored in the cell is
available at the output terminal of peripheral circuit, or
the write data is programmed to the cell.
• PRECHARGE: deactivate the wordline and bitline, and
prepare the bank for the next access.
Figure 15 shows different memory timing parameters
when serving two read requests. Table 2 reports these tim-
ing parameters for the near and far segment of DRAM and
PCM. The parameters for DRAM are obtained from Lee et
Figure 15.Memory timings for read requests.
al. [49], scaled to 45nm technology nodes using predictive
technology scaling [14]. The timing parameters for PCM
are obtained via SPICE simulations [3] with 45nm PDK [77].
Table 2 is stored in a LUT in our memory controller.
Table 2. Latency incurred by read and write requests, re-
spectively, to near and far segments of DRAM and PCM.
tRCD tCL tBL tRP tRC
DRAM
near Read 9.3ns 5.5ns 7.5ns 5.5ns 27.8nsWrite 9.3ns 5.5ns 7.5ns 5.5ns 27.8ns
far Read 15ns 15ns 7.5ns 15ns 52.5nsWrite 15ns 15ns 7.5ns 15ns 52.5ns
PCM
near Read 3.75ns 22.5ns 15ns 0ns 41.25nsWrite 3.75ns 101ns 15ns 0ns 119.75ns
far Read 3.75ns 37.5ns 15ns 0ns 56.25nsWrite 3.75ns 142.8ns 15ns 0ns 161.55ns
4.1.2 ReliabilityAnalysis. Table 3 summarizes the sources
of reliability concerns in NVM. In this work, we consider
two dominant reliability issues in PCM: 1) finite endurance
of PCM cells and 2) high voltage-related aging of CMOS
devices in a peripheral circuit. We formulate these next.
Table 3. Reliability issues in NVM.
Reliability Issues NVM
High-voltage related circuit aging PCM, Flash
High-current related circuit aging OxRAM, STT-MRAM
Read disturbance All
Limited endurance All
Endurance-related lifetime: Endurance-related lifetime dep-
ends on: 1) how many times a PCM cell can be programmed
(Ne ) and 2) how frequently the cells are programmed (Nf ) [64].
If NWL is the total number of wordlines in a PCM bank, the
endurance-related lifetime can be estimated as
Le = NWL ∗ Ne /Nf . (1)
Aging-related lifetime: High-voltage operations lead to reli-
ability issues such as negative-bias temperature instability
(NBTI), hot carrier injection (HCI), and time-dependent di-
electric breakdown (TDDB) [7, 19–25, 76].We illustrate NBTI,
which is a dominant reliability issue in scaled technology
nodes. NBTI-induced aging of a CMOS device in a peripheral
ISMM ’20, June 16, 2020, London, UK Shihao Song, Anup Das, and Nagarajan Kandasamy
circuit at temperature T is calculated as
A(T ) =
Na−1∑
i=0
д0(T ) · V abias · tRCb, (2)
where Na is the number of PCM accesses, д0(T ), a, and b
are material-dependent constants [7]. The bias voltage for a
specific PCM operation, Vbias, is obtained from Table 1, and
the PCM timing parameter, tRC , from Table 2.
Using Equation 2, the reliability (R(T )) and lifetime (La )
can be computed as
R(T ) = e−A(T )β and La =
∫
R(T ). (3)
From Equations 2 and 3, we can conclude that aging can be
used as a measure of lifetime. In Section 6, we show improve-
ments of MNEME in terms of Le and A.
4.2 Efficient Implementation of FTI Table and AIR
We now discuss an efficient implementation of MNEME.
4.2.1 Implementing FTITable. MNEME storesmost acc-
ess-inducing FTIs (i.e., their program counters) from an ex-
ecution phase in the FTI Table. A naive way to implement
the FTI table is to use a row for each FTI. However, the exact
number of rows within this table will vary depending on
the number of FTIs, which is program-specific. If a table’s
capacity is inadequate to store all high-access inducing FTIs,
MNEME will start predicting every new FTI as non access-
intensive, once the table is full. The OS page fault handler
will then allocate all new pages to the far memory segment,
providing no significant performance improvement. There-
fore, the FTI table must be sized conservatively (i.e. assuming
the maximum number of FTIs), leading to a large hardware
cost for table storage and lookup. We propose to use Bloom
filer to implement the FTI table.
The Bloom filter is a memory-efficient data structure to
represent set membership [10]. The price paid for this effi-
ciency is that this filter is a probabilistic data structure: it
tells if an element is either definitely not in the set (zero false
positive) or may be in the set (non-zero false negative).
Figure 16. Operation of a Bloom filter.
The filter is implemented as a bit array of length m with
k distinct hash functions. Figure 16 shows a filter in which
m = 16 and k = 3. To insert an element x in the filter, it is
hashed with all three functions, and all of the bits in the
corresponding positions are set to 1. Conversely, to test if
an element y is in the filter, it is again hashed using all three
functions; and if all of the bits at the corresponding bit po-
sitions are 1, the element is declared to be present in the
Bloom filter. In Fig.16, y is declared to be not present.
Because bits are never reset: 1) an element once inserted
cannot be deleted from the filter5, and 2) a false negative can
never occur. The rate of false positive is approximately(
1 − e−kn/m
)k
, (4)
where n is the number of elements expected to be inserted
in the filter. MNEME uses two Bloom filters for the FTI_R
and FTI_W tables, each implemented with m = 128, k = 3.
4.2.2 AIR Implementation. The AIR is implemented as
a table with D rows, each having five fields: a valid field, the
program counter value of the FTI, the number of read and
write inducing pages touched by this FTI, and the number of
accesses that go to these pages. Within a program phase, the
least-frequently used entry in the AIR is overwritten with
a new FTI; the field recording the total number of accesses
is used as the frequency estimate. Upon completion of a
phase, any entry having frequency higher than a threshold
is inserted into the Bloom filter. Subsequently, the AIR is
reset by setting the valid field of all its entries to 0.
5 Evaluation Methodology
To evaluate MNEME, we develop a cycle-accurate DRAM-
PCM hybrid memory simulator with the following:
• A Cycle-level x86 multi-core simulator, whose front-
end is based on Pin [50]. We configure this to simulate
8 out-of-order cores.
• Amain memory simulator, closely matching the JEDEC
Nonvolatile Dual In-line Memory Module (NVDIMM)-
N/F/P Specifications [1]. This simulator is composed of
Ramulator [39], to simulate DRAM, and a cycle-level
PCM simulator based on NVMain [58].
• Power and latency for DRAM and PCM are based on
Intel/Micron’s 3D Xpoint specification [11, 65]. Energy
is modeled for DRAM using DRAMPower [15] and for
NVM using NVMain with parameters from [65].
Table 4 summarizes the various simulation parameters.
We evaluate the following main memory architectures.
• M1:DRAM-PCMhybridmemorywithDRAMand PCM
placed on separate DIMMs, sharing a common main
memory address space. This is the primary hybrid mem-
ory architecture that we evaluate in this paper, and is
similar to Intel Optane [11] using PCM instead of SSD.
• M2: PCM main memory with DRAM as write cache.
This is similar to the architecture of IBM Power9 [69].
• M3: Mainstream DRAM-based main memory architec-
ture similar to Intel Skylake [27].
5Certain modifications to the Bloom filter allow for deletion of elements.
One example is the Cuckoo filter [29].
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories ISMM ’20, June 16, 2020, London, UK
Table 4.Major simulation parameters.
Processor 8 cores, 3 GHz, out-of-order
L1-I/D cache Private 64KB per core, 4-way
L2 cache shared, 4MB, 8-way
DRAM Main Memory 64GB, Micron DDR3
2 channels, 4 ranks/channel, 8 banks/rank, 128
sub-arrays/bank, 512 rows/sub-array
Memory clock = 1066MHz
Near bitline segment = 128 cells
Far bitline segment = 384 cells (= 512 -128)
PCM Main Memory 128GB, Micron DDR3 [65]
4 channels, 4 ranks/channel, 8 banks/rank, 8 par-
titions/bank, 128 tiles/partition, 4096 rows/tile
Memory clock = 1066MHz
Near bitline segment = 512 cells
Far bitline segment = 3584 cells(= 4096 -512)
Weevaluate architecturesM2 andM3 to show thatMNEME
improves performance of other memory architectures.
We evaluate the following techniques.
• Baseline allocates a page randomly to a free physical ad-
dress. Pages are not migrated between DRAM and PCM
during program execution. Bitlines are not segmented.
– for M1, the Baseline is Intel Optane [11].
– for M2, the Baseline is IBM Power9 [69].
– for M3, the Baseline is Intel Skylake [27].
• Nimble [86] supports M1-type hybrid memory. It mi-
grates pages between DRAM and PCM during program
execution, starting from a random physical address al-
location. Bitlines are not segmented.
• TL-DRAM [49] supports DRAM (M3). It uses the page
management policy of Baseline. Each bitline in a DRAM
bank is partitioned into near and far segments.
• MNEME supports bothM1 andM2-type hybrid memory
architectures. It 1) uses segmented bitlines for DRAM
and PCM, 2) controls initial page allocation to correct
memory tiers, and 3) minimizes channel occupancy dur-
ing page migrations between tiers of the same memory,
to improve performance and reliability.
We evaluate all single-core and multi-programmed work-
loads from the SPEC CPU2017 suite [12]. Table 5 reports the
workloads that we present in Section 6. These workloads
are chosen because they have at least 1 cache Miss Per Kilo
Instructions (MPKI) (see Fig. 17). For other workloads with
low MPKI (those not presented in Sec. 6), MNEME neither
significantly improves nor hurts performance and reliability.
Table 5. Evaluated workloads.
single-core 8 copies each of blender, bwaves, cactuBSSN, cam4,
gcc, imagick, nab, namd, omnetpp, perlbench, povray,
roms, wrf, xalancbmk, xz
multi-programmed MP1 (2 copies each of blender, bwaves, cactuBSSN,
and cam4) andMP2 (2 copies each of perlbench, wrf,
xalancbmk, and xz)
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0
100
200
300
M
P
K
I
34
11
37
14
9 18
1
2 8 1
0
15
7
18 1
225
13
0
20
3
4
78
Figure 17.MPKI for our evaluated workloads.
All workloads are executed for 10 billion instructions.
6 Results and Discussions
6.1 Summary of Key Results
Table 6 summarizes MNEME’s improvements.
Table 6. Summary of key results.
System Energy Migration Lifetime
Perf. Consump. Overhead (Sec. 6.9)
MNEME vs. (Sec. 6.2) (Sec. 6.8) (Sec. 6.5) Endurance Aging
Intel Optane 21% ↑ 19% ↓ – – –
Nimble [86] 16% ↑ 18% ↓ 71.2% ↓ 20% ↑ 33% ↓
IBM Power9 15% ↑
Intel Skylake 15% ↑
TL-DRAM [49] 13% ↑
6.2 Overall System Performance
We report overall system performance for three configu-
rations: 1) M1: DRAM-PCM hybrid memory with DRAM
and PCM on separate DIMMs, 2) M2: DRAM-PCM hybrid
memory with DRAM as write cache to PCM, and 3) M3:
DRAM-based system.
6.2.1 HybridMainMemoryArchitectureM1. Figure 18
reports the execution time of each workload for our evalu-
ated systems normalized to Baseline. The simulator is con-
figured for our primary DRAM-PCM hybrid main memory
architecture with a common address space. We make the
following three main observations.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
1.5
E
xe
cu
ti
on
ti
m
e
n
or
-
m
al
iz
ed
to
B
as
el
in
e Nimble MNEME
Figure 18. Execution time, normalized to Baseline for
DRAM-PCM hybrid main memory with shared address.
First, Nimble achieves better performance than Baseline by
an average of 7% due to Nimble’s policy to migrate hot pages
from PCM to DRAM, which reduces execution time (DRAM
has lower access latency than PCM). Second, for workloads
such as blender and bwaves, performance of Nimble is, in
ISMM ’20, June 16, 2020, London, UK Shihao Song, Anup Das, and Nagarajan Kandasamy
fact, worse than Baseline because of the high overhead of
page migrations in Nimble. Third, MNEME’s performance
is the best among all three systems. On average, the execu-
tion time of MNEME is 21% lower than Baseline and 16%
lower than Nimble. This improvement is due to 1) MNEME’s
segmented bitline architecture and 2) MNEME’s intelligent
initial page allocation policy to exploit performance asym-
metries in memory tiers.
6.2.2 HybridMainMemoryArchitectureM2. Figure 19
reports the execution time of each workload for our evalu-
ated systems normalized to Baseline with the simulator con-
figured for DRAM-PCM hybrid main memory with DRAM
configured as write cache to PCM. We make the following
two main observations.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
E
xe
cu
ti
on
ti
m
e
n
or
-
m
al
iz
ed
to
B
as
el
in
e Baseline+ SegmentedBitline MNEME
Figure 19. Execution time, normalized to Baseline for
DRAM-PCM hybrid memory with DRAM as write cache.
First, using the proposed segmented bitline architecture,
performance of Baseline improves only marginally, by an
average of 2% (first bar in each set). See also Observation
1. Second, MNEME’s performance is the highest among all
three systems. On average, MNEME’s execution time is 15%
lower than Baseline and 14% lower than segmented bitlines.
6.2.3 DRAM-based Main Memory Architecture M3.
Figure 20 reports the execution time of each workload for our
evaluated systems normalized to Baseline with the simulator
configured for DRAM-based main memory. We observe that
performance of TL-DRAM is marginally better than Base-
line. MNEME has the highest performance among all three
systems. On average, MNEME’s execution time is 15% lower
than Baseline and 13% lower than TL-DRAM.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
E
xe
cu
ti
on
ti
m
e
n
or
-
m
al
iz
ed
to
B
as
el
in
e TL-DRAM MNEME
Figure 20. Execution time, normalized to Baseline for
DRAM-based main memory.
Unless otherwise stated, following results are for our pri-
mary memory architecture, i.e., DRAM-PCM hybrid memory
with a common memory address space (M1).
6.3 Multi-Programmed Workloads
Figure 21 plots the execution time of MNEME normalized
to Nimble for 2 multi-programmed workloads on 2-core (2-
channel), 4-core (4-channel), and 8-core (8-channel) systems.
MP1 MP2
0.0
0.5
1.0
E
xe
cu
ti
on
ti
m
e
n
or
-
m
al
iz
ed
to
N
im
b
le 2-core (2 ch.) 4-core (4 ch.) 8-core (8 ch.)
Figure 21. Execution time normalized to Nimble for multi-
programmed workloads on 2-core, 4-core, and 8-core.
We observe that for 2, 4, and 8 cores in the system,MNEME
provides 18%, 24% and 31% performance improvement for
MP1, and 26%, 36% and 38% performance improvement for
MP2, compared to Nimble. The performance improvement
of MNEME increases with increasing number of channels.
This is because, with more channels, bank access latency
becomes the primary performance bottleneck. Therefore,
MNEME, which reduces the average bank access latency,
provides better performance with more channels.
6.4 Memory Access Distribution
Figure 22 plots the memory accesses to near and far memory
segments of each workload for Nimble and MNEME. We
make the following two main observations.
First, on average, only 13% of accesses go to near segments
in memory banks using the initial page allocation and hot
page migration policy of Nimble. Second, MNEME directs
an average of 64% of accesses to near memory segments
using its intelligent initial page allocation policy. This leads
to significant performance improvement (see Section 6.2).
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0
50
100
M
em
or
y
ac
ce
ss
(%
) Nimble (near) Nimble (far) MNEME (near) MNEME (far)
Figure 22.Memory accesses to near and far segments.
6.5 Migration Overhead
Figure 23 plots the pagemigration-related accesses inMNEME
normalized to Nimble for each workload. We observe that
page migration-related accesses in MNEME are lower than
Nimble by an average of 71.2%. This reduction is because 1)
MNEME places new pages in correct memory tiers during
their initial allocation using access profiles of observed first-
touch instructions in the program, which reduces the average
number of inter-memory page migrations, and 2) MNEME
uses its new peripheral circuit design to eliminate channel
usage for page migrations within each memory bank.
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories ISMM ’20, June 16, 2020, London, UK
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
M
ig
ra
ti
on
-r
el
at
ed
ac
ce
ss
es
n
or
m
al
iz
ed
to
N
im
b
le
29
.3
%
50
.4
% 33
.0
%
47
.6
%
86
.1
%
93
.4
%
83
.0
%
96
.1
%
96
.1
%
99
.8
% 7
5.
5%
75
.4
%
77
.3
%
93
.6
%
30
.7
%
71
.2
%
Nimble MNEME
Figure 23.Migration-related accesses normalized to Nimble.
6.6 Length of Execution Phases
Figure 24 reports the execution time of MNEME normalized
to Nimble for each workload. The first bar in each set is
for the default phase length of 100 million instructions. The
second and third bars are for phase length of 250 million
and 500 million instructions, respectively. We observe the
execution time of MNEME to increase with the length of the
phase interval. This is because, lower phase intervals allow
finer control of page allocation, resulting in higher perfor-
mance (i.e., lower execution time) than Nimble. However,
lower phase intervals also result in higher overhead due to
1) frequent updates to FTI Table and AIR and 2) frequent
page migrations and updates to page table, impacting per-
formance. Phase interval of 100 million instructions gives
the best performance and overhead trade-off.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
E
xe
cu
ti
on
ti
m
e
n
or
m
al
iz
ed
to
N
im
b
le phase interval (instructions) = 100M 250M 500M
Figure 24. Execution time, normalized to Nimble for pro-
gram phase intervals of 100M, 250M, and 500M instructions.
6.7 FTI-based Access Intensity Prediction
Figure 25 illustrates how MNEME improves its page alloca-
tion decisions over time, improving overall performance. The
bottom subfigure shows the increase in stored FTIs during
the execution of cam4. The top subfigure shows the fraction
of memory pages (i.e., their program counters) that hit in
the FTI Table. We observe that cam4 undergoes a change in
behavior after executing ≈ 5.5 billion instructions and then
again after ≈ 8.7 billion instructions, due to potentially dis-
tinct work sets. We see a dip in the number of pages that hit
in the FTI Table (top subfigure). Therefore, the number of
stored FTIs increases sharply around these time (bottom sub-
figure) because MNEME starts inserting the newly observed
FTIs into the FTI table to improve its allocation decisions.
This results in an increase in the number of page hits (top
subfigure) during subsequent execution of cam4.
Figure 25. Illustration of the FTI-based access intensity pre-
diction for cam4.
6.8 Energy Consumption
Figure 26 reports the total energy consumption (demand ac-
cesses and page migrations) of each workload for our evalu-
ated systems normalized to Baseline. We observe that Nimble
has lower energy consumption than Baseline by an average
of only 2%. Although energy consumption of demand ac-
cesses is lower in Nimble, the potential energy savings are
overshadowed by the page migrations. MNEME has the low-
est energy consumption (on average, 19% lower than Baseline
and 18% lower than Nimble). These savings are achieved in
MNEME because it reduces page migrations on the memory
channel significantly by 1) initially allocating a page to a
correct tier, and 2) facilitating inter-segment data transfers,
which reduce channel occupancy.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
E
n
er
gy
co
n
su
m
p
ti
on
n
or
m
al
iz
ed
to
B
as
el
in
e Nimble MNEME
Figure 26. Energy consumption, normalized to Baseline.
6.9 Reliability
We evaluate two reliability issues: endurance and NBTI.
6.9.1 Endurance-related Lifetime. Figure 27 reports the
endurance-related lifetime (computed using Equation 1) of
each workload for our evaluated systems normalized to Nim-
ble. We make the following observation.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
1.5
E
n
d
u
ra
n
ce
-l
if
et
im
e
n
or
m
al
iz
ed
to
N
im
b
le Nimble MNEME
Figure 27. Endurance lifetime, normalized to Nimble.
ISMM ’20, June 16, 2020, London, UK Shihao Song, Anup Das, and Nagarajan Kandasamy
MNEME improves endurance-related lifetime by an aver-
age of 20% compared to Nimble. This improvement is because
of the extra tiers that MNEME creates inside each memory
using isolation transistors. Pages that are not frequently ref-
erenced can now stay in DRAM far segments for sometime,
before they are migrated to PCM. This reduces PCM writes,
which improves endurance.
6.9.2 NBTI-related Aging. Figure 28 reports the NBTI-
related aging (computed using Equation 2) of each workload
for our evaluated systems normalized to Nimble. We observe
that MNEME has 33% lower NBTI-related aging than Nimble.
This is because MNEME uses lower bias voltages to access
near PCM segments (Table 1) due to its segmented bitline
architecture. Also, the CMOS devices in PCM’s peripheral
circuit are stressed for reduced time duration than Nimble
due to lower timing requirements of the near segments (Table
2). Both these factors contribute to lower NBTI aging [7].
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.0
0.5
1.0
N
B
T
I-
ag
in
g
n
or
m
-
al
iz
ed
to
N
im
b
le
Nimble MNEME
Figure 28. NBTI-related aging, normalized to Nimble.
6.10 Design Area Analysis
MNEME introduces two changes: increase in memory die
size due to isolation transistors and overhead in the memory
controller to maintain FTI Table and AIR.
6.10.1 Area Overhead on Memory-side.
Adding an isolation transistor to each bitline increases the
area of each bank. We estimate this for DRAM and PCM
as follows. In DRAM, the sense amplifier and the isolation
transistor are respectively 115.2x and 11.5x taller than an
individual DRAM cell. For a subarray of 512 DRAM cells per
bitline, the area overhead is 11.5115.2+512 = 1.83% [49].
In PCM, the peripheral circuit and the isolation transistor
are respectively 384x and 9.6x taller than an individual PCM
cell. For a PCM partition of 4096 PCM cells per bitline, the
area overhead is 9.6+9.6384+4096 = 0.43% (including the change to
support in-memory migrations).
6.10.2 Area Overhead on CPU-side.
FTI Tables are implemented as two 128-bit Bloom filters with
a total size of 256 bits. AIR is implemented as an 8-entry table
with 1-bit valid field, a 32-bit field for the program counter,
a 32-bit field for counting accesses, and two 16-bit fields
for counting pages. The total area overhead of AIR is 97B.
The LUT stores 4 extra rows in Table 2 (2 for PCM read
and write latency of near segment and 2 for DRAM read
and write latency of near segment). The extra area overhead
is 160 bits (= 4 * 5 entries per row * 8 bits per entry). So,
MNEME introduces a total of 149 bytes in storage in the
memory controller, corresponding to an area overhead of
6× 10−4mm2 at 45nm. Given the cost sensitivity of memory de-
signs, designers can still benefit from MNEME’s standalone
page allocation policy, without segmented bitlines.
Figure 29 plots the execution time of each workload for
Nimble and MNEME, normalized to Baseline. The simulator
is configured for DRAM-PCM hybrid memory architecture
with shared address space. Bitlines are not segmented either
in DRAM or in PCM. We observe that MNEME’s perfor-
mance is still better. On average, MNEME’s execution time
is 15% lower than Baseline and 8% lower than Nimble.
bl
en
de
r
bw
av
es
ca
ct
uB
SS
N
ca
m
4
gc
c
im
ag
ic
k
na
b
na
m
d
om
ne
tp
p
pe
rlb
en
ch
po
vr
ay
ro
m
s
w
rf
xa
la
nc
bm
k xz
AV
E
R
A
G
E
0.00
0.25
0.50
0.75
1.00
1.25
E
xe
cu
ti
on
ti
m
e
n
or
m
al
iz
ed
to
B
as
el
in
e
Nimble MNEME
Figure 29. Execution time, normalized to Baseline for main-
stream memory without segmented bitlines.
7 Related Works
To our knowledge, this is the first work that 1) enables seg-
mented bitline architecture for non-volatile memory and
analyzes its performance and reliability impacts, 2) develop a
strategy to intelligently place pages in near and far segments
of different memory units during their initial allocation, re-
ducing run-time migration overhead, and 3) introduces page
migrations within tiers of the samememory, without moving
data on the memory channels.
7.1 Performance, Energy, and Endurance
Optimizations
Many prior works optimize performance, energy, and en-
durance of PCM [2, 6, 18, 43, 46, 61, 64, 71, 74, 75, 87]. Song et
al. propose exploiting partition-level parallelism in each PCM
bank to improve performance of DRAM-PCM hybrid mem-
ory [74]. Song et al. propose data content aware PCM writes
to reduce write latency in PCM, improving performance of
DRAM-PCM hybrid memory [75]. Cho et al. propose Flip-
N-Write to improve PCM performance by first reading the
memory content and then programming only the bits that
need to be altered [18]. Qureshi et al. propose PreSET, an
architectural technique that SETs the PCM cells of a mem-
ory location in the background before programming them
during write [64]. There are also techniques to consolidate
multiple write operations, saving energy and improving per-
formance [85] . As MNEME addressees performance and
reliability bottlenecks by tackling them at their source, it can
be combined with these and similar techniques.
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories ISMM ’20, June 16, 2020, London, UK
7.2 Writeback Optimization
Several prior works propose line-level writeback, where for
each evicted DRAM cache block, processor cache blocks that
become dirty are tracked and selectively written back to
PCM [43, 45, 46, 59–61]. Various works propose dynamic
write consolidation where PCM writes to the same row are
consolidated into one write operation [48, 73, 78, 81, 85].
Other works propose write activity reduction in PCM using
CPU registers [31, 32]. Yet some other works propose multi-
stage write operations where a write request is served in sev-
eral steps rather than in one-shot to improve performance[88,
89]. Qureshi et al. propose a morphable PCM system, which
dynamically adapts between high-density and high-latency
MLC PCM and low-density and low-latency single-level cell
PCM [62, 63]. Jiang et al. propose write truncation where a
write operation is truncated to allow read operations, com-
pensating for data loss using ECC [34]. MNEME is comple-
mentary to all these approaches.
7.3 Page Allocation
Manymodern OSes are already aware of performance and re-
liability characteristics of differentmemory technologies [42].
There are two different approaches for page placement in
hybrid memory. The first approach is to monitor the memory
access patterns to pages, and migrate access-intensive pages
to the faster memory, e.g., [16, 33, 36, 40, 80, 86]. We com-
pare MNEME with Nimble [86] and found it to be largely
better. However, the disadvantages of this approach are that
it incurs high performance and energy overhead, as well as
increasing bank occupancy due to the movement of data in
memory. An alternative approach is to predict the memory
access patterns of pages and place them in matching mem-
ory during their initial allocation [37]. However, it requires
accurate predictions of the memory read-write characteris-
tics of pages to be allocated. We not only introduce a new
prediction scheme based on first-touch instruction, but also
a novel memory architecture that can be exploited using
this allocation policy. Furthermore, we discuss methods to
reduce the migration overhead inside the memory bank.
8 Conclusions
We introduce MNEME, a new mechanism that enables seg-
mented bitline architecture in DRAM-NVM hybrid memory,
introducing intra- and inter-memory performance and relia-
bility asymmetries and exploit them using an efficient page
management policy at the OS, improving both performance
and reliability. Previous architectural solutions exist to tackle
page migration between different heterogeneous units in hy-
brid memory. However, they lead to significant performance
and energy overhead due to high bank and channel occu-
pancy during data migration. In this paper, we first introduce
an architectural solution involving the use of isolation tran-
sistors in long bitlines to create tiers with different latency
and reliability characteristics. Next, we expose the asym-
metric performance and reliability properties of memory
tiers, both within and across heterogeneous memory units
of hybrid memory to the OS. The OS exploits these asymme-
tries in placing every newly-referenced memory page to the
best tier during its initial allocation by predicting its data
access intensity. This minimizes run-time page migrations,
which lead to performance improvements. Finally, we pro-
pose a simple approach to facilitate page migrations within
tiers of the same memory, eliminating the need to move
data over memory channels, thereby further improving per-
formance. We evaluate MNEME with single-core and multi-
programmed workloads from the SPEC CPU2017 Benchmark
suites. Our results show that MNEME significantly improves
performance and reliability of state-of-the-art hybrid mem-
ory systems as well as mainstream DRAM-based systems.
Additionally, MNEME’s standalone page allocation policy
can also be applied to improve performance of computing
systems, where the proposed segmented bitline architecture
is too costly to incorporate.
We conclude that MNEME is a simple yet powerful mech-
anism for hybrid memory systems.
References
[1] “Non-Volatile Dual In-line Memory Module (NVDIMM) – N/F/P Spec-
ification,” JEDEC Solid State Technology Association, 2019.
[2] S. Akram, J. B. Sartor, K. S. McKinley, and L. Eeckhout, “Write-rationing
garbage collection for hybrid memories,” in PLDI, 2018.
[3] P. Antognetti and G. Massobrio, Semiconductor device modeling with
SPICE. McGraw-Hill, Inc., 1990.
[4] D. Apalkov, A. Khvalkovskiy, S. Watts, V. Nikitin, X. Tang, D. Lottis,
K. Moon, X. Luo, E. Chen, A. Ong et al., “Spin-transfer torque magnetic
random access memory (STT-MRAM),” JETC, 2013.
[5] M. Arafa and R. K. Ramanujan, “Memory card with volatile and non
volatile memory space having multiple usage model configurations,”
US Patent 10,095,618, 2018.
[6] M. Arjomand, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das,
“Boosting access parallelism to PCM-based main memory,” in ISCA,
2016.
[7] A. Balaji, S. Song, A. Das, N. Dutt, J. Krichmar, N. Kandasamy, and
F. Catthoor, “A framework to explore workload-specific performance
and lifetime trade-offs in neuromorphic computing,” CAL, 2019.
[8] A. Bhattacharyya, “Memory arrays,” US Patent 10,374,101, 2019.
[9] S. Blagodurov, G. H. Loh, and M. R. Meswani, “Hot page selection in
multi-level memory hierarchies,” US Patent 10,235,290, 2019.
[10] B. H. Bloom, “Space/time trade-offs in hash coding with allowable
errors,” CSUR, 1970.
[11] K. Bourzac, “Has Intel created a universal memory technology?” IEEE
Spectrum, 2017.
[12] J. Bucek, K.-D. Lange et al., “SPECCPU2017: Next-Generation Compute
Benchmark,” in ICPE, 2018.
[13] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error character-
ization, mitigation, and recovery in flash-memory-based solid-state
drives,” Proceedings of the IEEE, 2017.
[14] Y. Cao and C. McAndrew, “MOSFET modeling for 45nm and beyond,”
in ICCAD, 2007.
[15] K. Chandrasekar, C. Weis, Y. Li, B. Akesson, N. Wehn, and K. Goossens,
“DRAMPower: Open-source DRAM power & energy estimation tool,”
http://www. drampower. info, 2012.
ISMM ’20, June 16, 2020, London, UK Shihao Song, Anup Das, and Nagarajan Kandasamy
[16] Y.-M. Chang, Y.-H. Chang, H.-C. Chen, and T.-W. Kuo, “Memory system
and memory management method thereof,” US Patent 10,108,555, 2018.
[17] B.-H. Cho, W.-Y. Cho, H.-R. Oh, and B.-G. Choi, “Programming method
of controlling the amount of write current applied to phase change
memory device and write driver circuit therefor,” US Patent 6,885,602,
2005.
[18] S. Cho and H. Lee, “Flip-N-Write: a simple deterministic technique to
improve PRAM write performance, energy and endurance,” in MICRO,
2009.
[19] A. Das and A. Kumar, “Fault-aware task re-mapping for throughput
constrained multimedia applications on noc-based mpsocs,” in RSP,
2012.
[20] A. Das, A. Kumar, and B. Veeravalli, “Aging-aware hardware-software
task partitioning for reliable reconfigurable multiprocessor systems,”
in CASES, 2013.
[21] A. Das, A. Kumar, and B. Veeravalli, “Reliability-driven task mapping
for lifetime extension of networks-on-chip based multiprocessor sys-
tems,” in DATE, 2013.
[22] A. Das, A. Kumar, and B. Veeravalli, “Communication and migration
energy aware task mapping for reliable multiprocessor systems,” FGCS,
2014.
[23] A. Das, A. Kumar, and B. Veeravalli, “Energy-aware task mapping and
scheduling for reliable embedded computing systems,” TECS, 2014.
[24] A. Das, A. Kumar, B. Veeravalli, C. Bolchini, and A. Miele, “Combined
DVFS and mapping exploration for lifetime and soft-error susceptibil-
ity improvement in MPSoCs,” in DATE, 2014.
[25] A. Das, A. Kumar, and B. Veeravalli, “Reliability and energy-aware
mapping and scheduling of multimedia applications on multiprocessor
systems,” TPDS, 2015.
[26] A. Das, H. Hassan, and O. Mutlu, “VRL-DRAM: Improving DRAM
performance via variable refresh latency,” in DAC, 2018.
[27] J. Doweck, W.-F. Kao, A. K.-y. Lu, J. Mandelblat, A. Rahatekar, L. Rap-
poport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th-generation Intel
core: New microarchitecture code-named skylake,” IEEE Micro, 2017.
[28] C. Dray and L. Wei, “High voltage tolerant word-line driver,” US Patent
9,875,783, 2018.
[29] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher,
“Cuckoo filter: Practically better than bloom,” in CONEXT, 2014.
[30] A. Goda, T. Vali, C. Miccoli, and P. Kalavade, “Programming memory
devices,” US Patent 10,217,515, 2019.
[31] J. Hu, C. J. Xue, Q. Zhuge, W.-C. Tseng, and E. H.-M. Sha, “Write
activity reduction on non-volatile main memories for embedded chip
multiprocessors,” TECS, 2013.
[32] Y. Huang, T. Liu, and C. J. Xue, “Register allocation for write activity
minimization on non-volatile main memory,” in ASPDAC, 2011.
[33] N. S. Jayasena, G. H. Loh, J. M. O’connor, and N. Chatterjee, “Page
migration in a hybrid memory device,” US Patent 9,910,605, 2018.
[34] L. Jiang, Y. Zhang, B. R. Childers, and J. Yang, “FPB: Fine-grained power
budgeting to improvewrite throughput ofmulti-level cell phase change
memory,” in MICRO, 2012.
[35] U. Kang, H.-s. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and
J. S. Choi, “Co-architecting controllers and DRAM to enhance DRAM
process scaling,” in The Memory Forum, 2014.
[36] S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, “HeteroOS: OS
design for heterogeneous memory management in datacenters,” OSR,
2018.
[37] H. A. Khouzani, C. Yang, and J. Hu, “Improving performance and
lifetime of DRAM-PCM hybrid main memory through a proactive
page allocation strategy,” in ASP-DAC, 2015.
[38] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A case for exploiting
subarray-level parallelism (SALP) in DRAM,” in ISCA, 2012.
[39] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible
DRAM Simulator,” CAL, 2016.
[40] A. Kokolis, D. Skarlatos, and J. Torrellas, “PageSeer: Using page walks
to trigger page swaps in hybrid memory systems,” in HPCA, 2019.
[41] E. Kültürsay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Eval-
uating STT-RAM as an energy-efficient main memory alternative,” in
ISPASS, 2013.
[42] C. Lameter, “Numa (non-uniform memory access): An overview,”
Queue, 2013.
[43] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and
D. Burger, “Phase-change technology and the future of main memory,”
IEEE Micro, 2010.
[44] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change
memory as a scalable dram alternative,” in ISCA, 2009.
[45] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change
memory as a scalable DRAM alternative,” in ISCA, 2009.
[46] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase change memory
architecture and the quest for scalability,” CACM, 2010.
[47] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and
D. Burger, “Phase-change technology and the future of main memory,”
IEEE Micro, 2010.
[48] C. J. Lee, V. Narasiman, E. Ebrahimi, O. Mutlu, and Y. N. Patt, “DRAM-
aware last-level cache writeback: Reducing write-caused interference
in memory systems,” 2010.
[49] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu,
“Tiered-latencyDRAM:A low latency and low cost DRAMarchitecture,”
in HPCA, 2013.
[50] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace,
V. J. Reddi, and K. Hazelwood, “Pin: Building customized program
analysis tools with dynamic instrumentation,” in PLDI, 2005.
[51] H.-L. Lung, C. P. Miller, C.-J. Chen, S. C. Lewis, J. Morrish, T. Perri, R. C.
Jordan, H.-Y. Ho, T.-S. Chen, W.-C. Chien et al., “A double-data-rate
2 (DDR2) interface phase-change memory with 533MB/s read-write
data rate and 37.5 ns access latency for memory-type storage class
memory applications,” in IMW, 2016.
[52] A. Mallik, D. Garbin, A. Fantini, D. Rodopoulos, R. Degraeve, J. Stu-
ijt, A. Das, S. Schaafsma, P. Debacker, G. Donadio et al., “Design-
technology co-optimization for OxRRAM-based synaptic processing
unit,” in VLSI Technology, 2017.
[53] O. Mutlu, “Memory scaling: A systems architecture perspective,” in
IMW, 2013.
[54] O. Mutlu, “The RowHammer problem and other issues we may face as
memory becomes denser,” in DATE, 2017.
[55] O. Mutlu and J. S. Kim, “Rowhammer: A retrospective,” TCAD, 2019.
[56] O. Mutlu and L. Subramanian, “Research problems and opportunities
in memory systems,” Supercomputing Frontiers and Innovations, 2015.
[57] T. Nirschl, J. Philipp, T. Happ, G. W. Burr, B. Rajendran, M.-H. Lee,
A. Schrott, M. Yang, M. Breitwisch, C.-F. Chen et al., “Write strategies
for 2 and 4-bit multi-level phase-change memory,” in IEDM, 2007.
[58] M. Poremba, T. Zhang, and Y. Xie, “Nvmain 2.0: A user-friendly mem-
ory simulator to model (non-) volatile memory systems,” CAL, 2015.
[59] B. Pourshirazi, M. V. Beigi, Z. Zhu, and G. Memik, “WALL: A writeback-
aware LLC management for PCM-based main memory systems,” in
DATE, 2018.
[60] B. Pourshirazi, M. V. Beigi, Z. Zhu, and G. Memik, “Writeback-aware
LLC management for PCM-based main memory systems,” TODAES,
2019.
[61] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor-
mance main memory system using phase-change memory technology,”
in ISCA, 2009.
[62] M. K. Qureshi, M. M. Franceschini, L. A. Lastras-Montaño, and J. P.
Karidis, “Morphable memory system: A robust architecture for exploit-
ing multi-level phase change memories,” in ISCA, 2010.
[63] M. K. Qureshi, M.M. Franceschini, and L. A. Lastras-Montano, “Improv-
ing read performance of phase change memories via write cancellation
and write pausing,” in HPCA, 2010.
[64] M. K. Qureshi, M. M. Franceschini, A. Jagmohan, and L. A. Lastras, “Pre-
SET: improving performance of phase change memories by exploiting
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories ISMM ’20, June 16, 2020, London, UK
asymmetry in write times,” in ISCA, 2012.
[65] A. Redaelli, “Phase Change Memory: Device Physics, Reliability and
Applications,” Phase Change Memory, 2018.
[66] A. Redaelli and C. Perrone, “Semiconductor constructions and memory
arrays,” US Patent 9,748,480, 2017.
[67] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutlu, “ThyNVM:
Enabling software-transparent crash consistency in persistent memory
systems,” in MICRO, 2015.
[68] S. Rixner,W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory
access scheduling,” in ISCA, 2000.
[69] S. K. Sadasivam, B.W. Thompto, R. Kalla, andW. J. Starke, “IBM Power9
processor architecture,” IEEE Micro, 2017.
[70] B. S. Sandhu, C. Pietrzyk, and G. M. Lattimore, “Memory write driver,
method and system,” US Patent 10,529,420, 2020.
[71] N. H. Seong, D. H. Woo, and H.-H. S. Lee, “Security Refresh: Prevent
malicious wear-out and increase durability for phase-change memory
with dynamically randomized optaddress mapping,” in ISCA, 2010.
[72] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhi-
menko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch et al., “RowClone:
fast and energy-efficient in-DRAM bulk data copy and initialization,”
in MICRO, 2013.
[73] V. Seshadri, A. Bhowmick, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and
T. C. Mowry, “The dirty-block index,” in ISCA, 2014.
[74] S. Song, A. Das, O. Mutlu, and N. Kandasamy, “Enabling and exploiting
partition-level parallelism (PALP) in phase change memories,” TECS,
2019.
[75] S. Song, A. Das, O. Mutlu, and N. Kandasamy, “Improving phase change
memory performance with data content aware access,” in ISMM, 2020.
[76] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “The case for lifetime
reliability-aware microprocessors,” in ISCA, 2004.
[77] J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis,
P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh et al., “FreePDK: An
open-source variation-aware design kit,” in MSE, 2007.
[78] J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John, “The
virtual write queue: Coordinating dram and last-level cache policies,”
in ISCA, 2010.
[79] C. Villa, “PCM array architecture and management,” in Phase Change
Memory, 2018.
[80] X. Wang, H. Liu, X. Liao, J. Chen, H. Jin, Y. Zhang, L. Zheng, B. He,
and S. Jiang, “Supporting superpages and lightweight page migration
in hybrid memory systems,” TACO, 2019.
[81] Z. Wang, S. Shan, T. Cao, J. Gu, Y. Xu, S. Mu, Y. Xie, and D. A. Jiménez,
“WADE:Writeback-aware dynamic cachemanagement for NVM-based
main memory system,” TACO, 2013.
[82] M. V. Wilkes, “The memory gap and the future of high performance
memories,” Computer Architecture News, 2001.
[83] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran,
M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings
of the IEEE, 2010.
[84] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications
of the obvious,” Computer Architecture News, 1995.
[85] F. Xia, D. Jiang, J. Xiong, M. Chen, L. Zhang, and N. Sun, “DWC:
Dynamic write consolidation for phase change memory systems,” in
ICS, 2014.
[86] Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Nimble page
management for tiered memory systems,” in ASPLOS, 2019.
[87] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, “Effi-
cient data mapping and buffering techniques for multilevel cell phase-
change memories,” TACO, 2015.
[88] J. Yue and Y. Zhu, “Accelerating write by exploiting pcm asymmetries,”
in HPCA, 2013.
[89] L. Zhang, B. Neely, D. Franklin, D. Strukov, Y. Xie, and F. T. Chong,
“Mellow writes: Extending lifetime in resistive memories through se-
lective slow write backs,” in ISCA, 2016.
