SAWL:A Self-adaptive Wear-leveling NVM Scheme for High Performance
  Storage Systems by Huang, Jianming et al.
SAWL:A Self-adaptive Wear-leveling NVM Scheme for High Performance Storage
Systems
Jianming Huang*, Yu Hua*, Pengfei Zuo*, Wen Zhou*, Fangting Huang*
*Huazhong University of Science and Technology
Abstract
In order to meet the needs of high performance computing
(HPC) in terms of large memory, high throughput and energy
savings, the non-volatile memory (NVM) has been widely
studied due to its salient features of high density, near-zero
standby power, byte-addressable and non-volatile properties.
In HPC systems, the multi-level cell (MLC) technique is used
to significantly increase device density and decrease the cost,
which however leads to much weaker endurance than the
single-level cell (SLC) counterpart. Although wear-leveling
techniques can mitigate this weakness in MLC, the improve-
ments upon MLC-based NVM become very limited due to
not achieving uniform write distribution before some cells are
really worn out. To address this problem, our paper proposes a
self-adaptive wear-leveling (SAWL) scheme for MLC-based
NVM. The idea behind SAWL is to dynamically tune the
wear-leveling granularities and balance the writes across the
cells of entire memory, thus achieving suitable tradeoff be-
tween the lifetime and cache hit rate. Moreover, to reduce
the size of the address-mapping table, SAWL maintains a
few recently-accessed mappings in a small on-chip cache.
Experimental results demonstrate that SAWL significantly
improves the NVM lifetime and the performance for HPC
systems, compared with state-of-the-art schemes.
1 Introduction
High performance computing (HPC) systems generally re-
quire large-size memory, high I/O throughput and significant
energy savings. Due to meeting all these needs, non-volatile
memory (NVM) has been widely used in high performance
systems [17, 25, 27, 39]. For HPC applications, fitting the
larger workloads in NVM than DRAM can efficiently allevi-
ate the constraints from memory space and reduce the data
movements between high-speed memory and low-speed disks
to deliver high performance [25]. Moreover, the recent mea-
surements of the Intel Optane DC Persistent Memory Module
demonstrate the significant performance improvements upon
typical real-world applications [15]. Existing studies [23, 37]
have also shown that leakage energy grows with the memory
capacity, dissipating as much heat as dynamic energy and
becomes a main contributor to operational costs. NVM tech-
nologies [3,4,7,18,19], such as STT-RAM, PCM, and RRAM,
hence become promising and important for HPC applications.
In practice, NVM fails to achieve high performance and
actually increases the complexity of management due to
the limited lifetime, which causes frequent update and re-
configurations. The property of limited lifetime has become
the performance bottleneck of storage systems, especially
for HPC applications that usually contain large amounts of
write operation. Moreover, in order to offer large space capac-
ity and relatively cheap costs, device vendors often provide
multi-level-cell (MLC)-based NVM for real-world applica-
tions. Compared with single-level-cell (SLC)-based NVM,
MLC-based NVM exhibits higher storage density, lower costs
and comparable read latency, thus achieving better perfor-
mance in memory-sensitive HPC applications. Unfortunately,
the lifetime of MLC becomes exacerbated, since MLC stores
more bits in a single cell and results in weak endurance. The
MLC technique used in NVM (PCM and RRAM) is able to
support the rapid growth in device capacity and density but
at the cost of much weaker endurance than the SLC coun-
terpart. The advanced fabrication technique in MLC packs
more than one bit in a single cell [22], thus allowing NVM
to achieve ultra-high density. However, due to the iterative
program-and-verify (P&V) technique, the MLC technology
produces remarkable variations on access latency and cell
endurance. Table 1 summarizes the characteristic parameters
of SLC- and MLC-based NVM technologies.
Compared with SLC, the MLC-based NVM increases ac-
cess latency by 2∼4 times and decreases endurance by 100
times due to unavoidable over-programming operation. For
example, the SLC PCM devices are expected to last for
107 ∼ 108 writes per cell [9, 44], and the RRAM technol-
ogy has a per-cell write limit between 108 and 1012 in the
SLC mode. But the cell endurance of MLC PCM only reaches
105 ∼ 106 writes per cell [11], and that of the MLC RRAM
decreases to 107 writes per cell [22].
1
ar
X
iv
:1
90
5.
02
87
1v
1 
 [c
s.A
R]
  8
 M
ay
 20
19
Table 1: Key features of SLC- and MLC-based NVM tech-
nologies.
SLC PCM SLC RRAM MLC PCM MLC RRAM
Read latency 150ns 10ns 250ns 50ns
Write latency 450ns 50ns 1.5us 350ns
Cell endurance 107 ∼ 108 108 ∼ 1012 105 ∼ 106 ∼ 107
In order to extend the lifetime of MLC-based NVM, the
wear-leveling technique attempts to make write operation uni-
formly distributed by frequently remapping logical lines to
new physical positions, which can also prevent brute-force
attacks to a certain physical line. However, we observe that
existing wear-leveling algorithms [29, 33, 35, 44, 45] initially
designed for SLC-based NVM, become inefficient in MLC-
based NVM systems. Specifically, to prevent the malicious
attacks [33] that guess the physical location and continuously
wear a given line, existing algorithms perform the remap-
ping in the randomized manner without recording the accu-
rate write counts of memory cells. Hence, they attempt to
achieve wear leveling by randomly shuffling logical-physical
address mappings via algebraic functions to evenly disperse
the logical lines written most frequently to as many physical
lines as possible. This requires a huge number of rounds of
data exchanges before a probabilistically uniform distribu-
tion of write counts of all physical lines in an NVM can be
achieved [30, 42]. However, in practice, the low endurance of
MLC-based NVM implies that some cells can be worn out
long before this uniform distribution is achieved. As a result,
existing work fails to attain long lifetime of MLC-based NVM
(the quantitative analysis is shown in Section 2).
There are two straightforward solutions to accelerate data
exchanges and avoid some lines being worn out before being
swapped. One is to increase the exchange frequency. However,
frequent content exchanges increase write amplification and
block the data access, which in turn significantly decrease
performance and increase energy consumption.
The other is to decrease the wear-leveling granularities
(e.g., region size) to mitigate the imbalanced writes across
the entire memory, which however significantly increases the
size of address mapping table (e.g., hundreds of megabytes).
Therefore, the mapping table is too large to be fully held
into the on-chip cache which leads to severe performance
degradation due to the long latency of address translation.
To address this problem, a tiered architecture can be con-
sidered, which stores the entire address mapping table in
the main memory (DRAM or NVM devices) and holds the
recently-accessed entries in a small on-chip SRAM cache. In
fact, this intuitive solution often fails to provide sufficient per-
formance improvements for the applications with substantial
random access patterns due to the low cache hit rate of HPC
applications. Hence, we propose a self-adaptive wear-leveling
scheme (SAWL) that dynamically changes the wear-leveling
granularities to accommodate more useful addresses in the
cache, thus significantly improving cache hit rate. As a re-
sult, SAWL is able to achieve both long lifetime and high
performance. The main contributions are summarized:
1. Insights for wear-leveling schemes on MLC-based
NVM. We investigate the effectiveness that state-of-the-
art wear-leveling algorithms work on MLC-based NVM,
including table-based wear-leveling (TBWL) [46],
algebraic-based wear-leveling (AWL) [29, 33], and hy-
brid wear-leveling (HWL) schemes [35,42]. We observe
that TBWL and AWL have the vulnerability of either
Repeated Address Attack (RAA) or significant NVM
lifetime reduction. HWL is able to achieve high lifetime
but causes significant on-chip storage overhead to store
address mappings.
2. An efficient wear-leveling scheme for MLC-based
NVM. We propose a Self-Adaptive Wear-Leveling
(SAWL) scheme for MLC-based NVM. SAWL main-
tains recently-accessed address mappings in a small on-
chip cache managed by the memory controller. To im-
prove the cache hit rate, SAWL dynamically changes the
wear-leveling granularities by means of region-merge
and region-split operations as shown in Section 3.2. As
a result, SAWL is able to achieve both high lifetime and
performance.
3. Implementation and evaluation. We have implemented
SAWL and evaluated it using the gem5 [1]. Experimental
results show that SAWL improves 25%∼ 51% (50%∼
78%) of ideal lifetime for the MLC-based NVM system
with 106 (105) cell endurance, compared with state-of-
the-art wear-leveling schemes. Moreover, existing wear-
leveling schemes incur 25% IPC decrease on average,
while SAWL only decreases the IPC performance by 5%
on average, compared with a baseline system without
any wear-leveling algorithms.
The rest of the paper is organized as follows. Section 2
introduces the background and motivation. The design of
SAWL is described in Section 3. Section 4 presents the evalu-
ation results and analysis. Section 5 presents the related work.
We conclude this paper in Section 6.
2 Background and Motivation
In this section, we present the background on wear leveling in
NVM to facilitate our discussion and analyze the important
observations that motivate our SAWL design.
2.1 Existing Wear-Leveling Algorithms
Wear-leveling schemes are proposed to extend the lifetime
of NVM and defend against security attacks by uniformly
distributing writes among all NVM cells. When a region
2
(a) Segment Swapping
PA WC
0
1
2
3
150
520
115
210
0
2
1
3
150
116
521
210
0
1
2
3
PA WC
(b) Start-Gap
A
B
C
D
0
1
2
3
C
D
A
B
0
1
2
3
A
B
C
D
0
1
2
3
C
A
D
B
0
1
2
3
A
B
C
D
0
1
2
3
B
A
D
C
0
1
2
3
A
B
C
D
0
1
2
3
B
A
D
C
0
1
2
3
A
B
C
D
0
1
2
3
B
A
D
C
0
1
2
3
Initial State Step 1 Step 2 Step 3 Step 4
(c) Security Refresh
CRP (k0=2,k1=1)MA RMA
A
B
C
D
0
1
2
3
B
A
D
C
0
1
2
3
Final State
(k1=1,k2=3)
A
B
C
D
0
1
2
3
Initial State Step 1
E4
D
A
B
C
E
Final State
A
B
C
E
D
Step 2
A
B
E
C
D
A
E
B
C
D
E
A
B
C
D
Step 3 Step 4
Start line Gap line
LA
LA PA
0
1
2
3
LA
Figure 1: Table-based and algebraic wear-leveling schemes
((a) is TBWL scheme, (b) and (c) are AWL schemes).
has been written for a certain amount, the wear-leveling al-
gorithm is performed to exchange the data in/beyond this
region. The number of the writes to trigger the wear-leveling
is called swapping period. According to the mapping relation-
ship between the logical and physical addresses, existing wear-
leveling schemes can be classified into three categories: table-
based wear-leveling (TBWL), algebraic-based wear-leveling
(AWL), and hybrid wear-leveling (HWL) schemes. Wear-
leveling is transparent for upper-level applications due to the
mapping relationship between the logical and physical ad-
dresses. Applications can simply access the same contents
according to the same logical addresses and overlook the
physical addresses where data are actually stored.
TBWL schemes, e.g., Segment Swapping [46], record the
corresponding mapping relationship between a logical line ad-
dress (LA) and its physical counterpart (PA). When the write
count (WC) of one PA triggers the wear-leveling, Segment
Swapping exchanges the data between this PA and the least
used PA in the same region, as shown in Fig. 1(a). A line is
the atomic memory-access unit whose size is equal to that
of the last-level cache line. This, however, results in a huge
space overhead in keeping track of the mapping information
in all memory lines.
AWL schemes leverage algebraic mapping functions to ran-
domly generate the physical address for a given logical ad-
dress. The space overhead is extremely low since the algebraic
function using space-efficient hardware structure replaces the
address-mapping table in the table-based wear-leveling al-
gorithms. The representative AWL schemes include region-
based Start-Gap (RBSG) [29] and two-level Security Refresh
(TLSR) [33], as shown in Fig. 1(b) and 1(c). RBSG always
swaps a memory line with its neighboring line, which is eas-
ily attacked by maliciously-contrived code through simple
buffer-overflow detection [33]. To defend against such ma-
licious attacks, TLSR uses dynamically generated random
keys and XOR operations to change address mappings in a
more unpredictable way to reduce the security vulnerability
of TLSR. However, as the number of regions increases, a pure
AWL scheme usually fails to balance write traffic among the
regions, which enables the lines of the heavily-written regions
to be worn out much earlier than others.
A
B
C
D
prn0
E
F
G
H
prn2
Step 1
SRAM
(Memory Controller)
Read Out
NVM
Line Shift Write Back
Step 2 Step 3
SRAM
(Memory Controller)
H
B
C
D
E
F
G
A
H
G
C
D
E
F
B
A
H
G
F
D
E
C
B
A
H
G
F
E
D
C
B
A
A
B
C
D
prn0
E
F
G
H
prn2
A
B
C
D
E
F
G
H
D
C
B
A
H
G
F
E
D
C
B
A
H
G
F
E
NVM
prn0 prn2
prn0 prn2 prn0 prn2 prn0 prn2 prn0 prn2
Initial State Step 1 Step 2 Step 3 Final State
(a) PCM-S
(b) MWSR
Figure 2: The state-of-the-art hybrid wear-leveling schemes.
HWL schemes combine the algebraic and table-based wear-
leveling algorithms, such as PCM-S [35] and MWSR [42] as
shown in Fig. 2, which use a mapping table to keep track of
the mapping relationship between the logical region address
of a line and the physical region address of its corresponding
physical line, and leverage the algebraic function to obtain the
physical location of lines within each region according to the
given logical address. In general, the physical address offset
(pao) of the memory lines within the region can be obtained
through pao = lao
⊕
key, where lao represents the logical
address offset and key denotes the offset parameter within a
region. The HWL algorithms disperse writes across the entire
memory by randomly exchanging the regions and shifting the
location of its lines simultaneously [35].
2.2 Problems of Wear-leveling Algorithms on
MLC-based NVM
The above-mentioned wear-leveling algorithms work well for
SLC-based NVM. However, we observe that these algorithms
expose strong security vulnerability and shortened lifetime
for MLC-based NVM, due to decreased cell endurance and
increased device capacity of MLC-based NVM, as elaborated
next.
• Decreased cell endurance. The MLC technique de-
creases NVM cell endurance by two orders of magnitude
[11, 22]. This weakened endurance leads to insufficient
numbers of data exchanges across the entire memory
for the existing wear-leveling algorithms because the
number of data exchanges is proportional to the cell en-
durance, which results in serious write imbalance and
severely reduces the lifetime of MLC-based NVM sys-
tems.
• Increased device capacity. The given trend suggests
that the capacity of a single bank and an NVM system
is likely to increase potentially exponentially with the
advanced manufacturing technology and multithreaded
application requirements. Thus, to ensure sufficient data
exchanges within and among regions in the entire mem-
ory space, the wear-leveling algorithms must increase
the number of regions, a number that is proportional to
3
1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M
0
1 0
2 0
3 0
4 0
5 0
s w a p p i n g  p e r i o d  ( w r i t e  o v e r h e a d )
 8  ( 1 5 . 6 % )       1 6  ( 9 . 4 % )  
 3 2  ( 6 . 2 5 % )     6 4  ( 4 . 7 % )
s w a p p i n g  p e r i o d  ( w r i t e  o v e r h e a d )
( b )  W m a x = 1 0 5( a )  W m a x = 1 0 6
Nor
mal
ized
 life
time
 (%
)
T h e  n u m b e r  o f  r e g i o n s
 8  ( 1 5 . 6 % )       1 6  ( 9 . 4 % )  
 3 2  ( 6 . 2 5 % )     6 4  ( 4 . 7 % )
1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M
0
1 0
2 0
3 0
4 0
5 0
Nor
mal
ized
 life
time
 (%
)
T h e  n u m b e r  o f  r e g i o n s
Figure 3: The normalized lifetime of a 64GB NVM system
with TLSR algorithm under the BPA program.
NVM capacity. However, as the number of regions in-
creases, the hardware overhead increases proportionally.
The space and hardware overhead can become unaccept-
ably high for the practical systems.
To quantitatively analyze and understand the security vul-
nerability problem of the existing wear-leveling algorithms,
we conduct an experiment to evaluate the lifetime of MLC-
based NVM devices under the Repeated Address Attack
(RAA) [30] and Birthday Paradox Attack (BPA) [34]. RAA
is an attack program that writes data to the same address re-
peatedly. BPA aims to randomly select logical addresses and
repeatedly write to each one precisely until being remapped
to another physical address.
1) RAA risk for Segment Swapping and RBSG. Since
the Segment Swapping does not change the inter-segment
offset address, the RAA programs are written back to the
physical memory lines with the same offset address among
the segments. These memory lines are worn out at the early
stage. The RBSG, which adopts a static address mapping al-
gorithm, fails to defend against the RAA program since the
attacked physical address cannot be migrated to the entire ad-
dress space. The attacked region then receives an extremely,
disproportionally large number of writes, and fails in sev-
eral hours. Therefore, we do not evaluate the experiments on
RBSG and Segment Swapping as they are obviously unsuit-
able for large-capacity MLC-based NVM.
Since TLSR, PCM-S and MWSR algorithms can effec-
tively migrate the attacked memory lines to the entire space to
resist RAA program, we use the BPA program to evaluate the
lifetime of MLC-based NVM system. We simulate a 64GB
MLC-based NVM with 32 2GB banks and 256M memory
lines, including 4M spare lines to tolerate some worn-out
memory lines to prevent it from early failures. A line fails
when its write count reaches its write limit. The NVM fails
when there are not enough spare lines to replace the failing
lines. With an assumed write limit of 105 and 106 for each
cell [26,40], the ideal lifetime of this NVM system can be de-
rived to be 2.5 months and 25 months respectively with 1GBps
write traffic. For the TLSR, the outer-level swapping period
is fixed at 32 and the inner-level swapping period varies from
         	
  !"#$#%
(a) 106 endurance
         	
  !"#"$
(b) 105 endurance
Figure 4: The normalized lifetime of a 64GB NVM system
with PCM-S and MWSR under the BPA program.
8 to 64, while the accumulated number of regions increases
from 16K to 64M. .
2) Lifetime shortening for TLSR. Fig. 3 shows the nor-
malized lifetime (i.e., to the ideal lifetime) of an MLC-based
NVM system with the TLSR algorithm under the BPA pro-
gram. The experimental results indicate that the lifetime of
the MLC-based NVM system shows a trend from increase to
decrease with the growing of the number of regions. When
the number of regions is 32K, the MLC-based NVM system
achieves the best lifetime, which means that the write counts
of the memory lines within and among the regions are well
balanced. In addition, the swapping period has a greater im-
pact on NVM lifetime. When the number of regions is small
(i.e., a region contains many memory lines), the low swapping
period can increase the number of data exchanges, and thus
achieves better wear leveling. However, when the number
of regions is large, the low swapping period improves NVM
lifetime slightly. On the contrary, the low swapping period
greatly increases the number of data exchanges, incurring
many extra writes and thus reducing the lifetime of NVM sys-
tem. Furthermore, with the increase of the number of regions,
the write distribution among the regions is more uneven, and
the regions with heavily writes are easily worn out.
As shown in Fig. 3(a), the best lifetime of the NVM system
is 42% of the ideal lifetime when the number of regions is 32K
and the swapping period is equal to 8. However, this comes
at the cost of a 15.6% extra write overhead, which results in
a severe performance degradation. As the swapping period
increases to 32, the write overhead decreases to 6.25%, but the
system lifetime decreases to no more than 25.4% of the ideal
lifetime with the configuration of 64K regions. When the cell
endurance decreases to 105, the NVM system using TLSR
lasts for 4.6% of the ideal lifetime, as shown in Fig. 3(b).
Thus, TLSR is not competent for the work of wear leveling
with MLC-based NVM.
3) Significant on-chip storage overhead for PCM-S and
MWSR. Hybrid wear-leveling schemes need to store all ad-
dress mappings in an on-chip cache. Specifically, PCM-S
needs to record the physical address and internal offset of
each logical region. MWSR needs to store two physical ad-
dresses (i.e., the physical addresses of the previous and current
rounds), two offset addresses (i.e., the internal offsets of the
4
previous and current rounds) and a write counter, for each log-
ical region. Therefore, the space overheads of on-chip cache
in PCM-S and MWSR algorithms are proportional to the num-
ber of regions. Using smaller wear-leveling granularity is able
to increase the NVM lifetime but increases the number of
regions and thus needs a larger on-chip cache. We evaluate
the NVM lifetime when PCM-S and MWSR are performed
on MLC-based NVM with different on-chip cache sizes, as
shown in Fig. 4. We observe that PCM-S only achieves 72%
of ideal lifetime for the MLC-based NVM with 106 endurance,
and 41% of ideal lifetime for the MLC-based NVM with 105
endurance, even with a very large on-chip cache, i.e., 4MB.
MWSR achieves the lower lifetime than PCM-S due to larger
storage overhead of address mappings.
In summary, Segment Swapping and RBSG are vulnera-
ble to RAA. TLSR causes significant NVM lifetime reduc-
tion. Hybrid wear-leveling algorithms including PCW-S and
MWSR have the potential of achieving a high lifetime but
cause significant on-chip storage overhead.
3 Design and Implementation
To improve the NVM lifetime and reduce the on-chip storage
overhead of hybrid wear-leveling algorithms, a naive solution
called naive wear-leveling scheme (NWL) is to store all ad-
dress mappings in the NVM and maintain recently-accessed
mapping entries in an on-chip cache. Nevertheless, the NWL
often exhibits poor cache utilization under applications with
substantial random access patterns, resulting severe system
performance degradation due to long latency of accessing ad-
dress mappings in NVM. To effectively address this problem,
we propose a self-adaptive wear-leveling scheme, SAWL, to
significantly improve the cache hit rate by dynamically and
adaptively tuning the wear-leveling granularities at runtime
based on the workload. The SAWL scheme enables the MLC-
based NVM systems to attain high performance and long
lifetime simultaneously. In what follows, we describe the
tiered architecture and the self-adaptive wear-leveling scheme
in detail.
3.1 An Architectural Overview
SAWL is a tiered wear-leveling architecture consisting of a
data exchange module, an address translation module and a
region reconfiguration module, as shown in Fig. 5. The data
exchange module is capable of implementing arbitrary hy-
brid wear-leveling algorithms. Since the address translation
and region reconfiguration of PCM-S are relatively simple,
we adopt PCM-S algorithm in data exchange module. The
detailed data exchange algorithms are described in Section 2,
and the relevant addresses, depending on their temporal and
spatial properties, are stored in an Integrated Mapping Table
(IMT), a Cached Mapping Table (CMT) and a Global Trans-
lation Directory (GTD), which are managed by the address
translation module.
SAWL uses translation lines to record the locations, in
which the user data are actually stored. To prevent the trans-
lation lines from being worn out, the NVM system must in-
dependently perform hybrid wear leveling for the translation
lines. Hence, a GTD table is needed to record the relationship
between the logical translation line memory address (tlma)
and its physical counterpart (tpma). The GTD table can be
entirely stored in the SRAM due to its extremely low space
overhead. In the meantime, to prevent the loss or corruption
of the metadata (e.g., data stored in the CMT, GTD and IMT
tables) due to power failures, the updated metadata are writ-
ten back to the NVM devices. Within the long swapping
period, the update operation is infrequent, which has negli-
gible influence on NVM performance. How to ensure the
crash consistency is important an challenging problem and
has been discussed in [24, 38, 47], which is beyond the scope
of this paper and we assume that there is a battery backup in
memory controller to refresh metadata during power failure
like existing schemes [24, 36].
IMT records the relationship between a logical region num-
ber (lrn) and its corresponding physical region number (prn),
where lrn represents the N Most Significant Bits (MSB) of
the logical memory address and an lrn can be mapped to
any physical region. In addition, IMT records the offset pa-
rameter (key) of each region, through which we obtain the
corresponding intra-regional physical address offset. The lrn
is implicitly indicated by IMT. Assuming a translation line
in IMT contains 6 translation entries (determined by the size
of translation entry), the first line contains lrn0 to lrn5 at
the beginning. And after several translation line remapping,
the first line may contain lrn6k to lrn6k+ 5, where k is an
integer. We obtain the tpma from GTD table using tlma and
finally get user data line information from IMT table. The
size of the IMT table, e.g., tens to hundreds of megabytes, is
proportional to the NVM capacity and too large to be entirely
held in the memory controller. Therefore, the IMT table is
stored in a reserved space of the NVM devices with its entries
packed into memory lines that are called translation lines, in
contrast to the data lines that hold users’ data. The entries are
placed in an ascending order of the lrn to facilitate easy ad-
dress lookup. To alleviate performance degradation induced
by long address translation latency, a naive scheme is to lever-
age DRAM or NVM to hold complete IMT table and a CMT
table in the SRAM to buffer the recently-used IMT entries.
The entries in CMT are organized in an LRU stack and a new
entry cached from NVM will evict the least-recently-used
entry in the CMT.
Moreover, we use a parameter, called wear-leveling granu-
larity (wlg), to represent the range of the address space cov-
ered by each entry. When SAWL changes the wear-leveling
granularities, this parameter needs to be updated.
5
2(2,4),(4,5), ,(15,6)
(8,2),(10,7), ,(3,6)
(6,3),(5,5), ,(18,7)
(NVM)
Memory Controller
Address 
Translation
Data
Exchange lrn1,wlg1,prn1,key1
lrn2,wlg2,prn2,key2
lrnk,wlgk,prnk,keyk
Region 
Split/Merge
Cached Mapping 
Table (CMT)
Global Translation 
Directory (GTD)
Data lines
tpma
3
8
4
1
tlma
0
1
2
3
key
0
1
0
1
lrn3,wlg3,prn3,key3
 
Translation lines
Integrated Mapping
Table (IMT)
line 0
line 1
line 2
 
region 0
line 0
line 1
line 2
 
region 1
line 0
line 1
line 2
 
region 2
line 0
line 1
line 2
 
region N
line 0
line 1
line 2
 
region 0
line 0
line 1
line 2
 
region M
  
(SRAM)
(DRAM)
sync
   
（prn,key）
 
tpma
0
 
1
Figure 5: The self-adaptive tiered wear-leveling architecture
with NVM-based main memory.
SRAM cache NVM devices
memory 
requests
region split/merge
unbalanced write distribution
increase region size
decrease region size
high hit rate &
low hit rate &
uniform write distribution
monitor
cache hit rate wear leveling
address mapping 
table
Figure 6: An overview of self-adaptive wear leveling.
3.2 Self-Adaptive Wear Leveling
With the limited cache space, only a relatively small number of
mapping entries can be held in the cache. When applications
exhibit very weak locality and the requested addresses are
sparsely dispersed over the entire address space, the NVM
system exhibits very poor cache hit rate and performance. To
address this problem, we propose the SAWL scheme.
Based on the experimental results shown in Fig. 4, we ob-
serve that under a hybrid wear-leveling algorithm, the lifetime
of an NVM system is generally positively correlated to the
number of regions. In other words, the larger the number of
regions is, the closer the NVM system approaches its ideal
lifetime. However, with an increasing number of regions, the
number of memory lines within a region is reduced. Hence,
the address space covered by each of the Cached Mapping
Table (CMT) entries, i.e., wear-leveling granularity, decreases
accordingly. Since the number of CMT entries is fixed, the
whole address space covered by the SRAM cache decreases,
which reduces the cache hit rate.
To address this performance problem, the design goal of
SAWL is to automatically tune the region size to improve
NVM performance whenever the SRAM cache demonstrates
poor hit rate under some applications, as shown in Fig. 6. To
achieve this goal, SAWL carries out a region-merge operation
to merge two or more regions into a single larger region, thus
allowing an Integrated Mapping Table (IMT) entry to cover
more addresses and increasing the wear-leveling granularity.
On the other hand, since a coarse wear-leveling granularity
reduces wear leveling, SAWL counters this by carrying out a
region-split operation to divide a large region into multiple
smaller regions when the cache hit rate continues to climb be-
yond a predefined threshold and the hits have become severely
unbalanced within the region. In addition, the NVM lifetime
can be used as an indicator to tune wear-leveling granularities.
However, the lifetime is difficult to measure during runtime.
In general, the lifetime is calculated by running many requests
until the NVM cell is worn out. Since the cache hit rate is
easy to capture, we adopt the indicator of the cache hit rate
which shows the performance decrease of NVM system.
1) Region-merge operation. To perform the region-merge
operation, SAWL first picks out the physical location for the
new region, in which the physical location is mapped by one
of non-merged logical locations to avoid choosing the physi-
cal locations that have been occupied by other already merged
regions. Then, SAWL chooses the closest non-merged logical
location and merges them.
SAWL further swaps the data of the new region with the
data of the new location, ensuring that the logical addresses
and their physical counterparts of the memory lines within the
newly merged region satisfy the algebraic mapping. Finally,
we update the address-mapping table on the NVM and the
relevant CMT entries on the SRAM. Fig. 7 depicts an example
of the region-merge operation. As shown in Fig. 7 (a), there
are three logical regions, e.g., lrn0, lrn1 and lrn5, which are
mapped to prn3, prn8 and prn2, respectively. To merge lrn0
and its closet logical neighbour lrn1 into one super region, we
pick out a large physical space for the newly merged region
(e.g., prn2 and prn3). We then move out the lines E and F
from prn2 (the data can be temporarily stored on memory
controller or DRAM cache). We also migrate the lines C, D
of lrn1 to prn2, and rotate all the memory lines within prn2
and prn3 to ensure addresses of the logical memory lines
within the two regions satisfy the algebraic mapping function.
Finally, we write back the data of lines E and F to prn8 and
update the corresponding entries in IMT and CMT tables, as
shown in Fig. 7 (b). After this, the wlg parameter of lrn0 is
changed to 4, which means the lrn0 entry covers four memory
addresses at present. Since lrn0 and lrn1 belong to the same
large region, the physical region address and internal offset
in IMT are identical. On the address translation, the NVM
obtains the real wear-leveling granularity of a region based on
the number of adjacent regions which have the same address
information.
2) Region-split operation. In contrast to the region-merge
operation, the region-split operation splits a large region
into multiple smaller regions by migrating the memory lines
within the large region. More specifically, if we use the XOR
operation to conduct address mapping, there is no need to
migrate the memory lines within the old large region since
the XOR operation makes the memory lines within each post-
split smaller region contiguous in the physical space. We only
need to update the address-mapping table and the CMT en-
6
AB
C
D
E
F
A
B
F
E
D
C
0
1
5
2
3
8
3 0
8 1
   
2 1
CMT entry
0 3 02
lrn
IMT entry
2 3
2 3
   
8 1
CMT entry
0 2 34
IMT entry
(a) region-merge operation (b) address update operation
logical
region
A
B
C
D
E
F
F
E
0
5 8
D
C
B
A
2
logical
region
wlg prn key lrn wlg prn key
lrn prn key lrn prn key
Physical
region
Physical
region
0
1
 
5
0
1
 
5
Figure 7: An example of the region-merge and address update
operations. lrn is implicitly indicated by IMT.
2 3
2 3
   
IMT entry
0 3 1
1 2 1
    
IMT entry
(a) region-split operation (b) address update operation
A
B
C
D
logical 
region
D
C
B
A
lrn prn key lrn prn key
A
B
C
D
B
A
D
C
0
1
2
3
logical
region
physical
region
0 2
physical 
region
CMT entry
0 2 34
lrn
CMT entry
0 2 12
wlg prn key lrn wlg prn key
0
1
 
Figure 8: An example of the region-split and address update
operations. lrn is implicitly indicated by IMT.
tries, and the memory lines have already satisfied the algebraic
mapping function. Fig. 8 describes an example of a simple
region-split operation. As shown in Fig. 8 (a), the large region
lrn0 is split into two sub-regions (lrn0 and lrn1). Given that
the memory lines in lrn0 and lrn1 are mapped to the same
physical sub-regions, there is no need to migrate the memory
lines if we keep the original mapping relationship. Since the
wear-leveling granularities of lrn0 and lrn1 changes, we only
update the relevant entries in IMT and CMT tables. The new
physical address of the sub-regions is obtained by the region
address XORing with the most significant bit (MSB) of the
offset parameter, e.g., the keys of lrn0 and lrn1. For example,
the physical address of lrn0 is calculated by the 2⊕1, where
’1’ denotes the MSB of the old key of lrn0. The new keys is
achieved by the least significant bits (LSB) of the old key, e.g.,
the old key of lrn0 and lrn1 are 3 (’11’) and its LSB is ’1’ as
shown in Fig. 8 (b). After the region-split completes, the lrn0
and lrn1 do not belong to a large region because their physical
addresses are different. In contrast to region-merge operations,
the overhead of region-split operations is extremely low.
To make the region-split operation efficient, we employ
two registers to record the cache hit counts of the first and the
second half of the CMT entries queue, respectively. Since the
entries in CMT are organized in an LRU stack, usually the hit
count of the first sub-queue is larger than that of the second
one. If the first one is far larger than the second one, it means
that the addresses in the second sub-queue are rarely accessed,
and splitting the region is beneficial. Otherwise, the current
region size is of a satisfactory wear-leveling granularity.
To avoid performing the region-split and region-merge op-
erations too frequently, SAWL tunes the region size only
when the cache hit rate stays over the high threshold or below
the low threshold for certain number of requests. Considering
1 0 1 0 0 1 1 0
region address internal offset
 m bits  n bits
(a) initial configuration
1 0 1 1 1 0 1 0
 m-1 bits  n+1 bits
(b) region-merge operation
1 0 1 1 1 0 1 0
(c) region-split operation
1 0 1 0 1 0 1 0
 m bits  n bits
1 0 1 1 1 0 1 0
Figure 9: An example of size scaling for IMT entries.
the relatively large region-merge overhead, we only merge
the cached regions rather than all the regions in the entire
memory.
3) Implementation details. For the concrete implementa-
tion, the NVM systems use the reserve space to store the IMT
table. The capacity of IMT is determined by an initial wear-
leveling granularity (P), i.e., the number of IMT entries equals
M/P, where M denotes the number of lines in the entire mem-
ory. In the working process, the size of IMT doesn’t change.
Otherwise, the address migration incurs massive space over-
head, and the address translation becomes extremely complex.
Fig. 9 shows an example of address update for IMT. Each IMT
entry records the address information (including the region
address and offset parameter) according to the initial configu-
ration. For example, m bits keep the region address, and n bits
record the offset parameter. The sum of m and n is fixed and
determined by the NVM capacity, i.e., m+n= logM2 . After
region-merge operation completes, the region size increases
and the number of regions decreases. Thus we use a small
amount of bits to record the region address and leverage more
bits to record the offset parameter. As shown in Fig. 9 (b),
NVM uses m−1 and n+1 bits to record the region address
and offset parameter, respectively. To indicate the sub-regions
belonging to a large region, their address information is iden-
tical. After region splitting completes, the number of regions
increases and more bits are required to record the region ad-
dress, while less bits are used to keep the intra-regional offset,
as shown in Fig. 9 (c). In addition, the address information
of the adjacent regions is different. It is worth noting that the
minimum wear-leveling granularity cannot be smaller than
the initial configuration, because the shortened wear-leveling
granularity will significantly increase the size of IMT table
and the NVM does not have sufficient reserved space to store
the increased address-mapping table. The region-split and
region-merge operations result in a dynamic tuning of the
wear-leveling granularities. Given that the number of adja-
cent regions that have same address information is n, the real
wear-leveling granularity (Q) is calculated by the formula of
Q= n×P.
3.3 Adaptive Address Mapping Algorithm
The wear-leveling process makes the relationship between a
logical address and its physical counterpart dynamic. When
7
lrn=lma/P
tlma=lrn/(P×K)
obtain Q
 and  D
lma
exist 
in CMT?
pma=
prn × Q + pao
yes
no Look up
GTD and IMT
Step 1 Step 2 Step 3
Step 4Step 5Step 7
Requested
Address
prn=D/Q
key= D%Q
lao=lma%Q
pao = lao ⊕ key
Step 6
Figure 10: The workflow of the address translation.
a request arrives at the memory controller, the requested ad-
dress must be translated into a physical address that is used
to access the underlying NVM devices. This address transla-
tion in SAWL is facilitated by the Global Translation Direc-
tory (GTD), Cached Mapping Table (CMT) and Integrated
Mapping Table (IMT). The 6-step workflow of the address-
mapping algorithm is shown in Fig. 10. SAWL first computes
the logical region number (lrn, lrn = lma/P) according to
the given logical memory address (lma), and then obtains the
logical translation line address (tlma, lrnP×K ), where P repre-
sents the initial wear-leveling granularity and K is the number
of entries within a translation line, which is 6 in our design
(Step 1). The variable lrn is used to check the CMT table
to see if the translation entry is cached. If yes, the values of
the real wear-leveling granularity (Q), address information
(D, the combination of the physical region number and off-
set parameter) are obtained by accessing the SRAM cache
(Step 2). If it is a miss, the physical translation line addresses
(t pma) value of the corresponding translation line is obtained
from the GTD table. Using the tpma value, the translation
line can be read from IMT table in DRAM or NVM devices
and placed at the top of the LRU stack (Step 3). From this
line, the Q and D values of this requested address are found
(Step 4). For example, if the lrn we compute is 6k+m, the
mth entry in this line is needed. The physical region number
(prn) and the offset parameter (key) are obtained based on
the formula of prn= DQ and key= D%Q, respectively (Step
5). Thus, the logical address offset (lao) and physical address
offset (pao) are lao= lma%Q and pao= lao
⊕
key (Step 6).
Finally, SAWL obtains the physical memory address (pma)
by combining the prn and pao using pma= prn×Q+ pao
(Step 7).
Address translation leads to extra access latency to main
memory. The main overhead includes looking up the CMT,
GTD and IMT tables, respectively. In general, the access
latency to the CMT and GTD tables would be about 5 ns
due to their residence in SRAM, while a DRAM/NVM read
operation is at least 50 ns. For the CMT table, the entries
cached on SRAM are organized in the LRU list. Given that
the SRAM query consumes 3ns, we hence set 5ns on average
for address translation latency. Our SAWL dynamically tunes
the wear-leveling granularities (i.e., region size) to increase
the number of cached addresses, which improves the cache
hit rate and I/O performance significantly.
4 Performance Evaluation
4.1 Methodology
In our experiments, we use the Gem5 simulator [1] to evaluate
various wear-leveling schemes and NVMain [28] to examine
the lifetime and cache hit rate in a time-efficient way. We
evaluate state-of-the-art hybrid wear-leveling algorithms, in-
cluding the basic non-tiered architecture (BWL), i.e., PCM-S
and MWSR, naive tiered architecture (NWL) and compare
them with our SAWL algorithm on the tiered architecture.
In the following experiments, the initial wear-leveling gran-
ularity of BWL, NWL and SAWL is set to 4 memory lines
to ensure that the lifetime MLC-based NVM systems lasts
for a long time under the worst-case attacks. In addition, we
use NWL-4 and NWL-64 to respectively represent the naive
wear-leveling algorithm on the tiered architecture with a re-
gion consisting of 4 and 64 memory lines respectively. For
the SAWL scheme, the lowest region-merge threshold is set
to 90% based on the experimental observation that the cache
hit rate of 90% marks a turning point below which the per-
formance of NVM system decreases significantly. For the
region-split operation, the highest cache-hit-rate threshold
is set to 95%, because the performance evaluation indicates
that the wear-leveling algorithm has slightly impact on NVM
performance within the boundary. The SAWL algorithm au-
tomatically tunes the wear-leveling granularities when the
cache hit rate is above or below this threshold for a long time.
Moreover, if the hit ratio of the first queue OR the hit ratio of
the second queue ≥ 99%, the NVM system splits the region
for endurance, thus avoiding the decrease of cache hit rate
after region-split completes.
To evaluate the performance of NVM system under general
applications, we use 14 representative applications from the
SPEC CPU2006 suite [12], which contain high memory ac-
cessing frequency with at least 100 million read/write requests
in each application. These applications have been widely used
in existing lifetime analysis experiments [16, 29, 46]. We per-
form evaluations by executing the benchmark in rate mode,
where all the eight cores execute the same benchmark [41].
4.2 Parameter Training via Sensitivity Study
The key to SAWL is to dynamically adjust the region size,
or wear-leveling granularities, by applying a combination of
region-merge and region-split operations based on the work-
load behaviors that are monitored using the observed runtime
cache hit rate. To accurately capture the runtime cache hit rate
and adjust the region size in a reliable and cost-efficient way,
SAWL relies on two critical parameters, the size of the obser-
vation window for capturing runtime cache hit rate and the
size of the settling window for reliable and efficient region-size
adjustment. In what follows we first define these parameters
and then experimentally determine their values.
8
0 1 2 3 4 5 6 70
2 0
4 0
6 0
8 0
1 0 0
0 1 2 3 4 5 6 70
2 0
4 0
6 0
8 0
1 0 0
0 1 2 3 4 5 6 70
2 0
4 0
6 0
8 0
1 0 0
0 1 2 3 4 5 6 70
2 0
4 0
6 0
8 0
1 0 0
1 0 8 1 0 8
1 0 8
( b )  S O W  =  2 2 2
1 0 8Ca
che
 Hit
 Ra
te (%
)
R u n t i m e  ( #  o f  r e q u e s t s )
( a )  S O W  =  2 2 0
Cac
he H
it R
ate 
(%)
R u n t i m e  ( #  o f  r e q u e s t s )
( c )  S O W  =  2 2 4 ( d )  S O W  =  2 2 6
Cac
he H
it R
ate 
(%)
R u n t i m e  ( #  o f  r e q u e s t s )
Cac
he H
it R
ate 
(%)
R u n t i m e  ( #  o f  r e q u e s t s )
Figure 11: Cache hit rate as a function of runtime obtained
from different observation window sizes SOW when running
the SPEC CPU2006 soplex benchmark in a 512KB cache.
0 1 2 3 4 5 6 71
24
81 6
3 26 4
0 1 2 3 4 5 6 71
24
81 6
3 26 4
0 1 2 3 4 5 6 71
24
81 6
3 26 4
0 1 2 3 4 5 6 71
24
81 6
3 26 4
( a )  S S W = 2 2 0
Reg
ion 
size
R u n t i m e  ( #  o f  r e q u e s t s )
( b )  S S W = 2 2 2
Reg
ion 
size
R u n t i m e  ( #  o f  r e q u e s t s )
( d )  S S W = 2 2 6( c )  S S W = 2 2 4
Reg
ion 
size
R u n t i m e  ( #  o f  r e q u e s t s )
A v g .  c a c h e  h i t  r a t e  ( 8 5 . 5 % )A v g .  c a c h e  h i t  r a t e  ( 9 6 . 1 % )
A v g .  c a c h e  h i t  r a t e  ( 9 7 . 7 % )
1 0 8
1 0 81 0 8
1 0 8
A v g .  c a c h e  h i t  r a t e  ( 9 8 . 0 % )
Reg
ion 
size
R u n t i m e  ( #  o f  r e q u e s t s )
Figure 12: The region size adjustments as a function of the
runtime with different settling window sizes under the soplex
benchmark.
1) Observation Window Size. SAWL measures the cur-
rent runtime cache hit rate by calculating the percentage of
memory access requests that hit the cache out of a certain total
number of requests observed, including the most recent one.
This total number SOW of observed requests is called the size
of the observation window. We measure the runtime cache
hit rate every 100,000 requests as it is not very sensitive for
the accuracy of measurement according to our experiments.
However, our experiments revealed that SOW is a sensitive
parameter for the accuracy of the sampled cache hit rate. To
find an optimal value for SOW , we examine how the sam-
pled cache hit rate changes with the size of the observation
window.
Fig. 11 shows the cache hit rates of different sizes of the
observation window as a function of runtime, which is defined
by the total number of requests issued thus far, when running
the SPEC CPU2006 soplex benchmark in a 512KB cache.
Specifically, as shown in Fig. 11(a), when the window size is
220, the cache hit rate fluctuates significantly causing SAWL
to adjust the region size too frequently to be efficient. And as
the observation window size (SOW ) increases, the sampled
cache hit rate becomes less fluctuating and more stable, which
brings SAWL to miss the important time points, in these
points, the SAWL needs to split or merge regions, as indicated
by the small green circles in Fig. 11 (c) and (d). Consequently,
we choose 222 as the size of observation window.
2) Settling Window Size. SAWL waits for a certain num-
ber of requests to ensure that the cache hit rate of the observed
runtime is sufficiently stable so as to avoid unnecessary or
frequent region adjustments. This waiting period is called
the settling window and the number of requests to wait is
called the size of the settling window (SSW ). Fig. 12 shows
the adjustments of region size as a function of the runtime
(i.e., the number of requests) with different SSW values under
the soplex workload. Specifically, as shown in Fig. 12(a), a
small settling window size, i.e., 220, results in frequent region
size adjustments and incurs high write overhead. On the con-
trary, Fig. 12(d) indicates that a large settling window size
leads to SAWL to fail to sufficiently adjust the region size and
obtain high performance since SAWL misses important time
points of splitting and merging regions. In fact, the cache hit
rate decreases to 85.5%. As a result, we argue that the set-
tling window sizes in Fig. 12 (b) and (c) are much better. By
training the parameters in Fig. 11 and 12, we experimentally
determine the best SOW and SSW values are both 222.
In order to validate the efficiency and effectiveness of the
values of SOW and SSW determined experimentally above,
we evaluate the average cache hit rates of SAWL under the
three representative benchmarks of bzip2, cactusADM and
gcc respectively. As shown in Fig. 13, the average cache
hit rates of the three workloads are 94.5%, 88% and 91.3%,
respectively, which are close to those of NWL-64. SAWL
improves the hit rates via increasing the region size when the
hit rate becomes too low. Furthermore, the average region size
of SAWL is about 16 memory lines in all workloads, which
means that the BPA lifetime of NVM is about 20 months even
under the worst-case workload.
4.3 NVM Lifetime
1) NVM lifetime under the BPA program. We use the BPA
program to simulate the lifetime of an NVM system under
the worst-case scenario and use the result to evaluate the ro-
bustness of the NVM system. We use 1MB on-chip cache
and vary the swapping period from 8 to 64. The NVM life-
times that PCM-S, MWSR, and SAWL are shown in Fig. 14.
We observe that smaller swapping period increases the NVM
lifetime for PCM-S and MWSR, but at the cost of high write
overhead. SAWL achieves much higher lifetime than PCM-S
and MWSR, due to storing all address mappings in NVM
and no limitation on the number of regions. Fig. 14 shows
that SAWL improves 25%∼ 51% (50%∼ 78%) of ideal life-
9
0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 51
24
81 6
3 26 4
0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 50
2 04 0
6 08 0
1 0 0
A v g .  c a c h e  h i t  r a t e
( a )  b z i p 2
Reg
ion 
size
R u n t i m e  ( #  o f  r e q u e s t s )
1 0 8
1 0 8Cac
he H
it R
ate 
(%)
R u n t i m e  ( #  o f  r e q u e s t s )
 N W L - 4    ( 8 6 . 4 % ) N W L - 6 4  ( 9 8 . 9 % ) S A W L     ( 9 4 . 5 % )
0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 51
24
81 6
3 26 4
0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 50
2 04 0
6 08 0
1 0 0
1 0 8
1 0 8
( b )  c a c t u s A D M
Reg
ion 
size
R u n t i m e  ( #  o f  r e q u e s t s )
A v g . c a c h e  h i t  r a t e
Cac
he H
it R
ate 
(%)
R u n t i m e  ( #  o f  r e q u e s t s )
 N W L - 4     ( 6 3 % ) N W L - 6 4   ( 9 5 . 2 % ) S A W L      ( 8 8 % )
0 . 0 0 . 5 1 . 0 1 . 5 2 . 01
24
81 6
3 26 4
0 . 0 0 . 5 1 . 0 1 . 5 2 . 00
2 04 0
6 08 0
1 0 0
( c )  g c c
Reg
ion 
size
R u n t i m e  ( #  o f  r e q u e s t s )
1 0 8
A v g . c a c h e  h i t  r a t e
Cac
he H
it R
ate 
(%)
R u n t i m e  ( #  o f  r e q u e s t s )
 N W L - 4     ( 5 8 . 3 % ) N W L - 6 4   ( 9 8 . 9 % ) S A W L      ( 9 1 . 3 % )
1 0 8
Figure 13: The runtime hit rates and region size adjustments under the three representative benchmarks. 	 
	
   !" !#$ %#&
(a) 106 endurance
 	 
	
   !" !#$ %#&
(b) 105 endurance
Figure 14: The normalized lifetime of the MLC-based NVM
system with PCM-S, MWSR and SAWL under different swap-
ping periods.
time for the MLC-based NVM system with 106 (105) cell
endurance, compared with PCM-S and MWSR.
2) NVM lifetime under general applications. We evalu-
ate the lifetime of an MLC-based NVM system under general
applications. Since the requested address of the real-world
workload changes every time, to evaluate the lifetime of an
NVM system, the simulation must trace each request until the
NVM system fails, which takes so much time that is unpracti-
cal. In order to reduce the running time, we simulate a 2GB
NVM system with endurance of 105. The normalized lifetime
results can also be used to other large-capacity NVM systems.
The entire space is divided into 4K ∼ 1M regions, and the
exchange periods of TLSR, RBSG and SAWL algorithms are
fixed at 128. Note that 4K regions are the standard configu-
ration for TLSR and RBSG algorithms, and 1M regions are
beneficial to our SAWL scheme.
Fig. 15 shows the normalized lifetime of the MLC-based
NVM under the general benchmarks. The baseline system
without any wear-leveling algorithm suffers from poor life-
time due to non-uniform underlying writes distribution. For
the RBSG algorithm, the average lifetime (harmonic mean)
of the MLC-based NVM system under all the benchmarks
achieves 15% of the ideal lifetime (ranges from 5% to 81%).
The RBSG performs unsteadily since the static address map-
ping fails to balance inter-regional write distribution under
various benchmarks. In contrast to RBSG, the results of the
TLSR algorithm are much more stable for average lifetime,
achieving an average lifetime that is 43.1% of the ideal life-
b z i p 2 g c c m c f m i l cg r o m a c
s
c a c t u s A
D Ml e s l i e 3 d n a m dg o b m ks o p l e xh m m e
r s j e n gl i b q u a n
t u m l b mH m e a n
0
2 0
4 0
6 0
8 0
1 0 0
( b )  1 M  r e g i o n s  ( w e a r - l e v e l i n g  g r a n u l a r i t y  b e i n g  8 )
( a )  4 0 9 6  r e g i o n s  ( w e a r - l e v e l i n g  g r a n u l a r i t y  b e i n g  2 0 4 8 )
Nor
mal
ized
 life
time
 (%
)  B a s e l i n e    R B S G    T L S R    S A W L
b z i p 2 g c c m c f m i l cg r o m a c
s
c a c t u s A
D Ml e s l i e 3 d n a m dg o b m ks o p l e xh m m e
r s j e n gl i b q u a n
t u m l b mH m e a n
0
2 0
4 0
6 0
8 0
1 0 0
Nor
mal
ized
 life
time
 (%
)  B a s e l i n e    R B S G    T L S R    S A W L
Figure 15: The lifetime of the MLC-based NVM system with
RBSG, TLSR and SAWL under general applications.
time. What’s worse, under gromacs and hmmer benchmarks,
the lifetime of MLC-based NVM system decreases to 10% of
ideal lifetime, because the writes concentrate on a fraction of
the address space. These experimental results clearly show
that both the static and dynamical random address-mapping
schemes are inadequate for most benchmarks. Compared to
the existing wear-leveling algorithms, SAWL improves NVM
lifetime to 85.1% of the ideal lifetime. Under the most non-
uniform distribution benchmarks, e.g., gromacs and hmmer,
SAWL still enhances NVM lifetime to 82% and 70%, respec-
tively. In addition, the extra write overhead of the SAWL al-
gorithm is less than 1% and can be ignored. With the increase
of the number of regions (1M regions), the SAWL algorithm
can obtain higher lifetime, while the lifetime of RBSG and
TLSRL is lower, as shown in Fig. 15 (b). The average lifetime
of MLC-based NVM is extended to 9.8%, 40.5% and 92.5%
under RBSG, TLSR and SAWL schemes. In summary, the
experimental results in Section 2.1 and 4.2 illustrate that the
SAWL algorithm significantly improves the lifetime of the
MLC-based NVM system under both malicious attacks and
general applications.
10
0%
20%
40%
60%
80%
100%
64K 128K 256K 512K 1M
C
ac
h
e 
h
it
 r
at
e
The cache size (B)
NWL-4 SAWL
Figure 16: The average cache hit rates of NWL-4 and SAWL
with different cache sizes.
4.4 Performance Impact
In general, a fine-grained wear-leveling region could im-
prove lifetime but degrade performance. We compare NWL-
4 (4-memory-line wear-leveling granularity on PCM-S and
MWSR) with SAWL on cache hit rate and also compare
NWL-4 and BWL with SAWL on IPC performance.
1) Cache hit rate. To demonstrate the robustness of SAWL
with different cache sizes, we evaluate the cache hit rate of
SAWL as a function of the cache size ranging from 64KB
to 1MB, compared with NWL-4. Fig. 16 shows the average
cache hit rates of the 14 applications from SPEC CPU2006.
We observe that with the cache size increased from 64KB to
1MB, the average cache hit rate of NML-4 increases from
52% to 73%, and that of SAWL increases from 88% to 94%.
Therefore, SAWL is able to achieve 21%∼ 36% of cache hit
rate improvement via adaptively tuning the region sizes.
2) IPC performance. We used the Gem5 simulator [1] to
evaluate the performance impact of SAWL. In our experimen-
tal platform, the system consists of an 8-core processor (3.2
GHz), private 32KB L1 cache, shared 256 KB L2 cache, and
a 128 MB L3 DRAM cache. The read and write latencies
of DRAM are both 50ns, while that of MLC-based NVM
(e.g., RRAM) are 50ns and 350ns, respectively. We use a
queue length of 128 and the FR-FCFS scheduling scheme in
the memory controller. The address translation requires 5 ns
when the address is hit in the cache. Otherwise, it consumes
55ns. We run the 14 SPEC2006 applications mentioned above
and compare the IPC measure with, i.e., normalized to, the
Baseline (without any wear-leveling scheme). The swapping
period of the SAWL algorithm is set to 128. As shown in
Fig. 17, the average IPC measure of the BWL, NWL-4 and
SAWL schemes is decreased by 23%, 10% and 5%, respec-
tively. Some applications, such as the bzip2 and milc, show
only slight IPC degradation. This is because the memory ac-
cesses in these applications are relatively sparse and the most
requested addresses can be hit in the cache. Therefore, the
results demonstrate that the performance impact of SAWL is
arguably negligible.
b z i p 2 g c c m c f m i l cg r o m a c
s
c a c t u s A
D Ml e s l i e 3 d n a m dg o b m k s o p l e xh m m e r s j e n gl i b q u a n
t u m l b m H m e a n
0
1 0
2 0
3 0
4 0
5 0
IPC
 deg
rada
tion
 (%
)
 
 B W L N W L - 4 S A W L
Figure 17: IPC degradation of the NVM system (normal-
ized to the baseline without wear leveling) with various wear-
leveling schemes under the SPEC CPU2006 applications.
5 Related Work
Existing work relevant to our SAWL research can be broadly
summarized into the two categories of wear leveling algo-
rithms and writes reduction technologies.
Wear Leveling Algorithms. Wear-leveling techniques at-
tempt to balance write counts on physical devices. The con-
ventional table-based wear-leveling algorithms include Fine-
Grained Wear Leveling (FGWL) [31], row shifting and seg-
ment swapping [46], page allocation and page swapping [2,8],
and line swapping [10,43]. To achieve long lifetime, the gran-
ularity of the mapping unit should be sufficiently small, which
incurs huge space overhead. In addition, most of these algo-
rithms adopt a deterministic exchanging policy, suffering from
severe security vulnerability for malicious attacks. To address
these problem, the algebraic-based wear-leveling algorithms,
such as randomized region-based Start-Gap (RBSG) [29],
multi-level Security Refresh (TLSR) [33] and Online Attack
Detection (OAD) [30], are proposed. RBSG and TLSR over-
come the space overhead problem of TBWL algorithms, but
the lifetimes under them are shortened for MLC-based NVM
systems due to their insufficient data exchanges. OAD is used
to tune the swapping frequency of AWL algorithms to im-
prove NVM lifetime by distinguishing general applications
with malicious attacks. Different from the pure algebraic algo-
rithms, SAWL prolongs the NVM system to the ideal lifetime
by tuning the wear-leveling granularities.
In addition, there are two hybrid wear-leveling algorithms
combining table- and algebraic-based wear leveling, PCM-
S [34] and MWSR [42]. PCM-S gathers multiple lines into a
region and tracks the mapping information without recording
the access frequencies of each region. During a write to a
region, the region is swapped with a randomly picked region
in memory with a small probability. A random amount of
lines within the region are rotated during region exchange.
However, the PCM-S and MWSR algorithms require a large
number of regions to achieve uniform write distribution. The
basic architecture that stores the entire address mapping table
on the NVM devices leads to severe performance degrada-
tion. To overcome these problems, our SAWL leverages an
SRAM cache in the memory controller and improve cache
hit rate in runtime by tuning the region size dynamically, thus
11
significantly alleviating performance degradation.
Write Reduction Technologies. Write reduction tech-
nique is an architectural method to improve NVM life-
time. A hybrid design that consists of NVM-based main
memory and a small-sized DRAM buffer is widely stud-
ied [6, 13, 14, 21, 31, 32]. The DRAM buffers frequently re-
write data and reduce the write traffic of NVM. Moreover,
some techniques extend the lifetime via reducing the bit flip-
ping, including Flip-N-Write redundant bit-write removal [5],
partial writes [20], line level write-back [31], lazy write [31]
and silent store removal [46]. These techniques remove redun-
dant writes by adopting read-before-write and novel coding
schemes to extend NVM lifetime, which are orthogonal to
our proposed wear-leveling technique.
6 Conclusion
MLC technique can be used in the NVM systems, which leads
to their rapid growth in device capacity but at the cost of much
weaker endurance than their single-level-cell versions. The
existing wear-leveling algorithms are shown to have their re-
spective shortcomings for MLC-based NVM systems. While
hybrid wear leveling has the potential to improve NVM life-
time, it incurs huge on-chip space overhead. The basic ar-
chitecture, which stores the entire address mapping table on
the NVM devices, leads to unacceptably severe performance
degradation due to the very long address translation latency.
To thoroughly address this problem, we propose a tiered wear-
level architecture and a self-adaptive wear-leveling (SAWL)
algorithm that dynamically tunes the wear-leveling granulari-
ties to accommodate more useful addresses in the cache, thus
improving cache hit rate and system performance. Experi-
mental results demonstrate that SAWL is effective and robust.
References
[1] BINKERT, N., BECKMANN, B., BLACK, G., REIN-
HARDT, S. K., SAIDI, A., BASU, A., HESTNESS, J.,
HOWER, D. R., KRISHNA, T., AND SARDASHTI, S.
The gem5 simulator. ACM SIGARCH Computer Archi-
tecture News 39, 2 (2011), 1–7.
[2] CHEN, C.-H., HSIU, P.-C., KUO, T.-W., YANG, C.-L.,
AND WANG, C.-Y. M. Age-based pcm wear leveling
with nearly zero search cost. In Proceedings of Design
Automation Conference (DAC) (2012), pp. 453–458.
[3] CHEN, E., APALKOV, D., DIAO, Z., DRISKILL-SMITH,
A., DRUIST, D., LOTTIS, D., NIKITIN, V., TANG, X.,
WATTS, S., AND WANG. Advances and future prospects
of spin-transfer torque random access memory. IEEE
Transactions on Magnetics 46, 6 (2010), 1873–1878.
[4] CHEN, Y.-C., CHEN, C., CHEN, C., YU, J., WU, S.,
LUNG, S., LIU, R., AND LU, C.-Y. An access-
transistor-free (0t/1r) non-volatile resistance random
access memory (rram) using a novel threshold switch-
ing, self-rectifying chalcogenide device. In IEEE Inter-
national Electron Devices Meeting (IEDM’03) (2003),
pp. 37–4.
[5] CHO, S., AND LEE, H. Flip-n-write: a simple de-
terministic technique to improve pram write perfor-
mance, energy and endurance. In Proceedings of In-
ternational Symposium on Microarchitecture (MICRO)
(2009), pp. 347–357.
[6] DHIMAN, G., AYOUB, R., AND ROSING, T. PDRAM:
a hybrid PRAM and DRAM main memory system. In
Proceedings of Design Automation Conference (DAC)
(2009), pp. 664–669.
[7] ENDOH, T., KOIKE, H., IKEDA, S., HANYU, T., AND
OHNO, H. An overview of nonvolatile emerging memo-
ries—spintronics for working memories. IEEE Journal
on Emerging and Selected Topics in Circuits and Sys-
tems 6, 2 (2016), 109–119.
[8] FERREIRA, A. P., ZHOU, M., BOCK, S., CHILDERS,
B., MELHEM, R., AND MOSSÉ, D. Increasing pcm
main memory lifetime. In Proceedings of the conference
on design, automation and test in Europe (DATE) (2010),
European Design and Automation Association, pp. 914–
919.
[9] FREITAS, R. F., AND WILCKE, W. W. Storage-class
memory: The next storage system technology. IBM
Journal of Research and Development 52, 4.5 (2008),
439–447.
[10] GAL, E., AND TOLEDO, S. Algorithms and data struc-
tures for flash memories. ACM Computing Surveys
(CSUR) 37, 2 (2005), 138–163.
[11] GLEIXNER, B., PELLIZZER, F., AND BEZ, R. Reliabil-
ity characterization of phase change memory. In Annual
Non-Volatile Memory Technology Symposium (NVMTS)
(2009), pp. 7–11.
[12] GOVE, D. Cpu2006 working set size. ACM SIGARCH
Computer Architecture News 35, 1 (2007), 90–96.
[13] HU, J., XIE, M., PAN, C., XUE, C. J., ZHUGE, Q.,
AND SHA, E. H.-M. Low overhead software wear lev-
eling for hybrid pcm+ dram main memory on embedded
systems. IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems 23, 4 (2015), 654–663.
[14] HU, J., ZHUGE, Q., XUE, C. J., TSENG, W.-C., AND
SHA, E. H.-M. Software enabled wear-leveling for
hybrid PCM main memory on embedded systems. In
Proceedings of Design, Automation & Test in Europe
Conference & Exhibition (DATE) (2013), pp. 599–602.
12
[15] IZRAELEVITZ, J., YANG, J., ZHANG, L., KIM, J., LIU,
X., MEMARIPOUR, A., SOH, Y. J., WANG, Z., XU, Y.,
DULLOOR, S. R., ET AL. Basic performance measure-
ments of the intel optane dc persistent memory module.
arXiv preprint arXiv:1903.05714 (2019).
[16] JIANG, L., DU, Y., ZHANG, Y., CHILDERS, B. R., AND
YANG, J. LLS: Cooperative integration of wear-leveling
and salvaging for PCM main memory. In Proceedings
of International Conference on Dependable Systems &
Networks (DSN) (2011), pp. 221–232.
[17] KIM, J., LEE, S., AND VETTER, J. S. Papyruskv: a
high-performance parallel key-value store for distributed
nvm architectures. In Proceedings of the International
Conference for High Performance Computing, Network-
ing, Storage and Analysis (2017), ACM, p. 57.
[18] KULTURSAY, E., KANDEMIR, M., SIVASUBRAMA-
NIAM, A., AND MUTLU, O. Evaluating stt-ram as an
energy-efficient main memory alternative. In Proceed-
ings of International Symposium on Performance Analy-
sis of Systems and Software (ISPASS) (2013), pp. 256–
267.
[19] LEE, B. C., IPEK, E., AND MUTLU, O. Phase change
memory architecture and the quest for scalability. Com-
munications of the ACM 53, 7 (2010), 99–106.
[20] LEE, B. C., IPEK, E., MUTLU, O., AND BURGER, D.
Architecting phase change memory as a scalable dram
alternative. ACM SIGARCH Computer Architecture
News 37, 3 (2009), 2–13.
[21] LEE, E., BAHN, H., AND NOH, S. H. A unified buffer
cache architecture that subsumes journaling function-
ality via nonvolatile memory. ACM Transactions on
Storage (ToS) 10, 1 (2014), 1.
[22] LEE, S. R., KIM, Y.-B., CHANG, M., KIM, K. M.,
LEE, C. B., HUR, J. H., PARK, G.-S., LEE, D., LEE,
M.-J., AND KIM, C. J. Multi-level switching of triple-
layered taox rram with excellent reliability for storage
class memory. In Symposium on VLSI Technology (VL-
SIT) (2012), pp. 71–72.
[23] LEFURGY, C., RAJAMANI, K., RAWSON, F., FELTER,
W., KISTLER, M., AND KELLER, T. W. Energy man-
agement for commercial servers. IEEE Computer 36,
12 (2003), 39–48.
[24] LIU, S., KOLLI, A., REN, J., AND KHAN, S. Crash
consistency in encrypted non-volatile main memory sys-
tems. In 2018 IEEE International Symposium on High
Performance Computer Architecture (HPCA) (2018),
IEEE, pp. 310–323.
[25] MARKTHUB, P., BELVIRANLI, M. E., LEE, S., VET-
TER, J. S., AND MATSUOKA, S. Dragon: breaking gpu
memory capacity limits with direct nvm access. In Pro-
ceedings of the International Conference for High Per-
formance Computing, Networking, Storage, and Analy-
sis (2018), IEEE Press, p. 32.
[26] PALANGAPPA, P. M., AND MOHANRAM, K. Compex:
Compression-expansion coding for energy, latency, and
lifetime improvements in mlc/tlc nvm. In IEEE Inter-
national Symposium on High Performance Computer
Architecture (HPCA) (2016).
[27] PENG, I. B., AND VETTER, J. S. Siena: exploring the
design space of heterogeneous memory systems. In Pro-
ceedings of the International Conference for High Per-
formance Computing, Networking, Storage, and Analy-
sis (2018), IEEE Press, p. 33.
[28] POREMBA, M., ZHANG, T., AND XIE, Y. Nvmain
2.0: A user-friendly memory simulator to model (non-)
volatile memory systems. IEEE Computer Architecture
Letters 14, 2 (2015), 140–143.
[29] QURESHI, M. K., KARIDIS, J., FRANCESCHINI, M.,
SRINIVASAN, V., LASTRAS, L., AND ABALI, B. En-
hancing lifetime and security of PCM-based main mem-
ory with start-gap wear leveling. In Proceedings of In-
ternational Symposium on Microarchitecture (MICRO)
(2009), pp. 14–23.
[30] QURESHI, M. K., SEZNEC, A., LASTRAS, L. A., AND
FRANCESCHINI, M. M. Practical and secure pcm sys-
tems by online detection of malicious write streams.
In Proceedings of International Symposium on High
Performance Computer Architecture (HPCA) (2011),
pp. 478–489.
[31] QURESHI, M. K., SRINIVASAN, V., AND RIVERS, J. A.
Scalable high performance main memory system us-
ing phase-change memory technology. In Proceedings
of International Symposium on Computer Architecture
(ISCA) (2009), pp. 24–33.
[32] RAMOS, L. E., GORBATOV, E., AND BIANCHINI, R.
Page placement in hybrid memory systems. In Proceed-
ings of the international conference on Supercomputing
(2011), pp. 85–95.
[33] SEONG, N. H., WOO, D. H., AND LEE, H.-H. S. Se-
curity refresh: prevent malicious wear-out and increase
durability for phase-change memory with dynamically
randomized address mapping. In Proceedings of Inter-
national Symposium on Computer Architecture (ISCA)
(2010), pp. 383–394.
13
[34] SEZNEC, A. A phase change memory as a secure main
memory. Computer Architecture Letters 9, 1 (2010),
5–8.
[35] SEZNEC, A. Towards Phase Change Memory as a Se-
cure Main Memory. In Workshop on the Use of Emerg-
ing Storage and Memory Technologies (WEST) (2010).
[36] SHIN, S., TIRUKKOVALLURI, S. K., TUCK, J., AND
SOLIHIN, Y. Proteus: A flexible and fast software sup-
ported hardware logging approach for nvm. In Pro-
ceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture (2017), ACM, pp. 178–
190.
[37] THOZIYOOR, S., AHN, J.-H., MONCHIERO, M.,
BROCKMAN, J., AND JOUPPI, N. A Comprehensive
Memory Modeling Tool and Its Application to the
Design and Analysis of Future Memory Hierarchies. In
Proceedings of International Symposium on Computer
Architecture (ISCA) (June 2008), pp. 51–62.
[38] WEI, Q., WANG, C., CHEN, C., YANG, Y., YANG, J.,
AND XUE, M. Transactional nvm cache with high per-
formance and crash consistency. In Proceedings of the
International Conference for High Performance Com-
puting, Networking, Storage and Analysis (2017), ACM,
p. 56.
[39] WU, K., REN, J., AND LI, D. Runtime data man-
agement on non-volatile memory-based heterogeneous
memory for task-parallel programs. In Proceedings
of the International Conference for High Performance
Computing, Networking, Storage, and Analysis (2018),
IEEE Press, p. 31.
[40] XU, C., NIU, D., MURALIMANOHAR, N., JOUPPI,
N. P., AND XIE, Y. Understanding the trade-offs in
multi-level cell reram memory design. In Proceedings
of 2013 50th ACM/EDAC/IEEE Design Automation Con-
ference (DAC) (2013), IEEE, pp. 1–6.
[41] YOUNG, V., NAIR, P. J., AND QURESHI, M. K. Deuce:
Write-efficient encryption for non-volatile memories. In
Proceedings of Architectural Support for Programming
Languages and Operating Systems (ASPLOS) (2015).
[42] YU, H., AND DU, Y. Increasing endurance and security
of phase-change memory with multi-way wear-leveling.
IEEE Transactions on Computers 63, 5 (2014), 1157–
1168.
[43] YUN, J., LEE, S., AND YOO, S. Bloom filter-based dy-
namic wear leveling for phase-change ram. In Proceed-
ings of the Conference on Design, Automation and Test
in Europe (DATE) (2012), EDA Consortium, pp. 1513–
1518.
[44] ZHAO, M., JIANG, L., ZHANG, Y., AND XUE, C. J. Slc-
enabled wear leveling for mlc pcm considering process
variation. In Proceedings of the 51st Annual Design
Automation Conference (DAC) (2014), pp. 1–6.
[45] ZHAO, M., SHI, L., YANG, C., AND XUE, C. J. Lev-
eling to the last mile: Near-zero-cost bit level wear lev-
eling for pcm-based main memory. In Proceedings of
International Conference on Computer Design (ICCD)
(2014), pp. 16–21.
[46] ZHOU, P., ZHAO, B., YANG, J., AND ZHANG, Y. A
durable and energy efficient main memory using phase
change memory technology. In Proceedings of Inter-
national Symposium on Computer Architecture (ISCA)
(2009), pp. 14–23.
[47] ZUO, P., AND HUA, Y. Secpm: a secure and persis-
tent memory system for non-volatile memory. In 10th
{USENIX}Workshop on Hot Topics in Storage and File
Systems (HotStorage 18) (2018).
14
