SoftWear: Software-Only In-Memory Wear-Leveling for Non-Volatile Main
  Memory by Hakert, Christian et al.
ar
X
iv
:2
00
4.
03
24
4v
2 
 [c
s.O
S]
  8
 A
pr
 20
20
SoftWear: Software-Only In-Memory Wear-Leveling
for Non-Volatile Main Memory
Christian Hakert∗, Kuan-Hsun Chen∗, Paul R. Genssler†, Georg von der Bru¨ggen∗, Lars Bauer†, Hussam Amrouch†,
Jian-Jia Chen∗, Jo¨rg Henkel†
∗ Design Automation for Embedded Systems Group, TU Dortmund Unibersity, Germany
† Chair for Embedded Systems, KIT, Germany
Abstract—Several emerging technologies for byte-addressable
non-volatile memory (NVM) have been considered to replace
DRAM as the main memory in computer systems during the last
years. The disadvantage of a lower write endurance, compared
to DRAM, of NVM technologies like Phase-Change Memory
(PCM) or Ferroelectric RAM (FeRAM) has been addressed in
the literature. As a solution, in-memory wear-leveling techniques
have been proposed, which aim to balance the wear-level over
all memory cells to achieve an increased memory lifetime.
Generally, to apply such advanced aging-aware wear-leveling
techniques proposed in the literature, additional special hardware
is introduced into the memory system to provide the necessary
information about the cell age and thus enable aging-aware wear-
leveling decisions.
This paper proposes software-only aging-aware wear-leveling
based on common CPU features and does not rely on any
additional hardware support from the memory subsystem. Specif-
ically, we exploit the memory management unit (MMU), per-
formance counters, and interrupts to approximate the memory
write counts as an aging indicator. Although the software-only
approach may lead to slightly worse wear-leveling, it is applicable
on commonly available hardware. We achieve page-level coarse-
grained wear-leveling by approximating the current cell age
through statistical sampling and performing physical memory
remapping through the MMU. This method results in non-
uniform memory usage patterns within a memory page. Hence,
we further propose a fine-grained wear-leveling in the stack
region of C / C++ compiled software.
By applying both wear-leveling techniques, we achieve up
to 78.43% of the ideal memory lifetime, which is a lifetime
improvement of more than a factor of 900 compared to the
lifetime without any wear-leveling.
I. INTRODUCTION
Emerging technologies for non-volatile memory (NVM),
like Phase-Change-Memory (PCM) or Ferroelectric RAM
(FeRAM), have been considered as a replacement for DRAM
as the main memory over the last years. Most NVM tech-
nologies feature advantages like low energy consumption and
high integration density, which makes them a desired main
memory replacement. One of the major disadvantages of some
NVM technologies is the lower write-endurance. While classic
DRAM endures for more than 1015 write cycles, PCM only
endures 108− 109 write cycles per cell [7]. Thus, to wear out
a DRAM cell within 10 years, an application would have to
write the same memory cell every 900th CPU cycle in average
on a 3GHz CPU. Applying the same application to PCM, the
memory would wear-out within 5 minutes. Although typical
applications do not cause such an extreme write pattern, they
still cause a highly non-uniform write pattern to the memory
[13], [21], [24]. Accordingly, the problem has been tackled in
the literature and several in-memory wear-leveling techniques
have been proposed. A majority of these techniques is aging-
aware [3], [8], [11], [13], [14], [16], [18], [19], [22], [26],
which means that the current cell age or the current write count
is taken into account for the wear-leveling decisions. The wear-
leveling itself is mostly realized through an abstraction layer,
which remaps the physical location of logical memory regions.
However, as current memory hardware does not provide a
write-count, which is necessary to determine the cell age,
additional hardware is introduced. This hardware requires
additional chip-space, and might be hard to realize in a way
that meets the desired granularity and clock-frequency.
To allow aging-aware wear-leveling in the absence of such
special hardware, this paper proposes software-only wear-
leveling techniques. The term software-only here means that
we do not require any additional hardware from the memory
subsystem and only use hardware features which are widely
available. We provide the necessary write-count through a
statistical online approximation of the write distribution, which
only requires a memory management unit (MMU), perfor-
mance counters, and an interrupt mechanism. The performance
counter allows to generate an interrupt every nth memory write
access, which achieves an equidistant sampling of the write
distribution. A special configuration of the memory access per-
mission allows to record the target address of a single memory
write afterwards. The approximated write distribution enables
an arbitrary aging-aware wear-leveling algorithm subsequently.
In this paper, we implement a simple wear-leveling algorithm
on the granularity of virtual memory pages, which achieves
the necessary physical memory remapping through the MMU.
Since the resulting memory write distribution still results in
high non-uniformity due to the granularity of memory pages,
we introduce an additional software-only, fine-grained wear-
leveling technique, which balances the write-accesses to the
stack region by relocating the stack in a circular manner. This
is achieved by copying the current stack content regularly to
a new location and adjust the stack-pointer accordingly. A
special virtual memory configuration allows a hardware-aided
wraparound to achieve a circular movement.
1
Our contributions:
• We deliver a software-only coarse-grained in-memory
wear-leveling system, consisting of an online approxima-
tion mechanism for the write-distribution and an MMU-
based wear-leveling algorithm.
• We further provide an extending software-only fine-
grained wear-leveling technique, which targets the stack
region of C/C++ compiled applications and relocates the
stack in a circular manner in a bounded memory region.
We aim to balance the write-count to each memory byte
in the flat memory space equally to achieve a high memory
lifetime. We note that other factors impact the memory en-
durance as well, e.g., process variation in PCM [24], but the
write-count is a major factor. Our approaches can be extended
according to physical models (e.g. process variation domains)
to also respect advanced physical memory properties.
After giving an overview about the related wear-leveling
approaches in literature in Section II, we present the memory
write distribution of our benchmark applications in Section III
and our method to analyze the write pattern of applications,
which is also used for our evaluations in Section IV. After this,
our novel wear-leveling techniques are described in detail in
Section V and Section VI. Each section contains an evaluation,
which uses the write pattern analysis mechanism. The paper
concludes with a short summary in Section VIII.
II. RELATED WORK
During the last years, several approaches for in-memory
wear-leveling for NVM have been proposed. These approaches
can be categorized along different criteria. First, there are
aging-aware approaches [3], [8], [11], [13], [14], [16], [18],
[19], [22], [26], which take the current cell age into account
to apply wear-leveling. In contrast there are random-based
approaches [13], [21], [26], which apply wear-leveling in a
circular or random-based manner. Both approaches are often
combined to achieve a random-based wear-leveling on fine
granularities inside memory blocks, while an aging-aware
approach is used to target these coarse-grained memory blocks.
The granularity also varies from single bits [9], [25] over
cache-lines [21], [26] for fine-grained approaches to memory
pages [3], [8], [13], [14], [22] or even bigger memory segments
[24], [26] for coarse-grained approaches.
Some approaches are not based on remapping the physical
memory content through an abstraction layer, but hook into
the memory allocation process of the operating system to
apply wear-leveling to the memory allocator [3], [18], [22].
Li et al. [18] also propose to use an allocated memory portion
whenever a function is called for the function’s stack memory
to wear-level the stack region.
Gogte et al. propose a software-only coarse-grained wear-
leveling approach by using a sampled approximation of the
write distribution [14]. They make use of advanced debug-
ging capabilities, e.g. Intel Processor Event Based Sampling
(PEBS), which allows them to sample the write requests
from the CPU. These debugging capabilities, however, can
rarely be found in embedded systems and resource constrained
hardware.
All other mentioned aging-aware approaches rely on the
the current write-count information of the memory. Most
approaches introduce specialized hardware into the memory
controller to collect the write-count information, which is not
available in commonly available systems and might be hard to
realize. Dong et al. [11] use an offline recorded memory trace
to estimate the write distribution, which limits the approach
to a subset of well-known applications only.
III. PROBLEM DESCRIPTION
When considering non-volatile memory as the main memory
for program executions, the system may suffer from the low
write-endurance of the underlying memory technology. Even if
the system is also equipped with DRAM, certain applications
may be desired to only run on the non-volatile memory to
reach energy saving states as fast as possible. To understand
the impact of program executions on main memory with low
write-endurance, the precise write distribution from a program
should be recorded and analyzed. Separating the program’s
memory into the text, data, bss, and stack regions
allows to analyze the write pattern of each region separately
and determine the impact on the memory lifetime. This section
presents the write distribution for our benchmark applications
and points out the influence on the write-endurance.
To determine the influence of code executions on the mem-
ory write-patterns of applications, especially on the different
memory regions, we run four benchmark applications. We
aggregate the resulting memory trace file on the granularity of
64 byte (a cache-line is assumed to be written always entirely)
to a write-count distribution and present them graphically. As
the benchmark applications we chose following programs:
1) bitcount: A simple implementation, which iterates over
an array of data and counts the 1 bits. The resulting
count is stored in global counter and returned at the
end.
2) pfor: A simulation of a data decompression scenario. A
big set of data is available in a lightweight compressed
format, namely Patched Frame of Reference (PFOR)
[27]. The data is decompressed and aggregated in fixed
size windows, which simulates the processing of a
stream of compressed data.
3) sha: This application is part of the MiBench security
suite [15] and calculates the sha sum of a given dataset.
4) dijkstra This application is also part of the MiBench
network suite [15] and calculates a fixed number of
shortest paths in a network, using the dijkstra algorithm.
We chose these benchmarks, because they are simple enough
to understand the connection between the code and the mem-
ory usage of the different segments. The limitation to four
benchmarks is due to the high time consumption of the
required full system simulations.
Figure 1 shows the resulting illustration of the write-count
distributions of the benchmark applications. Note that the four
applications face different execution times and thus the total
2
76 kB
2
.0
E
6
6
.0
E
6
1
.0
E
7
1
.4
E
7
bitcount
main memory
w
ri
te
co
u
n
t
148 kB
0
1
E
6
2
E
6
3
E
6
pfor
main memory address
text data stack text data stack
52 kB
5
E
5
1
.5
E
6
2
.5
E
6
3
.5
E
6
sha
main memory address
w
ri
te
co
u
n
t
148 kB
0
5
E
6
1
E
7
1
.5
E
7
dijkstra
main memory address
textdata stack text data stack
Fig. 1. Memory write-count distribution - baseline1
amount of writes is different. Thus, the scaling of the y axes is
different. Considering the different memory regions, different
observations can be made:
• text: As the text segment only contains the compiled
binary code, it is never written during the normal applica-
tion execution. This behavior is also shown in the result.
In the context of wear-leveling, read only memory regions
have to be targeted as well as heavy written memory
regions to distribute the wear-levels equally.
• data/bss: The data and the bss segments store
global program variables, such as global attributes or
arrays. Naturally, these variables are written from time
to time, depending on the application logic. The dijkstra
benchmark has a heavy, non-uniform usage of the bss
segment, since the benchmark manages the steps of the
algorithm in a queue.
• stack: The stack segment causes the most non-
uniform write access to the main memory. This results
from the way the stack is typically used: Local variables
are stored on top of the stack and are removed when
they are no longer used. Depending on the application
logic, this makes the beginning of the stack a heavily
used area with a lot of memory writes, while the rest of
the stack region is used less. A wear-leveling algorithm
has to distribute the memory writes to this region to all
other, less written memory regions.
These results point out the need for aging-aware wear level-
ing. The memory writes to hot memory regions have to be
redirected mainly to unused memory regions, but also to less
used memory regions. This requires a monitoring of the current
write-count and an incremental redistribution according to the
current write-count distribution.
IV. MEMORY WRITE-PATTERN ANALYSIS
Section III presents the memory write-count distribution of
four benchmark applications. In a usual computation platform,
the memory accesses of a program cannot be captured and
analyzed without special techniques. Debugging mechanisms
can overcome the problem but introduce a large overhead.
Using a hardware analyzer, which basically plugs an FPGA
between the CPU and the memory DIMM, is considered
by Bao et al. [4]. Such an analyzer is reasonably fast but
requires a complex hardware setup. In this paper, we use a
full system cycle-accurate simulator (including CPU, memory,
buses, peripherals, etc.) on top of a Linux host instead. This
section introduces our simulation environment, which is also
used for the results in Section III.
We chose gem5 [6] as the full system simulator, since it
can be combined with a memory simulator for non-volatile
memories, namely NVMain2.0 [20], due to its modular struc-
ture. This setup allows to obtain all memory accesses of a
running program in a logfile, analyze them afterwards, and
perform detailed evaluations of our methods by comparing
the captured logfiles. To simulate the properties of NVMs,
several simulators can be considered (e.g. [23] and [12]),
which precisely simulate, for instance, the timing and energy
behavior. However, the methods in this paper analyze and
change the write behavior of applications only, which is
independent from the physical properties of the underlying
memory. Thus, we do not involve them in our analysis.
A. Simulation Setup Details
NVMain2.0 provides an option to generate a memory trace
file, which contains detailed information for every main mem-
ory access. Using this information, we can extract the memory
address for each write access and aggregate them for each 64
byte sized cache-line2, which results in a write-count distribu-
tion. This method is also independent from the CPU internal
cache configuration, since writes to the main memory are
recorded. Even if a write is caused by a logical read operation
(cache preemption), this write is captured in our simulation.
We simulate an ARMv8 CPU architecture, the DerivO3CPU
implementation, and the VExpress GEM5 V2 machine. This
system includes an advanced CPU with pipelining and out-
of-order execution as well as a set of controllers, which are
typically found in ARM based systems (e.g., the GIC interrupt
controller, PL011 UART controller, etc.).
Two simulation modes are supported by gem5: The
systemcall-emulation and the full system simulation. As we
want to reduce the influence of the runtime infrastructure
(libraries, operating system services, etc.) on the application
1The gray lines indicate boundaries of 4 kB virtual memory pages. The
data and bss segment is marked as a big data segment in the picture.
2The simulation model of gem5 assumes cache-lines to be written to the
memory entirely, hence we also use this assumption in the analysis.
3
Linux host system
Gem5 / NVMain2.0
Runtime System
Application
system services
memory accesses
hardware initialization
memory trace
Fig. 2. Overview of the simulation setup
as much as possible, we run bare-metal full system mode
simulations. This requires an operating system to be started
in gem5, handling the hardware initialization and providing
required services for the running application. We developed
a small bare-metal runtime system, which takes the place of
the operating system in the simulation setup. Thus, we can
initialize the hardware in a flexible way with low overhead
(compared to Linux kernel modifications), and only provide
the required operating system services. Even if the analyzed
application is directly compiled into the binary file of the run-
time system, which is started in gem5 afterwards, the runtime
system can be seen as part of the simulation environment
and not as part of the application. The simulation setup is
illustrated in Figure 2.
B. Application Separation
The full system simulation mode of gem5 combined with
a small, customized runtime system in place of the operating
system allows us to highly control the hardware behavior and
the memory placement. In this section, we aim to analyze the
write access behavior of an application, without interference
of an operating system, and separately analyze the memory
regions of the application. To achieve this, we apply two
separation techniques:
Spatial Separation: During the linking process of the runtime
system, the application’s memory regions (i.e. text, data,
bss, and stack) are placed in a static separate memory
location, which resides apart from the memory locations of the
runtime system. Thus, the memory accesses of the application
target a separated memory region, which can be analyzed sep-
arately in the recorded write-count distribution. Furthermore,
the concrete memory addresses of the memory regions can
be determined after the linking process, which allows to ana-
lyze the recorded write-count distribution separately for each
memory region. Hence, the runtime system has to establish an
identity mapping (or at least a constant, well-known mapping)
from virtual memory addresses to physical memory addresses
to be able to determine the different memory regions in the
recorded write-count distribution.
Interrupt Separation: The handling of interrupts is separated
from the application’s stack. Usually, the operating system,
respectively our runtime system, saves the current register set
on the stack when handling an interrupt. An interrupt during
the running application would cause the application’s stack to
be used for the register backup, which would influence the
application’s write pattern to the stack region. To overcome
this, we handle interrupts on another stack instead of the
application’s stack by the hardware. For ARMv8 architectures,
this can be achieved by using two different exception levels
[1]. When taking an interrupt to a higher exception level, an
ARMv8 CPU can be configured to switch the stack pointer
to a dedicated stack pointer for the higher exception level.
We run the runtime system on exception level 1 (EL1),
using a stack, allocated for the runtime system only. The
application is executed on exception level 0 (EL0) with the
application’s stack. Thus, whenever an interrupt occurs during
the application execution, the interrupt is handled on EL1 on
the stack of the runtime system. Accordingly, the application’s
stack is not influenced by interrupts at all.
Both techniques allow to analyze the memory write-pattern
of isolated applications. Based on this, required wear-leveling
actions are deduced and proposed subsequently. In this paper,
we only focus on wear-leveling for the test applications. In a
real world setup, also the runtime system / operating system
requires wear-leveling to be applied on its memory regions,
because the implementation uses the main memory similarly
like the test applications. However, the solutions presented
here can also be applied for the runtime system, but require
some additional implementation effort, since they are provided
as a service from the runtime system itself.
V. AGING-AWARE COARSE-GRAINED WEAR-LEVELING
Section III points out the need for aging-aware in-memory
wear-leveling, when the write-endurance is low. If the current
write behavior cannot be tracked by the hardware and no
memory trace is known for the running application, aging-
aware techniques cannot be applied. To overcome this issue,
in this section we propose a software-only write distribution
approximation technique, which estimates the memory write
distribution (i.e., the write count to fixed sized memory
regions) using only commonly available hardware support
(i.e., MMU, performance counters, and interrupts). The write
distribution approximation can be used subsequently to enable
an arbitrary aging-aware wear-leveling algorithm. However,
to keep our implementation software-only, we developed a
simple aging-aware wear-leveling algorithm, which adjusts
the virtual memory mapping of the MMU to exchange the
physical location of hot (heavy written) and cold (less often
written) virtual memory pages. Thus, the entire wear-leveling
is coarse-grained with a 4 kB granularity. To omit the need of
storing the aging state of the memory as a persistent object, we
design our wear-leveling solution incremental. Hence, at every
point in time the algorithm aims to achieve an allover write-
count balance in the memory. After a reboot, for instance,
the memory can be assumed to be wear-leveled and the
incremental wear-leveling can be continued. This furthermore
overcomes the requirement to know the exact age of the
memory at any time. Therefore, the approximation does not
need to estimate absolute number, a relative representation
of the write distribution is sufficient. At the end of this
4
section, we evaluate the resulting wear-leveling quality on the
previously mentioned benchmark applications.
A. Write Distribution Approximation
Several steps are required to record an approximation of the
real write distribution of an application at runtime. To achieve
an equidistant sampling of write accesses, i.e. every nth write
access is sampled, the target of every nth memory write of
the application is captured and stored in an appropriate data
structure. The number n determines the temporal granularity
of the approximation technique, allowing a trade-off between
accuracy and introduced overhead. After capturing the write,
the spatial granularity of the data structure has to be considered
as well. Storing the estimated write count for every byte
introduces a big storage overhead and leads to imprecise
results, when the temporal granularity is coarse. Instead, bytes
can be related to larger memory blocks and the write counts
are aggregated for every write access into these blocks. For
our implementation, we aggregate the write counts for 4 kB
memory blocks, because the wear-leveling algorithm considers
this granularity, i.e., the decision is based on memory pages.
Using an 8 byte counter for every block, 1
512
· memory-size
bytes are required to store the approximated write distribution
(e.g., 2 MB when 1GB of main memory is tracked).
The detailed flow of capturing the target of every nth mem-
ory write access requires two techniques to be implemented.
First, an interrupt has to be generated after every nth write
access, thus the runtime system can take action. Secondly, the
target of the next memory write access has to be determined
and stored in the data structure. Both implementations are
stated in detail subsequently. Although the approach by Gogte
et al. allows to directly capture CPU write requests at sampled
intervals [14], their approach relies on a specialized debugging
capability. Our method provides an alternative, which makes
use of more widely available hardware features.
1) Temporal Write Distribution Sampling: To generate an
interrupt after every nth write access of the application, we
use the CPU internal performance counting mechanism. In
ARMv8, each performance counter can be configured to only
record events triggered on EL0, thus there is no interference
of executed interrupt handlers. The BUS_ACCESS_ST event
counts the total number of store requests on the memory
bus, thus the number of write accesses of the application
are recorded. For Intel CPUs, the same behavior could be
achieved by using a performance counter for writebacks of the
last-level-cache. If no such performance counter is available
in some system, any approximation (e.g. the cycle counter),
still can be considered. The performance counting mechanism
allows to generate an interrupt when the performance counter
overflows (i.e., exceeds the value of 232 − 1). To establish
interrupts on every nth write access, the performance counter
is set to 232−n during the handling of the overflow interrupt.
2) Write Access Trapping: As the last written memory
address cannot be determined during the interrupt handling
of the performance counter overflow, a second technique is
implemented to track the target address of the the next memory
write. During the handling of the overflow interrupt, the
memory access permission for the tracked memory region
is set to READ_ONLY. Note that the ARMv8 architecture
allows hierarchical memory access permissions, allowing to
configure memory regions of 1 GB size to READ_ONLY by
only modifying one page-table entry. Due to the READ_ONLY
permission, the next write access causes a permission violation
trap, which is handled as an interrupt. The violation causing
address is available for the interrupt handler in a dedicated
register, which then is used to increment the corresponding
counter in the write distribution approximation3. During the
handling of the trap, the access permissions are set back to
READ_WRITE4. Note that this mechanism does not strictly
require a MMU, it could also be implemented with a very
lightweight MPU on a microcontroller.
B. Wear-leveling Algorithm
As mentioned before, the write distribution approximation
enables arbitrary aging-aware wear-leveling algorithms. When
this technique is used, the integration of the approximation
system and the wear-leveling algorithm has to be considered
as well. To provide a common interface, the approximation
implementation could provide the estimated write-counts in a
table inside the runtime system’s memory and a notification
mechanism to trigger the wear-leveling algorithm when a
special event occurs (e.g., one estimated counter exceeds a
configured threshold). However, to reduce the overhead fur-
ther, we interleave our wear-leveling algorithm further with the
approximation implementation to reduce redundantly stored
data. Our wear-leveling algorithm uses a red-black tree [5] to
maintain all managed virtual memory pages along with their
estimated age. As the estimated age is already present inside
of the tree nodes, there is no need to store these values in the
approximation implementation as well.
1) Management of Memory Pages: Our wear-leveling al-
gorithm is based on a red-black tree as the management data
structure, which contains all managed physical memory pages
together with their estimated cell age. Whenever a virtual
memory page should be relocated to another physical memory
page, the current minimum is extracted from the tree as the
target physical page and the estimated ages are adjusted ac-
cordingly. Regarding the overhead, the wear-leveling algorithm
is only called in this setup, when a memory page has to be
relocated. Regarding the selection policy of the wear-leveling
decisions, the estimated age of all physical pages is balanced
equally over time, because every page will be the current
minimum page at a certain time when the estimated age is
updated properly.
3The semantics of the performance counter and of the write access trapping
mechanism differ slightly. While the performance counter counts every
write to the memory, including cache writebacks and other indirect memory
accesses, the write access trapping only applies to CPU write operations,
which require a fetch of a TLB line. However this only implies that not the
target of every nth write is recorded, but that sometimes the distance between
two recorded writes is n+ x, where x is a small integer.
4For our runtime system implementation, memory permissions are not used
for any protection purposes. If this is the case, the modified permissions might
have to be backed up and restored later on.
5
76 kB
0
5
E
2
1
E
3
1
.5
E
3
2
E
3
2
.5
E
3
bitcount
main memory
ap
p
ro
x
im
at
ed
w
ri
te
-c
o
u
n
t
148 kB
0
5
E
2
1
E
3
1
.5
E
3
pfor
main memory address
text data stack text data stack
52 kB
0
1
E
2
3
E
2
5
E
2
7
E
2
sha
main memory address
ap
p
ro
x
im
at
ed
w
ri
te
-c
o
u
n
t
148 kB
0
1
E
3
2
E
3
3
E
3
dijkstra
main memory address
textdata stack text data stack
Fig. 3. Memory write-count approximation n = 5000
Eventually, this integration of the wear-leveling algorithm
and the approximation system leads to an additional configu-
ration parameter, besides the temporal and spatial granularity
of the write-count approximation. The threshold, after which
number of estimated writes a relocation should be performed
is maintained by the approximation system, because the wear-
leveling algorithm is called from the approximation system in
that case. This configuration parameter provides a trade-off
between the overhead of page relocation and the frequency,
respectively the resulting quality, of wear-leveling actions
without taking influence on the quality of the write-count
approximation.
2) Memory Page Relocation: Once the wear-leveling algo-
rithm determined a pair of two virtual memory pages to swap,
two steps are required to perform the relocation. First, the
virtual memory mapping in the page-table has to be adjusted
accordingly, such that the physical pages of both virtual
memory pages are exchanged. A Translation Lookaside Buffer
(TLB) maintenance operation is required afterwards to make
sure the exchanged mapping is applied. Note that the ARMv8
virtual memory system allows single entries to be invalidated
in the TLB, thus a total TLB flush is not necessary. After the
new page mapping is established, the physical content has to
be exchanged to maintain the application’s view on the virtual
memory. This is achieved by copying one page to a spare
buffer, copy the second page to the first page, and copy the
buffer content to the second page. The size of the buffer is
76 kB
0
1
E
2
3
E
2
5
E
2
7
E
2
bitcont
main memory
ap
p
ro
x
im
at
ed
w
ri
te
-c
o
u
n
t
148 kB
0
1
E
2
2
E
2
3
E
2
4
E
2
pfor
main memory address
text data stack text data stack
52 kB
0
5
0
1
0
0
1
5
0
sha
main memory address
ap
p
ro
x
im
at
ed
w
ri
te
-c
o
u
n
t
148 kB
0
2
E
2
4
E
2
6
E
2
8
E
2
dijkstra
main memory address
textdata stack text data stack
Fig. 4. Memory write-count approximation n = 20000
chosen to 4 kB for two reasons: First, copying a sequential
memory content can be done more efficiently in most systems
than copying single bytes or words from different regions.
Second, the write access pattern to the buffer memory page
is completely uniform and thus has no negative influence on
the memory lifetime if it is also handled by the wear-leveling
system.
C. Evaluation
To point out how the previously presented techniques can
be used to improve the balance of wear-levels, the write-count
approximation system is evaluated first. The four benchmark
applications shown in Figure 1 are executed again with enabled
write-count approximation. Instead of triggering the wear-
leveling algorithm, the write-counts are simply aggregated,
resulting in an analyzable distribution. The spatial granularity
is fixed to 4 kB sized memory regions (virtual memory
page size), while the temporal granularity is evaluated for
two different values. For the first experiment, a sample is
recorded every n = 5000th memory write access, for the
second experiment a sample is recorded every n = 20000th
memory write access. The resulting approximated write-count
distributions are illustrated in Figure 3 and Figure 4.
1) Write-Count Approximation Evaluation: The character-
istic of the real write-count distribution (compared to Fig-
ure 1) is reflected properly in both experiments. The main
peaks inside the distribution are shown regarding their height
6
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - baseline
main memory
w
ri
te
co
u
n
t
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - wear-leveling
main memory
Fig. 5. Coarse-grained full Wear-Leveling Result For sha n = 5000
compared to the rest of the distribution. The variation of
the temporal granularity can be observed due to the different
scaling of the y axes. Since our approach performs incremental
wear-leveling, the total memory lifetime is not considered.
Hence, the absolute scaling of the write approximation does
not matter. However, the reduction of the temporal granularity
does not influence the preciseness of the approximation in
this setup, because still enough samples are recorded, even
for n = 20000. If the application executes relative short or
the temporal granularity is configured too coarse, not enough
samples might be available to reflect the characteristic of
the distribution properly. This trade-off should be taken into
account when considering the temporal granularity.
bitcount pfor sha dijkstra
n = 5000 5.72% 11.50% 4.94% 7.20%
n = 20000 1.50% 3.24% 1.77% 1.89%
TABLE I
CPU OVERHEAD FOR THE WRITE-COUNT APPROXIMATION
When choosing a temporal granularity, the introduced over-
head should be also considered. To evaluate the overhead, the
necessary additional CPU cycles are calculated as a percentage
of the baseline execution, without write-count approximation.
Table I lists the calculated CPU overhead of both experiments.
The relative overhead is similar for all benchmarks, because
the approximation system reacts relative to the total write
count, respectively the execution time.
2) Full Wear-Leveling Evaluation: To determine if the
estimation is precise enough to enable aging-aware wear-
leveling, the approximation and wear-leveling algorithm is
plugged together and evaluated again. The red-black tree based
wear-leveling algorithm is activated and triggered from the
approximation system. The spatial granularity remains at 4
kB while the temporal granularity of the approximation again
is chosen as n = 5000 and n = 20000. A remapping of a page
is requested, whenever the write-count estimation exceeds the
value of 4 (for n = 5000) or the value of 1 (n = 20000). This
leads to mostly the same total number of page relocations in
both experiments. Thus they can be compared regarding the
quality of the write count approximation.
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - baseline
main memory
w
ri
te
co
u
n
t
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - wear-leveling
main memory
Fig. 6. Coarse-grained full wear-leveling result for sha n = 20000
Figure 5 and Figure 6 show the resulting write distribution
of our simulation under coarse-grained wear-leveling for the
sha benchmark. The results from the other benchmarks are
only presented by their calculated improvement later due to
space limitation. Note that due to the logarithmic scale of the y
axes memory bytes with a write-count of 0 are not displayed.
The estimated write-count distribution is precise enough to
perform aging-aware relocations and balance the wear-levels
across the target memory region.
3) Memory Lifetime Improvement: Considering the gained
improvement of the memory lifetime requires some assump-
tions. First, the system is considered dead once the first
memory cell is worn out. Thus, the maximum write count
to the memory determines the memory lifetime. Assuming
that the target of each write access could be shuffled through
the memory arbitrarily, the theoretical best memory lifetime
could be achieved when every memory cell is written equally
often, thus the mean write count would be applied to each
cell. Combining both considerations, Equation (1) calculates
the achieved endurance (AE), which is the fraction of the ideal
memory lifetime, which is achieved by the analyzed execution.
A value of 1 means that the experiment already achieves the
maximum memory lifetime, while a value of, for instance, 0.5
means that the memory lifetime could be doubled in the ideal
case.
AE =
mean write count
max write count
(1)
Comparing the achieved endurance of an execution with en-
abled wear-leveling to the baseline without any wear-leveling
leads to an endurance improvement (EI), which can be de-
termined according to Equation (2). The maximum endurance
improvement thus depends on the achieved endurance of the
baseline.
EI =
AEanalyzed
AEbaseline
(2)
The endurance improvement describes how many additional
write accesses can be performed before the memory wears out
while using the analyzed wear-leveling technique, compared
to the baseline, but does not give any insight if the application
profits from the additional writes. For instance, an EI of
2 means that the application can perform twice as many
7
writes compared to the situation without wear-leveling. If the
wear-leveling causes 100% overhead, all the additional writes
would be consumed by the wear-leveling and no real benefit
would be achieved. Therefore, the introduced write overhead
WO (as a percentage of the total number of writes of the
baseline execution) has to be considered as well to determine
the lifetime improvement (LI) according to Equation (3). A
LI value of, for instance, 2 implies that the application can
perform twice as much writes, respectively can run twice as
long, regardless of introduced overhead and writes for the
wear-leveling.
LI =
EI
WO + 1
(3)
Similarly, the achieved endurance can be related to the write
overhead, which leads to the normalized endurance (NE).
NE =
AE
WO + 1
(4)
The write overhead is determined in this evaluation on the
simulation results by comparing the total number of memory
writes for each benchmark execution with the corresponding
baseline. For the four benchmark applications, the achieved
endurance, the write overhead and the lifetime improvement
is calculated for both wear-leveling experiments and the results
are collected in Table II.
AE WO NE LI
n = 5000 bitcount 0.016 5.10% 0.015 18.90
pfor 0.043 5.10% 0.041 40.01
sha 0.022 5.05% 0.021 11.20
dijkstra 0.022 5.10% 0.021 28.65
n = 20000 bitcount 0.016 5.11% 0.015 18.93
pfor 0.044 5.12% 0.042 40.06
sha 0.019 5.10% 0.018 9.72
dijkstra 0.022 5.11% 0.021 28.26
TABLE II
LIFETIME IMPROVEMENT (LI ) FOR COARSE-GRAINED WEAR-LEVELING
We observe the following properties. First, the memory
write overhead is mostly independent from the configuration
of the approximation system, because the approximation in
general does not cause many additional memory writes. Sec-
ond, the lifetime improvement depends on the total amount
of memory which is used for the wear-leveling, since the
write pattern of the application is anyway mostly targeting
a single memory page. If this page can be remapped to
colder pages, the improvement is higher. Third, although the
lifetime is improved by a considerable factor, the achieved
endurance remains at mostly ≈ 4% of the ideal lifetime in
all benchmarks. This stems from the high non-uniformity
within memory pages, which is caused by the applications.
As memory pages are only relocated to other 4 kB aligned
memory pages, the non-uniformity within pages is not resolved
by the wear-leveling system.
To summarize this section, aging-aware wear-leveling on
the coarse-granularity of 4 kB sized memory pages performs
reasonably in a software-only manner due to the statisti-
cal write-count approximation. Nevertheless, a coarse-grained
wear-leveling technique alone is not sufficient to achieve an
equal balance of the wear-levels allover the memory due to
the high non-uniformity within memory pages.
VI. FINE-GRAINED STACK WEAR-LEVELING
To overcome the problem of intra page non-uniformity,
solutions in literature are extended with a finer grained wear-
leveling technique, resolving the non-uniformity in the scope
of coarse-grained memory regions, which are targeted by the
coarse-grained technique subsequently [21], [26]. To the best
of our knowledge, all the fine-grained extensions are either
realized in hardware by remapping single bytes or group of
bytes with an additional abstraction or by functional data
remapping [17], which requires at least compiler support. In
this section, we propose a software-only fine-grained extension
to the coarse-grained wear-leveling system (Section V), which
resolves non-uniform write accesses in the memory pages
of the stack region. These pages are targeted by the coarse-
grained wear-leveling system subsequently and are remapped
to other physical pages.
Since all fine-grained wear-leveling extensions are hardware
based, we most likely cannot propose a generic fine-grained
wear-leveling approach based on commonly available hard-
ware. Instead, we propose a specialized technique, which only
targets the stack region of C / C++ compiled applications.
The concept to target the stack with a specialized wear-
leveling system in a software-based manner is also considered
by Li et al. [18]. The basic idea is to allocate every stack
frame for a new function call on the heap through an aging-
aware memory allocator. This approach features two major
disadvantages: First, the wear-leveling quality relies on the
application to perform enough and fine-grained function calls
to apply sufficient wear-leveling actions. Second, the amount
of required stack memory might not be known in advance5,
which leads to a certain fragmentation and to worse wear-
leveling results. Due to these disadvantages, we in contrast
relocate the entire stack memory without the application’s
cooperation.
As the stack is used by the compiled code relative to the
stack pointer (sp)6, the application can be instructed to use
another memory location as the stack by adjusting the sp.
As the stack anyway is the main cause for non-uniform write
accesses (see Section V-C), we focus our fine-grained wear-
leveling extension on relocating the stack to other memory
locations and thus resolve the non-uniformwrite access pattern
inside the stack.
A. Circular Stack Relocation
To evenly distribute the write accesses to the stack, we move
the stack region in a circular manner through the memory. In
essence, the physical memory content is relocated with a fixed
5C99 allows dynamic sized local arrays [2]. However, this could also be
achieved in assembly.
6Depending on the application logic, concrete pointer values may be also
calculated and stored in variables. These pointer are also considered when the
memory location of the stack is changed.
8
reserved stackshadow stack
valid stack contentsp
Fig. 7. Shadow stack
offset into one direction always with an overflow semantics at
the end of the memory. For the Start-gap approach, this can
be achieved by a corresponding remapping function, because
an additional abstraction layer maintains the logical view on
the memory. The runtime system allocates a memory region of
the size of multiple memory pages for the application’s stack.
The stack is relocated from time to time by setting the sp
further by an offset and copying the old stack content to the
according new location. The logical view of the application
always expects free memory bytes left (negative offset) of the
sp and the already created stack content directly right (positive
offset) of the sp. As long as the stack only is relocated into one
direction, this view can be maintained easily. A wraparound
at the end of the reserved memory region cannot be achieved
trivially when the stack should be relocated by the same offset
in each step, since the stack content cannot be split. Thus,
we install a mechanism, called shadow stack, which aids to
implement the wraparound at the end of the reserved memory
region.
1) Shadow Stack: The basic concept of the shadow stack
is to allow one part of the stack to maintain at the end of the
reserved memory region, while the rest of the stack already
is wrapped around to the beginning. At any point in time, the
entire stack content must be accessible by addressing memory
contents right of the sp (with a positive offset). Furthermore,
at any point in time the same amount of free memory should
be available left of the sp (with a negative offset). Only by
maintaining these two properties, the application can continue
the execution at any time.
The setup of the shadow stack is illustrated in Figure 7.
Technically, the real stack is present as a consecutive virtual
memory region, which is shown in the right half of Figure 7.
For the shadow stack, the same amount of virtual memory
space left of the real stack is allocated and is mapped to
exactly the same physical memory pages like the real stack.
Thus, given an arbitrary virtual address A of the real stack,
the same physical content is accessed at the virtual address
S(A) = A − stacksize. This also implies that setting the
sp from some virtual address S(A) inside the shadow stack
to the corresponding real stack address A does not change
the application’s perspective on the stack at all. Using this
mechanism, the stack relocation is implemented in two steps.
First, the stack is moved down the memory periodically. At
any time, the application can access the same amount of
memory left of the sp, because the writes can target the
shadow stack. Once the currently used stack (including all
valid stack content) is entirely moved to the shadow stack,
the sp is set back to the corresponding real stack address.
As mentioned before, the virtual memory at the new location
of the sp contains exactly the same content as at the old
location. Hence, the application’s perspective is maintained
and the entire stack is wrapped around back to the real stack
(right half). Repeating these two steps regularly, the stack is
relocated in a circular manner with the same offset in each
relocation step.
2) Combination with Coarse-grained Wear-Leveling: As
stated before, the fine-grained wear-leveling is designed as
an extension to the previously presented coarse-grained wear-
leveling system (Section V). Both systems can work together
nearly out of the box. Since the stack relocation only operates
in the virtual memory space, a stack relocation can only be
interrupted by the remapping of the page to another physical
memory page. Nevertheless, when remapping hot and cold
pages, the coarse-grained wear-leveling system has to be aware
of the special shadow stack configuration and has to maintain
it during remapping. Furthermore, the statistical write-count
approximation has to aggregate the captured write accesses
from the shadow stack and from the real stack to the same
physical page. Eventually, we set up a frequent stack relocation
by using the same performance counter overflow interrupt
mechanism like the coarse-grained wear-leveling system. This
ensures that stack relocations are triggered after a certain
number of writes to the memory. Additionally, the overhead
can be reduced by combining the interrupt mechanism and
only using one interrupt service routine (ISR).
B. Address Consistency
The concept of moving the stack in a circular manner
(Section VI-A) is based on the sp relative access of the
stack region by C / C++ compiled applications. However,
the sp relative access is not the only way to access memory
contents within the stack memory. Sometimes, the application
requires to create pointers to variables inside the stack to pass
it to subsequent function calls or to store the pointer in a
central variable. Furthermore, pointers to variables on the stack
may also be moved out of the stack to some global or heap
data structures. During a relocation of the stack, the memory
address of the variables on the stack changes, while the content
of the pointers stays unchanged. This leads to invalid pointers
and to a wrong behavior of the application. To overcome
this problem, we equip the fine-grained relocation system
with two pointer adjustment mechanisms, which maintain the
correctness of pointer contents over stack relocations.
1) In-memory Pointer Adjustment: First, an in-memory
pointer adjustment technique targets pointers to stack contents,
which are stored inside the stack itself. This is the usual case
when pointers to local variables are passed to subsequent
function calls or positions inside local arrays need to be
remembered. For the relocation of the stack, the entire valid
stack content has to be copied to the new memory location
anyway, resulting in every memory word from the current
valid stack is loaded to the CPU and stored back to the
memory. During this process, the memory word is checked,
and a pointer to stack variable is adjusted by the relocation
offset. To identify a memory word as a pointer into the stack,
a strong constraint needs to be put to the memory usage of
9
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - coarse-grained
main memory
w
ri
te
co
u
n
t
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - fine-grained
main memory
Fig. 8. Fine-grained wear-leveling result for sha (page relocation every t =
64th stack relocation)
the application. As the memory word is just seen as a 8 byte
number by the relocation routine, the application has to make
sure to not use any logic variable content, which has the same
number like a pointer value into the stack would have. We
ensure this by allocating the virtual memory pages of the
stack at a memory location bigger than 4 GB and allow the
application to use 64 bit aligned data types with the 32 lower
bits set only.
2) Smart-Pointer Adjustment: As the previous technique
only targets pointers, which are stored inside the stack, point-
ers which are stored in global or heap data structures still are
corrupted after a stack relocation. To solve this problem, the
fine-grained wear-leveling system ships with a smart-pointer
implementation, which checks the current relocation of the
stack during dereferencing. The internally stored raw pointer
is adjusted properly and dereferenced. The smart-pointer im-
plementation only allows to hand out copied variables, but
not the internal raw pointer. Whenever the application aims to
move a pointer out of the stack, it has to use the smart-pointer
implementation instead of a raw pointer.
To summarize, maintaining the consistency of pointers dur-
ing stack relocations puts strong constraints on the application
and blows up in-memory data structures. Nevertheless, the
constraints can be achieved by reimplementing applications
accordingly and this enables software-only fine-grained in-
memory wear-leveling.
C. Evaluation
The technical details of the combined implementation of
the fine-grained stack relocation technique and the coarse-
grained aging-aware wear-leveling system are explained in
Section VI-A2. The movement of the stack by an offset
of 64 bytes7 is triggered periodically from the performance
counter overflow mechanism. In this evaluation the perfor-
mance counter overflow is configured to trigger after every
n = 1000th memory write access, thus the stack is relocated
every 1000th memory write. Accordingly, the write-count
approximation works on the same temporal granularity. The
7In our simulation setup 64 byte cache-lines are assumed to be written
entirely. A finer movement than 64 byte has no further effect on the wear-
leveling result in this case.
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - coarse-grained
main memory
w
ri
te
co
u
n
t
52 kB
1
0
0
1
0
2
1
0
4
1
0
6
sha - fine-grained
main memory
Fig. 9. fine-grained wear-leveling result for sha (page relocation every t =
32nd stack relocation)
coarse-grained wear-leveling system is triggered whenever a
page exceeds an approximated write-count of t = 64 and
thus in mean on every 64th stack relocation. Considering the
relocation offset of 64 bytes, a coarse-grained page relocation
is triggered whenever the stack is relocated by 4096 bytes,
which is the size of one memory page. A second experiment is
executed with the trigger for the coarse-grained wear-leveling
system set to t = 32. This increases the total number of
page relocations at the cost of higher memory overhead.
Furthermore, in this scenario page relocations are performed
when the stack only passed half of a memory page size, thus
the internal non-uniformity is higher.
Figure 8 and Figure 9 show the resulting memory write-
count distribution for the sha benchmark, compared to the
coarse-grained wear-leveling system only (Figure 5) for both
benchmark configurations. The results show that the non-
uniformity within virtual memory pages can be resolved by
the fine-grained stack wear-leveling technique and thus the
allover write pattern to the main memory is more uniform.
Even though the total number of page relocations is higher
in the second experiment (Figure 9), the results from the
first experiment are slightly better due to the fact that a page
relocation is only performed, when the stack is moved by an
offset of an entire memory page.
1) Memory Lifetime Improvement: To finalize the evalu-
ation, the improvement of the memory lifetime can be cal-
culated in the same way like in Section V-C3. The according
results are collected in Table III. First of all, it can be observed
AE WO NE LI
t = 64 bitcount 0.788 0.47% 0.784 953.52
pfor 0.698 9.17% 0.639 614.45
sha 0.746 111.59% 0.353 187.55
dijkstra 0.018 2.90% 0.017 23.64
t = 32 bitcount 0.592 0.79% 0.587 713.98
pfor 0.462 10.78% 0.417 400.96
sha 0.693 112.91% 0.328 173.09
dijkstra 0.020 4.50% 0.019 25.87
TABLE III
LIFETIME IMPROVEMENT (LI) FOR FINE-GRAINED WEAR-LEVELING
that the write overhead WO has a high variation for the
different benchmarks. This is caused by the different way
of stack usage by each benchmark. The sha application for
instance uses a big part of the stack memory and thus has a
10
very high write overhead. The total write distribution of the
application in the end determines the lifetime improvement
LI . The dijkstra application for instance also faces a high
non-uniform memory usage within the bss segment, which
is not resolved by our fine-grained wear-leveling technique.
Thus, the results for dijkstra are relative bad.
In conclusion, the memory lifetime can be improved sig-
nificantly, if the intra page non-uniformity can be resolved
by the fine-grained stack wear-leveling, e.g., ≈ 900 times
for the bitcount application. Note that the memory lifetime
improvement strongly depends on the available memory size.
In this evaluation, only the minimal required amount of
memory for each benchmark is considered. If a system offers
additional spare memory, the memory lifetime can be further
improved. The improvement is determined mostly by the
resulting uniformity of the memory access distribution (AE)
and the write overhead.
2) Comparison to the Literature: Several techniques for in-
memory wear-leveling for NVM have been proposed over the
last years. In this section we compare our evaluation results
with following related techniques: Start-gap was proposed by
Qureshi et al. [21] and relocates the entire memory space in
a circular manner on the granularity of 256 byte cache-lines
through special hardware. To resolve non-uniformity within
cache-lines, a finer-grained address space randomization is
introduced. Khouzani et al. [3] proposed a wear-leveling
scheme, which hooks into the page allocation process of the
operating system. Due to knowledge about the current write-
count and the write characteristic to each memory region,
wear-leveling actions are decided and performed. Chen et al.
[8] proposed a similar scheme with advanced management data
structures to make the wear-leveling algorithm more efficient.
This approach only operates on the coarse granularity of
virtual memory pages.
As a metric, we adopted the term normalized endurance
(NE) from the Start-gap approach, which is our achieved
endurance value related to the memory write overhead. As
a concrete lifetime or a relative improvement always highly
depends on the considered benchmark and the memory size,
we use the normalized endurance as a fraction of the possible
ideal memory usage, respectively the memory lifetime. Unfor-
tunately only a few works consider the possible ideal lifetime
in their evaluation. The previously mentioned works [3], [8],
[21] all report to achieve almost the ideal memory lifetime in
the best case (i.e., in the range of ≈ 87% to ≈ 98%). Our best
result achieves 78.43% of the ideal memory lifetime.
As our system requires no additional hardware and can be
tuned regarding the write-overhead, it enables a trade-off for
the design-process of a hardware platform. The necessary costs
for the required hardware support for in-memory wear-leveling
can be replaced by the slightly worse wear-leveling quality and
a possibly bigger runtime overhead
VII. OUTLOOK ON FURTHER FINE-GRAINED EXTENSIONS
The final evaluation results in Table III show that the all-
over wear-leveling quality can be good, if the non-uniformity
of write accesses within memory pages can be resolved.
However, not only the stack has to be targeted by a fine-
grained specific extension, but also the data/bss and, if it exists,
the heap segment. For instance, the dijkstra application has
a highly non-uniform memory usage inside the bss segment
leading to a bad performance. The text segment requires no
special wear-leveling, because all accesses are read-only by
definition. While specific wear-leveling for the heap has been
targeted in form of aging-aware memory allocations in the
literature [10], [18], the data/bss segment requires another
special technique. For future work, we propose to relocate
elements of the data/bss segment by using the feature of
dynamic linked code. If the application is not statically linked,
the addresses or an access offset for the data/bss segment is
determined and set while the application is loaded. During a
maintenance phase, i.e., an interrupt, the text segment could
be re-loaded with relocated addresses of the data/bss segment
and thus these segments can be relocated. This could achieve
a circular movement, similar to the movement for the stack,
for the data/bss segment.
VIII. CONCLUSION
Recently, several in-memory wear-leveling techniques have
been proposed to tackle a major disadvantage, namely the
lower write endurance, of NVM technologies, which might
replace classic DRAM in the near future. Advanced, aging-
aware wear-leveling techniques rely on hardware-provided age
information, such as a write-count per cell / byte / domain, to
achieve good wear-leveling results. As the necessary hardware
support is not available in common or commercial off-the-shelf
(COTS) hardware, it introduces additional costs. The hardware
at least requires additional chip-space, but also might be very
complex to build to meet a certain clock-speed and granularity.
To overcome the need for this hardware and offer the
possibility to use the chip-space for other features, this paper
introduced a software-only, aging-aware wear-leveling system,
which only makes use of widely available hardware features.
The final evaluations show that we are able to achieve up
to 78.43% of the theoretically ideal possible memory lifetime
with our wear-leveling system without any additional hardware
costs. During the design process of a system, it might be
totally reasonable to only achieve roughly 80% of the possible
memory lifetime (e.g. 8 instead of 10 years), but to equip the
system with advanced hardware controllers to improve energy
consumption, for instance.
As we believe it is important to offer the possibil-
ity for such software-only in-memory wear-leveling, we
release all our sources, including benchmark applications
and wear-leveling implementations: https://github.com/tu-
dortmund-ls12-rt/NVMSimulator.
ACKNOWLEDGEMENT
This paper is supported in parts by the German Re-
search Foundation (DFG) Project OneMemory (Project num-
ber 405422836).
11
REFERENCES
[1] “Arm architecture reference manual armv8, for armv8-a architecture
profile,” https://developer.arm.com/docs/ddi0487/latest/arm-
architecture-reference-manual-armv8-for-armv8-a-architecture-profile.
[2] “Using the gnu compiler collection (gcc) - 6.20 arrays of variable
length,” https://gcc.gnu.org/onlinedocs/gcc/Variable-Length.html.
[3] H. Aghaei Khouzani, Y. Xue, C. Yang, and A. Pandurangi, “Prolonging
pcm lifetime through energy-efficient, segment-aware, and wear-
resistant page allocation,” in Proceedings of the 2014 International
Symposium on Low Power Electronics and Design, ser. ISLPED ’14.
New York, NY, USA: ACM, 2014, pp. 327–330. [Online]. Available:
http://doi.acm.org/10.1145/2627369.2627667
[4] Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu,
“Hmtt: A platform independent full-system memory trace monitoring
system,” in Proceedings of the 2008 ACM SIGMETRICS International
Conference on Measurement and Modeling of Computer Systems, ser.
SIGMETRICS ’08. New York, NY, USA: ACM, 2008, pp. 229–240.
[Online]. Available: http://doi.acm.org/10.1145/1375457.1375484
[5] R. Bayer, “Symmetric binary b-trees: Data structure and maintenance
algorithms,” Acta Informatica, vol. 1, no. 4, pp. 290–306, Dec 1972.
[Online]. Available: https://doi.org/10.1007/BF00289509
[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,”
SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, Aug. 2011.
[Online]. Available: http://doi.acm.org/10.1145/2024716.2024718
[7] J. Boukhobza, S. Rubini, R. Chen, and Z. Shao, “Emerging nvm:
A survey on architectural integration and research challenges,” ACM
Trans. Des. Autom. Electron. Syst., vol. 23, no. 2, pp. 14:1–14:32, Nov.
2017. [Online]. Available: http://doi.acm.org/10.1145/3131848
[8] C.-H. Chen, P.-C. Hsiu, T.-W. Kuo, C.-L. Yang, and C.-Y. M. Wang,
“Age-based pcm wear leveling with nearly zero search cost,” in
Proceedings of the 49th Annual Design Automation Conference, ser.
DAC ’12. New York, NY, USA: ACM, 2012, pp. 453–458. [Online].
Available: http://doi.acm.org/10.1145/2228360.2228439
[9] S. Cho and H. Lee, “Flip-n-write: A simple deterministic technique
to improve pram write performance, energy and endurance,” in
Proceedings of the 42Nd Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM,
2009, pp. 347–357. [Online]. Available: http://doi.acm.org/10.1145/
1669112.1669157
[10] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala,
and S. Swanson, “Nv-heaps: making persistent objects fast and safe
with next-generation, non-volatile memories,” in Proceedings of the 16th
International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS 2011, Newport Beach, CA,
USA, March 5-11, 2011, 2011, pp. 105–118.
[11] J. Dong, L. Zhang, Y. Han, Y. Wang, and X. Li, “Wear rate leveling:
Lifetime enhancement of pram with endurance variation,” in Proceed-
ings of the 48th Design Automation Conference. ACM, 2011, pp.
972–977.
[12] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-level
performance, energy, and area model for emerging nonvolatile memory,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 31, no. 7, pp. 994–1007, 2012.
[13] A. P. Ferreira, M. Zhou, S. Bock, B. Childers, R. Melhem, and
D. Mosse´, “Increasing pcm main memory lifetime,” in Proceedings
of the Conference on Design, Automation and Test in Europe, ser.
DATE ’10. 3001 Leuven, Belgium, Belgium: European Design
and Automation Association, 2010, pp. 914–919. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1870926.1871147
[14] V. Gogte, W. Wang, S. Diestelhorst, A. Kolli, P. M. Chen,
S. Narayanasamy, and T. F. Wenisch, “Software wear management
for persistent memories,” in 17th USENIX Conference on File
and Storage Technologies (FAST 19). Boston, MA: USENIX
Association, Feb. 2019, pp. 45–63. [Online]. Available: https:
//www.usenix.org/conference/fast19/presentation/gogte
[15] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and
R. B. Brown, “Mibench: A free, commercially representative embedded
benchmark suite,” in Proceedings of the Workload Characterization,
2001. WWC-4. 2001 IEEE International Workshop, ser. WWC ’01.
Washington, DC, USA: IEEE Computer Society, 2001, pp. 3–14.
[Online]. Available: https://doi.org/10.1109/WWC.2001.15
[16] Y. Han, J. Dong, K. Weng, Y. Wang, and X. Li, “Enhanced wear-rate
leveling for pram lifetime improvement considering process variation,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 24, no. 1, pp. 92–102, Jan 2016.
[17] A. Jacobvitz, “Coset coding to extend the lifetime of non-volatile
memory,” Ph.D. dissertation, Duke University, 2014.
[18] W. Li, Z. Shuai, C. J. Xue, M. Yuan, and Q. Li, “A wear leveling aware
memory allocator for both stack and heap management in pcm-based
main memory systems,” in Proceedings of the 2019 Design, Automation
& Test in Europe (DATE), 2019.
[19] D. Liu, T. Wang, Y. Wang, Z. Shao, Q. Zhuge, and E. Sha, “Curling-
pcm: Application-specific wear leveling for phase change memory
based embedded systems,” in 2013 18th Asia and South Pacific Design
Automation Conference (ASP-DAC), Jan 2013, pp. 279–284.
[20] M. Poremba, T. Zhang, and Y. Xie, “Nvmain 2.0: A user-friendly
memory simulator to model (non-)volatile memory systems,” IEEE
Computer Architecture Letters, vol. 14, no. 2, pp. 140–143, July 2015.
[21] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras,
and B. Abali, “Enhancing lifetime and security of pcm-based main
memory with start-gap wear leveling,” in 2009 42nd Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), Dec 2009, pp.
14–23.
[22] Songping Yu, Nong Xiao, Mingzhu Deng, Yuxuan Xing, Fang Liu,
Zhiping Cai, and Wei Chen, “Walloc: An efficient wear-aware allocator
for non-volatile main memory,” in 2015 IEEE 34th International Per-
formance Computing and Communications Conference (IPCCC), Dec
2015, pp. 1–8.
[23] H. Volos, G. Magalhaes, L. Cherkasova, and J. Li, “Quartz: A
lightweight performance emulator for persistent memory software,” in
Proceedings of the 16th Annual Middleware Conference. ACM, 2015,
pp. 37–49.
[24] W. Zhang and T. Li, “Characterizing and mitigating the impact of
process variations on phase change based memory systems,” in 2009
42nd Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), Dec 2009, pp. 2–13.
[25] M. Zhao, L. Shi, C. Yang, and C. J. Xue, “Leveling to the last mile:
Near-zero-cost bit level wear leveling for pcm-based main memory,” in
2014 IEEE 32nd International Conference on Computer Design (ICCD),
Oct 2014, pp. 16–21.
[26] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A durable and energy efficient
main memory using phase change memory technology,” in Proceedings
of the 36th Annual International Symposium on Computer Architecture,
ser. ISCA ’09. New York, NY, USA: ACM, 2009, pp. 14–23. [Online].
Available: http://doi.acm.org/10.1145/1555754.1555759
[27] M. Zukowski, S. Heman, N. Nes, and P. A. Boncz, “Super-scalar ram-
cpu cache compression.” in Icde, vol. 6, 2006, p. 59.
12
