High Performance Data Persistence in Non-Volatile Memory for Resilient
  High Performance Computing by Huang, Yingchao et al.
High Performance Data Persistence in Non-Volatile Memory
for Resilient High Performance Computing
Yingchao Huang
University of California, Merced
yhuang46@ucmerced.edu
Kai Wu
University of California, Merced
kwu42@ucmerced.edu
Dong Li
University of California, Merced
dli35@ucmerced.edu
ABSTRACT
Resilience is a major design goal for HPC. Checkpoint is the most
common method to enable resilient HPC. Checkpoint periodically
saves critical data objects to non-volatile storage to enable data per-
sistence. However, using checkpoint, we face dilemmas between
resilience, recomputation and checkpoint cost. e reason that
accounts for the dilemmas is the cost of data copying inherent in
checkpoint. In this paper we explore how to build resilient HPC
with non-volatile memory (NVM) as main memory and address
the dilemmas. We introduce a variety of optimization techniques
that leverage high performance and non-volatility of NVM to en-
able high performance data persistence for data objects in appli-
cations. With NVM we avoid data copying; we optimize cache
ushing needed to ensure consistency between caches and NVM.
We demonstrate that using NVM is feasible to establish data persis-
tence frequently with small overhead (4.4% on average) to achieve
highly resilient HPC and minimize recomputation.
1 INTRODUCTION
Resilience is one of the major design goals for extreme-scale HPC
systems. Looking forward to future HPCwith shrinking feature size
of hardware and aggressive power management techniques, mean
time between failures (MTBF) in HPC could be shortened because
of more frequent so and hard errors; the application execution
could be interrupted more frequently; and the application result
correctness could be corrupted more oen.
To address the above resilience challenge, checkpoint (or more
specically, application-level checkpoint) is the most common
method deployed in current production supercomputers. Appli-
cation level checkpoint periodically saves application critical data
objects to non-volatile storage to enable data persistence. Once
a failure happens, the application can restart from the last valid
state of the data objects without restarting from the beginning.
However, checkpoint faces two dilemmas. First, there is a dilemma
between HPC resilience and checkpoint overhead. On one hand,
as MTBF may become shorter in the future, we have to increase
checkpoint frequency to improve HPC fault tolerance. On the other
hand, the frequent checkpoint results in larger runtime overhead.
We call this dilemma as the resilience dilemma. Second, there is a
dilemma between recomputation cost and checkpoint overhead. On
one hand, we want to increase checkpoint frequency to minimize
recomputation cost and reduce data loss. On the other hand, the
frequent checkpoint results in larger runtime overhead. We call
this dilemma as the recomputation dilemma.
e fundamental reason that accounts for the above two dilem-
mas is the cost of data copying inherent in the checkpoint mecha-
nism. e data copying operations can be expense, because check-
point data has to be stored in remote or local durable hard drive. Al-
though the disk-less checkpoint reduces data copying overhead [4,
26, 33, 34, 41] by using main memory, this technique has to en-
code data across multiple nodes to create redundancy and only
tolerates up to a certain number of node failures, because of the
volatility of memory. Other techniques, such as multi-level check-
point [3, 13, 28] and incremental checkpoint [2, 4, 33, 47] partially
remove expensive data copying o the critical path of application
execution, but a checkpoint with a large data size can still cause
large runtime overhead.
e emergence of non-volatile memories (NVM), such as phase
change memory (PCM) and RRAM, is poised to revolutionize mem-
ory systems [5, 43]. e performance of NVM is much beer than
hard drive, and even close to or match that of DRAM [14, 40]. Fur-
thermore, NVM has beer scalability than DRAM while remain
non-volatility. ese features make it possible to merge the tradi-
tional two layers of memory hierarchy (i.e, memory plus back-end
storage) into one layer (i.e, memory without back-end storage) [29].
Given NVM as main memory and its non-volatility nature, is it
possible to change or even remove checkpoint to enable data per-
sistence frequently, thus fundamentally addressing the above two
dilemmas in future HPC systems? How can NVM be used to address
the resilience challenge for HPC?
is paper aims to answer the above questions, and explores how
to build resilient HPC with emerging NVM as main memory. We
introduce a variety of optimization techniques to leverage high per-
formance and non-volatility of NVM to establish data persistence
for application critical data objects frequently.
We start from a preliminary design that uses NVM as either main
memory or storage to implement checkpoint. We expect that the
superior performance of NVM would allow us to achieve frequent
checkpoint with small runtime overhead and hence address the two
dilemmas. To improve checkpoint performance, we introduce a
couple of optimizations, including parallelization of cache ushing
and using SIMD-based, non-temporal load/store instructions (e.g.,
MOVDQU) to bypass CPU caches and minimize data movement
between caches and memory. However, we reveal that even based
on an optimistic assumption on NVM performance, NVM-based
checkpoint can still lead to large runtime overhead (up to 46%),
because of data copying in checkpoint.
We further study how to leverage non-volatility of NVM to cre-
ate a copy of the data objects. We aim to replace traditional data
copying in checkpoint, which is the fundamental reason that ac-
counts for expensive checkpoint. We introduce a technique, named
in-place versioning. is technique hides programmers from ap-
plication and algorithm details, and leverages application-inherent
memory write operations to create a new version of the data objects
in NVM without extra data copying. We derive a set of rules to
enable automatic transformation of programs to achieve in-place
versioning.
ar
X
iv
:1
70
5.
00
26
4v
2 
 [c
s.D
C]
  2
 M
ay
 20
17
To ensure proper recovery based on the new version of the data
objects, we must guarantee that the data of the new version is
consistent between caches and NVM. Hence, we must ush data
blocks of the new version out of caches, aer the new version is
created by the in-place versioning technique. Such cache ushing
operations can be expensive, because there is no mechanism that
allows us to track which data blocks of the new version are in
caches and whether data blocks in caches are clean. As a result, we
must ush all data blocks of the new version as if all data blocks
are in caches, which brings large performance loss.
To minimize the cache ushing cost, we propose to use a privi-
leged instruction and make it accessible to the application to ush
the entire cache hierarchy, instead of ushing all data blocks of
the new version. For a large data object, ushing the entire cache
hierarchy are oen much cheaper. Furthermore, we introduce an
asynchronous and proactive cache ushing mechanism to remove
cache ushing cost o the critical path of application execution
while enabling data consistency in NVM.
In general, the in-place versioning plus the optimized cache
ushing allow us to establish data persistence with consistence for
application critical data objects in NVM. e establishment of data
persistence can happen much more frequently than the traditional
checkpoint mechanism, with high performance. With the evalua-
tion of six representative HPC benchmarks and one productionHPC
application (Nek5000), we show that the runtime overhead is %4.4
on average (up to 9%) when the establishment of data persistence
frequently happens at every iteration of the main computation loop.
Such frequent and high performance data persistence allows us to
minimize recomputation cost and tolerate high error rate in future
HPC.
Our major contributions are summarized as follows.
• We explore how to use NVM to enable resilient HPC. We
demonstrate that using NVM (either as main memory or
storage) to implement frequent checkpoint based on data
copying to address the two dilemmas may not be feasible,
because of large data copying overhead, even though NVM
is expected to have superior performance.
• We explore how to enable data persistence with consis-
tency in NVM with minimized runtime overhead. Without
data copying and with the optimization of cache ush-
ing, using NVM has potential to address the resilience and
recomputation dilemmas rooted in the traditional check-
point.
2 BACKGROUND
In this paper, we focus on HPC applications. ose applications
are typically characterized with iterative structures. In particular,
there is usually a main computation loop in an HPC application.
With the traditional checkpoint mechanism, at every n iterations
of the loop (n is much larger than 1), the application saves critical
data objects of the application into non-volatile storage. In the
rest of the paper, we name those critical data objects as target data
objects. Checkpoint usually happens near the end of an iteration.
We call the execution point where checkpoint happens as persistence
establishment point.
We also distinguish cache line and cache block in this paper.
e cache line describes a location in the cache, and the cache
block refers to the data that go into a cache-line. We review NVM
background in this section.
2.1 Non-Volatile Memory Usage Model
ere are at least two existing usage models to integrate the emerg-
ing NVM into HPC systems. In the rst model, NVM is built as
NVDIMM modules and installed into DDR slots. NVM is physically
aached to the high-speed memory bus and managed by a memory
controller [8]. In the second model, NVM connects to the host by
an I/O controller and I/O bus (e.g., PCI-E) [7].
From the perspective of soware, OS can regard NVM as regular
memory (the rst model), similar to DRAM, and NVM provides
the capability of being byte addressable to OS and applications.
Also, NVM is accessed through load and store instructions.
Alternatively, NVM can be exposed as a block device in OS [37].
NVM is accessed via a read/write block I/O interface. A le system
can be built on top of NVM to provide the convenience of naming
schemes and data protection [37].
2.2 Data Consistence in NVM
To build a consistent state for target data objects in NVM (as main
memory) and ensure proper recovery, the target data objects in
NVM must be updated with the most recent data in caches at the
persistence establishment point. However, the prevalence of volatile
caches introduces randomness into write operations in NVM.When
the data is wrien from caches to NVM is subject to the cache
management policy by hardware and OS.
ere are “interfaces” that enable explicit data ushing from
caches to NVM.ose interfaces are presented as processor instruc-
tions or system calls. Using those interfaces, it is possible to enforce
data consistence at the persistence establishment point. We discuss
the common cache ushing instructions as follows.
• clflush instruction: is is the most common cache
ushing instruction. Given a cache block, this instruction
invalidates it from all levels of the processor cache hierar-
chy. If the cache line at any level of the cache hierarchy
is dirty, the cache line is wrien to memory before invali-
dation. clflush is a blocking instruction, meaning that
the instruction waits until the data ushing is done [38].
• WBINVD instruction: this is a privileged instruction used
by OS to ush and invalidate the entire cache hierarchy.
To enable data consistence based on clflush and other cache
block-based cache ushing instructions (particularly CLWB and
clflush opt, which will be discussed next), we may have a per-
formance problem for a data object with a large data size. Because
we do not have a mechanism to track which cache line is dirty
and whether a specic cache block is in caches, we have to ush
all cache blocks of target data objects, as if all cache blocks are in
caches. Figure 1 shows how we ush cache blocks based on cache
block-based cache ushing.
Flushing clean cache blocks in caches and ushing cache blocks
not in caches have performance cost at the same order as ushing
dirty cache blocks. Table 1 shows the performance of ushing cache
blocks in dierent status in caches. e performance is measured
2
12 / ∗ Loop through cache− l i n e − s i z e a l i g n e d chunks
3 cove r i ng the g iven range o f the t a r g e t da t a o b j e c t ∗ /
4 c a c h e b l o c k f l u s h ( con s t vo id ∗ addr , s i z e t l en )
5 {
6 uns igned i n t 6 4 p t r ;
7
8 f o r ( p t r = ( uns igned i n t 6 4 ) addr & ˜ ( FLUSH ALIGN − 1 ) ;
9 p t r < ( uns igned i n t 6 4 ) addr + l en ;
10 p t r += FLUSH ALIGN )
11 f l u s h ( ( char ∗ ) p t r ) ; / ∗ c l f l u s h / c l f l u s h o p t / clwb ∗ /
12 }
Figure 1: Using cache block-based cache ushing instructions to
ush cache blocks of the target data object.
Table 1: Performance of ushing cache blocks in dierent status in
caches using clflush.
Flush dirty cache
blocks in caches
Flush clean cache
blocks in caches
Flush cache blocks
not in caches
Cycles per cache
block
228 254 350
in a platform with two eight-core Intel Xeon E5-2630 v3 processors
(2.4 GHz, 20MB L3, 256KB L2, and 32KB L1) aached to 32GB DDR4.
Based on the results, we conclude that ushing all cache blocks of
a data object is roughly proportional to the data object size.
To support NVM, there are two very new instructions, clflush
opt and CLWB. clflush optmaximizes the concurrency of mul-
tiple clflush within individual threads. CLWB instruction max-
imizes the concurrency of multiple cache line ushing without
cache line invalidation (i.e., leaving data in the cache aer cache
line ushing). clflush opt is only available in the most recent
Intel SkyLake microarchitecture. Based on our knowledge, there is
no hardware available in themarket that supports CLWB. We cannot
evaluate them in this paper. However, using these two instructions
should lead to beer performance with our method proposed in
this paper. More importantly, these two instructions use cache
block-based cache ushing, hence they have the same problem as
discussed above for large target data objects. Our proposed method
can help them improve performance.
3 PRELIMINARY SYSTEM DESIGNS
e performance of NVM is much beer than that of traditional
hard drive, and even close to or match that of DRAM. Given such
performance characteristics of NVM, it is promising to enable fre-
quent checkpoint with a small overhead. Frequent checkpoint will
enable beer HPC resilience and minimize recomputation, hence
addressing the two dilemmas for future HPC.
Our preliminary designs aim to improve the existing checkpoint
mechanism and optimize its performance on NVM. We want to
answer a fundamental question: can the NVM-based checkpoint
(with optimization) happen frequently, such that we address the
two dilemmas rooted in the current checkpoint mechanism?
3.1 Preliminary Design 1: NVM-based,
Frequent Checkpoint
In our rst design, we employ an NVM-based checkpoint, and the
checkpoint happens at each iteration of the main loop, which is
much more frequent than the traditional checkpoint. Also, the
NVM-based checkpoint happens locally. is means that no maer
what usage model NVM is used (either as main memory or as a
local block device), the checkpoint is stored locally in NVM. By
removing networking overhead, this local NVM-based checkpoint
represents the best performance we can get out of NVM. In fact,
from the architecture point of view, such local NVM-based system
has been shown to be possible for HPC [15, 20, 21].
We compare two cases of hard drive-based, frequent checkpoint
with two cases of NVM-based, frequent checkpoint. For hard drive-
based, frequent checkpoint, the hard drive is resident either locally
(annotated as “hard drive based chkp (local)”) or in a remote storage
node (annotated as “hard drive based chkp (remote)”). For NVM-
based, frequent checkpoint, NVM is used as either main memory
(annotated as “NVM based chkp (mem)”) or a local block device
(annotated as “NVM based chkp (block)”). If NVM is used as main
memory, checkpointing is the same as making a data copy in mem-
ory plus necessary cache ushing. To emulate NVM as a block
device, we use a ramdisk with a le system (tmpfs). Hence, such
emulation includes the overhead of le system and system calls,
but does not emulate internal overhead of I/O controllers, such as
interface command decoding and ECC.
We run six NAS parallel benchmarks and one production code
(Nek5000). e details for those applications are summarized in
Table 2. In our study, NVM has either the same performance char-
acteristics (bandwidth and latency) as DRAM, which is a rather
optimistic assumption on NVM performance, or inferior perfor-
mance than DRAM, which is a more practical assumption.
(1) NVM has the same performance as DRAM.We emulate
NVM with local DRAM, similar to [49] and assume that NVM has
the same latency and bandwidth as DRAM, Aer data copying
in checkpoint, we ush cache blocks of the new data copy out of
caches to build a consistent state in NVM, using clflush.
Figure 2 shows the results on a production supercomputer, Edi-
son at Lawrence Berkeley National Lab. For NPB benchmarks, we
use CLASS D as input; for Nek5000, we use the eddy problem as
input (256 × 256). We use 4 nodes with 16 MPI tasks per node.
Performance (execution time) in the gure is normalized by that of
the native execution without checkpoint. e gure reveals that
with frequent checkpoint, hard drive based checkpoint (local) has
283% overhead on average (up to 1062%), which is unacceptable.
NVM-based checkpoint has much beer performance. For some
benchmarks (e.g., BT and LU), the overhead of NVM-based check-
point (NVM as main memory) is smaller than 10%. But there is still
high overhead for some benchmarks (more than 40% for MG, FT,
and Nek5000). Also, NVM-based checkpoint (main memory) shows
beer performance (26% performance loss on average and up to
46%) than NVM-based checkpoint (block device) (89% performance
loss on average and up to 401%).
(2) NVM has worse performance than DRAM. Since the
NVM techniques have a range of performance characteristics, we
change NVM performance to make our evaluation more practical,
and re-do the above tests in (1). Since checkpoint performance is
sensitive to memory bandwidth, we change NVM bandwidth based
onartz (a DRAM-based, lightweight performance emulator for
NVM [44]) for our study. Because usingartz requires loading a
kernel driver, which needs privileged accesses to the system, we
run artz on a local cluster (see Section 5 for more details on
the cluster). We choose 1/8 and 1/32 DRAM bandwidth as NVM
3
Figure 2: Preliminary design 1 with NVM-based, frequent check-
point and hard drive-based, frequent checkpoint on Edison. NVM
has the same performance characteristics as DRAM. Performance
(execution time) is normalized by that of native execution onDRAM
without checkpoint.
Figure 3: Preliminary design 1 with NVM-based, frequent check-
point and hard drive-based frequent checkpoint on a local cluster.
Performance (execution time) is normalized to that of the native
execution without checkpoint on the heterogeneous NVM/DRAM
system. NVM has 1/8 bandwidth of DRAM.
bandwidth based on [14, 40]. We use CLASS C as input for NPB
and the eddy problem (256 × 256) as input for Nek5000; we use 4
nodes with 4 MPI tasks per node. Figures 3 and 4 show the results.
Note that given a lower NVM bandwidth, the application perfor-
mance on a NVM-only system is worse than a DRAM-only system.
To bridge the performance gap between NVM and DRAM, the exist-
ing work introduces a small DRAM cache [14, 27, 35] to place recent
write-intensive data into NVM and build a heterogeneous NVM/-
DRAM system. To study the impact of such small DRAM cache
on checkpoint performance, we allocate a small DRAM space to
implement a heterogeneous NVM/DRAM system based onartz.
e existing work chooses the DRAM cache size between 32MB
and 1GB [12, 14, 17, 27, 35, 46, 48]. We choose a medium DRAM
size in our test, which is 256MB.
With the DRAM cache, the overhead of NVM-based checkpoint
(NVM as main memory) must include ushing cache blocks of the
target data objects from this DRAM space to NVM, besides the
overhead of memory copying in NVM and CPU cache ushing.
Which data objects are in the DRAM cache at the persistence estab-
lishment point depends on the DRAM cache management strategy.
We implement a recent soware-based approach [14] to manage
the DRAM cache. Furthermore, because of the soware-based ap-
proach, we know which target data objects (or data blocks of the
target data objects) are in the DRAM cache. Hence, we do not need
to ush all cache blocks of target data objects for DRAM cache
ushing. Also, we do not invalidate data in the DRAM cache aer
DRAM cache ushing to optimize performance of DRAM cache
ushing.
Figures 3 and 4 show the results. Similar to Figure 2, the two
gures show that NVM-based, frequent checkpoint (NVM as main
memory) can result in large performance loss (22% on average and
Figure 4: Preliminary design 1 with NVM-based, frequent check-
point and hard drive-based frequent checkpoint on a local cluster.
Performance is normalized to that of the native execution without
checkpoint on the heterogeneous NVM/DRAM system. NVM has
1/32 bandwidth of DRAM.
Figure 5: Performance of parallelizing clflush with multi-
threading.
up to 52% for NVM with 1/8 DRAM bandwidth, and 32% on average
and up to 60% for NVM with 1/32 DRAM bandwidth).
Conclusions. Using NVM as main memory for checkpoint is
promising, but still comes with large performance overhead for
some benchmarks, even though we take an optimistic assumption
on NVM performance.
e performance loss of NVM-based checkpoint (NVM as main
memory) comes from data copying during checkpointing and cache
ushing. To improve the performance of NVM-based checkpoint
(NVM as main memory), we focus on improving the performance
of cache ushing in the next section. We consider removing data
copying in Section 4. In the rest of this paper, we focus on NVM
with 1/8 DRAM bandwidth, which is a more practical assumption
on NVM performance [14, 40].
3.2 Preliminary Design 2: Optimization of
NVM-based Checkpoint
To improve the performance of cache ushing, we explore the par-
allelization of clflush instructions by multi-threading. Although
clflush is blocking, there is no guaranteed order for clflush
instructions [1] across threads. It is possible to use multiple threads
for cache ushing, and each of which ushes non-overlapped cache
blocks. To verify the above idea, we use OpenMP parallel
for to parallel a for loop for cache ushing with each iteration of
the loop ushing a single cache block. We change the number of
threads and measure performance for ushing a 20MB data buer
with dirty cache blocks on an Intel Xeon E5-2630 v3 processor
(20MB L3, 256KB L2, and 32KB L1) aached to 32GB DDR4. e
processor has 8 cores with 16 hardware threads. Figure 5 shows
the performance (average cycles per cache line).
Figure 5 shows that using multi-threading does improve perfor-
mance of clflush, but the performance is not scalable beyond
certain number of threads. In fact, as we increase the number of
threads, they will compete for those resources in cache controllers
4
Figure 6: Performance (execution time) of NVM-based checkpoint
with optimization (NVM is used as main memory). Performance is
normalized by that of the native performance without checkpoint.
and read/write ports of main memory, which limits the scalability
of parallel clflush. Based on such observation, we use up to 16
threads to parallelize cache ushing, depending on the availability
of idling cores in a node.
To further improve performance of NVM-based checkpoint (NVM
as main memory), we explore special instructions and use SIMD-
based, non-temporal instructions (particularly MOVDQU and
MOVNTDQ), which bypass caches to make a data copy. Using those
instructions removes the necessity of cache ushing, but those in-
structions are only available on a processor with SSE support.
Figure 6 shows the performance for the above two optimiza-
tion techniques. Within the gure, the preliminary design 1 (i.e.,
NVM-based checkpoint with NVM as main memory), the paral-
lelized clflush, and non-temporal instructions are labeled as
“checkpoint clush”, “checkpoint par clush”, and “cache bypass-
ing” respectively. “Native execution” is the one without checkpoint.
e gure shows that the parallelized clflush has up to 5%
performance improvement (for FT) over the preliminary design 1.
Non-temporal instructions lead to the best performance in all cases.
Comparing with the preliminary design 1 (checkpoint clush in
Figure 6), non-temporal instructions result in 9.6% performance
improvement on average and up to 16%. If a platform supports those
instructions, they should be the preferred method for NVM-based
checkpoint.
However, even if we use the above optimizations on CPU cache
ushing, we still see big performance loss on some benchmarks
(e.g., 36% for Nek5000 and 13% for CG). To investigate the reason,
we break down the checkpoint time. For “checkpoint clush” (the
preliminary design 1) and “checkpoint par clush”, the checkpoint
time includes DRAM cache ushing, data copying, and CPU cache
ushing; for “cache bypassing”, the checkpoint time includes DRAM
cache ushing and data copying. Figure 7 shows the results.
e results reveal that data copying contributes the most to the
performance loss. Except BT and LU with the preliminary design 1,
all other cases have more than 50% performance loss come from
data copying.
Conclusions. To establish frequent data persistence in NVM
with high performance and address the dilemmas in checkpoint,
we must address the data copying overhead.
4 HIGH PERFORMANCE DATA PERSISTENCE
We introduce a technique, called “in-place versioning”, to remove
data copying. Because the in-place versioning has to come with
cache ushing, we introduce an asynchronous and proactive cache
ushing to improve performance.
4.1 In-Place Versioning
Basic Idea. e in-place versioning is based on the idea of the dual
version [48]. Both the in-place versioning and the dual version
aim to remove data copying by leveraging application-inherent
memory write operations to create a new version of the target data
objects. But the dual version heavily relies on numerical algorithm
knowledge, and is only applicable to those algorithms with specic
characteristics. e implementation of the dual version for an
algorithm requires the programmer to manually change the code
based on algorithm knowledge.
e in-place versioning signicantly improves the dual version.
e in-place versioning works for any numerical algorithm, and is
algorithm-agnostic. We generalize a couple of rules to implement
the in-place versioning. Based on the rules, we can use compiler
to automatically transform the application into a new one with
the implementation of in-place versioning. e new application
creates data copy at runtime without programmer intervention. In
the following, we describe the basic idea of the dual version in an
algorithm-agnostic way and give an example. Based on the example,
we derive a basic rule for the in-place versioning.
Before the main computation loop, the dual version allocates
an extra copy of the target data objects (a new version). en,
in each iteration of the main computation loop, both versions of
the data objects are involved into the computation, but memory
write operations only happen to one version of the data objects
(which we call “working version”), the other version (which we
call “consistent version”)remains unchanged until the next iteration.
At the end of each iteration, the working version is ushed out
of the cache and becomes consistent in NVM. is version will
not be changed in the next iteration, and becomes the consistent
version since then. e previous consistent version becomes the
working version, and is updated by memory write operations of the
application. Two versions alternate roles across iterations, with one
version being consistent and the other being updated. Hence, we
ensure that there is always a consistent version in NVM for restart.
e recomputation is limited to at most one iteration, equivalent
to the recomputation in the frequent checkpoint we discuss in
Section 3.
Figure 8 shows an example to further explain the basic idea. In
this example, the array u is the target data object. In the main loop
(Lines 13-17) of the original code, all elements of u are updated, and
those elements are both read and wrien in each iteration of the
main loop. In the dual version, we allocate an extra copy of u (u e)
and rename the original copy asu o. u o is enforced to be consistent
in NVM (Lines 4-6) before the computation loop. In the main loop,
both uo and ue participate in the computation. However, at any
iteration, only one version of u is updated, and the other version
is read. e update to one version of u is naturally embedded in
the place of write operations (Line 12). Also, at any iteration, we
always maintain a consistent version of u in NVM. Depending on
the iteration number (odd or even), we decide which version should
be updated and which one should be consistent. e two versions
switch their roles (either write or read) aer each iteration (Lines
19-25).
Based on the above example and description in an algorithm-
agnostic way, we derive a basic rule for our in-place versioning.
5
Figure 7: e breakdown of performance loss for NVM-based checkpoint aer optimization (i.e., the preliminary design 2).
1 . . .
2 / / i n i t i a l i z a t i o n o f u [ ]
3 i n i t ( u ) ;
4
5 vo id update ( u ) {
6 . . .
7 f o r ( i = 0 ; i<Nu ; i ++)
8 u [ i ] = u [ i ] + e ;
9 . . .
10 }
11
12 / / main computa t ion Loop
13 f o r ( i t = 0 ; i t <Ni t ; i ++){
14 . . .
15 update ( u ) ;
16 . . .
17 }
18 . . .
19 ( a ) The o r i g i n a l code
1 . . .
2 / ∗ u o and u e a r e
3 two v e r s i o n s o f u ∗ /
4 u e = ma l l o c ( . . ) ;
5 i n i t ( u o ) ;
6 f l u s h c a c h e ( u o ) ;
7
8 vo id update ( u new , u o l d )
9 {
10 . . .
11 f o r ( i = 0 ; i<Nu ; i ++)
12 u new [ i ] = u o l d [ i ] + e ;
13 . . .
14 }
15
16 / / main computa t ion loop
17 f o r ( i t = 0 ; i t <Ni t ; i ++){
18 . . .
19 i f ( i t %2==0) {
20 update ( u e , u o ) ;
21 f l u s h c a c h e ( u e ) ;
22 } e l s e {
23 update ( u o , u e ) ;
24 f l u s h c a c h e ( u o ) ;
25 }
26 . . .
27 }
28 . . .
29 ( b ) The dua l v e r s i o n
Figure 8: An example to explain the basic idea of the dual version
described in an algorithm-agnostic way. u (an array) is the target
data object. u has Nu number of elements.
• Basic rule: within each iteration of the main computation
loop, write operations happen to one version of the tar-
get data objects and read operations happen on the other
version. Alternate the role of the two versions, and ush
data blocks of the updated version out of caches aer each
iteration.
Although the basic rule is straightforward, it can be applied to
many target data objects (see Table 2). However, the basic rule is
also very restricted. ere are two special cases violating the basic
rule. In the rst case, within one iteration, read operations reference
one version (i.e., the consistent version) before any update happens
to the target data object. However, aer the rst update, read
operations should reference the updated version (i.e., the working
version) for program correctness. Read operations should not use
the same version before and aer the rst update. We name this
case as post-update version switch for read operations. We use an
example to further explain it.
Special case I: post-update version switch for read opera-
tions. See Figure 9. In this example, we only show the routine
where the updates to the target data object (the array u) happen
(the routine update), but ignore the main computation loop which
is already shown in Figure 8.
In this example, for the rst update of u (Line 4 in Figure 9.b),
we can use the basic rule correctly. e read operations use u old .
However, aer the rst update (Line 6 in Figure 9.b), we should
read the most recent update from u new , not u old suggested by
the basic rule (see Line 6 in Figure 9.c for a correct version). e
read operations in Lines 4 and 6 in Figure 9.c use dierent version
of u aer the rst update in Line 4 in Figure 9.c.
e other case violating the basic rule is that elements of the
target data object are not updated uniformly within an iteration.
As a result, read operations should reference one version for some
elements of the target data object, but reference the other version
for the other elements. We use an example to further explain it.
Special case II: nonuniform updates. Figure 10 gives an ex-
ample. ere are two loops in the gure, each of which updates u.
In the rst loop (Figure 10.a), the elements from 1 to Nu − 2 of u
are updated, while the elements 0 and Nu − 1 are not updated. In
the second loop, all elements are updated. Hence, across two loops,
all elements are not updated uniformly.
Based on the basic rule, we replace u in the rst loop with the
two versions of u (Line 4 in Figure 10.b), which is correct. In the
second loop, we do the same thing (Line 7 in Figure 10.b) based on
the basic rule. However, the program will not run correctly. For
the elements u[0] and u[Nu − 1] that have not been updated in the
rst loop, we should use u old for read operations in the second
loop (Line 8 in Figure 10.c), while for the other elements that have
been updated, we should use u new for read operations (Line 10 in
Figure 10.c).
To handle the above two cases and enable automatic code trans-
formation to implement the in-place versioning, we introduce a
prole-guided code transformation. is method uses the results of
a proling test to detect the rst update and nonuniform updates,
and then transforms the application into the in-place versioning
accordingly. We particularly target on arrays, the most common
target data object in HPC applications. We explain our method in
details as follows.
Our method rst leverages an LLVM compiler [23] instrumenta-
tion pass [39] to generate a set of dynamic LLVM instruction traces
6
1 vo id update ( u ) {
2 . . .
3 f o r ( i = 0 ; i<Nu ; i ++) {
4 u [ i ] = u [ i ] + e ;
5 . . .
6 u [ i ] = u [ i ] + f ;
7 . . .
8 }
9 . . .
10 }
11
12 ( a ) The o r i g i n a l code
1 vo id update ( u new , u o l d ) {
2 . . .
3 f o r ( i = 0 ; i<Nu ; i ++) {
4 u new [ i ] = u o l d [ i ] + e ;
5 . . .
6 u new [ i ] = u o l d [ i ] + f ;
7 . . .
8 }
9 . . .
10 }
11 ( b ) The wrong code based on
12 the b a s i c r u l e
1 vo id update ( u new , u o l d ) {
2 . . .
3 f o r ( i = 0 ; i<Nu ; i ++) {
4 u new [ i ] = u o l d [ i ] + e ;
5 . . .
6 u new [ i ] = u new [ i ] + f ;
7 . . .
8 }
9 . . .
10 }
11
12 ( c ) The c o r r e c t code
Figure 9: Special case I: post-update version switch for read operations. e target data object is u . e main computation loop is ignored in
this gure. u has Nu number of elements. Line 6 in Figure 9.b is the incorrect code.
1 vo id update ( u ) {
2 . . .
3 / / The f i r s t c o l l e c t i v e
4 / / update to u
5 f o r ( i = 1 ; i<Nu−1 ; i ++)
6 u [ i ] = u [ i ] + e ;
7 . . .
8 / / The f i r s t c o l l e c t i v e
9 / / update to u
10 f o r ( i = 0 ; i<Nu ; i ++)
11 u [ i ] = u [ i ] + f ;
12 . . .
13 }
14 ( a ) The o r i g i n a l code
1 vo id update ( u new , u o l d ) {
2 . . .
3 f o r ( i = 1 ; i<Nu−1 ; i ++)
4 u new [ i ] = u o l d [ i ] + e ;
5 . . .
6 f o r ( i = 0 ; i<Nu ; i ++)
7 u new [ i ] = u o l d [ i ] + f ;
8 . . .
9 }
10
11
12
13 ( b ) The wrong code based on
14 the b a s i c r u l e
1 vo id update ( u new , u o l d ) {
2 . . .
3 f o r ( i = 1 ; i<Nu−1 ; i ++)
4 u new [ i ] = u o l d [ i ] + e ;
5 . . .
6 f o r ( i = 0 ; i<Nu ; i ++) {
7 i f ( i ==0 | | i ==Nu−1)
8 u new [ i ] = u o l d [ i ] + f ;
9 e l s e
10 u new [ i ] = u new [ i ] + f ;
11 }
12 . . .
13 }
14 ( c ) The c o r r e c t code
Figure 10: Special case II: the elements of the data object u are not updated uniformly. e main computation loop is ignored in this gure. u
has Nu number of elements. Line 7 in Figure 10.b is the incorrect code.
for the rst iteration of the main computation loop. ose traces in-
clude dynamic register values and memory addresses referenced in
each instruction. Each of the traces corresponds to either a loop or
instructions between two neighbor loops. For example, the update
routine in Figure 10.a has three traces: Two of them correspond
to for loops and the third one corresponds to the instructions
between the two loops. We also record the whole memory address
ranges of the target data objects in the beginning of each trace,
based on the LLVM instrumentation.
Furthermore, we develop a trace analysis tool. Given the traces
and memory address ranges of the target data objects as input, this
tool tracks register allocation and memory references to determine
which elements are updated in each trace. Based on the analysis
results across and within the traces, we identify the rst update
for each target data object; we also determine the coverage of each
loop-based update (e.g., Lines 7-8 in Figure 8.a) and whether the
coverages in all loop-based updates are dierent. is will be used
to detect non-uniform update.
Based on the trace analysis results, we use a static LLVM pass to
replace the references to the target data objects with the references
to either the working version or the consistent version. In particular,
any read reference to the target data object before the rst update
will be replaced with the reference to the old version of the target
data object (i.e., the consistent version); aer the rst update, any
read reference to the target data object will be replaced with the
reference to the new version (i.e., the working version). Any write
reference to the target data object is always replaced with the
reference to the new version, based on the basic rule. Figure 9.c is
an example of such replacement.
If nonuniform updates are detected, then for a loop-based struc-
ture we need to add control ow constructs within the loop to
control which version of the data objects should be used. Figure
10.c (Lines 7-10) is such an example. However, in practice, we
nd that such control ow constructs can be rather sophisticated,
especially for a statement of the loop with multiple elements of
the data objects. Furthermore, the prevalence of such control ow
constructs in loops can bring large performance overhead. Hence,
we do not apply the in-place versioning to the data object with
nonuniform updates. Instead, we use our preliminary design 2 (i.e.,
data copying based on non-temporal load/store) at the persistence
establishment point for those target data objects.
Discussion. We prole the rst iteration to detect the rst
update and nonuniform updates. is method aims to generate
a short trace and make the trace analysis time manageable. is
method is based on an assumption that the rst iteration and the
rest of iterations in the main loop have the same read and write
paerns for the target data objects. Based on our experience with
10 data objects from six NPB benchmarks (24 input problem sizes)
and 7 data objects from a large-scale production code (Nek5000),
we nd such assumption is true in all cases.
Furthermore, we nd that dierent input problems (not dierent
input problem size) can have dierent read and write paerns to
the data objects, and hence needs to generate dierent code for
the in-place versioning. However, proling the rst iteration and
generating the code is quick, based on our compiler-based approach.
In-place versioning vs. checkpoint. ere is a signicant
dierence between the in-place versioning and checkpoint mech-
anism. Creating data copy in the checkpoint mechanism is an
7
extra operation, and also the data copy is not involved in the com-
putation; Creating data copy in the in-place versioning leverages
inherent memory write operations in the application, and is part
of the computation (not extra operation). Hence, the in-place ver-
sioning signicantly reduces data copying overhead from which
the checkpoint mechanism suers.
However, the in-place versioning can bring performance loss
from two perspectives. First, the in-place versioning has to allocate
one extra data copy before the main computation loop. However,
this cost happens just once, and can be easily amortized by the main
computation. Second, the in-place versioning increases memory
footprint of the application, because the two versions of the target
data object are involved in the computation. is may increase CPU
cache miss rate, which hurt performance. is may also consume
more DRAM cache space, reducing the DRAM space for other data
objects. However, we see small performance dierence (less than
8.2% and 2.7% on average) between the in-place version and the
native execution without it. e reason is as follows.
For the DRAM cache problem, the soware-based cache man-
agement we use in our study [14] treats each extra data copy as
a new data object and chooses the best data placement in DRAM
and NVM for optimal performance, which eectively reduces the
impact of larger working set in the in-place versioning. For the
CPU cache problem, we study it based on performance counters,
but do not nd signicant increase in cache miss rates because of
the “streaming-like” memory access paerns in target data objects.
We discuss it further in the performance evaluation section.
4.2 Optimization of Cache Flushing
e in-place versioning avoids memory copying. However, to make
data consistent between NVM and caches at the persistence estab-
lishment point, we need to ush caches. As shown in Figure 7,
periodically ushing caches accounts for a large portion of the total
overhead. e fundamental reason for such large overhead is that
we cannot know which cache blocks of the data objects are in the
cache hierarchy and whether they are dirty, and have to issue cache
ushing instructions on every single cache block of the target data
objects.
To reduce the cache ushing cost, we propose two optimization
techniques: whole cache ushing and proactive cache ushing.
Whole cache ushing. e basic idea of the whole cache ush-
ing is to use WBINVD instruction to ush the entire cache hierarchy,
instead of ushing individual cache blocks of the target data objects.
If the size of the target data objects is much larger than the last
level cache size, it is highly possible that most of the cache blocks
are not in caches, and ushing the entire cache hierarchy is cheaper
than ushing all cache blocks of the target data objects.
However, WBINVD is a privileged instruction, and only the ker-
nel level code can issue this instruction. Hence we introduce a
kernel module that allows the application to indirectly issue the in-
struction. e drawback of using WBINVD is that the cache blocks
that do not belong to the target data objects are ushed out of the
caches. If those cache blocks are to be reused, they have to be
reloaded, which lose performance. However, when the total size of
the target data objects is large enough, ushing all cache blocks of
the target data objects that are not resident in caches is much more
Figure 11: e proactive cache ushing scheme.
expensive than data reloading because of WBINVD. We empirically
decide that if the total size of the target data objects is ten times
larger than the last level cache size, it is benecial to use WBINVD.
Asynchronous and proactive cache ushing. In the in-place
versioning, we trigger cache ushing (including CPU cache ushing
with WBINVD and DRAM cache ushing) at the persistence estab-
lishment point to make the working version consistent in NVM.
To improve cache ushing performance, we want to remove cache
ushing o the execution critical path as much as possible. Also, we
can trigger cache ushing ahead of the persistence establishment
point under certain conditions (discussed as below). We introduce
a helper thread-based mechanism to implement asynchronous and
proactive cache ushing.
In particular, we do not wait until the persistence establishment
point to ush caches. Instead, as soon as the working version is not
updated in the current iteration, a helper thread will proactively
ush caches. Furthermore, the cache ushing does not have to
be nished at the end of each iteration. As long as the working
version from the last iteration is not read in the current iteration,
the cache ushing can continue. But, the helper thread must nish
cache ushing at the point where the working version from the last
iteration is read for the rst time. Figure 11 describes the idea.
To implement the above proactive cache ushing, we develop a
lightweight library for HPC applications and a set of APIs. To use
the library, the programmer needs to insert a thread creation API
(ush init() in Figure 11) before the main loop to create a helper
thread and a FIFO queue shared between the helper thread andmain
thread. e programmer also needs to insert an API (ush async()
in Figure 11) into the program to specify where the cache ush
can happen within each iteration; e cache ush point does not
have to be the same as the persistence establishment point. Using
this API will insert a cache ush request into the FIFO queue. e
programmer also needs to insert an API to specify where the cache
ush must nish within each iteration (ush barrier() in Figure 11).
is API works as a synchronization between the helper thread and
the main thread to ensure that the working version is completely
ushed before it becomes the consistent version and read by the
application.
8
Table 2: Target Data objects for checkpointing and the in-place ver-
sioning. IPV in the table stands for the in-place versioning.
Bench-
mark
Data obj IPV (basic
rule)
IPV (post-update
version switch)
IPV (nonuni-
form update)
FT u0,u1,u2 u1,u2 u0 -
CG p,r,z p r,z -
BT u u - -
SP u u - -
LU u u - -
MG r - - r
Nek5000
(eddy)
vx, vy, vz,
pr,xm1,ym1,
zm1
pr,xm1,ym1,
zm1
vx,vy,vz -
Discussion. Similar to any help thread-based approaches [24,
25, 30, 42], our approach depends on the availability of idling core
for helper threads. We expect that the future many-core platform
can provide such core abundance. Note that even without the
helper thread, the in-place versioning with WBINVD already pro-
vide signicant performance improvement over checkpoint, shown
in Figure 12 in the evaluation section.
5 EVALUATION
We evaluate the in-place versioning (IPV) in this section. Unless
indicated otherwise, IPV includes optimized cache ushing and
helper thread in this section. Also, the data persistence establish-
ment happens at every iteration of the main computation loop,
which aims to build high resilience and minimize recomputation
for future HPC. We use the native execution, which has neither
checkpoint nor IPV, as our baseline. An ideal performance of IPV
should be close to that of the native execution as much as possible.
We study the performance on two test platforms. One test plat-
form is a local cluster. Each node of it has two eight-core Intel
Xeon E5-2630 processors (2.4 GHz) and 32GB DDR4. We use this
platform for tests in all gures except Figure 2 in Section 3. We
deployartz on such platform to emulate a heterogeneous NVM/-
DRAM system with NVM congured with 1/8 DRAM bandwidth
and DRAM congured with 256MB capacity to enable a practical
emulation of NVM [14, 40]. e other test platform is the Edison
supercomputer at Lawrence Berkeley National Lab (LBNL). We use
this platform for tests in Figure 2. Each Edison node has two 12-core
Intel Ivy Bridge processor (2.4 GHz) with 64GB DDR3. We cannot
install artz on Edison to enable a practical emulation of NVM,
becauseartz requires a privileged access to the system. Hence,
we perform most of the tests on the local cluster.
We use six NPB benchmarks (CLASS C) and one production
application (Nek5000) with the eddy input problem (256 × 256).
Table 2 gives more information on the benchmarks and application.
e table also lists how the target data objects are transformed
into IPV based on either basic rule, post-update version switch, or
nonuniform update. For NPB benchmarks, the target data objects
are chosen based on typical checkpoint cases, algorithm knowledge,
and benchmark information. For Nek5000, the target data objects
are determined by the checkpoint mechanism in Nek5000.
Figure 12 compares the performance of the baseline, the prelim-
inary design 2 (i.e., checkpoint with cache bypassing), IPV with
neither cache ushing nor helper thread, IPV with cache ushing
(no helper thread), and IPV with everything. Comparing with the
baseline, IPV achieves rather small runtime overhead (4.4% on aver-
age and no larger than 9.5%). Most of the performance improvement
comes from the removal of data copying. In particular, regarding
IPV (no cache ushing and helper thread) and the preliminary de-
sign 2, both of them do not have cache ushing, but IPV (no cache
ushing and helper thread) performs 9% beer on average because
of no data copying. is fact is especially pronounced in Nek5000,
where IPV (no cache ushing and helper thread) performs 26%
beer than the preliminary design 2.
Furthermore, IPV cannot be applied to MG because of nonuni-
form updates (see Table 2). Hence MG does not have performance
data for any IPV. However, MG with the helper thread to enable
proactive and asynchronous data copying in the gure has 5.4%
performance improvement over the preliminary design 2.
To further study the performance of IPV, we focus on the perfor-
mance dierence between IPV without cache ushing and IPV. We
aim to study the eectiveness of proactive and asynchronous cache
ushing. In Figure 13, we measure performance of WBINVD and
DRAM cache ushing, and quantify their contribution to the total
overhead (i.e., WBINVD plus DRAM cache ushing) in IPV. e
table below the gure quanties how much of the total overhead
is overlapped with the application execution by the proactive and
asynchronous cache ushing.
Figure 13 reveals that the proactive and asynchronous cache
ushing is prey eective to hide the cache ushing overhead (or
data copying for MG). At least 41% of the total overhead is over-
lapped in all benchmarks. e non-overlapped cache ushing time
is exposed to the application critical path and causes the perfor-
mance dierence between IPV and the native execution in Figure
12.
IPV can cause extra CPU cache misses, because of two reasons.
(1) e two versions of the target data objects increase working
set size of the application; (2) WBINVD ushes the entire cache
hierarchy.
We measure the system-wide last level CPU cache miss rate
for the native execution and IPV. Figure 14 shows the results. In
general, we do not see big dierence (up to 4%) between the two
cases in terms of the last level cache miss rate. is further explains
the small performance loss between IPV and the native execution
in Figure 12.
e reason that accounts for such small dierence in the last
level miss rate is as follows. WBINVD happens only once in each
iteration, hence its impact on cache misses is not frequent. e
two versions do increase the working set size of the application.
However, within the original application, the target data objects
are typically updated in a loop (e.g., the loop structure in update
routine in Figures 9 and 10) and there is lile data reuse across
iterations of the loop. Such updates tend to be “streaming-like”,
which is not sensitive to the increase of working set size.
6 RELATEDWORK
Persistent memory. NVM has been explored to implement check-
point as main memory. Kannan et al. [21] use NVM only for check-
point (not computation). To improve performance, they proactively
move checkpoint data from DRAM to NVM before checkpoint is
started. Gao et al. [15] use a hardware-based approach to utilize
runtime idling to write checkpoint and spread it across memory
banks for load balance. Ren et al. [36] dynamically determine check-
point granularity (cache block level or page level) based on memory
update density. Dong et al. [13] introduce 3D stacked NVM and
9
Figure 12: Performance dierence between the native execution (baseline), the preliminary design 2 (checkpoint with cache bypassing), and
dierent IPV cases. Performance is normalized to that of the native execution. MG does not have the results for IPV. e dotted bar in MG is
the case of checkpoint with a helper thread for asynchronous and proactive data copying.
Figure 13: Breakdown of the performance dierence between the
in-place versioning and in-place versioning without cache ushing.
Figure 14: Last level CPU cache miss rate dierence between the
baseline and the in-place versioning (no cache ushing).
incremental checkpoint to reduce checkpoint overhead. ose prior
eorts focus on good performance of NVM to establish persistence
(checkpoint) in NVM, while we focus on how to maximize the ben-
et of non-volatility of NVM. Dierent from those prior eorts, our
work avoids data copying, and does not require hardware assist.
To enable data consistence in NVM, many research eorts ex-
plore how to enforce write-ordering with minimum overhead. e
epoch-based approach [10, 18, 22, 32] is one of those research ef-
forts. is approach divides program execution into epochs, within
which stores are allowed to happen concurrently without disturbing
data consistence in NVM. In fact, our proactive cache ushing (Sec-
tion 4.2) is one variation of epoch. From the point where the cache
ush happens to the point where the working version becomes the
consistent version is an epoch where concurrent, persistent writes
can happen. However, most of the existing work is hardware-based
and requires hardware support to implicitly identify epochs. Also,
to apply the existing work to establish data persistence in HPC
still needs a mechanism to maintain two versions of the target data
objects. Our work requires no hardware support and the in-place
versioning provides the two versions.
Some work explores redo-log and undo-log based approaches
to build transaction semantics for data consistence in NVM. is
includes hardware logging [19, 27, 31]. However, those approaches
come with extensive architecture modications.
ere are also soware-based approaches that introduce certain
program constructs to enable data persistence in NVM [6, 9, 11,
16, 37, 45]. To use those program constructs, one have to make
changes to OS and applications. e application can suer from
large overhead because of frequent runtime checking or data log-
ging. Our experiences with [16] show that CG and dense matrix
multiplication suer from 52% and 103% performance loss because
of frequent data logging operations. Our work in this paper has
very small runtime overhead and does not require changes to OS.
Checkpoint mechanism. Diskless checkpoint is a technique
that uses DRAM-based main memory and available processors
to encode and store the encoded checkpoint data [26, 33, 34, 41].
Because of the DRAM usage and the limitation of encoding tech-
niques, diskless checkpoint has to leverage multiple nodes to create
redundancy and only tolerates up to a certain number of node
failures. Our method is a diskless-based approach, but leveraging
non-volatility of NVM. Our method does not have node-level re-
dundancy in diskless checkpoint, and is independent of the number
of node failures.
Incremental checkpoint is a method that only checkpoints modi-
ed data to save checkpoint size and improve checkpoint perfor-
mance [2, 4, 33, 47]. However, for those applications with intensive
modications between checkpoints (e.g., HPL [41]), the eective-
ness of the incremental checkpoint method can be limited.
Multi-level checkpoint is a method that saves checkpoint to fast
devices (e.g., PCM and local SSD) in a short interval and to slower
devices in a long interval [3, 13, 28]. By leveraging good perfor-
mance of fast devices, the multi-level checkpoint removes expensive
memory copy on slower devices. However, it can still suer from
large data copy overhead on fast devices, when the checkpoint data
size is large. Our work introduces the in-place versioning to remove
data copy by leveraging application-inherent write operations to
update checkpoint data. Hence, our method does not have the
limitation of incremental and multi-level checkpoints.
7 CONCLUSIONS
With the emergence of NVM, how to leverage performance and non-
volatility characteristics of NVM for future HPC is largely unknown.
In this paper, we study how to use NVM to build data persistence for
critical data objects of applications to replace traditional checkpoint.
Our study enables the frequent establishment of data persistence
on NVM with small overhead, which enable high resilient HPC and
minimized recomputation.
10
REFERENCES
[1] Accessed on April 3, 2017. x86 Instruction Set Reference: CLFLUSH. hp://x86.
renejeschke.de/html/le module x86 id 30.html. (Accessed on April 3, 2017).
[2] Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Moreira. 2004. Adap-
tive Incremental Checkpointing for Massively Parallel Systems. In International
Conference on Supercomputing (ICS).
[3] Leonardo Arturo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cap-
pello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTI: high performance
fault tolerance interface for hybrid systems. In Conference on High Performance
Computing Networking, Storage and Analysis (SC).
[4] Greg Bronevetsky, Daniel Marques, Keshav Pingali, Sally A. McKee, and Radu
Rugina. 2009. Compiler-enhanced incremental checkpointing for OpenMP ap-
plications. In International Symposium on Parallel and Distributed Processing
(IPDPS).
[5] Adrian M. Cauleld, Joel Coburn, Todor I. Mollov, Arup De, Ameen Akel, Ji-
ahua He, Arun Jagatheesan, Rajesh K. Gupta, Allan Snavely, and Steven Swan-
son. 2010. Understanding the Impact of Emerging Non-Volatile Memories on
High-Performance, IO-Intensive Computing. In Conference on High Performance
Computing Networking, Storage and Analysis (SC).
[6] Andreas Chatzistergiou, Marcelo Cintra, and Stratis D. Viglas. 2015. REWIND:
Recovery Write-ahead System for In-Memory Non-Volatile Data Structures.
Proceedings of the VLDB Endowment 8, 5 (2015).
[7] Feng Chen, Michael P. Mesni, and Sco Hahn. 2014. A Protected Block Device for
Persistent Memory. In IEEE Symposium on Mass Storage Systems and Technologies
(MSST).
[8] R. Chen, Z. Shao, and T. Li. 2016. Bridging the I/O performance gap for big
data workloads: A new NVDIMM-based approach. In IEEE/ACM International
Symposium on Microarchitecture (MICRO).
[9] Joel Coburn, Adrian Cauleld, Ameen Akel, Laura Grupp, Rajesh Gupta, Ranjit
Jhala, and Steve Swanson. 2011. NV-heaps: Making Persistent Objects Fast and
Safe with Next-generation, Non-volatile Memories. In Architectural Support for
Programming Languages and Operating Systems (ASPLOS).
[10] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Ben-
jamin Lee, Doug Burger, and Derrick Coetzee. 2009. Beer I/O rough Byte-
addressable, Persistent Memory. In Symposium on Operating Systems Principles
(SOSP).
[11] Joel E. Denny, Seyong Lee, and Jerey S. Veer. 2016. NVL-C: Static Analysis
Techniques for Ecient, Correct Programming of Non-Volatile Main Memory
Systems. In International Symposium onHigh-Performance Parallel and Distributed
Computing (HPDC).
[12] G. Dhiman, R. Ayoub, and T. Rosing. 2006. PDRAM: A hybrid PRAM and DRAM
main memory system. In ACM/IEEE Design Automation Conference.
[13] Xiangyu Dong, Naveen Muralimanohar, Norm Jouppi, Richard Kaufmann, and
Yuan Xie. 2009. Leveraging 3D PCRAM Technologies to Reduce Checkpoint
Overhead for Future Exascale Systems. In International Conference on High Per-
formance Computing Networking, Storage and Analysis (SC).
[14] Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram,
Nadathur Satish, Rajesh Sankaran, Je Jackson, and Karsten Schwan. 2016. Data
Tiering in Heterogeneous Memory Systems. In Proceedings of the Eleventh Euro-
pean Conference on Computer Systems (EuroSys).
[15] Shen Gao, Bingsheng He, and Jianliang Xu. 2015. Real-Time In-Memory Check-
pointing for Future Hybrid Memory Systems. In International Conference on
Supercomputing (ICS).
[16] Intel. Accessed on April 3, 2017. Intel NVM Library. hp://pmem.io/nvml/
libpmem/. (Accessed on April 3, 2017).
[17] X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin,
and R. Balasubramonian. 2011. CHOP: Integrating DRAMCaches for CMP Server
Platforms. IEEE Micro 31, 1 (2011), 99–108.
[18] Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, and Stratis Viglas. 2015. Ecient
Persist Barriers for Multicores. In International Symposium on Microarchitecture.
[19] Arpit Joshi, Vijay Nagarajan, Stratis Viglas, and Marcelo Cintra. 2017. ATOM:
Atomic Durability in Non-volatile Memory through Hardware Logging. In Inter-
national Symposium on High-Performance Computer Architecture (HPCA).
[20] Myoungsoo Jung, Ellis H. Wilson, III, Wonil Choi, John Shalf, Hasan Metin
Aktulga, Chao Yang, Erik Saule, Umit V. Catalyurek, and Mahmut Kandemir.
2013. Exploring the Future of Out-of-core Computing with Compute-local
Non-volatile Memory. In Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis (SC).
[21] Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan, and Dejan Milojicic. 2013.
Optimizing Checkpoints Using NVM As Virtual Memory. In International Sym-
posium on Parallel and Distributed Processing (IPDPS).
[22] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley, S. Liu, P. M. Chen, and
T. F. Wenisch. 2016. Delegated persist ordering. In International Symposium on
Microarchitecture (MICRO).
[23] C. Laner. 2002. LLVM: An Infrastructure for Multi-Stage Optimization. Ph.D.
Dissertation. Computer Science Dept., Univ. of Illinois at Urbana-Champaign.
[24] J. Lee, C. Jung, D. Lim, and Y. Solihin. 2009. Prefetching with Helper reads
for Loosely Coupled Multiprocessor Systems. IEEE Transactions on Parallel and
Distributed Systems 20, 9 (2009).
[25] Dong Li, Dimitrios S. Nikolopoulos, Kirk W. Cameron, Bronis de Supinski, and
Martin Schulz. 2011. Scalable Memory Registration for High-Performance Net-
works Using Helper reads. In International Conference on Computer Frontier.
[26] Charng-Da Lu. 2005. Scalable Diskless Checkpointing for Large Parallel Systems.
Ph.D. Dissertation. Advisor(s) Reed, Daniel A.
[27] Youyou Lu, JiWu Shu, Long Sun, and Onur Mutlu. 2014. Loose-Ordering Con-
sistency for Persistent Memory. In Loosing-Ordering Consistency for Persistent
Memory.
[28] Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski.
2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing
System. In Conference on High Performance Computing Networking, Storage and
Analysis (SC).
[29] Onur Mutlu. 2013. Memory Scaling: A Systems Architecture Perspective. In 5th
International Memory Workshop (IMW).
[30] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Pa. 2003. Runahead Execution: An
Alternative to Very Large Instruction Windows for Out-of-Order Processors. In
International Symposium on High-Performance Computer Architecture (HPCA).
[31] Matheus A. Ogleari, Ethan L. Miller, and Jishen Zhao. 2016. Relaxing Persis-
tent Memory Constraints with Hardware-Driven Undo+Redo Logging. (2016).
hps://users.soe.ucsc.edu/ jzhao/les/HardwareLogging-techreport2016.pdf.
[32] Steven Pelley, Peter M. Chen, andomas F. Wenisch. 2014. Memory Persistency.
In ISCA.
[33] J. S. Plank and Kai Li. 1994. Faster checkpointing with N+1 parity. In International
Symposium on Fault-Tolerant Computing.
[34] James S. Plank, Kai Li, and Michael A. Puening. 1998. Diskless Checkpointing.
IEEE Transactions on Parallel and Distributed System 9, 10 (1998).
[35] Luiz Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in
Hybrid Memory Systems. In International Conference on Supercomputing (ICS).
[36] Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur
Mutlu. 2015. yNVM: Enabling Soware-transparent Crash Consistency in
Persistent Memory Systems. In International Symposium on Microarchitecture
(MICRO).
[37] Andy Rudo. 2013. Programming Models for Emerging Non-Volatile Memory
Technologies. ;login: e USENIX Magazine 38, 3 (2013).
[38] Andy Rudo. 2016. Processor Support for NVM Programming. NVM Summit
(2016).
[39] Yakun Sophia Shao and David Brooks. 2013. ISA-Independent Workload Charac-
terization and its Implications for Specialized Architectures.
[40] Kosuke Suzuki and Steven Swanson. 2015. e Non-Volatile Memory Technology
Database (NVMDB). Technical Report CS2015-1011. Department of Computer Sci-
ence & Engineering, University of California, San Diego. hp://nvmdb.ucsd.edu.
[41] Xiongchao Tang, Jidong Zhai, Bowen Yu, Wenguang Chen, and Weimin Zheng.
2017. Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space
and Its Practice on Fault-Tolerant HPL. In Symposium on Principles and Practice
of Parallel Programming (PPoPP).
[42] D. Tiwari, S. Lee, J. Tuck, and Y. Solihin. 2010. MMT: Exploiting ne-grained
parallelism in dynamic memory management. In International Symposium on
Parallel Distributed Processing (IPDPS).
[43] J. S. Veer and S. Mial. 2015. Opportunities for Nonvolatile Memory Systems in
Extreme-Scale High-Performance Computing. Computing in Science Engineering
17, 2 (2015).
[44] Haris Volos, GuilhermeMagalhaes, Ludmila Cherkasova, and Jun Li. 2015.artz:
A Lightweight Performance Emulator for Persistent Memory Soware. In Annual
Middleware Conference (Middleware).
[45] H. Volos, A. J. Tack, and M. M. Swi. 2011. Mnemosyne: Lightweight Persistent
Memory. In Architectural Support for Programming Languages and Operating
Systems (ASPLOS).
[46] Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jef-
frey S. Veer. 2013. Exploring Hybrid Memory for GPU Energy Eciency
through Soware-Hardware Co-Design. In International Conference on Parallel
Architectures and Compilation Techniques (PACT).
[47] C. Wang, F. Mueller, C. Engelmann, and S. L. Sco. 2010. Hybrid Checkpointing
for MPI Jobs in HPC Environments. In International Conference on Parallel and
Distributed Systems (ICPADS).
[48] Panruo Wu, Dong Li, Zizhong Chen, Jerey Veer, and Sparsh Mial. 2016.
Algorithm-Directed Data Placement in Explicitly Managed No-Volatile Memory.
In ACM Symposium on High-Performance Parallel and Distributed Computing
(HPDC).
[49] Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015.
Mojim: A Reliable and Highly-Available Non-Volatile Memory System. In In-
ternational Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS).
11
