Resilience is a major design goal for HPC. Checkpoint is the most common method to enable resilient HPC. Checkpoint periodically saves critical data objects to non-volatile storage to enable data persistence. However, using checkpoint, we face dilemmas between resilience, recomputation and checkpoint cost.
INTRODUCTION
Resilience is one of the major design goals for extreme-scale HPC systems. Looking forward to future HPC with shrinking feature size of hardware and aggressive power management techniques, mean time between failures (MTBF) in HPC could be shortened because of more frequent so and hard errors; the application execution could be interrupted more frequently; and the application result correctness could be corrupted more o en.
To address the above resilience challenge, checkpoint (or more speci cally, application-level checkpoint) is the most common method deployed in current production supercomputers. Application level checkpoint periodically saves application critical data objects to non-volatile storage to enable data persistence. Once a failure happens, the application can restart from the last valid state of the data objects without restarting from the beginning. However, checkpoint faces two dilemmas. First, there is a dilemma between HPC resilience and checkpoint overhead. On one hand, as MTBF may become shorter in the future, we have to increase checkpoint frequency to improve HPC fault tolerance. On the other hand, the frequent checkpoint results in larger runtime overhead. We call this dilemma as the resilience dilemma. Second, there is a dilemma between recomputation cost and checkpoint overhead. On one hand, we want to increase checkpoint frequency to minimize recomputation cost and reduce data loss. On the other hand, the frequent checkpoint results in larger runtime overhead. We call this dilemma as the recomputation dilemma.
e fundamental reason that accounts for the above two dilemmas is the cost of data copying inherent in the checkpoint mechanism. e data copying operations can be expense, because checkpoint data has to be stored in remote or local durable hard drive. Although the disk-less checkpoint reduces data copying overhead [4, 26, 33, 34, 41] by using main memory, this technique has to encode data across multiple nodes to create redundancy and only tolerates up to a certain number of node failures, because of the volatility of memory. Other techniques, such as multi-level checkpoint [3, 13, 28] and incremental checkpoint [2, 4, 33, 47] partially remove expensive data copying o the critical path of application execution, but a checkpoint with a large data size can still cause large runtime overhead.
e emergence of non-volatile memories (NVM), such as phase change memory (PCM) and RRAM, is poised to revolutionize memory systems [5, 43] . e performance of NVM is much be er than hard drive, and even close to or match that of DRAM [14, 40] . Furthermore, NVM has be er scalability than DRAM while remain non-volatility. ese features make it possible to merge the traditional two layers of memory hierarchy (i.e, memory plus back-end storage) into one layer (i.e, memory without back-end storage) [29] . Given NVM as main memory and its non-volatility nature, is it possible to change or even remove checkpoint to enable data persistence frequently, thus fundamentally addressing the above two dilemmas in future HPC systems? How can NVM be used to address the resilience challenge for HPC? is paper aims to answer the above questions, and explores how to build resilient HPC with emerging NVM as main memory. We introduce a variety of optimization techniques to leverage high performance and non-volatility of NVM to establish data persistence for application critical data objects frequently.
We start from a preliminary design that uses NVM as either main memory or storage to implement checkpoint. We expect that the superior performance of NVM would allow us to achieve frequent checkpoint with small runtime overhead and hence address the two dilemmas. To improve checkpoint performance, we introduce a couple of optimizations, including parallelization of cache ushing and using SIMD-based, non-temporal load/store instructions (e.g., MOVDQU) to bypass CPU caches and minimize data movement between caches and memory. However, we reveal that even based on an optimistic assumption on NVM performance, NVM-based checkpoint can still lead to large runtime overhead (up to 46%), because of data copying in checkpoint.
We further study how to leverage non-volatility of NVM to create a copy of the data objects. We aim to replace traditional data copying in checkpoint, which is the fundamental reason that accounts for expensive checkpoint. We introduce a technique, named in-place versioning. is technique hides programmers from application and algorithm details, and leverages application-inherent memory write operations to create a new version of the data objects in NVM without extra data copying. We derive a set of rules to enable automatic transformation of programs to achieve in-place versioning.
To ensure proper recovery based on the new version of the data objects, we must guarantee that the data of the new version is consistent between caches and NVM. Hence, we must ush data blocks of the new version out of caches, a er the new version is created by the in-place versioning technique. Such cache ushing operations can be expensive, because there is no mechanism that allows us to track which data blocks of the new version are in caches and whether data blocks in caches are clean. As a result, we must ush all data blocks of the new version as if all data blocks are in caches, which brings large performance loss.
To minimize the cache ushing cost, we propose to use a privileged instruction and make it accessible to the application to ush the entire cache hierarchy, instead of ushing all data blocks of the new version. For a large data object, ushing the entire cache hierarchy are o en much cheaper. Furthermore, we introduce an asynchronous and proactive cache ushing mechanism to remove cache ushing cost o the critical path of application execution while enabling data consistency in NVM.
In general, the in-place versioning plus the optimized cache ushing allow us to establish data persistence with consistence for application critical data objects in NVM. e establishment of data persistence can happen much more frequently than the traditional checkpoint mechanism, with high performance. With the evaluation of six representative HPC benchmarks and one production HPC application (Nek5000), we show that the runtime overhead is %4.4 on average (up to 9%) when the establishment of data persistence frequently happens at every iteration of the main computation loop. Such frequent and high performance data persistence allows us to minimize recomputation cost and tolerate high error rate in future HPC.
Our major contributions are summarized as follows.
• We explore how to use NVM to enable resilient HPC. We demonstrate that using NVM (either as main memory or storage) to implement frequent checkpoint based on data copying to address the two dilemmas may not be feasible, because of large data copying overhead, even though NVM is expected to have superior performance.
• We explore how to enable data persistence with consistency in NVM with minimized runtime overhead. Without data copying and with the optimization of cache ushing, using NVM has potential to address the resilience and recomputation dilemmas rooted in the traditional checkpoint.
BACKGROUND
In this paper, we focus on HPC applications. ose applications are typically characterized with iterative structures. In particular, there is usually a main computation loop in an HPC application. With the traditional checkpoint mechanism, at every n iterations of the loop (n is much larger than 1), the application saves critical data objects of the application into non-volatile storage. In the rest of the paper, we name those critical data objects as target data objects. Checkpoint usually happens near the end of an iteration. We call the execution point where checkpoint happens as persistence establishment point.
We also distinguish cache line and cache block in this paper. e cache line describes a location in the cache, and the cache block refers to the data that go into a cache-line. We review NVM background in this section.
Non-Volatile Memory Usage Model
ere are at least two existing usage models to integrate the emerging NVM into HPC systems. In the rst model, NVM is built as NVDIMM modules and installed into DDR slots. NVM is physically a ached to the high-speed memory bus and managed by a memory controller [8] . In the second model, NVM connects to the host by an I/O controller and I/O bus (e.g., PCI-E) [7] .
From the perspective of so ware, OS can regard NVM as regular memory (the rst model), similar to DRAM, and NVM provides the capability of being byte addressable to OS and applications. Also, NVM is accessed through load and store instructions. Alternatively, NVM can be exposed as a block device in OS [37] . NVM is accessed via a read/write block I/O interface. A le system can be built on top of NVM to provide the convenience of naming schemes and data protection [37] .
Data Consistence in NVM
To build a consistent state for target data objects in NVM (as main memory) and ensure proper recovery, the target data objects in NVM must be updated with the most recent data in caches at the persistence establishment point. However, the prevalence of volatile caches introduces randomness into write operations in NVM. When the data is wri en from caches to NVM is subject to the cache management policy by hardware and OS.
ere are "interfaces" that enable explicit data ushing from caches to NVM. ose interfaces are presented as processor instructions or system calls. Using those interfaces, it is possible to enforce data consistence at the persistence establishment point. We discuss the common cache ushing instructions as follows.
• clflush instruction: is is the most common cache ushing instruction. Given a cache block, this instruction invalidates it from all levels of the processor cache hierarchy. If the cache line at any level of the cache hierarchy is dirty, the cache line is wri en to memory before invalidation. clflush is a blocking instruction, meaning that the instruction waits until the data ushing is done [38] .
• WBINVD instruction: this is a privileged instruction used by OS to ush and invalidate the entire cache hierarchy.
To enable data consistence based on clflush and other cache block-based cache ushing instructions (particularly CLWB and clflush opt, which will be discussed next), we may have a performance problem for a data object with a large data size. Because we do not have a mechanism to track which cache line is dirty and whether a speci c cache block is in caches, we have to ush all cache blocks of target data objects, as if all cache blocks are in caches. Figure 1 shows how we ush cache blocks based on cache block-based cache ushing.
Flushing clean cache blocks in caches and ushing cache blocks not in caches have performance cost at the same order as ushing dirty cache blocks. Table 1 shows the performance of ushing cache blocks in di erent status in caches. e performance is measured Figure 1 : Using cache block-based cache ushing instructions to ush cache blocks of the target data object. in a platform with two eight-core Intel Xeon E5-2630 v3 processors (2.4 GHz, 20MB L3, 256KB L2, and 32KB L1) a ached to 32GB DDR4. Based on the results, we conclude that ushing all cache blocks of a data object is roughly proportional to the data object size.
To support NVM, there are two very new instructions, clflush opt and CLWB. clflush opt maximizes the concurrency of multiple clflush within individual threads. CLWB instruction maximizes the concurrency of multiple cache line ushing without cache line invalidation (i.e., leaving data in the cache a er cache line ushing). clflush opt is only available in the most recent Intel SkyLake microarchitecture. Based on our knowledge, there is no hardware available in the market that supports CLWB. We cannot evaluate them in this paper. However, using these two instructions should lead to be er performance with our method proposed in this paper. More importantly, these two instructions use cache block-based cache ushing, hence they have the same problem as discussed above for large target data objects. Our proposed method can help them improve performance.
PRELIMINARY SYSTEM DESIGNS
e performance of NVM is much be er than that of traditional hard drive, and even close to or match that of DRAM. Given such performance characteristics of NVM, it is promising to enable frequent checkpoint with a small overhead. Frequent checkpoint will enable be er HPC resilience and minimize recomputation, hence addressing the two dilemmas for future HPC.
Our preliminary designs aim to improve the existing checkpoint mechanism and optimize its performance on NVM. We want to answer a fundamental question: can the NVM-based checkpoint (with optimization) happen frequently, such that we address the two dilemmas rooted in the current checkpoint mechanism?
3.1 Preliminary Design 1: NVM-based, Frequent Checkpoint
In our rst design, we employ an NVM-based checkpoint, and the checkpoint happens at each iteration of the main loop, which is much more frequent than the traditional checkpoint. Also, the NVM-based checkpoint happens locally. is means that no ma er what usage model NVM is used (either as main memory or as a local block device), the checkpoint is stored locally in NVM. By removing networking overhead, this local NVM-based checkpoint represents the best performance we can get out of NVM. In fact, from the architecture point of view, such local NVM-based system has been shown to be possible for HPC [15, 20, 21] . We compare two cases of hard drive-based, frequent checkpoint with two cases of NVM-based, frequent checkpoint. For hard drivebased, frequent checkpoint, the hard drive is resident either locally (annotated as "hard drive based chkp (local)") or in a remote storage node (annotated as "hard drive based chkp (remote)"). For NVMbased, frequent checkpoint, NVM is used as either main memory (annotated as "NVM based chkp (mem)") or a local block device (annotated as "NVM based chkp (block)"). If NVM is used as main memory, checkpointing is the same as making a data copy in memory plus necessary cache ushing. To emulate NVM as a block device, we use a ramdisk with a le system (tmpfs). Hence, such emulation includes the overhead of le system and system calls, but does not emulate internal overhead of I/O controllers, such as interface command decoding and ECC.
We run six NAS parallel benchmarks and one production code (Nek5000). e details for those applications are summarized in Table 2 . In our study, NVM has either the same performance characteristics (bandwidth and latency) as DRAM, which is a rather optimistic assumption on NVM performance, or inferior performance than DRAM, which is a more practical assumption.
(1) NVM has the same performance as DRAM. We emulate NVM with local DRAM, similar to [49] and assume that NVM has the same latency and bandwidth as DRAM, A er data copying in checkpoint, we ush cache blocks of the new data copy out of caches to build a consistent state in NVM, using clflush. Figure 2 shows the results on a production supercomputer, Edison at Lawrence Berkeley National Lab. For NPB benchmarks, we use CLASS D as input; for Nek5000, we use the eddy problem as input (256 × 256). We use 4 nodes with 16 MPI tasks per node. Performance (execution time) in the gure is normalized by that of the native execution without checkpoint. e gure reveals that with frequent checkpoint, hard drive based checkpoint (local) has 283% overhead on average (up to 1062%), which is unacceptable. NVM-based checkpoint has much be er performance. For some benchmarks (e.g., BT and LU), the overhead of NVM-based checkpoint (NVM as main memory) is smaller than 10%. But there is still high overhead for some benchmarks (more than 40% for MG, FT, and Nek5000). Also, NVM-based checkpoint (main memory) shows be er performance (26% performance loss on average and up to 46%) than NVM-based checkpoint (block device) (89% performance loss on average and up to 401%).
(2) NVM has worse performance than DRAM. Since the NVM techniques have a range of performance characteristics, we change NVM performance to make our evaluation more practical, and re-do the above tests in (1) . Since checkpoint performance is sensitive to memory bandwidth, we change NVM bandwidth based on artz (a DRAM-based, lightweight performance emulator for NVM [44] ) for our study. Because using artz requires loading a kernel driver, which needs privileged accesses to the system, we run artz on a local cluster (see Section 5 for more details on the cluster). We choose 1/8 and 1/32 DRAM bandwidth as NVM bandwidth based on [14, 40] . We use CLASS C as input for NPB and the eddy problem (256 × 256) as input for Nek5000; we use 4 nodes with 4 MPI tasks per node. Figures 3 and 4 show the results.
Note that given a lower NVM bandwidth, the application performance on a NVM-only system is worse than a DRAM-only system. To bridge the performance gap between NVM and DRAM, the existing work introduces a small DRAM cache [14, 27, 35 ] to place recent write-intensive data into NVM and build a heterogeneous NVM/-DRAM system. To study the impact of such small DRAM cache on checkpoint performance, we allocate a small DRAM space to implement a heterogeneous NVM/DRAM system based on artz.
e existing work chooses the DRAM cache size between 32MB and 1GB [12, 14, 17, 27, 35, 46, 48] . We choose a medium DRAM size in our test, which is 256MB.
With the DRAM cache, the overhead of NVM-based checkpoint (NVM as main memory) must include ushing cache blocks of the target data objects from this DRAM space to NVM, besides the overhead of memory copying in NVM and CPU cache ushing. Which data objects are in the DRAM cache at the persistence establishment point depends on the DRAM cache management strategy. We implement a recent so ware-based approach [14] to manage the DRAM cache. Furthermore, because of the so ware-based approach, we know which target data objects (or data blocks of the target data objects) are in the DRAM cache. Hence, we do not need to ush all cache blocks of target data objects for DRAM cache ushing. Also, we do not invalidate data in the DRAM cache a er DRAM cache ushing to optimize performance of DRAM cache ushing. Figures 3 and 4 show the results. Similar to Figure 2 , the two gures show that NVM-based, frequent checkpoint (NVM as main memory) can result in large performance loss (22% on average and up to 52% for NVM with 1/8 DRAM bandwidth, and 32% on average and up to 60% for NVM with 1/32 DRAM bandwidth).
Conclusions. Using NVM as main memory for checkpoint is promising, but still comes with large performance overhead for some benchmarks, even though we take an optimistic assumption on NVM performance. e performance loss of NVM-based checkpoint (NVM as main memory) comes from data copying during checkpointing and cache ushing. To improve the performance of NVM-based checkpoint (NVM as main memory), we focus on improving the performance of cache ushing in the next section. We consider removing data copying in Section 4. In the rest of this paper, we focus on NVM with 1/8 DRAM bandwidth, which is a more practical assumption on NVM performance [14, 40] .
Preliminary Design 2: Optimization of NVM-based Checkpoint
To improve the performance of cache ushing, we explore the parallelization of clflush instructions by multi-threading. Although clflush is blocking, there is no guaranteed order for clflush instructions [1] across threads. It is possible to use multiple threads for cache ushing, and each of which ushes non-overlapped cache blocks. To verify the above idea, we use OpenMP parallel for to parallel a for loop for cache ushing with each iteration of the loop ushing a single cache block. We change the number of threads and measure performance for ushing a 20MB data bu er with dirty cache blocks on an Intel Xeon E5-2630 v3 processor (20MB L3, 256KB L2, and 32KB L1) a ached to 32GB DDR4. e processor has 8 cores with 16 hardware threads. Figure 5 shows the performance (average cycles per cache line). Figure 5 shows that using multi-threading does improve performance of clflush, but the performance is not scalable beyond certain number of threads. In fact, as we increase the number of threads, they will compete for those resources in cache controllers and read/write ports of main memory, which limits the scalability of parallel clflush. Based on such observation, we use up to 16 threads to parallelize cache ushing, depending on the availability of idling cores in a node.
To further improve performance of NVM-based checkpoint (NVM as main memory), we explore special instructions and use SIMDbased, non-temporal instructions (particularly MOVDQU and MOVNTDQ), which bypass caches to make a data copy. Using those instructions removes the necessity of cache ushing, but those instructions are only available on a processor with SSE support. Figure 6 shows the performance for the above two optimization techniques. Within the gure, the preliminary design 1 (i.e., NVM-based checkpoint with NVM as main memory), the parallelized clflush, and non-temporal instructions are labeled as "checkpoint cl ush", "checkpoint par cl ush", and "cache bypassing" respectively. "Native execution" is the one without checkpoint.
e gure shows that the parallelized clflush has up to 5% performance improvement (for FT) over the preliminary design 1. Non-temporal instructions lead to the best performance in all cases. Comparing with the preliminary design 1 (checkpoint cl ush in Figure 6 ), non-temporal instructions result in 9.6% performance improvement on average and up to 16%. If a platform supports those instructions, they should be the preferred method for NVM-based checkpoint.
However, even if we use the above optimizations on CPU cache ushing, we still see big performance loss on some benchmarks (e.g., 36% for Nek5000 and 13% for CG). To investigate the reason, we break down the checkpoint time. For "checkpoint cl ush" (the preliminary design 1) and "checkpoint par cl ush", the checkpoint time includes DRAM cache ushing, data copying, and CPU cache ushing; for "cache bypassing", the checkpoint time includes DRAM cache ushing and data copying. Figure 7 shows the results.
e results reveal that data copying contributes the most to the performance loss. Except BT and LU with the preliminary design 1, all other cases have more than 50% performance loss come from data copying.
Conclusions. To establish frequent data persistence in NVM with high performance and address the dilemmas in checkpoint, we must address the data copying overhead.
HIGH PERFORMANCE DATA PERSISTENCE
We introduce a technique, called "in-place versioning", to remove data copying. Because the in-place versioning has to come with cache ushing, we introduce an asynchronous and proactive cache ushing to improve performance.
In-Place Versioning
Basic Idea. e in-place versioning is based on the idea of the dual version [48] . Both the in-place versioning and the dual version aim to remove data copying by leveraging application-inherent memory write operations to create a new version of the target data objects. But the dual version heavily relies on numerical algorithm knowledge, and is only applicable to those algorithms with speci c characteristics. e implementation of the dual version for an algorithm requires the programmer to manually change the code based on algorithm knowledge. e in-place versioning signi cantly improves the dual version. e in-place versioning works for any numerical algorithm, and is algorithm-agnostic. We generalize a couple of rules to implement the in-place versioning. Based on the rules, we can use compiler to automatically transform the application into a new one with the implementation of in-place versioning. e new application creates data copy at runtime without programmer intervention. In the following, we describe the basic idea of the dual version in an algorithm-agnostic way and give an example. Based on the example, we derive a basic rule for the in-place versioning.
Before the main computation loop, the dual version allocates an extra copy of the target data objects (a new version).
en, in each iteration of the main computation loop, both versions of the data objects are involved into the computation, but memory write operations only happen to one version of the data objects (which we call "working version"), the other version (which we call "consistent version")remains unchanged until the next iteration. At the end of each iteration, the working version is ushed out of the cache and becomes consistent in NVM. is version will not be changed in the next iteration, and becomes the consistent version since then. e previous consistent version becomes the working version, and is updated by memory write operations of the application. Two versions alternate roles across iterations, with one version being consistent and the other being updated. Hence, we ensure that there is always a consistent version in NVM for restart.
e recomputation is limited to at most one iteration, equivalent to the recomputation in the frequent checkpoint we discuss in Section 3. Figure 8 shows an example to further explain the basic idea. In this example, the array u is the target data object. In the main loop (Lines 13-17) of the original code, all elements of u are updated, and those elements are both read and wri en in each iteration of the main loop. In the dual version, we allocate an extra copy of u (u e) and rename the original copy as u o. u o is enforced to be consistent in NVM (Lines 4-6) before the computation loop. In the main loop, both u o and u e participate in the computation. However, at any iteration, only one version of u is updated, and the other version is read. e update to one version of u is naturally embedded in the place of write operations (Line 12). Also, at any iteration, we always maintain a consistent version of u in NVM. Depending on the iteration number (odd or even), we decide which version should be updated and which one should be consistent. e two versions switch their roles (either write or read) a er each iteration (Lines [19] [20] [21] [22] [23] [24] [25] .
Based on the above example and description in an algorithmagnostic way, we derive a basic rule for our in-place versioning. • Basic rule: within each iteration of the main computation loop, write operations happen to one version of the target data objects and read operations happen on the other version. Alternate the role of the two versions, and ush data blocks of the updated version out of caches a er each iteration.
Although the basic rule is straightforward, it can be applied to many target data objects (see Table 2 ). However, the basic rule is also very restricted. ere are two special cases violating the basic rule. In the rst case, within one iteration, read operations reference one version (i.e., the consistent version) before any update happens to the target data object. However, a er the rst update, read operations should reference the updated version (i.e., the working version) for program correctness. Read operations should not use the same version before and a er the rst update. We name this case as post-update version switch for read operations. We use an example to further explain it.
Special case I: post-update version switch for read operations. See Figure 9 . In this example, we only show the routine where the updates to the target data object (the array u) happen (the routine update), but ignore the main computation loop which is already shown in Figure 8 .
In this example, for the rst update of u (Line 4 in Figure 9 .b), we can use the basic rule correctly. e read operations use u old. However, a er the rst update (Line 6 in Figure 9 .b), we should read the most recent update from u new, not u old suggested by the basic rule (see Line 6 in Figure 9 .c for a correct version). e read operations in Lines 4 and 6 in Figure 9 .c use di erent version of u a er the rst update in Line 4 in Figure 9 .c. e other case violating the basic rule is that elements of the target data object are not updated uniformly within an iteration. As a result, read operations should reference one version for some elements of the target data object, but reference the other version for the other elements. We use an example to further explain it.
Special case II: nonuniform updates. Figure 10 gives an example. ere are two loops in the gure, each of which updates u. In the rst loop (Figure 10 .a), the elements from 1 to Nu − 2 of u are updated, while the elements 0 and Nu − 1 are not updated. In the second loop, all elements are updated. Hence, across two loops, all elements are not updated uniformly.
Based on the basic rule, we replace u in the rst loop with the two versions of u (Line 4 in Figure 10 .b), which is correct. In the second loop, we do the same thing (Line 7 in Figure 10 .b) based on the basic rule. However, the program will not run correctly. For the elements u[0] and u[Nu − 1] that have not been updated in the rst loop, we should use u old for read operations in the second loop (Line 8 in Figure 10 .c), while for the other elements that have been updated, we should use u new for read operations (Line 10 in Figure 10 .c).
To handle the above two cases and enable automatic code transformation to implement the in-place versioning, we introduce a pro le-guided code transformation. is method uses the results of a pro ling test to detect the rst update and nonuniform updates, and then transforms the application into the in-place versioning accordingly. We particularly target on arrays, the most common target data object in HPC applications. We explain our method in details as follows.
Our method rst leverages an LLVM compiler [23] instrumentation pass [39] to generate a set of dynamic LLVM instruction traces Figure 9 : Special case I: post-update version switch for read operations. e target data object is u. e main computation loop is ignored in this gure. u has N u number of elements. Line 6 in Figure 9 .b is the incorrect code. for the rst iteration of the main computation loop. ose traces include dynamic register values and memory addresses referenced in each instruction. Each of the traces corresponds to either a loop or instructions between two neighbor loops. For example, the update routine in Figure 10 .a has three traces: Two of them correspond to for loops and the third one corresponds to the instructions between the two loops. We also record the whole memory address ranges of the target data objects in the beginning of each trace, based on the LLVM instrumentation.
Furthermore, we develop a trace analysis tool. Given the traces and memory address ranges of the target data objects as input, this tool tracks register allocation and memory references to determine which elements are updated in each trace. Based on the analysis results across and within the traces, we identify the rst update for each target data object; we also determine the coverage of each loop-based update (e.g., Lines 7-8 in Figure 8 .a) and whether the coverages in all loop-based updates are di erent. is will be used to detect non-uniform update.
Based on the trace analysis results, we use a static LLVM pass to replace the references to the target data objects with the references to either the working version or the consistent version. In particular, any read reference to the target data object before the rst update will be replaced with the reference to the old version of the target data object (i.e., the consistent version); a er the rst update, any read reference to the target data object will be replaced with the reference to the new version (i.e., the working version). Any write reference to the target data object is always replaced with the reference to the new version, based on the basic rule. Figure 9 .c is an example of such replacement.
If nonuniform updates are detected, then for a loop-based structure we need to add control ow constructs within the loop to control which version of the data objects should be used. Figure  10 .c (Lines 7-10) is such an example. However, in practice, we nd that such control ow constructs can be rather sophisticated, especially for a statement of the loop with multiple elements of the data objects. Furthermore, the prevalence of such control ow constructs in loops can bring large performance overhead. Hence, we do not apply the in-place versioning to the data object with nonuniform updates. Instead, we use our preliminary design 2 (i.e., data copying based on non-temporal load/store) at the persistence establishment point for those target data objects.
Discussion. We pro le the rst iteration to detect the rst update and nonuniform updates.
is method aims to generate a short trace and make the trace analysis time manageable. is method is based on an assumption that the rst iteration and the rest of iterations in the main loop have the same read and write pa erns for the target data objects. Based on our experience with 10 data objects from six NPB benchmarks (24 input problem sizes) and 7 data objects from a large-scale production code (Nek5000), we nd such assumption is true in all cases.
Furthermore, we nd that di erent input problems (not di erent input problem size) can have di erent read and write pa erns to the data objects, and hence needs to generate di erent code for the in-place versioning. However, pro ling the rst iteration and generating the code is quick, based on our compiler-based approach.
In-place versioning vs. checkpoint. ere is a signi cant di erence between the in-place versioning and checkpoint mechanism. Creating data copy in the checkpoint mechanism is an extra operation, and also the data copy is not involved in the computation; Creating data copy in the in-place versioning leverages inherent memory write operations in the application, and is part of the computation (not extra operation). Hence, the in-place versioning signi cantly reduces data copying overhead from which the checkpoint mechanism su ers.
However, the in-place versioning can bring performance loss from two perspectives. First, the in-place versioning has to allocate one extra data copy before the main computation loop. However, this cost happens just once, and can be easily amortized by the main computation. Second, the in-place versioning increases memory footprint of the application, because the two versions of the target data object are involved in the computation. is may increase CPU cache miss rate, which hurt performance. is may also consume more DRAM cache space, reducing the DRAM space for other data objects. However, we see small performance di erence (less than 8.2% and 2.7% on average) between the in-place version and the native execution without it. e reason is as follows.
For the DRAM cache problem, the so ware-based cache management we use in our study [14] treats each extra data copy as a new data object and chooses the best data placement in DRAM and NVM for optimal performance, which e ectively reduces the impact of larger working set in the in-place versioning. For the CPU cache problem, we study it based on performance counters, but do not nd signi cant increase in cache miss rates because of the "streaming-like" memory access pa erns in target data objects. We discuss it further in the performance evaluation section.
Optimization of Cache Flushing
e in-place versioning avoids memory copying. However, to make data consistent between NVM and caches at the persistence establishment point, we need to ush caches. As shown in Figure 7 , periodically ushing caches accounts for a large portion of the total overhead. e fundamental reason for such large overhead is that we cannot know which cache blocks of the data objects are in the cache hierarchy and whether they are dirty, and have to issue cache ushing instructions on every single cache block of the target data objects.
To reduce the cache ushing cost, we propose two optimization techniques: whole cache ushing and proactive cache ushing.
Whole cache ushing. e basic idea of the whole cache ushing is to use WBINVD instruction to ush the entire cache hierarchy, instead of ushing individual cache blocks of the target data objects. If the size of the target data objects is much larger than the last level cache size, it is highly possible that most of the cache blocks are not in caches, and ushing the entire cache hierarchy is cheaper than ushing all cache blocks of the target data objects.
However, WBINVD is a privileged instruction, and only the kernel level code can issue this instruction. Hence we introduce a kernel module that allows the application to indirectly issue the instruction. e drawback of using WBINVD is that the cache blocks that do not belong to the target data objects are ushed out of the caches. If those cache blocks are to be reused, they have to be reloaded, which lose performance. However, when the total size of the target data objects is large enough, ushing all cache blocks of the target data objects that are not resident in caches is much more expensive than data reloading because of WBINVD. We empirically decide that if the total size of the target data objects is ten times larger than the last level cache size, it is bene cial to use WBINVD.
Asynchronous and proactive cache ushing. In the in-place versioning, we trigger cache ushing (including CPU cache ushing with WBINVD and DRAM cache ushing) at the persistence establishment point to make the working version consistent in NVM. To improve cache ushing performance, we want to remove cache ushing o the execution critical path as much as possible. Also, we can trigger cache ushing ahead of the persistence establishment point under certain conditions (discussed as below). We introduce a helper thread-based mechanism to implement asynchronous and proactive cache ushing.
In particular, we do not wait until the persistence establishment point to ush caches. Instead, as soon as the working version is not updated in the current iteration, a helper thread will proactively ush caches. Furthermore, the cache ushing does not have to be nished at the end of each iteration. As long as the working version from the last iteration is not read in the current iteration, the cache ushing can continue. But, the helper thread must nish cache ushing at the point where the working version from the last iteration is read for the rst time. Figure 11 describes the idea.
To implement the above proactive cache ushing, we develop a lightweight library for HPC applications and a set of APIs. To use the library, the programmer needs to insert a thread creation API ( ush init() in Figure 11 ) before the main loop to create a helper thread and a FIFO queue shared between the helper thread and main thread. e programmer also needs to insert an API ( ush async() in Figure 11 ) into the program to specify where the cache ush can happen within each iteration; e cache ush point does not have to be the same as the persistence establishment point. Using this API will insert a cache ush request into the FIFO queue. e programmer also needs to insert an API to specify where the cache ush must nish within each iteration ( ush barrier() in Figure 11 ). is API works as a synchronization between the helper thread and the main thread to ensure that the working version is completely ushed before it becomes the consistent version and read by the application. Discussion. Similar to any help thread-based approaches [24, 25, 30, 42] , our approach depends on the availability of idling core for helper threads. We expect that the future many-core platform can provide such core abundance. Note that even without the helper thread, the in-place versioning with WBINVD already provide signi cant performance improvement over checkpoint, shown in Figure 12 in the evaluation section.
EVALUATION
We evaluate the in-place versioning (IPV) in this section. Unless indicated otherwise, IPV includes optimized cache ushing and helper thread in this section. Also, the data persistence establishment happens at every iteration of the main computation loop, which aims to build high resilience and minimize recomputation for future HPC. We use the native execution, which has neither checkpoint nor IPV, as our baseline. An ideal performance of IPV should be close to that of the native execution as much as possible.
We study the performance on two test platforms. One test platform is a local cluster. Each node of it has two eight-core Intel Xeon E5-2630 processors (2.4 GHz) and 32GB DDR4. We use this platform for tests in all gures except Figure 2 in Section 3. We deploy artz on such platform to emulate a heterogeneous NVM/-DRAM system with NVM con gured with 1/8 DRAM bandwidth and DRAM con gured with 256MB capacity to enable a practical emulation of NVM [14, 40] . e other test platform is the Edison supercomputer at Lawrence Berkeley National Lab (LBNL). We use this platform for tests in Figure 2 . Each Edison node has two 12-core Intel Ivy Bridge processor (2.4 GHz) with 64GB DDR3. We cannot install artz on Edison to enable a practical emulation of NVM, because artz requires a privileged access to the system. Hence, we perform most of the tests on the local cluster.
We use six NPB benchmarks (CLASS C) and one production application (Nek5000) with the eddy input problem (256 × 256). Table 2 gives more information on the benchmarks and application.
e table also lists how the target data objects are transformed into IPV based on either basic rule, post-update version switch, or nonuniform update. For NPB benchmarks, the target data objects are chosen based on typical checkpoint cases, algorithm knowledge, and benchmark information. For Nek5000, the target data objects are determined by the checkpoint mechanism in Nek5000. Figure 12 compares the performance of the baseline, the preliminary design 2 (i.e., checkpoint with cache bypassing), IPV with neither cache ushing nor helper thread, IPV with cache ushing (no helper thread), and IPV with everything. Comparing with the baseline, IPV achieves rather small runtime overhead (4.4% on average and no larger than 9.5%). Most of the performance improvement comes from the removal of data copying. In particular, regarding IPV (no cache ushing and helper thread) and the preliminary design 2, both of them do not have cache ushing, but IPV (no cache ushing and helper thread) performs 9% be er on average because of no data copying. is fact is especially pronounced in Nek5000, where IPV (no cache ushing and helper thread) performs 26% be er than the preliminary design 2.
Furthermore, IPV cannot be applied to MG because of nonuniform updates (see Table 2 ). Hence MG does not have performance data for any IPV. However, MG with the helper thread to enable proactive and asynchronous data copying in the gure has 5.4% performance improvement over the preliminary design 2.
To further study the performance of IPV, we focus on the performance di erence between IPV without cache ushing and IPV. We aim to study the e ectiveness of proactive and asynchronous cache ushing. In Figure 13 , we measure performance of WBINVD and DRAM cache ushing, and quantify their contribution to the total overhead (i.e., WBINVD plus DRAM cache ushing) in IPV. e table below the gure quanti es how much of the total overhead is overlapped with the application execution by the proactive and asynchronous cache ushing. Figure 13 reveals that the proactive and asynchronous cache ushing is pre y e ective to hide the cache ushing overhead (or data copying for MG). At least 41% of the total overhead is overlapped in all benchmarks. e non-overlapped cache ushing time is exposed to the application critical path and causes the performance di erence between IPV and the native execution in Figure  12 .
IPV can cause extra CPU cache misses, because of two reasons. (1) e two versions of the target data objects increase working set size of the application; (2) WBINVD ushes the entire cache hierarchy.
We measure the system-wide last level CPU cache miss rate for the native execution and IPV. Figure 14 shows the results. In general, we do not see big di erence (up to 4%) between the two cases in terms of the last level cache miss rate. is further explains the small performance loss between IPV and the native execution in Figure 12 .
e reason that accounts for such small di erence in the last level miss rate is as follows. WBINVD happens only once in each iteration, hence its impact on cache misses is not frequent. e two versions do increase the working set size of the application. However, within the original application, the target data objects are typically updated in a loop (e.g., the loop structure in update routine in Figures 9 and 10) and there is li le data reuse across iterations of the loop. Such updates tend to be "streaming-like", which is not sensitive to the increase of working set size.
RELATED WORK
Persistent memory. NVM has been explored to implement checkpoint as main memory. Kannan et al. [21] use NVM only for checkpoint (not computation). To improve performance, they proactively move checkpoint data from DRAM to NVM before checkpoint is started. Gao et al. [15] use a hardware-based approach to utilize runtime idling to write checkpoint and spread it across memory banks for load balance. Ren et al. [36] dynamically determine checkpoint granularity (cache block level or page level) based on memory update density. Dong et al. [13] introduce 3D stacked NVM and Figure 12 : Performance di erence between the native execution (baseline), the preliminary design 2 (checkpoint with cache bypassing), and di erent IPV cases. Performance is normalized to that of the native execution. MG does not have the results for IPV. e dotted bar in MG is the case of checkpoint with a helper thread for asynchronous and proactive data copying. incremental checkpoint to reduce checkpoint overhead. ose prior e orts focus on good performance of NVM to establish persistence (checkpoint) in NVM, while we focus on how to maximize the bene t of non-volatility of NVM. Di erent from those prior e orts, our work avoids data copying, and does not require hardware assist.
To enable data consistence in NVM, many research e orts explore how to enforce write-ordering with minimum overhead. e epoch-based approach [10, 18, 22, 32] is one of those research efforts. is approach divides program execution into epochs, within which stores are allowed to happen concurrently without disturbing data consistence in NVM. In fact, our proactive cache ushing (Section 4.2) is one variation of epoch. From the point where the cache ush happens to the point where the working version becomes the consistent version is an epoch where concurrent, persistent writes can happen. However, most of the existing work is hardware-based and requires hardware support to implicitly identify epochs. Also, to apply the existing work to establish data persistence in HPC still needs a mechanism to maintain two versions of the target data objects. Our work requires no hardware support and the in-place versioning provides the two versions.
Some work explores redo-log and undo-log based approaches to build transaction semantics for data consistence in NVM. is includes hardware logging [19, 27, 31] . However, those approaches come with extensive architecture modi cations.
ere are also so ware-based approaches that introduce certain program constructs to enable data persistence in NVM [6, 9, 11, 16, 37, 45] . To use those program constructs, one have to make changes to OS and applications. e application can su er from large overhead because of frequent runtime checking or data logging. Our experiences with [16] show that CG and dense matrix multiplication su er from 52% and 103% performance loss because of frequent data logging operations. Our work in this paper has very small runtime overhead and does not require changes to OS.
Checkpoint mechanism. Diskless checkpoint is a technique that uses DRAM-based main memory and available processors to encode and store the encoded checkpoint data [26, 33, 34, 41] . Because of the DRAM usage and the limitation of encoding techniques, diskless checkpoint has to leverage multiple nodes to create redundancy and only tolerates up to a certain number of node failures. Our method is a diskless-based approach, but leveraging non-volatility of NVM. Our method does not have node-level redundancy in diskless checkpoint, and is independent of the number of node failures.
Incremental checkpoint is a method that only checkpoints modied data to save checkpoint size and improve checkpoint performance [2, 4, 33, 47] . However, for those applications with intensive modi cations between checkpoints (e.g., HPL [41] ), the e ectiveness of the incremental checkpoint method can be limited.
Multi-level checkpoint is a method that saves checkpoint to fast devices (e.g., PCM and local SSD) in a short interval and to slower devices in a long interval [3, 13, 28] . By leveraging good performance of fast devices, the multi-level checkpoint removes expensive memory copy on slower devices. However, it can still su er from large data copy overhead on fast devices, when the checkpoint data size is large. Our work introduces the in-place versioning to remove data copy by leveraging application-inherent write operations to update checkpoint data. Hence, our method does not have the limitation of incremental and multi-level checkpoints.
CONCLUSIONS
With the emergence of NVM, how to leverage performance and nonvolatility characteristics of NVM for future HPC is largely unknown. In this paper, we study how to use NVM to build data persistence for critical data objects of applications to replace traditional checkpoint. Our study enables the frequent establishment of data persistence on NVM with small overhead, which enable high resilient HPC and minimized recomputation.
