833 research outputs found
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
Fine-Grain Checkpointing with In-Cache-Line Logging
Non-Volatile Memory offers the possibility of implementing high-performance,
durable data structures. However, achieving performance comparable to
well-designed data structures in non-persistent (transient) memory is
difficult, primarily because of the cost of ensuring the order in which memory
writes reach NVM. Often, this requires flushing data to NVM and waiting a full
memory round-trip time.
In this paper, we introduce two new techniques: Fine-Grained Checkpointing,
which ensures a consistent, quickly recoverable data structure in NVM after a
system failure, and In-Cache-Line Logging, an undo-logging technique that
enables recovery of earlier state without requiring cache-line flushes in the
normal case. We implemented these techniques in the Masstree data structure,
making it persistent and demonstrating the ease of applying them to a highly
optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating
Systems (ASPLOS 19), April 13, 2019, Providence, RI, US
- …