1,424 research outputs found
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery
Efficient utilization of today's high-performance computing (HPC) systems
with complex hardware and software components requires that the HPC
applications are designed to tolerate process failures at runtime. With low
mean time to failure (MTTF) of current and future HPC systems, long running
simulations on these systems require capabilities for gracefully handling
process failures by the applications themselves. In this paper, we explore the
use of fault tolerance extensions to Message Passing Interface (MPI) called
user-level failure mitigation (ULFM) for handling process failures without the
need to discard the progress made by the application. We explore two
alternative recovery strategies, which use ULFM along with application-driven
in-memory checkpointing. In the first case, the application is recovered with
only the surviving processes, and in the second case, spares are used to
replace the failed processes, such that the original configuration of the
application is restored. Our experimental results demonstrate that graceful
degradation is a viable alternative for recovery in environments where spares
may not be available.Comment: 26th Euromicro International Conference on Parallel, Distributed and
network-based Processing (PDP 2018
- …