1,789 research outputs found
Extending Message Passing Interface Windows to Storage
This work presents an extension to MPI supporting the one-sided communication
model and window allocations in storage. Our design transparently integrates
with the current MPI implementations, enabling applications to target MPI
windows in storage, memory or both simultaneously, without major modifications.
Initial performance results demonstrate that the presented MPI window extension
could potentially be helpful for a wide-range of use-cases and with
low-overhead
A Generic Checkpoint-Restart Mechanism for Virtual Machines
It is common today to deploy complex software inside a virtual machine (VM).
Snapshots provide rapid deployment, migration between hosts, dependability
(fault tolerance), and security (insulating a guest VM from the host). Yet, for
each virtual machine, the code for snapshots is laboriously developed on a
per-VM basis. This work demonstrates a generic checkpoint-restart mechanism for
virtual machines. The mechanism is based on a plugin on top of an unmodified
user-space checkpoint-restart package, DMTCP. Checkpoint-restart is
demonstrated for three virtual machines: Lguest, user-space QEMU, and KVM/QEMU.
The plugins for Lguest and KVM/QEMU require just 200 lines of code. The Lguest
kernel driver API is augmented by 40 lines of code. DMTCP checkpoints
user-space QEMU without any new code. KVM/QEMU, user-space QEMU, and DMTCP need
no modification. The design benefits from other DMTCP features and plugins.
Experiments demonstrate checkpoint and restart in 0.2 seconds using forked
checkpointing, mmap-based fast-restart, and incremental Btrfs-based snapshots
Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC
Fault tolerance is one of the major design goals for HPC. The emergence of
non-volatile memories (NVM) provides a solution to build fault tolerant HPC.
Data in NVM-based main memory are not lost when the system crashes because of
the non-volatility nature of NVM. However, because of volatile caches, data
must be logged and explicitly flushed from caches into NVM to ensure
consistence and correctness before crashes, which can cause large runtime
overhead.
In this paper, we introduce an algorithm-based method to establish crash
consistence in NVM for HPC applications. We slightly extend application data
structures or sparsely flush cache blocks, which introduce ignorable runtime
overhead. Such extension or cache flushing allows us to use algorithm knowledge
to \textit{reason} data consistence or correct inconsistent data when the
application crashes. We demonstrate the effectiveness of our method for three
algorithms, including an iterative solver, dense matrix multiplication, and
Monte-Carlo simulation. Based on comprehensive performance evaluation on a
variety of test environments, we demonstrate that our approach has very small
runtime overhead (at most 8.2\% and less than 3\% in most cases), much smaller
than that of traditional checkpoint, while having the same or less
recomputation cost.Comment: 12 page
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
- …