190 research outputs found
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI
Taking snapshots of the state of a distributed computation is useful for
off-line analysis of the computational state, for later restarting from the
saved snapshot, for cloning a copy of the computation, and for migration to a
new cluster. The problem is made more difficult when supporting collective
operations across processes, such as barrier, reduce operations, scatter and
gather, etc. Some processes may have reached the barrier or other collective
operation, while other processes wait a long time to reach that same barrier or
collective operation. At least two solutions are well-known in the literature:
(I) draining in-flight network messages and then freezing the network at
checkpoint time; and (ii) adding a barrier prior to the collective operation,
and either completing the operation or aborting the barrier if not all
processes are present. Both solutions suffer important drawbacks. The code in
the first solution must be updated whenever one ports to a newer network. The
second solution implies additional barrier-related network traffic prior to
each collective operation. This work presents a third solution that avoids both
drawbacks. There is no additional barrier-related traffic, and the solution is
implemented entirely above the network layer. The work is demonstrated in the
context of transparent checkpointing of MPI libraries for parallel computation,
where each of the first two solutions have already been used in prior systems,
and then abandoned due to the aforementioned drawbacks. Experiments demonstrate
the low runtime overhead of this new, network-agnostic approach. The approach
is also extended to non-blocking, collective operations in order to handle
overlapping of computation and communication.Comment: 16 pages, 6 figure
Failure Avoidance in MPI Applications Using an Application-Level Approach
[Abstract] Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.Ministerio de Ciencia e Innovación; TIN2010-16735Galicia. Consellería de Economía e Industria; 10PXIB105180P
Implementation-Oblivious Transparent Checkpoint-Restart for MPI
This work presents experience with traditional use cases of checkpointing on
a novel platform. A single codebase (MANA) transparently checkpoints production
workloads for major available MPI implementations: "develop once, run
everywhere". The new platform enables application developers to compile their
application against any of the available standards-compliant MPI
implementations, and test each MPI implementation according to performance or
other features.Comment: 17 pages, 4 figure
- …