2,729 research outputs found
Toward Message Passing Failure Management
As machine sizes have increased and application runtimes have lengthened, research into fault tolerance has evolved alongside. Moving from result checking, to rollback recovery, and to algorithm based fault tolerance, the type of recovery being performed has changed, but the programming model in which it executes has remained virtually static since the publication of the original Message Passing Interface (MPI) Standard in 1992. Since that time, applications have used a message passing paradigm to communicate between processes, but they could not perform process recovery within an MPI implementation due to limitations of the MPI Standard. This dissertation describes a new protocol using the exiting MPI Standard called Checkpoint-on-Failure to perform limited fault tolerance within the current framework of MPI, and proposes a new platform titled User Level Failure Mitigation (ULFM) to build more complete and complex fault tolerance solutions with a true fault tolerant MPI implementation. We will demonstrate the overhead involved in using these fault tolerant solutions and give examples of applications and libraries which construct other fault tolerance mechanisms based on the constructs provided in ULFM
Resilient MPI applications using an application-level checkpointing framework and ULFM
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.Ministerio de EconomÃa y Competitividad; TIN2013-42148-PMinisterio de EconomÃa y Competitividad; TIN2014-53522-REDTMinisterio de EconomÃa y Competitividad; BES-2014-068066Galicia. ConsellerÃa de Cultura, Educación e Ordenación Universitaria; GRC2013/05
PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications
As we have entered Exascale computing, the faults in high-performance systems
are expected to increase considerably. To compensate for a higher failure rate,
the standard checkpoint/restart technique would need to create checkpoints at a
much higher frequency resulting in an excessive amount of overhead which would
not be sustainable for many scientific applications. Replication allows for
fast recovery from failures by simply dropping the failed processes and using
their replicas to continue the regular operation of the application.
In this paper, we have implemented PartRePer-MPI, a novel fault-tolerant MPI
library that adopts partial replication of some of the launched MPI processes
in order to provide resilience from failures. The novelty of our work is that
it combines both fault tolerance, due to the use of the User Level Failure
Mitigation (ULFM) framework in the Open MPI library, and high performance, due
to the use of communication protocols in the native MPI library that is
generally fine-tuned for specific HPC platforms. We have implemented efficient
and parallel communication strategies with computational and replica processes,
and our library can seamlessly provide fault tolerance support to an existing
MPI application. Our experiments using seven NAS Parallel Benchmarks and two
scientific applications show that the failure-free overheads in PartRePer-MPI
when compared to the baseline MVAPICH2, are only up to 6.4% for the NAS
parallel benchmarks and up to 9.7% for the scientific applications
Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1863-z[Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.Ministerio de EconomÃa y Competitividad; TIN2013-42148-PMinisterio de EconomÃa y Competitividad; BES-2014-068066Galicia.ConsellerÃa de Cultura, Educación e Ordenación Universitaria; GRC2013/05
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
Epidemic failure detection and consensus for extreme parallelism
Future extreme-scale high-performance computing systems will be required
to work under frequent component failures. The MPI Forum’s User
Level Failure Mitigation proposal has introduced an operation,
MPI Comm shrink, to synchronize the alive processes on the list of failed
processes, so that applications can continue to execute even in the presence
of failures by adopting algorithm-based fault tolerance techniques. This
MPI Comm shrink operation requires a failure detection and consensus
algorithm. This paper presents three novel failure detection and consensus
algorithms using Gossiping. Stochastic pinging is used to quickly detect
failures during the execution of the algorithm, failures are then disseminated
to all the fault-free processes in the system and consensus on the
failures is detected using the three consensus techniques. The proposed
algorithms were implemented and tested using the Extreme-scale Simulator.
The results show that the stochastic pinging detects all the failures in
the system. In all the algorithms, the number of Gossip cycles to achieve
global consensus scales logarithmically with system size. The second algorithm
also shows better scalability in terms of memory and network
bandwidth usage and a perfect synchronization in achieving global consensus.
The third approach is a three-phase distributed failure detection
and consensus algorithm and provides consistency guarantees even in very
large and extreme-scale systems while at the same time being memory and
bandwidth efficient
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
- …