3 research outputs found
teaMPI---replication-based resiliency without the (performance) pain.
In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine—a task-based solver for hyperbolic equation systems—that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned “for nothing”. Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing
PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications
As we have entered Exascale computing, the faults in high-performance systems
are expected to increase considerably. To compensate for a higher failure rate,
the standard checkpoint/restart technique would need to create checkpoints at a
much higher frequency resulting in an excessive amount of overhead which would
not be sustainable for many scientific applications. Replication allows for
fast recovery from failures by simply dropping the failed processes and using
their replicas to continue the regular operation of the application.
In this paper, we have implemented PartRePer-MPI, a novel fault-tolerant MPI
library that adopts partial replication of some of the launched MPI processes
in order to provide resilience from failures. The novelty of our work is that
it combines both fault tolerance, due to the use of the User Level Failure
Mitigation (ULFM) framework in the Open MPI library, and high performance, due
to the use of communication protocols in the native MPI library that is
generally fine-tuned for specific HPC platforms. We have implemented efficient
and parallel communication strategies with computational and replica processes,
and our library can seamlessly provide fault tolerance support to an existing
MPI application. Our experiments using seven NAS Parallel Benchmarks and two
scientific applications show that the failure-free overheads in PartRePer-MPI
when compared to the baseline MVAPICH2, are only up to 6.4% for the NAS
parallel benchmarks and up to 9.7% for the scientific applications