Future extreme-scale high-performance computing systems will be required
to work under frequent component failures. The MPI Forum’s User
Level Failure Mitigation proposal has introduced an operation,
MPI Comm shrink, to synchronize the alive processes on the list of failed
processes, so that applications can continue to execute even in the presence
of failures by adopting algorithm-based fault tolerance techniques. This
MPI Comm shrink operation requires a failure detection and consensus
algorithm. This paper presents three novel failure detection and consensus
algorithms using Gossiping. Stochastic pinging is used to quickly detect
failures during the execution of the algorithm, failures are then disseminated
to all the fault-free processes in the system and consensus on the
failures is detected using the three consensus techniques. The proposed
algorithms were implemented and tested using the Extreme-scale Simulator.
The results show that the stochastic pinging detects all the failures in
the system. In all the algorithms, the number of Gossip cycles to achieve
global consensus scales logarithmically with system size. The second algorithm
also shows better scalability in terms of memory and network
bandwidth usage and a perfect synchronization in achieving global consensus.
The third approach is a three-phase distributed failure detection
and consensus algorithm and provides consistency guarantees even in very
large and extreme-scale systems while at the same time being memory and
bandwidth efficient