Search CORE

3 research outputs found

Proposal of MPI operation level checkpoint/rollback and one implementation

Author: Dongarra Jack J.
Fagg Graham E.
Yuan Tang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

The University of Manchester - Institutional Repository

Proposal of MPI operation level checkpoint/rollback and one implementation

Author: Graham E. Fagg
Jack J. Dongarra
Yuan Tang
Publication venue
Publication date
Field of study

With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI’s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user’s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-andrestart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint / rollback framework. With the progress of our work, a picture of the hierarchy of futur

CiteSeerX