3 research outputs found

    A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems

    No full text
    : In this paper, we describe a new scheme for checkpointing parallel applications on message-passing scalable distributed memory systems. The novelty of our scheme is that a checkpointed application can be restored, from its checkpointed state, in a reconfigured form. Thus, a parallel application may be checkpointed while executing with t 1 tasks on p 1 processors, and then restarted from the checkpointed state with t 2 tasks on p 2 processors. As a result, applications can recover from partial failures in the underlying system. Also, the reconfigurable checkpointed states can be migrated from one parallel system to another even if they do not have the same number of processors. We describe a new programming model for implementing a reconfigurable checkpointing scheme for parallel programs. This new model is derived from the DRMS programming model, developed in the context of run-time reconfiguration of parallel applications. A key component of our implementation is the distri..
    corecore