Search CORE

3 research outputs found

SRS: A FRAMEWORK FOR DEVELOPING MALLEABLE AND MIGRATABLE PARALLEL APPLICATIONS FOR DISTRIBUTED SYSTEMS

Author: Arabe J. N. C.
Foster I.
JACK J. DONGARRA
Koo R.
SATHISH S. VADHIYAR
Tannenbaum T.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date
Field of study

Crossref

Deploying fault tolerance and taks migration with NetSolve

Author: Amza
Bakken
Beguelin
Boley
Henri Casanova
Huang
Jack J. Dongarra
James S. Plank
Micah Beck
Nichols
Plank
Vaidya
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems

Author: Jose E. Moreira
Samuel P. Midkiff
Vijay K. Naik
Publication venue: ACM Press
Publication date: 01/01/1997
Field of study

: In this paper, we describe a new scheme for checkpointing parallel applications on message-passing scalable distributed memory systems. The novelty of our scheme is that a checkpointed application can be restored, from its checkpointed state, in a reconfigured form. Thus, a parallel application may be checkpointed while executing with t 1 tasks on p 1 processors, and then restarted from the checkpointed state with t 2 tasks on p 2 processors. As a result, applications can recover from partial failures in the underlying system. Also, the reconfigurable checkpointed states can be migrated from one parallel system to another even if they do not have the same number of processors. We describe a new programming model for implementing a reconfigurable checkpointing scheme for parallel programs. This new model is derived from the DRMS programming model, developed in the context of run-time reconfiguration of parallel applications. A key component of our implementation is the distri..

CiteSeerX

Crossref