Skip to main content
Article thumbnail
Location of Repository

A Flexible Checkpoint/Restart Model in Distributed Systems

By Mohamed-slim Bouguerra, Thierry Gautier, Denis Trystram, Jean-marc Vincent, Saint Martin and Saint Ismier

Abstract

Abstract. Large scale applications running on new computing platforms with thousands of processors have to face with reliability problems. The failure of a single processor will cause the entire execution to fail. Most existing approaches to guarantee reliable executions are based on fault tolerance mechanisms. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. This work presents a new model of coordinated Checkpoint/Restart mechanism for several types of computing platforms. The model is parametrized by the process failure distribution, the cost to save a global consistent state of processes and the number of computational resources. Through mathematical analysis of reliability, we apply this new model to compute the optimal interval between checkpoint times in order to minimize the average completion time. Model independency from the type of the failure law makes it completely flexible. We show that such a model may be used to reduce the checkpoint rate up to 20 % in same cases and up to factor 4 the total overhead in same cases. Finally, we report some experiments based on simulations for random failure distributions corresponding to the two most popular laws, namely, the Poisson’s process and Weibull’s law. Keywords: Fault tolerance- Reliability modeling- Checkpointing.

Year: 2011
OAI identifier: oai:CiteSeerX.psu:10.1.1.186.8522
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://graal.ens-lyon.fr/%7Elm... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.