Skip to main content
Article thumbnail
Location of Repository

Cooperative checkpointing: a robust approach to large-scale systems reliability

By Adam J. Oliner, Larry Rudolph and Ramendra K. Sahoo

Abstract

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing

Topics: fault-tolerance, C.4 [Performance of Systems, Fault Tolerance General Terms Algorithms, Experimentation, Measurement, Reliability Keywords Cooperative checkpointing, RAS, high-performance computing, supercomputing, parallel computing, simulations
Publisher: ACM
Year: 2006
OAI identifier: oai:CiteSeerX.psu:10.1.1.353.2307
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://adam.oliner.net/files/o... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.