Skip to main content
Article thumbnail
Location of Repository

Cooperative Checkpointing for Supercomputing Systems

By Adam Jamison Oliner

Abstract

A system-level checkpointing mechanism, with global knowledge of the state and health of the machine, can improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. This thesis presents such a technique, called cooperative checkpointing, and models its behavior as an online algorithm. Where C is the checkpoint overhead and I is the request interval, a worst-case analysis proves a lower bound of (2 + ⌊ C ⌋)-competitiveness for deter-I ministic cooperative checkpointing algorithms, and proves that a number of simple algorithms meet this bound. Using an expected-case analysis, this thesis proves that an optimal periodic checkpointing algorithm that assumes an exponential failure distribution may be arbitrarily bad relative to an optimal cooperative checkpointing algorithm that permits a general failure distribution. Calculations suggest that, under realistic conditions, an application using cooperative checkpointing may mak

Year: 2005
OAI identifier: oai:CiteSeerX.psu:10.1.1.353.2491
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://adam.oliner.net/files/o... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.