Consistent Checkpointing in Message Passing Distributed Systems

Abstract

: A global checkpoint of a distributed computation is a a set of local checkpoints (local states), one per process. Determining consistent global checkpoints is an important problem for many distributed applications (e.g. fault-tolerance, distributed debugging, properties detection, etc). This paper concentrates on such determinations. A precedence relation on checkpoint intervals (such intervals are sets of events produced by processes between two successive local checkpoints) is introduced and analyzed. It is shown that a local checkpoint is useless (i.e. it cannot participate in any consistent global checkpoint) iff some pattern appears in this precedence relation. Then an adaptive checkpointing algorithm is introduced. This algorithm, assuming processes take local checkpoints independently, requires them to take (as few as possible) additional checkpoints in order that none of previously taken checkpoints be useless. It is based on the prevention of the previously mentioned pattern..

Similar works

This paper was published in CiteSeerX.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.