thesis

Dynamic interval determination for pagelevel incremental checkpointing

Abstract

A distributed system is composed of multiple independent machines that communicate using messages. Faults in a large distributed system are common events. Without fault tolerance mechanisms, an application running on a system has to be restarted from scratch if a fault happens in the middle of its execution, resulting in loss of useful computation. Checkpoint and Recovery mechanisms are used in distributed systems to provide fault tolerance for such applications. A checkpoint of a process is the information about the state of a process at some instant of time. A checkpoint of a distributed application is a set of checkpoints, one from each of its processes, satisfying certain constraints. If a fault occurs, the application is started from an earlier checkpoint instead of being restarted from scratch to save some of the computation. Several checkpoint and recovery protocols have been proposed in the literature. The performance of a checkpoint and recovery protocol depends upon the amount of computation it can save against the amount of overhead it incurs. Checkpointing protocols should not add much overhead to the system. Checkpoiniting overhead is mainly due to the coordination among processes and their context saving in stable storage. In coordination checkpointing, for taking single checkpoint, it will coordinate with other processes. Checkpoint initiating process coordinates with other processes through messages. If more number of messages are used for coordination then it increases the network tra±c. Which is not desirable. It is better to reduce the number of messages that are needed for checkpoint coordination. In this thesis, we present an algorithm which reduces the number of messages per process, that are needed for checkpoint Coordination and there by decreasing the network tra±c. The total running time of an application is depend on the execution time of the application and the amount of checkpointing overhead that incurs with the application. We should minimize this checkpointing overhead. Checkpointing overhead is the combination of context saving overhead and coordination overhead. Storing the context of application over stable storage also increases the overhead. In periodic interval checkpointing, sometimes processes takes checkpoints though it is not much useful. These unnecessary checkpoints increase the application's running time. We have proposed an algorithm which determines checkpointing interval dynamically, based on expected recovery time, to avoid unnecessary checkpoints. By eliminating unnecessary checkpoints, we can reduce running time of a process signi¯cantly

    Similar works