2 research outputs found

    Design and Analysis of a Hardware-Assisted Checkpointing and Recovery Scheme for Distributed Applications

    No full text
    An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency which is the main source of multi-step rollback in distributed systems is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework

    Design and Analysis of a Hardware-Assisted Checkpointing and Recovery Scheme for Distributed Applications

    No full text
    A checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a hardware error detection scheme is presented. Message dependency which is the main source of multistep rollback in distributed systems is minimized by using a new message validation technique derived from hardware-assisted error detection. The main contribution of this paper is the development of an analytical model to establish the completeness and correctness of the new scheme. A novel concept of global state matrix is defined to keep track of the global state in a distributed system and assist in recovery. An illustration is given to show the distinction between conventional and the new recovery schemes
    corecore