23 research outputs found

    Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing

    Get PDF
    Message logging and check pointing can provide fault tolerance in distributed systems in which all process communication is through messages. This paper presents a general model for reasoning about recovery in these systems. Using this model_ we prove that the set of recoverable system states that have occurred during any single execution of the system forms a lattice, and that therefore, there is always a unique maximum recoverable system state, which never decreases. Based on this model, we present an algorithm for determining this maximum recoverable state, and prove its correctness. Our algorithm utilizes all logged messages and checkpoints, and thus always finds the maximum recoverable state possible. Previous recovery methods using optimistic message logging and checkpointing have not considered the existing checkpoints, and thus may not find this maximum state. Furthermore, by utilizing the checkpoints, some messages received by a process before it was checkpointed may not need to be logged. Using our algorithm also adds less communication overhead to the system than do previous methods. Our model and algorithm can be used with any message logging protocol, whether pessimistic or optimistic, but their full generality is only required with optimistic logging protocols

    Operative Merest-undertaking Impeccable Reclamation Line Accretion Ordering for Deterministic Mobile Distributed Computing Systems

    Get PDF
    Impeccable-RL-accretion   (Impeccable Reclamation Line accretion) is one of the ordinarily familiarized  approaches to present failing resilience  in Distributed Computing  setup (DCS)   so that the setup can operate even if one or more components have abdicated. However, Mobile DCSs are constrained by small transmittal potentiality, Suppleness, and dearth of stabilized repository, recurrent disruptions and imperfect battery life. From this time Impeccable-RL-accretion   orderings which have reduced reestablishment-dots   are favored in mobile environments. In this paper, we contemplate a merest-undertaking synchronic ordering for Impeccable-RL-accretion   for mobile DCS. We eliminate inoperable reestablishment-dots   as well as stalling of undertakings amidst reestablishment-dots   at the striving of registering contra-dispatches of very few dispatches amidst Impeccable-RL-accretion. We also organize an effort to subside the depletion of Impeccable-RL-accretion   work when any undertaking collapses to stockpile its reestablishment-dot in a founding. In this mode, we handle excessive failings amidst Impeccable-RL-accretion. We organize registering of contra-dispatches of very few dispatches only amidst Impeccable-RL-accretion. We also strive to subside depletion of Impeccable-RL-accretion   work. &nbsp

    About logical clocks for distributed systems

    Full text link

    Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

    Get PDF
    Manetho is a new transparent rollback_recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated check pointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery schem

    Optimal Message Log Reclamation for Independent Checkpointing

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the Office of the Chief of Naval Research / N00014-91-J-128

    Lazy Checkpoint Coordination for Bounding Rollback Propagation

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the Office of the Chief of Naval Research / N00014-91-J-128

    Transparent Adaptive Parallelism on NOWs using OpenMP

    Get PDF
    We present a system that allows OpenMP programs to execute on a network of workstations with a variable number of nodes. The ability to adapt to a variable number of nodes allows a program to take advantage of additional nodes that become available after it starts execution, or to gracefully scale down when the number of available nodes is reduced. We demonstrate that the cost of adaptation is modest; the system allows a program to adapt at a moderate rate without much performance loss.Two ideas underlie the efficiency of our design. First, we recognize that OpenMP programs exhibit convenient adaptation points during their execution, points at which the cost of adaptation can be much reduced. Second, by allowing a process a certain grace period before it must leave a node, we insure that most adaptations can occur at these adaptation points, and thus at low cost. Migration of a process, a much more expensive method for providing adaptivity, is used only as a back-up solution, when the process cannot reach an adaptation point within the grace period.Our implementation consists of an OpenMP pre-processor that generates TreadMarks distributed shared memory (DSM) programs, and a version of TreadMarks modified to adapt to a variable number of nodes. Using a DSM as the underlying substrate facilitates the data (re-)distribution necessary after an adaptation

    Performance comparison of hierarchical checkpoint protocols grid computing

    Get PDF
    Grid infrastructure is a large set of nodes geographically distributed and connected by a communication. In this context, fault tolerance is a necessity imposed by the distribution that poses a number of problems related to the heterogeneity of hardware, operating systems, networks, middleware, applications, the dynamic resource, the scalability, the lack of common memory, the lack of a common clock, the asynchronous communication between processes. To improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resistance to these faults of the system. Fault tolerance is intended to allow the system to provide service as specified in spite of occurrences of faults. It appears as an indispensable element in distributed systems. To meet this need, several techniques have been proposed in the literature. We will study the protocols based on rollback recovery. These protocols are classified into two categories: coordinated checkpointing and rollback protocols and log-based independent checkpointing protocols or message logging protocols. However, the performance of a protocol depends on the characteristics of the system, network and applications running. Faced with the constraints of large-scale environments, many of algorithms of the literature showed inadequate. Given an application environment and a system, it is not easy to identify the recovery protocol that is most appropriate for a cluster or hierarchical environment, like grid computing. While some protocols have been used successfully in small scale, they are not suitable for use in large scale. Hence there is a need to implement these protocols in a hierarchical fashion to compare their performance in grid computing. In this paper, we propose hierarchical version of four well-known protocols. We have implemented and compare the performance of these protocols in clusters and grid computing using the Omnet++ simulator

    Security and Privacy for Partial Order Time

    Full text link

    Reducing Space Overhead for Independent Checkpointing

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the office of the Chief of Naval Research / N00014-91-J-128
    corecore