37 research outputs found

    Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems

    Get PDF
    Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated

    A survey of checkpointing algorithms for parallel and distributed computers

    Get PDF
    Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery

    A Survey of Checkpointing Algorithms in Mobile Ad Hoc Network

    Get PDF
    Checkpoint is defined as a fault tolerant technique that is a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. If there is a failure, computation may be restarted from the current checkpoint instead of repeating the computation from beginning. Checkpoint based rollback recovery is one of the widely used technique used in various areas like scientific computing, database, telecommunication and critical applications in distributed and mobile ad hoc network. The mobile ad hoc network architecture is one consisting of a set of self configure mobile hosts capable of communicating with each other without the assistance of base stations. The main problems of this environment are insufficient power and limited storage capacity, so the checkpointing is major challenge in mobile ad hoc network. This paper presents the review of the algorithms, which have been reported for checkpointing approaches in mobile ad hoc network

    Rollback recovery with low overhead for fault tolerance in mobile ad hoc networks

    Get PDF
    AbstractMobile ad hoc networks (MANETs) have significantly enhanced the wireless networks by eliminating the need for any fixed infrastructure. Hence, these are increasingly being used for expanding the computing capacity of existing networks or for implementation of autonomous mobile computing Grids. However, the fragile nature of MANETs makes the constituent nodes susceptible to failures and the computing potential of these networks can be utilized only if they are fault tolerant. The technique of checkpointing based rollback recovery has been used effectively for fault tolerance in static and cellular mobile systems; yet, the implementation of existing protocols for MANETs is not straightforward. The paper presents a novel rollback recovery protocol for handling the failures of mobile nodes in a MANET using checkpointing and sender based message logging. The proposed protocol utilizes the routing protocol existing in the network for implementing a low overhead recovery mechanism. The presented recovery procedure at a node is completely domino-free and asynchronous. The protocol is resilient to the dynamic characteristics of the MANET; allowing a distributed application to be executed independently without access to any wired Grid or cellular network access points. We also present an algorithm to record a consistent global snapshot of the MANET

    Visões progressivas de computações distribuidas

    Get PDF
    Orientador : Luiz Eduardo BuzatoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Um checkpoint é um estado selecionado por um processo durante a sua execução. Um checkpoint global é composto por um checkpoint de cada processo e é consistente se representa urna foto­grafia da computação que poderia ter sido capturada por um observador externo. Soluções para vários problemas em sistemas distribuídos necessitam de uma seqüência de checkpoints globais consistentes que descreva o progresso de urna computação distribuída. Corno primeira contri­buição desta tese, apresentamos um conjunto de algoritmos para a construção destas seqüências, denominadas visões progressivas. Outras contribuições provaram que certas suposições feitas na literatura eram falsas utilizando o argumento de que algumas propriedades precisam ser válidas ao longo de todo o progresso da computação. Durante algumas computações distribuídas, todas as dependências de retrocesso entre check­points podem ser rastreadas em tempo de execução. Esta propriedade é garantida através da indução de checkpoints imediatamente antes da formação de um padrão de mensagens que poderia dar origem a urna dependência de retrocesso não rastreável. Estudos teóricos e de simu­lação indicam que, na maioria das vezes, quanto mais restrito o padrão de mensagens, menor o número de checkpoints induzidos. Acreditava-se que a caracterização minimal para a obtenção desta propriedade estava estabelecida e que um protocolo baseado nesta caracterização precisa­ria da manutenção e propagação de informações de controle com complexidade O(n2), onde n é o número de processos na computação. A complexidade quadrática tornava o protocolo base­ado na caracterização mimimal menos interessante que protocolos baseados em caracterizações maiores, mas com complexidade linear.A segunda contribuição desta tese é uma prova de que a caracterização considerada minimal podia ser eduzida, embora a complexidade requerida por um protocolo baseado nesta nova caracterização minimal continuasse indicando ser quadrática. A terceira contribuição desta tese é a proposta de um pequeno relaxamento na caracterização minimal que propicia a implementação de um protocolo com complexidade linear e desempenho semelhante à solução quadrática. Como última contribuição, através de um estudo detalhado das variações da informação de controle durante o progresso de urna computação, propomos um protocolo que implementa exatamente a caracterização minimal, mas com complexidade linearAbstract: A checkpoint is a state selected by a process during its execution. A global checkpoint is composed of one checkpoint from each process and it is consistent if it represents a snapshot of the computation that could have been taken by an external observer. The solution to many problems in distributed systems requires a sequence of consistent global checkpoints that describes the progress of a distributed computation. As the first contribution of this thesis, we present a set of algorithms to the construction of these sequences, called progressive views. Additionally, the analysis of properties during the progress of a distributed computation allowed us to verify that some assumptions made in the literature were false. Some checkpoint patterns present only on-line trackable rollback-dependencies among check­points. This property is enforced by taking a checkpoint immediately before the formation of a message pattern that can produce a non-trackable rollback-dependency. Theoretical and simula­tion studies have shown that, most often, the more restricted the pattern, the more efficient the protocol. The minimal characterization was supposed to be known and its implementation was supposed to require the processes of the computation to maintain and propagate O(n2) control information, where n is the number of processes in the computation. The quadratic complexity makes the protocol based on the minimal characterization less interesting than protocols based on wider characterizations, but with a linear complexity. The second contribution of this thesis is a proof that the characterization that was supposed to be minimal could be reduced. However, the complexity required by a protocol based on the new minimal characterization seemed to be also quadratic. The third contribution of this thesis is a protocol based on a slightly weaker condition than the minimal characterization, but with linear complexity and performance similar to the quadratic solution. As the last contribution, through a detailed analysis of the control information computed and transmitted during the progress of distributed computations, we have proposed a protocol that implements exactly the minimal characterization, but with a linear complexityDoutoradoDoutor em Ciência da Computaçã

    Trust Based Node Recovery and Checkpointing Techniques in Manets

    Get PDF
    Checkpointing is a process of determining the vulnerability of node in case of any attack occurs in the network. It depends on the cluster change count value of the node. If the measure of the hop exchanges required to reach the destination node from the current node, is above the previously specified value, the node under consideration is unsafe and safe points must be implemented in between the path and different subnetworks within that network must have their own implemented safe points. The message must commits to the safe points as it reaches the respective sub networks. The message in the networks evolve over the certain subnetworks. The each subnetwork has the checkpoint node, that serves the purpose for communication between different subnetworks, or between the hops in different subnetworks. This phenomenon supports the system efficiency and preserves the robustness. The process retrieval methods, therefore, should be implemented with the use of the safe points to prevent system degradation. In this research paper, an efficient recovery protocol is designed for distributed transactions in MANETs so that failures can be minimised. Dynamic analysis has also been done and it is compared with other existing protocols to validate the attained result

    Study and Design of Global Snapshot Compilation Protocols for Rollback-Recovery in Mobile Distributed System

    Get PDF
    Checkpoint is characterized as an assigned place in a program at which ordinary process is intruded on particularly to protect the status data important to permit resumption of handling at a later time. A conveyed framework is an accumulation of free elements that participate to tackle an issue that can't be separately comprehended. A versatile figuring framework is a dispersed framework where some of procedures are running on portable hosts (MHs). The presence of versatile hubs in an appropriated framework presents new issues that need legitimate dealing with while outlining a checkpointing calculation for such frameworks. These issues are portability, detachments, limited power source, helpless against physical harm, absence of stable stockpiling and so forth. As of late, more consideration has been paid to giving checkpointing conventions to portable frameworks. Least process composed checkpointing is an alluring way to deal with present adaptation to internal failure in portable appropriated frameworks straightforwardly. This approach is without domino, requires at most two recovery_points of a procedure on stable stockpiling, and powers just a base number of procedures to recovery_point. In any case, it requires additional synchronization messages, hindering of the basic calculation or taking some futile recovery_points. In this paper, we complete the writing review of some Minimum-process Coordinated Checkpointing Algorithms for Mobile Computing System
    corecore