Search CORE

673 research outputs found

Rollback recovery with low overhead for fault tolerance in mobile ad hoc networks

Author: Jaggi Parmeet Kaur
Singh Awadhesh Kumar
Publication venue: The Authors. Production and hosting by Elsevier B.V.
Publication date: 01/10/2015
Field of study

AbstractMobile ad hoc networks (MANETs) have significantly enhanced the wireless networks by eliminating the need for any fixed infrastructure. Hence, these are increasingly being used for expanding the computing capacity of existing networks or for implementation of autonomous mobile computing Grids. However, the fragile nature of MANETs makes the constituent nodes susceptible to failures and the computing potential of these networks can be utilized only if they are fault tolerant. The technique of checkpointing based rollback recovery has been used effectively for fault tolerance in static and cellular mobile systems; yet, the implementation of existing protocols for MANETs is not straightforward. The paper presents a novel rollback recovery protocol for handling the failures of mobile nodes in a MANET using checkpointing and sender based message logging. The proposed protocol utilizes the routing protocol existing in the network for implementing a low overhead recovery mechanism. The presented recovery procedure at a node is completely domino-free and asynchronous. The protocol is resilient to the dynamic characteristics of the MANET; allowing a distributed application to be executed independently without access to any wired Grid or cellular network access points. We also present an algorithm to record a consistent global snapshot of the MANET

Elsevier - Publisher Connector

Directory of Open Access Journals

CHECKPOINTING AND RECOVERY IN DISTRIBUTED AND DATABASE SYSTEMS

Author: Wu Jiang
Publication venue: UKnowledge
Publication date: 01/01/2011
Field of study

A transaction-consistent global checkpoint of a database records a state of the database which reflects the effect of only completed transactions and not the re- sults of any partially executed transactions. This thesis establishes the necessary and sufficient conditions for a checkpoint of a data item (or the checkpoints of a set of data items) to be part of a transaction-consistent global checkpoint of the database. This result would be useful for constructing transaction-consistent global checkpoints incrementally from the checkpoints of each individual data item of a database. By applying this condition, we can start from any useful checkpoint of any data item and then incrementally add checkpoints of other data items until we get a transaction- consistent global checkpoint of the database. This result can also help in designing non-intrusive checkpointing protocols for database systems. Based on the intuition gained from the development of the necessary and sufficient conditions, we also de- veloped a non-intrusive low-overhead checkpointing protocol for distributed database systems. Checkpointing and rollback recovery are also established techniques for achiev- ing fault-tolerance in distributed systems. Communication-induced checkpointing algorithms allow processes involved in a distributed computation take checkpoints independently while at the same time force processes to take additional checkpoints to make each checkpoint to be part of a consistent global checkpoint. This thesis develops a low-overhead communication-induced checkpointing protocol and presents a performance evaluation of the protocol

University of Kentucky

Concurrent Checkpointing and Recovery in Distributed Systems

Author: Bhargava Bharat
Leu Pei-Jyun
Publication venue: 'Purdue University (bepress)'
Publication date: 11/06/1987
Field of study

Purdue E-Pubs

Fault tolerance at system level based on RADIC architecture

Author: Castro León Marcela
Luque Emilio
Meyer Hugo Daniel
Rexachs del Rosario Dolores Isabel
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries

Elsevier - Publisher Connector

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Diposit Digital de Documents de la UAB

CIC : an integrated approach to checkpointing in mobile agent systems

Author: Cao J
Wu W
Yang J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/12/2014
Field of study

Internet and Mobile Computing Lab (in Department of Computing)Refereed conference paper2006-2007 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe

PolyU Institutional Repository

An Infrastructure-less Vehicle Counting without Disruption

Author: Jiang Zhen
Sabatino Paul
Tsan Jennifer
Wu Jie
Publication venue: Digital Commons @ West Chester University
Publication date: 01/01/2014
Field of study

Crossref

Digital Commons @ West Chester University

Evaluation of Communication Induced Checkpointing Approaches for Reconfiguration-Based Fault-Tolerance in Embedded Systems

Author: . Belal H. Sababha
A. Rawashdeh Osamah
Publication venue: GSTF Journal on Computing (JoC)
Publication date: 16/09/2014
Field of study

Reconfiguration-Based Fault-Tolerance is an approach to developing dependable safety-critical embedded applications, where redundant active or standby resources are used to cope with faults through a system reconfiguration at run-time. Compared to traditional hardware and software redundancy, it is a promising technique that may achieve dependability with a significant reduction in cost, size, weight, and power requirements. Reconfiguration necessitates using proper checkpointing protocols to support state reservation to ensure correct task restarts after a system reconfiguration. Communication Induced Checkpointing (CIC) protocols are well developed and understood for large parallel and information systems, but not much has been done for resource limited embedded systems. This paper implements four common CIC protocols in a resource constrained distributed embedded system with a Controller Area Network (CAN) backbone. An example feedback control system implementation is used for a case study. The four implemented protocols are described and performances are contrasted. The paper compares the protocols in terms of network bandwidth consumptions, CPU usages, checkpointing times, and checkpoint sizes in additional to the traditional measures of forced to local checkpoint rations and total number of checkpoints

GSTF Digital Library (GSTF-DL): Open Journal Systems (Global Science and Technology Forum)

ALGORITHMS FOR FAULT TOLERANCE IN DISTRIBUTED SYSTEMS AND ROUTING IN AD HOC NETWORKS

Author: Jiang Qiangfeng
Publication venue: UKnowledge
Publication date: 01/01/2013
Field of study

Checkpointing and rollback recovery are well-known techniques for coping with failures in distributed systems. Future generation Supercomputers will be message passing distributed systems consisting of millions of processors. As the number of processors grow, failure rate also grows. Thus, designing efficient checkpointing and recovery algorithms for coping with failures in such large systems is important for these systems to be fully utilized. We presented a novel communication-induced checkpointing algorithm which helps in reducing contention for accessing stable storage to store checkpoints. Under our algorithm, a process involved in a distributed computation can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes involved in the computation come to know about the consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message. The tentative checkpoints taken can be flushed to stable storage when there is no contention for accessing stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Ad hoc networks consist of a set of nodes that can form a network for communication with each other without the aid of any infrastructure or human intervention. Nodes are energy-constrained and hence routing algorithm designed for these networks should take this into consideration. We proposed two routing protocols for mobile ad hoc networks which prevent nodes from broadcasting route requests unnecessarily during the route discovery phase and hence conserve energy and prevent contention in the network. One is called Triangle Based Routing (TBR) protocol. The other routing protocol we designed is called Routing Protocol with Selective Forwarding (RPSF). Both of the routing protocols greatly reduce the number of control packets which are needed to establish routes between pairs of source nodes and destination nodes. As a result, they reduce the energy consumed for route discovery. Moreover, these protocols reduce congestion and collision of packets due to limited number of nodes retransmitting the route requests

University of Kentucky