Search CORE

70,365 research outputs found

Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing

Author: Johnson D.B
Zwaenepoel W
Publication venue: 'Elsevier BV'
Publication date: 17/10/2005
Field of study

Message logging and check pointing can provide fault tolerance in distributed systems in which all process communication is through messages. This paper presents a general model for reasoning about recovery in these systems. Using this model_ we prove that the set of recoverable system states that have occurred during any single execution of the system forms a lattice, and that therefore, there is always a unique maximum recoverable system state, which never decreases. Based on this model, we present an algorithm for determining this maximum recoverable state, and prove its correctness. Our algorithm utilizes all logged messages and checkpoints, and thus always finds the maximum recoverable state possible. Previous recovery methods using optimistic message logging and checkpointing have not considered the existing checkpoints, and thus may not find this maximum state. Furthermore, by utilizing the checkpoints, some messages received by a process before it was checkpointed may not need to be logged. Using our algorithm also adds less communication overhead to the system than do previous methods. Our model and algorithm can be used with any message logging protocol, whether pessimistic or optimistic, but their full generality is only required with optimistic logging protocols

Infoscience - École polytechnique fédérale de Lausanne

Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems

Author: Long Junsheng
Publication venue
Publication date
Field of study

This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forwared recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, look-ahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct look-ahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR

NASA Technical Reports Server

Recommended from our members

On designing dependable services with diverse off-the-shelf SQL servers

Author: A. Avizienis
A. Avizienis
A. Vaysburd
B. Kemme
C. Babbage
D. Powell
F. Pedone
F. Schneider
I. Gashi
J. Gray
J. Gray
J.C. Laprie
M. Patino-Martinez
M. Weismann
P. Popov
P.A. Bernstein
P.E. Ammann
P.J. Traverse
P.M. Chen
R. Jimenez-Peris
R. Jimenez-Peris
S. Chandra
S. Chandra
S. Poledna
S. Poledna
T. Anderson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

City Research Online

Crossref

Autonomic State Management for Optimistic Simulation Platforms

Author: Pellegrini Alessandro
Quaglia Francesco
Vitali Roberto
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

We present the design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format (ELF), and developed for execution on x8664 architectures. With ASM, the state of any logical process (LP), namely the individual (concurrent) simulation unit being part of the simulation model, is allowed to be scattered on dynamically allocated memory chunks managed via standard API (e.g., malloc/free). Also, the application programmer is not required to provide any serialization/deserialization module in order to take a checkpoint of the LP state, or to restore it in case a causality error occurs during the optimistic run, or to provide indications on which portions of the state are updated by event processing, so to allow incremental checkpointing. All these tasks are handled by ASM in a fully transparent manner via (A) runtime identification (with chunk-level granularity) of the memory map associated with the LP state, and (B) runtime tracking of the memory updates occurring within chunks belonging to the dynamic memory map. The co-existence of the incremental and non-incremental log/restore modes is achieved via dual versions of the same application code, transparently generated by ASM via compile/link time facilities. Also, the dynamic selection of the best suited log/restore mode is actuated by ASM on the basis of an innovative modeling/optimization approach which takes into account stability of each operating mode with respect to variations of the model/environmental execution parameters

ART

Archivio della ricerca- Università di Roma La Sapienza

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

A distributed file service based on optimistic concurrency control

Author: Mullender Sape J.
Tanenbaum Andrew S.
Publication venue: ACM Press
Publication date: 01/01/1985
Field of study

The design of a layered file service for the Amoeba Distributed System is discussed, on top of which various applications can easily be intplemented. The bottom layer is formed by the Amoeba Block Services, responsible for implementing stable storage and repficated, highly available disk blocks. The next layer is formed by the Amoeba File Service which provides version management and concurrency control for tree-structured files. On top of this layer, the appficafions, ranging from databases to source code control systems, determine the structure of the file trees and provide an interface to the users

VU Research Portal

CWI's Institutional Repository

University of Twente Research Information

An approach to rollback recovery of collaborating mobile agents

Author: Bargiela A
Osman T
Wagealla W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/02/2004
Field of study

Fault-tolerance is one of the main problems that must be resolved to improve the adoption of the agents' computing paradigm. In this paper, we analyse the execution model of agent platforms and the significance of the faults affecting their constituent components on the reliable execution of agent-based applications, in order to develop a pragmatic framework for agent systems fault-tolerance. The developed framework deploys a communication-pairs independent check pointing strategy to offer a low-cost, application-transparent model for reliable agent- based computing that covers all possible faults that might invalidate reliable agent execution, migration and communication and maintains the exactly-one execution property

Crossref

Nottingham Trent Institutional Repository (IRep)

Transparently Mixing Undo Logs and Software Reversibility for State Recovery in Optimistic PDES

Author: Cingolani Davide
Pellegrini Alessandro
Quaglia Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

The rollback operation is a fundamental building block to support the correct execution of a speculative Time Warp-based Parallel Discrete Event Simulation. In the literature, several solutions to reduce the execution cost of this operation have been proposed, either based on the creation of a checkpoint of previous simulation state images, or on the execution of negative copies of simulation events which are able to undo the updates on the state. In this paper, we explore the practical design and implementation of a state recoverability technique which allows to restore a previous simulation state either relying on checkpointing or on the reverse execution of the state updates occurred while processing events in forward mode. Differently from other proposals, we address the issue of executing backward updates in a fully-transparent and event granularity-independent way, by relying on static software instrumentation (targeting the x86 architecture and Linux systems) to generate at runtime reverse update code blocks (not to be confused with reverse events, proper of the reverse computing approach). These are able to undo the effects of a forward execution while minimizing the cost of the undo operation. We also present experimental results related to our implementation, which is released as free software and fully integrated into the open source ROOT-Sim (ROme OpTimistic Simulator) package. The experimental data support the viability and effectiveness of our proposal

ART

Archivio della ricerca- Università di Roma La Sapienza