Search CORE

1,683 research outputs found

Concurrent Checkpointing and Recovery in Distributed Systems

Author: Bhargava Bharat
Leu Pei-Jyun
Publication venue: 'Purdue University (bepress)'
Publication date: 11/06/1987
Field of study

Transparent Fault-tolerance in Parallel Orca Programs

Author: Bal H.E.
Kaashoek M.F.
Michiels R.
Tanenbaum A.S.
Publication venue
Publication date: 01/01/1992
Field of study

With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer. 1

CiteSeerX

Fine-Grain Checkpointing with In-Cache-Line Logging

Author: Aksun David T.
Avni Hillel
Cohen Nachshon
Larus James R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/02/2019
Field of study

Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time. In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS 19), April 13, 2019, Providence, RI, US

arXiv.org e-Print Archive

Implicit transactional memory in kilo-instruction multiprocessors

Author: Beivide Palacio Julio Ramon
Cristal Kestelman Adrián
Galluzzi Marco
Smith James E.
Stenström Per
Valero Cortés Mateo
Vallejo Enrique
Vallejo Fernando
Publication venue
Publication date: 01/01/2007
Field of study

Although they have been the main server technology for many years, multiprocessors are undergoing a renaissance due to multi-core chips and the attractive scalability properties of combining a number of such multi-core chips into a system. The widespread use of multiprocessor systems will make performance losses due to consistency models and synchronization styles of popular programming models even more evident than they already are. Known architectural approaches to combat these losses are generally too complex, too specialized, or not transparent to software. In this article, we introduce implicit transactional memory as a generalized architectural concept to remove unnecessary performance losses caused by consistency models and synchronization styles. We show how the concept of implicit transactions can be implemented with low complexity by leveraging the multi-checkpoint mechanism of the Kilo-Instruction Processor. By relying on a general speculation substrate, this method supports even the strictest consistency model – sequential consistency – potentially as effectively as weaker models and it allows multiple threads to speculatively execute critical sections, beyond barriers and event synchronizations.Postprint (published version

Rollback recovery with low overhead for fault tolerance in mobile ad hoc networks

Author: Jaggi Parmeet Kaur
Singh Awadhesh Kumar
Publication venue: The Authors. Production and hosting by Elsevier B.V.
Publication date: 01/10/2015
Field of study

AbstractMobile ad hoc networks (MANETs) have significantly enhanced the wireless networks by eliminating the need for any fixed infrastructure. Hence, these are increasingly being used for expanding the computing capacity of existing networks or for implementation of autonomous mobile computing Grids. However, the fragile nature of MANETs makes the constituent nodes susceptible to failures and the computing potential of these networks can be utilized only if they are fault tolerant. The technique of checkpointing based rollback recovery has been used effectively for fault tolerance in static and cellular mobile systems; yet, the implementation of existing protocols for MANETs is not straightforward. The paper presents a novel rollback recovery protocol for handling the failures of mobile nodes in a MANET using checkpointing and sender based message logging. The proposed protocol utilizes the routing protocol existing in the network for implementing a low overhead recovery mechanism. The presented recovery procedure at a node is completely domino-free and asynchronous. The protocol is resilient to the dynamic characteristics of the MANET; allowing a distributed application to be executed independently without access to any wired Grid or cellular network access points. We also present an algorithm to record a consistent global snapshot of the MANET

Directory of Open Access Journals

Reversible Multiparty Sessions with Checkpoints

Author: Dezani-Ciancaglini Mariangiola
Giannini Paola
Publication venue: 'Open Publishing Association'
Publication date: 01/01/2016
Field of study

Reversible interactions model different scenarios, like biochemical systems and human as well as automatic negotiations. We abstract interactions via multiparty sessions enriched with named checkpoints. Computations can either go forward or roll back to some checkpoints, where possibly different choices may be taken. In this way communications can be undone and different conversations may be tried. Interactions are typed with global types, which control also rollbacks. Typeability of session participants in agreement with global types ensures session fidelity and progress of reversible communications.Comment: In Proceedings EXPRESS/SOS 2016, arXiv:1608.0269

arXiv.org e-Print Archive

Directory of Open Access Journals