Search CORE

1,303 research outputs found

Report on the Second European SIGOPS Workshop "making distributed systems work"

Author: Mullender Sape
Publication venue: ACM
Publication date: 01/01/1987
Field of study

University of Twente Research Information

A survey of checkpointing algorithms for parallel and distributed computers

Author: Kalaiselvi S.
Rajaraman V.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2000
Field of study

Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery

Autonomic State Management for Optimistic Simulation Platforms

Author: Pellegrini Alessandro
Quaglia Francesco
Vitali Roberto
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

We present the design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format (ELF), and developed for execution on x8664 architectures. With ASM, the state of any logical process (LP), namely the individual (concurrent) simulation unit being part of the simulation model, is allowed to be scattered on dynamically allocated memory chunks managed via standard API (e.g., malloc/free). Also, the application programmer is not required to provide any serialization/deserialization module in order to take a checkpoint of the LP state, or to restore it in case a causality error occurs during the optimistic run, or to provide indications on which portions of the state are updated by event processing, so to allow incremental checkpointing. All these tasks are handled by ASM in a fully transparent manner via (A) runtime identification (with chunk-level granularity) of the memory map associated with the LP state, and (B) runtime tracking of the memory updates occurring within chunks belonging to the dynamic memory map. The co-existence of the incremental and non-incremental log/restore modes is achieved via dual versions of the same application code, transparently generated by ASM via compile/link time facilities. Also, the dynamic selection of the best suited log/restore mode is actuated by ASM on the basis of an innovative modeling/optimization approach which takes into account stability of each operating mode with respect to variations of the model/environmental execution parameters

Crossref

ART

Archivio della ricerca- Università di Roma La Sapienza

Building Resilient Cloud Over Unreliable Commodity Infrastructure

Author: Bansal Sorav
Deshpande Deepak
Iyer Sreekanth
Kedia Piyus
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/08/2012
Field of study

Cloud Computing has emerged as a successful computing paradigm for efficiently utilizing managed compute infrastructure such as high speed rack-mounted servers, connected with high speed networking, and reliable storage. Usually such infrastructure is dedicated, physically secured and has reliable power and networking infrastructure. However, much of our idle compute capacity is present in unmanaged infrastructure like idle desktops, lab machines, physically distant server machines, and laptops. We present a scheme to utilize this idle compute capacity on a best-effort basis and provide high availability even in face of failure of individual components or facilities. We run virtual machines on the commodity infrastructure and present a cloud interface to our end users. The primary challenge is to maintain availability in the presence of node failures, network failures, and power failures. We run multiple copies of a Virtual Machine (VM) redundantly on geographically dispersed physical machines to achieve availability. If one of the running copies of a VM fails, we seamlessly switchover to another running copy. We use Virtual Machine Record/Replay capability to implement this redundancy and switchover. In current progress, we have implemented VM Record/Replay for uniprocessor machines over Linux/KVM and are currently working on VM Record/Replay on shared-memory multiprocessor machines. We report initial experimental results based on our implementation.Comment: Oral presentation at IEEE "Cloud Computing for Emerging Markets", Oct. 11-12, 2012, Bangalore, Indi

arXiv.org e-Print Archive

Crossref

Mangosteen:Fast Transparent Durability for Linearizable Applications using NVM

Author: Chockler Gregory
Dongol Brijesh
Egorov Sergey
Keshavarzi Sadegh
O'Keeffe Dan
Publication venue
Publication date: 01/01/2024
Field of study

Royal Holloway - Pure

Transparent Fault-tolerance in Parallel Orca Programs

Author: Bal H.E.
Kaashoek M.F.
Michiels R.
Tanenbaum A.S.
Publication venue
Publication date: 01/01/1992
Field of study

With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer. 1

CiteSeerX

VU Research Portal

The construction of recoverable multi-level systems

Author: Verhofstad Joost S.M.
Publication venue: Newcastle University
Publication date: 01/01/1977
Field of study

PhD ThesisSystems structures and data structures which make possible the state restoration of user objects, are described in this thesis. Recovery is linked with types, which suggests making a distinction between recoverable and unrecoverable types. For convenience, recovery is discussed in terms of recovery blocks as developed at the University of Newcastle upon Tyne. Recovery is taken to mean restoring the values of recoverable types. Recoverable multi-level systems are considered. On the one hand levels in such systems can be backed out. On the other hand these levels provide explicit recovery for new types they introduce, and so can be called on to restore states of objects used in higher levels. The concepts and issues are discussed and explained; mechanisms and techniques for building such systems are presented. Recovery techniques for complex global data structures and techniques to maintain consistency at any time, even when recovery is impossible such as after a crash, are described and compared. Many of the presented techniques are employed in an implemented recoverable two-level system, with a recoverable filing system. This two-level system is described in detail. It is argued that in order to implement recoverability in multi-level systems with efficiency and flexibility, the interfaces of the system should provide both recoverable and unrecoverable types. It is also shown that the way in which complex data structures are updated is of major importance if recovery is to be provided in a "reasonably" efficient way and consistency is to be guaranteed after a crash.Netherlands Organisation for the Advancement of Pure Research

Newcastle University eTheses