1,303 research outputs found
A survey of checkpointing algorithms for parallel and distributed computers
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery
Autonomic State Management for Optimistic Simulation Platforms
We present the design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format (ELF), and developed for execution on x8664 architectures. With ASM, the state of any logical process (LP), namely the individual (concurrent) simulation unit being part of the simulation model, is allowed to be scattered on dynamically allocated memory chunks managed via standard API (e.g., malloc/free). Also, the application programmer is not required to provide any serialization/deserialization module in order to take a checkpoint of the LP state, or to restore it in case a causality error occurs during the optimistic run, or to provide indications on which portions of the state are updated by event processing, so to allow incremental checkpointing. All these tasks are handled by ASM in a fully transparent manner via (A) runtime identification (with chunk-level granularity) of the memory map associated with the LP state, and (B) runtime tracking of the memory updates occurring within chunks belonging to the dynamic memory map. The co-existence of the incremental and non-incremental log/restore modes is achieved via dual versions of the same application code, transparently generated by ASM via compile/link time facilities. Also, the dynamic selection of the best suited log/restore mode is actuated by ASM on the basis of an innovative modeling/optimization approach which takes into account stability of each operating mode with respect to variations of the model/environmental execution parameters
Building Resilient Cloud Over Unreliable Commodity Infrastructure
Cloud Computing has emerged as a successful computing paradigm for
efficiently utilizing managed compute infrastructure such as high speed
rack-mounted servers, connected with high speed networking, and reliable
storage. Usually such infrastructure is dedicated, physically secured and has
reliable power and networking infrastructure. However, much of our idle compute
capacity is present in unmanaged infrastructure like idle desktops, lab
machines, physically distant server machines, and laptops. We present a scheme
to utilize this idle compute capacity on a best-effort basis and provide high
availability even in face of failure of individual components or facilities.
We run virtual machines on the commodity infrastructure and present a cloud
interface to our end users. The primary challenge is to maintain availability
in the presence of node failures, network failures, and power failures. We run
multiple copies of a Virtual Machine (VM) redundantly on geographically
dispersed physical machines to achieve availability. If one of the running
copies of a VM fails, we seamlessly switchover to another running copy. We use
Virtual Machine Record/Replay capability to implement this redundancy and
switchover. In current progress, we have implemented VM Record/Replay for
uniprocessor machines over Linux/KVM and are currently working on VM
Record/Replay on shared-memory multiprocessor machines. We report initial
experimental results based on our implementation.Comment: Oral presentation at IEEE "Cloud Computing for Emerging Markets",
Oct. 11-12, 2012, Bangalore, Indi
Transparent Fault-tolerance in Parallel Orca Programs
With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer. 1
The construction of recoverable multi-level systems
PhD ThesisSystems structures and data structures which make
possible the state restoration of user objects, are
described in this thesis. Recovery is linked with types,
which suggests making a distinction between recoverable and
unrecoverable types. For convenience, recovery is discussed
in terms of recovery blocks as developed at the University
of Newcastle upon Tyne. Recovery is taken to mean restoring
the values of recoverable types.
Recoverable multi-level systems are considered. On the
one hand levels in such systems can be backed out. On the
other hand these levels provide explicit recovery for new
types they introduce, and so can be called on to restore
states of objects used in higher levels. The concepts and
issues are discussed and explained; mechanisms and
techniques for building such systems are presented.
Recovery techniques for complex global data structures
and techniques to maintain consistency at any time, even
when recovery is impossible such as after a crash, are
described and compared.
Many of the presented techniques are employed in an
implemented recoverable two-level system, with a recoverable
filing system. This two-level system is described in detail.
It is argued that in order to implement recoverability
in multi-level systems with efficiency and flexibility, the
interfaces of the system should provide both recoverable and
unrecoverable types.
It is also shown that the way in which complex data
structures are updated is of major importance if recovery is
to be provided in a "reasonably" efficient way and
consistency is to be guaranteed after a crash.Netherlands Organisation
for the Advancement of Pure Research
- …