51 research outputs found
A survey of checkpointing algorithms for parallel and distributed computers
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery
Recoverable Distributed Shared Memory Under Sequential and Relaxed Consistency
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryOffice of Naval Research / N00014-90-J-1270 and N00014-91-J-1283National Aeronautics and Space Administration / NASA NAG 1-61
Recommended from our members
Implementing fault tolerance in a 64-bit distributed operating system
This thesis explores the potential of 64-bit processors for providing a different style of distributed operating system. Rather than providing another reworking of the UNIX model, the use of the large address space for unifying volatile memory (virtual memory), persistent memory (file systems) and distributed network access is examined and a novel operating system, Arius, is proposed.
The concepts behind the design of ARIUS are briefly reviewed, and then the reliability of such a system is examined in detail. The unified nature of the architecture makes it possible to use a reliable single address space to provide a completely reliable system without the addition of other mechanisms. Protocols are proposed to provide locally scalable distributed shared memory and these are then augmented to handle machine failures transparently though the use of distributed checkpoints and rollback.
The checkpointing system makes use of the caching mechanism in DSM to provide data duplication for failure recovery. By using distributed memory for checkpoints, recovery from machine faults may be handled seamlessly. To cope with more “complete” failures, persistent storage is also included in the failure mechanism.
These protocols are modelled to show their operability and to determine the cost they incur in various types of parallel and serial programs. Results are presented to demonstrate these costs
Portable Checkpointing for Parallel Applications
High Performance Computing (HPC) systems represent the peak of modern computational capability. As
ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems,
modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these
scales, the huge number of individual components in a single system makes the probability that a single
component will fail quite high, with today's large HPC systems featuring mean times between failures on the
order of hours or a few days. As many modern computational tasks require days or months to complete, fault
tolerance becomes critical to HPC system design.
The past three decades have seen significant amounts of research on parallel system fault tolerance. However,
as most of it has been either theoretical or has focused on low-level solutions that are embedded into a
particular operating system or type of hardware, this work has had little impact on real HPC systems. This
thesis attempts to address this lack of impact by describing a high-level approach for implementing
checkpoint/restart functionality that decouples the fault tolerance solution from the details of the
operating system, system libraries and the hardware and instead connects it to the APIs implemented by the
above components. The resulting solution enables applications that use these APIs to become
self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement
the APIs.
The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It
presents two theoretical checkpointing protocols, one for the message passing communication model and one for
the shared memory model. The former is the first protocol to be compatible with application-level
checkpointing of individual processes, while the latter is the first protocol that is compatible with
arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing
protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel
APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these
popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms
and are shown to feature low overheads
RGLock: Recoverable Mutual Exclusion for Non-Volatile Main Memory Systems
Mutex locks have traditionally been the most popular concurrent programming mechanisms for inter-process synchronization in the rapidly advancing field of concurrent computing systems that support high-performance applications. However, the concept of recoverability of these algorithms in the event of a crash failure has not been studied thoroughly. Popular techniques like transaction roll-back are widely known for providing fault-tolerance in modern Database Management Systems. Whereas in the context of mutual exclusion in shared memory systems, none of the prominent lock algorithms (e.g., Lamport’s Bakery algorithm, MCS lock, etc.) are designed to tolerate crash failures, especially in operations carried out in the critical sections. Each of these algorithms may fail to maintain mutual exclusion, or sacrifice some of the liveness guarantees in presence of crash failures. Storing application data and recovery information in the primary storage with conventional volatile memory limits the development of efficient crash-recovery mechanisms since a failure on any component in the system causes a loss of program data. With the advent of Non-Volatile Main Memory technologies, opportunities have opened up to redefine the problem of Mutual Exclusion in the context of a crash-recovery model where processes may recover from crash failures and resume execution. When the main memory is non-volatile, an application’s entire state can be recovered from a crash using the in-memory state near-instantaneously, making a process’s failure appear as a suspend/resume event. This thesis proceeds to envision a solution for the problem of mutual exclusion in such systems. The goal is to provide a first-of-its-kind mutex lock that guarantees mutual exclusion and starvation freedom in emerging shared-memory architectures that incorporate non-volatile main memory (NVMM)
Fault-tolerant parallel applications using a network of workstations
PhD thesisIt is becoming common to employ a Network Of Workstations, often referred to as a NOW, for
general purpose computing since the allocation of an individual workstation offers good interactive
response. However, there may still be a need to perform very large scale computations which exceed
the resources of a single workstation. It may be that the amount of processing implies an inconveniently
long duration or that the data manipulated exceeds available storage. One possibility is to employ a
more powerful single machine for such computations. However, there is growing interest in seeking a
cheaper alternative by harnessing the significant idle time often observed in a NOW and also possibly
employing a number of workstations in parallel on a single problem. Parallelisation permits use of the
combined memories of all participating workstations, but also introduces a need for communication. and
success in any hardware environment depends on the amount of communication relative to the amount
of computation required. In the context of a NOW, much success is reported with applications which
have low communication requirements relative to computation requirements.
Here it is claimed that there is reason for investigation into the use of a NOW for parallel execution
of computations which are demanding in storage, potentially even exceeding the sum of memory in
all available workstations. Another consideration is that where a computation is of sufficient scale,
some provision for tolerating partial failures may be desirable. However, generic support for storage
management and fault-tolerance in computations of this scale for a NOW is not currently available and
the suitability of a NOW for solving such computations has not been investigated to any large extent.
The work described here is concerned with these issues.
The approach employed is to make use of an existing distributed system which supports nested
atomic actions (atomic transactions) to structure fault-tolerant computations with persistent objects.
This system is used to develop a fault-tolerant "bag of tasks" computation model, where the bag and
shared objects are located on secondary storage.
In order to understand the factors that affect the performance of large parallel computations on a
NOW, a number of specific applications are developed. The performance of these applications is ana-
lysed using a semi-empirical model. The same measurements underlying these performance predictions
may be employed in estimation of the performance of alternative application structures. Using services
provided by the distributed system referred to above, each application is implemented. The implement-
ation allows verification of predicted performance and also permits identification of issues regarding
construction of components required to support the chosen application structuring technique. The work
demonstrates that a NOW certainly offers some potential for gain through parallelisation and that for
large grain computations, the cost of implementing fault tolerance is low.Engineering and Physical Sciences Research Counci
Replication of non-deterministic objects
This thesis discusses replication of non-deterministic objects in distributed systems to achieve fault tolerance against crash failures. The objects replicated are the virtual nodes of a distributed application. Replication is viewed as an issue that is to be dealt with only during the configuration of a distributed application and that should not affect the development of the application. Hence, replication of virtual nodes should be transparent to the application. Like all measures to achieve fault tolerance, replication introduces redundancy in the system. Not surprisingly, the main difficulty is guaranteeing the consistency of all replicas such that they behave in the same way as if the object was not replicated (replication transparency). This is further complicated if active objects (like virtual nodes) are replicated, and these objects themselves can be clients of still further objects in the distributed application. The problems of replication of active non-deterministic objects are analyzed in the context of distributed Ada 95 applications. The ISO standard for Ada 95 defines a model for distributed execution based on remote procedure calls (RPC). Virtual nodes in Ada 95 use this as their sole communication paradigm, but they may contain tasks to execute activities concurrently, thus making the execution potentially non-deterministic due to implicit timing dependencies. Such non-determinism cannot be avoided by choosing deterministic tasking policies. I present two different approaches to maintain replica consistency despite this non-determinism. In a first approach, I consider the run-time support of Ada 95 as a black box (except for the part handling remote communications). This corresponds to a non-deterministic computation model. I show that replication of non-deterministic virtual nodes requires that remote procedure calls are implemented as nested transactions. Unfortunately, effects of failures are not local to the replicas of a virtual node: when a failure occurs, nested remote calls made to other virtual nodes must be undone. Also, using transactional semantics for RPCs necessitates a compromise regarding transparency: the application must identify global state for it cannot be determined reliably in an automatic way. Further study reveals that this approach cannot be implemented in a transparent way at all because the consistency criterion of Ada 95 (linearizability) is much weaker than that of transactions (serializability). An execution of remote procedure calls as transactions may thus lead to incompatibilities with the semantics of the programming language. If remotely called subprograms on a replicated virtual node perform partial operations, i.e., entry calls on global protected objects, deadlocks that cannot be broken can occur in certain cases. Such deadlocks do not occur when the virtual node is not replicated. The transactional semantics of RPCs must therefore be exposed to the application. A second approach is based on a piecewise deterministic computation model, i.e., the execution of a virtual node is seen as a sequence of deterministic state intervals. Whenever a non-deterministic event occurs, a new state interval is started. I study replica organization under this computation model (semi-active replication). In this model, all non-deterministic decisions are made on one distinguished replica (the leader), while all other replicas (the followers) are forced to follow the same sequence of non-deterministic events. I show that it suffices to synchronize the followers with the leader upon each observable event, i.e., when the leader sends a message to some other virtual node. It is not necessary to synchronize upon each and every non-deterministic event — which would incur a prohibitively high overhead. Non-deterministic events occurring on the leader between observable events are logged and sent to the followers just before the leader executes an observable event. Consequently, it is guaranteed that the followers will reach the same state as the leader, and thus the effects of failures remain mostly local to the replicas. A prototype implementation called RAPIDS (Replicated Ada Partitions In Distributed Systems) serves as a proof of concept for this second approach, demonstrating its feasibility. RAPIDS is an Ada 95 implementation of a replication manager for semi-active replication for the GNAT development system for Ada 95. It is entirely contained within the run-time support and hence largely transparent for the application
Simulation Modelling of Distributed-Shared Memory Multiprocessors
Institute for Computing Systems ArchitectureDistributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process.
This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance.
DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture
Speculative execution by using software transactional memory
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para a obtenção do Grau de Mestre em Engenharia Informática.Many programs sequentially execute operations that take a long time to complete. Some of these operations may return a highly predictable result. If this is the case, speculative execution can improve the overall performance of the program.
Speculative execution is the execution of code whose result may not be needed. Generally it is used as a performance optimization. Instead of waiting for the result of a costly operation,speculative execution can be used to speculate the operation most probable result and continue
executing based in this speculation. If later the speculation is confirmed to be correct, time had been gained. Otherwise, if the speculation is incorrect, the execution based in the speculation must abort and re-execute with the correct result.
In this dissertation we propose the design of an abstract process to add speculative execution to a program by doing source-to-source transformation. This abstract process is used in the definition of a mechanism and methodology that enable programmer to add speculative execution to the source code of programs. The abstract process is also used in the design of an automatic source-to-source transformation process that adds speculative execution to existing programs without user intervention. Finally, we also evaluate the performance impact of introducing speculative execution in database clients.
Existing proposals for the design of mechanisms to add speculative execution lacked portability in favor of performance. Some were designed to be implemented at kernel or hardware level. The process and mechanisms we propose in this dissertation can add speculative execution to the source of program, independently of the kernel or hardware that is used.
From our experiments we have concluded that database clients can improve their performance by using speculative execution. There is nothing in the system we propose that limits in the scope of database clients. Although this was the scope of the case study, we strongly believe that other programs can benefit from the proposed process and mechanisms for introduction of speculative execution
- …