Search CORE

27 research outputs found

Reliability Issues in Distributed Operating Systems

Author: Renesse R. van
Tanenbaum A.S.
Publication venue
Publication date: 01/01/1987
Field of study

Distributed systems span a wide spectrum in the design space. In this paper we will look at the various kinds and discuss some of the reliability issues involved. In the first half of the paper we will concentrate on the causes of unreliability, illustrating these with some general solutions and examples. Among the issues treated are interprocess communication, machine crashes, server redundancy, and data integrity. In the second half of the paper, we will examine one distributed operating system, Amoeba, to see how reliability issues have been handled in at least one real system, and how the pieces fit together. 1. INTRODUCTION It is difficult to get two computer scientists to agree on what a distributed system is. Rather than attempt to formulate a watertight definition, which is probably impossible anyway, we will divide these systems into three broad categories: - Closely coupled systems - Loosely coupled systems - Barely coupled systems The key issue that distinguishes these syst..

CiteSeerX

VU Research Portal

Recommended from our members

A Survey of Process Migration Mechanisms

Author: Smith Jonathan M.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1988
Field of study

We define process migration as the transfer of a sufficient amount of a process's state from one machine to another for the process to execute on the target machine. This paper surveys proposed and implemented mechanisms for process migration. We pay particular attention to the designer's goals, such as performance, load-balancing, and reliability. The effect of operating system design upon the ease of implementation is discussed in some detail; we conclude that message-passing systems simplify designs for migration

Columbia University Academic Commons

Optimal Message Log Reclamation for Independent Checkpointing

Author: Fuchs W. Kent
Wang Yi-Min
Publication venue: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/04/1993
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the Office of the Chief of Naval Research / N00014-91-J-128

Illinois Digital Environment for Access to Learning and Scholarship Repository

NASA Technical Reports Server

Sender-Based Message Logging

Author: Johnson D.B.
Zwaenepoel Willy
Publication venue
Publication date: 20/10/2005
Field of study

Sender-based message logging, a low-overhead mechanism for providing transparent fault-tolerance in distributed systems, is described. It differs from conventional message logging mechanisms in that each message is logged in volatile memory on the machine from which the message is sent. Keeping the message log in the sender's local memory allows one to recover from a single failure at a time without the expense of synchronously logging each message to stable storage. The message log is then asynchronously written to stable storage, without delaying the computation, as part of the sender's periodic checkpoint. Maintaining the sender-based message log requires at most one extra network packet over non-fault-tolerant reliable message communication and imposes little additional synchronization delay. It can be applied transparently to existing distributed applications and does not required specialized hardware. It is currently being implemented on a network of Sun workstations

Infoscience - École polytechnique fédérale de Lausanne

Performance comparison of hierarchical checkpoint protocols grid computing

Author: Massata NDIAYE Ndeye
SENS Pierre
THIARE Ousmane
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 03/12/2019
Field of study

Grid infrastructure is a large set of nodes geographically distributed and connected by a communication. In this context, fault tolerance is a necessity imposed by the distribution that poses a number of problems related to the heterogeneity of hardware, operating systems, networks, middleware, applications, the dynamic resource, the scalability, the lack of common memory, the lack of a common clock, the asynchronous communication between processes. To improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resistance to these faults of the system. Fault tolerance is intended to allow the system to provide service as specified in spite of occurrences of faults. It appears as an indispensable element in distributed systems. To meet this need, several techniques have been proposed in the literature. We will study the protocols based on rollback recovery. These protocols are classified into two categories: coordinated checkpointing and rollback protocols and log-based independent checkpointing protocols or message logging protocols. However, the performance of a protocol depends on the characteristics of the system, network and applications running. Faced with the constraints of large-scale environments, many of algorithms of the literature showed inadequate. Given an application environment and a system, it is not easy to identify the recovery protocol that is most appropriate for a cluster or hierarchical environment, like grid computing. While some protocols have been used successfully in small scale, they are not suitable for use in large scale. Hence there is a need to implement these protocols in a hierarchical fashion to compare their performance in grid computing. In this paper, we propose hierarchical version of four well-known protocols. We have implemented and compare the performance of these protocols in clusters and grid computing using the Omnet++ simulator

Re-UNIR

Lazy Checkpoint Coordination for Bounding Rollback Propagation

Author: Fuchs W. Kent
Wang Yi-Min
Publication venue: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/11/1992
Field of study

Illinois Digital Environment for Access to Learning and Scholarship Repository

On the Implementation and Use of Message Logging

Author: Elnozahy Elmootazbellah N.
Zwaenepoel Willy
Publication venue
Publication date: 20/10/2005
Field of study

We present a number of experiments showing that for compute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead than coordinated checkpointing. Message logging protocols, however, result in much shorter output latency than coordinated checkpointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging design that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checkpointing offers several advantages, including improved failure-free performance, bounded recovery time, simplified garbage collection, and reduced complexity. Meanwhile, the new protocols retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three “lessons learned” from an implementation of various message logging protocol

Infoscience - École polytechnique fédérale de Lausanne

Keeping checkpoint/restart viable for exascale systems

Author: Ferreira Kurt
Publication venue: UNM Digital Repository
Publication date: 01/12/2011
Field of study

Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms

Reducing Space Overhead for Independent Checkpointing

Author: Chung Pi-Yu
Fuchs W. Kent
Lin In-Jen
Wang Yi-Min
Publication venue: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/02/1992
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the office of the Chief of Naval Research / N00014-91-J-128

Illinois Digital Environment for Access to Learning and Scholarship Repository

NASA Technical Reports Server