Search CORE

1,894 research outputs found

Computing in the RAIN: a reliable array of independent nodes

Author: Bohossian Vasken
Bruck Jehoshua
Fan Chenggong C.
LeMahieu Paul S.
Riedel Marc D.
Xu Lihao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

CiteSeerX

Caltech Authors

FRIENDS - A flexible architecture for implementing fault tolerant and secure distributed applications

Author: Fabre Jean-Charles
Pérennou Tanguy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1996
Field of study

FRIENDS is a software-based architecture for implementing fault-tolerant and, to some extent, secure applications. This architecture is composed of sub-systems and libraries of metaobjects. Transparency and separation of concerns is provided not only to the application programmer but also to the programmers implementing metaobjects for fault tolerance, secure communication and distribution. Common services required for implementing metaobjects are provided by the sub-systems. Metaobjects are implemented using object-oriented techniques and can be reused and customised according to the application needs, the operational environment and its related fault assumptions. Flexibility is increased by a recursive use of metaobjects. Examples and experiments are also described

CiteSeerX

Open Archive Toulouse Archive Ouverte

Operating System Support for Redundant Multithreading

Author: Döbel Björn
Publication venue
Publication date: 25/11/2014
Field of study

Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs. ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication. I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB). I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors

Technische Universität Dresden: Qucosa

Lessons from FTM: an Experiment in the Design and Implementation of a Low Cost Fault Tolerant System

Author: Banâtre Michel
Hue Mireille
Muller Gilles
Peyrouze Nadine
Rochat Bruno
Publication venue: HAL CCSD
Publication date: 01/01/1995
Field of study

This report describes an experiment in the design of a general purpose fault tolerant system, FTM. The main objective of the FTM design was to implement a "low-cost" fault tolerant system that could be used on standard workstations. At the operating system level, our goal was to provide a methodology for the design of modular reliable operating systems, while offering fault tolerance transparency to user applications. In other words, porting an application to FTM had only to require compiling the source code without having to modify it. These objectives were achieved using the Mach micro-kernel and a modular set of reliable servers which implement application checkpoints and provide continuous system functions despite machine crashes. At the architectural level, our approach relies on a high performance stable storage implementation, called Stable Transactional Memory (STM), which can be implemented either by hardware or software. We first motivate our design choices, then we detail the FTM implementation at both architectural and operating system level. We comment on the reasons for the evolution of our stable memory technology from hardware to software. Finally, we present a performance evaluation of the FTM prototype. We conclude with lessons learned and give some assessments

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Validation of a software dependability tool via fault injection experiments

Author: Benso Alfredo
Di Carlo Stefano
Di Natale Giorgio
Prinetto Paolo Ernesto
Tagliaferri Luca
Publication venue: IEEE Computer Society
Publication date: 01/01/2001
Field of study

Presents the validation of the strategies employed in the RECCO tool to analyze a C/C++ software; the RECCO compiler scans C/C++ source code to extract information about the significance of the variables that populate the program and the code structure itself. Experimental results gathered on an Open Source Router are used to compare and correlate two sets of critical variables, one obtained by fault injection experiments, and the other applying the RECCO tool, respectively. Then the two sets are analyzed, compared, and correlated to prove the effectiveness of RECCO's methodology

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Jiko kaifukugata operetingu shisutemu kochiku furemu waku

Author: Ishikawa Hiroo
Publication venue
Publication date: 01/01/2009
Field of study

制度:新 ; 報告番号:甲2786号 ; 学位の種類:博士(工学) ; 授与年月日:2009/2/25 ; 早大学位記番号:新500

Waseda University Repository

Efficient Reliable Group Communication for Distributed Systems

Author: Kaashoek M.F.
Tanenbaum A.S.
Publication venue
Publication date: 01/01/1992
Field of study

Many applications can profit from broadcast communication, but few operating systems provide primitives that make broadcast communication available to user applications. In this paper we introduce primitives for broadcast communication that have been integrated with the Amoeba distributed operating system. The semantics of the broadcast primitives are simple, powerful, and easy to understand. Our primitives, for example, guarantee total ordering of broadcast messages. The proposed primitives are also efficient: if a network supports physical multicast, a reliable broadcast can be done in just slightly more than two messages on the average, so, the performance of a reliable broadcast is roughly comparable to that of a remote procedure call. In addition, the primitives are flexible: user applications can, for example, trade performance against fault tolerance. 1

CiteSeerX

VU Research Portal