Search CORE

365 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Chameleon: A Software Infrastructure and Testbed for Reliable High-Speed Networked Computing

Author: Bagchi S.
Iyer R.K.
Kalbarczyk Z.
Publication venue: Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/07/1997
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNASA / NAG 1-61

Illinois Digital Environment for Access to Learning and Scholarship Repository

A Smart Voting Subsystem for Distributed Fault Tolerance

Author: Iyer R.K.
Rotondi G.
Publication venue: Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/11/2000
Field of study

Coordinated Science Laboratory was formerly known as Control Systems Laborator

Illinois Digital Environment for Access to Learning and Scholarship Repository

Fault-Injection-Based Assessment of Fail-Silence Provided by Process Duplication versus Internal Error Detection in Scientific-Based Applications

Author: Bagchi Saurabh
Iyer Ravishankar K.
Kalbarczyk Zbigniew
Speirs Neil A.
Stott David T.
Whisnant Keith
Xu Jun
Publication venue: Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/11/2000
Field of study

Coordinated Science Laboratory was formerly known as Control Systems Laborator

Illinois Digital Environment for Access to Learning and Scholarship Repository

An Analysis of Failure Handling in Chameleon, A Framework for Supporting Cost-Effective Fault Tolerant Services

Author: Haakensen Erik Edward
Publication venue
Publication date
Field of study

The desire for low-cost reliable computing is increasing. Most current fault tolerant computing solutions are not very flexible, i.e., they cannot adapt to reliability requirements of newly emerging applications in business, commerce, and manufacturing. It is important that users have a flexible, reliable platform to support both critical and noncritical applications. Chameleon, under development at the Center for Reliable and High-Performance Computing at the University of Illinois, is a software framework. for supporting cost-effective adaptable networked fault tolerant service. This thesis details a simulation of fault injection, detection, and recovery in Chameleon. The simulation was written in C++ using the DEPEND simulation library. The results obtained from the simulation included the amount of overhead incurred by the fault detection and recovery mechanisms supported by Chameleon. In addition, information about fault scenarios from which Chameleon cannot recover was gained. The results of the simulation showed that both critical and noncritical applications can be executed in the Chameleon environment with a fairly small amount of overhead. No single point of failure from which Chameleon could not recover was found. Chameleon was also found to be capable of recovering from several multiple failure scenarios

NASA Technical Reports Server

ITR/SY: a distributed programming infrastructure for integrating smart sensors

Author: DeWeerth Stephen P.
Hutto Phil
Mackenzie Kenneth
Ramachandran Umakishore
Rehg James M.
Starner Thad
Wolenetz Matt
Publication venue: Georgia Institute of Technology
Publication date: 30/11/2009
Field of study

Issued as final reportNational Science Foundation (U.S.

Scholarly Materials And Research @ Georgia Tech

Enhancing Planning-Based Adaptation Middleware with Support for Dependability: a Case Study

Author: Eliassen Frank
Rouvoy Romain
Vitenberg Roman
Publication venue: European Association of Software Science and Technology
Publication date: 02/06/2008
Field of study

Recent evolutions of mobile devices have opened up for new opportunities for building advanced mobile applications. In particular, these applications are capable of discovering and exploiting software and hardware resources that are made available in their environment. A possible approach for supporting these ubiquitous interactions consists in adapting the mobile application to reflect the functionalities that are provided by the environment. However, these approaches often fail in offering a sufficient degree of resilience to potential device, network, and software failures, which are particularly frequent in ubiquitous environments. Therefore, the contribution of this paper is to integrate the dependability concern in the process of mobile applications adaptation. In particular, we propose to reflect dependability mechanisms as alternative configurations for a given application. This reflection allows the planning-based adaptation middleware to automatically decide, based on contextual information, to enable the support for dependability or not

Electronic Communications of the EASST (European Association of Software Science and Technology)

Recommended from our members

Fault tolerance via diversity for off-the-shelf products: A study with SQL database servers

Author: Gashi I.
Popov P. T.
Strigini L.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2007
Field of study

If an off-the-shelf software product exhibits poor dependability due to design faults, then software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, which is a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-the-shelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers plus later releases of two of them. We found that many of these faults cause systematic noncrash failures, which is a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them, and the difficulties that they may present

City Research Online

Crossref

Run Time Models in Adaptive Service Infrastructure

Author: A. Bertolino
D. Hirsch
D. Le Metayer
F. Budinsky
G. Taentzer
G.C. Necula
I. Georgiadis
J. Aldrich
J. Aldrich
J. Magee
J.-Y. Hong
L. Baresi
M. Autili
M. Baldauf
M. Caporuscio
M. Caporuscio
R. Allen
R. Hirschfeld
S. Balsamo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Software in the near ubiquitous future will need to cope with vari- ability, as software systems get deployed on an increasingly large diversity of computing platforms and operates in different execution environments. Heterogeneity of the underlying communication and computing infrastruc- ture, mobility inducing changes to the execution environments and therefore changes to the availability of resources and continuously evolving requirements require software systems to be adaptable according to the context changes. Software systems should also be reliable and meet the user's requirements and needs. Moreover, due to its pervasiveness, software systems must be de- pendable. Supporting the validation of these self-adaptive systems to ensure dependability requires a complete rethinking of the software life cycle. The traditional division among static analysis and dynamic analysis is blurred by the need to validate dynamic systems adaptation. Models play a key role in the validation of dependable systems, dynamic adaptation calls for the use of such models at run time. In this paper we describe the approach we have un- dertaken in recent projects to address the challenge of assessing dependability for adaptive software systems

Crossref

Hal-Diderot