Search CORE

15,347 research outputs found

On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers

Author: Altmann Jorn
Bartha Tamas
Pataricza Andras
Publication venue: IPDS1995
Publication date: 01/04/1995
Field of study

Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. In this paper we introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented, and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by messages, on the application performance.Supported by the EU (European Unit) as part of the Esprit Project 6731, Fault Tolerance for Massively Parallel Systems, and the Hungarian-German Joint Scientific Research Project #70 with additional support from OTKA-F007414

SNU Open Repository and Archive

A Pattern Language for High-Performance Computing Resilience

Author: Chung Jinsuk
Mohror Kathryn
Saridakis Titos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/10/2017
Field of study

High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

arXiv.org e-Print Archive

Crossref

A fault-tolerant intelligent robotic control system

Author: Marzwell Neville I.
Tso Kam Sing
Publication venue
Publication date
Field of study

This paper describes the concept, design, and features of a fault-tolerant intelligent robotic control system being developed for space and commercial applications that require high dependability. The comprehensive strategy integrates system level hardware/software fault tolerance with task level handling of uncertainties and unexpected events for robotic control. The underlying architecture for system level fault tolerance is the distributed recovery block which protects against application software, system software, hardware, and network failures. Task level fault tolerance provisions are implemented in a knowledge-based system which utilizes advanced automation techniques such as rule-based and model-based reasoning to monitor, diagnose, and recover from unexpected events. The two level design provides tolerance of two or more faults occurring serially at any level of command, control, sensing, or actuation. The potential benefits of such a fault tolerant robotic control system include: (1) a minimized potential for damage to humans, the work site, and the robot itself; (2) continuous operation with a minimum of uncommanded motion in the presence of failures; and (3) more reliable autonomous operation providing increased efficiency in the execution of robotic tasks and decreased demand on human operators for controlling and monitoring the robotic servicing routines

NASA Technical Reports Server

Immunotronics - novel finite-state-machine architectures with built-in self-test using self-nonself differentiation

Author: Bradley D.W.
Tyrrell A.M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

A novel approach to hardware fault tolerance is demonstrated that takes inspiration from the human immune system as a method of fault detection. The human immune system is a remarkable system of interacting cells and organs that protect the body from invasion and maintains reliable operation even in the presence of invading bacteria or viruses. This paper seeks to address the field of electronic hardware fault tolerance from an immunological perspective with the aim of showing how novel methods based upon the operation of the immune system can both complement and create new approaches to the development of fault detection mechanisms for reliable hardware systems. In particular, it is shown that by use of partial matching, as prevalent in biological systems, high fault coverage can be achieved with the added advantage of reducing memory requirements. The development of a generic finite-state-machine immunization procedure is discussed that allows any system that can be represented in such a manner to be "immunized" against the occurrence of faulty operation. This is demonstrated by the creation of an immunized decade counter that can detect the presence of faults in real tim

CiteSeerX

Crossref

White Rose Research Online

Architectural mismatch tolerance

Author: A. Egyed
C. Gacek
D. Compare
D. Garlan
D.E. Perry
D.S. Roseblum
J.-C. Laprie
N. Medvidovic
P. Inverardi
R. Lemos de
R.N. Taylor
T. Anderson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2003
Field of study

The integrity of complex software systems built from existing components is becoming more dependent on the integrity of the mechanisms used to interconnect these components and, in particular, on the ability of these mechanisms to cope with architectural mismatches that might exist between components. There is a need to detect and handle (i.e. to tolerate) architectural mismatches during runtime because in the majority of practical situations it is impossible to localize and correct all such mismatches during development time. When developing complex software systems, the problem is not only to identify the appropriate components, but also to make sure that these components are interconnected in a way that allows mismatches to be tolerated. The resulting architectural solution should be a system based on the existing components, which are independent in their nature, but are able to interact in well-understood ways. To find such a solution we apply general principles of fault tolerance to dealing with arch itectural mismatche

CiteSeerX

City Research Online

Crossref

Kent Academic Repository

Recommended from our members

Fault Tolerance Against Design Faults

Author: Strigini L.
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

City Research Online

Plug-and-Play Fault Detection and control-reconfiguration for a class of nonlinear large-scale constrained systems

Author: Boem F
Ferrari-Trecate G
Parisini T
Riverso S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

This paper deals with a novel Plug-and-Play (PnP) architecture for the control and monitoring of Large-Scale Systems (LSSs). The proposed approach integrates a distributed Model Predictive Control (MPC) strategy with a distributed Fault Detection (FD) architecture and methodology in a PnP framework. The basic concept is to use the FD scheme as an autonomous decision support system: once a fault is detected, the faulty subsystem can be unplugged to avoid the propagation of the fault in the interconnected LSS. Analogously, once the issue has been solved, the disconnected subsystem can be re-plugged-in. PnP design of local controllers and detectors allow these operations to be performed safely, i.e. without spoiling stability and constraint satisfaction for the whole LSS. The PnP distributed MPC is derived for a class of nonlinear LSSs and an integrated PnP distributed FD architecture is proposed. Simulation results in two paradigmatic examples show the effectiveness and the potential of the general methodology

Archivio istituzionale della ricerca - Università di Trieste

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Archivio Istituzionale della Ricerca - Università degli Studi di Pavia

Spiral - Imperial College Digital Repository

Experimental analysis of computer system dependability

Author: Iyer Ravishankar, K.
Tang Dong
Publication venue
Publication date
Field of study

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

NASA Technical Reports Server