Search CORE

22,445 research outputs found

Critical fault patterns determination in fault-tolerant computer systems

Author: Losq J.
Mccluskey E. J.
Publication venue
Publication date
Field of study

The method proposed tries to enumerate all the critical fault-patterns (successive occurrences of failures) without analyzing every single possible fault. The conditions for the system to be operating in a given mode can be expressed in terms of the static states. Thus, one can find all the system states that correspond to a given critical mode of operation. The next step consists in analyzing the fault-detection mechanisms, the diagnosis algorithm and the process of switch control. From them, one can find all the possible system configurations that can result from a failure occurrence. Thus, one can list all the characteristics, with respect to detection, diagnosis, and switch control, that failures must have to constitute critical fault-patterns. Such an enumeration of the critical fault-patterns can be directly used to evaluate the overall system tolerance to failures. Present research is focused on how to efficiently make use of these system-level characteristics to enumerate all the failures that verify these characteristics

NASA Technical Reports Server

Online Fault Classification in HPC Systems through Machine Learning

Author: A Gainaru
Alessio Netti
C Engelmann
F Cappello
I Cohen
M Snir
O Tuncer
Z Lan
Publication venue
Publication date: 01/01/2019
Field of study

As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We cast the problem and its solution within realistic operating constraints of online use. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. We have based our study on a local dataset, which we make publicly available, that was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Case study: Bio-inspired self-adaptive strategy for spike-based PID controller

Author: Harkin Jim
Jiménez Fernández Ángel Francisco
Linares Barranco Alejandro
Liu Junxiu
McDaid Liam
McElholm Malachy
Publication venue: IEEE Computer Society
Publication date: 01/01/2015
Field of study

A key requirement for modern large scale neuromorphic systems is the ability to detect and diagnose faults and to explore self-correction strategies. In particular, to perform this under area-constraints which meet scalability requirements of large neuromorphic systems. A bio-inspired online fault detection and self-correction mechanism for neuro-inspired PID controllers is presented in this paper. This strategy employs a fault detection unit for online testing of the PID controller; uses a fault detection manager to perform the detection procedure across multiple controllers, and a controller selection mechanism to select an available fault-free controller to provide a corrective step in restoring system functionality. The novelty of the proposed work is that the fault detection method, using synapse models with excitatory and inhibitory responses, is applied to a robotic spike-based PID controller. The results are presented for robotic motor controllers and show that the proposed bioinspired self-detection and self-correction strategy can detect faults and re-allocate resources to restore the controller’s functionality. In particular, the case study demonstrates the compactness (~1.4% area overhead) of the fault detection mechanism for large scale robotic controllers.Ministerio de Economía y Competitividad TEC2012-37868-C04-0

idUS. Depósito de Investigación Universidad de Sevilla

Fault isolation detection expert (FIDEX). Part 1: Expert system diagnostics for a 30/20 Gigahertz satellite transponder

Author: Durkin John
Schlegelmilch Richard
Tallo Donald
Publication venue
Publication date
Field of study

LeRC has recently completed the design of a Ka-band satellite transponder system, as part of the Advanced Communication Technology Satellite (ACTS) System. To enhance the reliability of this satellite, NASA funded the University of Akron to explore the application of an expert system to provide the transponder with an autonomous diagnosis capability. The results of this research was the development of a prototype diagnosis expert system called FIDEX (fault-isolation and diagnosis expert). FIDEX is a frame-based expert system that was developed in the NEXPERT Object development environment by Neuron Data, Inc. It is a MicroSoft Windows version 3.0 application, and was designed to operate on an Intel i80386 based personal computer system

NASA Technical Reports Server

Parallel Architectures for Planetary Exploration Requirements (PAPER)

Author: Cezzar Ruknet
Sen Ranjan K.
Publication venue
Publication date
Field of study

The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified

NASA Technical Reports Server

Improving root cause analysis through the integration of PLM systems with cross supply chain maintenance data

Author: Broome S
Madenas N
Peachey S
Tiwari Ashutosh
Turner Christopher J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/09/2015
Field of study

The purpose of this paper is to demonstrate a system architecture for integrating Product Lifecycle Management (PLM) systems with cross supply chain maintenance information to support root-cause analysis. By integrating product-data from PLM systems with warranty claims, vehicle diagnostics and technical publications, engineers were able to improve the root-cause analysis and close the information gaps. Data collection was achieved via in-depth semi-structured interviews and workshops with experts from the automotive sector. Unified Modelling Language (UML) diagrams were used to design the system architecture proposed. A user scenario is also presented to demonstrate the functionality of the system

Crossref

Cranfield CERES

Surrey Research Insight

Design methods for fault-tolerant navigation computers

Author: Avizienis A.
Publication venue
Publication date
Field of study

Design methods for fault tolerant navigation computer

NASA Technical Reports Server

Experimental analysis of computer system dependability

Author: Iyer Ravishankar, K.
Tang Dong
Publication venue
Publication date
Field of study

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

NASA Technical Reports Server

Advanced flight control system study

Author: Hartmann G. L.
Lee H. P.
Ng W. K.
Rang E. R.
Schulte R. W.
Wall J. E., Jr.
Publication venue
Publication date
Field of study

A fly by wire flight control system architecture designed for high reliability includes spare sensor and computer elements to permit safe dispatch with failed elements, thereby reducing unscheduled maintenance. A methodology capable of demonstrating that the architecture does achieve the predicted performance characteristics consists of a hierarchy of activities ranging from analytical calculations of system reliability and formal methods of software verification to iron bird testing followed by flight evaluation. Interfacing this architecture to the Lockheed S-3A aircraft for flight test is discussed. This testbed vehicle can be expanded to support flight experiments in advanced aerodynamics, electromechanical actuators, secondary power systems, flight management, new displays, and air traffic control concepts

NASA Technical Reports Server

Recommended from our members

Fault Tolerance Against Design Faults

Author: Strigini L.
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

City Research Online