Search CORE

49,276 research outputs found

Study of fault-tolerant software technology

Author: Broglio C.
Goldberg J.
Hitt E.
Levitt K.
Slivinski T.
Webb J.
Wild C.
Publication venue
Publication date
Field of study

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

NASA Technical Reports Server

Recommended from our members

Software safety : a definition and some preliminary thoughts

Author: Leveson Nancy G.
Publication venue: eScholarship, University of California
Publication date: 01/01/1981
Field of study

Software safety is the subject of a research project in its initial stages at the University of California Irvine. This research deals with critical real-time software where the cost of an error is high, e.g. human life. In this paper software techniques having a bearing on safety are described and evaluated. Initial definitions of software safety concepts are presented along with some preliminary thoughts and research questions

eScholarship - University of California

Intelligent fault management for the Space Station active thermal control system

Author: Faltisco Robert M.
Hill Tim
Publication venue
Publication date
Field of study

The Thermal Advanced Automation Project (TAAP) approach and architecture is described for automating the Space Station Freedom (SSF) Active Thermal Control System (ATCS). The baseline functionally and advanced automation techniques for Fault Detection, Isolation, and Recovery (FDIR) will be compared and contrasted. Advanced automation techniques such as rule-based systems and model-based reasoning should be utilized to efficiently control, monitor, and diagnose this extremely complex physical system. TAAP is developing advanced FDIR software for use on the SSF thermal control system. The goal of TAAP is to join Knowledge-Based System (KBS) technology, using a combination of rules and model-based reasoning, with conventional monitoring and control software in order to maximize autonomy of the ATCS. TAAP's predecessor was NASA's Thermal Expert System (TEXSYS) project which was the first large real-time expert system to use both extensive rules and model-based reasoning to control and perform FDIR on a large, complex physical system. TEXSYS showed that a method is needed for safely and inexpensively testing all possible faults of the ATCS, particularly those potentially damaging to the hardware, in order to develop a fully capable FDIR system. TAAP therefore includes the development of a high-fidelity simulation of the thermal control system. The simulation provides realistic, dynamic ATCS behavior and fault insertion capability for software testing without hardware related risks or expense. In addition, thermal engineers will gain greater confidence in the KBS FDIR software than was possible prior to this kind of simulation testing. The TAAP KBS will initially be a ground-based extension of the baseline ATCS monitoring and control software and could be migrated on-board as additional computation resources are made available

NASA Technical Reports Server

Distributed systems status and control

Author: Kreidler David
Vickers David
Publication venue
Publication date
Field of study

Concepts are investigated for an automated status and control system for a distributed processing environment. System characteristics, data requirements for health assessment, data acquisition methods, system diagnosis methods and control methods were investigated in an attempt to determine the high-level requirements for a system which can be used to assess the health of a distributed processing system and implement control procedures to maintain an accepted level of health for the system. A potential concept for automated status and control includes the use of expert system techniques to assess the health of the system, detect and diagnose faults, and initiate or recommend actions to correct the faults. Therefore, this research included the investigation of methods by which expert systems were developed for real-time environments and distributed systems. The focus is on the features required by real-time expert systems and the tools available to develop real-time expert systems

NASA Technical Reports Server

Seal integrity and the impact on food waste

Author: Dudbridge Michael
Publication venue: WRAP
Publication date: 01/03/2009
Field of study

An investigation into the contribution that inadequate heat sealing of food packaging might make to the generation of food waste, in the supply chain and the househol

University of Lincoln Institutional Repository

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Correct and Control Complex IoT Systems: Evaluation of a Classification for System Anomalies

Author: Heisse Stefan
Niedermaier Sina
Wagner Stefan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/05/2020
Field of study

In practice there are deficiencies in precise interteam communications about system anomalies to perform troubleshooting and postmortem analysis along different teams operating complex IoT systems. We evaluate the quality in use of an adaptation of IEEE Std. 1044-2009 with the objective to differentiate the handling of fault detection and fault reaction from handling of defect and its options for defect correction. We extended the scope of IEEE Std. 1044-2009 from anomalies related to software only to anomalies related to complex IoT systems. To evaluate the quality in use of our classification a study was conducted at Robert Bosch GmbH. We applied our adaptation to a postmortem analysis of an IoT solution and evaluated the quality in use by conducting interviews with three stakeholders. Our adaptation was effectively applied and interteam communications as well as iterative and inductive learning for product improvement were enhanced. Further training and practice are required.Comment: Submitted to QRS 2020 (IEEE Conference on Software Quality, Reliability and Security

arXiv.org e-Print Archive

Crossref

Instrumenting self-modifying code

Author: De Bosschere K.
Maebe J.
Publication venue
Publication date: 01/01/2003
Field of study

Adding small code snippets at key points to existing code fragments is called instrumentation. It is an established technique to debug certain otherwise hard to solve faults, such as memory management issues and data races. Dynamic instrumentation can already be used to analyse code which is loaded or even generated at run time.With the advent of environments such as the Java Virtual Machine with optimizing Just-In-Time compilers, a new obstacle arises: self-modifying code. In order to instrument this kind of code correctly, one must be able to detect modifications and adapt the instrumentation code accordingly, preferably without incurring a high penalty speedwise. In this paper we propose an innovative technique that uses the hardware page protection mechanism of modern processors to detect such modifications. We also show how an instrumentor can adapt the instrumented version depending on the kind of modificiations as well as an experimental evaluation of said techniques.Comment: In M. Ronsse, K. De Bosschere (eds), proceedings of the Fifth International Workshop on Automated Debugging (AADEBUG 2003), September 2003, Ghent. cs.SE/030902

arXiv.org e-Print Archive

Ghent University Academic Bibliography