Search CORE

548 research outputs found

Recovery blocks for communicating systems

Author: Ciciani B.
Velardi P.
Publication venue
Publication date: 01/01/1983
Field of study

In many practical applications of real-time computing (avionics, switching systems) a message-passing inter-processes communication approach is adopted for both modularity and reliability aims. In the present paper, the problem of adding fault-tolerance in a message passing multiprocesses environment is examined. Recovery blocks implementation schemes for both asynchronous and synchronous communications are proposed, with the aim of avoiding domino-effects and exploiting the message oriented system structure. When a sender process produces a message, an acceptance test is performed on the message by system procedures, which in sequence: i) transfer the message on the receiving process working memory, ii) save present process status, or in case of error, restore some previous process status, and iii) discard no longer needed status informations

Archivio della ricerca- Università di Roma La Sapienza

Recommended from our members

An evaluation of software fault tolerance techniques in real-time safety-critical applications

Author: Leveson Nancy G.
Yemini Shaula
Publication venue: eScholarship, University of California
Publication date: 01/01/1982
Field of study

The usefulness of three software fault tolerance techniques -- n-version programming, recovery blocks, and exception handling is examined within the context of real-time safety-critical environments. The general requirements of such application systems are presented and the techniques evaluated with regard to how well they satisfy these requirements

eScholarship - University of California

Analysis of backward error recovery for concurrent processes with recovery blocks

Author: Lee Y. H.
Shin K. G.
Publication venue
Publication date
Field of study

Three different methods of implementing recovery blocks (RB's). These are the asynchronous, synchronous, and the pseudo recovery point implementations. Pseudo recovery points so that unbounded rollback may be avoided while maintaining process autonomy are proposed. Probabilistic models for analyzing these three methods under standard assumptions in computer performance analysis, i.e., exponential distributions for related random variables were developed. The interval between two successive recovery lines for asynchronous RB's mean loss in computation power for the synchronized method, and additional overhead and rollback distance in case PRP's are used were estimated

NASA Technical Reports Server

Improving the Reliability of Decision-Support Systems for Nuclear Emergency Management by Leveraging Software Design Diversity

Author: Tudor B. Ionescu
Walter Scheuermann
Publication venue: 'Faculty of Electrical Engineering and Computing, Univ. of Zagreb'
Publication date: 01/01/2016
Field of study

This paper introduces a novel method of continuous verification of simulation software used in decision-support systems for nuclear emergency management (DSNE). The proposed approach builds on methods from the field of software reliability engineering, such as N-Version Programming, Recovery Blocks, and Consensus Recovery Blocks. We introduce a new acceptance test for dispersion simulation results and a new voting scheme based on taxonomies of simulation results rather than individual simulation results. The acceptance test and the voter are used in a new scheme, which extends the Consensus Recovery Block method by a database of result taxonomies to support machine-learning. This enables the system to learn how to distinguish correct from incorrect results, with respect to the implemented numerical schemes. Considering that decision-support systems for nuclear emergency management are used in a safety-critical application context, the methods introduced in this paper help improve the reliability of the system and the trustworthiness of the simulation results used by emergency managers in the decision making process. The effectiveness of the approach has been assessed using the atmospheric dispersion forecasts of two test versions of the widely used RODOS DSNE system

Crossref

Directory of Open Access Journals

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Development of software fault-tolerance techniques

Author: Melliar-Smith P. M.
Publication venue
Publication date
Field of study

As computers become more widely used, and in particular as they become used in more safety critical applications, the reliability of the computer system and its software becomes more important. There is also an increasing need for high levels of reliability in applications involving very large numbers of inexpensive units where recall of the units would be disproportionately expensive. The nature of faults and the assumptions made by different approaches to correct operation are considered. The recovery block approach is described and a probabilistic analysis of its effectiveness, with and without correlated design errors is provided. Mechanisms for generating acceptance tests from specifications, and for providing recovery in the presence of asynchrony, are described. An analysis of, and design for, the provision of recovery blocks in the microprogram of the Bendix BDX930 processor is provided. An example of the use of recovery blocks in a simple operating system is also provided

NASA Technical Reports Server

Recommended from our members

Software safety : a definition and some preliminary thoughts

Author: Leveson Nancy G.
Publication venue: eScholarship, University of California
Publication date: 01/01/1981
Field of study

Software safety is the subject of a research project in its initial stages at the University of California Irvine. This research deals with critical real-time software where the cost of an error is high, e.g. human life. In this paper software techniques having a bearing on safety are described and evaluated. Initial definitions of software safety concepts are presented along with some preliminary thoughts and research questions

eScholarship - University of California

Integrated analysis of error detection and recovery

Author: Lee Y. H.
Shin K. G.
Publication venue
Publication date
Field of study

An integrated modeling and analysis of error detection and recovery is presented. When fault latency and/or error latency exist, the system may suffer from multiple faults or error propagations which seriously deteriorate the fault-tolerant capability. Several detection models that enable analysis of the effect of detection mechanisms on the subsequent error handling operations and the overall system reliability were developed. Following detection of the faulty unit and reconfiguration of the system, the contaminated processes or tasks have to be recovered. The strategies of error recovery employed depend on the detection mechanisms and the available redundancy. Several recovery methods including the rollback recovery are considered. The recovery overhead is evaluated as an index of the capabilities of the detection and reconfiguration mechanisms

NASA Technical Reports Server

Study of fault-tolerant software technology

Author: Broglio C.
Goldberg J.
Hitt E.
Levitt K.
Slivinski T.
Webb J.
Wild C.
Publication venue
Publication date
Field of study

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

NASA Technical Reports Server

Performance and evaluation of real-time multicomputer control systems

Author: Shin K. G.
Publication venue
Publication date
Field of study

New performance measures, detailed examples, modeling of error detection process, performance evaluation of rollback recovery methods, experiments on FTMP, and optimal size of an NMR cluster are discussed

NASA Technical Reports Server