548 research outputs found

    Recovery blocks for communicating systems

    Get PDF
    In many practical applications of real-time computing (avionics, switching systems) a message-passing inter-processes communication approach is adopted for both modularity and reliability aims. In the present paper, the problem of adding fault-tolerance in a message passing multiprocesses environment is examined. Recovery blocks implementation schemes for both asynchronous and synchronous communications are proposed, with the aim of avoiding domino-effects and exploiting the message oriented system structure. When a sender process produces a message, an acceptance test is performed on the message by system procedures, which in sequence: i) transfer the message on the receiving process working memory, ii) save present process status, or in case of error, restore some previous process status, and iii) discard no longer needed status informations

    Analysis of backward error recovery for concurrent processes with recovery blocks

    Get PDF
    Three different methods of implementing recovery blocks (RB's). These are the asynchronous, synchronous, and the pseudo recovery point implementations. Pseudo recovery points so that unbounded rollback may be avoided while maintaining process autonomy are proposed. Probabilistic models for analyzing these three methods under standard assumptions in computer performance analysis, i.e., exponential distributions for related random variables were developed. The interval between two successive recovery lines for asynchronous RB's mean loss in computation power for the synchronized method, and additional overhead and rollback distance in case PRP's are used were estimated

    Improving the Reliability of Decision-Support Systems for Nuclear Emergency Management by Leveraging Software Design Diversity

    Get PDF
    This paper introduces a novel method of continuous verification of simulation software used in decision-support systems for nuclear emergency management (DSNE). The proposed approach builds on methods from the field of software reliability engineering, such as N-Version Programming, Recovery Blocks, and Consensus Recovery Blocks. We introduce a new acceptance test for dispersion simulation results and a new voting scheme based on taxonomies of simulation results rather than individual simulation results. The acceptance test and the voter are used in a new scheme, which extends the Consensus Recovery Block method by a database of result taxonomies to support machine-learning. This enables the system to learn how to distinguish correct from incorrect results, with respect to the implemented numerical schemes. Considering that decision-support systems for nuclear emergency management are used in a safety-critical application context, the methods introduced in this paper help improve the reliability of the system and the trustworthiness of the simulation results used by emergency managers in the decision making process. The effectiveness of the approach has been assessed using the atmospheric dispersion forecasts of two test versions of the widely used RODOS DSNE system

    Development of software fault-tolerance techniques

    Get PDF
    As computers become more widely used, and in particular as they become used in more safety critical applications, the reliability of the computer system and its software becomes more important. There is also an increasing need for high levels of reliability in applications involving very large numbers of inexpensive units where recall of the units would be disproportionately expensive. The nature of faults and the assumptions made by different approaches to correct operation are considered. The recovery block approach is described and a probabilistic analysis of its effectiveness, with and without correlated design errors is provided. Mechanisms for generating acceptance tests from specifications, and for providing recovery in the presence of asynchrony, are described. An analysis of, and design for, the provision of recovery blocks in the microprogram of the Bendix BDX930 processor is provided. An example of the use of recovery blocks in a simple operating system is also provided

    Integrated analysis of error detection and recovery

    Get PDF
    An integrated modeling and analysis of error detection and recovery is presented. When fault latency and/or error latency exist, the system may suffer from multiple faults or error propagations which seriously deteriorate the fault-tolerant capability. Several detection models that enable analysis of the effect of detection mechanisms on the subsequent error handling operations and the overall system reliability were developed. Following detection of the faulty unit and reconfiguration of the system, the contaminated processes or tasks have to be recovered. The strategies of error recovery employed depend on the detection mechanisms and the available redundancy. Several recovery methods including the rollback recovery are considered. The recovery overhead is evaluated as an index of the capabilities of the detection and reconfiguration mechanisms

    Study of fault-tolerant software technology

    Get PDF
    Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

    Performance and evaluation of real-time multicomputer control systems

    Get PDF
    New performance measures, detailed examples, modeling of error detection process, performance evaluation of rollback recovery methods, experiments on FTMP, and optimal size of an NMR cluster are discussed
    corecore