13,532 research outputs found

    Redundancy management for efficient fault recovery in NASA's distributed computing system

    Get PDF
    The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance

    Evaluating the reliability of NAND multiplexing with PRISM

    Get PDF
    Probabilistic-model checking is a formal verification technique for analyzing the reliability and performance of systems exhibiting stochastic behavior. In this paper, we demonstrate the applicability of this approach and, in particular, the probabilistic-model-checking tool PRISM to the evaluation of reliability and redundancy of defect-tolerant systems in the field of computer-aided design. We illustrate the technique with an example due to von Neumann, namely NAND multiplexing. We show how, having constructed a model of a defect-tolerant system incorporating probabilistic assumptions about its defects, it is straightforward to compute a range of reliability measures and investigate how they are affected by slight variations in the behavior of the system. This allows a designer to evaluate, for example, the tradeoff between redundancy and reliability in the design. We also highlight errors in analytically computed reliability bounds, recently published for the same case study

    Evaluating the reliability of NAND multiplexing with PRISM

    Get PDF
    Probabilistic-model checking is a formal verification technique for analyzing the reliability and performance of systems exhibiting stochastic behavior. In this paper, we demonstrate the applicability of this approach and, in particular, the probabilistic-model-checking tool PRISM to the evaluation of reliability and redundancy of defect-tolerant systems in the field of computer-aided design. We illustrate the technique with an example due to von Neumann, namely NAND multiplexing. We show how, having constructed a model of a defect-tolerant system incorporating probabilistic assumptions about its defects, it is straightforward to compute a range of reliability measures and investigate how they are affected by slight variations in the behavior of the system. This allows a designer to evaluate, for example, the tradeoff between redundancy and reliability in the design. We also highlight errors in analytically computed reliability bounds, recently published for the same case study

    Study of fault tolerant software technology for dynamic systems

    Get PDF
    The major aim of this study is to investigate the feasibility of using systems-based failure detection isolation and compensation (FDIC) techniques in building fault-tolerant software and extending them, whenever possible, to the domain of software fault tolerance. First, it is shown that systems-based FDIC methods can be extended to develop software error detection techniques by using system models for software modules. In particular, it is demonstrated that systems-based FDIC techniques can yield consistency checks that are easier to implement than acceptance tests based on software specifications. Next, it is shown that systems-based failure compensation techniques can be generalized to the domain of software fault tolerance in developing software error recovery procedures. Finally, the feasibility of using fault-tolerant software in flight software is investigated. In particular, possible system and version instabilities, and functional performance degradation that may occur in N-Version programming applications to flight software are illustrated. Finally, a comparative analysis of N-Version and recovery block techniques in the context of generic blocks in flight software is presented
    corecore