13 research outputs found

    A Primer on Architectural Level Fault Tolerance

    Get PDF
    This paper introduces the fundamental concepts of fault tolerant computing. Key topics covered are voting, fault detection, clock synchronization, Byzantine Agreement, diagnosis, and reliability analysis. Low level mechanisms such as Hamming codes or low level communications protocols are not covered. The paper is tutorial in nature and does not cover any topic in detail. The focus is on rationale and approach rather than detailed exposition

    A Test Generation Framework for Distributed Fault-Tolerant Algorithms

    Get PDF
    Heavyweight formal methods such as theorem proving have been successfully applied to the analysis of safety critical fault-tolerant systems. Typically, the models and proofs performed during such analysis do not inform the testing process of actual implementations. We propose a framework for generating test vectors from specifications written in the Prototype Verification System (PVS). The methodology uses a translator to produce a Java prototype from a PVS specification. Symbolic (Java) PathFinder is then employed to generate a collection of test cases. A small example is employed to illustrate how the framework can be used in practice

    Is Model-Based Development a Favorable Approach for Complex and Safety-Critical Computer Systems on Commercial Aircraft?

    Get PDF
    A system is safety-critical if its failure can endanger human life or cause significant damage to property or the environment. State-of-the-art computer systems on commercial aircraft are highly complex, software-intensive, functionally integrated, and network-centric systems of systems. Ensuring that such systems are safe and comply with existing safety regulations is costly and time-consuming as the level of rigor in the development process, especially the validation and verification activities, is determined by considerations of system complexity and safety criticality. A significant degree of care and deep insight into the operational principles of these systems is required to ensure adequate coverage of all design implications relevant to system safety. Model-based development methodologies, methods, tools, and techniques facilitate collaboration and enable the use of common design artifacts among groups dealing with different aspects of the development of a system. This paper examines the application of model-based development to complex and safety-critical aircraft computer systems. Benefits and detriments are identified and an overall assessment of the approach is given

    A Fault-Tolerant Clock Synchronization and Geometry Determination Protocol

    Get PDF
    A fault-tolerant distributed protocol (algorithm) is presented that achieves optimum timing precision (clock synchronization) among the nodes and, simultaneously, determines the network's geometry (shape) - locations and distances of the nodes relative to each other - in a wireless distributed system. This protocol is based on the assumption of initial coarse synchrony of nodes' local clocks. The proposed solution assumes no prior knowledge of the nodes' locations, the distances between the nodes, or network's geometry, but assumes an ordered geometry where nodes have unique identifiers. This protocol accommodates large variations in the communication latencies among the nodes; thus, it applies equally to both wireless and wired networks

    High-Intensity Radiated Field Fault-Injection Experiment for a Fault-Tolerant Distributed Communication System

    Get PDF
    Safety-critical distributed flight control systems require robustness in the presence of faults. In general, these systems consist of a number of input/output (I/O) and computation nodes interacting through a fault-tolerant data communication system. The communication system transfers sensor data and control commands and can handle most faults under typical operating conditions. However, the performance of the closed-loop system can be adversely affected as a result of operating in harsh environments. In particular, High-Intensity Radiated Field (HIRF) environments have the potential to cause random fault manifestations in individual avionic components and to generate simultaneous system-wide communication faults that overwhelm existing fault management mechanisms. This paper presents the design of an experiment conducted at the NASA Langley Research Center's HIRF Laboratory to statistically characterize the faults that a HIRF environment can trigger on a single node of a distributed flight control system

    Model Checking A Self-Stabilizing Synchronization Protocol for Arbitrary Digraphs

    Get PDF
    This report presents the mechanical verification of a self-stabilizing distributed clock synchronization protocol for arbitrary digraphs in the absence of faults. This protocol does not rely on assumptions about the initial state of the system, other than the presence of at least one node, and no central clock or a centrally generated signal, pulse, or message is used. The system under study is an arbitrary, non-partitioned digraph ranging from fully connected to 1-connected networks of nodes while allowing for differences in the network elements. Nodes are anonymous, i.e., they do not have unique identities. There is no theoretical limit on the maximum number of participating nodes. The only constraint on the behavior of the node is that the interactions with other nodes are restricted to defined links and interfaces. This protocol deterministically converges within a time bound that is a linear function of the self-stabilization period. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV) for a subset of digraphs. Modeling challenges of the protocol and the system are addressed. The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirmation of claims of determinism and linear convergence with respect to the self-stabilization period

    A Self-Stabilizing Hybrid Fault-Tolerant Synchronization Protocol

    Get PDF
    This paper presents a strategy for solving the Byzantine general problem for self-stabilizing a fully connected network from an arbitrary state and in the presence of any number of faults with various severities including any number of arbitrary (Byzantine) faulty nodes. The strategy consists of two parts: first, converting Byzantine faults into symmetric faults, and second, using a proven symmetric-fault tolerant algorithm to solve the general case of the problem. A protocol (algorithm) is also present that tolerates symmetric faults, provided that there are more good nodes than faulty ones. The solution applies to realizable systems, while allowing for differences in the network elements, provided that the number of arbitrary faults is not more than a third of the network size. The only constraint on the behavior of a node is that the interactions with other nodes are restricted to defined links and interfaces. The solution does not rely on assumptions about the initial state of the system and no central clock nor centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. A mechanical verification of a proposed protocol is also present. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV). The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirming claims of determinism and linear convergence with respect to the self-stabilization period

    A Self-Stabilizing Hybrid-Fault Tolerant Synchronization Protocol

    Get PDF
    In this report we present a strategy for solving the Byzantine general problem for self-stabilizing a fully connected network from an arbitrary state and in the presence of any number of faults with various severities including any number of arbitrary (Byzantine) faulty nodes. Our solution applies to realizable systems, while allowing for differences in the network elements, provided that the number of arbitrary faults is not more than a third of the network size. The only constraint on the behavior of a node is that the interactions with other nodes are restricted to defined links and interfaces. Our solution does not rely on assumptions about the initial state of the system and no central clock nor centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. We also present a mechanical verification of a proposed protocol. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV). The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirming claims of determinism and linear convergence with respect to the self-stabilization period. We believe that our proposed solution solves the general case of the clock synchronization problem

    Overview of Risk Mitigation for Safety-Critical Computer-Based Systems

    Get PDF
    This report presents a high-level overview of a general strategy to mitigate the risks from threats to safety-critical computer-based systems. In this context, a safety threat is a process or phenomenon that can cause operational safety hazards in the form of computational system failures. This report is intended to provide insight into the safety-risk mitigation problem and the characteristics of potential solutions. The limitations of the general risk mitigation strategy are discussed and some options to overcome these limitations are provided. This work is part of an ongoing effort to enable well-founded assurance of safety-related properties of complex safety-critical computer-based aircraft systems by developing an effective capability to model and reason about the safety implications of system requirements and design

    On-line health monitoring of passive electronic components using digitally controlled power converter

    Get PDF
    This thesis presents System Identification based On-Line Health Monitoring to analyse the dynamic behaviour of the Switch-Mode Power Converter (SMPC), detect, and diagnose anomalies in passive electronic components. The anomaly detection in this research is determined by examining the change in passive component values due to degradation. Degradation, which is a long-term process, however, is characterised by inserting different component values in the power converter. The novel health-monitoring capability enables accurate detection of passive electronic components despite component variations and uncertainties and is valid for different topologies of the switch-mode power converter. The need for a novel on-line health-monitoring capability is driven by the need to improve unscheduled in-service, logistics, and engineering costs, including the requirement of Integrated Vehicle Health Management (IVHM) for electronic systems and components. The detection and diagnosis of degradations and failures within power converters is of great importance for aircraft electronic manufacturers, such as Thales, where component failures result in equipment downtime and large maintenance costs. The fact that existing techniques, including built-in-self test, use of dedicated sensors, physics-of-failure, and data-driven based health-monitoring, have yet to deliver extensive application in IVHM, provides the motivation for this research ... [cont.]
    corecore