3,552 research outputs found

    Verification of fault-tolerant clock synchronization systems

    Get PDF
    A critical function in a fault-tolerant computer architecture is the synchronization of the redundant computing elements. The synchronization algorithm must include safeguards to ensure that failed components do not corrupt the behavior of good clocks. Reasoning about fault-tolerant clock synchronization is difficult because of the possibility of subtle interactions involving failed components. Therefore, mechanical proof systems are used to ensure that the verification of the synchronization system is correct. In 1987, Schneider presented a general proof of correctness for several fault-tolerant clock synchronization algorithms. Subsequently, Shankar verified Schneider's proof by using the mechanical proof system EHDM. This proof ensures that any system satisfying its underlying assumptions will provide Byzantine fault-tolerant clock synchronization. The utility of Shankar's mechanization of Schneider's theory for the verification of clock synchronization systems is explored. Some limitations of Shankar's mechanically verified theory were encountered. With minor modifications to the theory, a mechanically checked proof is provided that removes these limitations. The revised theory also allows for proven recovery from transient faults. Use of the revised theory is illustrated with the verification of an abstract design of a clock synchronization system

    A Byzantine-Fault Tolerant Self-Stabilizing Protocol for Distributed Clock Synchronization Systems

    Get PDF
    Embedded distributed systems have become an integral part of safety-critical computing applications, necessitating system designs that incorporate fault tolerant clock synchronization in order to achieve ultra-reliable assurance levels. Many efficient clock synchronization protocols do not, however, address Byzantine failures, and most protocols that do tolerate Byzantine failures do not self-stabilize. Of the Byzantine self-stabilizing clock synchronization algorithms that exist in the literature, they are based on either unjustifiably strong assumptions about initial synchrony of the nodes or on the existence of a common pulse at the nodes. The Byzantine self-stabilizing clock synchronization protocol presented here does not rely on any assumptions about the initial state of the clocks. Furthermore, there is neither a central clock nor an externally generated pulse system. The proposed protocol converges deterministically, is scalable, and self-stabilizes in a short amount of time. The convergence time is linear with respect to the self-stabilization period. Proofs of the correctness of the protocol as well as the results of formal verification efforts are reported

    Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

    Get PDF
    SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness

    Experimental validation of clock synchronization algorithms

    Get PDF
    The objective of this work is to validate mathematically derived clock synchronization theories and their associated algorithms through experiment. Two theories are considered, the Interactive Convergence Clock Synchronization Algorithm and the Midpoint Algorithm. Special clock circuitry was designed and built so that several operating conditions and failure modes (including malicious failures) could be tested. Both theories are shown to predict conservative upper bounds (i.e., measured values of clock skew were always less than the theory prediction). Insight gained during experimentation led to alternative derivations of the theories. These new theories accurately predict the behavior of the clock system. It is found that a 100 percent penalty is paid to tolerate worst-case failures. It is also shown that under optimal conditions (with minimum error and no failures) the clock skew can be as much as three clock ticks. Clock skew grows to six clock ticks when failures are present. Finally, it is concluded that one cannot rely solely on test procedures or theoretical analysis to predict worst-case conditions

    Rapid Recovery for Systems with Scarce Faults

    Full text link
    Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202

    Validation of a fault-tolerant clock synchronization system

    Get PDF
    A validation method for the synchronization subsystem of a fault tolerant computer system is investigated. The method combines formal design verification with experimental testing. The design proof reduces the correctness of the clock synchronization system to the correctness of a set of axioms which are experimentally validated. Since the reliability requirements are often extreme, requiring the estimation of extremely large quantiles, an asymptotic approach to estimation in the tail of a distribution is employed

    The Second NASA Formal Methods Workshop 1992

    Get PDF
    The primary goal of the workshop was to bring together formal methods researchers and aerospace industry engineers to investigate new opportunities for applying formal methods to aerospace problems. The first part of the workshop was tutorial in nature. The second part of the workshop explored the potential of formal methods to address current aerospace design and verification problems. The third part of the workshop involved on-line demonstrations of state-of-the-art formal verification tools. Also, a detailed survey was filled in by the attendees; the results of the survey are compiled

    Design verification of SIFT

    Get PDF
    A SIFT reliable aircraft control computer system, designed to meet the ultrahigh reliability required for safety critical flight control applications by use of processor replications and voting, was constructed for SRI, and delivered to NASA Langley for evaluation in the AIRLAB. To increase confidence in the reliability projections for SIFT, produced by a Markov reliability model, SRI constructed a formal specification, defining the meaning of reliability in the context of flight control. A further series of specifications defined, in increasing detail, the design of SIFT down to pre- and post-conditions on Pascal code procedures. Mechanically checked mathematical proofs were constructed to demonstrate that the more detailed design specifications for SIFT do indeed imply the formal reliability requirement. An additional specification defined some of the assumptions made about SIFT by the Markov model, and further proofs were constructed to show that these assumptions, as expressed by that specification, did indeed follow from the more detailed design specifications for SIFT. This report provides an outline of the methodology used for this hierarchical specification and proof, and describes the various specifications and proofs performed

    Formal verification of a software countermeasure against instruction skip attacks

    Get PDF
    Fault attacks against embedded circuits enabled to define many new attack paths against secure circuits. Every attack path relies on a specific fault model which defines the type of faults that the attacker can perform. On embedded processors, a fault model consisting in an assembly instruction skip can be very useful for an attacker and has been obtained by using several fault injection means. To avoid this threat, some countermeasure schemes which rely on temporal redundancy have been proposed. Nevertheless, double fault injection in a long enough time interval is practical and can bypass those countermeasure schemes. Some fine-grained countermeasure schemes have also been proposed for specific instructions. However, to the best of our knowledge, no approach that enables to secure a generic assembly program in order to make it fault-tolerant to instruction skip attacks has been formally proven yet. In this paper, we provide a fault-tolerant replacement sequence for almost all the instructions of the Thumb-2 instruction set and provide a formal verification for this fault tolerance. This simple transformation enables to add a reasonably good security level to an embedded program and makes practical fault injection attacks much harder to achieve
    corecore