10,367 research outputs found

    Prototype of Fault Adaptive Embedded Software for Large-Scale Real-Time Systems

    Get PDF
    This paper describes a comprehensive prototype of large-scale fault adaptive embedded software developed for the proposed Fermilab BTeV high energy physics experiment. Lightweight self-optimizing agents embedded within Level 1 of the prototype are responsible for proactive and reactive monitoring and mitigation based on specified layers of competence. The agents are self-protecting, detecting cascading failures using a distributed approach. Adaptive, reconfigurable, and mobile objects for reliablility are designed to be self-configuring to adapt automatically to dynamically changing environments. These objects provide a self-healing layer with the ability to discover, diagnose, and react to discontinuities in real-time processing. A generic modeling environment was developed to facilitate design and implementation of hardware resource specifications, application data flow, and failure mitigation strategies. Level 1 of the planned BTeV trigger system alone will consist of 2500 DSPs, so the number of components and intractable fault scenarios involved make it impossible to design an `expert system' that applies traditional centralized mitigative strategies based on rules capturing every possible system state. Instead, a distributed reactive approach is implemented using the tools and methodologies developed by the Real-Time Embedded Systems group.Comment: 2nd Workshop on Engineering of Autonomic Systems (EASe), in the 12th Annual IEEE International Conference and Workshop on the Engineering of Computer Based Systems (ECBS), Washington, DC, April, 200

    Event-B Patterns for Specifying Fault-Tolerance in Multi-Agent Interaction

    No full text
    Interaction in a multi-agent system is susceptible to failure. A rigorous development of a multi-agent system must include the treatment of fault-tolerance of agent interactions for the agents to be able to continue to function independently. Patterns can be used to capture fault-tolerance techniques. A set of modelling patterns is presented that specify fault-tolerance in Event-B specifications of multi-agent interactions. The purpose of these patterns is to capture common modelling structures for distributed agent interaction in a form that is re-usable on other related developments. The patterns have been applied to a case study of the contract net interaction protocol

    Rapid Recovery for Systems with Scarce Faults

    Full text link
    Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202

    Incremental bounded model checking for embedded software

    Get PDF
    Program analysis is on the brink of mainstream usage in embedded systems development. Formal verification of behavioural requirements, finding runtime errors and test case generation are some of the most common applications of automated verification tools based on bounded model checking (BMC). Existing industrial tools for embedded software use an off-the-shelf bounded model checker and apply it iteratively to verify the program with an increasing number of unwindings. This approach unnecessarily wastes time repeating work that has already been done and fails to exploit the power of incremental SAT solving. This article reports on the extension of the software model checker CBMC to support incremental BMC and its successful integration with the industrial embedded software verification tool BTC EMBEDDED TESTER. We present an extensive evaluation over large industrial embedded programs, mainly from the automotive industry. We show that incremental BMC cuts runtimes by one order of magnitude in comparison to the standard non-incremental approach, enabling the application of formal verification to large and complex embedded software. We furthermore report promising results on analysing programs with arbitrary loop structure using incremental BMC, demonstrating its applicability and potential to verify general software beyond the embedded domain

    A method for rigorous development of fault-tolerant systems

    Get PDF
    PhD ThesisWith the rapid development of information systems and our increasing dependency on computer-based systems, ensuring their dependability becomes one the most important concerns during system development. This is especially true for the mission and safety critical systems on which we rely not to put signi cant resources and lives at risk. Development of critical systems traditionally involves formal modelling as a fault prevention mechanism. At the same time, systems typically support fault tolerance mechanisms to mitigate runtime errors. However, fault tolerance modelling and, in particular, rigorous de nitions of fault tolerance requirements, fault assumptions and system recovery have not been given enough attention during formal system development. The main contribution of this research is in developing a method for top-down formal design of fault tolerant systems. The re nement-based method provides modelling guidelines presented in the following form: a set of modelling principles for systematic modelling of fault tolerance, a fault tolerance re nement strategy, and a library of generic modelling patterns assisting in disciplined integration of error detection and error recovery steps into models. The method supports separation of normal and fault tolerant system behaviour during modelling. It provides an environment for explicit modelling of fault tolerance and modal aspects of system behaviour which ensure rigour of the proposed development process. The method is supported by tools that are smoothly integrated into an industry-strength development environment. The proposed method is demonstrated on two case studies. In particular, the evaluation is carried out using a medium-scale industrial case study from the aerospace domain. The method is shown to provide support for explicit modelling of fault tolerance, to reduce the development e orts during modelling, to support reuse of fault tolerance modelling, and to facilitate adoption of formal methods.DEPLOY: The TrAmS Grant: The School of Computing Science, Newcastle University

    The ISIS project: Fault-tolerance in large distributed systems

    Get PDF
    The semi-annual status report covers activities of the ISIS project during the second half of 1989. The project had several independent objectives: (1) At the level of the ISIS Toolkit, ISIS release V2.0 was completed, containing bypass communication protocols. Performance of the system is greatly enhanced by this change, but the initial software release is limited in some respects. (2) The Meta project focused on the definition of the Lomita programming language for specifying rules that monitor sensors for conditions of interest and triggering appropriate reactions. This design was completed, and implementation of Lomita is underway on the Meta 2.0 platform. (3) The Deceit file system effort completed a prototype. It is planned to make Deceit available for use in two hospital information systems. (4) A long-haul communication subsystem project was completed and can be used as part of ISIS. This effort resulted in tools for linking ISIS systems on different LANs together over long-haul communications lines. (5) Magic Lantern, a graphical tool for building application monitoring and control interfaces, is included as part of the general ISIS releases
    corecore