11,223 research outputs found

    Survivable algorithms and redundancy management in NASA's distributed computing systems

    Get PDF
    The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

    Low-energy standby-sparing for hard real-time systems

    No full text
    Time-redundancy techniques are commonly used in real-time systems to achieve fault tolerance without incurring high energy overhead. However, reliability requirements of hard real-time systems that are used in safety-critical applications are so stringent that time-redundancy techniques are sometimes unable to achieve them. Standby sparing as a hardwareredundancy technique can be used to meet high reliability requirements of safety-critical applications. However, conventional standby-sparing techniques are not suitable for lowenergy hard real-time systems as they either impose considerable energy overheads or are not proper for hard timing constraints. In this paper we provide a technique to use standby sparing for hard real-time systems with limited energy budgets. The principal contribution of this work is an online energymanagement technique which is specifically developed for standby-sparing systems that are used in hard real-time applications. This technique operates at runtime and exploits dynamic slacks to reduce the energy consumption while guaranteeing hard deadlines. We compared the low-energy standby-sparing (LESS) system with a low-energy timeredundancy system (from a previous work). The results show that for relaxed time constraints, the LESS system is more reliable and provides about 26% energy saving as compared to the time-redundancy system. For tight deadlines when the timeredundancy system is not sufficiently reliable (for safety-critical application), the LESS system preserves its reliability but with about 49% more energy consumptio

    Avionics systems integration technology

    Get PDF
    A very dramatic and continuing explosion in digital electronics technology has been taking place in the last decade. The prudent and timely application of this technology will provide Army aviation the capability to prevail against a numerically superior enemy threat. The Army and NASA have exploited this technology explosion in the development and application of avionics systems integration technology for new and future aviation systems. A few selected Army avionics integration technology base efforts are discussed. Also discussed is the Avionics Integration Research Laboratory (AIRLAB) that NASA has established at Langley for research into the integration and validation of avionics systems, and evaluation of advanced technology in a total systems context

    Synthesis of Fault-Tolerant Embedded Systems

    Get PDF
    This work addresses the issue of design optimization for faulttolerant hard real-time systems. In particular, our focus is on the handling of transient faults using both checkpointing with rollback recovery and active replication. Fault tolerant schedules are generated based on a conditional process graph representation. The formulated system synthesis approaches decide the assignment of fault-tolerance policies to processes, the optimal placement of checkpoints and the mapping of processes to processors, such that multiple transient faults are tolerated, transparency requirements are considered, and the timing constraints of the application are satisfied. 1

    Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems with Checkpointing and Replication

    Get PDF
    We present an approach to the synthesis of fault-tolerant hard real-time systems for safety-critical applications. We use checkpointing with rollback recovery and active replication for tolerating transient faults. Processes and communications are statically scheduled. Our synthesis approach decides the assign-ment of fault-tolerance policies to processes, the optimal place-ent of checkpoints and the mapping of processes to processors such that multiple transient faults are tolerated and the timing con-straints of the application are satisfied. We present several design optimization approaches which are able to find fault-tolerant im-plementations given a limited amount of resources. The developed algorithms are evaluated using extensive experiments, including a real-life example

    Optimized Surface Code Communication in Superconducting Quantum Computers

    Full text link
    Quantum computing (QC) is at the cusp of a revolution. Machines with 100 quantum bits (qubits) are anticipated to be operational by 2020 [googlemachine,gambetta2015building], and several-hundred-qubit machines are around the corner. Machines of this scale have the capacity to demonstrate quantum supremacy, the tipping point where QC is faster than the fastest classical alternative for a particular problem. Because error correction techniques will be central to QC and will be the most expensive component of quantum computation, choosing the lowest-overhead error correction scheme is critical to overall QC success. This paper evaluates two established quantum error correction codes---planar and double-defect surface codes---using a set of compilation, scheduling and network simulation tools. In considering scalable methods for optimizing both codes, we do so in the context of a full microarchitectural and compiler analysis. Contrary to previous predictions, we find that the simpler planar codes are sometimes more favorable for implementation on superconducting quantum computers, especially under conditions of high communication congestion.Comment: 14 pages, 9 figures, The 50th Annual IEEE/ACM International Symposium on Microarchitectur

    Measurement of SIFT operating system overhead

    Get PDF
    The overhead of the software implemented fault tolerance (SIFT) operating system was measured. Several versions of the operating system evolved. Each version represents different strategies employed to improve the measured performance. Three of these versions are analyzed. The internal data structures of the operating systems are discussed. The overhead of the SIFT operating system was found to be of two types: vote overhead and executive task overhead. Both types of overhead were found to be significant in all versions of the system. Improvements substantially reduced this overhead; even with these improvements, the operating system consumed well over 50% of the available processing time

    An assessment of the real-time application capabilities of the SIFT computer system

    Get PDF
    The real-time capabilities of the SIFT computer system, a highly reliable multicomputer architecture developed to support the flight controls of a relaxed static stability aircraft, are discussed. The SIFT computer system was designed to meet extremely high reliability requirements and to facilitate a formal proof of its correctness. Although SIFT represents a significant achievement in fault-tolerant system research it presents an unusual and restrictive interface to its users. The characteristics of the user interface and its impact on application system design are assessed
    • …
    corecore