45 research outputs found

    Survivable algorithms and redundancy management in NASA's distributed computing systems

    Get PDF
    The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

    Failover in cellular automata

    Get PDF
    A cellular automata (CA) configuration is constructed that exhibits emergent failover. The configuration is based on standard Game of Life rules. Gliders and glider-guns form the core messaging structure in the configuration. The blinker is represented as the basic computational unit, and it is shown how it can be recreated in case of a failure. Stateless failover using primary-backup mechanism is demonstrated. The details of the CA components used in the configuration and its working are described, and a simulation of the complete configuration is also presented.Comment: 16 pages, 15 figures and associated video at http://dl.dropbox.com/u/7553694/failover_demo.avi and simulation at http://dl.dropbox.com/u/7553694/failover_simulation.ja

    Study of fault-tolerant software technology

    Get PDF
    Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

    Application Integration Control System for Multi-Scale and Multi-PhysicsSimulation

    Get PDF
    In the case of long-period and large-scale simulation, unexpected stop which is caused by execution time excess, outage of computers, outage of a network during file transfer and so on, become major issues. To avoid the stop of job execution and file transfer, we have developed Task Flow Control System that is a new control system for application integration with a fault tolerant API. If the computer is outage, the system designates an alternate computer, gathers necessary files and submits a new job. Each scheduler, file transfer and job condition can be flexibly defined in XML. This time, we applied the system to fluid-structure interaction analysis simulation. The result indicates that the system enables a user to easily execute multi-scale and multi-physics simulation using application integration

    A New Concurrent Checkpoint Mechanism for Embeded Multi-Core Systems

    Get PDF
    his paper presents a new transparent, incremental, concurrent checkpoint mechanism for embedded multi-core systems. It allows the checkpointed process (also called checkpointee) to continue running without stopping while checkpoints are set to a large extent. Through tracing TLB misses to block the accesses to target memory pages first time while dumping memory pages (the most time-consuming step when setting a checkpoint). At that time, a kernel thread, called checkpointer, copies the memory access target pages to the designated memory buffer for constructing a consistent state of the checkpointee, and then resumes the memory accesses. From the experimental results, in contrast to a traditional concurrent checkpoint system, the proposed mechanism reduces the downtime of the checkpointed process by more than 10.1 %. Moreover, the incremental checkpointing functionality has been implemented in this new concurrent checkpoint mechanism as well. Compared with full checkpointing, incremental checkpointing can reduce the checkpoint time more than 95.5 % and 89.2 % while the benchmark is the matrix multiplication at the checkpoint intervals of 10 seconds and 20 seconds, respectively

    TIARA: Trust Management, Intrusion-tolerance, Accountability, and Reconstitution Architecture

    Get PDF
    The last 20 years have led to unprecedented improvements in chipdensity and system performance fueled mainly by Moore's Law. Duringthe same time, system and application software have bloated, leadingto unmanageable complexity, vulnerability to attack, rigidity and lackof robustness and accountability. These problems arise from the factthat all key elements of the computational environment, from hardwarethrough system software and middleware to application code regard theworld as consisting of unconstrained ``raw seething bits''. No elementof the entire stack is responsible for enforcing over-archingconventions of memory structuring or access control. Outsiders mayeasily penetrate the system by exploiting vulnerabilities (e.g. bufferoverflows) arising from this lack of basic constraints. Attacks arenot easily contained, whether they originate from the clever outsiderwho penetrates the defenses or from the insider who exploits existingprivileges. Finally, because there are no facilities for tracing theprovenance of data, even when an attack is detected, it is difficultif not impossible to tell which data are traceable to the attack andwhat data may still be trusted. We have abundant computational resources allowing us to fix thesecritical problems using a combination of hardware, system software,and programming language technology: In this report, we describe theTIARAproject, which is using these resources to design a newcomputer system thatis less vulnerable, more tolerant of intrusions, capable of recoveryfrom attacks, and accountable for their actions. TIARA provides thesecapabilities without significant impact on overall system performance. Itachieves these goals through the judicious use of a modest amountof extra, but reasonably generable purpose, hardware that is dedicatedto tracking the provenance of data at a very fine grained level, toenforcing access control policies, and to constructing a coherentobject-oriented model of memory. This hardware runs in parallel withthe main data-paths of the system and operates on a set of extra bitstagging each word with data-type, bounds, access control andprovenance information. Operations that violate the intendedinvariants are trapped, while normal results are tagged withinformation derived from the tags of the input operands.This hardware level provides fine-grained support for a series ofsoftware layers that enable a variety of comprehensive access controlpolicies, self-adaptive computing, and fine-grained recoveryprocessing. The first of these software layers establishes aconsistent object-oriented level of computing while higher layersestablish wrappers that may not be bypassed, access controls, dataprovenance tracking. At the highest level we create the ``planlevel'' of computing in which code is executed in parallel with anabstract model (or executable specification) of the system that checkswhether the code behaves as intended
    corecore