741 research outputs found

    Redundancy management for efficient fault recovery in NASA's distributed computing system

    Get PDF
    The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance

    Quantifying fault recovery in multiprocessor systems

    Get PDF
    Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault recovery. The mathematical model which proves to be most appropriate is provided by the theory of graphs. New measures for fault recovery are developed and the value of elements of the fault recovery vector are observed to depend not only on the computation graph H and the architecture graph G, but also on the specific location of a fault. In the examples, a hypercube is chosen as a representative of parallel computer architecture, and a pipeline as a typical configuration for program execution. Dependability qualities of such a system is defined with or without a fault. These qualities are determined by the resiliency triple defined by three parameters: multiplicity, robustness, and configurability. Parameters for measuring the recovery effectiveness are also introduced in terms of distance, time, and the number of new, used, and moved nodes and edges

    Hyperswitch communication network

    Get PDF
    The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed

    A fault-tolerant multiprocessor architecture for aircraft, volume 1

    Get PDF
    A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed

    Advanced software techniques for space shuttle data management systems Final report

    Get PDF
    Airborne/spaceborn computer design and techniques for space shuttle data management system

    Autonomous spacecraft maintenance study group

    Get PDF
    A plan to incorporate autonomous spacecraft maintenance (ASM) capabilities into Air Force spacecraft by 1989 is outlined. It includes the successful operation of the spacecraft without ground operator intervention for extended periods of time. Mechanisms, along with a fault tolerant data processing system (including a nonvolatile backup memory) and an autonomous navigation capability, are needed to replace the routine servicing that is presently performed by the ground system. The state of the art fault handling capabilities of various spacecraft and computers are described, and a set conceptual design requirements needed to achieve ASM is established. Implementations for near term technology development needed for an ASM proof of concept demonstration by 1985, and a research agenda addressing long range academic research for an advanced ASM system for 1990s are established

    Survivable algorithms and redundancy management in NASA's distributed computing systems

    Get PDF
    The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

    An Adaptive Design Methodology for Reduction of Product Development Risk

    Full text link
    Embedded systems interaction with environment inherently complicates understanding of requirements and their correct implementation. However, product uncertainty is highest during early stages of development. Design verification is an essential step in the development of any system, especially for Embedded System. This paper introduces a novel adaptive design methodology, which incorporates step-wise prototyping and verification. With each adaptive step product-realization level is enhanced while decreasing the level of product uncertainty, thereby reducing the overall costs. The back-bone of this frame-work is the development of Domain Specific Operational (DOP) Model and the associated Verification Instrumentation for Test and Evaluation, developed based on the DOP model. Together they generate functionally valid test-sequence for carrying out prototype evaluation. With the help of a case study 'Multimode Detection Subsystem' the application of this method is sketched. The design methodologies can be compared by defining and computing a generic performance criterion like Average design-cycle Risk. For the case study, by computing Average design-cycle Risk, it is shown that the adaptive method reduces the product development risk for a small increase in the total design cycle time.Comment: 21 pages, 9 figure
    • …
    corecore