1,116 research outputs found

    Efficient diagnosis of multiprocessor systems under probabilistic models

    Get PDF
    The problem of fault diagnosis in multiprocessor systems is considered under a probabilistic fault model. The focus is on minimizing the number of tests that must be conducted in order to correctly diagnose the state of every processor in the system with high probability. A diagnosis algorithm that can correctly diagnose the state of every processor with probability approaching one in a class of systems performing slightly greater than a linear number of tests is presented. A nearly matching lower bound on the number of tests required to achieve correct diagnosis in arbitrary systems is also proven. Lower and upper bounds on the number of tests required for regular systems are also presented. A class of regular systems which includes hypercubes is shown to be correctly diagnosable with high probability. In all cases, the number of tests required under this probabilistic model is shown to be significantly less than under a bounded-size fault set model. Because the number of tests that must be conducted is a measure of the diagnosis overhead, these results represent a dramatic improvement in the performance of system-level diagnosis techniques

    Survivable algorithms and redundancy management in NASA's distributed computing systems

    Get PDF
    The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

    Probabilistic diagnostics with P-graphs

    Get PDF
    This paper presents a novel approach for solving the probabilistic diagnosis problem in multiprocessor systems. The main idea of the algorithm is based on the reformulation of the diagnostic procedure as a P-graph model. The same, well-elaborated mathematical paradigm - originally used to model material flow - can be applied in our approach to model information flow. This idea is illustrated by deriving a maximum likelihood diagnostic decision procedure. The diagnostic accuracy of the solution is considered on the basis of simulation measurements, and a method of constructing a general framework for different aspects of a complex problem is demonstrated with the use of P-graph models

    Parallel Architectures for Planetary Exploration Requirements (PAPER)

    Get PDF
    The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified

    Redundancy management for efficient fault recovery in NASA's distributed computing system

    Get PDF
    The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance

    Gradient based system-level diagnosis

    Get PDF
    Traditional approaches in system-level diagnosis in multiprocessor systems are usually based on the oversimplified PMC test invalidation model, however Blount introduced a more general model containing conditional probabilities as parameters for different test invalidation situations. He suggested a lookup table based approach, but no algorithmic solution has been elaborated until our P-graph based solution introduced in previous publications. In this approach the diagnostic process is formulated as an optimization problem and the optimal solution is determined. Although the average behavior of the algorithm is quite good, the worst case complexity is exponential. In this paper we introduce a novel group of fast diagnostic algorithms that we named gradient based algorithms. This approach only approximates the optimal maximum likelihood or maximum a posteriori solution, but it has a polynomial complexity of the magnitude of O\left (N \cdot NbCount + N^2\right ), where N is the size of the system and NbCount is number of neighbors of a single unit. The idea of the base algorithm is that it takes an initial fault pattern and iterates till the likelihood of the actual fault pattern can be increased with a single state-change in the pattern. Improvements of this base algorithm, complexity analysis and simulation results are also presented. The main, although not exclusive application field of the algorithms is wafer-scale diagnosis, since the accuracy and the performance is still good even if relative large number of faults are present

    Reliability Analysis of the Hypercube Architecture.

    Get PDF
    This dissertation presents improved techniques for analyzing network-connected (NCF), 2-connected (2CF), task-based (TBF), and subcube (SF) functionality measures in a hypercube multiprocessor with faulty processing elements (PE) and/or communication elements (CE). These measures help study system-level fault tolerance issues and relate to various application modes in the hypercube. Solutions discussed in the text fall into probabilistic and deterministic models. The probabilistic measure assumes a stochastic graph of the hypercube where PE\u27s and/or CE\u27s may fail with certain probabilities, while the deterministic model considers that some system components are already failed and aims to determine the system functionality. For probabilistic model, MIL-HDBK-217F is used to predict PE and CE failure rates for an Intel iPSC system. First, a technique called CAREL is presented. A proof of its correctness is included in an appendix. Using the shelling ordering concept, CAREL is shown to solve the exact probabilistic NCF measure for a hypercube in time polynomial in the number of spanning trees. However, this number increases exponentially in the hypercube dimension. This dissertation, then, aims to more efficiently obtain lower and upper bounds on the measures. Algorithms, presented in the text, generate tighter bounds than had been obtained previously and run in time polynomial in the cube dimension. The proposed algorithms for probabilistic 2CF measure consider PE and/or CE failures. In attempting to evaluate deterministic measures, a hybrid method for fault tolerant broadcasting in the hypercube is proposed. This method combines the favorable features of redundant and non-redundant techniques. A generalized result on the deterministic TBF measure for the hypercube is then described. Two distributed algorithms are proposed to identify the largest operational subcubes in a hypercube C\sb{n} with faulty PE\u27s. Method 1, called LOS1, requires a list of faulty components and utilizes the CMB operator of CAREL to solve the problem. In case the number of unavailable nodes (faulty or busy) increases, an alternative distributed approach, called LOS2, processes m available nodes in O(mn) time. The proposed techniques are simple and efficient

    Diagnosis of static Topology MANETs in faulty environment

    Get PDF
    Mobile Ad-Hoc Networks (MANETs) are set of mobile nodes that communicates wirelessly without a centralized supporting system. Faulty nodes aect the reliable transmission of messages across the network. In this thesis we deal with the fault identication problem in static topology MANETs. A comparison based approach is used where a set of tasks is given to the nodes and outcomes are compared. Based on these comparisons the nodes are classied either as faulty or fault free. Our new diagnosis model is based on the spanning tree concept in which the testing of the nodes as well as the construction of the spanning tree takes place simultaneously. As a result of which the maintenance and the repairing overhead of the spanning tree is completely avoided thus reducing the number of messages exchanged. We have also developed a simulator which can be applied to a network with large number of nodes.We have carried out the simulation in-order to nd out the total number of messages exchanged and the total diagnosis time. On analysing the results we have seen that our model performs better than its previous counterparts. The correctness and complexity proofs are also being provided which also shows that our model performs better from a communication as well as latency viewpoint
    corecore