2,095 research outputs found
Quantifying fault recovery in multiprocessor systems
Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault recovery. The mathematical model which proves to be most appropriate is provided by the theory of graphs. New measures for fault recovery are developed and the value of elements of the fault recovery vector are observed to depend not only on the computation graph H and the architecture graph G, but also on the specific location of a fault. In the examples, a hypercube is chosen as a representative of parallel computer architecture, and a pipeline as a typical configuration for program execution. Dependability qualities of such a system is defined with or without a fault. These qualities are determined by the resiliency triple defined by three parameters: multiplicity, robustness, and configurability. Parameters for measuring the recovery effectiveness are also introduced in terms of distance, time, and the number of new, used, and moved nodes and edges
Redundancy management for efficient fault recovery in NASA's distributed computing system
The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance
Doing-it-All with Bounded Work and Communication
We consider the Do-All problem, where cooperating processors need to
complete similar and independent tasks in an adversarial setting. Here we
deal with a synchronous message passing system with processors that are subject
to crash failures. Efficiency of algorithms in this setting is measured in
terms of work complexity (also known as total available processor steps) and
communication complexity (total number of point-to-point messages). When work
and communication are considered to be comparable resources, then the overall
efficiency is meaningfully expressed in terms of effort defined as work +
communication. We develop and analyze a constructive algorithm that has work
and a nonconstructive
algorithm that has work . The latter result is close to the
lower bound on work. The effort of each of
these algorithms is proportional to its work when the number of crashes is
bounded above by , for some positive constant . We also present a
nonconstructive algorithm that has effort
Counter Attack on Byzantine Generals: Parameterized Model Checking of Fault-tolerant Distributed Algorithms
We introduce an automated parameterized verification method for
fault-tolerant distributed algorithms (FTDA). FTDAs are parameterized by both
the number of processes and the assumed maximum number of Byzantine faulty
processes. At the center of our technique is a parametric interval abstraction
(PIA) where the interval boundaries are arithmetic expressions over parameters.
Using PIA for both data abstraction and a new form of counter abstraction, we
reduce the parameterized problem to finite-state model checking. We demonstrate
the practical feasibility of our method by verifying several variants of the
well-known distributed algorithm by Srikanth and Toueg. Our semi-decision
procedures are complemented and motivated by an undecidability proof for FTDA
verification which holds even in the absence of interprocess communication. To
the best of our knowledge, this is the first paper to achieve parameterized
automated verification of Byzantine FTDA
Secondary techniques for increasing fault coverage of fault detection test sequences for asynchronous sequential networks
The generation of fault detection sequences for asynchronous sequential networks is considered here. Several techniques exist for the generation of fault detection sequences on combinational and clocked sequential networks. Although these techniques provide closed solutions for combinational and clocked networks, they meet with much less success when used as strategies on asynchronous networks. It is presently assumed that the general asynchronous problem defies closed solution. For this reason, a secondary procedure is presented here to facilitate increased fault coverage by a given fault detection test sequence. This procedure is successful on all types of logic networks but is, perhaps, most useful in the asynchronous case since this is the problem on which other techniques fail. The secondary procedure has been designed to improve the fault coverage accomplished by any fault detection sequence regardless of the origin of the sequence. The increased coverage is accomplished by a minimum amount of additional internal hardware and/or a minimum of additional package outputs. The procedure presented here will function as part of an overall digital fault detection system, which will be composed of: 1) a compatible digital logic simulator, 2) a set of fault detection sequence generators, 3) secondary procedures for increasing fault coverage, 4) procedures to allow for diagnosis to a variable level. This research is directed at presenting a complete solution to the problems involved with developing secondary procedures for increasing the fault coverage of fault detection sequences --Abstract, pages ii-iii
- …