209 research outputs found

    Quantifying fault recovery in multiprocessor systems

    Get PDF
    Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault recovery. The mathematical model which proves to be most appropriate is provided by the theory of graphs. New measures for fault recovery are developed and the value of elements of the fault recovery vector are observed to depend not only on the computation graph H and the architecture graph G, but also on the specific location of a fault. In the examples, a hypercube is chosen as a representative of parallel computer architecture, and a pipeline as a typical configuration for program execution. Dependability qualities of such a system is defined with or without a fault. These qualities are determined by the resiliency triple defined by three parameters: multiplicity, robustness, and configurability. Parameters for measuring the recovery effectiveness are also introduced in terms of distance, time, and the number of new, used, and moved nodes and edges

    Fault-tolerant building-block computer study

    Get PDF
    Ultra-reliable core computers are required for improving the reliability of complex military systems. Such computers can provide reliable fault diagnosis, failure circumvention, and, in some cases serve as an automated repairman for their host systems. A small set of building-block circuits which can be implemented as single very large integration devices, and which can be used with off-the-shelf microprocessors and memories to build self checking computer modules (SCCM) is described. Each SCCM is a microcomputer which is capable of detecting its own faults during normal operation and is described to communicate with other identical modules over one or more Mil Standard 1553A buses. Several SCCMs can be connected into a network with backup spares to provide fault-tolerant operation, i.e. automated recovery from faults. Alternative fault-tolerant SCCM configurations are discussed along with the cost and reliability associated with their implementation

    Quadruplex digital flight control system assessment

    Get PDF
    Described are the development and validation of a double fail-operational digital flight control system architecture for critical pitch axis functions. Architectural tradeoffs are assessed, system simulator modifications are described, and demonstration testing results are critiqued. Assessment tools and their application are also illustrated. Ultimately, the vital role of system simulation, tailored to digital mechanization attributes, is shown to be essential to validating the airworthiness of full-time critical functions such as augmented fly-by-wire systems for relaxed static stability airplanes

    A bibliography on formal methods for system specification, design and validation

    Get PDF
    Literature on the specification, design, verification, testing, and evaluation of avionics systems was surveyed, providing 655 citations. Journal papers, conference papers, and technical reports are included. Manual and computer-based methods were employed. Keywords used in the online search are listed

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Fault-tolerant interconnection networks for multiprocessor systems

    Get PDF
    Interconnection networks represent the backbone of multiprocessor systems. A failure in the network, therefore, could seriously degrade the system performance. For this reason, fault tolerance has been regarded as a major consideration in interconnection network design. This thesis presents two novel techniques to provide fault tolerance capabilities to three major networks: the Baseline network, the Benes network and the Clos network. First, the Simple Fault Tolerance Technique (SFT) is presented. The SFT technique is in fact the result of merging two widely known interconnection mechanisms: a normal interconnection network and a shared bus. This technique is most suitable for networks with small switches, such as the Baseline network and the Benes network. For the Clos network, whose switches may be large for the SFT, another technique is developed to produce the Fault-Tolerant Clos (FTC) network. In the FTC, one switch is added to each stage. The two techniques are described and thoroughly analyzed

    NASA space station automation: AI-based technology review

    Get PDF
    Research and Development projects in automation for the Space Station are discussed. Artificial Intelligence (AI) based automation technologies are planned to enhance crew safety through reduced need for EVA, increase crew productivity through the reduction of routine operations, increase space station autonomy, and augment space station capability through the use of teleoperation and robotics. AI technology will also be developed for the servicing of satellites at the Space Station, system monitoring and diagnosis, space manufacturing, and the assembly of large space structures

    Yield Analysis on Embedded Memory with Redundancy

    Get PDF

    Testability and redundancy techniques for improved yield and reliability of CMOS VLSI circuits

    Get PDF
    The research presented in this thesis is concerned with the design of fault-tolerant integrated circuits as a contribution to the design of fault-tolerant systems. The economical manufacture of very large area ICs will necessitate the incorporation of fault-tolerance features which are routinely employed in current high density dynamic random access memories. Furthermore, the growing use of ICs in safety-critical applications and/or hostile environments in addition to the prospect of single-chip systems will mandate the use of fault-tolerance for improved reliability. A fault-tolerant IC must be able to detect and correct all possible faults that may affect its operation. The ability of a chip to detect its own faults is not only necessary for fault-tolerance, but it is also regarded as the ultimate solution to the problem of testing. Off-line periodic testing is selected for this research because it achieves better coverage of physical faults and it requires less extra hardware than on-line error detection techniques. Tests for CMOS stuck-open faults are shown to detect all other faults. Simple test sequence generation procedures for the detection of all faults are derived. The test sequences generated by these procedures produce a trivial output, thereby, greatly simplifying the task of test response analysis. A further advantage of the proposed test generation procedures is that they do not require the enumeration of faults. The implementation of built-in self-test is considered and it is shown that the hardware overhead is comparable to that associated with pseudo-random and pseudo-exhaustive techniques while achieving a much higher fault coverage through-the use of the proposed test generation procedures. The consideration of the problem of testing the test circuitry led to the conclusion that complete test coverage may be achieved if separate chips cooperate in testing each other's untested parts. An alternative approach towards complete test coverage would be to design the test circuitry so that it is as distributed as possible and so that it is tested as it performs its function. Fault correction relies on the provision of spare units and a means of reconfiguring the circuit so that the faulty units are discarded. This raises the question of what is the optimum size of a unit? A mathematical model, linking yield and reliability is therefore developed to answer such a question and also to study the effects of such parameters as the amount of redundancy, the size of the additional circuitry required for testing and reconfiguration, and the effect of periodic testing on reliability. The stringent requirement on the size of the reconfiguration logic is illustrated by the application of the model to a typical example. Another important result concerns the effect of periodic testing on reliability. It is shown that periodic off-line testing can achieve approximately the same level of reliability as on-line testing, even when the time between tests is many hundreds of hours

    Definition and trade-off study of reconfigurable airborne digital computer system organizations

    Get PDF
    A highly-reliable, fault-tolerant reconfigurable computer system for aircraft applications was developed. The development and application reliability and fault-tolerance assessment techniques are described. Particular emphasis is placed on the needs of an all-digital, fly-by-wire control system appropriate for a passenger-carrying airplane
    corecore