8,171 research outputs found

    Fault-tolerance techniques for hybrid CMOS/nanoarchitecture

    Get PDF
    The authors propose two fault-tolerance techniques for hybrid CMOS/nanoarchitecture implementing logic functions as look-up tables. The authors compare the efficiency of the proposed techniques with recently reported methods that use single coding schemes in tolerating high fault rates in nanoscale fabrics. Both proposed techniques are based on error correcting codes to tackle different fault rates. In the first technique, the authors implement a combined two-dimensional coding scheme using Hamming and Bose-Chaudhuri-Hocquenghem (BCH) codes to address fault rates greater than 5. In the second technique, Hamming coding is complemented with bad line exclusion technique to tolerate fault rates higher than the first proposed technique (up to 20). The authors have also estimated the improvement that can be achieved in the circuit reliability in the presence of Don-t Care Conditions. The area, latency and energy costs of the proposed techniques were also estimated in the CMOS domain

    Fault Secure Encoder and Decoder for NanoMemory Applications

    Get PDF
    Memory cells have been protected from soft errors for more than a decade; due to the increase in soft error rate in logic circuits, the encoder and decoder circuitry around the memory blocks have become susceptible to soft errors as well and must also be protected. We introduce a new approach to design fault-secure encoder and decoder circuitry for memory designs. The key novel contribution of this paper is identifying and defining a new class of error-correcting codes whose redundancy makes the design of fault-secure detectors (FSD) particularly simple. We further quantify the importance of protecting encoder and decoder circuitry against transient errors, illustrating a scenario where the system failure rate (FIT) is dominated by the failure rate of the encoder and decoder. We prove that Euclidean geometry low-density parity-check (EG-LDPC) codes have the fault-secure detector capability. Using some of the smaller EG-LDPC codes, we can tolerate bit or nanowire defect rates of 10% and fault rates of 10^(-18) upsets/device/cycle, achieving a FIT rate at or below one for the entire memory system and a memory density of 10^(11) bit/cm^2 with nanowire pitch of 10 nm for memory blocks of 10 Mb or larger. Larger EG-LDPC codes can achieve even higher reliability and lower area overhead

    Fault tolerant architectures for integrated aircraft electronics systems, task 2

    Get PDF
    The architectural basis for an advanced fault tolerant on-board computer to succeed the current generation of fault tolerant computers is examined. The network error tolerant system architecture is studied with particular attention to intercluster configurations and communication protocols, and to refined reliability estimates. The diagnosis of faults, so that appropriate choices for reconfiguration can be made is discussed. The analysis relates particularly to the recognition of transient faults in a system with tasks at many levels of priority. The demand driven data-flow architecture, which appears to have possible application in fault tolerant systems is described and work investigating the feasibility of automatic generation of aircraft flight control programs from abstract specifications is reported

    Fault-tolerant computer study

    Get PDF
    A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed

    On Fault Tolerance Methods for Networks-on-Chip

    Get PDF
    Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast

    Preserving Stabilization while Practically Bounding State Space

    Full text link
    Stabilization is a key dependability property for dealing with unanticipated transient faults, as it guarantees that even in the presence of such faults, the system will recover to states where it satisfies its specification. One of the desirable attributes of stabilization is the use of bounded space for each variable. In this paper, we present an algorithm that transforms a stabilizing program that uses variables with unbounded domain into a stabilizing program that uses bounded variables and (practically bounded) physical time. While non-stabilizing programs (that do not handle transient faults) can deal with unbounded variables by assigning large enough but bounded space, stabilizing programs that need to deal with arbitrary transient faults cannot do the same since a transient fault may corrupt the variable to its maximum value. We show that our transformation algorithm is applicable to several problems including logical clocks, vector clocks, mutual exclusion, leader election, diffusing computations, Paxos based consensus, and so on. Moreover, our approach can also be used to bound counters used in an earlier work by Katz and Perry for adding stabilization to a non-stabilizing program. By combining our algorithm with that earlier work by Katz and Perry, it would be possible to provide stabilization for a rich class of problems, by assigning large enough but bounded space for variables.Comment: Moved some content from the Appendix to the main paper, added some details to the transformation algorithm and to its descriptio

    Characterizing a Detection Strategy for Transient Faults in HPC

    Get PDF
    Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.Red de Universidades con Carreras en Informática (RedUNCI

    Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

    Get PDF
    SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness
    corecore