45,143 research outputs found

    Efficient protection of the pipeline core for safety-critical processor-based systems

    Get PDF
    The increasing number of safety-critical commercial applications has generated a need for components with high levels of reliability. As CMOS process sizes continue to shrink, the reliability of ICs is negatively affected since they become more sensitive to transient faults. New circuit designs must take this fact into consideration, and incorporate adequate protection against the effects of transient faults. This paper presents a novel method for protecting the pipelined execution unit of an embedded processor. It is based on a self-configured architecture with hybrid redundancy that can mask single and multiple errors, which can occur on storage elements due to transient or permanent faults. This concept can be easily applied to any processing architecture of this nature with a high safety integrity level. Results from error-injection experiments are also reported that show that this design can maintain a non-interrupted and failure-free operation under single and double errors with a probability that exceeds 99.4%

    What does fault tolerant Deep Learning need from MPI?

    Full text link
    Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM

    Reasoning about the Reliability of Diverse Two-Channel Systems in which One Channel is "Possibly Perfect"

    Get PDF
    This paper considers the problem of reasoning about the reliability of fault-tolerant systems with two "channels" (i.e., components) of which one, A, supports only a claim of reliability, while the other, B, by virtue of extreme simplicity and extensive analysis, supports a plausible claim of "perfection." We begin with the case where either channel can bring the system to a safe state. We show that, conditional upon knowing pA (the probability that A fails on a randomly selected demand) and pB (the probability that channel B is imperfect), a conservative bound on the probability that the system fails on a randomly selected demand is simply pA.pB. That is, there is conditional independence between the events "A fails" and "B is imperfect." The second step of the reasoning involves epistemic uncertainty about (pA, pB) and we show that under quite plausible assumptions, a conservative bound on system pfd can be constructed from point estimates for just three parameters. We discuss the feasibility of establishing credible estimates for these parameters. We extend our analysis from faults of omission to those of commission, and then combine these to yield an analysis for monitored architectures of a kind proposed for aircraft

    A Self-Repairing Execution Unit for Microprogrammed Processors

    Get PDF
    Describes a processor which dynamically reconfigures its internal microcode to execute each instruction using only fault-free blocks from the execution unit. Working without redundant or spare computational blocks, this self-repair approach permits a graceful performance degradatio

    Model-based dependability analysis : state-of-the-art, challenges and future outlook

    Get PDF
    Abstract: Over the past two decades, the study of model-based dependability analysis has gathered significant research interest. Different approaches have been developed to automate and address various limitations of classical dependability techniques to contend with the increasing complexity and challenges of modern safety-critical system. Two leading paradigms have emerged, one which constructs predictive system failure models from component failure models compositionally using the topology of the system. The other utilizes design models - typically state automata - to explore system behaviour through fault injection. This paper reviews a number of prominent techniques under these two paradigms, and provides an insight into their working mechanism, applicability, strengths and challenges, as well as recent developments within these fields. We also discuss the emerging trends on integrated approaches and advanced analysis capabilities. Lastly, we outline the future outlook for model-based dependability analysis

    DFT and BIST of a multichip module for high-energy physics experiments

    Get PDF
    Engineers at Politecnico di Torino designed a multichip module for high-energy physics experiments conducted on the Large Hadron Collider. An array of these MCMs handles multichannel data acquisition and signal processing. Testing the MCM from board to die level required a combination of DFT strategie

    Fault Secure Encoder and Decoder for NanoMemory Applications

    Get PDF
    Memory cells have been protected from soft errors for more than a decade; due to the increase in soft error rate in logic circuits, the encoder and decoder circuitry around the memory blocks have become susceptible to soft errors as well and must also be protected. We introduce a new approach to design fault-secure encoder and decoder circuitry for memory designs. The key novel contribution of this paper is identifying and defining a new class of error-correcting codes whose redundancy makes the design of fault-secure detectors (FSD) particularly simple. We further quantify the importance of protecting encoder and decoder circuitry against transient errors, illustrating a scenario where the system failure rate (FIT) is dominated by the failure rate of the encoder and decoder. We prove that Euclidean geometry low-density parity-check (EG-LDPC) codes have the fault-secure detector capability. Using some of the smaller EG-LDPC codes, we can tolerate bit or nanowire defect rates of 10% and fault rates of 10^(-18) upsets/device/cycle, achieving a FIT rate at or below one for the entire memory system and a memory density of 10^(11) bit/cm^2 with nanowire pitch of 10 nm for memory blocks of 10 Mb or larger. Larger EG-LDPC codes can achieve even higher reliability and lower area overhead
    corecore