1,158 research outputs found

    Design of a fault tolerant airborne digital computer. Volume 1: Architecture

    Get PDF
    This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive

    Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery

    Get PDF
    The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%. © 2014 IEEE.Peer ReviewedPostprint (author's final draft

    Fault tolerance issues in nanoelectronics

    Get PDF
    The astonishing success story of microelectronics cannot go on indefinitely. In fact, once devices reach the few-atom scale (nanoelectronics), transient quantum effects are expected to impair their behaviour. Fault tolerant techniques will then be required. The aim of this thesis is to investigate the problem of transient errors in nanoelectronic devices. Transient error rates for a selection of nanoelectronic gates, based upon quantum cellular automata and single electron devices, in which the electrostatic interaction between electrons is used to create Boolean circuits, are estimated. On the bases of such results, various fault tolerant solutions are proposed, for both logic and memory nanochips. As for logic chips, traditional techniques are found to be unsuitable. A new technique, in which the voting approach of triple modular redundancy (TMR) is extended by cascading TMR units composed of nanogate clusters, is proposed and generalised to other voting approaches. For memory chips, an error correcting code approach is found to be suitable. Various codes are considered and a lookup table approach is proposed for encoding and decoding. We are then able to give estimations for the redundancy level to be provided on nanochips, so as to make their mean time between failures acceptable. It is found that, for logic chips, space redundancies up to a few tens are required, if mean times between failures have to be of the order of a few years. Space redundancy can also be traded for time redundancy. As for memory chips, mean times between failures of the order of a few years are found to imply both space and time redundancies of the order of ten

    Survey of Soft Error Mitigation Techniques Applied to LEON3 Soft Processors on SRAM-Based FPGAs

    Get PDF
    Soft-core processors implemented in SRAM-based FPGAs are an attractive option for applications to be employed in radiation environments due to their flexibility, relatively-low application development costs, and reconfigurability features enabling them to adapt to the evolving mission needs. Despite the advantages soft-core processors possess, they are seldom used in critical applications because they are more sensitive to radiation than their hard-core counterparts. For instance, both the logic and signal routing circuitry of a soft-core processor as well as its user memory are susceptible to radiation-induced faults. Therefore, soft-core processors must be appropriately hardened against ionizing-radiation to become a feasible design choice for harsh environments and thus to reap all their benefits. This survey henceforth discusses various techniques to protect the configuration and user memories of an LEON3 soft processor, which is one of the most widely used soft-core processors in radiation environments, as reported in the state-of-the-art literature, with the objective of facilitating the choice of right fault-mitigation solution for any given soft-core processor

    Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices

    Get PDF
    This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results

    Single event upset hardened embedded domain specific reconfigurable architecture

    Get PDF

    Self-healing concepts involving fine-grained redundancy for electronic systems

    Get PDF
    The start of the digital revolution came through the metal-oxide-semiconductor field-effect transistor (MOSFET) in 1959 followed by massive integration onto a silicon die by means of constant down scaling of individual components. Digital systems for certain applications require fault-tolerance against faults caused by temporary or permanent influence. The most widely used technique is triple module redundancy (TMR) in conjunction with a majority voter, which is regarded as a passive fault mitigation strategy. Design by functional resilience has been applied to circuit structures for increased fault-tolerance and towards self-diagnostic triggered self-healing. The focus of this thesis is therefore to develop new design strategies for fault detection and mitigation within transistor, gate and cell design levels. The research described in this thesis makes three contributions. The first contribution is based on adding fine-grained transistor level redundancy to logic gates in order to accomplish stuck-at fault-tolerance. The objective is to realise maximum fault-masking for a logic gate with minimal added redundant transistors. In the case of non-maskable stuck-at faults, the gate structure generates an intrinsic indication signal that is suitable for autonomous self-healing functions. As a result, logic circuitry utilising this design is now able to differentiate between gate faults and faults occurring in inter-gate connections. This distinction between fault-types can then be used for triggering selective self-healing responses. The second contribution is a logic matrix element which applies the three core redundancy concepts of spatial- temporal- and data-redundancy. This logic structure is composed of quad-modular redundant structures and is capable of selective fault-masking and localisation depending of fault-type at the cell level, which is referred to as a spatiotemporal quadded logic cell (QLC) structure. This QLC structure has the capability of cellular self-healing. Through the combination of fault-tolerant and masking logic features the QLC is designed with a fault-behaviour that is equal to existing quadded logic designs using only 33.3% of the equivalent transistor resources. The inherent self-diagnosing feature of QLC is capable of identifying individual faulty cells and can trigger self-healing features. The final contribution is focused on the conversion of finite state machines (FSM) into memory to achieve better state transition timing, minimal memory utilisation and fault protection compared to common FSM designs. A novel implementation based on content-addressable type memory (CAM) is used to achieve this. The FSM is further enhanced by creating the design out of logic gates of the first contribution by achieving stuck-at fault resilience. Applying cross-data parity checking, the FSM becomes equipped with single bit fault detection and correction

    Adaptive-Hybrid Redundancy for Radiation Hardening

    Get PDF
    An Adaptive-Hybrid Redundancy (AHR) mitigation strategy is proposed to mitigate the effects of Single Event Upset (SEU) and Single Event Transient (SET) radiation effects. AHR is adaptive because it switches between Triple Modular Redundancy (TMR) and Temporal Software Redundancy (TSR). AHR is hybrid because it uses hardware and software redundancy. AHR is demonstrated to run faster than TSR and use less energy than TMR. Furthermore, AHR allows space vehicle designers, mission planners, and operators the flexibility to determine how much time is spent in TMR and TSR. TMR mode provides faster processing at the expense of greater energy usage. TSR mode uses less energy at the expense of processing speed. AHR allows the user to determine the optimal balance between these modes based on their mission needs and changes can be made even after the space vehicle is operational. Radiation testing was performed to determine the SEU injection rate for simulations and analyses. A Field Programmable Gate Array (FPGA) was used to expedite testing in hardware

    Resilience of an embedded architecture using hardware redundancy

    Get PDF
    In the last decade the dominance of the general computing systems market has being replaced by embedded systems with billions of units manufactured every year. Embedded systems appear in contexts where continuous operation is of utmost importance and failure can be profound. Nowadays, radiation poses a serious threat to the reliable operation of safety-critical systems. Fault avoidance techniques, such as radiation hardening, have been commonly used in space applications. However, these components are expensive, lag behind commercial components with regards to performance and do not provide 100% fault elimination. Without fault tolerant mechanisms, many of these faults can become errors at the application or system level, which in turn, can result in catastrophic failures. In this work we study the concepts of fault tolerance and dependability and extend these concepts providing our own definition of resilience. We analyse the physics of radiation-induced faults, the damage mechanisms of particles and the process that leads to computing failures. We provide extensive taxonomies of 1) existing fault tolerant techniques and of 2) the effects of radiation in state-of-the-art electronics, analysing and comparing their characteristics. We propose a detailed model of faults and provide a classification of the different types of faults at various levels. We introduce an algorithm of fault tolerance and define the system states and actions necessary to implement it. We introduce novel hardware and system software techniques that provide a more efficient combination of reliability, performance and power consumption than existing techniques. We propose a new element of the system called syndrome that is the core of a resilient architecture whose software and hardware can adapt to reliable and unreliable environments. We implement a software simulator and disassembler and introduce a testing framework in combination with ERA’s assembler and commercial hardware simulators
    • …
    corecore