10,318 research outputs found

    Design of a fault tolerant airborne digital computer. Volume 1: Architecture

    Get PDF
    This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    Fast Convergence and Reduced Complexity Receiver Design for LDS-OFDM System

    Get PDF
    Low density signature for OFDM (LDS-OFDM) is able to achieve satisfactory performance in overloaded conditions, but the existing LDS-OFDM has the drawback of slow convergence rate for multiuser detection (MUD) and high receiver complexity. To tackle these problems, we propose a serial schedule for the iterative MUD. By doing so, the convergence rate of MUD is accelerated and the detection iterations can be decreased. Furthermore, in order to exploit the similar sparse structure of LDS-OFDM and LDPC code, we utilize LDPC codes for LDS-OFDM system. Simulations show that compared with existing LDS-OFDM, the LDPC code improves the system performance

    Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

    Get PDF
    In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft

    Designing Low Cost Error Correction Schemes for Improving Memory Reliability

    Get PDF
    abstract: Memory systems are becoming increasingly error-prone, and thus guaranteeing their reliability is a major challenge. In this dissertation, new techniques to improve the reliability of both 2D and 3D dynamic random access memory (DRAM) systems are presented. The proposed schemes have higher reliability than current systems but with lower power, better performance and lower hardware cost. First, a low overhead solution that improves the reliability of commodity DRAM systems with no change in the existing memory architecture is presented. Specifically, five erasure and error correction (E-ECC) schemes are proposed that provide at least Chipkill-Correct protection for x4 (Schemes 1, 2 and 3), x8 (Scheme 4) and x16 (Scheme 5) DRAM systems. All schemes have superior error correction performance due to the use of strong symbol-based codes. In addition, the use of erasure codes extends the lifetime of the 2D DRAM systems. Next, two error correction schemes are presented for 3D DRAM memory systems. The first scheme is a rate-adaptive, two-tiered error correction scheme (RATT-ECC) that provides strong reliability (10^10x) reduction in raw FIT rate) for an HBM-like 3D DRAM system that services CPU applications. The rate-adaptive feature of RATT-ECC enables permanent bank failures to be handled through sparing. It can also be used to significantly reduce the refresh power consumption without decreasing the reliability and timing performance. The second scheme is a two-tiered error correction scheme (Config-ECC) that supports different sized accesses in GPU applications with strong reliability. It addresses the mismatch between data access size and fixed sized ECC scheme by designing a product code based flexible scheme. Config-ECC is built around a core unit designed for 32B access with a simple extension to support 64B and 128B accesses. Compared to fixed 32B and 64B ECC schemes, Config-ECC reduces the failure in time (FIT) rate by 200x and 20x, respectively. It also reduces the memory energy by 17% (in the dynamic mode) and 21% (in the static mode) compared to a state-of-the-art fixed 64B ECC scheme.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc
    • …
    corecore