198,214 research outputs found

    GPGPU Reliability Analysis: From Applications to Large Scale Systems

    Get PDF
    Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications

    Terrestrial Cosmic Ray Induced Soft Errors and Large-Scale FPGA Systems in the Cloud

    Get PDF
    Radiation from outer space can cause soft errors in microelectronic devices deployed at terrestrial altitudes on Earth. Cosmic rays entering the Earth’s atmosphere create a complex cascade of radioactive particles. The most likely form of cosmic radiation to cause soft errors in microelectronics at terrestrial levels are neutrons. SRAM-based FPGAs are susceptible to terrestrial cosmic ray induced soft errors. These soft errors occur infrequently for a single device deployed at terrestrial altitudes. When many FPGAs are deployed in a large-scale system, the impact of these soft errors on reliability can be significant. This study examines terrestrial cosmic ray induced soft errors and the effects they can have on large-scale deployment of FPGAs in cloud computing. Fifteen data-center-like designs were tested for sensitivity through fault injecting. Sensitivities ranged from less than 1% to about 12% of randomly injected faults resulting in unacceptable behavior. A hypothetical but realistic large-scale FPGA system, with 100,000 node deployed at a high-altitude, running the most sensitive design would experience the dominant failure mode of silent data corruption every 3.8 hours on average. This system would only be able to retain reliability level above 0.99 for about two minutes. Some soft error detection and recover approaches are discussed

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Fault Tolerant Electronic System Design

    Get PDF
    Due to technology scaling, which means reduced transistor size, higher density, lower voltage and more aggressive clock frequency, VLSI devices may become more sensitive against soft errors. Especially for those devices used in safety- and mission-critical applications, dependability and reliability are becoming increasingly important constraints during the development of system on/around them. Other phenomena (e.g., aging and wear-out effects) also have negative impacts on reliability of modern circuits. Recent researches show that even at sea level, radiation particles can still induce soft errors in electronic systems. On one hand, processor-based system are commonly used in a wide variety of applications, including safety-critical and high availability missions, e.g., in the automotive, biomedical and aerospace domains. In these fields, an error may produce catastrophic consequences. Thus, dependability is a primary target that must be achieved taking into account tight constraints in terms of cost, performance, power and time to market. With standards and regulations (e.g., ISO-26262, DO-254, IEC-61508) clearly specify the targets to be achieved and the methods to prove their achievement, techniques working at system level are particularly attracting. On the other hand, Field Programmable Gate Array (FPGA) devices are becoming more and more attractive, also in safety- and mission-critical applications due to the high performance, low power consumption and the flexibility for reconfiguration they provide. Two types of FPGAs are commonly used, based on their configuration memory cell technology, i.e., SRAM-based and Flash-based FPGA. For SRAM-based FPGAs, the SRAM cells of the configuration memory highly susceptible to radiation induced effects which can leads to system failure; and for Flash-based FPGAs, even though their non-volatile configuration memory cells are almost immune to Single Event Upsets induced by energetic particles, the floating gate switches and the logic cells in the configuration tiles can still suffer from Single Event Effects when hit by an highly charged particle. So analysis and mitigation techniques for Single Event Effects on FPGAs are becoming increasingly important in the design flow especially when reliability is one of the main requirements

    Single-Event Upset Analysis and Protection in High Speed Circuits

    Get PDF
    The effect of single-event transients (SETs) (at a combinational node of a design) on the system reliability is becoming a big concern for ICs manufactured using advanced technologies. An SET at a node of combinational part may cause a transient pulse at the input of a flip-flop and consequently is latched in the flip-flop and generates a soft-error. When an SET conjoined with a transition at a node along a critical path of the combinational part of a design, a transient delay fault may occur at the input of a flip-flop. On the other hand, increasing pipeline depth and using low power techniques such as multi-level power supply, and multi-threshold transistor convert almost all paths in a circuit to critical ones. Thus, studying the behavior of the SET in these kinds of circuits needs special attention. This paper studies the dynamic behavior of a circuit with massive critical paths in the presence of an SET. We also propose a novel flip-flop architecture to mitigate the effects of such SETs in combinational circuits. Furthermore, the proposed architecture can tolerant a single event upset (SEU) caused by particle strike on the internal nodes of a flip-flo

    耐ソフトエラーラッチにおける欠陥の分析、検出及び評価に関する研究

    Get PDF
    The development of modern integrated circuits (ICs) has greatly changed the life of humankind. Nowadays, IC s are also indispensable to mission-critical applications, such as medical devices, autonomous cars, aircraft navigating systems, and satellites. The reliability of these mission-critical applications is a major concern. A soft-error occurring in an IC is a severe threat to its reliability, especially for mission-critical applications. The continuous trend of shrinking technology feature sizes makes modern ICs more and more vulnerable to soft errors. Soft-errors are caused by radiation particles striking an IC and generating current pulses to disturb its functionality. A soft-error can cause data corruption and may eventually lead to system failure s If a soft-error occurs in an operational medical device during surgery, it may cause a malfunction of this device and interrupt the surgery process. A soft-error may change the control data of an autonomous car which may lead to an accident. A soft-error may corrupt the aircraft navigating systems. No one would take the chance to let it happen even though malfunction s caused by soft errors can be solved by resetting these devices. Because reset takes time and severe results may happen during the resetting. If a soft-error causes a malfunction in the control system of a satellite, it may not be able to maintain its height and eventually burn up as it falls into the Earth’s atmosphere. Hence, it is important to protect ICs from soft errors. Many soft-error tolerance methods have been proposed to protect ICs against soft-errors. In an IC, memory elements and storage elements (e.g., latches and flip flops) are the most vulnerable to soft-errors, and data stored in them are crucial to the operation of a circuit. Error correction codes (ECCs) can be u sed to protect memories. Register-level soft-error tolerance methods can be used to detect soft-errors in latches by using parity checking and correct them by resetting. Hardened designs protect latches against soft-errors by using redundant feedback loops to store the same input data and using a voter to select the correct output. The advantage of using hardened designs is that they can prevent soft-errors from reaching outputs while ECCs and register-level soft-error tolerance methods must detect soft-errors and then correct them by restoring the data. For protecting storage elements in mission-critical applications, hardened latch design is the best option because it has high reliability and can save the resetting time. Many state-of-the-art hardened latch designs have been proposed to tolerate soft errors and they are believed to have good soft-error tolerability. Defects (physical flaws due to imperfect production (production defects) and physical changes caused by aging effects after a long operation time (aging-related defects) can also cause a malfunction of a circuit and cause a system failure eventually. Different from the temporal state change of a circuit caused by soft errors, defects are permanent damages to a circuit and can disturb the behavior of a circuit from its desired manner. Defects in storage elements should be detected to make sure a system/device operating correctly and stably. Scan test is a commonly used defect detection method, which connects reconfigured storage elements to form a shift register with external access and the internal states of these storage elements can be easily controlled and checked. However, the impact of defects on existing state of the art hardened latch design has not been considered. This impact requires consideration because added redundancy in hardened latch designs can not only mask soft-errors but also mask the effects of defects and it can lead to two serious problems: Problem-1 (Low Testability): Production defects in hardened latch designs are difficult to detect with conventional scan tests, in which the observability (an important metric to evaluate a circuit’s testability) of defects in hardened latch designs can be greatly reduced. Therefore, existing state-of-the-art hardened latches have low observability and thus low testability. Furthermore, defects that escaped the production test (undetected defects) may become more and more serious and cause a system failure eventually. Problem-2 (Low Soft-Error Tolerability): Undetected defects and aging-related defects can make hardened latch designs vulnerable to soft-errors while defect-free ones do not. The soft-error tolerability of hardened latch designs may be compromise d by undetected defects or aging related defects. This research is the first to consider Problem-1 of low testability of hardened latches and Problem-2 of defects reducing the reliability of hardened latches. Furthermore, this research is the first to pro pose a comprehensive solution to solve these two problems with the following five major contributions: Contribution-1: A first of its kind metric for quantifying the impact of defects on hardened latches, called Post-Test Vulnerability Factor (PTVF). It is used to analyze the residual soft-error tolerability of hardened latches after testing. Problem-2 is solved by this first major contribution. Contribution-2: A novel design called Scan-Test-Aware Hardened Latch (STAHL) that provides the highest defect coverage in comparison with all existing hardened latches. Problem-1 is solved by using STAHL to build a scan c ell to perform a scan test. Contribution-3: A novel scan test procedure is proposed to solve Problem-1 by fully testing the STAHL based scan cell. Contribution-4: A novel High-Performance Scan-Test-Aware Hardened Latch (HP-STAHL) design can also solve Problem-1 and has similar defect coverage as STAHL but has lower power consumption and higher propagation speed. Contribution-5: A novel scan test procedure is proposed to fully test the HP STAHL-based scan cell to solve Problem-1. Comprehensive simulation results demonstrate the accuracy of the PTVF metric and the effectiveness of the STAHL-based scan test and HP-STAHL-based scan test. As the first comprehensive study bridging the gap between hardened latch design s and IC testing, the findings of this research are expected to significantly improve the soft-error-related reliability of IC designs for mission-critical applications. Furthermore, the two proposed hardened latches and the scan test procedures can not only be use d to detect defects after production but also can be applied to detect aging related defects in the field through performing built-in self-test (BIST). In Chapter 1, an example is introduced to indicate Problem-1 and Problem-2. Chapter 2 shows the background information of soft-errors and defects. Chapter 3 shows some typical soft-error mitigation methods and details of a scan test. Chapter 4 describes the detailed information of PTVF Contribution-1). Chapter 5 shows the structure of STAHL (Contribution-2) and Chapter 6 shows the scan test procedure of testing the STAHL-based scan cell (Contribution-3). Chapter 7 shows the structure of HP-STAHL (Contribution-4) and Chapter 8 shows the scan test procedure of testing the HP-STAHL based scan cell (Contribution-5). Chapter 9 shows the experimental results of comparing STAHL and HP-STAHL with state-of-the-art hardened latch designs. Chapter 10 concludes this thesis.九州工業大学博士学位論文 学位記番号:情工博甲第371号 学位授与年月日:令和4年9月26日1. Introduction|2. Background|3. Related Works|4. Post-Test Vulnerability Factor (PTVF)|5. Scan-Test Aware Hardened Latch (STAHL)|6. Scan Test Based on STAHL|7. High Performance Scan-Test-Aware Hardened Latch (HP STAHL)|8. Scan Test Based on HP STAHL|9. Experimental Evaluation|10. Conclusions and Future Works九州工業大学令和4年

    Statistical Reliability Estimation of Microprocessor-Based Systems

    Get PDF
    What is the probability that the execution state of a given microprocessor running a given application is correct, in a certain working environment with a given soft-error rate? Trying to answer this question using fault injection can be very expensive and time consuming. This paper proposes the baseline for a new methodology, based on microprocessor error probability profiling, that aims at estimating fault injection results without the need of a typical fault injection setup. The proposed methodology is based on two main ideas: a one-time fault-injection analysis of the microprocessor architecture to characterize the probability of successful execution of each of its instructions in presence of a soft-error, and a static and very fast analysis of the control and data flow of the target software application to compute its probability of success. The presented work goes beyond the dependability evaluation problem; it also has the potential to become the backbone for new tools able to help engineers to choose the best hardware and software architecture to structurally maximize the probability of a correct execution of the target softwar
    corecore