2,276 research outputs found

    Havens: Explicit Reliable Memory Regions for HPC Applications

    Full text link
    Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, US

    Robust Face Recognition Against Soft-errors Using a Cross-layer Approach

    Get PDF
    Recently, soft-errors, temporary bit toggles in memory systems, have become increasingly important. Although soft-errors are not critical to the stability of recognition systems or multimedia systems, they can significantly degrade the system performance. Considering these facts, in this paper, we propose a novel method for robust face recognition against soft-errors using a cross layer approach. To attenuate the effect of soft-errors in the face recognition system, they are detected in the embedded system layer by using a parity bit checker and compensated in the application layer by using a mean face. We present the soft-error detection module for face recognition and the compensation module based on the mean face of the facial images. Simulation results show that the proposed system effectively compensates for the performance degradation due to soft errors and improves the performance by 2.11 % in case of the Yale database and by 10.43 % in case of the ORL database on average as compared to that with the soft-errors induced

    Techniques for Aging, Soft Errors and Temperature to Increase the Reliability of Embedded On-Chip Systems

    Get PDF
    This thesis investigates the challenge of providing an abstracted, yet sufficiently accurate reliability estimation for embedded on-chip systems. In addition, it also proposes new techniques to increase the reliability of register files within processors against aging effects and soft errors. It also introduces a novel thermal measurement setup that perspicuously captures the infrared images of modern multi-core processors

    Wireless industrial monitoring and control networks: the journey so far and the road ahead

    Get PDF
    While traditional wired communication technologies have played a crucial role in industrial monitoring and control networks over the past few decades, they are increasingly proving to be inadequate to meet the highly dynamic and stringent demands of today’s industrial applications, primarily due to the very rigid nature of wired infrastructures. Wireless technology, however, through its increased pervasiveness, has the potential to revolutionize the industry, not only by mitigating the problems faced by wired solutions, but also by introducing a completely new class of applications. While present day wireless technologies made some preliminary inroads in the monitoring domain, they still have severe limitations especially when real-time, reliable distributed control operations are concerned. This article provides the reader with an overview of existing wireless technologies commonly used in the monitoring and control industry. It highlights the pros and cons of each technology and assesses the degree to which each technology is able to meet the stringent demands of industrial monitoring and control networks. Additionally, it summarizes mechanisms proposed by academia, especially serving critical applications by addressing the real-time and reliability requirements of industrial process automation. The article also describes certain key research problems from the physical layer communication for sensor networks and the wireless networking perspective that have yet to be addressed to allow the successful use of wireless technologies in industrial monitoring and control networks

    Dependable Computing on Inexact Hardware through Anomaly Detection.

    Full text link
    Reliability of transistors is on the decline as transistors continue to shrink in size. Aggressive voltage scaling is making the problem even worse. Scaled-down transistors are more susceptible to transient faults as well as permanent in-field hardware failures. In order to continue to reap the benefits of technology scaling, it has become imperative to tackle the challenges risen due to the decreasing reliability of devices for the mainstream commodity market. Along with the worsening reliability, achieving energy efficiency and performance improvement by scaling is increasingly providing diminishing marginal returns. More than any other time in history, the semiconductor industry faces the crossroad of unreliability and the need to improve energy efficiency. These challenges of technology scaling can be tackled by categorizing the target applications in the following two categories: traditional applications that have relatively strict correctness requirement on outputs and emerging class of soft applications, from various domains such as multimedia, machine learning, and computer vision, that are inherently inaccuracy tolerant to a certain degree. Traditional applications can be protected against hardware failures by low-cost detection and protection methods while soft applications can trade off quality of outputs to achieve better performance or energy efficiency. For traditional applications, I propose an efficient, software-only application analysis and transformation solution to detect data and control flow transient faults. The intelligence of the data flow solution lies in the use of dynamic application information such as control flow, memory and value profiling. The control flow protection technique achieves its efficiency by simplifying signature calculations in each basic block and by performing checking at a coarse-grain level. For soft applications, I develop a quality control technique. The quality control technique employs continuous, light-weight checkers to ensure that the approximation is controlled and application output is acceptable. Overall, I show that the use of low-cost checkers to produce dependable results on commodity systems---constructed from inexact hardware components---is efficient and practical.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113341/1/dskhudia_1.pd

    Effect of Soft Errors in Iterative Learning Control and Compensation using Cross-layer Approach

    Get PDF
    In this paper, we study the effects of radiation-induced soft errors in iterative learning control(ILC) and present the compensation techniques to make the ILC systems robust against soft errors. Soft errors are transient faults, which occur temporarily in memories where the energetic particles strike the sensitive region in the transistors mainly under abnormal conditions such as high radiation, high temperature, and high pressure. These soft errors can cause bit value changes without any notification to the controller, affect the stability of the system, and result in catastrophic consequences. First, we investigate and analyze the effects of soft errors in the ILC systems. Our analytical study shows that when a single soft error occurs in the output data from the ILC, the performance of the learning control is significantly degraded. Second, we propose novel learning methods by incorporating the existing techniques across the system abstraction levels in the ILC to compensate for soft-error-induced incorrect output. The occurrence of soft errors is estimated by using a monotonic convergence of the erroneous outputs in a cross-layer manner, and our proposed methods can significantly reduce these negative impacts on the system performance. Under the assumption of soft error occurrence, our analytic study has proved the convergence of the proposed methods in the ILC systems and our simulation results show the effectiveness of the proposed methods to efficiently reduce the impacts of soft-error-induced outputs in the ILC systems

    Fault Tolerant Integer Data Computations: Algorithms and Applications

    Get PDF
    As computing units move to higher transistor integration densities and computing clusters become highly heterogeneous, studies begin to predict that, rather than being exceptions, data corruptions in memory and processor failures are likely to become more prevalent. It has therefore become imperative to improve the reliability of systems in the face of increasing soft error probabilities in memory and computing logic units of silicon CMOS integrated chips. This thesis introduces a new class of algorithms for fault tolerance in compute-intensive linear and sesquilinear (“one-and-half-linear”) data computations on integer data inputs within high-performance computing systems. The key difference between the proposed algorithms and existing fault tolerance methods is the elimination of the traditional requirement for additional hardware resources for system reliability. The first contribution of this thesis is in the detection of hardware-induced errors in integer matrix products. The proposed method of numerical packing for detecting a single error within a quadruple of matrix outputs is described in Chapter 2. The chapter includes analytic calculations of the proposed method’s computational complexity and reliability. Experimental results show that the proposed algorithm incurs comparable execution time overhead to existing algorithms for the detection and correction of a limited number of errors within generic matrix multiplication (GEMM) outputs. On the other hand, numerical packing becomes substantially more efficient in the mitigation of multiple errors. The achieved execution time gain of numerical packing is further analyzed with respect to its energy saving equivalent, thus paving the way for a new class of silent data corruption (SDC) mitigation method for integer matrix products that are fast, energy efficient, and highly reliable. A further advancement of the proposed numerical packing approach for the mitigation of core/processor failures in computing clusters (a.k.a., failstop failures) is described in Chapter 3 . The key advantage of this new packing approach is the ability to tolerate processor failures for all classes of sum-of-product computations. Because multimedia applications running on cloud computing platforms are now required to mitigate an increasing number of failures and outages at runtime, we analyze the efficiency of numerical packing within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminable) instances. Our results show that more than 70% reduction of cost can be achieved in comparison to conventional failure-intolerant processing based on AWS EC2 on-demand (i.e., higher-cost albeit guaranteed) instances. Finally, beyond numerical packing, we present a second approach for reliability in the case of linear and sesquilinear integer data computations by generalizing the recently-proposed concept of numerical entanglement. The proposed approach is capable of recovering from multiple fail-stop failures in a parallel/distributed computing environment. We present theoretical analysis of the computational and bit-width requirements of the proposed method in comparison to existing methods of checksum generation and processing. Our experiments with integer matrix products show that the proposed approach incurs 1.72% − 37.23% reduction in processing throughput in comparison to failure-intolerant processing while allowing for the mitigation of multiple fail-stop failures without the use of additional computing resources

    Development of low-overhead soft error mitigation technique for safety critical neural networks applications

    Get PDF
    Deep Neural Networks (DNNs) have been widely applied in healthcare applications. DNN-based healthcare applications are safety-critical systems that require highreliability implementation due to a high risk of human death or injury in case of malfunction. Several DNN accelerators are used to execute these DNN models, and GPUs are currently the most prominent and the dominated DNN accelerators. However, GPUs are prone to soft errors that dramatically impact the GPU behaviors; such error may corrupt data values or logic operations, which result in Silent Data Corruption (SDC). The SDC propagates from the physical level to the application level (SDC that occurs in hardware GPUs’ components) results in misclassification of objects in DNN models, leading to disastrous consequences. Food and Drug Administration (FDA) reported that 1078 of the adverse events (10.1%) were unintended errors (i.e., soft errors) encountered, including 52 injuries and two deaths. Several traditional techniques have been proposed to protect electronic devices from soft errors by replicating the DNN models. However, these techniques cause significant overheads of area, performance, and energy, making them challenging to implement in healthcare systems that have strict deadlines. To address this issue, this study developed a Selective Mitigation Technique based on the standard Triple Modular Redundancy (S-MTTM-R) to determine the model’s vulnerable parts, distinguishing Malfunction and Light-Malfunction errors. A comprehensive vulnerability analysis was performed using a SASSIFI fault injector at the CNN AlexNet and DenseNet201 models: layers, kernels, and instructions to show both models’ resilience and identify the most vulnerable portions and harden them by injecting them while implemented on NVIDIA’s GPUs. The experimental results showed that S-MTTM-R achieved a significant improvement in error masking. No-Malfunction have been improved from 54.90%, 67.85%, and 59.36% to 62.80%, 82.10%, and 80.76% in the three modes RF, IOA, and IOV, respectively for AlexNet. For DenseNet, NoMalfunction have been improved from 43.70%, 67.70%, and 54.68% to 59.90%, 84.75%, and 83.07% in the three modes RF, IOA, and IOV, respectively. Importantly, S-MTTMR decreased the percentage of errors that case misclassification (Malfunction) from 3.70% to 0.38% and 5.23% to 0.23%, for AlexNet and DenseNet, respectively. The performance analysis results showed that the S-MTTM-R achieved lower overhead compared to the well-known protection techniques: Algorithm-Based Fault Tolerance (ABFT), Double Modular Redundancy (DMR), and Triple Modular Redundancy (TMR). In light of these results, the study revealed strong evidence that the developed S-MTTMR was successfully mitigated the soft errors for the DNNs model on GPUs with lowoverheads in energy, performance, and area indicated a remarkable improvement in the healthcare domains’ model reliability

    Reliable Software for Unreliable Hardware - A Cross-Layer Approach

    Get PDF
    A novel cross-layer reliability analysis, modeling, and optimization approach is proposed in this thesis that leverages multiple layers in the system design abstraction (i.e. hardware, compiler, system software, and application program) to exploit the available reliability enhancing potential at each system layer and to exchange this information across multiple system layers
    corecore