13 research outputs found

    The impact of the soft errors in convolutional neural network on GPUS: Alexnet as case study

    Get PDF
    Convolutional Neural Networks (CNNs) have been increasingly deployed in many applications, including safety critical system such as healthcare and autonomous vehicles. Meanwhile, the vulnerability of CNN model to soft errors (e.g., caused by radiation mduced) rapidly increases, thus reliability is crucial especially in real-tmie system. There are many traditional techniques for miprove the reliability of the system, e.g.. Triple Modular Redundancy, but these techniques incur high overheads, which makes them hard to deploy. In tins paper, we experimentally evaluate the vulnerable parts of Alexnet mode (e.g., fault mjector). Results show that FADD and LD are the top vulnerable mstructions against soft errors for Alexnet model, both mstruetions generate at least 84% of injected faults as SDC errors. Thus, these the only parts of the Alexnet model that need to be hardened mstead of usmg fully duplication solutions

    Evaluation of the Suitability of NEON SIMD Microprocessor Extensions Under Proton Irradiation

    Get PDF
    This paper analyzes the suitability of single-instruction multiple data (SIMD) extensions of current microprocessors under radiation environments. SIMD extensions are intended for software acceleration, focusing mostly in applications that require high computational effort, which are common in many fields such as computer vision. SIMD extensions use a dedicated coprocessor that makes possible packing several instructions in one single extended instruction. Applications that require high performance could benefit from the use of SIMD coprocessors, but their reliability needs to be studied. In this paper, NEON, the SIMD coprocessor of ARM microprocessors, has been selected as a case study to explore the behavior of SIMD extensions under radiation. Radiation experiments of ARM CORTEX-A9 microprocessors have been accomplished with the objective of determining how the use of this kind of coprocessor can affect the system reliability

    Analyzing the Resilience of Convolutional Neural Networks Implemented on GPUs: Alexnet as a Case Study

    Get PDF
    There have been an extensive use of Convolutional Neural Networks (CNNs) in healthcare applications. Presently, GPUs are the most prominent and dominated DNN accelerators to increase the execution speed of CNN algorithms to improve their performance as well as the Latency. However, GPUs are prone to soft errors. These errors can impact the behaviors of the GPU dramatically. Thus, the generated fault may corrupt data values or logic operations and cause errors, such as Silent Data Corruption. unfortunately, soft errors propagate from the physical level (microarchitecture) to the application level (CNN model). This paper analyzes the reliability of the AlexNet model based on two metrics: (1) critical kernel vulnerability (CKV) used to identify the malfunction and light- malfunction errors in each kernel, and (2) critical layer vulnerability (CLV) used to track the malfunction and light-malfunction errors through layers. To achieve this, we injected the AlexNet which was popularly used in healthcare applications on NVIDIA’s GPU, using the SASSIFI fault injector as the major evaluator tool. The experiments demonstrate through the average error percentage that caused malfunction of the models has been reduced from 3.7% to 0.383% by hardening only the vulnerable part with the overhead only 0.2923%. This is a high improvement in the model reliability for healthcare applications

    MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications

    Full text link
    Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on the error resilience characteristics, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines

    Detecting Silent Data Corruptions in Aerospace-Based Computing Using Program Invariants

    Get PDF
    Soft error caused by single event upset has been a severe challenge to aerospace-based computing. Silent data corruption (SDC) is one of the results incurred by soft error. SDC occurs when a program generates erroneous output with no indications. SDC is the most insidious type of results and very difficult to detect. To address this problem, we design and implement an invariant-based system called Radish. Invariants describe certain properties of a program; for example, the value of a variable equals a constant. Radish first extracts invariants at key program points and converts invariants into assertions. It then hardens the program by inserting the assertions into the source code. When a soft error occurs, assertions will be found to be false at run time and warn the users of soft error. To increase the coverage of SDC, we further propose an extension of Radish, named Radish_D, which applies software-based instruction duplication mechanism to protect the uncovered code sections. Experiments using architectural fault injections show that Radish achieves high SDC coverage with very low overhead. Furthermore, Radish_D provides higher SDC coverage than that of either Radish or pure instruction duplication

    Development of low-overhead soft error mitigation technique for safety critical neural networks applications

    Get PDF
    Deep Neural Networks (DNNs) have been widely applied in healthcare applications. DNN-based healthcare applications are safety-critical systems that require highreliability implementation due to a high risk of human death or injury in case of malfunction. Several DNN accelerators are used to execute these DNN models, and GPUs are currently the most prominent and the dominated DNN accelerators. However, GPUs are prone to soft errors that dramatically impact the GPU behaviors; such error may corrupt data values or logic operations, which result in Silent Data Corruption (SDC). The SDC propagates from the physical level to the application level (SDC that occurs in hardware GPUs’ components) results in misclassification of objects in DNN models, leading to disastrous consequences. Food and Drug Administration (FDA) reported that 1078 of the adverse events (10.1%) were unintended errors (i.e., soft errors) encountered, including 52 injuries and two deaths. Several traditional techniques have been proposed to protect electronic devices from soft errors by replicating the DNN models. However, these techniques cause significant overheads of area, performance, and energy, making them challenging to implement in healthcare systems that have strict deadlines. To address this issue, this study developed a Selective Mitigation Technique based on the standard Triple Modular Redundancy (S-MTTM-R) to determine the model’s vulnerable parts, distinguishing Malfunction and Light-Malfunction errors. A comprehensive vulnerability analysis was performed using a SASSIFI fault injector at the CNN AlexNet and DenseNet201 models: layers, kernels, and instructions to show both models’ resilience and identify the most vulnerable portions and harden them by injecting them while implemented on NVIDIA’s GPUs. The experimental results showed that S-MTTM-R achieved a significant improvement in error masking. No-Malfunction have been improved from 54.90%, 67.85%, and 59.36% to 62.80%, 82.10%, and 80.76% in the three modes RF, IOA, and IOV, respectively for AlexNet. For DenseNet, NoMalfunction have been improved from 43.70%, 67.70%, and 54.68% to 59.90%, 84.75%, and 83.07% in the three modes RF, IOA, and IOV, respectively. Importantly, S-MTTMR decreased the percentage of errors that case misclassification (Malfunction) from 3.70% to 0.38% and 5.23% to 0.23%, for AlexNet and DenseNet, respectively. The performance analysis results showed that the S-MTTM-R achieved lower overhead compared to the well-known protection techniques: Algorithm-Based Fault Tolerance (ABFT), Double Modular Redundancy (DMR), and Triple Modular Redundancy (TMR). In light of these results, the study revealed strong evidence that the developed S-MTTMR was successfully mitigated the soft errors for the DNNs model on GPUs with lowoverheads in energy, performance, and area indicated a remarkable improvement in the healthcare domains’ model reliability

    Análise da criticidade de falhas transientes em processadores paralelos para enrobustecimento seletivo

    Get PDF
    Este documento tem como objetivo descrever o trabalho realizado ao longo do período de um ano, dedicado ao trabalho de graduação. O trabalho consiste em realizar uma análise de criticidade dos erros na saída de aplicações paralelas que sofrem falhas durante sua execução. O modelo de falhas considerado neste trabalho é o modelo de falhas causadas devido à radiação cósmica. Como resultado dessa análise, é possível obter uma classificação de quais variáveis são candidatas para a técnica de enrobustecimento seletivo. Adicionalmente, a implementação de uma ferramenta que provê a infraestrutura para a aplicação automática de enrobustecimento seletivo foi realizada.This document aims to describe the work made during the one year time period dedicated to the undergraduate thesis. The work consists of providing an analysis of the criticality of the errors in the output of parallel application that experience faults during their execution. The considered fault model is that of faults caused by cosmic radiation. As a result of this analysis, it is shown that it is possible to obtain a classification of which variables are candidates for the selective hardening technique. Additionaly, the implementation of a tool that provides the required infrastructure for automatic application of the selective hardening was achieved

    Using precision reduction to efficiently improve mixed-precision GPUs reliability

    Get PDF
    Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in this process. Reduced Precision Duplication With Comparison (RP-DWC) is an effective software level solution to improve the performance of the conventional DWC. RP-DWC aims to mitigate these overheads by enabling parallel processing in underused Floating Point Units (FPUs) in mixed precision Graphic Processing Units (GPUs). By making use of precision reduction to efficiently improve the reliability in mixed precision GPUs, RPDWC extends the DWC technique, introducing proper ways to handle redundancy with different precision operations. Improving GPUs reliability is an extremely valuable challenge in the fault tolerance field since GPUs are adopted in both High-Performance Computing (HPC) and in automotive real-time applications. When GPUs are exposed to a natural environment, such as the surface of the Earth at sea level, they are also exposed to the Earth’s surface radiation. Furthermore, this exposure can be critical, given that these radiation particles may hit the GPU’s internal circuit, corrupt sensitive data and consequently generate undesired outputs. Introducing duplication with reduced precision in a trustworthy manner to maintain reliability in safety-critical systems is an arduous task that we propose to further investigate in this work
    corecore