9 research outputs found

    Characterizing Deep Neural Networks Neutrons-Induced Error Model

    Get PDF
    International audienceWe characterize the fault models for Deep Neural Networks (DNNs) in GPUs exposed to neutron. We observe tolerable and critical errors, and show that ECC is not effective in reducing critical errors

    Deep learning optimization for drug-target interaction prediction in COVID-19 using graphic processing unit

    Get PDF
    The exponentially increasing bioinformatics data raised a new problem: the computation time length. The amount of data that needs to be processed is not matched by an increase in hardware performance, so it burdens researchers on computation time, especially on drug-target interaction prediction, where the computational complexity is exponential. One of the focuses of high-performance computing research is the utilization of the graphics processing unit (GPU) to perform multiple computations in parallel. This study aims to see how well the GPU performs when used for deep learning problems to predict drug-target interactions. This study used the gold-standard data in drug-target interaction (DTI) and the coronavirus disease (COVID-19) dataset. The stages of this research are data acquisition, data preprocessing, model building, hyperparameter tuning, performance evaluation and COVID-19 dataset testing. The results of this study indicate that the use of GPU in deep learning models can speed up the training process by 100 times. In addition, the hyperparameter tuning process is also greatly helped by the presence of the GPU because it can make the process up to 55 times faster. When tested using the COVID-19 dataset, the model showed good performance with 76% accuracy, 74% F-measure and a speed-up value of 179

    Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection

    Get PDF
    The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register-Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility

    Development of low-overhead soft error mitigation technique for safety critical neural networks applications

    Get PDF
    Deep Neural Networks (DNNs) have been widely applied in healthcare applications. DNN-based healthcare applications are safety-critical systems that require highreliability implementation due to a high risk of human death or injury in case of malfunction. Several DNN accelerators are used to execute these DNN models, and GPUs are currently the most prominent and the dominated DNN accelerators. However, GPUs are prone to soft errors that dramatically impact the GPU behaviors; such error may corrupt data values or logic operations, which result in Silent Data Corruption (SDC). The SDC propagates from the physical level to the application level (SDC that occurs in hardware GPUs’ components) results in misclassification of objects in DNN models, leading to disastrous consequences. Food and Drug Administration (FDA) reported that 1078 of the adverse events (10.1%) were unintended errors (i.e., soft errors) encountered, including 52 injuries and two deaths. Several traditional techniques have been proposed to protect electronic devices from soft errors by replicating the DNN models. However, these techniques cause significant overheads of area, performance, and energy, making them challenging to implement in healthcare systems that have strict deadlines. To address this issue, this study developed a Selective Mitigation Technique based on the standard Triple Modular Redundancy (S-MTTM-R) to determine the model’s vulnerable parts, distinguishing Malfunction and Light-Malfunction errors. A comprehensive vulnerability analysis was performed using a SASSIFI fault injector at the CNN AlexNet and DenseNet201 models: layers, kernels, and instructions to show both models’ resilience and identify the most vulnerable portions and harden them by injecting them while implemented on NVIDIA’s GPUs. The experimental results showed that S-MTTM-R achieved a significant improvement in error masking. No-Malfunction have been improved from 54.90%, 67.85%, and 59.36% to 62.80%, 82.10%, and 80.76% in the three modes RF, IOA, and IOV, respectively for AlexNet. For DenseNet, NoMalfunction have been improved from 43.70%, 67.70%, and 54.68% to 59.90%, 84.75%, and 83.07% in the three modes RF, IOA, and IOV, respectively. Importantly, S-MTTMR decreased the percentage of errors that case misclassification (Malfunction) from 3.70% to 0.38% and 5.23% to 0.23%, for AlexNet and DenseNet, respectively. The performance analysis results showed that the S-MTTM-R achieved lower overhead compared to the well-known protection techniques: Algorithm-Based Fault Tolerance (ABFT), Double Modular Redundancy (DMR), and Triple Modular Redundancy (TMR). In light of these results, the study revealed strong evidence that the developed S-MTTMR was successfully mitigated the soft errors for the DNNs model on GPUs with lowoverheads in energy, performance, and area indicated a remarkable improvement in the healthcare domains’ model reliability

    Characterizing a Neutron-Induced Fault Model for Deep Neural Networks

    Get PDF
    International audienceThe reliability evaluation of Deep Neural Networks (DNNs) executed on Graphic Processing Units (GPUs) is a challenging problem since the hardware architecture is highly complex and the software frameworks are composed of many layers of abstraction. While software-level fault injection is a common and fast way to evaluate the reliability of complex applications, it may produce unrealistic results since it has limited access to the hardware resources and the adopted fault models may be too naive (i.e., single and double bit flip). Contrarily, physical fault injection with neutron beam provides realistic error rates but lacks fault propagation visibility. This paper proposes a characterization of the DNN fault model combining both neutron beam experiments and fault injection at software level. We exposed GPUs running General Matrix Multiplication (GEMM) and DNNs to beam neutrons to measure their error rate. On DNNs, we observe that the percentage of critical errors can be up to 61%, and show that ECC is ineffective in reducing critical errors. We then performed a complementary software-level fault injection, using fault models derived from RTL simulations. Our results show that by injecting complex fault models, the YOLOv3 misdetection rate is validated to be very close to the rate measured with beam experiments, which is 8.66× higher than the one measured with fault injection using only single-bit flips

    New Techniques for On-line Testing and Fault Mitigation in GPUs

    Get PDF
    L'abstract Ăš presente nell'allegato / the abstract is in the attachmen

    A checkpointing mechanism for GPU intensive HPC applications

    Get PDF
    Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC) grants EP/N028201/1 and EP/L00058X/

    GPU Behavior on a Large HPC Cluster

    No full text