1,324 research outputs found

    New Techniques for On-line Testing and Fault Mitigation in GPUs

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs

    Get PDF
    Multiprocessor system-on-chip such as embedded GPUs are becoming very popular in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of soft-errors, such as those produced by radiation effects. These effects are able to generate unpredictable misbehaviors. Fault tolerance oriented to multi-threaded software introduces severe performance degradations due to the redundancy, voting and correction threads operations. In this paper, we propose a new fault injection environment for NVIDIA GPGPU devices and a fault tolerance approach based on error detection and correction threads executed during data transfer operations on embedded GPUs. The fault injection environment is capable of automatically injecting faults into the instructions at SASS level by instrumenting the CUDA binary executable file. The mitigation approach is based on concurrent error detection threads running simultaneously with the memory stream device to host data transfer operations. With several benchmark applications, we evaluate the impact of softerrors classifying Silent Data Corruption, Detection, Unrecoverable Error and Hang. Finally, the proposed mitigation approach has been validated by soft-error fault injection campaigns on an NVIDIA Pascal Architecture GPU controlled by Quad-Core A57 ARM processor (JETSON TX2) demonstrating an advantage of more than 37% with respect to state of the art solution

    Analyzing the Reliability of Alternative Convolution Implementations for Deep Learning Applications

    Get PDF
    Convolution represents the core of Deep Learning (DL) applications, enabling the automatic extraction of features from raw input data. Several implementations of the convolution have been proposed. The impact of these different implementations on the performance of DL applications has been studied. However, no specific reliability-related analysis has been carried out. In this paper, we apply the CLASSES cross-layer reliability analysis methodology for an in-depth study aimed at: i) analyzing and characterizing the effects of Single Event Upsets occurring in Graphics Processing Units while executing the convolution operators; and ii) identifying whether a convolution implementation is more robust than others. The outcomes can then be exploited to tailor better hardening schemes for DL applications to improve reliability and reduce overhead
    • …
    corecore