1,324 research outputs found
New Techniques for On-line Testing and Fault Mitigation in GPUs
L'abstract è presente nell'allegato / the abstract is in the attachmen
Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs
Multiprocessor system-on-chip such as embedded
GPUs are becoming very popular in safety-critical applications,
such as autonomous and semi-autonomous vehicles. However,
these devices can suffer from the effects of soft-errors, such as
those produced by radiation effects. These effects are able to
generate unpredictable misbehaviors. Fault tolerance oriented to
multi-threaded software introduces severe performance
degradations due to the redundancy, voting and correction
threads operations. In this paper, we propose a new fault injection
environment for NVIDIA GPGPU devices and a fault tolerance
approach based on error detection and correction threads
executed during data transfer operations on embedded GPUs. The
fault injection environment is capable of automatically injecting
faults into the instructions at SASS level by instrumenting the
CUDA binary executable file. The mitigation approach is based on
concurrent error detection threads running simultaneously with
the memory stream device to host data transfer operations. With
several benchmark applications, we evaluate the impact of softerrors classifying Silent Data Corruption, Detection,
Unrecoverable Error and Hang. Finally, the proposed mitigation
approach has been validated by soft-error fault injection
campaigns on an NVIDIA Pascal Architecture GPU controlled by
Quad-Core A57 ARM processor (JETSON TX2) demonstrating
an advantage of more than 37% with respect to state of the art
solution
Analyzing the Reliability of Alternative Convolution Implementations for Deep Learning Applications
Convolution represents the core of Deep Learning (DL) applications, enabling the automatic extraction of features from raw input data. Several implementations of the convolution have been proposed. The impact of these different implementations on the performance of DL applications has been studied. However, no specific reliability-related analysis has been carried out. In this paper, we apply the CLASSES cross-layer reliability analysis methodology for an in-depth study aimed at: i) analyzing and characterizing the effects of Single Event Upsets occurring in Graphics Processing Units while executing the convolution operators; and ii) identifying whether a convolution implementation is more robust than others. The outcomes can then be exploited to tailor better hardening schemes for DL applications to improve reliability and reduce overhead
- …