2 research outputs found

    A Multi-level Approach to Evaluate the Impact of GPU Permanent Faults on CNN's Reliability

    Get PDF
    Graphics processing units (GPUs) are widely used to accelerate Artificial Intelligence applications, such as those based on Convolutional Neural Networks (CNNs). Since in some domains in which CNNs are heavily employed (e.g., automotive and robotics) the expected lifetime of GPUs is over ten years, it is of paramount importance to study the impact of permanent faults (e.g. due to aging). Crucially, while the impact of transient faults on GPUs running CNNs has been widely studied, an accurate evaluation of the impact of permanent faults is still lacking. Performing this evaluation is challenging due to the complexity of GPU devices and the software implementing a CNN. In this work, we propose a methodology that combines the accuracy of gate-level fault simulation with the speed and flexibility of software fault injection to evaluate the effects of permanent hardware faults affecting a GPU. First, we profile the executed low-level GPU instructions during the CNN inference. Then, using extensive gate-level fault injection campaigns, we provide an accurate analysis of the effects of permanent faults on the internal modules executing the targeted instructions. Finally, we propagate these effects using fast software-based fault injection. The method allows, for the first time, to estimate the percentage of permanent faults leading the CNN to produce wrong results (i.e., changing the result of its work). The method's feasibility, which allows for flexibly trade-off accuracy with the required computational effort, is shown using LeNet running on an Ampere Nvidia GPU as a case study. The method reduces the computational effort for the evaluation by several orders of magnitude with respect to plain gate- and RTL-level faults simulation
    corecore