Search CORE

109 research outputs found

Analyzing the Sensitivity of GPU Pipeline Registers to Single Events Upsets

Author: Azambuja Jose Rodrigo
Goncalves Marcio M.
Rodriguez Condia Josie E.
Sonza Reorda Matteo
Sterpone Luca
Publication venue: IEEE
Publication date: 01/01/2020
Field of study

Graphics processing units are available solutions for high-performance safety-critical applications, such as self-driving cars. In this application domain, functional-safety and reliability are major concerns. Thus, the adoption of fault tolerance techniques is mandatory to detect or correct faults, since these devices must work properly, even when faults are present. GPUs are designed and implemented with cutting-edge technologies, which makes them sensitive to faults caused by radiation interference, such as single event upsets. These effects can lead the system to a failure, which is unacceptable in safety-critical applications. Therefore, effective detection and mitigation strategies must be adopted to harden the GPU operation. In this paper, we analyze transient effects in the pipeline registers of a GPU architecture. We run four applications at three GPU configurations, considering the source of the fault, its effect on the GPU, and the use of software-based hardening techniques. The evaluation was performed using a general-purpose soft-core GPU based on the NVIDIA G80 architecture. Results can guide designers in building more resilient GPU architectures

Crossref

ZENODO

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Comparison of parallel implementation strategies in GPU-accelerated System-on-Chip under proton irradiation

Author: Badía José M.
Belloch Rodríguez José Antonio
Entrena Arrontes Luis Alfonso
García Valderas Mario
León Germán
Lindoso Muñoz Almudena
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2022
Field of study

Commercial off-the-shelf (COTS) system-on-chip (SoC) are becoming widespread in embedded systems. Many of them include a multicore central processing unit (CPU) and a high-end graphics processing unit (GPU). They combine high computational performance with low power consumption and flexible multilevel parallelism. This kind of device is also being considered for radiation environments where large amounts of data must be processed or compute-intensive applications must be executed. In this article, we compare three different strategies to perform matrix multiplication in the GPU of a Tegra TK1 SoC. Our aim is to analyze how the different use of the resources of the GPU influences not only the computational performance of the algorithm, but also its radiation sensitivity. Radiation experiments with protons were performed to compare the behavior of the three strategies. Experimental results show that most of the errors force a reboot of the platform. The number of errors is directly related with how the algorithms use the internal memories of the GPU and increases with the matrix size. It is also related with the number of transactions with the global memory, which in our experiments is not affected by the radiation. Results show that the smallest cross section is obtained with the fastest algorithm, even if it uses the cores of the GPU more intensively.This work was supported in part by the Valencian Regional Government under Grant PROMETEO/2019/109, in part by Jaume I University under Project UJIB2019-36, and in part by the Spanish Ministry of Science and Innovation under Project PID2019-106455GB-C21 and Project PID2020-113656RB-C21.Publicad

Universidad Carlos III de Madrid e-Archivo

Analysis of Kernel Redundancy for Soft Error Mitigation on Embedded GPUs

Author: Alcaide Sergi
Cuenca-Asensi Sergio
Morilla Yolanda
Romero-Maestre Amor
Serrano-Cases Alejandro
Publication venue: IEEE
Publication date: 03/07/2023
Field of study

The use of state-of-the-art commercial processors such as graphical processing units (GPUs) is becoming increasingly common in the New Space industry in order to ensure high performance and power efficiency. However, commercial GPUs are not designed to operate in a harsh environment and therefore different protection techniques need to be applied to mitigate the effects of radiation, including those produced by single events. This paper assesses the effectiveness of redundant kernel execution on tightly constrained embedded GPUs under proton irradiation, with results suggesting a significant improvement in the SDC cross-section without penalizing the stability of the whole system. In addition, the posterior error analysis shows that the CPU is the source of the majority of the events, which are mainly dominated by functional interrupts.This work has been supported by the Spanish Ministry of Science and Innovation as part of the PID2019-106455GB-C22 project

Repositorio Institucional de la Universidad de Alicante

Digital.CSIC

Body of Knowledge for Graphics Processing Units (GPUs)

Author: Wyrwas Edward
Publication venue
Publication date
Field of study

Graphics Processing Units (GPU) have emerged as a proven technology that enables high performance computing and parallel processing in a small form factor. GPUs enhance the traditional computer paradigm by permitting acceleration of complex mathematics and providing the capability to perform weighted calculations, such as those in artificial intelligence systems. Despite the performance enhancements provided by this type of microprocessor, there exist tradeoffs in regards to reliability and radiation susceptibility, which may impact mission success. This report provides an insight into GPU architecture and its potential applications in space and other similar markets. It also discusses reliability, qualification, and radiation considerations for testing GPUs

NASA Technical Reports Server

GPU devices for safety-critical systems: a survey

Author: Abella Ferrer Jaume
Calderón Torres Alejandro Josué
Cazorla Almeida Francisco Javier
Flores Barroso José Luis
Kosmidis Leonidas
Pérez Cerrolaza Jon
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/07/2023
Field of study

Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Digital design techniques for dependable High-Performance Computing

Author: Azimi Sarah
Publication venue: Politecnico di Torino
Publication date
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

New Techniques for On-line Testing and Fault Mitigation in GPUs

Author: RODRIGUEZ CONDIA JOSIE ESTEBAN
Publication venue: country:Italy
Publication date: 24/09/2021
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Using precision reduction to efficiently improve mixed-precision GPUs reliability

Author: Acosta Gerônimo Veit
Publication venue
Publication date: 01/01/2021
Field of study

Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in this process. Reduced Precision Duplication With Comparison (RP-DWC) is an effective software level solution to improve the performance of the conventional DWC. RP-DWC aims to mitigate these overheads by enabling parallel processing in underused Floating Point Units (FPUs) in mixed precision Graphic Processing Units (GPUs). By making use of precision reduction to efficiently improve the reliability in mixed precision GPUs, RPDWC extends the DWC technique, introducing proper ways to handle redundancy with different precision operations. Improving GPUs reliability is an extremely valuable challenge in the fault tolerance field since GPUs are adopted in both High-Performance Computing (HPC) and in automotive real-time applications. When GPUs are exposed to a natural environment, such as the surface of the Earth at sea level, they are also exposed to the Earth’s surface radiation. Furthermore, this exposure can be critical, given that these radiation particles may hit the GPU’s internal circuit, corrupt sensitive data and consequently generate undesired outputs. Introducing duplication with reduced precision in a trustworthy manner to maintain reliability in safety-critical systems is an arduous task that we propose to further investigate in this work

Lume 5.8

Exploiting non-constant safe memory in resilient algorithms and data structures

Author: DE STEFANI LORENZO
SILVESTRI FRANCESCO
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

We extend the Faulty RAM model by Finocchi and Italiano (2008) by adding a safe memory of arbitrary size

S

, and we then derive tradeoffs between the performance of resilient algorithmic techniques and the size of the safe memory. Let

\delta

and

\alpha

denote, respectively, the maximum amount of faults which can happen during the execution of an algorithm and the actual number of occurred faults, with

\alpha \leq \delta

. We propose a resilient algorithm for sorting

n

entries which requires

O\left(n\log n+\alpha (\delta/S + \log S)\right)

time and uses

\Theta(S)

safe memory words. Our algorithm outperforms previous resilient sorting algorithms which do not exploit the available safe memory and require

O\left(n\log n+ \alpha\delta\right)

time. Finally, we exploit our sorting algorithm for deriving a resilient priority queue. Our implementation uses

\Theta(S)

safe memory words and

\Theta(n)

faulty memory words for storing

n

keys, and requires

O\left(\log n + \delta/S\right)

amortized time for each insert and deletemin operation. Our resilient priority queue improves the

O\left(\log n + \delta\right)

amortized time required by the state of the art.Comment: To appear in Theoretical Computer Science, 201

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Padova

An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

Author: Ergin Oguz
Kestelman Adrian Cristal
Koc Fahrettin
Mutlu Onur
Onural Erhan Baturay
Salami Behzad
Sarbazi-Azad Hamid
Unsal Osman S.
Yuksel Ismail Emir
Publication venue
Publication date: 01/01/2020
Field of study

We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc

arXiv.org e-Print Archive

Crossref

UPCommons. Portal del coneixement obert de la UPC

TOBB ETÜ Institutional Repository