Search CORE

9 research outputs found

Increasing the robustness of CUDA Fermi GPU-based systems

Author: Di Carlo Stefano
Gambardella G.
Indaco M.
Martella I.
Prinetto Paolo Ernesto
Rolfo D.
Trotta P.
Publication venue: IEEE
Publication date: 01/01/2013
Field of study

Nowadays Graphical processing Units (GPUs) have become increasingly popular due to their high computational power and low prices. This makes them particularly suitable for high-performance computing applications, like data elaboration and image processing. In these fields, the capability of properly work even in presence of faults is mandatory. This paper presents an innovative approach, that combines a Software Based Self Test & Diagnosis (SBSTD) methodology with a fault mitigation strategy, to increase the robustness of a CUDA Fermi GPU-based system

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

An improved fault mitigation strategy for CUDA Fermi GPUs

Author: Di Carlo S.
Gambardella G.
Martella I.
Prinetto P.
Rolfo D.
Trotta P.
Publication venue
Publication date: 01/01/2014
Field of study

High computation is a predominant requirement in many applications. In this field, Graphic Processing Units (GPUs) are more and more adopted. Low prices and high parallelism let GPUs be attractive, even in safety critical applications. Nonetheless, new methodologies must be studied and developed to increase the dependability of GPUs. This paper presents an improved fault mitigation strategy against permanent faults for CUDA Fermi GPUs. The proposed approach exploits the reverse engineering of the block scheduling policy in CUDA Fermi GPUs in order to minimize the fault mitigation timing overhead. The graceful performance degradation achieved by the proposed technique outperforms multithreaded CPU implementations and other fault mitigation strategies for CUDA GPU, even in presence of multiple permanent faults

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

On the Functional Test of Special Function Units in GPUs

Author: Guerrero-Balaguera Juan-David
Reorda Matteo Sonza
Rodriguez Condia Josie E.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

The Graphics Processing Units (GPUs) usage has extended from graphic applications to others where their high computational power is exploited (e.g., to implement Artificial Intelligence algorithms). These complex applications usually need highly intensive computations based on floating-point transcendental functions. GPUs may efficiently compute these functions in hardware using ad hoc Special Function Units (SFUs). However, a permanent fault in such units could be very critical (e.g., in safety-critical automotive applications). Thus, test methodologies for SFUs are strictly required to achieve the target reliability and safety levels. In this work, we present a functional test method based on a Software-Based Self-Test (SBST) approach targeting the SFUs in GPUs. This method exploits different approaches to build a test program and applies several optimization strategies to exploit the GPU parallelism to speed up the test procedure and reduce the required memory. The effectiveness of this methodology was proven by resorting to an open-source GPU model (FlexGripPlus) compatible with NVIDIA GPUs. The experimental results show that the proposed technique achieves 90.75% of fault coverage and up to 94.26% of Testable Fault Coverage, reducing the required memory and test duration with respect to pseudorandom strategies proposed by other authors

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

On the testing of special memories in GPGPUs

Author: Reorda Matteo Sonza
Rodriguez Condia Josie E.
Publication venue: IEEE
Publication date: 01/01/2020
Field of study

Nowadays, data-intensive processing applications, such as multimedia, high-performance computing and safety-critical ones (e.g., in automotive) employ General Purpose Graphics Processing Units (GPGPUs) due to their parallel processing capabilities and high performance. In these devices, multiple levels of memories are employed in GPGPUs to hide latency and increase the performance during the operation of a kernel. Moreover, modern GPGPU architectures implement cutting-edge semiconductor technologies, reducing their size and power consumption. However, some studies proved that these technologies are prone to faults during the operative life of a device, so compromising reliability. In this work, we developed functional test techniques based on parallel Software-Based Self-Test routines to test memory structures in the memory hierarchy of a GPGPU (FlexGripPlus) implementing the G80 architecture of Nvidia

Crossref

ZENODO

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

An On-Line Testing Technique for the Scheduler Memory of a GPGPU

Author: Josie E. Rodriguez Condia
Matteo Sonza Reorda
Stefano Di Carlo
Publication venue
Publication date: 01/01/2020
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Open Access Repository

Dynamic Partial Reconfiguration for Dependable Systems

Author: Gambardella Giulio
Publication venue: Politecnico di Torino
Publication date: 01/01/2015
Field of study

Moore’s law has served as goal and motivation for consumer electronics manufacturers in the last decades. The results in terms of processing power increase in the consumer electronics devices have been mainly achieved due to cost reduction and technology shrinking. However, reducing physical geometries mainly affects the electronic devices’ dependability, making them more sensitive to soft-errors like Single Event Transient (SET) of Single Event Upset (SEU) and hard (permanent) faults, e.g. due to aging effects. Accordingly, safety critical systems often rely on the adoption of old technology nodes, even if they introduce longer design time w.r.t. consumer electronics. In fact, functional safety requirements are increasingly pushing industry in developing innovative methodologies to design high-dependable systems with the required diagnostic coverage. On the other hand commercial off-the-shelf (COTS) devices adoption began to be considered for safety-related systems due to real-time requirements, the need for the implementation of computationally hungry algorithms and lower design costs. In this field FPGA market share is constantly increased, thanks to their flexibility and low non-recurrent engineering costs, making them suitable for a set of safety critical applications with low production volumes. The works presented in this thesis tries to face new dependability issues in modern reconfigurable systems, exploiting their special features to take proper counteractions with low impacton performances, namely Dynamic Partial Reconfiguration

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Innovative Techniques for Testing and Diagnosing SoCs

Author: DE CARVALHO Mauricio
Publication venue: Politecnico di Torino
Publication date: 01/01/2015
Field of study

We rely upon the continued functioning of many electronic devices for our everyday welfare, usually embedding integrated circuits that are becoming even cheaper and smaller with improved features. Nowadays, microelectronics can integrate a working computer with CPU, memories, and even GPUs on a single die, namely System-On-Chip (SoC). SoCs are also employed on automotive safety-critical applications, but need to be tested thoroughly to comply with reliability standards, in particular the ISO26262 functional safety for road vehicles. The goal of this PhD. thesis is to improve SoC reliability by proposing innovative techniques for testing and diagnosing its internal modules: CPUs, memories, peripherals, and GPUs. The proposed approaches in the sequence appearing in this thesis are described as follows: 1. Embedded Memory Diagnosis: Memories are dense and complex circuits which are susceptible to design and manufacturing errors. Hence, it is important to understand the fault occurrence in the memory array. In practice, the logical and physical array representation differs due to an optimized design which adds enhancements to the device, namely scrambling. This part proposes an accurate memory diagnosis by showing the efforts of a software tool able to analyze test results, unscramble the memory array, map failing syndromes to cell locations, elaborate cumulative analysis, and elaborate a final fault model hypothesis. Several SRAM memory failing syndromes were analyzed as case studies gathered on an industrial automotive 32-bit SoC developed by STMicroelectronics. The tool displayed defects virtually, and results were confirmed by real photos taken from a microscope. 2. Functional Test Pattern Generation: The key for a successful test is the pattern applied to the device. They can be structural or functional; the former usually benefits from embedded test modules targeting manufacturing errors and is only effective before shipping the component to the client. The latter, on the other hand, can be applied during mission minimally impacting on performance but is penalized due to high generation time. However, functional test patterns may benefit for having different goals in functional mission mode. Part III of this PhD thesis proposes three different functional test pattern generation methods for CPU cores embedded in SoCs, targeting different test purposes, described as follows: a. Functional Stress Patterns: Are suitable for optimizing functional stress during I Operational-life Tests and Burn-in Screening for an optimal device reliability characterization b. Functional Power Hungry Patterns: Are suitable for determining functional peak power for strictly limiting the power of structural patterns during manufacturing tests, thus reducing premature device over-kill while delivering high test coverage c. Software-Based Self-Test Patterns: Combines the potentiality of structural patterns with functional ones, allowing its execution periodically during mission. In addition, an external hardware communicating with a devised SBST was proposed. It helps increasing in 3% the fault coverage by testing critical Hardly Functionally Testable Faults not covered by conventional SBST patterns. An automatic functional test pattern generation exploiting an evolutionary algorithm maximizing metrics related to stress, power, and fault coverage was employed in the above-mentioned approaches to quickly generate the desired patterns. The approaches were evaluated on two industrial cases developed by STMicroelectronics; 8051-based and a 32-bit Power Architecture SoCs. Results show that generation time was reduced upto 75% in comparison to older methodologies while increasing significantly the desired metrics. 3. Fault Injection in GPGPU: Fault injection mechanisms in semiconductor devices are suitable for generating structural patterns, testing and activating mitigation techniques, and validating robust hardware and software applications. GPGPUs are known for fast parallel computation used in high performance computing and advanced driver assistance where reliability is the key point. Moreover, GPGPU manufacturers do not provide design description code due to content secrecy. Therefore, commercial fault injectors using the GPGPU model is unfeasible, making radiation tests the only resource available, but are costly. In the last part of this thesis, we propose a software implemented fault injector able to inject bit-flip in memory elements of a real GPGPU. It exploits a software debugger tool and combines the C-CUDA grammar to wisely determine fault spots and apply bit-flip operations in program variables. The goal is to validate robust parallel algorithms by studying fault propagation or activating redundancy mechanisms they possibly embed. The effectiveness of the tool was evaluated on two robust applications: redundant parallel matrix multiplication and floating point Fast Fourier Transform

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

New Techniques for On-line Testing and Fault Mitigation in GPUs

Author: RODRIGUEZ CONDIA JOSIE ESTEBAN
Publication venue: country:Italy
Publication date: 24/09/2021
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Fault mitigation strategies for CUDA GPUs

Author: DI CARLO STEFANO
Gambardella G.
Martella I.
PRINETTO Paolo Ernesto
Rolfo D.
Trotta P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

High computation is a predominant requirement in many applications. In this field, Graphic Processing Units (GPUs) are more and more adopted. Low prices and high parallelism let GPUs be attractive, even in safety critical applications. Nonetheless, new methodologies must be studied and developed to increase the dependability of GPUs. This paper presents effective fault mitigation strategies for CUDA-based GPUs against permanent faults. The methodology to apply these strategies, on the software to be executed, is fully described and verified. The graceful performance degradation achieved by the proposed technique outperforms multithreaded CPU implementation, even in presence of multiple permanent faults

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)