Search CORE

61,295 research outputs found

Evaluating Software-based Hardening Techniques for General-Purpose Registers on a GPGPU

Author: Azambuja Jose Rodrigo
Goncalves Marcio M.
Rodriguez Condia Josie E.
Sonza Reorda Matteo
Sterpone Luca
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/05/2020
Field of study

Graphics Processing Units (GPUs) are considered a promising solution for high-performance safety-critical applications, such as self-driving cars. In this application domain, the use of fault tolerance techniques is mandatory to detect or correct faults, since they must work properly even in the presence of faults. GPUs are designed with aggressive technology scaling, which makes them susceptible to faults caused by radiation interference, such as the Single Event Upsets (SEUs), which can lead the system to fail, and that is unacceptable in safety-critical applications. In this paper, we evaluate different software-based hardening techniques developed to detect SEUs in GPUs general-purpose registers and propose optimizations to improve performance and memory utilization. The techniques are implemented in three case-study applications and evaluated in a general-purpose soft-core GPU based on the NVIDIA G80 architecture. A fault injection campaign is performed at register transfer level to assess the fault detection potential of the implemented techniques. Results show that the proposed improvements can be tailored for different scenarios, helping engineers in navigating the design space of hardened GPGPU applications

Crossref

ZENODO

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

DeSyRe: on-Demand System Reliability

Author: Armato Antonino
Bouganis Christos-Savvas
Falsafi Babak
Gaydadjiev Georgi
Isaza Sebastian
Malek Alirad
Mariani Riccardo
Pnevmatikatos Dionisios N
Pradhan Dhiraj K
Rauwerda Gerard
Seepers Robert
Shafik Rishad Ahmed
Sourdis Ioannis
Strydis Christos
Sunesen Kim
Theodoropoulos Dimitris
Tzilis Stavros
Vavouras Michail
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

Southampton (e-Prints Soton)

Crossref

EUR Research Repository

Chalmers Research

Chalmers Publication Library

Explore Bristol Research

Automated Synthesis of SEU Tolerant Architectures from OO Descriptions

Author: Chiusano Silvia Anna
Di Carlo Stefano
Prinetto Paolo Ernesto
Publication venue: IEEE Computer Society
Publication date: 01/01/2002
Field of study

SEU faults are a well-known problem in aerospace environment but recently their relevance grew up also at ground level in commodity applications coupled, in this frame, with strong economic constraints in terms of costs reduction. On the other hand, latest hardware description languages and synthesis tools allow reducing the boundary between software and hardware domains making the high-level descriptions of hardware components very similar to software programs. Moving from these considerations, the present paper analyses the possibility of reusing Software Implemented Hardware Fault Tolerance (SIHFT) techniques, typically exploited in micro-processor based systems, to design SEU tolerant architectures. The main characteristics of SIHFT techniques have been examined as well as how they have to be modified to be compatible with the synthesis flow. A complete environment is provided to automate the design instrumentation using the proposed techniques, and to perform fault injection experiments both at behavioural and gate level. Preliminary results presented in this paper show the effectiveness of the approach in terms of reliability improvement and reduced design effort

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Improving reconfigurable systems reliability by combining periodical test and redundancy techniques: a case study

Author: Bezerra Eduardo Augusto
Gough Michael Paul
Vargas Fabian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2001
Field of study

This paper revises and introduces to the field of reconfigurable computer systems, some traditional techniques used in the fields of fault-tolerance and testing of digital circuits. The target area is that of on-board spacecraft electronics, as this class of application is a good candidate for the use of reconfigurable computing technology. Fault tolerant strategies are used in order for the system to adapt itself to the severe conditions found in space. In addition, the paper describes some problems and possible solutions for the use of reconfigurable components, based on programmable logic, in space applications

Sussex Research Online

Study of fault tolerant software technology for dynamic systems

Author: Caglayan A. K.
Zacharias G. L.
Publication venue
Publication date
Field of study

The major aim of this study is to investigate the feasibility of using systems-based failure detection isolation and compensation (FDIC) techniques in building fault-tolerant software and extending them, whenever possible, to the domain of software fault tolerance. First, it is shown that systems-based FDIC methods can be extended to develop software error detection techniques by using system models for software modules. In particular, it is demonstrated that systems-based FDIC techniques can yield consistency checks that are easier to implement than acceptance tests based on software specifications. Next, it is shown that systems-based failure compensation techniques can be generalized to the domain of software fault tolerance in developing software error recovery procedures. Finally, the feasibility of using fault-tolerant software in flight software is investigated. In particular, possible system and version instabilities, and functional performance degradation that may occur in N-Version programming applications to flight software are illustrated. Finally, a comparative analysis of N-Version and recovery block techniques in the context of generic blocks in flight software is presented

NASA Technical Reports Server

Havens: Explicit Reliable Memory Regions for HPC Applications

Author: Engelmann Christian
Hukerikar Saurabh
Publication venue
Publication date: 26/10/2016
Field of study

Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, US

arXiv.org e-Print Archive

Crossref

PROMON: a profile monitor of software applications

Author: Benso Alfredo
Di Carlo Stefano
Di Natale Giorgio
Prinetto Paolo Ernesto
Tagliaferri Luca
Tibaldi Clara
Publication venue: IEEE Computer Society
Publication date: 01/01/2005
Field of study

Software techniques can be efficiently used to increase the dependability of safety-critical applications. Many approaches are based on information redundancy to prevent data and code corruption during the software execution. This paper presents PROMON, a C++ library that exploits a new methodology based on the concept of "Programming by Contract" to detect system malfunctions. Resorting to assertions, pre- and post-conditions, and marginal programmer interventions, PROMON-based applications can reach high level of dependabilit

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino