Search CORE

39 research outputs found

A novel co-design approach for soft errors mitigation in embedded systems

Author: Aguirre Echanove Miguel Ángel
Cuenca-Asensi Sergio
Guzmán Miranda Hipólito
Martínez-Álvarez Antonio
Palomo Pinto Francisco Rogelio
Restrepo Calle Felipe
Publication venue: Austrian Institute of Technology (AIT)
Publication date: 20/09/2010
Field of study

Comunicación presentada en the 11th European Conference on Radiation and its Effects on Components and Systems RADECS 2010, Längenfeld, Austria, September 20-24, 2010.A novel proposal to design radiation-tolerant embedded systems combining hardware and software mitigation techniques is presented. Two suites of tools are developed to automatically apply the techniques and to facilitate the trade-offs analyses.This work makes part of RENASER project (ESP2007-65914-C03-03) funded by the 2007 Spain Research National Plan of the Ministry of Science and Education in which context this work has been possible. The work presented here has been carried out thanks to the support of the research project ’Aceleración de algoritmos industriales y de seguridad en entornos críticos mediante hardware’ (GV/2009/098) (Generalitat Valenciana, Spain)

Repositorio Institucional de la Universidad de Alicante

Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery

Author: González Colás Antonio María
Upasani Gaurang
Vera Rivera Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%. © 2014 IEEE.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Author: Carro Luigi
Cela Jose M.
Fernandes Fernando
Fratin Vinicius
Hanzich Mauricio
Lunardi Caio
Navaux Philippe
Oliveira Daniel
Pilla Laercio
Rech Paolo
Publication venue
Publication date: 01/03/2016
Field of study

In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Online error detection and correction of erratic bits in register files

Author: Abella Ferrer Jaume
Carretero Casado Javier Sebastián
Chaparro Valero Pedro Alonso
González Colás Antonio María
Vera Rivera Francisco Javier
Publication venue
Publication date: 01/06/2009
Field of study

Aggressive voltage scaling needed for low power in each new process generation causes large deviations in the threshold voltage of minimally sized devices of the 6T SRAM cell. Gate oxide scaling can cause large transient gate leakage (a trap in the gate oxide), which is known as the erratic bits phenomena. Register file protection is necessary to prevent errors from quickly spreading to different parts of the system, which may cause applications to crash or silent data corruption. This paper proposes a simple and cost-effective mechanism that increases the resiliency of the register files to erratic bits. Our mechanism detects those registers that have erratic bits, recovers from the error and quarantines the faulty register. After the quarantine period, it is able to detect whether they are fully operational with low overhead.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Evaluation of the Suitability of NEON SIMD Microprocessor Extensions Under Proton Irradiation

Author: Entrena Arrontes Luis Alfonso
García Valderas Mario
Lindoso Muñoz Almudena
Martín Holgado P.
Morilla Y.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/04/2018
Field of study

This paper analyzes the suitability of single-instruction multiple data (SIMD) extensions of current microprocessors under radiation environments. SIMD extensions are intended for software acceleration, focusing mostly in applications that require high computational effort, which are common in many fields such as computer vision. SIMD extensions use a dedicated coprocessor that makes possible packing several instructions in one single extended instruction. Applications that require high performance could benefit from the use of SIMD coprocessors, but their reliability needs to be studied. In this paper, NEON, the SIMD coprocessor of ARM microprocessors, has been selected as a case study to explore the behavior of SIMD extensions under radiation. Radiation experiments of ARM CORTEX-A9 microprocessors have been accomplished with the objective of determining how the use of this kind of coprocessor can affect the system reliability

Universidad Carlos III de Madrid e-Archivo

Digital.CSIC

Special session: Operating systems under test: An overview of the significance of the operating system in the resiliency of the computing continuum

Author: Bosio A.
Casseau E.
Di Carlo S.
Dobias P.
Kastensmidt F.
Rebaudengo M.
Rodrigues G. S.
Savino A.
Sinnen O.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

The computing continuum's actual trend is facing a growth in terms of devices with any degree of computational capability. Those devices may or may not include a full-stack, including the Operating System layer and the Application layer, or just facing pure bare-metal solutions. In either case, the reliability of the full system stack has to be guaranteed. It is crucial to provide data regarding the impact of faults at all system stack levels and potential hardening solutions to design highly resilient systems. While most of the work usually concentrates on the application reliability, the special session aims to provide a deep comprehension of the impact on the reliability of an embedded system when faults in the hardware substrate of the system stack surface at the Operating System layer. For this reason, we will cover a comparison from an application perspective when hardware faults happen in bare metal vs. real-time OS vs. general-purpose OS. Then we will go deeper within a FreeRTOS to evaluate the contribution of all parts of the OS. Eventually, the Special Session will propose some hardening techniques at the Operating System level by exploiting the scheduling capabilities

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

A Hybrid Fault-Tolerant LEON3 Soft Core Processor Implemented in Low-End SRAM FPGA

Author: Entrena Arrontes Luis Alfonso
García Valderas Mario
Lindoso Muñoz Almudena
Parra Avellaneda Luis Isaías
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

In this work we implemented a hybrid fault-tolerant LEON3 soft-core processor in a low-end FPGA (Artix-7) and evaluated its error detection capabilities through neutron irradiation and fault injection in an incremental manner. The error mitigation approach combines the use of SEC/DED codes for memories, a hardware monitor to detect control-flow errors, software-based techniques to detect data errors and configuration memory scrubbing with repair to avoid error accumulation. The proposed solution can significantly improve fault tolerance and can be fully embedded in a low-end FPGA, with reduced overhead and low intrusiveness

Universidad Carlos III de Madrid e-Archivo

Recommended from our members

Runtime asynchronous fault tolerance via speculation

Author: August David I
Ghosh Soumyadeep
Huang Jialu
Lee Jae W
Mahlke Scott A
Zhang Yun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications

Princeton University Open Access Repository

Crossref