Search CORE

27 research outputs found

SafeSoftDR: A library to enable software-based diverse redundancy for safety-critical tasks

Author: Abella Ferrer Jaume
Alcaide Portet Sergi
Bas Jalón Francisco
Benedicte Illescas Pedro
Cabo Pitarch Guillem
Chang Feng
Fuentes Díaz Francisco Javier
Mazzocchetti Fabio
Publication venue: 'Center for Open Science'
Publication date: 01/01/2022
Field of study

Applications with safety requirements have become ubiquitous nowadays and can be found in edge devices of all kinds. However, microcontrollers in those devices, despite offering moderate performance by implementing multicores and cache hierarchies, may fail to offer adequate support to implement some safety measures needed for the highest integrity levels, such as lockstepped execution to avoid so-called common cause failures (i.e., a fault affecting redundant components causing the same error in all of them). To respond to this limitation, an approach based on a software monitor enforcing some sort of software-based lockstepped execution across cores has been proposed recently in [2], providing a proof of concept. This paper presents SafeSoftDR, a library providing a standard interface to deploy software-based lockstepped execution across non-natively lockstepped cores relieving end-users from having to manage the burden to create redundant processes, copying input/output data, and performing result comparison. Our library has been tested on x86-based Linux and is currently being integrated on top of an open-source RISC-V platform targeting safety-related applications, hence offering a convenient environment for safety-critical applications.This work is part of the project PCI2020-112010, funded by MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR, and the European Union’s Horizon 2020 Programme under project ECSEL Joint Undertaking (JU) under grant agreement No 877056. This workhasalsobeen partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB-C21 funded by MCIN/AEI/10.13039/501100011033.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

FIMSIM: A fault injection infrastructure for microarchitectural simulators

Author: Cristal Kestelman Adrián
Unsal Osman Sabri
Valero Cortés Mateo
Yalcin Gulay
Publication venue
Publication date: 01/01/2011
Field of study

Fault injection is a widely used approach for experiment-based dependability evaluation in which faults can be injected to the hardware, to the simulator or to the software. Simulation based fault injection is more appealing for researchers, since it can be utilized at the early design stage of the processor. As such, it enables a preliminary analysis of the correlation between the criticality of circuit level faults and their impact on applications. However, the lack of publicly available fault injectors for microarchitecture level simulators brings extra burden of designing and implementing fault injectors to the researchers who evaluate microarchitecture dependability. In this study, we present FIMSIM, to the best of our knowledge, the first publicly available fault injection simulator at the microarchitecture level. FIMSIM is a compact tool which is capable of injecting transient, permanent, intermittent and multi-bit faults. Therefore, FIMSIM provides the opportunity to comprehensively evaluate the vulnerability of different microarchitectural structures against different fault models.Postprint (published version

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Online error detection and correction of erratic bits in register files

Author: Abella Ferrer Jaume
Carretero Casado Javier Sebastián
Chaparro Valero Pedro Alonso
González Colás Antonio María
Vera Rivera Francisco Javier
Publication venue
Publication date: 01/06/2009
Field of study

Aggressive voltage scaling needed for low power in each new process generation causes large deviations in the threshold voltage of minimally sized devices of the 6T SRAM cell. Gate oxide scaling can cause large transient gate leakage (a trap in the gate oxide), which is known as the erratic bits phenomena. Register file protection is necessary to prevent errors from quickly spreading to different parts of the system, which may cause applications to crash or silent data corruption. This paper proposes a simple and cost-effective mechanism that increases the resiliency of the register files to erratic bits. Our mechanism detects those registers that have erratic bits, recovers from the error and quarantines the faulty register. After the quarantine period, it is able to detect whether they are fully operational with low overhead.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Propuesta de tesis: tratamiento de fallos transitorios en entornos de cluster de multicores

Author: Montezanti Diego Miguel
Publication venue
Publication date: 08/08/2012
Field of study

El objetivo de mejorar el rendimiento en las computadoras actuales ha producido el reto de utilizar mayor cantidad de transistores (mayor densidad) y aumentar la frecuencia de operación, además de una disminución en la tensión de alimentación. Todo esto se traduce en un aumento en la temperatura y una mayor cantidad de interferencias, provenientes del entorno, que afectan a los procesadores. Además, con el advenimiento de los multicores y los manycores, se han integrado varios núcleos de procesamiento en el mismo chip. La combinación de todos estos factores tiene como consecuencia que las computadoras sean cada vez menos robustas frente a la ocurrencia de fallos transitorios. El presente trabajo de Tesis se enfoca en el tratamiento de fallos transitorios que ocurren en los registros internos de los cores que conforman un procesador actual, en el contexto de un cluster de multicores en el que se está ejecutando una aplicación científica, de cómputo intensivo. Estos fallos pueden afectar tanto a datos como a instrucciones o direcciones. El centro de atención está puesto en los fallos silenciosos, aquellos que producen corrupciones de datos que alteran la ejecución del programa, pero sin provocar violaciones detectables a nivel del sistema operativo. La ocurrencia de estos fallos se traduce en la ejecución del programa con parámetros erróneos, de modo que proporciona resultados incorrectos. En este contexto, el objetivo del trabajo de Tesis es el diseño y desarrollo de un sistema de middleware que detecte y tolere los fallos transitorios en un entorno de cluster de multicores, de manera transparente al usuario, manteniendo un nivel de robustez especificado y optimizando la utilización de recursos en los multicores para minimizar la ineficiencia que implica replicar y comparar toda la ejecución.Presentado en el Encuentro de Tesistas de PostgradoRed de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Radiation Testing of a Multiprocessor Macrosynchronized Lockstep Architecture With FreeRTOS

Author: Avilés Pablo M.
Belloch Rodríguez José Antonio
Entrena Arrontes Luis Alfonso
García Valderas Mario
Lindoso Muñoz Almudena
Morilla Yolanda
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/11/2021
Field of study

Nowadays, high-performance microprocessors are demanded in many fields, including those with high-reliability requirements. Commercial microprocessors present a good tradeoff between cost, size, and performance, albeit they must be adapted to satisfy the reliability requirements when they are used in harsh environments. This work presents a high-end multiprocessor hardened with macrosynchronized lockstep and additional protections. A commercial dual-core Advanced RISC Machine (ARM) cortex A9 has been used as a case study and a complete hardened system has been developed. Evaluation of the proposed hardened system has been accomplished with exhaustive fault injection campaigns and proton irradiation. The hardening approach has been accomplished for both baremetal applications and operating system (OS)-based. The hardened system has demonstrated high reliability in all performed experiments with error coverage up to 99.3% in the irradiation experiments. Experimental irradiation results demonstrate a cross-sectional reduction of two orders of magnitude.This work was supported in part by the Spanish Ministry of Science and Innovation under Project PID2019-106455GB-C21 and in part by the Community of Madrid under Project 49.520608.9.18Publicad

Universidad Carlos III de Madrid e-Archivo

Understanding the performance of concurrent error detecting superscalar microarchitectures

Author: Falsafi Babak
Hoe James C.
Jangwoo Kim
Smolens Jared C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2009
Field of study

Superscalar out-of-order microarchitectures can be modified to support redundant execution of a program as two concurrent threads for soft-error detection. However, the extra workload from redundant execution incurs a performance penalty due to increased contention for resources throughout the datapath. We present four key parameters that affect performance of these designs, namely 1) issue and functional unit bandwidth, 2) issue queue and reorder buffer capacity, 3) decode and retirement bandwidth, and 4) coupling between redundant threads' instantaneous resource requirements. We then survey existing work in concurrent error detecting superscalar microarchitectures and evaluate these proposals with respect to the four factors. © 2005 IEEE

Infoscience - École polytechnique fédérale de Lausanne

Recommended from our members

Runtime asynchronous fault tolerance via speculation

Author: August David I
Ghosh Soumyadeep
Huang Jialu
Lee Jae W
Mahlke Scott A
Zhang Yun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications

Princeton University Open Access Repository

Crossref

A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters

Author: De Giusti Armando Eduardo
Luque Fadón Emilio
Montezanti Diego Miguel
Naiouf Marcelo
Rexachs del Rosario Dolores
Rucci Enzo
Publication venue
Publication date: 01/10/2013
Field of study

Transient faults are becoming a critical concern among current trends of design of general-purpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of relaunching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationally-intensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.WPDP- XIII Workshop procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

Servicio de Difusión de la Creación Intelectual

Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors

Author: Sudhanva Gurumurthi
Taniya Siddiqua
Publication venue
Publication date: 01/01/2009
Field of study

Silicon reliability is a key challenge facing the microprocessor industry. Processors need to be designed such that they are resilient against both soft errors and lifetime reliability phenomena. However, techniques developed to address one class of reliability problems may impact other aspects of silicon reliability. In this paper, we show that Redundant Multi-Threading (RMT), which provides soft error protection, exacerbates lifetime reliability. We then explore two different architectural approaches to tackle this problem, namely, Dynamic Voltage Scaling (DVS) and partial RMT. We show that each approach has certain strengths and weaknesses with respect to performance, soft error coverage, and lifetime reliability. We then propose and evaluate a hybrid approach that combines DVS and partial RMT. We show that this approach provides better improvement in lifetime reliability than DVS or partial RMT alone, buys back a significant amount of performance that is lost due to DVS, and provides nearly complete soft error coverage. I

CiteSeerX

Crossref