24 research outputs found
Multi-Threaded Mitigation of Radiation-Induced Soft Errors in Bare-Metal Embedded Systems
This article presents a software protection technique against radiation-induced faults which is based on a multi-threaded strategy. Data triplication and instructions flow duplication or triplication techniques are used to improve system reliability and thus, ensure a correct system operation. To achieve this objective, a relaxed lockstep model to synchronize the execution of both, redundant threads and variables under protection on different processing units is defined. The evaluation was performed by means of simulated fault injection campaigns in a multi-core ARM system. Results show that despite being considered techniques that imply an evident overhead in memory and instructions (Duplication With Comparison and Re-Execution – DWC-R and Triple Modular Redundancy – TMR), spreading the replicas in different instruction flows not only produce similar results than classic techniques, but also improves the computational and recovery time in presence of soft-errors. In addition, this paper highlights the importance of protecting memory-allocated data, since the instruction flow triplication is not enough to improve the overall system reliability.This work was funded by the Spanish Ministry of Economy and Competitiveness and the European Regional Development Fund through the following projects: ‘Evaluación temprana de los efectos de radiación mediante simulación y virtualización. Estrategias de mitigación en arquitecturas de microprocesadores avanzados’, (Ref: ESP2015-68245-C4-3-P, MINECO/FEDER, UE)
Leveraging Hardware QoS to Control Contention in the Xilinx Zynq UltraScale+ MPSoC
The interference co-running tasks generate on each other’s timing behavior continues to be one of the main challenges to be addressed before Multi-Processor System-on-Chip (MPSoCs) are fully embraced in critical systems like those deployed in avionics and automotive domains. Modern MPSoCs like the Xilinx Zynq UltraScale+ incorporate hardware Quality of Service (QoS) mechanisms that can help controlling contention among tasks. Given the distributed nature of modern MPSoCs, the route a request follows from its source (usually a compute element like a CPU) to its target (usually a memory) crosses several QoS points, each one potentially implementing a different QoS mechanism. Mastering QoS mechanisms individually, as well as their combined operation, is pivotal to obtain the expected benefits from the QoS support. In this work, we perform, to our knowledge, the first qualitative and quantitative analysis of the distributed QoS mechanisms in the Xilinx UltraScale+ MPSoC. We empirically derive QoS information not covered by the technical documentation, and show limitations and benefits of the available QoS support. To that end, we use a case study building on neural network kernels commonly used in autonomous systems in different real-time domains.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB; the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 878752 (MASTECS) and the European Research Council (ERC) grant agreement No. 772773 (SuPerCom).Peer ReviewedPostprint (published version
Analysis of Kernel Redundancy for Soft Error Mitigation on Embedded GPUs
The use of state-of-the-art commercial processors such as graphical processing units (GPUs) is becoming increasingly common in the New Space industry in order to ensure high performance and power efficiency. However, commercial GPUs are not designed to operate in a harsh environment and therefore different protection techniques need to be applied to mitigate the effects of radiation, including those produced by single events. This paper assesses the effectiveness of redundant kernel execution on tightly constrained embedded GPUs under proton irradiation, with results suggesting a significant improvement in the SDC cross-section without penalizing the stability of the whole system. In addition, the posterior error analysis shows that the CPU is the source of the majority of the events, which are mainly dominated by functional interrupts.This work has been supported by the Spanish Ministry of Science and Innovation as part of the PID2019-106455GB-C22 project
Empirical Mathematical Model of Microprocessor Sensitivity and Early Prediction to Proton and Neutron Radiation-Induced Soft Errors
A mathematical model is described to predict microprocessor fault tolerance under radiation. The model is empirically trained by combining data from simulated fault-injection campaigns and radiation experiments, both with protons (at the National Center of Accelerators (CNA) facilities, Seville, Spain) and neutrons [at the Los Alamos Neutron Science Center (LANSCE) Weapons Neutron Research Facility at Los Alamos, USA]. The sensitivity to soft errors of different blocks of commercial processors is identified to estimate the reliability of a set of programs that had previously been optimized, hardened, or both. The results showed a standard error under 0.1, in the case of the Advanced RISC Machines (ARM) processor, and 0.12, in the case of the MSP430 microcontroller.This work was supported in part by Spanish MINECO under Project ESP-2015-68245-C4-3-P and Project ESP-2015-68245-C4-4-P
Dependability Evaluation of COTS Microprocessors via On-Chip debugging facilities
Este artículo presenta una herramienta de inyección de fallos y la metodología para la realización de campañas de inyección de Single-Event-Upsets (SEUs) en microprocesadores Commercial-off-the-shelf (COTS). Este método utiliza las ventajas que ofrecen las infraestructuras de depuración de los microprocesadores actuales, además del depurador estándar de GNU (GDB) para la ejecución y depuración de los programas de pruebas. Los experimentos desarrollados sobre microprocesadores reales, así como en las máquinas virtuales, demuestran la viabilidad y la flexibilidad de la propuesta como una solución de bajo costo para evaluar la fiabilidad de los microprocesadores COTS.This paper presents a fault injection tool and methodology for performing Single-Event-Upsets (SEUs) injection campaigns on Commercial-off-the-shelf (COTS) microprocessors. This method takes advantage of the debug facilities of modern microprocessors along with standard GNU Debugger (GDB) for executing and debugging benchmarks. The developed experiments on real boards, as well as on virtual machines, demonstrate the feasibility and flexibility of the proposal as a low-cost solution for assessing the reliability of COTS microprocessors.Este trabajo fue financiado por el Ministerio español de Economía y Competitividad y el Fondo Europeo de Desarrollo Regional con el proyecto "Evaluación Temprana de los Efectos de radiación Mediante simulación y virtualización. Estrategias de mitigación en arquitecturas de microprocesadores Avanzados "(Ref: ESP2015-68245-C4-3-P MINECO / FEDER, UE), y la Universidad Nacional de Colombia con el proyecto "Desarrollo de software Una métrica de Vulnerabilidad de registros en Microprocesadores COTS" (Cod HERMES: 28433)
Quasi Isolation QoS Setups to Control MPSoC Contention in Integrated Software Architectures
The use of integrated architectures, such as integrated modular avionics (IMA) in avionics, IMA-SP in space, and AUTOSAR in automotive, running on Multi-Processor System-on-Chip (MPSoC) is on the rise. Timing isolation among the different software partitions or applications thereof in an integrated architecture is key to simplifying software integration and its timing validation by ensuring the performance of each partition has no or very limited impact on others despite they share MPSoC’s hardware resources. In this work, we contend that the increasing hardware support for Quality of Service (QoS) guarantees in modern MPSoCs can be leveraged via specific setups to provide strong, albeit not full, isolation among different software partitions. We introduce the concept of Quasi Isolation QoS (QIQoS) setups and instantiate it in the Xilinx Zynq UltraScale+. To that end, out of the millions of setups offered by the different QoS mechanisms, we identify specific QoS configurations that isolate the traffic of time-critical software partitions executing in the core cluster from that generated by contender partitions in the programmable logic. Our results show that the selected isolation setup results in performance variations of the partitions run in the computing cores that are below 6 percentage points, even under scenarios with extremely high traffic coming from the programmable logic.This work has been supported by Grant PID2019-107255GB-C21 funded by MCIN/AEI/ 10.13039/501100011033 and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 772773).Peer ReviewedPostprint (published version
A Compact Model to Evaluate the Effects of High Level C++ Code Hardening in Radiation Environments
A high-level C++ hardening library is designed for the protection of critical software against the harmful effects of radiation environments that can damage systems. A mathematical and empirical model to predict system behavior in the presence of radiation induced faults is also presented. This model generates a quick evaluation and adjustment of several reliability vs. performance trade-offs, to optimize radiation hardening based on the proposed C++ hardening library. Several simulations and irradiation campaigns with protons and neutrons are used to build the model and to tune it. Finally, the effects of our hardening approach are compared with other hardened and non-hardened approaches.This work was funded by the Spanish Ministry of Economy and Competitiveness and the European Regional Development Fund through the following projects: ‘Evaluación temprana de los efectos de radiación mediante simulación y virtualización. Estrategias de mitigación en arquitecturas de microprocesadores avanzados’ and ‘Centro de Ensayos Combinados de Irradiación’, (Refs: ESP2015-68245-C4-3-P and ESP2015-68245-C4-4-P, MINECO/FEDER, UE)
Evaluation of Fault Mitigation Techniques Based on Approximate Computing Under Radiation
A software technique based on approximate computing and redundancy is presented to mitigate radiation-induced soft errors in COTS microprocessors. Approximate Computing relies on the capability of certain applications to accept imprecise results to improve efficiency by sacrificing its results in a controlled manner. Our approach avoids the time overhead derived from hardening while preserving the detection and correction rate. Experimental results show that we can detect and correct SDC events improving the cross-section up to 160×, and keeping accuracy under control without compromising performance. In addition, an accuracy-aware layer is included to improve error mitigation and to provide a trade-off between the number of tolerable errors and the necessary accuracy.MultiRad (funded by Région Auvergne-Rhône-Alpes, France); IRT Nanoelec (French National Research Agency ANR-10-AIRT-05 project funded through the Program d’investissement d’avenir); UGA/LPSC/-GENESIS platform and PID2022-138696OB-C22 (funded by the Spanish Ministry of Science and Innovation)
A software-only approach to enable diverse redundancy on Intel GPUs for safety-related kernels
Autonomous Driving (AD) systems rely on object detection and tracking algorithms that require processing high volumes of data at high frequency. High-performance graphics processing units (GPUs) have been shown to provide the required computing performance. AD also carries functional safety requirements such as diverse redundancy for critical software tasks like object detection. This implies that software must be executed redundantly (in a single GPU for efficiency reasons), and with some form of diversity so that a single fault does not cause the same error in both redundant executions. Unfortunately, high-performance GPUs lack explicit hardware means for diverse redundancy, and software-based solutions with limited guarantees have only been provided for NVIDIA GPUs. This paper presents a software-only solution to enable diverse redundancy on Intel GPUs achieving, for the first time, strong guarantees on the diversity provided. By smartly tailoring workload geometry and managing workload allocation to execution units with thread-level wrappers, we guarantee that redundant threads use physically diverse execution units, hence meeting diverse redundancy requirements with affordable performance overheads.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB-C21/AEI/ 10.13039/501100011033, and by the project AUTOtech.agil of the German Federal Ministry of Education and Research (support code 01IS22088I)Peer ReviewedPostprint (author's final draft