110 research outputs found

    SafeSoftDR: A library to enable software-based diverse redundancy for safety-critical tasks

    Get PDF
    Applications with safety requirements have become ubiquitous nowadays and can be found in edge devices of all kinds. However, microcontrollers in those devices, despite offering moderate performance by implementing multicores and cache hierarchies, may fail to offer adequate support to implement some safety measures needed for the highest integrity levels, such as lockstepped execution to avoid so-called common cause failures (i.e., a fault affecting redundant components causing the same error in all of them). To respond to this limitation, an approach based on a software monitor enforcing some sort of software-based lockstepped execution across cores has been proposed recently in [2], providing a proof of concept. This paper presents SafeSoftDR, a library providing a standard interface to deploy software-based lockstepped execution across non-natively lockstepped cores relieving end-users from having to manage the burden to create redundant processes, copying input/output data, and performing result comparison. Our library has been tested on x86-based Linux and is currently being integrated on top of an open-source RISC-V platform targeting safety-related applications, hence offering a convenient environment for safety-critical applications.This work is part of the project PCI2020-112010, funded by MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR, and the European Union’s Horizon 2020 Programme under project ECSEL Joint Undertaking (JU) under grant agreement No 877056. This workhasalsobeen partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB-C21 funded by MCIN/AEI/10.13039/501100011033.Peer ReviewedPostprint (published version

    Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing

    Get PDF
    Efficiency of modern multiprocessor systems is hurt by unpredictable events: aging causes permanent faults that disable components; application spawnings and terminations taking place at arbitrary times, affect energy proportionality, causing energy waste; load imbalances reduce resource utilization, penalizing performance. This thesis demonstrates how runtime management can mitigate the negative effects of unpredictable events, making decisions guided by a combination of static information known in advance and parameters that only become known at runtime. We propose techniques for three different objectives: graceful degradation of aging-prone systems; energy efficiency of heterogeneous adaptive systems; and load balancing by means of work stealing. Managing aging-prone systems for graceful efficiency degradation, is based on a high-level system description that encapsulates hardware reconfigurability and workload flexibility and allows to quantify system efficiency and use it as an objective function. Different custom heuristics, as well as simulated annealing and a genetic algorithm are proposed to optimize this objective function as a response to component failures. Custom heuristics are one to two orders of magnitude faster, provide better efficiency for the first 20% of system lifetime and are less than 13% worse than a genetic algorithm at the end of this lifetime. Custom heuristics occasionally fail to satisfy reconfiguration cost constraints. As all algorithms\u27 execution time scales well with respect to system size, a genetic algorithm can be used as backup in these cases. Managing heterogeneous multiprocessors capable of Dynamic Voltage and Frequency Scaling is based on a model that accurately predicts performance and power: performance is predicted by combining static, application-specific profiling information and dynamic, runtime performance monitoring data; power is predicted using the aforementioned performance estimations and a set of platform-specific, static parameters, determined only once and used for every application mix. Three runtime heuristics are proposed, that make use of this model to perform partial search of the configuration space, evaluating a small set of configurations and selecting the best one. When best-effort performance is adequate, the proposed approach achieves 3% higher energy efficiency compared to the powersave governor and 2x better compared to the interactive and ondemand governors. When individual applications\u27 performance requirements are considered, the proposed approach is able to satisfy them, giving away 18% of system\u27s energy efficiency compared to the powersave, which however misses the performance targets by 23%; at the same time, the proposed approach maintains an efficiency advantage of about 55% compared to the other governors, which also satisfy the requirements. Lastly, to improve load balancing of multiprocessors, a partial and approximate view of the current load distribution among system cores is proposed, which consists of lightweight data structures and is maintained by each core through cheap operations. A runtime algorithm is developed, using this view whenever a core becomes idle, to perform victim core selection for work stealing, also considering system topology and memory hierarchy. Among 12 diverse imbalanced workloads, the proposed approach achieves better performance than random, hierarchical and local stealing for six workloads. Furthermore, it is at most 8% slower among the other six workloads, while competing strategies incur a penalty of at least 89% on some workload

    Failure Prognosis of Embedded Systems Based on Temperature Drift Assessment

    Get PDF
    International audienceThe Systems-on-Chip provide a large capacity for calculation and monitoring, so they are increasingly integrated into risky processes such as aeronautical and power generation systems. However, embedded systems are subject to degradation caused by wear, that can be accelerated by the often hostile environment. This paper proposes a method of failure prognosis of embedded systems based on the estimation of the temperature drift under reference operating conditions, then the modelling of the drift trend using a support vector regression model. The remaining useful life is estimated using the integral of the probability density function of the time to failure. Experimental results, evaluated by performance analysis techniques, show the effectiveness of the proposed approach

    Behind the Last Line of Defense -- Surviving SoC Faults and Intrusions

    Get PDF
    Today, leveraging the enormous modular power, diversity and flexibility of manycore systems-on-a-chip (SoCs) requires careful orchestration of complex resources, a task left to low-level software, e.g. hypervisors. In current architectures, this software forms a single point of failure and worthwhile target for attacks: once compromised, adversaries gain access to all information and full control over the platform and the environment it controls. This paper proposes Midir, an enhanced manycore architecture, effecting a paradigm shift from SoCs to distributed SoCs. Midir changes the way platform resources are controlled, by retrofitting tile-based fault containment through well known mechanisms, while securing low-overhead quorum-based consensus on all critical operations, in particular privilege management and, thus, management of containment domains. Allowing versatile redundancy management, Midir promotes resilience for all software levels, including at low level. We explain this architecture, its associated algorithms and hardware mechanisms and show, for the example of a Byzantine fault tolerant microhypervisor, that it outperforms the highly efficient MinBFT by one order of magnitude
    corecore