21 research outputs found

    Network-on-Chip -based Multi-Processor System-on-Chip: Towards Mixed-Criticality System Certification

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Fault-Tolerant Nanosatellite Computing on a Budget

    Get PDF
    Computer Systems, Imagery and Medi

    Mixed-Criticality Systems on Commercial-Off-the-Shelf Multi-Processor Systems-on-Chip

    Get PDF
    Avionics and space industries are struggling with the adoption of technologies like multi-processor system-on-chips (MPSoCs) due to strict safety requirements. This thesis propose a new reference architecture for MPSoC-based mixed-criticality systems (MCS) - i.e., systems integrating applications with different level of criticality - which are a common use case for aforementioned industries. This thesis proposes a system architecture capable of granting partitioning - which is, for short, the property of fault containment. It is based on the detection of spatial and temporal interference, and has been named the online detection of interference (ODIn) architecture. Spatial partitioning requires that an application is not able to corrupt resources used by a different application. In the architecture proposed in this thesis, spatial partitioning is implemented using type-1 hypervisors, which allow definition of resource partitions. An application running in a partition can only access resources granted to that partition, therefore it cannot corrupt resources used by applications running in other partitions. Temporal partitioning requires that an application is not able to unexpectedly change the execution time of other applications. In the proposed architecture, temporal partitioning has been solved using a bounded interference approach, composed of an offline analysis phase and an online safety net. The offline phase is based on a statistical profiling of a metric sensitive to temporal interference’s, performed in nominal conditions, which allows definition of a set of three thresholds: 1. the detection threshold TD; 2. the warning threshold TW ; 3. the α threshold. Two rules of detection are defined using such thresholds: Alarm rule When the value of the metric is above TD. Warning rule When the value of the metric is in the warning region [TW ;TD] for more than α consecutive times. ODIn’s online safety-net exploits performance counters, available in many MPSoC architectures; such counters are configured at bootstrap to monitor the selected metric(s), and to raise an interrupt request (IRQ) in case the metric value goes above TD, implementing the alarm rule. The warning rule is implemented in a software detection module, which reads the value of performance counters when the monitored task yields control to the scheduler and reset them if there is no detection. ODIn also uses two additional detection mechanisms: 1. a control flow check technique, based on compile-time defined block signatures, is implemented through a set of watchdog processors, each monitoring one partition. 2. a timeout is implemented through a system watchdog timer (SWDT), which is able to send an external signal when the timeout is violated. The recovery actions implemented in ODIn are: • graceful degradation, to react to IRQs of WDPs monitoring non-critical applications or to warning rule violations; it temporarily stops non-critical applications to grant resources to the critical application; • hard recovery, to react to the SWDT, to the WDP of the critical application, or to alarm rule violations; it causes a switch to a hot stand-by spare computer. Experimental validation of ODIn was performed on two hardware platforms: the ZedBoard - dual-core - and the Inventami board - quad-core. A space benchmark and an avionic benchmark were implemented on both platforms, composed by different modules as showed in Table 1 Each version of the final application was evaluated through fault injection (FI) campaigns, performed using a specifically designed FI system. There were three types of FI campaigns: 1. HW FI, to emulate single event effects; 2. SW FI, to emulate bugs in non-critical applications; 3. artificial bug FI, to emulate a bug in non-critical applications introducing unexpected interference on the critical application. Experimental results show that ODIn is resilient to all considered types of faul

    Fault-tolerant satellite computing with modern semiconductors

    Get PDF
    Miniaturized satellites enable a variety space missions which were in the past infeasible, impractical or uneconomical with traditionally-designed heavier spacecraft. Especially CubeSats can be launched and manufactured rapidly at low cost from commercial components, even in academic environments. However, due to their low reliability and brief lifetime, they are usually not considered suitable for life- and safety-critical services, complex multi-phased solar-system-exploration missions, and missions with a longer duration. Commercial electronics are key to satellite miniaturization, but also responsible for their low reliability: Until 2019, there existed no reliable or fault-tolerant computer architectures suitable for very small satellites. To overcome this deficit, a novel on-board-computer architecture is described in this thesis.Robustness is assured without resorting to radiation hardening, but through software measures implemented within a robust-by-design multiprocessor-system-on-chip. This fault-tolerant architecture is component-wise simple and can dynamically adapt to changing performance requirements throughout a mission. It can support graceful aging by exploiting FPGA-reconfiguration and mixed-criticality.  Experimentally, we achieve 1.94W power consumption at 300Mhz with a Xilinx Kintex Ultrascale+ proof-of-concept, which is well within the powerbudget range of current 2U CubeSats. To our knowledge, this is the first COTS-based, reproducible on-board-computer architecture that can offer strong fault coverage even for small CubeSats.European Space AgencyComputer Systems, Imagery and Medi

    The Road towards Predictable Automotive High-Performance Platforms

    Get PDF
    Due to the trends of centralizing the E/E architecture and new computing-intensive applications, high-performance hardware platforms are currently finding their way into automotive systems. However, the SoCs currently available on the market have significant weaknesses when it comes to providing predictable performance for time-critical applications. The main reason for this is that these platforms are optimized for averagecase performance. This shortcoming represents one major risk in the development of current and future automotive systems. In this paper we describe how high-performance and predictability could (and should) be reconciled in future HW/SW platforms. We believe that this goal can only be reached in a close collaboration between system suppliers, IP providers, semiconductor companies, and OS/hypervisor vendors. Furthermore, academic input will be needed to solve remaining challenges and to further improve initial solutions

    Leveraging Hardware QoS to Control Contention in the Xilinx Zynq UltraScale+ MPSoC

    Get PDF
    The interference co-running tasks generate on each other’s timing behavior continues to be one of the main challenges to be addressed before Multi-Processor System-on-Chip (MPSoCs) are fully embraced in critical systems like those deployed in avionics and automotive domains. Modern MPSoCs like the Xilinx Zynq UltraScale+ incorporate hardware Quality of Service (QoS) mechanisms that can help controlling contention among tasks. Given the distributed nature of modern MPSoCs, the route a request follows from its source (usually a compute element like a CPU) to its target (usually a memory) crosses several QoS points, each one potentially implementing a different QoS mechanism. Mastering QoS mechanisms individually, as well as their combined operation, is pivotal to obtain the expected benefits from the QoS support. In this work, we perform, to our knowledge, the first qualitative and quantitative analysis of the distributed QoS mechanisms in the Xilinx UltraScale+ MPSoC. We empirically derive QoS information not covered by the technical documentation, and show limitations and benefits of the available QoS support. To that end, we use a case study building on neural network kernels commonly used in autonomous systems in different real-time domains.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB; the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 878752 (MASTECS) and the European Research Council (ERC) grant agreement No. 772773 (SuPerCom).Peer ReviewedPostprint (published version

    Applying Hypervisor-Based Fault Tolerance Techniques to Safety-Critical Embedded Systems

    Get PDF
    This document details the work conducted through the development of this thesis, and it is structured as follows: • Chapter 1, Introduction, has briefly presented the motivation, objectives, and contributions of this thesis. • Chapter 2, Fundamentals, exposes a series of concepts that are necessary to correctly understand the information presented in the rest of the thesis, such as the concepts of virtualization, hypervisors, or software-based fault tolerance. In addition, this chapter includes an exhaustive review and comparison between the different hypervisors used in scientific studies dealing with safety-critical systems, and a brief review of some works that try to improve fault tolerance in the hypervisor itself, an area of research that is outside the scope of this work, but that complements the mechanism presented and could be established as a line of future work. • Chapter 3, Problem Statement and Related Work, explains the main reasons why the concept of Hypervisor-Based Fault Tolerance was born and reviews the main articles and research papers on the subject. This review includes both papers related to safety-critical embedded systems (such as the research carried out in this thesis) and papers related to cloud servers and cluster computing that, although not directly applicable to embedded systems, may raise useful concepts that make our solution more complete or allow us to establish future lines of work. • Chapter 4, Proposed Solution, begins with a brief comparison of the work presented in Chapter 3 to establish the requirements that our solution must meet in order to be as complete and innovative as possible. It then sets out the architecture of the proposed solution and explains in detail the two main elements of the solution: the Voter and the Health Monitoring partition. • Chapter 5, Prototype, explains in detail the prototyping of the proposed solution, including the choice of the hypervisor, the processing board, and the critical functionality to be redundant. With respect to the voter, it includes prototypes for both the software version (the voter is implemented in a virtual machine) and the hardware version (the voter is implemented as IP cores on the FPGA). • Chapter 6, Evaluation, includes the evaluation of the prototype developed in Chapter 5. As a preliminary step and given that there is no evidence in this regard, an exercise is carried out to measure the overhead involved in using the XtratuM hypervisor versus not using it. Subsequently, qualitative tests are carried out to check that Health Monitoring is working as expected and a fault injection campaign is carried out to check the error detection and correction rate of our solution. Finally, a comparison is made between the performance of the hardware and software versions of Voter. • Chapter 7, Conclusions and Future Work, is dedicated to collect the conclusions obtained and the contributions made during the research (in the form of articles in journals, conferences and contributions to projects and proposals in the industry). In addition, it establishes some lines of future work that could complete and extend the research carried out during this doctoral thesis.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Katzalin Olcoz Herrero.- Secretario: Félix García Carballeira.- Vocal: Santiago Rodríguez de la Fuent

    Timing Predictable and High-Performance Hardware Cache Coherence Mechanisms for Real-Time Multi-Core Platforms

    Get PDF
    Multi-core platforms are becoming primary compute platforms for real-time systems such as avionics and autonomous vehicles. This adoption is primarily driven by the increasing application demands deployed in real-time systems, and the cost and performance benefits of multi-core platforms. For real-time applications, satisfying safety properties in the form of timing predictability, is the paramount consideration. Providing such guarantees on safety properties requires applying some timing analysis on the application executing on the compute platform. The timing analysis computes an upper bound on the application’s execution time on the compute platform, which is referred to as the worst-case execution time (WCET). However, multi-core platforms pose challenges that complicate the timing analysis. Among these challenges are timing challenges caused due to simultaneous accesses from multiple cores to shared hardware resources such as shared caches, interconnects, and off-chip memories. Supporting timing predictable shared data communication between real-time applications further compounds this challenge as a core’s access to shared data is dependent on the simultaneous memory activity from other cores on the shared data. Although hardware cache coherence mechanisms are the primary high-performance data communication mechanisms in current multi-core platforms, there has been very little use of these mechanisms to support timing predictable shared data communication in real-time multi-core platforms. Rather, current state-of-the-art approaches to timing predictable shared data communication sidestep hardware cache coherence. These approaches enforce memory and execution constraints on the shared data to simplify the timing analysis at the expense of application performance. This thesis makes the case for timing predictable hardware cache coherence mechanisms as viable shared data communication mechanisms for real-time multi-core platforms. A key takeaway from the contributions in this thesis is that timing predictable hardware cache coherence mechanisms offer significant application performance over prior state-of-the-art data communication approaches while guaranteeing timing predictability. This thesis has three main contributions. First, this thesis shows how a hardware cache coherence mechanism can be designed to be timing predictable by defining design invariants that guarantee timing predictability. We apply these design invariants and design timing predictable variants of existing conventional cache coherence mechanisms. Evaluation of these timing predictable cache coherence mechanisms show that they provide significant application performance over state-of-the-art approaches while delivering timing predictability. Second, we observe that the large worst-case memory access latency under timing predictable hardware cache coherence mechanisms questions their applicability as a data communication mechanism in real-time multi-core platforms. To this end, we present a systematic framework to design better timing predictable cache coherence mechanisms that balance high application performance and low worst-case memory access latency. Our systematic framework concisely captures the design features of timing predictable cache coherence mechanisms that impacts their WCET, and identifies a spectrum of approaches to reduce the worst-case memory access latency. We describe one approach and show that this approach reduces the worst-case memory access latency of timing predictable cache coherence mechanisms to be the same as alternative approaches while trading away minimal performance in the original cache coherence mechanisms. Third, we design a timing predictable hardware cache coherence mechanism for multi-core platforms used in mixed-critical real-time systems (MCS). Applications in MCS have varying performance and timing predictability requirements. We design a timing predictable cache coherence mechanism that considers these differing requirements and ensures that applications with no timing predictability requirements do not impact applications with strict predictability requirements
    corecore