9 research outputs found

    On the tailoring of CAST-32A certification guidance to real COTS multicore architectures

    Get PDF
    The use of Commercial Off-The-Shelf (COTS) multicores in real-time industry is on the rise due to multicores' potential performance increase and energy reduction. Yet, the unpredictable impact on timing of contention in shared hardware resources challenges certification. Furthermore, most safety certification standards target single-core architectures and do not provide explicit guidance for multicore processors. Recently, however, CAST-32A has been presented providing guidance for software planning, development and verification in multicores. In this paper, from a theoretical level, we provide a detailed review of CAST-32A objectives and the difficulty of reaching them under current COTS multicore design trends; at experimental level, we assess the difficulties of the application of CAST-32A to a real multicore processor, the NXP P4080.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal grant RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Development and certification of mixed-criticality embedded systems based on probabilistic timing analysis

    Get PDF
    An increasing variety of emerging systems relentlessly replaces or augments the functionality of mechanical subsystems with embedded electronics. For quantity, complexity, and use, the safety of such subsystems is an increasingly important matter. Accordingly, those systems are subject to safety certification to demonstrate system's safety by rigorous development processes and hardware/software constraints. The massive augment in embedded processors' complexity renders the arduous certification task significantly harder to achieve. The focus of this thesis is to address the certification challenges in multicore architectures: despite their potential to integrate several applications on a single platform, their inherent complexity imperils their timing predictability and certification. Recently, the Measurement-Based Probabilistic Timing Analysis (MBPTA) technique emerged as an alternative to deal with hardware/software complexity. The innovation that MBPTA brings about is, however, a major step from current certification procedures and standards. The particular contributions of this Thesis include: (i) the definition of certification arguments for mixed-criticality integration upon multicore processors. In particular we propose a set of safety mechanisms and procedures as required to comply with functional safety standards. For timing predictability, (ii) we present a quantitative approach to assess the likelihood of execution-time exceedance events with respect to the risk reduction requirements on safety standards. To this end, we build upon the MBPTA approach and we present the design of a safety-related source of randomization (SoR), that plays a key role in the platform-level randomization needed by MBPTA. And (iii) we evaluate current certification guidance with respect to emerging high performance design trends like caches. Overall, this Thesis pushes the certification limits in the use of multicore and MBPTA technology in Critical Real-Time Embedded Systems (CRTES) and paves the way towards their adoption in industry.Una creciente variedad de sistemas emergentes reemplazan o aumentan la funcionalidad de subsistemas mecánicos con componentes electrónicos embebidos. El aumento en la cantidad y complejidad de dichos subsistemas electrónicos así como su cometido, hacen de su seguridad una cuestión de creciente importancia. Tanto es así que la comercialización de estos sistemas críticos está sujeta a rigurosos procesos de certificación donde se garantiza la seguridad del sistema mediante estrictas restricciones en el proceso de desarrollo y diseño de su hardware y software. Esta tesis trata de abordar los nuevos retos y dificultades dadas por la introducción de procesadores multi-núcleo en dichos sistemas críticos: aunque su mayor rendimiento despierta el interés de la industria para integrar múltiples aplicaciones en una sola plataforma, suponen una mayor complejidad. Su arquitectura desafía su análisis temporal mediante los métodos tradicionales y, asimismo, su certificación es cada vez más compleja y costosa. Con el fin de lidiar con estas limitaciones, recientemente se ha desarrollado una novedosa técnica de análisis temporal probabilístico basado en medidas (MBPTA). La innovación de esta técnica, sin embargo, supone un gran cambio cultural respecto a los estándares y procedimientos tradicionales de certificación. En esta línea, las contribuciones de esta tesis están agrupadas en tres ejes principales: (i) definición de argumentos de seguridad para la certificación de aplicaciones de criticidad-mixta sobre plataformas multi-núcleo. Se definen, en particular, mecanismos de seguridad, técnicas de diagnóstico y reacción de faltas acorde con el estándar IEC 61508 sobre una arquitectura multi-núcleo de referencia. Respecto al análisis temporal, (ii) presentamos la cuantificación de la probabilidad de exceder un límite temporal y su relación con los requisitos de reducción de riesgos derivados de los estándares de seguridad funcional. Con este fin, nos basamos en la técnica MBPTA y presentamos el diseño de una fuente de números aleatorios segura; un componente clave para conseguir las propiedades aleatorias requeridas por MBPTA a nivel de plataforma. Por último, (iii) extrapolamos las guías actuales para la certificación de arquitecturas multi-núcleo a una solución comercial de 8 núcleos y las evaluamos con respecto a las tendencias emergentes de diseño de alto rendimiento (caches). Con estas contribuciones, esta tesis trata de abordar los retos que el uso de procesadores multi-núcleo y MBPTA implican en el proceso de certificación de sistemas críticos de tiempo real y facilita, de esta forma, su adopción por la industria.Postprint (published version

    Multi-core devices for safety-critical systems: a survey

    Get PDF
    Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft

    Virtual Timing Isolation Safety-Net for Multicore Processors

    Get PDF
    Multicore processors promise to offer the performance as well as the reduced space, weight and power needed by future aircrafts. However, commercial off-the-shelf multicore processors suffer from timing interferences between cores which complicates applying them in hard real-time systems like avionic applications. In this thesis, a safety-net system is proposed which enables a virtual timing isolation of applications running on one core from all other cores. The technique is based on hardware external to the multicore processor and completely transparent to the applications, i.e. no modification of the observed software is necessary. The basic idea is to apply a single-core execution based worst-case execution time analysis and to accept a predefined slowdown during multicore execution. If the slowdown exceeds the acceptable bounds, interferences will be reduced by controlling the behavior of low-critical cores to keep the main application’s progress inside the given bounds. Measuring the progress of the applications running on the main core is performed by tracking the application’s fingerprint. A fingerprint is created by extraction of the performance counters of the critical core in very small timesteps which results in a characteristic curve for every execution of a periodic program. In standalone mode, without any running applications on the other cores, a model of an application is created by clustering and combining the extracted curves. During runtime, the extracted performance counter values are compared to the model to determine the progress of the critical application. In case the progress of an application is unacceptably delayed, the cores creating the interferences are throttled. The interference creating cores are determined by the accesses of the respective cores to the shared resources. A controller that takes the progress of a critical application as well as the time until the final deadline into account throttles the low priority cores. Throttling is either performed by frequency scaling of the interfering cores or by halt and continue with a pulse width modulation scheme. The complete safety-net system was evaluated on a TACLeBench benchmark running on an NXP P4080 multicore processor observed by a Xilinx FPGA implementing a MicroBlaze soft-core microcontroller. The results show that the progress can be measured by the fingerprinting with a final deviation of less than 1% for a TACLeBench execution with running opponent cores and indicate the non-intrusiveness of the approach. Several experiments are conducted to demonstrate the effectiveness of the different throttling mechanisms. Evaluations using a real-world avionic application show that the approach can be applied to integrated modular avionic applications. The safety-net does not ensure robust partitioning in the conventional meaning. The applications on the different cores can influence each other in the timing domain, but the external safety-net ensures that the interference on the high critical application is low enough to keep the timing. This allows for an efficient utilization of the multicore processor. Every critical application is treated individually, and by relying on individual models recorded in standalone mode, the critical as well as the non-critical applications running on the other cores can be exchanged without recreating a fingerprint model. This eases the porting of legacy applications to the multicore processor and allows the exchange of applications without recertification.Der Einsatz von Multicore Prozessoren in Avioniksystemen verspricht sowohl die Performancesteigerung als auch den reduzierten Platz-, Gewichts- und Energieverbrauch, der zur Realisierung von zukünftigen Flugzeugen benötigt wird. Die Verwendung von seriengefertigten (COTS) Multicore Prozessoren in sicherheitskritischen Echtzeitsystemen ist jedoch sehr komplex, da eine gegenseitige zeitliche Beeinflussung der Anwendungen auf den unterschiedlichen Kernen nicht ausgeschlossen werden kann. In dieser Arbeit wird ein Konzept vorgestellt, das eine virtuelle zeitliche Trennung der Anwendungen, die auf einem Prozessorkern ausgeführt werden, von denen der übrigen Kerne ermöglicht. Die Grundidee besteht darin, eine auf einer Single-Core-Ausführung basierende Laufzeitanalyse (WCET) durchzuführen und eine vordefinierte Verlangsamung während der Multicore-Ausführung zu akzeptieren. Wenn die Verlangsamung die zulässige Grenze überschreitet, wird das Verhalten niedrigkritischer Kerne so gesteuert, dass der Fortschritt der Hauptanwendung innerhalb der Deadlines bleibt. Die Bestimmung des Fortschritts der kritischen Anwendungen erfolgt durch das Verfolgen eines sogenannten Fingerprints. Ein Fingerprint wird durch Auslesen der Performance Counter des kritischen Kerns in sehr kleinen Zeitschritten erzeugt, was zu einer charakteristischen Kurve für jede Ausführung eines periodischen Programms führt. Ein Modell einer Anwendung wird erstellt, indem die extrahierten Kurven gruppiert und kombiniert werden. Während der Laufzeit werden die ausgelesenen Werte mit dem Modell verglichen, um den Fortschritt zu bestimmen. Falls die zeitliche Ausführung einer ktitischen Anwendung zu stark verzögert wird, werden die Kerne gedrosselt, welche die Störungen verursachen. Das Konzept wurde mit einem TACLeBench-Benchmark evaluiert, der auf einem NXP P4080 Multicore Prozessor ausgefüht, und von einem Xilinx-FPGA beobachtet wurde. Es konnte gezeigt werden, dass der Fortschritt durch den Fingerprint mit einer endgültigen Abweichung von weniger als 1% für eine TACLeBench-Ausführung mit laufenden konkurrierenden Kernen gemessen werden kann. Die Evaluation mit einer realen Avionik-Anwendung zeigte, dass das Konzept für integrierte modulare Avionik-Anwendungen (IMA) genutzt werden kann. Der Ansatz gewährleistet keine robuste Partitionierung im herkömmlichen Sinne. Die Anwendungen auf den verschiedenen Kernen können sich zeitlich gegenseitig beeinflussen, aber ein externes Sicherheitsnetz stellt sicher, dass die Verlangsamung der hochkritischen Anwendung niedrig genug ist, um die Deadlines zu halten. Dies ermöglicht eine effiziente Auslastung des Multicore Prozessors. Außerdem wird jede kritische Anwendung einzeln behandelt und verfügt über ein individuelles Modell. Somit können die kritischen und nicht kritischen Anwendungen, die auf den anderen Kernen ausgeführt werden, ausgetauscht werden, ohne ein Modell neu zu erstellen. Dies vereinfacht die Portierung von bestehenden Anwendungen auf Multicore Prozessoren und ermöglicht den Austausch von Anwendungen ohne eine erneute Zertifizierung

    On the reliability of hardware event monitors in MPSoCs for critical domains

    Get PDF
    Performance Monitoring Units (PMUs) are at the heart of most-advanced timing analysis techniques to control and bound the impact of contention in Commercial Off-The-Shelf (COTS) SoCs with shared resources (e.g. GPUs and multicore CPUs). In this paper, we report discrepancies on the values obtained from the PMU event monitors and the number of events expected based on PMU event description in the processor's official documentation. Discrepancies, which may be either due to actual errors or inaccurate specifications, make PMU readings unreliable. This is particularly problematic in consideration of the critical role played by event monitors for timing analysis in domains such as automotive and avionics. This paper proposes a systematic procedure for event monitor validation. We apply it to validate event monitors in the NVIDIA Xavier and TX2, and the Zynq UltraScale+ MPSoC. We show that, while some event monitors count as expected, this is not the case for others whose discrepancies with expected values we analyze.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P, the SELENE European Union’s Horizon 2020 (H2020) research and innovation programme under grant agreement No 871467, and the HiPEAC Network of Excellence. MINECO partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717), Enrico Mezzetti under Juan de la-Cierva-Incorporacion postdoctoral fellowship (IJCI-2016-27396), and Leonidas Kosmidis under Juan de la Cierva-Formacion postdoctoral fellowship (FJCI-2017-34095).Peer ReviewedPostprint (author's final draft

    SAFEXPLAIN: Safe and Explainable Critical Embedded Systems Based on AI

    Get PDF
    Deep Learning (DL) techniques are at the heart of most future advanced software functions in Critical Autonomous AI-based Systems (CAIS), where they also represent a major competitive factor. Hence, the economic success of CAIS industries (e.g., automotive, space, railway) depends on their ability to design, implement, qualify, and certify DL-based software products under bounded effort/cost. However, there is a fundamental gap between Functional Safety (FUSA) requirements on CAIS and the nature of DL solutions. This gap stems from the development process of DL libraries and affects high-level safety concepts such as (1) explainability and traceability, (2) suitability for varying safety requirements, (3) FUSA-compliant implementations, and (4) real-time constraints. As a matter of fact, the data-dependent and stochastic nature of DL algorithms clashes with current FUSA practice, which instead builds on deterministic, verifiable, and pass/fail test-based software. The SAFEXPLAIN project tackles these challenges and targets by providing a flexible approach to allow the certification - hence adoption - of DL-based solutions in CAIS building on: (1) DL solutions that provide end-to-end traceability, with specific approaches to explain whether predictions can be trusted and strategies to reach (and prove) correct operation, in accordance to certification standards; (2) alternative and increasingly sophisticated design safety patterns for DL with varying criticality and fault tolerance requirements; (3) DL library implementations that adhere to safety requirements; and (4) computing platform configurations, to regain determinism, and probabilistic timing analyses, to handle the remaining non-determinism.The research leading to these results has received funding from the Horizon Europe Programme under the SAFEXPLAIN Project (www.safexplain.eu), grant agreement num. 101069595. BSC authors have also been supported by the Spanish Ministry of Science and Innovation under grant PID2019- 107255GBC21/AEI/10.13039/501100011033.Peer Reviewed"Article signat per 22 autors/es: Jaume Abella, Jon Perez, Cristofer Englund, Bahram Zonooz, Gabriele Giordana, Carlo Donzella, Francisco J. Cazorla, Enrico Mezzetti, Isabel Serra, Axel Brando, Irune Agirre, Fernando Eizaguirre, Thanh Hai Bui, Elahe Arani, Fahad Sarfraz, Ajay Balasubramaniam, Ahmed BadarIlaria Bloise, Lorenzo Feruglio, Ilaria Cinelli, Davide Brighenti, Davide Cunial"Postprint (author's final draft

    Non-Simultaneity as a Design Constraint

    Get PDF
    Whether one or multiple hardware execution units are activated (i.e. CPU cores), invalid resource sharing, notably due to simultaneous accesses, proves to be problematic as it can yield to unexpected runtime behaviors with negative implications such as security or safety issues. The growing interest for off-the-shelf multi-core architectures in sensitive applications motivates the need for safe resources sharing. If critical sections are a well-known solution from imperative and non-temporized programming models, they fail to provide safety guarantees. By leveraging the time-triggered programming model, this paper aims at enforcing that identified critical windows of computations can never be simultaneously executed. We achieve this result by determining, before an application is compiled, the exact dates during which a task accesses a shared resource, which enables the off-line validation of non-simultaneity constraints

    GPU devices for safety-critical systems: a survey

    Get PDF
    Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft

    Assessment of a microcontroller for safety-critical avionics and automotive systems

    Get PDF
    Nowadays, microcontrollers used in critical real-time embedded systems use mostly one core, but are being replaced with more powerful hardware platforms that implement multicore systems. Among the latter, it is possible to identify in the space domain, for instance, the Cobham Gaisler NGMP developed for the European Space Agency (ESA), which is built with a SPARC quad-core processor that has a two-level cache hierarchy. For what concerns automotive and avionics environments, very flexible platforms like the Zynq UltraScale+ EG one has been regarded as a very powerful platform for these high-performance safety-critical systems. In fact, the aforementioned Zynq board implements two multicore clusters, namely an ARM dual-core Cortex R5 and an ARM quad-core Cortex A53, as well as a GPU and an FPGA. Due to the industrial trend towards the deployment of autonomous driving in the automotive domain and unmanned vehicles in the avionics domain, boards with such multicore systems are very promising. The use of multicores brings a concern related to contention (interference) in the access to shared hardware resources, which challenges timing verification needed to prove that all critical real-time tasks will execute by their respective deadlines. In particular, Worst-Case Execution Time (WCET) estimates for tasks need to account for the impact in execution time that contention in shared resources may have. While such analysis has been performed on relatively-simple multicores, like the NGMP, it needs to be carried out on the more powerful and complex Zynq UltraScale+ EG platform. In particular, it is required to analyze the different sources of interference for the multicore clusters and how tasks need to be consolidated so that resource sharing is performed efficiently across tasks, thus minimizing the impact on execution time for the most critical real-time tasks. In this Master thesis work, the measurement-based methodology developed at Barcelona Supercomputing Center (BSC) to quantify the interference that arises across cores due to contention in shared hardware resources, is ported from the (simple) NGMP platform to each of the computing clusters of the Zynq UltraScale+ EG platform. Such methodology consists in the use of small microbenchmarks that aim at stressing specific shared hardware resources to create very high contention. Hence, this thesis investigates how to produce high contention in the shared hardware resources of the Zynq UltraScale+ EG platform, thus integrating those concepts working on the SPARC V8 instruction set of the NGMP to the ARM v7 and ARM v8 instruction sets of the Zynq platform. This requires porting and adapting microbenchmarks written partly in assembly code, verifying the Performance Monitoring Unit, and analyzing the sources of contention. As final step, guidelines are devised to properly consolidate software to be implemented on the target platform in order to contain as much as possible interference on critical tasks
    corecore