77 research outputs found

    SafeLS: An open source implementation of a lockstep NOEL-V RISC-V core

    Get PDF
    Microcontrollers running safety-critical applications with high integrity requirements must provide appropriate safety measures to manage random hardware faults. For instance, automotive safety regulations (e.g., ISO26262) impose the use of diverse redundancy for items at the highest automotive safety integrity level (ASIL), ASIL-D. In the case of computing cores, this is realized with dual core lockstep (DCLS). The advent of the RISC-VISA has made open source hardware gain popularity. However, there are few industrial open source SoCs meeting the requirements of safety-critical systems, and, to our knowledge, none of them provides lockstep cores. This paper presents the realization of a RISC-V open source lockstep core based on Gaisler's NOEL-V core for the space domain, as well as its integration in the SELENE SoC that provides a complete microcontroller synthesizable on FPGA successfully assessed against space, automotive and railway safety-critical applications in the past.This work is part of the European Union’s Horizon 2020 Programme under project KDT Joint Undertaking (JU) under grant agreement No 101112274 (ISOLDE). This work has also been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB-C21 funded by MCIN/AEI/10.13039/501100011033.Peer ReviewedPostprint (author's final draft

    Modeling RTL Fault Models Behavior to Increase the Confidence on TSIM-based Fault Injection

    Get PDF
    Future high-performance safety-relevant applications require microcontrollers delivering higher performance than the existing certified ones. However, means for assessing their dependability are needed so that they can be certified against safety critical certification standars (e.g ISO26262). Dependability assessment analyses performed at high level of abstraction inject single faults to investigate the effects these have in the system. In this work we show that single faults do not comprise the whole picture, due to fault multiplicities and reactivations. Later we prove that, by injecting complex fault models that consider multiplicities and reactivations in higher levels of abstraction, results are substantially different, thus indicating that a change in the methodology is needed.The research leading to these results has received funding from the Ministry of Science and Technology of Spain under contract TIN2015-65316-P and the HiPEAC Network of Excellence. Carles Hern´andez is jointly funded by the Spanish Ministry of Economy and Competitiveness (MINECO) and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Postprint (author's final draft

    Software-only diverse redundancy on GPUs for autonomous driving platforms

    Get PDF
    Autonomous driving (AD) builds upon high-performance computing platforms including (1) general purpose CPUs as well as (2) specific accelerators, being GPUs one of the main representatives. Microcontrollers have reached ASIL-D compliance by implementing diverse redundancy with lockstep execution. However, ASIL-D compliant GPUs rely on either fully redundant lockstep GPUs (i.e. 2 GPUs), which doubles hardware costs, or fully redundant systems with a GPU and another accelerator, which virtually doubles design and validation/verification (V&V) costs. In this paper we analyze the degree of diversity achieved when implementing redundancy on a single GPU, showing that diverse redundancy is not achieved in many cases, and propose software strategies that guarantee achieving diverse redundancy for any kernel on systems using commercial off-the-shelf (COTS) GPUs, thus showing how to achieve ASIL-D compliance on a single COTS GPU in controlled scenarios.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC2013-14717Peer ReviewedPostprint (author's final draft

    Software-only triple diverse redundancy on GPUs for autonomous driving platforms

    Get PDF
    Autonomous driving (AD) imposes the need for safe computations in high-performance computing (HPC) components such as GPUs, thus with capabilities to detect and recover from errors since a safe state may not exist anymore. This can be achieved with Triple Modular Redundancy (TMR) for computation components. Furthermore, error detection capabilities need to provide some form of diversity to avoid the case where a single fault leads all redundant executions lead to the same error, which would go undetected. In our past work, we assessed GPUs against dual modular redundancy (DMR) with diversity, showing their potential and limitations to provide diverse redundancy building on reset and restart for recovery. However, such recovery scheme may be too slow for some applications. This paper proposes a software-only solution to deliver diverse TMR on commercial off-the-shelf (COTS) GPUs. Our work details how staggered execution can be achieved and assesses the performance of TMR on COTS GPUs. Moreover, we identify those elements where diversity cannot be guaranteed and provide some discussion comparing the case of DMR and TMR for those elements.This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871467 (SELENE). Leonidas Kosmidis has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under a Juan de la Cierva Formacion postdoctoral fellowship with number FJCI-2017-34095.Peer ReviewedPostprint (author's final draft

    SafeDE: A low-cost hardware solution to enforce diverse redundancy in multicores

    Get PDF
    Failure risk must be tiny in high-integrity systems, such as those in cars, satellites and aircraft. Hence, safety measures must be deployed to avoid a single fault leading to a failure. Redundancy has been often used to address this concern, but it has been proven insufficient if a single fault can cause the same error in all redundant elements, which defeats the purpose of redundancy for error detection. Hence, to avoid this scenario, diversity is implemented along with redundancy, being lockstep execution the most popular diverse redundancy solution for computing cores. However, classic lockstep solutions have non-negligible limitations if implemented in hardware (e.g., half of the cores can only be used for redundant execution and are not even visible at user level), or in software (e.g., the software loop to enforce staggering is long and costs performance). This paper tackles the limitations of classic lockstep solutions by providing an extended analysis and evaluation of SafeDE, a Diversity Enforcement hardware module combining the short loop to enforce diversity of hardware solutions, and the nonintrusiveness of software solutions. Hence, cores can operate in lockstep mode efficiently or run independent tasks. In this paper, we present SafeDE and its rationale, its application to N-modular systems, its hardware and software integration, and an evaluation showing its performance and area efficiency, and its behavior in the presence of faults.This work was supported in part by the European Union’s Horizon 2020 Research and Innovation Programme under Grant 871467, and in part by the Spanish Ministry of Science and Innovation under Grant PID2019-107255GB-C21/AEI/10.13039/501100011033.Peer ReviewedPostprint (author's final draft

    Design of a diversity enforcement module for safety critical processing systems

    Get PDF
    Safety-critical systems must adhere to specific functional safety standards describing the development process for those systems. One key requirement is the ability to avoid a single fault from causing a system failure, or in other words, avoiding Common Cause Failures (CCFs). Redundancy is a usual solution against CCFs. However, some specific CCFs may affect redundant components identically (e.g., voltage droops, clock interferences), hence potentially leading to identical errors that may go unnoticed and cause a failure. Diversity is often deployed along with redundancy to avoid also those CCFs. In the particular case of computing elements (e.g., cores), this is usually realized with some form of lockstep execution where two identical cores execute the same software, but with some time shift among them (aka staggering). Therefore, both cores have different state at any point in time and faults affecting both cores lead to different errors, which can be detected by comparing the outputs. Unfortunately, existing solutions have some non-negligible costs: (i) hardware-only solutions hide half of the cores making them non-user visible, hence halving platform performance even for non-critical tasks. Conversely, (ii) software-only solutions are much more flexible but impose the use of a third core to run the lockstep monitor, and require large staggering which has significant impact in performance for short programs. This thesis devises a new solution aiming at combining the advantages of existing solutions. Our proposal, a hardware diversity-enforcement module (referred to as SafeDE), is an efficient hardware realization of the software monitor. Therefore, it does not hide any core to the end user, it does not require a third core for monitoring purposes, and allows operating with tiny staggering (e.g., few tens of cycles instead of hundreds of thousands as required for the software-only solution). We implement and integrate SafeDE in a space multicore prototype in an FPGA and validate that it effectively achieves its requirements with negligible hardware costs. Moreover, this work has already led to the publication of two peer-reviewed articles in especialized conferences and journals

    Parallel error detection using heterogeneous cores

    Get PDF
    Microprocessor error detection is increasingly important, as the number of transistors in modern systems heightens their vulnerability. In addition, many modern workloads in domains such as the automotive and health industries are increasingly error intolerant, due to strict safety standards. However, current detection techniques require duplication of all hardware structures, causing a considerable increase in power consumption and chip area. Solutions in the literature involve running the code multiple times on the same hardware, which reduces performance significantly and cannot capture all errors. We have designed a novel hardware-only solution for error detection, that exploits parallelism in checking code which may not exist in the original execution. We pair a high-performance out-of-order core with a set of small low-power cores, each of which checks a portion of the out-of-order core's execution. Our system enables the detection of both hard and soft errors, with low area, power and performance overheads.This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and Arm Ltd

    최신 ECU보드를 활용하여 소프트에러들을 실시간 복구하는 기법

    Get PDF
    학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2020. 8. 이창건.This dissertation presents the fault-tolerant real-time scheduling using dynamic mode switch support of modern ECU hardware. This dissertation first describes the optimal capacity of the Periodic Resource which contains harmonic periodic task set using the exact time supply function.We show that the optimal capacity can be represented as sum of the each individual utilization of the task in the harmonic periodic task set for both normal state(i.e. no faults) and faulty state. Then, this dissertation proposes non-critical task overlapping technique by only using the idle time intervals of the Periodic Resource in order to overlap the non-critical tasks which ensures no additional capacity increase. Finally, this dissertation proposes the basic form of the Periodic Resources in order to efficiently use the dynamic mode switch support. Next, we also proposes the bin-packing heuristic algorithm that considers both making sub-taskset as a one Periodic Resource and Periodic Resource wide bin-packing which has the pseudo-polynomial time complexity. Experimental results show that the proposed algorithm performs better than the traditional partitioned fixed-priority scheduling approach and partitioned mixed-criticality scheduling approach. Also, the achievement is made up to 18% in terms of the total needed cores compared to traditional partitioned fixed-priority approach for making the given input task set schedulable.본 논문에서는 효율적인 재구성가능 시스템 사용을 위한 계층기반 실시간 결함 감내 스케줄링 기법을 제안한다. 본 연구는 주기 자원 모델을 기반으로, 최적 주기 자원 서버의 용량을 주기 자원 모델이 가지는 실시간 주기 태스크 셋의 유틸라이제이션의 합으로 제시한다. 본 논문은 해당 최적 서버 용량을 시스템이 정상 동작할때와 오동작 할때 모두에 대해서 제시한다. 다음으로, 비중요 태스크 셋들을 중요 주기 자원 서버의 여분 공백 시간을 활용해 서버 용량의 증가 없이 비중요 태스크를 중요 주기 자원 서버에 할당하는 방법론을 제시한다. 마지막으로 본 논문은 주기 자원 서버 단위의 파티션 기법과 주기 태스크를 하나의 주기 자원 서버로 만드는 빈패킹 휴리스틱 알고리즘을 제시한다. 실험 결과, 본 논문에서 제시한 알고리즘은 기존에 사용되었던 파티션 기반 우선순위 스케줄링 알고리즘과 파티션 기반 우선순위 혼잡 중요도 알고리즘보다 더 작은 수의 코어의 개수를 도출 할 수 있음을 보인다. 실험결과를 기반으로, 본 연구에서 제안한 알고리즘을 재구성가능 시스템에 활용한다면 기존 방법 대비 최대 18%의 코어절감효과를 기대할수 있다.1 Introduction 1 1.1 Motivation and Objective 1 1.2 Approach 2 1.3 Organization 6 2 System Model 7 3 Schedulability Analysis 10 3.1 Background 10 3.2 Optimal Capacity Analysis During Normal State 14 3.3 Optimal Capacity Analysis During Fault State 16 3.4 Periodic Resource Wide Schedulability Test 20 3.5 Non-Critical Task Overlapping 24 4 Proposed Approach 26 4.1 Minimum Harmonic Partitions of the Task Set 26 4.2 Proposed Heuristic Algorithm 28 4.2.1 Choosing Detection method 28 4.2.2 Packing Minimum Harmonic Partitions 29 4.2.3 Packing Free Tasks 30 4.2.4 Packing Non-Critical Tasks 31 4.3 Algorithm Description 32 5 Evaluation 35 5.1 Experimental Setup 35 5.2 Simulation Results 36 5.2.1 Free Task Bin-Packing 38 5.2.2 Minimum Harmonic Partitions Bin-Packing 40 5.2.3 Effect of Non-Critical Task Overlapping 43 5.2.4 Effect of State-Wise Computation 45 6 Related Works 46 6.1 Hierarchical Fault-Tolerant Real-Time Scheduling 46 6.2 Error Detection Method 46 7 Conclusion 48 References 50Maste

    Dynamic lockstep processors for applications with functional safety relevance

    Get PDF
    © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Lockstep processing is a recognized technique for helping to secure functional-safety relevant processing against, for instance, single upset errors that might cause faulty execution of code. Lockstepping processors does however bind processing resources in a fashion not beneficial to architectures and applications that would benefit from multi-core/-processors. We propose a novel on-demand synchronizing of cores/processors for lock-step operation featuring post-processing resource release, a concept that facilitates the implementation of modularly redundant core/processor arrays. We discuss the fundamentals of the design and some implementation notes on work achieved to date

    Envisioning a Safety Island to Enable HPC Devices in Safety-Critical Domains

    Full text link
    HPC (High Performance Computing) devices increasingly become the only alternative to deliver the performance needed in safety-critical autonomous systems (e.g., autonomous cars, unmanned planes) due to deploying large and powerful multicores along with accelerators such as GPUs. However, the support that those HPC devices offer to realize safety-critical systems on top is heterogeneous. Safety islands have been devised to be coupled to HPC devices and complement them to meet the safety requirements of an increased set of applications, yet the variety of concepts and realizations is large. This paper presents our own concept of a safety island with two goals in mind: (1) offering a wide set of features to enable the broadest set of safety applications for each HPC device, and (2) being realized with open source components based on RISC-V ISA to ease its use and adoption. In particular, we present our safety island concept, the key features we foresee it should include, and its potential application beyond safety.Comment: White pape
    corecore