4,983 research outputs found

    EXFI: a low cost Fault Injection System for embedded Microprocessor-based Boards

    Get PDF
    Evaluating the faulty behavior of low-cost embedded microprocessor-based boards is an increasingly important issue, due to their adoption in many safety critical systems. The architecture of a complete Fault Injection environment is proposed, integrating a module for generating a collapsed list of faults, and another for performing their injection and gathering the results. To address this issue, the paper describes a software-implemented Fault Injection approach based on the Trace Exception Mode available in most microprocessors. The authors describe EXFI, a prototypical system implementing the approach, and provide data about some sample benchmark applications. The main advantages of EXFI are the low cost, the good portability, and the high efficienc

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    A small-scale testbed for large-scale reliable computing

    Get PDF
    High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed—using a customized QEMU build, libvirt, and Python—for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection

    Mixed-Criticality Systems on Commercial-Off-the-Shelf Multi-Processor Systems-on-Chip

    Get PDF
    Avionics and space industries are struggling with the adoption of technologies like multi-processor system-on-chips (MPSoCs) due to strict safety requirements. This thesis propose a new reference architecture for MPSoC-based mixed-criticality systems (MCS) - i.e., systems integrating applications with different level of criticality - which are a common use case for aforementioned industries. This thesis proposes a system architecture capable of granting partitioning - which is, for short, the property of fault containment. It is based on the detection of spatial and temporal interference, and has been named the online detection of interference (ODIn) architecture. Spatial partitioning requires that an application is not able to corrupt resources used by a different application. In the architecture proposed in this thesis, spatial partitioning is implemented using type-1 hypervisors, which allow definition of resource partitions. An application running in a partition can only access resources granted to that partition, therefore it cannot corrupt resources used by applications running in other partitions. Temporal partitioning requires that an application is not able to unexpectedly change the execution time of other applications. In the proposed architecture, temporal partitioning has been solved using a bounded interference approach, composed of an offline analysis phase and an online safety net. The offline phase is based on a statistical profiling of a metric sensitive to temporal interference’s, performed in nominal conditions, which allows definition of a set of three thresholds: 1. the detection threshold TD; 2. the warning threshold TW ; 3. the α threshold. Two rules of detection are defined using such thresholds: Alarm rule When the value of the metric is above TD. Warning rule When the value of the metric is in the warning region [TW ;TD] for more than α consecutive times. ODIn’s online safety-net exploits performance counters, available in many MPSoC architectures; such counters are configured at bootstrap to monitor the selected metric(s), and to raise an interrupt request (IRQ) in case the metric value goes above TD, implementing the alarm rule. The warning rule is implemented in a software detection module, which reads the value of performance counters when the monitored task yields control to the scheduler and reset them if there is no detection. ODIn also uses two additional detection mechanisms: 1. a control flow check technique, based on compile-time defined block signatures, is implemented through a set of watchdog processors, each monitoring one partition. 2. a timeout is implemented through a system watchdog timer (SWDT), which is able to send an external signal when the timeout is violated. The recovery actions implemented in ODIn are: • graceful degradation, to react to IRQs of WDPs monitoring non-critical applications or to warning rule violations; it temporarily stops non-critical applications to grant resources to the critical application; • hard recovery, to react to the SWDT, to the WDP of the critical application, or to alarm rule violations; it causes a switch to a hot stand-by spare computer. Experimental validation of ODIn was performed on two hardware platforms: the ZedBoard - dual-core - and the Inventami board - quad-core. A space benchmark and an avionic benchmark were implemented on both platforms, composed by different modules as showed in Table 1 Each version of the final application was evaluated through fault injection (FI) campaigns, performed using a specifically designed FI system. There were three types of FI campaigns: 1. HW FI, to emulate single event effects; 2. SW FI, to emulate bugs in non-critical applications; 3. artificial bug FI, to emulate a bug in non-critical applications introducing unexpected interference on the critical application. Experimental results show that ODIn is resilient to all considered types of faul

    Study and development of a software implemented fault injection plug-in for the Xception tool/powerPC 750

    Get PDF
    Estágio realizado na Critical Software e orientado pelo Eng.º Ricardo BarbosaTese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 200

    Enhancing Real-time Embedded Image Processing Robustness on Reconfigurable Devices for Critical Applications

    Get PDF
    Nowadays, image processing is increasingly used in several application fields, such as biomedical, aerospace, or automotive. Within these fields, image processing is used to serve both non-critical and critical tasks. As example, in automotive, cameras are becoming key sensors in increasing car safety, driving assistance and driving comfort. They have been employed for infotainment (non-critical), as well as for some driver assistance tasks (critical), such as Forward Collision Avoidance, Intelligent Speed Control, or Pedestrian Detection. The complexity of these algorithms brings a challenge in real-time image processing systems, requiring high computing capacity, usually not available in processors for embedded systems. Hardware acceleration is therefore crucial, and devices such as Field Programmable Gate Arrays (FPGAs) best fit the growing demand of computational capabilities. These devices can assist embedded processors by significantly speeding-up computationally intensive software algorithms. Moreover, critical applications introduce strict requirements not only from the real-time constraints, but also from the device reliability and algorithm robustness points of view. Technology scaling is highlighting reliability problems related to aging phenomena, and to the increasing sensitivity of digital devices to external radiation events that can cause transient or even permanent faults. These faults can lead to wrong information processed or, in the worst case, to a dangerous system failure. In this context, the reconfigurable nature of FPGA devices can be exploited to increase the system reliability and robustness by leveraging Dynamic Partial Reconfiguration features. The research work presented in this thesis focuses on the development of techniques for implementing efficient and robust real-time embedded image processing hardware accelerators and systems for mission-critical applications. Three main challenges have been faced and will be discussed, along with proposed solutions, throughout the thesis: (i) achieving real-time performances, (ii) enhancing algorithm robustness, and (iii) increasing overall system's dependability. In order to ensure real-time performances, efficient FPGA-based hardware accelerators implementing selected image processing algorithms have been developed. Functionalities offered by the target technology, and algorithm's characteristics have been constantly taken into account while designing such accelerators, in order to efficiently tailor algorithm's operations to available hardware resources. On the other hand, the key idea for increasing image processing algorithms' robustness is to introduce self-adaptivity features at algorithm level, in order to maintain constant, or improve, the quality of results for a wide range of input conditions, that are not always fully predictable at design-time (e.g., noise level variations). This has been accomplished by measuring at run-time some characteristics of the input images, and then tuning the algorithm parameters based on such estimations. Dynamic reconfiguration features of modern reconfigurable FPGA have been extensively exploited in order to integrate run-time adaptivity into the designed hardware accelerators. Tools and methodologies have been also developed in order to increase the overall system dependability during reconfiguration processes, thus providing safe run-time adaptation mechanisms. In addition, taking into account the target technology and the environments in which the developed hardware accelerators and systems may be employed, dependability issues have been analyzed, leading to the development of a platform for quickly assessing the reliability and characterizing the behavior of hardware accelerators implemented on reconfigurable FPGAs when they are affected by such faults
    corecore