90 research outputs found
Enhancing Real-time Embedded Image Processing Robustness on Reconfigurable Devices for Critical Applications
Nowadays, image processing is increasingly used in several application fields, such as biomedical, aerospace, or automotive. Within these fields, image processing is used to serve both non-critical and critical tasks. As example, in automotive, cameras are becoming key sensors in increasing car safety, driving assistance and driving comfort. They have been employed for infotainment (non-critical), as well as for some driver assistance tasks (critical), such as Forward Collision Avoidance, Intelligent Speed Control, or Pedestrian Detection.
The complexity of these algorithms brings a challenge in real-time image processing systems, requiring high computing capacity, usually not available in processors for embedded systems. Hardware acceleration is therefore crucial, and devices such as Field Programmable Gate Arrays (FPGAs) best fit the growing demand of computational capabilities. These devices can assist embedded processors by significantly speeding-up computationally intensive software algorithms.
Moreover, critical applications introduce strict requirements not only from the real-time constraints, but also from the device reliability and algorithm robustness points of view. Technology scaling is highlighting reliability problems related to aging phenomena, and to the increasing sensitivity of digital devices to external radiation events that can cause transient or even permanent faults. These faults can lead to wrong information processed or, in the worst case, to a dangerous system failure. In this context, the reconfigurable nature of FPGA devices can be exploited to increase the system reliability and robustness by leveraging Dynamic Partial Reconfiguration features.
The research work presented in this thesis focuses on the development of techniques for implementing efficient and robust real-time embedded image processing hardware accelerators and systems for mission-critical applications. Three main challenges have been faced and will be discussed, along with proposed solutions, throughout the thesis: (i) achieving real-time performances, (ii) enhancing algorithm robustness, and (iii) increasing overall system's dependability.
In order to ensure real-time performances, efficient FPGA-based hardware accelerators implementing selected image processing algorithms have been developed. Functionalities offered by the target technology, and algorithm's characteristics have been constantly taken into account while designing such accelerators, in order to efficiently tailor algorithm's operations to available hardware resources.
On the other hand, the key idea for increasing image processing algorithms' robustness is to introduce self-adaptivity features at algorithm level, in order to maintain constant, or improve, the quality of results for a wide range of input conditions, that are not always fully predictable at design-time (e.g., noise level variations). This has been accomplished by measuring at run-time some characteristics of the input images, and then tuning the algorithm parameters based on such estimations. Dynamic reconfiguration features of modern reconfigurable FPGA have been extensively exploited in order to integrate run-time adaptivity into the designed hardware accelerators.
Tools and methodologies have been also developed in order to increase the overall system dependability during reconfiguration processes, thus providing safe run-time adaptation mechanisms. In addition, taking into account the target technology and the environments in which the developed hardware accelerators and systems may be employed, dependability issues have been analyzed, leading to the development of a platform for quickly assessing the reliability and characterizing the behavior of hardware accelerators implemented on reconfigurable FPGAs when they are affected by such faults
Methodologies and Toolflows for the Predictable Design of Reliable and Low-Power NoCs
There is today the unmistakable need to evolve design methodologies and
tool
ows for Network-on-Chip based embedded systems. In particular, the
quest for low-power requirements is nowadays a more-than-ever urgent dilemma.
Modern circuits feature billion of transistors, and neither power management
techniques nor batteries capacity are able to endure the increasingly higher
integration capability of digital devices. Besides, power concerns come together
with modern nanoscale silicon technology design issues.
On one hand, system failure rates are expected to increase exponentially at
every technology node when integrated circuit wear-out failure mechanisms
are not compensated for. However, error detection and/or correction mechanisms
have a non-negligible impact on the network power.
On the other hand, to meet the stringent time-to-market deadlines, the design
cycle of such a distributed and heterogeneous architecture must not be
prolonged by unnecessary design iterations.
Overall, there is a clear need to better discriminate reliability strategies and
interconnect topology solutions upfront, by ranking designs based on power
metric. In this thesis, we tackle this challenge by proposing power-aware
design technologies.
Finally, we take into account the most aggressive and disruptive methodology
for embedded systems with ultra-low power constraints, by migrating
NoC basic building blocks to asynchronous (or clockless) design style. We
deal with this challenge delivering a standard cell design methodology and
mainstream CAD tool
ows, in this way partially relaxing the requirement
of using asynchronous blocks only as hard macros
Applying Hypervisor-Based Fault Tolerance Techniques to Safety-Critical Embedded Systems
This document details the work conducted through the development of this thesis, and it
is structured as follows:
• Chapter 1, Introduction, has briefly presented the motivation, objectives, and contributions
of this thesis.
• Chapter 2, Fundamentals, exposes a series of concepts that are necessary to correctly
understand the information presented in the rest of the thesis, such as the
concepts of virtualization, hypervisors, or software-based fault tolerance. In addition,
this chapter includes an exhaustive review and comparison between the different
hypervisors used in scientific studies dealing with safety-critical systems, and a
brief review of some works that try to improve fault tolerance in the hypervisor itself,
an area of research that is outside the scope of this work, but that complements
the mechanism presented and could be established as a line of future work.
• Chapter 3, Problem Statement and Related Work, explains the main reasons why
the concept of Hypervisor-Based Fault Tolerance was born and reviews the main
articles and research papers on the subject. This review includes both papers related
to safety-critical embedded systems (such as the research carried out in this thesis)
and papers related to cloud servers and cluster computing that, although not directly
applicable to embedded systems, may raise useful concepts that make our solution
more complete or allow us to establish future lines of work.
• Chapter 4, Proposed Solution, begins with a brief comparison of the work presented
in Chapter 3 to establish the requirements that our solution must meet in order to
be as complete and innovative as possible. It then sets out the architecture of the
proposed solution and explains in detail the two main elements of the solution: the
Voter and the Health Monitoring partition.
• Chapter 5, Prototype, explains in detail the prototyping of the proposed solution,
including the choice of the hypervisor, the processing board, and the critical functionality
to be redundant. With respect to the voter, it includes prototypes for both
the software version (the voter is implemented in a virtual machine) and the hardware
version (the voter is implemented as IP cores on the FPGA).
• Chapter 6, Evaluation, includes the evaluation of the prototype developed in Chapter
5. As a preliminary step and given that there is no evidence in this regard, an
exercise is carried out to measure the overhead involved in using the XtratuM hypervisor
versus not using it. Subsequently, qualitative tests are carried out to check that
Health Monitoring is working as expected and a fault injection campaign is carried
out to check the error detection and correction rate of our solution. Finally, a comparison
is made between the performance of the hardware and software versions of
Voter.
• Chapter 7, Conclusions and Future Work, is dedicated to collect the conclusions
obtained and the contributions made during the research (in the form of articles in
journals, conferences and contributions to projects and proposals in the industry).
In addition, it establishes some lines of future work that could complete and extend
the research carried out during this doctoral thesis.Programa de Doctorado en Ciencia y TecnologĂa Informática por la Universidad Carlos III de MadridPresidente: Katzalin Olcoz Herrero.- Secretario: FĂ©lix GarcĂa Carballeira.- Vocal: Santiago RodrĂguez de la Fuent
Mixed-Criticality Systems on Commercial-Off-the-Shelf Multi-Processor Systems-on-Chip
Avionics and space industries are struggling with the adoption of technologies
like multi-processor system-on-chips (MPSoCs) due to strict safety requirements.
This thesis propose a new reference architecture for MPSoC-based mixed-criticality
systems (MCS) - i.e., systems integrating applications with different level of criticality - which are a common use case for aforementioned industries.
This thesis proposes a system architecture capable of granting partitioning -
which is, for short, the property of fault containment. It is based on the detection
of spatial and temporal interference, and has been named the online detection of
interference (ODIn) architecture.
Spatial partitioning requires that an application is not able to corrupt resources
used by a different application. In the architecture proposed in this thesis, spatial
partitioning is implemented using type-1 hypervisors, which allow definition of
resource partitions. An application running in a partition can only access resources
granted to that partition, therefore it cannot corrupt resources used by applications
running in other partitions.
Temporal partitioning requires that an application is not able to unexpectedly
change the execution time of other applications. In the proposed architecture, temporal partitioning has been solved using a bounded interference approach, composed of
an offline analysis phase and an online safety net.
The offline phase is based on a statistical profiling of a metric sensitive to
temporal interference’s, performed in nominal conditions, which allows definition of
a set of three thresholds:
1. the detection threshold TD;
2. the warning threshold TW ;
3. the α threshold.
Two rules of detection are defined using such thresholds:
Alarm rule When the value of the metric is above TD.
Warning rule When the value of the metric is in the warning region [TW ;TD] for
more than α consecutive times.
ODIn’s online safety-net exploits performance counters, available in many MPSoC architectures; such counters are configured at bootstrap to monitor the selected
metric(s), and to raise an interrupt request (IRQ) in case the metric value goes above
TD, implementing the alarm rule. The warning rule is implemented in a software detection module, which reads the value of performance counters when the monitored
task yields control to the scheduler and reset them if there is no detection.
ODIn also uses two additional detection mechanisms:
1. a control flow check technique, based on compile-time defined block signatures, is implemented through a set of watchdog processors, each monitoring
one partition.
2. a timeout is implemented through a system watchdog timer (SWDT), which is
able to send an external signal when the timeout is violated.
The recovery actions implemented in ODIn are:
• graceful degradation, to react to IRQs of WDPs monitoring non-critical applications or to warning rule violations; it temporarily stops non-critical applications
to grant resources to the critical application;
• hard recovery, to react to the SWDT, to the WDP of the critical application, or
to alarm rule violations; it causes a switch to a hot stand-by spare computer.
Experimental validation of ODIn was performed on two hardware platforms: the
ZedBoard - dual-core - and the Inventami board - quad-core.
A space benchmark and an avionic benchmark were implemented on both platforms, composed by different modules as showed in Table 1
Each version of the final application was evaluated through fault injection (FI)
campaigns, performed using a specifically designed FI system. There were three
types of FI campaigns:
1. HW FI, to emulate single event effects;
2. SW FI, to emulate bugs in non-critical applications;
3. artificial bug FI, to emulate a bug in non-critical applications introducing
unexpected interference on the critical application.
Experimental results show that ODIn is resilient to all considered types of faul
Design of a fault tolerant airborne digital computer. Volume 1: Architecture
This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive
Fault tolerant programmable digital attitude control electronics study
The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation
Resilience of an embedded architecture using hardware redundancy
In the last decade the dominance of the general computing systems market has being replaced by embedded systems with billions of units manufactured every year. Embedded systems appear in contexts where continuous operation is of utmost importance and failure can be profound.
Nowadays, radiation poses a serious threat to the reliable operation of safety-critical systems. Fault avoidance techniques, such as radiation hardening, have been commonly used in space applications. However, these components are expensive, lag behind commercial components with regards to performance and do not provide 100% fault elimination. Without fault tolerant mechanisms, many of these faults can become errors at the application or system level, which in turn, can result in catastrophic failures.
In this work we study the concepts of fault tolerance and dependability and
extend these concepts providing our own definition of resilience. We analyse the physics of radiation-induced faults, the damage mechanisms of particles and the process that leads to computing failures. We provide extensive taxonomies of 1) existing fault tolerant techniques and of 2) the effects of radiation in state-of-the-art electronics, analysing and comparing their characteristics. We propose a detailed model of faults and provide a classification of the different types of faults at various levels. We introduce an algorithm of fault tolerance and define the system states and actions necessary to implement it. We introduce novel hardware and system software techniques that provide a more efficient combination of reliability, performance and power consumption than existing techniques. We propose a new element of the system called syndrome that is the core of a resilient architecture whose software and hardware can adapt to reliable and unreliable environments. We implement a software simulator and disassembler and introduce a testing framework in combination with ERA’s assembler and commercial hardware simulators
- …