63 research outputs found
RASA: Reliability-Aware Scheduling Approach for FPGA-Based Resilient Embedded Systems in Extreme Environments
Field-programmable gate arrays (FPGAs) offer the flexibility of general-purpose processors along with the performance efficiency of dedicated hardware that essentially renders it as a platform of choice for modern-day robotic systems for achieving real-time performance. Such robotic systems when deployed in harsh environments often get plagued by faults due to extreme conditions. Consequently, the real-time applications running on FPGA become susceptible to errors which call for a reliability-aware task scheduling approach, the focus of this article. We attempt to address this challenge using a hybrid offline-online approach. Given a set of periodic real-time tasks that require to be executed, the offline component generates a feasible preemptive schedule with specific preemption points. At runtime, these preemption events are utilized for fault detection. Upon detecting any faulty execution at such distinct points, the reliability-aware scheduling approach, RASA, orchestrates the recovery mechanism to remediate the scenario without jeopardizing the predefined schedule. Effectiveness of the proposed strategy has been verified through simulation-based experiments and we observed that the RASA is able to achieve 72% of task acceptance rate even under 70% of system workloads with high fault occurrence rates
Multi-core devices for safety-critical systems: a survey
Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
Fault Tolerant Nanosatellite Computing on a Budget
In this contribution, we present a CubeSat-compatible on-board computer (OBC) architecture that offers strong fault tolerance to enable the use of such spacecraft in critical and long-term missions. We describe in detail the design of our OBC’s breadboard setup, and document its composition from the component-level, all the way down to the software level. Fault tolerance in this OBC is achieved without resorting to radiation hardening, just intelligent through software. The OBC ages graceful, and makes use of FPGA-reconfiguration and mixed criticality. It can dynamically adapt to changing performance requirements throughout a space mission.
We developed a proof-of-concept with several Xilinx Ultrascale and Ultrascale+ FPGAs. With the smallest Kintex Ultrascale+ KU3P device, we achieve 1.94W total power consumption at 300Mhz, well within the power budget range of current 2U CubeSats. To our knowledge, this is the first scalable and COTS-based, widely reproducible OBC solution which can offer strong fault coverage even for small CubeSats. To reproduce this OBC architecture, no custom-written, proprietary, or protected IP is needed, and the needed design tools are available free-of-charge to academics. All COTS components required to construct this architecture can be purchased on the open market, and are affordable even for academic and scientific CubeSat developers
A Survey of Research into Mixed Criticality Systems
This survey covers research into mixed criticality systems that has been published since Vestal’s seminal paper in 2007, up until the end of 2016. The survey is organised along the lines of the major research areas within this topic. These include single processor analysis (including fixed priority and EDF scheduling, shared resources and static and synchronous scheduling), multiprocessor analysis, realistic models, and systems issues. The survey also explores the relationship between research into mixed criticality systems and other topics such as hard and soft time constraints, fault tolerant scheduling, hierarchical scheduling, cyber physical systems, probabilistic real-time systems, and industrial safety standards
Dynamic partial reconfiguration management for high performance and reliability in FPGAs
Modern Field-Programmable Gate Arrays (FPGAs) are no longer used to implement
small “glue logic” circuitries. The high-density of reconfigurable logic resources in
today’s FPGAs enable the implementation of large systems in a single chip. FPGAs
are highly flexible devices; their functionality can be altered by simply loading a new
binary file in their configuration memory. While the flexibility of FPGAs is
comparable to General-Purpose Processors (GPPs), in the sense that different
functions can be performed using the same hardware, the performance gain that can
be achieved using FPGAs can be orders of magnitudes higher as FPGAs offer the
ability for customisation of parallel computational architectures.
Dynamic Partial Reconfiguration (DPR) allows for changing the functionality of
certain blocks on the chip while the rest of the FPGA is operational. DPR has
sparked the interest of researchers to explore new computational platforms where
computational tasks are off-loaded from a main CPU to be executed using dedicated
reconfigurable hardware accelerators configured on demand at run-time. By having a
battery of custom accelerators which can be swapped in and out of the FPGA at runtime,
a higher computational density can be achieved compared to static systems
where the accelerators are bound to fixed locations within the chip. Furthermore, the
ability of relocating these accelerators across several locations on the chip allows for
the implementation of adaptive systems which can mitigate emerging faults in the
FPGA chip when operating in harsh environments. By porting the appropriate fault
mitigation techniques in such computational platforms, the advantages of FPGAs can
be harnessed in different applications in space and military electronics where FPGAs
are usually seen as unreliable devices due to their sensitivity to radiation and extreme
environmental conditions.
In light of the above, this thesis investigates the deployment of DPR as: 1) a method
for enhancing performance by efficient exploitation of the FPGA resources, and 2) a
method for enhancing the reliability of systems intended to operate in harsh
environments. Achieving optimal performance in such systems requires an efficient
internal configuration management system to manage the reconfiguration and
execution of the reconfigurable modules in the FPGA. In addition, the system needs
to support “fault-resilience” features by integrating parameterisable fault detection
and recovery capabilities to meet the reliability standard of fault-tolerant
applications. This thesis addresses all the design and implementation aspects of an
Internal Configuration Manger (ICM) which supports a novel bitstream relocation
model to enable the placement of relocatable accelerators across several locations on
the FPGA chip. In addition to supporting all the configuration capabilities required to
implement a Reconfigurable Operating System (ROS), the proposed ICM also
supports the novel multiple-clone configuration technique which allows for cloning
several instances of the same hardware accelerator at the same time resulting in much
shorter configuration time compared to traditional configuration techniques. A faulttolerant
(FT) version of the proposed ICM which supports a comprehensive faultrecovery
scheme is also introduced in this thesis. The proposed FT-ICM is designed
with a much smaller area footprint compared to Triple Modular Redundancy (TMR)
hardening techniques while keeping a comparable level of fault-resilience.
The capabilities of the proposed ICM system are demonstrated with two novel
applications. The first application demonstrates a proof-of-concept reliable FPGA
server solution used for executing encryption/decryption queries. The proposed
server deploys bitstream relocation and modular redundancy to mitigate both
permanent and transient faults in the device. It also deploys a novel Built-In Self-
Test (BIST) diagnosis scheme, specifically designed to detect emerging permanent
faults in the system at run-time. The second application is a data mining application
where DPR is used to increase the computational density of a system used to
implement the Frequent Itemset Mining (FIM) problem
Towards the development of a reliable reconfigurable real-time operating system on FPGAs
In the last two decades, Field Programmable Gate Arrays (FPGAs) have been
rapidly developed from simple “glue-logic” to a powerful platform capable of
implementing a System on Chip (SoC). Modern FPGAs achieve not only the high
performance compared with General Purpose Processors (GPPs), thanks to hardware
parallelism and dedication, but also better programming flexibility, in comparison to
Application Specific Integrated Circuits (ASICs). Moreover, the hardware
programming flexibility of FPGAs is further harnessed for both performance and
manipulability, which makes Dynamic Partial Reconfiguration (DPR) possible. DPR
allows a part or parts of a circuit to be reconfigured at run-time, without interrupting
the rest of the chip’s operation. As a result, hardware resources can be more
efficiently exploited since the chip resources can be reused by swapping in or out
hardware tasks to or from the chip in a time-multiplexed fashion. In addition, DPR
improves fault tolerance against transient errors and permanent damage, such as
Single Event Upsets (SEUs) can be mitigated by reconfiguring the FPGA to avoid
error accumulation. Furthermore, power and heat can be reduced by removing
finished or idle tasks from the chip. For all these reasons above, DPR has
significantly promoted Reconfigurable Computing (RC) and has become a very hot
topic. However, since hardware integration is increasing at an exponential rate, and
applications are becoming more complex with the growth of user demands, highlevel
application design and low-level hardware implementation are increasingly
separated and layered. As a consequence, users can obtain little advantage from DPR
without the support of system-level middleware.
To bridge the gap between the high-level application and the low-level hardware
implementation, this thesis presents the important contributions towards a Reliable,
Reconfigurable and Real-Time Operating System (R3TOS), which facilitates the
user exploitation of DPR from the application level, by managing the complex
hardware in the background. In R3TOS, hardware tasks behave just like software
tasks, which can be created, scheduled, and mapped to different computing resources
on the fly. The novel contributions of this work are: 1) a novel implementation of an efficient task scheduler and allocator; 2) implementation of a novel real-time
scheduling algorithm (FAEDF) and two efficacious allocating algorithms (EAC and
EVC), which schedule tasks in real-time and circumvent emerging faults while
maintaining more compact empty areas. 3) Design and implementation of a faulttolerant
microprocessor by harnessing the existing FPGA resources, such as Error
Correction Code (ECC) and configuration primitives. 4) A novel symmetric
multiprocessing (SMP)-based architectures that supports shared memory programing
interface. 5) Two demonstrations of the integrated system, including a) the K-Nearest
Neighbour classifier, which is a non-parametric classification algorithm widely used
in various fields of data mining; and b) pairwise sequence alignment, namely the
Smith Waterman algorithm, used for identifying similarities between two biological
sequences.
R3TOS gives considerably higher flexibility to support scalable multi-user, multitasking
applications, whereby resources can be dynamically managed in respect of
user requirements and hardware availability. Benefiting from this, not only the
hardware resources can be more efficiently used, but also the system performance
can be significantly increased. Results show that the scheduling and allocating
efficiencies have been improved up to 2x, and the overall system performance is
further improved by ~2.5x. Future work includes the development of Network on
Chip (NoC), which is expected to further increase the communication throughput; as
well as the standardization and automation of our system design, which will be
carried out in line with the enablement of other high-level synthesis tools, to allow
application developers to benefit from the system in a more efficient manner
Design Space Exploration and Resource Management of Multi/Many-Core Systems
The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
Development and Qualification of an FPGA-Based Multi-Processor System-on-Chip On-Board Computer for LEO Satellites
九州工業大学博士学位論文 学位記番号:工博甲第374号 学位授与年月日:平成26年9月26日Chapter 1: Introduction||Chapter 2: Background and Literature Review||Chapter 3: Multi-Processor System-on-Chip On-Borad Computer Design||Chapter 4: Space and Time Redundancy Trade-offs||Chapter 5: Radiation and Fault Injection Testing||Chapter 6: Thermal Vacuum Testing||Chapter 7: Results and Discussion||Chapter 8: Conclusion and Future Perspectives||ReferencesDeveloping small satellites for scientific and commercial purposes is emerging rapidly in the last decade. The future is still expected to carry more challenging services and designs to fulfill the growing needs for space based services. Nevertheless, there exists a big challenge in developing cost effective and highly efficient small satellites yet with accepted reliability and power consumption that is adequate to the mission capabilities. This challenge mandates the use of the recent developments in digital design techniques and technologies to strike the required balance between the four basic parameters: 1) Cost, 2) Performance, 3) Reliability and 4) Power consumption. This balance becomes even more stringent and harder to
reach when the satellite mass reduces significantly. Mass reduction puts strict constraints on the power system in terms of the solar panels and the batteries. That fact creates the need to miniaturize the design of the subsystems as much as possible which can be viewed as the fifth parameter in the design balance dilemma.
At Kyuhsu Institute of Technology-Japan we are investigating the use of SRAMbased Field Programmable Gate Arrays (FPGA) in building: 1) High performance, 2)Low cost, 3) Moderate power consumption and 4) Highly reliable Muti-Processor System-on-Chip (MPSoC) On-Board Computers (OBC) for future space missions and applications. This research tries to investigate how commercial grade SRAMbased FPGAs would perform in space and how to mitigate them against the space environment. Our methodology to answer that question depended on following formal design procedure for the OBC according to the space environment requirements then qualifying the design through extensive testing. We developed the
MPSoC OBC with 4 complete embedded processor systems. The Inter Processor Communication (IPC) takes place through hardware First-In-First-Out (FIFO) mailboxes. One processor acts as the system master controller which monitors the operation and controls the reset and restore of the system in case of faults and the other three processors form Triple Modular Redundancy (TMR) fault tolerance architecture with each other. We used Dynamic Partial Reconfiguration (DPR) in scrubbing the configuration memory frames and correcting the faults that might exist. The system is implemented using a Virtex-5 LX50 commercial grade FPGA from Xilinx. The research also qualifies the design in the ground-simulated space environment conditions. We tested the implemented MPSoC OBC in Thermal Vacuum Chambers (TVC) at the Center of Nano-Satellite Testing (CeNT) at Kyushu Institute of Technology. Also we irradiated the design with proton accelerated beam at 65 MeV with fluxes of 10e06 and 3e06 particle/cm2/sec at the Takasaki Advanced Radiation Research Institute (TARRI).
The TVC test results showed that the FPGA design exceeded the limits of normal operation for the commercial grade package at about 105 C°. Therefore, we mitigated the package using: 1) heat sink, 2) dynamic temperature management through operating frequency reduction from 100 MHz to 50 MHz and 3) reconfiguration to reduce the number of working processors to 2 instead of 4 by replacing the spaceredundancy TMR with time-redundancy TMR during the sunlight section of the orbit. The mitigation proved to be efficient and it even reduced the temperature from 105 C° to about 66 C° when the heat sink, frequency reduction, and reconfiguration techniques were used together.
The radiation and the fault injection tests showed that mitigating the FPGA configuration frames through scrubbing are efficient when Single Bit Upsets (SBU) are recorded. Multiple Bit Upsets (MBU) are not well mitigated using the scrubbing with Single Error Correction Double Error Detection (SECDED) technique and the FPGA needs to be totally reset and reloaded when MBUs are detected in its configuration frames. However, as MBUs occurrence in space is very seldom and rare compared to SBUs, we consider that SECDED scrubbing is very efficient in decreasing the soft error rate and increasing the reliability of having error-free bitstreams. The reliability was proven to be at 0.9999 when the scrubbing rate was continuous at a period of 7.1 msec between complete scans of the FPGA bitstream. In the proton radiation tests we managed to develop a new technique to estimate the static cross section using internal scrubbing only without using external monitoring, control and scrubbing device. Fault injection was used to estimate the dynamic cross section in a cost effective alternative for estimating it through radiation test.
The research proved through detailed testing that the 65 nm commercial grade SRAM-based FPGA can be used in future space missions. The MPSoC OBC design achieved an adequate balance between the performance, power, mass, and reliability requirements. Extensive testing and applying carefully crafted mitigation techniques were the key points to verify and validate the MPSoC OBC design. In-orbit validation through a scientific demonstration mission would be the next step for the future research
- …