211 research outputs found

    Fault-tolerant fpga for mission-critical applications.

    Get PDF
    One of the devices that play a great role in electronic circuits design, specifically safety-critical design applications, is Field programmable Gate Arrays (FPGAs). This is because of its high performance, re-configurability and low development cost. FPGAs are used in many applications such as data processing, networks, automotive, space and industrial applications. Negative impacts on the reliability of such applications result from moving to smaller feature sizes in the latest FPGA architectures. This increases the need for fault-tolerant techniques to improve reliability and extend system lifetime of FPGA-based applications. In this thesis, two fault-tolerant techniques for FPGA-based applications are proposed with a built-in fault detection region. A low cost fault detection scheme is proposed for detecting faults using the fault detection region used in both schemes. The fault detection scheme primarily detects open faults in the programmable interconnect resources in the FPGAs. In addition, Stuck-At faults and Single Event Upsets (SEUs) fault can be detected. For fault recovery, each scheme has its own fault recovery approach. The first approach uses a spare module and a 2-to-1 multiplexer to recover from any fault detected. On the other hand, the second approach recovers from any fault detected using the property of Partial Reconfiguration (PR) in the FPGAs. It relies on identifying a Partially Reconfigurable block (P_b) in the FPGA that is used in the recovery process after the first faulty module is identified in the system. This technique uses only one location to recover from faults in any of the FPGA’s modules and the FPGA interconnects. Simulation results show that both techniques can detect and recover from open faults. In addition, Stuck-At faults and Single Event Upsets (SEUs) fault can also be detected. Finally, both techniques require low area overhead

    A new architecture for single-event upset detection & reconfiguration of SRAM-based FPGAs

    Get PDF
    Field Programmable Gate Arrays (FPGA) are used in a variety of applications, ranging from consumer electronics to devices in spacecrafts because of their flexibility in achieving requirements such as low cost, high performance, and fast turnaround. SRAM-based FPGAs can experience single bit flips in the configuration memory due to high-energy neutrons or alpha particles hitting critical nodes in the SRAM cells, by transferring enough energy to effect the change. High energy particles can be emitted by cosmic radiation or traces of radioactive elements in device packaging. The result of this could range from unwanted functional or data modification, data loss in the system, to damage to the cell where the charged particle makes impact. This phenomenon is known as a Single Event Upset (SEU) and makes fault tolerance a critical requirement in FPGA design. This research proposes a shift in architecture from current SRAM-based FPGAs such as Xilinx Virtex. The proposed architecture includes an inherent SEU detection through parity checking of the configuration memory. The inherent SEU detection sets a syndrome flag when an odd number of bit flips occur within a data frame of the configuration memory. To correct a fault, the FPGA the affected data frame is partially reconfigured. Existing and proposed solutions include: Triple Modular Redundancy (TMR) systems; readback and compare the configuration memory; and periodically reprogramming the entire configuration memory, also known as scrubbing. The advantages afforded by the proposed architecture over existing solutions include: faster error detection and correction latency over the readback method and better area and power overhead over TMR

    An Adaptive Modular Redundancy Technique to Self-regulate Availability, Area, and Energy Consumption in Mission-critical Applications

    Get PDF
    As reconfigurable devices\u27 capacities and the complexity of applications that use them increase, the need for self-reliance of deployed systems becomes increasingly prominent. A Sustainable Modular Adaptive Redundancy Technique (SMART) composed of a dual-layered organic system is proposed, analyzed, implemented, and experimentally evaluated. SMART relies upon a variety of self-regulating properties to control availability, energy consumption, and area used, in dynamically-changing environments that require high degree of adaptation. The hardware layer is implemented on a Xilinx Virtex-4 Field Programmable Gate Array (FPGA) to provide self-repair using a novel approach called a Reconfigurable Adaptive Redundancy System (RARS). The software layer supervises the organic activities within the FPGA and extends the self-healing capabilities through application-independent, intrinsic, evolutionary repair techniques to leverage the benefits of dynamic Partial Reconfiguration (PR). A SMART prototype is evaluated using a Sobel edge detection application. This prototype is shown to provide sustainability for stressful occurrences of transient and permanent fault injection procedures while still reducing energy consumption and area requirements. An Organic Genetic Algorithm (OGA) technique is shown capable of consistently repairing hard faults while maintaining correct edge detector outputs, by exploiting spatial redundancy in the reconfigurable hardware. A Monte Carlo driven Continuous Markov Time Chains (CTMC) simulation is conducted to compare SMART\u27s availability to industry-standard Triple Modular Technique (TMR) techniques. Based on nine use cases, parameterized with realistic fault and repair rates acquired from publically available sources, the results indicate that availability is significantly enhanced by the adoption of fast repair techniques targeting aging-related hard-faults. Under harsh environments, SMART is shown to improve system availability from 36.02% with lengthy repair techniques to 98.84% with fast ones. This value increases to five nines (99.9998%) under relatively more favorable conditions. Lastly, SMART is compared to twenty eight standard TMR benchmarks that are generated by the widely-accepted BL-TMR tools. Results show that in seven out of nine use cases, SMART is the recommended technique, with power savings ranging from 22% to 29%, and area savings ranging from 17% to 24%, while still maintaining the same level of availability

    Fault Tolerant Nanosatellite Computing on a Budget

    Get PDF
    In this contribution, we present a CubeSat-compatible on-board computer (OBC) architecture that offers strong fault tolerance to enable the use of such spacecraft in critical and long-term missions. We describe in detail the design of our OBC’s breadboard setup, and document its composition from the component-level, all the way down to the software level. Fault tolerance in this OBC is achieved without resorting to radiation hardening, just intelligent through software. The OBC ages graceful, and makes use of FPGA-reconfiguration and mixed criticality. It can dynamically adapt to changing performance requirements throughout a space mission. We developed a proof-of-concept with several Xilinx Ultrascale and Ultrascale+ FPGAs. With the smallest Kintex Ultrascale+ KU3P device, we achieve 1.94W total power consumption at 300Mhz, well within the power budget range of current 2U CubeSats. To our knowledge, this is the first scalable and COTS-based, widely reproducible OBC solution which can offer strong fault coverage even for small CubeSats. To reproduce this OBC architecture, no custom-written, proprietary, or protected IP is needed, and the needed design tools are available free-of-charge to academics. All COTS components required to construct this architecture can be purchased on the open market, and are affordable even for academic and scientific CubeSat developers

    Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack

    Get PDF
    Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria

    Analyse und Erweiterung eines fehler-toleranten NoC für SRAM-basierte FPGAs in Weltraumapplikationen

    Get PDF
    Data Processing Units for scientific space mission need to process ever higher volumes of data and perform ever complex calculations. But the performance of available space-qualified general purpose processors is just in the lower three digit megahertz range, which is already insufficient for some applications. As an alternative, suitable processing steps can be implemented in hardware on a space-qualified SRAM-based FPGA. However, suitable devices are susceptible against space radiation. At the Institute for Communication and Network Engineering a fault-tolerant, network-based communication architecture was developed, which enables the construction of processing chains on the basis of different processing modules within suitable SRAM-based FPGAs and allows the exchange of single processing modules during runtime, too. The communication architecture and its protocol shall isolate non SEU mitigated or just partial SEU mitigated modules affected by radiation-induced faults to prohibit the propagation of errors within the remaining System-on-Chip. In the context of an ESA study, this communication architecture was extended with further components and implemented in a representative hardware platform. Based on the acquired experiences during the study, this work analyses the actual fault-tolerance characteristics as well as weak points of this initial implementation. At appropriate locations, the communication architecture was extended with mechanisms for fault-detection and fault-differentiation as well as with a hardware-based monitoring solution. Both, the former measures and the extension of the employed hardware-platform with selective fault-injection capabilities for the emulation of radiation-induced faults within critical areas of a non SEU mitigated processing module, are used to evaluate the effects of radiation-induced faults within the communication architecture. By means of the gathered results, further measures to increase fast detection and isolation of faulty nodes are developed, selectively implemented and verified. In particular, the ability of the communication architecture to isolate network nodes without SEU mitigation could be significantly improved.Instrumentenrechner für wissenschaftliche Weltraummissionen müssen ein immer höheres Datenvolumen verarbeiten und immer komplexere Berechnungen ausführen. Die Performanz von verfügbaren qualifizierten Universalprozessoren liegt aber lediglich im unteren dreistelligen Megahertz-Bereich, was für einige Anwendungen bereits nicht mehr ausreicht. Als Alternative bietet sich die Implementierung von entsprechend geeigneten Datenverarbeitungsschritten in Hardware auf einem qualifizierten SRAM-basierten FPGA an. Geeignete Bausteine sind jedoch empfindlich gegenüber der Strahlungsumgebung im Weltraum. Am Institut für Datentechnik und Kommunikationsnetze wurde eine fehlertolerante netzwerk-basierte Kommunikationsarchitektur entwickelt, die innerhalb eines geeigneten SRAM-basierten FPGAs Datenverarbeitungsmodule miteinander nach Bedarf zu Verarbeitungsketten verbindet, sowie den Austausch von einzelnen Modulen im Betrieb ermöglicht. Nicht oder nur partiell SEU mitigierte Module sollen bei strahlungsbedingten Fehlern im Modul durch das Protokoll und die Fehlererkennungsmechanismen der Kommunikationsarchitektur isoliert werden, um ein Ausbreiten des Fehlers im restlichen System-on-Chip zu verhindern. Im Kontext einer ESA Studie wurde diese Kommunikationsarchitektur um Komponenten erweitert und auf einer repräsentativen Hardwareplattform umgesetzt. Basierend auf den gesammelten Erfahrungen aus der Studie, wird in dieser Arbeit eine Analyse der tatsächlichen Fehlertoleranz-Eigenschaften sowie der Schwachstellen dieser ursprünglichen Implementierung durchgeführt. Die Kommunikationsarchitektur wurde an geeigneten Stellen um Fehlerdetektierungs- und Fehlerunterscheidungsmöglichkeiten erweitert, sowie um eine hardwarebasierte Überwachung ergänzt. Sowohl diese Maßnahmen, als auch die Erweiterung der Hardwareplattform um gezielte Fehlerinjektions-Möglichkeiten zum Emulieren von strahlungsinduzierten Fehlern in kritischen Komponenten eines nicht SEU mitigierten Prozessierungsmoduls werden genutzt, um die tatsächlichen auftretenden Effekte in der Kommunikationsarchitektur zu evaluieren. Anhand der Ergebnisse werden weitere Verbesserungsmaßnahmen speziell zur schnellen Detektierung und Isolation von fehlerhaften Knoten erarbeitet, selektiv implementiert und verifiziert. Insbesondere die Fähigkeit, fehlerhafte, nicht SEU mitigierte Netzwerkknoten innerhalb der Kommunikationsarchitektur zu isolieren, konnte dabei deutlich verbessert werden

    Proxy Circuits for Fault-Tolerant Primitive Interfacing in Reconfigurable Devices Targeting Extreme Environments

    Get PDF
    Continuous interface access to device-level primitives in reconfigurable devices in extreme environments is key to reliable operation. However, it is possible for a primitive's interface controller, which is static to be rendered non-operational by a permanent damage in the controller's circuitry. In order to mitigate this, this paper proposes the use of relocatable proxy circuits to provide remote interfacing capability to primitives from anywhere on a reconfigurable device. A demonstration with device register read controller shows that an improvement in fault-tolerance can be achieved

    Enhancing Real-time Embedded Image Processing Robustness on Reconfigurable Devices for Critical Applications

    Get PDF
    Nowadays, image processing is increasingly used in several application fields, such as biomedical, aerospace, or automotive. Within these fields, image processing is used to serve both non-critical and critical tasks. As example, in automotive, cameras are becoming key sensors in increasing car safety, driving assistance and driving comfort. They have been employed for infotainment (non-critical), as well as for some driver assistance tasks (critical), such as Forward Collision Avoidance, Intelligent Speed Control, or Pedestrian Detection. The complexity of these algorithms brings a challenge in real-time image processing systems, requiring high computing capacity, usually not available in processors for embedded systems. Hardware acceleration is therefore crucial, and devices such as Field Programmable Gate Arrays (FPGAs) best fit the growing demand of computational capabilities. These devices can assist embedded processors by significantly speeding-up computationally intensive software algorithms. Moreover, critical applications introduce strict requirements not only from the real-time constraints, but also from the device reliability and algorithm robustness points of view. Technology scaling is highlighting reliability problems related to aging phenomena, and to the increasing sensitivity of digital devices to external radiation events that can cause transient or even permanent faults. These faults can lead to wrong information processed or, in the worst case, to a dangerous system failure. In this context, the reconfigurable nature of FPGA devices can be exploited to increase the system reliability and robustness by leveraging Dynamic Partial Reconfiguration features. The research work presented in this thesis focuses on the development of techniques for implementing efficient and robust real-time embedded image processing hardware accelerators and systems for mission-critical applications. Three main challenges have been faced and will be discussed, along with proposed solutions, throughout the thesis: (i) achieving real-time performances, (ii) enhancing algorithm robustness, and (iii) increasing overall system's dependability. In order to ensure real-time performances, efficient FPGA-based hardware accelerators implementing selected image processing algorithms have been developed. Functionalities offered by the target technology, and algorithm's characteristics have been constantly taken into account while designing such accelerators, in order to efficiently tailor algorithm's operations to available hardware resources. On the other hand, the key idea for increasing image processing algorithms' robustness is to introduce self-adaptivity features at algorithm level, in order to maintain constant, or improve, the quality of results for a wide range of input conditions, that are not always fully predictable at design-time (e.g., noise level variations). This has been accomplished by measuring at run-time some characteristics of the input images, and then tuning the algorithm parameters based on such estimations. Dynamic reconfiguration features of modern reconfigurable FPGA have been extensively exploited in order to integrate run-time adaptivity into the designed hardware accelerators. Tools and methodologies have been also developed in order to increase the overall system dependability during reconfiguration processes, thus providing safe run-time adaptation mechanisms. In addition, taking into account the target technology and the environments in which the developed hardware accelerators and systems may be employed, dependability issues have been analyzed, leading to the development of a platform for quickly assessing the reliability and characterizing the behavior of hardware accelerators implemented on reconfigurable FPGAs when they are affected by such faults

    Dynamic partial reconfiguration management for high performance and reliability in FPGAs

    Get PDF
    Modern Field-Programmable Gate Arrays (FPGAs) are no longer used to implement small “glue logic” circuitries. The high-density of reconfigurable logic resources in today’s FPGAs enable the implementation of large systems in a single chip. FPGAs are highly flexible devices; their functionality can be altered by simply loading a new binary file in their configuration memory. While the flexibility of FPGAs is comparable to General-Purpose Processors (GPPs), in the sense that different functions can be performed using the same hardware, the performance gain that can be achieved using FPGAs can be orders of magnitudes higher as FPGAs offer the ability for customisation of parallel computational architectures. Dynamic Partial Reconfiguration (DPR) allows for changing the functionality of certain blocks on the chip while the rest of the FPGA is operational. DPR has sparked the interest of researchers to explore new computational platforms where computational tasks are off-loaded from a main CPU to be executed using dedicated reconfigurable hardware accelerators configured on demand at run-time. By having a battery of custom accelerators which can be swapped in and out of the FPGA at runtime, a higher computational density can be achieved compared to static systems where the accelerators are bound to fixed locations within the chip. Furthermore, the ability of relocating these accelerators across several locations on the chip allows for the implementation of adaptive systems which can mitigate emerging faults in the FPGA chip when operating in harsh environments. By porting the appropriate fault mitigation techniques in such computational platforms, the advantages of FPGAs can be harnessed in different applications in space and military electronics where FPGAs are usually seen as unreliable devices due to their sensitivity to radiation and extreme environmental conditions. In light of the above, this thesis investigates the deployment of DPR as: 1) a method for enhancing performance by efficient exploitation of the FPGA resources, and 2) a method for enhancing the reliability of systems intended to operate in harsh environments. Achieving optimal performance in such systems requires an efficient internal configuration management system to manage the reconfiguration and execution of the reconfigurable modules in the FPGA. In addition, the system needs to support “fault-resilience” features by integrating parameterisable fault detection and recovery capabilities to meet the reliability standard of fault-tolerant applications. This thesis addresses all the design and implementation aspects of an Internal Configuration Manger (ICM) which supports a novel bitstream relocation model to enable the placement of relocatable accelerators across several locations on the FPGA chip. In addition to supporting all the configuration capabilities required to implement a Reconfigurable Operating System (ROS), the proposed ICM also supports the novel multiple-clone configuration technique which allows for cloning several instances of the same hardware accelerator at the same time resulting in much shorter configuration time compared to traditional configuration techniques. A faulttolerant (FT) version of the proposed ICM which supports a comprehensive faultrecovery scheme is also introduced in this thesis. The proposed FT-ICM is designed with a much smaller area footprint compared to Triple Modular Redundancy (TMR) hardening techniques while keeping a comparable level of fault-resilience. The capabilities of the proposed ICM system are demonstrated with two novel applications. The first application demonstrates a proof-of-concept reliable FPGA server solution used for executing encryption/decryption queries. The proposed server deploys bitstream relocation and modular redundancy to mitigate both permanent and transient faults in the device. It also deploys a novel Built-In Self- Test (BIST) diagnosis scheme, specifically designed to detect emerging permanent faults in the system at run-time. The second application is a data mining application where DPR is used to increase the computational density of a system used to implement the Frequent Itemset Mining (FIM) problem
    corecore