139 research outputs found

    SSIVP: Spacecraft Supercomputing Experiment for STP-H6

    Get PDF
    The Department of Defense Space Test Program (STP) provides spaceflight opportunities for conducting on-orbit research and technology demonstrations to advance the future of spacecraft. STP-H6, the next mission of the program to the International Space Station (ISS), will include a prototype spacecraft supercomputing experiment and framework, called Spacecraft Supercomputing for Image and Video Processing (SSIVP), developed at the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC) at the University of Pittsburgh. SSIVP introduces scalable, high-performance computing (HPC) principles to a CubeSat form-factor to advance the state of the art in space computing. SSIVP adopts the CHREC Space Processor (CSP) concept, a multifaceted design philosophy for a hybrid system of commercial and radiation-hardened (rad-hard) components supplemented with fault-tolerant computing, and a hybrid processor combining fixed-logic CPU and reconfigurable-logic FPGA. SSIVP features five flight-qualified CSPv1 computers as compute nodes, to facilitate this supercomputing concept, and one μCSP smart module, for running a Gallium Nitride (GaN)-based power converter sub-experiment. SSIVP is a versatile, heterogenous platform capable of processing application workloads in the processor or on runtime-reconfigurable FPGA accelerators. In this paper, we present the flight hardware and software, frameworks for parallel and dependable computing, and mission objectives for SSIVP

    Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack

    Get PDF
    Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria

    An Adaptive Modular Redundancy Technique to Self-regulate Availability, Area, and Energy Consumption in Mission-critical Applications

    Get PDF
    As reconfigurable devices\u27 capacities and the complexity of applications that use them increase, the need for self-reliance of deployed systems becomes increasingly prominent. A Sustainable Modular Adaptive Redundancy Technique (SMART) composed of a dual-layered organic system is proposed, analyzed, implemented, and experimentally evaluated. SMART relies upon a variety of self-regulating properties to control availability, energy consumption, and area used, in dynamically-changing environments that require high degree of adaptation. The hardware layer is implemented on a Xilinx Virtex-4 Field Programmable Gate Array (FPGA) to provide self-repair using a novel approach called a Reconfigurable Adaptive Redundancy System (RARS). The software layer supervises the organic activities within the FPGA and extends the self-healing capabilities through application-independent, intrinsic, evolutionary repair techniques to leverage the benefits of dynamic Partial Reconfiguration (PR). A SMART prototype is evaluated using a Sobel edge detection application. This prototype is shown to provide sustainability for stressful occurrences of transient and permanent fault injection procedures while still reducing energy consumption and area requirements. An Organic Genetic Algorithm (OGA) technique is shown capable of consistently repairing hard faults while maintaining correct edge detector outputs, by exploiting spatial redundancy in the reconfigurable hardware. A Monte Carlo driven Continuous Markov Time Chains (CTMC) simulation is conducted to compare SMART\u27s availability to industry-standard Triple Modular Technique (TMR) techniques. Based on nine use cases, parameterized with realistic fault and repair rates acquired from publically available sources, the results indicate that availability is significantly enhanced by the adoption of fast repair techniques targeting aging-related hard-faults. Under harsh environments, SMART is shown to improve system availability from 36.02% with lengthy repair techniques to 98.84% with fast ones. This value increases to five nines (99.9998%) under relatively more favorable conditions. Lastly, SMART is compared to twenty eight standard TMR benchmarks that are generated by the widely-accepted BL-TMR tools. Results show that in seven out of nine use cases, SMART is the recommended technique, with power savings ranging from 22% to 29%, and area savings ranging from 17% to 24%, while still maintaining the same level of availability

    Toward Biologically-Inspired Self-Healing, Resilient Architectures for Digital Instrumentation and Control Systems and Embedded Devices

    Get PDF
    Digital Instrumentation and Control (I&C) systems in safety-related applications of next generation industrial automation systems require high levels of resilience against different fault classes. One of the more essential concepts for achieving this goal is the notion of resilient and survivable digital I&C systems. In recent years, self-healing concepts based on biological physiology have received attention for the design of robust digital systems. However, many of these approaches have not been architected from the outset with safety in mind, nor have they been targeted for the automation community where a significant need exists. This dissertation presents a new self-healing digital I&C architecture called BioSymPLe, inspired from the way nature responds, defends and heals: the stem cells in the immune system of living organisms, the life cycle of the living cell, and the pathway from Deoxyribonucleic acid (DNA) to protein. The BioSymPLe architecture is integrating biological concepts, fault tolerance techniques, and operational schematics for the international standard IEC 61131-3 to facilitate adoption in the automation industry. BioSymPLe is organized into three hierarchical levels: the local function migration layer from the top side, the critical service layer in the middle, and the global function migration layer from the bottom side. The local layer is used to monitor the correct execution of functions at the cellular level and to activate healing mechanisms at the critical service level. The critical layer is allocating a group of functional B cells which represent the building block that executes the intended functionality of critical application based on the expression for DNA genetic codes stored inside each cell. The global layer uses a concept of embryonic stem cells by differentiating these type of cells to repair the faulty T cells and supervising all repair mechanisms. Finally, two industrial applications have been mapped on the proposed architecture, which are capable of tolerating a significant number of faults (transient, permanent, and hardware common cause failures CCFs) that can stem from environmental disturbances and we believe the nexus of its concepts can positively impact the next generation of critical systems in the automation industry

    Dynamic Partial Reconfiguration for Dependable Systems

    Get PDF
    Moore’s law has served as goal and motivation for consumer electronics manufacturers in the last decades. The results in terms of processing power increase in the consumer electronics devices have been mainly achieved due to cost reduction and technology shrinking. However, reducing physical geometries mainly affects the electronic devices’ dependability, making them more sensitive to soft-errors like Single Event Transient (SET) of Single Event Upset (SEU) and hard (permanent) faults, e.g. due to aging effects. Accordingly, safety critical systems often rely on the adoption of old technology nodes, even if they introduce longer design time w.r.t. consumer electronics. In fact, functional safety requirements are increasingly pushing industry in developing innovative methodologies to design high-dependable systems with the required diagnostic coverage. On the other hand commercial off-the-shelf (COTS) devices adoption began to be considered for safety-related systems due to real-time requirements, the need for the implementation of computationally hungry algorithms and lower design costs. In this field FPGA market share is constantly increased, thanks to their flexibility and low non-recurrent engineering costs, making them suitable for a set of safety critical applications with low production volumes. The works presented in this thesis tries to face new dependability issues in modern reconfigurable systems, exploiting their special features to take proper counteractions with low impacton performances, namely Dynamic Partial Reconfiguration

    Methodology of highly reliable systems design

    Get PDF
    Práce představuje alternativní metodiku k již existujícím technikám pro návrh číslicových systémů se zvýšenou spolehlivostí implementovaných do obvodů FPGA a doplňuje některé nové vlastnosti při realizaci a testování těchto systémů. Práce se opírá o využití částečné dynamické rekonfigurace obvodu FPGA při návrhu systémů odolných proti poruchám, kde může být částečná rekonfigurace využita jako mechanizmus pro opravu a zotavení systému po výskytu poruchy. Práce nejprve představuje obecné principy diagnostiky, testování a spolehlivosti číslicových systémů včetně stručného popisu programovatelných obvodů FPGA a jejich architektury. Dále pokračuje přehledem současných metod a technik při návrhu a implementaci systémů odolných proti poruchám do obvodů FPGA, kde jsou popsány zejména techniky z oblasti detekce a lokalizace poruch, opravy a posuzování kvality návrhu. Nejdůležitější částí práce je popis metodiky pro návrh, implementaci a testování systémů odolných proti poruchám, která byla vytvořena pro obvody FPGA, jejichž konfigurační paměť je založena na pamětech typu SRAM. Nejprve je prezentována technika pro vytváření a automatizované generování hlídacích obvodů pro číslicové systémy a komunikační protokoly v FPGA, následně je prezentovaná referenční architektura spolehlivého systému implementovaného do FPGA včetně několika odolných architektur využívajících principu částečné dynamické rekonfigurace jako mechanizmu opravy a zotavení po výskytu poruchy. Dále je popsán způsob řízení rekonfiguračního procesu a testovací platforma pro snadné testovaní a ověření kvality systémů odolných proti poruchám implementovaných dle navržené metodiky. V závěru jsou diskutovány experimentální výsledky a přínos práce.In the thesis, a methodology alternative to existing methods of digital systems design with increased dependability implemented into FPGA is presented, new features which can be used in the implementation and testing of these systems are demonstrated. The research is based on the use of FPGA partial dynamic reconfiguration for the design of fault tolerant systems. In these applications, the partial dynamic reconfiguration can be used as a mechanism to correct the fault and recover the system after the fault occurrence. First, the general principles of diagnostics, testing and digital systems dependability are presented including a brief description of FPGA components and their architectures. Next, a survey of currently used methods and techniques used for the design and implementation of fault tolerant systems into FPGA is described, especially the methods used for fault detection and localization, their correction, together with the principles of evaluating fault tolerant systems design quality.  The most important part of the thesis is seen in the description of the design methodology, implementation and testing of fault tolerant systems implemented into FPGAs which uses SRAMs as the configuration memory. First, the methodology of developing and automated checker components design for digital systems and communication protocols is presented. Then, a reference architecture of a dependable system implemented into FPGA is demonstrated including several fault tolerant architectures based on the use of partial dynamic reconfiguration as the mechanism of fault correction and the recovery from it. The principles of controlling the reconfiguration process are described together with the description of the test platform which allows to test and verify the design of fault tolerant systems based on the methodology presented in the thesis. The experimental results and the contribution of the thesis are discussed in the conclusions.

    Using embedded hardware monitor cores in critical computer systems

    Get PDF
    The integration of FPGA devices in many different architectures and services makes monitoring and real time detection of errors an important concern in FPGA system design. A monitor is a tool, or a set of tools, that facilitate analytic measurements in observing a given system. The goal of these observations is usually the performance analysis and optimisation, or the surveillance of the system. However, System-on-Chip (SoC) based designs leave few points to attach external tools such as logic analysers. Thus, an embedded error detection core that allows observation of critical system nodes (such as processor cores and buses) should enforce the operation of the FPGA-based system, in order to prevent system failures. The core should not interfere with system performance and must ensure timely detection of errors. This thesis is an investigation onto how a robust hardware-monitoring module can be efficiently integrated in a target PCI board (with FPGA-based application processing features) which is part of a critical computing system. [Continues.

    Injecting Intermittent Faults for the Dependability Assessment of a Fault-Tolerant Microcomputer System

    Full text link
    © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.As scaling is more and more aggressive, intermittent faults are increasing their importance in current deep submicron complementary metal-oxide-semiconductor (CMOS) technologies. This work shows the dependability assessment of a fault-tol- erant computer system against intermittent faults. The applied methodology lies in VHDL-based fault injection, which allows the assessment in early design phases, together with a high level of observability and controllability. The evaluated system is a duplex microcontroller system with cold stand-by sparing. A wide set of intermittent fault models have been injected, and from the simulation traces, coverages and latencies have been measured. Markov models for this system have been generated and some dependability functions, such as reliability and safety, have been calculated. From these results, some enhancements of detection and recovery mechanisms have been suggested. The methodology presented is general to any fault-tolerant computer system.This work was supported in part by the Universitat Politecnica de Valencia under the Research Project SP20120806, and in part by the Spanish Government under the Research Project TIN2012-38308-C02-01. Associate Editor: J. Shortle.Gil Tomás, DA.; Gracia Morán, J.; Baraza Calvo, JC.; Saiz Adalid, LJ.; Gil Vicente, PJ. (2016). Injecting Intermittent Faults for the Dependability Assessment of a Fault-Tolerant Microcomputer System. IEEE Transactions on Reliability. 65(2):648-661. https://doi.org/10.1109/TR.2015.2484058S64866165
    corecore