4 research outputs found
Dependability modeling and optimization of triple modular redundancy partitioning for SRAM-based FPGAs
SRAM-based FPGAs are popular in the aerospace industry for their field
programmability and low cost. However, they suffer from cosmic
radiation-induced Single Event Upsets (SEUs). Triple Modular Redundancy (TMR)
is a well-known technique to mitigate SEUs in FPGAs that is often used with
another SEU mitigation technique known as configuration scrubbing. Traditional
TMR provides protection against a single fault at a time, while partitioned TMR
provides improved reliability and availability. In this paper, we present a
methodology to analyze TMR partitioning at early design stage using
probabilistic model checking. The proposed formal model can capture both single
and multiple-cell upset scenarios, regardless of any assumption of equal
partition sizes. Starting with a high-level description of a design, a Markov
model is constructed from the Data Flow Graph (DFG) using a specified number of
partitions, a component characterization library and a user defined scrub rate.
Such a model and exhaustive analysis captures all the considered failures and
repairs possible in the system within the radiation environment. Various
reliability and availability properties are then verified automatically using
the PRISM model checker exploring the relationship between the scrub frequency
and the number of TMR partitions required to meet the design requirements.
Also, the reported results show that based on a known voter failure rate, it is
possible to find an optimal number of partitions at early design stages using
our proposed method.Comment: Published in Reliability Engineering & System Safety Volume 182,
February 2019, Pages 107-11
Early Dependability Analysis of FPGA-Based Space Applications Using Formal Verification
SRAM-based FPGAs are increasingly attractive in the aerospace industry for their field programmability and low cost. Unfortunately, they suffer from cosmic radiation induced Single Event Effects (SEEs). In safety-critical applications, the dependability of the design is a prime concern since failures may have catastrophic consequences. Hence, an early analysis of dependability of such safety-critical applications will enable designers to develop systems that meet high dependability requirements, such as the DO-254 standard. In this thesis, we propose a high-level dependability and performability analysis methodology based on probabilistic model checking. Compared to the pen-and-pencil and discrete-event simulation approach, our methodology is more accurate due to the use of an automated formal verification technique. Moreover, compared to fault injection or beam testing, analysis at early design stages can guide designers to build more reliable designs reducing the overall cost and effort. The proposed methodology can perform three different types of analysis: evaluation of available design options, optimization of scrub intervals while satisfying its design assurance level requirements, and optimal partitioning of Triple-Modular Redundant (TMR) Systems. Such analysis can also guide designers to adopt proper mitigation technique(s), such as rescheduling, TMR, TMR with less frequent scrubs, or even can help to decide the number of TMR partitions for a given scrub intervals.
Starting from a high-level description of a system, based on the preferred analysis, a Markov model or Markov (reward) model is constructed from the extracted Control Data Flow Graph (CDFG) and the failure/mitigation parameters for the targeted FPGA. Such modeling and exhaustive analysis elaborated using a probabilistic model checking technique can capture all the failures and repairs possible (according to some general model) in the system within the radiation environment. To illustrate the applicability of the proposed approach, we present our quantitative analysis obtained from DSP benchmark circuits
Operating System Support for Redundant Multithreading
Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs.
ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication.
I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB).
I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors
Dependable Embedded Systems
This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems