Search CORE

1,931 research outputs found

Recommended from our members

Fault Tolerance Against Design Faults

Author: Strigini L.
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

City Research Online

Software Fault Tolerance in Real-Time Systems: Identifying the Future Research Questions

Author: FEDERICO REGHENZANI
WILLIAM FORNACIARI
ZHISHAN GUO
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2023
Field of study

Tolerating hardware faults in modern architectures is becoming a prominent problem due to the miniaturization of the hardware components, their increasing complexity, and the necessity to reduce the costs. Software-Implemented Hardware Fault Tolerance approaches have been developed to improve the system dependability to hardware faults without resorting to custom hardware solutions. However, these come at the expense of making the satisfaction of the timing constraints of the applications/activities harder from a scheduling standpoint. This paper surveys the current state of the art of fault tolerance approaches when used in the context real-time systems, identifying the main challenges and the cross-links between these two topics. We propose a joint scheduling-failure analysis model that highlights the formal interactions among software fault tolerance mechanisms and timing properties. This model allows us to present and discuss many open research questions with the final aim to spur the future research activities

Archivio istituzionale della ricerca - Politecnico di Milano

Impact of Transient Faults on Timing Behavior and Mitigation with Near-Zero WCET Overhead

Author: Kritikakou Angeliki
Nikiema Pegdwende Romaric
Sentieys Olivier
Traiola Marcello
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th Euromicro Conference on Real-Time Systems (ECRTS 2023)
Publication date: 01/01/2023
Field of study

As time-critical systems require timing guarantees, Worst-Case Execution Times (WCET) have to be employed. However, WCET estimation methods usually assume fault-free hardware. If proper actions are not taken, such fault-free WCET approaches become unsafe, when faults impact the hardware during execution. The majority of approaches, dealing with hardware faults, address the impact of faults on the functional behavior of an application, i.e., denial of service and binary correctness. Few approaches address the impact of faults on the application timing behavior, i.e., time to finish the application, and target faults occurring in memories. However, as the transistor size in modern technologies is significantly reduced, faults in cores cannot be considered negligible anymore. This work shows that faults not only affect the functional behavior, but they can have a significant impact on the timing behavior of applications. To expose the overall impact of faults, we enhance vulnerability analysis to include not only functional, but also timing correctness, and show that faults impact WCET estimations. As common techniques to deal with faults, such as watchdog timers and re-execution, have large timing overhead for error detection and correction, we propose a mechanism with near-zero and bounded timing overhead. A RISC-V core is used as a case study. The obtained results show that faults can lead up to almost 700% increase in the maximum observed execution time between fault-free and faulty execution without protection, affecting the WCET estimations. On the contrary, the proposed mechanism is able to restore fault-free WCET estimations with a bounded overhead of 2 execution cycles

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

HAL-Rennes 1

Reasoning about the Reliability of Diverse Two-Channel Systems in which One Channel is "Possibly Perfect"

Author: Littlewood B.
Rushby J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

This paper considers the problem of reasoning about the reliability of fault-tolerant systems with two "channels" (i.e., components) of which one, A, supports only a claim of reliability, while the other, B, by virtue of extreme simplicity and extensive analysis, supports a plausible claim of "perfection." We begin with the case where either channel can bring the system to a safe state. We show that, conditional upon knowing pA (the probability that A fails on a randomly selected demand) and pB (the probability that channel B is imperfect), a conservative bound on the probability that the system fails on a randomly selected demand is simply pA.pB. That is, there is conditional independence between the events "A fails" and "B is imperfect." The second step of the reasoning involves epistemic uncertainty about (pA, pB) and we show that under quite plausible assumptions, a conservative bound on system pfd can be constructed from point estimates for just three parameters. We discuss the feasibility of establishing credible estimates for these parameters. We extend our analysis from faults of omission to those of commission, and then combine these to yield an analysis for monitored architectures of a kind proposed for aircraft

CiteSeerX

City Research Online

Timing Predictability in Future Multi-Core Avionics Systems

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref

A fault-tolerant multiprocessor architecture for aircraft, volume 1

Author: Ausrotas R. A.
Hanley L. D.
Hopkins A. L.
Lala J. H.
Martin J. H.
Smith T. B.
Taylor W.
Publication venue
Publication date
Field of study

A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed

NASA Technical Reports Server

2020 NASA Technology Taxonomy

Author: Miranda David
Publication venue
Publication date
Field of study

This document is an update (new photos used) of the PDF version of the 2020 NASA Technology Taxonomy that will be available to download on the OCT Public Website. The updated 2020 NASA Technology Taxonomy, or "technology dictionary", uses a technology discipline based approach that realigns like-technologies independent of their application within the NASA mission portfolio. This tool is meant to serve as a common technology discipline-based communication tool across the agency and with its partners in other government agencies, academia, industry, and across the world

NASA Technical Reports Server

Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

Author: Axel Krings
Samir Jafar
Senior Member
Thierry Gautier
Publication venue
Publication date
Field of study

Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded. Index Terms—Grid computing, rollback recovery, checkpointing, event logging. Ç

CiteSeerX

A Survey of Research into Mixed Criticality Systems

Author: Burns Alan
Davis Robert Ian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/11/2017
Field of study

This survey covers research into mixed criticality systems that has been published since Vestal’s seminal paper in 2007, up until the end of 2016. The survey is organised along the lines of the major research areas within this topic. These include single processor analysis (including fixed priority and EDF scheduling, shared resources and static and synchronous scheduling), multiprocessor analysis, realistic models, and systems issues. The survey also explores the relationship between research into mixed criticality systems and other topics such as hard and soft time constraints, fault tolerant scheduling, hierarchical scheduling, cyber physical systems, probabilistic real-time systems, and industrial safety standards

White Rose Research Online