Search CORE

4,964 research outputs found

A Survey of Fault-Tolerance Techniques for Embedded Systems from the Perspective of Power, Energy, and Thermal Issues

Author: Ansari M.
Ejlali A.
Henkel J.
Hessabi S.
Khdr H.
Nazari P. G.
Safari S.
Yari-Karin S.
Yeganeh-Khaksar A.
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 02/02/2022
Field of study

The relentless technology scaling has provided a significant increase in processor performance, but on the other hand, it has led to adverse impacts on system reliability. In particular, technology scaling increases the processor susceptibility to radiation-induced transient faults. Moreover, technology scaling with the discontinuation of Dennard scaling increases the power densities, thereby temperatures, on the chip. High temperature, in turn, accelerates transistor aging mechanisms, which may ultimately lead to permanent faults on the chip. To assure a reliable system operation, despite these potential reliability concerns, fault-tolerance techniques have emerged. Specifically, fault-tolerance techniques employ some kind of redundancies to satisfy specific reliability requirements. However, the integration of fault-tolerance techniques into real-time embedded systems complicates preserving timing constraints. As a remedy, many task mapping/scheduling policies have been proposed to consider the integration of fault-tolerance techniques and enforce both timing and reliability guarantees for real-time embedded systems. More advanced techniques aim additionally at minimizing power and energy while at the same time satisfying timing and reliability constraints. Recently, some scheduling techniques have started to tackle a new challenge, which is the temperature increase induced by employing fault-tolerance techniques. These emerging techniques aim at satisfying temperature constraints besides timing and reliability constraints. This paper provides an in-depth survey of the emerging research efforts that exploit fault-tolerance techniques while considering timing, power/energy, and temperature from the real-time embedded systems’ design perspective. In particular, the task mapping/scheduling policies for fault-tolerance real-time embedded systems are reviewed and classified according to their considered goals and constraints. Moreover, the employed fault-tolerance techniques, application models, and hardware models are considered as additional dimensions of the presented classification. Lastly, this survey gives deep insights into the main achievements and shortcomings of the existing approaches and highlights the most promising ones

KITopen

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Study of fault-tolerant software technology

Author: Broglio C.
Goldberg J.
Hitt E.
Levitt K.
Slivinski T.
Webb J.
Wild C.
Publication venue
Publication date
Field of study

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

NASA Technical Reports Server

Recommended from our members

Ratatoskr wide-area actuator RPC over gridstat with timeliness, redundancy, and safety

Author: Viddal Erlend
Publication venue: Washington State University
Publication date: 01/12/2007
Field of study

The development of the communication infrastructure for the north-American electrical powergrid has failed to fully incorporate important developments in the field of computer science,affecting the stability and efficiency of the power grid as a whole. The current power-gridcommunication standard, SCADA, utilizes protocols specialized for centralized communication,hampering communication between field sites key for envisioned improvements of power gridsafety and efficiency. Further, a number of different proprietary communication protocols are inuse, making communication between power utility companies very difficult.GridStat is a communication infrastructure designed for a power grid environment that solvesmany of the problems with the current situation. GridStat uses a specialization of thepublish-subscribe middleware paradigm, status dissemination, that takes advantage of thesemantics of status data to provide flexible acquisition of power-grid data with multipledimensions of QoS semantics. The middleware approach enables communication betweenutilities independent of proprietary network protocols, and allows enhanced network features such as forwarding data through multiple redundant paths. While GridStat provides excellent support for data acquisition, the publish-subscribe architecture supports only one-way communication and provides syntax and semantics unsuitable for control communications.This thesis presents Ratatoskr, a novel scheme for control of actuators using GridStat communication. It constructs a two-way communication channel on top of GridStatpublish/subscribe paths, and utilizes the QoS semantics and middleware properties GridStatprovides. For control communication Ratatoskr uses remote procedure call (RPC), providingprogrammer friendliness and familiarity. The QoS semantics of GridStat are drawn upon toprovide the timeliness required for power-grid operation. Reliability concerns are addressed byproviding three redundancy schemes, ACK/resend, transmitting multiple copies of a singlepacket, and spatial redundancy through GridStat’s redundant routing paths feature. Additionally,pre- and post-condition expressions over GridStat status variables are built into call semantics.The architecture and design of Ratatoskr is presented, along with results from an evaluation of aprototype implementation

Washington State University institutional repository

The integration of on-line monitoring and reconfiguration functions using IEEE1149.4 into a safety critical automotive electronic control unit.

Author: Cutajar R.
Jeffery C.
Lickess M.
Prosser S.
Richardson Andrew M. D.
Riches S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

This paper presents an innovative application of IEEE 1149.4 and the integrated diagnostic reconfiguration (IDR) as tools for the implementation of an embedded test solution for an automotive electronic control unit, implemented as a fully integrated mixed signal system. The paper describes how the test architecture can be used for fault avoidance with results from a hardware prototype presented. The paper concludes that fault avoidance can be integrated into mixed signal electronic systems to handle key failure modes

Lancaster E-Prints

Design of a fault tolerant airborne digital computer. Volume 1: Architecture

Author: Goldberg J.
Green M. W.
Levitt K. N.
Neumann P. G.
Wensley J. H.
Publication venue
Publication date
Field of study

This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive

NASA Technical Reports Server

Dependability of the NFV Orchestrator: State of the Art and Research Challenges

Author: Gonzalez Andres Javier
Heegaard Poul Einar
Helvik Bjarne Emil
Kamisinski Andrzej
Nencioni Gianfranco
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The introduction of network function virtualisation (NFV) represents a signiﬁcant change in networking technology, which may create new opportunities in terms of cost efﬁciency, operations, and service provisioning. Although not explicitly stated as an objective, the dependability of the services provided using this technology should be at least as good as conventional solutions. Logical centralisation, off-the-shelf computing platforms, and increased system complexity represent new dependability challenges relative to the state of the art. The core function of the network, with respect to failure and service management, is orchestration. The failure and misoperation of the NFV orchestrator (NFVO) will have huge network-wide consequences. At the same time, NFVO is vulnerable to overload and design faults. Thus, the objective of this paper is to give a tutorial on the dependability challenges of the NFVO, and to give insight into the required future research. This paper provides necessary background information, reviews the available literature, outlines the proposed solutions, and identiﬁes some design and research problems that must be addressed.acceptedVersio

NORA - Norwegian Open Research Archives

UiS Brage