Search CORE

6,781 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Author: Balladini Javier
Moran Marina
Rexachs Dolores
Rucci Enzo
Publication venue
Publication date: 14/11/2023
Field of study

Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be explored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure occurs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have enriched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest number of candidate processes to be analyzed. We have called the latter as \textit{cascade analysis}, because it includes processes that gets blocked by communication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90\% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.Comment: This is the accepted version of the manuscript that was sent to review to Journal of Parallel and Distributed Computing (ISSN 1096-0848). arXiv admin note: text overlap with arXiv:2012.1139

arXiv.org e-Print Archive

Fault-Tolerance by Graceful Degradation for Car Platoons

Author: Glesner Sabine
Grabowski Markus
Zarrouki M. Baha E.
Publication venue: OASIcs - OpenAccess Series in Informatics. Workshop on Autonomous Systems Design (ASD 2019)
Publication date: 01/01/2019
Field of study

The key advantage of autonomous car platoons are their short inter-vehicle distances that increase traffic flow and reduce fuel consumption. However, this is challenging for operational and functional safety. If a failure occurs, the affected vehicles cannot suddenly stop driving but instead should continue their operation with reduced performance until a safe state can be reached or, in the case of temporal failures, full functionality can be guaranteed again. To achieve this degradation, platoon members have to be able to compensate sensor and communication failures and have to adjust their inter-vehicle distances to ensure safety. In this work, we describe a systematic design of degradation cascades for sensor and communication failures in autonomous car platoons using the example of an autonomous model car. We describe our systematic design method, the resulting degradation modes, and formulate contracts for each degradation level. We model and test our resulting degradation controller in Simulink/Stateflow

Dagstuhl Research Online Publication Server

Advanced instrumentation concepts for environmental control subsystems

Author: Gyorki J. R.
Schubert F. H.
Wynveen R. A.
Yang P. Y.
Publication venue
Publication date
Field of study

Design, evaluation and demonstration of advanced instrumentation concepts for improving performance of manned spacecraft environmental control and life support systems were successfully completed. Concepts to aid maintenance following fault detection and isolation were defined. A computer-guided fault correction instruction program was developed and demonstrated in a packaged unit which also contains the operator/system interface

NASA Technical Reports Server

Muller C-element based Decoder (MCD): A Decoder Against Transient Faults

Author: Boutillon E.
Jego C.
Jezequel M.
Tang Y.
Winstead Chris J.
Publication venue: Hosted by Utah State University Libraries
Publication date: 01/05/2013
Field of study

This work extends the analysis and application of a digital error correction method called Muller C-element Decoding (MCD), which has been proposed for fault masking in logic circuits comprised of unreliable elements. The proposed technique employs cascaded Muller C-elements and XOR gates to achieve efficient error-correction in the presence of internal upsets. The error-correction analysis of MCD architecture and the investigation of C-element’s robustness are first introduced. We demonstrate that the MCD is able to produce error-correction benefit in a high error-rate of internal faults. Significantly, for a (3,6) short-length LDPC code, when the decoding process is internally error-free the MCD achieves also a gain in terms of decoding performance by comparison to the well-known Gallager Bit-Flipping method. We further consider application of MCD to a general-purpose fault-tolerant model, coded Dual Modular Redundancy (cDMR), which offers low-redundancy error-resilience for contemporary logic systems as well as future nanoeletronic architectures

HAL-Université de Bretagne Occidentale

DigitalCommons@USU