6,781 research outputs found
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
Nowadays, improving the energy efficiency of high-performance computing (HPC)
systems is one of the main drivers in scientific and technological research. As
large-scale HPC systems require some fault-tolerant method, the opportunities
to reduce energy consumption should be explored. In particular,
rollback-recovery methods using uncoordinated checkpoints prevent all processes
from re-executing when a failure occurs. In this context, it is possible to
take actions to reduce the energy consumption of the nodes whose processes do
not re-execute. This work is an extension of a previous one, in which we
proposed a series of strategies to manage energy consumption at failure-time.
In this work, we have enriched our simulator and the experimentation by
including non-blocking communications (with and without system buffering) and a
largest number of candidate processes to be analyzed. We have called the latter
as \textit{cascade analysis}, because it includes processes that gets blocked
by communication indirectly with the failed process. The simulations show that
the savings were negligible in the worst case, but in some scenarios, it was
possible to achieve significant ones; the maximum saving achieved was 90\% in a
time interval of 16 minutes. As a result, we show the feasibility of improving
energy efficiency in HPC systems in the presence of a failure.Comment: This is the accepted version of the manuscript that was sent to
review to Journal of Parallel and Distributed Computing (ISSN 1096-0848).
arXiv admin note: text overlap with arXiv:2012.1139
Fault-Tolerance by Graceful Degradation for Car Platoons
The key advantage of autonomous car platoons are their short inter-vehicle distances that increase traffic flow and reduce fuel consumption. However, this is challenging for operational and functional safety. If a failure occurs, the affected vehicles cannot suddenly stop driving but instead should continue their operation with reduced performance until a safe state can be reached or, in the case of temporal failures, full functionality can be guaranteed again. To achieve this degradation, platoon members have to be able to compensate sensor and communication failures and have to adjust their inter-vehicle distances to ensure safety. In this work, we describe a systematic design of degradation cascades for sensor and communication failures in autonomous car platoons using the example of an autonomous model car. We describe our systematic design method, the resulting degradation modes, and formulate contracts for each degradation level. We model and test our resulting degradation controller in Simulink/Stateflow
Advanced instrumentation concepts for environmental control subsystems
Design, evaluation and demonstration of advanced instrumentation concepts for improving performance of manned spacecraft environmental control and life support systems were successfully completed. Concepts to aid maintenance following fault detection and isolation were defined. A computer-guided fault correction instruction program was developed and demonstrated in a packaged unit which also contains the operator/system interface
Muller C-element based Decoder (MCD): A Decoder Against Transient Faults
This work extends the analysis and application of a digital error correction method called Muller C-element Decoding (MCD), which has been proposed for fault masking in logic circuits comprised of unreliable elements. The proposed technique employs cascaded Muller C-elements and XOR gates to achieve efficient error-correction in the presence of internal upsets. The error-correction analysis of MCD architecture and the investigation of C-element’s robustness are first introduced. We demonstrate that the MCD is able to produce error-correction benefit in a high error-rate of internal faults. Significantly, for a (3,6) short-length LDPC code, when the decoding process is internally error-free the MCD achieves also a gain in terms of decoding performance by comparison to the well-known Gallager Bit-Flipping method. We further consider application of MCD to a general-purpose fault-tolerant model, coded Dual Modular Redundancy (cDMR), which offers low-redundancy error-resilience for contemporary logic systems as well as future nanoeletronic architectures
- …