6,781 research outputs found

    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

    Full text link
    Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

    Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

    Full text link
    Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be explored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure occurs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have enriched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest number of candidate processes to be analyzed. We have called the latter as \textit{cascade analysis}, because it includes processes that gets blocked by communication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90\% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.Comment: This is the accepted version of the manuscript that was sent to review to Journal of Parallel and Distributed Computing (ISSN 1096-0848). arXiv admin note: text overlap with arXiv:2012.1139

    Fault-Tolerance by Graceful Degradation for Car Platoons

    Get PDF
    The key advantage of autonomous car platoons are their short inter-vehicle distances that increase traffic flow and reduce fuel consumption. However, this is challenging for operational and functional safety. If a failure occurs, the affected vehicles cannot suddenly stop driving but instead should continue their operation with reduced performance until a safe state can be reached or, in the case of temporal failures, full functionality can be guaranteed again. To achieve this degradation, platoon members have to be able to compensate sensor and communication failures and have to adjust their inter-vehicle distances to ensure safety. In this work, we describe a systematic design of degradation cascades for sensor and communication failures in autonomous car platoons using the example of an autonomous model car. We describe our systematic design method, the resulting degradation modes, and formulate contracts for each degradation level. We model and test our resulting degradation controller in Simulink/Stateflow

    Advanced instrumentation concepts for environmental control subsystems

    Get PDF
    Design, evaluation and demonstration of advanced instrumentation concepts for improving performance of manned spacecraft environmental control and life support systems were successfully completed. Concepts to aid maintenance following fault detection and isolation were defined. A computer-guided fault correction instruction program was developed and demonstrated in a packaged unit which also contains the operator/system interface

    Muller C-element based Decoder (MCD): A Decoder Against Transient Faults

    Get PDF
    This work extends the analysis and application of a digital error correction method called Muller C-element Decoding (MCD), which has been proposed for fault masking in logic circuits comprised of unreliable elements. The proposed technique employs cascaded Muller C-elements and XOR gates to achieve efficient error-correction in the presence of internal upsets. The error-correction analysis of MCD architecture and the investigation of C-element’s robustness are first introduced. We demonstrate that the MCD is able to produce error-correction benefit in a high error-rate of internal faults. Significantly, for a (3,6) short-length LDPC code, when the decoding process is internally error-free the MCD achieves also a gain in terms of decoding performance by comparison to the well-known Gallager Bit-Flipping method. We further consider application of MCD to a general-purpose fault-tolerant model, coded Dual Modular Redundancy (cDMR), which offers low-redundancy error-resilience for contemporary logic systems as well as future nanoeletronic architectures
    • …
    corecore