194 research outputs found

    Autonomic log/restore for advanced optimistic simulation systems

    Get PDF
    In this paper we address state recoverability in optimistic simulation systems by presenting an autonomic log/restore architecture. Our proposal is unique in that it jointly provides the following features: (i) log/restore operations are carried out in a completely transparent manner to the application programmer, (ii) the simulation-object state can be scattered across dynamically allocated non-contiguous memory chunks, (iii) two differentiated operating modes, incremental vs non-incremental, coexist via transparent, optimized run-time management of dual versions of the same application layer, with dynamic selection of the best suited operating mode in different phases of the optimistic simulation run, and (iv) determinationof the best suited mode for any time frame is carried out on the basis of an innovative modeling/optimization approach that takes into account stability of each operating mode vs variations of the model execution parameters. © 2010 IEEE

    Improving redundant multithreading performance for soft-error detection in HPC applications

    Get PDF
    Tesis de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Computación, 2018As HPC systems move towards extreme scale, soft errors leading to silent data corruptions become a major concern. In this thesis, we propose a set of three optimizations to the classical Redundant Multithreading (RMT) approach to allow faster soft error detection. First, we leverage the use of Simultaneous Multithreading (SMT) to collocate sibling replicated threads on the same physical core to efficiently exchange data to expose errors. Some HPC applications cannot fully exploit SMT for performance improvement and instead, we propose to use these additional resources for fault tolerance. Second, we present variable aggregation to group several values together and use this merged value to speed up detection of soft errors. Third, we introduce selective checking to decrease the number of checked values to a minimum. The last two techniques reduce the overall performance overhead by relaxing the soft error detection scope. Our experimental evaluation, executed on recent multicore processors with representative HPC benchmarks, proves that the use of SMT for fault tolerance can enhance RMT performance. It also shows that, at constant computing power budget, with optimizations applied, the overhead of the technique can be significantly lower than the classical RMT replicated execution. Furthermore, these results show that RMT can be a viable solution for soft-error detection at extreme scale

    Approaches to multiprocessor error recovery using an on-chip interconnect subsystem

    Get PDF
    For future multicores, a dedicated interconnect subsystem for on-chip monitors was found to be highly beneficial in terms of scalability, performance and area. In this thesis, such a monitor network (MNoC) is used for multicores to support selective error identification and recovery and maintain target chip reliability in the context of dynamic voltage and frequency scaling (DVFS). A selective shared memory multiprocessor recovery is performed using MNoC in which, when an error is detected, only the group of processors sharing an application with the affected processors are recovered. Although the use of DVFS in contemporary multicores provides significant protection from unpredictable thermal events, a potential side effect can be an increased processor exposure to soft errors. To address this issue, a flexible fault prevention and recovery mechanism has been developed to selectively enable a small amount of per-core dual modular redundancy (DMR) in response to increased vulnerability, as measured by the processor architectural vulnerability factor (AVF). Our new algorithm for DMR deployment aims to provide a stable effective soft error rate (SER) by using DMR in response to DVFS caused by thermal events. The algorithm is implemented in real-time on the multicore using MNoC and controller which evaluates thermal information and multicore performance statistics in addition to error information. DVFS experiments with a multicore simulator using standard benchmarks show an average 6% improvement in overall power consumption and a stable SER by using selective DMR versus continuous DMR deployment

    ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance.

    Get PDF
    Providing reliability is becoming a challenge for chip manufacturers, faced with simultaneously trying to improve miniaturization, performance and energy efficiency. This leads to very large margins on voltage and frequency, designed to avoid errors even in the worst case, along with significant hardware expenditure on eliminating voltage spikes and other forms of transient error, causing considerable inefficiency in power consumption and performance. We flip traditional ideas about reliability and performance around, by exploring the use of error resilience for power and performance gains. ParaMedic is a recent architecture that provides a solution for reliability with low overheads via automatic hardware error recovery. It works by splitting up checking onto many small cores in a heterogeneous multicore system with hardware logging support. However, its design is based on the idea that errors are exceptional. We transform ParaMedic into ParaDox, which shows high performance in both error-intensive and scarce-error scenarios, thus allowing correct execution even when undervolted and overclocked. Evaluation within error-intensive simulation environments confirms the error resilience of ParaDox and the low associated recovery cost. We estimate that compared to a non-resilient system with margins, ParaDox can reduce energy-delay product by 15% through undervolting, while completely recovering from any induced errors

    Mitigation of failures in high performance computing via runtime techniques

    Get PDF
    As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost

    Advanced Concepts for Automatic Differentiation based on Operator Overloading

    Get PDF
    Mit Hilfe der Technik des Automatischen Differenzierens (AD) lassen sich für Funktionen, die als Programmquellcode gegeben sind, Ableitungsinformationen rechentechnisch effizient und mit geringem Aufwand für den Nutzer bereitstellen. Eine Variante der Implementierung von AD basiert auf der Überladung von Operatoren und Funktionen, die von vielen modernen Programmiersprachen ermöglicht wird. Durch Ausnutzung des Konzepts der Überladung wird eine interne Funktions-Repräsentation (Tape) generiert, die anschließend für die Ableitungsberechnung herangezogen wird. In der Dissertation werden neue Techniken erarbeitet, die eine effizientere Tape-Erstellung und die parallele Tape-Auswertung ermöglichen. Anhand von Laufzeituntersuchungen für numerische Beispiele werden die Möglichkeiten der neuen Techniken verdeutlicht.Using the technique of Automatic Differentiation (AD), derivative information can be computed efficiently for any function that is given as source code in a supported programming languages. One basic implementation strategy is based on the concept of operator overloading that is available for many programming languages. Due the overloading of operators, an internal representation of the function can be generated at runtime. This so-called tape can then be used for computing derivatives. In the thesis, new techniques are introduced that allow a more efficient tape creation and the parallel evaluation of tapes. Advantages of the new techniques are demonstrated by means of runtime analyses for numerical examples

    ParaMedic: Heterogeneous Parallel Error Correction

    Get PDF
    Processor error detection can be reduced in cost significantly by exploiting the parallelism that exists in a repeated copy of an execution, which may not exist in the original code, to split up the redundant work on a large number of small, highly efficient cores. However, such schemes don't provide a method for automatic error recovery. We develop ParaMedic, an architecture to allow efficient automatic correction of errors detected in a system by using parallel heterogeneous cores, to provide a full fail-safe system that does not propagate errors to other systems, and can recover without manual intervention. This uses logging to roll back any computation that occurred after a detected error, along with a set of techniques to provide error-checking parallelism while still preventing the escape of incorrect processor values in multicore environments, where ordering of individual processors' logs is not enough to be able to roll back execution. Across a set of single and multi-threaded benchmarks, we achieve 3.1\% and 1.5\% overhead respectively, compared with 1.9\% and 1\% for error detection alone.Arm Lt

    From detection to optimization: impact of soft errors on high-performance computing applications

    Get PDF
    As high-performance computing (HPC) continues to progress, constraints on HPC system design forces the handling of errors to higher levels in the software stack. Of the types of errors facing HPC, soft errors that silently corrupt system or application state are among the most severe. The behavior of HPC applications in the presence of soft errors is critical to gain insight for effective utilization of HPC systems. The need to understand this behavior can be used in developing algorithm-based error detection guided by application characteristics from fault injection and error propagation studies. Furthermore, the realization that applications are tolerant to small errors allows optimizations such as lossy compression on high-cost data transfers. Lossy compression adds small user controllable amounts of error when compressing data, to reduce data size before expensive data transfers saving time. This dissertation investigates and improves the resiliency of HPC applications to soft errors, and explores lossy compression as a new form of optimization for expensive, time-consuming data transfers

    Exploiting Inherent Program Redundancy for Fault Tolerance

    Get PDF
    Technology scaling has led to growing concerns about reliability in microprocessors. Currently, fault tolerance studies rely on creating explicitly redundant execution for fault detection or recovery, which usually involves expensive cost on performance, power, or hardware, etc. In our study, we find exploiting program's inherent redundancy can better trade off between reliability, performance, and hardware cost. This work proposes two approaches to enhance program reliability. The first approach investigates the additional fault resilience at the application level. We explore program correctness definition that views correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, multiple numerical outputs can be deemed as correct as long as they are acceptable to users. Thus faults that cause program to produce such outputs can also be tolerated. We find programs which produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and find that they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that only compute exact numerical outputs offer less error resilience at the application level. However, all programs that we have studied exhibit some enhanced fault resilience at the application level, including those that are traditionally considered as exact computations-e.g., SPECInt CPU2000. We conduct fault injection experiments and evaluate the additional fault tolerance at the application level compared to the traditional architectural level. We also exploit the relaxed requirements for numerical integrity of application-level correctness to reduce checkpoint cost: our lightweight recovery mechanism checkpoints a minimal set of program state including program counter, architectural register file, and stack; our soft-checkpointing technique identifies computations that are resilient to errors and excludes their output state from checkpoint. Both techniques incur much smaller runtime overhead than traditional checkpointing, but can successfully recover either all or a major part of program crashes in soft computations. The second approach we take studies value predictability for reducing fault rate. Value prediction is considered as additional execution, and its results are compared with corresponding computational outputs. Any mismatch between them is accounted as symptom of potential faults and incurs restoration process. To reduce misprediction rate caused by limitations of predictor itself, we characterize fault vulnerability at the instruction level and only apply value prediction to instructions that are highly susceptible to faults. We also vary threshold of confidence estimation according to instruction's vulnerability-instructions with high vulnerability are assigned with low confidence threshold, while instructions with low vulnerability are assigned with high confidence threshold. Our experimental results show benefit from such selective prediction and adaptive confidence threshold on balance between reliability and performance
    corecore