182 research outputs found

    Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications

    Get PDF
    As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures. A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks. In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications. Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes. Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable. El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks. En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores. Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones. Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version

    Using migratable objects to enhance fault tolerance schemes in supercomputers

    Get PDF
    Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Project Final Report: HPC-Colony II

    Full text link

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

    Benefits from using mixed precision computations in the ELPA-AEO and ESSEX-II eigensolver projects

    Get PDF
    We first briefly report on the status and recent achievements of the ELPA-AEO (Eigenvalue Solvers for Petaflop Applications - Algorithmic Extensions and Optimizations) and ESSEX II (Equipping Sparse Solvers for Exascale) projects. In both collaboratory efforts, scientists from the application areas, mathematicians, and computer scientists work together to develop and make available efficient highly parallel methods for the solution of eigenvalue problems. Then we focus on a topic addressed in both projects, the use of mixed precision computations to enhance efficiency. We give a more detailed description of our approaches for benefiting from either lower or higher precision in three selected contexts and of the results thus obtained

    Scalable Energy-efficient Microarchitectures with Computational Error Tolerance

    Get PDF
    Dennard scaling of conventional semiconductor technology has reached its limit resulting in issues pertaining to leakage current and threshold voltage. Energy-savings found at the transistor level by simply lowering supply voltage are no longer available for these devices (e.g., MOSFETs) and has reached the Landauer-Shannon limit. Recent proposals of minivolt switch technologies aim to extend the technology scaling roadmap by maintaining a high on/off ratio of drain current with a much lower supply voltage. However, high intermittent error probabilities in millivolt switches constraints their Vdd reduction for traditional architectures. Thus, there is an urgent need for scalable and energy-efficient micro-architectures with computational error-tolerance. This thesis systematically leverages the error detection and correction properties of the Redundant Residue Number System (RRNS) by varying the number of non-redundant (n) and redundant (r) components (residues), and selects and discusses trade-offs about configuration points from a two-dimensional (n, r)-RRNS design plane that meet certain capabilities of error detection and/or correction. Being able to efficiently handle resilience in this (n, r)-RRNS plane significantly improves reliability, allowing further Vdd reduction and energy savings. First, the necessary implementation details of RRNS cores are discussed. Second, scalable RRNS micro-architectures that simultaneously support both error-correction and checkpointing with restart capabilities for uncorrectable errors are proposed. Third, novel RRNS-based adaptive checkpointing&restart mechanisms are designed that automatically guarantee reliability while minimizing the energy-delay product (EDP). Finally, the RRNS design space is explored to find the optimal (n, r) configuration points. For similar reliability when compared to a conventional binary core (running at high Vdd) without computational error tolerance, the proposed RRNS scalable micro-architecture reduces EDP by 53% on average for memory-intensive workloads and by 67% on average for non-memory-intensive workloads. This thesis's second topic is to alleviate fault rate and power consumption issues of exascale computing. Faults in High-Performance Computing (HPC) have become an urgent challenge with estimated Mean Time Between Failures (MTBF) of exascale system projected as only several minutes with contemporary methodologies. Unfortunately, existing error-tolerance technologies in the context of HPC systems have serious deficiencies such as insufficient error-tolerance coverage, high power consumption, and difficult integration with existing workloads. Considering Department of Energy (DOE) guidelines that limit exascale power consumption to 20 MW, this thesis highlights the issue of energy usage and proposes a thread-level fault tolerance mechanism compatible with current state-of-the art exascale programming models while simultaneously meeting the requirements of full system error protection. Additionally, an efficient micro-architecture and corresponding mechanisms that can support thread level RRNS are discussed. Experimental results show that this strategy reduces energy consumption by 62.25% and the Energy-Delay-Product by 58.67% on average when compared with state-of-the-art black box resilience techniques.Ph.D
    corecore