5 research outputs found

    A fault-tolerance protocol for parallel applications with communication imbalance

    Get PDF
    ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large machines are expected to foster. Therefore, resilient extreme-scale applications are an absolute necessity to effectively use the new generation of supercomputers. Rollback-recovery techniques have been traditionally used in HPC to provide resilience. Among those techniques, message logging provides the appealing features of saving energy, accelerating recovery, and having low performance penalty. Its increased memory consumption is, however, an important downside. This paper introduces memory-constrained message logging (MCML), a general framework for decreasing the memory footprint of message-logging protocols. In particular, we demonstrate the effectiveness of MCML in maintaining message logging feasible for applications with substantial communication imbalance. This type of applications appear in many scientific fields. We present experimental results with several parallel codes running on up to 4,096 cores. Using those results and an analytical model, we predict MCML can reduce execution time up to 25% and energy consumption up to 15%, at extreme scale

    Prediction of energy consumption by checkpoint/restart in HPC

    Get PDF
    The fault tolerance method most used today in high-performance computing (HPC) is coordinated checkpointing. This, like any other fault tolerance method, adds additional energy consumption to that of the execution of the application. Currently, knowing and minimizing this energy consumption is a challenge. The objective of this paper is to propose a model to estimate the energy consumption of checkpoint and restart operations and a method for its construction. These estimates allow the evaluation of different scenarios in order to minimize energy consumption. We focus on coordinated checkpoint/restart at the system level, in single-program multiple-data (SPMD) applications, on homogeneous clusters. We study the behavior of the power dissipated by the compute node during a checkpoint/restart operation, as well as its execution time, considering different parameters of the system and the application. The experimentation carried out on two platforms shows the validity of the proposal. We also evaluate the impact on power and energy consumption of the processor's C states, the configuration of the network file system (NFS), where the checkpoint files are stored, and the compression of the checkpoint files. This paper contributes to the objective of predicting energy consumption in the execution of applications that use checkpoint/restart. Not counting the outliers, we can estimate the energy consumed by checkpoint/restart operations with errors lower than 7.5%

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems
    corecore