89 research outputs found

    Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

    Get PDF
    This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

    Entwicklung und Anwendung von Hochleistungs-Software für Mantelkonvektionssimulationen

    Get PDF
    The Earth mantle convects on a global scale, coupling the stress field at every point to every other location at an instant. This way, any change in the buoyancy field has an immediate impact on the convection patterns worldwide. At the same time, mantle convection couples to processes at scales of a few kilometers or even a few hundred meters. Dynamic topography and the geoid are examples of such small-scale expressions of mantle convection. Also, the depth of phase transitions varies locally, with strong influences on the buoyancy, and thus the global stress field. In order to understand these processes dynamically it is essential to resolve the whole mantle at very high numerical resolutions. At the same time, geodynamicists are trying to answer new questions with their models, for example about the rheology of the mantle, which is most likely highly nonlinear. Also, due to the extremely long timescales we cannot observe past mantle states, which calls for simulations backwards in time. All these issues lead to an extreme demand in computing power. To cater to those needs, the physical models of the mantle have to be matched with efficient solvers and fast algorithms, such that we can efficiently exploit the enormous computing power of current and future high performance systems. Here, we first give an extensive overview over the physical models and introduce some numerical concepts to solve the equations. We present a new two-dimensional software as a testbed and elaborate on the implications of realistic mineralogic models for efficient mantle convection simulations. We find that phase transitions present a major challenge and suggest some procedures to incorporate them into mantle convection modeling. Then we give an introduction to the high-performance mantle convection prototype HHG, a multigrid-based software framework that scales to some of the fastest computers currently available. We adapt this framework to a spherical geometry and present first application examples to answer geodynamic questions. In particular, we show that a very thin and very weak asthenosphere is dynamically plausible and consistent with direct and indirect geological observations.Englische Ãœbersetzung des Titels: Development and application of high performance software for mantle convection modelin

    Tasks Makyth Models: Machine Learning Assisted Surrogates for Tipping Points

    Full text link
    We present a machine learning (ML)-assisted framework bridging manifold learning, neural networks, Gaussian processes, and Equation-Free multiscale modeling, for (a) detecting tipping points in the emergent behavior of complex systems, and (b) characterizing probabilities of rare events (here, catastrophic shifts) near them. Our illustrative example is an event-driven, stochastic agent-based model (ABM) describing the mimetic behavior of traders in a simple financial market. Given high-dimensional spatiotemporal data -- generated by the stochastic ABM -- we construct reduced-order models for the emergent dynamics at different scales: (a) mesoscopic Integro-Partial Differential Equations (IPDEs); and (b) mean-field-type Stochastic Differential Equations (SDEs) embedded in a low-dimensional latent space, targeted to the neighborhood of the tipping point. We contrast the uses of the different models and the effort involved in learning them.Comment: 29 pages, 8 figures, 6 table

    New approaches for efficient on-the-fly FE operator assembly in a high-performance mantle convection framework

    Get PDF

    Entwicklung und Anwendung von Hochleistungs-Software für Mantelkonvektionssimulationen

    Get PDF
    The Earth mantle convects on a global scale, coupling the stress field at every point to every other location at an instant. This way, any change in the buoyancy field has an immediate impact on the convection patterns worldwide. At the same time, mantle convection couples to processes at scales of a few kilometers or even a few hundred meters. Dynamic topography and the geoid are examples of such small-scale expressions of mantle convection. Also, the depth of phase transitions varies locally, with strong influences on the buoyancy, and thus the global stress field. In order to understand these processes dynamically it is essential to resolve the whole mantle at very high numerical resolutions. At the same time, geodynamicists are trying to answer new questions with their models, for example about the rheology of the mantle, which is most likely highly nonlinear. Also, due to the extremely long timescales we cannot observe past mantle states, which calls for simulations backwards in time. All these issues lead to an extreme demand in computing power. To cater to those needs, the physical models of the mantle have to be matched with efficient solvers and fast algorithms, such that we can efficiently exploit the enormous computing power of current and future high performance systems. Here, we first give an extensive overview over the physical models and introduce some numerical concepts to solve the equations. We present a new two-dimensional software as a testbed and elaborate on the implications of realistic mineralogic models for efficient mantle convection simulations. We find that phase transitions present a major challenge and suggest some procedures to incorporate them into mantle convection modeling. Then we give an introduction to the high-performance mantle convection prototype HHG, a multigrid-based software framework that scales to some of the fastest computers currently available. We adapt this framework to a spherical geometry and present first application examples to answer geodynamic questions. In particular, we show that a very thin and very weak asthenosphere is dynamically plausible and consistent with direct and indirect geological observations.Englische Ãœbersetzung des Titels: Development and application of high performance software for mantle convection modelin

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest
    • …
    corecore