18,771 research outputs found

    Towards a Mini-App for Smoothed Particle Hydrodynamics at Exascale

    Full text link
    The smoothed particle hydrodynamics (SPH) technique is a purely Lagrangian method, used in numerical simulations of fluids in astrophysics and computational fluid dynamics, among many other fields. SPH simulations with detailed physics represent computationally-demanding calculations. The parallelization of SPH codes is not trivial due to the absence of a structured grid. Additionally, the performance of the SPH codes can be, in general, adversely impacted by several factors, such as multiple time-stepping, long-range interactions, and/or boundary conditions. This work presents insights into the current performance and functionalities of three SPH codes: SPHYNX, ChaNGa, and SPH-flow. These codes are the starting point of an interdisciplinary co-design project, SPH-EXA, for the development of an Exascale-ready SPH mini-app. To gain such insights, a rotating square patch test was implemented as a common test simulation for the three SPH codes and analyzed on two modern HPC systems. Furthermore, to stress the differences with the codes stemming from the astrophysics community (SPHYNX and ChaNGa), an additional test case, the Evrard collapse, has also been carried out. This work extrapolates the common basic SPH features in the three codes for the purpose of consolidating them into a pure-SPH, Exascale-ready, optimized, mini-app. Moreover, the outcome of this serves as direct feedback to the parent codes, to improve their performance and overall scalability.Comment: 18 pages, 4 figures, 5 tables, 2018 IEEE International Conference on Cluster Computing proceedings for WRAp1

    SPH-EXA: Enhancing the Scalability of SPH codes Via an Exascale-Ready SPH Mini-App

    Full text link
    Numerical simulations of fluids in astrophysics and computational fluid dynamics (CFD) are among the most computationally-demanding calculations, in terms of sustained floating-point operations per second, or FLOP/s. It is expected that these numerical simulations will significantly benefit from the future Exascale computing infrastructures, that will perform 10^18 FLOP/s. The performance of the SPH codes is, in general, adversely impacted by several factors, such as multiple time-stepping, long-range interactions, and/or boundary conditions. In this work an extensive study of three SPH implementations SPHYNX, ChaNGa, and XXX is performed, to gain insights and to expose any limitations and characteristics of the codes. These codes are the starting point of an interdisciplinary co-design project, SPH-EXA, for the development of an Exascale-ready SPH mini-app. We implemented a rotating square patch as a joint test simulation for the three SPH codes and analyzed their performance on a modern HPC system, Piz Daint. The performance profiling and scalability analysis conducted on the three parent codes allowed to expose their performance issues, such as load imbalance, both in MPI and OpenMP. Two-level load balancing has been successfully applied to SPHYNX to overcome its load imbalance. The performance analysis shapes and drives the design of the SPH-EXA mini-app towards the use of efficient parallelization methods, fault-tolerance mechanisms, and load balancing approaches.Comment: arXiv admin note: substantial text overlap with arXiv:1809.0801

    Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

    Full text link
    We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance

    Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

    Full text link
    Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.Comment: Accepted for publication in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 201

    Exascale Deep Learning for Climate Analytics

    Full text link
    We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor Cores, a half-precision version of the DeepLabv3+ network achieves a peak and sustained throughput of 1.13 EF/s and 999.0 PF/s respectively.Comment: 12 pages, 5 tables, 4, figures, Super Computing Conference November 11-16, 2018, Dallas, TX, US
    corecore