2,007 research outputs found

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

    SPH-EXA: Enhancing the Scalability of SPH codes Via an Exascale-Ready SPH Mini-App

    Full text link
    Numerical simulations of fluids in astrophysics and computational fluid dynamics (CFD) are among the most computationally-demanding calculations, in terms of sustained floating-point operations per second, or FLOP/s. It is expected that these numerical simulations will significantly benefit from the future Exascale computing infrastructures, that will perform 10^18 FLOP/s. The performance of the SPH codes is, in general, adversely impacted by several factors, such as multiple time-stepping, long-range interactions, and/or boundary conditions. In this work an extensive study of three SPH implementations SPHYNX, ChaNGa, and XXX is performed, to gain insights and to expose any limitations and characteristics of the codes. These codes are the starting point of an interdisciplinary co-design project, SPH-EXA, for the development of an Exascale-ready SPH mini-app. We implemented a rotating square patch as a joint test simulation for the three SPH codes and analyzed their performance on a modern HPC system, Piz Daint. The performance profiling and scalability analysis conducted on the three parent codes allowed to expose their performance issues, such as load imbalance, both in MPI and OpenMP. Two-level load balancing has been successfully applied to SPHYNX to overcome its load imbalance. The performance analysis shapes and drives the design of the SPH-EXA mini-app towards the use of efficient parallelization methods, fault-tolerance mechanisms, and load balancing approaches.Comment: arXiv admin note: substantial text overlap with arXiv:1809.0801

    A compiler extension for parallelizing arrays automatically on the cell heterogeneous processor

    Get PDF
    This paper describes the approaches taken to extend an array programming language compiler using a Virtual SIMD Machine (VSM) model for parallelizing array operations on Cell Broadband Engine heterogeneous machine. This development is part of ongoing work at the University of Glasgow for developing array compilers that are beneficial for applications in many areas such as graphics, multimedia, image processing and scientific computation. Our extended compiler, which is built upon the VSM interface, eases the parallelization processes by allowing automatic parallelisation without the need for any annotations or process directives. The preliminary results demonstrate significant improvement especially on data-intensive applications

    DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

    Full text link
    We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs for an image segmentation problem. Compared to a serial baseline, we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare our performance to a reference, OpenMP-based algorithm, and find speedups of up to 7X (CPU).Comment: LDAV 2018, October 201

    DISPATCH: A Numerical Simulation Framework for the Exa-scale Era. I. Fundamentals

    Full text link
    We introduce a high-performance simulation framework that permits the semi-independent, task-based solution of sets of partial differential equations, typically manifesting as updates to a collection of `patches' in space-time. A hybrid MPI/OpenMP execution model is adopted, where work tasks are controlled by a rank-local `dispatcher' which selects, from a set of tasks generally much larger than the number of physical cores (or hardware threads), tasks that are ready for updating. The definition of a task can vary, for example, with some solving the equations of ideal magnetohydrodynamics (MHD), others non-ideal MHD, radiative transfer, or particle motion, and yet others applying particle-in-cell (PIC) methods. Tasks do not have to be grid-based, while tasks that are, may use either Cartesian or orthogonal curvilinear meshes. Patches may be stationary or moving. Mesh refinement can be static or dynamic. A feature of decisive importance for the overall performance of the framework is that time steps are determined and applied locally; this allows potentially large reductions in the total number of updates required in cases when the signal speed varies greatly across the computational domain, and therefore a corresponding reduction in computing time. Another feature is a load balancing algorithm that operates `locally' and aims to simultaneously minimise load and communication imbalance. The framework generally relies on already existing solvers, whose performance is augmented when run under the framework, due to more efficient cache usage, vectorisation, local time-stepping, plus near-linear and, in principle, unlimited OpenMP and MPI scaling.Comment: 17 pages, 8 figures. Accepted by MNRA
    corecore