2,256 research outputs found

    Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

    Full text link
    This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on gpu hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 4096^3 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the fmm-based vortex method achieving 74% parallel efficiency on 4096 processes (one gpu per mpi process, 3 gpus per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of mpi processes (using only cpu cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 seconds for the vortex method and 154 seconds for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex method calculations to date

    Strong scaling of general-purpose molecular dynamics simulations on GPUs

    Get PDF
    We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, arXiv:1308.5587). Our approach is inspired by a traditional CPU-based code, LAMMPS (Plimpton, J. Comp. Phys. 117, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., J. Comp. Phys. 227, 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3,375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5x.Comment: 30 pages, 14 figure

    Multi-Level Parallelism for Incompressible Flow Computations on GPU Clusters

    Get PDF
    We investigate multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow computations using up to 256 GPUs on a problem with approximately 17.2 billion cells. Our work addresses some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI or MPI-OpenMP for communications. We present three different strategies to overlap computations with communications, and systematically assess their impact on parallel performance on two different GPU clusters. Our results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance. We also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage in performance over the dual-level implementation on GPU clusters with two GPUs per node, but on clusters with higher GPU counts per node or with different domain decomposition strategies a tri-level implementation may exhibit higher efficiency than a dual-level implementation and needs to be investigated further

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin

    Efficient CFD code implementation for the ARM-based Mont-Blanc architecture

    Get PDF
    Since 2011, the European project Mont-Blanc has been focused on enabling ARM-based technology for HPC, developing both hardware platforms and system software. The latest Mont-Blanc prototypes use system-on-chip (SoC) devices that combine a CPU and a GPU sharing a common main memory. Specific developments of parallel computing software and well-suited implementation approaches are of crucial importance for such a heterogeneous architecture in order to efficiently exploit its potential. This paper is devoted to the optimizations carried out in the TermoFluids CFD code to efficiently run it on the Mont-Blanc system. The underlying numerical method is based on an unstructured finite-volume discretization of the Navier–Stokes equations for the numerical simulation of incompressible turbulent flows. It is implemented using a portable and modular operational approach based on a minimal set of linear algebra operations. An architecture-specific heterogeneous multilevel MPI+OpenMP+OpenCL implementation of such kernels is proposed. It includes optimizations of the storage formats, dynamic load balancing between the CPU and GPU devices and hiding of communication overheads by overlapping computations and data transfers. A detailed performance study shows time reductions of up to on the kernels’ execution with the new heterogeneous implementation, its scalability on up to 128 Mont-Blanc nodes and the energy savings (around ) achieved with the Mont-Blanc system versus the high-end hybrid supercomputer MinoTauro.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007–2013] and Horizon 2020 under the Mont-Blanc Project (www.montblanc-project.eu), grant agreement n 288777, 610402 and 671697. The work has been financially supported by the Ministerio de Ciencia e Innovación, Spain (ENE- 2014-60577-R), the Russian Science Foundation, project 15-11-30039, CONICYT Becas Chile Doctorado 2012, the Juan de la Cierva posdoctoral grant (IJCI-2014-21034), and the Initial Training Network SEDITRANS (GA number: 607394), implemented within the 7th Framework Programme of the European Commission under call FP7-PEOPLE- 2013-ITN. Our calculations have been performed on the resources of the Barcelona Supercomputing Center. The authors thankfully acknowledge these institutions.Peer ReviewedPostprint (published version

    Multiscale Universal Interface: A Concurrent Framework for Coupling Heterogeneous Solvers

    Full text link
    Concurrently coupled numerical simulations using heterogeneous solvers are powerful tools for modeling multiscale phenomena. However, major modifications to existing codes are often required to enable such simulations, posing significant difficulties in practice. In this paper we present a C++ library, i.e. the Multiscale Universal Interface (MUI), which is capable of facilitating the coupling effort for a wide range of multiscale simulations. The library adopts a header-only form with minimal external dependency and hence can be easily dropped into existing codes. A data sampler concept is introduced, combined with a hybrid dynamic/static typing mechanism, to create an easily customizable framework for solver-independent data interpretation. The library integrates MPI MPMD support and an asynchronous communication protocol to handle inter-solver information exchange irrespective of the solvers' own MPI awareness. Template metaprogramming is heavily employed to simultaneously improve runtime performance and code flexibility. We validated the library by solving three different multiscale problems, which also serve to demonstrate the flexibility of the framework in handling heterogeneous models and solvers. In the first example, a Couette flow was simulated using two concurrently coupled Smoothed Particle Hydrodynamics (SPH) simulations of different spatial resolutions. In the second example, we coupled the deterministic SPH method with the stochastic Dissipative Particle Dynamics (DPD) method to study the effect of surface grafting on the hydrodynamics properties on the surface. In the third example, we consider conjugate heat transfer between a solid domain and a fluid domain by coupling the particle-based energy-conserving DPD (eDPD) method with the Finite Element Method (FEM).Comment: The library source code is freely available under the GPLv3 license at http://www.cfm.brown.edu/repo/release/MUI

    Ein Gas-Kinetic Scheme Ansatz zur Modellierung und Simulation von Feuer auf massiv paralleler Hardware

    Get PDF
    This work presents a simulation approach based on a Gas Kinetic Scheme (GKS) for the simulation of fire that is implemented on massively parallel hardware in terms of Graphics Processing Units (GPU) in the framework of General Purpose computing on Graphics Processing Units (GPGPU). Gas kinetic schemes belong to the class of kinetic methods because their governing equation is the mesoscopic Boltzmann equation, rather than the macroscopic Navier-Stokes equations. Formally, kinetic methods have the advantage of a linear advection term which simplifies discretization. GKS inherently contains the full energy equation which is required for compressible flows. GKS provides a flux formulation derived from kinetic theory and is usually implemented as a finite volume method on cell-centered grids. In this work, we consider an implementation on nested Cartesian grids. To that end, a coupling algorithm for uniform grids with varying resolution was developed and is presented in this work. The limitation to local uniform Cartesian grids allows an efficient implementation on GPUs, which belong to the class of many core processors, i.e. massively parallel hardware. Multi-GPU support is also implemented and efficiency is enhanced by communication hiding. The fluid solver is validated for several two- and three-dimensional test cases including natural convection, turbulent natural convection and turbulent decay. It is subsequently applied to a study of boundary layer stability of natural convection in a cavity with differentially heated walls and large temperature differences. The fluid solver is further augmented by a simple combustion model for non-premixed flames. It is validated by comparison to experimental data for two different fire plumes. The results are further compared to the industry standard for fire simulation, i.e. the Fire Dynamics Simulator (FDS). While the accuracy of GKS appears slightly reduced as compared to FDS, a substantial speedup in terms of time to solution is found. Finally, GKS is applied to the simulation of a compartment fire. This work shows that the GKS has a large potential for efficient high performance fire simulations.Diese Arbeit präsentiert einen Simulationsansatz basierend auf einer gaskinetischen Methode (eng. Gas Kinetic Scheme, GKS) zur Simulation von Bränden, welcher für massiv parallel Hardware im Sinne von Grafikprozessoren (eng. Graphics Processing Units, GPUs) implementiert wurde. GKS gehört zur Klasse der kinetischen Methoden, die nicht die makroskopischen Navier-Stokes Gleichungen, sondern die mesoskopische Boltzmann Gleichung lösen. Formal haben kinetische Methoden den Vorteil, dass der Advektionsterms linear ist. Dies vereinfacht die Diskretisierung. In GKS ist die vollständige Energiegleichung, die zur Lösung kompressibler Strömungen benötigt wird, enthalten. GKS formuliert den Fluss von Erhaltungsgrößen basierend auf der gaskinetischen Theorie und wird meistens im Rahmen der Finiten Volumen Methode umgesetzt. In dieser Arbeit betrachten wir eine Implementierung auf gleichmäßigen Kartesischen Gittern. Dazu wurde ein Kopplungsalgorithmus für die Kombination von Gittern unterschiedlicher Auflösung entwickelt. Die Einschränkung auf lokal gleichmäßige Gitter erlaubt eine effiziente Implementierung auf GPUs, welche zur Klasse der massiv parallelen Hardware gehören. Des Weiteren umfasst die Implementierung eine Unterstützung für Multi-GPU mit versteckter Kommunikation. Der Strömungslöser ist für zwei und dreidimensionale Testfälle validiert. Dabei reichen die Tests von natürlicher Konvektion über turbulente Konvektion bis hin zu turbulentem Zerfall. Anschließend wird der Löser genutzt um die Grenzschichtstabilität in natürlicher Konvektion bei großen Temperaturunterschieden zu untersuchen. Darüber hinaus umfasst der Löser ein einfaches Verbrennungsmodell für Diffusionsflammen. Dieses wird durch Vergleich mit experimentellen Feuern validiert. Außerdem werden die Ergebnisse mit dem gängigen Brandsimulationsprogramm FDS (eng. Fire Dynamics Simulator) verglichen. Die Qualität der Ergebnisse ist dabei vergleichbar, allerdings ist der in dieser Arbeit entwickelte Löser deutlich schneller. Anschließend wird das GKS noch für die Simulation eines Raumbrandes angewendet. Diese Arbeit zeigt, dass GKS ein großes Potential für die Hochleistungssimulation von Feuer hat
    • …
    corecore