12 research outputs found

    Enhancing speed and scalability of the ParFlow simulation code

    Full text link
    Regional hydrology studies are often supported by high resolution simulations of subsurface flow that require expensive and extensive computations. Efficient usage of the latest high performance parallel computing systems becomes a necessity. The simulation software ParFlow has been demonstrated to meet this requirement and shown to have excellent solver scalability for up to 16,384 processes. In the present work we show that the code requires further enhancements in order to fully take advantage of current petascale machines. We identify ParFlow's way of parallelization of the computational mesh as a central bottleneck. We propose to reorganize this subsystem using fast mesh partition algorithms provided by the parallel adaptive mesh refinement library p4est. We realize this in a minimally invasive manner by modifying selected parts of the code to reinterpret the existing mesh data structures. We evaluate the scaling performance of the modified version of ParFlow, demonstrating good weak and strong scaling up to 458k cores of the Juqueen supercomputer, and test an example application at large scale.Comment: The final publication is available at link.springer.co

    Strong scaling for numerical weather prediction at petascale with the atmospheric model NUMA

    Get PDF
    Numerical weather prediction (NWP) has proven to be computationally challenging due to its inherent multiscale nature. Currently, the highest resolution NWP models use a horizontal resolution of approximately 15 km. At the resolution many important processes in the atmosphere are not resolved. Needless to say this introduces errors. In order to increase the resolution of NWP models highly scalable atmospheric models are needed. The Non-hydrostatic Unified Model of the Atmosphere (NUMA), developed by the authors at the Naval Postgraduate School, was designed to achieve this purpose. NUMA is used by the Naval Research Laboratory, Monterey, as the engine inside its next generation weather prediction system NEPTUNE. NUMA solves the fully compressible Navier-Stokes equations be means of high-order Galerkin methods (both spectral element as well as discontinuous Galerkin methods can be used. Mesh generation is done using the p4est library. NUMA is capable of running middle and upper atmosphere simulations since it does not make use of the shallow-atmosphere approximation. This paper presents the performance analysis and optimization of the spectral element version of NUMA. The performance at different optimization stages is analyzed using hardware counters with the help of the Hardware Performance Monitor Toolkit as well as the PAPI library. Machine independent optimization is compared to machine specific optimization using BG/Q vector intrinsics. By using vector intrinsics the main computations reach 1.2 PFlops on the entire machine Mira. The paper also present scalability studies for two idealized test cases that are relevant for NWP applications. The atmospheric model NUMA delivers an excellent strong scaling efficiency of 99% on the entire supercomputer Mira using a mesh with 1.8 billion grid points. This allows us to run a global forecast of a baroclinic wave test case at 3 km uniform horizontal resolution and double precision within the time frame required for operational weather prediction.Financial support for the work presented in this paper was provided by the Office of Naval Research through Program Element PE-0602435N, The Air Force Office of Scientific Research through the Computation Mathematics program, and the National Science Foundation (Division of MAthematical Sciences) through program elelment 121760. AM, MK and SM are grateful to the National Research Council of the National Academies.Approved for public release; distribution is unlimited

    Lattice-Boltzmann Simulationen auf mehreren GPUs

    Get PDF
    In dieser Arbeit geht es um die Implementierung der Lattice-Boltzmann Methode auf mehreren Grafikkarten. Die Lattice-Boltzmann Methode ist eine bekannte und beliebte Methode um hydrodynamische Simulationen zu berechnen. Dabei wird der Raum durch ein Gitter diskretisiert und mittels einer Verteilungsdichtefunktion wird mit Populationen, die sich auf den Gittern bewegen, gerechnet. Als Grundlage dient dafür das Software Paket ESPResSo, das durch die vorgestellte Implementierung erweitert wird, und die Bibliothek P4EST, die für die Erstellung und Verwaltung eines Gitters, das auf Oktalbäumen basiert, genutzt wird. Das durch P4EST erstellte Gitter wird zusätzlich in Patches verfeinert, die dann auf den Grafikkarten durch einen CUDA-Code parallel bearbeitet werden. Die vorgestellte Implementierung nutzt das Message Passing Interface und ist dafür ausgerichtet auch auf großen Computern zu laufen und eine gute Skalierung zu erreichen

    Strong scaling for numerical weather prediction at petascale with the atmospheric model NUMA

    Get PDF
    The article of record as published may be found at https://doi.org/10.1177/1094342018763966Numerical weather prediction (NWP) has proven to be computationally challenging due to its inherent multiscale nature. Currently, the highest resolution global NWP models use a horizontal resolution of 9 km. At this resolution, many important processes in the atmosphere are not resolved. Needless to say, this introduces errors. In order to increase the resolution of NWP models, highly scalable atmospheric models are needed. The non-hydrostatic unified model of the atmosphere (NUMA), developed by the authors at the Naval Postgraduate School, was designed to achieve this purpose. NUMA is used by the Naval Research Laboratory, Monterey as the engine inside its next generation weather prediction system NEPTUNE. NUMA solves the fully compressible Navier–Stokes equations by means of high-order Galerkin methods (both spectral element as well as discontinuous Galerkin methods can be used). NUMA is capable of running middle and upper atmosphere simulations since it does not make use of the shallow-atmosphere approximation. This article presents the performance analysis and optimization of the spectral element version of NUMA. The performance at different optimization stages is analyzed using a theoretical performance model as well as measurements via hardware counters. Machine-independent optimization is compared to machine-specific optimization using Blue Gene (BG)/Q vector intrinsics. The best portable version of the main computations was found to be about two times slower than the best non-portable version. By using vector intrinsics, the main computations reach 1.2 PFlops on the entire IBM Blue Gene supercomputer Mira (12% of the theoretical peak performance). The article also presents scalability studies for two idealized test cases that are relevant for NWP applications. The atmospheric model NUMA delivers an excellent strong scaling efficiency of 99% on the entire supercomputer Mira using a mesh with 1.8 billion grid points. This allows running a global forecast of a baroclinic wave test case at a 3-km uniform horizontal resolution and double precision within the time frame required for operational weather prediction.This work was supported by the Office of Naval Research (PE-0602435 N), the Air Force Office of Scien- tific Research (Computational Mathematics program), and the National Science Foundation (Division of Mathematical Sciences; 121670). This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357

    Coupling of particle simulation and lattice Boltzmann background flow on adaptive grids

    Get PDF
    The lattice-Boltzmann method as well as classical molecular dynamics are established and widely used methods for the simulation and research of soft matter. Molecular dynamics is a computer simulation technique on microscopic scales solving the multi-body kinetic equations of the involved particles. The lattice-Boltzmann method describes the hydrodynamic interactions of fluids, gases, or other soft matter on a coarser scale. Many applications, however, are multi-scale problems and they require a coupling of both methods. A basic concept for short ranged interactions in molecular dynamics is the linked cells algorithm that is in O(N) for homogeneously distributed particles. Spatial adaptive methods for the lattice-Boltzmann scheme are used in order to reduce costly scaling effects on the runtime and memory for large-scale simulations. As basis for this work the highly flexible simulation software ESPResSo is used and extended. The adaptive lattice-Boltzmann scheme, that is implemented in ESPResSo, uses a domain decomposition with tree-based grids along the space-filling Morton curve using the p4est software library. However, coupling the regular particle simulation with the adaptive lattice-Boltzmann method for highly parallel computer architectures is a challenging issue that raises several problems. In this work, an approach for the domain decomposition of the linked cells algorithm based on space-filling curves and the p4est library is presented. In general, the grids for molecular dynamics and fluid simulations are not equal. Thus, strategies to distribute differently refined grids on parallel processes are explained, including a parallel algorithm to construct the finest common tree using p4est. Furthermore, a method for interpolation and extrapolation, that is needed for the viscous coupling of particles with the fluid, on adaptively refined grids are discussed. The ESPResSo simulation software is augmented by the developed methods for particle-fluid coupling as well as the Morton curve based domain decompositions in a minimally invasive manner. The original ESPResSo implementation for regular particle and fluid simulations is used as reference for the developed algorithms

    Large-Scale Simulations of Complex Turbulent Flows: Modulation of Turbulent Boundary Layer Separation and Optimization of Discontinuous Galerkin Methods for Next-Generation HPC Platforms

    Full text link
    The separation of spatially evolving turbulent boundary layer flow near regions of adverse pressure gradients has been the subject of numerous studies in the context of flow control. Although many studies have demonstrated the efficacy of passive flow control devices, such as vortex generators (VGs), in reducing the size of the separated region, the interactions between the salient flow structures produced by the VG and those of the separated flow are not fully understood. Here, wall-resolved large-eddy simulation of a model problem of flow over a backward-facing ramp is studied with a submerged, wall-mounted cube being used as a canonical VG. In particular, the turbulent transport that results in the modulation of the separated flow over the ramp is investigated by varying the size, location of the VG, and the spanwise spacing between multiple VGs, which in turn are expected to modify the interactions between the VG-induced flow structures and those of the separated region. The horseshoe vortices produced by the cube entrain the freestream turbulent flow towards the plane of symmetry. These localized regions of high vorticity correspond to turbulent kinetic energy production regions, which effectively transfer energy from the freestream to the near-wall regions. Numerical simulations indicate that: (i) the gradients and the fluctuations, scale with the size of the cube and thus lead to more effective modulation for large cubes, (ii) for a given cube height the different upstream cube positions affect the behavior of the horseshoe vortex---when placed too close to the leading edge, the horseshoe vortex is not sufficiently strong to affect the large-scale structures of the separated region, and when placed too far, the dispersed core of the streamwise vortex is unable to modulate the flow over the ramp, (iii) if the spanwise spacing between neighboring VGs is too small, the counter-rotating vortices are not sufficiently strong to affect the large-scale structures of the separated region, and if the spacing is too large, the flow modulation is similar to that of an isolated VG. Turbulent boundary layer flows are inherently multiscale, and numerical simulations of such systems often require high spatial and temporal resolution to capture the unsteady flow dynamics accurately. While the innovations in computer hardware and distributed computing have enabled advances in the modeling of such large-scale systems, computations of many practical problems of interest are infeasible, even on the largest supercomputers. The need for high accuracy and the evolving heterogeneous architecture of the next-generation high-performance computing centers has impelled interest in the development of high-order methods. While the new class of recovery-assisted discontinuous Galerkin (RADG) methods can provide arbitrary high-orders of accuracy, the large number of degrees of freedom increases costs associated with the arithmetic operations performed and the amount of data transferred on-node. The purpose of the second part of this thesis is to explore optimization strategies to improve the parallel efficiency of RADG. A cache data-tiling strategy is investigated for polynomial orders 1 through 6, which enhances the arithmetic intensity of RADG to make better utilization of on-node floating-point capability. In addition, a power-aware compute framework is suggested by analyzing the power-performance trade-offs when changing from double to single-precision floating-point types---energy savings of 5 W per node are observed---which suggests that a transprecision framework will likely offer better power-performance balance on modern HPC platforms.PHDMechanical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163206/1/suyashtn_1.pd

    Scalable Algorithms for Parallel Tree-based Adaptive Mesh Refinement with General Element Types

    Get PDF
    In this thesis, we develop, discuss and implement algorithms for scalable parallel tree-based adaptive mesh refinement (AMR) using space-filling curves (SFCs). We create an AMR software that works independently of the used element type, such as for example lines, triangles, tetrahedra, quadrilaterals, hexahedra, and prisms. For triangular and tetrahedral elements (simplices) with red-refinement (1:4 in 2D, 1:8 in 3D), we develop a new SFC, the tetrahedral Morton space-filling curve (TM-SFC). Its construction is similar to the Morton index for quadrilaterals/hexa- hedra, as it is also based on bitwise interleaving the coordinates of a certain vertex of the simplex, the anchor node. Additionally, we interleave with a new piece of information, the so called type. For these simplices, we develop element local algorithms such as constructing the parent, children, or face-neighbors of a simplex, and show that most of them are constant-time operations independent of the refinement level. With SFC based partitioning it is possible that the mesh elements that are parti- tioned to one process do not form a face-connected domain. We prove the following upper bounds for the number of face-connected components of segments of the TM-SFC: With a maximum refine- ment level of L, the number of face-connected components is bounded by 2(L − 1) in 2D and 2L + 1 in 3D. Additionally, we perform a numerical investigation of the distribution of lengths of SFC segments. Furthermore, we develop a new approach to partition and repartition a coarse (input) mesh among the processes. Compared to previous methods it optimizes for fine mesh load-balance and reduces the parallel communication of coarse mesh data. We discuss the coarse mesh repartitioning algorithm and demonstrate that our method repartitions a coarse mesh of 371e9 trees on 917,504 processes (405,000 trees per process) on the Juqueen supercomputer in 1.2 seconds. We develop an AMR concept that works independently of the element type; achieving this independence by strictly distinguishing between functions that oper- ate on the whole mesh (high-level) and functions that locally operate on a single element or a small set of elements (low-level). We discuss a new approach to generate and manage ghost elements that fits into our element-type independent approach. We define and describe the necessary low-level algorithms. Our main idea is the computation of tree-to-tree face-neighbors of an element via the explicit construction of the element's face as a lower dimensional element. In order to optimize the runtime of this method we enhance the algorithm with a top-down search method from Isaac, Burstedde, Wilcox, and Ghattas, and demonstrate how it speeds up the computation by factors of 10 to 20 achieving runtimes comparable to state-of-the art implementations with fixed element types. With the ghost algorithm we build a straight-forward ripple version of the 2:1 balance algorithm. This is not an optimized version but it serves as a feasibility study for our element-type independent approach. We implement all algorithms that we develop in this thesis in the new AMR library t8code. Our modular approach allows us to reuse existing software, which we demonstrate by using the library p4est for quadrilateral and hexahedral elements. In a concurrent Bachelor's thesis by David Knapp (INS, Bonn) the necessary low-level algorithms for prisms were developed. With t8code we demonstrate that we can create, adapt, (re-)partition, and balance meshes, as well as create and manage a ghost layer. In various tests we show excellent strong and weak scaling behavior of our algorithms on up to 917,504 parallel processes on the Juqueen and Mira supercomputers using up to 858e9 mesh elements. We conclude this thesis by demonstrating how an application can be coupled with the AMR routines. We implement a finite volume based advection solver using t8code and show applications with triangular, quadrilateral, tetrahedral, and hexahedral elements, as well as 2D and 3D hybrid meshes, the latter consisting of tetrahedra, hexahedra, and prisms. Overall, we develop and demonstrate a new simplicial SFC and create a fast and scalable tree-based AMR software that offers a flexibility and generality that was previously not available
    corecore