726 research outputs found

    A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics

    Full text link
    Mesoscopic simulations of hydrocarbon flow in source shales are challenging, in part due to the heterogeneous shale pores with sizes ranging from a few nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid and fluid-solid interactions in nano- to micro-scale shale pores, which are physically and chemically sophisticated, must be captured. To address those challenges, we present a GPU-accelerated package for simulation of flow in nano- to micro-pore networks with a many-body dissipative particle dynamics (mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the code offloads all intensive workloads on GPUs. Other advancements, such as smart particle packing and no-slip boundary condition in complex pore geometries, are also implemented for the construction and the simulation of the realistic shale pores from 3D nanometer-resolution stack images. Our code is validated for accuracy and compared against the CPU counterpart for speedup. In our benchmark tests, the code delivers nearly perfect strong scaling and weak scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we demonstrate, through a flow simulation in realistic shale pores, that the CPU counterpart requires 840 Power9 cores to rival the performance delivered by our package with four V100 GPUs on ORNL's Summit architecture. This simulation package enables quick-turnaround and high-throughput mesoscopic numerical simulations for investigating complex flow phenomena in nano- to micro-porous rocks with realistic pore geometries

    Analyzing and Modeling the Performance of the HemeLB Lattice-Boltzmann Simulation Environment

    Get PDF
    We investigate the performance of the HemeLB lattice-Boltzmann simulator for cerebrovascular blood flow, aimed at providing timely and clinically relevant assistance to neurosurgeons. HemeLB is optimised for sparse geometries, supports interactive use, and scales well to 32,768 cores for problems with ~81 million lattice sites. We obtain a maximum performance of 29.5 billion site updates per second, with only an 11% slowdown for highly sparse problems (5% fluid fraction). We present steering and visualisation performance measurements and provide a model which allows users to predict the performance, thereby determining how to run simulations with maximum accuracy within time constraints.Comment: Accepted by the Journal of Computational Science. 33 pages, 16 figures, 7 table

    Developing Efficient Discrete Simulations on Multicore and GPU Architectures

    Get PDF
    In this paper we show how to efficiently implement parallel discrete simulations on multicoreandGPUarchitecturesthrougharealexampleofanapplication: acellularautomatamodel of laser dynamics. We describe the techniques employed to build and optimize the implementations using OpenMP and CUDA frameworks. We have evaluated the performance on two different hardware platforms that represent different target market segments: high-end platforms for scientific computing, using an Intel Xeon Platinum 8259CL server with 48 cores, and also an NVIDIA Tesla V100GPU,bothrunningonAmazonWebServer(AWS)Cloud;and on a consumer-oriented platform, using an Intel Core i9 9900k CPU and an NVIDIA GeForce GTX 1050 TI GPU. Performance results were compared and analyzed in detail. We show that excellent performance and scalability can be obtained in both platforms, and we extract some important issues that imply a performance degradation for them. We also found that current multicore CPUs with large core numbers can bring a performance very near to that of GPUs, and even identical in some cases.Ministerio de Economía, Industria y Competitividad, Gobierno de España (MINECO), and the Agencia Estatal de Investigación (AEI) of Spain, cofinanced by FEDER funds (EU) TIN2017-89842

    Evolution of a double-front Rayleigh-Taylor system using a GPU-based high resolution thermal Lattice-Boltzmann model

    Full text link
    We study the turbulent evolution originated from a system subjected to a Rayleigh-Taylor instability with a double density at high resolution in a 2 dimensional geometry using a highly optimized thermal Lattice Boltzmann code for GPUs. The novelty of our investigation stems from the initial condition, given by the superposition of three layers with three different densities, leading to the development of two Rayleigh-Taylor fronts that expand upward and downward and collide in the middle of the cell. By using high resolution numerical data we highlight the effects induced by the collision of the two turbulent fronts in the long time asymptotic regime. We also provide details on the optimized Lattice-Boltzmann code that we have run on a cluster of GPU

    Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats

    Get PDF
    Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory-intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here, we evaluate the possibility to use even FP16 and Posit16 (half) precision for storing fluid populations, while still carrying arithmetic operations in FP32. For this, we first show that the commonly occurring number range in the LBM is a lot smaller than the FP16 number range. Based on this observation, we develop novel 16-bit formats - based on a modified IEEE-754 and on a modified Posit standard - that are specifically tailored to the needs of the LBM. We then carry out an in-depth characterization of LBM accuracy for six different test systems with increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex streets, lid-driven cavity, a microcapsule in shear flow (utilizing the immersed-boundary method) and finally the impact of a raindrop (based on a Volume-of-Fluid approach). We find that the difference in accuracy between FP64 and FP32 is negligible in almost all cases, and that for a large number of cases even 16-bit is sufficient. Finally, we provide a detailed performance analysis of all precision levels on a large number of hardware microarchitectures and show that significant speedup is achieved with mixed FP32/16-bit.Comment: 30 pages, 20 figures, 4 tables, 2 code listing

    Optimization of lattice Boltzmann simulations on heterogeneous computers

    Get PDF
    High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs
    corecore