726 research outputs found
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Analyzing and Modeling the Performance of the HemeLB Lattice-Boltzmann Simulation Environment
We investigate the performance of the HemeLB lattice-Boltzmann simulator for
cerebrovascular blood flow, aimed at providing timely and clinically relevant
assistance to neurosurgeons. HemeLB is optimised for sparse geometries,
supports interactive use, and scales well to 32,768 cores for problems with ~81
million lattice sites. We obtain a maximum performance of 29.5 billion site
updates per second, with only an 11% slowdown for highly sparse problems (5%
fluid fraction). We present steering and visualisation performance measurements
and provide a model which allows users to predict the performance, thereby
determining how to run simulations with maximum accuracy within time
constraints.Comment: Accepted by the Journal of Computational Science. 33 pages, 16
figures, 7 table
Developing Efficient Discrete Simulations on Multicore and GPU Architectures
In this paper we show how to efficiently implement parallel discrete simulations on multicoreandGPUarchitecturesthrougharealexampleofanapplication: acellularautomatamodel of laser dynamics. We describe the techniques employed to build and optimize the implementations using OpenMP and CUDA frameworks. We have evaluated the performance on two different hardware platforms that represent different target market segments: high-end platforms for scientific computing, using an Intel Xeon Platinum 8259CL server with 48 cores, and also an NVIDIA Tesla V100GPU,bothrunningonAmazonWebServer(AWS)Cloud;and on a consumer-oriented platform, using an Intel Core i9 9900k CPU and an NVIDIA GeForce GTX 1050 TI GPU. Performance results were compared and analyzed in detail. We show that excellent performance and scalability can be obtained in both platforms, and we extract some important issues that imply a performance degradation for them. We also found that current multicore CPUs with large core numbers can bring a performance very near to that of GPUs, and even identical in some cases.Ministerio de Economía, Industria y Competitividad, Gobierno de España (MINECO), and the Agencia Estatal de Investigación (AEI) of Spain, cofinanced by FEDER funds (EU) TIN2017-89842
Evolution of a double-front Rayleigh-Taylor system using a GPU-based high resolution thermal Lattice-Boltzmann model
We study the turbulent evolution originated from a system subjected to a
Rayleigh-Taylor instability with a double density at high resolution in a 2
dimensional geometry using a highly optimized thermal Lattice Boltzmann code
for GPUs. The novelty of our investigation stems from the initial condition,
given by the superposition of three layers with three different densities,
leading to the development of two Rayleigh-Taylor fronts that expand upward and
downward and collide in the middle of the cell. By using high resolution
numerical data we highlight the effects induced by the collision of the two
turbulent fronts in the long time asymptotic regime. We also provide details on
the optimized Lattice-Boltzmann code that we have run on a cluster of GPU
Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats
Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very
memory-intensive. Alongside reduction in memory footprint, significant
performance benefits can be achieved by using FP32 (single) precision compared
to FP64 (double) precision, especially on GPUs. Here, we evaluate the
possibility to use even FP16 and Posit16 (half) precision for storing fluid
populations, while still carrying arithmetic operations in FP32. For this, we
first show that the commonly occurring number range in the LBM is a lot smaller
than the FP16 number range. Based on this observation, we develop novel 16-bit
formats - based on a modified IEEE-754 and on a modified Posit standard - that
are specifically tailored to the needs of the LBM. We then carry out an
in-depth characterization of LBM accuracy for six different test systems with
increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex
streets, lid-driven cavity, a microcapsule in shear flow (utilizing the
immersed-boundary method) and finally the impact of a raindrop (based on a
Volume-of-Fluid approach). We find that the difference in accuracy between FP64
and FP32 is negligible in almost all cases, and that for a large number of
cases even 16-bit is sufficient. Finally, we provide a detailed performance
analysis of all precision levels on a large number of hardware
microarchitectures and show that significant speedup is achieved with mixed
FP32/16-bit.Comment: 30 pages, 20 figures, 4 tables, 2 code listing
Optimization of lattice Boltzmann simulations on heterogeneous computers
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs
- …