60 research outputs found
Performance and portability of accelerated lattice Boltzmann applications with OpenACC
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures
Particle-resolved thermal lattice Boltzmann simulation using OpenACC on multi-GPUs
We utilize the Open Accelerator (OpenACC) approach for graphics processing
unit (GPU) accelerated particle-resolved thermal lattice Boltzmann (LB)
simulation. We adopt the momentum-exchange method to calculate fluid-particle
interactions to preserve the simplicity of the LB method. To address load
imbalance issues, we extend the indirect addressing method to collect
fluid-particle link information at each timestep and store indices of
fluid-particle link in a fixed index array. We simulate the sedimentation of
4,800 hot particles in cold fluids with a domain size of , and the
simulation achieves 1750 million lattice updates per second (MLUPS) on a single
GPU. Furthermore, we implement a hybrid OpenACC and message passing interface
(MPI) approach for multi-GPU accelerated simulation. This approach incorporates
four optimization strategies, including building domain lists, utilizing
request-answer communication, overlapping communications with computations, and
executing computation tasks concurrently. By reducing data communication
between GPUs, hiding communication latency through overlapping computation, and
increasing the utilization of GPU resources, we achieve improved performance,
reaching 10846 MLUPS using 8 GPUs. Our results demonstrate that the
OpenACC-based GPU acceleration is promising for particle-resolved thermal
lattice Boltzmann simulation.Comment: 45 pages, 18 figure
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
Optimization of lattice Boltzmann simulations on heterogeneous computers
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs
Portable multi-node LQCD Monte Carlo simulations using OpenACC
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code
for staggered fermions, purposely designed to be portable across different
computer architectures, including GPUs and commodity CPUs. Portability is
achieved using the OpenACC parallel programming model, used to develop a code
that can be compiled for several processor architectures. The paper focuses on
parallelization on multiple computing nodes using OpenACC to manage parallelism
within the node, and OpenMPI to manage parallelism among the nodes. We first
discuss the available strategies to be adopted to maximize performances, we
then describe selected relevant details of the code, and finally measure the
level of performance and scaling-performance that we are able to achieve. The
work focuses mainly on GPUs, which offer a significantly high level of
performances for this application, but also compares with results measured on
other processors.Comment: 22 pages, 8 png figure
Porting of DSMC to multi-GPUs using OpenACC
The Direct Simulation Monte Carlo has become the method of choice for studying gas flows characterized by variable rarefaction and non-equilibrium effects, rising interest in industry for simulating flows in micro-, and nano-electromechanical systems.
However, rarefied gas dynamics represents an open research challenge from the computer science perspective, due to its computational expense compared to continuum computational fluid dynamics methods.
Fortunately, over the last decade, high-performance computing has seen an exponential growth of performance. Actually, with the breakthrough of General-Purpose GPU computing, heterogeneous systems have become widely used for scientific computing, especially in large-scale clusters and supercomputers.
Nonetheless, developing efficient, maintainable and portable applications for hybrid systems is, in general, a non-trivial task.
Among the possible approaches, directive-based programming models, such as OpenACC, are considered the most promising for porting scientific codes to hybrid CPU/GPU systems, both for their simplicity and portability.
This work is an attempt to port a simplified version of the fm dsmc code developed at FLOW Matters Consultancy B.V., a start-up company supporting this project, on a multi-GPU distributed hybrid system, such as Marconi100 hosted at CINECA, using OpenACC.
Finally, we perform a detailed performance analysis of our DSMC application on Volta (NVIDIA V100 GPU) architecture based computing platform as well as a comparison with previous results obtained with x64 86 (Intel Xeon CPU) and ppc64le (IBM Power9 CPU) architectures
Massively parallel lattice–Boltzmann codes on large GPU clusters
This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics
A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters
This is the author accepted manuscript. The final version is available from MDPI via the DOI in this record.Heterogeneous clusters are a widely utilized class of supercomputers assembled from
different types of computing devices, for instance CPUs and GPUs, providing a huge computational
potential. Programming them in a scalable way exploiting the maximal performance introduces
numerous challenges such as optimizations for different computing devices, dealing with multiple
levels of parallelism, the application of different programming models, work distribution, and hiding
of communication with computation. We utilize the lattice Boltzmann method for fluid flow as
a representative of a scientific computing application and develop a holistic implementation for
large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and
techniques ranging from optimizations for the particular computing devices to the orchestration
of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with
an implementation using all the available computational resources for the lattice Boltzmann
method operators. Our approach shows excellent scalability behavior making it future-proof for
heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of
more than 90% are achieved leading to 2,604.72 GLUPS utilizing 24,576 CPU cores and 2,048 GPUs of
the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 · 109
lattice cells.This work was supported by the German Research Foundation (DFG) as part of the
Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89). In addition, this work was
supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d68. We further
thank the Max Planck Computing & Data Facility (MPCDF) and the Global Scientific Information and Computing
Center (GSIC) for providing computational resources
- …