515 research outputs found
A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters
Sustaining a large fraction of single GPU performance in parallel
computations is considered to be the major problem of GPU-based clusters. In
this article, this topic is addressed in the context of a lattice Boltzmann
flow solver that is integrated in the WaLBerla software framework. We propose a
multi-GPU implementation using a block-structured MPI parallelization, suitable
for load balancing and heterogeneous computations on CPUs and GPUs. The
overhead required for multi-GPU simulations is discussed in detail and it is
demonstrated that the kernel performance can be sustained to a large extent.
With our GPU implementation, we achieve nearly perfect weak scalability on
InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less
efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost
analysis must determine the best course of action for a particular simulation
task. Additionally, weak scaling results of heterogeneous simulations conducted
on CPUs and GPUs simultaneously are presented using clusters equipped with
varying node configurations.Comment: 20 pages, 12 figure
Energy-efficiency evaluation of Intel KNL for HPC workloads
Energy consumption is increasingly becoming a limiting factor to the design
of faster large-scale parallel systems, and development of energy-efficient and
energy-aware applications is today a relevant issue for HPC code-developer
communities. In this work we focus on energy performance of the Knights Landing
(KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel
into the HPC market. We take into account the 64-core Xeon Phi 7230, and
analyze its energy performance using both the on-chip MCDRAM and the regular
DDR4 system memory as main storage for the application data-domain. As a
benchmark application we use a Lattice Boltzmann code heavily optimized for
this architecture and implemented using different memory data layouts to store
its lattice. We assessthen the energy consumption using different memory
data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Evolution of a double-front Rayleigh-Taylor system using a GPU-based high resolution thermal Lattice-Boltzmann model
We study the turbulent evolution originated from a system subjected to a
Rayleigh-Taylor instability with a double density at high resolution in a 2
dimensional geometry using a highly optimized thermal Lattice Boltzmann code
for GPUs. The novelty of our investigation stems from the initial condition,
given by the superposition of three layers with three different densities,
leading to the development of two Rayleigh-Taylor fronts that expand upward and
downward and collide in the middle of the cell. By using high resolution
numerical data we highlight the effects induced by the collision of the two
turbulent fronts in the long time asymptotic regime. We also provide details on
the optimized Lattice-Boltzmann code that we have run on a cluster of GPU
Using accelerators to speed up scientific and engineering codes: perspectives and problems
Accelerators are quickly emerging as the leading technology to further boost
computing performances; their main feature is a massively parallel on-chip architecture. NVIDIA
and AMD GPUs and the Intel Xeon-Phi are examples of accelerators available today. Accelerators
are power-efficient and deliver up to one order of magnitude more peak performance than
traditional CPUs. However, existing codes for traditional CPUs require substantial changes to
run efficiently on accelerators, including rewriting with specific programming languages.
In this contribution we present our experience in porting large codes to NVIDIA GPU and Intel
Xeon-Phi accelerators. Our reference application is a CFD code based on the Lattice
Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for
processor architectures with a large degree of parallelism. However, the challenge of
exploiting a large fraction of the theoretically available performance is not easy to
met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a
D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the
equation-of-state of a perfect gas.
We describe in details how we implement and optimize our LB code for Xeon-Phi and
GPUs, and then analyze performances on single- and multi-accelerator systems. We
finally compare results with those available on recent traditional multi-core CPUs
Massively parallel lattice–Boltzmann codes on large GPU clusters
This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics
A Simulation Suite for Lattice-Boltzmann based Real-Time CFD Applications Exploiting Multi-Level Parallelism on Modern Multi- and Many-Core Architectures
We present a software approach to hardware-oriented numerics which builds upon an augmented, previously published open-source set of libraries facilitating portable code development and optimisation on a wide range of modern computer architectures. In order to maximise eficiency, we exploit all levels of arallelism, including vectorisation within CPU cores, the Cell BE and GPUs, shared memory thread-level parallelism between cores, and parallelism between heterogeneous distributed memory resources in clusters. To evaluate and validate our approach, we implement a collection of modular building blocks for the easy and fast assembly and development of CFD applications based on the shallow water equations: We combine the Lattice-Boltzmann method with i-uid-structure interaction techniques in order to achieve real-time simulations targeting interactive virtual environments. Our results demonstrate that recent multi-core CPUs outperform the Cell BE, while GPUs are significantly faster than conventional multi-threaded SSE code. In addition, we verify good scalability properties of our application on small clusters
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
Energy efficiency is becoming increasingly important for computing systems,
in particular for large scale HPC facilities. In this work we evaluate, from an
user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS)
techniques, assisted by the power and energy monitoring capabilities of modern
processors in order to tune applications for energy efficiency. We run selected
kernels and a full HPC application on two high-end processors widely used in
the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate
the available trade-offs between energy-to-solution and time-to-solution,
attempting a function-by-function frequency tuning. We finally estimate the
benefits obtainable running the full code on a HPC multi-GPU node, with respect
to default clock frequency governors. We instrument our code to accurately
monitor power consumption and execution time without the need of any additional
hardware, and we enable it to change CPUs and GPUs clock frequencies while
running. We analyze our results on the different architectures using a simple
energy-performance model, and derive a number of energy saving strategies which
can be easily adopted on recent high-end HPC systems for generic applications
- …