702 research outputs found
Three Dimensional Pseudo-Spectral Compressible Magnetohydrodynamic GPU Code for Astrophysical Plasma Simulation
This paper presents the benchmarking and scaling studies of a GPU accelerated
three dimensional compressible magnetohydrodynamic code. The code is developed
keeping an eye to explain the large and intermediate scale magnetic field
generation is cosmos as well as in nuclear fusion reactors in the light of the
theory given by Eugene Newman Parker. The spatial derivatives of the code are
pseudo-spectral method based and the time solvers are explicit. GPU
acceleration is achieved with minimal code changes through OpenACC
parallelization and use of NVIDIA CUDA Fast Fourier Transform library (cuFFT).
NVIDIAs unified memory is leveraged to enable over-subscription of the GPU
device memory for seamless out-of-core processing of large grids. Our
experimental results indicate that the GPU accelerated code is able to achieve
upto two orders of magnitude speedup over a corresponding OpenMP parallel, FFTW
library based code, on a NVIDIA Tesla P100 GPU. For large grids that require
out-of-core processing on the GPU, we see a 7x speedup over the OpenMP, FFTW
based code, on the Tesla P100 GPU. We also present performance analysis of the
GPU accelerated code on different GPU architectures - Kepler, Pascal and Volta
Adaptive Mesh Fluid Simulations on GPU
We describe an implementation of compressible inviscid fluid solvers with
block-structured adaptive mesh refinement on Graphics Processing Units using
NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes
can be mapped naturally on this architecture. Using the method of lines
approach with the second order total variation diminishing Runge-Kutta time
integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer
Riemann solver, we achieve an overall speedup of approximately 10 times faster
execution on one graphics card as compared to a single core on the host
computer. We attain this speedup in uniform grid runs as well as in problems
with deep AMR hierarchies. Our framework can readily be applied to more general
systems of conservation laws and extended to higher order shock capturing
schemes. This is shown directly by an implementation of a magneto-hydrodynamic
solver and comparing its performance to the pure hydrodynamic case. Finally, we
also combined our CUDA parallel scheme with MPI to make the code run on GPU
clusters. Close to ideal speedup is observed on up to four GPUs.Comment: Submitted to New Astronom
Production Level CFD Code Acceleration for Hybrid Many-Core Architectures
In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time
Semi‐automatic porting of a large‐scale CFD code to multi‐graphics processing unit clusters
A typical large‐scale CFD code based on adaptive, edge‐based finite‐element formulations for the solution of compressible and incompressible flow is taken as a test bed to port such codes to graphics hardware (graphics processing units, GPUs) using semi‐automatic techniques. In previous work, a GPU version of this code was presented, in which, for many run configurations, all mesh‐sized loops required throughout time stepping were ported. This approach simultaneously achieves the fine‐grained parallelism required to fully exploit the capabilities of many‐core GPUs, completely avoids the crippling bottleneck of GPU–CPU data transfer, and uses a transposed memory layout to meet the distinct memory access requirements posed by GPUs. The present work describes the next step of this porting effort, namely to integrate GPU‐based, fine‐grained parallelism with Message‐Passing‐Interface‐based, coarse‐grained parallelism, in order to achieve a code capable of running on multi‐GPU clusters. This is carried out in a semi‐automated fashion: the existing Fortran–Message Passing Interface code is preserved, with the translator inserting data transfer calls as required. Performance benchmarks indicate up to a factor of 2 performance advantage of the NVIDIA Tesla M2050 GPU (Santa Clara, CA, USA) over the six‐core Intel Xeon X5670 CPU (Santa Clara, CA, USA), for certain run configurations. In addition, good scalability is observed when running across multiple GPUs. The approach should be of general interest, as how best to run on GPUs is being presently considered for many so‐called legacy codes
Computational fluid dynamics using Graphics Processing Units: Challenges and opportunities
A new paradigm for computing fluid flows is the use of Graphics Processing Units (GPU), which have recently become very powerful and convenient to use. In the past three years, we have implemented five different fluid flow algorithms on GPUs and have obtained significant speed-ups over a single CPU. Typically, it is possible to achieve a factor of 50-100 over a single CPU. In this review paper, we describe our experiences on the various algorithms developed and the speeds achieved
An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters
Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi-GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations
- …