2,232 research outputs found
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Many problems in geophysical and atmospheric modelling require the fast
solution of elliptic partial differential equations (PDEs) in "flat" three
dimensional geometries. In particular, an anisotropic elliptic PDE for the
pressure correction has to be solved at every time step in the dynamical core
of many numerical weather prediction models, and equations of a very similar
structure arise in global ocean models, subsurface flow simulations and gas and
oil reservoir modelling. The elliptic solve is often the bottleneck of the
forecast, and an algorithmically optimal method has to be used and implemented
efficiently. Graphics Processing Units have been shown to be highly efficient
for a wide range of applications in scientific computing, and recently
iterative solvers have been parallelised on these architectures. We describe
the GPU implementation and optimisation of a Preconditioned Conjugate Gradient
(PCG) algorithm for the solution of a three dimensional anisotropic elliptic
PDE for the pressure correction in NWP. Our implementation exploits the strong
vertical anisotropy of the elliptic operator in the construction of a suitable
preconditioner. As the algorithm is memory bound, performance can be improved
significantly by reducing the amount of global memory access. We achieve this
by using a matrix-free implementation which does not require explicit storage
of the matrix and instead recalculates the local stencil. Global memory access
can also be reduced by rewriting the algorithm using loop fusion and we show
that this further reduces the runtime on the GPU. We demonstrate the
performance of our matrix-free GPU code by comparing it to a sequential CPU
implementation and to a matrix-explicit GPU code which uses existing libraries.
The absolute performance of the algorithm for different problem sizes is
quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure
A GPU-accelerated Direct-sum Boundary Integral Poisson-Boltzmann Solver
In this paper, we present a GPU-accelerated direct-sum boundary integral
method to solve the linear Poisson-Boltzmann (PB) equation. In our method, a
well-posed boundary integral formulation is used to ensure the fast convergence
of Krylov subspace based linear algebraic solver such as the GMRES. The
molecular surfaces are discretized with flat triangles and centroid
collocation. To speed up our method, we take advantage of the parallel nature
of the boundary integral formulation and parallelize the schemes within CUDA
shared memory architecture on GPU. The schemes use only
size-of-double device memory for a biomolecule with triangular surface
elements and partial charges. Numerical tests of these schemes show
well-maintained accuracy and fast convergence. The GPU implementation using one
GPU card (Nvidia Tesla M2070) achieves 120-150X speed-up to the implementation
using one CPU (Intel L5640 2.27GHz). With our approach, solving PB equations on
well-discretized molecular surfaces with up to 300,000 boundary elements will
take less than about 10 minutes, hence our approach is particularly suitable
for fast electrostatics computations on small to medium biomolecules
Adaptive Mesh Fluid Simulations on GPU
We describe an implementation of compressible inviscid fluid solvers with
block-structured adaptive mesh refinement on Graphics Processing Units using
NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes
can be mapped naturally on this architecture. Using the method of lines
approach with the second order total variation diminishing Runge-Kutta time
integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer
Riemann solver, we achieve an overall speedup of approximately 10 times faster
execution on one graphics card as compared to a single core on the host
computer. We attain this speedup in uniform grid runs as well as in problems
with deep AMR hierarchies. Our framework can readily be applied to more general
systems of conservation laws and extended to higher order shock capturing
schemes. This is shown directly by an implementation of a magneto-hydrodynamic
solver and comparing its performance to the pure hydrodynamic case. Finally, we
also combined our CUDA parallel scheme with MPI to make the code run on GPU
clusters. Close to ideal speedup is observed on up to four GPUs.Comment: Submitted to New Astronom
- β¦