193 research outputs found
A GPU Accelerated Aggregation Algebraic Multigrid Method
We present an efficient, robust and fully GPU-accelerated aggregation-based
algebraic multigrid preconditioning technique for the solution of large sparse
linear systems. These linear systems arise from the discretization of elliptic
PDEs. The method involves two stages, setup and solve. In the setup stage,
hierarchical coarse grids are constructed through aggregation of the fine grid
nodes. These aggregations are obtained using a set of maximal independent nodes
from the fine grid nodes. We use a ``fine-grain'' parallel algorithm for
finding a maximal independent set from a graph of strong negative connections.
The aggregations are combined with a piece-wise constant (unsmooth)
interpolation from the coarse grid solution to the fine grid solution, ensuring
low setup and interpolation cost. The grid independent convergence is achieved
by using recursive Krylov iterations (K-cycles) in the solve stage. An
efficient combination of K-cycles and standard multigrid V-cycles is used as
the preconditioner for Krylov iterative solvers such as generalized minimal
residual and conjugate gradient. We compare the solver performance with other
solvers based on smooth aggregation and classical algebraic multigrid methods.Comment: 18 pages, 11 figure
AMG based on compatible weighted matching for GPUs
We describe main issues and design principles of an efficient implementation,
tailored to recent generations of Nvidia Graphics Processing Units (GPUs), of
an Algebraic Multigrid (AMG) preconditioner previously proposed by one of the
authors and already available in the open-source package BootCMatch: Bootstrap
algebraic multigrid based on Compatible weighted Matching for standard CPU. The
AMG method relies on a new approach for coarsening sparse symmetric positive
definite (spd) matrices, named "coarsening based on compatible weighted
matching". It exploits maximum weight matching in the adjacency graph of the
sparse matrix, driven by the principle of compatible relaxation, providing a
suitable aggregation of unknowns which goes beyond the limits of the usual
heuristics applied in the current methods. We adopt an approximate solution of
the maximum weight matching problem, based on a recently proposed parallel
algorithm, referred as the Suitor algorithm, and show that it allow us to
obtain good quality coarse matrices for our AMG on GPUs. We exploit inherent
parallelism of modern GPUs in all the kernels involving sparse matrix
computations both for the setup of the preconditioner and for its application
in a Krylov solver, outperforming preconditioners available in Nvidia AmgX
library. We report results about a large set of linear systems arising from
discretization of scalar and vector partial differential equations (PDEs).Comment: 11 pages, submitted to the special issue of Parallel Computing
related to the 10th International Workshop on Parallel Matrix Algorithms and
Applications (PMAA18
Decoupled Block-Wise ILU(k) Preconditioner on GPU
This research investigates the implementation mechanism of block-wise ILU(k)
preconditioner on GPU. The block-wise ILU(k) algorithm requires both the level
k and the block size to be designed as variables. A decoupled ILU(k) algorithm
consists of a symbolic phase and a factorization phase. In the symbolic phase,
a ILU(k) nonzero pattern is established from the point-wise structure extracted
from a block-wise matrix. In the factorization phase, the block-wise matrix
with a variable block size is factorized into a block lower triangular matrix
and a block upper triangular matrix. And a further diagonal factorization is
required to perform on the block upper triangular matrix for adapting a
parallel triangular solver on GPU.We also present the numerical experiments to
study the preconditioner actions on different k levels and block sizes.Comment: 14 page
Numerical Study of Geometric Multigrid Methods on CPU--GPU Heterogeneous Computers
The geometric multigrid method (GMG) is one of the most efficient solving
techniques for discrete algebraic systems arising from elliptic partial
differential equations. GMG utilizes a hierarchy of grids or discretizations
and reduces the error at a number of frequencies simultaneously. Graphics
processing units (GPUs) have recently burst onto the scientific computing scene
as a technology that has yielded substantial performance and energy-efficiency
improvements. A central challenge in implementing GMG on GPUs, though, is that
computational work on coarse levels cannot fully utilize the capacity of a GPU.
In this work, we perform numerical studies of GMG on CPU--GPU heterogeneous
computers. Furthermore, we compare our implementation with an efficient CPU
implementation of GMG and with the most popular fast Poisson solver, Fast
Fourier Transform, in the cuFFT library developed by NVIDIA
On Parallel Solution of Sparse Triangular Linear Systems in CUDA
The acceleration of sparse matrix computations on modern many-core
processors, such as the graphics processing units (GPUs), has been recognized
and studied over a decade. Significant performance enhancements have been
achieved for many sparse matrix computational kernels such as sparse
matrix-vector products and sparse matrix-matrix products. Solving linear
systems with sparse triangular structured matrices is another important sparse
kernel as demanded by a variety of scientific and engineering applications such
as sparse linear solvers. However, the development of efficient parallel
algorithms in CUDA for solving sparse triangular linear systems remains a
challenging task due to the inherently sequential nature of the computation. In
this paper, we will revisit this problem by reviewing the existing
level-scheduling methods and proposing algorithms with self-scheduling
techniques. Numerical results have indicated that the CUDA implementations of
the proposed algorithms can outperform the state-of-the-art solvers in cuSPARSE
by a factor of up to for structured model problems and general sparse
matrices
Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs
Rapid growth in scientific data and a widening gap between computational
speed and I/O bandwidth makes it increasingly infeasible to store and share all
data produced by scientific simulations. Multigrid-based hierarchical data
refactoring is a class of promising approaches to this problem. These
approaches decompose data hierarchically; the decomposed components can then be
selectively and intelligently stored or shared, based on their relative
importance in the original data. Efficient data refactoring design is one key
to making these methods truly useful. In this paper, we describe highly
optimized data refactoring kernels on GPU accelerators that are specialized for
refactoring scientific data. We demonstrate that our optimized design can
achieve 45.42 TB/s aggregated data refactoring throughput when using 4,096 GPUs
of the Summit supercomputer. Finally, we showcase our optimized design by
applying it to a large-scale scientific visualization workflow and the MGARD
lossy compression software
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Hardware-aware design and optimization is crucial in exploiting emerging
architectures for PDE-based computational fluid dynamics applications. In this
work, we study optimizations aimed at acceleration of OpenFOAM-based
applications on emerging hybrid heterogeneous platforms. OpenFOAM uses MPI to
provide parallel multi-processor functionality, which scales well on
homogeneous systems but does not fully utilize the potential per-node
performance on hybrid heterogeneous platforms. In our study, we use two
OpenFOAM applications, icoFoam and laplacianFoam, both based on Krylov
iterative methods. We propose a number of optimizations of the dominant kernel
of the Krylov solver, aimed at acceleration of the overall execution of the
applications on modern GPU-accelerated heterogeneous platforms. Experimental
results show that the proposed hybrid implementation significantly outperforms
the state-of-the-art implementation.Comment: Presented at ParCFD 2014, prepared for submission to Computer and
Fluids. 12 pages, 9 figures, 2 table
Accelerating Geometric Multigrid Preconditioning with Half-Precision Arithmetic on GPUs
With the hardware support for half-precision arithmetic on NVIDIA V100 GPUs,
high-performance computing applications can benefit from lower precision at
appropriate spots to speed up the overall execution time. In this paper, we
investigate a mixed-precision geometric multigrid method to solve large sparse
systems of equations stemming from discretization of elliptic PDEs. While the
final solution is always computed with high-precision accuracy, an iterative
refinement approach with multigrid preconditioning in lower precision and
residuum scaling is employed. We compare the FP64 baseline for Poisson's
equation to purely FP16 multigrid preconditioning and to the employment of
FP16-FP32-FP64 combinations within a mesh hierarchy. While the iteration count
is almost not affected by using lower accuracy, the solver runtime is
considerably decreased due to the reduced memory transfer and a speedup of up
to 2.5x is gained for the overall solver. We investigate the performance of
selected kernels with the hierarchical Roofline model
Complete PISO and SIMPLE solvers on Graphics Processing Units
We implemented the pressure-implicit with splitting of operators (PISO) and
semi-implicit method for pressure-linked equations (SIMPLE) solvers of the
Navier-Stokes equations on Fermi-class graphics processing units (GPUs) using
the CUDA technology. We also introduced a new format of sparse matrices
optimized for performing elementary CFD operations, like gradient or divergence
discretization, on GPUs. We verified the validity of the implementation on
several standard, steady and unsteady problems. Computational effciency of the
GPU implementation was examined by comparing its double precision run times
with those of essentially the same algorithms implemented in OpenFOAM. The
results show that a GPU (Tesla C2070) can outperform a server-class 6-core,
12-thread CPU (Intel Xeon X5670) by a factor of 4.2
GPU accelerated spectral finite elements on all-hex meshes
This paper presents a spectral element finite element scheme that efficiently
solves elliptic problems on unstructured hexahedral meshes. The discrete
equations are solved using a matrix-free preconditioned conjugate gradient
algorithm. An additive Schwartz two-scale preconditioner is employed that
allows h-independence convergence. An extensible multi-threading programming
API is used as a common kernel language that allows runtime selection of
different computing devices (GPU and CPU) and different threading interfaces
(CUDA, OpenCL and OpenMP). Performance tests demonstrate that problems with
over 50 million degrees of freedom can be solved in a few seconds on an
off-the-shelf GPU.Comment: 23 pages, 7 figure
- …