Search CORE

193 research outputs found

A GPU Accelerated Aggregation Algebraic Multigrid Method

Author: Esler Ken
Gandham Rajesh
Zhang Yongpeng
Publication venue
Publication date: 06/03/2014
Field of study

We present an efficient, robust and fully GPU-accelerated aggregation-based algebraic multigrid preconditioning technique for the solution of large sparse linear systems. These linear systems arise from the discretization of elliptic PDEs. The method involves two stages, setup and solve. In the setup stage, hierarchical coarse grids are constructed through aggregation of the fine grid nodes. These aggregations are obtained using a set of maximal independent nodes from the fine grid nodes. We use a ``fine-grain'' parallel algorithm for finding a maximal independent set from a graph of strong negative connections. The aggregations are combined with a piece-wise constant (unsmooth) interpolation from the coarse grid solution to the fine grid solution, ensuring low setup and interpolation cost. The grid independent convergence is achieved by using recursive Krylov iterations (K-cycles) in the solve stage. An efficient combination of K-cycles and standard multigrid V-cycles is used as the preconditioner for Krylov iterative solvers such as generalized minimal residual and conjugate gradient. We compare the solver performance with other solvers based on smooth aggregation and classical algebraic multigrid methods.Comment: 18 pages, 11 figure

arXiv.org e-Print Archive

AMG based on compatible weighted matching for GPUs

Author: Bernaschi Massimo
D'Ambra Pasqua
Pasquini Dario
Publication venue
Publication date: 09/10/2018
Field of study

We describe main issues and design principles of an efficient implementation, tailored to recent generations of Nvidia Graphics Processing Units (GPUs), of an Algebraic Multigrid (AMG) preconditioner previously proposed by one of the authors and already available in the open-source package BootCMatch: Bootstrap algebraic multigrid based on Compatible weighted Matching for standard CPU. The AMG method relies on a new approach for coarsening sparse symmetric positive definite (spd) matrices, named "coarsening based on compatible weighted matching". It exploits maximum weight matching in the adjacency graph of the sparse matrix, driven by the principle of compatible relaxation, providing a suitable aggregation of unknowns which goes beyond the limits of the usual heuristics applied in the current methods. We adopt an approximate solution of the maximum weight matching problem, based on a recently proposed parallel algorithm, referred as the Suitor algorithm, and show that it allow us to obtain good quality coarse matrices for our AMG on GPUs. We exploit inherent parallelism of modern GPUs in all the kernels involving sparse matrix computations both for the setup of the preconditioner and for its application in a Krylov solver, outperforming preconditioners available in Nvidia AmgX library. We report results about a large set of linear systems arising from discretization of scalar and vector partial differential equations (PDEs).Comment: 11 pages, submitted to the special issue of Parallel Computing related to the 10th International Workshop on Parallel Matrix Algorithms and Applications (PMAA18

arXiv.org e-Print Archive

Decoupled Block-Wise ILU(k) Preconditioner on GPU

Author: Chen Zhangxin
Liu Hui
Yang Bo
Zhong He
Publication venue
Publication date: 03/03/2017
Field of study

This research investigates the implementation mechanism of block-wise ILU(k) preconditioner on GPU. The block-wise ILU(k) algorithm requires both the level k and the block size to be designed as variables. A decoupled ILU(k) algorithm consists of a symbolic phase and a factorization phase. In the symbolic phase, a ILU(k) nonzero pattern is established from the point-wise structure extracted from a block-wise matrix. In the factorization phase, the block-wise matrix with a variable block size is factorized into a block lower triangular matrix and a block upper triangular matrix. And a further diagonal factorization is required to perform on the block upper triangular matrix for adapting a parallel triangular solver on GPU.We also present the numerical experiments to study the preconditioner actions on different k levels and block sizes.Comment: 14 page

arXiv.org e-Print Archive

Numerical Study of Geometric Multigrid Methods on CPU--GPU Heterogeneous Computers

Author: Feng Chunsheng
Shu Shi
Xu Jinchao
Zhang Chen-Song
Publication venue
Publication date: 11/01/2013
Field of study

The geometric multigrid method (GMG) is one of the most efficient solving techniques for discrete algebraic systems arising from elliptic partial differential equations. GMG utilizes a hierarchy of grids or discretizations and reduces the error at a number of frequencies simultaneously. Graphics processing units (GPUs) have recently burst onto the scientific computing scene as a technology that has yielded substantial performance and energy-efficiency improvements. A central challenge in implementing GMG on GPUs, though, is that computational work on coarse levels cannot fully utilize the capacity of a GPU. In this work, we perform numerical studies of GMG on CPU--GPU heterogeneous computers. Furthermore, we compare our implementation with an efficient CPU implementation of GMG and with the most popular fast Poisson solver, Fast Fourier Transform, in the cuFFT library developed by NVIDIA

arXiv.org e-Print Archive

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

Author: Li Ruipeng
Publication venue
Publication date: 13/10/2017
Field of study

The acceleration of sparse matrix computations on modern many-core processors, such as the graphics processing units (GPUs), has been recognized and studied over a decade. Significant performance enhancements have been achieved for many sparse matrix computational kernels such as sparse matrix-vector products and sparse matrix-matrix products. Solving linear systems with sparse triangular structured matrices is another important sparse kernel as demanded by a variety of scientific and engineering applications such as sparse linear solvers. However, the development of efficient parallel algorithms in CUDA for solving sparse triangular linear systems remains a challenging task due to the inherently sequential nature of the computation. In this paper, we will revisit this problem by reviewing the existing level-scheduling methods and proposing algorithms with self-scheduling techniques. Numerical results have indicated that the CUDA implementations of the proposed algorithms can outperform the state-of-the-art solvers in cuSPARSE by a factor of up to

2.6

for structured model problems and general sparse matrices

arXiv.org e-Print Archive

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Author: Chen Jieyang
Foster Ian
Klasky Scott
Liang Xin
Liu Qing
Munson Todd
Pugmire Dave
Thompson Nicholas
Wan Lipeng
Whitney Ben
Wolf Matthew
Publication venue
Publication date: 08/07/2020
Field of study

Rapid growth in scientific data and a widening gap between computational speed and I/O bandwidth makes it increasingly infeasible to store and share all data produced by scientific simulations. Multigrid-based hierarchical data refactoring is a class of promising approaches to this problem. These approaches decompose data hierarchically; the decomposed components can then be selectively and intelligently stored or shared, based on their relative importance in the original data. Efficient data refactoring design is one key to making these methods truly useful. In this paper, we describe highly optimized data refactoring kernels on GPU accelerators that are specialized for refactoring scientific data. We demonstrate that our optimized design can achieve 45.42 TB/s aggregated data refactoring throughput when using 4,096 GPUs of the Summit supercomputer. Finally, we showcase our optimized design by applying it to a large-scale scientific visualization workflow and the MGARD lossy compression software

arXiv.org e-Print Archive

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Author: AlOnazi Amani
Keyes David
Lastovetsky Alexey
Rychkov Vladimir
Publication venue
Publication date: 28/05/2015
Field of study

Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dynamics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on emerging hybrid heterogeneous platforms. OpenFOAM uses MPI to provide parallel multi-processor functionality, which scales well on homogeneous systems but does not fully utilize the potential per-node performance on hybrid heterogeneous platforms. In our study, we use two OpenFOAM applications, icoFoam and laplacianFoam, both based on Krylov iterative methods. We propose a number of optimizations of the dominant kernel of the Krylov solver, aimed at acceleration of the overall execution of the applications on modern GPU-accelerated heterogeneous platforms. Experimental results show that the proposed hybrid implementation significantly outperforms the state-of-the-art implementation.Comment: Presented at ParCFD 2014, prepared for submission to Computer and Fluids. 12 pages, 9 figures, 2 table

arXiv.org e-Print Archive

Accelerating Geometric Multigrid Preconditioning with Half-Precision Arithmetic on GPUs

Author: Oo Kyaw L.
Vogel Andreas
Publication venue
Publication date: 15/07/2020
Field of study

With the hardware support for half-precision arithmetic on NVIDIA V100 GPUs, high-performance computing applications can benefit from lower precision at appropriate spots to speed up the overall execution time. In this paper, we investigate a mixed-precision geometric multigrid method to solve large sparse systems of equations stemming from discretization of elliptic PDEs. While the final solution is always computed with high-precision accuracy, an iterative refinement approach with multigrid preconditioning in lower precision and residuum scaling is employed. We compare the FP64 baseline for Poisson's equation to purely FP16 multigrid preconditioning and to the employment of FP16-FP32-FP64 combinations within a mesh hierarchy. While the iteration count is almost not affected by using lower accuracy, the solver runtime is considerably decreased due to the reduced memory transfer and a speedup of up to 2.5x is gained for the overall solver. We investigate the performance of selected kernels with the hierarchical Roofline model

arXiv.org e-Print Archive

Complete PISO and SIMPLE solvers on Graphics Processing Units

Author: Koza Zbigniew
Matyka Maciej
Mirosław Łukasz
Tomczak Tadeusz
Zadarnowska Katarzyna
Publication venue
Publication date: 06/07/2012
Field of study

We implemented the pressure-implicit with splitting of operators (PISO) and semi-implicit method for pressure-linked equations (SIMPLE) solvers of the Navier-Stokes equations on Fermi-class graphics processing units (GPUs) using the CUDA technology. We also introduced a new format of sparse matrices optimized for performing elementary CFD operations, like gradient or divergence discretization, on GPUs. We verified the validity of the implementation on several standard, steady and unsteady problems. Computational effciency of the GPU implementation was examined by comparing its double precision run times with those of essentially the same algorithms implemented in OpenFOAM. The results show that a GPU (Tesla C2070) can outperform a server-class 6-core, 12-thread CPU (Intel Xeon X5670) by a factor of 4.2

arXiv.org e-Print Archive

GPU accelerated spectral finite elements on all-hex meshes

Author: Gandham R.
Remacle J. -F.
Warburton T.
Publication venue: 'Elsevier BV'
Publication date: 19/06/2015
Field of study

This paper presents a spectral element finite element scheme that efficiently solves elliptic problems on unstructured hexahedral meshes. The discrete equations are solved using a matrix-free preconditioned conjugate gradient algorithm. An additive Schwartz two-scale preconditioner is employed that allows h-independence convergence. An extensible multi-threading programming API is used as a common kernel language that allows runtime selection of different computing devices (GPU and CPU) and different threading interfaces (CUDA, OpenCL and OpenMP). Performance tests demonstrate that problems with over 50 million degrees of freedom can be solved in a few seconds on an off-the-shelf GPU.Comment: 23 pages, 7 figure

arXiv.org e-Print Archive