Search CORE

3 research outputs found

Finite Element Integration with Quadrature on the GPU

Author: Knepley Matthew G.
Rupp Karl
Terrel Andy R.
Publication venue
Publication date: 14/07/2016
Field of study

We present a novel, quadrature-based finite element integration method for low-order elements on GPUs, using a pattern we call \textit{thread transposition} to avoid reductions while vectorizing aggressively. On the NVIDIA GTX580, which has a nominal single precision peak flop rate of 1.5 TF/s and a memory bandwidth of 192 GB/s, we achieve close to 300 GF/s for element integration on first-order discretization of the Laplacian operator with variable coefficients in two dimensions, and over 400 GF/s in three dimensions. From our performance model we find that this corresponds to 90\% of our measured achievable bandwidth peak of 310 GF/s. Further experimental results also match the predicted performance when used with double precision (120 GF/s in two dimensions, 150 GF/s in three dimensions). Results obtained for the linear elasticity equations (220 GF/s and 70 GF/s in two dimensions, 180 GF/s and 60 GF/s in three dimensions) also demonstrate the applicability of our method to vector-valued partial differential equations.Comment: 14 pages, 6 figure

arXiv.org e-Print Archive

Manycore parallel computing for a hybridizable discontinuous Galerkin nested multigrid method

Author: Fabien M. S.
Knepley M. G.
Mills R. T.
Riviere B. M.
Publication venue
Publication date: 17/07/2019
Field of study

We present a parallel computing strategy for a hybridizable discontinuous Galerkin (HDG) nested geometric multigrid (GMG) solver. Parallel GMG solvers require a combination of coarse-grain and fine-grain parallelism to improve time to solution performance. In this work we focus on fine-grain parallelism. We use Intel's second generation Xeon Phi (Knights Landing) many-core processor. The GMG method achieves ideal convergence rates of

0.2

or less, for high polynomial orders. A matrix free (assembly free) technique is exploited to save considerable memory usage and increase arithmetic intensity. HDG enables static condensation, and due to the discontinuous nature of the discretization, we developed a matrix vector multiply routine that does not require any costly synchronizations or barriers. Our algorithm is able to attain 80\% of peak bandwidth performance for higher order polynomials. This is possible due to the data locality inherent in the HDG method. Very high performance is realized for high order schemes, due to good arithmetic intensity, which declines as the order is reduced.Comment: 23 pages, 10 figure

arXiv.org e-Print Archive

A performance spectrum for parallel computational frameworks that solve PDEs

Author: Chang J.
Johnsson L.
Knepley M. G.
Nakshatrala K. B.
Publication venue
Publication date: 14/09/2017
Field of study

Important computational physics problems are often large-scale in nature, and it is highly desirable to have robust and high performing computational frameworks that can quickly address these problems. However, it is no trivial task to determine whether a computational framework is performing efficiently or is scalable. The aim of this paper is to present various strategies for better understanding the performance of any parallel computational frameworks for solving PDEs. Important performance issues that negatively impact time-to-solution are discussed, and we propose a performance spectrum analysis that can enhance one's understanding of critical aforementioned performance issues. As proof of concept, we examine commonly used finite element simulation packages and software and apply the performance spectrum to quickly analyze the performance and scalability across various hardware platforms, software implementations, and numerical discretizations. It is shown that the proposed performance spectrum is a versatile performance model that is not only extendable to more complex PDEs such as hydrostatic ice sheet flow equations, but also useful for understanding hardware performance in a massively parallel computing environment. Potential applications and future extensions of this work are also discussed

arXiv.org e-Print Archive