3 research outputs found
Finite Element Integration with Quadrature on the GPU
We present a novel, quadrature-based finite element integration method for
low-order elements on GPUs, using a pattern we call \textit{thread
transposition} to avoid reductions while vectorizing aggressively. On the
NVIDIA GTX580, which has a nominal single precision peak flop rate of 1.5 TF/s
and a memory bandwidth of 192 GB/s, we achieve close to 300 GF/s for element
integration on first-order discretization of the Laplacian operator with
variable coefficients in two dimensions, and over 400 GF/s in three dimensions.
From our performance model we find that this corresponds to 90\% of our
measured achievable bandwidth peak of 310 GF/s. Further experimental results
also match the predicted performance when used with double precision (120 GF/s
in two dimensions, 150 GF/s in three dimensions). Results obtained for the
linear elasticity equations (220 GF/s and 70 GF/s in two dimensions, 180 GF/s
and 60 GF/s in three dimensions) also demonstrate the applicability of our
method to vector-valued partial differential equations.Comment: 14 pages, 6 figure
Manycore parallel computing for a hybridizable discontinuous Galerkin nested multigrid method
We present a parallel computing strategy for a hybridizable discontinuous
Galerkin (HDG) nested geometric multigrid (GMG) solver. Parallel GMG solvers
require a combination of coarse-grain and fine-grain parallelism to improve
time to solution performance. In this work we focus on fine-grain parallelism.
We use Intel's second generation Xeon Phi (Knights Landing) many-core
processor. The GMG method achieves ideal convergence rates of or less,
for high polynomial orders. A matrix free (assembly free) technique is
exploited to save considerable memory usage and increase arithmetic intensity.
HDG enables static condensation, and due to the discontinuous nature of the
discretization, we developed a matrix vector multiply routine that does not
require any costly synchronizations or barriers. Our algorithm is able to
attain 80\% of peak bandwidth performance for higher order polynomials. This is
possible due to the data locality inherent in the HDG method. Very high
performance is realized for high order schemes, due to good arithmetic
intensity, which declines as the order is reduced.Comment: 23 pages, 10 figure
A performance spectrum for parallel computational frameworks that solve PDEs
Important computational physics problems are often large-scale in nature, and
it is highly desirable to have robust and high performing computational
frameworks that can quickly address these problems. However, it is no trivial
task to determine whether a computational framework is performing efficiently
or is scalable. The aim of this paper is to present various strategies for
better understanding the performance of any parallel computational frameworks
for solving PDEs. Important performance issues that negatively impact
time-to-solution are discussed, and we propose a performance spectrum analysis
that can enhance one's understanding of critical aforementioned performance
issues. As proof of concept, we examine commonly used finite element simulation
packages and software and apply the performance spectrum to quickly analyze the
performance and scalability across various hardware platforms, software
implementations, and numerical discretizations. It is shown that the proposed
performance spectrum is a versatile performance model that is not only
extendable to more complex PDEs such as hydrostatic ice sheet flow equations,
but also useful for understanding hardware performance in a massively parallel
computing environment. Potential applications and future extensions of this
work are also discussed