6,153 research outputs found
Global finite element matrix construction based on a CPU-GPU implementation
The finite element method (FEM) has several computational steps to
numerically solve a particular problem, to which many efforts have been
directed to accelerate the solution stage of the linear system of equations.
However, the finite element matrix construction, which is also time-consuming
for unstructured meshes, has been less investigated. The generation of the
global finite element matrix is performed in two steps, computing the local
matrices by numerical integration and assembling them into a global system,
which has traditionally been done in serial computing. This work presents a
fast technique to construct the global finite element matrix that arises by
solving the Poisson's equation in a three-dimensional domain. The proposed
methodology consists in computing the numerical integration, due to its
intrinsic parallel opportunities, in the graphics processing unit (GPU) and
computing the matrix assembly, due to its intrinsic serial operations, in the
central processing unit (CPU). In the numerical integration, only the lower
triangular part of each local stiffness matrix is computed thanks to its
symmetry, which saves GPU memory and computing time. As a result of symmetry,
the global sparse matrix also contains non-zero elements only in its lower
triangular part, which reduces the assembly operations and memory usage. This
methodology allows generating the global sparse matrix from any unstructured
finite element mesh size on GPUs with little memory capacity, only limited by
the CPU memory
Numerical integration on GPUs for higher order finite elements
The paper considers the problem of implementation on graphics processors of
numerical integration routines for higher order finite element approximations.
The design of suitable GPU kernels is investigated in the context of general
purpose integration procedures, as well as particular example applications. The
most important characteristic of the problem investigated is the large
variation of required processor and memory resources associated with different
degrees of approximating polynomials. The questions that we try to answer are
whether it is possible to design a single integration kernel for different GPUs
and different orders of approximation and what performance can be expected in
such a case
Finite element numerical integration for first order approximations on multi-core architectures
The paper presents investigations on the implementation and performance of
the finite element numerical integration algorithm for first order
approximations and three processor architectures, popular in scientific
computing, classical CPU, Intel Xeon Phi and NVIDIA Kepler GPU. A unifying
programming model and portable OpenCL implementation is considered for all
architectures. Variations of the algorithm due to different problems solved and
different element types are investigated and several optimizations aimed at
proper optimization and mapping of the algorithm to computer architectures are
demonstrated. Performance models of execution are developed for different
processors and tested in practical experiments. The results show the varying
levels of performance for different architectures, but indicate that the
algorithm can be effectively ported to all of them. The general conclusion is
that the finite element numerical integration can achieve sufficient
performance on different multi- and many-core architectures and should not
become a performance bottleneck for finite element simulation codes. Specific
observations lead to practical advises on how to optimize the kernels and what
performance can be expected for the tested architectures
Vectorized OpenCL implementation of numerical integration for higher order finite elements
In our work we analyze computational aspects of the problem of numerical
integration in finite element calculations and consider an OpenCL
implementation of related algorithms for processors with wide vector registers.
As a platform for testing the implementation we choose the PowerXCell
processor, being an example of the Cell Broadband Engine (CellBE) architecture.
Although the processor is considered old for today's standards (its design
dates back to year 2001), we investigate its performance due to two features
that it shares with recent Xeon Phi family of coprocessors: wide vector units
and relatively slow connection of computing cores with main global memory. The
performed analysis of parallelization options can also be used for designing
numerical integration algorithms for other processors with vector registers,
such as contemporary x86 microprocessors.Comment: published online in Computers and Mathematics with Applications:
http://www.sciencedirect.com/science/article/pii/S089812211300521
GPU Accelerated Finite Element Assembly with Runtime Compilation
In recent years, high performance scientific computing on graphics processing
units (GPUs) have gained widespread acceptance. These devices are designed to
offer massively parallel threads for running code with general purpose. There
are many researches focus on finite element method with GPUs. However, most of
the works are specific to certain problems and applications. Some works propose
methods for finite element assembly that is general for a wide range of finite
element models. But the development of finite element code is dependent on the
hardware architectures. It is usually complicated and error prone using the
libraries provided by the hardware vendors. In this paper, we present
architecture and implementation of finite element assembly for partial
differential equations (PDEs) based on symbolic computation and runtime
compilation technique on GPU. User friendly programming interface with symbolic
computation is provided. At the same time, high computational efficiency is
achieved by using runtime compilation technique. As far as we know, it is the
first work using this technique to accelerate finite element assembly for
solving PDEs. Experiments show that a one to two orders of speedup is achieved
for the problems studied in the paper.Comment: 6 pages, 8 figures, conferenc
GPU accelerated spectral finite elements on all-hex meshes
This paper presents a spectral element finite element scheme that efficiently
solves elliptic problems on unstructured hexahedral meshes. The discrete
equations are solved using a matrix-free preconditioned conjugate gradient
algorithm. An additive Schwartz two-scale preconditioner is employed that
allows h-independence convergence. An extensible multi-threading programming
API is used as a common kernel language that allows runtime selection of
different computing devices (GPU and CPU) and different threading interfaces
(CUDA, OpenCL and OpenMP). Performance tests demonstrate that problems with
over 50 million degrees of freedom can be solved in a few seconds on an
off-the-shelf GPU.Comment: 23 pages, 7 figure
A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems
Recently, graphics processors (GPUs) have been increasingly leveraged in a
variety of scientific computing applications. However, architectural
differences between CPUs and GPUs necessitate the development of algorithms
that take advantage of GPU hardware. As sparse matrix vector multiplication
(SPMV) operations are commonly used in finite element analysis, a new SPMV
algorithm and several variations are developed for unstructured finite element
meshes on GPUs. The effective bandwidth of current GPU algorithms and the newly
proposed algorithms are measured and analyzed for 15 sparse matrices of varying
sizes and varying sparsity structures. The effects of optimization and
differences between the new GPU algorithm and its variants are then
subsequently studied. Lastly, both new and current SPMV GPU algorithms are
utilized in the GPU CG Solver in GPU finite element simulations of the heart.
These results are then compared against parallel PETSc finite element
implementation results. The effective bandwidth tests indicate that the new
algorithms compare very favorably with current algorithms for a wide variety of
sparse matrices and can yield very notable benefits. GPU finite element
simulation results demonstrate the benefit of using GPUs for finite element
analysis, and also show that the proposed algorithms can yield speedup factors
up to 12-fold for real finite element applications.Comment: 35 pages, 22 figure
A curved-element unstructured discontinuous Galerkin method on GPUs for the Euler equations
In this work we consider Runge-Kutta discontinuous Galerkin methods (RKDG)
for the solution of hyperbolic equations enabling high order discretization in
space and time. We aim at an efficient implementation of DG for Euler equations
on GPUs. A mesh curvature approach is presented for the proper resolution of
the domain boundary. This approach is based on the linear elasticity equations
and enables a boundary approximation with arbitrary, high order. In order to
demonstrate the performance of the boundary curvature a massively parallel
solver on graphics processors is implemented and utilized for the solution of
the Euler equations of gas-dynamics
Tensor B-Spline Numerical Methods for PDEs: a High-Performance Alternative to FEM
Tensor B-spline methods are a high-performance alternative to solve partial
differential equations (PDEs). This paper gives an overview on the principles
of Tensor B-spline methodology, shows their use and analyzes their performance
in application examples, and discusses its merits. Tensors preserve the
dimensional structure of a discretized PDE, which makes it possible to develop
highly efficient computational solvers. B-splines provide high-quality
approximations, lead to a sparse structure of the system operator represented
by shift-invariant separable kernels in the domain, and are mesh-free by
construction. Further, high-order bases can easily be constructed from
B-splines. In order to demonstrate the advantageous numerical performance of
tensor B-spline methods, we studied the solution of a large-scale heat-equation
problem (consisting of roughly 0.8 billion nodes!) on a heterogeneous
workstation consisting of multi-core CPU and GPUs. Our experimental results
nicely confirm the excellent numerical approximation properties of tensor
B-splines, and their unique combination of high computational efficiency and
low memory consumption, thereby showing huge improvements over standard
finite-element methods (FEM)
Finite Element Integration with Quadrature on the GPU
We present a novel, quadrature-based finite element integration method for
low-order elements on GPUs, using a pattern we call \textit{thread
transposition} to avoid reductions while vectorizing aggressively. On the
NVIDIA GTX580, which has a nominal single precision peak flop rate of 1.5 TF/s
and a memory bandwidth of 192 GB/s, we achieve close to 300 GF/s for element
integration on first-order discretization of the Laplacian operator with
variable coefficients in two dimensions, and over 400 GF/s in three dimensions.
From our performance model we find that this corresponds to 90\% of our
measured achievable bandwidth peak of 310 GF/s. Further experimental results
also match the predicted performance when used with double precision (120 GF/s
in two dimensions, 150 GF/s in three dimensions). Results obtained for the
linear elasticity equations (220 GF/s and 70 GF/s in two dimensions, 180 GF/s
and 60 GF/s in three dimensions) also demonstrate the applicability of our
method to vector-valued partial differential equations.Comment: 14 pages, 6 figure
- …