Search CORE

6,153 research outputs found

Global finite element matrix construction based on a CPU-GPU implementation

Author: Montealegre-Rubio Wilfredo
Ramírez-Gil Francisco Javier
Tsuzuki Marcos de Sales Guerra
Publication venue
Publication date: 20/01/2015
Field of study

The finite element method (FEM) has several computational steps to numerically solve a particular problem, to which many efforts have been directed to accelerate the solution stage of the linear system of equations. However, the finite element matrix construction, which is also time-consuming for unstructured meshes, has been less investigated. The generation of the global finite element matrix is performed in two steps, computing the local matrices by numerical integration and assembling them into a global system, which has traditionally been done in serial computing. This work presents a fast technique to construct the global finite element matrix that arises by solving the Poisson's equation in a three-dimensional domain. The proposed methodology consists in computing the numerical integration, due to its intrinsic parallel opportunities, in the graphics processing unit (GPU) and computing the matrix assembly, due to its intrinsic serial operations, in the central processing unit (CPU). In the numerical integration, only the lower triangular part of each local stiffness matrix is computed thanks to its symmetry, which saves GPU memory and computing time. As a result of symmetry, the global sparse matrix also contains non-zero elements only in its lower triangular part, which reduces the assembly operations and memory usage. This methodology allows generating the global sparse matrix from any unstructured finite element mesh size on GPUs with little memory capacity, only limited by the CPU memory

arXiv.org e-Print Archive

Numerical integration on GPUs for higher order finite elements

Author: Banaś Krzysztof
Macioł Paweł
Płaszewski Przemysław
Publication venue: 'Elsevier BV'
Publication date: 04/10/2013
Field of study

The paper considers the problem of implementation on graphics processors of numerical integration routines for higher order finite element approximations. The design of suitable GPU kernels is investigated in the context of general purpose integration procedures, as well as particular example applications. The most important characteristic of the problem investigated is the large variation of required processor and memory resources associated with different degrees of approximating polynomials. The questions that we try to answer are whether it is possible to design a single integration kernel for different GPUs and different orders of approximation and what performance can be expected in such a case

arXiv.org e-Print Archive

Finite element numerical integration for first order approximations on multi-core architectures

Author: Banaś Krzysztof
Bielański Jan
Krużel Filip
Publication venue: 'Elsevier BV'
Publication date: 04/04/2015
Field of study

The paper presents investigations on the implementation and performance of the finite element numerical integration algorithm for first order approximations and three processor architectures, popular in scientific computing, classical CPU, Intel Xeon Phi and NVIDIA Kepler GPU. A unifying programming model and portable OpenCL implementation is considered for all architectures. Variations of the algorithm due to different problems solved and different element types are investigated and several optimizations aimed at proper optimization and mapping of the algorithm to computer architectures are demonstrated. Performance models of execution are developed for different processors and tested in practical experiments. The results show the varying levels of performance for different architectures, but indicate that the algorithm can be effectively ported to all of them. The general conclusion is that the finite element numerical integration can achieve sufficient performance on different multi- and many-core architectures and should not become a performance bottleneck for finite element simulation codes. Specific observations lead to practical advises on how to optimize the kernels and what performance can be expected for the tested architectures

arXiv.org e-Print Archive

Vectorized OpenCL implementation of numerical integration for higher order finite elements

Author: Banaś Krzysztof
Krużel Filip
Publication venue: 'Elsevier BV'
Publication date: 04/10/2013
Field of study

In our work we analyze computational aspects of the problem of numerical integration in finite element calculations and consider an OpenCL implementation of related algorithms for processors with wide vector registers. As a platform for testing the implementation we choose the PowerXCell processor, being an example of the Cell Broadband Engine (CellBE) architecture. Although the processor is considered old for today's standards (its design dates back to year 2001), we investigate its performance due to two features that it shares with recent Xeon Phi family of coprocessors: wide vector units and relatively slow connection of computing cores with main global memory. The performed analysis of parallelization options can also be used for designing numerical integration algorithms for other processors with vector registers, such as contemporary x86 microprocessors.Comment: published online in Computers and Mathematics with Applications: http://www.sciencedirect.com/science/article/pii/S089812211300521

arXiv.org e-Print Archive

GPU Accelerated Finite Element Assembly with Runtime Compilation

Author: Cui Tao
Guo Xiaohu
Liu Hui
Publication venue
Publication date: 09/02/2018
Field of study

In recent years, high performance scientific computing on graphics processing units (GPUs) have gained widespread acceptance. These devices are designed to offer massively parallel threads for running code with general purpose. There are many researches focus on finite element method with GPUs. However, most of the works are specific to certain problems and applications. Some works propose methods for finite element assembly that is general for a wide range of finite element models. But the development of finite element code is dependent on the hardware architectures. It is usually complicated and error prone using the libraries provided by the hardware vendors. In this paper, we present architecture and implementation of finite element assembly for partial differential equations (PDEs) based on symbolic computation and runtime compilation technique on GPU. User friendly programming interface with symbolic computation is provided. At the same time, high computational efficiency is achieved by using runtime compilation technique. As far as we know, it is the first work using this technique to accelerate finite element assembly for solving PDEs. Experiments show that a one to two orders of speedup is achieved for the problems studied in the paper.Comment: 6 pages, 8 figures, conferenc

arXiv.org e-Print Archive

GPU accelerated spectral finite elements on all-hex meshes

Author: Gandham R.
Remacle J. -F.
Warburton T.
Publication venue: 'Elsevier BV'
Publication date: 19/06/2015
Field of study

This paper presents a spectral element finite element scheme that efficiently solves elliptic problems on unstructured hexahedral meshes. The discrete equations are solved using a matrix-free preconditioned conjugate gradient algorithm. An additive Schwartz two-scale preconditioner is employed that allows h-independence convergence. An extensible multi-threading programming API is used as a common kernel language that allows runtime selection of different computing devices (GPU and CPU) and different threading interfaces (CUDA, OpenCL and OpenMP). Performance tests demonstrate that problems with over 50 million degrees of freedom can be solved in a few seconds on an off-the-shelf GPU.Comment: 23 pages, 7 figure

arXiv.org e-Print Archive

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

Author: Darve Eric
Kuhl Ellen
Wong Jonathan
Publication venue
Publication date: 01/01/2015
Field of study

Recently, graphics processors (GPUs) have been increasingly leveraged in a variety of scientific computing applications. However, architectural differences between CPUs and GPUs necessitate the development of algorithms that take advantage of GPU hardware. As sparse matrix vector multiplication (SPMV) operations are commonly used in finite element analysis, a new SPMV algorithm and several variations are developed for unstructured finite element meshes on GPUs. The effective bandwidth of current GPU algorithms and the newly proposed algorithms are measured and analyzed for 15 sparse matrices of varying sizes and varying sparsity structures. The effects of optimization and differences between the new GPU algorithm and its variants are then subsequently studied. Lastly, both new and current SPMV GPU algorithms are utilized in the GPU CG Solver in GPU finite element simulations of the heart. These results are then compared against parallel PETSc finite element implementation results. The effective bandwidth tests indicate that the new algorithms compare very favorably with current algorithms for a wide variety of sparse matrices and can yield very notable benefits. GPU finite element simulation results demonstrate the benefit of using GPUs for finite element analysis, and also show that the proposed algorithms can yield speedup factors up to 12-fold for real finite element applications.Comment: 35 pages, 22 figure

arXiv.org e-Print Archive

A curved-element unstructured discontinuous Galerkin method on GPUs for the Euler equations

Author: Schmidt S.
Schulz V.
Siebenborn M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/01/2013
Field of study

In this work we consider Runge-Kutta discontinuous Galerkin methods (RKDG) for the solution of hyperbolic equations enabling high order discretization in space and time. We aim at an efficient implementation of DG for Euler equations on GPUs. A mesh curvature approach is presented for the proper resolution of the domain boundary. This approach is based on the linear elasticity equations and enables a boundary approximation with arbitrary, high order. In order to demonstrate the performance of the boundary curvature a massively parallel solver on graphics processors is implemented and utilized for the solution of the Euler equations of gas-dynamics

arXiv.org e-Print Archive

Tensor B-Spline Numerical Methods for PDEs: a High-Performance Alternative to FEM

Author: Friedrich Felix
Hunziker Patrick
Morozov Oleksii
Roth Volker
Shulga Dmytro
Publication venue
Publication date: 05/04/2019
Field of study

Tensor B-spline methods are a high-performance alternative to solve partial differential equations (PDEs). This paper gives an overview on the principles of Tensor B-spline methodology, shows their use and analyzes their performance in application examples, and discusses its merits. Tensors preserve the dimensional structure of a discretized PDE, which makes it possible to develop highly efficient computational solvers. B-splines provide high-quality approximations, lead to a sparse structure of the system operator represented by shift-invariant separable kernels in the domain, and are mesh-free by construction. Further, high-order bases can easily be constructed from B-splines. In order to demonstrate the advantageous numerical performance of tensor B-spline methods, we studied the solution of a large-scale heat-equation problem (consisting of roughly 0.8 billion nodes!) on a heterogeneous workstation consisting of multi-core CPU and GPUs. Our experimental results nicely confirm the excellent numerical approximation properties of tensor B-splines, and their unique combination of high computational efficiency and low memory consumption, thereby showing huge improvements over standard finite-element methods (FEM)

arXiv.org e-Print Archive

Finite Element Integration with Quadrature on the GPU

Author: Knepley Matthew G.
Rupp Karl
Terrel Andy R.
Publication venue
Publication date: 14/07/2016
Field of study

We present a novel, quadrature-based finite element integration method for low-order elements on GPUs, using a pattern we call \textit{thread transposition} to avoid reductions while vectorizing aggressively. On the NVIDIA GTX580, which has a nominal single precision peak flop rate of 1.5 TF/s and a memory bandwidth of 192 GB/s, we achieve close to 300 GF/s for element integration on first-order discretization of the Laplacian operator with variable coefficients in two dimensions, and over 400 GF/s in three dimensions. From our performance model we find that this corresponds to 90\% of our measured achievable bandwidth peak of 310 GF/s. Further experimental results also match the predicted performance when used with double precision (120 GF/s in two dimensions, 150 GF/s in three dimensions). Results obtained for the linear elasticity equations (220 GF/s and 70 GF/s in two dimensions, 180 GF/s and 60 GF/s in three dimensions) also demonstrate the applicability of our method to vector-valued partial differential equations.Comment: 14 pages, 6 figure

arXiv.org e-Print Archive