This article is devoted to graphics processing unit (GPU) kernel optimization and performance analysis of three tensorproduct operations arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close to peak performance for these operators requires extensive optimization because of the operators' properties: low arithmetic intensity, tiered structure, and the need to store intermediate results during the kernel execution. We give a guided overview of optimization strategies and we present a performance model that allows us to compare the efficacy of these optimizations against an empirically calibrated roofline.
Introduction
State-of-the-art high-order finite element method (FEM)based simulation codes now execute calculations on parallel machines with up to a million CPU cores. For example, the Nek5000 incompressible fluid flow simulator has successfully scaled up to a million CPU cores using an Message Passing Interface (MPI)-distributed computing model (Fischer et al., 2015) . The current shift toward greater onchip parallelism makes it imperative to focus on finer grain parallel implementation of high-order operators on next generation accelerators, such as graphics processing units. Efficient high-order finite element implementations rely heavily on tensor-product contractions (Deville et al., 2002) . These elemental tensor-product operations are memory bound and require repeated sweeps through the data for each finite element operation. In this work, we demonstrate that despite their complexity, it is possible to achieve nearly maximum throughput on a current generation GPU.
In this article, we focus in particular on GPU implementations of three commonplace matrix-vector multiplications (further abbreviated as matvecs) arising in finite element codes: mass matvec (BK1.0), stiffness matvec (BK5.0), and stiffness matvec that involves de-aliasing integration (BK3.0). These problems have been formulated as a part of Center for Efficient Exascale Discretizations (CEED) codesign effort (CEED, 2017) . BK1.0, BK5.0, and BK3.0 appear to be potentially well-suited for fine-grain parallelism but they are challenging to fully optimize. These benchmarks are characterized by low arithmetic intensity, namely that the ratio of floating point operations (FLOPS) to data transfer is low. The benchmarks also require non-unitarily strided memory access patterns since all threads that are processing a single element need the data computed by other threads that process this element. Finally, the operations consist of several concatenated tensor-product contractions and all threads simultaneously process the contractions in a prescribed order.
We detail several strategies that are collectively applied to maximize the computational throughput of these operations. In a sequence of computational experiments, we progressively optimize the operations by varying the amount of global memory accesses, shared memory usage, register variable usage, data padding, and loop unrolling. In this article, we use an NVIDIA Tesla P100 as the computational platform, as this is the target architecture for the CEED benchmarks. We note, however, that the same analysis may be applied for newer general-purpose graphics processing units (GP-GPUs), such as the NVIDIA Tesla V100, with only minor modifications.
To guide the optimization process, we construct a performance model that serves as a tool to evaluate the efficiency of the kernels. Because of the nature of the finite element operations, the performance of our kernels is limited by the bandwidth of the data streaming from both the global device memory and shared memory as well as finite register file and shared memory capacity. While a theoretical performance roofline, as presented by Lo et al. (2014) , gives an upper bound on the capability of the hardware under ideal circumstances, we instead rely on an empirically calibrated roofline model that provides a realistic estimate of the best achievable performance.
Overview of published literature
Early optimization studies of FEM operations date back to 2005 when Göddeke et al. (2005 Göddeke et al. ( , 2007 solved elliptic Partial Differential Equations (PDEs) on two-dimensional (2-D) domains using a cluster equipped with GPU accelerators. In a later work, Göddeke et al. (2009) focused mostly on accelerating a Navier-Stokes solver for a liddriven cavity and laminar flow over a cylinder. The study presented by Fu et al. (2014) targeted the entire FEM pipeline using an elliptic Helmholtz problem to show how the FEM codes are ported to the GPU. In addition, Fu et al. (2014) discussed strategies for accelerating conjugate gradient and algebraic multigrid solvers.
Other authors usually choose to work on improving the performance of one part of the FEM pipeline. For example, Cecka et al. (2011) and Markall et al. (2013) focused on the global assembly phase of FEMs and showed how to optimize this phase for GPU execution. Markall and coauthors emphasize that the choice of the most efficient algorithm strongly depends on the available hardware and selected programming model. In their work, Markall and coauthors considered the CPU and the GPU with OpenCL and CUDA parallel implementations. Moreover, Markall et al. (2010 Markall et al. ( , 2013 argue that making the code portable between different threading systems (such as many-core and multi-core architectures) requires a high-level programming language.
In all of the above work, the greatest concern was efficiently pipelining GPU execution of FEMs with low-order discretizations. In contrast, Remacle et al. (2016) present algorithms for solving elliptic equations using high-order continuous hexahedral finite elements on GPUs. The issues associated with efficiently handling the greater complexity of high-order FEM operations highlighted in that work are amplified in the current article and refined with the use of a new roofline model.
The evaluation of the matvec typically accounts for the highest cost of the elliptic FEM solver (Remacle et al., 2016 ; Tables 2 and 3) . Optimization of matvec performance on the GPUs appears in the literature mostly in the context of general applications. Yet a few papers focused directly on the low-order FEM matvec. For example, Dziekoński et al. (2017) used a conjugate gradient solver and optimized matvec as its important part. Dehnavi et al. (2010) and Grigoras et al. (2016) presented similar findings and provided optimization strategies. However, in our case the action of the operator is local to the element. We never assemble the global matrix and we perform the elementwise matvec. In the literature, this approach is known as the matrix-free approach. As the action of the element-wise matvec is a bandwidth-bound operation, assembling the matrix explicitly would require more memory usage inside the kernel and offer little cache reusage opportunities. Thus, in this work in order to optimize performance we focus solely on the matrix-free approach.
The matvecs used in this article can be expressed as a concatenation of tensor contractions. Related work on efficient implementation of tensor contractions for the GPUs consists of several papers. Nelson et al. (2015) applied a high-level language to formulate tensor contractions and used GPU code auto-tuning to enhance the performance. The findings were tested using several benchmark problems, including a problem derived from Nek5000 (see Fischer et al., 2008) . The optimized code reached 120 GFLOPS/s on a Maxwell class GPU. Abdelfattah et al. (2016) published similar work where, in addition to specifying a high-level language to formulate tensor contractions, the authors introduced a performance model based on both the number of floating point operations and the number of bytes of global memory that need to be read and written. The performance results were compared to CUBLAS and to a parallel CPU code using 16 cores. The most efficient version of the code reached 180 GFLOPS/s on an NVIDIA Tesla K40 GPU. While those two papers focused on small tensor contractions, Liu et al. (2017) optimized large tensor contractions and used auto-tuning for performance optimization. Their paper also presented implementation details and reported global memory usage. Shi et al. (2016) also demonstrated BLAS-like optimizations of large tensor product contractions.
The performance model developed in this article is rooted in earlier work by several authors, and is most similar to the roofline model in Lo et al. (2014) . Other performance models, such as those considered by Stratton et al. (2012) and Zhang and Owens (2011) , modeled the GPU performance using benchmarks. These efforts focused on a semiautomatic way of identifying performance bottlenecks, are specifically designed to obtain a more complete understanding of the underlying hardware properties of the GPU. Such studies have appeared frequently in the literature due to lack of available complete GPU hardware documentation even though it is well-known that hardware organization strongly impacts performance (Hong and Kim, 2009; Lee et al., 2010; Wong et al., 2010) . Volkov and Demmel (2008) and Volkov (2010) employed a different approach; they optimized matrix-matrix multiplication on the GPU and explained the code performance while improving the implementation. In his 2010 paper, Volkov pointed out that, contrary to a common belief, high GPU occupancy is not necessary to obtain performance close to the peak declared by the GPU manufacturer. This observation is important for the current article. Our most efficient kernels are characterized by high usage of registers and shared memory. Thus, achieved occupancy can often be as low as 10% even though the performance of these kernels is close to the empirical roofline. Volkov's observation that the aggregate number of shared memory accesses per thread should be reduced to avoid bottlenecks motivates a final optimization for the high-order FEM operations considered in this work. Specifically, we exploit the intrinsic structure of the interpolation matrix used in the construction of the action of the mass and stiffness matrices to improve throughput by reducing the aggregate number of shared memory accesses.
While there exist multiple GPU performance models (see Konstantinidis and Cotronis, 2015 and Abdelfattah et al., 2016) which are based on various properties of the GPU hardware and the code characteristics, in this article we develop a simple and practical model. The model provides an upper bound on achievable performance for the bandwidthbound FEM operations. Our model is independent of implementation and, to our knowledge, there is no model of this type (i.e. based on the achievable throughput of device to device data copy) in the literature.
In this article, we show a full optimization process for FEM tensor product codes on NVIDIA Tesla P100. We start from a simple implementation and we improve it through a sequence of optimization steps. While many of the optimization strategies for improving the performance of GPU kernels used in this article have been discussed in multiple places (Kim et al., 2012; Li, 2016; Volkov, 2010; Volkov and Demmel, 2008) , in this work we apply these optimizations in a systematic fashion in order to design and create fully optimized GPU kernels for a complex practical problem. In particular, we reduce the number of shared memory fetches and thus improve the performance of our code by using the row symmetry of the interpolation matrix I in BK1.0 and BK3.0 optimization. This property has been exploited before (see Don and Solomonoff, 1995) , however, not in the context of the GPU code optimizations. Later in this article, we show how using this property results in a substantial performance boost for BK1.0 and BK3.0.
The remainder of this article is organized as follows. We first present the mathematical description of the three benchmark problems mentioned earlier. We then introduce the performance model we use to evaluate our kernels. We continue by explaining the implementation and optimization choices for each of the benchmark problems. Finally, we present concluding remarks and future goals. In the Appendix we explain how to obtain the source code and reproduce the results.
Problem description
The three benchmark problems we discuss below were formulated by the CEED (see CEED, 2017) . The problems are listed as benchmark kernels. These problems are motivated by numerical approximation of the screened Poisson equation
using a high-order FEM on hexahedral elements. In the equation, f is a given forcing function and l is a constant. To begin we consider an unstructured mesh of a domain D & R 3 into K hexahedral elements D e , where e ¼ 1; . . . ; K, such that
We assume that each hexahedral element D e with vertices fx n ; y n ; z n g n ¼ 8
n ¼ 1 is an image of the reference bi-unit cubeD under a tri-linear map. We take the reference cube to be the bi-unit cube, that iŝ D ¼ fðr; s; tÞ : À1 r; s; t 1g
On each element we consider the variational form of the screened Poisson problem (1) on element D e by requiring that u satisfies
for all test functions v 2 H 1 ðD e Þ. We map the integrals to the reference cubeD in order to write the variational form as
where jJ e j is the determinant of the Jacobian J e of the mapping from element D e to the reference cubeD and G e ¼ jJ e jðJ e Þ T J e is the scaled elemental metric tensor. The r operators in these integrals are now understood to be in ðr; s; tÞ-space.
On the reference cubeD, we construct u e , a high-order finite element approximation of the function u. Note that u e is a degree N polynomial in each dimension. We use Q N ðDÞ to describe a space of all such polynomials. As a polynomial basis of Q N ðDÞ, we choose the tensor product of 1-D Lagrange interpolation polynomials fl i g i ¼ N i ¼ 0 based on N þ 1 Gauss-Lobatto-Legendre (GLL) nodes on the interval ½À1; 1. We use fr i g i ¼N i ¼ 0 to denote the GLL nodes in ½À1; 1. We employ an analogous notation for the GLL nodes in the s and t dimensions. Using a multi-index, we define the multidimensional basis polynomials l ijk ðr; s; tÞ to be the tensor product of the 1-D Lagrange interpolating polynomials, that is l ijk ðr; s; tÞ ¼ l i ðrÞl j ðsÞl k ðtÞ for all 0 i; j; k N . Hence, onD we can express the polynomial u e as Consequently, the coefficients u e ijk are the nodal values of the polynomial u e at the GLL interpolation points ðr i ; s j ; t k Þ, that is, u e ijk ¼ u e ðr i ; s j ; t k Þ. We enforce the variational form (3) for all test functions v in the finite element space Q N ðDÞ by taking v to be each of the basis Lagrange interpolating polynomials, that is, v ¼ l i 0 j 0 k 0 ðr; s; tÞ for all 0 i 0 ; j 0 ; k 0 N . We then denote by u e the vector of the polynomial coefficients of u e , that is u e ¼ ½u e 000 ; u e 001 ; . . . ; u e NNN T . Evaluating (3) for each test function v we obtain that u e satisfies the following linear system S e u e þ lM e u e ¼ f e where the local stiffness matrix S e , the local mass matrix M e , and the local load vector f e are defined as
We concatenate the local stiffness and mass matrix operators as well as local load vectors to form global unassembled versions which we denote, S, M, and f, respectively. Note that since the global approximation of u is continuous, the nodal values of u e along edges and vertices shared by neighboring elements will appear multiple local stiffness and mass matrix operations. A fully assembled system for the global degrees of freedom can be assembled by gathering the local contributions at each nodal point. In this article, we neglect the action of the global assembly and focus on the optimization of the local operators. The concatenated system of the local operators form the following block diagonal system, which operates on the concatenated vector of solution coefficients u Su þ lMu ¼ f As we neglect the global assembly, these global stiffness and mass matrix operators can be applied in a matrix-free (element-wise) way and no communication is required between elements.
The benchmarks presented below consider the efficient action of either just the mass matrix M or the complete screened Poisson operator S þ lM on each element. Benchmark Problems 1.0 and 3.0 consider the case where the integrals (4)-(5) are evaluated using a full Gauss-Legendre (GL) quadrature and Benchmark Problem 5.0 considers the case where the integrals are approximated using simply a quadrature at the GLL interpolation points.
Benchmark Problem 1.0 (BK1.0): Mass matrix multiplication
The first benchmark we present involves a matrix-vector product of the high-order finite element mass matrix M and a corresponding vector. This operation is a component of finite element elliptic operators and some preconditioning strategies. Thus, it serves as a useful initial performance benchmark. The operation requires relatively little data transfers compared with more demanding differential operators which necessitate loading the elemental metric tensor for each node of the element, as discussed in later benchmarks.
Beginning from the integral definition of the entries of the local mass matrix M e in (5) we use the fact thatD is the reference cube to evaluate the integral in each dimension separately via an N GL q -node GL quadrature. We denote the quadrature weights and nodes in the r dimension as fw a g a ¼N GL q a ¼ 1 and fr a g a ¼N GL q a ¼1 , respectively, and use an analogous notation for quadrature nodes in the s and t dimensions. Thus we obtain
We rewrite this expression in a matrix form by defining the interpolation operator I abc;ijk ¼ l ijk ðr a ;s b ;t c Þ for all i; j; k ¼ 0; . . . ; N and a; b; c ¼ 1; . . . ; N GL q , the diagonal matrix J e of weights, and the geometric data J e abc;abc ¼w awbwc jJ e ðr a ;s b ;t c Þj for all a; b; c ¼ 1; . . . ; N GL q . Hence, we write the local mass matrix (7) compactly as M e ¼ I T J e I Note that since the basis interpolation polynomials are tensor products of 1-D polynomials, we can define the 1-D interpolation matrix I 1D as
for all i ¼ 0; . . . ; N and a ¼ 1; . . . ; N GL q . We then express the interpolation operator I as a tensor product of the 1-D operators, that is
Thus, the interpolation operation from the GLL interpolation nodes to the GL quadrature nodes can be applied using three tensor contractions, while the projection back to the GLL nodes via the transpose interpolation operator comprises three additional tensor contractions. Since the remaining operation is simply the multiplication by the diagonal matrix J e , no further tensor contractions are required. Since we will make use of the interpolation and projection operations again in BK3.0 below, we detail their pseudo-code in Algorithms 1 and 2. We then detail the full matrix-free action of the mass matrix in the pseudo-code in Algorithm 3.
Benchmark Problem 5.0 (BK5.0): Stiffness matrix with collocation differentiation For our second benchmark, we consider the matrix-vector product of the full high-order finite element screened Poisson operator S þ lM and a corresponding vector. In this benchmark, the operators S and M are evaluated using a collocation GLL quadrature rather than the more accurate GL quadrature used in the other two benchmark problems. This operation is central to many elliptic finite element codes and is usually a part of a discrete operator we wish to invert. For example, incompressible flow solvers such as Nek5000 (see Fischer et al., 2008) require solving a Poisson potential problem at each time step. Consequently, this matvec is potentially evaluated many times in each time step of a flow simulation, making its optimization a significant factor for good performance.
To introduce the application of the full screened Poisson operator we begin by describing the local stiffness matrix S e defined in (4). We evaluate the integral in (4) in each Algorithm 1. Interpolation from GLL to GL nodes.
Algorithm 2. Projection from GL to GLL nodes.
Algorithm 3. BK1.0: mass matrix multiplication. dimension separately, this time using the N þ 1 GLL interpolation nodes as the quadrature. Recall that we denote the GLL quadrature weights and nodes in the r dimension as fw a g a ¼N a ¼ 0 and fr a g a ¼N a ¼ 0 , respectively, and similarly for the s and t dimensions. Thus we obtain
To write this expression in a more compact matrix form we begin by defining the differentiation operators
for i; j; k; a; b; c ¼ 0; . . . N . We then define the gradient operator D to be the vector of these three derivative operators, that is, D ¼ ½D r ; D s ; D t T . Next, since G e is the scaled metric tensor on element D e defined by G e ¼ jJ e jðJ e Þ T J e , we write Next, we define a matrix G e of operators where each entry of G e is a diagonal matrix of weights and geometric data from G e . That is, we define where each entry, say G e rr , is defined as a diagonal matrix, that is
We use analogous definitions for the remaining entries of G e . Thus, the local stiffness matrix (9) is expressed in a compact form as follows
To simplify the action of this local stiffness matrix we again use the fact that the basis interpolation polynomials are tensor products of 1-D polynomials. We define the 1-D differentiation matrix D 1D as
for all i; a ¼ 0; . . . ; N . Using this 1-D derivative operator, and the fact that the GLL quadrature nodes collocate with the interpolation nodes using the Lagrange basis polynomials l i , we write the partial derivative matrices D r ; D s ;
and D t as tensor products of D 1D and the identity matrix I as follows
Thus, differentiation along each dimension can be computed using single tensor contractions.
Finally, to write the full action of the local screened Poisson operator S e þ lM e we note that since we have used the collocation GLL quadrature in the evaluation of the integrals (4)-(5), we can follow the description of the mass matrix operator above to find that no interpolation is required and the mass matrix can be written simply as M e ¼ J e where the matrix of geometric factors J e is now defined using the GLL weights and quadrature points, that is
This operator can be applied using only six tensor contractions. First, we apply the D operator by differentiating along each dimension using three tensor contractions. We then multiply by the necessary geometric factors G e and apply the transpose operator D T with three more tensor contractions. Finally we add the mass matrix contribution which requires no tensor contractions. We detail the full matrix-free action of the screened Poisson operator in the pseudo-code in Algorithm 4.
Benchmark Problem 3.0 (BK3.0): Stiffness matrix evaluated with quadrature
The final benchmark we consider is the same matrix-vector product of the high-order screened Poisson operator S þ lM as in BK5.0. This time, however, we use the full GL quadrature to approximate the integrals in (4) and (5). This benchmark combines computational elements from BK1.0 and BK5.0 which makes for a more arithmetically intense kernel. Thus, maximizing its performance on GPUs is challenging.
We again begin by describing the local stiffness matrix S e defined in (4). We evaluate the integral in (4) in each dimension separately, this time applying the full GL interpolation nodes as the quadrature. Using the notation introduced in BK1.0 above, we write
Note here that the quadrature requires the gradients of the basis polynomials l ijk to be evaluated at the GL quadrature nodes. Simply composing the interpolation and differentiation operators I and D defined in BK1.0 and BK5.0 above requires nine tensor contractions to evaluate this quantity. Indeed, for each of the r, s, and t derivatives of l ijk , we need an operation which combines differentiation and interpolation to the GL quadrature along one dimension and only interpolation to the GL quadrature along the remaining two dimensions.
We instead reduce the number of required tensor contractions by considering the Lagrange interpolating polynomialsl a ðrÞ for a ¼ 1; . . . ; N GL q which interpolate the GL quadrature nodes. We define the tensor product basis polynomials as it is done for the GLL interpolating basis, that is l abc ðr; s; tÞ ¼l a ðrÞl b ðsÞl c ðtÞ for a; b; c ¼ 1; . . . ; N GL q . We then define the derivative operators, as done above for the GLL interpolating Lagrange basis functions, on this set of polynomials as followsD
We can then construct the gradient operation on this set of basis function as
Thus, if we view the interpolation operators defined in (8) as transforming a polynomial from the basis of Lagrange interpolating polynomials on the GLL nodes to the basis of Lagrange interpolating polynomials on the GL quadrature, then we can use the operatorD to evaluate the derivatives of this polynomial on the GL node basis. With these derivative operators, we continue as in BK5.0 by defining the matrix of weights and geometric data G e in (11), and this time ðG e rr Þ abc;abc ¼w awbwc G e rr ðr a ;s b ;t c Þ for a; b; c ¼ 1; . . . ; N GL q . We use analogous definitions for the remaining entries of G e . Using these matrix operators we can write the local stiffness matrix (9) compactly as follows
Note that, as with the GLL interpolation basis polynomials, we can define the 1-D differentiation operatorD 1D defined such thatD 1D ai ¼l i ðr a Þ for i; a ¼ 1; . . . N GL q , so that each differentiation operation inD can be written as a tensor product ofD 1D and the identity matrix I. Therefore, the differentiation operators on the GL Lagrange interpolation basis polynomials along each dimension can be applied using a single tensor contraction. Finally, to express the full action of the local screened Poisson operator S e þ lM e we combine the discussion in BK1.0 regarding using the GL quadrature to evaluate the local mass matrix M e in order to write the local screened Poisson operator as
where J e is defined as in BK1.0 above. This operator can be applied using a total of 12 tensor contractions. We first interpolate to the GL quadrature nodes by applying the interpolation operator I using three tensor contractions. We then differentiate along each dimension by applying theD operator using another three tensor contractions. We then multiply by the necessary geometric factors and multiply by the transpose derivative operator using three additional tensor contractions and add the mass matrix contributions. Finally, we use three tensor contractions to multiply by the transpose interpolation operator in each dimension to project the result back to the GLL interpolation nodes. We detail the full matrix-free action of the stiffness matrix evaluated with the numerical quadrature in pseudo-code in Algorithm 5.
Empirical performance model
The performance for all three benchmark problems is limited by global memory bandwidth. In BK5.0 and BK3.0, we load seven geometric factors for every node in the FEM element in addition to loading and storing the field vector q itself. Moreover, as we emphasize in the introduction, for all three operations the data-to-FLOP ratio is high which limits our ability to hide memory latency behind computation. Konstantinidis and Cotronis (2015) proposed a GPU performance model that accounts for the various data-to-FLOP ratios. However, the kernels used in the model achieve occupancy of around 97% while, for our kernels, extensive use of register file and shared memory results in achieved occupancy rarely exceeding 25%. Hence, we cannot apply this model for our kernels in a meaningful way.
A different idea was presented in Abdelfattah et al. (2016) , where the authors used the size of global memory transfers as a means of comparison. The advantage of such approach is that it is independent of implementation. In this work, we use a model which is similar to Abdelfattah et al. (2016) . However, we transfer comparatively more data due to the geometric factors. Our model is also based on an assumption that the bandwidth of the global device data transfer is the limiting factor. Thus, we compare the performance of our kernels to the performance of copying data from global GPU memory to global GPU memory. The size of the data transfer we compare to is equivalent to the size of the data moved from and to the global memory in the particular benchmark problems, see Table 2 .
For example, for every element in the FEM mesh the BK1.0 code needs to read
doubles and needs to write Algorithm 5. BK3.0: Differentiation for 3D Hexahedral Elements.
W ¼ N p doubles. Hence, the total global memory transfer is
doubles per element. Since the memory bus is bidirectional and a double variable consists of eight bytes of data, we compare the performance of our code for BK1.0 to transferring
bytes of data. Each data transfer is executed 10 times using a standard cudaMemcpy function and the performance in terms of GB/s is measured by taking an average of the 10 measurements.
Let d r denote the number of bytes read from the global GPU memory and d w denote the number of bytes written to global GPU memory by a GPU kernel. Let B gl denote global memory bandwidth for copying drþdw 2 bytes of data. Let F denote the number of floating point operations that must be executed in this kernel. In our model, the maximal GFLOPS/s are estimated using a formula
For BK1.0, BK5.0, and BK3.0, the roofline R global is evaluated as a function of polynomial degree N . Figure 1 shows maximum TFLOPS/s for BK1.0, BK5.0, and BK5.0, respectively.
The efficiency of data transfer depends on the size of data, with throughput maximized if sufficiently large amounts of data are being transferred. We therefore expect higher bandwidth for a mesh with more elements, and for higher-degree polynomial approximations. We note that for the GPU used in this article, the NVIDIA Tesla P100 12 GB PCI-E, the mesh containing 512 elements is too small to effectively hide data transfer latencies. To see this, we note that our approach is to parallelize the problem by assigning each element to a thread block. The NVIDIA Tesla P100 has 56 SMs and two blocks of threads can be processed simultaneously on every SM. The GPU is therefore capable of processing 112 blocks of threads concurrently, which for the mesh of 512 elements means only 5 thread blocks are executed per SM. The result of this small workload per SM is that the overhead cost of kernel launch is not offset by the execution time of the kernel itself and also the empirical bandwidths which we observe in these data transfers are noisy due to overhead costs.
Inspired by Volkov and Demmel (2008) , for several kernels tested for BK1.0 and BK3.0., we devised a supplementary theoretical roofline based on shared memory bandwidth. We observe that in addition to copying the data to/ from global memory, we also copy the data to/from shared memory. 1 Since the memory bandwidth of shared memory is lower than the register bandwidth, shared memory transactions can limit performance. We denote the shared memory bandwidth by B sh and and estimate it with Figure 1 . Performance roofline bounds for the BK1.0 (top), BK5.0 (middle), and BK3.0 (bottom) benchmarks. In each chart, the upper plot (line with diamond-shaped ticks) shows a theoretical bound obtained using theoretical peak bandwidth of 549 GB/ s for the NVIDIA P100 PCI-E 12-GB GPU. The lower plots show the empirical peak bandwidth bound obtained using measured bandwidth attained when performing a device memory to device memory copy (upper line for a hexahedral mesh with 4096 elements and lower line for a hexahedral mesh with 512 elements). GPU: graphics processing unit.
With this ansatz we estimate that the shared memory for the NVIDIA Tesla P100 12 GB PCI-E GPU is limited to B sh ¼ 56 Á 32 Á 4 bytes Á 1:126 Ghz ¼ 8:071 TB=s. The shared memory roofline model is then given by the equation
where F denotes the number of floating point operations performed (per thread block), s r is the number of bytes read from the shared memory (per thread block), and s w is the number of bytes written to the shared memory (per thread block).
Notation
The notation used throughout this article is shown in Table 1 . In addition, we define 1 TFLOPS ¼ 10 9 FLOPS
The floating point throughput rates for compute kernels in this article are reported using TFLOPS/s. We use double precision arithmetic in all the tests.
Hardware and software
All computational studies in this article were performed on a single Tesla P100 (Pascal class) PCI-E GPU with 12 GB RAM and maximum theoretical bandwidth of 549 GB/s. The GPU is hosted by a server node equipped with Intel Xeon E5-2680 v4 processor, 2.40 Ghz with 14 cores. The code was compiled using the gcc 5.2.0 compiler and the nvcc CUDA compiler release 8.0, V8.0.61 managed by the OCCA library, version 0.2 (see Medina et al., 2014) .
We compute the reference times (needed for copy bandwidth in roofline plots) using CUDA events. The kernels are timed using MPI_Wtime. In our experiments, we test the code on two hexahedral cube-shaped meshes. The small mesh consists of 512 elements and the large mesh consists of 4096 elements.
Optimizing Benchmark Problem 1.0
In this section, we describe the sequence of optimization strategies that were applied to the implementation of the BK1.0 operation.
Kernel design
Recall that the tensor contraction operations for each element in the FEM mesh can be performed independently of other elements. Thus, to parallelize the FEM operations we assign each element to a block of threads on the GPU. In a previous work, Remacle et al. (2016) associated a single node of an element with a single thread. This is, however, impossible for high-order interpolation because if N q ! 9, we exceed the maximum number of threads per block of threads (currently limited by CUDA to 1024). Hence, for higher-order approximations we need to assign multiple nodes to a thread. We can subdivide the nodes in each element using either a 3-D thread structure or a 2-D thread structure. Figure 2 shows two such approaches. For BK1.0, we use a 2-D thread structure since we found it more effective. We also investigate a 3-D thread structure for N q 9 for BK5.0 and N q < 9 for BK3.0.
In BK1.0, the action of the mass matrix M on each element can be applied using six tensor contraction operations. Hence, Algorithm 3 consists of six contractions wherein we cycle through the entire q e . Block-wise synchronization is needed between the contractions to ensure that the previous operation has completed. At the minimum, we need to enforce this block synchronize five times. Using a 2-D thread structure requires more thread synchronizations because we process only a slice of the nodes at a time.
BK1.0 thread memory optimization
Because the 1-D interpolation matrix I 1D is used by all the threads in the block, we load I 1D into the shared memory. 
Number of GL quadrature nodes:
Number of GL nodes in a hexahedral element.
Number of elements in the mesh GLL: Gauss-Lobatto-Legendre; GL: Gauss-Legendre. a The first column shows the symbols used in pseudocode listings and the second column shows the symbols used in the derivations. For the field variable q, we can either load q e to shared memory, fetch it to registers, or fetch it piece by piece from global memory when needed. Note that we also need a placeholder array to store the partial results between the loops. There are several options to choose from, however, only a single array of size N GL p can be stored in shared memory. Storing two such arrays is not feasible since we exceed the limit of 48 KB shared memory for thread block for a large N .
BK1.0 kernel optimization
We show the performance results of eight GPU kernels in Figure 3 . The eight kernels were constructed in sequence beginning from a direct implementation of the pseudo-code in Algorithm 3 and applying successive optimizations. The results shown in Figure 3 help to demonstrate the effect each optimization has on the overall performance of BK1.0. We list below some details of each kernel here as well as any optimization made in that kernel. Unless otherwise noted, Kernel n contains all the optimizations contained in Kernel n À 1.
Kernel 1. This kernel serves as a reference implementation. It uses a 2-D thread structure associated with horizontal ði; jÞ slices (see right side of Figure 2 ). We declare two additional global memory variables for storing intermediate results. Kernel 1 only uses shared memory for the interpolation matrix. In all the loops, it reads from and writes to global memory. As a result, the reference kernel is highly inefficient, reaching only 80 GFLOPS/s even for the larger mesh.
Kernel 2. In this kernel, the global auxiliary arrays are replaced by two shared memory arrays (with ðN GL q Þ 2 elements in each array). While processing q e , we read directly from the input arrays, without caching to shared memory or registers. Due to the reduced number of global memory fetches, the performance improves by approximately a factor of two.
Kernel 3. In this kernel, each thread allocates a register array of size N ðGLÞ q and copies a section of q e to the array at the beginning of the kernel. This kernel reaches one TFLOP/s in the best case.
Kernel 4. In this kernel, all the input variables, except the variable to which the output is saved, are declared as const. This yields only a marginal improvement in performance.
Kernel 5. For this kernel, if N q ¼ 8 or 16 or N GL q ¼ 8 or 16, we pad the shared memory arrays used for storing I 1D and partial results to avoid bank conflicts. There is only a noticeable improvement for N ¼ 15 for the smaller mesh and N ¼ 14 for the larger mesh.
Kernel 6. In this kernel, all loops including the main loop in which we process the ði; jÞ slices are unrolled. Unrolling the ði; jÞ loop is important for the performance, since it gives the scheduler more freedom and more opportunity for instruction-level parallelism. At this point, the performance exceeds 1 TFLOP/s for the small mesh and 1.4 TFLOPS/s for the large mesh.
Kernel 7. All kernels presented thus far have required 5N GL q þ 1 thread synchronizations in a block. For example, with N ¼ 12, for which N GL q ¼ 14, the number of synchronizations is 71. In this kernel, the number of thread synchronizations is reduced to five by allocating more shared memory. That is, we load the entire q e array to shared memory as opposed to only loading q e slice by slice. Figure 4 illustrates the idea behind reducing synchronizations. Kernel 7 brings the performance up to 2.25 TFLOPS/s. Kernel 8. Volkov (2010) emphasized that shared memory is much slower than using registers. Hence, reducing total number of read and write requests to/from shared memory and replacing them with register read/write instructions can significantly improve performance. Kernel 8 exploits the structure of the matrix I 1D to reduce the number of shared memory transactions. Specifically, each entry in I 1D appears twice: I 1D nm ¼ I 1D N GL q Ànþ1;N q Àmþ1 . Exploiting this symmetry lets us halve the number of loads from I 1D . This kernel loads the entire I 1D into shared memory and inside each loop an appropriate entry is copied from shared memory to a register once and used twice. Since in BK1.0 we need to multiply by I 1D and by ðI 1D Þ T , pre-fetching only a half of matrix I 1D would complicate the code and require a set of additional if-statements, which cause thread divergence. The resulting reduction in shared memory operations brings the measured performance close to the empirical roofline. The optimization strategies are summarized in Table 3 .
BK1.0 results
The performance numbers presented in Figure 3 reveal that the kernels considered can be categorized into three groups. Kernels 1 and 2 form the first group, Kernels 3-6 form the second group, and Kernels 7 and 8 form the third group. Between each of these three groups we observe significant performance improvements. The first major improvement occurs for Kernel 3 and is the result of caching q e to register arrays prior to contraction. The second significant improvement occurs for Kernel 7 and is due to reducing the number of barriers. The performance of our reference kernel, Kernel 1, in the best case reaches only 80 GFLOPS/s, whereas our best kernel, Kernel 8, achieves 2.5 TFLOPS/s, yielding a 31-fold speedup. To obtain more than two TFLOPS/s, we needed to change the algorithm to account for the limitations intrinsic to the GPU, namely the cost of many thread synchronizations.
To explain the improvement between Kernel 6 and Kernel 8, we consider the roofline model based on shared memory bandwidth. We generate a new roofline plot by taking a minimum of the empirical roofline based on device to device copy bandwidth (13) and the upper limit computed based on shared memory bandwidth (14). We show these new rooflines in Figure 5 . For Kernel 8, this roofline is identical with the global memory roofline as in Figure 3 as the shared memory roofline is higher than the global memory roofline due to eliminating almost a half of shared memory transactions. This indicated that the shared memory bandwidth is not the limiting factor of the performance of Kernel 8 for high N where performance begins to degrade. For Kernels 6 and 7 however, we see that this shared memory roofline model indeed provides a reliable performance bound and these kernels are likely limited by shared memory performance. Overall, our best performing kernel is Kernel 8. The performance results for this kernel are very close to the roofline model for N ¼ 11. Output from the NVIDIA profiler indicates that this kernel's performance is bound almost entirely by memory bandwidth. Table 4 presents performance details for all the kernels with N ¼ 11. We use N ¼ 11 because our code achieved The idea behind reducing synchronizations in kernel 7. We fetch pieces of q e to registers from shared memory and then write the result to shared memory. This does not create race conditions because we do use a 2-D thread structure and interpolate only in one direction at a time.
close to peak performance for this degree. The table shows differences in resource utilization for all the kernels.
Optimizing Benchmark Problem 5.0
Kernel design
As in BK1.0, we parallelize the action of the screened Poisson operator by assigning each element to a separate thread block on the GPU. In this optimization procedure, we investigate a 2-D thread structure as well as test a 3-D thread structure for N < 10.
The action of the screened Poisson operator presents a difficult optimization challenge. In particular, hiding global memory latency is significantly more important than in BK1.0 because we transfer seven geometric factors (J e and six factors of G e ) per node from global memory in every element. The majority of these values are required during the action of the stiffness matrix. To hide global memory latency, we aim to overlap data transfer with computations.
The increased amount of data that we are required to load increases both the number of global memory read transactions and the number of registers needed per thread. In theory, the geometric factors can be computed "on the fly" inside the kernel using the element's vertices and the each node's ðr i ; s j ; t k Þ coordinates. This approach reduces the amount of double precision values loaded from global memory to 18 þ 2N q per block but increases the number of registers. We implemented and extensively tested this approach but concluded that it was impractical. Indeed, it is faster to simply load geometric factors from global memory than use extra registers and FLOPS.
BK5.0 kernel optimization: 2-D thread structure
We show the performance results of 10 GPU kernels in Figure 7 . The 10 kernels were constructed in sequence beginning from a direct implementation of the pseudocode in Algorithm 6 and applying successive optimizations. The results in Figure 7 help to demonstrate the effect each optimization has on the overall performance of BK5.0. As done with BK1.0, we list below some details of each kernel as well as any optimization made in that kernel.
Kernel 1. This kernel is a direct implementation of Algorithm 6. The 1-D differentiation matrix D 1D is fetched to shared memory at the beginning of the kernel. This kernel also uses a shared memory array of size N p to store partial sums. Entries of q e are fetched from global memory as needed. In the last loop, the partial sums are successively added directly to the global memory array to store the final result. The kernel achieves 200 GFLOPS/s, which is one sixth of the predicted empirical roofline.
Kernel 2. In this kernel, all the variables, except the array used for storing the final result, are declared using const. This optimization has only a marginal influence on the performance for N ! 8.
Kernel 3. In this kernel, all loops with k are unrolled. Unrolling loops improves the performance for the larger mesh only if N ! 8 and for all N for the smaller mesh. Note that a kernel with unrolled loops uses more registers per thread and, hence, the achieved occupancy decreases. The measured TFLOPS/s increase, however, as shown in Figure 6 . This behavior can be explained if unrolling has increased instruction-level parallelism. Kernel 4. In this kernel, we place the k loop (lines 7-17 in Algorithm 6) on the exterior of i and j loops. Doing this, k becomes the slowest running index. This loop structure is justified because we iterate through ði; jÞ slices.
Kernel 5. In this kernel, the auxiliary shared memory array of size N p is replaced by two auxiliary shared memory array of size N 2 q each.
Kernel 6. In this kernel, the field variable q e is fetched to the shared memory. The overall improvement is modest and the performance reaches 0.5 TFLOP/s at this point.
Kernel 7. This kernel reduces total global memory transactions by caching the necessary data at the beginning of the kernel and only writing the output variable once. This produces a significant optimization and improves performance by about 40%. Kernel 8. For this kernel, if N ¼ 7 or 15, the arrays are padded to avoid shared memory bank conflicts. The improvements are significant for the larger mesh and N ¼ 15.
Kernel 9. In this kernel, each thread allocates a register array with N q elements and the field variable q e is fetched to these registers instead of shared memory. For both meshes, this approach slightly improves the performance for N ¼ 12 and N ¼ 15.
Kernel 10. This kernel uses three 2-D shared memory arrays for partial results and fetches the field variable q e to register arrays. The achieved TFLOPS/s for this kernel are aligned with the empirical roofline. The optimization strategies are summarized in Table 5 .
BK5.0 results: 2-D thread structure
Many of the performance improvements we observe when optimizing the kernels for BK5.0 are due to subtle changes in the code, for example, declaring variables as constant, adding padding, or unrolling the loops. But global and shared memory usage has the biggest impact on the performance. Once we reduce the use of global memory (starting in Kernel 7), the performance improves substantially, especially for larger N . Reducing the amount of shared memory and caching the data to registers result in the second most important factor which is visible in Figure 7 for the mesh with 4096 elements and N ! 10. Overall, we find the best performing kernel is Kernel 10. The performance of this kernel aligns well with the roofline model for N ¼ 10, N ¼ 11, and N ¼ 15. For the reference kernel, Kernel 1, the performance barely reaches 200 GFLOPS/s, whereas the most optimized kernel we present, Kernel 10, achieved up to 1:2 TFLOPS/s. While this is only a six-fold speedup, data output from the NVIDIA profiler and comparison with our empirical roofline model based on streaming global device memory indicate that the performance of Kernel 10 is comparable to just streaming the minimally necessary data. Table 6 presents performance details for all the kernels with N ¼ 11. We use N ¼ 11 because our code achieved close to peak performance for this degree. The table shows differences in resource utilization for all the kernels.
BK5.0 kernel optimization: 3-D thread structure
We show the performance results of six GPU kernels in Figure 7 . We again construct these kernels as a sequence of successive optimizations beginning from the pseudo-code shown in Algorithm 7 which uses a 3-D thread structure. We present results for these kernels for N ¼ 1; 2; . . . ; 9, since associating one thread with one node as done in this Figure 6 . BK5.0: Comparison of performance for kernels 1 and 2. The only difference between these two kernels is loop unrolling. The number of registers per thread is shown in the bar chart in the background. 3-D thread structure would require more than the maximum of 1024 threads for higher-degree polynomials. We list below some details of each kernel as well as any optimization made in that kernel.
Kernel 1. This kernel serves as an initial reference kernel and is a direct implementation of Algorithm 7. The differentiation matrix D 1D is loaded into shared memory and partial sums are stored in global arrays. The performance of this kernel reaches 300 GFLOPS/s in the best case.
Kernel 2. In this kernel, all variables, except the output variable, are declared using const. All loops with index n are unrolled. The overall improvement is modest.
Kernel 3. In this kernel, the geometric factors are stored in registers and fetched only once from global memory. This reduces the number of global memory loads from nine to seven. This kernel reaches about 350 GFLOPS/s for the larger mesh. In case of the smaller mesh, performance of this kernel is similar to Kernel 2.
Kernel 4. Instead of writing the result directly to Aq, in this kernel the partial results are accumulated in a register variable. The performance of this kernel is only slightly better than that of Kernel 3.
Kernel 5. In this kernel, the field variable q e is cached to shared memory. The same shared array is used to store the partial result, hence there is no need to use Sqtemp. The performance increases significantly and reaches 610 GFLOPS/s. Kernel 6. In this kernel, partial results are stored in three shared memory arrays of size N p each. This strategy reduces occupancy but allows the kernel to use less synchronizations (the number of synchronizations reduced to two). However, there is almost no performance difference between this and the previous kernel due to extensive use of shared memory. The optimization strategies are summarized in Table 7. BK5.0 results: 3-D thread structure
The optimization results are shown in Figure 8 . The kernels adapting 3-D thread structure form two groups. First group consists of Kernels 1-4 and the second one consists of Kernels 5 and 6. The largest performance jump appears between Kernel 4 and Kernel 5. It is a result of caching the field variable to shared memory. Kernel 5 appears to be the best performing kernel. As a result of succesively reducing the number of shared memory fetches, we improved the performance by a factor of two, from approximately 300 GFLOPS/s to around 600 GFLOPS/s. a The stall reason column shows the most common reason for stalls. The best performing kernel, kernel 10, uses very little shared memory but uses it effectively (we perform two FLOPS per every 8 bytes loaded from the shared memory). High number of registers and low occupancy does not prevent the kernel from achieving high FLOP count. Table 5 . Summary of code optimization strategies applied for BK5.0 (2-D thread structure).
Kernel # Type of optimization
Kernel 1 Baseline kernel. The matrix D is fetched to shared memory. We declare one shared memory array for partial result. Kernel 2 Kernel 1 þ variables that are not being changed in the kernel are declared as const. Kernel 3 Kernel 2 þ loop unrolling. Kernel 4 Kernel 3 with restructured main loop (k is the slowest running index). Kernel 5 Kernel 4 with restructured shared memory. A single shared memory array of size N 3 q is replaced by two arrays of size N 2 q each. Kernel 6 Kernel 5 with q loaded to shared memory. Kernel 7 Kernel 6 þ reduced global memory usage. The necessary data are read from global memory only once. The output is written once as well. Kernel 8 Kernel 7 þ padding. All shared memory arrays of size N 2 q are padded if N q ¼ 8 or N q ¼ 16: Kernel 9 Kernel 8 with q cached to registers, not to shared memory. Kernel 10 Restructured shared memory. The kernel uses three shared memory arrays, size N 2 q each.
Optimizing Benchmark Problem 3.0 BK3.0 kernel design BK3.0 can be considered a fusion of BK1.0 and BK5.0. This benchmark shares its interpolation/projection steps with BK1.0 and the stiffness matrix action with BK5.0. Hence, BK3.0 inherits all the optimization challenges we have discussed thus far for BK1.0 and BK5.0. In particular, the kernel for this problem needs to synchronize threads multiple times and needs to load a large amount of data from global memory. Furthermore, compared to either BK1.0 or BK5.0 this benchmark requires even more shared memory usage in order to store partial results. Since differentiation is performed using a denser set of nodes, we also transfer more data per thread block compared with BK5.0. The action of the screened Poisson operator in this benchmark can be written as three distinct parts: interpolation to GL nodes, stiffness and mass matrix actions, and projection back to GLL nodes. Therefore it is possible to split the implementation into three kernels. This approach reduces memory requirements per kernel and makes the code more readable. We investigate this approach with 3-D thread structure.
BK3.0 kernel optimization: 2-D thread structure
We begin as in BK5.0 by first investigating a kernel implementation using a 2-D thread structure. The performance results for nine separate kernel are presented in Figure 9 . As in the previous benchmarks, we present a sequence of kernels and detail what optimizations we performed in each kernel.
Kernel 1. This kernel serves as a reference implementation. In this kernel, there are four shared memory arrays: one for I 1D , one for D 1D , and two arrays with ðN GL q Þ 2 elements for storing partial sums. Each thread also allocates an additional register array to store additional partial sums that do not require sharing between threads in a thread block. The field variable q e is read directly from global memory when needed. This kernel reaches approximately 280 GFLOPS/s. Kernel 2. In this kernel, all the input variables, except the pointer used for storing the final results, are declared as const. Each thread allocates a register array of size N q and caches a column of entries of q e to the register array.
The performance for N ! 12 for the larger mesh improves to over 1.25 TFLOP/s Kernel 3. In this kernel, all internal loops are unrolled. The performance improves to 600 GFLOPS/s for the smaller mesh and to 1.4 TFLOP/s for the larger mesh.
Kernel 4. In this kernel, we add padding to all the shared memory arrays to avoid bank conflicts. In particular, in case N GL q ¼ 8 or N GL q ¼ 16, the arrays with size N GL q Â N GL q change their size to N GL q Â ðN GL q þ 1Þ. Padding helps only for the smaller mesh.
Kernel 5. This kernel adapts the most efficient interpolation strategy devised for BK1.0 (see BK1.0: Kernel 7 and Figure 4 ). For the larger mesh, the performance reaches 1.5 TFLOP/s. Kernel 6. In this kernel, the field variable q e is loaded to a shared memory array. This kernel implements more efficient differentiation (as in BK5.0: Kernel 9). The improvement is very modest since in order to use this type of differentiation we need to combine it with less efficient interpolation.
Kernel 7. A version of Kernel 6 with one less shared memory array. We use only one shared memory array to store the partial result at a cost of additional synchronizations. Kernel 7 performs better than Kernel 6; however, in most cases, it is not more efficient than Kernel 5.
Kernel 8. This kernel is a version of Kernel 7 with both D 1D and ðD 1D Þ T fetched to shared memory. Now all the threads can access shared memory column-wise. Performance is very similar to Kernel 6.
Kernel 9. The kernel uses the same strategy as in the last kernel implemented for BK1.0, that is, it fetches each repeating entry of the interpolation matrix I 1D only once. The performance of this kernel is the best that we present here as it reaches 1.6 TFLOPS/s. The optimization strategies are summarized in Table 8 .
BK3.0 results: 2-D thread structure
For this benchmark problem, our best performing kernel for high-order FEM approximations, Kernel 9, performs approximately four times as fast as the reference kernel, Kernel 1. Although we did not achieve the peak performance as predicted by our empirical roofline model using global memory bandwidth, for the 4096 element mesh and N 12 our kernels are very close to that peak.
Output from the NVIDIA profiler suggest that the best kernel, Kernel 9, suffers from execution dependency stalls. As emphasized previously, the kernel has three distinct parts that have to be executed in an order, and hence, this finding is not surprising. The shared memory bandwidth for Kernel 9 is very high; in fact, much higher than for Kernel 10 for BK1.0, which uses the same interpolation strategy.
As in BK1.0, we can obtain a better understanding of the performance of our kernels by considering the roofline model based on shared memory bandwidth. Figure 10 shows updated rooflines for Kernels 8 and 9 in which we see that this shared memory bandwidth bound yields a better estimate for the achieved performance of these kernels with N ! 10. Table 9 presents performance details for all the kernels with N ¼ 11. We use N ¼ 11 because our code achieved close to peak performance for this degree. The table shows differences in resource utilization for all the kernels.
BK3.0 3-D thread structure
We perform analogous tuning process using a 3-D thread structure for N < 9. Figure 11 shows the performance of these kernels presented below. Kernel 1. This kernel starts from a direct implementation of the pseudo-code given in Algorithm 5, which is executed as three separate CUDA kernels. The field variable q e is read directly from global memory and partial results are stored in global memory as well. The code reaches approximately 280 GFLOPS/s in the best case.
Kernel 2. This kernel combines the three separate kernels as one CUDA kernel. There is almost no difference between Kernel 1 and Kernel 2. This suggests that the cost of additional kernel launches is small compared with the cost of data transfer.
Kernel 3. In this kernel, all internal loops are unrolled. Unrolling loops would appear to be more important for 2-D thread structure as for 3-D thread structure it has only a marginal influence on performance.
Kernel 4. This kernel differs from Kernel 2 by loading the geometric factors once and avoiding redundant reads. The gain in performance is small because of large amount of global memory transactions that have not yet been eliminated.
Kernel 5. While we access the field variable q e directly from global memory in the interpolation and projection parts, q e is fetched to a shared array in the differentiation step.
Kernel 6. This kernel is a version of Kernel 5 with intermediate results stored in registers in the differentiation step. Kernels 5 and 6 are approximately 1.5 times faster than Kernel 4.
Kernel 7. In this kernel, we store the partial results in shared memory and registers everywhere except between interpolation, first and second differentiation, and projection. Kernel 9. In this kernel, we declare three additional shared memory arrays (size N p each) for storing intermediate results. As a result, the performance reaches 900 GFLOPS/s. The optimization strategies are summarized in Table 10 . 
TFLOPS/s
empirical bound based on d2d copies BK3: kernel 1 (no opt) BK3: kernel 2 (one kernel) BK3: kernel 3 (const. var. + loop unrolling) BK3: kernel 4 (geometric factors in registers) BK3: kernel 5 (q in shared for differentiation) BK3: kernel 6 (registers storage in differentiation) BK3: kernel 7 (global storage only between parts) BK3: kernel 8 (no global storage) BK3: kernel 9 (shared variables for partial result) Figure 11 . BK3.0: Performance of 3-D kernels in various stages of optimization. The red line marked with crosses is the roofline computed based on device-to-device copies on a single NVIDIA P100 PCI-E 12-GB GPU. Left: TFLOPS/s for cubical mesh with 512 elements. Right: TFLOPS/s for cubical mesh with 4096 elements. Note: almost all kernels fail for N ¼ 3 on the smaller mesh. GPU: graphics processing unit.
BK3.0 Results: 3-D thread structure
Unlike in the case of BK5.0, in BK3.0 we observe a very gradual improvement in performance. However, as for BK5.0 the largest improvements appear when removing additional global memory transactions. It is interesting to note that the difference between Kernel 7 and Kernel 8 is rather small, despite the fact that Kernel 7 reads and writes three vectors of size N el Á N p to the global memory and Kernel 8 reads and writes only once. The lack of significant improvement may be a result of the additonal synchronizations enforced in Kernel 8. In Kernel 7, the global write serves as synchronization.
Conclusions and future work
In this article, we described the GPU optimization of three FEM-specific benchmark problems which have significant relevance in real-world large-scale code bases. We detailed how reducing factors such as global memory transactions and reducing pressure on shared memory usage can lead to efficient GPU kernels for these FEM operators. For BK5.0, the most-tuned kernel is aligned with the empirical roofline bound and its performance is limited only by global memory bandwidth. We obtained similar results for the two remaining problems but their performance does not scale perfectly for N ! 13. For each of the problems, we used a sequence of optimizations. One might argue that multiple authors (see, e.g. Abdelfattah et al., 2016; Garvey and Abdelrahman, 2015; Nelson et al., 2015) consider automatic performance tuning and our efforts should rather be replaced by a computational tuning process. However, in each of the benchmarks we consider large performance improvements are gained by changing the algorithm itself, which is not an automatic process. While the performance bottlenecks such too many barriers in BK1.0 and high data fetch to FLOP ratio in BK3.0 and BK5.0 are often not hard to identify, they can be troublesome to remove. We use a mixture of standard optimization strategies such as splitting the data between registers and shared memory and more advanced strategies, which resulted in redesigning the algorithm to make it more well-suited for fine-grain parallelism.
An obvious question remaining is how much more performance can be obtained in each of these benchmarks. In this article, we provide contributions toward the answer. Unlike Abdelfattah et al. (2016) , we consider the limitation of shared memory bandwidth in our approach. By no mean our model is exhaustive and our future work involves introducing a more advanced performance model that accounts for register use and other factors but retains the simplicity of the original model. The source code for all kernels and benchmarks considered in this article is publicly available at: https://github.com/kswirydo/CEED-Ax.
The results presented in this article were obtained on an NVIDIA Tesla P100 GPU. One might ask whether we should expect similar optimizations to yield good performance on the newer generation of NVIDIA GPUs (Volta) or on older generation NVIDIA GPUs (Fermi, Kepler, and Maxwell), and if not, what performance can one expect. Pascal cards, such as the NVIDIA Tesla P100, have significantly more streaming multiprocessors than its predecessors with less computational cores per multiprocessor. This allows us to split the problem into more blocks with less threads per block. Since each streaming multiprocessor has its own L1 cache, we can rely more heavily on the L1 cache in our optimizations. Furthermore, Pascal cards have an improved ratio of double-precision to single-precision performance, yielding significant performance gains in the benchmarks presented. Thus, while running the code as is using double precision on older generation of NVIDIA GPUs, such as Kepler, which has fewer streaming multiprocessors and four times as many cores per multiprocessor, we expect the performance to be far from the roofline. On the other hand, the newer generation of Volta cards has architecture similar to Pascal cards but also come with unified L1 cache and shared memory space. This feature impacts the performance of kernels that rely heavily on both L1 and shared memory. Table 10 . Optimization strategies applied to a baseline kernel with 3-D thread structure for BK3.0 problem.
Kernel # Type of optimization
Kernel 1 Baseline, no optimization. The kernel is executed as three separate CUDA kernels. Kernel 2 Kernel 1 combined into one CUDA kernel. Kernel 3 Kernel 2 þ loop unrolling. Kernel 4 Kernel 3 þ geometric factors are loaded only once and stored in registers. Kernel 5 Kernel 4 þ q is fetched to the shared memory in the differentiation step (but not in interpolation and projection). Kernel 6 Kernel 5 þ intermediate results stored in registers in the differentiation part. Kernel 7 Kernel 6 þ all partial results are stored in shared memory or registers -except the results between interpolation and differentiation and differentiation and projection. Kernel 8 Kernel 7 þ all partial results are stored either in shared memory or in registers. Kernel 9 Kernel 8 þ three additional shared memory arrays for partial results.
In addition, the authors would like to kindly acknowledge Advance Research Computing at Virginia Tech for providing readily accessible computational resources. Finally, this research was supported in part by the John K. Costain Faculty Chair in Science at Virginia Tech.
While the manuscript was under revision, Kasia Ś wirydowicz was working in National Renewable Energy Laboratory in Golden, Colorado. This work was authored in part by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the US Department of Energy (DOE) under contract no. DE-AC36-08GO28308. The views expressed in the article do not necessarily represent the views of the DOE or the US Government. The publisher, by accepting the article for publication, acknowledges that the US Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for US Government purposes.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
