In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix-vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel's MKL for multicore CPUs, and develop a GPU sparse matrix-matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers.
Introduction
Building an Exascale machine using the technology employed in today's fastest supercomputers would result in a power dissipation of about half a gigawatt . Providing suitable infrastructure poses a significant challenge, and with a rough cost of US$1 million per megawatt year, the running cost for the facility would quickly exceed the acquisition cost. This is the motivation for replacing homogeneous CPU clusters with architectures generating most of their performance with low-power processors, or accelerators.
Indeed, the success of using GPU accelerators to reduce energy consumption is evident by the fact that 17 of the 18 greenest systems are accelerated by GPUs, according to the November 2014 Green500 list, 1 which ranks supercomputers according to their performance/ Watt ratio on the HPL benchmark. 2 A drawback of this ranking is that HPL provides GFlop/s numbers that are usually orders of magnitude above the performance achieved in real-world applications, and, thus, it is insufficient to understand how energy efficient the accelerated systems are when running scientific applications (Dongarra and Heroux, 2013) .
In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We focus in particular on GPU kernel optimizations for the memory-bound sparse matrix-vector product (SpMV), which serves as a building block of many sparse solvers. Standard Krylov methods, for example, are typically bounded by the SpMV performance, which is only a few per cent of the peak of current GPUs, as quantified in Figure 1 . One approach to improve Krylov methods is to break up the data dependency between the SpMV and the dot products like in the recent work on the s-step and communication-avoiding Krylov methods (Hoemmen, 2010; Yamazaki et al., 2014) . The improvement in these methods comes from grouping SpMV s into a (so-called) Matrix Powers Kernel (MPK), which allows reuse of the sparse matrix and vector reads during the computation of a Krylov subspace. We consider in this paper a different approach that is based on building block-Krylov subspace. Higher computational intensity is achieved by multiplying several vectors simultaneously in a sparse matrix-matrix product (SpMM) kernel. Communication is reduced in this case by accessing the matrix entries only once, and reusing them for the multiplication with all vectors, which results in significant performance benefits. As the SpMM's communication volume is less than the MPK's, the SpMM performance can be useful in modeling/predicting the upper performance bounds of s-step and communication-avoiding Krylov methods that rely on MPKs. Furthermore, SpMMs are needed in numerical algorithms such as the Discontinuous Galerkin, high-order finite element method (FEM), cases when vector quantities are simulated, or in specially designed block solvers, such as the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) used for finding a set of smallest/ largest eigenstates of an SPD matrix (Knyazev, 2001) . We chose the LOBPCG method as a benchmark algorithm as it provides an interesting mix between sparse and dense linear algebra operations, that is typical for complex simulation applications. It also serves as a backbone for many scientific simulations, e.g. in quantum mechanics, where eigenstates and molecular orbitals are defined by eigenvectors, or principle component analysis (PCA) . Nonlinear Conjugate Gradient (CG) methods have been successfully used in predicting the electronic properties of large nanostructures (Tomov et al., 2006; Vo¨mel et al., 2008) , where the LOBPCG method in particular has shown higher performance and improved convergence (reduced number of SpMV products compared with non-blocked CG methods). Applying this algorithm efficiently to multi-billion size problems served as the backbone of two Gordon-Bell Prize finalists who ran many-body simulations on the Japanese Earth Simulator (Yamada et al., 2006 (Yamada et al., , 2005 .
To put the GPU results in the context of traditional processors, we include results for multicore CPUs. We compare the GPU SpMV and SpMM implementations with MKL's counterparts, and the GPU-accelerated LOBPCG solver against the multithreaded CPU implementation available in BLOPEX, 3 for which the PETSc and Hypre libraries provide an interface (Knyazev et al., 2007) . Our GPU implementations are available through the open-source MAGMA library starting from MAGMA 1.5 (Innovative Computing Laboratory, UTK, 2014b). The target platforms are K20 and K40 GPUs, and the Piz Daint Supercomputer located at the Swiss National Computing Centre (CSCS). 4 2 Related work
Energy efficiency
An overview of energy-efficient scientific computing on extreme-scale systems is provided in Donofrio et al. (2009) , where the authors address both hardware and algorithmic efforts to reduce the energy cost of scientific applications. Jime´nez et al. (2010) analyze the trend towards energy-efficient microprocessors. Kestor et al. (2013) break down the energy cost of the data movement in a scientific application. Targeting different mainstream architectures, Aliaga et al. (2015) present an energy comparison study for the CG solver. Concerning more complex applications, Charles et al. (2015) present an efficiency comparison when running the COSMO-ART simulation model (Knote, 2011) on different CPU architectures. For an agroforestry application, Padoin et al. (2013) investigate performance and power consumption on a hybrid CPU + GPU architecture, revealing that a changing workload can drastically improve energy efficiency. Krueger et al. (2011) compare the energy efficiency of a CPU and a GPU implementation of a seismic modeling application against a generalpurpose manycore chip design called ''Green Wave,'' that is optimized for high-order wave equations. Based on the analysis on Intel Sandy Bridge processors, Wittmann et al. (2013) extrapolate the energy efficiency Figure 1 . NVIDIA K40 GPU computing efficiency on compute intensive (left: dense LU in double precision) versus memory-bound computation (right: SpMV in double precision for various data formats) on state-of-the-art hardware and software libraries. CPU runs use MKL (denoted MKL-CSR) with the numactl -interleave=all policy (see also MKL's benchmarks (Intel Corporation, 2014) ).
of a lattice-Boltzmann computational fluid dynamics (CFD) simulation to a petascale-class machine.
Blocked SpMV
As there exists significant need for SpMM products, NVIDIA's cuSPARSE library provides this routine for the CSR format. 5 Aside from a straightforward implementation assuming the set of vectors being stored in column-major order, the library also contains an optimized version taking the block of vectors as a rowmajor matrix that can be used in combination with a preprocessing step, transposing the matrix to achieve significantly higher performance (Naumov, 2012) . Still, we show that the SpMM product that we propose outperforms the cuSPARSE implementations.
Orthogonalizations for GPUs
Orthogonalization of vectors is a fundamental operation for both linear systems and eigenproblem solvers, and many applications. Therefore there has been extensive research on both its acceleration and stability. Besides the classical and modified Gram-Schmidt orthogonalizations (Golub and Van Loan, 1996) and orthogonalizations based on LAPACK (xGEQRF + xUNGQR) (Anderson et al., 1999) and correspondingly MAGMA for GPUs (Innovative Computing Laboratory, UTK, 2014a), recent work includes communication-avoiding QR (Demmel et al., 2008) , also developed for GPUs (Agullo et al., 2011; Anderson et al., 2010) . For tall and skinny matrices, these orthogonalizations are, in general, memory bound. Higher performance, using Level 3 BLAS, is also possible in orthogonalizations such as the Cholesky QR or SVD QR (Stathopoulos and Wu, 2002) , but they are less stable (error bounded by the square of the condition number of the input matrix). Recently, more stable versions were developed using mixed-precision arithmetic (Yamazaki et al., 2015) . For the LOBPCG method, after the SpMM kernel, the orthogonalizations are the most time-consuming building block. In particular, LOBPCG contains two sets of orthogonalizations per iteration.
LOBPCG implementations
The BLOPEX package maintained by A. Knyazev is state-of-the-art for CPU implementations of LOBPCG, and popular software libraries such as PETSc and Hypre provide an interface to it (Knyazev et al., 2007) . Also Scipy (Jones et al., 2001 (Jones et al., -2013 , Octopus (Castro et al., 2006) , and Anasazi (Baker et al., 2009) as part of the Trilinos library (Heroux et al., 2003) feature LOBPCG implementations. The first implementation of LOBPCG for GPUs has been available since 2011 in the ABINIT material science package (Gonze et al., 2002) . It benefits from utilizing the generic linear algebra routines available in the CUDA (NVIDIA, 2009) and MAGMA (Innovative Computing Laboratory, UTK, 2014a) libraries. More recently, NVIDIA announced that LOBPCG will be included in the GPUaccelerated Algebraic Multigrid Accelerator library AmgX. 6 This work extends the results presented by Anzt et al. (2015) by providing a more comprehensive analysis on energy and performance, introducing models for estimating hardware-imposed optimization bounds, and enhancing the LOBPCG solver with preconditioning.
Performance bounds for SpMV kernels
The performance of sparse computations, including the performance of standard Krylov iterative methods, is typically bounded by the performance of the SpMV. The performance of the SpMV itself is typically bounded by the memory bandwidth of the system at hand. To demonstrate this, we consider the SpMV performance on a state-of-the-art GPU for a set of sparse matrices from the University of Florida Matrix Collection (UFMC) 7 with characteristics listed in Table 1 . Unless the non-zeros of a sparse matrix are equally distributed over the distinct rows, which would imply that all rows contain a very similar number of non-zeros, the compressed sparse row (CSR (Barrett et al., 1994) ; see Section 5 for details) is the most compact format for storing the matrix. In this general case, the CSR is also the format minimizing the data transfers in an SpMV kernel. Against this background, a lower bound of data transfers in bytes (t B ) needed for a double precision (DP) SpMV product in the form Ax = y can be evaluated as follows: 
The number of floating point operations (Flops) for an SpMV is at most 2n z . Thus, taking into account equation (1), we can derive the Flops per byte (Flops/B) ratio for the SpMV. Note that the number is very low and, moreover, we can use it to derive an upper performance bound for the SpMV. In particular, if we disregard the time for the SpMV Flops, the time to transfer the data will give us an upper bound for the performance (P max ):
where the bandwidth is the achievable data transfer rate from/to the main memory. On a K40 GPU for example, the achievable bandwidth using NVIDIA's bandwidthTest benchmark is about 180 GB/s, and thus, the upper performance bound for the SpMV, according to equation (2), can be approximated by disregarding the 20n term in t B as
For example, taking into account the 20n term on the bmw matrix, yields a P max = 29:3 GFlop/s. Note that MAGMA achieves 23.6 GFlop/s (see Figure 1 ), which is 80% of the P max peak. Furthermore, note that this performance is achieved using a SELL-P format that, in this particular case, has 13% overhead (see Table 1 ). Therefore, if we take into account the overhead, our implementation achieves 93% of the P max peak. The efficiencies achieved as a percentage of the P max peak for the other matrices are given in Table 2 . These results illustrate the bandwidth limitations and the current state-of-the-art for the SpMV kernel and sparse computations in general. Furthermore, the comparison in Figure 1 shows that even if the SpMV is 100% efficient, the overall performance will still only be about 2.4% of the performance capabilities of the K40 GPU. This is a motivation for the work in this paper to further improve and uncover what is possible through techniques for reducing the communications in sparse computations.
LOBPCG
The LOBPCG method (Knyazev, 1998 (Knyazev, , 2001 finds m of the smallest (or largest) eigenvalues l and corresponding eigenvectors x of a symmetric and positivedefinite eigenvalue problem:
Similarly to other CG-based methods, this is accomplished by the iterative minimization of the Rayleigh quotient:
The minimization at each step is done locally, in the subspace of the current approximation x i , the previous approximation x iÀ1 , and the preconditioned residual P(Ax i À l i x i ), where P is a preconditioner for A. The subspace minimization is done by the Rayleigh-Ritz method.
Note that the operations in the algorithm are blocked and therefore can be very efficient on modern architectures. Indeed, the AX i is the SpMM kernel, and the bulk of the computations in the Rayleigh-Ritz minimization are general matrix-matrix products (GEMMs). The direct implementation of this algorithm becomes unstable as the difference between X iÀ1 and X i becomes smaller, but stabilization methods can provide an efficient workaround (Hetmaniuk and Lehoucq, 2006; Knyazev, 2001) . While the LOBPCG convergence characteristics usually benefit from using an applicationspecific preconditioner (Arbenz and Geus, 2005; Benner and Mach, 2011; Knyazev and Neymeyr, 2001; Kolev and Vassilevski, 2006; Vo¨mel et al., 2008) , we refrain Table 1 . Matrix characteristics and storage overhead for ELLPACK and SELL-P formats. SELL-P employs a block size of 8 with 4 threads assigned to each row. Here n FORMAT z refers to the explicitly stored elements (n z non-zero elements plus the explicitly stored zeros for padding).
Acronym
Matrix #nonzeros (n z ) Size (n) n z =n n row z ELLPACK SELL-P Table 2 . Efficiency of the MAGMA SpMV in SELL-P format as a percentage of the P max peak from equation (2). The numbers are obtained based on the performances in Figure 1 , Table 1 from including specialized preconditioners as we are particularly interested in the performance of the toplevel method and general cases where preconditioners such as ILU would be applied. The LOBPCG we develop is hybrid, using both the GPUs and CPUs available. All data resides on the GPU memory and the bulk of the computation (the preconditioned residual, the accumulation of the matrices for the Rayleigh-Ritz method, and the update transformations) are handled by the GPU. The small and not easy to parallelize Rayleigh-Ritz eigenproblem is solved on the CPU using LAPACK. More specifically, to find
the Rayleigh-Ritz method computes on the GPŨ
and solves the small generalized eigenproblemÃf = Bf on the CPU, to finally find (computed on the GPU)
For stability, various orthogonalizations are performed, following the LOBPCG Matlab code from A. Knyazev. 8 We used our highly optimized GPU implementations based on the Cholesky QR to get the same convergence rates as the reference CPU implementation from BLOPEX (in HYPRE) on all our test matrices from the University of Florida sparse matrix collection (see Section 7).
Sparse matrix-vector-block product
To develop an efficient SpMM kernel, we use the recently proposed SELL-P format (padded sliced ELLPACK ). The performance numbers reported throughout this section are for DP arithmetic.
Implementation of SpMM for SELL-P
In addition to the well-known sparse matrix formats such as CSR (Barrett et al., 1994) , work on efficient SpMV products for GPUs has motivated a number of new formats, and in particular, the one standing out is ELLPACK (Bell and Garland, 2008) , where padding of the different rows with zeros is applied for a uniform row-length suitable for coalesced memory accesses of the matrix and instruction parallelism. However, the ELLPACK format incurs a storage overhead, which is determined by the maximum number of non-zero elements aggregated in one row. Depending on the particular problem, the overheads in using ELLPACK may result in poor performance, despite the coalesced memory access that is highly favorable for streaming processors.
A workaround to reduce memory and computational overhead is to split the original matrix into row blocks before converting them into the ELLPACK format. In the resulting sliced ELLPACK format (SELL or SELL-C, where C denotes the size of the row blocks (Kreutzer et al., 2014; Monakov et al., 2010) ), the overhead is no longer determined by the matrix row containing the largest number of non-zeros, but is determined by the row with the largest number of nonzero elements in the respective block. While sliced SELL-C reduces the overhead very efficiently, e.g.choosing C = 1 results in the storage-optimal CSR, assigning multiple threads to each row requires padding the rows with zeros, so that each block has a rowlength divisible by this thread number. This is the underlying idea of the SELL-P format: partition the sparse matrix into row-blocks, and convert the distinct blocks into ELLPACK format (Bell and Garland, 2008) with the row-length of each block being padded to a multiple of the number of threads assigned to each row when computing an SpMV or SpMM product.
Although the padding requires storing some additional zeros, the comparison between the formats in Figure 3 reveals that the blocking strategy may still render significant memory savings compared with the plain ELLPACK (also see Table 1 ), which translates into reduced computational cost for the SpMV kernel. For the performance of the SpMM routine, it is not sufficient to reduce the computational overhead, but essential to optimize the memory access pattern. This requires the accessed data to be aligned in memory whenever possible. For consecutive memory access, and with the motivation of processing multiple vectors simultaneously, we implement the SpMM assuming that the tall-andskinny dense matrix composed of the vectors is stored in row-major order. Although this makes a preprocessing step transposing the dense matrix from column to row-major format necessary, the aligned memory access to the vectors' values compensates for the extra work.
The algorithmic kernel for the SpMM operation arises as a natural extension of the SpMV kernel for the SELL-P format proposed in Anzt et al. (2014) . Like in the SpMV kernel, the x-dimension of the thread block processes the distinct rows of one SELL-P block, while the y-dimension corresponds to the number of threads assigned to each row (see Figure  4 ). Partial products are written into shared memory and added in a local reduction phase. For the SpMM it is beneficial to process multiple vectors simultaneously, which provides incentive for extending the thread block by a z-dimension, handling the distinct vectors. While assigning every z-layer of the block to one vector would provide a straightforward implementation, keeping the set of vectors in texture memory makes an enhanced approach more appealing: In CUDA (version 6.0) every texture read fetches 16 bytes, corresponding to two IEEE DP or four IEEE single precision floating point values. As using only part of them would result in performance waste, every z-layer may process two (DP case) or four (single precision case) vectors, respectively. This implies that, depending on the precision format, the z-dimension of the thread block equals half or a quarter the column count of the tall-and-skinny dense matrix.
As assigning multiple threads to each row requires a local reduction of the partial products in shared memory (see Figure 4) , the x-, y-, and z-dimensions are bounded by the characteristics of the GPU architecture (NVIDIA, 2009 ). An efficient workaround when processing a large number of vectors is given by assigning only one thread per z-dimension to each row (choose ydimension equal 1), which removes the reduction step and the need for shared memory.
Performance of SpMM for SELL-P
For the runtime analysis we use a Kepler K40 GPU which is newer than the K20 GPUs employed in the Piz Daint supercomputer. We benchmarked the SELL-P SpMM for a set of test matrices taken from the University of Florida Matrix Collection (UFMC). 9 . With some key characteristics collected in Table 1 , and sparsity plots shown in Figure 2 , we tried to cover a large variety of systems common in scientific computing. The hardware we used is a two socket Intel Xeon E5-2670 (Sandy Bridge) platform accelerated by an NVIDIA Tesla K40c GPU. The host system has a theoretical peak of 333 GFlop/s in DP and is equipped with 64 GB of main memory that can be accessed at a theoretical bandwidth of up to 51 GB/s. The K40 GPU has a theoretical peak performance of 1682 GFlop/s in DP. On the card, 12 GB of main memory are accessed at a theoretical bandwidth of 288 GB/s. Using bandwidth tests provided by NVIDIA we achieved a bandwidth of 180 GB/s. The implementation of all GPU kernels is realized in CUDA (NVIDIA, 2009) version 6.0, 10 and we also include in performance comparisons with routines taken from NVIDIA's cuSPARSE library. On the CPU, Intel's MKL (Intel Corporation, 2007) is used (version 11.0, update 5).
In Figure 5 , we visualize the performance scaling of the previously described SpMM kernel for different test matrices with respect to the number of columns in the dense matrix (equivalent to the number of vectors in a blocked SpMV). The results reveal that the SpMM performance exceeds 100 GFlop/s as soon as the number of columns in the dense matrix exceeds 30. The characteristic oscillation of the performance can be explained by the more or less efficient memory access. Note in particular the very good performance for the cases where the column-count equals a multiple of 16. Using the SpMM kernel versus a set of consecutive SpMV sparse products (that typically achieve less than 25 GFlop/s Figure 3 . Visualizing the CSR, ELLPACK, SELL-C, and SELL-P formats. The memory demand corresponds to the gray areas. Note that choosing the block size 2 for SELL-C (SELL-2) and SELL-P requires adding a zero row to the original matrix. Furthermore, padding the SELL-P format to a row length divisible by 2 requires explicit storage of a few additional zeros. on this architecture ; see Figure 1 , right) results in speedup factors of around 5 3 . Similar performance improvement (up to 6.1 3 ) is observed on CPUs when replacing consecutive MKL SpMV kernels with the MKL SpMM routine, see MKL results in Table 3 .
While these results are obtained by assuming the performance-beneficial row-major storage of the talland-skinny dense matrix, many applications and algorithms use dense matrices stored in column-major format to benefit from highly optimized BLAS implementations (available for matrices in columnmajor format). For this reason, when comparing the performance of the SELL-P implementation and the cuSPARSE CSRSpMM (Naumov, 2012) with unblocked SpMV counterparts, the performance reported in Table 3 includes the time needed to transpose the talland-skinny dense matrix composed of the vectors.
The performance results reveal that the SELL-P based SpMM kernel outperforms the cuSPARSE counterpart for all tested matrices. The speedup over the fastest SpMV ranges between 3.8 3 and 5.4 3 . The CPU results (using Intel's MKL) were obtained by running the CPU in the numactl -interleave=all policy, which is well known to improve performance. The reported performance is consistent with Figure 4 . Visualization of the SELL-P memory layout and the SELL-P SpMM kernel including the reduction step using the blocksize b = 4 (corresponding to SELL-4), and always assigning two threads to every row (t = 2). Adding a z-dimension to the thread-block allows for processing multiple vectors simultaneously. benchmarks provided by Intel (Intel Corporation, 2014) . The results reveal a 3 3 to 5 3 acceleration from CPU to GPU implementation, which was expected from the compute and bandwidth capabilities of the two architectures.
Experiment framework
The Piz Daint Supercomputer at the Swiss National Computing Centre in Lugano was listed in November 2014 as the sixth fastest supercomputer according to the TOP500, while its energy efficiency ranked number nine in the Green500. A single node is equipped with an 8-core 64-bit Intel Sandy Bridge CPU (Intel Xeon E5-2670), an NVIDIA Tesla K20X with 6 GB GDDR5 memory, and 32 GB of host memory. The nodes are connected by the ''Aries'' proprietary interconnect from Cray, with a dragonfly network topology (CSCS, 2014) . Piz Daint has 5272 compute nodes, corresponding to 42,176 CPU cores in total, with the ability to use up to 16 virtual cores per node when hyperthreading (HT) is enabled, i.e.84,352 virtual cores in total and 5272 GPUs. The peak performance is 7.8 Petaflops (CSCS, 2014) . PMDB enables the user to monitor power and energy usage for the host node and separately for the accelerator at a frequency of 10 Hz (Fourestey et al., 2014) . For the BLOPEX code running exclusively on the CPU of the host nodes, we obtain the pure CPU power by subtracting the power draft of the (idle) GPUs.
Runtime and energy analysis of LOBPCG
In this section we quantify the runtime and energy efficiency of two LOBPCG implementations: the GPUaccelerated LOBPCG, and the multithreaded CPU implementation from BLOPEX. All computations use DP arithmetic. For the runtime and energy analysis, we used the BLOPEX code via the Hypre interface, assigning 8 threads to each node (one thread per core) for up Figure 5 . DP performance scaling with respect to the vector count of the SpMM kernel for the matrices listed in Table 1 . to 128 nodes (1024 cores) of the hardware platform listed in Section 6. We note that we use the BLOPEX LOBPCG out-of-the-box, without attempting any optimizations that are not included in the software library. The convergence rate of the GPU implementation is matching the one from BLOPEX. On the left-hand side of Figure 6 we visualize the convergence of 10 eigenvectors for the Audi test matrix. As convergence properties are not the focus of the research, all further results are based on 100 iterations of either implementation. The right-hand graph of Figure 6 shows the scaling of the Hypre implementation when computing a set of eigenvectors for the Audi problem. Taking one node (8 threads) as the baseline configuration, the algorithm scales almost linearly up to moderate node counts. The LOBPCG implementation in BLOPEX is matrix free, i.e. the user is allowed to provide their choice of SpMV/SpMM implementation. In these experiments we use the Hypre interface to BLOPEX, linked with the MKL library.
The number of operations executed in every iteration of LOBPCG can be approximated by
where n z denotes the number of non-zeros of the sparse matrix, n the dimension, and n v the number of eigenvectors (equivalent to the number of columns in the talland-skinny dense matrix). The first term of the sum reflects the SpMM operation generating the Krylov vectors, while the second contains the remaining operations including the orthogonalizations. Due to the n 2 v term, the runtime is expected to increase superlinearly with the number of vectors. This can be observed on the left-hand graph of Figure 7 where we visualize the time needed to complete 100 iterations on the Audi problem using either the BLOPEX code via the Hypre interface on 4 nodes (32 threads), or the GPU implementation using either a sequence of SpMVs or the SpMM kernel. While the BLOPEX implementation also shows some variances for different numbers of vectors, the runtime pattern of the GPU LOBPCG reflects the efficiency of the orthogonalization routines favoring cases where 16, 32, or 48 vectors are processed. This characteristic pattern remains when replacing the consecutive SpMVs with the SpMM, as this kernel also promotes certain column-counts of the tall and skinny dense matrix. The right-hand side of Figure 7 shows the runtime of the SpMM -based GPU implementation of LOBPCG to complete 100 iterations for different test matrices. Comparing the results for the Audi problem, we are 1.3 3 and 1.2 3 faster when computing 32 and 48 eigenvectors, respectively, using the SpMM instead of the SpMV in the GPU implementation of LOBPCG. Note that although in this case the SpMM performance is about 5 3 the SpMV performance, the overall respective improvements of 30% and 20% reflect that only 12.5% and 8.7% of the overall LOBPCG flops are in SpMVs for the 32 and 48 eigenvector problems, respectively (see equation (3) and the matrix specifications in Table 1 ). While the BLOPEX implementation also shows some variances for different numbers of vectors, the runtime pattern of the GPU LOBPCG reflects the efficiency of the orthogonalization routines favoring cases where 16, 32, or 48 vectors are processed. This characteristic pattern is even amplified when replacing the consecutive SpMVs with the SpMM, as this kernel also promotes certain column-counts of the tall and skinny dense matrix, as shown on the right of Figure 7 . On the right-hand side of Figure 6 we identified almost linear scaling of the Hypre implementation. The associated energy need remains constant within the linear scaling range, but rises as soon as the speedup is smaller than the node increase. We further observe that the Hypre LOBPCG scales almost independently of the number of eigenvectors computed. We define the computation of a set of 16 and 32 eigenvectors as the two benchmarks for the further analysis. In a first step, we investigate how energy efficiency and runtime performance are related for the BLOPEX implementation. For this, we evaluate these metrics when running on 1, 2, 4, 8, 16, 32, 64, and 128 nodes, and identify the configuration providing the best runtime performance and energy efficiency (see Table 4 ). Due to the nonlinear scaling of the BLOPEX implementation, and the power draft increasing almost linearly with the hardware resources, optimizing for runtime results in significant overhead for energy efficiency and vice versa. While this is different for the hybrid implementation of LOBPCG using a single CPU + GPU node, the energy balance has to also account for the power draft of the GPU. The results in Table 5 indicate that handling a larger set of eigenvalues increases the GPU energy need in relation to the total energy budget. Depending on the system and the number of eigenvectors, up to 65% of the energy is consumed by the GPU. Each node of Piz Daint is equipped with a multicore Sandy Bridge CPU that provides a theoretical peak of 166.4 GFlop/s at a power consumption of 115 W (1.44 GFlop/W), whereas the K20X GPU consumes 225 W when providing 1311 GFlop/s (5.83 GFlop/W) (Fourestey et al., 2014) . Hence, assuming full load, adding the accelerator theoretically increases the power need per node by a factor of around 1.67. However, the GPU-accelerated LOBPCG also uses the CPU for solving the Rayleigh-Ritz problem independent of the system matrix. The energy improvement depends on the specific test case, but is in general smaller than the runtime improvement when comparing the BLOPEX implementation running on one node with the hybrid code. This difference is reflected in the first line of Table 6 , where we list the runtime and energy requirement of the BLOPEX LOBPCG scaled to the MAGMA LOBPCG results (this is equivalent to the speedup and greenup of the MAGMA implementation over the BLOPEX implementation). When computing 16 eigenvectors (see speedup shown on the top left of Table 6 ), increasing the computing resources enables the BLOPEX implementation to outperform the MAGMA LOBPCG for some test cases; however, this comes at the cost of a significantly higher energy usage (see greenup of MAGMA over BLOPEX on the top right of Table 6 ). With the greenup ranging between 4 and 180, the BLOPEX code is, for no configuration, even close to the energy efficiency of the GPU-accelerated solver Table 6 ).
To complete the performance analysis, we concentrate on the single node performances in Table 6 . Based on the SpMM kernel analysis in Table 3 , the expectation for 16 vectors is that an optimized CPU code (blocking the SpMVs) is about 5 3 slower than the GPU code. The 10 3 acceleration indicates that the Hypre interface to BLOPEX is probably not blocking the SpMVs. Computing more eigenvectors (e.g. 32, shown in Table  6 , bottom) reduces the fraction of SpMV flops to the total flops (see equation (3)), and thus making the SpMV implementation less critical for the overall performance. The fact that the speedup of the GPU vs. the CPU LOBPCG continues to grow, reaching about 30 3 , indicates that there are other missed optimization opportunities in the CPU implementation. For large eigenvector counts, the majority of flops are in GEMMs used for assembling the matrix representations for the local Rayleigh-Ritz minimizations and the orthogonalizations. In the GPU-accelerated LOBPCG implementation, these routines are highly optimized. In particular the GEMMs are optimized for the tall-andskinny shape of the matrices involved (Nath et al., 2010) . More specifically, the matrix multiplication A T Á B is split into a set of smaller GEMMs that are tuned for the particular sizes. A batched routine is used for simultaneous execution of the small GEMMs (Yamazaki et al., 2015) , and the sum of the partial products is formed in a post-processing step.
We now enhance the LOBPCG with a preconditioner. For the developed GPU-accelerated version we analyze in Tables 7 and 8 the runtime and energy when using the Jacobi and ILU preconditioner available in the MAGMA software package (Innovative Computing Laboratory, UTK, 2014a). As expected, enhancing the LOBPCG with a preconditioner increases runtime and energy consumption of every iteration. In a scientific application, the preconditioner's efficiency is determined by the tradeoff between increased resource usage and improved convergence behavior, which is outside the focus of this study. By comparing with the non-preconditioned LOBPCG in Table 5 , we notice that the GPU fraction in the overall energy balance increases for the Jacobi preconditioner and decreases for the ILU preconditioner. This indicates that applying the Jacobi preconditioner is a compute-intensive operation increasing the pressure on the GPU, while the less compute-intense forwardbackward triangular solves in the ILU-preconditioned LOBPCG cause a low GPU utilization and emphasize the CPU contributions to the energy balance.
In Table 9 we report the speedup and greenup of the Jacobi-preconditioned MAGMA LOBPCG over the CPU counterpart. Although the speedups are typically smaller than for the unpreconditioned LOBPCG, the BLOPEX LOBPCG used via the Hypre framework does for no configuration achieve the energy efficiency level of the GPU-accelerated implementation.
As a bottomline, we observe that using more hardware resources may enable a scalable CPU-based algorithm to keep up with the performance of the hybrid implementation, but this increasingly fails to provide the energy efficiency desired.
Summary and outlook
In this study we compared the performance and energy efficiency of a CPU and a hybrid CPU + GPU LOBPCG eigensolver on the Piz Daint supercomputer. For the GPU-accelerated LOBPCG, we provided a comprehensive description of the GPU-kernel used to generate the block-Krylov subspace via an SpMM product. As this building block also serves as the backbone for other block-Krylov methods, we included a runtime analysis that reveals significant speedups over a set of consecutive SpMVs, and an equivalent routine provided in Intel's MKL and NVIDIA's cuSPARSE library. We compared runtime and energy efficiency of the GPU-accelerated LOBPCG with the BLOPEX LOBPCG implementation used via the Hypre interface. Running the CPU code on one node we reported more than an order of magnitude lower resource utilization for the GPU-accelerated version. We showed that increasing the hardware resources for the BLOPEX code succeeds in improving the runtime performance, but results in ever higher energy usage. This indicates that with tuning GPU code it is possible to transfer the resource efficiency of GPU supercomputers into improved energy efficiency of the scientific applications. Future work includes simplifying the process of evaluating a system's resource efficiency when running scientific applications by creating a set of basic benchmark kernels, and extrapolating to a more complex algorithm. 
