In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6× performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5× faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.
Introduction
Building an Exascale machine using the technology employed in today's fastest supercomputers would result in a power dissipation of about half a gigawatt [1] . Providing suitable infrastructure poses Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. PMAM '15, a significant challenge, and with a rough cost of one million USD per megawatt year, the running cost for the facility would quickly exceed the acquisition cost. This is the motivation for replacing homogeneous CPU clusters with architectures generating most of their performance with low-power processors, or accelerators.
Indeed, the success of using GPU accelerators to reduce energy consumption is evident by the fact that the 15 greenest systems are accelerated by GPUs, according to the June 2014 Green500 list [2] , which ranks the supercomputers according to their performance/watt ratio on the HPL benchmark [3] . A drawback of this ranking is that HPL provides GFLOP/s numbers that usually are by orders of magnitude above the performance achieved in real-world applications, and thus, it is insufficient to understand how energyefficient the accelerated systems are for scientific computing [4] .
Recent work has addressed this question by comparing the energy efficiency of a large set of current hardware platforms when running a Conjugate Gradient solver [5] , and the results revealed that when optimizing with respect to energy efficiency, low-power processors and the newer CPUs are able to keep up with the GPUs. Although [5] contains a comprehensive study on different optimization scenarios, considering an isolated CG fails to provide insight into the energy efficiency of more complex applications.
In this paper, we focus on GPU kernel optimizations for the memory-bound SpMV, which serves as a building block of many sparse solvers. Standard Krylov methods, for example, are typically bounded by the SpMV performance, which is only a few percent of the peak of current GPUs, as shown in Figure 1 , right. One approach to improve Krylov methods is to break up the data dependency between the SpMV and the dot products like in the recent work on the s-step and communication-avoiding Krylov methods [7, 8] . The improvement in these methods comes from grouping SpMVs together into a (so called) Matrix Powers Kernel, which allows reuse of the sparse matrix and vector reads during the computation of a Krylov subspace. Another approach that we consider in this paper is to generate a block-Krylov subspace. Higher performance for the numerical algorithm is achieved by several vectors being multiplied simultaneously in an SpMM kernel. Communication is reduced in this case by accessing the matrix only once, and reusing it for the multiplication with all vectors, which results in significant performance benefits. SpMMs are needed in numerical algorithms like the Discontinuous Galerkin, high-order FEM, cases when vector quantities are simulated, or in specially designed block solvers, like the Locally Optimal Block PCG (LOBPCG) used for finding a set of smallest/largest eigenstates of an SPD matrix [9] .
The LOBPCG method provides an interesting mix between sparse and dense linear algebra operations. It also serves as a backbone for many simulations, e.g., in quantum mechanics, where eigenstates and molecular orbitals are defined by eigenvectors, or principle component analysis. Non-linear CG methods have been successfully used in predicting the electronic properties of large nanostructures [10, 11] , where the LOBPCG method in particular has shown higher performance and improved convergence (reduced number of SpMV products compared to non-blocked CG methods). Applying this algorithm efficiently to multi-billion size problems served as the backbone of two Gordon-Bell Prize finalists that ran many-body simulations on the Japanese Earth Simulator [12, 13] .
In this paper, we compare the state-of-the-art multi-threaded CPU implementations of LOBPCG in BLOPEX [14] , for which the PETSc and Hypre libraries provide an interface [15] , with our GPU implementation, that is now available through the MAGMA library [16] . The target platform is the Piz Daint Supercomputer located at the Swiss National Computing Centre 1 (CSCS).
Related Work
Energy Efficiency: An overview of energy-efficient scientific computing on extreme-scale systems is provided in [17] , where the authors address both hardware and algorithmic efforts to reduce the energy cost of scientific applications. Jiménez et al. [18] analyze the trend towards energy-efficient microprocessors. Kestor et al. [19] break down the energy cost of the data movement in a scientific application. Targeting different mainstream architectures, Aliaga et al. present in [5] an energy comparison study for the CG solver. Concerning more complex applications, Charles et al. [20] present an efficiency comparison when running the COSMO-ART simulation model [21] on different CPU architectures. For an agroforestry application, Padoin et al. [22] investigate performance and power consumption on a hybrid CPU+GPU architecture, revealing that a changing workload can drastically improve energy efficiency. In [23] , Krueger et al. compare the energy efficiency of a CPU and a GPU implementation of a seismic modeling application against a general-purpose manycore chip design called "Green Wave," that is optimized for high-order wave equations. Based on the analysis on Intel Sandy Bridge processors, Wittmann et al. [24] extrapolate the energy efficiency of a Lattice-Boltzmann CFD simulation to a petascale-class machine. 1 The authors would like to thank the CSCS for access to Piz Daint and the support running the experiments.
Blocked SpMV: As there exists significant need for SpMM products, NVIDIA's cuSPARSE library provides this routine for the CSR format [25] . Aside from a straight-forward implementation assuming the set of vectors being stored in column-major order, the library also contains an optimized version taking the block of vectors as a row-major matrix that can be used in combination with a preprocessing step, transposing the matrix to achieve significantly higher performance [26] . Still, we show that the SpMM product that we propose outperforms the cuSPARSE implementations.
Orthogonalizations for GPUs: Orthogonalization of vectors is a fundamental operation for both linear systems and eigenproblem solvers, and many applications. Therefore there has been extensive research on both its acceleration and stability. Besides the classical and modified Gram-Schmidt orthogonalizations [27] and orthogonalizations based on LAPACK (xGEQRF + xUNGQR ) [28] and correspondingly MAGMA for GPUs [16] , recent work includes communication-avoiding QR [29] , also developed for GPUs [30, 31] . For tall and skinny matrices, these orthogonalizations are, in general, memory bound. Higher performance, using Level 3 BLAS, is also possible in orthogonalizations like the Cholesky QR or SVD QR, but they are less stable (error bounded by the square of the condition number of the input matrix). For the LOPBPCG method, after the SpMM kernel, the orthogonalizations are the most time consuming building block. In particular, LOBPCG contains two sets of orthogonalizations per iteration.
LOBPCG implementations: The BLOPEX package maintained by A. Knyazev is state-of-the-art for CPU implementations of LOBPCG, and popular software libraries like PETSc and Hypre provide an interface to it [15] . Also Scipy [32] , Octopus [33] , and Anasazi [34] as part of the Trilinos library [35] feature LOBPCG implementations. The first implementation of LOBPCG for GPUs has been available since 2011 in the ABINIT material science package [36] . It benefits from utilizing the generic linear algebra routines available in the CUDA [37] and MAGMA [16] libraries. More recently, NVIDIA announced that LOBPCG will be included in the GPU-accelerated Algebraic Multigrid Accelerator AmgX 2 .
LOBPCG
The LOBPCG method [9, 38] finds m of the smallest (or largest) eigenvalues λ and corresponding eigenvectors x of a symmetric and positive definite eigenvalue problem:
Similarly to other CG-based methods, this is accomplished by the iterative minimization of the Rayleigh quotient:
The minimization at each step is done locally, in the subspace of the current approximation xi, the previous approximation xi−1, and the preconditioned residual P (Axi − λixi), where P is a preconditioner for A. The subspace minimization is done by the Rayleigh-Ritz method.
Note that the operations in the algorithm are blocked and therefore can be very efficient on modern architectures. Indeed, the AXi is the SpMM kernel, and the bulk of the computations in the Rayleigh-Ritz minimization are general matrix-matrix products (GEMMs). The direct implementation of this algorithm becomes unstable as the difference between Xi−1 and Xi becomes smaller, but stabilization methods can provide an efficient workaround [9, 39] . While the LOBPCG convergence characteristics usually benefit from using an application-specific preconditioner [11, [40] [41] [42] [43] , we refrain from including one as we are particularly interested in the performance of the top-level method. The LOBPCG we develop is hybrid, using both the GPUs and CPUs available. All data resides on the GPU memory and the bulk of the computation -the preconditioned residual, the accumulation of the matrices for the Rayleigh-Ritz method, and the update transformations -are handled by the GPU. The small and not easy to parallelize RayleighRitz eigenproblem is solved on the CPU using LAPACK. More specifically, to find
the Rayleigh-Ritz method computes on the GPŨ
and solves the small generalized eigenproblemÃ φ = B φ on the CPU, to finally find (computed on the GPU)
For stability, various orthogonalizations are performed, following the LOBPCG Matlab code from A. Knyazev 3 . We used our highly optimized GPU implementations based on the Cholesky QR to get the same convergence rates as the reference CPU implementation from BLOPEX (in HYPRE) on all our test matrices from the University of Florida sparse matrix collection (see Section 6).
Sparse Matrix-Vector-Block Product
To develop an efficient SpMM kernel, we use the recently proposed SELL-P format (padded sliced ELLPACK [44] ). The performance numbers reported throughout this section are for double precision (DP) arithmetic.
Implementation of SpMM for SELL-P In addition to the well known sparse matrix formats like CSR [45] , work on efficient SpMV products for GPUs has motivated a number of new formats, and in particular, the one standing out is ELL-PACK [46] , where padding of the different rows with zeros is ap-3 http://www.mathworks.com/matlabcentral/fileexchange/48-lobpcg-m plied for a uniform row-length suitable for coalesced memory accesses of the matrix and instruction parallelism. However, the ELL-PACK format incurs a storage overhead, which is determined by the maximum number of nonzero elements aggregated in one row. Depending on the particular problem, the overheads in using ELL-PACK may result in poor performance, despite the coalesced memory access that is highly favorable for streaming processors.
A workaround to reduce memory and computational overhead is to split the original matrix into row blocks before converting them into the ELLPACK format. In the resulting sliced ELLPACK format (SELL or SELL-C, where C denotes the size of the row blocks [47, 48] ), the overhead is no longer determined by the matrix row containing the largest number of non-zeros, but by the row with the largest number of nonzero elements in the respective block. While sliced SELL-C reduces the overhead very efficiently, e.g., choosing C=1 results in the storage-optimal CSR, assigning multiple threads to each row requires padding the rows with zeros, so that each block has a row-length divisible by this thread number. This is the underlying idea of the SELL-P format: partition the sparse matrix into row-blocks, and convert the distinct blocks into ELLPACK format [46] with the row-length of each block being padded to a multiple of the number of threads assigned to each row when computing an SpMV or SpMM product.
Although the padding introduces some zero fill-in, the comparison between the formats in Figure 3 reveals that the blocking strategy may still render significant memory savings (also see Table 1 ), which translates into reduced computational cost for the SpMV kernel. For the performance of the SpMM routine, it is not sufficient to reduce the computational overhead, but essential to optimize the memory access pattern. Doing so requires the accessed data to be aligned in memory whenever possible. For consecutive memory access, and with the motivation of processing multiple vectors simultaneously, we implement the SpMM assuming that the tall-andskinny dense matrix composed of the vectors is stored in row-major order. Although this requires a preprocessing step of transposing the dense matrix from column to row-major format prior to the SpMM call, the aligned memory access to the vectors' values compensates for the extra work.
The algorithmic kernel for the SpMM operation arises as a natural extension of the SpMV kernel for the SELL-P format proposed in [44] . Like in the SpMV kernel, the x-dimension of the thread block processes the distinct rows of one SELL-P block, while the y-dimension corresponds to the number of threads assigned to each row (see Figure 4) . Partial products are written into shared memory and added in a local reduction phase. For the SpMM it is beneficial to process multiple vectors simultaneously, which motivates for extending the thread block by a z-dimension, handling the distinct vectors. While assigning every z-layer of the block to one vector would provide a straight-forward implementation, keeping the set of vectors, in texture memory, makes an enhanced approach more appealing. The motivation is that in CUDA (version 6.0) every texture read fetches 16 bytes, corresponding to two IEEE double or four IEEE single precision floating point values. As using only part of them would result in performance waste, every z-layer may process two (double precision case) or four (single precision case) vectors, respectively. This implies that, depending on the precision format, the z-dimension of the thread block equals half or a quarter the column count of the tall-and-skinny dense matrix.
As assigning multiple threads to each row requires a local reduction of the partial products in shared memory (see Figure 4) , the x-y-and z-dimensions are bounded by the characteristics of the GPU architecture [37] . An efficient workaround when processing a large number of vectors is given by assigning only one thread per z-dimension to each row (choose y-dimension equal 1), which removes the reduction step and the need for shared memory. Table 1 : Matrix characteristics and storage overhead for ELLPACK and SELL-P formats. SELL-P employs a blocksize of 8 with 4 threads assigned to each row. n F ORM AT z refers to the explicitly stored elements (nz nonzero elements plus the explicitly stored zeros for padding). Visualizing the CSR, ELLPACK, SELL-C, and SELL-P formats. The memory demand corresponds to the grey areas. Note that choosing the block size 2 for SELL-C (SELL-2) and SELL-P requires adding a zero row to the original matrix. Furthermore, padding the SELL-P format to a row length divisible by 2 requires explicit storage of a few additional zeros.
Performance of SpMM for SELL-P For the runtime analysis we use a Kepler K40 GPU which is newer than the K20 GPUs in the Piz Daint supercomputer. We benchmarked the SELL-P SpMV for a larger set of test matrices taken from the University of Florida Matrix Collection (UFMC) 4 than the ones we target with the LOBPCG algorithm in Section 6.
With some key characteristics collected in Table 1 , and sparsity plots shown in Figure 2 , we tried to cover a large variety of systems common in scientific computing. The hardware we used is a two socket Intel Xeon E5-2670 (Sandy Bridge) platform accelerated by an NVIDIA Tesla K40c GPU with a theoretical peak performance of 1,682 GFLOP/s. The host system has a theoretical peak of 333 GFLOP/s, main memory size is 64 GB, and theoretical bandwidth is up to 51 GB/s. On the K40 GPU, 12 GB of main memory are accessed at a theoretical bandwidth of 288 GB/s. The implementation of all GPU kernels is realized in CUDA [37] , version 6.0 [49] , while we also include in the performance compar-4 UFMC; see http://www.cise.ufl.edu/research/sparse/matrices/ isons routines taken from NVIDIA's cuSPARSE [25] library. On the CPU, Intel's MKL [50] is used in version 11.0, update 5.
In Figure 5 , for different test matrices we visualize the performance scaling of the previously described SpMM kernel with respect to the number of columns in the dense matrix (equivalent to the number of vectors in a blocked SpMV). The results reveal that the SpMM performance exceeds 100 GFLOP/s as soon as the number of columns in the dense matrix exceeds 30. The characteristic oscillation of the performance can be explained by the more or less efficient memory access, but in particular the cases where the column-count equals a multiple of 16 provide very good performance. Using the SpMM kernel vs. a set of consecutive SpMV sparse products (that typically achieve less than 25 GFLOP/s on this architecture [44] ; see Figure 1 , Right) results in speedup factors of around five, see Table 2 . Similar performance improvement (up to 6.1×) is observed on CPUs when replacing consecutive MKL SpMV kernels by the MKL SpMM routine, see Table 3 .
While these results are obtained by assuming the performancebeneficial row-major storage of the tall-and-skinny dense matrix, Figure 4 : Visualization of the SELL-P memory layout and the SELL-P SpMM kernel including the reduction step using the blocksize b = 4 (corresponding to SELL-4), and always assigning two threads to every row (t = 2). Adding a z-dimension to the thread-block allows for processing multiple vectors simultaneously. many applications and algorithms use dense matrices stored in column-major format to benefit from highly optimized BLAS implementations (available for matrices in column-major format). For this reason, when comparing the performance of the SELL-P implementation against the cuSPARSE CSRSpMM [25], we include the preprocessing time needed to transpose the tall-and-skinny dense matrix (see Figure 6) . Aside from the standard CSRSpMM assuming column-major storage, the cuSPARSE also includes a highly tuned version assuming, like the MAGMA implementation, row-major storage [26] . Combining this with a preprocessing step transposing the matrix provides the same functionality at significantly higher GFLOP rates. While the standard SpMM achieves between 20 and 60 GFLOP/s for most matrices, the highly tuned row-major based implementation often gets close to 100 GFLOP/s, sometimes even above (see results labelled "cuSPARSE CSRSpMM v2" in Figure 6 ). Our SpMM based on the SELL-P format outperforms both cuSPARSE SpMM implementations. With significant speedup factors over the standard SpMM, the performance im- Table 2 : Asymptotic DP performance [GFLOP/s] for a large number of vectors using consecutive SpMVs (cuSPARSE CSR, cuS-PARSE HYB, MAGMA SELL-P SpMV) vs. the SpMM kernel on GPUs. The last column is the speedup of the SpMM against the respective best SpMV. See Table 1 for the matrix characteristics.
provement compared to the cuSPARSE SpMM ranges between 13% and 41% (see results for CANT and CRANK, respectively). Table 1 . core Intel Xeon E5-2690s for selected matrices and number of vectors n. Both implementations assume the vectors to be multiplied by the sparse matrix to be stored in row-major data format. On both architectures, the row-major storage allows for significantly higher GFLOP rates. The CPU runs are using the numactl --interleave=all policy, which is well known to improve performance. The performance obtained is consistent with benchmarks provided by Intel [6] . The results show a 3 to 5× acceleration from CPU to GPU implementation, which is expected from the compute and bandwidth capabilities of the two architectures.
Experiment Framework
The Piz Daint Supercomputer at the Swiss National Computing Centre in Lugano was listed in June 2014 as the sixth fastest supercomputer according to the TOP500 [3] , while its energy efficiency ranked number five in the Green500 . PMDB enables the user to monitor power and energy usage for the host node and separately for the accelerator at a frequency of 10 Hz [52] . For the BLOPEX code running exclusively on the CPU of the host nodes, we obtain the pure CPU power by subtracting the power draft of the (idle) GPUs. 
Runtime and Energy Analysis of LOBPCG
In this Section we quantify the runtime and energy efficiency of two LOBPCG implementations: the GPU-accelerated LOBPCG, and the multithreaded CPU implementation from BLOPEX [14] . All computations use DP arithmetic. For the runtime and energy analysis, we used the BLOPEX code via the Hypre interface, assigning 8 threads to each node (one thread per core) for up to 128 nodes (1024 cores) of the hardware platform listed in Section 5. We note that we use the BLOPEX LOBPCG out-of-the-box, without attempting any optimizations that are not included in the software library. The convergence rate of the GPU implementation is matching the one from BLOPEX. In Figure 8 we visualize the convergence of 10 eigenvectors for the AUDI test matrix. As convergence properties are not the focus of the research, all further results are based on 100 iterations of either implementation. The LOBPCG implementation in BLOPEX is matrix free, i.e., the user is allowed to provide their choice of SpMV/SpMM implementation. In these experiments we use the Hypre interface to BLOPEX, linked with the MKL library.
The number of operations executed in every iteration of LOBPCG can be approximated by
where nnz denotes the number of non-zeros of the sparse matrix, n the dimension, and nv the number of eigenvectors (equivalent to the number of columns in the tall-and-skinny dense matrix). The first term of the sum reflects the SpMM operation generating the Krylov vectors, while the second contains the remaining operations including the orthogonalizations. Due to the n 2 v term, the runtime is expected to increase superlinearly with the number of vectors. This can be observed in Figure 9 where we visualize the time needed to complete 100 iterations on the AUDI problem using either the BLOPEX code via the Hypre interface on 4 nodes (32 threads), or the GPU implementation using either a sequence of SpMVs or the SpMM kernel. While the BLOPEX implementation also shows some variances for different numbers of vectors, the runtime pattern of the GPU LOBPCG reflects the efficiency of the orthogonalization routines favoring cases where 16, 32, or 48 vectors are processed. This characteristic pattern remains when replacing the consecutive SpMVs with the SpMM, as this kernel also promotes certain column-counts of the tall and skinny dense matrix. Figure 10 shows the runtime of the SpMM-based GPU implementation of LOBPCG to complete 100 iterations for different test matrices. Comparing the results for the AUDI problem, we are 1. Number of Eigenvectors BLOPEX SpMV SpMM Figure 9 : Runtime to complete 100 DP iterations on the AUDI problem using either the BLOPEX via the Hypre interface on 4 nodes (32 threads), or the GPU implementation using either a sequence of SpMVs or the SpMM kernel. Figure 10: Runtime needed to complete 100 iterations in DP using the LOBPCG GPU implementation based on SpMM. and 1.2× faster when computing 32 and 48 eigenvectors, respectively, using the SpMM instead of the SpMV in the GPU implementation of LOBPCG. Note that although in this case the SpMM performance is about 5× the SpMV performance, the overall improvement of correspondingly 30% and 20% reflects that only 12.5% and 8.7% of the overall LOBPCG FLOP/s are in SpMVs for the 32 and 48 eigenvector problems, respectively (see equation (1) and the matrix specifications in Table 1 ). While the BLOPEX implementation also shows some variances for different numbers of vectors, the runtime pattern of the GPU LOBPCG reflects the efficiency of the orthogonalization routines favoring cases where 16, 32, or 48 vectors are processed. This characteristic pattern is even amplified when replacing the consecutive SpMVs with the SpMM, as this kernel also promotes certain column-counts of the tall and skinny dense matrix, as shown in Figure 10 . Figure 11 shows the scaling of the Hypre implementation when computing a set of eigenvectors for the AUDI problem. Taking one node (8 threads) as the baseline configuration, the algorithm scales almost linearly up to 16 nodes. The associated energy need remains constant within the linear scaling range, but rises as soon as the speedup is smaller than the node increase. We further observe that the Hypre LOBPCG scales almost independently of the number of eigenvectors to compute. We define the computation of a set of 16 and 32 eigenvectors as the two benchmarks for the further analysis. In a first step, we investigate how energy efficiency and runtime performance are related for the BLOPEX implementation.
For this, we evaluate these metrics when running on 1, 2, 4, 8, 16, 32, 64 , and 128 nodes, and identify the configuration providing the best runtime performance and energy efficiency (see Table 4 ). Due to the non-linear scaling of the BLOPEX implementation, and the power draft increasing almost linearly with the hardware resources, optimizing for runtime results in significant overhead for energy efficiency and vice versa. While this is different for the hybrid implementation of LOBPCG using a single CPU+GPU node, the energy balance has to also account for the power draft of the GPU (see Table 5 ). Theoretically, each node of Piz Daint is equipped with a multicore Sandy Bridge CPU that provides a theoretical peak of 166.4 GFLOP/s at a power consumption of 115W (1.44 GFLOP/W), whereas the K20X GPU consumes 225W when providing 1311 GFLOP/s (5.83 GFLOP/W) [52] . Hence, assuming full load, adding the accelerator theoretically increases the power need per node by a factor of around 1.67. However, the GPU-acclerated LOBPCG uses both the CPU host and the GPU, but not heavily. The power usage therefore scales according to the load. Hence, the energy improvement depends on the specific test case, but is in general smaller than the runtime improvement when comparing the BLOPEX implementation running on one node with the hybrid code. This is reflected in the first line of Table 6 , where we list the runtime and energy requirement of the BLOPEX LOBPCG scaled to the MAGMA LOBPCG results (this is equivalent to the speedup and greenup of the MAGMA implementation over the BLOPEX implementation). When computing 16 eigenvectors (see speedup shown on the left top of Table 6 ), increasing the computing resources enables the BLOPEX implementation to outperform the MAGMA LOBPCG for some test cases; however, this comes at the cost of a significantly higher energy usage (see greenup of MAGMA over BLOPEX on the right top of Table 6 ). With the greenup ranging between 4 and 180, the BLOPEX code is, for no configuration, even close to the energy efficiency of the GPUaccelerated solver. Targeting the computation of 32 eigenvectors, speedup and greenup grow even more (see bottom of Table 6 ).
To complete the performance analysis, we concentrate on the single node performances in Table 6 . Based on the SpMM kernel analysis in Figure 7 , the expectation for 16 vectors is that an optimized CPU code (blocking the SpMVs) must be only about 5× slower than the GPU code. The 10× acceleration indicates that the Hypre interface to BLOPEX is probably not blocking the SpMVs. Computing more vectors (e.g., 32, shown in Table 6 , bottom) reduces the fraction of SpMV FLOP/s to the total FLOP/s (see equation (1)), and thus making the SpMV implementation less critical for the overall performance. The fact that the speedup of the GPU vs. the CPU LOBPCG continues to grow, reaching about 30×, shows that there are other missed optimization opportunities in the CPU implementation. In particular, these are where the majority of the FLOP/s are -the GEMMs in assembling the matrix representations for the local Rayleigh-Ritz minimizations and the orthogonalizations. These routines are highly optimized in our GPU implementation, especially the GEMMs, which due to the specific sizes of the matrices involved -tall and skinny matrices A and B with a small square resulting matrices A T B -required modifications to the standard GEMM algorithm for large matrices [53] . What worked very well is splitting the A T B GEMM into smaller GEMMs based on tuning the MAGMA GEMM [53] for the particular small sizes, all grouped for execution into a single batched GEMM, followed by the addition of the local results [54] .
As a bottom-line, we observe that using more hardware resources may enable a scalable CPU-based algorithm to keep up with the performance of the hybrid implementation, but this increasingly fails to provide the energy efficiency desired. 
Summary and Outlook
In this paper we have compared the performance and energy efficiency of a CPU and a hybrid CPU+GPU LOBPCG eigensolver on the Piz Daint supercomputer. For the GPU-accelerated LOBPCG, we provide a comprehensive description of the GPU-kernel used to generate the Krylov subspace by computing the SpMM product. As this building block also serves as the backbone for other blockKrylov methods, we include a runtime analysis that reveals significant speedups over a set of consecutive SpMVs, and an equivalent routine provided in NVIDIA's cuSPARSE library. Integrating it into the GPU implementation of LOBPCG, we outperform the BLOPEX CPU implementation running on one node by more than an order of magnitude for both, runtime performance and energy efficiency, when computing a set of eigenvectors. Increasing the resources for the BLOPEX code improves runtime performance at the cost of a rising energy balance. This indicates, that resource efficiency of scientific applications can be improved by utilizing GPU accelerators lauded for their high GFLOP/W ratios. Various insights regarding energy efficiency can be extracted from the results presented. For example, as a general rule Krylov space methods achieve (on systems comparable to the one tested) about 0.033 GFLOP/W on CPUs vs. 0.2 GFLOP/W on GPUs. This makes GPUs about six times more energy efficient. Performancewise, two Sandy Bridge CPUs achieve around 5 GFLOP/s vs. around 20 GFLOP/s for a K20 GPU. Further, employing techniques to reduce communications has a significant impact. In particular, blocking of size 16 in kernels like SpMM, brings the GPU's energy efficiency to about 1 GFLOP/W and the performance to about 100 GFLOP/s. Compared to the baseline SpMV-based method this brings 30x greenup. The greenup for an entire application would depend on the other operations as well, e.g., being around 7× for the LOBPCG for 16 vectors (see Table VI ), and converging to the asymptotic 30× as the number of vectors grows (and more if the performance of the CPU implementation does not scale with increasing the number of nodes used).
The results presented were shown to be state-of-the-art, and therefore to be indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers. Indeed, the performance and energy efficiency on other optimizations techniques can be evaluated based on their expected FLOP/Byte ratio, and extrapolated from the results for the basic building blocks presented (like the SpMV and SpMM, and their interactions with other BLAS kernels). Future work includes simplifying this process of evaluation by setting up MAGMA's basic kernels as benchmarks, and their use in the development of performance and energy efficiency evaluation tools for sparse computing applications.
