The solution of eigenproblems is often a key computational bottleneck that limits the tractable system size of numerical algorithms, among them electronic structure theory in chemistry and in condensed matter physics. Large eigenproblems can easily exceed the capacity of a single compute node, thus must be solved on distributed-memory parallel computers. We here present GPU-oriented optimizations of the ELPA two-stage tridiagonalization eigensolver (ELPA2). On top of its existing cuBLAS-based GPU offloading, we add a CUDA kernel to speed up the back-transformation of eigenvectors, which can be the computationally most expensive part of the two-stage tridiagonalization algorithm. We demonstrate the performance of this GPUaccelerated eigensolver by benchmark calculations on two hybrid CPU-GPU architectures, namely a compute cluster based on Intel Xeon Gold CPUs and NVIDIA Volta GPUs, and the Summit supercomputer based on IBM POWER9 CPUs and NVIDIA Volta GPUs. Consistent with previous bench- * Corresponding author. marks on CPU-only architectures, the GPU-accelerated two-stage solver exhibits a parallel performance superior to the one-stage solver.
Introduction
Finding the eigenvalues and eigenvectors of large dense matrices is a frequent problem in computational science and engineering. For example, in Kohn-Sham density-functional theory (KS-DFT) [1, 2] , the many-electron problem for the Born-Oppenheimer electronic ground state is reduced to a system of single particle equations that can be discretized into a generalized eigenproblem in the following matrix form HC = SCΣ.
(1)
Here the Hamiltonian matrix H and the overlap matrix S are real symmetric or complex Hermitian. The matrix C and the diagonal matrix Σ are the eigenvectors and eigenvalues, respectively, of this eigensystem. In the framework of KS-DFT, C and Σ (or the information they carry, at least) are needed for the construction of H. Therefore, Eq. 1 is a non-linear problem and must be solved self-consistently. It is possible, and has already been implemented in various codes [3, 4, 5, 6] , to restrict the computational cost of the construction of H to scale linearly with respect to the system size N for any semi-local and hybrid exchange-correlation functional. In contrast, the solution of a dense eigenproblem ("diagonalization") scales as O(N 3 ), quickly growing to become prohibitive as N increases to large values. Various strategies exist to facilitate or circumvent the solution of Eq. 1 [7, 8] . (1) The key idea in a conventional dense eigensolver [9, 10] is tridiagonalization, which brings the original matrix to a tridiagonal form by a series of Householder transformations. This algorithm suffers from the inefficiency of BLAS level-2 matrix-vector operations. New algorithms such as pentadiagonalization [11] and two-stage tridiagonalization [12, 13, 14, 15] have been developed, leading to enhanced performance over the conventional one-stage tridiagonalization approach. (2) Iterative eigensolvers [16, 17, 18, 19] are commonly employed by DFT codes, particularly those based on plane-wave (PW) basis functions and pseudopotentials. In that case, because of the large number of basis functions, i.e. the large dimension of the matrices in Eq. 1, needed in an accurate calculation, a direct solution of Eq. 1 is rather infeasible. Iterative eigensolvers are well suited to find a small fraction of low-lying eigenstates, commensurate with the needs of a PW-based code.
(3) When using spatially localized basis functions, such as linear combination of atomic orbitals (LCAO), the fraction of needed eigenpairs out of the full matrix dimension can be fairly large. In this scenario, iterative solvers no longer have an advantage over direct eigensolvers. With localized basis functions, locality in the physical system can be translated to sparsity in the H and S matrices. Methods exploiting this sparsity can be formulated as O(N ) ∼ O(N 2 ) [20, 21, 22, 23, 24, 25] by circumventing the explicit solution of Eq. 1. In particular, linear scaling algorithms in a density matrix formalism have been successfully applied to simulations of one million atoms [26, 27] . Despite the success in extreme-scale simulations, reduced scaling methods come with a computational prefactor that is much larger than that of the O(N 3 ) diagonalization method. Moreover, the applicability and optimal performance of reduced scaling methods are often limited to some certain problem types. As of today, dense eigensolvers, with their small computational prefactor and general applicability, remain the default method in most LCAO codes. Even in PW codes, the performance of a dense eigensolver is still crucial, because at some stage of an iterative solver there will be a dense eigenproblem, the size of which scales with the number of valence electrons in the system being simulated. Therefore, any improvement made to a dense eigensolver would benefit the entire electronic structure community, and the broader field of computational science in general.
The ubiquitous adoption of graphics processing units (GPU) in highperformance computing opens up new opportunities to accelerate dense eigensolvers. A GPU device consists of hundreds to thousands of parallel cores operating at a relatively low frequency. These cores are naturally suited for parallel computational tasks, such as vector and matrix operations found in dense linear algebra. On top of that, GPUs typically have a power efficiency superior to traditional CPUs, and therefore play an important role in supercomputing towards the exascale. According to the latest release of the TOP500 list at the time of writing (November 2019) [28], five of the top ten machines are GPU-accelerated, including Summit, the world's fastest computer based on IBM POWER9 CPUs and NVIDIA Tesla Volta V100 GPUs.
Eigensolvers operating on hybrid CPU-GPU machines have long been available in GPU-oriented linear algebra packages such as cuSOLVER [29] and MAGMA [30, 31] . These packages are designed and optimized for shared-memory host architectures. They can be very fast, but the problem size they can tackle is limited by the memory capacity of a single compute node. Fully exploiting the power of GPU-accelerated supercomputers would require a distributed-memory implementation. The MPI-parallel, distributed-memory ELPA library implements the conventional one-stage diagonalization method and the two-stage diagonalization proposed in Refs. [12, 13] , known as the "ELPA1" and "ELPA2" solvers, respectively. Both of them have been ported to GPUs by substituting BLAS calls with the corresponding cuBLAS functions [32] , making ELPA the only publicly available, distributed-memory, GPU-accelerated eigensolver to our knowledge. The CPU-only version of ELPA1 and ELPA2 and the GPU-accelerated version of ELPA1 and ELPA2 are hereafter referred to as CPU-ELPA1, CPU-ELPA2, GPU-ELPA1, and GPU-ELPA2, respectively.
In single-node tests published by Kůs et al. [32, 33] , the performance of GPU-ELPA1 is better than that of GPU-ELPA2. When using 2 IBM POWER8 CPUs (24 cores in total) and 4 NVIDIA Pascal P100 GPUs, GPU-ELPA1 delivers up to 11.9x performance boost compared to CPU-ELPA1 using 24 CPU cores. The efficiency of CPU-ELPA2 on various distributedmemory CPU platforms [15, 34, 35] indicates that GPU-ELPA2 may outperform GPU-ELPA1 on multiple nodes. Historically, the GPU porting of ELPA2 dates back to 2013, when Peter Messmer from NVIDIA programmed the first version of GPU-ELPA2. Then the code was refactored and merged into the mainline version of ELPA, and has been available in released versions of the ELPA eigensolver library since 2016. In this paper, we report our latest optimizations and developments that enable a performance improvement on distributed-memory, GPU-accelerated architectures. Specifically, several synchronizations and memory transfers between CPUs and GPUs have been optimized. Additionally, some kernels in one of the major computational steps, the tridiagonal-to-banded back-transformation of eigenvectors (Eq. 6d), have been rewritten. The rest of this paper is organized as follows. First, we briefly review the two-stage diagonalization algorithm, in particular the tridiagonal-to-banded back-transformation of eigenvectors and its CPU implementation in the ELPA library. Next, we outline the GPU acceleration strategies employed in GPU-ELPA2, and elaborate on our CUDA implementation of the tridiagonal-to-banded back-transformation of eigenvectors, which is essentially a GPU extension of the algorithm in Refs. [14, 36] . Finally, we demonstrate the performance and scalability of the GPU-ELPA2 solver by a set of benchmarks on two GPU-accelerated computers, namely the Talos cluster at Max Planck Computing and Data Facility in Garching, Germany, based on Intel Xeon Gold CPUs and NVIDIA Volta GPUs, and the Summit supercomputer at Oak Ridge National Laboratory in Tennessee, USA, based on IBM POWER9 CPUs and NVIDIA Volta GPUs.
Two-Stage Tridiagonalization in ELPA2

Overview of the Two-Stage Tridiagonalization
The textbook procedure [9, 10] to solve a dense generalized eigenproblem, like the one in Eq. 1, first computes the Cholesky factorization of S
then uses L to transform Eq. 1 to a standard eigenproblem HC =CΣ.
and the eigenvectorsC must be back-transformed in order to retrieve the eigenvectors of Eq. 1, i.e.
The direct solution of Eq. 3 is based on tridiagonalization, that is, the full matrixH is transformed to a tridiagonal matrix T . This is typically accomplished by individual Householder transformations, which take the shape of matrix-vector operations. Eigenvalues and eigenvectors of T can be easily (compared to the original problem) solved. Then, eigenvectors of T are back-transformed to obtain eigenvectors ofH. This algorithm is adopted by a variety of dense linear algebra packages, such as LAPACK [37] targeting sequential and shared-memory parallel architectures; cuSOLVER [29] and MAGMA [30, 31] targeting shared-memory architectures with GPU accelerators; ScaLAPACK [38] and Elemental [39] targeting distributed-memory parallel architectures. However, as mentioned above, the tridiagonalization step makes extensive use of memory-bound, BLAS level-2 matrix-vector operations, whose performance is limited on modern computer architectures.
The two-stage tridiagonalization algorithm proposed by Bischof, Sun, and Lang [12, 13] is an established alternative to the conventional one-stage method. As shown in Eq. 6 below and further illustrated in Fig. 1 , the tridiagonalization of the full matrixH is carried out in two transformations. The first transformation P reducesH to a banded matrix B, and the second transformation Q reduces B to a tridiagonal matrix T . The eigenvalues Σ and eigenvectors X of T are solved as done in the one-stage method. The back-transformation of eigenvectors is also carried out in two steps. X is first back-transformed to Y , the eigenvectors of B, then toC, the eigenvectors ofH. This two-stage tridiagonalization approach is implemented in several linear algebra software packages [14, 15, 30, 37, 40] , including a high-performance distributed-memory implementation in the ELPA library [14, 15] . The introduction of the banded matrix stage leads to faster computation compared to the one-stage method, for the transformation in Eq. 6a mostly involves highly efficient BLAS level-3 matrix-matrix operations, and the transformation in Eq. 6b only works on a sparse banded matrix B instead of a full matrix. The solution of Eq. 6c is accelerated in ELPA by extending the divide-andconquer symmetric tridiagonal eigensolver [41, 42, 43] such that unwanted eigenvectors are not computed [14, 36] . Regarding the back-transformation of eigenvectors, it is rather difficult to directly use BLAS routines for Eq. 6d. Manually optimized "kernels" written in architecture-specific instruction sets are employed for this particular step [14, 33, 36] . The ELPA2 solver is highly scalable on massively parallel, distributed-memory architectures. It avoids global MPI communications by using a 2-dimensional (2D) process grid and restricting the communication to take place within either the row direction or the column direction. Depending on the size of the matrix, ELPA2 scales to at least tens of thousands of CPU cores [15, 34, 35 ].
Tridiagonal-to-Banded Back-Transformation of Eigenvectors
We now summarize the tridiagonal-to-banded back-transformation algorithm and its CPU implementation in ELPA2. The reader is also referred to Refs. [12, 13, 14, 36] . ELPA2 relies on the "bulge chasing" algorithm [12, 13] to reduce a banded matrix B to a tridiagonal matrix T . Let N and b denote the dimension and semi-bandwidth of B, respectively. The banded-totridiagonal transformation is done in N − 2 stages, with (N − i)/b sweeps in the i th stage. The first sweep of the i th stage reduces the i th column of B to the target tridiagonal form, at the same time introducing fill-ins ("bulges") to the remainder of B. From the second sweep on, the first column of the fill-ins introduced in the previous sweep is eliminated, while introducing new fill-ins further down the matrix. The transformation applied in the j th sweep of the i th stage can be written as
where Q (i,j) is a Householder transformation matrix; I is the identity matrix; the scalar τ (i,j) and vector v (i,j) are computed following the standard Householder method [9, 44] . Most Householder vectors v (i,j) have a length equal to b, except that v (i,j) generated in the last sweep of each stage may be shorter. Matrix-vector operations are still needed in order to apply these Householder transformations, but the computational cost is much less than in the onestage tridiagonalization algorithm, as the vectors are much shorter. The left panel of Fig. 2 shows where vectors v (i,j) are generated with an example of N = 17, b = 4, and N − 2 = 15 stages of bulge chasing. In the actual code, the Householder matrices Q (i,j) are never explicitly constructed. Instead, τ (i,j) and v (i,j) are stored and used for the transformations in Eqs. 6b and 6d, where Q is the product of all Q (i,j) . Note that in the code these vectors are stored in a separate array. They cannot be stored in place, because the transformations generated in the full-to-banded transformation are stored there. The (i, j) pair inside a vector denotes its stage and sweep indexes. For symmetric and Hermitian eigenproblems, only the lower triangular part is considered. The right panel shows the eigenvectors X to be back-transformed. When using multiple processes, the local computation performed by each process is the application of a series of Householder transformations Q (i,j) to a part of the eigenvectors X. As a fictitious example, in the first step of the back-transformation, a process could be responsible for applying the colored Householder transformations sequentially, i.e. (8,1) → (7,1) → (6,1) → (5,1), to the colored part of the eigenvectors.
After the tridiagonal eigenproblem (Eq. 6c) is solved, we get the eigenvalues Σ and eigenvectors X of the tridiagonal matrix T . The main task of the tridiagonal-to-banded back-transformation is to apply all Q * (i,j) to X (Eq. 6d). It is obvious that the eigenvectors can be back-transformed independently, leading to a trivial parallelism over eigenvectors. Moreover, as the Householder vectors v (i,j) are shorter than the eigenvectors X, a single transformation Q * (i,j) only alters a few rows of X where v (i,j) has non-zero values. For instance, the Householder vector v (1, 4) at the left bottom corner of Fig. 2 only alters the last four rows of X. This leads to another level of parallelism to be exploited within an individual eigenvector. In ELPA2, the N p processes are organized in an N pr by N pc grid. The N ev eigenvectors are uniformly distributed across the N pc process columns. Within a process column, the eigenvectors are distributed across the N pr processes in a block manner. Each process applies a series of Householder transformations to its local part of the eigenvector matrix X. The colored part in Fig. 2 shows an example of the local computation by a process. This process applies four Householder transformations, Q * (8,1) , Q * (7,1) , Q * (6,1) , and Q * (5,1) , to its local part of the eigenvector matrix, referred to as X local hereafter. Apparently, Q * (4,1) , Q * (3,1) , and Q * (2,1) also modify the top part of X local . These three transformations are however applied by the upper adjacent process in the same process column. Therefore, the top three rows of X local must be exchanged with the upper adjacent process. Likewise, the bottom three rows of X local must be exchanged with the lower adjacent process in the same process column. The middle part of X local is not involved in any data exchange. It appears in Fig. 2 that almost the entire X local matrix needs to be exchanged between adjacent process rows, as there is only one row in the middle part. However, the height of the middle part actually increases with the matrix size N , whereas the height of the top and bottom parts can never exceed the semi-bandwidth b. Therefore, the amount of data that needs to be exchanged, i.e., data in the top and bottom parts, is limited by b, usually accounting for a small fraction of X local .
GPU Acceleration of ELPA2
In ELPA2, the two-stage tridiagonalization is implemented as five separate subroutines corresponding to the five steps in Eq. 6. The input, output, and internal working matrices of ELPA2 are distributed across CPU cores. ELPA2 relies on its own explicit MPI calls to handle distributed linear algebra operations. The efficient MPI communication pattern in the CPU version of ELPA2 is not altered in the GPU version, where GPU acceleration is mainly realized by substituting local BLAS calls with the corresponding cuBLAS functions, as is done for the full-to-banded transformation (Eq. 6a), the solution of the tridiagonal system (Eq. 6c), and the banded-to-full backtransformation (Eq. 6e). The tridiagonal-to-banded back-transformation (Eq. 6d) is GPU accelerated by a CUDA implementation of Eq. 7. The banded-to-tridiagonal transformation (Eq. 6b) has not been ported to GPUs, because of its low computational cost as shown in Fig. 7 .
GPU Offloading via cuBLAS
The API of cuBLAS is designed to be almost identical to that of the standard CPU BLAS, making cuBLAS-based GPU offloading straightforward. Here we only comment on two technical aspects. First, the communication and synchronization between CPUs and GPUs should be avoided as much as possible. Before calling a cuBLAS function, the input arrays must reside on the GPU memory, which often requires a copy of the data from CPU to GPU. In order for the CPU to access the result of cuBLAS, another copy from GPU to CPU is needed. In our GPU porting of ELPA2, CPU-GPU memory transfers are reduced to minimum by leaving data on the GPU as long as possible. Most often, GPU data is copied to the CPU in order to participate in an MPI communication call. In the version discussed in this paper, we have not yet explored GPU-aware MPI to directly communicate data on the GPU. In each of the GPU-accelerated subroutines, the allocation of GPU memory, which implies a CPU-GPU synchronization, is performed before the main work begins by precomputing the size of the allocation. The allocated GPU memory is reused wherever possible, and is deallocated after the main work finishes. Avoiding frequent allocation and deallocation of GPU memory in a loop or in a recursive routine greatly reduces the amount of implicit CPU-GPU synchronization.
Second, the CPU code of the GPU-accelerated version of ELPA2 operates in a pure MPI mode without threading, i.e., one MPI task for each CPU core. As most (if not all) mainstream computers today have more CPU cores than GPU devices, several MPI tasks would have to share one GPU device. When the size of an eigenproblem is relatively small, the amount of work assigned to each individual MPI task may not be able to fully saturate the GPU. In such cases, the NVIDIA Multi-Process Service (MPS) transparently allows work from different MPI tasks to be executed concurrently on the GPU, increasing the overall GPU utilization. It is thus recommended to use ELPA with MPS switched on.
CUDA Implementation of Parallel Householder Transformations
In this section, we present our CUDA implementation of the local computation in Fig. 2 , which is the key step in the tridiagonal-to-banded backtransformation (Eq. 6d). In a CUDA program, the large number of GPU cores are arranged into a grid of blocks, each of which in turn comprises a grid of threads. All the GPU cores work in a single instruction, multiple thread (SIMT) fashion, i.e., a single instruction is simultaneously executed by multiple threads with different data [45] . As explained in Sec. 2.2, in a given step of the tridiagonal-to-banded back-transformation, a process is responsible for applying N v Householder transformations to its local eigenvector matrix X local , which has N R rows and N C columns. A Householder transformation defined by τ and v is applied to an eigenvector x by
The N v Householder transformations are applied sequentially. The order in which they are applied in the tridiagonal-to-banded back-transformation is the reverse of the order in which they are generated in the banded-totridiagonal transformation. Consider the example in Fig. 2 again, the first transformation would be Q * (8,1) , then Q * (7,1) , Q * (6,1) , and finally Q * (5, 1) . It is obvious that from one transformation to the next, the rows of X local modified by the transformation are shifted upward by one row. This is also seen in Fig. 2 , where Q * (8,1) modifies the 4 th to 7 th rows of X local , Q * (7,1) modifies the 3 rd to 6 th rows of X local , and so forth. Each individual Q * (i,j) actually only modifies b rows of X local .
In order to map this local computation to the GPU cores, we choose a 1D block grid with a 1D thread grid within each block. The CUDA kernel is launched with N C blocks and b threads, so that each block works on an eigenvector, and each thread works on an element of this eigenvector. Specifically, the Householder transformations are applied to X local as follows:
1. Copy X local , as well as all v and τ that ever update X local , from CPU to GPU. Each GPU block is responsible for one column of X local , denoted as x. modified. Since the dot product v * x has already been computed and τ is a scalar, this update can be done in a straightforward element-wise fashion, i.e., each thread updates one element of x. 3. Among the b elements of x that are updated by the first Householder transformation, only the lowest element will not be affected by the next Householder transformation. Before applying the next transformation, the last thread in each block writes its element of x into X local . Then thread i (i >= 1) takes the element of x from thread (i − 1), while thread 0 loads a new element from X local . 4. Now all threads are ready to repeat steps 1, 2, and 3 for the next Householder transformation. 5. The CUDA kernel finishes when all transformations are applied. The updated X local is then copied back to CPU.
An example of this procedure is given in Fig. 3 , where four Householder transformations are applied to a local eigenvector matrix with N R = 7 rows and N C = 6 columns. The CUDA kernel is launched with six blocks and four threads, as indicated by the block id and thread id (both zero-based) in the figure. Throughout the execution of the kernel, block i is responsible for the (i + 1) th eigenvector. The four Householder transformations need to be applied in reverse order, i.e. from right to left. In the application of the rightmost transformation, threads 0, 1, 2, 3 in each block are responsible for the 4 th , 5 th , 6 th , and 7 th elements, respectively, of the eigenvector this block is responsible for. After computing Eq. 8, thread 3 immediately writes the 7 th element back to the X local array. Threads 1, 2, 3 take the updated 4 th , 5 th , and 6 th elements from threads 0, 1, 2, respectively. Then thread 0 loads the 3 rd element of the eigenvector from X local , and all the threads are ready for the next transformation. It is obvious that all the transformations can be applied in exactly the same way. After the kernel finishes, the final X local is copied back from GPU to CPU.
Performance and Scalability
The performance of the GPU-accelerated ELPA2 solver is benchmarked on the Talos cluster at Max Planck Computing and Data Facility and the Summit supercomputer at Oak Ridge National Laboratory. Each node of Talos has 2 Intel Xeon Gold 6148 CPUs (40 cores in total) and 2 NVIDIA Tesla Volta V100 (each has 32 GB high-bandwidth memory, double precision peak 7.0 TFLOP/s, PCIe 32 GB/s interconnect) GPUs. Benchmarks presented in Fig. 4 are performed on Talos. The ELPA code is compiled with Fig. 4 shows the total time to solution of CPU-ELPA1, CPU-ELPA2, GPU-ELPA1, and GPU-ELPA2 on the Talos cluster. All eigenvalues and eigenvectors of randomly generated real and complex matrices of size N = 40,000 to 100,000 are computed with up to 64 nodes. As already demonstrated in published benchmarks [15, 34, 35, 36] , CPU-ELPA2 greatly outperforms CPU-ELPA1 in terms of performance and scalability. The performance difference between GPU-ELPA1 and GPU-ELPA2 is rather small. For small node counts, GPU-ELPA1 is marginally faster than GPU-ELPA2, which agrees with the published single-node tests [33] . What has not been tested previously is the performance on multiple nodes. It turns out that GPU-ELPA2 becomes faster than GPU-ELPA1 as the node count increases. The crossover point depends on the dimension of the problem, e.g., 4 nodes in the N = 40,000 real case, and 16 nodes in the N = 100,000 complex case. When using 64 nodes, the speedup of GPU-ELPA2 over GPU-ELPA1, averaged over all matrix sizes in Fig. 4 , is 2.2x.
In Fig. 4 , the speedup enabled by the GPUs ranges from no speedup at all to 3.3x. Three general trends emerge: (1) For the same matrix size, the speedup becomes smaller as the node count increases. For small N , CPU-ELPA2 can even be faster than the GPU-accelerated solvers, thanks to the near-optimal strong scaling of CPU-ELPA2. (2) For the same number of nodes, the speedup becomes larger as the matrix size increases. (3) For the same node count and the same matrix size, the speedup is larger for complex matrices than for real matrices. The timing experiment in Fig. 4 is repeated on the Summit supercomputer. The results are shown in Fig. 5 . CPU-ELPA1 is omitted for simplicity. On Summit, the GPU-accelerated solvers GPU-ELPA1 and GPU-ELPA2 are always faster than the CPU-ELPA2 solver, with a maximum speedup of over 20x. The speedup of GPU-ELPA2 over CPU-ELPA2 remains 2.2x (N = 40,000 real) to 6.7x (N = 100,000 complex) even for 64 nodes. For the same matrix size and the same node count, the speedup on Summit appears greater than on Talos, which can be partially attributed to the difference in hardware. Summit has 6 GPUs per node, whereas Talos only has two GPUs per node. Data transfers on Summit take advantage of the NVLink technology [46] for high-bandwidth interconnect between CPUs and GPUs. Besides, a high-performance CPU kernel for the tridiagonal-to-banded backtransformation, written in AVX512 instructions [33] , is employed for the Intel Xeon Gold CPUs on Talos, rendering better performance of CPU-ELPA2 on Talos. Overall, we therefore observe that the absolute per-node timings of CPU-ELPA2 in Figs. 4 and 5 are already lower on Talos than on Summit. This difference in the CPU-only results, which are the baseline of the reported speedups, probably exaggerates the comparison of GPU-ELPA2 and CPU-ELPA2 on Summit somewhat, relative to Talos. Nevertheless, the three trends summarized from the tests on Talos are still valid on Summit, that is, the GPU speedup is larger for (1) larger matrix size, (2) fewer nodes, and (3) solving a complex problem instead of a real one. On both computers, the strong scaling of the GPU solvers is never as good as that of the CPU solvers. This can be explained by the workload assigned to the individual nodes. When using a large number of nodes or solving a small matrix, the workload on each node becomes so little that the many GPUs cannot be saturated, and the cost of CPU-GPU communications cannot be amortized. In contrast, when solving a large matrix or using a small number of nodes, a large amount of local work is offloaded to the GPUs, resulting in a significant speedup.
Note that in Figs. 4 and 5, all eigenvalues and eigenvectors are computed. In applications such as materials simulations based on KS-DFT, only a portion of the eigenspectrum, e.g., typically 20% to 60% for LCAO basis sets, is of interest. In this case, the advantage of ELPA2 over ELPA1 should be more significant, as the computational complexity of the back-transformation is proportional to the number of eigenvectors to be calculated. This is demonstrated in Fig. 6 , where the total timings (red circles) of the GPU-ELPA1 (solid) and GPU-ELPA2 (dashed) are decomposed into the forward tridiagonalization (blue squares), the solution of the tridiagonal problem (yellow diamonds), and the backward transformation (violet triangles). Two test cases, namely N = 40,000 real and N = 100,000 real, are shown as examples for a single node on Summit, i.e., the worst-case scenario for ELPA2. The two-stage tridiagonalization in ELPA2 is always faster than the one-stage tridiagonalization in ELPA1 by a factor of approximately three. The backtransformation accounts for a small fraction of the total time in ELPA1, but is the most time-consuming part in ELPA2 when all the eigenvectors are computed. When computing fewer eigenvectors, the burden of the two-stage back-transformation in ELPA2 can be greatly alleviated, making ELPA2 more favorable than ELPA1.
Given that the optimal performance may be achieved with GPU-ELPA1, GPU-ELPA2, or CPU-ELPA2, depending on the specifics of the problem and the architecture, we highlight the auto-tuning feature in the ELPA library [33] . When ELPA is called repeatedly, like e.g. in a self-consistent KS-DFT calculation, this auto-tuning feature automatically iterates over possible combinations of solvers and runtime parameters. The best solver for a given problem can be identified and utilized without any additional input from the user.
In Fig. 7 , we further decompose the timings of CPU-ELPA2 and GPU-ELPA2 into the five computational steps defined in Eq. 6. Again, the N = 40,000 real and N = 100,000 real tests are shown as examples. Steps that have been GPU-accelerated display an excellent speedup, namely 5.8x, 6.1x, 17.1x, and 11.3x (averaged over all data points in Fig. 5 ) for the fullto-banded transformation (Eq. 6a), the solution of the tridiagonal problem (Eq. 6c), the tridiagonal-to-banded back-transformation (Eq. 6d), and the banded-to-full back-transformation (Eq. 6e), respectively. The banded-totridiagonal transformation step (Eq. 6b) is not yet GPU-accelerated, as it never stands as a bottleneck. The tridiagonal-to-banded back-transformation, which uses the newly developed CUDA kernel described in Sec. 3.2, shows a strong scaling that is close to ideal. The scaling of the full-to-banded transformation and the banded-to-full back-transformation is not as good as the other steps. These two steps limit the overall parallel efficiency of GPU-ELPA2, therefore they would be the first target for further algorithmic and technical optimization. Figure 7 : Timing decomposition of CPU-ELPA2 (solid) and GPU-ELPA2 (dashed) for randomly generated, real symmetric matrices of size N = 40,000 and 100,000. Tests are performed on the Summit supercomputer. Node utilization is identical to that in Fig. 5 . All eigenvalues and eigenvectors are computed. Shown are the timings of the full-to-banded transformation (red circles, Eq. 6a), the banded-to-tridiagonal transformation (blue squares, Eq. 6b), the solution of the tridiagonal problem (yellow diamonds, Eq. 6c), the tridiagonal-to-banded back-transformation (violet up triangles, Eq. 6a), and the banded-to-full back-transformation (green down triangles, Eq. 6e).
Conclusions
In this paper, we report our GPU-oriented optimizations of the two-stage tridiagonalization eigensolver ELPA2 in the ELPA library, the only publicly available, distributed-memory, GPU-accelerated eigensolver library at the time of writing. The local BLAS operations in ELPA2 are offloaded to GPUs via the cuBLAS library. The tridiagonal-to-banded back-transformation of eigenvectors, which cannot be easily written as BLAS operations, is GPUaccelerated by a CUDA implementation. The overall performance of the GPU-accelerated ELPA2 solver is promising. It delivers a significant performance boost over the CPU-only version of ELPA2, as demonstrated by benchmarks on the Talos and Summit computers. Owing to the advanced two-stage tridiagonalization algorithm, the parallel scaling of the GPU-ELPA2 solver is superior to that of the GPU-ELPA1 solver. Based on an analysis of the individual computational steps in ELPA2, we identify the full-tobanded transformation and banded-to-full back-transformation steps as the next target for future optimization, as their strong scaling is not yet optimal. Nevertheless, the GPU-ELPA2 solver in its current form already unlocks a significant potential of GPU computations.
