Accelerating the computation of FLAPW methods on heterogeneous architectures by Davidović, Davor et al.
Received <day> <Month>, 2017; Revised <day> <Month>, 2018; Accepted <day> <Month>, 2018
DOI: xxx/xxxx
RESEARCH ARTICLE
Accelerating the computation of FLAPWmethods onheterogeneous architectures
Davor Davidović*1 | Diego Fabregat-Traver2 | Markus Höhnerbach2 | Edoardo di Napoli2,3
1Ruđer Bošković Institute, 10000 Zagreb,
Croatia
2Aachen Institute for Advanced Study in
Computational Engineering Science,
RWTH Aachen, 52062 Aachen, Germany
3Jülich Supercomuting Center and JARA,
Jülich, Germany
Correspondence
*Davor Davidović, Bijenička cesta 54, 10000
Zagreb, Croatia. Email: davor.davidovic@irb.hr
Abstract
Legacy codes in computational science and engineering have been very successful in providing
essential functionality to researchers. However, they are not capable of exploiting the massive
parallelism provided by emerging heterogeneous architectures. The lack of portable performance
and scalability puts them at high risk: either they evolve or they are doomed to disappear.
One example of legacy code which would heavily beneﬁt from a modern design is FLEUR, a
software for electronic structure calculations. In previous work, the computational bottleneck
of FLEUR was partially reengineered to have a modular design that relies on standard build-
ing blocks, namely BLAS and LAPACK. In this paper, we demonstrate how the initial redesign
enables the portability to heterogeneous architectures. More speciﬁcally, we study diﬀerent
approaches to port the code to architectures consisting of multi-core CPUs equipped with one
or more coprocessors such as Nvidia GPUs and Intel Xeon Phis. Our ﬁnal code attains over 70%
of the architectures’ peak performance, and outperforms Nvidia’s and Intel’s libraries. Finally, on
JURECA, the supercomputer where FLEUR is often executed, the code takes advantage of the
full power of the computing nodes, attaining 5× speedup over the sole use of the CPUs.
KEYWORDS:
FLEUR; FLAPW; multiGPU; Phi; hybrid BLAS; portability; scalability
1 INTRODUCTION
Modern software is typically developed with modularity, portability and scalability in mind, while older codes often rely on a direct implementation
of mathematical formulas. The latter approach limits the reuse of the code on emerging computing architectures and constitutes a substantial
portability challenge: each of these formulas must be rewritten and optimized for each new architecture. One such example is FLEUR, a software
for electronic structure calculations developed during the last three decades at the Forschungszentrum Jülich (Germany) 1.
Today, in a context where massively-parallel heterogeneous architectures have become ubiquitous, legacy code must be re-engineered and
adapted to make an eﬃcient use of these architectures. A critical insight in writing long-lasting scientiﬁc code is to have a modular design where, at
the bottom layers, the computational bottlenecks are written in terms of kernels 2,3,4. Examples of such kernels are fast Fourier transforms, matrix
products, and eigensolvers, provided by a number of commercial as well as academic libraries, such as Intel MKL 5, MAGMA 6, cuBLAS 7, cuFFT, and
ELPA 8,9. Over time, standardized libraries implement highly-optimized routines for the most critical kernels on various computing architectures,
from which applications can beneﬁt automatically, without the need to change the code.
The most outstanding example is the Basic Linear Algebra Subproblems (BLAS) library. In reality, the BLAS is an interface that deﬁnes a number
of common building blocks (including matrix products and linear systems), which was born after the realization that standardization was critical to
increase productivity and portability. Today, the BLAS is the ﬁrst library ported and optimized for each new architecture, and is used as a building
block for many other higher-level libraries. Therefore, writing software on top of the BLAS is a guarantee for performance portability and scalability.
2 Davor Davidović et al
In order to demonstrate the beneﬁts of modern engineering and portability, Di Napoli et al. underwent a major eﬀort to rework one of the
computational bottlenecks of FLEUR 10: the construction of the so-called Hamiltonian and Overlap matrices. The concepts of abstraction and
encapsulation were lacking in the original code, and the diﬀerent modules were tightly coupled. Therefore, the authors rewrote the code from
scratch. At this point, they decided to analyze the mathematics behind the code and rewrite the formulas at a higher level, resulting in a series of
matrix operations. These matrix operations were then cast to kernels provided by the BLAS and LAPACK libraries. The new algorithm (HSDLA)
attains speedups ranging from 1.5× to 2.5× with respect to the corresponding FLEUR code on multi-core CPUs. More importantly, the new code
enables performance portability with limited programming eﬀort which was demonstrated in our previous work 11. We analyzed the performance
portability of theHSDLA code on heterogeneous architectures consisting ofmulti-core shared-memory CPUs and one ormoreGraphical Processing
Units (GPUs). In particular, we showed that the HSDLA code can be ported to emerging architectures with minor or relatively straightforward code
changes by employing highly-optimized libraries such as NVIDIA’s CUBLASXT or Intel’s MKL. However, this approach attained only up to 69% of
the theoretical system’s peak performance, thus still signiﬁcantly underutilizing the available resources.
In this paper we close the gaps still open in the previous work to demonstrate that the initial eﬀort striving for modularity and portability
indeed facilitates quick performance portability to complex heterogeneous architectures comprising multi-core shared-memory CPUs and either
GPUs or Phi coprocessors. We ﬁrst attempt a (almost) plug-and-play solution and quantify the beneﬁts and limitations to make the best of these
architectures. Then, we study alternative approaches to leverage existing optimized BLAS kernels for each component to obtain eﬃcient hybrid
routines. Finally, we identify scalability bottlenecks in sequential or poorly-scalable sections of the algorithm and propose solutions to limit their
impact in the overall scalability. As our experimental results obtained on a number of architectures illustrate, the resulting code makes an eﬃcient
use of the computing resources and provides the expected boost to the application code.
Contributions
We make the following key contributions. First, we perform a portability study of the code, once it has been rewritten in terms of a standardized
interface, and point out the remaining issues for a successful port to heterogeneous architectures. We then analyze diﬀerent approaches to over-
come these limitations, implement them and quantify the improvements. Finally, we contribute a high-performance and scalable implementation
to compute the Hamiltonian and Overlap matrices, a computational bottleneck in FLAPW methods.
Organization of the paper
The remainder of the paper is organized as follows. Section 2 provides the background on Density Functional Theory and the math behind the
computation to generate the Hamiltonian and Overlap matrices. Section 3 gives an overview of the HSDLA algorithm for the generation of these
matrices and Section 4 provides overview of the related work. The improvements of the HSDLA algorithm as well as our implementation on
the heterogeneous architectures, are described in Section 5. Section 6 presents experimental results for a collection of test cases and hybrid
architectures. Finally, Sec. 7 draws conclusion and discusses future research directions.
2 COMPUTING THE HAMILTONIAN AND OVERLAP MATRICES IN FLEUR
FLEUR is a scientiﬁc computing code, well-known in the community of condensed matter physicists, for calculating ground-state and excited-state
properties of solids. This is one of few Density Functional Theory (DFT) which computes the electron structure of an atomic compound using the
Full-potential Linearized Augmented Plane Wave (FLAPW) discretization approach. In this section we give a brief introduction of the fundamental
mathematical aspects of DFT, and its practical realization by the FLAPW method. The material presented does not require a special background
knowledge in quantum physics and allows the non-specialist to understand the origin of one of the most computational intensive parts of the
FLEUR code, the construction of the Hamiltonian and Overlap matrices.
2.1 DFT and the FLAPWmethod
In the last two decades, Density Functional Theory (DFT) 12,13 has become the “standard model” in Materials Science. Within the DFT framework,
it is possible to simulate the physical properties of complex quantum mechanical systems made of few dozens up to few hundreds of atoms. The
core of the method relies on the simultaneous solution of a set of partial diﬀerential equations. These equations are determined by a Hamiltonian
operator Hˆ containing a so-called eﬀective potential V0[n], where n(r) is a function of the radial coordinate r known as the one-particle electron
density.
Davor Davidović et al 3
In turn, the solutions of the equations ψi(r) determine the one-particle electron density n(r) used in calculating the eﬀective potential V0.
Because of the dependence of the Hamiltonian on the set of ψi(r) through n(r), the governing equations are highly non-linear and cannot be
solved directly.
Hˆψi(r) =
(
− ~2
2m
∇2 + V0(r)
)
ψi(r) = iψi(r) ; 1 ≤ 2 ≤ . . .
n(r) =
∑N
i fi|ψi(r)|2
(1)
In practice, the equations above, also known as Kohn-Sham (KS) 14, are solved by self-consistent iteration; an initial guess for n0(r) is used to
calculate the eﬀective potential V0 which, in turn, is inserted in Eq. (1) whose solutions, ψi(r), are used to compute a new charge density n1(r).
Convergence is checked by comparing the new density to the starting one. When convergence is not reached, an opportune mixing of the two
densities is selected as a new guess, and the process is repeated.
In principle, the theory only requires as input the quantum numbers and the positions of the atoms that are part of the investigated system.
In practice, there is a plethora of DFT methods which depends on the discretization used to parameterize the KS equations. The discretization in
the Full-potential Linearized Augmented Plane Wave (FLAPW) method 15,16 is based on a plane wave expansion of ψk,ν(r), where the momentum
vector k and the band index ν replace the generic index i. The k-point wave function ψk,ν(r) =∑|G+k|≤Kmax cGk,νϕG(k, r) is expanded in termsof a ﬁnite basis set ϕG(k, r) indexed by the vectorsG lying in the lattice reciprocal to conﬁguration space up to a chosen cut-oﬀ valueKmax. In
FLAPW, the physical (conﬁguration) space of the simulation cell is divided into spherical regions, called Muﬃn-Tin (MT) spheres, centered around
atomic nuclei, and interstitial areas between the MT spheres. The basis set ϕG(k, r) takes a diﬀerent expression depending on the region
ϕG(k, r) ∝

ei(k+G)r Interstitial∑
l,m
[
Aa,Gl,m (k)u
a
l (r) + B
a,G
l,m (k)u˙
a
l (r)
]
Yl,m (rˆa) ath Muﬃn Tin (2)
where the coeﬃcientsAa,Gl,m (k) andBa,Gl,m (k) are determined by imposing continuity ofϕG(k, r) and its derivative at the boundary of theMTs. Dueto this expansion the KS equations naturally translate to a set of generalized eigenvalue problems∑G′ [HG,G′ (k)− λkνSG,G′ (k)] cG′k,ν = 0 forthe coeﬃcients of the expansion cG′k,ν where the Hamiltonian and Overlap matrices H and S are given by multiple integrals and sums
{H(k), S(k)}G,G′ =
∑
a
∫∫
ϕ∗G(k, r){Hˆ, I}ϕG′ (k, r)dr. (3)
Since the set of basis functions in Eq. (2) is implicitly labeled by the values the variable k takes in the Brillouin zone, there are multiple Hamiltonian
and Overlap matrices, one for each independent k-point.
2.2 Building H and S
Without loss of generality, we can abstract from the k-point index and recover an explicit formulation of the HG,G′ and SG,G′ matrices by
substituting Eq. (2) in Eq. (3) and carrying out the multiple integration. The computation is particularly complex within the MT regions where the
initialization of the Hamiltonian and Overlap matrices is by far the most computationally intensive task. By exploiting the properties of the basis
functions, the H and S matrices are directly expressed as functions of the set of A and B coeﬃcients.
(S)G′,G =
NA∑
a=1
∑
l,m
(
Aa,G
′
l,m
)∗
Aa,Gl,m +
(
Ba,G
′
l,m
)∗
Ba,Gl,m ‖u˙al ‖2 (4)
(H)G′,G =
NA∑
a=1
∑
L′,L
((
Aa,G
′
L′
)∗
T
[AA]
L′,L;a A
a,G
L
)
+
((
Aa,G
′
L′
)∗
T
[AB]
L′,L;a B
a,G
L
)
+
((
Ba,G
′
L′
)∗
T
[BA]
L′,L;a A
a,G
L
)
+
((
Ba,G
′
L′
)∗
T
[BB]
L′,L;a B
a,G
L
)
. (5)
Notice that in Eq. (5) for convenience we have compacted the indexes l ,m into L, and expressed the range of the index a over all the distinct
atom types NA. The new matrices T[... ]L′,L;a ∈ CNL×NL are dense and their computation involves multiple integrals between the basis functions andthe non-spherical part of the potential V0 (See 10, Appendix A.2 for details). Due to the non-orthornormality of the basis function set (2), the matrix S
is non-diagonal, dense, and generically positive deﬁnite with the exception of having few very small singular values. On the opposite, H is always
non-deﬁnite and both matrices are either complex Hermitian or real symmetric.
4 Davor Davidović et al
3 THE HSDLA ALGORITHM
In the original FLEUR code the construction of matrices H and S was a direct implementation of mathematical formulae with the focus on the
functionality rather than scalability. The performance scalability was signiﬁcantly improved in the HSDLA algorithm 10, in which the authors refor-
mulated Eqs. (4) and (5) in terms of the coeﬃcients A and B. As a result, the entire construction of matrices H and S is now cast in terms of matrix
operands, as shown in Eqs. (6) and (7),
H =
NA∑
a=1
AHa T
[AA]Aa︸ ︷︷ ︸
HAA
+AHa T
[AB]Ba + B
H
a T
[BA]Aa + B
H
a T
[BB]Ba︸ ︷︷ ︸
HAB+BA+BB
(6)
S =
NA∑
a=1
AHa Aa + B
H
a U
H
a UaBa, (7)
where Aa and Ba ∈ CNL×NG , T[...]a ∈ CNL×NL , H and S ∈ CNG×NG are Hermitian , and U ∈ CNL×NL is a diagonal matrix. Typical values for the
matrix sizes are NA ∼ O(100), NG ∼ O(1000) to O(10000), and NL ∼ O(100).
The HSDLA algorithm 10 takes advantage of the matrix formulation to cast the computation in terms of the highly optimized and portable BLAS
and LAPACK libraries. The algorithm is illustrated in Algorithm 1. The main ideas behind it are: 1) exploiting the symmetries in the operations to
reduce the computational cost, 2) casting the computation in terms of eﬃcient BLAS and LAPACK kernels, and 3) combining multiple operations
on small matrices together to increase performance and scalability. Although the algorithm exploits symmetry in matrices to reduce computational
costs, extra ﬂops are introduced compared to the original version. However, these are fast ﬂops, run in highly-optimized BLAS-3 kernels, leading
to signiﬁcantly lower execution time when compared to the original code.
Algorithm 1 : Computation of the H and S matrices in HSDLA.
1: Backup Aˆ = A, Bˆ = B
2: // HAB+BA+BB
3: for a := 1→ NA do
4: Za = T[BA]a Aa . (zgemm: 8N2LNG Flops)
5: Za = Za + 12T[BB]a Ba . (zhemm: 8N2LNG Flops)6: Stack Za to A
7: Stack Ba to B
8: end for
9: H = AHB+ BHA . (zher2k: 8NANLN2G Flops)
10: // S
11: Restore A = Aˆ, B = Bˆ
12: S = AHA . (zherk: 4NANLN2G Flops)
13: B = UB . (scaling: 2NANLNG Flops)
14: S = S+ BHB . (zherk: 4NANLN2G Flops)
15: // HAA
16: for a := 1→ NA do
17: try:
18: Ca = Cholesky(T[AA]a ) . (zpotrf: 43N3L Flops)19: success:
20: Za = CHa Aa . (ztrmm: 4N2LNG Flops)
21: Stack Za to BT
22: failure:
23: Za = T[AA]a Aa . (zhemm: 8N2LNG Flops)
24: Stack Za to BB
25: Stack Aa to A
26: end for
27: H = H+ BHTBT . (zherk: 4NAHPDNLN2G Flops)
28: H = H+ AHBB . (zgemm: 8NA¬HPDNLN2G Flops)
Davor Davidović et al 5
For implementation purposes, the computation of H is split into two parts, HAB+BA+BB and HAA (Eq. 6), which can be cast completely in terms
of BLAS-3 operations. The key idea behind the calculation of HAB+BA+BB (lines 3–9) is to rewrite the expression as
NA∑
a=1
BHa (T
[BA]Aa) + (A
H
a T
[AB])Ba +
1
2
BHa (T
[BB]Ba) +
1
2
(BHa T
[BB])Ba,
where T[BA] is the Hermitian transpose of T[AB], and T[BB] is Hermitian. By doing so, the common expressions in parentheses can be grouped
together and substituted with
Za = T
[BA]Aa +
1
2
T [BB]Ba,
for each atom a (lines 3–5). The matrices Za and Ba are then stacked together into larger matrices A and B (lines 6–7), allowing the remaining
computation
NA∑
a=1
BHa Za + Z
H
a Ba
to be performed as one single large matrix product (line 9), as depicted in Fig. 1 .
FIGURE 1 Stacking of multiple small matrices Za and Ba into larger matrices A and B, respectively. As a result, multiple small matrix products are
combined into a larger better performing product.
To compute the second part of the matrix H (lines 16–28), the algorithm again takes advantage of the existing symmetries in HAA. Since the
T[AA] matrix is Hermitian positive deﬁnite (HPD), HSDLA ﬁrst attempts to compute the Cholesky factorization of T[AA] (CaCHa = T[AA]) (line 18),
which is then substituted in HAA yielding
HAA =
NA∑
a=1
AHa T
[AA]Aa ⇒
NA∑
a=1
(CHa Aa︸ ︷︷ ︸
Za
)H(CHa Aa).
However, while in theory T[AA] is HPD, in practice, due to numerical considerations, the factorization may fail. The algorithm thus divides the
computation of HAA in two parts. In case the factorization succeeds, the matrix Za is computed (line 20) and stacked at the top part of B (BT) (line
21). If, on the other hand, the factorization fails, the matrix Za (line 23) is stacked at the bottom of matrix B (BB), and the matrix Aa stacked in A.
At the completion of the loop, H is updated with the operation BHTBT + AHBB (lines 27–28), where the ﬁrst term (BHTBT) exploits the symmetryand computes only half of the output via the BLAS routine herk, while the second term (AHBB) is computed via the BLAS routine gemm, which
computes a full matrix.
Finally, the computation of S (lines 12–14) is more straightforward. First, the product AHA is computed as one single large product. Then B is
updated with the norms stored in U and then second large product BHB completes the computation of S.
4 RELATEDWORK
Today, a numerous optimized BLAS-like computational libraries exist. However, up to our knowledge, none of these libraries does not implement
BLAS kernels capable for automatic redistribution of the computational ﬂops between diﬀerent computational units, such as CPU, GPU and/or
Phi. The most popular multi-GPU commercial BLAS library, CUBLASXT, supports hybrid CPU-GPU computation, but so far, only gemm kernels are
implemented as hybrid. The bottom-right part of the input matrix is oﬄoaded to the CPU and the percentage (i.e. the number of rows/columns)
that will be oﬄoad to CPU is set at the compile time. An academic alternative to CUBLASXT, called BLASX 17 requires minor changes to the calls
6 Davor Davidović et al
to the BLAS routines and, similar to CUBLASXT library, takes care of data transfers between CPU and GPU transparently. Although the authors
reported a signiﬁcant speedup and the communication volume decrease compared to CUBLASXT, Magma and some other libraries and runtimes,
again, only the dgemm routine is implemented in the hybrid fashion, i.e. with a possibility to schedule execution between CPUs and GPUs. The other
kernels of interest for this work have CPU-only and multi-GPU-only implementations.
Further research in hybridization of codes has been done in 18 in which authors developed a lightweight scheduler to oﬄoad tasks between
diﬀerent computational units inside a single compute node. The scheduler is based on task parallelism and, as was demonstrated in the case of
Cholesky, the factorization tasks are executed on the CPU while updating is done on the GPUs. The authors also demonstrated the portability
and the scalability of the their approach across diﬀerent accelerators, including multi-GPU and multi-Phi systems. However, there is no report on
hybrid BLAS-3 kernels required by HSDLA.
At the time of writing this paper, only one research 19 reported on hybrid GPU-based algorithms for the generation of Hamiltonian and Overlap
matrices in FLAPW methods. In that work, the authors aimed at 1000+ atoms systems, too large to be executed on a single compute node thus
a distributed block-cyclic setup and distribution of the Hamiltonian and Overlap matrices were implemented. However, the authors based the
generation of H and S only on distributed versions of zgemm and did not exploit symmetries to decrease computational cost. In addition, the
generation is a task-based heterogeneous implementation where particular tasks are scheduled between CPU and the GPUs. In our approach to
heterogeneity, we aim at further exploiting data parallelism, not task parallelism.
5 HSDLA ON HETEROGENEOUS ARCHITECTURES
In this section, we concentrate on the behaviour of the algorithm when executed on hybrid architectures consisting of shared-memory CPUs and
one or more graphic processor units (GPUs) or Intel Xeon Phi coprocessors. As discussed in 10, once HSDLA is cast in terms of BLAS and LAPACK
routines, it attains high performance on multi-core architectures by simply linking to a multi-threaded implementation of these libraries. However,
as we demonstrate in this section, attaining such performance levels on hybrid architectures is not straightforward.
We ﬁrst reﬁne the original HSDLA algorithm, and signiﬁcantly reduce its computational cost as well as its memory footprint. Then, we identify
the limitations of a straightforward port to hybrid architectures and the shortcomings of the existing solutions, and study alternative approaches
to attain satisfactory performance levels on such architectures.
5.1 HSDLA reﬁned
In the eﬀort of exploiting symmetries from the problem, the designers of HSDLA overlooked the fact that matrixH is Hermitian and used the zgemm
routine (general matrix-matrix product) in Algorithm 1 (line 28) instead of a routine that takes advantage of the symmetry and computes only one
triangle of the output matrix. As a result, the algorithm performs 4¬HPDNANLN2G redundant ﬂops for computing the upper triangle of the matrix.With that in mind, and given that some vendor libraries such as NVIDIA’s CUBLASXT and Intel’s MKL provide a specialized routine to compute
only one half of a general matrix product, line 16 through 28 can be greatly simpliﬁed, avoiding the Cholesky factorization altogether, by replacing
them with the computation in Algorithm 2. As a result, the total cost of computing HAA is signiﬁcantly reduced and is dominated by the specialized
routine with a cost of 4NANLN2G.
Algorithm 2 : Reﬁnement of the computation of HAA.
15: // HAA
16: for a := 1→ NA do
17: Xa = T[AA]a Aa . (zhemm: 8N2LNG Flops)
18: Stack Xa to X
19: Stack Aa to A
20: end for
21: H = H+ AHX . (zherkx: 4NANLN2G Flops)
By further examining Algorithm 1 one observes two sequential steps, a backup and a restore of matrices A and B, lines 1 and 11, respectively.
Since A and B are overwritten (lines 3–9), one copy of each are stored in two additional storage spaces Aˆ and Bˆ (line 1) and reused to compute
S (lines 12–14) and the rest of H (HAA) (lines 16–28). However, the need for additional storage spaces Aˆ and Bˆ can be signiﬁcantly reduced by
simply reordering the execution ﬂow as described in Algorithm 3. The computation of S, which only requires one temporary buﬀer, is moved to
Davor Davidović et al 7
the beginning of the algorithm (lines 2–4); the result of computing UB is stored in the temporary buﬀer X, and the original values A and B and are
so far preserved. The buﬀer X is then reused as auxiliary storage in the computation of HAB+BA+BB for stacking Za (line 9); note that overwriting
B now in line 10 is harmless since its original contents are not needed anymore. Finally, in the computation of HAA, the result of line 15 is also
stacked in X, and A is compressed in the A buﬀer itself. Since all of A, B, and X are of size 16NANLNG bytes (in the order of gigabytes), cutting this
requirements in half has a major impact on the memory footprint of the entire algorithm.
Algorithm 3 : The reﬁned HSDLA without Cholesky and with only one temporary buﬀer.
1: // S
2: S = AHA . (zherk: 4NANLN2G Flops)
3: X = UB . (scaling: 2NANLNG Flops)
4: S = S+ XHX . (zherk: 4NANLN2G Flops)
5: // HAB+BA+BB
6: for a := 1→ NA do
7: Za = T[BA]a Aa . (zgemm: 8N2LNG Flops)
8: Za = Za + 12T[BB]a Ba . (zhemm: 8N2LNG Flops)9: Stack Za to X
10: Stack Ba to B
11: end for
12: H = XHB+ BHX . (zher2k: 8NANLN2G Flops)
13: // HAA
14: for a := 1→ NA do
15: Xa = T[AA]a Aa . (zhemm: 8N2LNG Flops)
16: Stack Xa to X
17: Stack Aa to A
18: end for
19: H = H+ AHX . (zherkx: 4NANLN2G Flops)
5.2 Limitations of the straightforward approach
As we reported in our previous work 11, with very limited eﬀort, HSDLA can be easily ported to various computing architectures such as multi-
core and multi-GPU systems. To achieve that goal, wrappers have to be implemented around calls to optimized architecture-speciﬁc libraries such
as cuBLAS, CUBLASXT or Intel MKL, and function calls changed with those of the optimized libraries for a particular architecture. Although this
straightforward approach can quickly bring reasonable performance, it cannot utilize the underlying computational system at their full potential.
For example, if CUBLASXT is used on a multi-GPU system, most of the BLAS operations will be oﬄoaded to the GPUs while the CPUs remain idle.
Thus, simply using the existing optimized libraries cannot yield the performance increase that is expected by combining the power of all CPUs and
GPUs of a system.
In order to overcome this issue,we developed heterogeneous BLAS kernels for the routines zher2k, zherk, and zgemm/zherkx that can eﬃciently
divide the workload between the CPUs and the accelerators (GPUs and Phis). In the rest of this section, we describe two diﬀerent designs on how
to implement the BLAS kernels on heterogeneous architectures, required by the reﬁned HDSLA. We denoted these two approaches as:
• Static - The computation between the CPUs and GPUs is divided by pre-computing the number of rows/columns that will be oﬄoaded to
the GPUs and then a highly-tuned multi-GPU library is used to compute it.
• Dynamic - The computation is based on dynamic scheduling depending on available on-device memory to determine block size and per-
device task queuing.
5.3 Static
This approach targets hybrid architectures with one or more CUDA-based GPU devices, but can be easily extended to Intel Xeon Phi as well as
other accelerators, as long as optimized BLAS libraries for these architectures exist. The key idea of this approach is to re-use the existing, highly
8 Davor Davidović et al
FIGURE 2 Rank 2k update of matrix C (zher2k). The blocks are statically split between the CPUs (dark grey) and GPUs (light grey). ng is the size of
principal submatrix oﬄoaded to the GPUs.
optimized multi-GPU and multi-threaded libraries in building the hybrid code, with the goal to signiﬁcantly reduce the programming eﬀort. As such,
the code can be quickly tuned for new and emerging GPU architectures, thus improving performance portability on diﬀerent platforms, while at
the same time explores concurrent execution on all available CPUs and GPUs of the system.
For the sake of simplicity, only the zher2k routine will be described in details, while the same approach can be easily extended to the other two
routines. The zher2k performs the Hermitian rank 2k update of the given matrix C in one of the following operations:
C := αABH + αBAH + βC,
C := αAHB+ αBHA+ βC,
where α and β are scalars, C ∈ Rn×n is a Hermitian matrix, and A and B are general matrices of size n× k in the ﬁrst case and k× n in the second
case. Hereafter, we will observe only the ﬁrst case; the same algorithm applies for the second, but with A and B of transposed dimensions.
Matrix C is split into four blocks and updated as described in Figure 2 . The workload is divided such that the principal submatrix C00 with
dimension ng × ng is updated on the GPU(s), while the blocks C10 and C11 (C01 is not required since C is Hermitian) are updated on the CPU. The
entire update of C can then be performed with only 4 function calls:
C00 = αA0B
H
0 + αB0A
H
0 + βC00 −→ zher2k (GPU)
C11 = αA1B
H
1 + αB1A
H
1 + βC11 −→ zher2k (CPU)
C10 = αB1A
H
0 + βC10 −→ zgemm (CPU)
C10 = αA1B
H
0 + C10 −→ zgemm (CPU).
Since these operations are independent from each other, they can be performed in parallel. For scheduling the workload between CPU and
GPU devices, OpenMP with nested parallelism is used. Two OpenMP threads are created, one manages the computation on the CPU and invokes
multi-threaded MKL kernels, the other manages the computation on the GPU and invokes CUBLASXT kernels on the GPU.
Although this approach is simple, the main challenge that remains is how to optimally set the size of the principal submatrix that will be oﬄoaded
to the GPU. In the ideal case the CPU and GPU execution times should be completely overlapped. Thus, knowing that the total number of ﬂops for
zher2k is 8kn2, with n the number of rows and columns of C and k number of columns of matrices A and B, ng , the size of the principal submatrix
C00, may be computed as:
timegpu = timecpu
ﬂopsgpuGﬂopsgpu =
ﬂopscpuGﬂopscpu
8kn2g + 4Gﬂopsgpu =
8k(n− ng)2 + 4+ 16kng(n− ng)Gﬂopscpu
Note that the updating block C10 requires two additional calls to zgemm which adds extra ﬂops to the count. Ifm = GﬂopsgpuGﬂopscpu then the size of matrixoﬄoaded to the GPU is:
n2g =
mn2 + 4m
m+ 1
.
As we illustrate in Section 6, the optimal oﬄoad between CPU and GPU devices can be computed analytically using this approach or the
performance ratio between CPU and GPU can be estimated on smaller problem sizes.
Davor Davidović et al 9
5.4 Dynamic
While a static split of the matrices and a static assignment of blocks to devices leads to the lowest overheads and highest eﬃciency, it lacks
the ﬂexibility to adapt easily to new circumstances. An alternative is the use of a dynamic scheduling that queries the devices to determine the
available on-device memory, calculates appropriate block sizes, and uses per-device queues to distribute the work. To maximize portability, the
implementation is split into a device-speciﬁc and a device-independent part. This allows the dynamic code to target both the Xeon Phi coprocessor
and CUDA GPUs. The device-independent part contains all the workload distribution, scheduling, the BLAS operation and hybrid calculation logic.
As for the device-dependent code, the Xeon Phi implementation builds on the hStreams library and MKL, while the GPU implementation relies on
CUDA and cuBLAS.
The device-independent logic works as follows. First, each matrix is split into square blocks. The block size is chosen to split the matrices
evenly, ﬁt multiple blocks into GPU memory, and follow vendor recommendations (e.g. for the ﬁrst generation Xeon Phi accelerators the block size
should be divisible by 64 but not 256). Next, the scheduler determines the individual operations to be performed on each of the submatrices, and
schedules them to the devices in a round-robin fashion. Each block is packed into contiguous buﬀers and streamed to the devices; the results are
unpacked once the calculation and the reverse transfer complete. The packing of the blocks is necessary for the Xeon Phi, since it lacks support
for copies from 2D arrays, but it is avoided on GPUs. If the scheduler notices that a devices is not yet ready to accept work because its queue is
full, it skips that devices. Once the memory of the devices are ﬁlled up, and the CPU is not needed to drive the devices, the CPU starts computing
block operations using its own computational resources.
The remaining challenge is to improve utilization at the very start of a calculation, and at the very end. There, it is desirable to reduce block
sizes and sacriﬁce kernel eﬃciency for load balance. However, implementing such a scheme requires sophisticated models of transfer and compute
eﬃciency, and it is left to future work.
6 EXPERIMENTAL RESULTS
We turn now our attention to experimental results. We compare the performance of the originalmulti-core (CPU only) HSDLA algorithm 10 against
our reﬁned CPU implementation from Section 5.1 and the two hybrid CPU-GPU implementations based on that reﬁned algorithm. As mentioned
in Section 5 these include an implementation that oﬄoads large matrix-matrix products to the multi-GPU using CUBLASXT, an implementation
using static work assignment (Section 5.3) and an implementation using dynamic work assignment (Section 5.4). Since the implementation using
dynamic work assignment supports oﬄoading to Intel Xeon Phi accelerators, we also present results on that platform.
As test cases we use three input systems that describe distinct physical systems. We refer to them as NaCl, AuAg and TiO2, respectively. These
systems represent a heterogeneous sample (including both insulators and conductors) with diﬀerent physical properties. In our tests, the code
generates the matrices H and S for one single k-point, and diﬀerent Kmax values. The actual problem sizes, that is, the values for NA, NL, and NG
for each case are given in Tab. 1 .
Test case NA NL NG : Kmax = 2.5 Kmax = 3.0 Kmax = 3.5 Kmax = 4.0
NaCl 512 49 2256 3893 6217 9273
AuAg 108 121 3275 5638 8970 13379
TiO2 384 81 7094 12293 19553 29144
TABLE 1 Problem sizes for NaCl, AuAg, and TiO2, for a variety of Kmax values. The value of NG varies with Kmax.
6.1 Experimental setup
We ran our experiments on a range of compute nodes. We used two computed nodes hosted by the IT center of the RWTH Aachen University.
One of these nodes (RWTH-GPU) consists of two eight-core Sandy Bridge E5-2680 processors, running at a frequency of 2.7 GHz. The node is
equipped with 64 GBs of RAM and 2 Nvidia Tesla K20Xm GPUs. The peak performance of the 16 CPU cores in double precision is 345 GFlops/s,
while the peak performance of each GPU is 1.3 TFlops/s, for a combined peak of almost 3.0 TFlops/s. The second RWTH node (RWTH-Phi)
consists of two eight-core Sandy Bridge E5-2650 processors, running at a frequency of 2.0 GHz. The node is also equipped with 64 GBs of RAM
10 Davor Davidović et al
101 102
TFLOPS
1
2
3
4
5
6
7
8
S
p
ee
du
p
Refined (CPU) cuBlasXt Dynamic Static
FIGURE 3 Speedup on RWTH-GPU for all implementations, relative to original code.
and 2 Intel Xeon Phi 5110p accelerators (Knight’s Corner), The peak performance of the 16 CPU cores in double precision is 256 GFlops/s, while
the peak performance of each Xeon Phi is (about) 1 TFlops/s, for a combined peak of 2.3 TFlops/s.
We also ran experiments on the JURECA supercomputer at the Jülich Supercomputing Centre (JSC). More speciﬁcally, we used one JURECA
node consisting of two twelve-core Haswell E5-2680v3 processors, running at a nominal frequency of 2.5 GHz, and 2 NVIDIA K80 GPUs (each
of which consists of two GK210 devices). The node is equipped with 128 GBs of RAM. The combined peak performance for the 24 CPU cores
in double precision is 960 GFlops/s, while the theoretical peak performance in double precision of each GPU device is about 1.45 TFlops/s, for a
total of 6.7 TFlops/s.
In the following subsections, when presenting eﬃciency of the achieved system with a varying number of accelerators, we take as reference
value the peak performance corresponding to the same number of accelerators. For example, when presenting the results on JURECA with 1, 2
and 3 GPUs, the system peak performance is 2.4, 3.8, and 5.3 TFlops, respectively.
We compare several diﬀerent improvedHSDLA versions.We refer to the baseHSDLA algorithm, as implemented in 10, with original. Our algorith-
mically reﬁnedmultiCPUHSDLA version is denoted as reﬁned, while the same version but for multi-GPU system is named cuBlasXt. The accelerated
versions, denoted as static and dynamic, are the version described in Sections 5.3 and 5.4, respectively. In all cases, the code was linked to Intel
MKL version 11.3.2 for the BLAS routines on the CPU and Xeon Phis; the GPU code was linked to NVIDIA CUBLASXT version 7.5 and 8.0 on the
RWTH-GPU nodes and on the JURECA nodes, respectively.
6.2 RWTH-GPU
We provide two experiments for the RWTH-GPU system. Figure 3 presents the speedup over the original HSDLA algorithm attained by the
reﬁned, cuBlasXt, dynamic, and static implementations, for each of the diﬀerent test cases ordered by the amount of computation required by
each of them. The results show an average speedup of 1.13 of the reﬁned algorithm over the original, when run on the CPU only. That clearly
demonstrates how small changes in the code, such as exploiting symmetries and reordering of the execution ﬂow (requiring limited programming
eﬀort), can lead to higher performance, even without using the accelerators.
Also, the ﬁgure illustrates how the GPU code (cuBlasXt, static, and dynamic) achieves seizable speedups no matter how large the computation
is. In most cases our custom hybrid implementations outperform the cuBlasXT, thanks to both better tuning of block sizes and especially a heavier
usage of the host CPU computational power. The dip at the end for the dynamic code is due to TiO2 case run out of memory when 2 GPUs are
used and instead result using 1 GPU is presented.
Figure 4 showcases the total performance achieved by the diﬀerent implementations for the AuAg test case and a range of values for Kmax.
Again, the advantages of enabling the use of the entire compute nodes (CPU + GPUs) become apparent. There is a clear separation between CPU-
only and accelerated code, and also between GPU-only cuBlasXt and the hybrid static and dynamic codes. Both our hybrid codes outperform
Davor Davidović et al 11
2.5 3.0 3.5 4.0
Kmax
250
500
750
1000
1250
1500
1750
2000
G
F
L
O
P
s/
s
Original (CPU)
Refined (CPU)
cuBlasXt (1 GPU)
cuBlasXt (2 GPUs)
Dynamic (1 GPU)
Dynamic (2 GPUs)
Static (1 GPU)
Static (2 GPUs)
FIGURE 4 Performance on RWTH-GPU for all implementations, for the AuAg test case.
GPU-only (cuBlasXt), for large enough problems, by more than 200 Gﬂops, which is very close to the peak performance of the CPU-only version,
Table 2 . This clearly demonstrates that the all CPUs are fully utilized in our hybrid approach. Furthermore, for large problems that allow for a
reasonable utilization of the multiple GPUs, we achieve about 2 TFlops/s compared to a peak performance of 3 TFlops/s, i.e. 66% eﬃciency. The
eﬃciency is even higher for larger problems such as TiO2 with Kmax = 4.0, where our code attains about 2.5 TFlops/s (83% of the peak).
The presented results include timings from the CPU-only parts of the code (Algorithm 3 lines 6–11 and 14–18), during which the GPUs are in
idle state, thus decreasing total performance and eﬃciency. The eﬃciency attained by our hybrid BLAS routines is even higher, as presented in
Table 2 . The attained eﬃciency of the BLAS-only kernels is up to 79% for AuAg case and slightly better for TiO2 case, up to 81% of the peak
system performance.
TABLE 2 GFlops/s and eﬃciency (in parentheses) for the AuAg test case and Kmax = 4.0. Results for the RWTH node using 2 GPUs.
Implementation zherk zher2k zherkx/zgemmt
CPU 358 (11.93%) 304 (10.13%) 343 (11.43%)
cuBlasXt 2041 (68.30%) 2012 (67.06%) 2021 (67.07%)
Dynamic 2332 (77.73%) 2367 (78.90%) 2302 (76.73%)
Static 2337 (77.91%) 2315 (77.16%) 2315 (77.16%)
6.3 RWTH-Phi
For the RWTH-Phi system, we again present a speedup plot (Fig. 5 ) and a performance plot (Fig. 6 ). In this case, we present results for the reﬁned
algorithm and the dynamic version of the hybrid code, which is the only one with current support for the Xeon Phi (through the hStreams API).
While MKL can oﬄoad calculations automatically to the device, this is only true for the GEMM operation. Therefore, the automatic oﬄoad is not
applicable to our algorithm, which is dominated by the zherk, zher2k and zherkx routines.
Figure 5 shows how our hybrid code consistently achieves a speedup between 4 and 5 with respect to the original HSDLA algorithm. This is
consistent with both the relative power of the two accelerator cards and the CPU, as well as the expectation that the Phi accelerators do not get
as close to their theoretical peak performance as GPUs do.
In Figure 6 , we illustrate the performance of the original, reﬁned, and dynamic codes for AuAg test case. For the larger case (Kmax = 4.0), the
dynamic versions attain about 700 GFlops/s, using one Phi, and 1200 GFlops/s, using two Phis, which correspond to 55% and 52% of the system
peak performance, respectively. For the TiO2 test case we observed even higher performances of up to 950 GFlops/s (one Phi) and 1600 GFlops/s
(two Phis) corresponding to an eﬃciency of 75% and 69%. Similarly to the previous experiments in the RWTH-GPU system, the larger the problem
size, the larger the eﬃciency attained, which shows that even higher eﬃciency is to be expected when larger systems are simulated.
12 Davor Davidović et al
101 102
TFLOPS
1
2
3
4
5
S
p
ee
du
p
Refined (CPU) Dynamic
FIGURE 5 Speedup on RWTH-Phi for the reﬁned and the hybrid dynamic implementations, relative to original code.
2.5 3.0 3.5 4.0
Kmax
200
400
600
800
1000
1200
G
F
L
O
P
s/
s Original (CPU)
Refined (CPU)
Dynamic (1 Phi)
Dynamic (2 Phis)
FIGURE 6 Performance on RWTH-Phi for the original, reﬁned and dynamic implementations, for the AuAg test case.
6.4 JURECA-GPU
The JURECA-GPU nodes comprise two NVIDIA K80 GPUs, each of which consists of two devices. This allows us not only to not evaluate speedups
(Fig. 7 ) and performance (Fig. 8 ), but also scalability of the diﬀerent implementations (Fig. 9 ). Figure 7 illustrates the speedup of the various
GPU implementations relative to the original HSDLA. The speedup ranges most of the times between 3.5 and 5.5 (starting from a maximum of
about 7). The fact that the speedup of the dynamic code varies widely indicates that the scheduler struggles to fully utilize all four GPUs. With
regards to the comparison among the three GPU-hybrid implementations, there is again a clear separation between the static code and the cuBlasXt
implementation.
The performance plot in Fig. 8 illustrates the ﬂop rates achieved by each implementation. The clear separation between the static and the
cuBlasXt implementation is again very pronounced. However, as observed in Fig. 7 , the dynamic implementation struggles to utilize many GPUs.
It seems like further improvements would be necessary to properly adapt it to this use case. Overall, our static implementation achieves up to
3.2 TFlops/s for the AuAg test case, that is, a 15% higher performance than the cuBlasXt implementation. Similar results are achieved for the
TiO2 case, in which we observe a performance of up to 4.3 TFlops/s on 4 GPUs, resulting in 16% higher performance compared to the cuBlasXt
implementation. As the system has a peak performance of 6.7 TFlops/s (4 GPU included), the results for AuAg and TiO2 cases correspond to
Davor Davidović et al 13
47% and 61% peak performance utilization, respectively. These much lower results compared to RWTH-GPU are the result of the surprisingly
low performance of the CUBLASXT implementations (only 55% utilization) of the key BLAS-3 kernels, on which our accelerated HSDLA versions
depend on.
The plot in Fig. 9 captures the scalability behaviour of our code for the AuAg and TiO2, Kmax = 3.5 case, for 1, 2, 3, and 4 GPUs. In this plot we
focus solely on our hybrid implementation of the oﬀ-loaded BLAS routines, and consider only this part of the runtime. We extract three important
messages from the graph: First, the measurements for static and dynamic code nearly coincide. This is not surprising, since both codes aim to
achieve the same goal: hybrid execution. Second, the measurements for all three implementations roughly share the same slope. This indicates
that all codes utilize the GPUs similarly well as additional devices are added. Third, there is a gap between the hybrid codes and the GPU-only code
of almost 1 TFlop/s, which is roughly the computational power of the CPU. This reﬂects that not only the hybrid codes reach a utilization of the
GPUs similar to the vendor optimized CUBLASXT, but they also fully utilize the CPU cores. This is particularly of interest because modern CPUs
provide high computational power (e.g. 960 GFlops on JURECA per node), even when compared to that of the coprocessors, and thus should not
be neglected nor underutilized. Table 3 shows that the performance gain of combining CPUs with GPUs in the BLAS-3 kernels in the HSDLA
algorithm can improve performance up to 19% compared to the state-of-the-art cuBlasXt implementations using 4 GPUs.
TABLE3 GFlops/s and speedup (in parentheses) for static and dynamic implementations compared to cuBlasXt for the TiO2 test case andKmax = 3.5
on JURECA using 4 GPUs.
Implementation zherk zher2k zherkx/zgemmt
cuBlasXt 3439 3258 3257
Dynamic 3862 (1.12×) 3590 (1.11×) 3266 (1.01×)
Static 4012 (1.19×) 3832 (1.19×) 3686 (1.13×)
101 102
TFLOPS
1
2
3
4
5
6
7
S
p
ee
du
p
Refined (CPU) cuBlasXt Dynamic Static
FIGURE 7 Speedup on JURECA for all implementations, relative to original code.
14 Davor Davidović et al
2.5 3.0 3.5 4.0
Kmax
500
1000
1500
2000
2500
3000
G
F
L
O
P
s/
s
Original (CPU)
Refined (CPU)
cuBlasXt (1 GPU)
cuBlasXt (2 GPUs)
cuBlasXt (3 GPUs)
cuBlasXt (4 GPUs)
Dynamic (1 GPU)
Dynamic (2 GPUs)
Dynamic (3 GPUs)
Dynamic (4 GPUs)
Static (1 GPU)
Static (2 GPUs)
Static (3 GPUs)
Static (4 GPUs)
FIGURE 8 Performance on JURECA for all implementations, for the AuAg test case.
1 2 3 4
#GPUs
0
1000
2000
3000
4000
5000
6000
G
F
L
O
P
s/
s
Peak
cuBlasXt
Dynamic
Static
1 2 3 4
#GPUs
0
1000
2000
3000
4000
5000
6000
G
F
L
O
P
s/
s
Peak
cuBlasXt
Dynamic
Static
a) b)
FIGURE 9 Scalability of the hybrid BLAS routines on JURECA going from one to many GPUs, for both codes, for the AuAg a) and TiO2 b) 3.5 test
case.
7 CONCLUSIONS AND FUTUREWORK
In this paper we demonstrated how legacy codes, such as FLEUR code, can be modernized in order to exploit the massive parallelism of modern
computing architectures, and to improve code performance, portability and scalability. The starting point of our work was the HSDLA algorithm.
HSDLA encodes the computation of the Hamiltonian andOverlapmatrices (H and S), one of the two computational bottlenecks of the FLEUR code,
in terms of multi-dimensional operations that map well onto BLAS operations. However, the straightforward porting of these BLAS operations to
heterogeneous architectures consisting of multi-core CPU and one or more GPUs or Phi accelerators, does not attain expected performances. The
reason behind the limited performance is that vendor optimized BLAS libraries do not provide support for such hybrid architectures and exploit
either the CPUs or the accelerators, but not both.
Weused the original BLAS-basedHSDLA code as a starting point to explore both existing and custom implementations in order to take advantage
of the accelerators. First, we improved the original HSDLA algorithm by introducing changes in the execution ﬂow; the result was a much lower
memory footprint, and a reduced computational cost. Then, we showed that custom hybrid implementations can boost performance compared to
accelerator-only implementations, especially when CPUs themselves provide large computational power.
We presented two approaches to implement hybrid BLAS: A dynamic approach that schedules chunks of calculation on the devices using buﬀers
and queues, and a static approach that splits matrices based on prescribed ratios. The dynamic approach is more generic, as demonstrated by
targeting both GPUs and Phis. Because the static code needs to be tuned through the ratios, it gives better control for tuning. Both implementation
strategies considerably boost performance due to their hybrid nature.
However, there is still room for improvement. By using mathematical equalities the objects A and B can be compressed into smaller objects,
leading to less butmore complex computations.More speciﬁcally, the compressed form ofA andB leads to tensor contractions that do not naturally
map onto gemm-like operations and require the development of specialized routines. If eﬃcient mappings and implementations of new kernels are
Davor Davidović et al 15
found, speedups of up to one order of magnitude may be achieved. Furthermore, in order to solve even larger problems (systems of up to 1000
atoms), the presented approaches present certain limitations such as the memory requirements and the limited size of the on-device memory.
To eﬃciently solve larger problems new memory and communication-avoiding approaches on distributed multi-CPU and multi-GPU systems are
required, including the redesign of the most time-consuming kernels.
ACKNOWLEDGEMENTS
This workwas partially funded by theMinistry of Science and Education of the Republic of Croatia and theDeutsche Akademische Austauschdienst
(DAAD) from funds of the Bundesministeriums für Bildung und Forschung (BMBF) through project “PPP Kroatien” ID 57216700. Financial support
from the Jülich Aachen Research Alliance-High Performance Computing and the Deutsche Forschungsgemeinschaft (DFG) through grant GSC 111
is also gratefully acknowledged. Furthermore, the authors thank the RWTH IT Center and the Jülich Supercomputing Centre for the computational
resources.
References
1. FLEUR. The Jülich FLAPW code family. http://www.ﬂapw.de/pm/index.php [10 December 2017].
2. Deserno Markus, Holm Christian. How to mesh up Ewald sums I. A theoretical and numerical comparison of various particle mesh routines.
The Journal of Chemical Physics. 1998;109(18):7678-7693.
3. Johnson Jeﬀ, Douze Matthijs, Jégou Hervé. Billion-scale similarity search with GPUs. 2017;. http://arxiv.org/abs/1702.08734.
4. Fabregat-Traver Diego, Bientinesi Paolo. Computing Petaﬂops over Terabytes of Data: The Case of Genome-Wide Association Studies. ACM
Transactions on Mathematical Software (TOMS). 2014;40(4):27:1–27:22.
5. Intel Math Kernel Library. https://software.intel.com/en-us/mkl [10 December 2017] .
6. Tomov Stanimire, Dongarra Jack, Baboulin Marc. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel
Computing. 2010;36(5-6):232–240.
7. NVIDIA cuBLAS. Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas [12 December 2017] .
8. Auckenthaler Thomas, Blum Volker, Bungartz Hans-Joachim, et al. Parallel solution of partial symmetric eigenvalue problems from electronic
structure calculations.. Parallel Computing. 2011;37(12):783–794.
9. Marek Andreas, Blum Volker, Johanni Rainer, et al. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and
computational science. Journal of Physics: Condensed Matter. 2014;26(21):213201.
10. Di Napoli Edoardo, Peise Elmar, Hrywniak Markus, Bientinesi Paolo. High-performance generation of the Hamiltonian and Overlap matrices
in FLAPW methods. Computer Physics Communications. 2017;211:61–72. DOI: 10.1016/j.cpc.2016.10.003.
11. Fabregat-Traver Diego, Davidovic Davor, Höhnerbach Markus, Napoli Edoardo Di. Hybrid CPU-GPU Generation of the Hamiltonian and Over-
lap Matrices in FLAPW Methods.. In: Napoli Edoardo Di, Hermanns Marc-André, Iliev Hristo, Lintermann Andreas, Peyser Alexander, eds.
JHPCS, Lecture Notes in Computer Science, vol. 10164: :200–211Springer; 2016.
12. Nogueira Fernando, Marques Miguel A L, Fiolhais Carlos. A primer in density functional theory. Lecture Notes in PhysicsBerlin: Springer; 2003.
13. Sholl David, Steckel Janice A. Density Functional Theory. A Practical IntroductionJohn Wiley & Sons; 2011.
14. Kohn W, Sham L J. Self-Consistent Equations Including Exchange and Correlation Eﬀects. Phys.Rev.. 1965;140:A1133–A1138.
15. Wimmer E, Krakauer H, Weinert M, Freeman A J. Full-Potential Self-Consistent Linearized-Augmented-Plane-Wave Method for Calculating
the Electronic-Structure of Molecules and Surfaces - O2 Molecule. Physical Review B. 1981;24(2):864–875.
16. Jansen H J F, Freeman A J. Total-Energy Full-Potential Linearized Augmented-Plane-Wave Method for Bulk Solids - Electronic and Structural-
Properties of Tungsten. Physical Review B. 1984;30(2):561–569.
16 Davor Davidović et al
17. Wang Linnan, Wu Wei, Xu Zenglin, Xiao Jianxiong, Yang Yi. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU
Computing. In: ICS ’16:20:1–20:11ACM; 2016; New York, NY, USA.
18. Haidar Azzam, Cao Chongxiao, Yarkhan Asim, et al. Uniﬁed Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a
Lightweight Runtime Environment. In: IPDPS ’14:491–500IEEE Computer Society; 2014; Washington, DC, USA.
19. Solcà Raﬀaele, Kozhevnikov Anton, Haidar Azzam, Tomov Stanimire, Dongarra Jack, Schulthess Thomas C.. Eﬃcient implementation of quan-
tum materials simulations on distributed CPU-GPU systems.. In: Kern Jackie, Vetter Jeﬀrey S., eds. Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis, :10:1–10:12ACM; 2015.
