6 research outputs found
Hybrid CPU-GPU generation of the Hamiltonian and overlap matrices in FLAPW methods
In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code
Accelerating the computation of FLAPW methods on heterogeneous architectures
Legacy codes in computational science and engineering have been very successful in providing essential functionality to researchers. However, they are not capable of exploiting the massive parallelism provided by emerging heterogeneous architectures. The lack of portable performance and scalability puts them at high risk, ie, either they evolve or they are destined to be executed on older platforms and small clusters. One example of a legacy code which would heavily benefit from a modern redesign is FLEUR, a software for electronic structure calculations. In previous work, the computational bottleneck of FLEUR was partially re-engineered to have a modular design that relies on standard building blocks, namely, BLAS and LAPACK libraries. In this paper, we demonstrate how the initial redesign enables the portability to heterogeneous architectures. More specifically, we study different approaches to port the code to architectures consisting of multi-core CPUs equipped with one or more coprocessors such as Nvidia GPUs and Intel Xeon Phis. Our final code attains over 70% of the architectures' peak performance and outperforms Nvidia's and Intel's libraries. On JURECA, the large tier-0 cluster where FLEUR is often executed, the code takes advantage of the full power of the computing nodes, attaining 5× speedup over the sole use of the CPUs
The LAPW method with eigendecomposition based on the Hari--Zimmermann generalized hyperbolic SVD
In this paper we propose an accurate, highly parallel algorithm for the
generalized eigendecomposition of a matrix pair , given in a factored
form . Matrices and are generally complex
and Hermitian, and is positive definite. This type of matrices emerges from
the representation of the Hamiltonian of a quantum mechanical system in terms
of an overcomplete set of basis functions. This expansion is part of a class of
models within the broad field of Density Functional Theory, which is considered
the golden standard in condensed matter physics. The overall algorithm consists
of four phases, the second and the fourth being optional, where the two last
phases are computation of the generalized hyperbolic SVD of a complex matrix
pair , according to a given matrix defining the hyperbolic scalar
product. If , then these two phases compute the GSVD in parallel very
accurately and efficiently.Comment: The supplementary material is available at
https://web.math.pmf.unizg.hr/mfbda/papers/sm-SISC.pdf due to its size. This
revised manuscript is currently being considered for publicatio
GPAW: open Python package for electronic-structure calculations
We review the GPAW open-source Python package for electronic structure
calculations. GPAW is based on the projector-augmented wave method and can
solve the self-consistent density functional theory (DFT) equations using three
different wave-function representations, namely real-space grids, plane waves,
and numerical atomic orbitals. The three representations are complementary and
mutually independent and can be connected by transformations via the real-space
grid. This multi-basis feature renders GPAW highly versatile and unique among
similar codes. By virtue of its modular structure, the GPAW code constitutes an
ideal platform for implementation of new features and methodologies. Moreover,
it is well integrated with the Atomic Simulation Environment (ASE) providing a
flexible and dynamic user interface. In addition to ground-state DFT
calculations, GPAW supports many-body GW band structures, optical excitations
from the Bethe-Salpeter Equation (BSE), variational calculations of excited
states in molecules and solids via direct optimization, and real-time
propagation of the Kohn-Sham equations within time-dependent DFT. A range of
more advanced methods to describe magnetic excitations and non-collinear
magnetism in solids are also now available. In addition, GPAW can calculate
non-linear optical tensors of solids, charged crystal point defects, and much
more. Recently, support of GPU acceleration has been achieved with minor
modifications of the GPAW code thanks to the CuPy library. We end the review
with an outlook describing some future plans for GPAW
Commemorative Issue in Honor of Professor Karlheinz Schwarz on the Occasion of His 80th Birthday
A collection of 18 scientific papers written in honor of Professor Karlheinz Schwarz's 80th birthday. The main topics include spectroscopy, excited states, DFT developments, results analysis, solid states, and surfaces
Performance Modeling and Prediction for Dense Linear Algebra
This dissertation introduces measurement-based performance modeling and
prediction techniques for dense linear algebra algorithms. As a core principle,
these techniques avoid executions of such algorithms entirely, and instead
predict their performance through runtime estimates for the underlying compute
kernels. For a variety of operations, these predictions allow to quickly select
the fastest algorithm configurations from available alternatives. We consider
two scenarios that cover a wide range of computations:
To predict the performance of blocked algorithms, we design
algorithm-independent performance models for kernel operations that are
generated automatically once per platform. For various matrix operations,
instantaneous predictions based on such models both accurately identify the
fastest algorithm, and select a near-optimal block size.
For performance predictions of BLAS-based tensor contractions, we propose
cache-aware micro-benchmarks that take advantage of the highly regular
structure inherent to contraction algorithms. At merely a fraction of a
contraction's runtime, predictions based on such micro-benchmarks identify the
fastest combination of tensor traversal and compute kernel