6 research outputs found

    Hybrid CPU-GPU generation of the Hamiltonian and overlap matrices in FLAPW methods

    Get PDF
    In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code

    Accelerating the computation of FLAPW methods on heterogeneous architectures

    Get PDF
    Legacy codes in computational science and engineering have been very successful in providing essential functionality to researchers. However, they are not capable of exploiting the massive parallelism provided by emerging heterogeneous architectures. The lack of portable performance and scalability puts them at high risk, ie, either they evolve or they are destined to be executed on older platforms and small clusters. One example of a legacy code which would heavily benefit from a modern redesign is FLEUR, a software for electronic structure calculations. In previous work, the computational bottleneck of FLEUR was partially re-engineered to have a modular design that relies on standard building blocks, namely, BLAS and LAPACK libraries. In this paper, we demonstrate how the initial redesign enables the portability to heterogeneous architectures. More specifically, we study different approaches to port the code to architectures consisting of multi-core CPUs equipped with one or more coprocessors such as Nvidia GPUs and Intel Xeon Phis. Our final code attains over 70% of the architectures' peak performance and outperforms Nvidia's and Intel's libraries. On JURECA, the large tier-0 cluster where FLEUR is often executed, the code takes advantage of the full power of the computing nodes, attaining 5× speedup over the sole use of the CPUs

    The LAPW method with eigendecomposition based on the Hari--Zimmermann generalized hyperbolic SVD

    Full text link
    In this paper we propose an accurate, highly parallel algorithm for the generalized eigendecomposition of a matrix pair (H,S)(H, S), given in a factored form (F∗JF,G∗G)(F^{\ast} J F, G^{\ast} G). Matrices HH and SS are generally complex and Hermitian, and SS is positive definite. This type of matrices emerges from the representation of the Hamiltonian of a quantum mechanical system in terms of an overcomplete set of basis functions. This expansion is part of a class of models within the broad field of Density Functional Theory, which is considered the golden standard in condensed matter physics. The overall algorithm consists of four phases, the second and the fourth being optional, where the two last phases are computation of the generalized hyperbolic SVD of a complex matrix pair (F,G)(F,G), according to a given matrix JJ defining the hyperbolic scalar product. If J=IJ = I, then these two phases compute the GSVD in parallel very accurately and efficiently.Comment: The supplementary material is available at https://web.math.pmf.unizg.hr/mfbda/papers/sm-SISC.pdf due to its size. This revised manuscript is currently being considered for publicatio

    GPAW: open Python package for electronic-structure calculations

    Full text link
    We review the GPAW open-source Python package for electronic structure calculations. GPAW is based on the projector-augmented wave method and can solve the self-consistent density functional theory (DFT) equations using three different wave-function representations, namely real-space grids, plane waves, and numerical atomic orbitals. The three representations are complementary and mutually independent and can be connected by transformations via the real-space grid. This multi-basis feature renders GPAW highly versatile and unique among similar codes. By virtue of its modular structure, the GPAW code constitutes an ideal platform for implementation of new features and methodologies. Moreover, it is well integrated with the Atomic Simulation Environment (ASE) providing a flexible and dynamic user interface. In addition to ground-state DFT calculations, GPAW supports many-body GW band structures, optical excitations from the Bethe-Salpeter Equation (BSE), variational calculations of excited states in molecules and solids via direct optimization, and real-time propagation of the Kohn-Sham equations within time-dependent DFT. A range of more advanced methods to describe magnetic excitations and non-collinear magnetism in solids are also now available. In addition, GPAW can calculate non-linear optical tensors of solids, charged crystal point defects, and much more. Recently, support of GPU acceleration has been achieved with minor modifications of the GPAW code thanks to the CuPy library. We end the review with an outlook describing some future plans for GPAW

    Commemorative Issue in Honor of Professor Karlheinz Schwarz on the Occasion of His 80th Birthday

    Get PDF
    A collection of 18 scientific papers written in honor of Professor Karlheinz Schwarz's 80th birthday. The main topics include spectroscopy, excited states, DFT developments, results analysis, solid states, and surfaces

    Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel
    corecore