23 research outputs found

    Quantum MASALA: Quantum MAterialS Ab initio eLectronic-structure pAckage

    Full text link
    We present QuantumMASALA, a compact package that implements different electronic structure methods in Python. Within just 8000 lines of pure Python code, we have implemented Density Functional Theory (DFT), Time dependent Density Functional Theory (TD-DFT) and the GW Method. The program can run across multiple process cores and in Graphical Processing Units (GPU) with the help of easily-accessible Python libraries. With QuantumESPRESSO and BerkeleyGW I/O interfaces implemented, it can also be used as a substitute for small scale calculations, making it a perfect learning tool for ab initio methods. The package is aimed to provide a framework with its modular and simple code design to rapidly build and test new methods for first-principles calculation.Comment: 42 pages, 5 figure

    Fully Self-Consistent Finite-Temperature GWGW in Gaussian Bloch Orbitals for Solids

    Full text link
    We present algorithmic and implementation details for the fully self-consistent finite-temperature GWGW method in Gaussian Bloch orbitals for solids. Our implementation is based on the finite-temperature Green's function formalism in which all equations are solved on the imaginary axis, without resorting to analytical continuation during the self-consistency. No quasiparticle approximation is employed and all matrix elements of the self-energy are explicitly evaluated. The method is tested by evaluating the band gaps of selected semiconductors and insulators. We show agreement with other, differently formulated finite-temperature scGWGW implementations when finite-size corrections and basis set errors are taken into account. By migrating computationally intensive calculations to GPUs, we obtain scalable results on large supercomputers with nearly optimal performance. Our work demonstrates the applicability of Gaussian orbital based scGWGW for \emph{ab initio} correlated materials simulations and provides a sound starting point for embedding methods built on top of GWGW.Comment: 17 pages, 10 figures, 2 table

    Spinor G W /Bethe-Salpeter calculations in BerkeleyGW: Implementation, symmetries, benchmarking, and performance

    Get PDF
    Computing the G W quasiparticle band structure and Bethe-Salpeter equation (BSE) absorption spectra for materials with spin-orbit coupling have commonly been done by treating G W corrections and spin-orbit coupling (SOC) as separate perturbations to density-functional theory. However, accurate treatment of materials with strong spin-orbit coupling (such as many topological materials of recent interest, and thermoelectrics) often requires a nonperturbative approach using spinor wave functions in the Kohn-Sham equation and G W / BSE . Such calculations have only recently become available, in particular for the BSE. We have implemented this approach in the plane-wave pseudopotential G W / BSE code BerkeleyGW, which is highly parallelized and widely used in the electronic-structure community. We present reference results for quasiparticle band structures and optical absorption spectra of solids with different strengths of spin-orbit coupling, including Si, Ge, GaAs, GaSb, CdSe, Au, and Bi 2 Se 3 . The calculated quasiparticle band gaps of these systems are found to agree with experiment to within a few tens of meV. SOC splittings are found to be generally in better agreement with experiment, including quasiparticle corrections to band energies. The absorption spectrum of GaAs is not significantly impacted by the inclusion of spin-orbit coupling due to its relatively small value (0.2 eV) in the Λ direction, while the absorption spectrum of GaSb calculated with the spinor G W / BSE captures the large spin-orbit splitting of peaks in the spectrum. For the prototypical topological insulator Bi 2 Se 3 , we find a drastic change in the low-energy band structure compared to that of DFT, with the spinorial treatment of the G W approximation correctly capturing the parabolic nature of the valence and conduction bands after including off-diagonal self-energy matrix elements. We present the detailed methodology, approach to spatial symmetries for spinors, comparison against other codes, and performance compared to spinless G W / BSE calculations and perturbative approaches to SOC. This work aims to spur further development of spinor G W / BSE methodology in excited-state research software and enables a more accurate and detailed exploration of electronic and optical properties of materials containing elements with large atomic numbers

    Accelerating Dynamical Density Response Code on Summit and Its Application for Computing the Density Response Function of Vanadium Sesquioxide

    Get PDF
    This thesis details the process of porting the Eguiluz group dynamical density response computational platform to the hybrid CPU+GPU environment at the Summit supercomputer at Oak Ridge National Laboratory (ORNL) Leadership Computing Center. The baseline CPU-only version is a Gordon Bell-winning platform within the formally-exact time-dependent density functional theory (TD-DFT) framework using the linearly augmented plane wave (LAPW) basis set. The code is accelerated using a combination of the OpenACC programming model and GPU libraries -- namely, the Matrix Algebra for GPU and Multicore Architectures (MAGMA) library -- as well as exploiting the sparsity pattern of the matrices involved in the matrix-matrix multiplication. Benchmarks show a 12.3x speedup compared to the CPU-only version. This performance boost should accelerate discovery in material and condensed matter physics through computational means. After the hybrid CPU+GPU code has been sufficiently optimized, it is used to study the dynamical density response function of vanadium sesquioxide, and the results are compared with spectroscopic data from non-resonant inelastic X-ray scattering {NIXS} experiments

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Accelerating the computation of FLAPW methods on heterogeneous architectures

    Get PDF
    Legacy codes in computational science and engineering have been very successful in providing essential functionality to researchers. However, they are not capable of exploiting the massive parallelism provided by emerging heterogeneous architectures. The lack of portable performance and scalability puts them at high risk, ie, either they evolve or they are destined to be executed on older platforms and small clusters. One example of a legacy code which would heavily benefit from a modern redesign is FLEUR, a software for electronic structure calculations. In previous work, the computational bottleneck of FLEUR was partially re-engineered to have a modular design that relies on standard building blocks, namely, BLAS and LAPACK libraries. In this paper, we demonstrate how the initial redesign enables the portability to heterogeneous architectures. More specifically, we study different approaches to port the code to architectures consisting of multi-core CPUs equipped with one or more coprocessors such as Nvidia GPUs and Intel Xeon Phis. Our final code attains over 70% of the architectures' peak performance and outperforms Nvidia's and Intel's libraries. On JURECA, the large tier-0 cluster where FLEUR is often executed, the code takes advantage of the full power of the computing nodes, attaining 5× speedup over the sole use of the CPUs

    Dataflow Programming Paradigms for Computational Chemistry Methods

    Get PDF
    The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability. This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the gold standard to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains. Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms

    Hybrid CPU-GPU generation of the Hamiltonian and overlap matrices in FLAPW methods

    Get PDF
    In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code
    corecore