Search CORE

23 research outputs found

Accelerating Dynamical Density Response Code on Summit and Its Application for Computing the Density Response Function of Vanadium Sesquioxide

Author: Phan Wileam Y
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2021
Field of study

This thesis details the process of porting the Eguiluz group dynamical density response computational platform to the hybrid CPU+GPU environment at the Summit supercomputer at Oak Ridge National Laboratory (ORNL) Leadership Computing Center. The baseline CPU-only version is a Gordon Bell-winning platform within the formally-exact time-dependent density functional theory (TD-DFT) framework using the linearly augmented plane wave (LAPW) basis set. The code is accelerated using a combination of the OpenACC programming model and GPU libraries -- namely, the Matrix Algebra for GPU and Multicore Architectures (MAGMA) library -- as well as exploiting the sparsity pattern of the matrices involved in the matrix-matrix multiplication. Benchmarks show a 12.3x speedup compared to the CPU-only version. This performance boost should accelerate discovery in material and condensed matter physics through computational means. After the hybrid CPU+GPU code has been sufficiently optimized, it is used to study the dynamical density response function of vanadium sesquioxide, and the results are compared with spectroscopic data from non-resonant inelastic X-ray scattering {NIXS} experiments

University of Tennessee, Knoxville: Trace

Hybrid CPU-GPU generation of the Hamiltonian and overlap matrices in FLAPW methods

Author: D Sholl
E Napoli Di
E Wimmer
F Nogueira
HJF Jansen
W Kohn
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code

arXiv.org e-Print Archive

Crossref

Full-text Institutional Repository of the Ruđer Bošković Institute

Juelich Shared Electronic Resources

Master of Science

Author: Lo Yu jung
Publication venue: University of Utah
Publication date: 01/01/2015
Field of study

thesisTo address the need of understanding and optimizing the performance of complex applications and achieving sustained application performance across different architectures, we need performance models and tools that could quantify the theoretical performance and the resultant gap between theoretical and observed performance. This thesis proposes a benchmark-driven Roofline Model Toolkit to provide theoretical and achievable performance, and their resultant gap for multicore, manycore, and accelerated architectures. Roofline micro benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these micro benchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism(TLP), instruction-level parallelism(ILP), and explicit Single Instruction, Multiple Data(SIMD) parallelism, measured in the context of the compilers and runtime environment on the target architecture. We also developed benchmarks to explore detailed memory subsystems behaviors and evaluate parallelization overhead. Beyond on-chip performance, we measure sustained Peripheral Component Interconnect Express(PCIe) throughput with four Graphics Processing Unit(GPU) memory managed mechanisms. By combining results from the architecture characterization with the Roofline Model based solely on architectural specification, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline Model when run on a Blue Gene/Q architecture

The University of Utah: J. Willard Marriott Digital Library

Accelerating the computation of FLAPW methods on heterogeneous architectures

Author: Auckenthaler
Deserno
Di Napoli
Fabregat-Traver
Fabregat-Traver
Fiolhais
Jansen
Kohn
Marek
Sholl
Tomov
Wimmer
Publication venue: 'Wiley'
Publication date: 19/12/2017
Field of study

Legacy codes in computational science and engineering have been very successful in providing essential functionality to researchers. However, they are not capable of exploiting the massive parallelism provided by emerging heterogeneous architectures. The lack of portable performance and scalability puts them at high risk, ie, either they evolve or they are destined to be executed on older platforms and small clusters. One example of a legacy code which would heavily benefit from a modern redesign is FLEUR, a software for electronic structure calculations. In previous work, the computational bottleneck of FLEUR was partially re-engineered to have a modular design that relies on standard building blocks, namely, BLAS and LAPACK libraries. In this paper, we demonstrate how the initial redesign enables the portability to heterogeneous architectures. More specifically, we study different approaches to port the code to architectures consisting of multi-core CPUs equipped with one or more coprocessors such as Nvidia GPUs and Intel Xeon Phis. Our final code attains over 70% of the architectures' peak performance and outperforms Nvidia's and Intel's libraries. On JURECA, the large tier-0 cluster where FLEUR is often executed, the code takes advantage of the full power of the computing nodes, attaining 5× speedup over the sole use of the CPUs

arXiv.org e-Print Archive

Crossref

Full-text Institutional Repository of the Ruđer Bošković Institute

Publikationsserver der RWTH Aachen University

Juelich Shared Electronic Resources

High performance BLAS formulation of the multipole-to-local operator in the Fast Multipole Method

Author: Coulaud Olivier
Fortin Pierre
Roman Jean
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

International audienceThe multipole-to-local (M2L) operator is the most time-consuming part of the far field computation in the Fast Multipole Method for Laplace equation. Its natural expression, though commonly used, does not respect a sharp error bound: we here first prove the correctness of a second expression. We then propose a matrix formulation implemented with BLAS (Basic Linear Algebra Subprograms) routines in order to speed up its computation for these two expressions. We also introduce special data storages in memory to gain greater computational efficiency. This BLAS scheme is finally compared, for uniform distributions, to other M2L improvements such as block FFT, rotations and plane wave expansions. When considering runtime, extra memory storage, numerical stability and common precisions for Laplace equation, the BLAS version appears as the best one

INRIA a CCSD electronic archive server

Fully Self-Consistent Finite-Temperature $GW$ in Gaussian Bloch Orbitals for Solids

Author: Gull Emanuel
Iskakov Sergei
Yeh Chia-Nan
Zgid Dominika
Publication venue: 'American Physical Society (APS)'
Publication date: 17/11/2022
Field of study

We present algorithmic and implementation details for the fully self-consistent finite-temperature

GW

method in Gaussian Bloch orbitals for solids. Our implementation is based on the finite-temperature Green's function formalism in which all equations are solved on the imaginary axis, without resorting to analytical continuation during the self-consistency. No quasiparticle approximation is employed and all matrix elements of the self-energy are explicitly evaluated. The method is tested by evaluating the band gaps of selected semiconductors and insulators. We show agreement with other, differently formulated finite-temperature sc

GW

implementations when finite-size corrections and basis set errors are taken into account. By migrating computationally intensive calculations to GPUs, we obtain scalable results on large supercomputers with nearly optimal performance. Our work demonstrates the applicability of Gaussian orbital based sc

GW

for \emph{ab initio} correlated materials simulations and provides a sound starting point for embedding methods built on top of

GW

.Comment: 17 pages, 10 figures, 2 table

arXiv.org e-Print Archive