With the growing reliance of modern supercomputers on accelerator-based
architectures such a GPUs, the development and optimization of electronic
structure methods to exploit these massively parallel resources has become a
recent priority. While significant strides have been made in the development of
GPU accelerated, distributed memory algorithms for many-body (e.g.
coupled-cluster) and spectral single-body (e.g. planewave, real-space and
finite-element density functional theory [DFT]), the vast majority of
GPU-accelerated Gaussian atomic orbital methods have focused on shared memory
systems with only a handful of examples pursuing massive parallelism on
distributed memory GPU architectures. In the present work, we present a set of
distributed memory algorithms for the evaluation of the Coulomb and
exact-exchange matrices for hybrid Kohn-Sham DFT with Gaussian basis sets via
direct density-fitted (DF-J-Engine) and seminumerical (sn-K) methods,
respectively. The absolute performance and strong scalability of the developed
methods are demonstrated on systems ranging from a few hundred to over one
thousand atoms using up to 128 NVIDIA A100 GPUs on the Perlmutter
supercomputer.Comment: 45 pages, 9 figure