Search CORE

1,393 research outputs found

An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor

Author: Alexeev Yuri
D'mello Michael
Gordon Mark S.
Keipert Kristopher
Mironov Vladimir
Moskovsky Alexander
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/08/2017
Field of study

Modern OpenMP threading techniques are used to convert the MPI-only Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two separate implementations that differ by the sharing or replication of key data structures among threads are considered, density and Fock matrices. All implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi processors. With 64 cores per processor, scaling numbers are reported on up to 192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory footprint by approximately 200 times compared to the legacy code. The MPI/OpenMP code was shown to run up to six times faster than the original for a range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure

arXiv.org e-Print Archive

Crossref

Modern Approaches to Exact Diagonalization and Selected Configuration Interaction with the Adaptive Sampling CI Method.

Author: Freeman C Daniel
Hait Diptarka
Head-Gordon Martin
Levine Daniel S
Tubman Norm M
Whaley K Birgitta
Publication venue: eScholarship, University of California
Publication date: 28/12/2019
Field of study

Recent advances in selected configuration interaction methods have made them competitive with the most accurate techniques available and, hence, creating an increasingly powerful tool for solving quantum Hamiltonians. In this work, we build on recent advances from the adaptive sampling configuration interaction (ASCI) algorithm. We show that a useful paradigm for generating efficient selected CI/exact diagonalization algorithms is driven by fast sorting algorithms, much in the same way iterative diagonalization is based on the paradigm of matrix vector multiplication. We present several new algorithms for all parts of performing a selected CI, which includes new ASCI search, dynamic bit masking, fast orbital rotations, fast diagonal matrix elements, and residue arrays. The ASCI search algorithm can be used in several different modes, which includes an integral driven search and a coefficient driven search. The algorithms presented here are fast and scalable, and we find that because they are built on fast sorting algorithms they are more efficient than all other approaches we considered. After introducing these techniques, we present ASCI results applied to a large range of systems and basis sets to demonstrate the types of simulations that can be practically treated at the full-CI level with modern methods and hardware, presenting double- and triple-ζ benchmark data for the G1 data set. The largest of these calculations is Si2H6 which is a simulation of 34 electrons in 152 orbitals. We also present some preliminary results for fast deterministic perturbation theory simulations that use hash functions to maintain high efficiency for treating large basis sets

arXiv.org e-Print Archive

eScholarship - University of California

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Author: Rubensson Emanuel H.
Rudberg Elias
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

We present a method for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation, data locality is exploited without prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case graphics processing units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node. The performance is evaluated for matrices with different sparsity structures, including examples from electronic structure calculations. Compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests.Comment: 35 pages, 14 figure

arXiv.org e-Print Archive

Publikationer från Uppsala Universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Diagrammatic Coupled Cluster Monte Carlo

Author: Scott Charles J. C.
Di Remigio Roberto
Crawford T. Daniel
Thom Alex J. W.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 09/02/2019
Field of study

We propose a modified coupled cluster Monte Carlo algorithm that stochastically samples connected terms within the truncated Baker--Campbell--Hausdorff expansion of the similarity transformed Hamiltonian by construction of coupled cluster diagrams on the fly. Our new approach -- diagCCMC -- allows propagation to be performed using only the connected components of the similarity-transformed Hamiltonian, greatly reducing the memory cost associated with the stochastic solution of the coupled cluster equations. We show that for perfectly local, noninteracting systems, diagCCMC is able to represent the coupled cluster wavefunction with a memory cost that scales linearly with system size. The favorable memory cost is observed with the only assumption of fixed stochastic granularity and is valid for arbitrary levels of coupled cluster theory. Significant reduction in memory cost is also shown to smoothly appear with dissociation of a finite chain of helium atoms. This approach is also shown not to break down in the presence of strong correlation through the example of a stretched nitrogen molecule. Our novel methodology moves the theoretical basis of coupled cluster Monte Carlo closer to deterministic approaches.Comment: 31 pages, 6 figure

arXiv.org e-Print Archive

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Munin - Open Research Archive

Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden (SLUB): Qucosa

Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis. (VII) HFODD (v2.49t): a new version of the program

Author: A. Staszczak
Anguiano
Baran
Bolsterli
Bonche
Bonche
Caurier
Caurier
Chabanat
Davies
Dechargé
Diebel
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Dobaczewski
Egido
Engelbrecht
Gall
Giannoni
Giannoni
Goodman
Heisenberg
Hestenes
J. Dobaczewski
J. McDonnell
J.A. Sheikh
Kortelainen
Kruppa
Lipkin
M. Stoitsov
Martin
N. Schunck
P. Toivanen
Pei
Powell
Rafalski
Ring
Robledo
Satuła
Satuła
Satuła
Satuła
Satuła
Schunck
Sheikh
Staszczak
Staszczak
Stoitsov
Stoitsov
Strutinsky
Strutinsky
Towner
Varshalovich
Vertse
W. Satuła
Werner
Wigner
Younes
Zduńczuk
Zduńczuk
Publication venue: 'Elsevier BV'
Publication date: 03/03/2011
Field of study

We describe the new version (v2.49t) of the code HFODD which solves the nuclear Skyrme Hartree-Fock (HF) or Skyrme Hartree-Fock-Bogolyubov (HFB) problem by using the Cartesian deformed harmonic-oscillator basis. In the new version, we have implemented the following physics features: (i) the isospin mixing and projection, (ii) the finite temperature formalism for the HFB and HF+BCS methods, (iii) the Lipkin translational energy correction method, (iv) the calculation of the shell correction. A number of specific numerical methods have also been implemented in order to deal with large-scale multi-constraint calculations and hardware limitations: (i) the two-basis method for the HFB method, (ii) the Augmented Lagrangian Method (ALM) for multi-constraint calculations, (iii) the linear constraint method based on the approximation of the RPA matrix for multi-constraint calculations, (iv) an interface with the axial and parity-conserving Skyrme-HFB code HFBTHO, (v) the mixing of the HF or HFB matrix elements instead of the HF fields. Special care has been paid to using the code on massively parallel leadership class computers. For this purpose, the following features are now available with this version: (i) the Message Passing Interface (MPI) framework, (ii) scalable input data routines, (iii) multi-threading via OpenMP pragmas, (iv) parallel diagonalization of the HFB matrix in the simplex breaking case using the ScaLAPACK library. Finally, several little significant errors of the previous published version were corrected.Comment: Accepted for publication to Computer Physics Communications. Program files re-submitted to Comp. Phys. Comm. Program Library after correction of several minor bug

arXiv.org e-Print Archive

Crossref

UNT Digital Library

Linear scaling computation of the Fock matrix. IX. Parallel computation of the Coulomb matrix

Author: C. J. Tymczak
Chee Kwan Gan
Guerra C. F.
Matt Challacombe
Publication venue: 'AIP Publishing'
Publication date: 04/06/2004
Field of study

We present parallelization of a quantum-chemical tree-code [J. Chem. Phys. {\bf 106}, 5526 (1997)] for linear scaling computation of the Coulomb matrix. Equal time partition [J. Chem. Phys. {\bf 118}, 9128 (2003)] is used to load balance computation of the Coulomb matrix. Equal time partition is a measurement based algorithm for domain decomposition that exploits small variation of the density between self-consistent-field cycles to achieve load balance. Efficiency of the equal time partition is illustrated by several tests involving both finite and periodic systems. It is found that equal time partition is able to deliver 91 -- 98 % efficiency with 128 processors in the most time consuming part of the Coulomb matrix calculation. The current parallel quantum chemical tree code is able to deliver 63 -- 81% overall efficiency on 128 processors with fine grained parallelism (less than two heavy atoms per processor).Comment: 7 pages, 6 figure

arXiv.org e-Print Archive

Crossref