1,393 research outputs found
An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor
Modern OpenMP threading techniques are used to convert the MPI-only
Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two
separate implementations that differ by the sharing or replication of key data
structures among threads are considered, density and Fock matrices. All
implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi
processors. With 64 cores per processor, scaling numbers are reported on up to
192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory
footprint by approximately 200 times compared to the legacy code. The
MPI/OpenMP code was shown to run up to six times faster than the original for a
range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure
Modern Approaches to Exact Diagonalization and Selected Configuration Interaction with the Adaptive Sampling CI Method.
Recent advances in selected configuration interaction methods have made them competitive with the most accurate techniques available and, hence, creating an increasingly powerful tool for solving quantum Hamiltonians. In this work, we build on recent advances from the adaptive sampling configuration interaction (ASCI) algorithm. We show that a useful paradigm for generating efficient selected CI/exact diagonalization algorithms is driven by fast sorting algorithms, much in the same way iterative diagonalization is based on the paradigm of matrix vector multiplication. We present several new algorithms for all parts of performing a selected CI, which includes new ASCI search, dynamic bit masking, fast orbital rotations, fast diagonal matrix elements, and residue arrays. The ASCI search algorithm can be used in several different modes, which includes an integral driven search and a coefficient driven search. The algorithms presented here are fast and scalable, and we find that because they are built on fast sorting algorithms they are more efficient than all other approaches we considered. After introducing these techniques, we present ASCI results applied to a large range of systems and basis sets to demonstrate the types of simulations that can be practically treated at the full-CI level with modern methods and hardware, presenting double- and triple-ζ benchmark data for the G1 data set. The largest of these calculations is Si2H6 which is a simulation of 34 electrons in 152 orbitals. We also present some preliminary results for fast deterministic perturbation theory simulations that use hash functions to maintain high efficiency for treating large basis sets
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model
We present a method for parallel block-sparse matrix-matrix multiplication on
distributed memory clusters. By using a quadtree matrix representation, data
locality is exploited without prior information about the matrix sparsity
pattern. A distributed quadtree matrix representation is straightforward to
implement due to our recent development of the Chunks and Tasks programming
model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined
with the Chunks and Tasks model leads to favorable weak and strong scaling of
the communication cost with the number of processes, as shown both
theoretically and in numerical experiments.
Matrices are represented by sparse quadtrees of chunk objects. The leaves in
the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by
the matrix library and may occur at any level in the hierarchy and/or within
the submatrix leaves. In case graphics processing units (GPUs) are available,
both CPUs and GPUs are used for leaf-level multiplication work, thus making use
of the full computing capacity of each node.
The performance is evaluated for matrices with different sparsity structures,
including examples from electronic structure calculations. Compared to methods
that do not exploit data locality, our locality-aware approach reduces
communication significantly, achieving essentially constant communication per
node in weak scaling tests.Comment: 35 pages, 14 figure
Diagrammatic Coupled Cluster Monte Carlo
We propose a modified coupled cluster Monte Carlo algorithm that
stochastically samples connected terms within the truncated
Baker--Campbell--Hausdorff expansion of the similarity transformed Hamiltonian
by construction of coupled cluster diagrams on the fly. Our new approach --
diagCCMC -- allows propagation to be performed using only the connected
components of the similarity-transformed Hamiltonian, greatly reducing the
memory cost associated with the stochastic solution of the coupled cluster
equations. We show that for perfectly local, noninteracting systems, diagCCMC
is able to represent the coupled cluster wavefunction with a memory cost that
scales linearly with system size. The favorable memory cost is observed with
the only assumption of fixed stochastic granularity and is valid for arbitrary
levels of coupled cluster theory. Significant reduction in memory cost is also
shown to smoothly appear with dissociation of a finite chain of helium atoms.
This approach is also shown not to break down in the presence of strong
correlation through the example of a stretched nitrogen molecule. Our novel
methodology moves the theoretical basis of coupled cluster Monte Carlo closer
to deterministic approaches.Comment: 31 pages, 6 figure
Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis. (VII) HFODD (v2.49t): a new version of the program
We describe the new version (v2.49t) of the code HFODD which solves the
nuclear Skyrme Hartree-Fock (HF) or Skyrme Hartree-Fock-Bogolyubov (HFB)
problem by using the Cartesian deformed harmonic-oscillator basis. In the new
version, we have implemented the following physics features: (i) the isospin
mixing and projection, (ii) the finite temperature formalism for the HFB and
HF+BCS methods, (iii) the Lipkin translational energy correction method, (iv)
the calculation of the shell correction. A number of specific numerical methods
have also been implemented in order to deal with large-scale multi-constraint
calculations and hardware limitations: (i) the two-basis method for the HFB
method, (ii) the Augmented Lagrangian Method (ALM) for multi-constraint
calculations, (iii) the linear constraint method based on the approximation of
the RPA matrix for multi-constraint calculations, (iv) an interface with the
axial and parity-conserving Skyrme-HFB code HFBTHO, (v) the mixing of the HF or
HFB matrix elements instead of the HF fields. Special care has been paid to
using the code on massively parallel leadership class computers. For this
purpose, the following features are now available with this version: (i) the
Message Passing Interface (MPI) framework, (ii) scalable input data routines,
(iii) multi-threading via OpenMP pragmas, (iv) parallel diagonalization of the
HFB matrix in the simplex breaking case using the ScaLAPACK library. Finally,
several little significant errors of the previous published version were
corrected.Comment: Accepted for publication to Computer Physics Communications. Program
files re-submitted to Comp. Phys. Comm. Program Library after correction of
several minor bug
Linear scaling computation of the Fock matrix. IX. Parallel computation of the Coulomb matrix
We present parallelization of a quantum-chemical tree-code [J. Chem. Phys.
{\bf 106}, 5526 (1997)] for linear scaling computation of the Coulomb matrix.
Equal time partition [J. Chem. Phys. {\bf 118}, 9128 (2003)] is used to load
balance computation of the Coulomb matrix. Equal time partition is a
measurement based algorithm for domain decomposition that exploits small
variation of the density between self-consistent-field cycles to achieve load
balance. Efficiency of the equal time partition is illustrated by several tests
involving both finite and periodic systems. It is found that equal time
partition is able to deliver 91 -- 98 % efficiency with 128 processors in the
most time consuming part of the Coulomb matrix calculation. The current
parallel quantum chemical tree code is able to deliver 63 -- 81% overall
efficiency on 128 processors with fine grained parallelism (less than two heavy
atoms per processor).Comment: 7 pages, 6 figure
- …