1,393 research outputs found

    An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor

    Full text link
    Modern OpenMP threading techniques are used to convert the MPI-only Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two separate implementations that differ by the sharing or replication of key data structures among threads are considered, density and Fock matrices. All implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi processors. With 64 cores per processor, scaling numbers are reported on up to 192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory footprint by approximately 200 times compared to the legacy code. The MPI/OpenMP code was shown to run up to six times faster than the original for a range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure

    Modern Approaches to Exact Diagonalization and Selected Configuration Interaction with the Adaptive Sampling CI Method.

    Get PDF
    Recent advances in selected configuration interaction methods have made them competitive with the most accurate techniques available and, hence, creating an increasingly powerful tool for solving quantum Hamiltonians. In this work, we build on recent advances from the adaptive sampling configuration interaction (ASCI) algorithm. We show that a useful paradigm for generating efficient selected CI/exact diagonalization algorithms is driven by fast sorting algorithms, much in the same way iterative diagonalization is based on the paradigm of matrix vector multiplication. We present several new algorithms for all parts of performing a selected CI, which includes new ASCI search, dynamic bit masking, fast orbital rotations, fast diagonal matrix elements, and residue arrays. The ASCI search algorithm can be used in several different modes, which includes an integral driven search and a coefficient driven search. The algorithms presented here are fast and scalable, and we find that because they are built on fast sorting algorithms they are more efficient than all other approaches we considered. After introducing these techniques, we present ASCI results applied to a large range of systems and basis sets to demonstrate the types of simulations that can be practically treated at the full-CI level with modern methods and hardware, presenting double- and triple-ζ benchmark data for the G1 data set. The largest of these calculations is Si2H6 which is a simulation of 34 electrons in 152 orbitals. We also present some preliminary results for fast deterministic perturbation theory simulations that use hash functions to maintain high efficiency for treating large basis sets

    Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

    Full text link
    We present a method for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation, data locality is exploited without prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case graphics processing units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node. The performance is evaluated for matrices with different sparsity structures, including examples from electronic structure calculations. Compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests.Comment: 35 pages, 14 figure

    Diagrammatic Coupled Cluster Monte Carlo

    Get PDF
    We propose a modified coupled cluster Monte Carlo algorithm that stochastically samples connected terms within the truncated Baker--Campbell--Hausdorff expansion of the similarity transformed Hamiltonian by construction of coupled cluster diagrams on the fly. Our new approach -- diagCCMC -- allows propagation to be performed using only the connected components of the similarity-transformed Hamiltonian, greatly reducing the memory cost associated with the stochastic solution of the coupled cluster equations. We show that for perfectly local, noninteracting systems, diagCCMC is able to represent the coupled cluster wavefunction with a memory cost that scales linearly with system size. The favorable memory cost is observed with the only assumption of fixed stochastic granularity and is valid for arbitrary levels of coupled cluster theory. Significant reduction in memory cost is also shown to smoothly appear with dissociation of a finite chain of helium atoms. This approach is also shown not to break down in the presence of strong correlation through the example of a stretched nitrogen molecule. Our novel methodology moves the theoretical basis of coupled cluster Monte Carlo closer to deterministic approaches.Comment: 31 pages, 6 figure

    Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis. (VII) HFODD (v2.49t): a new version of the program

    Full text link
    We describe the new version (v2.49t) of the code HFODD which solves the nuclear Skyrme Hartree-Fock (HF) or Skyrme Hartree-Fock-Bogolyubov (HFB) problem by using the Cartesian deformed harmonic-oscillator basis. In the new version, we have implemented the following physics features: (i) the isospin mixing and projection, (ii) the finite temperature formalism for the HFB and HF+BCS methods, (iii) the Lipkin translational energy correction method, (iv) the calculation of the shell correction. A number of specific numerical methods have also been implemented in order to deal with large-scale multi-constraint calculations and hardware limitations: (i) the two-basis method for the HFB method, (ii) the Augmented Lagrangian Method (ALM) for multi-constraint calculations, (iii) the linear constraint method based on the approximation of the RPA matrix for multi-constraint calculations, (iv) an interface with the axial and parity-conserving Skyrme-HFB code HFBTHO, (v) the mixing of the HF or HFB matrix elements instead of the HF fields. Special care has been paid to using the code on massively parallel leadership class computers. For this purpose, the following features are now available with this version: (i) the Message Passing Interface (MPI) framework, (ii) scalable input data routines, (iii) multi-threading via OpenMP pragmas, (iv) parallel diagonalization of the HFB matrix in the simplex breaking case using the ScaLAPACK library. Finally, several little significant errors of the previous published version were corrected.Comment: Accepted for publication to Computer Physics Communications. Program files re-submitted to Comp. Phys. Comm. Program Library after correction of several minor bug

    Linear scaling computation of the Fock matrix. IX. Parallel computation of the Coulomb matrix

    Full text link
    We present parallelization of a quantum-chemical tree-code [J. Chem. Phys. {\bf 106}, 5526 (1997)] for linear scaling computation of the Coulomb matrix. Equal time partition [J. Chem. Phys. {\bf 118}, 9128 (2003)] is used to load balance computation of the Coulomb matrix. Equal time partition is a measurement based algorithm for domain decomposition that exploits small variation of the density between self-consistent-field cycles to achieve load balance. Efficiency of the equal time partition is illustrated by several tests involving both finite and periodic systems. It is found that equal time partition is able to deliver 91 -- 98 % efficiency with 128 processors in the most time consuming part of the Coulomb matrix calculation. The current parallel quantum chemical tree code is able to deliver 63 -- 81% overall efficiency on 128 processors with fine grained parallelism (less than two heavy atoms per processor).Comment: 7 pages, 6 figure
    • …
    corecore