173,076 research outputs found

    Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

    Full text link
    A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur

    Gauge Field Generation on Large-Scale GPU-Enabled Systems

    Full text link
    Over the past years GPUs have been successfully applied to the task of inverting the fermion matrix in lattice QCD calculations. Even strong scaling to capability-level supercomputers, corresponding to O(100) GPUs or more has been achieved. However strong scaling a whole gauge field generation algorithm to this regim requires significantly more functionality than just having the matrix inverter utilizing the GPUs and has not yet been accomplished. This contribution extends QDP-JIT, the migration of SciDAC QDP++ to GPU-enabled parallel systems, to help to strong scale the whole Hybrid Monte-Carlo to this regime. Initial results are shown for gauge field generation with Chroma simulating pure Wilson fermions on OLCF TitanDev.Comment: The 30th International Symposium on Lattice Field Theory, June 24-29, 2012, Cairns, Australia (Acknowledgment and Citation added

    A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices

    Get PDF
    We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. We follow the idea of Approximate Computing, allowing imprecision in the final result in order to be able to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n x n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The approximate result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures. We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for error-resilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665x compared to single-threaded execution

    Linear scaling computation of the Fock matrix. IX. Parallel computation of the Coulomb matrix

    Full text link
    We present parallelization of a quantum-chemical tree-code [J. Chem. Phys. {\bf 106}, 5526 (1997)] for linear scaling computation of the Coulomb matrix. Equal time partition [J. Chem. Phys. {\bf 118}, 9128 (2003)] is used to load balance computation of the Coulomb matrix. Equal time partition is a measurement based algorithm for domain decomposition that exploits small variation of the density between self-consistent-field cycles to achieve load balance. Efficiency of the equal time partition is illustrated by several tests involving both finite and periodic systems. It is found that equal time partition is able to deliver 91 -- 98 % efficiency with 128 processors in the most time consuming part of the Coulomb matrix calculation. The current parallel quantum chemical tree code is able to deliver 63 -- 81% overall efficiency on 128 processors with fine grained parallelism (less than two heavy atoms per processor).Comment: 7 pages, 6 figure

    Making extreme computations possible with virtual machines

    Full text link
    State-of-the-art algorithms generate scattering amplitudes for high-energy physics at leading order for high-multiplicity processes as compiled code (in Fortran, C or C++). For complicated processes the size of these libraries can become tremendous (many GiB). We show that amplitudes can be translated to byte-code instructions, which even reduce the size by one order of magnitude. The byte-code is interpreted by a Virtual Machine with runtimes comparable to compiled code and a better scaling with additional legs. We study the properties of this algorithm, as an extension of the Optimizing Matrix Element Generator (O'Mega). The bytecode matrix elements are available as alternative input for the event generator WHIZARD. The bytecode interpreter can be implemented very compactly, which will help with a future implementation on massively parallel GPUs.Comment: 5 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:1411.383
    corecore