173,076 research outputs found
Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds
A parallel algorithm has perfect strong scaling if its running time on P
processors is linear in 1/P, including all communication costs.
Distributed-memory parallel algorithms for matrix multiplication with perfect
strong scaling have only recently been found. One is based on classical matrix
multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's
fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz,
2012). Both algorithms scale perfectly, but only up to some number of
processors where the inter-processor communication no longer scales.
We obtain a memory-independent communication cost lower bound on classical
and Strassen-based distributed-memory matrix multiplication algorithms. These
bounds imply that no classical or Strassen-based parallel matrix multiplication
algorithm can strongly scale perfectly beyond the ranges already attained by
the two parallel algorithms mentioned above. The memory-independent bounds and
the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur
Gauge Field Generation on Large-Scale GPU-Enabled Systems
Over the past years GPUs have been successfully applied to the task of
inverting the fermion matrix in lattice QCD calculations. Even strong scaling
to capability-level supercomputers, corresponding to O(100) GPUs or more has
been achieved. However strong scaling a whole gauge field generation algorithm
to this regim requires significantly more functionality than just having the
matrix inverter utilizing the GPUs and has not yet been accomplished. This
contribution extends QDP-JIT, the migration of SciDAC QDP++ to GPU-enabled
parallel systems, to help to strong scale the whole Hybrid Monte-Carlo to this
regime. Initial results are shown for gauge field generation with Chroma
simulating pure Wilson fermions on OLCF TitanDev.Comment: The 30th International Symposium on Lattice Field Theory, June 24-29,
2012, Cairns, Australia (Acknowledgment and Citation added
A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices
We present the submatrix method, a highly parallelizable method for the
approximate calculation of inverse p-th roots of large sparse symmetric
matrices which are required in different scientific applications. We follow the
idea of Approximate Computing, allowing imprecision in the final result in
order to be able to utilize the sparsity of the input matrix and to allow
massively parallel execution. For an n x n matrix, the proposed algorithm
allows to distribute the calculations over n nodes with only little
communication overhead. The approximate result matrix exhibits the same
sparsity pattern as the input matrix, allowing for efficient reuse of allocated
data structures.
We evaluate the algorithm with respect to the error that it introduces into
calculated results, as well as its performance and scalability. We demonstrate
that the error is relatively limited for well-conditioned matrices and that
results are still valuable for error-resilient applications like
preconditioning even for ill-conditioned matrices. We discuss the execution
time and scaling of the algorithm on a theoretical level and present a
distributed implementation of the algorithm using MPI and OpenMP. We
demonstrate the scalability of this implementation by running it on a
high-performance compute cluster comprised of 1024 CPU cores, showing a speedup
of 665x compared to single-threaded execution
Linear scaling computation of the Fock matrix. IX. Parallel computation of the Coulomb matrix
We present parallelization of a quantum-chemical tree-code [J. Chem. Phys.
{\bf 106}, 5526 (1997)] for linear scaling computation of the Coulomb matrix.
Equal time partition [J. Chem. Phys. {\bf 118}, 9128 (2003)] is used to load
balance computation of the Coulomb matrix. Equal time partition is a
measurement based algorithm for domain decomposition that exploits small
variation of the density between self-consistent-field cycles to achieve load
balance. Efficiency of the equal time partition is illustrated by several tests
involving both finite and periodic systems. It is found that equal time
partition is able to deliver 91 -- 98 % efficiency with 128 processors in the
most time consuming part of the Coulomb matrix calculation. The current
parallel quantum chemical tree code is able to deliver 63 -- 81% overall
efficiency on 128 processors with fine grained parallelism (less than two heavy
atoms per processor).Comment: 7 pages, 6 figure
Making extreme computations possible with virtual machines
State-of-the-art algorithms generate scattering amplitudes for high-energy
physics at leading order for high-multiplicity processes as compiled code (in
Fortran, C or C++). For complicated processes the size of these libraries can
become tremendous (many GiB). We show that amplitudes can be translated to
byte-code instructions, which even reduce the size by one order of magnitude.
The byte-code is interpreted by a Virtual Machine with runtimes comparable to
compiled code and a better scaling with additional legs. We study the
properties of this algorithm, as an extension of the Optimizing Matrix Element
Generator (O'Mega). The bytecode matrix elements are available as alternative
input for the event generator WHIZARD. The bytecode interpreter can be
implemented very compactly, which will help with a future implementation on
massively parallel GPUs.Comment: 5 pages, 2 figures. arXiv admin note: substantial text overlap with
arXiv:1411.383
- …