Search CORE

7,285 research outputs found

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

Author: Ballard Grey
Demmel James
Holtz Olga
Lipshitz Benjamin
Schwartz Oded
Publication venue
Publication date: 01/01/2012
Field of study

A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Author: Azad Ariful
Ballard Grey
Buluc Aydin
Demmel James
Grigori Laura
Schwartz Oded
Toledo Sivan
Williams Samuel
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2016
Field of study

Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

eScholarship - University of California

Hal-Diderot

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Author: Ballard Grey
Demmel James
Dumitriu Ioana
Publication venue
Publication date: 01/01/2010
Field of study

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amount of communication required for essentially all

O(n^3)

-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs.Comment: 43 pages, 11 figure

arXiv.org e-Print Archive

CiteSeerX

Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices

Author: Baruch E.
Cannon L. E.
Choi J
Choi J.
Choi J.
Solomonik E.
Szabo A.
van de Geijn R. A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/10/2015
Field of study

A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text overlap with arXiv:1504.0504

arXiv.org e-Print Archive

Crossref