Search CORE

136,102 research outputs found

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

Author: Ballard Grey
Demmel James
Holtz Olga
Lipshitz Benjamin
Schwartz Oded
Publication venue
Publication date: 01/01/2012
Field of study

A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Parallel Sparse Matrix-Matrix Multiplication

Author: Alexandrov Luben
Publication venue: Karlsruher Institut für Technologie
Publication date: 25/01/2021
Field of study

The thesis investigates the BLAS-3 routine of sparse matrix-matrix multiplication (SpGEMM) based on the outer product method. Sev- eral algorithmic approaches have been implemented and empirically an- alyzed. The experiments have shown that an algorithm presented by Gustavson [22] outperforms other alternatives. In this work we propose optimization techniques that improve the scalability and the cache efficiency of the Gustavson’s algorithm for large matrices. Our approach succeeded to reduce the cache misses by more than a factor of five and to improve the net running time by 30% with some instances. The thesis also presents an algorithm for flops estima- tion, which can be used to determine an upper bound for the density of the result matrix. Furthermore, the work analyzes and empirically evaluates techniques for parallelization of the multiplication in a shared memory model by using Intel TBB and OpenMP. We investigate the cache efficiency of the algorithm in a parallel setting and compare several approaches for load balancing of the computation

KITopen

Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

Author: Hasanov Khalid
Lastovetsky Alexey
Quintin Jean-Noel
Publication venue
Publication date: 18/06/2013
Field of study

Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid 1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm as it can be used on a non-square number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores.Comment: 9 page

arXiv.org e-Print Archive

Crossref