Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

A. Aggarwal; C. Ashcraft; D. Irony; D. Irony; E. Dekel; J. Demmel; R.A. Geijn Van De; R.C. Agarwal; S. Kumar; S.L. Johnsson; W. Gropp; W.F. McColl

Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Authors: A. Aggarwal
C. Ashcraft
D. Irony
D. Irony
E. Dekel
J. Demmel
R.A. Geijn Van De
R.C. Agarwal
S. Kumar
S.L. Johnsson
W. Gropp
W.F. McColl
Publication date: 1 January 2011
Publisher
Doi

Abstract

One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p processors in a 3D array, and Cannon’s algorithm “2D ” because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈{1, 2,..., ⌊p 1/3 ⌋}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2. We also show that these costs reach the lower bounds [13, 3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communication-avoiding pivoting, a stable alternative to partial-pivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2, the latency must increase by a factor of c 1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon’s algorithm. Careful choice of c also yields up to a 2.4X speed-up over 3D matrix multiplication, due to a better balance between communication costs

Similar works

Full text

Available Versions

Crossref

Last time updated on 14/03/2019