Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Abstract

One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p processors in a 3D array, and Cannon’s algorithm “2D ” because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈{1, 2,..., ⌊p 1/3 ⌋}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2. We also show that these costs reach the lower bounds [13, 3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communication-avoiding pivoting, a stable alternative to partial-pivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2, the latency must increase by a factor of c 1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon’s algorithm. Careful choice of c also yields up to a 2.4X speed-up over 3D matrix multiplication, due to a better balance between communication costs

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 14/03/2019