Second-generation polyalgorithms for parallel dense-matrix multiplication

Abstract

The polyalgorithm library, originally designed in 1991-1993 by Robert Falgout, Jin Li, and Anthony Skjellum, includes fourteen dense matrix multiplication algorithms mapped onto two-dimensional process grids using the Message Passing Interface (MPI). This thesis\u27 goal is to achieve optimized performance of parallel, dense linear algebra algorithms by varying the algorithm as a function of problem size, shape, data layout, concurrency, and architecture. We integrate these algorithms with an intra-node BLAS DGEMM kernel designed by Thomas Hines (Tennessee Tech), which improves the BLAS DGEMM performance in fat-by-thin dense matrix multiplication region. We add a rank-k-based SUMMA algorithm, which performs better than rank-1-based SUMMA. We studied performance on two cluster systems and results show the performance and improvements achieved. We compare and contrast our results with COSMA, a recent, highly optimized approach, and verify that COSMA, using optimal 3D grid decompositions, has significant advantages provided its preferred data layouts can be used

    Similar works