5 research outputs found
High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU
Optimizing High Performance Markov Clustering for Pre-Exascale Architectures
HipMCL is a high-performance distributed memory implementation of the popular
Markov Cluster Algorithm (MCL) and can cluster large-scale networks within
hours using a few thousand CPU-equipped nodes. It relies on sparse matrix
computations and heavily makes use of the sparse matrix-sparse matrix
multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are
not scalable to Exascale architectures, both due to their communication costs
dominating the runtime at large concurrencies and also due to their inability
to take advantage of accelerators that are increasingly popular.
In this work, we systematically remove scalability and performance
bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion
phase of the MCL algorithm on GPU. We propose a CPU-GPU joint distributed
SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic
memory requirement estimator that is fast and accurate. We develop a new
merging algorithm for the incremental processing of partial results produced by
the GPUs, which improves the overlap efficiency and the peak memory usage. We
also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We
validate our new algorithms and optimizations with extensive evaluations. With
the enabling of the GPUs and integration of new algorithms, HipMCL is up to
12.4x faster, being able to cluster a network with 70 million proteins and 68
billion connections just under 15 minutes using 1024 nodes of ORNL's Summit
supercomputer
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much
attention from researchers in fields of multigrid methods and graph analysis.
Many optimization techniques have been developed for certain application fields
and computing architecture over the decades. The objective of this paper is to
provide a structured and comprehensive overview of the research on SpGEMM.
Existing optimization techniques have been grouped into different categories
based on their target problems and architectures. Covered topics include SpGEMM
applications, size prediction of result matrix, matrix partitioning and load
balancing, result accumulating, and target architecture-oriented optimization.
The rationales of different algorithms in each category are analyzed, and a
wide range of SpGEMM algorithms are summarized. This survey sufficiently
reveals the latest progress and research status of SpGEMM optimization from
1977 to 2019. More specifically, an experimentally comparative study of
existing implementations on CPU and GPU is presented. Based on our findings, we
highlight future research directions and how future studies can leverage our
findings to encourage better design and implementation.Comment: 19 pages, 11 figures, 2 tables, 4 algorithm