3 research outputs found
Optimizing High Performance Markov Clustering for Pre-Exascale Architectures
HipMCL is a high-performance distributed memory implementation of the popular
Markov Cluster Algorithm (MCL) and can cluster large-scale networks within
hours using a few thousand CPU-equipped nodes. It relies on sparse matrix
computations and heavily makes use of the sparse matrix-sparse matrix
multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are
not scalable to Exascale architectures, both due to their communication costs
dominating the runtime at large concurrencies and also due to their inability
to take advantage of accelerators that are increasingly popular.
In this work, we systematically remove scalability and performance
bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion
phase of the MCL algorithm on GPU. We propose a CPU-GPU joint distributed
SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic
memory requirement estimator that is fast and accurate. We develop a new
merging algorithm for the incremental processing of partial results produced by
the GPUs, which improves the overlap efficiency and the peak memory usage. We
also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We
validate our new algorithms and optimizations with extensive evaluations. With
the enabling of the GPUs and integration of new algorithms, HipMCL is up to
12.4x faster, being able to cluster a network with 70 million proteins and 68
billion connections just under 15 minutes using 1024 nodes of ORNL's Summit
supercomputer
Recommended from our members
Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale
Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in
various graph, scientific computing and machine learning algorithms. In this
paper, we consider SpGEMMs performed on hundreds of thousands of processors
generating trillions of nonzeros in the output matrix. Distributed SpGEMM at
this extreme scale faces two key challenges: (1) high communication cost and
(2) inadequate memory to generate the output. We address these challenges with
an integrated communication-avoiding and memory-constrained SpGEMM algorithm
that scales to 262,144 cores (more than 1 million hardware threads) and can
multiply sparse matrices of any size as long as inputs and a fraction of output
fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a
Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when
multiplying large-scale protein-similarity matrices