25 research outputs found
The Reverse Cuthill-McKee Algorithm in Distributed-Memory
Ordering vertices of a graph is key to minimize fill-in and data structure
size in sparse direct solvers, maximize locality in iterative solvers, and
improve performance in graph algorithms. Except for naturally parallelizable
ordering methods such as nested dissection, many important ordering methods
have not been efficiently mapped to distributed-memory architectures. In this
paper, we present the first-ever distributed-memory implementation of the
reverse Cuthill-McKee (RCM) algorithm for reducing the profile of a sparse
matrix. Our parallelization uses a two-dimensional sparse matrix decomposition.
We achieve high performance by decomposing the problem into a small number of
primitives and utilizing optimized implementations of these primitives. Our
implementation shows strong scaling up to 1024 cores for smaller matrices and
up to 4096 cores for larger matrices
An Empirical Evaluation of Allgatherv on Multi-GPU Systems
Applications for deep learning and big data analytics have compute and memory
requirements that exceed the limits of a single GPU. However, effectively
scaling out an application to multiple GPUs is challenging due to the
complexities of communication between the GPUs, particularly for collective
communication with irregular message sizes. In this work, we provide a
performance evaluation of the Allgatherv routine on multi-GPU systems, focusing
on GPU network topology and the communication library used. We present results
from the OSU-micro benchmark as well as conduct a case study for sparse tensor
factorization, one application that uses Allgatherv with highly irregular
message sizes. We extend our existing tensor factorization tool to run on
systems with different node counts and varying number of GPUs per node. We then
evaluate the communication performance of our tool when using traditional MPI,
CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three
different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with
8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in
the tensor data sets produce trends that contradict those in the OSU
micro-benchmark, as well as trends that are absent from the benchmark.Comment: 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGRID