42 research outputs found
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
The Reverse Cuthill-McKee Algorithm in Distributed-Memory
Ordering vertices of a graph is key to minimize fill-in and data structure
size in sparse direct solvers, maximize locality in iterative solvers, and
improve performance in graph algorithms. Except for naturally parallelizable
ordering methods such as nested dissection, many important ordering methods
have not been efficiently mapped to distributed-memory architectures. In this
paper, we present the first-ever distributed-memory implementation of the
reverse Cuthill-McKee (RCM) algorithm for reducing the profile of a sparse
matrix. Our parallelization uses a two-dimensional sparse matrix decomposition.
We achieve high performance by decomposing the problem into a small number of
primitives and utilizing optimized implementations of these primitives. Our
implementation shows strong scaling up to 1024 cores for smaller matrices and
up to 4096 cores for larger matrices
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Implementing Push-Pull Efficiently in GraphBLAS
We factor Beamer's push-pull, also known as direction-optimized
breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them
for generalizability, asymptotic speedup, and contribution to overall speedup.
We demonstrate that masking is critical for high performance and can be
generalized to all graph algorithms where the sparsity pattern of the output is
known a priori. We show that these graph algorithm optimizations, which
together constitute DOBFS, can be neatly and separably described using linear
algebra and can be expressed in the GraphBLAS linear-algebra-based framework.
We provide experimental evidence that with these optimizations, a DOBFS
expressed in a linear-algebra-based graph framework attains competitive
performance with state-of-the-art graph frameworks on the GPU and on a
multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph.Comment: 11 pages, 7 figures, International Conference on Parallel Processing
(ICPP) 201
Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation
Across a variety of scientific disciplines, sparse inverse covariance
estimation is a popular tool for capturing the underlying dependency
relationships in multivariate data. Unfortunately, most estimators are not
scalable enough to handle the sizes of modern high-dimensional data sets (often
on the order of terabytes), and assume Gaussian samples. To address these
deficiencies, we introduce HP-CONCORD, a highly scalable optimization method
for estimating a sparse inverse covariance matrix based on a regularized
pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal
gradient method uses a novel communication-avoiding linear algebra algorithm
and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving
parallel scalability on problems with up to ~819 billion parameters (1.28
million dimensions); even on a single node, HP-CONCORD demonstrates
scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to
estimate the underlying dependency structure of the brain from fMRI data, and
use the result to identify functional regions automatically. The results show
good agreement with a clustering from the neuroscience literature.Comment: Main paper: 15 pages, appendix: 24 page
10 Years Later: Cloud Computing is Closing the Performance Gap
Can cloud computing infrastructures provide HPC-competitive performance for
scientific applications broadly? Despite prolific related literature, this
question remains open. Answers are crucial for designing future systems and
democratizing high-performance computing. We present a multi-level approach to
investigate the performance gap between HPC and cloud computing, isolating
different variables that contribute to this gap. Our experiments are divided
into (i) hardware and system microbenchmarks and (ii) user application proxies.
The results show that today's high-end cloud computing can deliver
HPC-competitive performance not only for computationally intensive applications
but also for memory- and communication-intensive applications - at least at
modest scales - thanks to the high-speed memory systems and interconnects and
dedicated batch scheduling now available on some cloud platforms
Recommended from our members
Accelerating Multi-GPU Embedding Retrieval with PGAS-Style Communication for Deep Learning Recommendation Systems
In this paper, we propose using Partitioned Global Address Space (PGAS) GPU one-sided asynchronous small messages to replace the widely used collective communication calls for sparse input multi-GPU embedding retrieval in deep learning recommendation systems. This GPU PGAS communication approach achieves (1) better communication and computation overlap, (2) smoother network usage, and (3) reduced overhead (due to the data unpack and rearrangement steps associated with collective communication calls). We implement a CUDA embedding retrieval backend for PyTorch that supports the proposed PGAS communication scheme and evaluate it on deep learning recommendation inference passes. Our backend outperforms the baseline using NCCL collective calls, achieving 1.97x speedup for the weak scaling test and 2.63x speedup for the strong scaling test in a 4 GPU NVLink-connected system
Extreme Scale De Novo Metagenome Assembly
Metagenome assembly is the process of transforming a set of short,
overlapping, and potentially erroneous DNA segments from environmental samples
into the accurate representation of the underlying microbiomes's genomes.
State-of-the-art tools require big shared memory machines and cannot handle
contemporary metagenome datasets that exceed Terabytes in size. In this paper,
we introduce the MetaHipMer pipeline, a high-quality and high-performance
metagenome assembler that employs an iterative de Bruijn graph approach.
MetaHipMer leverages a specialized scaffolding algorithm that produces long
scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is
end-to-end parallelized using the Unified Parallel C language and therefore can
run seamlessly on shared and distributed-memory systems. Experimental results
show that MetaHipMer matches or outperforms the state-of-the-art tools in terms
of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and
is able to assemble previously intractable grand challenge metagenomes. We
demonstrate the unprecedented capability of MetaHipMer by computing the first
full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion
reads - size 2.6 TBytes.Comment: Accepted to SC1
Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices
Identifying similar protein sequences is a core step in many computational
biology pipelines such as detection of homologous protein sequences, generation
of similarity protein graphs for downstream analysis, functional annotation and
gene location. Performance and scalability of protein similarity searches have
proven to be a bottleneck in many bioinformatics pipelines due to increases in
cheap and abundant sequencing data. This work presents a new distributed-memory
software, PASTIS. PASTIS relies on sparse matrix computations for efficient
identification of possibly similar proteins. We use distributed sparse matrices
for scalability and show that the sparse matrix infrastructure is a great fit
for protein similarity searches when coupled with a fully-distributed
dictionary of sequences that allows remote sequence requests to be fulfilled.
Our algorithm incorporates the unique bias in amino acid sequence substitution
in searches without altering the basic sparse matrix model, and in turn,
achieves ideal scaling up to millions of protein sequences.Comment: To appear in International Conference for High Performance Computing,
Networking, Storage, and Analysis (SC'20