113 research outputs found
TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs
Recently, graph neural networks (GNNs), as the backbone of graph-based
machine learning, demonstrate great success in various domains (e.g.,
e-commerce). However, the performance of GNNs is usually unsatisfactory due to
the highly sparse and irregular graph-based operations. To this end, we
propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration
framework. The core idea is to reconcile the "Sparse" GNN computation with
"Dense" TCU. Specifically, we conduct an in-depth analysis of the sparse
operations in mainstream GNN computing frameworks. We introduce a novel sparse
graph translation technique to facilitate TCU processing of sparse GNN
workload. We also implement an effective CUDA core and TCU collaboration design
to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch
framework for ease of programming. Rigorous experiments show an average of
1.70X speedup over the state-of-the-art Deep Graph Library framework across
various GNN models and dataset settings
PopSparse: Accelerated block sparse matrix multiplication on IPU
Reducing the computational cost of running large scale neural networks using
sparsity has attracted great attention in the deep learning community. While
much success has been achieved in reducing FLOP and parameter counts while
maintaining acceptable task performance, achieving actual speed improvements
has typically been much more difficult, particularly on general purpose
accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In
this work we introduce PopSparse, a library that enables fast sparse operations
on Graphcore IPUs by leveraging both the unique hardware characteristics of
IPUs as well as any block structure defined in the data. We target two
different types of sparsity: static, where the sparsity pattern is fixed at
compile-time; and dynamic, where it can change each time the model is run. We
present benchmark results for matrix multiplication for both of these modes on
IPU with a range of block sizes, matrix sizes and densities. Results indicate
that the PopSparse implementations are faster than dense matrix multiplications
on IPU at a range of sparsity levels with large matrix size and block size.
Furthermore, static sparsity in general outperforms dynamic sparsity. While
previous work on GPAs has shown speedups only for very high sparsity (typically
99\% and above), the present work demonstrates that our static sparse
implementation outperforms equivalent dense calculations in FP16 at lower
sparsity (around 90%). IPU code is available to view and run at
ipu.dev/sparsity-benchmarks, GPU code will be made available shortly
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Deep Neural Networks (DNNs) are becoming an important tool in modern
computing applications. Accelerating their training is a major challenge and
techniques range from distributed algorithms to low-level circuit design. In
this survey, we describe the problem from a theoretical perspective, followed
by approaches for its parallelization. We present trends in DNN architectures
and the resulting implications on parallelization strategies. We then review
and model the different types of concurrency in DNNs: from the single operator,
through parallelism in network inference and training, to distributed deep
learning. We discuss asynchronous stochastic optimization, distributed system
architectures, communication schemes, and neural architecture search. Based on
those approaches, we extrapolate potential directions for parallelism in deep
learning
- …