Search CORE

113 research outputs found

Machine learning-aided numerical linear Algebra: Convolutional neural networks for the efficient preconditioner generation

Author: Anzt Hartwig
Götz Markus
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2019
Field of study

TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs

Author: Ding Yufei
Feng Boyuan
Wang Yuke
Publication venue
Publication date: 03/12/2021
Field of study

Recently, graph neural networks (GNNs), as the backbone of graph-based machine learning, demonstrate great success in various domains (e.g., e-commerce). However, the performance of GNNs is usually unsatisfactory due to the highly sparse and irregular graph-based operations. To this end, we propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU. Specifically, we conduct an in-depth analysis of the sparse operations in mainstream GNN computing frameworks. We introduce a novel sparse graph translation technique to facilitate TCU processing of sparse GNN workload. We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch framework for ease of programming. Rigorous experiments show an average of 1.70X speedup over the state-of-the-art Deep Graph Library framework across various GNN models and dataset settings

arXiv.org e-Print Archive

PopSparse: Accelerated block sparse matrix multiplication on IPU

Author: Beker Deniz
Da costa Godfrey
Li Zhiyi
Masters Dominic
Murray Tom
Ohan Valeriu
Orr Douglas
Sanders Adam
Publication venue
Publication date: 05/04/2023
Field of study

Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In this work we introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. We present benchmark results for matrix multiplication for both of these modes on IPU with a range of block sizes, matrix sizes and densities. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels with large matrix size and block size. Furthermore, static sparsity in general outperforms dynamic sparsity. While previous work on GPAs has shown speedups only for very high sparsity (typically 99\% and above), the present work demonstrates that our static sparse implementation outperforms equivalent dense calculations in FP16 at lower sparsity (around 90%). IPU code is available to view and run at ipu.dev/sparsity-benchmarks, GPU code will be made available shortly

arXiv.org e-Print Archive

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Author: Ben-Nun Tal
Hoefler Torsten
Publication venue
Publication date: 15/09/2018
Field of study

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning

arXiv.org e-Print Archive

Repository for Publications and Research Data