Search CORE

2 research outputs found

Accelerating Reduction and Scan Using Tensor Core Units

Author: Dakkak Abdul
Gelado Isaac
Hwu Wen-mei
Li Cheng
Xiong Jinjun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/11/2019
Field of study

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.Comment: In Proceedings of the ACM International Conference on Supercomputing (ICS '19

arXiv.org e-Print Archive

Crossref

Low-Level Functional GPU Programming for Parallel Algorithms

Author: Accelerated Parallel Processing AMD
Alpern B.
Carlsen P.
Holk E.
Svensson B. J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

We present a Functional Compute Language (FCL) for low-levelGPU programming. FCL is functional in style, which allows foreasy composition of program fragments and thus easy prototypingand a high degree of code reuse. In contrast with projects such asFuthark, Accelerate, Harlan, Nessie and Delite, the intention is notto develop a language providing fully automatic optimizations, butinstead to provide a platform that supports absolute control of theGPU computation and memory hierarchies. The developer is thusrequired to have an intimate knowledge of the target platform, as isalso required when using CUDA/OpenCL directly.FCL is heavily inspired by Obsidian. However, instead of relyingon a multi-staged meta-programming approach for kernel generationusing Haskell as meta-language, FCL is completely selfcontained,and we intend it to be suitable as an intermediate languagefor data-parallel languages, including data-parallel parts ofhigh-level array languages, such as R, Matlab, and APL.We present a type-system and a dynamic semantics suitablefor understanding the performance characteristics of both FCL andObsidian-style programs. Our aim is that FCL will be useful as aplatform for developing new parallel algorithms, as well as a target languagefor various code-generators targeting GPU hardware

Crossref

Copenhagen University Research Information System

Chalmers Research