Search CORE

4 research outputs found

Accelerating Reduction and Scan Using Tensor Core Units

Author: Dakkak Abdul
Gelado Isaac
Hwu Wen-mei
Li Cheng
Xiong Jinjun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/11/2019
Field of study

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.Comment: In Proceedings of the ACM International Conference on Supercomputing (ICS '19

arXiv.org e-Print Archive

Crossref

A manifesto for future generation cloud computing::research directions for the next decade

Author: Buyya Rajkumar
Srirama Satish Narayana
Publication venue
Publication date: 01/11/2018
Field of study

University of Birmingham Research Portal

Toward performance portability for CPUS and GPUS through algorithmic compositions

Author: Chang Li-Wen
Publication venue
Publication date: 01/08/2017
Field of study

The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required. This work presents my solution to the performance portability problem. I argue that a new language is required for replacing the current practices of programming systems to achieve practical performance portability. To support my argument, I first demonstrate the limited performance portability of the current practices by showing quantitative and qualitative evidences. I identify the main limiting issues of conventional programming languages. To overcome the issues, I propose a new modular, composition-based programming language that can effectively express an algorithmic design space with functional polymorphism, and a compiler that can effectively explore the design space and facilitate many high-level optimization techniques. This proposed approach achieves no less than 70% of the performance of highly optimized vendor libraries such as Intel MKL and NVIDIA CUBLAS/CUSPARSE on an Intel i7-3820 Sandy Bridge CPU, an NVIDIA C2050 Fermi GPU, and an NVIDIA K20c Kepler GPU

Illinois Digital Environment for Access to Learning and Scholarship Repository