2 research outputs found

    Accelerating Reduction and Scan Using Tensor Core Units

    Full text link
    Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.Comment: In Proceedings of the ACM International Conference on Supercomputing (ICS '19

    Low-Level Functional GPU Programming for Parallel Algorithms

    No full text
    We present a Functional Compute Language (FCL) for low-levelGPU programming. FCL is functional in style, which allows foreasy composition of program fragments and thus easy prototypingand a high degree of code reuse. In contrast with projects such asFuthark, Accelerate, Harlan, Nessie and Delite, the intention is notto develop a language providing fully automatic optimizations, butinstead to provide a platform that supports absolute control of theGPU computation and memory hierarchies. The developer is thusrequired to have an intimate knowledge of the target platform, as isalso required when using CUDA/OpenCL directly.FCL is heavily inspired by Obsidian. However, instead of relyingon a multi-staged meta-programming approach for kernel generationusing Haskell as meta-language, FCL is completely selfcontained,and we intend it to be suitable as an intermediate languagefor data-parallel languages, including data-parallel parts ofhigh-level array languages, such as R, Matlab, and APL.We present a type-system and a dynamic semantics suitablefor understanding the performance characteristics of both FCL andObsidian-style programs. Our aim is that FCL will be useful as aplatform for developing new parallel algorithms, as well as a target languagefor various code-generators targeting GPU hardware
    corecore