Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores

Abstract

In multiword arithmetic, a matrix is represented as the unevaluated sum of two or more lower-precision matrices, and a matrix product is formed by multiplying the constituents in low precision. We investigate the use of multiword arithmetic for improving the performance-accuracy tradeoff of matrix multiplication with mixed precision block fused multiply-add (FMA) hardware, focusing especially on the tensor cores available on NVIDIA GPUs. Building on a general block FMA framework, we develop a comprehensive error analysis of multiword matrix multiplication. After confirming the theoretical error bounds experimentally by simulating low precision in software, we use the cuBLAS and CUTLASS libraries to implement a number of matrix multiplication algorithms using double-fp16 (double-binary16) arithmetic. When running the algorithms on NVIDIA V100 and A100 GPUs, we find that double-fp16 is not as accurate as fp32 (binary32) arithmetic despite satisfying the same worst-case error bound. Using probabilistic error analysis, we explain why this issue is likely to be caused by the rounding mode used by the NVIDIA tensor cores, and propose a parameterized blocked summation algorithm that alleviates the problem and significantly improves the performance-accuracy tradeoff

Similar works

Full text

thumbnail-image

MIMS EPrints

redirect
Last time updated on 23/02/2022

This paper was published in MIMS EPrints.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.