1 research outputs found
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
General Matrix Multiplication (GEMM) is a crucial algorithm for various
applications such as machine learning and scientific computing, and an
efficient GEMM implementation is essential for the performance of these
systems. While researchers often strive for faster performance by using large
compute platforms, the increased scale of these systems can raise concerns
about hardware and software reliability. In this paper, we present a design for
a high-performance GEMM with algorithm-based fault tolerance for use on GPUs.
We describe fault-tolerant designs for GEMM at the thread, warp, and
threadblock levels, and also provide a baseline GEMM implementation that is
competitive with or faster than the state-of-the-art, proprietary cuBLAS GEMM.
We present a kernel fusion strategy to overlap and mitigate the memory latency
due to fault tolerance with the original GEMM computation. To support a wide
range of input matrix shapes and reduce development costs, we present a
template-based approach for automatic code generation for both fault-tolerant
and non-fault-tolerant GEMM implementations. We evaluate our work on NVIDIA
Tesla T4 and A100 server GPUs. Experimental results demonstrate that our
baseline GEMM presents comparable or superior performance compared to the
closed-source cuBLAS. The fault-tolerant GEMM incurs only a minimal overhead
(8.89\% on average) compared to cuBLAS even with hundreds of errors injected
per minute. For irregularly shaped inputs, the code generator-generated kernels
show remarkable speedups of and
for fault-tolerant and non-fault-tolerant GEMMs, outperforming cuBLAS by up to
.Comment: 11 pages, 2023 International Conference on Supercomputin