A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Dehnavi, Maryam Mehri; Jangda, Abhinav; Maleki, Saeed; Musuvathi, Madan; Saarikivi, Olli

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Authors: Maryam Mehri Dehnavi
Abhinav Jangda
Saeed Maleki
Madan Musuvathi
Olli Saarikivi
Publication date: 22 May 2023
Publisher

Abstract

Machine Learning (ML) models contain highly-parallel computations, such as, Matrix Multiplication, Convolutions, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation in independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in waves. However, the tiles executed in the last wave can under-utilize the execution units because tiles are not always a multiple of execution units. This under-utilization can be reduced by executing multiple independent kernels concurrently on a GPU, but is not currently possible for dependent kernels. In this paper, we present cuSync, a framework to write custom fine-grained synchronization policies for dependent kernels to improve GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing tiles of multiple dependent kernels. Using cuSync we expressed several synchronization policies in a few lines of code and reduced the inference times of GPT-3 and ResNet-38 by up to 1.19x and 1.16x respectively

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.13450

Last time updated on 26/05/2023