Machine Learning (ML) models contain highly-parallel computations, such as,
Matrix Multiplication, Convolutions, Dropout, etc. These computations are
commonly executed on Graphics Processing Units (GPUs), by dividing the
computation in independent processing blocks, known as tiles. Since the number
of tiles are usually higher than the execution units of a GPU, tiles are
executed on all execution units in waves. However, the tiles executed in the
last wave can under-utilize the execution units because tiles are not always a
multiple of execution units. This under-utilization can be reduced by executing
multiple independent kernels concurrently on a GPU, but is not currently
possible for dependent kernels.
In this paper, we present cuSync, a framework to write custom fine-grained
synchronization policies for dependent kernels to improve GPU utilization.
cuSync synchronizes tiles instead of kernels, which allows executing tiles of
multiple dependent kernels. Using cuSync we expressed several synchronization
policies in a few lines of code and reduced the inference times of GPT-3 and
ResNet-38 by up to 1.19x and 1.16x respectively