3 research outputs found
A practical performance model for compute and memory bound GPU kernels
Performance prediction of GPU kernels is generally a tedious procedure
with unpredictable results. In this paper, we provide a practical model
for estimating performance of CUDA kernels on GPU hardware in an
automated manner.
First, we propose the quadrant-split model, an alternative of the
roofline visual performance model, which provides insight on the
performance limiting factors of multiple devices with different
compute-memory bandwidth ratios with respect to a particular kernel. We
elaborate on the compute-memory bound characteristic of kernels. In
addition, a micro-benchmark program was developed exposing the peak
compute and memory transfer performance using variable operation
intensity. Experimental results of executions on different GPUs are
presented.
In the proposed performance prediction procedure, a set of kernel
features is extracted through an automated profiling execution which
records a set of significant kernel metrics. Additionally, a small set
of device features for the target GPU is generated using
micro-benchmarking and architecture specifications. In conjunction of
kernel and device features we determine the performance limiting factor
and we generate an estimation of the kernel’s execution time. We
performed experiments on DAXPY, DGEMM, FFT and stencil computation
kernels using 4 GPUs and we showed an absolute error in predictions of
10.1% in the average case and 25.8% in the worst case