Improving instruction scheduling in GPGPUs

Abstract

GPU architectures have become popular for executing general-purpose programs. Moreover, they are some of the most efficient architectures for machine learning applications which are among the most trendy and demanding applications these days. GPUs rely on having a large number of threads that run concurrently to hide the latency among dependent instructions. This work presents SOCGPU (Simple Out-of-order Core for GPU), a simple out-of-order execution mechanism that does not require register renaming nor scoreboards. It uses a small Instruction Buffer and a tiny Dependence matrix to keep track of dependencies among instructions and avoid data hazards. Evaluations for an Nvidia GTX1080TI-like GPU show that SOCGPU provides a speed-up up to 3.76 in some machine learning programs and 1.58 on average for a variety of benchmarks, while it reduces energy consumption by 17.6%, with only 3.48% area overhead when using the same number of warps as the baseline. Moreover, we show that SOCGPU can reduce the number of concurrently running warps without hardly affecting performance, which can provide significant reductions in area, especially in the register file and the instruction scheduler logic, as well as other hardware structures of the GPU cores

    Similar works