2 research outputs found
Persistent Kernels for Iterative Memory-bound GPU Applications
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU
implementations have a loop on the host side that invokes the GPU kernel as
much as time/algorithm steps there are. The termination of each kernel
implicitly acts as the barrier required after advancing the solution every time
step. We propose a scheme for running memory-bound iterative GPU kernels:
PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a
persistent kernel, and device-wide barriers are used for synchronization. We
then reduce the traffic to device memory by caching a subset of the output in
each time step in registers and shared memory to be used as input for the
following time step. PERKS can be generalized to any iterative solver: they are
largely independent of the solver's implementation. We explain the design
principle of PERKS and demonstrate the effectiveness of PERKS for a wide range
of iterative 2D/3D stencil benchmarks (geometric mean speedup of x in
small domains and x in large domains), and a Krylov subspace solver
(geometric mean speedup of x in smaller SpMV datasets from SuiteSparse
and x in larger SpMV datasets, for conjugate gradient)