One of the key requirements for the Lattice QCD Application Development as
part of the US Exascale Computing Project is performance portability across
multiple architectures. Using the Grid C++ expression template as a starting
point, we report on the progress made with regards to the Grid GPU offloading
strategies. We present both the successes and issues encountered in using CUDA,
OpenACC and Just-In-Time compilation. Experimentation and performance on GPUs
with a SU(3)×SU(3) streaming test will be reported. We will also report
on the challenges of using current OpenMP 4.x for GPU offloading in the same
code.Comment: 8 pages, 4 figures. Talk presented at the 35th International
Symposium on Lattice Field Theory, 18-24 June 2017, Granada, Spai