2 research outputs found
Optimization of Tensor-product Operations in Nekbone on GPUs
In the CFD solver Nek5000, the computation is dominated by the evaluation of
small tensor operations. Nekbone is a proxy app for Nek5000 and has previously
been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we
continue this effort and optimize the main tensor-product operation in Nekbone
further. Our optimization is done in CUDA and uses a different, 2D, thread
structure to make the computations layer by layer. This enables us to use loop
unrolling as well as utilize registers and shared memory efficiently. Our
implementation is then compared on both the Pascal and Volta GPU architectures
to previous GPU versions of Nekbone as well as a measured roofline. The results
show that our implementation outperforms previous GPU Nekbone implementations
by 6-10%. Compared to the measured roofline, we obtain 77 - 92% of the peak
performance for both Nvidia P100 and V100 GPUs for inputs with 1024 - 4096
elements and polynomial degree 9.Comment: 4 pages, 4 figure