The rapid development in computing technology has paved the way for
directive-based programming models towards a principal role in maintaining
software portability of performance-critical applications. Efforts on such
models involve a least engineering cost for enabling computational acceleration
on multiple architectures while programmers are only required to add meta
information upon sequential code. Optimizations for obtaining the best possible
efficiency, however, are often challenging. The insertions of directives by the
programmer can lead to side-effects that limit the available compiler
optimization possible, which could result in performance degradation. This is
exacerbated when targeting multi-GPU systems, as pragmas do not automatically
adapt to such systems, and require expensive and time consuming code adjustment
by programmers.
This paper introduces JACC, an OpenACC runtime framework which enables the
dynamic extension of OpenACC programs by serving as a transparent layer between
the program and the compiler. We add a versatile code-translation method for
multi-device utilization by which manually-optimized applications can be
distributed automatically while keeping original code structure and
parallelism. We show in some cases nearly linear scaling on the part of kernel
execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the
resulting performance improvements amortize the latency of GPU-to-GPU
communications.Comment: Extended version of a paper to appear in: Proceedings of the 28th
IEEE International Conference on High Performance Computing, Data, and
Analytics (HiPC), December 17-18, 202