Abstract-Programming models like CUDA, OpenMP, OpenACC and OpenCL are designed to offload compute-intensive workloads to accelerators efficiently. However, the naive offload model, which synchronously copies and executes in sequence, requires extensive hand-tuning of techniques, such as pipelining to overlap computation and communication. Therefore, we propose an easy-to-use, directive-based pipelining extension for OpenMP to overlap data transfers and kernel computation. This extension can map data to a pre-allocated device buffer and can automate memory-constrained array indexing and sub-task scheduling. We evaluate a prototype implementation of our approach with three different applications. The experimental results show that our approach can reduce memory usage by 52% to 97% while delivering a 1.41× to 1.65× speedup over the naive offload model.
I. INTRODUCTION
Systems with accelerators, particularly GPUs, are becoming prominent on the Top500. Many purpose-built programming models have been created for accelerators, but rather than grapple with unfamiliar programming models, scientists often prefer to keep using their existing verified C or FORTRAN code. OpenMP 4 [1] and OpenACC [2] allow for the straightforward adoption of that existing code.
These models use a similar offload approach Users ensure that the accelerator can access their data. They then launch their computation on the accelerator and ensure that the results are available on the host when needed. These data transfers can take a large portion of execution time under this naive offload model. Further, the data often does not fit in the device memory of the accelerator because scientific applications frequently use large data arrays or matrices. As a consequence, the user must manually split the data and the associated computation, which can involve significant code changes. To address these issues, we extend OpenMP to automate the partitioning of the data and the overlapping of data transfers with computation through pipelining. These extensions allow data to be mapped into a small buffer to reduce memory usage. This paper makes the following contributions: (1) a comprehensive initial study that identifies limitations of current programming extensions for GPU devices; (2) a new directivebased pipelined extension for OpenMP that automates the overlap of data transfers and kernel computation and the reduction of GPU memory consumption; (3) a prototype implementation of our approach; and (4) a detailed evaluation of our approach for three applications on an AMD Radeon 7970 GPU and a NVIDIA K40m GPU. Our results demonstrate that our approach can provide a 1.41× to 1.65× speedup while reducing memory usage 52% to 97% over the naive offload model.
II. BACKGROUND
Supercomputers are increasingly being equipped with accelerators, such as GPUs, FPGAs, APUs, and co-processors like the Intel Xeon Phi. Programming these accelerators requires the use of alternative programming models or language extensions such as CUDA, OpenMP, OpenACC, and OpenCL.
OpenMP is a directive-based extension for FORTRAN, C and C++ that is best known for providing portable multithreading on shared-memory multicore systems [3] . Since OpenMP 4.0, OpenMP has included device constructs that target offload to an accelerator with a potentially distinct memory space. This OpenMP support for accelerators is still relatively nascent, offering opportunities for improvements [4] .
CUDA is a parallel computing framework from NVIDIA, targeting only NVIDIAs GPUs. It is currently one of the most widely used programming models for GPUs despite its lack of portability. CUDA often requires the programmer to re-factor their code significantly.
OpenCL is a low-level model that is similar to CUDA. Existing OpenCL implementations offer portability across GPUs, multicore CPUs, co-processors, and FPGAs. However, OpenCL provides a complex, very low-level API that requires significantly more code than even CUDA.
OpenACC provides directives to define compute and data regions in C, C++, and FORTRAN programs. Several studies have compared its directive-based approach to CUDA in terms of performance, portability, and programmability [5] , [6] .
III. DESIGN
Our extension and its implementation overlaps computation and data transfers. Users neither need to re-factor their code nor manually break down work. In order to support this goal, they must partially allocate and free device arrays. We do this by dividing the loop into several smaller chunks and then launching each chunk on a different GPU stream along with their required data transfers. As soon as the data transfer of the first chunk finishes, its kernel begins execution. Each chunk's transfers are enqueued separately, and thus may run in parallel. We control the chunk size in the runtime system automatically to avoid exceeding available memory, and we auto-tune the chunk size and number of chunks and streams for performance. Our framework calculates the dependencies of the current loop (computation) chunks and then removes the data only required by previous chunks. By mapping the data array to a small pre-allocated device buffer, we copy new data arrays into the location of this stale data inside the buffer. Thus, by mapping the segment of data for a chunk into a small buffer, we can significantly reduce the live memory requirement of many kernels. Figure 1 presents the clauses that our extension adds to OpenMP. The pipeline() clause specifies which schedule to use. Users can choose between static and adaptive. The user must specify chunk_size and num_stream with the static schedule. With the adaptive schedule, the user can provide initial chunk size and maximum number of streams.Each pipeline_copy() clause defines characteristics of an input and/or output array for which <var> is the variable or base pointer. The <size> argument defines the size of each "item" in the array/matrix. If the <cond> argument is true, then the dimension is partitioned while all other dimensions should be copied for each chunk. Finally, <num> indicates the number of items in the dimension. We must consider data dependencies so we integrate this information into the pipeline_shadow() clause by specifying which inputs are required to produce the output of a given iterator value. 
IV. IMPLEMENTATION
We realize a prototype of our proposed extension for three applications: (1) a lattice QCD application; (2) the stencil benchmark from the Parboil benchmark suite [7] ; and (3) the 3D convolution benchmark from the Polybenchmark set [8] . We split each loop into configurable-sized chunks. Each chunk has its data dependencies that must be present on the device before its kernel executes. Different streams handle each chunk. As we already define the number of streams, chunk size, and data dependencies in our extension, we pre-allocate a device buffer that conforms to the memory usage parameter. We then map the data from the original data space to the buffer data space and copy each chunk to its corresponding location in the GPU memory buffer. Once a data chunk is not needed for later partitions (kernels), we replace it with data required by a later partition.
We first transform the application and benchmarks into OpenACC as a baseline that we denote as "Naive." We realize a pipelined version ("Pipelined") of each benchmark that manually divides the iterations but does not alter array indices, and thus requires the full memory footprint in device memory. Finally, we use our extended runtime to map the chunks into a reduced memory space ("Pipelined-buffer").
V. EVALUATION AND DISCUSSION
We evaluate our approach on three applications: a 3D convolution benchmark; a stencil benchmark; and a lattice QCD application. We run our experiments on two types of GPUs: AMD Radeon 7970 and NVIDIA Tesla K40m.
A. 3D Convolution
We use the 3D convolution benchmark from the Polybenchmark suite as an example on which to evaluate our approach. The "Pipelined" version achieves 1.7× speedup over the "Naive" version while the "Pipelined-buffer" version delivers 1.41× speedup over the "Naive" version.
In terms of memory usage, since the default test case of the benchmark is relatively large, the Naive and Pipelined versions require about 3.5 GB of GPU memory. Our Pipelinedbuffer version consumes only 93 MB of GPU memory, which translates to a 97% reduction in device memory usage. With this huge memory savings, we could potentially run much larger datasets or keep other useful data structures in device memory for a larger application. Table I shows that the number of overlapping streams affects the performance of the Pipelined version. However, using two streams no longer delivers the best performance; we instead need up to six streams to achieve the best performance. As our results show with our other applications, the number of streams can significantly affect performance, but the ideal number of streams varies across applications.
Our Pipelined-buffer implementation is much less sensitive to the number of streams and provides a place to auto-tune the number of streams. As shown in Table I , the execution time for our Pipelined-buffer version is consistent, regardless of stream count. Ultimately, we save 96% of the memory space even with eight streams.
Next, we test the performance of the 3D convolution benchmark on the AMD Radeon 7970 GPU. At first, we find that the Pipelined version is 57% slower than the Naive version, which is significantly different from our NVIDIA K40 results. This difference is due to data transfer times that lead to significant performance degradation. Although the data volues that are transferred are the same, the Pipelined version takes much longer to move it: the transfer rate for the Naive version is about 6 GB/s while it is only 2 GB/s for the Pipelined version.
To try to address this issue, we vary the chunk_size and num_stream arguments. Our conclusions include that even if more chunks imply more API call overhead, it can be ignored on NVIDIA GPUs. However, that overhead is more significant with the AMD GPU. The AMD APP Profiler indicates that the performance degradation arises because we split the task by the outer loop into small chunks, which requires many API calls and high scheduling overhead; Splitting the tasks into small chunks decreases the array size of each transfer, thus limiting bandwidth. To test our theory, we modify our code to decrease the number of chunks. Figure 2 shows that if we split the problem into only two chunks that we achieve a 1.2× speedup over the Naive version. Performance improves as we increase the number of chunks until we use nine chunks, after which it degrades sharply.
B. Stencil
The Parboil Stencil benchmark represents an iterative Jacobi solver of the heat equation on a 3-D structured grid. We realize a prototype of the stencil benchmark using our approach. We then evaluate the performance of the stencil benchmark on the K40m GPU. The Pipelined version, which uses native OpenACC pragmas to pipeline the kernel computation and data transfer, achieves 1.57× speedup over the Naive version. Our Pipelined-buffer version is faster than the Pipelined version, even including the time to handle array indexing and function calls. Our analysis finds that we only use two streams to implement the Pipelined-buffer version. However, we assign one stream to handle each subtask with the OpenACC async() clause, which indicates that it uses the maximum number of available GPU streams by default. Although more GPU streams could potentially hide more bubbles in the pipeline, they require more scheduling and API calls and can create contention overhead. Overall, these effects have more overhead than the the benefit from overlapping data transfer and kernel computation. Since these parameters are building blocks of our schedules, we evaluate their performance, keeping the stream constant, as Table II We observe that the Pipelined version uses eight (8) streams by default and that as we increase the number of GPU streams, the execution time of the Pipelined version increases dramatically while our Pipelined-buffer version is quite stable. If we limit the number of streams to two instead of using the default eight streams, the Pipelined version performs the best.
Either pipelined version delivers at least 1.5× speedup over the Naive version. For memory usage, our Pipelined-buffer version reduces memory consumption by nearly 50% compared to the Pipelined version.
We then test the performance with the default number of chunks on the AMD HD 7970. For the stencil benchmark, the Naive version is 56% faster than the Pipelined version. We again verify that reduced effective transfer bandwidth leads to the performance loss. Figure 2 shows that with two chunks, the Pipelined version achieves 1.35× speedup over the Naive version. As we increase the number of chunks up to four chunks, performance improves slightly. With even more chunks, performance degrades until it is the same as the Naive version between 10 and 20 chunks, after which it becomes worse than the Naive version.
Since the AMD GPU device is sensitive to chunk_size, the trade-off between performance and memory usage is also a important building block of our proposed auto-tuning scheduler in our future work.
C. Lattice QCD
Our lattice QCD benchmark application is a relatively large application from the SciDAC and LLNL Lattice Group.
We evaluate the performance of lattice QCD code. Although the speedup is not as significant as the Pipelined version (up to 1.9×), our Pipelined-buffer prototype still delivers competitive performance. In the large test case, our prototype delivers 1.54× speedup over the Naive version. The huge indexing operation to map the high-dimensional space to the pre-allocated buffer probably leads to the performance difference. Nonetheless, the Pipelined-buffered version significantly outperforms the Naive one.
In terms of memory consumption, our prototype significantly reduces GPU memory usage. As we increase the problem size, the memory savings also increase. For the largest test case, our approach reduces memory usage on the GPU up to 79% and achieves competitive performance.
D. Evaluation Summary and Discussion
We realize a prototype using our approach for the Parboil stencil benchmark, a 3D convolution benchmark, and a lattice QCD application. We show that our approach can significantly reduce GPU memory usage for these applications while delivering performance that is competitive with a hand-written pipelined version. We compare the sensitivity to the number of streams with the standard OpenACC implementation and observe that the best choice varies with the application and its input. The complex relationship between concurrency in data transfers, kernel launching, and stream scheduling overhead makes optimal performance difficult to achieve with handcoded approaches. The trade-off does not have a constant solution but choosing the wrong value could adversely impact performance. In terms of memory consumption, our implementation can save a huge portion of GPU memory for these benchmarks. Our results demonstrate that we can save more memory as the size of the test case increases.
The implementation of our prototype revealed limitations of OpenACC for this pattern. The naive offloading model, synchronously copying and executing, in sequence is inefficient. However, manually pipelining the kernel computation and data transfer significantly reduces programmability. Moreover, the current extension allocates GPU memory based on the host array pointers; no "partial array asynchronous copy" APIs are available. Thus, regardless of whether we use synchronous or asynchronous copies, we can only allocate the entire array on the GPU memory, exactly the same as the one on the host. Our approach, by mixing CUDA and OpenACC APIs, handling the indexing to map the host array to a pre-allocated device buffer, and scheduling data movement and kernels correctly, addresses this problem. Our proposed extension provides good programmability and high performance and can significantly reduce memory requirements at the same time.
We also find that the AMD GPU is sensitive to the number of chunks that we create, which is significantly different from a NVIDIA GPU. More chunks suffer from more API calls and scheduling overhead. Moreover, if the chunk_size is too small, the data transfer cannot achieve full bandwidth.
VI. RELATED WORK
Task-based models like OmpSs [9] and StarPU [10] are based around constructing graphs of "tasks" composed of statically sized chunks of data and computation, which are then scheduled. Our extension dynamically generates a range of logical tasks from the representation provided by the user, ending with a similar result but giving the runtime more flexibility.
CoreTSAR [11] , [12] explored automated coscheduling between devices with potentially disjoint memory spaces. CoreTSAR did this by included mapping functionality that could associate data to computation along a single dimension for certain specific patterns. Our specifications are similar to the array association pattern employed by CoreTSAR, in that it takes in similar information, but CoreTSAR used this to divide computation across devices rather than to overlap computation and communication, or reduce memory use, as we do.
Our extension maps high-dimensional arrays to a lowdimensional buffer, from non-contiguous to contiguous, and also some specific data structure for data movement. Recent studies on MPI libraries such as MVAPICH2 [13] , [14] , MPICH2 [15] , and OpenMPI [16] provide such support to pipeline data transfers between PCIe with the data transfer on high performance interconnects to optimize bandwidth. However, for the standard directive-based programming models, such as OpenMP and OpenACC, the lack of such support has hindered their adoption due to the obstacles in the programmability and performance.
VII. CONCLUSION
In this paper we propose a directive-based pipelining extension for offload models such as OpenMP 4.X and OpenACC. Our extension has an easy-to-use interface that allows GPU programmers to pipeline data transfers, thus automating overlap of computation and communication. Further, mapping the host array in subsections into a device buffer reduces memory requirements.
Our results show that our extension can significantly reduce memory consumption while still delivering competitive performance. The memory savings of our approach increase with problem size. Moreover, we find that our implementation is less sensitive to the number of streams used than a typical hand-coded pipelining solution.
Our prototype implementation already shows the benefit of our directive-based pipelining extension, in terms of programmability, performance and memory consumption for some specific applications. We will continue this work by investigating more benchmarks and use-cases for our extension on all kinds of accelerators. We are also considering a sourceto-source translator based on our previous work [11] , [12] . Finally, we will further study how the other parameters affect our design for specific applications on a specific system and integrate a performance model in an auto-tuning scheduler.
