We use OpenMP directives to target hardware accelerators (GPUs) on Summit, a newly deployed supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), demonstrating simplified access to GPU devices for users of our astrophysics code GenASiS and useful speedup on a sample fluid dynamics problem. At a lower level, we use the capabilities of Fortran 2003 for C interoperability to provide wrappers to the OpenMP device memory runtime library routines (currently available only in C). At a higher level, we use C interoperability and Fortran 2003 type-bound procedures to modify our workhorse class for data storage to include members and methods that significantly streamline the persistent allocation of and on-demand association to GPU memory. Where the rubber meets the road, users offload computational kernels with OpenMP target directives that are rather similar to constructs already familiar from multi-core parallelization. In this initial example we demonstrate total wall time speedups of ∼ 4X in 'proportional resource tests' that compare runs with a given percentage of nodes' GPUs with runs utilizing instead the same percentage of nodes' CPU cores, and reasonable weak scaling up to 8000 GPUs vs. 56,000 CPU cores (1333 1 3 Summit nodes). These speedups increase to over 12X when pinned memory is used strategically. We make available the source code from this work at https://github.com
Introduction
As of version 4.5, OpenMP directives provide an excellent opportunity to access the extraordinary computational power of the GPUs on machines like Summit at the Oak Ridge Leadership Computing Facility (OLCF). Most notably, the similarity to existing OpenMP multi-or many-core parallelization coding patterns allows relative ease of porting and minimizes divergence between versions of a code to be run with or without GPUs or other hardware acceleration devices.
That the programmer need not engage the CUDA programming model is particularly useful in Fortran applications. Given the broad acceptance and entrenched use of OpenMP, it can be expected that the facilities for devices now specified in its standard will receive wide support. This is in contrast for instance with CUDA Fortran, a non-standard Fortran extension provided by only two compiler vendors (PGI and IBM XL), excluding many widely-used compilers (e.g. GCC, Intel, and Cray). Moreover, OpenACC-an alternative directive-based approach-is currently actively supported only by the PGI compiler, and some ongoing efforts in GCC. On Summit in particular, excellent support for OpenMP 4.5 already exists with the IBM XL compiler. Our experience so far is that Fortran features and semantics such as multidimensional arrays, array sections, and pointer remapping (allowing for instance access to a given memory block as either a 3D or 1D array) are translated just as one would hope and expect.
We have recently begun using OpenMP directives to adapt our code GenASiS (General Astrophysical Simulation System) to the exploitation of GPUs on Summit. Aggressively deploying the features introduced in Fortran 2003 that facilitate an object-oriented approach, GenASiS is an extensible multiphysics simulation code aimed primarily at astrophysics applications [1] . Early versions have been used to study aspects of the post-bounce core-collapse supernova environment [2, 3, 4, 5, 6] . GenASiS is divided into three primary divisions: Basics, which includes utilitarian functionality generically needed by physics simulations on distributed-memory supercomputers [7, 8] ; Mathematics, which encompasses implementations of manifolds (i.e. meshes), operations, and solvers [9] ; and Physics, which comprises spaces governed by various theories of spacetime and gravity, specific forms of stress en-ergy such as fluids and radiation, and the combination of these into universes.
In this paper we report our initial explorations, which include implementation of GPU-related infrastructure in GenA-SiS Basics and its use in a simple fluid dynamics problem, the RiemannProblem example built only upon Basics classes. In Section 2 we describe the functionality we have added to GenASiS to streamline the use of GPUs. In Section 3 we introduce the RiemannProblem example and describe its porting to GPU-aware code. Section 4 reports the performance gains we obtain on Summit by exploiting its GPU resources, as compared with proportional use of its CPU resources. In Section 5 we show further performance gains that can be achieved when future OpenMP specifications adopt more advanced memory management. Section 6 offers some concluding thoughts on the state of OpenMP implementations vis-à-vis GPU usage. In what follows we use the term 'host' to refer to the set of cores assigned to a Message Passing Interface (MPI) process on the CPU, and 'device' to refer to the GPU available for the same MPI process. (In its current version, our code can only use one GPU per MPI process.)
We make available the source code from this work at https://github.com/GenASiS/GenASiS Basics.
Using an Accelerator Device in GenASiS
OpenMP directives enable the automatic transfer of data needed by offloaded kernels from the host to the device and back without intervention by the programmer; but because large data transfers incur significant overhead, it behooves applications to instead take affirmative control of significant data movement. In particular, it is possible to persistently allocate memory on the device for predictively repeated use; to flexibly associate Fortran variables with previously allocated memory locations on the device in an on-demand manner; and to command the transfer of data between the host and device. When data is present on the device, and host variables are associated with device locations, a computational kernel expressed in terms of those host variables can be executed on the device by enclosing the kernel code with the OpenMP target directive and its corresponding end target exit directive.
Lower-level GenASiS functionality
Because the OpenMP runtime library routines for device memory management-such as the persistent allocation of memory on the device and for association of a program variable with previously allocated device memory-are only provided in C, we have used the C interoperability capabilities of Fortran 2003 to write simple Fortran wrappers to make this functionality available in GenASiS. In our code, the line causes references to a Fortran array Value appearing in an appropriate OpenMP directive to be interpreted as referring to the device memory location pointed to by D Value. 1 We also have also provided the corresponding Fortran routines DeallocateDevice() (which frees memory on the device pointed to by an argument D Value) and DisassociateHost() (which frees a host variable argument Value from its association with a memory location on the device).
Under the hood, these wrappers call the OpenMP C functions omp target alloc(), omp target associate ptr(), omp target free(), and omp target disassociate ptr() respectively, making use of the Fortran 2003 bind attribute. In each case the function omp get default device() is used to identify the device, which is all that is needed with our convention of only one device (i.e. GPU) per MPI process.
We also provide Fortran wrappers in GenASiS to transfer data to and from the device, with the commands performing transfers from host to device and from device to host respectively. Similarly, under the hood these wrappers call the OpenMP C function omp target memcpy().
Higher-level GenASiS Functionality
The primary way we have integrated device functionality into GenASiS is by adding members and methods to our workhorse class StorageForm that make use of the wrappers discussed in Section 2.1. This class includes members for both data and metadata; it is used to group together a set of related variables, typically a set of physical fields on a computational domain. Code for tasks like I/O, ghost cell exchange, prolongation and restriction on multilevel meshes, and so on are simplified and rendered more generic by use of this class.
The data member Value of StorageForm is a rank-2 array whose first dimension (rows)-that corresponding to contiguous storage in Fortran-indexes different values of some variable (typically the values of single physical field at different points in space), with the second dimension (columns) indexing different variables (typically different physical fields). Metadata members include such things as names and units associated with the columns of data.
Prior to the addition of device functionality, the only method of StorageForm was an overloaded Initialize procedure. One mode of initialization entails allocation of the data member Value according to a specified shape; this could be called a 'primary' instance of StorageForm. A second mode initializes what might be called an 'overlay' instance of StorageForm, which does not allocate new storage; instead, it points its Value member to that of an existing instance, and includes an integer array member iaSelected 2 specifying a subset of variables-that is, the indices of selected columns of Value to be regarded as 'active' in the new instance. (The iaSelected member of a 'primary' instance of StorageForm automatically includes all columns of data.)
Adding just a handful of members and methods to our StorageForm class provides for the persistent allocation of device memory-and references thereto-in a manner that is simple to use, yet still quite generic and therefore powerful. A new method AllocateDevice() is used in conjunction with a new member D Selected, an array of type(c ptr), to allocate device memory in a manner that mirrors the standard access to variables in primary and overlay instances of StorageForm on the host. In the case of a primary instance, the method AllocateDevice() (a type-bound procedure of the class) calls the lower-level AllocateDevice() routine (the free-standing wrapper discussed in Section 2.1) to allocate a block of memory corresponding to the member Value of StorageForm. The elements of the new D Selected member are set to point to the sections of this array on the device corresponding to the individual variables (columns) of the Value member on the host. In the case of an overlay instance of StorageForm, no call to the AllocateDevice() method is needed; instead, during initialization the D Selected member is automatically populated by the appropriately selected values from the D Selected member of the primary instance. Finally, the methods UpdateDevice() and UpdateHost() (which call the lower-level routines of corresponding name discussed in Section 2.1) allow all the selected fields of an instance of StorageForm to be transferred to or from the device en masse with a single call. (Alternatively, individual fields can be specified for transfer.)
Offloading Computational Kernels in GenASiS
In contrast to the functionality added to our workhorse StorageForm class as discussed in Section 2.2-which may be regarded as somewhat 'higher-level'-the offloading of computational kernels to the device implicates 'lower-level' functionality. This is to be expected given the inherent nature of computational kernels, defined as straightforward segments of code implementing repetitive basic operations, through which significant amounts of data are fed.
In line with common practice and in part for traditional purposes of modularity, our habit is to sequester computational kernels in dedicated subroutines, to which only a few changes are needed to achieve offloading. In GenASiS the sequestration of kernels in subroutines also serves-via arrays provided in argument lists-the additional purpose of exposing fundamental data types to both the programmer and the compiler (as opposed to confronting a human or algorithmic parser with data buried in more complicated structures, such as array sections of the rank-2 pointer member Value of instances of our StorageForm class). 2 The prefix ia-in iaSelected is our shorthand for 'index array.' The computational kernel called here, shown in Listing 1, has three elements that distinguish it from a kernel not intended for offloading to the device. 3 First, the argument list in line 1see also line 4-includes a device pointer (D A, D B, D C) for each of the field variables (A, B, C). Second, calls to the GenA-SiS Fortran wrapper AssociateHost() in lines 9-11 (paired subsequently with corresponding DisassociateHost() calls in lines 19-21) instruct the compiler to interpret the host variable arguments A, B, and C as referring to the device memory locations indicated by the device pointer arguments D A, D B, and D C in subsequent OpenMP target constructs. 4 Third, the kernel loop in lines 13-17 is accessorized with an OpenMP target directive rather than a similar construct ! $OMP parallel do schedule ( static, 1 ) that would be used in a traditional CPU multithreading context. The target directive in line 13 tells OpenMP to offload the computational kernel (line 14-16) to the device. The teams distribute directives in the same line create and distribute work to a league of teams of threads on the device. The manner in which the teams of threads are mapped to the device hardware capabilities is implementation-dependent.
It may be asked why the transfer of data to and from the device is not included in the kernel itself-perhaps even left to be done implicitly and automatically by the compiler-and the answer is important to the simple and effective porting of sophisticated codes. Because of the nontrivial overhead involved in data transfer to and from the device, it is highly desirable to keep data on the device for as long as possible. The above simple example involves only a single kernel, but realistic cases may involve calls to many separate kernels between necessary updates to the host. The persistent allocation of device memory, together with device pointers allowing access to it, allows data to be left on the device for extended operations involving multiple separate kernel calls without disrupting the existing structure of thoroughly modularized codes.
Porting a Fluid Dynamics Application: RiemannProblem
As a concrete working example of targeting GPUs with OpenMP directives, we use our implementation to solve a RiemannProblem using GenASiS Basics. RiemannProblem is an extension of the classic 1D Sod shock tube-a standard computational fluid dynamics test problem-to 2D and 3D. Figure 1 shows the initial and final state of 1D and 3D versions of RiemannProblem after being evolved to time t = 0.25 at a resolution of 128 cells in each dimension.
Pseudocode outlining the solution in RiemannProblem (and other simple fluid dynamics examples distributed with GenASiS Basics) is displayed in Algorithm 1. Each iteration of the main loop enclosed by lines 4 and 23 is a secondorder Runge-Kutta time step consisting of two substeps, each of which consists of several computational kernels (lines 8-12 and 16-20) . Because the emphasis in the Basics examples as originally distributed [7, 8] was to demonstrate use of the utilitarian functionality provided by that division of the code, the kernels were written in the most simple-minded and compact manner possible, with essentially no regard for performance or even traditional OpenMP multi-core threading. Eschewing explicit loops and indexing, Fortran array syntax-whole-array operations, the where construct, the cshift() operation, and the like-had been used throughout. Therefore our first task, prior to beginning the porting required for exploitation of GPU devices, was to rewrite these kernels in terms of do loops over allow us to address individual variables (columns of Value). Second, a onetime call, at initialization, to AssociateHost() for each column of Value should in principle suffice per the OpenMP specification, avoiding the need for AssociateHost() calls in every kernel; however, in practice this results in runtime errors with the IBM XL compiler current at this writing. Call: GhostExchange ( ) 23: end while explicitly indexed arrays, accessorized with OpenMP directives implementing multi-core threading. This then constituted a code that provided a basis for a speedup comparison, with kernels readily adaptable to offloading via modification of the familiar OpenMP parallel do construct.
Given the infrastructure and approach to accelerator devices described in Section 2, porting Algorithm 1 to its deviceenabled counterpart in Algorithm 2 was both straightforward and effective. Algorithm 2 differs from Algorithm 1 in its notation of whether steps are executed on the Host or Device, but otherwise matches almost line for line in this high-level perspective, the only additions being the Transfer operations explicitly commanded in lines 6, 15, 17, and 25. In the present state of our code these transfers are necessary to allow the host to exchange ghost data between MPI processes responsible for separate regions of the spatial domain. (Future exploitation of CUDA-aware MPI libraries with hardware-enabled GPUDirect technologies should allow this ghost exchange to proceed directly between devices.)
We emphasize that the basic structure of the code has been preserved, and that (with the exception of FluidCurrent) the data utilized by the several kernels is generated on the device and remains on the device throughout each of the Runge-Kutta substeps in lines 9-13 and 19-23 of Algorithm 2. Instances of our class StorageForm for FluidOld, Differences, Reconstruction, Fluxes and FluidUpdate are initialized as part of line 1 in both Algorithm 1 and Algorithm 2; and in the latter case, addition of a call to the AllocateDevice method of each of these instances to enable use of the device is rather trivial from a programming perspective. With- Call: GhostExchange ( ) 27: end while out such use of designated persistent device memory and controlled transfers, OpenMP by default ensures that data wouldwith blissful ignorance on the part of the programmer, but at the cost of debilitating wall time overhead-be automatically shipped back and forth between host and device between each of the several kernel calls. (Achieving this sort of control via OpenMP target enter data copyin and target exit data copyout directives would be inconsistent with the spirit of our higher-level StorageForm functionality, where we have provided UpdateHost() and UpdateDevice() methods.)
It is worth noting a degree of flexibility available in the mapping of host program variables to device memory in computational kernels. As discussed in Section 2.2, we store the values of a physical field corresponding to different cells of a discretized computational domain in a column of a rank-2 Fortran array, the Value member of class StorageForm. However in our fluid dynamics problem (and as is of course the case in general), some of the kernels require knowledge of spatial relationships in the data, in particular, nearest-neighbor (or wider) stencils. In this Basics example problem a single-level rectangular mesh is employed, so that stencil relationships can be represented through appropriate indexing of rank-3 arrays embodying discretized 3D position space. Such 3D arrays can be obtained from a column of data with the Fortran pointer remapping facility. For instance, local rank-3 pointer variables V and dV declared as where nX, nY, nZ are the numbers of 'proper' or active cells in each of the three dimensions of a subdomain assigned to an MPI task, and allowance is made for two layers of ghost cells. Now the rank-3 arrays V and dV can be sent to a kernel ComputeDifference X shown in Listing 2, along with device pointers FluidCurrent % D Selected ( iV ) and Differences % D Selected ( iV ) as arguments corresponding to the kernel dummy variables D V and D dV. Even though the device pointers were originally set in connection with rank-1 arrays (single columns of a Value data array), their association in lines 9 and 10 to rank-3 aliases of these data columns work just as hoped and expected. To compile our code, we use the IBM XL Fortran for Linux, version 16.1.0 with -qsmp=omp -qoffload -Ofast flags. The invocation xlf2008 r is used to compile Fortran 2008 source files with thread-safe compilation. The MPI library is provided by IBM Spectrum MPI.
Performance Results
The test results described here were performed on Summit, a newly deployed supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). A compute node on Summit is an IBM Power System AC922, which consists of two IBM Power9 processors and six NVIDIA Volta V100 GPUs. Each Power9 processor has 22 physical cores, one of which is reserved for the operating system tasks, leaving 42 physical cores on a Summit compute node available for running a user's application. Three GPUS and a Power9 socket are interconnected with NVLINK. The two Power9 sockets are connected by an X-Bus. Summit nodes are interconnected in a non-blocking fat tree topology with a dual-rail EDR InfiniBand network.
Summit's job launching system, jsrun, allows users to create logical abstractions of Summit nodes with the concept of a resource set. Here we create six resource sets per node, where each each resource set has seven CPU cores (i.e. 1/6 of the 42 cores per node) and one GPU (i.e. 1/6 of the six GPUs per node). This allows for meaningful speedup measurements in 'proportional resource tests' that compare the performance of runs using either the GPUs or the CPU cores in the resource sets.
In the test runs reported here, we use the GPU in exclusive mode by assigning one GPU to each MPI process. In each resource set (described in the preceding paragraph) we run one MPI process with the possibility of running up to seven OpenMP threads. As alluded to in Section 3, the computational kernels in our code can run either with OpenMP threads or with OpenMP target directives for GPU, but not both simultaneously.
For the performance and scaling tests, we assign 256 3 cells per MPI process for the 3D RiemannProblem to ensure that the problem size is large enough that each MPI process, whether utilizing multiple OpenMP threads or the GPU, is not computationally starved. For a sense of comparison, in previous produc- Figure 2 : Weak scaling of the 3D RiemannProblem with 256 3 cells assigned to each MPI process. Wall time includes data transfers between host memory and GPU memory in the GPU version of the code. Colors correspond to OpenMP thread counts 2 (blue), 4 (red), and 7 (yellow) on the CPU, while the GPU version is shown in green. Each MPI process is assigned one GPU. With 8 MPI processes, the speedup from thread counts 2 to 4 to 7 to GPU are 1.76X, 1.48X, 4.8X respectively. The speedup from 7 CPU threads to the GPU ranges from 3.92X to 6.71X in these runs. tion simulations [5, 6] we assigned 128 3 cells per MPI process to achieve reasonable time to solution per simulation. Figure 2 shows the weak scaling of the 3D RiemannProblem for both versions of the code: with OpenMP threads on the CPU, and OpenMP target directives on the GPU. The figure demonstrates near-ideal speedup with increasing CPU thread counts, from 2 to 4 to 7; and from 7 CPU threads to the GPU, approximately another factor of 4 speedup is observed. This significant speedup is achieved even when the extra costs of data transfers between the host memory and GPU high-bandwidth memory are included.
To better understand the computational costs of the different portions of our code, we heavily accessorize computational kernels and logical portions of the code with timers from our TimerForm class [8] . Figure 5 shows the timings for the computational kernels, MPI communication, and-for the GPU version of the code-data transfers to and from GPU. The figure shows that in almost all computational kernels the GPU version outperforms the 7-thread CPU version. We do not expect significant differences in the timing of the MPI communication portion of the code, which is single-threaded and only runs on the host. We therefore attribute the timing differences seen in 5 for the MPI portion to system noise and general variability. (Similar effects of system noise and variability can be seen on the weak scaling plot on Figure 2 for runs with 1000 and more MPI processes.) Figure 4 plots the speedup of the GPU version over the CPU version with 7 OpenMP threads for each computational kernel. Except for one kernel labeled RiemannSolverInput, speedups of more than 10X-40X are achieved. The nonperforming RiemannSolverInput therefore warrants further investigation, which we have not yet undertaken.
Are we getting the speedups we can reasonably expect from the offloading our kernels to GPU via OpenMP target directives? To answer that question, we note the following facts. In this example fluid dynamics problem most of the kernels are memory-bandwidth bound, since most of the work involves vector-like operations across all cells. Therefore we can at least expect performance improvements proportional to the ratio of the bandwidths of the GPU's HBM2 memory to the CPU's DDR4 main memory. Summit's HBM2 theoretical peak bandwidth is 900 GB/s, 5 while its DDR4 bandwidth is 135 GB/s, 6 implying a HBM2 to DDR4 bandwidth ratio of ∼ 6.66X. Since most of the GPU offloaded kernels get speedups exceeding this ratio, it seems that we also benefit from the much higher computational power of the GPUs. Figure 5 shows the timing proportions of the different portions of the GPU version of 3D RiemannProblem, revealing prime targets for further attention. Despite overall speedups of 4X and beyond, the data transfers between the GPU and host memory loom large in taking more wall time than anything else in our application. The primary cause of these data transfers is the need to move the fluid data for MPI communication (ghost exchanges and time step reduction) for every time step, since our MPI communication is done on the host CPU. In the next section we discuss our plan to alleviate these costs. Figure 5 suggests that further speedups can be achived if we can reduce the time required for data transfers between 5 https://www.olcf.ornl.gov/for-users/system-user-guides/summit/nvidia-v100-gpus/ 6 https://www.olcf.ornl.gov/for-users/system-user-guides/summit/systemoverview/ host and device. Host data that are allocated on a page-locked memory-more commonly known as pinned memory-can be transferred from host to device much more efficiently that data allocated on pageable host memory. However, the current OpenMP specifications (OpenMP 4.5) do not provide a mechanism to allocate host data on pinned memory. To do this therefore we provide a new Fortran routines AllocateHost() and DeallocateHost() as wrappers to the CUDA function cudaHostAlloc() and cudaFreeHost(), respectively. We added an option to the initialization method of StorageForm to allocate the data member as pinned memory by calling the routine AllocateHost(), instead of using the the intrinsic Fortran statement allocate. Figure 6 shows the weak scaling plot of the 3D RiemannProblem of the GPU versions of the code with and without the use of pinned memory. In this figure, the line labeled "GPU" is the same as the one on Figure 2 . The line labeled "GPU + Pinned Memory" shows the timings when FluidCurrent in Algorithm 2 is allocated on pinned memory. We observe speedups between ∼ 1.7 − 2.0X when pinned memory is used, resulting overall speedups of over 9X from 7 CPU threads. We plot the timing distribution of the different computational kernels when pinned memory is used in Figure  7 .
Beyond OpenMP

Conclusion and Discussion
In this paper we describe our use of OpenMP to implement GPU-related infrastructure in GenASiS Basics and port fluid dynamics kernels in the RiemannProblem example. This has proved to be simple and effective, yielding speedups of ∼ 4X for our RiemannProblem. In section 5 we went beyond the current OpenMP 4.5 specifications to achieve even further speedups using pinned memory. These results further motivate the need for more advanced memory management with OpenMP, which is currently in the draft for future specifications. Along the way, however, we encountered a few issues with the current state of OpenMP implementations that merit brief discussion. Our understanding is that it should be possible to have a single, portable code base that can either run on GPUs or other accelerators, or fall back to the more traditional multi-threading CPU context when no accelerator is available. In theory, this could be achieved by guarding the OpenMP target directive with an if clause whose logical expression is set based on the availability of an accelerator (e.g. by checking the return value of omp get num devices()). However, in our tests the OpenMP runtime environment available to us still complains that a target is not available even when we remove the GPU from the resource set. Furthermore, even after manually removing the target directive, having the teams distribute directives seem to introduce deleterious effects to the performance of CPU multi-threading. For now we are therefore forced to either use preprocessor directives, or maintain a separate code base in which target team distribute directives and schedule ( static, 1 ) clauses are manually removed for running with multi-threaded CPU cores.
In our earliest explorations we considered a pipelining approach, using OpenMP tasks and the nowait clause, in which data transfer between host and device could go on in the background while kernel calculations proceeded on data already received. However we observed no performance difference, and concluded that the implementation did not automatically take advantage of the opportunity to overlap data movement and computation that one might hope would be available through the nowait clause. Thus, there seems to be no way within OpenMP to ensure parallelism of data movement and kernel launch on the GPU. In the CUDA programming model this can be achieved by setting up separate CUDA streams for data movement and kernel launch. But at present no mechanism exists within OpenMP to select particular CUDA streams.
The availability of compilers able to produce highperforming executables for the GPU is another issue that may inhibit adoption of OpenMP's device-related features. It is not until recently that, as part as Summit's procurement, the OpenMP target directive has been supported by the IBM XL compiler. For Fortran, the only other compiler supporting OpenMP 4.5 that we know of is the Cray Compiling Environment (CCE), which is only available on Cray platforms. Hopefully this will change soon.
