We introduce "Hybrid Fortran," a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA's code structure, Hybrid Fortran is compared to both a performance model as well as today's commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran-based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation-an achievement comparable to more invasive GPGPU rewrites of other weather models.
New GPGPU Code Transformation Framework Applied to Weather Prediction 7:3 (see Section 2.4), we conclude that industry standards alone are insufficient to fulfill the above requirements, which has motivated the invention of a new method for porting CPU-targeted structured grid applications in Fortran to GPU. In Section 3, we thus propose "Hybrid Fortran," a source-to-source transformation and language extension that supports different targets. It uses Modern Fortran together with higher-level parallelization and data specification abstractions as input and outputs OpenMP Fortran, CUDA Fortran, or OpenACC Fortran sources. Crucially, this tool introduces a new assisted splitting method for large kernels, a necessity to enable GPU compatibility of physical processes within ASUCA without requiring a rewrite. We use a reduced weather application (introduced in Section 3.1) as a basis to compare the features of Hybrid Fortran with the industry standard OpenACC. A performance model (discussed in Section 4) is introduced to evaluate the performance of Hybrid Fortran and OpenACC for this use case (see Section 5.1).
By applying Hybrid Fortran to both dynamical core and physical processes of ASUCA a GPGPU implementation of this weather model has been completed, resulting in only 5% increase in code size, even though the code now supports both CPU and GPU (see Section 5.2) . To our knowledge, this is the first time a complete production weather prediction model has been ported to GPGPU using a unified programming paradigm. This new implementation performs with a speedup of up to 4.9× comparing one GPU to one multi-core CPU socket (as discussed in Sections 5.3 and 5.4). The magnitude of these results are in line with previous GPU ports of ASUCA and other weather models using more invasive rewrites (as discussed in the related work section below).
Finally, we draw conclusions and point out future work in Sections 6 and 7, respectively.
Related Work
The following GPGPU ports of atmospheric models are relevant for the discussion of our work. Shimokawabe et al. In 2010, Shimokawabe et al. completed a GPUonly CUDA C port of the dynamical core and part of the physical core of ASUCA. Fourteen TFLOP/s was achieved in single precision using 528 Tesla S1070 GPUs [36] .
ASUCA GPU Port by
In 2011, 24 GFLOP/s was achieved in double precision for ASUCA dynamical core on a single NVIDIA Fermi-based Tesla M2050 GPU on TSUBAME 2.0, a sixfold speedup when compared to the system's Xeon X5670 CPUs running the original Fortran code. This work also included extensive work regarding the optimization of inter-node communication and an overlapping method to run communication and computation in parallel. By applying weak scaling (i.e., increasing the grid size together with the number of GPUs), up to 145 TFLOP/s were achieved in single precision on 3990 Tesla M2050 GPUs [35] .
In 2014, the previous work was generalized by employing a stencil library and DSL based on C++ templates that generates either CUDA kernels or CPU code. Employing previously researched optimization methods to this more generalized method, up to 209.6 TFLOP/s were achieved for ASUCA in single precision on 4108 Tesla K20x GPUs [37] .
COSMO.
In 2014, MeteoSwiss' Fuhrer et al. proposed a mixed approach for COSMO, another structured grid regional weather and climate model used in operation in several European nations. In this mixed approach, a C++ stencil library and DSL called "STELLA" is used for the dynamical core, while a "less disruptive" OpenACC-based method is applied to the physical processes to retain most of the original Fortran codebase [7] . In COSMO's case, the physical processes already expose fine-grained parallelism, and kernel fusion was employed in some instances to optimize for less kernel launches by Lapillonne et al. [21] . Thanks to the code refactoring a speedup of 1.5× was achieved on CPU. By employing GPU code generation an additional speedup of 2.9× was achieved (comparing Tesla K20x to 8-core Xeon E5-2670 on Piz Daint) [7] . STELLA has subsequently been generalized for use cases beyond COSMO and rectangular grids under the new name "GridTools" [6] . (WRF) . The U.S. National Center for Atmospheric Research's "Weather Research and Forecasting model" (WRF) is the most widely used weather model worldwide [24] . It too uses a structured grid. WRF consists of multiple hundreds of thousands of lines of Fortran code. GPU ports known to us have been only partial, leading to heavy pressure on the memory bus between CPU and GPU and thus a lower speedup compared to full GPU ports of other weather models [44] . The following partial GPU ports have been achieved among others:
Weather Research and Forecasting Model
• in 2009, Michalakes and Vachharajani ported the "WSM5" cloud microphysics to CUDA C, achieving an overall speedup of 1.25× for a benchmark workload [24] , • in 2010, Ruetsch et al. ported the longwave radiation physics to CUDA Fortran [33] , • in 2012, Mielikainen et al. ported the Goddard shortwave radiation scheme to CUDA C [26] , • in 2013, Vanderbauwhede and Takemi ported the advection scheme to OpenCL and integrated it together with the Fortran-based WRF. Doing so, an overall speedup of 2× was achieved [44] .
Non-Hydrostatic Icosahedral Model (NIM).
In 2010 the dynamical core of the Nonhydrostatic Icosahedral Model (NIM) developed at the U.S. National Oceanic and Atmospheric Administration (NOAA) was ported to GPU by Govett et al. by using a Fortran-to-CUDA-C transpiler called "F2C-ACC" [9] . In 2014 this work was compared to the industry standard "OpenACC" on Titan using Tesla K20x GPUs, showing 77% higher performance with F2C-ACC compared to Cray's OpenACC Fortran implementation and 220% higher performance compared to PGI's OpenACC. Overall however the GPU implementation was not able to achieve a speedup compared to CPU due to the missing physical processes and the thus high GPU-CPU data transfer overhead [10] . Since then, performance of both hardware and software have improved significantly however, resulting in a speedup of 2.5× when comparing NVIDIA GPUs to CPUs and 2.0× when comparing Xeon Phi to the same CPU in 2017. Govett et al. have also reported that Ope-nACC's performance issues and bugs have been fixed and F2C-ACC has thus been fully replaced with the industry standard [11] .
1.1.5 Accelerated Model for Climate and Energy (ACME). Norman et al. have investigated methods to create a hybrid GPU/CPU codebase for the U.S. Department of Energy's "Accelerated Model for Climate and Energy" (ACME). A previous CUDA Fortran-based port was dropped due to issues with having to maintain two separate codebases for CPU and GPU. For this reason, OpenACC was evaluated as a replacement, but limitations with resource pressure and inlining when porting large kernels to GPU were found. It was concluded that such large kernels need to be split up to be feasible on GPU with OpenACC, which aligns with our findings about ASUCA's physical processes (see Section 2.4.1). Since CPUs perform better with fused loops for caching reasons, for a pure OpenACC-based solution it was found that two separate code versions are required for CPU and GPU, resulting in similar maintainability issues as with a pure CUDA Fortran port [28] .
BACKGROUND AND MOTIVATION
This section discusses the background and motivation of this work. As such it introduces ASUCA (the main target application for the work presented in this article), the goals of an accelerator port, a discussion of existing methods for porting such applications to accelerators, as well as a problem statement. 
The ASUCA Weather Prediction Model
ASUCA is a mesoscale weather prediction model developed by the Japan Meteorological Agency (JMA) [17] . It is used in production since 2014 as one of the main operational weather forecast models in Japan and covers the area depicted in Figure 1 in two kilometer resolution, generating nine-hour-forecasts every hour [34] . ASUCA uses an Arakawa C-grid (one type of rectangular grid) [37] . Time discretization is implemented by employing a third order Runge-Kutta scheme, enabling long time steps [45] . Sound and gravity waves are treated separately using a second order Runge-Kutta scheme to enable a higher time resolution, employing the horizontally explicitvertically implicit (HEVE) scheme. Vertical advection of water substances are calculated using a separate time step for each column, based on the Courant-Friedrichs-Lewy convergence condition [17] . ASUCA employs a generalized coordinate system that is used in conjunction with Lambert conformal conic-as well as latitude/longitude projections [37] . It is discretized using a structured grid on a finite volume method in three spatial dimensions: I and J as interchangeable horizontal dimensions and K as the separately treated (largely sequentially implemented) vertical dimension.
ASUCA has been implemented in Fortran, employing multidimensional arrays as its main data structure due to their simplicity and performant implementation on hardware. In ASUCA these data objects tend to be directly stored in modules (i.e., only making rare use of derived types) and imported where necessary through use statements. The program structure can roughly be divided into two categories: Physical processes (later depicted under pp interface as well as phys. adjust in Figure 8 , Section 5.2) and dynamical core (all other modules depicted in Figure 8 , Section 5.2). This distinction is common in weather models such as WRF [25] and COSMO [2] .
ASUCA is parallelized for multi-core and multi-node in the horizontal domain (i.e., I and J dimensions). For the multi-node parallelization, given a full grid of nx · ny · nz size, with nx and ny being the horizontal boundaries farthest from the origin and nz being the upper vertical boundary, ASUCA's grid is decomposed into blocks and scattered across the available nodes, with each node computing a block of nx · ny · nz size with nx <= nx and ny <= ny, thus requiring halo communication for each of the employed Runge-Kutta schemes. For the on-chip parallelization, ASUCA's runtime is dominated by the dynamical core, which as a stencil code is heavily bounded by memory bandwidth [4, 36] . It therefore stands to reason to investigate hardware architectures with a high memory bandwidth as potential implementation targets.
GPGPU Computing
This section discusses the motivation for GPGPU support in a weather prediction codebase. See also Appendix C for a glossary of GPGPU specific terms. Table 1 sums up the architectural differences that are relevant to the use case at hand. Specific performance numbers from 18-core Broadwell CPUs and Tesla P100 are used here (see also Appendix B).
GPUs are designed to support the highest number of parallel threads possible for a given number of transistors, while still assigning each thread enough resources to support common use cases in scientific and graphics computing. This results in GPU programs operating on a much higher number of threads in parallel compared to CPU (supported by the high core count, high number of registers per core and high memory bandwidth), however the sequential computational performance is lower due to the threads operating on a shorter pipeline with no additional vectorization and with less cache per thread.
In terms of memory bandwidth, GPUs typically have a 5-7× advantage over CPUs, which makes GPUs interesting for bandwidth bounded applications such as stencil applications. It should be noted however that the CPU's larger caches per floating-point unit can lead to the CPU being faster for the bandwidth bounded case if and only if cached accesses dominate over main memory accesses in the runtime profile. Experience shows that this is usually not the case however, instead for most use cases the CPU's more prominent caches lead to a somewhat lesser speedup for the GPU implementation compared to the bandwidth difference between GPU and CPU.
A unified programming model for the on-chip parallelism (compared to vectorization and multicore, which need to be treated separately on CPU) as well as the low context switching cost for fine-grained parallelizations (thanks to keeping inactive thread context stored in the large register files where possible) are further advantages of GPUs.
The GPU's main disadvantage is a high occupancy requirement (discussed in Appendix C), which results in a minimum problem size per GPU as well as the need for being more economical with each thread's private resources to achieve a high performance. This has the following main effects on how GPGPU applications are constructed and run:
• The resource constraints generally lead to a higher parallelization granularity being optimal, i.e., the amount of code that can belong to a single GPU kernel is limited. Disregarding this leads to local resources (particularly registers) overflowing, resulting in excessive main memory accesses. • There is a limit to how few threads a GPU can be given without starving for work and subsequently giving poor performance. This, in turn, enforces a lower limit on time-to-solution (i.e., an upper limit to the number of GPUs a given problem scales to). This limitation is common in all high-throughput processor architectures, to various degrees.
However, even for a fixed given time to solution, assuming a large enough problem size per microchip, GPUs are usually more energy efficient thanks to their high computational and memory bandwidth "density" in terms of rack space (as shown in Table 1 ).
To acquire an understanding of the feasibility of GPU ports for memory bandwidth bounded applications, we can define the necessary and sufficient condition for a speedup on GPU as t H > t D + t comm , where t H is the computation time for the host implementation, t D is the computation time on the device and t comm is the time spent for communication between host and device. Assuming memory bandwidth of sequentially accessed values to be the bottleneck of an application, 2 
where m is the number of values to read and write from/to memory per point update, b is the byte length of each value, n i is the number of iterations over the data, m H toD is the number of values that need to be transferred from/to the device after n i iterations and BW H , BW D , and BW H toD are the bandwidths available on host, device and between host and device, respectively. After substitution the speedup condition becomes
(1)
The left-hand side of this inequality is dictated by the application while the right-hand side is purely hardware dependent. We can therefore determine that n i m m H toD must be larger than 5.4, 8.3, and 5.2 on the GPU supercomputers TSUBAME 2.5, Reedbush-H, and Piz Daint, respectively (using the performance numbers listed in Appendix B).
Existing Hybrid CPU/GPU Parallelization Methods
As discussed in Section 1, the main goals of this work is to find a solution for accelerating the ASUCA weather prediction model with GPUs with a unified code and without requiring a rewrite. An overview over existing software tools that allow unified GPU/CPU codes is provided in this section. Table 2 lists these options. For "STELLA" and "F2C-ACC" see also the discussion in Section 1.1.
The following criteria are important for the consideration of these candidates:
(1) Due to their invasive nature when applying such techniques to ASUCA, Non-Fortranbased solutions as well as domain-specific languages for stencils (stencil DSLs) have been (2) Since this work focuses on production weather prediction, the maturity of the pursued tools is important. Despite OpenMP allowing to target accelerators from 4.0 on, its support has not yet reached a mature level as of 2016, according to NVIDIA engineers [18] . For this reason, it has been ruled out for a further evaluation for Hybrid ASUCA at this point in time. (3) F2C-ACC and OpenACC share many common features, with OpenACC being the much more widely supported choice. Furthermore, the maintainers of F2C-ACC at NOAA have recently dropped it in favor of OpenACC [11] .
By this process of elimination, we arrive at OpenACC as the only viable option in terms of already-existing solutions for the problem at hand. Its limitations, and, finally, the decision against using it, will be discussed in the following Section 2.4.
Design Considerations Regarding Code Reusability and Portability
To design the method to allow for a high reusability of existing code structure, the considerations outlined in this section have been crucial.
Architecture-Dependent Parallelization Granularity.
While ASUCA's dynamical core is already implemented with a high parallelization granularity in the original code and is thus already "GPU friendly" (as discussed in Section 2.2), the physical processes are originally implemented with a very large kernels and thus low granularity. Since computations can be executed on each K-column separately, without affecting neighboring columns until the next run of the dynamical core, most of the physical processes are implemented using a single kernel with one thread per column to increase cache locality and decrease the amount of context switching and thread synchronization [3, 20] . This amounts to approximately 25k lines of code in a single parallel kernel. For a GPU implementation, this is problematic due to the following reasons (as discussed in Section 2.2 and found independently by Norman et al. [28] ):
(1) Deep call graphs under a single kernel are difficult to maintain, since kernel code relies on compiler inlining, which breaks easily and only supports a subset of language constructs. (2) On GPUs the memory, cache and register resources per thread are much more limited, in favor of supporting a much higher number of parallel threads.
For a GPGPU port it is therefore necessary to split this large kernel to achieve finer-grained tasks for each thread. Doing so manually would, however, require a complete rewrite of the physical processes code. Furthermore, it would make a performance portable solution impossible, since the CPU benefits from a high cache locality of the column-based code.
An automated or assisted approach for kernel fission is therefore highly desirable to allow for GPU acceleration while keeping cache locality and low overhead for the CPU case. OpenACC does not offer a coding facility that allows us to abstract the parallelization granularity; instead, two separate code versions are required, once with the parallelization applied to the call graph leafs and once applied to a coarse-grained level. In ASUCA's case, this would mean a doubling of most of code used for the physical processes. We note that Norman et al., as discussed in Section 1.1.5, have arrived at the same conclusion [28] .
Architecture-Dependent Privatization.
Due to the large amount of calculations being performed locally per thread in the reference physical processes as discussed in Section 2.4.1, there are also many data objects created locally. Kernel fission requires the sharing of such data objects between threads. If kernel fission is to be employed and cache locality on CPU is to be kept, then it is necessary to define the privatization depending on what architecture is targeted. OpenACC allows for automatic privatization by means of explicit clauses; however, this only works reliably within the same routine as the parallel loop is defined. For a rewrite with finer-grained parallelization granularity across a deep call graph it would still be necessary to rewrite most of the data objects for a higher number of domains (in case of ASUCA's physics three-dimensional (3D) grid data instead of 1D column data).
Architecture-Dependent Storage Order.
In a largely memory bandwidth application it is paramount to implement a storage order that matches the target hardware's cache architecture, since cache misses directly influence memory pressure [4] . Different hardware architectures with different cache architectures have different ideal storage orders. On CPU, the innermost loop should always be given stride-1 memory access to increase cache locality. Due to the characteristics of physical processes discussed in Section 2.4.1 the innermost loop is always operating on the Kdomain, and therefore the natural storage order choice is KIJ or KJI [36] . Meanwhile on GPUs the domain that is mapped to the first thread index (i.e., one of the parallel domains) should be given stride-1 access to enable coalesced memory access [13] , which requires either IJK or JIK orderings in case of ASUCA [36] . Storage order abstraction is therefore necessary to achieve performance portability. Section 5.1.1 shows this as well, based on the findings from a model application.
Storage order therefore strongly impacts performance portability. In OpenACC Fortran there is no provided method to specify multidimensional Fortran arrays in a way that the storage order is efficient on both GPU and CPU without requiring a rewrite of all array accesses and specifications.
Problem Statement
In Section 2.4, we have outlined what we consider the most important porting aspects regarding efficiency on GPU for hybrid codes, i.e., applications with the capability to run on either CPU or GPU. In existing methods used for NWP these issues have been solved using a combination of the following approaches:
• Maximum relearning: Through a complete reimplementation in a meta-programming capable language (so far for HPC this means C++), memory layout abstraction and parallelization on different target platforms has been solved. Examples for this approach are Fuhrer et al. 2014 [7] and Shimokawabe et al. 2014 [37] . • Efficiency loss: Reducing the scope of the problem by only porting the parts of the code that are already fine-grained (and thus suited for GPU) solves the issues described in Sections 2.4.1 and 2.4.2. This, however, leads to a significant overhead in GPU-CPU data exchange, since the remaining code runs on CPU and thus requires data synchronization for every time step. This approach has for example been taken by Govett et al. [9, 10] and by various projects involving GPU acceleration of WRF as outlined in Section 1.1.3. • Code divergence: If the entire logic is to be kept in a style of Fortran language that is familiar to domain scientists (by making use of directive-based approaches such as OpenMP and/or OpenACC), then it ultimately leads to code duplication and divergence, since no directivebased approaches have support for code granularity transformations or memory layout transformations. Norman et al. 2017 show an example of this approach [28] .
Our goal for ASUCA was to find a solution that has none of the aforementioned drawbacks, i.e., a full GPU port that remains CPU compatible, requires minimal code divergence and is able to reuse as much of the existing code as possible.
NEW APPROACH
A new approach called "Hybrid Fortran" has been developed to solve the problem described in Section 2.5. Hybrid Fortran employs a source transformation with the following characteristics:
(1) porting to this framework from existing CPU code requires a low amount of effort, i.e., existing computational code targeted at CPUs can be used as-is where possible; (2) as a consequence (see also Sections 2.4.1 and 2.4.3), storage order and parallelization granularity are to be made variable, such that CPU targeted user code can be transformed to match the architecture specific granularity and storage order; (3) the parallelization technology (here OpenMP, OpenACC, or CUDA) is transparent to the framework user such that with changing industry standards the backend implementation can easily be modified; (4) whenever architecture specific programming is needed, Hybrid Fortran uses an abstraction to centralize these specific settings in a codebase-wide configuration rather than leaking it to the user code; (5) however, control is not be taken away from the framework user, i.e., the implementation should be predictable and adjustable to allow for a highly productive environment to create performant implementations; (6) and, finally, Hybrid Fortran is suitable for both dynamical processes (stencil patterns) as well as column-only physical parametrizations.
In this section, we first introduce a model application used for the further discussion (Section 3.1). We then apply OpenACC parallelization for GPU and OpenMP parallelization for CPU and compare it to a Hybrid Fortran implementation from a user-centric viewpoint in Section 3.2, followed by an implementation-centric discussion in Section 3.3. Finally, the current limitations of this approach are discussed in Section 3.4.
New GPGPU Code Transformation Framework Applied to Weather Prediction 7:11 
Reduced Weather Application
Since ASUCA is too large to present and analyze here in its entirety (see also Figure 9 in Section 5.2), this section introduces a reduced, yet scientifically meaningful, weather application as a vehicle to show relevant code and performance characteristics. Equation (2) shows the main equation governing this application: the heat diffusion equation with T being the temperature defined in IJK-space, κ being the thermal diffusivity, and f being the rate of heat input through radiation.
This equation is solved numerically by using the explicit Euler method in 3D. In addition, the k = 0 and k = nz boundaries are modeled as infinite heat wells with temperatures 330K and 200K, respectively. Radiation f is assumed to be uniform and statically applied across the spatial domain. The KI and KJ boundaries are modeled cyclically. We initialize the temperature function to 0K with a 300K box from one quarter to three quarters of each spatial dimension and set a delta of 0.1s between time steps. Figure 2 shows slices at j = 100 for the resulting data with time steps at t = 0s, t = 30s, and t = 90s. This shows the heat diffusion, the radiation effect, and the surface (k = 0) and planetary boundary heat exchange (k = 200).
To adhere to ASUCA's code structure, the application is split into a dynamical core (which here only consists of the diffusion ∂T ∂t = κ∇ 2 T ) and physical processes with no interaction in the horizontal (radiation, surface-, and planetary boundary heat exchange). Analogously to ASUCA's physical processes interface (see Section 2.4.1), such processes are parallelized (here in run_physics) with each parallel thread executing all physical processes for a single K-column. The dynamical core consists of four stencil code kernels-the inner IJK region as well as the boundary regions for the IJ, KI, and KJ boundaries. Figure 3 shows this code structure including the computationally relevant call graph, loops, and data domains. The entire code is listed in Appendix I.
Design of the Hybrid Fortran Language Extension
In the following section, by using the previously introduced reduced weather application as an example, we discuss the application of Hybrid Fortran with regards to the design goals stated in the beginning of Section 3.
Basic
Parallelization. An example of a parallelizeable code is the following extract of the diffusion algorithm (see Appendix I.1 for a full listing): There are no loop carried dependencies for the sole modified data object energy_u, thus this inner loop region is inherently parallelizable. One can parallelize this code for CPU and GPU with OpenMP and OpenACC by adding directives as follows 3 :
In contrast, in Hybrid Fortran the loops to be parallelized are replaced by the directive @parallelRegion as follows 4,5 :
By replacing parallelizable loops with this explicit code construct the following is gained:
(1) It becomes immediately clear that loops are semantically different from parallel regions.
Loops are always sequential constructs and thus allow any dependencies on previous values, while in a parallel regions the order of execution is not defined. (2) For the GPU implementation there is a clear boundary between device code and host code (see also the glossary in Appendix C). All code inside the parallel region directive gets implemented as device code, while all other code remains on the host. Thus, Hybrid Fortran follows a more explicit approach for the parallelization, which more easily allows the programmer to have a mental model of how an algorithm is implemented on the device.
Through only the information provided in this one directive, Hybrid Fortran is able to generate OpenMP parallel loops for the CPU-as well as CUDA Fortran or OpenACC kernels for GPUparallelization. The generated code for GPU is listed in Appendix I.2 while the generated CPU parallelization corresponds to Listing 2. It is noteworthy that Hybrid Fortran automatically adds the required sharing clauses. For that purpose it is assumed by default that all arrays need to be shared while all scalars can be treated as thread private. 6 Data object specification-as well as usage within parallel regions is parsed to enable useful default sharing behaviour. This combination of design choices makes implementing data parallel code very simple for the framework user, i.e., only the domain boundaries of the parallelization need to be specified.
Device Data
Region. The GPU's advantage in terms of memory bandwidth can only be used effectively if the copying of data via the comparatively slow PCI Express bus can be minimized. To ensure this, it is necessary to gather device state information for each data object in each routine, i.e., whether an existing device memory version can be utilized or whether the data object first needs to be synchronized with their host versions.
To improve on this with OpenACC, data directives are added for the time integration loop (Appendix I.6) as follows:
In addition, present clauses need to be added to all OpenACC kernel directives for all the data objects that are already present thanks to this newly introduced data region.
By contrast, in Hybrid Fortran the device state of data objects is defined declaratively using @domainDependant directives that are added after the specification parts of each relevant subroutine:
Through the present attribute, Hybrid Fortran is instructed to treat data as device present. The autoDom attribute furthermore instructs Hybrid Fortran to use the dimensions from the user specification (which becomes relevant with the features discussed in Section 3.2.4). By applying analogous @domainDependant directives with transferHere attribute at the data region boundaries (e.g., at the subroutine that implements the time integration loop), Hybrid Fortran is instructed to generate the data copies at those boundaries (instead of the default behavior, which is to transfer at every kernel invocation). Again, the user specified intent is used for each data object to determine the type of memory transfer to be implemented, which minimizes the potential for bugs in comparison to OpenACC's explicit copyIn, copyOut, and copy clauses.
3.2.3
Architecture-Dependent Parallelization Granularity. As discussed in Section 2.4.1, one of the main problems to solve to gain performance portability and to allow code reusability is to support parallelizations with variable granularity to co-exist within the same user code. For this purpose, Hybrid Fortran allows a parallel region to be applied to only a partial set of hardware architectures.
For the physical processes in the example we originally have the following code (see the full listings in Appendixes I.3, I.4, and I.5):
Hybrid Fortran allows to write this as follows:
This shows the following:
(1) Using an appliesTo clause (e.g., for the physical processes), parallel regions can be applied partially to specific architectures, that is, for architectures not listed in this clause the region body will be run only once per call and the parallelization can be applied at a higher granularity (here at the radiation, surface and planetary boundary processes). See also Figure 4 for the resulting code structure and compare to Figure 3 in Section 3.1. (2) The user code within parallel region bodies can remain untouched compared to the sequential CPU version. Hybrid Fortran automatically extends data accesses and specifications, here from 1D to 3D for radiate (see also Section 3.2.4).
To be able to correctly generate the code for varying parallelization granularity, the transformation process requires positioning information for each routine towards relevant parallel regions. Each routine either directly contains a parallel region, has parallel regions within its call graph, or is itself called within a parallel region. This information can be visualized as a coloring of the call graph, with a separate coloring for each target architecture. Parallel region appliesTo clauses are the deciding factor for this coloring. Figure 5 shows this for the reduced weather application. This global positioning information is required by the code transformation to implement the correct dimensionality (see also Section 3.2.4) as well as the correct type of code in case of GPUs being the target, i.e., host-, kernel-, or device routine code (see also the relevant definitions in Appendix C).
3.2.4
Architecture-Dependent Privatization. Since Hybrid Fortran allows compile-time-defined parallelization granularity, as discussed in Section 3.2.3, it is necessary for the code transformation to change the dimensionality of data objects to expose their data parallelism at a finer-grained level. This requires hints from the framework user. The following listing shows the @domainDependant directives required to extend the data objects from 1D to 3D (adding the missing parallel dimensions). This is used in column-only physical parametrizations (radiation and boundary layer in case of ASUCA) where the original CPU code is parallelized in a single place using OpenMP-loops over the horizontal domain. Applying this technique allows the user to simply wrap the existing code at a fine-grained level with GPU-targeted @parallelRegion constructs to delegate the task of exposing data parallelism to the framework.
Storage
Order. As will be shown in Section 5.1.1, variable and architecture-dependent storage order is necessary for performance portability as well as code reuse. Hybrid Fortran therefore automatically adds default macro wrappings for multidimensional arrays. This allows a specification of the storage order depending on the target architecture in a centralized location. Unlike with other code hybridization methods supporting Fortran, this enables variable storage order without the need to rewrite any user code. In case more flexibility is needed (such us varying storage order in different parts of an application), Hybrid Fortran allows an explicit definition of the macro names in the @domainDependant directive.
As an example, consider again Listing 7, which shows the Hybrid Fortran version of the reduced weather application's physics code. Through the transformation processes, array accesses and specifications are rewritten as follows:
In an separate source file that is automatically included in all sources, the macro AT is specified as follows: 
Block Size.
Block size (see also Appendix C) can have a significant effect on GPU performance. Hybrid Fortran by default chooses a block size of 32x16x1, which is a good default for data parallel atmospheric applications (see also Section 5.1.4). This default can be changed through macros in a configuration source that gets automatically included in all other sources. In cases where kernels have varying optimal block sizes, Hybrid Fortran allows the specification of a macro name suffix in the parallel region directive.
In contrast, PGI Accelerator's OpenACC implementation by default uses a 128 × 1 block size configuration, which usually leads to a subpar performance due to a low occupancy (see also Appendix C). This therefore needs to be changed explicitly for each loop invocation by using vector clauses in $acc loop directives added to all parallel loops. Furthermore it is often advisable to use !$acc loop seq for loops over k to force the compiler not to assign the first thread index to k (which is sub-optimal for the chosen storage order).
By using macros, Hybrid Fortran avoids leaking this architecture-dependent information to the user code in favor of a centrally defined configuration.
Source Transformation Method
In this section we introduce the code transformation, involved in implementing the framework logic discussed. Figure 6 shows this process in nine distinct phases, with phases four through nine being applied separately depending on the target architecture. This process runs transparently from the user point of view, i.e., it is applied automatically by the means of a common Makefile provided together with the framework. Each of the numbered paragraphs in this section corresponds to one transformation phase.
Macro
Processing for Unified Sources. In addition to the automatically applied transformation process that implements the behavior described in Section 3.2, a further layer of meta-programming is supported by applying a macro processing step to Hybrid Fortran sources, 7 including directives. We use the GNU compiler tool-chain for this purpose (gcc -E).
Merging of Continued Lines.
To simplify the parsing in subsequent phases, Fortran continuation lines are merged.
Call Graph and Parallel Region
Parsing. Facilitating phase 4, the application's call graph and parallel region directives are parsed globally.
Call Graph Coloring.
A call graph coloring is created, as shown previously in Figure 5 .
Global Symbol
Parsing. To gather symbol information, the application is parsed twice in this phase: A first time to gather module data object specifications and a second time to link this information to the routines where such module data objects get imported, together with the information of data objects defined in each routine scope. This allows Hybrid Fortran to globally determine the domain information for data objects.
Global Model Generation.
All previous information is compiled into a global application model, with model classes representing the modules, routines and code regions. Code regions are modelled using subclasses for specification-, sequential-, call-, and parallel regions. For each code line this model contains a list of the data objects used within the line and their known meta information (gathered previously from local module scope, imported scope, routine scope as well as added information from Hybrid Fortran directives). Each routine model is also assigned an implementation class (see phase 7).
3.3.7 Implementation. One of the stated goals of Hybrid Fortran is to create a higher level toolchain that is reusable for various hardware-and software architectures. Thus, only a single transformation phase is dependent on the backend implementation, here called the "implementation phase." For this purpose, a Python implementation class hierarchy is used to facilitate the reuse or specialization of a set of implementation methods for the varying backends. Previously gathered information is used by instance methods of these implementation classes to create the sources from the previously built model. More specifically, using implementation classes for either standard Fortran, OpenMP Fortran, CUDA Fortran, or OpenACC Fortran, the following transformations are applied:
• data object specifications and accesses are transformed according to object type, implementation class and routine coloring, • data copies from and to the device are generated, • device-and host code boilerplate is generated. Since this requires separate routines for CUDA (one for the host code and one for each kernel in the user code), parallel region bodies are split off into their own generated routines beforehand.
CUDA Fortran and OpenACC GPU implementations can be mixed in the same project. This is mainly done to make use of OpenACC's reduction feature (which is not supported for the CUDA Fortran backend). Switching between the different backend architectures is done by specifying a default implementation in the configuration and wrapping routines in a @scheme directive where a different one needs to be applied. To facilitate interoperatibility, Hybrid Fortran uses CUDA Fortran allocated device pointers for all GPU implementations. 7 Hybrid Fortran's build system recognizes sources from four distinct formats: f90 (Modern Fortran), F90 (Modern Fortran with Macros), h90 (Hybrid Fortran), and H90 (Hybrid Fortran with Macros). Depending on the source format, the build process starts at phase one (H90), two (h90), or nine (f90 / F90).
Line Splitting.
To make the generated code compatible with standard Fortran compilers (which usually enforce a maximum line length), the generated lines are split at appropriate positions containing white space until the maximum line length is satisfied.
Final Macro Processing and Compilation.
The GNU preprocessor is applied again to process the generated macros (e.g., for storage reordering, see also Section 3.2.5). Subsequently, a user specified compiler and linker is used to create the CPU and GPU executables. For GPU, the generated code is compatible with PGI Accelerator in case of the CUDA Fortran implementation, and PGI Accelerator or Cray Fortran compiler in case of the OpenACC implementation.
Limitations
Hybrid Fortran has at this point in time mainly been tested to implement structured grid applications. We expect that unstructured grid applications would reveal currently unsupported use cases.
Additionally, some limitations apply to the code within GPU-applied parallel region bodies: Such code must not contain recursion, call other kernel routines, use data objects with DATA or SAVE attribute, contain I/O statements (except print) or contain array or slice expressions such as A(:,:,:) = 0.0 or C = A + B (i.e., all operations in GPU parallel region bodies are required to be scalar).
PERFORMANCE MODEL
In this section, a performance model is constructed for the application introduced in Section 3.1. For this purpose we simplify the problem as follows:
(1) By assuming an optimized storage order (see also Section 5.1.1) the bandwidth bounded model for sequentially accessed memory can be applied to the inner region of diffusion as well as radiation.
(2) Memory accesses from cache are assumed to not affect execution time at all (i.e., such accesses are assumed to be hidden behind host-or device memory accesses in the pipeline). (3) By assuming sufficiently large spatial dimensions we omit all other program parts from further modelling, since their runtime effect should be negligible. (4) We assume the compiler to optimize all operations that can be hoisted out of loops.
Per loop iteration, diffusion, 8 being a seven point stencil code, requires a maximum of eight memory accesses per point update as well as six add and two multiply instructions. The arithmetic intensity according to an adapted roofline model [46] is thus
where m is the number of values to read and write from/to memory per point update, b is the byte length of each value and c is the number of floating point cycles spent per point update. 9 The minimum arithmetic intensity for compute boundedness is 5.9[ F LOP B ] and 7.8[ F LOP B ] for Tesla K20x and P100, respectively (using the performance numbers from Appendix B). Thus, diffuse is clearly memory bandwidth bounded. This holds true even if six of the eight memory accesses are cached (see also the stencil access pattern in Appendix I.1). Due to the lower random access memory bandwidth, the boundary region JK is clearly also memory bandwidth bounded.
The same holds for the radiation process, since it requires two memory accesses and one FLOP per iteration. 10 We therefore conclude that the reduced weather application is memory bandwidth bounded for all processes.
The reduced weather application is thus highly sensitive to the cache performance of the target architecture. Applying the above parameters to the speedup condition (equation 1), a speedup on GPU is feasible if the output is not required at every time step.
With some simplification in the domain boundaries (by assuming sufficiently large domains for halos to be neglectable) the model GPU execution time becomes
and the model CPU execution becomes
where b is the byte length of each value, m sa is denoting the number of sequential memory accesses in the inner region of diffuse and radiation and m r a is denoting the number of random access updates in the boundary region JK, n x , n y , and n z are denoting the grid sizes in the x-, y-, and z-directions, respectively, BW H , BW D , and BW H toD are the bandwidths available on host, device, and between host and device, respectively, and RA H and RA D are the number of random memory accesses per second (measured in GU P/s) [22] . It holds that m sa = 10 without caching, m sa = 4 with caching and m r a = 4. See also Section 5.1 for a comparison of the measured performance with this model.
PERFORMANCE RESULTS AND DISCUSSION
This section discusses performance results obtained using the Hybrid Fortran approach. In Section 5.1 the reduced weather application (introduced in Section 3.1) is used to compare the Hybrid Fortran user code implementation to an OpenACC implementation and the performance model (introduced in Section 4) with respect to the overall performance as well as individual performance related implementation aspects. Hybrid ASUCA is presented as a new implementation in Section 5.2. Sections 5.3 and 5.4 subsequently compare its GPU performance to the reference implementation (CPU-only), for small and large scale grids, respectively.
Comparing Hybrid Fortran with OpenACC and Model
This section facilitates the previously discussed reduced weather application to compare Hybrid Fortran with OpenACC and a performance model. We focus here on the desired portability features discussed in Section 2.4. For the performance measurements in this section the configuration as listed in Appendix D has been used. Table 3 shows the impact, storage order has on execution time. In the case of the currently discussed application, choosing a sub-optimal storage order impacts CPU execution time negatively by 35%, while on GPU the slowdown is 7.7×. This verifies the necessity of a flexible storage order for applications with similar data structures as ASUCA, as discussed in Section 2.4.3. 
Performance Portable Storage Order.

"
Naive" Parallelization. To establish a baseline performance, we apply a basic parallelization to the reduced weather application with OpenACC and OpenMP directives as discussed in Listing 2, Section 3.2.1. Storage order is made variable across the whole application by using macros for accessing and specifying multi-dimensional arrays. We are using I-J-K order for GPU and K-I-J order for the CPU implementation. Table 4 shows the resulting CPU performance from this parallelization.
We conclude that the measured CPU performance is already well within the models with perfect and no cache. The GPU performance is, however, very slow-more than 400× slower than the six core CPU version, which is not in agreement with the models we have constructed. This is to be expected, however, since no data region has yet been defined, without which the data are being copied over the slow PCI express bus for every kernel invocation. Further discussion for the reduced weather application therefore focuses on GPU performance.
Data Region.
The amount of I/O to and from the GPU can be greatly reduced by using a data region to avoid host-to-device data copies at every kernel invocation, as discussed in Section 3.2.2. Table 5 shows the resulting performance with OpenACC. Table 6 shows, up to 38% performance improvement is possible over the version discussed in Section 5.1.3 for the OpenACC implementation by changing the default block size using vector clauses for each parallel loop (for a definition of block size, see also the GPGPU terms glossary in Appendix C). PGI accelerator uses a 128 × 1 block size for its OpenACC implementation by default while 32 × 16 has been experimentally determined to be the best block size for this application except for a small degradation for the smallest measured grid size. See also Section 5.1.5 for a performance comparison with Hybrid Fortran (which chooses this block size by default). Figure 7 shows the performance results for the reduced weather application with the Hybrid Fortran-based implementation (discussed in Section 3.2) in comparison with the models introduced in Section 4 as well as the OpenMP/OpenACC-based approach discussed in Section 3.2. For each of the implementations the highest performing version (32 × 16 block size with data region) has been selected for this comparison. This shows that measured CPU performance aligns well in between the two models (with and without cache) while for the TSUBAME 2.5 Kepler-based GPU architecture, L1 cache appears to be less effective for this application and the measured performance is closer to the model without cache. It is noteworthy that the Hybrid Fortran generated code performs as well or better than the equivalent OpenACC code for both target architectures in this example.
Block Size. As
Performance Comparison.
Overall we observe a speedup of 8× for the Hybrid Fortran version vs. six-core CPU, compared to the speedup of 7.7× for the OpenACC version. The Hybrid Fortran implementation's speedup is thus very close to the bandwidth increase 8.2× between the two architectures (see Appendix B).
On CPU the Hybrid Fortran implementation performs the same as the manually coded OpenMP implementation, which is unsurprising given that the CPU code generated by Hybrid Fortran is practically identical. 
Hybrid ASUCA Implementation
As Figure 8 shows, by employing Hybrid Fortran directives to both the dynamical core and the physical processes of the existing CPU-targeted ASUCA user code, nearly all modules required for an operative weather prediction have been ported to GPU while retaining CPU compatibility. All ported modules have been validated by comparing the output of all relevant variables to the original implementation and ensuring the normalized root mean square error to be smaller than 1E-10 after 10 time steps in the coarsest time discretization, both on CPU and GPU. Figure 9 shows the impact this new implementation has on the code size, as well as the overall distribution of code lines in the application. Additionally, in Reference [27] we have studied the productivity effects in more detail, which we describe here briefly: ASUCA's codebase encompasses more than 150k lines of code. By using our method, more than 85% of the original codebase was left unchanged and the overall code size has been extended by less than 5%, even though the new codebase targets two very different hardware architectures instead of one. This shows very promising productivity, maintainability and ease-of-adoption of the proposed method in an operational setting. Using an application model we have established a detailed estimate of the required code changes for a comparable OpenACC implementation. The result: OpenACC would require more than 73% of additional changes compared to the changes required using Hybrid Fortran. We reckon that the same would be the case when using OpenMP directives targeting GPU, since to our knowledge it only offers a subset of OpenACC's GPU features to this date.
Of the approximately of 163k lines of code, around 19.5% are used for the dynamical core and 20.8% are used for the physical processes, mainly to implement 133 and 205 kernels, 11 respectively.
By applying this unified programming model to both dynamical core and physical processes of ASUCA, host-to-device communication has been eliminated with the only exceptions being setup, halo communication as well as file output.
Hybrid ASUCA Kernel Performance
In this section, performance results for this new implementation are discussed for a 301 × 301 × 58 grid that is small enough for single GPU or single socket execution with the latest architecture (Tesla P100 on Reedbush-H), yet still allows a useful performance analysis in terms of occupancy (see Appendix C). This allows to draw conclusions for the kernel performance as opposed to the communication overhead (which impacts performance more strongly, the more nodes are used for the same grid size, i.e., when applying strong scaling as will be shown in Section 5.4). Additionally, an older system (TSUBAME 2.5 with Tesla K20x) is used for comparison purposes. Since the two compared GPUs differ in their available device memory (16GB for P100 vs. 6GB for K20x) we compare four K20x GPUs to one P100. See Appendix E for the software configuration used for this test. Hybrid ASUCA is also compared to previous GPGPU ports of weather models as discussed in Section 1.1. Figure 10 shows the execution times for this configuration on four hardware/software configurations as listed. A speedup of 4.9× has been achieved by the port on Kepler GPU vs. Westmere Fig. 11 . Total cloud cover result with ASUCA using a 2km resolution grid with a real world weather situation. six-core Xeon X5670. On newer hardware, however (Pascal vs. Broadwell), the speedup has so far been a more modest 3.1×, which is partly explained by the lower memory bandwidth difference between the two comparisons and the increased caching performance on Broadwell-versus Westmere CPU architecture (as caching generally has a lower impact on GPU compared to CPU).
Hybrid ASUCA Strong Scaling GPU Results
Using a full-scale production grid, Hybrid ASUCA has been tested on the new Tokyo University cluster "Reedbush H" [43] . This cluster has two 18-core Xeon E5-2695 v4 CPUs per node as well as two NVIDIA Tesla P100 GPUs per node. At least seven nodes (14 sockets) or 24 GPUs are required for this test due to the grid's memory requirements. A visualization of the resulting cloud cover from this simulation is depicted in Figure 11 . Figure 12 shows that 24 GPUs can replace more than fifty 18-core CPU sockets. When comparing the same number of GPUs and CPU sockets, the GPU is up to 76% faster. See Appendix F for the software configuration used in this test.
The following factors influence scalability and will need to be improved to achieve better strong scaling:
(1) MPI communications code has been used as-is, with no further optimizations applied with respect to the targeted cluster. When testing the impact of communication on performance on TSUBAME 3.0 using the minimally required 24 GPUs, as shown in Figure 13 , communication requires approximately 13.4s or 18.7% of the overall runtime of 71.5s (to compute a 600 seconds simulation of the full regional grid in 2km resolution). When doubling the number of GPUs, this increases to 20.9s while the compute time decreases from 58.1s to 34.0s, and thus the communication then takes 38% of the runtime. Overlapping communication and computations has been shown to be effective in enabling better scaling by Shimokawabe et al. [36] , thus this approach will be the first step to improve performance at larger scales. (2) Since GPUs require a large-enough problem size per chip to have a sufficient number of threads to fill all schedulers, strong scaling is limited when the problem size per GPU becomes too small. (3) In our ASUCA code version, as in the given reference, we do not have a distributed file I/O system as used in production. Due to the large amount of data, even though for this test the output is only run at the beginning and end of the simulation, it still has a strong impact on the overall execution time, resulting in part of the discrepancy between for the speedups between the 301 × 301 and 1581 × 1301 grid sizes. Figure 14 categorizes the performance impact of the different modules of ASUCA on performance, including communication. The simulation of fast-moving sound and gravity waves has the highest impact, followed by radiation-and boundary-layer physics. Since the physics calculations do not require communication, the impact of fast moving dynamics increases with the number of nodes, rendering it the most important optimization target for larger-scale simulations. For a listing of the configurations used to gather the data for Figures 13 and 14 , please refer to Appendix G.
Comparison to Previous Work
In Section 1.1.1, Dr. Shimokawabe's work on applying GPUs to ASUCA was briefly discussed. In this section, we compare that implementation to Hybrid ASUCA. Table 7 shows the differences in focus between Hybrid ASUCA and Shimokawabe et al. [37] . Dr. Shimokawabe's work was targeted at achieving computations on the largest scale available, i.e., a near-full utilization of TSUBAME 2.5 for a weak scaling result as well as using 512 GPUs for strong scaling. For comparison, Hybrid ASUCA was only scaled using strong scaling, reaching a performance peak at 42 GPUs, using a device model that is equivalent in theoretical performance to 159 GPUs used by Shimokawabe (P100 vs. K20x). However, with Hybrid ASUCA we were first able to port a complete model including operation-ready physical processes using a unified framework and we have demonstrated that only a minor rewrite is necessary for a previously CPUtargeted code to achieve full GPU compatibility (while Shimokawabe et al. relied on a full rewrite using a different user language and did not demonstrate radiation-as well as boundary-layer physics).
CONCLUSIONS
In this work, "Hybrid Fortran" has been introduced as a new approach to port structured grid Fortran applications to GPU. It provides abstractions over architecture specific programming while allowing existing CPU targeted user code to be reused as much as possible (see also Section 3.2). This approach allows using a mixture of OpenMP, OpenACC, and CUDA Fortran in the backend with a unified programming interface (see also Section 3.3). Storage order and parallelization granularity are made variable. Existing CPU targeted user code is transformed to match the architecture specific granularity and storage order.
A new implementation of ASUCA, the main mesoscale weather model used in weather prediction operations by the national Japanese weather agency, has been created by applying Hybrid Fortran to both dynamical core and physical processes. The resulting speedup on GPU (up to 4.9× on single GPU compared to a single Xeon socket) is comparable to architecture specific rewrites of similar weather models (see also Section 5.3). To our knowledge this is the first time a complete production weather prediction model has been ported to GPU using a single unified programming paradigm. The resulting user code is less than 5% larger compared to the reference CPU code, which is only made possible by sharing all of the user code between the two architectures (see also Figure 9 ). A speedup of up to 3.5× on GPU was achieved for a full scale production run (1581 × 1301 × 58 grid with 2km horizontal resolution using real world sample input data, see also Section 5.4).
FUTURE WORK
Future efforts will focus on three aspects: performance, productivity and scope.
To improve the performance on GPGPU, the following avenues will be explored:
(1) Communication will be optimized to improve larger scale runs on more than 30 nodes. Network topology-based optimization, as well as the overlapping of communication and computation will be the main focus for these optimizations. (2) GPU kernels will be more finely tuned as far as a unified user code allows. One possible avenue is the usage of launch bounds, a CUDA feature that supplies better hints to the compiler with regard to register usage (however, it requires re-compilations for varying grid sizes). (3) Automated cache blocking transformations that are applicable to GPU (shared memory, L1, texture cache) are an important path to explore. Such transformations also have the potential to auto-tune the code for different CPU architectures.
Further productivity gains are possible by automating more of the mechanical aspects of porting to Hybrid Fortran. More specifically, replacing the device data state attributes per routine and per data object with a generalized data region facility (and keeping track of all the involved data objects automatically) will greatly reduce the work needed to port to Hybrid Fortran. This in turn can make Hybrid Fortran viable for a GPGPU port of the WRF model (which is one order of magnitude larger in code size compared to ASUCA).
In terms of scope, applying Hybrid Fortran to unstructured grids will be explored.
APPENDIX A HYBRID FORTRAN LANGUAGE EXTENSION
This appendix section gives an overview of the main language extensions provided by Hybrid Fortran. For a complete listing, refer to the documentation in the repository. 12
A.1 Parallel Region Construct
Listing 11 shows the parallel region construct. It allows the framework to define parallelization as well as the thread granularity at compile-time.
The following attributes are supported for this directive:
appliesTo Specify one or more of the following attribute members to set this parallel region to apply to either the CPU code version, the GPU version or both. Omitting this attribute has the same effect as specifying all supported architectures.
(1) CPU (2) GPU domName (Required) Specify one or more domain names over which the code can be executed in parallel. These domain names are being used as iterator names for the respective loops or CUDA kernels. domSize (Required) Set of the domain dimensions in the same order as their respective domain names specified using the domName attribute. It is required that |domN ame | = |domSize |. startAt The lower boundary for each domain at which to start computations. Omitting this attribute will set all boundaries to 1. It is required that |startAt | = |domN ame |. endAt The upper boundary for each domain at which to end computations. Omitting this attribute will set all boundaries to domSize for each domain. It is required that |startAt | = |domN ame |. reduction Works in the same way as OpenMP reduction directives. This is only supported with the OpenACC-and OpenMP backends. For example reduction(+:result) sums up the results over all threads. template Defines a postfix that is to be applied to different attributes that are loaded using the preprocessor. Currently this only affects CUDA Blocksizes: These are loaded using CUDA_BLOCKSIZE_X_[Template-Name] from the preprocessor (storage_order.F90 is the suggested place to define these). The goal of this attribute is to hoist hardware dependent attributes outside of the code to more easily facilitate performance portability.
A.2 Domain Dependant Construct
Listing 12 shows the domain-dependent construct. It is used to give hints to the framework how data objects are to be adjusted in the transformation process. This allows passing on the responsibility of rewriting certain aspects of data specifications and accesses to the framework. It should be noted that the framework only operates on local information available to each subroutine. As an example, whether a data object has already been copied to the GPU is not being analyzed. Domain Dependant constructs are required to be specified between the specification and the implementation part of a Fortran subroutine.
It is important to consider that in case neither present nor transferHere flags are used, Hybrid Fortran will automatically transfer all domain dependents with their Fortran intent specified (as in, inout or out) from and/or to the device according to their intent in case of a routine containing a GPU parallel region. 
B HARDWARE PERFORMANCE
C GPGPU SPECIFIC PROGRAMMING TERMS
The following glossary gives an overview for terms commonly used for GPGPU programming with modern NVIDIA architectures.
Block size Size of a block of threads that is fetched and loaded onto one [SMX] simultaneously. Can be defined in 1D, 2D, or 3D. The first dimension should usually chosen as a multiple of the warp size in single precision or as a multiple of a half warp size in double precision. Each thread in a block requires register-and, if used, shared memory resources. Thus, there is a trade-off between the block size and optimizing for the required number of registers per thread. If too many registers are used, then spilling occurs in which data is offloaded to the device memory. CUDA NVIDIA's proprietary GPGPU programming language, available for C, C++, and Fortran. CUDA core NVIDIA's marketing term for a combination of general purpose ALU and FPU on the GPU, each able to operate on one CUDA thread. CUDA Fortran A Fortran interface to the CUDA tool chain, supported by the Portland Group (PGI). Device code Code that exclusively runs on the GPU. Two types of device code exist: Kernel routines and device routines. The latter type is required for routines called within kernels. CUDA requires the programmer to explicitly set the type of each routine. Grid size Set of blocks that is loaded onto the GPU simultaneously. See also [Block size]. Host code Code that exclusively runs on the CPU. Launch bounds Compile-time provided upper bounds for the problem size per block-this allows the CUDA compiler to be more conservative with the assigned resources per thread and thus optimize for a higher occupancy. This however requires the application to be recompiled for each new problem size to be executed. Occupancy The percentage of CUDA cores that are occupied on average during execution. Occupancy can get lowered to to a number of issues, such as device memory stalls, high register pressure, high shared memory pressure, misconfigured block size configurations or problem sizes that are too small for the architecture. GPU specific optimizations are usually targeted at increasing the occupancy, since this directly affects the runtime. OpenACC An industry standard for directive-based GPGPU computing, available for C, C++, and Fortran. Notable implementations are available from Cray and Portland Group (PGI). Cache Each streaming multiprocessor (see [SM/SMX]) offers the following caches: Shared Memory, texture memory and L1 cache. Shared memory is used through explicitly programmed memory operations, texture memory is used in a declarative way and L1 cache operates automatically for device memory accesses, analogous to CPU cache architectures. Additionally an L2 cache is shared among all multiprocessors on one GPGPU. SIMT Single instruction, multiple-threads execution model. An extension of the traditional Single instruction, multiple data (SIMD) model, in which branching within kernels is possible (with a loss of performance however, see also [Warp] ). SM/SMX Streaming multiprocessor, NVIDIA's marketing term for a set of CUDA cores that share certain resources, such as an L1 and shared memory cache as well as a register file. Thread One thread in the [SIMT] model. In GPU hardware such threads do not operate independently, however; see also [Warp] . Each thread occupies one CUDA core during execution.
Warp A set of threads that are executed using a single scheduler. Threads within a warp operate in lockstep on either the same operation or nop (in case of branching). In current and recent architectures the warp size has been 32. For double precision, only each second thread per warp is executed (i.e., the data length per CUDA core and warp scheduler is fixed).
D REDUCED WEATHER CONFIGURATION FOR PERFORMANCE TESTS
The following configuration has been used for the results discussed in Section 5.1:
(1) TSUBAME 2.5 has been used, i.e., Tesla K20x GPUs and Xeon X5670 CPUs.
(2) For the CPU performance measurements, ifort and icc with -fast option has been used.
(3) Thread affinity has been set to compact to ensure that the comparison is being done to exactly one of the two CPU socket. (4) For the GPU performance measurements, PGI Accelerator version 15.04 with -fast option and compute capability 3.× has been used. (1) 75 time steps (2) All ported modules enabled as depicted in Figure 8 (3) Real model input data (4) All floating point computations done in double precision (5) Intel Fortran compiler with -fast -no-ipo settings has been used for the reference code running on CPU. Versions 14.0.2 and 17.0.2 have been used for Westmere Xeon X5670 and Broadwell Xeon E5-2695 v4, respectively (inter-procedural optimization has resulted in compiler errors). (6) PGI accelerator has been used for the GPU measurements, with optimizations disabled (-O0) due to accuracy problems; Versions 16.9 and 16.10 have been used for K20x and P100, respectively.
F ASUCA CONFIGURATION FOR TESTS WITH 1581 × 1301 × 58 GRID ON REEDBUSH-H
The following configuration has been used for the results discussed in Section 5.4:
(1) All ported modules have been enabled as depicted in Figure 8 .
(2) 75 time steps have been executed.
(3) OpenMPI version 1.10.7 has been used. (4) The GPU implementation uses two MPI processes per node. (5) The CPU implementation uses one MPI process per node, with OpenMP core affinity settings to keep the threads in a compact manner on each socket (to optimize cache locality). (6) The horizontal domain (IJ) has been decomposed according to Appendix H. (7) Output to file has been limited to the beginning and the end of the simulation. 
I.6 Reduced Weather: Time Integration
