In this work we use the GPU porting task for the operative Japanese weather prediction model "ASUCA" as an opportunity to examine productivity issues with OpenACC when applied to structured grid problems. We then propose "Hybrid Fortran", an approach that combines the advantages of directive based methods (no rewrite of existing code necessary) with that of stencil DSLs (memory layout is abstracted). This gives the ability to define multiple parallelizations with different granularities in the same code. Without compromising on performance, this approach enables a major reduction in the code changes required to achieve a hybrid GPU/CPU parallelization -as demonstrated with our ASUCA implementation using Hybrid Fortran.
Introduction
With supercomputing shifting towards accelerators, the increasing need for manycore architecture support in software has created a divide between domain scientists who mainly care about modeling, and the supercomputers their applications are required to run on. The high rate of change in both hardware and software creates maintainability issues, especially with codes that are required to run on varying hardware architectures such as multi-core CPU and GPU. The aforementioned divide has widened especially in atmospheric sciences, a field with constant need for model adaptations and high demand for computing resources. Related publications suggest that productivity issues are what is holding back wider accelerator support in this field.
In this work we use the GPU porting task for a Japanese weather prediction model ("ASUCA") as an opportunity to examine productivity issues with the current GPGPU standard "OpenACC" when it is applied to structured grid problems written in Fortran. ASUCA is the main mesoscale weather prediction model developed at the Japan Meteorological Agency. It is used in operation, generating nine-hour-forecasts every hour [ ] [ ].
arXiv:1710.08616v1 [cs.DC] 24 Oct 2017
We then propose Hybrid Fortran, a solution that is designed to increase productivity when re-targeting structured grid Fortran applications to GPU. It is an improvement over OpenACC in two major aspects:
. Parallelization granularity is abstracted. This allows the user to have multiple granularities defined in the same codebase, depending on the targeted hardware architecture. This is a crucial advantage in order to implement ASUCA's physical processes on GPU -a code that originally has a very coarse granularity, which is ill-matched for GPUs. . Memory layout is abstracted while supporting the already existing user code.
More specifically, the layout is reordered at compile-time to match the target architecture, and extended with additional dimensions to match the specified parallelization granularity.
By investigating the necessary code changes with a completed implementation based on Hybrid Fortran we show that this method has enabled high productivity and performance for re-targeting ASUCA to GPU. More than % of the hybridized codebase is a one-to-one copy of the original CPU-only codewithout counting white-space, code comments and line continuations. An equivalent OpenACC-based solution of ASUCA is estimated to require more than ten thousand additional code lines, or . % of the reference codebase. The new implementation performs up to . x faster when comparing one GPU to one multi-core CPU socket. On a full-scale production run with x x grid size and km resolution, Tesla P GPUs are shown to replace more than -core Broadwell Xeon sockets.
This paper is structured as follows: Section . introduces the application our work focuses on. In Sections . and . we introduce the main difficulties that we face when porting this application to GPU. Section . outlines the related work in terms of existing methods to solve these difficulties. We then provide a problem summary in Section . . In Sections , and we discuss our solution to this problem, the underlying code transformation method, as well as productivity and performance results achieved with this solution, respectively. Finally, in Section we draw conclusions and point out future work.
.
ASUCA on GPU
The regional scale weather prediction model "ASUCA" is one of the main operational forecast models in Japan [ ]. It is developed by the Japan Meteorological Agency (JMA) and used in production since , covering all of Japan as well as relevant ocean areas and surrounding landmasses in East Asia on a rectangular grid with two kilometer resolution, using the finite volume method for the spatial discretization [ ].
ASUCA is implemented in Fortran, with multidimensional arrays stored in modules as its main data structure. It is structured as a dynamical core interfacing with physical processes, as commonly seen in other weather models such as WRF [ ] and COSMO [ ]. Parallelization is applied in the horizontal domain (iterated with I and J loop indices). In order to scale to multiple nodes, ASUCA's grid is decomposed into blocks in the horizontal domain that are scattered across nodes, thus requiring halo communication. OpenMP parallel loop directives are employed as an existing intra-node CPU parallelization. Time discretization is implemented by employing a third order Runge-Kutta scheme, enabling long time steps [ ]. Sound and gravity waves are treated separately using a second-order Runge-Kutta scheme to enable a higher time resolution, employing the HEVI scheme (Horizontally explicit -vertically implicit) [ ].
ASUCA's dynamical core is a stencil code and thus heavily bounded by memory bandwidth [ ] [ ]. Since the dynamical core also dominates the runtime, GPUs are an attractive target architecture, with a memory bandwidth that is typically to times higher than Intel Xeon architectures of a similar generation. This work is thus motivated by the task of achieving a GPU port for ASUCA, however with the additional goal of optimizing for minimal code changes in order to achieve better acceptance from the application owner (JMA).
ASUCA is a protected asset of the Japanese government and can not be published at this time -in this paper we thus refer to code snippets to discuss our implementation.
Parallelization Granularity
Compared to CPUs, GPUs support a very high number of parallel threads while having a very low thread switching overhead -however with the cost of small caches available per thread and a low single-threaded performance. Furthermore, the latency experienced when off-loading code to GPU results in a very high number of threads being optimal, preferably a multiple of the available arithmetic units (i.e. tens of thousands of threads or more). This in turn leads to register pressure being a major factor for the optimization of GPU code, since the number of utilized registers scales linearly with the number of active threads. These characteristics point to an important distinction in the GPGPU programming model: A fine-grained parallelization is strongly preferred or even necessary (as subroutine calls in kernels are required to be inlined, which becomes impractical with deep call graphs).
While ASUCA's original implementation already offers a fine-grained and thus GPU-friendly dynamical core, most of its physical processes use very large kernels and thus coarse granularity. Since each vertical column (K index) can be computed independently for many of the physical processes, these computations are implemented in a single kernel in order to increase cache locality and decrease the amount of context switching and thread synchronization [ ] [ ].
In order to achieve GPU support, a more fine-grained parallelization is thus required for the physical processes. An automated or assisted approach for kernel fission is desirable in order to allow for GPU acceleration while keeping cache locality and low overhead for the CPU case.
Memory Layout
There are two major aspects in which the memory layout on GPU differs from that on CPU:
Stride-Access On CPU, the memory layout is generally chosen such that the fastest varying dimension is mapped to stride-access. GPUs on the other hand require the first parallel dimension to be mapped to stride-in order to coalesce the memory operations, since a group of threads is executed in lockstep. In case of ASUCA these are not the same dimensions: K is the fastest varying dimension, yet it is executed sequentially in general (with a few possible exceptions in the dynamical core), while either I or J can be chosen as the first parallel dimension. The original codebase thus employs KIJ memory order. In an unpublished paper submitted for review, we show that KIJ order leads to a . x slowdown on GPU (versus IJK) while IJK ordering leads to a % slowdown on CPU (versus
Privatization To create a more fine-grained parallelization for GPU, as discussed in Section . , kernel fission is required. This in turn requires thread-local data structures and passed-in data slices to be extended in the parallel domain (IJ). Many scalars thus become D-arrays. Many D-become D-arrays.
Thus, in order to make performance portability possible, the memory layout is required to be changeable and extendable, depending on the target architecture and the parallelization granularity. To achieve this in pure Fortran however, all array specifications and accesses need to be modified and thus specialized for GPU (i.e. code duplication is necessary).
Related Work
The following works are related to this paper in that they describe alternative or similar methods to achieve a hybrid GPU/CPU capable port for atmospheric models. All of these works have achieved compelling speedups on GPU, this discussion therefore focuses on the productivity aspects.
Stencil Domain-Specific Languages Domain-specific languages applied to stencil algorithms have been one method to abstract parallelization boiler-plate and memory layout for hybrid GPU/CPU code. For this matter, direct data accesses with loop-or thread indices are abstracted in the point-wise stencil code. This generally requires a complete rewrite of existing code. We take note of the following implementations of weather models using this technique: Directive-Based Porting Methods Directives are used to steer compilers on how to optimize or parallelize already existing code for a specific hardware architecture. Privatization of thread-local data is generally supported in directive-based methods, while storage ordering is not. To switch between multiple parallelization granularities, code duplication is necessary. We are aware of the following implementations of atmospheric models with this method: Energy" (ACME) for the U.S. DOE using OpenACC. As with ASUCA, ACME's physical processes are problematic for GPU due to their coarsegrained parallelization. GPU-specific code duplication was the only solution found when using OpenACC [ ].
Granularity Optimization Methods Kernel fusion has been the main approach to granularity optimization applied to GPGPU programming we are aware of. We take note of the following related work:
-Wahib and Maruyama have shown the effectiveness of this approach in terms of performance when applied to CUDA C kernels [ ]. -Gysi and Hoefler have applied the same approach as well as loop fusion for the aforementioned STELLA stencil DSL library [ ].
-Clement et al. are applying kernel fusion and an OpenACC/OpenMP targeted transformation to Fortran code in an ongoing effort as part of the C SM project (as of yet unpublished ).
We are not aware of previous or ongoing work regarding kernel fission (and thus support for coarse-grained parallel programming) applied to GPUs.
Memory Layout Abstraction Methods
While stencil DSLs abstract the memory layout, they also require a full rewrite of the point-wise code. The following work allows for the reuse of existing code while keeping the performance portability gains of an abstracted memory layout: .
Problem Summary
No existing method, that we are aware of, combines memory layout abstraction and a flexible parallelization granularity with the ability to reuse existing Fortran code for GPGPU. In this work we aim at introducing such a method.
Hybrid Fortran Language Extension and Code Transformation
Hybrid Fortran has been developed as a method for porting structured grid Fortran applications to GPU. In recognition of the advantages and disadvantages of stencil DSL-and directive-based methods (outlined in Section . ) we have combined the advantages of both by employing the following characteristics in our approach:
. it does abstract the parallel loops in order to achieve multiple parallelization granularities with the same code, These open-sourced efforts can be found at https://github.com/C SM-RCM/clawcompiler.
. it does not abstract the point-wise code (i.e. the loop bodies) -allowing for code reuse, . it does separate the memory layout as defined in the user code from the layout that is effectively implemented for each architecture.
Hybrid Fortran is an open-source framework and can be accessed together with a library of sample applications .
In this Section we discuss how these characteristics have been achieved. Subsection . describes our approach to parallelization and granularity in Hybrid Fortran, while Subsection . discusses and compile-time defined memory layout and device memory handling.
Parallel Loop Abstraction
Consider the following kernel from JMA's ASUCA reference implementation.
As part of the dynamical core it is executed within the second-order Runge-Kutta scheme with high time resolution. It applies lateral and upper damping to ASUCA's grid point values.
Listing . . Lateral and upper damping kernel applied to grid point values.
Using Hybrid Fortran we replace the OpenMP directives, as well as the loop instructions to be parallelized, with our parallelization DSL:
Listing . . Lateral and upper damping kernel, modified with Hybrid Fortran.
@ p a r a l l e l R e g i o n { &
& domName( i , j ) , domSize (nx_mn : nx_mx , ny_mn : ny_mx) , & & s t a r t A t (nx_mn , ny_mn) , endAt (nx_mx , ny_mx) , t e m p l a t e (TIGHT_STENCIL)
a r a l l e l R e g i o n
We therefore have an explicit distinction between loops that are treated as parallelizeable (and are thus restricted in their access patterns, i.e. loop carried dependencies are not supported) and loops that are always executed sequentially.
Please refer to https://github.com/muellermichel/Hybrid-Fortran.
The attributes domName and domSize specify the relevant domain iterators and the domain size relevant to data objects accessed within the parallel region (this relevancy will later be discussed in more detail in Section . ). The attributes startAt and endAt explicitly state the region boundaries, which can be a subset of the domain size (however, if omitted, the domain size is also assumed as the region boundary).
For CPU targets, Hybrid Fortran generates an OpenMP code version very similar to the reference code shown in Listing . , with multi-core parallelization applied to the outermost loop. For GPU targets it defaults to CUDA Fortran kernels (thus generating all the necessary host-and device code boilerplate and data copy operations) with an option to use OpenACC kernels with CUDA compatible data structures (device pointers) . The attribute template specifies a macro suffix used for the generated block size parameters -this allows a central configuration for the block sizes used in an application, rather than leaking this architecture-specific optimization to the user code in each kernel. If omitted, configurable default block sizes are used. The main advantage of this parallelization DSL is the following: replacing parallelizeable loops with the @parallelRegion construct allows the user to specify multiple granularities in the same code. Consider ASUCA's code structure, shown in simplified form in Figure . It shows two selected kernels and their embedding in the call graph -the lateral and upper boundary damping already discussed Privatization is the main difference: Hybrid Fortran generated OpenMP code uses "firstprivate" as the default policy with an explicit "shared" clause for all arrays used in the kernel. OpenACC is mainly used for reduction support -Hybrid Fortran does not automatically generate reduction kernels, however it supports the "reduce" clause, which is forwarded to the generated OpenMP or OpenACC kernels.
in this section, as well as the physics kernel. Many physical processes are called within this single kernel (of which three sample processes are depicted here). This code therefore has a very coarse granularity, which is problematic on GPU as discussed in Section . . With Hybrid Fortran we can solve this problem as follows: An additional appliesTo attribute in the @parallelRegion statement allows the user to selectively apply parallel regions to either CPU or GPU. Applying the parallelization at different granularities therefore becomes possible , by enabling user-steered kernel fission. Figure shows the resulting code structure, with the physics run being split into many kernels for GPU while remaining a single coarse grained kernel for CPU. The later Listing . gives an example of how such a kernel fission works in practice.
Compile-time Defined Memory Layout and Device Data Region
As discussed in Section . , it is necessary to consider two major aspects for implementing the memory layout: storage order on one hand and the compiletime defined granularity requiring a varying dimensionalities of data objects on the other. Part of Hybrid Fortran is an additional language extension, the @domainDependant construct , as a declarative way for the user to specify thus obviating the need for code duplication and/or deep inlining of call trees Please note: While this paper follows American English, the Hybrid Fortran language extension has originally been developed following British English, which becomes additional required information concerning data objects. This concerns memory layout as well as device memory operations, which will be discussed in this section.
Storage Order
Revisiting the code sample from Section . , the following Listing shows the specification of the routine implementing the discussed lateral and upper damping kernel:
Listing . . Routine implementing the lateral and upper damping kernel with Hybrid Fortran. This shows the specification of the local module data object dens_ptb_bnd (density perturbation in the boundary layer) as well as the external module data objects dens_ref_f (reference density) and dens_ptb_damp (density perturbation in ASUCA grid).
The autoDom attribute is used to delegate the dimensions setup to the data object specification parser (which gathers this information in a separate pass from the source modules, here ref and svar), rather than having the user specify the dimensions explicitly again in the @domainDependant construct. The attributes accPP and domPP are employed to specify the macro names used to implement the dimension ordering for accesses and specification parts, respectively. These macros wrap all dimension lists in access expressions and specifications of respective data objects in the generated code. When accPP and domPP attributes are omitted, default macro names are used (for a code example after this conversion please refer to Listing . ). In case of Listing . we use explicit macro names for the dynamical core since the default macros apparent in the spelling of "domainDependant" (https://en.oxforddictionaries.com/ definition/dependant). We consider support for the American English spelling of this directive as part of our future efforts. are already used with different assumptions for the physical processes (see the paragraph on "Dimensionality Changes" below).
Device Data Region Similar to OpenACC, in Hybrid Fortran we implement data regions by adding state attributes to data objects. The present attribute, shown in Listing . , indicates that the respective objects are located on the device in case of GPU compilation. Analogous transferHere attributes are used in the main simulation routine in order to instruct Hybrid Fortran to implement the memory copy operations to-and from the device, once at the beginning and end of the simulation. For dummy variables with specified intent, Hybrid Fortran will use the Fortran intent information to determine the correct copy operation , which minimizes the potential for bugs in comparison to OpenACC's explicit copyIn, copyOut and copy clauses. Halo region updates, required for every timestep, are implemented explicitly in code sections guarded from CPU compilation.
Dimensionality Changes Due to the compile-time defined parallelization granularity, discussed in Section . , it is necessary to modify the dimensionality of data objects in certain cases in the source generation. This requires hints from the framework user. Consider the following surface flux code snippet:
Listing . . Surface flux code snippet. 
. s e a t i l e s code and v a r i a b l e summing o m i t t e d
Since this process is defined inside the call graph of the physics kernel, as shown in Figure , the relevant D-and D grid point values are already sliced and passed in as scalars or D-arrays, that is, data parallelism is not exposed at this level. Hybrid Fortran allows implementing this as a fine-grained kernel (as outlined in Figure ) u_f ( l t ) = s q r t ( s q r t ( t a u x _ t i l e _ e x ( l t ) ** + t a u y _ t i l e _ e x ( l t ) ** ) ) e l s e t a u x _ t i l e _ e x ( l t ) = . _ r _ s i z e t a u y _ t i l e _ e x ( l t ) = . _ r _ s i z e ! . .
. f u r t h e r t i l e v a r i a b l e s o m i t t e d end i f ! . . . s e a t i l e s code and v a r i a b l e summing o m i t t e d @end p a r a l l e l R e g i o n
Using our parallelization DSL to provide additional dimensionality information, Hybrid Fortran is able to rewrite this code into a D kernel. Dimensions missing from the user code are inserted at the beginning of the dimension lists in access expressions and data object specifications. As an example, the expression u_f(lt) is converted to u_f(AT(i,j,lt)), employing the default ordering macro already mentioned in the paragraph "Storage Order". Dimensions are extended whenever there is a match found for domName or domSize information between data objects and parallel regions within the same routine or in routines called within the call graph of the same routine. It is therefore necessary for Hybrid Fortran to gather global information about the application before implementing each routine.
Transformed Code
Revisiting Listing . , the following code is generated when applying Hybrid Fortran with the OpenACC backend:
Listing . . Surface flux code snippet after conversion with OpenACC backend.
! $ a c c k e r n e l s d e v i c e p t r ( t a u x _ t i l e _ e x ) d e v i c e p t r ( t a u y _ t i l e _ e x ) & ! $ a c c& d e v i c e p t r ( t l c v r ) d e v i c e p t r ( u_f ) ! $ a c c l o o p i n d e p e n d e n t v e c t o r (CUDA_BLOCKSIZE_Y) o u t e r P a r a l l e l L o o p : do j = , ny ! $ a c c l o o p i n d e p e n d e n t v e c t o r (CUDA_BLOCKSIZE_X)
do i = , nx ! *** l o o p body *** : l t = t i l e _ l a n d i f ( t l c v r ( AT( i , j , l t ) ) > . _ r _ s i z e ) then c a l l s f _ s l a b _ f l x _ l a n d _ r u n (& ! . . .
i n p u t s and f u r t h e r t i l e v a r i a b l e s o m i t t e d
& t a u x _ t i l e _ e x ( AT( i , j , l t ) ) , t a u y _ t i l e _ e x ( AT( i , j , l t ) ) & & ) u_f ( AT( i , j , l t ) )= s q r t ( s q r t ( t a u x _ t i l e _ e x ( AT( i , j , l t ) ) ** + & & t a u y _ t i l e _ e x ( AT( i , j , l t ) ) ** ) ) e l s e t a u x _ t i l e _ e x ( AT( i , j , l t ) )= . _ r _ s i z e t a u y _ t i l e _ e x ( AT( i , j , l t ) )= . _ r _ s i z e ! . .
. f u r t h e r t i l e v a r i a b l e s o m i t t e d end i f ! . . . s e a t i l e s code and v a r i a b l e summing o m i t t e d end do end do o u t e r P a r a l l e l L o o p ! $ a c c end k e r n e l s
Device data is interoperable with the CUDA Fortran backend, thus device pointers are used instead of passing the management to OpenACC. OpenACC directives together with this data type can thus be directly used in the user code as well, i.e. it remains interoperable with device code generated by Hybrid Fortran.
As noted in Section . , storage ordering macros (here AT()) are applied to all array access statements. For the thread block setup, the configurable default sizes CUDA_BLOCKSIZE_X/Y are used since no template is specified for the parallel region at hand. Parallel region loops (here for indices i and |verb|j| are set up explicitly to parallelize. Other loops, such as the loop over k, use a !$acc loop seq directive to explicitly avoid parallization and give the framework user full expressiveness over the desired granularity.
Applying CUDA Fortran backend to the same user code produces the following host code (here shown together with the routine header and footer):
Listing . . Surface flux host code snippet after conversion with CUDA Fortran backend. The prefix hfd_ is added to host routines that use device data. Hybrid Fortran also duplicates the code for a pure host version of these routines (without a name change in order to remain interoperable with code that is not passed through Hybrid Fortran). In contexts where the data is not residing on the device, such as the setup part of an application, Hybrid Fortran automatically chooses the host version when generating the call statements at compile-time. Code residing within parallel regions is moved within a separeted kernel routine (using prefix hfki_ with i representing the kernel number). In case of the surface flux sample shown here, the kernel routine is generated as follows:
Listing . . Surface flux device code snippet after conversion with CUDA Fortran backend. The specification part of these kernels is automatically generated, applying device state information and converting input scalars to pass-by-value , among other transformations.
It is notable that CUDA Fortran requires a fairly large amount of boiler plate code for grid setup, iterator setup, host-and device code separation as well as memory-and error handling -Hybrid Fortran allows the user to pass on the responsibility for that to the framework. Compared with the code generated by OpenACC however (assembly-like CUDA C or NVVM intermediate representation), the Hybrid Fortran generated CUDA Fortran code remains easily readable to programmers experienced with CUDA. Experience shows that this reduction kernels are thus not supported with this backend -we use the OpenACC backend selectively for this purpose, see also the discussion in the footnotes to Section . .
is a productivity boost, especially in the debugging and manual performance optimization phase of a project.
Regarding the OpenMP backend, since the surface flux example is parallelized at a much more coarse-grained level for CPU, the generated CPU code for the sample at hand is a one-to-one copy of the user code shown earlier in Listing . . The parallelization is generated at a higher level in the call graph (replacing the parallel region construct with appliesTo(CPU) clause) as follows:
Listing . . Surface flux device code snippet after conversion with CUDA Fortran backend. In this section we discuss code transformation method involved in implementing Hybrid Fortran's characteristics described earlier. This process is applied transparently for the user, i.e. it is applied automatically by the means of a provided common Makefile . Figure gives an overview of the process and the components involved. We discuss this process in order of execution -each of the following enumerated items corresponds to one transformation phase:
. To simplify the parsing in subsequent phases, Fortran continuation lines are merged. . Facilitating later phases, the application's call graph and parallel region directives are parsed globally ("parse" phase in Figure ) . . Using the appliesTo information in kernels and the call graph, the position of each routine in relation to kernels is computed. Possible positions are "has kernel(s) in its caller graph", "contains a kernel itself" and "is called inside a kernel" ("analyze" phase in Figure ) . . In two passes, module data object specifications are parsed and then linked against all routines with imports of such objects, together with the locally defined objects. . A global application model is generated, with model classes representing the modules, routines and code regions. This model can be regarded as a target hardware independent intermediate representation. . Each routine object is assigned an implementation class depending on the target architecture . For each coding pattern, a separate class method of the implementation class is called by the model objects -e.g. CUDA parallelization boilerplate is generated. Using the previously gathered global kernel positioning and data object dimension information, data objects are transformed according to the behavior discussed in Section . . Implementation class methods return strings that are concatenated by the model objects into source files ("transform" phase in Figure ) . . Code lines are split using Fortran line continuations in order to adhere to line limits imposed by Fortran compilers. . Macros generated by Hybrid Fortran (to implement storage reordering and configurable block sizes) are processed by using the GNU compiler toolchain. Subsequently, a user specified compiler and linker is employed in order to create the CPU and GPU executables. A common makefile is provided with the framework, however the build dependency graph is user-provided in the format of makefile rules ("make" phase in Figure ) .
This process makes it possible to have a unified source input and create executables targeted for either multi-core CPU or many-core GPU.
See also the "Getting Started" section in https://github.com/muellermichel/Hybrid-Fortran/blob/v . rc /doc/Documentation.pdf. Hybrid Fortran allows the user to switch between varying backend implementations per routine, such as OpenACC and CUDA Fortran -the user specified information as well as the defaults given by the build system call thus steers this implementation class. alternatively, a dependency generator script can be configured as well
Productivity-and Performance Results
In this section we discuss the productivity-and performance results that have been achieved when applying Hybrid Fortran to weather prediction models.
At the time of this writing an additional paper has been submitted for review, in which the following results are discussed: Hybrid Fortran has been applied to both dynamical core and physical processes of ASUCA. This implementation (here referred to as "Hybrid ASUCA") consists of kernels and covers almost all modules required for an operative weather prediction, with convection being the main exception due to development time constraints. Communication between host and device has been eliminated with the exception of setup, output and halo exchange. On a x x grid with real weather data, a . x speedup has been achieved when comparing the Hybrid Fortran port on four Tesla K x versus the JMA provided reference implementation on four -core Westmere Xeon X (TSUBAME . ). The same setup executes at a speedup of x when comparing a single Tesla P versus a single -core Broadwell Xeon Ev (Reedbush-H). On a full-scale production run with x x grid size and km resolution, Tesla P GPUs are shown to replace more than -core Broadwell Xeon sockets [ ].
In order to further examine the performance of Hybrid Fortran kernels and compare them to OpenACC and OpenMP user codes, a performance model has been constructed for a reduced weather application. The Hybrid Fortran implementation with unified code performs on par or better than OpenACC-and OpenMP-implementations separately optimized for GPU and CPU, respectively. In this application, whose runtime is dominated by the memory bandwidth used for a seven point stencil diffusion kernel, the Hybrid Fortran GPU version performs % better than the no-cache-model on Tesla K x. It achieves % of the theoretical limit given by a perfect-cache-model. The CPU version performs % better than the no-cache-model and achieves % of the theoretical limit on Westmere Xeon X [ ].
To examine the productivity of our solution we have analyzed the code and compare it against the reference implementation . The high-level results of this analysis is shown in Figure . In order to gain GPU support in addition to the already existing multi-core and multi-node parallelization, the code has grown by less than % in total, from k lines of code to k. Sanitizing the two code versions (removing white space, comments and merging continued lines), the code has grown by %, from k to k lines of code. % of the sanitized reference code is used as-is in the new implementation, while % or approximately k lines of code is replaced with approximately k new code lines.
Since the input to this analysis is the closed source ASUCA codebase, full reproducibility cannot be provided in this context. However the intermediate data, the method employed to gather this data as well as a sample input is provided and documented in https://github.com/muellermichel/hybrid-asuca-productivity-evidence/ blob/master/asuca_productivity.xlsx. Code changes and additions have the largest impact in terms of productivity. We have analyzed the additional k lines of code in more detail. Figure shows a breakdown of these changes and compares them to an estimate of what would be required with an OpenACC-based implementation. The following methodology has been used for this analysis:
. for the parallelization-and data layout DSL line count we have used information parsed for Hybrid ASUCA, as well as the OpenACC backend available in Hybrid Fortran, to acquire an accurate count for the directives required for an OpenACC-based implementation, . since OpenACC does not offer a granularity abstraction, we have used the Hybrid ASUCA's parsed global application information to arrive at a set of routines that require a kernel positioning change (see the discussion in Section ) -the resulting code lines require duplication for the CPU in an equivalent OpenACC implementation (shown as "CPU-only physics" in figure ), . we have used the OpenACC backend in Hybrid Fortran to count the lines of code where storage order macros are introduced in order to achieve a compile-time defined data layout.
"Parallelization & data layout DSL" refers to the number of code lines for @parallelRegion and @domainDependant directives in case of Hybrid Fortran, and OpenACC !$acc directivies in case of OpenACC. Hybrid Fortran replaces the requirement for code changes to implement a varying data layout ("storage order macros", lines of code, "LOC") as well as a code duplication for multiple parallelizations with varying granularities ("CPU-only physics", LOC) with a higher number of DSL code lines compared to OpenACC ( vs. LOC). "Modified data specifications / initializations" ( LOC) as well as "routine & call signature" ( LOC) refers to changes applied to the setup of data and call parameter lists, respectively. These changes are necessary due to device code limitations and optimizations and are largely required for both the Hybrid Fortran version as well as a potential OpenACC solution, so we use the result from Hybrid ASUCA as an estimate for what would be required with OpenACC. Finally, one physics module concerning long-wave radiation ( LOC), has been replaced with a version that uses less local memory per thread (a factor of improvement in that regard) to make it more GPU-friendly. We again estimate that an OpenACC version would have approximately the same code size.
This result shows that an equivalent OpenACC implementation of ASUCA can be estimated to require approximately k LOC in additional changes compared to the Hybrid Fortran-based implementation. When comparing to the sanitized reference codebase, an OpenACC user code would require approximately % of code lines to be changed or added, while Hybrid Fortran currently requires %.
In addition to the results regarding ASUCA presented here, a library of small applications and kernel samples has been created to demonstrate Hybrid Fortran general applicability to data parallel code .
Conclusion and Future Work
With this work we have shown that it is possible for large structured grid Fortran applications to . achieve a GPU implementation without rewriting major parts of the computational code, . abstract the memory layout and . allow for multiple parallelization granularities.
With our proposed method, a regional scale weather prediction model of significant importance to Japan's national weather service has been ported to GPU, showing a speedup of up to . x on single GPU compared to a single Xeon socket. When scaling up to Tesla P , less than half the number of GPUs is required compared to contemporary Xeon CPU sockets to achieve the same result. Approximately % of the existing codebase has been reused for this implementation and our implementation has grown by less than % in total, even though it is now supporting GPU as well as CPU. Through a library of results, the general applicability of our framework to data parallel Fortran code has been shown.
Considering that much of the changes still required in the user code are "mechanical" in nature, we expect additional productivity gains to be possible from further automation. We strive to achieve a solution where a Hybrid Fortran-based transformation can be applied to large structured grid applications wholesale with minimal input required by the user.
References
. Cumming, B., Osuna, C., Gysi, T., Bianco, M., Lapillonne, X., Fuhrer, O., Schulthess, T.C.: A review of the challenges and results of refactoring the community climate code COSMO for hybrid Cray HPC systems. Proceedings of Cray User Group ( ) . Douglas, C.C., Hu, J., Kowarschik, M., Rüde, U., Weiß, C.: Cache optimization for structured and unstructured grid multigrid. Electronic Transactions on Numerical Analysis , -( ) . Dursun, H., Nomura, K.i., Wang, W., Kunaseth, M., Peng, L., Seymour, R., Kalia, R.K., Nakano, A., Vashishta, P.: In-core optimization of high-order stencil computations. In: PDPTA. pp. Icosahedral models language extensions ( ) . Kwiatkowski, J.: Evaluation of parallel programs by measurement of its granularity.
In: International Conference on Parallel Processing and Applied Mathematics. pp.
-. Springer ( ) . Lapillonne, X., Fuhrer, O.: Using compiler directives to port large scientific applications to GPUs: An example from atmospheric science. Parallel Processing Letters ( ) ( ) . Mielikainen, J., Huang, B., Huang, A.: Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme. In: SPIE Sensing Technology+ Applications. pp.
T-T. International Society for Optics and Photonics ( ) . Müller, M., Aoki, T.: New high performance GPGPU code transformation framework applied to large production weather prediction code ( ), unpublished paper submitted for review . Norman, M.R., Mametjanov, A., Taylor, M.: Exascale programming approaches for the accelerated model for climate and energy ( ) . Quinlan, D.: ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters ( n ), -( ) . Sakamoto, M., Ishida, J., Kawano, K., Matsubayashi, K., Aranami, K., Hara, T., Kusabiraki, H., Muroi, C., Kitamura, Y.: Development of yin-yang grid global model using a new dynamical core ASUCA ( ) . Sawyer, W., Zaengl, G., Linardakis, L.: Towards a multi-node openacc implementation of the icon model. In: EGU General Assembly Conference Abstracts. vol. ( ) . Shimokawabe, T., Aoki, T., Muroi, C., Ishida, J., Kawano, K., Endo, T., Nukada, A., Maruyama, N., Matsuoka, S.: An -fold speedup, . TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. 
