New High Performance GPGPU Code Transformation Framework Applied to
  Large Production Weather Prediction Code by Müller, Michel & Aoki, Takayuki
New High Performance GPGPU Code Transformation Framework Applied to Large
Production Weather Prediction Code
MICHEL MU¨LLER, Tokyo Institute of Technology
TAKAYUKI AOKI, Tokyo Institute of Technology
We introduce “Hybrid Fortran”, a new approach that allows a high performance GPGPU port for structured grid Fortran codes. is
technique only requires minimal changes for a CPU targeted codebase, which is a signicant advancement in terms of productivity. It
has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction
model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid
Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid
Fortran implementation is shown to deliver the same or beer performance than OpenACC and its performance agrees with the model
both on CPU and GPU. In a full scale production run, using an ASUCA grid with 1581 x 1301 x 58 cells and real world weather data in
2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran based GPU port are shown to replace more than 50 18-core Intel
Xeon Broadwell E5-2695 v4 running the reference implementation - an achievement comparable to more invasive GPGPU rewrites of
other weather models.
CCS Concepts: •Computing methodologies → Parallel programming languages; •General and reference → Performance;
•Applied computing→ Environmental sciences; •Soware and its engineering→ Source code generation; Preprocessors;
Additional Key Words and Phrases: CUDA, OpenACC, GPGPU, Performance models, Fortran, Weather Prediction
1 INTRODUCTION
With the growth in single threaded performance on CPUs slowing down [30], HPC applications today are increasingly
required to make eective use of on-chip parallelism. In 2005 Microso’s H. Suer called this paradigm shi “the free
lunch is over”: rather than relying on a steady and exponential increase of computational performance to automatically
benet already existing soware, it has become necessary for applications to specically target, among other aspects,
on-chip parallelism [38].
As a consequence of this paradigm shi, co-processors commonly referred to as “accelerators” have become increas-
ingly common in supercomputers since these architectures are generally designed to oer as much computational
throughput as possible for a given number of transistors. Top500 currently lists ve of the top ten systems to make
is research has been partly supported by the Japan Science and Technology Agency (JST) Core Research of Evolutional Science and Technology (CREST)
research program “Highly Productive, High Performance Application Frameworks for Post Peta-scale Computing”, partly by “KAKENHI”, Grant-in-Aid for
Scientic Research (S) 26220002 from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan, partly by “Joint Usage/Research
Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN)” and “High Performance Computing Infrastructure (HPCI)” in Japan and
partly by the Advanced Computation and I/O Methods for Earth-System Simulations (AIMES) project running under the multinational priority program
“Soware for Exascale Computing” (SPPEXA) .
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. Manuscript submied to ACM
Manuscript submied to ACM 1
ar
X
iv
:1
80
2.
05
83
9v
1 
 [c
s.D
C]
  1
6 F
eb
 20
18
2 M. Mu¨ller and T. Aoki
use either NVIDIA Tesla or Intel Xeon Phi based accelerators [41]. Ooading computationally intensive tasks to
co-processors has thus become an increasingly common task in the eld of HPC.
is need for architecture specic programming has, however, created a divide between domain scientists who mainly
care about application logic, and the supercomputers their applications run on. Due to this, implementations optimized
for hardware have oen been le to engineers. Both on the application - as well as on the hardware side - there is
oen a high rate of change, which creates maintainability issues. As an example, Oak Ridge National Laboratory’s
Norman et al. write about a recent accelerator implementation of the U.S. DOE’s “Accelerated Model for Climate and
Energy”: ”e original CUDA Fortran1 refactoring gave good performance, but it turned out that the codebase was still
somewhat in ux. is meant that changes needed to be propagated into the CUDA code, which was quite dicult to
support” [28]. See also Section 1.1 for a further discussion of this related work.
e work this paper is based on has initially been motivated by an evaluation of operative GPU support for the
Japanese production weather prediction model “ASUCA”, as discussed in Section 2. ASUCA is the main mesoscale
weather prediction model used in operation and developed at the Japan Meteorological Agency [17]. ASUCA uses a
structured grid and is implemented in Fortran. In order to gain accelerator support, the aforementioned maintainability
issues are a main point of contention. ASUCA is constantly being pushed further by the application owners, and it is
thus of fundamental importance to maintain ease-of-use and high productivity for the domain scientists when working
with the code, both on CPU and GPU. A single codebase for both architectures is therefore paramount.
In this work we therefore investigate methods to re-target structured grid applications in Fortran with low program-
ming eort for an ecient GPU implementation while at the same time keeping CPU compatibility. is implies the
following requirements for potential solutions:
(1) e same user code should be applicable to both CPU and GPU architectures. In current research the focus is
on NVIDIA GPUs and x86 CPU architectures, but the switch to others should be possible and streamlined.
(2) is user code (from here on called “Hybrid ASUCA”) should stay as close as possible to the original codebase
to ease the transition. Since the original code is wrien in Fortran, which does have some industry support for
accelerators, the language should thus be kept if possible.
Requirement 2 rules out applying domain-specic languages (such as stencil DSLs) to the computational code, to
avoid a complete rewrite. In an evaluation of existing directive based methods (see Section 2.4) we conclude that
industry standards alone are insucient to fulll the above requirements, which has motivated the invention of a new
method for porting CPU-targeted structured grid applications in Fortran to GPU.
In Section 3 we thus propose “Hybrid Fortran”, a source-to-source transformation and language extension that
supports dierent targets. It uses Modern Fortran together with higher level parallelization and data specication
abstractions as input, and outputs either OpenMP Fortran, CUDA Fortran or OpenACC Fortran sources. Crucially, this
tool introduces a new assisted spliing method for large kernels, a necessity in order to enable GPU compatibility of
physical processes within ASUCA without requiring a rewrite. We use a reduced weather application (introduced in
Section 3.1) as a basis to compare the features of Hybrid Fortran with the industry standard OpenACC. A performance
model (discussed in Section 4) is introduced in order to evaluate the performance of Hybrid Fortran and OpenACC for
this use case (see Section 5.1).
1see Appendix C for a denition of CUDA and CUDA Fortran
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 3
By applying Hybrid Fortran to both dynamical core and physical processes of ASUCA a GPGPU implementation of
this weather model has been completed, resulting in only 5% increase in code size, even though the code now supports
both CPU and GPU (see Section 5.2). To our knowledge this is the rst time a complete production weather prediction
model has been ported to GPGPU using a unied programming paradigm. is new implementation performs with a
speedup of up to 4.9x comparing one GPU to one multi-core CPU socket (as discussed in sections 5.3 and 5.4). e
magnitude of these results are in line with previous GPU ports of ASUCA and other weather models using more invasive
rewrites (as discussed in the related work section below).
Finally we draw conclusions and point out future work in sections 6 and 7, respectively.
1.1 Related Work
e following GPGPU ports of atmospheric models are relevant for the discussion of our work.
1.1.1 ASUCA GPU Port by Shimokawabe et al. In 2010, Shimokawabe et al. completed a GPU-only CUDA C port of
the dynamical core and part of the physical core of ASUCA. 14 TFLOP/s were achieved in single precision using 528
Tesla S1070 GPUs [36].
In 2011, 24 GFLOP/s was achieved in double precision for ASUCA dynamical core on a single NVIDIA Fermi based
Tesla M2050 GPU on TSUBAME 2.0, a 6-fold speedup when compared to the system’s Xeon X5670 CPUs running the
original Fortran code. is work also included extensive work regarding the optimization of inter-node communication
and an overlapping method in order to run communication and computation in parallel. By applying weak scaling (i.e.
increasing the grid size together with the number of GPUs), up to 145 TFLOP/s were achieved in single precision on
3990 Tesla M2050 GPUs [35].
In 2014, the previous work was generalized by employing a stencil library and DSL based on C++ templates that
generates either CUDA kernels or CPU code. Employing previously researched optimization methods to this more
generalized method, up to 209.6 TFLOP/s were achieved for ASUCA in single precision on 4108 Tesla K20x GPUs [37].
1.1.2 COSMO. In 2014, MeteoSwiss’ Fuhrer et al. proposed a mixed approach for COSMO, another structured
grid regional weather and climate model used in operation in several European nations. In this mixed approach, a
C++ stencil library and DSL called “STELLA” is used for the dynamical core, while a “less disruptive” OpenACC based
method is applied to the physical processes in order to retain most of the original Fortran codebase [7]. In COSMO’s
case, the physical processes already expose ne grained parallelism and kernel fusion was employed in some instances
in order to optimize for less kernel launches by Lapillonne et al. [21]. anks to the code refactoring a speedup of 1.5x
was achieved on CPU. By employing GPU code generation an additional speedup of 2.9x was achieved (comparing
Tesla K20x to 8-core Xeon E5-2670 on Piz Daint) [7]. STELLA has subsequently been generalized for use cases beyond
COSMO and rectangular grids under the new name “GridTools” [6].
1.1.3 Weather Research and Forecasting Model (WRF). e U.S. National Center for Atmospheric Research’s “Weather
Research and Forecasting model” (WRF) is the most widely used weather model worldwide [24]. It too uses a structured
grid. WRF consists of multiple hundreds of thousands of lines of Fortran code. GPU ports known to us have been only
partial, leading to heavy pressure on the memory bus between CPU and GPU and thus a lower speedup compared to
full GPU ports of other weather models [44]. e following partial GPU ports have been achieved among others:
Manuscript submied to ACM
4 M. Mu¨ller and T. Aoki
• in 2009, Michalakes and Vachharajani ported the “WSM5” cloud microphysics to CUDA C, achieving an overall
speedup of 1.25x for a benchmark workload [24],
• in 2010, Ruetsch et al. ported the longwave radiation physics to CUDA Fortran [33],
• in 2012, Mielikainen et al. ported the Goddard shortwave radiation scheme to CUDA C [26],
• in 2013, Vanderbauwhede and Takemi ported the advection scheme to OpenCL and integrated it together with
the Fortran based WRF. Doing so, an overall speedup of 2x was achieved [44].
1.1.4 Non-hydrostatic Icosahedral Model (NIM). In 2010 the dynamical core of the Non-hydrostatic Icosahedral
Model (NIM) developed at the U.S. National Oceanic and Atmospheric Administration (NOAA) was ported to GPU
by Gove et al. by using a Fortran-to-CUDA-C transpiler called ”F2C-ACC” [9]. In 2014 this work was compared to
the industry standard “OpenACC” on Titan using Tesla K20x GPUs, showing 77% higher performance with F2C-ACC
compared to Cray’s OpenACC Fortran implementation and 220% higher performance compared to PGI’s OpenACC.
Overall however the GPU implementation was not able to achieve a speedup compared to CPU due to the missing
physical processes and the thus high GPU-CPU data transfer overhead [10]. Since then, performance of both hardware
and soware have improved signicantly however, resulting in a speedup of 2.5x when comparing NVIDIA GPUs to
CPUs and 2.0x when comparing Xeon Phi to the same CPU in 2017. Gove et al. have also reported that OpenACC’s
performance issues and bugs have been xed and F2C-ACC has thus been fully replaced with the industry standard
[11].
1.1.5 Accelerated Model for Climate and Energy (ACME). Norman et al. have investigated methods to create a
hybrid GPU/CPU codebase for the U.S. Department of Energy’s “Accelerated Model for Climate and Energy” (ACME).
A previous CUDA Fortran based port was dropped due to issues with having to maintain two separate codebases for
CPU and GPU. For this reason, OpenACC was evaluated as a replacement, but limitations with resource pressure and
inlining when porting large kernels to GPU were found. It was concluded that such large kernels need to be split up to
be feasible on GPU with OpenACC, which aligns with our ndings about ASUCA’s physical processes (see Section
2.4.1). Since CPUs perform beer with fused loops for caching reasons, for a pure OpenACC based solution it was
found that two separate code versions are required for CPU and GPU, resulting in similar maintainability issues as with
a pure CUDA Fortran port [28].
2 BACKGROUND ANDMOTIVATION
is section discusses the background and motivation of this work. As such it introduces ASUCA (the main target
application for the work presented in this paper), the goals of an accelerator port, a discussion of existing methods for
porting such applications to accelerators, as well as a problem statement.
2.1 The ASUCAWeather Prediction Model
ASUCA is a mesoscale weather prediction model developed by the Japan Meteorological Agency (JMA) [17]. It is
used in production since 2014 as one of the main operational weather forecast models in Japan and covers the area
depicted in Figure 1 in two kilometer resolution, generating nine-hour-forecasts every hour [34]. ASUCA uses an
Arakawa C-grid (one type of rectangular grid) [37]. Time discretization is implemented by employing a third order
Runge-Kua scheme, enabling long time steps [45]. Sound and gravity waves are treated separately using a second
order Runge-Kua scheme to enable a higher time resolution, employing the HEVI scheme (Horizontally explicit -
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 5
Fig. 1. ASUCA’s model simulation boundaries
vertically implicit). Vertical advection of water substances are calculated using a separate time step for each column,
based on the Courant-Friedrichs-Lewy convergence condition [17]. ASUCA employs a generalized coordinate system
that is used in conjunction with Lambert conformal conic- as well as latitude/longitude projections [37]. It is discretized
using a structured grid on a nite volume method in three spatial dimensions: I and J as interchangeable horizontal
dimensions and K as the separately treated (largely sequentially implemented) vertical dimension.
ASUCA has been implemented in Fortran, employing multidimensional arrays as its main data structure due to
their simplicity and performant implementation on hardware. In ASUCA these data objects tend to be directly stored
in modules (i.e. only making rare use of derived types) and imported where necessary through use statements. e
program structure can roughly be divided into two categories: Physical processes (later depicted under pp interface
as well as phys. adjust in Figure 8, Section 5.2) and dynamical core (all other modules depicted in Figure 8, Section
5.2). is distinction is common in weather models such as WRF [25] and COSMO [2].
ASUCA is parallelized for multi-core and multi-node in the horizontal domain (i.e. I and J dimensions). For the
multi-node parallelization, given a full grid of nx · ny · nz size, with nx and ny being the horizontal boundaries farthest
from the origin and nz being the upper vertical boundary, ASUCA’s grid is decomposed into blocks and scaered across
the available nodes, with each node computing a block of nx ′ · ny′ · nz size with nx ′ <= nx and ny′ <= ny, thus
requiring halo communication for each of the employed Runge-Kua schemes. For the on-chip parallelization, OpenMP
parallel loop directives are used. e vertical domain is executed sequentially due to loop carried dependencies in the
physical processes.
Manuscript submied to ACM
6 M. Mu¨ller and T. Aoki
ASUCA’s runtime is dominated by the dynamical core, which as a stencil code is heavily bounded by memory
bandwidth [36] [4]. It therefore stands to reason to investigate hardware architectures with a high memory bandwidth
as potential implementation targets.
2.2 GPGPU Computing
is section discusses the motivation for GPGPU support in a weather prediction codebase. See also Appendix C for a
glossary of GPGPU specic terms.
Table 1. Comparison of the latest generation Intel CPUs and NVIDIA GPUs used in HPC
Characteristic CPU GPU
Number of parallelisms in 2: multi-core, Vector 1: CUDA core
programming model
Floating-point units (FPUs) ∼ 100 ∼ 3000
(core count multiplied with vector length)
Single-threaded performance high low
Context switching overhead high low
Worst case performance loss ∼ 80x ∼ 64x
with branching (pipeline length times vector size) (number of cores per SMX)
Double- (DP) to single-precision (SP) ∼ 0.5x ∼ 0.5x
FLOP/s ratio
Cache per FPU 8 KB (L1), 32 KB (L2), ∼ 300 KB (L3) ∼ 0.4 KB (L1), ∼ 1 KB (L2)
DP Registers per core ∼ 80 ∼ 1000
Memory bandwidth for sequential access ∼ 80 GB/s ∼ 500 GB/s
Memory bandwidth for random access ∼ 0.1 GB/s ∼ 0.9 GB/s
Table 1 sums up the architectural dierences that are relevant to the use case at hand. Specic performance numbers
from 18-core Broadwell CPUs and Tesla P100 are used here (see also Appendix B).
GPUs are designed to support the highest number of parallel threads possible for a given number of transistors, while
still assigning each thread enough resources to support common use cases in scientic and graphics computing. is
results in GPU programs operating on a much higher number of threads in parallel compared to CPU (supported by the
high core count, high number of registers per core and high memory bandwidth), however the sequential computational
performance is lower due to the threads operating on a shorter pipeline with no additional vectorization and with less
cache per thread.
In terms of memory bandwidth, GPUs typically have a 5-7x advantage over CPUs, which makes GPUs interesting
for bandwidth bounded applications such as stencil applications. It should be noted however that the CPU’s larger
caches per oating-point unit can lead to the CPU being faster for the bandwidth bounded case if and only if cached
accesses dominate over main memory accesses in the runtime prole. Experience shows that this is usually not the case
however, instead for most use cases the CPU’s more prominent caches lead to a somewhat lesser speedup for the GPU
implementation compared to the bandwidth dierence between GPU and CPU.
A unied programming model for the on-chip parallelism (compared to vectorization and multi-core, which need to
be treated separately on CPU) as well as the low context switching cost for ne grained parallelizations (thanks to
keeping inactive thread context stored in the large register les where possible) are further advantages of GPUs.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 7
e GPU’s main disadvantage is a high occupancy requirement (discussed in Appendix C), which results in a
minimum problem size per GPU as well as the need for being more economical with each thread’s private resources in
order to achieve a high performance. is has the following main eects on how GPGPU applications are constructed
and run:
• e resource constraints generally lead to a higher parallelization granularity being optimal, i.e. the amount of
code that can belong to a single GPU kernel is limited. Disregarding this leads to local resources (particularly
registers) overowing, resulting in excessive main memory accesses.
• ere is a limit to how few threads a GPU can be given without starving for work and subsequently giving
poor performance. is, in turn, enforces a lower limit on time-to-solution (i.e. an upper limit to the number of
GPUs a given problem scales to). is limitation is common in all high-throughput processor architectures, to
various degrees.
However, even for a xed given time to solution, assuming a large enough problem size per microchip, GPUs are
usually more energy ecient thanks to their high computational and memory bandwidth “density” in terms of rack
space (as shown by table 1).
To acquire an understanding of the feasibility of GPU ports for memory bandwidth bounded applications, we can
dene the necessary and sucient condition for a speedup on GPU as tH > tD+tcomm , where tH is the computation time
for the host implementation, tD is the computation time on the device and tcomm is the time spent for communication
between host and device. Assuming memory bandwidth of sequentially accessed values to be the boleneck of an
application2, we substitute tH = ni m ·bBWH , tD = ni
m ·b
BWD , and tcomm =
mHtoD ·b
BWHtoD , where m is the number of values to
read and write from/to memory per point update, b is the byte length of each value, ni is the number of iterations
over the data, mHtoD is the number of values that need to be transferred from/to the device aer ni iterations and
BWH , BWD and BWHtoD are the bandwidths available on host, device and between host and device, respectively. Aer
substitution the speedup condition becomes
ni
m
mHtoD
>
BWH
BWHtoD
(1 − BWHBWD )
. (1)
e le-hand side of this inequality is dictated by the application while the right-hand side is purely hardware
dependent. We can therefore determine that ni mmHtoD must be larger than 5.4, 8.3 and 5.2 on the GPU supercomputers
TSUBAME 2.5, Reedbush-H and Piz Daint, respectively (using the performance numbers listed in appendix B).
2.3 Existing Hybrid CPU/GPU Parallelization Methods
As discussed in Section 1, the main goals of this work is to nd a solution for accelerating the ASUCA weather prediction
model with GPUs with a unied code and without requiring a rewrite. An overview over existing soware tools that
allow unied GPU/CPU codes is provided in this section. Table 2 lists these options. For “STELLA” and ”F2C-ACC” see
also the discussion in Section 1.1.
e following criteria are important for the consideration of these candidates:
2is assumption holds for typical stencil algorithms as the threads within a warp operate in lockstep (all threads in the warp execute the same instruction
except for data index osets), which leads to coalesced memory accesses as long as the data is ordered in memory as threads are ordered in the CUDA grid.
Manuscript submied to ACM
8 M. Mu¨ller and T. Aoki
Table 2. Existing tools for unified CPU/GPU code
Language Usage Implementations Notable Applications
OpenACC[29] C,C++,Fortran Directives PGI [12], COSMO
Cray [19] Physical Processes[2][7][21]
NIM[10]
OpenMP 4.0+[40] C,C++,Fortran Directives GCC 4.9.1+ , Currently no major
Clang[1], known applications
ROSE[32]
F2C-ACC[8] Fortran Directives NOAA NIM[9][10]
STELLA[7] C++ Stencil Library / DSL SCS/CSCS COSMO
Dynamical Processes[7][2]
Physis[23] C, Fortran Stencil Library / DSL RIKEN Sample applications provided
in the github repository
Shimokawabe C++ Stencil Library / DSL Tokyo Tech ASUCA
et al. Stencil Dynamical Processes[37]
Framework[37]
Kokkos[5] C++ Unied parallelization Sandia Albany/FELIX[39]
/ Mem. Layout
Transformation
(1) Due to their invasive nature when applying such techniques to ASUCA, Non-Fortran based solutions as well as
domain-specic languages for stencils (stencil DSLs) have been ruled out from this evaluation. One should
note that Shimokawabe et al. have already completed an investigation concerning the application of a C++
template based DSL to ASUCA [37].
(2) Since this work focuses on production weather prediction, the maturity of the pursued tools is important.
Despite OpenMP allowing to target accelerators from 4.0 on, its support has not yet reached a mature level as
of 2016, according to NVIDIA engineers [18]. For this reason it has been ruled out for a further evaluation for
Hybrid ASUCA at this point in time.
(3) F2C-ACC and OpenACC share many common features, with OpenACC being the much more widely supported
choice. Furthermore, the maintainers of F2C-ACC at NOAA have recently dropped it in favor of OpenACC [11].
By this process of elimination we arrive at OpenACC as the only viable option in terms of already existing solutions
for the problem at hand. Its limitations, and nally the decision against using it, will be discussed in the following
Section 2.4.
2.4 Design Considerations Regarding Code Reusability and Portability
In order to design the method to allow for a high reusability of existing code structure, the considerations outlined in
this section have been crucial.
2.4.1 Architecture-Dependant Parallelization Granularity. While ASUCA’s dynamical core is already implemented
with a high parallelization granularity in the original code and is thus already “GPU friendly” (as discussed in Section
2.2), the physical processes are originally implemented with a very large kernels and thus low granularity. Since
computations can be executed on each K-column separately, without aecting neighboring columns until the next
run of the dynamical core, most of the physical processes are implemented using a single kernel with one thread per
column, in order to increase cache locality and decrease the amount of context switching and thread synchronization
[20] [3]. is amounts to approximately 25k lines of code in a single parallel kernel. For a GPU implementation this is
problematic due to the following reasons (as discussed in Section 2.2 and found independently by Norman et al. [28]):
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 9
(1) Deep call graphs under a single kernel are dicult to maintain since kernel code relies on compiler inlining,
which breaks easily and only supports a subset of language constructs.
(2) On GPUs the memory, cache and register resources per thread are much more limited, in favor of supporting a
much higher number of parallel threads.
For a GPGPU port it is therefore necessary to split this large kernel in order to achieve ner grained tasks for each
thread. Doing so manually would however require a complete rewrite of the physical processes code. Furthermore
it would make a performance portable solution impossible since the CPU benets from a high cache locality of the
column based code.
An automated or assisted approach for kernel ssion is therefore highly desirable in order to allow for GPU
acceleration while keeping cache locality and low overhead for the CPU case. OpenACC does not oer a coding facility
that allows to abstract the parallelization granularity - instead, two separate code versions are required, once with the
parallelization applied to the call graph leafs and once applied to a coarse grained level. In ASUCA’s case this would
mean a doubling of most of code used for the physical processes. We note that Norman et al., as discussed in Section
1.1.5, have arrived at the same conclusion [28].
2.4.2 Architecture-Dependent Privatization. Due to the large amount of calculations being performed locally per
thread in the reference physical processes as discussed in Section 2.4.1, there are also many data objects created locally.
Kernel ssion requires the sharing of such data objects between threads. If kernel ssion is to be employed and cache
locality on CPU is to be kept it is necessary to dene the privatization depending on what architecture is targeted.
OpenACC allows for automatic privatization by means of explicit clauses, however this only works reliably within the
same routine as the parallel loop is dened. For a rewrite with ner grained parallelization granularity across a deep
call graph it would still be necessary to rewrite most of the data objects for a higher number of domains (in case of
ASUCA’s physics 3D grid data instead of 1D column data).
2.4.3 Architecture-Dependent Storage Order. In a largely memory bandwidth application it is paramount to implement
a storage order that matches the target hardware’s cache architecture since cache misses directly inuence memory
pressure [4]. Dierent hardware architectures with dierent cache architectures have dierent ideal storage orders. On
CPU, the innermost loop should always be given stride-1 memory access in order to increase cache locality. Due to the
characteristics of physical processes discussed in Section 2.4.1 the innermost loop is always operating on the K-domain,
therefore the natural storage order choice is KIJ or KJI [36]. Meanwhile on GPUs the domain that is mapped to the rst
thread index (i.e. one of the parallel domains) should be given stride-1 access to enable coalesced memory access [13],
which requires either IJK or JIK orderings in case of ASUCA [36]. Storage order abstraction is therefore necessary to
achieve performance portability. Section 5.1.1 shows this as well, based on the ndings from a model application.
Storage order therefore strongly impacts performance portability. In OpenACC Fortran there is no provided method
to specify multidimensional Fortran arrays in a way that the storage order is ecient on both GPU and CPU without
requiring a rewrite of all array accesses and specications.
Manuscript submied to ACM
10 M. Mu¨ller and T. Aoki
2.5 Problem Statement
In Section 2.4 we have outlined what we consider the most important porting aspects regarding eciency on GPU for
hybrid codes, i.e. applications with the capability to run on either CPU or GPU. In existing methods used for NWP
these issues have been solved using a combination of the following approaches:
• Maximum relearning: rough a complete reimplementation in a meta-programming capable language (so far
for HPC this means C++), memory layout abstraction and parallelization on dierent target platforms has been
solved. Examples for this approach are Fuhrer et al. 2014 [7] and Shimokawabe et al. 2014 [37].
• Eciency loss: Reducing the scope of the problem by only porting the parts of the code that are already
ne-grained (and thus suited for GPU) solves the issues described in Sections 2.4.1 and 2.4.2. is however leads
to a signicant overhead in GPU-CPU data exchange since the remaining code runs on CPU, thus requires data
synchronization for every time step. is approach has for example been taken by Gove et al. [9] [10] and by
various projects involving GPU acceleration of WRF as outlined in Section 1.1.3.
• Code divergence: If the entire logic is to be kept in a style of Fortran language that is familiar to domain
scientists (by making use of directive-based approaches such as OpenMP and/or OpenACC), it ultimately leads
to code duplication and divergence since no directive-based approaches have support for code granularity
transformations or memory layout transformations. Norman et al. 2017 show an example of this approach [28].
Our goal for ASUCA was to nd a solution that has none of the aforementioned drawbacks, i.e. a full GPU port that
remains CPU compatible, requires minimal code divergence and is able to reuse as much of the existing code as possible.
3 NEW APPROACH
A new approach called “Hybrid Fortran” has been developed in order to solve the problem described in Section 2.5.
Hybrid Fortran employs a source transformation with the following characteristics:
(1) porting to this framework from existing CPU code requires a low amount of eort, i.e. existing computational
code targeted at CPUs can be used as-is where possible,
(2) as a consequence (see also Sections 2.4.1 and 2.4.3), storage order and parallelization granularity are to be made
variable, such that CPU targeted user code can be transformed to match the architecture specic granularity
and storage order,
(3) the parallelization technology (here OpenMP, OpenACC or CUDA) is transparent to the framework user such
that with changing industry standards the backend implementation can easily be modied,
(4) whenever architecture specic programming is needed, Hybrid Fortran uses an abstraction in order to centralize
these specic seings in a codebase-wide conguration rather than leaking it to the user code,
(5) however, control is not be taken away from the framework user, i.e. the implementation should be predictable
and adjustable in order to allow for a highly productive environment to create performant implementations,
(6) and nally, Hybrid Fortran is suitable for both dynamical processes (stencil paerns) as well as column-only
physical parametrizations.
In this section we rst introduce a model application used for the further discussion (Section 3.1). We then apply
OpenACC parallelization for GPU and OpenMP parallelization for CPU and compare it to a Hybrid Fortran implemen-
tation from a user-centric viewpoint in Section 3.2, followed by an implementation-centric discussion in Section 3.3.
Finally, the current limitations of this approach are discussed in Section 3.4.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 11
3.1 Reduced Weather Application
Since ASUCA is too large to present and analyze here in its entirety (see also Figure 9 in Section 5.2), this section
introduces a reduced, yet scientically meaningful, weather application as a vehicle to show relevant code and
performance characteristics. Equation 2 shows the main equation governing this application: the heat diusion equation
with T being the temperature dened in IJK-space, κ being the thermal diusivity and f being the rate of heat input
through radiation.
∂T
∂t
= κ∇2T + f (2)
is equation is solved numerically by using the explicit Euler method in 3D. In addition, the k = 0 and k = nz
boundaries are modeled as innite heat wells with temperatures 330K and 200K respectively. Radiation f is assumed
to be uniform and statically applied across the spatial domain. e KI and KJ boundaries are modeled cyclically. We
initialize the temperature function to 0K with a 300K box from one quarter to three quarters of each spatial dimension
and set a delta of 0.1s between time steps. Figure 2 shows slices at j = 100 for the resulting data with time steps at
t = 0s, t = 30s and t = 90s. is shows the heat diusion, the radiation eect and the surface (k = 0) and planetary
boundary heat exchange (k = 200).
Fig. 2. Output at j = 100.
In order to adhere to ASUCA’s code structure the application is split into a dynamical core (which here only consists
of the diusion ∂T∂t = κ∇2T ) and physical processes with no interaction in the horizontal (radiation, surface- and
planetary boundary heat exchange). Analogous to ASUCA’s physical processes interface (see Section 2.4.1), such
processes are parallelized (here in run_physics) with each parallel thread executing all physical processes for a single
K-column. e dynamical core consists of four stencil code kernels - the inner IJK region as well as the boundary
regions for the IJ, KI and KJ boundaries. Figure 3 shows this code structure including the computationally relevant call
graph, loops and data domains. e entire code is listed in Appendix I.
3.2 Design of the Hybrid Fortran Language Extension
In the following section, by using the previously introduced reduced weather application as an example, we discuss the
application of Hybrid Fortran with regards to the design goals stated in the beginning of section 3.
Manuscript submied to ACM
12 M. Mu¨ller and T. Aoki
simulate 
  for t ∈ [0,tend]:
routine
loop repeating  
.. statements.. 
for each x ∈ [a, b]
Legend
dif fuse 
  for j ∈ [1,ny]: 
    for i ∈ [1,nx]: 
      for k ∈ [2,nz-1]: 
        .. pointwise process .. 
    
.. boundary conditions ..
physics 
  for j ∈ [1,ny]: 
    for i ∈ [1,nx]: 
     
radiate 
      for k ∈ [1,nz]: 
        .. pointwise process ..
surface
planetary boundary
call
for x ∈ [a, b]: 
  .. statements ..
Fig. 3. Code structure of the reduced weather application.
3.2.1 Basic Parallelization. An example of a parallelizeable code is the following extract of the diusion algorithm
(see Appendix I.1 for a full listing):
Listing 1. Diusion, implemented sequentially.
subroutine diffuse(energy_u , energy)
real (8),intent(out):: energy_u (0:nx+1,0:ny+1,nz)
real (8),intent(in):: energy (0:nx+1,0:ny+1,nz)
! ... further specifications and initializations ...
do j = 1,ny
do i = 1,nx
do k = 2, nz -1
energy_u(i,j,k) = energy(i,j,k) &
& * (1.0d0 - 6 * diffusion_velocity) + diffusion_velocity &
& * ( energy(i-1,j,k) + energy(i+1,j,k) + ... ) &
end do
end do
end do
! ...
ere are no loop carried dependencies for the sole modied data object energy_u, thus this inner loop region is
inherently parallelizable. One can parallelize this code for CPU and GPU with OpenMP and OpenACC by adding
directives as follows3:
Listing 2. Diusion, naive parallel implementation on CPU and GPU.
!$omp parallel default(firstprivate) shared(energy , energy_u)
3It is notable that OpenACC’s kernel directive has symbol sharing defaults that are sucient for the described use case, i.e. the default behaviour is
equivalent to using default(rstprivate) and declaring all shared arrays explicitly. It will parallelize the stencil kernel based on heuristics as well as a
dependency analysis (i.e. recognizing loops as parallelizeable when there are no loop carried dependencies).
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 13
!$acc kernels
do j = 1,ny
do i = 1,nx
do k = 2, nz -1
! ... loop body ...
end do
end do
end do
In contrast, in Hybrid Fortran the loops to be parallelized are replaced by the directive @parallelRegion as follows45:
Listing 3. Diusion, implemented in parallel with Hybrid Fortran.
@parallelRegion{domName(i,j), domSize(nx,ny)}
do k = 2, nz -1
! ... loop body ...
end do
@end parallelRegion
By replacing parallelizable loops with this explicit code construct the following is gained:
(1) It becomes immediately clear that loops are semantically dierent from parallel regions. Loops are always
sequential constructs and thus allow any dependencies on previous values, while in a parallel regions the order
of execution is not dened.
(2) For the GPU implementation there is a clear boundary between device code and host code (see also the glossary
in Appendix C). All code inside the parallel region directive gets implemented as device code, while all other
code remains on the host. us, Hybrid Fortran follows a more explicit approach for the parallelization, which
more easily allows the programmer to have a mental model of how an algorithm is implemented on the device.
rough only the information provided in this one directive, Hybrid Fortran is able to generate OpenMP parallel
loops for the CPU- as well as CUDA Fortran or OpenACC kernels for GPU-parallelization. e generated code for
GPU is listed in Appendix I.2 while the generated CPU parallelization corresponds to Listing 2. It is noteworthy that
Hybrid Fortran automatically adds the required sharing clauses. For that purpose it is assumed by default that all
arrays need to be shared while all scalars can be treated as thread private6. Data object specication- as well as usage
within parallel regions is parsed in order to enable useful default sharing behaviour. is combination of design choices
makes implementing data parallel code very simple for the framework user, i.e. only the domain boundaries of the
parallelization need to be specied.
3.2.2 Device Data Region. e GPU’s advantage in terms of memory bandwidth can only be used eectively if the
copying of data via the comparatively slow PCI Express bus can be minimized. To ensure this, it is necessary to gather
device state information for each data object in each routine, i.e. whether an existing device memory version can be
utilized or whether the data object rst needs to be synchronized with their host versions.
In order to improve on this with OpenACC, data directives are added for the time integration loop (Appendix I.6) as
follows:
4Here we only parallelize in I and J as otherwise the kernel thread runtime becomes sub-optimally small, which creates scheduler overhead.
5Without further specication, Hybrid Fortran assumes arrays to start at 1, which can be changed by adding additional clauses to the parallel region
directive.
6Reductions thus require an explicit clause for the parallel region directive, analogous to OpenMP reductions, see also Appendix A.1.
Manuscript submied to ACM
14 M. Mu¨ller and T. Aoki
Listing 4. The data directives.
!$acc enter data copyin(energy), &
!$acc& $copyin(energy_u , energy_surf , energy_pbl) &
do while (.true.)
! .. time step code ..
time = time + timestep
if (time > end_time) then
!$acc exit data delete(energy , energy_u), &
!$acc& delete(energy_surf , energy_pbl)
return
end if
end do
!$acc exit data delete(energy , energy_u), &
!$acc& delete(energy_surf , energy_pbl)
In addition, present clauses need to be added to all OpenACC kernel directives for all the data objects that are
already present thanks to this newly introduced data region.
By contrast, in Hybrid Fortran the device state of data objects is dened declaratively using @domainDependant
directives that are added aer the specication parts of each relevant subroutine:
Listing 5. Specification of device present data
subroutine diffuse(energy_u , energy)
real (8),intent(out):: energy_u (0:nx+1,0:ny+1,nz)
real (8),intent(in):: energy (0:nx+1,0:ny+1,nz)
! ... further specifications
@domainDependant{attribute(autoDom , present)}
energy_u , energy
@end domainDependant
! ...
rough the present aribute, Hybrid Fortran is instructed to treat data as device present. e autoDom aribute
furthermore instructs Hybrid Fortran to use the dimensions from the user specication (which becomes relevant with
the features discussed in Section 3.2.4). By applying analogous @domainDependant directives with transferHere
aribute at the data region boundaries (e.g. at the subroutine that implements the time integration loop), Hybrid Fortran
is instructed to generate the data copies at those boundaries (instead of the default behavior, which is to transfer at
every kernel invocation). Again, the user specied intent is used for each data object to determine the type of memory
transfer to be implemented, which minimizes the potential for bugs in comparison to OpenACC’s explicit copyIn,
copyOut and copy clauses.
3.2.3 Architecture-Dependent Parallelization Granularity. As discussed in Section 2.4.1, one of the main problems to
solve in order to gain performance portability and to allow code reusability is to support parallelizations with variable
granularity to co-exist within the same user code. For this purpose, Hybrid Fortran allows a parallel region to be applied
to only a partial set of hardware architectures.
For the physical processes in the example we originally have the following code (see the full listings in Appendix I.3,
I.4 and I.5):
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 15
Listing 6. Physical processes, implemented sequentially
subroutine run_physics (...)
! ... specifications ...
do j = 0,ny+1
do i = 0,nx+1
call radiate(energy(i,j,:))
call exchange_heat_with_boundary( energy(i,j,:), energy_surf(i,j), 1 )
call exchange_heat_with_boundary( energy(i,j,:), energy_pbl(i,j), nz )
end do
end do
end subroutine
subroutine radiate(energy)
real (8), intent(inout), dimension(nz) :: energy
! ... more specifications and initializations ...
do k=1,nz
energy(k) = energy(k) + radiation_intensity
end do
end subroutine
! ... exchange_heat_with_boundary definition in 1D
Hybrid Fortran allows to write this as follows:
Listing 7. Physical processes, implemented with Hybrid Fortran
subroutine run_physics (...)
! ... specifications ...
@parallelRegion{ &
& appliesTo(CPU), domName(i,j), domSize (0:nx+1,0:ny+1), &
& startAt (0,0), endAt(nx+1,ny+1) &
& }
call radiate(energy(i,j,:))
call exchange_heat_with_boundary( energy(i,j,:), energy_surf(i,j), 1 )
call exchange_heat_with_boundary( energy(i,j,:), energy_pbl(i,j), nz )
@end parallelRegion
end subroutine
subroutine radiate(energy)
real (8), intent(inout), dimension(nz) :: energy
! ... more specifications and initializations ...
@parallelRegion{ &
& appliesTo(GPU), domName(i,j), domSize (0:nx+1,0:ny+1), &
& startAt (0,0), endAt(nx+1,ny+1) &
}
do k=1,nz
energy(k) = energy(k) + radiation_intensity
end do
@end parallelRegion
end subroutine
! ... exchange_heat_with_boundary definition in 1D, analogous to radiation ...
is shows the following:
Manuscript submied to ACM
16 M. Mu¨ller and T. Aoki
simulate 
  for t ∈ [0,tend]:
routine
loop repeating  
.. statements.. 
for each x ∈ [a, b]
Legend
dif fuse 
               for k ∈ [2,nz-1]: 
                 .. pointwise process ..  
    
.. boundary conditions ..
physics 
     
radiate 
             for k ∈ [1,nz]: 
               .. pointwise process ..
surface 
              .. pointwise process ..call
for x ∈ [a, b]: 
  .. statements ..
CPU&GPU
i,j ∈   
[1,nx], 
[1,ny]
CPU
i,j ∈   
[1,nx], 
[1,ny]
GPU
i,j ∈   
[1,nx], 
[1,ny]
GPU
i,j ∈   
[1,nx], 
[1,ny]
planetary boundary 
              .. pointwise process ..
GPU
i,j ∈   
[1,nx], 
[1,ny]
execute 
.. statements ..  
in parallel for each  
i,j ∈  [1,nx], [1,ny] 
if the executable is 
compiled for CPU. 
Otherwise run  
.. statements.. a 
single time.
CPU
i,j ∈   
[1,nx], 
[1,ny]
.. 
st
at
em
en
ts
 ..
Fig. 4. Code structure of the reduced weather application, adapted for Hybrid Fortran.
(1) Using an appliesTo clause (e.g. for the physical processes), parallel regions can be applied partially to specic
architectures, that is, for architectures not listed in this clause the region body will be run only once per call
and the parallelization can be applied at a higher granularity (here at the radiation, surface and planetary
boundary processes). See also Figure 4 for the resulting code structure and compare to Figure 3 in Section 3.1.
(2) e user code within parallel region bodies can remain untouched compared to the sequential CPU version.
Hybrid Fortran automatically extends data accesses and specications, here from 1D to 3D for radiate (see
also Section 3.2.4).
Fig. 5. Architecture specific call graph colorings.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 17
In order to be able to correctly generate the code for varying parallelization granularity, the transformation process
requires positioning information for each routine towards relevant parallel regions. Each routine either directly contains
a parallel region, has parallel regions within its call graph, or is itself called within a parallel region. is information
can be visualized as a coloring of the call graph, with a separate coloring for each target architecture. Parallel region
appliesTo clauses are the deciding factor for this coloring. Figure 5 shows this for the reduced weather application. is
global positioning information is required by the code transformation in order to implement the correct dimensionality
(see also Section 3.2.4) as well as the correct type of code in case of GPUs being the target, i.e. host-, kernel- or device
routine code (see also the relevant denitions in Appendix C).
3.2.4 Architecture-Dependent Privatization. Since Hybrid Fortran allows compile-time dened parallelization granu-
larity, as discussed in Section 3.2.3, it is necessary for the code transformation to change the dimensionality of data
objects in order to expose their data parallelism at a ner-grained level. is requires hints from the framework user.
e following listing shows the @domainDependant directives required in order to extend the data objects from 1D
to 3D (adding the missing parallel dimensions). is is used in column-only physical parametrizations (radiation and
boundary layer in case of ASUCA) where the original CPU code is parallelized in a single place using OpenMP-loops
over the horizontal domain. Applying this technique allows the user to simply wrap the existing code at a ne-grained
level with GPU-targeted @parallelRegion constructs in order to delegate the task of exposing data parallelism to the
framework.
Listing 8. Extending data object dimensions with Hybrid Fortran
subroutine radiate(energy)
real (8), intent(inout), dimension(nz) :: energy
! ... more specifications ...
@domainDependant{attribute(autoDom , present), domName(i,j), domSize (0:nx+1,0:ny+1)}
energy
@end domainDependant
@parallelRegion{ &
& appliesTo(GPU), domName(i,j), domSize (0:nx+1,0:ny+1), &
& startAt (0,0), endAt(nx+1,ny+1) &
& }
do k=1,nz
energy(k) = energy(k) + radiation_intensity
end do
@end parallelRegion
end subroutine
3.2.5 Storage Order. As will be shown in Section 5.1.1, variable and architecture-dependent storage order is necessary
for performance portability as well as code reuse. Hybrid Fortran therefore automatically adds default macro wrappings
for multidimensional arrays. is allows a specication of the storage order depending on the target architecture
in a centralized location. Unlike with other code hybridization methods supporting Fortran, this enables variable
storage order without the need to rewrite any user code. In case more exibility is needed (such us varying storage
order in dierent parts of an application), Hybrid Fortran allows an explicit denition of the macro names in the
@domainDependant directive.
Manuscript submied to ACM
18 M. Mu¨ller and T. Aoki
As an example, consider again Listing 7, which shows the Hybrid Fortran version of the reduced weather application’s
physics code. rough the transformation processes, array accesses and specications are rewrien as follows:
Listing 9. Storage order and coarse-grained parallelization aer automated transformation for CPU
do j = 0,ny+1
do i = 0,nx+1
call radiate(energy(AT(i,j,:)))
!...
end do
end do
In an separate source le that is automatically included in all sources, the macro AT is specied as follows:
Listing 10. Storage order definition
#if (GPU)
#define AT(i,j,k) i, j, k
#else
#define AT(i,j,k) k, j, i
#endif
3.2.6 Block Size. Block size (see also Appendix C) can have a signicant eect on GPU performance. Hybrid Fortran
by default chooses a block size of 32x16x1, which is a good default for data parallel atmospheric applications (see also
Section 5.1.4). is default can be changed through macros in a conguration source that gets automatically included in
all other sources. In cases where kernels have varying optimal block sizes, Hybrid Fortran allows the specication of a
macro name sux in the parallel region directive.
In contrast, PGI Accelerator’s OpenACC implementation by default uses a 128 × 1 block size conguration, which
usually leads to a subpar performance due to a low occupancy (see also Appendix C). is therefore needs to be
changed explicitly for each loop invocation by using vector clauses in $acc loop directives added to all parallel loops.
Furthermore it is oen advisable to use !$acc loop seq for loops over k in order to force the compiler not to assign
the rst thread index to k (which is sub-optimal for the chosen storage order).
By using macros, Hybrid Fortran avoids leaking this architecture dependant information to the user code in favor of
a centrally dened conguration.
3.3 Source Transformation Method
In this section we introduce the code transformation, involved in implementing the framework logic discussed. Figure
6 shows this process in nine distinct phases, with phases four through nine being applied separately depending on the
target architecture. is process runs transparently from the user point of view, i.e. it is applied automatically by the
means of a common Makele provided together with the framework. Each of the numbered paragraphs in this section
corresponds to one transformation phase.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 19
merge 
continued 
lines 
extract call 
graph & parallel 
regions
apply CPU 
coloring parse 
symbols 
globally in 
modules 
and 
routines
apply 
GPU 
coloring
preprocess 
H90 to h90
parse 
modules, 
routines 
and 
directives, 
create 
global 
model
impl. for 
CUDA 
Fortran
implement for 
OpenMP
implement for 
OpenACC
apply 
macros 
& 
compile
Phase
1 2 3 4 5 6 7 8
Legend
existing 
module
original 
module
applied 
generally
applied 
for CPU
applied 
for GPU
split host/
device 
code 
unified 
code
CPU 
exe.
GPU 
exe.
split lines for 
Fortran 
compatibility
9
Fig. 6. Steps involved in the code transformation method.
3.3.1 Macro Processing for Unified Sources. In addition to the automatically applied transformation process that
implements the behavior described in Section 3.2, a further layer of meta-programming is supported by applying a
macro processing step to Hybrid Fortran sources7, including directives. We use the GNU compiler tool-chain for this
purpose (gcc -E).
3.3.2 Merging of Continued Lines. In order to simplify the parsing in subsequent phases, Fortran continuation lines
are merged.
3.3.3 Call Graph and Parallel Region Parsing. Facilitating phase 4, the application’s call graph and parallel region
directives are parsed globally.
3.3.4 Call Graph Coloring. A call graph coloring is created, as shown previously in Figure 5.
3.3.5 Global Symbol Parsing. In order to gather symbol information, the application is parsed twice in this phase: A
rst time in order to gather module data object specications and a second time to link this information to the routines
where such module data objects get imported, together with the information of data objects dened in each routine
scope. is allows Hybrid Fortran to globally determine the domain information for data objects.
3.3.6 Global Model Generation. All previous information is compiled into a global application model, with model
classes representing the modules, routines and code regions. Code regions are modelled using subclasses for specication-
, sequential-, call- and parallel regions. For each code line this model contains a list of the data objects used within the
line and their known meta information (gathered previously from local module scope, imported scope, routine scope as
well as added information from Hybrid Fortran directives). Each routine model is also assigned an implementation class
(see phase 7).
3.3.7 Implementation. One of the stated goals of Hybrid Fortran is to create a higher level tool-chain that is reusable
for various hardware- and soware architectures. us, only a single transformation phase is dependent on the backend
implementation, here called the “implementation phase”. For this purpose, a Python implementation class hierarchy
is used in order to facilitate the reuse or specialization of a set of implementation methods for the varying backends.
7Hybrid Fortran’s build system recognizes sources from four distinct formats: f90 (Modern Fortran), F90 (Modern Fortran with Macros), h90 (Hybrid
Fortran) and H90 (Hybrid Fortran with Macros). Depending on the source format, the build process starts at phase one (H90), two (h90) or nine (f90 / F90).
Manuscript submied to ACM
20 M. Mu¨ller and T. Aoki
Previously gathered information is used by instance methods of these implementation classes to create the sources
from the previously built model. More specically, using implementation classes for either standard Fortran, OpenMP
Fortran, CUDA Fortran or OpenACC Fortran, the following transformations are applied:
• data object specications and accesses are transformed according to object type, implementation class and
routine coloring,
• data copies from and to the device are generated,
• device- and host code boilerplate is generated. Since this requires separate routines for CUDA (one for the host
code and one for each kernel in the user code), parallel region bodies are split o into their own generated
routines beforehand.
CUDA Fortran and OpenACC GPU implementations can be mixed in the same project. is is mainly done in order to
make use of OpenACC’s reduction feature (which is not supported for the CUDA Fortran backend). Switching between
the dierent backend architectures is done by specifying a default implementation in the conguration and wrapping
routines in a @scheme directive where a dierent one needs to be applied. In order to facilitate interoperatibility, Hybrid
Fortran uses CUDA Fortran allocated device pointers for all GPU implementations.
3.3.8 Line Spliing. In order to make the generated code compatible with standard Fortran compilers (which usually
enforce a maximum line length), the generated lines are split at appropriate positions containing white space until the
maximum line length is satised.
3.3.9 Final Macro Processing and Compilation. e GNU preprocessor is applied again in order to process the
generated macros (e.g. for storage reordering, see also Section 3.2.5). Subsequently, a user specied compiler and
linker is used in order to create the CPU and GPU executables. For GPU, the generated code is compatible with PGI
Accelerator in case of the CUDA Fortran implementation, and PGI Accelerator or Cray Fortran compiler in case of the
OpenACC implementation.
3.4 Limitations
Hybrid Fortran has at this point in time mainly been tested to implement structured grid applications. We expect that
unstructured grid applications would reveal currently unsupported use cases.
Additionally, some limitations apply to the code within GPU-applied parallel region bodies: Such code must not
contain recursion, call other kernel routines, use data objects with DATA or SAVE aribute, contain I/O statements
(except print) or contain array or slice expressions such as A(:,:,:) = 0.0 or C = A + B (i.e. all operations in GPU
parallel region bodies are required to be scalar).
4 PERFORMANCE MODEL
In this section a performance model is constructed for the application introduced in Section 3.1. For this purpose we
simplify the problem as follows:
(1) By assuming an optimized storage order (see also Section 5.1.1) the bandwidth bounded model for sequentially
accessed memory can be applied to the inner region of diffusion as well as radiation.
(2) Memory accesses from cache are assumed to not aect execution time at all (i.e. such accesses are assumed to
be hidden behind host- or device memory accesses in the pipeline).
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 21
(3) By assuming suciently large spatial dimensions we omit all other program parts from further modelling since
their runtime eect should be negligible.
(4) We assume the compiler to optimize all operations that can be hoisted out of loops.
Per loop iteration, diffusion8, being a seven point stencil code, requires a maximum of eight memory accesses per
point update as well as six add and two multiply instructions. e arithmetic intensity according to an adapted rooine
model[46] is thus
c
b ·m =
8
8 · 8
[
FLOP
B
]
= 0.125
[
FLOP
B
]
, (3)
wherem is the number of values to read and write from/to memory per point update, b is the byte length of each
value and c is the number of oating point cycles spent per point update9.
e minimum arithmetic intensity for compute boundedness is 5.9[ FLOPB ] and 7.8[ FLOPB ] for Tesla K20x and P100,
respectively (using the performance numbers from Appendix B). us, diffuse is clearly memory bandwidth bounded.
is holds true even if six of the eight memory accesses are cached (see also the stencil access paern in Appendix
I.1). Due to the lower random access memory bandwidth, the boundary region JK is clearly also memory bandwidth
bounded. e same holds for the radiation process, since it requires two memory accesses and one FLOP per iteration10.
We therefore conclude that the reduced weather application is memory bandwidth bounded for all processes.
e reduced weather application is thus highly sensitive to the cache performance of the target architecture. Applying
the above parameters to the speedup condition (equation 1), a speedup on GPU is feasible if the output is not required
at every time step.
With some simplication in the domain boundaries (by assuming suciently large domains for halos to be neglectable)
the model GPU execution time becomes
δtoutput
δtt imestep
(nx · ny · nz · (b ·msa
BWD
+
b ·mHtoD
BWHtoD
) + ny · nz · mra
RAD
), (4)
and the model CPU execution becomes
δtoutput
δtt imestep
(nx · ny · nz · b ·msa
BWH
+ ·ny · nz · mra
RAH
), (5)
where b is the byte length of each value,msa is denoting the number of sequential memory accesses in the inner
region of diffuse and radiation andmra is denoting the number of random access updates in the boundary region
JK, nx , ny and nz are denoting the grid sizes in x-, y- and z-direction, respectively, BWH , BWD and BWHtoD are the
bandwidths available on host, device and between host and device, respectively, and RAH and RAD are the number of
8See also the source in Appendix I.1
9In contrast to the rooine model we use oating point cycles rather than operations here in order to have a more generalized model that can be applied
to multi-cycle oating point operations such as division or exponentiation
10One should note however that this does not necessarily reect typical physical processes in atmospheric models - such processes tend to consist of a
mix of compute bounded and memory bandwidth bounded algorithms.
Manuscript submied to ACM
22 M. Mu¨ller and T. Aoki
random memory accesses per second (measured in GUP/s) [22]. It holds thatmsa = 10 without caching,msa = 4 with
caching andmra = 4. See also Section 5.1 for a comparison of the measured performance with this model.
5 PERFORMANCE RESULTS AND DISCUSSION
is section discusses performance results obtained using the Hybrid Fortran approach. In Section 5.1 the reduced
weather application (introduced in Section 3.1) is used to compare the Hybrid Fortran user code implementation to an
OpenACC implementation and the performance model (introduced in Section 4) with respect to the overall performance
as well as individual performance related implementation aspects. Hybrid ASUCA is presented as a new implementation
in Section 5.2. Sections 5.3 and 5.4 subsequently compare its GPU performance to the reference implementation
(CPU-only), for small and large scale grids, respectively.
5.1 Comparing Hybrid Fortran with OpenACC and Model
is section facilitates the previously discussed reduced weather application in order to compare Hybrid Fortran with
OpenACC and a performance model. We focus here on the desired portability features discussed in Section 2.4. For the
performance measurements in this section the conguration as listed in Appendix D has been used.
5.1.1 Performance Portable Storage Order. Table 3 shows the impact, storage order has on execution time. In the case
of the currently discussed application, choosing a sub-optimal storage order impacts CPU execution time negatively by
35%, while on GPU the slowdown is 7.7x. is veries the necessity of a exible storage order for applications with
similar data structures as ASUCA, as discussed in Section 2.4.3.
Table 3. Influence of Storage Order on Execution Time, nx = ny = nz = 128
IJK Order KIJ Order
CPU Single Core 1.73s 1.28s
GPU (OpenACC) 0.10s 0.77s
(Fastest Implementation)
Table 4. Execution Time with “Naive” Parallelization
nx · ny · nz 1283 2563
CPU Single Core Measurement 1.28s 8.20s
CPU Single Core Model w/ cache 0.74s 5.70s
CPU Single Core Model w/o cache 1.77s 13.91s
CPU 6 Core Measurement 0.40s 4.25s
CPU 6 Core Model w/ cache 0.38 2.84s
CPU 6 Core Model w/o cache 0.87 6.77s
GPU Measurement 163.13s n/a
5.1.2 “Naive” Parallelization. In order to establish a baseline performance we apply a basic parallelization to the
reduced weather application with OpenACC and OpenMP directives as discussed in Listing 2, Section 3.2.1. Storage
order is made variable across the whole application by using macros for accessing and specifying multi-dimensional
arrays. We are using I-J-K order for GPU and K-I-J order for the CPU implementation. Table 4 shows the resulting CPU
performance from this parallelization.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 23
We conclude that the measured CPU performance is already well within the models with perfect and no cache.
e GPU performance is however very slow - more than 400x slower than the six core CPU version, which is not in
agreement with the models we have constructed. is is to be expected however, since no data region has yet been
dened, without which the data is being copied over the slow PCI express bus for every kernel invocation. Further
discussion for the reduced weather application therefore focuses on GPU performance.
Table 5. Execution Time with Data Region
nx · ny · nz 1283 2563
GPU Measurement 0.095s 0.63s
GPU Model w/ cache 0.027s 0.19s
GPU Model w/o cache 0.087s 0.66s
5.1.3 Data Region. e amount of I/O to and from the GPU can be greatly reduced by using a data region in order
to avoid host-to-device data copies at every kernel invocation, as discussed in Section 3.2.2. Table 5 shows the resulting
performance with OpenACC.
Table 6. Block Size Impact on Execution Time
nx · ny · nz 1283 2563 2562 · 512
GPU Automatic Block size 0.095s 0.63s 2.99s
GPU 32 x 16 Block size 0.10s 0.55s 2.16s
GPU Model w/ cache 0.027s 0.19s 0.69s
GPU Model w/o cache 0.087s 0.66 2.60s
5.1.4 Block Size. As Table 6 shows, up to 38% performance improvement is possible over the version discussed in
Section 5.1.3 for the OpenACC implementation by changing the default block size using vector clauses for each parallel
loop (for a denition of block size, see also the GPGPU terms glossary in Appendix C). PGI accelerator uses a 128 × 1
block size for its OpenACC implementation by default while 32 × 16 has been experimentally determined to be the best
block size for this application except for a small degradation for the smallest measured grid size. See also Section 5.1.5
for a performance comparison with Hybrid Fortran (which chooses this block size by default).
5.1.5 Performance Comparison. Figure 7 shows the performance results for the reduced weather application with the
Hybrid Fortran based implementation (discussed in Section 3.2) in comparison with the models introduced in Section 4
as well as the OpenMP/OpenACC based approach discussed in Section 3.2. For each of the implementations the highest
performing version (32 x 16 block size with data region) has been selected for this comparison. is shows that measured
CPU performance aligns well in between the two models (with and without cache) while for the TSUBAME 2.5 Kepler
based GPU architecture, L1 cache appears to be less eective for this application and the measured performance is
closer to the model without cache. It is noteworthy that the Hybrid Fortran generated code performs as well or beer
than the equivalent OpenACC code for both target architectures in this example.
Overall we observe a speedup of 8x for the Hybrid Fortran version vs. 6-core CPU, compared to the speedup of 7.7x
for the OpenACC version. e Hybrid Fortran implementation’s speedup is thus very close to the bandwidth increase
8.2x between the two architectures (see Appendix B).
Manuscript submied to ACM
24 M. Mu¨ller and T. Aoki
Fig. 7. Comparing performance with the reduced weather application for handwrien vs. Hybrid Fortran generated vs. model on
2563 Grid.
On CPU the Hybrid Fortran implementation performs the same as the manually coded OpenMP implementation,
which is unsurprising given that the CPU code generated by Hybrid Fortran is practically identical.
5.2 Hybrid ASUCA Implementation
As Figure 8 shows, by employing Hybrid Fortran directives to both the dynamical core and the physical processes of the
existing CPU-targeted ASUCA user code, nearly all modules required for an operative weather prediction have been
ported to GPU while retaining CPU compatibility. All ported modules have been validated by comparing the output of
all relevant variables to the original implementation and ensuring the normalized root mean square error to be smaller
than 1E-10 aer 10 time steps in the coarsest time discretization, both on CPU and GPU.
Figure 9 shows the impact this new implementation has on the code size, as well as the overall distribution of code
lines in the application. Additionally, in [27] we have studied the productivity eects in more detail, which we describe
here briey: ASUCA’s codebase encompasses more than 150k lines of code. By using our method, more than 85% of the
original codebase was le unchanged and the overall code size has been extended by less than 5%, even though the new
codebase targets two very dierent hardware architectures instead of one. is shows very promising productivity,
maintainability and ease-of-adoption of the proposed method in an operational seing. Using an application model
we have established a detailed estimate of the required code changes for a comparable OpenACC implementation.
e result: OpenACC would require more than 73% of additional changes compared to the changes required using
Hybrid Fortran. We reckon that the same would be the case when using OpenMP directives targeting GPU, since to our
knowledge it only oers a subset of OpenACC’s GPU features to this date.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 25
diagnose
physics
surf. diag.
min/max/ave
output
diag. adjust
phys. adjust
hdiff
dynamics
sediment
dtlong∫
diagnose
monitflux
RKshort dtshort∫ RKshort dtshort∫
RKshort dtshort∫
HEVIdynamics
advection
coriolis
curvature
n. diffusion
rayleigh diff.
pp interface
lbc
diagnose
microphysics
lbc
rad pp
convection
surf. slab
pbl pp
qxdiff
pbl coupler
ported to GPU
not yet ported
Legend
Fig. 8. Simplified call graph of ASUCA and status of Hybrid Fortran based implementation
reference CPU code: 
156’533 LOC
hybrid code: 
163’226 LOC
205 
kernels
133 
kernels
Framework
Physical processes 
(Hybrid Sources)
Dynamical core 
(Hybrid Sources)
Dependencies of 
Hybrid Sources
Fig. 9. Hybrid ASUCA in numbers
Of the approximately of 163k lines of code, around 19.5% are used for the dynamical core and 20.8% are used for the
physical processes, mainly to implement 133 and 205 kernels11, respectively.
11We arrive at this number when counting the kernels from the point of view of the GPU - through kernel fusion this number is considerably lower for
the CPU.
Manuscript submied to ACM
26 M. Mu¨ller and T. Aoki
By applying this unied programming model to both dynamical core and physical processes of ASUCA, host-to-device
communication has been eliminated with the only exceptions being setup, halo communication as well as le output.
5.3 Hybrid ASUCA Kernel Performance
Fig. 10. Execution time measurements for 75 long time steps of ASUCA executed in four dierent configurations
In this section, performance results for this new implementation are discussed for a 301 x 301 x 58 grid that is small
enough for single GPU or single socket execution with the latest architecture (Tesla P100 on Reedbush-H), yet still
allows a useful performance analysis in terms of occupancy (see Appendix C). is allows to draw conclusions for the
kernel performance as opposed to the communication overhead (which impacts performance more strongly, the more
nodes are used for the same grid size, i.e. when applying strong scaling as will be shown in Section 5.4). Additionally,
an older system (TSUBAME 2.5 with Tesla K20x) is used for comparison purposes. Since the two compared GPUs dier
in their available device memory (16 GB for P100 vs. 6 GB for K20x) we compare four K20x GPUs to one P100. See
Appendix E for the soware conguration used for this test. Hybrid ASUCA is also compared to previous GPGPU ports
of weather models as discussed in Section 1.1.
Figure 10 shows the execution times for this conguration on four hardware/soware congurations as listed. A
speedup of 4.9x has been achieved by the port on Kepler GPU vs. Westmere 6-core Xeon X5670. On newer hardware
however (Pascal vs. Broadwell) the speedup has so far been a more modest 3.1x, which is partly explained by the lower
memory bandwidth dierence between the two comparisons and the increased caching performance on Broadwell-
versus Westmere CPU architecture (as caching generally has a lower impact on GPU compared to CPU).
5.4 Hybrid ASUCA Strong Scaling GPU Results
Using a full scale production grid, Hybrid ASUCA has been tested on the new Tokyo University cluster “Reedbush
H” [43]. is cluster has two 18-core Xeon E5-2695 v4 CPUs per node as well as two NVIDIA Tesla P100 GPUs per
node. At least seven nodes (14 sockets) or 24 GPUs are required for this test due to the grid’s memory requirements. A
visualization of the resulting cloud cover from this simulation is depicted in Figure 11.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 27
Fig. 11. Total cloud cover result with ASUCA using a 2km resolution grid with a real world weather situation
Figure 12 shows that 24 GPUs can replace more than 50 18-core CPU sockets. When comparing the same number of
GPUs and CPU sockets, the GPU is up to 76% faster. See Appendix F for the soware conguration used in this test.
e following factors inuence scalability and will need to be improved in order to achieve beer strong scaling:
(1) MPI communications code has been used as-is, with no further optimizations applied with respect to the
targeted cluster. When testing the impact of communication on performance on TSUBAME 3.0 using the
minimally required 24 GPUs, as shown in Figure 13, communication requires approximately 13.4s or 18.7% of
the overall runtime of 71.5s (to compute a 600 seconds simulation of the full regional grid in 2km resolution).
When doubling the number of GPUs this increases to 20.9s while the compute time decreases from 58.1s to
34.0s, thus the communication then takes 38% of the runtime. Overlapping communication and computations
has been shown to be eective in enabling beer scaling by Shimokawabe et al. [36], thus this approach will
be the rst step to improve performance at larger scales.
(2) Since GPUs require a large enough problem size per chip in order to have a sucient number of threads to ll
all schedulers, strong scaling is limited when the problem size per GPU becomes too small.
(3) In our ASUCA code version, as in the given reference, we do not have a distributed le I/O system as used in
production. Due to the large amount of data, even though for this test the output is only run at the beginning
and end of the simulation, it still has a strong impact on the overall execution time, resulting in part of the
discrepancy between for the speedups between the 301 x 301 and 1581 x 1301 grid sizes.
Figure 14 categorizes the performance impact of the dierent modules of ASUCA on performance, including
communication. e simulation of fast moving sound- and gravity waves has the highest impact, followed by radiation-
Manuscript submied to ACM
28 M. Mu¨ller and T. Aoki
Fig. 12. Strong scaling speedup on 1581 x 1301 x 58 ASUCA Grid.
58.10
34.02
13.38
20.86
0.00
20.00
40.00
60.00
80.00
24 48
Ex
ec
ut
io
n 
Ti
me
 [
s]
Number of GPUs
Compute
Halo Communication
Fig. 13. Impact of communication for strong scaling on 1581 x 1301 x 58 ASUCA Grid.
and boundary layer physics. Since the physics calculations do not require communication, the impact of fast moving
dynamics increases with the number of nodes, rendering it the most important optimization target for larger scale
simulations. For a listing of the congurations used to gather the data for gures 13 and 14, please refer to Appendix G.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 29
 
5.59 5.08
24.77
15.74
4.11
2.07
9.96
7.19
27.06
24.81
0.00
20.00
40.00
60.00
80.00
24 48
Ex
ec
ut
io
n 
Ti
me
 [
s]
Number of GPUs
Short Timestep
Dynamics
Precipitation
Long Timestep
Dynamics
Long Timestep
Physics
Other
Fig. 14. Impact of modules for strong scaling on 1581 x 1301 x 58 ASUCA Grid.
5.5 Comparison to Previous Work
In Section 1.1.1, Dr. Shimokawabe’s work on applying GPUs to ASUCA was briey discussed. In this section we
compare that implementation to Hybrid ASUCA.
Table 7. Comparison of Hybrid ASUCA to GPU implementation by Shimokawabe et al.
Characteristic Hybrid ASUCA Shimokawabe
Main focus Productivity Eciency at large scale
(# of GPUs)
User language Fortran C++
Parallelization DSL DSL
Memory layout Transformation DSL
Dynamical core Port of Ishida et al. 2010 [17] Port of Ishida et al. 2010 [17]
Cloud microphysics Warm rain (Kessler-type) Warm rain (Kessler-type)
Radiation MSM0705 based n/a
Boundary layer physics MSM0705 based n/a
Strong scaling 42 P100 512 K20x
(# of GPUs) (equiv. to 159 K20x)
Strong scaling n/a 4108 K20x
(# of GPUs)
Table 7 shows the dierences in focus between Hybrid ASUCA and Shimokawabe et al. [37]. Dr. Shimokawabe’s
work was targeted at achieving computations on the largest scale available, i.e. a near-full utilization of TSUBAME 2.5
for a weak scaling result as well as using 512 GPUs for strong scaling. For comparison, Hybrid ASUCA was only scaled
using strong scaling, reaching a performance peak at 42 GPUs, using a device model that is equivalent in theoretical
performance to 159 GPUs used by Shimokawabe (P100 vs. K20x). On the other hand, with Hybrid ASUCA we were
rst able to port a complete model including operation-ready physical processes using a unied framework and we
have demonstrated that only a minor rewrite is necessary for a previously CPU-targeted code to achieve full GPU
compatibility (while Shimokawabe et al. relied on a full rewrite using a dierent user language and did not demonstrate
radiation- as well as boundary layer physics).
Manuscript submied to ACM
30 M. Mu¨ller and T. Aoki
6 CONCLUSIONS
In this work, “Hybrid Fortran” has been introduced as a new approach to port structured grid Fortran applications to
GPU. It provides abstractions over architecture specic programming while allowing existing CPU targeted user code
to be reused as much as possible (see also Section 3.2). is approach allows using a mixture of OpenMP, OpenACC,
and CUDA Fortran in the backend with a unied programming interface (see also Section 3.3). Storage order and
parallelization granularity are made variable. Existing CPU targeted user code is transformed to match the architecture
specic granularity and storage order.
A new implementation of ASUCA, the main mesoscale weather model used in weather prediction operations by the
national Japanese weather agency, has been created by applying Hybrid Fortran to both dynamical core and physical
processes. e resulting speedup on GPU (up to 4.9x on single GPU compared to a single Xeon socket) is comparable to
architecture specic rewrites of similar weather models (see also Section 5.3). To our knowledge this is the rst time a
complete production weather prediction model has been ported to GPU using a single unied programming paradigm.
e resulting user code is less than 5% larger compared to the reference CPU code, which is only made possible by
sharing all of the user code between the two architectures (see also Figure 9). A speedup of up to 3.5x on GPU was
achieved for a full scale production run (1581 x 1301 x 58 grid with 2km horizontal resolution using real world sample
input data, see also Section 5.4).
7 FUTUREWORK
Future eorts will focus on three aspects: performance, productivity and scope.
In order to improve the performance on GPGPU, the following avenues will be explored:
(1) Communication will be optimized in order to improve larger scale runs on more than 30 nodes. Network
topology based optimization, as well as the overlapping of communication and computation will be the main
focus for these optimizations.
(2) GPU kernels will be more nely tuned as far as a unied user code allows. One possible avenue is the usage
of launch bounds, a CUDA feature that supplies beer hints to the compiler with regard to register usage
(however it requires re-compilations for varying grid sizes).
(3) Automated cache blocking transformations that are applicable to GPU (shared memory, L1, texture cache) are
an important path to explore. Such transformations also have the potential to auto-tune the code for dierent
CPU architectures.
Further productivity gains are possible by automating more of the mechanical aspects of porting to Hybrid Fortran.
More specically, replacing the device data state aributes per routine and per data object with a generalized data
region facility (and keeping track of all the involved data objects automatically) will greatly reduce the work needed to
port to Hybrid Fortran. is in turn can make Hybrid Fortran viable for a GPGPU port of the WRF model (which is one
order of magnitude larger in code size compared to ASUCA).
In terms of scope, applying Hybrid Fortran to unstructured grids will be explored.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 31
A HYBRID FORTRAN LANGUAGE EXTENSION
is appendix section gives an overview of the main language extensions provided by Hybrid Fortran. For a complete
listing, refer to the documentation in the repository12.
A.1 Parallel Region Construct
Listing 11 shows the parallel region construct. It allows the framework to dene parallelization as well as the thread
granularity at compile-time.
Listing 11. Parallel region directive syntax.
@parallelRegion{ATTRIBUTE_NAME1(MEMBER1 , MEMBER2 , ...), ...}
! code to be executed in parallel !
@end parallelRegion
e following aributes are supported for this directive:
appliesTo Specify one or more of the following aribute members in order to set this parallel region to apply to either
the CPU code version, the GPU version or both. Omiing this aribute has the same eect as specifying all
supported architectures.
(1) CPU
(2) GPU
domName (Required) Specify one or more domain names over which the code can be executed in parallel. ese
domain names are being used as iterator names for the respective loops or CUDA kernels.
domSize (Required) Set of the domain dimensions in the same order as their respective domain names specied using
the domName aribute. It is required that |domName | = |domSize |.
startAt e lower boundary for each domain at which to start computations. Omiing this aribute will set all
boundaries to 1. It is required that |startAt | = |domName |.
endAt e upper boundary for each domain at which to end computations. Omiing this aribute will set all
boundaries to domSize for each domain. It is required that |startAt | = |domName |.
reduction Works in the same way as OpenMP reduction directives. is is only supported with the OpenACC- and
OpenMP backends. For example reduction(+:result) sums up the results over all threads.
template Denes a postx that is to be applied to dierent aributes that are loaded using the preprocessor. Currently
this only aects CUDA Blocksizes: ese are loaded using CUDA_BLOCKSIZE_X_[Template-Name] from the
preprocessor (storage_order.F90 is the suggested place to dene these). e goal of this aribute is to hoist
hardware dependent aributes outside of the code in order to more easily facilitate performance portability.
A.2 Domain Dependant Construct
Listing 12 shows the domain dependant construct. It is used to give hints to the framework how data objects are to be
adjusted in the transformation process. is allows passing on the responsibility of rewriting certain aspects of data
specications and accesses to the framework. It should be noted that the framework only operates on local information
available to each subroutine. As an example, whether a data object has already been copied to the GPU is not being
12Hybrid Fortran master: hps://github.com/muellermichel/Hybrid-Fortran
Manuscript submied to ACM
32 M. Mu¨ller and T. Aoki
analyzed. Domain Dependant constructs are required to be specied between the specication and the implementation
part of a Fortran subroutine.
Listing 12. Domain dependant construct syntax.
@domainDependant{ATTRIBUTE_NAME1(MEMBER1 , MEMBER2 , ...), ...}
! data objects that share the attributes !
! defined above to be defined here , separated !
! by commas !
...
@end domainDependant
e following aributes are supported for this construct:
domName Set of all domain names in which the data object needs to be privatized. is is required to be a superset of
the domains that are being declared as the data object’s dimensions in the specication part of the subroutine
(except if the autoDom aribute ag is used, see below). More specically, the domain names specied here
must be the set of domains from the specication part plus the parallel domains (as specied using the parallel
region construct, see section A.1) for which privatization is needed.
domSize Set of the domain dimensions in the same order as their respective domain names specied using the domName
aribute. It is required that |domName | = |domSize |.
accPP Preprocessor macro name that takes |domSize | arguments and outputs them comma separated in the current
storage order for data object accesses. is macro must be dened in the le storage_order.F90. In case the
autoDom aribute is being used, the accPP specication is not necessary - AT, AT4, AT5 (…) are assumed as
dened storage order macro names, depending on the number of array dimensions.
domPP Preprocessor macro name that takes |domSize | arguments and outputs them comma separated in the current
storage order for the data object declaration. is macro must be dened in the le storage_order.F90.
In case the autoDom aribute is being used, the domPP specication is not necessary - DOM, DOM4, DOM5 (…)
are assumed as dened storage order macro names, depending on the number of array dimensions. is
preprocessor macro is usually identical to the one dened in accPP.
attribute Aribute ags for these data objects. Currently the following ags are supported:
present In case this ag is specied, the framework assumes array data to be already present on the device
memory for GPU compilation and the data will not be transferred.
transferHere In case this ag is specied, all domain dependants with intent specied as in, inout or out
will be transferred to- and from the device (according to the intent). is ag may not be specied together
with the present ag as it has exactly opposite eects.
autoDom In case this ag is specied, the framework will use the array dimensions that have been declared
using standard Fortran syntax to determine the domains for each data object. In case the parallel domains
are omied from the Fortran specication (in order to allow dierent parallelization for CPU and GPU),
they still need to be specied using domName and domSize for data objects that are to be privatized for
each thread. e parallel domains will be inserted before any independent domains picked up through
the declaration, depending on the subroutine’s position towards the parallel region. In addition, using
autoDom will by default enable standard accPP and domPP seings, if not specied otherwise. Using this
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 33
ag then greatly simplies the @domainDependant specication part, since the construct can be reused by
data objects of dierent domains.
host In subroutines or modules that are not related to a parallel region in their call graph, Hybrid Fortran
would not know whether a data object resides on the host or the device. is aribute is thus used to
specify this property unambiguously.
It is important to consider that in case neither present nor transferHere ags are used, Hybrid Fortran will
automatically transfer all domain dependants with their Fortran intent specied (as in, inout or out) from and/or to
the device according to their intent in case of a routine containing a GPU parallel region.
B HARDWARE PERFORMANCE
Table 8. Hardware performance numbers referred to in this work
Term Hardware Sustained Unit System Comment
Performance
PH 1C Xeon X5670 9.50 GFLOP/s TSUBAME 2.5 In double precision [14]
(1 Core)
PH Xeon X5670 57 GFLOP/s TSUBAME 2.5 In double precision [14]
(1 Socket)
PD NVIDIA 1030 GFLOP/s TSUBAME 2.5 Linpack DP perf. [31]
Tesla K20x
PD NVIDIA 3900 GFLOP/s Reedbush-H, Linpack DP perf. [47]
Tesla P100 Piz Daint
BWH 1C Xeon X5670 9.80 GB/s TSUBAME 2.5 according to Intel specications[14]
(1 Core)
BWH Xeon X5670 20.5 GB/s TSUBAME 2.5 according to Intel specications [14],
(1 Socket) conrmed by STREAM benchmark
BWH Xeon E5-2670 51.2 GB/s Piz Daint according to Intel specications[15]
(1 Socket)
BWH Xeon E5-2695 v4 76.8 GB/s Reedbush-H according to Intel specications [16],
(1 Socket) conrmed by STREAM benchmark
BWD NVIDIA 108.6 GB/s TSUBAME 2.0 Reported by CUDA bandwidth test
Tesla M2050
BWD NVIDIA 169.4 GB/s TSUBAME 2.5, Reported by CUDA bandwidth test
Tesla K20x Piz Daint,
BWD NVIDIA 499.4 GB/s Reedbush-H, Reported by CUDA bandwidth test
Tesla P100 Piz Daint, on Reedbush-H
TSUBAME 3.0
BWHtoD PCI Express 2.x 4.32 GB/s TSUBAME 2.5 Reported by CUDA bandwidth test
BWHtoD PCI Express 3.x 10.96 GB/s Reedbush-H, Reported by CUDA bandwidth test
Piz Daint on Reedbush-H
RAH Xeon X5670 0.12 GUP/s TSUBAME 2.5 Reported by
(1 Socket) RandomAccess benchmark
RAD NVIDIA 0.88 GUP/s TSUBAME 2.5 Reported by
Tesla K20x RandomAccess benchmark
Table 8 lists the relevant performance metrics that are used for the performance models and eciency discussions in
this work. For a denition of the terms, see Section 4. For the random access memory performance, the test included
with the HPC Challenge[22] benchmark suite has been used (employing the unit “Giga Updates Per Second”). For
the GPU, a CUDA port13 has been used while for the CPU test an open source version by Underwood et al. has been
employed[42] (using 220 table size and 1000 update sets). e GPU bandwidth is specied as reported by the NVIDIA
provided CUDA bandwidth test.
13hps://github.com/muellermichel/cuda randomaccess
Manuscript submied to ACM
34 M. Mu¨ller and T. Aoki
C GPGPU SPECIFIC PROGRAMMING TERMS
e following glossary gives an overview for terms commonly used for GPGPU programming with modern NVIDIA
architectures.
Block size Size of a block of threads that is fetched and loaded onto one [SMX] simultaneously. Can be dened
in 1D, 2D or 3D. e rst dimension should usually chosen as a multiple of the warp size in single
precision or as a multiple of a half warp size in double precision. Each thread in a block requires
register- and, if used, shared memory resources. us, there is a trade-o between the block size and
optimizing for the required number of registers per thread. If too many registers are used, spilling
occurs in which data is ooaded to the device memory.
CUDA NVIDIA’s proprietary GPGPU programming language, available for C, C++ and Fortran.
CUDA core NVIDIA’s marketing term for a combination of general purpose ALU and FPU on the GPU, each
able to operate on one CUDA thread.
CUDA Fortran A Fortran interface to the CUDA tool chain, supported by the Portland Group (PGI).
Device code Code that exclusively runs on the GPU. Two types of device code exist: Kernel routines and
device routines. e laer type is required for routines called within kernels. CUDA requires the
programmer to explicitly set the type of each routine.
Grid size Set of blocks that is loaded onto the GPU simultaneously. See also [Block size].
Host code Code that exclusively runs on the CPU.
Launch bounds Compile-time provided upper bounds for the problem size per block - this allows the CUDA compiler
to be more conservative with the assigned resources per thread and thus optimize for a higher
occupancy. is however requires the application to be recompiled for each new problem size to be
executed.
Occupancy e percentage of CUDA cores that are occupied on average during execution. Occupancy can get
lowered to to a number of issues, such as device memory stalls, high register pressure, high shared
memory pressure, miscongured block size congurations or problem sizes that are too small for
the architecture. GPU specic optimizations are usually targeted at increasing the occupancy since
this directly aects the runtime.
OpenACC An industry standard for directive based GPGPU computing, available for C, C++ and Fortran.
Notable implementations are available from Cray and Portland Group (PGI).
Cache Each streaming multiprocessor (see [SM / SMX]) oers the following caches: Shared Memory,
texture memory and L1 cache. Shared memory is used through explicitly programmed memory
operations, texture memory is used in a declarative way and L1 cache operates automatically for
device memory accesses, analogous to CPU cache architectures. Additionally an L2 cache is shared
among all multiprocessors on one GPGPU.
SIMT Single instruction, multiple threads execution model. An extension of the traditional Single instruc-
tion, multiple data (SIMD) model, in which branching within kernels is possible (with a loss of
performance however, see also [Warp]).
SM / SMX Streaming multiprocessor - NVIDIA’s marketing term for a set of CUDA cores that share certain
resources, such as an L1 and shared memory cache as well as a register le.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 35
read One thread in the [SIMT] model. In GPU hardware such threads do not operate independently
however - see also [Warp]. Each thread occupies one CUDA core during execution.
Warp A set of threads that are executed using a single scheduler. reads within a warp operate in lockstep
on either the same operation or nop (in case of branching). In current and recent architectures the
warp size has been 32. For double precision, only each second thread per warp is executed (i.e. the
data length per CUDA core and warp scheduler is xed).
D REDUCEDWEATHER CONFIGURATION FOR PERFORMANCE TESTS
e following conguration has been used for the results discussed in Section 5.1:
(1) TSUBAME 2.5 has been used, i.e. Tesla K20x GPUs and Xeon X5670 CPUs.
(2) For the CPU performance measurements, ifort and icc with -fast option has been used.
(3) read anity has been set to compact in order to ensure that the comparison is being done to exactly one of
the two CPU socket.
(4) For the GPU performance measurements, PGI Accelerator version 15.04 with -fast option and compute
capability 3.x has been used.
(5) GPU results include memory copy time from and to device unless stated otherwise.
(6) Hybrid Fortran uses the OpenACC based backend implementation.
(7) Hybrid Fortran uses the L1-cache-preferred seing for GPUs.
(8) e number of time steps NT has been set to 100. Output to disk is only performed once at the end of the
simulation.
(9) All code measured has been open-sourced in the Hybrid Fortran GitHub repository14.
E ASUCA CONFIGURATION FOR TESTS WITH 301 X 301 X 58 GRID
e following conguration has been used for the results discussed in Section 5.3:
(1) 75 time steps
(2) All ported modules enabled as depicted in Figure 8
(3) Real model input data
(4) All oating point computations done in double precision
(5) Intel Fortran compiler with -fast -no-ipo seings has been used for the reference code running on CPU.
Versions 14.0.2 and 17.0.2 have been used for Westmere Xeon X5670 and Broadwell Xeon E5-2695 v4 respectively
(inter-procedural optimization has resulted in compiler errors).
(6) PGI accelerator has been used for the GPU measurements, with optimizations disabled (-O0) due to accuracy
problems; Versions 16.9 and 16.10 have been used for K20x and P100 respectively.
F ASUCA CONFIGURATION FOR TESTS WITH 1581 X 1301 X 58 GRID ON REEDBUSH-H
e following conguration has been used for the results discussed in Section 5.4:
(1) All ported modules have been enabled as depicted in Figure 8.
(2) 75 time steps have been executed.
(3) OpenMPI version 1.10.7 has been used.
14Hybrid Fortran: hps://github.com/muellermichel/Hybrid-Fortran/tree/0.95
Manuscript submied to ACM
36 M. Mu¨ller and T. Aoki
(4) e GPU implementation uses two MPI processes per node.
(5) e CPU implementation uses one MPI process per node, with OpenMP core anity seings to keep the
threads in a compact manner on each socket (in order to optimize cache locality).
(6) e horizontal domain (IJ) has been decomposed according to Appendix H.
(7) Output to le has been limited to the beginning and the end of the simulation.
(8) A single MPI process has been used only for problem setup and output on a separate node. is process is not
included in the number of processes shown in the x-axis of the result as well as Appendix H.
(9) Setup time is not included in the measured computation time and thus the speedup shown in Figure 12.
(10) For the CPU implementation, Intel compiler version 17.0.2 has been used with -fast and -no-ipo ags
(inter-procedural optimization has resulted in compiler errors).
(11) For the GPU implementation, PGI Accelerator version 16.10 has been used with compiler optimizations
disabled (an unresolved issue with incorrect computations has been identied in conjunction with compiler
optimizations).
G ASUCA CONFIGURATION FOR TESTS WITH 1581 X 1301 X 58 GRID ON TSUBAME 3.0
e following conguration has been used for the module- and communication impact results discussed in Section 5.4:
(1) All ported modules have been enabled as depicted in Figure 8.
(2) 36 time steps have been executed, with ∆T = 16.6s , thus having a simulation of 10 minutes physical time.
(3) OpenMPI version 2.1.2 has been used.
(4) e GPU implementation uses two MPI processes (i.e. GPUs) per node.
(5) e horizontal domain (IJ) has been decomposed according to Appendix H.
(6) A single MPI process has been used only for problem setup and output on a separate node. is process is not
included in the number of processes shown in the x-axis of the result as well as Appendix H.
(7) Both setup- and le output time is not included in the measured computation time and thus the speedup shown
in the gures.
(8) PGI Accelerator version 17.10 has been used with -O2 optimizations.
(9) Memory error correction (ECC) was enabled on GPU.
H ASUCA IJ DOMAIN DECOMPOSITION
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 37
Table 9. MPI IxJ domain decomposition by number of processes
# processes IxJ decomposition
(CPU and GPU)
7 1 x 7
8 2 x 4
9 1 x 9
10 2 x 5
11 1 x 11
12 3 x 4
13 1 x 13
14 7 x 2
15 3 x 5
16 4 x 4
17 1 x 17
18 3 x 6
19 1 x 19
20 2 x 10
21 3 x 7
22 2 x 11
23 23 x 1
24 2 x 12
25 5 x 5
26 2 x 13
27 3 x 9
# processes IxJ decomposition
(GPU only)
28 2 x 14
30 3 x 10
32 2 x 16
34 2 x 17
36 3 x 12
38 2 x 19
40 4 x 10
42 2 x 21
44 4 x 11
46 2 x 23
48 4 x 12
50 2 x 25
52 4 x 13
I SOURCES
is section presents the sources for the reduced weather application15, as well as an extract from ASUCA’s source code,
as discussed in this work. Please note that ASUCA is a closed source application and can therefore not be presented in
full.
I.1 Reduced Weather: Diusion
Listing 13. Diusion.
subroutine diffuse( &
& energy_u , energy)
! ..more specifications ..
real (8),intent(out):: energy_u (0:nx+1,0:ny+1,nz)
real (8),intent(in):: energy (0:nx+1,0:ny+1,nz)
! ------- inner region IJK: ------------
do j = 1,ny
do i = 1,nx
do k = 2, nz -1
energy_u(i,j,k) = energy(i,j,k) * &
& (1.0d0 - 6 * diffusion_velocity) &
& + diffusion_velocity * ( &
& energy(i-1,j,k) + energy(i+1,j,k) + &
& energy(i,j-1,k) + energy(i,j+1,k) + &
& energy(i,j,k-1) + energy(i,j,k+1) &
& )
end do
end do
end do
15e complete source code for the reduced weather application can also be compiled and run from the Hybrid Fortran GitHub repository: hps:
//github.com/muellermichel/Hybrid-Fortran/tree/0.95/examples/simple weather
Manuscript submied to ACM
38 M. Mu¨ller and T. Aoki
! -- boundary regions IJ, IK, JK: -----
do j = 1,ny
do i = 1,nx
energy_updated(i,j,1) = (1.0d0 - 5 * diffusion_velocity) &
& * energy(i,j,1) + diffusion_velocity * ( &
& energy(i-1,j,1) + energy(i+1,j,1) + &
& energy(i,j-1,1) + energy(i,j+1,1) + &
& energy(i,j,2) &
& )
energy_updated(i,j,nz) = (1.0d0 - 5 * diffusion_velocity) * &
& energy(i,j,nz) + diffusion_velocity * ( &
& energy(i-1,j,nz) + energy(i+1,j,nz) + &
& energy(i,j-1,nz) + energy(i,j+1,nz) + &
& energy(i,j,nz -1) &
& )
end do
end do
do i = 0,nx+1
do k = 1,nz
energy_updated(i,0,k) = (1.0d0 - 2 * diffusion_velocity) &
& * energy(i,0,k) + diffusion_velocity * ( &
& energy(i,1,k) + energy(i,ny+1,k) &
& )
energy_updated(i,ny+1,k) = (1.0d0 - 2 * diffusion_velocity) &
& * energy(i,ny+1,k) + diffusion_velocity * ( &
& energy(i,ny,k) + energy(i,0,k) &
& )
end do
end do
do j = 0,ny+1
do k = 1,nz
energy_updated (0,j,k) = (1.0d0 - 2 * diffusion_velocity) &
& * energy(0,j,k) + diffusion_velocity * ( &
& energy(1,j,k) + energy(nx+1,j,k) &
& )
energy_updated(nx+1,j,k) = (1.0d0 - 2 * diffusion_velocity) &
& * energy(nx+1,j,k) + diffusion_velocity * ( &
& energy(nx,j,k) + energy(0,j,k) &
& )
end do
end do
end subroutine
I.2 Reduced Weather: Code Generated by Hybrid Fortran for Diusion
Listing 14. Diusion code generated for CUDA Fortran
attributes(global) subroutine hfk0_diffuse (...)
! ... use statements ...
real (8), device :: energy_u (0:nx+1,0:ny+1,nz)
real (8), device :: energy (0:nx+1,0:ny+1,nz)
! ... further specifications and initializations
i = (blockidx%x - 1) * blockDim%x + threadidx%x + 1 - 1
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 39
j = (blockidx%y - 1) * blockDim%y + threadidx%y + 1 - 1
if (i .GT. nx .OR. j .GT. ny) then
return
end if
! .... computational code within parallel region ...
end subroutine
subroutine diffuse(thermal_energy_updated , thermal_energy)
! ... use statements ...
real (8), intent(out) :: energy_u (0:nx+1,0:ny+1,nz)
real (8), device :: energy_u_hfdev (0:nx+1,0:ny+1,nz)
real (8), intent(in) :: energy (0:nx+1,0:ny+1,nz)
real (8), device :: energy_hfdev (0:nx+1,0:ny+1,nz)
! ... further specifications ...
energy_u_hfdev (:,:,:) = 0
if ( nx+1 - 0 + 1 .gt. 0 .and. ny+1 - 0 + 1 .gt. 0 .and. nz - 0 .gt. 0 ) then
energy_hfdev (:,:,:) = energy (:,:,:)
end if
! ... further initializations ...
cugridSizeX = ceiling(real(nx) / real (32))
cugridSizeY = ceiling(real(ny) / real (16))
cugridSizeZ = 1
cugrid = dim3(cugridSizeX , cugridSizeY , cugridSizeZ)
cublock = dim3(32, 16, 1)
call hfk0_diffuse <<< cugrid , cublock >>>( &
! ... passing in required scalars
energy_u_hfdev , energy_hfdev &
& )
! ... error handling code ...
! ... repeat the above for all boundary regions ...
if ( nx+1 - 0 + 1 .gt. 0 .and. ny+1 - 0 + 1 .gt. 0 .and. nz - 0 .gt. 0 ) then
energy_u (:,:,:) = energy_u_hfdev (:,:,:)
end if
end subroutine
I.3 Reduced Weather: Physical Processes
Listing 15. Physical processes.
subroutine run_physics (&
& energy , energy_surf , energy_pbl)
! ..more specifications ..
real (8),intent(inout):: energy (0:nx+1,0:ny+1,nz)
real (8),intent(in):: energy_surf (0:nx+1,0:ny+1)
real (8),intent(in):: energy_pbl (0:nx+1,0:ny+1)
do j = 0,ny+1
do i = 0,nx+1
call radiate(energy(i,j,:))
call exchange_heat_with_boundary( &
& energy(i,j,:), energy_surf(i,j), 1)
call exchange_heat_with_boundary( &
& energy(i,j,:), energy_pbl(i,j), nz)
end do
end do
end subroutine
Manuscript submied to ACM
40 M. Mu¨ller and T. Aoki
I.4 Reduced Weather: Radiation
Listing 16. Radiation process.
subroutine radiate(energy)
! ..more specifications ..
real (8), intent(inout), dimension(nz) :: energy
radiation_intensity = 0.1d0
do k=1,nz
energy(k) = energy(k) + radiation_intensity
end do
end subroutine
I.5 Reduced Weather: Heat Exchange
Listing 17. Surface / planetary heat exchange.
subroutine exchange_heat_with_boundary( &
& energy , boundary_energy , boundary_level)
! ..more specifications ..
real (8), intent(inout), dimension(nz) :: energy
real (8), intent(in) :: boundary_energy
integer (4), intent(in) :: boundary_level
transfer_velocity = 0.01d0
energy_transfer_to_boundary = &
& transfer_velocity * &
& (energy(boundary_level) - boundary_energy)
energy(boundary_level) = energy(boundary_level) &
& - energy_transfer_to_boundary
end subroutine
I.6 Reduced Weather: Time Integration
Listing 18. Main time integration loop.
module simple_weather
real (8),pointer :: energy (:,:,:)
real (8),pointer :: energy_u (:,:,:)
! .. initialization , cleanup and output code ..
subroutine simulate( &
& start_time , end_time , timestep , output_timestep)
! ..more specifications ..
real (8),pointer :: energy_temp (:,:,:)
time = start_time
do while (.true.)
if (modulo(time + 0.001d0, output_timestep) &
& < 0.01d0) then
call write_data(energy , "energy", time)
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 41
end if
call run_physics( &
& energy , energy_surf , energy_pbl)
call diffuse(energy_u , energy)
energy_temp => energy_u
energy_u => energy
energy => energy_temp
time = time + timestep
if (time > end_time) then
return
end if
end do
end subroutine
end module
ACKNOWLEDGMENTS
is work has been supported by the Japan Science and Technology Agency (JST) Core Research of Evolutional Science
and Technology (CREST) research program “Highly Productive, High Performance Application Frameworks for Post
Peta-scale Computing”, by KAKENHI Grant-in-Aid for Scientic Research (S) 26220002 from the Ministry of Education,
Culture, Sports, Science and Technology (MEXT) of Japan, by “Joint Usage/Research Center” for Interdisciplinary Large-
scale Information Infrastructures (JHPCN)” and “High Performance Computing Infrastructure (HPCI)” as well as by the
“Advanced Computation and I/O Methods for Earth-System Simulations” (AIMES) project running under the German-
Japanese priority program “Soware for Exascale Computing” (SPPEXA). e authors thank the Japan Meteorological
Agency for their extensive support, Tokyo University and the Global Scientic Information and Computing Center
at Tokyo Institute of Technology for the use of their supercomputers Reedbush-H and TSUBAME 2.5, as well as Dr.
Christian Conti and Dr. Mohamed Aia for their careful proof-reading.
REFERENCES
[1] Andrey Bokhanko. 2014. OpenMP 4 Support in Clang / LLVM. (2014). Retrieved 2017-12-22 from OpenMP.org: hp://mail.openmp.org/sc14/
BoF Intel Andrey Clang.pdf.
[2] Ben Cumming, Carlos Osuna, Tobias Gysi, Mauro Bianco, Xavier Lapillonne, Oliver Fuhrer, and omas C Schulthess. 2013. A review of the
challenges and results of refactoring the community climate code COSMO for hybrid Cray HPC systems. Proceedings of Cray User Group 2013
(2013), 1–11.
[3] Craig C Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Ru¨de, and Christian Weiß. 2000. Cache optimization for structured and unstructured
grid multigrid. Electronic Transactions on Numerical Analysis 10 (2000), 21–40.
[4] Hikmet Dursun, Ken-ichi Nomura, Weiqiang Wang, Manaschai Kunaseth, Liu Peng, Richard Seymour, Rajiv K Kalia, Aiichiro Nakano, and Priya
Vashishta. 2009. In-Core Optimization of High-Order Stencil Computations. In PDPTA. CSCE, 533–538.
[5] H Carter Edwards, Christian R Tro, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic
memory access paerns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202–3216.
[6] Oliver Fuhrer. 2014. Grid Tools: Towards a library for hardware oblivious implementation of stencil based codes. Retrieved 2017-12-22 from
hp://www.pasc-ch.org/projects/2013-2016/grid-tools. (2014).
[7] Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cumming, Mauro Bianco, Andrea Arteaga, and omas Christoph Schulthess.
2014. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomputing frontiers
and innovations 1, 1 (2014), 45–62.
[8] Mark Gove. 2012. F2C-ACC Users Guide, Version 4.2. (2012). Retrieved 2017-12-22 from NOAA: hp://www.esrl.noaa.gov/gsd/ab/ac/Accelerators.
html.
[9] Mark Gove, Jacques Middleco, and Tom Henderson. 2010. Running the NIM next-generation weather model on GPUs. In Proceedings of the 2010
10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Computer Society, 792–796.
Manuscript submied to ACM
42 M. Mu¨ller and T. Aoki
[10] Mark Gove, Jacques Middleco, and Tom Henderson. 2014. Directive-based Parallelization of the NIM Weather Model for GPUs. In Proceedings of
the First Workshop on Accelerator Programming Using Directives (WACCPD ’14). IEEE Press, Piscataway, NJ, USA, 55–61. hps://doi.org/10.1109/
WACCPD.2014.9
[11] Mark Gove, Jim Rosinski, Jacques Middleco, Tom Henderson, Jin Lee, Alexander MacDonald, Ning Wang, Paul Madden, Julie Schramm, and
Antonio Duarte. 2017. Parallelization and Performance of the NIM Weather Model on CPU, GPU and MIC Processors. Bulletin of the American
Meteorological Society (2017).
[12] e Portland Group. 2012. PGI Accelerator Compilers with OpenACC Directives. (2012). Retrieved 2017-12-22 from e Portland Group:
hp://www.pgroup.com/resources/accel.htm.
[13] Mark Harris. 2007. Optimizing CUDA. SC07: High Performance Computing With CUDA (2007).
[14] Intel. 2010. Intel Xeon Processor X5670. (2010). Retrieved 2017-12-22 from Intel: hp://ark.intel.com/products/47920/
Intel-Xeon-Processor-X5670-12M-Cache-2 93-GHz-6 40-GTs-Intel-QPI.
[15] Intel. 2012. Intel Xeon Processor E5-2670. (2012). Retrieved 2017-12-22 from Intel: hp://ark.intel.com/products/64595/
Intel-Xeon-Processor-E5-2670-20M-Cache-2 60-GHz-8 00-GTs-Intel-QPI.
[16] Intel. 2016. Intel Xeon Processor E5-2695 v4. (2016). Retrieved 2017-12-22 from Intel: hp://ark.intel.com/products/91316/
Intel-Xeon-Processor-E5-2695-v4-45M-Cache-2 10-GHz.
[17] Junichi Ishida, Chiashi Muroi, Kohei Kawano, and Yuji Kitamura. 2010. Development of a new nonhydrostatic model ASUCA at JMA. CAS/JSC
WGNE Research Activities in Atmospheric and Oceanic Modelling 40 (2010), 0511–0512.
[18] NVIDIA Inc. James Beyer. 2016. Targeting GPUs with OpenMP 4.5 Device Directives. (2016). Retrieved 2017-12-22 from NVIDIA: hp:
//on-demand.gputechconf.com/gtc/2016/presentation/s6510-je-larkin-targeting-gpus-openmp.pdf.
[19] Cray Inc. James C. Beyer. 2013. e use of OpenACC and OpenMP Accelerator directives with the Cray Compilation Environment
(CCE). (2013). Retrieved 2017-12-22 from GPU Technology Conference: hp://on-demand.gputechconf.com/gtc/2013/presentations/
S3084-OpenACC-OpenMP-Directives-CCE.pdf.
[20] Jan Kwiatkowski. 2001. Evaluation of parallel programs by measurement of its granularity. In International Conference on Parallel Processing and
Applied Mathematics. Springer, 145–153.
[21] Xavier Lapillonne and Oliver Fuhrer. 2014. Using compiler directives to port large scientic applications to GPUs: An example from atmospheric
science. Parallel Processing Leers 24, 01 (2014).
[22] Piotr R Luszczek, David H Bailey, Jack J Dongarra, Jeremy Kepner, Robert F Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. e HPC
Challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing. 213.
[23] Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An Implicitly Parallel Programming Model for Stencil
Computations on Large-scale GPU-accelerated Supercomputers. In Proceedings of 2011 International Conference for High Performance Computing,
Networking, Storage and Analysis (SC ’11). ACM, New York, NY, USA, Article 11, 12 pages. hps://doi.org/10.1145/2063384.2063398
[24] John Michalakes and Manish Vachharajani. 2008. GPU acceleration of numerical weather prediction. In IPDPS (2009-06-29). IEEE, 1–7. hp:
//dblp.uni-trier.de/db/conf/ipps/ipdps2008.html#MichalakesV08
[25] Jarno Mielikainen, Bormin Huang, and Allen Huang. 2014. Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme. In
SPIE Sensing Technology+ Applications. International Society for Optics and Photonics, 91240T–91240T.
[26] Jarno Mielikainen, Bormin Huang, Hung-Lung Allen Huang, and Mitchell D Goldberg. 2012. GPU acceleration of the updated Goddard shortwave
radiation scheme in the weather research and forecasting (WRF) model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing 5, 2 (2012), 555–562.
[27] Michel Mu¨ller and Takayuki Aoki. 2018. Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model.
In Accelerator Programming Using Directives, Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer International Publishing, Cham, 20–41.
[28] Mahew R Norman, Azamat Mametjanov, and Mark Taylor. 2017. Exascale Programming Approaches for the Accelerated Model for Climate and
Energy. (2017).
[29] OpenACC. 2015. e OpenACC Application Programming Interface Version 2.5. (2015). Retrieved 2017-12-22 from OpenACC.org: hp:
//www.openacc.org/sites/default/les/inline-les/OpenACC 2pt5.pdf.
[30] Je Preshing. 2012. A Look Back at Single-readed CPU Performance. (2012). Retrieved 2017-12-22: hp://preshing.com/20120208/
a-look-back-at-single-threaded-cpu-performance.
[31] Timothy Pricke Morgan. 2012. NVIDIA launches not one but two Kepler2 GPU coprocessors. (2012). Retrieved 2017-12-22 from e Register:
hp://www.theregister.co.uk/2012/11/12/nvidia tesla k20 k20x gpu coprocessors/?page=2.
[32] Alistair P Rendell, B Ch Apm An, and Mahias S Mu¨ller. 2013. OpenMP in the Era of Low Power Devices and Accelerators. (2013).
[33] Greg Ruetsch, Evere Phillips, and Massimiliano Fatica. 2010. GPU acceleration of the long-wave rapid radiative transfer model in WRF using
CUDA Fortran. In Many-Core and Recongurable Supercomputing Conference.
[34] M Sakamoto, J Ishida, K Kawano, K Matsubayashi, K Aranami, T Hara, H Kusabiraki, C Muroi, and Y Kitamura. 2014. Development of Yin-Yang
Grid Global Model Using a New Dynamical Core ASUCA. (2014).
[35] Takashi Shimokawabe, Takayuki Aoki, Junichi Ishida, Kohei Kawano, and Chiashi Muroi. 2011. 145 TFlops performance on 3990 GPUs of TSUBAME
2.0 supercomputer for an operational weather prediction. Procedia Computer Science 4 (2011), 1535–1544.
Manuscript submied to ACM
New GPGPU Code Transformation Framework Applied to Weather Prediction 43
[36] Takashi Shimokawabe, Takayuki Aoki, Chiashi Muroi, Junichi Ishida, Kohei Kawano, Toshio Endo, Akira Nukada, Naoya Maruyama, and Satoshi
Matsuoka. 2010. An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In Proceedings
of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1–11.
[37] Takashi Shimokawabe, Takayuki Aoki, and Naoyuki Onodera. 2014. High-productivity Framework on GPU-rich Supercomputers for Operational
Weather Prediction Code ASUCA. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
(SC ’14). IEEE Press, Piscataway, NJ, USA, 251–261. hps://doi.org/10.1109/SC.2014.26
[38] Herb Suer. 2005. e free lunch is over: A fundamental turn toward concurrency in soware. Dr. Dobb’s journal 30, 3 (2005), 202–210.
[39] Irina Tezaur, Jerry Watkins, and Irina Demeshko. [n. d.]. Towards Performance-Portability of the Albany/FELIX Land-Ice Solver to New and
Emerging Architectures Using Kokkos. ([n. d.]).
[40] e OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface Version 4.0. (July 2013). Retrieved 2017-12-22 from
OpenMP.org: hp://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.
[41] TOP500. 2016. Top500 List November 2016. (2016). Retrieved 2017-12-22 from Top500.org: hps://www.top500.org/lists/2016/11.
[42] Keith D Underwood, Steven J Plimpton, Ronald B Brightwell, Courtenay T Vaughan, and Mike Davis. 2006. A Simple Synchronous Distributed-Memory
Algorithm for the HPCC RandomAccess Benchmark. Technical Report. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States);
Sandia National Laboratories, Albuquerque, NM.
[43] Tokyo University. 2017. Reedbush Introduction (in Japanese). (2017). Retrieved 2017-12-22 from Tokyo University: hp://www.cc.u-tokyo.ac.jp/
system/reedbush/reedbush intro.html.
[44] Wim Vanderbauwhede and Tetsuya Takemi. 2013. An investigation into the feasibility and benets of GPU/multicore acceleration of the weather
research and forecasting model. In High Performance Computing and Simulation (HPCS), 2013 International Conference on. IEEE, 482–489.
[45] Louis J Wicker and William C Skamarock. 2002. Time-spliing methods for elastic models using forward time schemes. Monthly weather review
130, 8 (2002), 2088–2097.
[46] Samuel Williams, Andrew Waterman, and David Paerson. 2009. Rooine: An Insightful Visual Performance Model for Multicore Architectures.
Commun. ACM 52, 4 (April 2009), 65–76. hps://doi.org/10.1145/1498765.1498785
[47] Rengan Xu, Frank Han, and Nishanth Dandapanthu. 2017. Application Performance on P100-PCIe GPUs. (2017). Re-
trieved 2017-12-22 from Dell: hp://en.community.dell.com/techcenter/high-performance-computing/b/general hpc/archive/2017/03/14/
application-performance-on-p100-pcie-gpus.
Received July 2017; revised January 2018 (minor); accepted February 2018
Manuscript submied to ACM
