Hybrid architectures utilizing GPUs provide a unique opportunity in a high performance computing environment. However, there are many legacy codes, particularly written in Fortran, that can not take immediate advantage of GPUs. Furthermore, many of these codes are under active development and so completely rewriting the code may not be an option. The advanced circulation and storm surge finite element model (ADCIRC) is one such code base. In this paper we present our semi-automatic methodology for porting portions of ADCIRC to run on the GPU and some preliminary scaling results of these subroutines. We have implemented a C++ array class and pre-processor macros to create a type of application framework to simplify the conversion and maintenance tasks. This allows the C++ syntax to be similar to Fortran, to provide for a more straight forward syntactical conversion from the original Fortran to C++ and simplified calling conventions between the two. After the necessary subroutines are converted to the C++ framework, the CUDA library can be easily used and also we are able to provide a simplified abstraction layer for accessing basic GPU functionality. For example, the problem of transferring the correct data on/off the GPU is addressed by our framework by a one time code change and a script to resolve data dependencies. Although it is currently specific to ADCIRC, our framework provides a starting point for utilizing GPUs with legacy Fortran codes, from which more specific GPU optimizations can be implemented.
Introduction
ADCIRC is a computational model for solving time-dependent, free surface circulation and transport problems in two-and three-dimensions. The model employs a finite element method to solve the so-called shallow water equations, which are extensively used to model flows for processes such as tides, storm surge, and flows within ocean basins, on shelves, in bays, through inlets, in lakes, in rivers, and adjacent flood-plains. The "operational" version of the ADCIRC model, though still a research code, has been extensively used by a number of agencies including the U.S. Army Corps of Engineers, the Federal Emergency Management Agency, the National Oceanic and Atmospheric Agency, and the LSU Hurricane Center to name a few. ADCIRC model predictions have shown good agreement with measured field data for a number of applications, including hind-casts of high water marks for many historic and recent storms in Southern Louisiana; see, for example, [1, 2, 3] .
Over the past several years, a "next-generation" ADCIRC model has been in development that uses a discontinuous Galerkin (DG) finite element approach. The goal of the DG ADCIRC model development has been to incorporate and substantially improve upon the best features of the present "operational" ADCIRC model, which uses a continuous Galerkin (CG) approach. DG methods have been shown to be particularly robust and highly accurate in advection dominated flow scenarios, such as those encountered in flow through inlets and channels in coastal regions [4, 5] and have exhibited excellent parallel scalability and efficiency on up to thousands of processors [6, 7] . One notable enhancement in terms of numerics is a p refinement option that has been implemented using a family of orthogonal, hierarchical basis functions that allow the use of any order of approximation p over an element, which can vary element-by-element and which be adapted dynamically in time [5] ; dynamic p-adaptivity has been studied in [7] . The DG code has also been verified on a number of analytic test cases and validated on a number of field scale applications -most recently it has been used to simulate Hurricane Ike in the Gulf of Mexico [8] .
In this paper, we will describe our semi-automatic method for porting the depth-integrated, two-dimensional DG ADCIRC model using linear (p = 1) elements [5, 9] to a GPU. In the process of doing this, we have developed a framework from which we can further optimize the ADCIRC model on the GPU. We will also consider a naïve case of parallelization, to examine what might happen by automatically generating CUDA kernels from the computational subroutines in ADCIRC. To assess this parallelization, we consider the scalability of each of the different subroutines. Figure 1 shows an overview of the work-flow that we followed to ultimately generate our output files. The Fortran source code files of the computational subroutines that need to be run on the GPU are converted to C++ by the f2c.sh script and is discussed in Section 2. The Fortran source code files of the computational subroutines and the Fortran modules are used as input to the findeps.sh script to generate a series of header files to help automate variable declarations and data transfers to/from the GPU, and is discussed in Section 3. The other three header files help to form a part of our framework and simplify the development of the findeps.sh and f2c.sh scripts. The AllocatableArray.h header file is discussed in Section 2.2.1. All of these files are used to link a final binary of ADCIRC that will utilize the GPU and the overall strategy for how these files will work in conjunction is discussed in Section 4. Finally, the results of the simple parallelization example are discussed in Section 5. The work-flow that we followed to generate all the files necessary to begin executing ADCIRC on a GPU.
Porting Methodology
The overall porting strategy is divided into three phases: converting the relevant subroutines from Fortran to C++, adding the appropriate CUDA library calls, and finally a preliminary parallelization effort. At this point, the preexisting structure of ADCIRC is the guiding decision maker for how to accomplish these tasks. A simplified function call graph of ADCIRC is shown in in Figure 2 . ADCIRC has six computational subroutines that need to be ported to the GPU: ocean edge hydro, land edge hyd internal edge hydro, rhs dg hydro, numerical flux, llf flux 1 .
Alternative Methods
It should be emphasized that our goal is not to create a general method for automatically using Fortran code with a GPU. Rather, we are acknowledging the fact that to fully realize the GPU's potential requires problem-specific decisions and optimizations. Therefore, our goal is to create a framework (which is currently more specific than not to ADCIRC) that can be used so that the development time can be spent on optimizing the GPU code rather than developing infrastructure code. The following are some other methods that were considered for porting ADCIRC to use the GPU.
The PGI accelerator [10] is a commercial product that uses source code annotations (like OpenMP) to specify which loops will execute on the GPU. Initially, the use of this product seemed promising, however, it lacked the (essential) feature to share data across subroutine calls, resulting in poor performance due to excessive GPU/CPU data transfer 2 . Also, some subroutines simply could not be made to run faster on the GPU, and at that point it is not clear what to change because you do not really know what the compiler is doing behind the scenes and also because you are only provided the limited set of PGI compiler directives, until they release the next version of the compiler, hopefully supporting newer features. F2C-ACC [11] still requires annotations, but is unable to cope with certain Fortran syntax (especially relating to I/ O), does not support sharing data across subroutines, and the resulting C code becomes less legible (which is contrast to our method of attempting to hide certain syntax behind C++ classes and pre-processor macros). Therefore, it seemingly does not offer any advantage over the PGI accelerator, at least in our case.
PGI CUDA-Fortran [12] is another commercial product that allows CUDA kernels to be written and used directly in the Fortran code, removing the need to even translate the code to C/C++. Initially this product was unstable, but now it simply lacks any sort of cost benefit. Our methodology is simplified to such a degree that maintaining a commercial license for CUDA-Fortran has no cost (or time) benefit to our project. Furthermore, as heterogeneous architectures become increasingly popular, we are constantly looking for "standardized" solutions (such as OpenCL), so it is not even clear that CUDA will even be the viable long term solution, increasing the risk in investing in this type of software (and again, without any perceived benefit).
Corrigan et al. [13, 14] have implemented semi-automatic porting script for the legacy CFD code FEFLO. It is mentioned here because it is in nearly all ways relevant to our work. However, just as our methodology is somewhat proprietary to ADCIRC, theirs is specific to FEFLO. It relies upon the consistent syntax used in the FEFLO code and the OpenMP annotations, therefore it would not be possible to directly use their script, even with minor modifications. Additionally, their python script is not generally available, so it is not possible to even try to modify it. However, there is a good description of their methodology and they are acknowledged here for their inspiration and ideas.
Fortran to C++ Conversion
The relevant subroutines of ADCIRC were syntactically translated from Fortran to C++ by using a sed script. While perhaps this lacks some flexibility, its advantage is simplicity. More specifically, the script was created and used after a couple hours of development, and there are no subtle bugs to worry about because the translation is limited to simple (and easily extensible) search/replace cases and so the resulting output will generally either work or result in a syntax error during compilation. This technique is obviously not meant to cover all of the Fortran syntax, but ADCIRC does not utilize esoteric Fortran syntax. ADCIRC does, however, use a mixture of FORTRAN 77 and Fortran 90.
Briefly, we dealt with the translation in the following way:
1. Clean up the input source a bit to make regular expression processing easier. Remove line continuations (including the actual newline, to make it easier to add semicolons later). Trim excess line space. Convert comments ("!" and "C" to "//"). Remove unneeded syntax ("::"). Convert upper case to lower case. Convert operators (".eq." to "==", etc). 2. Convert do loops by parsing the arguments with regular expressions. 3. Convert if statements by translating "then" to "{", "else" to "} else {", "end if" to "}", etc. 4. Convert variable type names to pre-named typedefs ("real(8)" to "T DOUBLE", "real(4)" to "T SINGLE", etc). Also take care of "allocatable" arrays by converting "(:,:,:)" to " (3)", so that the dimension gets passed as an argument to the AllocatableArray class constructor. 5. Convert numbers (exponents in Fortran are specified after a "d", while in C they are specified after an "e"). 6. Attempt to deal with "**". This is difficult to do for the general case in regular expressions and without a proper parser. However, ADCIRC only uses pow a single time in the computational subroutines. 7. Convert remaining Fortran statements such as return/stop/exit, type casting such as dble/int, remaining program statements such as program/end/subroutine/end subroutine, etc. Most of these are just converted back to uppercase and a pre-processor function-like macro is defined to provide similar functionality (stop becomes STOP becomes abort()). Subroutine declarations have their arguments prefixed with & so that all of the variables are automatically passed by reference (to match Fortran's calling standard), and hence none of the call statements needs to be changed. The use statement gets converted to an #include statement, while implicit gets removed. 8. Add semicolons. Basically at this point anything without a # at the beginning of a line or a { at the end of a line would need a semicolon. A final pass can be made to remove extraneous semicolons. 9. (Optional). Run the resulting output through something like astyle to make it look pretty.
As previously stated, this script obviously will not deal with all of the Fortran syntax. However, this relatively short script was sufficient enough to convert the computational subroutines of ADCIRC. The point here is that other subroutines (such as I/O, which can not only have complicated syntax, but also different back end semantics) can be left to execute from the original Fortran code. Also, as it was shown, another thing that allowed us to simplify the translation was the (ab)use of C pre-processor macros.
AllocatableArray Class
AllocatableArray is the templated C++ class that we implemented to provide some specific key features to assist in porting ADCIRC to the GPU. The class itself is not very sophisticated and the primary aspect of the implementation is to overload all relevant operators so that the variables of type AllocatableArray can be used syntactically similar to how the variables are used in Fortran. Naïvely this means overloading the () operator so that the underlying memory is properly accessed in unit stride by automatically transposing the array indices, however, there are a few more advantages. The = operator was overloaded to provide Fortran 90 style assignment syntax, again, helping to simplify the earlier code conversion. Furthermore, the = operator was optimized for both array to array and scalar to array assignments, by iterating directly over the array memory, rather than computing for multi-dimensional array offset addresses. If desired, this class can also check array bounds to more closely mimic Fortran, and this support can also be left out for performance. Additionally, because the class functions are all inlined, there is no negative effect on performance. Finally, all of the class functions are declared with both host and device , because having a uniform C++ code that runs the same on both the CPU and GPU made debugging and verification easier.
CUDA Calls
Relevant CUDA library calls need to be added at the appropriate parts in the code. In our framework, this entails adding calls to properly transfer data to/from the GPU.
Automatic Data Transfer
A basic script was written to do a data dependency analysis to automatically determine which data would need to be copied to/from the GPU. This is necessesary because nearly all of the variables in ADCIRC (even ones that would be considered local to a subroutine) are declared in Fortran modules. If we simply copied all the variables named in the module to the GPU then this would cause two major problems. First, a lot of time would be wasted in doing unnecessary data transfers; for example, many temporary/intermediate variables and loop counters would be copied back and forth, in addition to unused variables, which the compiler would have no way of optimizing since these are explicit copies. Second, and perhaps more importantly, accessing these variables would always incur a global memory access (since they are pointers to globally allocated memory) on the GPU and even prevent further optimization; locally declared variables are generally assigned to registers in the GPU, which the nvcc compiler aggressively optimizes. Furthermore, the data dependency analysis was simplified because of the simplified call structure ( Figure  2 ) and also because the only data that are passed as explicit arguments are passed to the numerical flux subroutine (which is not a kernel itself, but rather is called as a device subroutine from an existing kernel). There we must consider how pointers to relevant data are to be passed to the CUDA kernel calls.
One of the limitations of calling CUDA kernels is that they can only receive up to 256 bytes of arguments. The way ADCIRC is organized is that there are many global data structures that must be accessed in the various subroutines. This can lead to two problems, the potential of running out of argument space and possibly the inefficiency of constantly passing pointers to the exact same memory locations. Therefore, struct Gpuargs was created to store pointers to all of the relevant data structures and the single pointer to this data structure is the only argument passed to all of the kernel calls. Two copies of this struct are created, one on the host (host gpuargs) and one on the GPU (dev gpuargs). The pointers in host gpuargs are set to point to the Fortran data, and the pointers in dev gpuargs are initialized via cudaMalloc. As the subroutine declarations are converted from Fortran to C++, references to these structs are inserted into the subroutine's argument list 4 . The final step is to insert calls to cudaMemcpy to transfer the data to/from the GPU. The initialized data will be transferred a single time to the GPU prior to the start of the time stepping. The data will only be transferred back to the host when they are needed to output time series data. For our testing we typically output every 172, 800 time steps, or one day of simulation time.
The script that does the data dependency actually helps generate a series of header files for performing one of four operations on all of the dependent data. These files basically generate code to perform the desired actions for each variable of interest. As part of creating these files, actual dependency information is considered, as a means of optimization. For example, if a variable is never written in any of the CUDA kernels, then no output is created for this variable in the dev2host.h file, to prevent unnecessarily transferring data (that will never change) back to the host.
declare.h
This file will only be included a single time in a ".cu" file and serves two purposes. The first is to create a device pointer to a similarly named variable (by appending gpu on to the variable name), which will be the pointer used for future calls to cudaMalloc and cudaMemcpy. The second is to create a similarly named variable (by appending host) to reference the original Fortran data for use by the host and by future calls to cudaMemcpy. Recalling that Fortran appends an underscore to exported names (and also has a fancy naming scheme for variables declared in modules), the declaration in this file is accomplished by creating a C++ reference to the identically named variable with an underscore appended to the name (this appending is done via pre-processor macros). The Fortran variable must also be declared as extern so the compiler does not complain, and of course the linker will resolve this. As an example: 
use.h
This header file is intended to be included inside every subroutine converted to C++ and it provides the named declaration of the variable so that the subroutine can use the same name as was used in Fortran, otherwise it would be necessary (and messy) to rename the Fortran variables in the C++ code, and hence would further complicated the Fortran to C++ translation. As an example: # i f d e f i n e d ( USE GPU ) A l l o c a t a b l e A r r a y <type , dimension > example ( g p u a r g s −>example , I , J , K ) ; # e l s e A l l o c a t a b l e A r r a y <type , dimension > example ( e x a m p l e h o s t ) , I , J , K ) ; # e n d i f would be used to bring the variable example, of type REAL * (i.e. an array, possibly multi-dimensional) from the Fortran module global into the current scope. The constructor for AllocatableArray uses the reference to the data and does not allocate anything, or perform any external functionality that would be disallowed by a device function.
host2dev.h
This header file is included once. It generates the code to allocate the memory on the GPU and it transfers the corresponding (preprocessed) data from the host to the GPU. A self explanatory example is:
Execution Strategy
The new execution scheme for ADCIRC on the GPU is shown in Figure 3 . The read input, prep, and write results subroutines are left to run in Fortran, because ADCIRC has standardized file formats for the input and output that must be adhered to; keeping these subroutines in Fortran significantly reduces the time spent on porting ADCIRC, and is one less thing that needs to be maintained. The strategy being employed here is to minimize the time spent transferring data between the CPU and GPU. That is, rather than using a strict "accelerator" approach, the data should be left on the GPU for as long as possible, and only copy data for I/O or inter-process communications [15] .
Since the timestep subroutine still resides on the CPU, obviously some amount of fine-grained "control" data must be transferred between the host and the GPU for every time step. Our framework does not currently deal with this and so the code to copy these data (there are currently only three such variables) to the GPU must be manually inserted into the timestep subroutine.
Parallelization
One issue that needs to be addressed is how to break down the computational subroutines in to CUDA kernels for parallelization. Our framework makes no assumptions as to how this should be done. However, to try and understand what would happen in a simple case of automation, a wrapper subroutine (as inspired from [13] ) is easily generated, at least for every named subroutine and an example is shown in Listing 1. This could even be generalized to generating a wrapper for all loops; it would be up to the developer to decide whether or not to actually turn the loop in to a CUDA kernel, but if not then the extra subroutine calls would be optimized out by the compiler and so there would be no loss in doing this type of automation.
Listing 1: A wrapper subroutine can be generated, at least for the original subroutine, so that the developer can manually implement the proper CUDA kernel. It can be possible to extend this method to generating wrappers for each loop as well. One difficulty with generating wrapper functions to kernel calls (as also discussed in [13] ) is that data dependencies must be tracked through subroutine calls, which is very difficult if standardized naming conventions cannot be relied upon. Generating wrapper subroutines for each loop would compound this problem. However, our framework helps simplify this problem, because of the struct Gpuargs that gets passed to any generated subroutines (and the corresponding use.h header file to bring variable names in to the current scope) contains pointers to any required data on the GPU -meaning all variables are technically accessible from all generated subroutines, and unused variable names will simply be optimized out by the compiler. Listing 2 is an example of what the generated code might actually look like. 
ze ( k , j , 1 ) = ze ( k , j , n r k + 1 ) ; qx ( k , j , 1 ) = qx ( k , j , n r k + 1 ) ; qy ( k , j , 1 ) = qy ( k , j , n r k + 1 ) ; } } }
Subroutine Scaling
As an example of how our framework can be useful as a basis for converting ADCIRC to the GPU, we will convert the subroutines to use our framework, naïvely parallelize them, and then assess the scaling properties of the subroutines on an NVIDIA GTX-480 GPU. We will consider the increase in run time per time step of the subroutines as a finite element mesh is refined; efficient use of the GPU is indicated by less than linear scaling.
Parallelizing The Subroutines 5.1.1. internal edge hydro
This subroutine is difficult to parallelize because each internal edge will be touching two elements, and so there is a data contention problem if two concurrently running threads attempt to update the same entry in the right-hand side. It turns out that, at least in our test case, the proper ordering of the nodes (and hence edges), will prevent most of the contention problems. However, this is not the case in general, especially for an unstructured mesh and also there are still problems with our numbering scheme that caused errors to increase, however the model still remained stable.
ocean edge hydro and land edge hydro
These two subroutines are very similar to parallelize because they only touch one element; they implement two different boundary conditions. Therefore, they do not suffer from the contention problem that internal edge hydro had. However, the layout of the mesh can have an impact on these assumptions, because in the test mesh that was used for these results, one element in each corner shared two different boundary conditions, which reduced the accuracy of the solution.
rhs dg hydro
This subroutine was the easiest to parallelize because it loops over all of the elements and at a coarse grained level there is no synchronization necessary because there is no contention when constructing the right-hand side.
numerical flux
This subroutine is not actually parallelized because it is called by an existing CUDA kernel.
The Problem
For testing purposes, we consider a simple test case proposed by Lynch and Gray [16] . The domain of the problem examined is a rectangle given by [
The boundaries at y = 0, y = L, and x = x 1 are land or no-normal flow boundaries (corresponding to "land edges" in the mesh), and at x = x 2 a surface elevation is prescribed by a periodic tidal forcing function (corresponding to "ocean edges" in the mesh). In our numerical tests, we take the width of the domain in the y-direction to be L = 45 km, and in the x-direction, we take x 2 − x 1 = 2L. We consider a quadratic bathymetric profile and use a tidal amplitude of ζ 0 = 0.5 m, with a frequency of ω ≈ 0.0001405 s −1 , which corresponds to an M 2 tide. A linear friction factor of τ b f = 0.0001 s −1 is used. For the purposes of scaling, the mesh is successively refined by dividing both the horizontal and vertical spacing in half and is summarized in Table 1 (b). 
Results
The scaling results are shown in Figure 4 . All of the subroutines were using 256 threads per block, and the sufficient number of blocks to perform the computation. Prior to any subroutine parallelization, all of the curves grew linearly as the number of elements increased. After parallelization, the subroutines have nearly constant scaling until approximately 4, 000 elements are used, at which point the rhs dg hydro and internal edge hydro subroutines begin to show linear scaling, while land edge hydro and ocean edge hydro remain flat. The poor results for the updatea1 and updatea2 subroutines indicates that manual intervention is required to tune these kernels. These subroutines are responsible for updating data structures related to the Runge-Kutta timestepper. One reason land edge hydro and ocean edge hydro remain flat is because they are computing on only twice as many edges as the mesh is refined by a factor of four, as shown in Table 2 . The other subroutines do better with the rudimentary parallelization, but still are going to require hand tuning. One factor to consider is that the Fermi architecture has a data cache, which is definitely helping with these subroutines since most of the memory accesses are not yet properly coalesced and pitched memory is not being used. 
Conclusions
We have shown a method to assist in the porting of a legacy Fortran finite element code (ADCIRC) to utilize a GPU. This includes the use of a specialized C++ framework and several supporting scripts. The C++ framework relaxes the syntax requirements of the C++ code to help simplify and hence automate the code translation from Fortran to C++, and provides for some optimizations. Additionally, the framework provides a mechanism to help transfer data to/from the GPU. The framework was developed specifically for ADCIRC, however there are some concepts that could easily be generalized to other numerical codes. It should be emphasized that the approach we have taken is not a substitute for hand crafting the final GPU design, but rather to supplement that effort and potentially allow for easier co-development between the main code branch and the GPU version; the scaling results verify this.
Acknowledgements

