Nowadays, the ocean numerical models are gradually developing towards multi-physical process and high resolution, with the increment of measured ocean data and more in-depth research in ocean field. Therefore, general computing capability is no longer able to meet these models' needs. It is necessary to utilize more powerful hardware and parallel software to process the ocean numerical model programs. China has made great process in the research and development of homegrown high performance processors, and sunway sw26010 many-core processor is the most outstanding representative. This paper focuses the lag of the ocean numerical model software matched with homegrown processors, and makes parallel implementation and optimization to regional ocean modeling system (ROMS) based on sunway sw26010 many-core processor for the first time. Furthermore, three kinds of programming methods are utilized in this paper, including OpenACC*, athread with fortran and athread with C. The comparison among these programming methods has been made, from programming method, workload and execution efficiency, which has a practical guiding significance for the programmers that use sunway sw26010 manycore processors. The evaluation measures the execution times and speedups of model kernel and total ROMS with different optimizations, input datasets and numbers of computing processing elements (CPEs). The result shows that, to compare with original ROMS, the speedup of optimized hotspot program can be up to 3.69x.
I. INTRODUCTION
Oceans have been the second territory of human survival, and the utilization of ocean resources plays an increasingly important role in human sustainable development. Therefore, we should pay more attention to oceans, understand oceans, and accelerate the innovative research of ocean research. With the increment of measured ocean data and more in-depth research in ocean model, the ocean numerical model is gradually developing towards multi-physical process and high resolution, and its demand for computing resources has
The associate editor coordinating the review of this manuscript and approving it for publication was Ligang He. also grown geometrically. According to the existing ocean numerical model, the calculation amount will increase by 8 to 10 times for every doubling of resolution increment. Furthermore, if the resolution is increased to one kilometer, the corresponding calculation amount will increase by 100 to 1000 times. Hence, the rapidly increasing calculation of ocean numerical model put forward higher demands on computing capacity, and employing high performance computer for ocean numerical model has become more necessary.
China has made great progress in the development of supercomputer hardware, especially in the homegrown highperformance processors. Sunway sw26010 many-core processor [1] is the most excellent representative. However, it is unpractical to improve software performance by updating hardware. The software in ocean numerical model does not match the homegrown high-performance processor. The powerful computing resources cannot be fully utilized in ocean numerical model, and technical lag in software has restricted the development of China's ocean research.
To develop a mature high performance ocean numerical model software, it is necessary to make detailed software analysis, migration and parallel optimization based on the specific CPU or accelerator. To achieve the coordinated development of software and hardware, we should focus on the research and development of software in ocean numerical model based on homegrown CPU, for example, the sunway sw26010 many-core processor.
The main contributions of this paper are as follows:
(1) For the first time, we make parallel implementation and optimization to regional ocean modeling system (ROMS) [2] based on sunway sw26010 many-core processor, and the program efficiency is improved.
(2) Based on sunway sw26010 many-core processor, three kinds of programming methods are utilized in this paper, including OpenACC*, athread with fortran and athread with C. Moreover, we compare these programming methods with each other from programming method, workload and execution efficiency. This work has a practical guiding significance for the programmers that use sunway sw26010 manycore processor.
The rest of this paper is organized as follows: Section II shows the background of this paper that includes ROMS and sunway sw26010 many-core processor. Section III presents the parallel implementation of ROMS based on sunway sw26010 many-core processor. Section IV details the parallel optimization of ROMS. Evaluations are presented in Section V, and Section VI introduces related work. Finally, Section VII provides the conclusions. II. BACKGROUND A. REGIONAL OCEAN MODELING SYSTEM (ROMS)
Regional Ocean Modeling System (ROMS)
is an open source 3D regional ocean model system, which is jointly developed by Rutger University and University of California, Los Angeles. It was first known as SCRUM (S-Coordinate Rugster University Model), later renamed as ROMS, and has been widely used to simulate the hydrodynamics and water environment in the ocean and estuarine areas [2] . Currently, there are three major versions of ROMS, Rutger University version, ROMS_AGRIF version and UCLA version. This paper employs the Rutger University version of ROMS.
ROMS utilizes a coarse-grained parallel computing paradigm and multi-threading to compute three-dimensional grids [2] . ROMS can provide many solutions to the same problem, and it uses high-precision difference algorithms and couples with multiple models, including ecological model, sediment transport model, sea ice model, and WRF model [3] . It can also simulate the movements at different scales, such as global circulations, mesoscale flows and small-scale fluid movements. ROMS is written in fortran language, and able to run both on serial and parallel computers.
B. SUNWAY SW26010 MANY-CORE PROCESSOR
Processor is the key component of computer hardware system, and its research and development are extremely complicated, which is the achievement of human high technology. In recent years, China has invested a lot in processors, and made remarkable progress. Many homegrown processors have been developed and put into use, and sunway sw26010 many-core processor is the excellent representative one of them. Sunway sw26010 many-core processor owns the instruction sets and architecture with independent intellectual property rights, and has been successfully applied in many national major science and technology projects.
The architecture of sunway sw26010 many-core processor is shown in figure 1. Each chip includes four core groups (CG). Each core group consists of the memory controller, management processing element (MPE), and computing processing element cluster (CPE cluster).
Both of the MPE and CPEs have its own memory, which is called two-level on-chip storage hierarchy.The MPE is a 64-bit RISC processor, and has 8 GB of memory. The CPE cluster contains 64 computing processing elements (CPEs), which are connected by an 8Ã-8 Mesh communication network. Each CPE is also a 64-bit RISC processor with 64KB local data Memory (LDM). The data storage of each CPE is organized in the form of scratch pad memory (SPM) [4] to reduce the cache consistency and design complexity. The system interface is used to connect with the off-chip systems, such as PCIE, ethernet and maintenance interface.
Each core group of sunway sw26010 many-core processor can be used independently. The data needs to be transmitted by MPE from main memory to the CPE cluster for calculation. The CPEs in CPE cluster do not share memory resources. Both MPE and CPE clusters support 256-bit vectorization instructions, as well as fused multiply-add (FMA) instructions. Each MPE supports one double type floatingpoint pipeline, while each CPE supports one floating-point pipeline.
The theoretical peak performance of one MPE is 23.2 Gflops/s, and the CPE cluster is 742.4 Gflops/s [5] . The calculated performance of CPE cluster accounts for 98% of the total performance. Therefore, making the most advantage of CPE cluster is the key to well use sunway sw26010 manycore processor.
III. PARALLEL IMPLEMENTATION OF ROMS BASED ON SW26010 MANY-CORE PROCESSOR A. RMOS PROGRAM ANALYSIS AND MASTER-SLAVE ACCELERATION METHOD
Program hotspot analysis plays a key role in the parallel optimization of ocean numerical model programs. Only by identifying the hotspots, the program can be optimized in the VOLUME 7, 2019 FIGURE 1. The architecture of sunway sw26010 many-core processor.
right places, and get better performance. In general, program hotspot analysis mainly includes using the built-in timing module of program, specialized tool like sunway gprof [6] based on sunway sw26010 many-core processor, third-party tools such as TAU [7] , HPCToolkit [8] and other ones [9] . Therefore, to make parallel optimization to ROMS, the first step is to find out the hotspots.
ROMS is a fully functional ocean numerical model program, and it provides a built-in timing program to count the running time of each module [2] . The running time of each module and its proportion in total program will be printed at the end after each execution.
We employ a ROMS case named drifter, whose purpose is to simulate the positive pressure hydrodynamic field driven by the M2 tidal divider in Jiaozhou bay of China. The step number of ROMS is set to 128. The running time is shown in figure 2 . The total running time of ROMS is 393.550 seconds, in which the time of model 2D kernel is 229.670 seconds, and the model 2D kernel accounts for 58.3585% in total. It is obvious that model 2D kernel is the hotspot of ROMS. The corresponding source code file to model 2D kernel is step2d.f90. We analyze the step2d.f90 of ROMS in detail. There are 51 loops that can be optimized in parallel, and most of the loops do not have logical judgments, which are very suitable to be accelerated with this heterogeneous processor. In consideration of the architecture characteristic of sunway sw26010 many-core processors, we adopt MPE-CPE cluster two-level parallel method to make parallel optimization.
The MPE-CPE cluster two-level parallel method is shown in figure 3 . The segment B is the hotspot, and it should be loaded into CPE cluster to execute in parallel. MPE is responsible for the communication, I/O and execution of serial code, and CPE cluster is responsible for parallelization. During the execution of CPE cluster, MPE is waiting until CPE cluster completes running. There are two ways for MPE to call CPE cluster. The first way is OpenACC* [10] , which is a general-purpose many-core programming method based on compiler instructions; the second way is accelerated thread library (athread), and it is an accelerated thread library of MPE-CPE cluster architecture. Both OpenACC* and athread are the programming methods specially developed for sunway sw26010 many-core processor.
Most ocean numerical mode system are written in fortran. If we use athread, the CPE program called by MPE can be written both by fortran and C language. The support of sunway sw26010 many-core processor to C language is more perfect than that to fortran, and the reason will be given in Section V.
B. OpenACC*
OpenACC* is an appropriate modification and expansion of OpenACC 2.0 [11] for sunway sw26010 many-core processor. Many directives are customized for rapid program migration and parallel optimization on this processor.
The usage of OpenACC* is similar to OpenMP [12] or OpenACC [10] , [11] . Adding some necessary compiler flags (the directives begin with !$acc or #pragma acc) before the serial codes that need to be parallelized, and then the OpenACC* compiler automatically compiles the serial codes into parallel codes according to the directives. The execution of the OpenACC* program is guided by the MPE and executed by its corresponding CPE cluster. Firstly, the MPE loads the flagged serial codes into the CPE cluster as kernel (accelerated task). Then, the CPE cluster processes the kernel in parallel, and transmits the results back to the MPE and main memory. Finally, the CPE cluster destructs the variables and frees LDM spaces. The common OpenACC* directives are listed in table 1. We take one loop in ROMS with OpenACC* as the example. As shown in figure 4 , the codes in the frames are OpenACC* directives, and others are the source codes of ROMS. We can see that, the source codes can execute in parallel in sunway sw26010 many-core processor after flagging with a few of OpenACC* directives, and it is a simple and convenient parallel optimization method.
Although the OpenACC* is convenient to use, the underlying code is transparent, so it is too abstract to programmers. Therefore, the sunway compiler provides intermediate OpenACC* code to programmers, to help them to make secondary optimization. The secondary optimization is not necessary. The intermediate OpenACC* code of the example is shown in figure 5 . 
C. ACCELERATED THREAD LIBRARY (ATHREAD)
The accelerated thread library (athread) is a set of interfaces specially designed for MPE-CPE cluster two-level parallel method. Its effect is to allow programmers to explicitly schedule the CPE threads in CPE cluster, for the good performance of the whole chip. Athread is mainly used to control the thread creation, recycling, scheduling, interrupt exception management, asynchronous mask support and other operations.
According to the analysis in part A of section III, we should optimize the loops in step2d.f90. In the beginning of step2d.f90, we call athread_init( ) to initialize the CPE threads; at the end of step2d.f90, athread_halt( ) is invoked to halt the CPE threads. Both athread_init( ) and athread_ halt( ) are invoked only once during the parallel optimization. The athread_spawn( ) interface is used to start all available CPE threads in the CPE cluster, and athread_join( ) is applied to explicitly block the CPE threads until the CPE cluster being terminated. The four interfaces mentioned above can only be invoked by the MPE.
The athread_get( ) interface is invoked by the CPEs, and is used to carry data from main memory to the LDMs of specified CPEs. When the CPEs finishes calculation, athread_put( ) is called to transmit the result from LDMs back to main memory and MPE. Both athread_get( ) and athread_put( ) are invoked by the CPEs.
Based on athread, we employ two languages, fortran and C, to make parallel optimization of ROMS respectively. The original code, codes with athread(fortran) and athread(C) are shown in figure 6. The optimized MPE code is written by fortran, while the CPE codes are optimized by fortran and C respectively. The main implementation process of the CPE codes is to get data from main memory, then calculate the data by the specified CPEs, and finally put the result back to the MPE and main memory.
In this paper, we compare the three parallel optimization methods, which are OpenACC*, athread(fortran) and athread(C). The usage of OpenACC* is simply and convenient, and just need to add applicable directives before the serial codes that need to be parallelized. However, the directives are too abstract that the programmers cannot make fine grained parallel operations to the CPE threads. Athread allows programmers to explicitly schedule the CPE threads, which could be well controlled, so the program may get better performance. The programming workload is larger than OpenACC*. The workloads of athread(fortran) and athread(C) are basically identical.
According to the evaluation in this paper, the program efficiency by using athread is higher than by OpenACC*. The detailed evaluation and reason are discussed in section V.
IV. PARALLEL OPTIMIZATION OF ROMS BASED ON SUNWAY SW26010 MANY-CORE PROCESSOR
The programmers of numerical mode programs like ROMS generally are the application domain experts, who focus on the logical derivation of physical phenomena and the code realization of functions, rather than the programming and hardware. Hence, the codes may not match the uncommon hardware, for example, as the sunway sw26010 many-core processor.
Sunway sw26010 many-core processor is a typical heterogeneous processor. Unlike Intel many integrated core architecture (MIC) [13] and NVIDIA GPU [14] , the CPEs in CPE cluster do not share the memory resource. Each CPE has separate 64KB LDM, and can be used alone. The optimized program and data should be transmitted from main memory to the CPEs by MPE.
Because of the special architecture of sunway sw26010 many-core processor, the ''memory wall'' problem [15] is more serious. Although the processor has powerful computing ability, I/O bandwidth between MPE and CPE cluster is very limited. The CPU frequencies of MPE and 64 CPEs are all 1.5GHz, and the clock cycle is 0.67ns. The delay of one CPE accessing to main memory is 278 clock cycles (186.26ns), and that to corresponding LDM is just 4 clock cycles (2.68ns). The memory accessing time from CPE to main memory is much longer than that from CPE to LDM. Moreover, when a CPE is accessing to the main memory, the other 63 CPEs are only pending, which is a waste of computing resource. Consequently, the frequent or small data transmission between CPEs and main memory should be reduced.
Three methods are utilized to optimize ROMS based on sunway sw26010 many-core processor, including DMA transmission optimization, array slices and SIMD vectorization [16] .
A. DMA TRANSMISSION OPTIMIZATION
We optimize DMA transmission by reducing the number of transmissions, and increasing the amount of single transmission between the MPE and CPE cluster.
1) PACKING THE CONSTANTS
Currently, the parallel optimizations are mainly aimed at the loops in source code. According to the previous mechanistic optimization, the data in each loop was transmitted from the main memory to the CPEs indiscriminately. However, the loops to be optimized may have the same data, such as the loop upper and lower bounds, constants, common input data. As shown in figure 7 , the underlined data in the two loops is the same, which is ubiquitous in step2d.f90. Consequently, we extract and package the data that does not change in the execution, and send them to the CPEs at one time in the beginning. In such way, the frequent and small data transmission between CPEs and main memory can be reduced. 
2) MERGING THE LOOPS
The parallel optimization to each loop needs to spawn and join the corresponding CPE threads, but frequent operations may cause the overall poor performance. Consequently, we should reduce the times of spawning and joining CPE threads appropriately. As shown in figure 8, in the preliminary code, both of the loops need to spawn and join CPE threads. But in the further optimized code, we merge the two loops into one, and just to need one time of spawning and joining. It is noticed that the lower bounds are different in the two loops, and we use the smaller lower bounds. Hence, for the first loop, one more row is calculated; for the second loop, one more column is calculated. The CPEs just need to copy the data with corresponding bounds, when sending the result to the main memory. In this way, although one unnecessary row and column are computed, the time is negligible relative to the cost of spawning and joining CPE threads. The optimized loops are invoked by the main program for multiple times, so the saved time is considerable.
B. ARRAY SLICES
Moreover, to reduce the transmissions between MPE and CPE cluster, we apply array slices in this paper. As normal processing, each CPE receives and calculates one row or column at a time. To take fortran language as an example, as shown in figure 9 .
With the original method, array A[4] [12] is assigned to 4 CPEs, and 12 transmissions are required. After employing the optimized method shown in the lower half of figure 9 , the array is sliced according to the number of CPEs, so 3 contiguous columns are assigned to each CPE, and just 4 transmissions are needed. Compared with the original method, array slices can obviously reduce the transmissions and improve efficiency by 75%. In the actual applications, the arrays are with multiple rows and columns, and generally 32 or 64 CPEs are utilized, so the efficiency will be improved more significantly.
C. SIMD VECTORIZATION
Single instruction, multiple data (SIMD) describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously [17] . Sunway sw26010 many-core processor learns the idea of SIMD, and design the floating instruction pipeline into SIMD unit to enhance its floating-point capabilities. Meanwhile, this processor is also equipped with 256-bit specialized data SIMD units to process vector data. These SIMD units can not only reduce the power consumption, but also improve the program performance. Compared with the original scalar calculation, the SIMD performance may be up to 4 times. If the SIMD multiply-add fused units are utilized, the performance may be up to 8 times.
For the sunway sw26010 many-core processor, 6 SIMD data types are designed, which are intv8, unitv8, int256, uint256, floatv4 and doublev4. These SIMD data types should be explicitly used by the SIMD functions. One SIMD instruction is equivalent to one loop, and can significantly reduce the number of iterations in the loop.
In this paper, we illustrate the usage of SIMD through one loop of ROMS. The comparison of no SIMD code and SIMD code is shown in figure 10 , and the SIMD variables are italicized. In the no SIMD code, the number of the iterations is N; in the SIMD code, the number is N / 4. 75% iterations are reduced by using SIMD code, and the program performance can be improved.
V. EVALUATION
We make test to the optimized ROMS based on sunway sw26010 many-core processor. In this section, we will introduce the experimental setup, which includes the test methods, programs and data we use. Then, we show the performance of ROMS with different optimizations and performance comparison with the original ROMS.
A. EXPERIMENTAL SETUP
In this paper, we use one group of sunway sw26010 manycore processor to process ROMS. The main memory is 32GB, and the compilers are swacc, sw5f90 and sw5cc, which compile OpenACC*, athread(fortran) and athread(C) of ROMS respectively. The ROMS version is v.844 released on April 29, 2017. The ROMS case is named drifter, whose purpose is to simulate the positive pressure hydrodynamic field driven by the M2 tidal divider in Jiaozhou bay of China.
We measure the execution times and speedups of model kernel and total ROMS with the different optimizations. As mentioned in part A of section III, model 2D kernel is the hotspot of ROMS, and the corresponding source code file is step2d.f90. Step2d.f90 is optimized with OpenACC*, fortran and C languages respectively. Also, we test the programs with different CPE numbers, including 16, 32 and 64 CPEs. Moreover, three kinds of datasets are used in the programs, including small, middle and large datasets, which are 128, 512 and 1024 steps respectively. In addition, we compare the results of different optimizations and input datasets with the original program, and the values are consistent. Figure 11 and 12 shows the performance of ROMS with 16 CPEs. Figure 11 shows the performance of original and optimized model 2D kernel programs with different datasets. The optimized model 2D kernel programs include OpenACC* model 2D kernel, athread(fortran) model 2D kernel and athread(C) model 2D kernel. We can see that, no matter with any type of dataset, the ranking of execution times is original > OpenACC* > athread(fortran) > athread(C). The athread(C) program is the fastest. Figure 12 shows the total performance of OpenACC*, athread(fortran), athread(C) and original ROMS. With any input dataset, the ranking of total execution times is original > athread(C) > athread(fortran) > OpenACC*. The speedups of OpenACC*, athread(fortran), athread(C) with original total ROMS are labeled on the bars. As model 2D kernel, the size of dataset does not have great influence on the program performance. The speedups are basically constant with different input datasets. The ranking of total speedups is athread(C) > athread(fortran) > OpenACC*.
B. RESULT
Because the speedups are the total of ROMS, they are lower than those of model 2D kernel. Figure 13 and 14 shows the performance of ROMS with 32 CPEs. Figure 13 shows the performance of original and optimized model 2D kernel with different datasets. We can see that, as the performance with 16 CPEs, no matter with any type of dataset, the ranking of execution times is original > OpenACC* > athread(fortran) > athread(C). The athread(C) program is still the fastest.
The speedups of OpenACC*, athread(fortran), athread(C) with original model 2D kernel are labeled on the bars. With different input datasets, the average speedups of OpenACC*, athread(fortran), athread(C) with original model 2D kernel are 1.854, 2.873 and 3.371 respectively. The speedups are basically constant with different input datasets. The ranking of the speedups is athread(C) > athread(fortran) > OpenACC*.
Compared with the result of 16 CPEs, the average speedups of athread(C) and athread(fortran) are higher. However, the speedup of OpenACC* is lower than that of 16 CPEs. OpenACC* is a kind of semi-automatic optimization, and does not achieve higher efficiency when the number of CPEs is increased. Figure 14 shows the total performance of OpenACC*, athread(fortran), athread(C) and original ROMS. With any input dataset, the ranking of execution times is original > athread(C) > athread(fortran) > OpenACC*.
The speedups of OpenACC*, athread(fortran), athread(C) with original total ROMS are labeled on the bars. The speedups are basically constant with different input datasets. The ranking is athread(C) > athread(fortran) > OpenACC*. The average speedups are higher than those of 16 CPEs, except OpenACC*. Figure 15 and 16 shows the performance of ROMS with 64 CPEs. Figure 15 shows the performance of original and optimized model 2D kernel with different datasets. As the result with 16 or 32 CPEs, no matter with any type of dataset, the ranking of execution times is original > OpenACC* > athread(fortran) > athread(C). The athread(C) program is the fastest.
The speedups of OpenACC*, athread(fortran), athread(C) with original model 2D kernel program are labeled on the bars. With different input datasets, the average speedups of OpenACC*, athread(fortran), athread(C) with original model 2D kernel are 1.599, 2.928 and 3.693 respectively.
The speedups are still basically constant with different input datasets. The ranking of speedups is athread(C) > athread(fortran) > OpenACC*. Compared with the speedups of 16 and 32 CPEs, the average speedups of athread(C) and athread(fortran) are the highest, but the speedup of OpenACC* is further reduced than that of 32CPEs. Figure 16 shows the total performance of OpenACC*, athread(fortran), athread(C) and original ROMS. With any input dataset, the ranking of total execution times is original > athread(C) > athread(fortran) > OpenACC*.
The speedups of OpenACC*, athread(fortran), athread(C) with original total ROMS are labeled on the bars. The speedups are basically constant with different input datasets. The ranking is athread(C) > athread(fortran) > OpenACC*. The average speedups are higher than those of 16 and 32 CPEs, except OpenACC*.
The above results show that, no matter with any type of dataset or number of CPEs, the ranking of speedup is consistent, athread(C) > athread(fortran) > OpenACC*. OpenACC* is a kind of semi-automatic optimization, and the underlying implementation is transparent to programmers, so the CPE threads cannot be precisely controlled. Furthermore, the excessive and uncontrolled CPE threads may interfere with each other, which reduces the program efficiency. The accelerated thread library (athread) is a set of interfaces specially designed for MPE-CPE cluster two-level parallel method. It allows programmers to explicitly schedule the CPE threads in CPE cluster, and the performance can be more improved. The support of sunway sw26010 many-core processor for athread(C) is better than that of athread(fortran). Because SIMD cannot be used in athread(fortran), the efficiency of athread(C) is higher than that of athread(fortran). Consequently, the ranking of optimization efficiency is athread(C)> athread(fortran) > OpenACC*.
By employing athread(C), the optimized program has the best performance. We make comparison among the optimized programs of different numbers of CPEs. As shown in Figure 11 -16, the programs with 64 CPEs has the best performance, because 64 CPEs have the strongest computing power. Moreover, by using athread, the programmers can control the CPE threads precisely.
VI. RELATED WORK
At present, dozens of ocean numerical model programs have been developed, such as ROMS(Regional Ocean Modeling System) [2] ,POP(Parallel Ocean Program) [18] , VOLUME 7, 2019 POM (Princeton Ocean Model) [19] ,FVCOM(Finite Volume Community Ocean Model) [20] and HYCOM(Hybrid Coordinate Ocean Model) [21] . With more in-depth research in ocean model, the ocean numerical model is developing towards multi-physical process and high resolution gradually, and needs more computing resources.
There are many excellent optimizations of ocean numerical model programs based on various kinds of CPUs and accelerators. Lupo et al. [22] used OpenACC [11] to port ROMS on Intel Xeon Phi X200 [23] , and achieved good acceleration. Yalavarthi and Kaginalkar [24] ported ROMS on Intel MIC [13] , and model 2D kernel is executed as the hotspot on the co-processor. Hu et al. [25] developed P-CSI based on POP, and tested it with ''Yellowstone'' supercomputer in National Center for Atmospheric Research (NCAR), America. In the high-score mode, the speedup can reach 1.7x. Panzer et al. [26] employed CUDA based on NVIDIA K20x GPU [27] and OpenMP based on Intel Xeon E5-2650 [28] respectively, to accelerate the ROMS hotspot in shared memory. The evaluation showed that, both the optimizations on different platforms can obtain good performance. In [29] , Shizhen Xu et al. improved POM into gpuPOM, and implemented it on NVIDIA K20x and K40m GPU respectively. The work of [30] employed earth system modeling framework (ESMF) [31] based on POP. It aims to construct a set of standard ocean model components, which can improve the efficiency of the ocean model program in certain degree.
Many researchers made a lot of excellent work based on sunway sw26010 many-core processor and ''Sunway Taihulight'' supercomputer equipped with 40,960 sw26010 processors [32] . Hundreds of complex applications have executed successfully on ''Sunway Taihulight'' supercomputer, involving 19 application fields, such as climate, aerospace, biomedicine and marine engineering [33] . ''10m-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics'' [34] won the 2016 ACM ''gordon bell prize'', which is a breakthrough of China in high-performance computing field in the world. Moreover, based on ''sunway taihulight'' supercomputer, ''18.9-pflops nonlinear earthquake simulation on sunway taihulight: enabling depiction of 18-hz and 8-meter scenarios'' [35] won the 2017 ACM ''gordon bell prize''. This is the second time that the prize has been given to a Chinese research team. Qiao et al. [36] made surface wave simulations on ''Sunway Taihulight'' supercomputer and achieved 45.43 PFlops in ultra-high resolution of (1/100) • . The work of [37] presented a programming framework named PFSI.sw for sea ice model algorithms based on sunway sw26010 many-core processor. By using this framework, programmer can exploit the parallelism of existing sea ice model algorithms and achieve good performance. Li M et al. implemented k-means on sw26010 manycore processor, and sustains a double-precision performance of over 348.1 Gflops. The result is 84% of the theoretical performance upper bound on a single core group [38] . Wang yichao et al. employed OpenACC* to port GTC-P on the ''Sunway Taihulight'' supercomputer and achieve 2.5x speedup [39] . In addition, there's still a lot of work based on sw26010 many-core processors, such as genetic algorithm [40] , DGTD Algorithm [41] , Lattice QCD [42] and AI [43] .
In summary, there are various parallel optimizations of ocean numerical model programs based on multicore processors, MIC, and GPU. However, few optimizations are based on sunway sw26010 many-core processor. Consequently, it is necessary to implement ocean numerical model programs on this homegrown processor. Through the migrations of numerous ocean mode program, we can gradually accumulate the development experience on sunway sw26010 manycore processor, to meet the national needs in ocean scientific research.
VII. CONCLUSION
This paper makes parallel implementation and optimization of regional ocean modeling system (ROMS) based on sunway sw26010 many-core processor for the first time. Three kinds of programming methods are utilized in this paper, including OpenACC*, athread with fortran and athread with C. Moreover, the comparison among these programming methods has been made, from programming method, workload and execution efficiency, which has a practical guiding significance for the programmers that use sunway 26010 manycore processors. We utilized DMA transmission optimization, array slices and SIMD vectorization in parallel optimization of ROMS. The evaluation measures the execution times and speedups of model kernel and total ROMS with different optimizations, input datasets and numbers of computing processing elements (CPEs). The result shows that, to compare with original ROMS, the speedup of optimized hotspot program can be up to 3.69x.
TAO LIU received the Ph.D. degree from the School of Computer Science and Engineering, Beihang University, in 2017. He is currently a Research Assistant with the Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences). He is involved in several national and provincial scientific projects of performance optimization for parallel applications and parallel program performance analysis. His research interests include parallel computing, high-performance computing, and performance optimization.
YUAN ZHUANG received the master degree from the School of Computer Science and Technology, Shandong University, in 2017. He is currently a Senior Engineer with the Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences). His research interests include high performance computing and parallel programming. 
