Introduction
High-order schemes have received considerable attention in computational fluid dynamics (CFD) simulations as they can resolve complicated flow structures and provide much more accurate flow details while utilizing less grid points [1] . As a kind of high-order schemes, the WENO schemes is widely used in large eddy simulations (LES) [2] and direct numerical simulations (DNS) [3] . However, the high requirement of computing power imposed by WENO schemes greatly limit the CFD application's performance and power efficiency. As a result, parallelizing the WENO schemes on high performance computes is necessary.
In recent years by the heterogeneous systems can achieve higher floating-point performance and lower power consumption than traditional CPU-only systems, there opens new horizons to utilize the accelerators, such as the Graphics Processing Units (GPUs) and the Many Integrated Core (MIC) coprocessors, for building high performance computes. In the latest TOP500 [4] list, there are 50 high-end supercomputer systems used NVIDIA GPUs, and 25 systems with Intel MIC coprocessors. Among them, the No. 1 supercomputer TianHe-2 designed by China's National University of Defense Technology (NUDT), which consists of 48,000 Intel Xeon Phi 31SP MIC coprocessors (57 cores per coprocessor), and the No. 2 Titan developed by Oak Ridge National Laboratory contains 18,688 NVIDIA K20x GPUs (14 Streaming Multiprocessors per GPU).
Since GPUs with the Compute Unified Device Architecture (CUDA) programming model support were first launched by NVIDIA Corporation in 2006, several groups have applied them to reduce the computational time of the WENO schemes for CFD simulations. Athanasios S. Antoniou et al. [5] presented a highly accelerated implementation of the finite-difference WENO scheme for large-scale simulations and reported a speedup of 53 on the average for the several mesh sizes compared to sequential execution. Michael Griebel et al. [6] implemented the 5th order WENO scheme for a two-phase solver on CUDA-enabled GPUs and observed an impressive speedup of 69.6 on eight GPUs/CPUs in contrast to a single CPU. Diego Rossinelli et al. [7] presented an adaptive finite volume solver for multiphase compressible flows based on the 5th-order WENO scheme on a heterogeneous platform using OpenCL, and observed an overall speedup of 34 by employing six GPUs compared to the single-core execution. Chen Meng et al. [8] developed a three-dimensional 5th-order WENO scheme GPU code for large-scale cosmological hydrodynamic flow simulations, and achieved a 12-19 times speedup in contrast to a single CPU. Vahid Esfahanian et al. [9] focused on the use of GPUs for solving some hyperbolic equations using the 5th-order WENO scheme and showed that a simple GPU can reach to several hundreds times speedup over a single-core CPU. Lin Fu et al. [10] ported a multi-block viscous flow solver for steady and unsteady turbulent flows onto GPUs by WENO 5th-order reconstruction scheme and obtained a speedup of 32.6× over a single-core CPU. Konstantinos I. Karantasis et al. [11] evaluated the performance of a high-order WENO scheme for the simulation of compressible flows on a GPU cluster, and got about a speedup of 90 on eight GPUs compared to the sequential execution. Hossein Mahmoodi Darian et al. [12] detailed studied on the GPU implementation of up to 9th-order WENO schemes for the multi-dimensional Euler equations and reported that the speedups roughly reach 65 and 105 for the GTX 550 and GTX 480 GPUs, respectively.
However, there are three major problems have not been fully addressed thus far. Firstly, while the performance gains of GPUs are impressive in many of the pioneering studies, a fair performance comparison has rarely been reported where the multi-core CPUs is also parallelized to make full use of the available hardware resource. Secondly, the new generation Kepler GPUs is increasing their number of cores and cache sizes, but it is unclear that the different generations of GPUs affect end-to-end performance for the same source code. Thirdly, the Intel MIC coprocessor based on x86 core architecture as a new type of hardware accelerator is a strong competitor of the NVIDIA GPU, but its application in CFD simulations used WENO schemes is still absent in the current literature.
To address these problems, we have been implementing efficient parallel a high-order WENO code for complicated flow structures on four platforms (Ivy Bridge CPUs, Fermi GPUs, Kepler GPUs and the Intel MIC). As to the GPU platforms, we carefully consider memory use, occupancy and concurrency to explore the performance optimization opportunities for our WENO code. For the Ivy Bridge CPUs and MIC, we employ a series of optimization techniques to maximize the parallelism and data locality and achieve the best performance. We also detailed comparison analysis the two generations GPUs between Fermi and Kepler, and conduct cross-platform performance analysis (focusing on Kepler GPU and MIC) to provide a few suggestions to help developers who need to choose the best accelerators for their specific applications.
The remainder of this paper is arranged as the following. Section 2 briefly introduces the Euler equations and the formulation of the WENO schemes. Section 3 details the GPU parallelization and optimization techniques, and MIC are demonstrated in Section 4. Experimental results are presented and analyzed in Section 5. Finally, we give the conclusion in Section 6.
Euler Equations and WENO Schemes
Euler Equations.
The two dimensional Euler equations of compressible gas dynamics in strong conservation form are:
Here  is the density,  
, u u is the velocity, E is the total energy, p is the pressure,  is the ratio of specific heats. For the nonlinear of Euler equations, its corresponding Jacobian matrices and their distinct eigenvalues are defined as follows: 
The distinct eigenvalues of the Jacobian are all real, and represent the wave speeds of each direction, where c p    is the sound speed. To handle both the negative and positive wave speeds, we choose the characteristic-wise Lax-Friedrichs flux splitting [13] , which is defined as: 
WENO Schemes.
The WENO schemes for the semi-discrete conservative form of the Euler equations are written as follows:
where the numerical flux
is approximated along the line 
The WENO scheme we use in this paper are the 9th-order finite difference version developed by Balsara and Shu in [14] . Therefore, the numerical flux The time discretization is via the third-order TVD Runge-Kutta method developed in [15] 
where the operator L is defined by
1 .
GPU Parallelization and Optimization
Traditionally, the WENO schemes parallel computing on CPUs is based on domain decomposition using MPI and OpenMP programming, in which each CPU core processes one or more portion. However, there are architectural disparities between CPUs and GPUs. For example, GPUs can execute a large number of threads concurrently and hide memory latency by efficient thread switching. Therefore we need to make full use of millions of lightweight threads on GPUs when designing GPU programs. Considering CUDA C programming features, we use one-dimensional arrays to represent multi-dimensional indexing of the variables for both CPU and GPU codes. Take the pressure p as example, its indexing is as follows:
where i and j are the computation domain point indexes in x and y directions, respectively, IIX and IIY is the number of cell in the x and y directions, respectively, MT denotes the number of node stencil of WENO scheme, and l is the equivalent index used in the codes. For the GPU fine-grained parallelism, there are two kinds of CUDA threadblock configurations (2D and 1D) according to data dependency among cells in computation domain. If there is no data dependency in the solution process, we use a 2D configuration (each CUDA thread calculates a cell independently). 
where  
x is the minimum integer that is larger than x , and the corresponding relation the CUDA thread and cell is as follows:
. * . .
i threadIdx x blockDim x blockIdx x j threadIdx y blockDim y blockIdx y
Since the CUDA model is lack of efficient global synchronization mechanism, if there is data dependency in the x direction, we use a 1D configuration (each CUDA thread computes all the cells in the x direction). The 1D thread block is configured as   1, ny and the size of GPU gird is Fig. 2 . 2D CUDA thread blocks configuration. After preliminary parallel GPU implementation of our WENO code, we focus on three issues: memory usage, occupancy and concurrency to improve the performance.
Memory Use.
To improve the performance, it is very important to make full use of the GPU's memory hierarchy. Therefore, based on the characteristics of data structures in our code, flow parameter and standard value such as IIX , IIY and MT , are stored in constant memory to reduce the data transfers between CPU and GPU, and flow variables such as the density  , the velocity   1 2 , u u , the total energy E , the pressure p , are stored in global memory to benefit the data exchanges between CPU and GPU. The shared memory is only used to buffer the temporary data coming from the global memory at the computation of largest eigenvalues max  in each direction where the largest eigenvalue coming from each thread are collected and summed up together using parallel reduction algorithm [16] . All data structures are stored contiguously for one dimension which is good for coalescing global memory accesses.
Occupancy.
Occupancy is the ratio of the number of active warps per multiprocessor to maximum number of possible active warps [17] . The maximum number of possible active warps in each multiprocessor is constant, which is only decided by the computing capacity of GPU. Through the NVIDIA occupancy spreadsheet [18], we determine the proper number of threads per block and the number of blocks per grid to maximize occupancy. In our code, when the number of threads per block is 32, x direction is 8 and y direction is 4, the optimal occupancy is 0.333 and the performance is best.
Concurrency.
Concurrency is generally the ability of a system to perform multiple operations simultaneously [19] . The greater concurrency of application, the better performance to obtain on GPU. Due to the solution process of WENO schemes, we firstly place computation of largest eigenvalues in each direction, local computation of matrices of left and right eigenvectors, flux splitting and calculation of right-side hand (RHS) values in a single 1D kernel. The concurrency is very small and do not make full use of millions of lightweight threads on GPUs, the performance is poor. Therefore, we decompose a large 1D kernel to several small kernels and use a 2D configuration to maximize the concurrency of our code. By carefully investigating the data dependency, 1D kernel fx for the inviscid flux of x direction is decomposed into six 2D kernels as Fig. 3 shows: fx_1 is used to compute largest eigenvalues in each direction am ; fx_2 is used to compute matrices of left and right eigenvectors evl and evr ; fx_3 is used to compute the negative and positive of flux terms 1 gg and 2 gg ; fx_4 is used to compute the negative of half-node flux terms ff ; fx_5 is used to compute the positive of half-node flux terms fh ; fx_6 is used to compute the right side hand rhs . Fig.3 . Data dependency in x -direction inviscid flux term computation and kernel decomposition. After the aforementioned optimization techniques are implemented, the procedure of a full GPU version of our WENO code is illustrated as Fig. 4 . To remove redundant data transfers between CPU and GPU, there are only two CPU/GPU data transfers, one is from CPU to GPU, and the other is from GPU to CPU. Computation of inviscid flux terms is split into two kernels, which are fx for calculating the finite difference in x direction used WENO scheme, gy for computing the finite difference in y direction used WENO scheme. After getting the RHS, calculating the new values on computation domain with Runge-Kutta method. All of the solution process of Euler equations is calculated on the GPU, the CPU only executes termination condition judgment.
Finally, we visualize flow and analyze the output results during the post-processing step. 
MIC Parallelization and Optimization
Due to the emphasis on common programming languages, models, tools across the Ivy Bridge CPU and the Intel MIC coprocessor, it is easy to port our WENO code from the traditional CPU to the newly available Intel MIC, and the similar optimization techniques can be employed for both to enable the best performance.
The solution process of Euler equations can take 98% of the total CPU computation time, thus we choose the "native" mode of MIC where the WENO code runs natively on MIC. The key of performance optimization on MIC is maximizing parallel computations and minimizing data movement. These optimization techniques generally fall into three categories: intra-node parallelism for scaling, intra-core parallelism to exploit SIMD (single-instruction multiple-data) mechanism and improve data locality in the on-chip cache.
Intra-node Parallelism.
In this paper, we choose OpenMP as the programming model for intra-node parallelism. For OpenMP, the optimization techniques are the proper thread number configuration and the suitable thread affinity strategy.
Making full use of the available hardware resource, it is usually a practical way to set the number of threads corresponding to the number of physical cores in each devices. For MIC, there are 57 physical cores placed in a ring interconnect. Additionally, four hardware threads on each MIC core resulting in 228 hardware threads available on a single MIC coprocessor. The hardware thread on MIC is designed for in-order execution that typically requires 2-4 threads per core for optimal performance [20] . Therefore, we set the number of threads as 24 and 171 for the Ivy Bridge CPU and MIC, respectively.
The suitable thread affinity can minimize the transitions from one core to another, improve cache efficiency and make each core load balancing. Fig. 5 shows the three thread affinity strategies.The first strategy, called compact, binds threads to the next free thread context. That is, all four contexts in a physical MIC core are used before threads are placed in the contexts of another core. The second strategy, called scatter, the new thread will firstly be allocated to the core that has the lightest load. The last strategy, balanced, is a comprehensive one. Threads are balanced among cores and subsequent threadIDs are assigned to neighbor contexts [21] . The most effective thread affinity strategy is compact for Ivy Bridge CPU and balanced for MIC in our WENO code 
Intra-core Parallelism.
Intra-core parallelism to efficiently exploit SIMD mechanism is one of the most important aspects in achieving high performance of the application code running on Intel MIC coprocessors [22] . MIC uses a 512-bit vector unit, which means 16 single-precision or 8 double-precision floating point operations can be executed at one time. We use serval SIMD vectorization techniques such as auto vectorization counting on the compiler, pragmas to assist vectorization(#pragma vector always can be used to instruct the compiler to always vectorise outer loops and #pragma vector nontemporal can use the streaming store capability to achieve higher effective memory bandwidth) and memory alignment optimization by using "_mm_malloc()" to ensure the base address of the array is aligned and using padding to ensure the address of the first element on each row is also aligned for achieving a more efficient intra-core parallelism.
Data Locality.
Normally, the two tiers parallelism can offer a significant performance improvement. However, the higher parallelism, the larger amount of data is required during a short period of time, which may exceed the peak memory bandwidth. Since the existence of cache whose access is significantly shorter than memory access, it is important to find techniques that can help reduce the memory access pressure by reusing cached data. As a kind of well-known optimization technique, cache blocking can help avoid memory bandwidth bottlenecks to exploit data that remains in the cache from recent, previous iterations. Fig. 6(a) shows an example of an iteration row by row is updated. For cache blocking the 2D computation domain is partitioned into block in x direction as shown in Fig. 6(b) . The best configuration of block size is able to theoretically computed, based on the data size accessed with each iteration and cache details. Here we make an automated search over all the possible size in one and two dimensions to experimentally verify, respectively. Fig. 6(a) . Row by row computation domain traversal. Fig. 6(b) . Cache blocking is applied, one block after another is updated.
Results

Computational Examples.
The double Mach reflection problem was used to evaluate the performance of GPU and MIC implementation of the high-order WENO scheme. The double Mach reflection problem was initially proposed and studied in detail by Woodward and Colella [23] . The computation domain is chosen to be     , and a solid wall boundary condition is used for the rest. At the top boundary, the solution is set to describe the exact motion of the Mach 10 shock. The left boundary is set as the inflow boundary condition, while the right boundary is set as outflow boundary condition. The computed density contours on different grid sizes are displayed in Fig. 7 . In CPU, GPU and MIC computations, the same results are obtained. 
Performance Evaluation on GPU and MIC.
To evaluate the performance of our implementation, we have conducted experiments using diferent grid sizes on the lastest multi-core and many-core architectures including Intel Ivy Bridge CPU, the NVIDIA Fermi M2050 GPU, the NVIDIA Kepler K20c GPU and the Intel MIC coprocessor. The important hardware features of these architectures are listed in Table 1 . The GPU code is implemented using CUDA C and the CUDA vesion is 5.0. All the code is compied with -O2 option. Double precison is used in all computations. In Fig.3 , we present the results obtained with Intel MIC in three different thread affinity strategies. We find that the balanced specification works better with fewer threads, and the best thread number configuration is 171 when three threads per core on MIC. Fig. 4 shows the results for kernel decomposition optimization to improve the concurrency of our code. We achieve an 8.8 times speedup for the x direction, but for the y direction, the speedup is only 4.0. This is because the decomposition in the x direction is good for coalescing global memory accesses. Fig. 3 . Execution time(s) for an iteration for fixed grid size using a MIC in native mode: scaling with respect to the number of OpenMP threads using different affinity specifications. Fig. 4 . Execution time(s) for an iteration for fixed grid size using a K20c GPU: performance comparison for before and after decomposition From Table 2 , we see that signficant improvement in performance over the serial code on CPU are archieve for CPU, Fermi M2050 GPU, Kepler K20c GPU and MIC after a series of optimizations. The SC of MIC is 15.66 times lower than Ivy Bridge CPU. The Kepler GPU offers 1.26 times speedup in contrast to the previous Fermi GPU maintaining exactly the same source code. Furthermore, while Kepler GPU can be 3.12 times faster than MIC without utilizing the increasingly available SIMD computing power on Vector Processing Unit (VPU), MIC can provide the computing capability equivalent to Kepler GPU when VPU is utilized. However, when GPU and MIC compare with Ivy Bridge CPU parallelized to make full use of the available hardware resource, we are surprised to find that the speedup is low. This is mainly because poor vectorization. Though vectorization also affects performance on CPU as well, we find the SIMD speedup is only 2.86 that is far from tapping the capaility of 512-bit wide vector on MIC from Table 3 . Large L3 cache might be another reason which can hold more data on cache. Table 2 . Execution time(s) for an iteration for fixed grid size (3840*960) using different processors.
"SC" means the performance of the serial code. "MC" means the performance of intra-node parallelism using OpenMP. "SIMD" means using vectorization for intra-core parallelism. "BLC" means using cache blocking to improve data locality in the on-chip cache. Table 3 . We can observe that the speedups on MIC for SC over MC and SC+SIMD over MC+SIMD using 171 threads are 62.06× and 75.67×, respectively. The speedups on CPU for SC over MC and SC+SIMD over MC+SIMD using 171 threads are 11.94× and 9.51×, respectively. It is disappointed to see that the speedups for SIMD is very low, thus tapping the available SIMD computing power on VPU using SIMD intrinsic programming is future work. The overall speedup of CPU and MIC is 15.88× and 192.19×, respectively. In Table 4 , we present the speedups of an Ivy Bridge CPU using 24 threads over a single-core CPU, a Fermi M2050 over a single-core CPU, a Kepler K20c over a single-core CPU and a MIC using 171 threads over a single-core CPU for different grid sizes. We see that Kepler GPU and MIC can achieve a speedup of more than 10 over a single-core CPU. We observe performance degradation for Ivy Bridge CPU when the gird size is increased. This can be explained by the fact that the performance improvement of L3 cache is flooded by the grid size increases. We also find slight performance improvement for both GPU and MIC. This is because higher workloads can better overlap the computation and memory accesses for GPU and MIC. If the memory of GPU and MIC is large enough, their performance may be better than the Ivy Bridge CPU. 
Conclusion
In this paper, we proposed methodologies to implement efficient parallel a high-order CFD simulation using WENO scheme for complicated flow structures on Ivy Bridge CPUs, Fermi GPUs, Kepler GPUs and the Intel MIC. For the GPU platforms, we carefully consider memory use, occupancy and concurrency to explore the performance optimization opportunities for our WENO code. For the Ivy Bridge CPUs and MIC, we employ a series of optimization techniques such as auto vectorization counting on the compiler, pragmas to assist vectorization, and cache blocking to achieve the best performance.
Besides, we have evaluated the performance of our parallel implementation of the WENO code. GPU and MIC can achieve a speedup of more than 10 over a single-core CPU. The Kepler GPU offers 1.26 times speedup in contrast to the previous Fermi GPU without special tuning. Furthermore, while Kepler GPU can be 3.12 times faster than MIC without utilizing the increasingly available SIMD computing power on VPU, MIC can provide the computing capability equivalent to Kepler GPU when VPU is utilized.
For future work, we plan to tap the capaility of VPU using SIMD intrinsic programming to improve the implementation on MIC.
