Accelerating CFD simulation with high order finite difference method on
  curvilinear coordinates for modern GPU clusters by Ye, Chuangchao et al.
Accelerating CFD simulation with high order finite difference
method on curvilinear coordinates for modern GPU clusters
Chuangchao Ye, Pengjunyi Zhang, Rui Yan, Dejun Sun, Zhenhua Wan1,∗
Department of modern mechanics, University of Science and Technology of China , Hefei, 230027, China
Abstract
A high fidelity flow simulation for complex geometries for high Reynolds number (Re) flow is
still very challenging, which requires more powerful computational capability of HPC system.
However, the development of HPC with traditional CPU architecture suffers bottlenecks due to
its high power consumption and technical difficulties. Heterogeneous architecture computation
is raised to be a promising solution of difficulties of HPC development. GPU accelerating
technology has been utilized in low order scheme CFD solvers on structured grid and high
order scheme solvers on unstructured meshes. The high order finite difference methods on
structured grid possess many advantages, e.g. high efficiency, robustness and low storage,
however, the strong dependence among points for a high order finite difference scheme still
limits its application on GPU platform. In present work, we propose a set of hardware-aware
technology to optimize the efficiency of data transfer between CPU and GPU, and efficiency
of communication between GPUs. An in-house multi-block structured CFD solver with high
order finite difference methods on curvilinear coordinates is ported onto GPU platform, and
obtain satisfying performance with speedup maximum around 2000x over a single CPU core.
This work provides efficient solution to apply GPU computing in CFD simulation with certain
high order finite difference methods on current GPU heterogeneous computers. The test shows
that significant accelerating effects can been achieved for different GPUs.
Keywords: hardware-aware, high order, finite difference methods, curvilinear coordinates,
GPU
1. Introduction
Computational fluid dynamics (CFD) is one of the most important research methods in
fluid mechanics, which highly relies on the computational capability of computer, especially for
accurately simulating realistic engineering flows. The deficiency of computational capability
of computer has now became one of the biggest obstacles for the future CFD development.
High order methods have been adopted in complex flow simulations like turbulence sim-
ulation, computational aeroacoustics (CAA), on the merit that it can obtain more accurate
results with less grid than low order methods due to their low dissipation property. However,
high order methods are more computational expensive comparing to low order methods. The
real flows in nature are usually with large Reynolds number. Chapman estimated in 1979 the
grid-point requirements (N) for wall-modeled large eddy simulation (LES) to be N ∼ Re2/5L ,
∗
Email address: wanzh@ustc.edu.cn (Zhenhua Wan)
Preprint submitted to Elsevier June 16, 2020
ar
X
iv
:2
00
6.
07
96
4v
1 
 [p
hy
sic
s.c
om
p-
ph
]  
14
 Ju
n 2
02
0
for wall-resolving LES, N ∼ Re9/5L , and for direct numerical simulation (DNS), N ∼ Re9/4L
[1]. In 2012, Choi & Moin [2] revised Chapman’s estimation to be ReL, Re13/7L and Re
37/14
L
respectively.
Since Gordon E. Moore, the co-founder of Intel, observed that the number of transistors
on a microchip doubles every two years though the cost of computers is halved, the Moore’s
law has governed the development of CPU for decades. However, in the past few years, there
are many voices about gradual failure of this golden law, because the high temperatures of
transistors eventually would make it harder to create smaller circuits. In the meantime, there
are many heterogeneous computational architectures arise, such as field-programmable gate
array (FPGA), automata processor (AP) and graphics processing unit (GPU). Among those
heterogeneous architectures, the GPU computing has grown to be the most popular one in
high performance computing (HPC). GPU, originally designed for graphics rendering, releases
great computational capability due to its "many cores" architecture. Compared to traditional
CPU which contains only several cores, GPU contains thousands of cores in a chip, which
provide more powerful computation capability. In additions, GPU has lower power and is
cheaper for equivalent computation capability as CPU. In the latest TOP500 list of HPC,
half of the 10 fastest HPC systems have been equipped with GPU devices. Summit, the
fastest supercomputer in the world located in the Oak Ridge National Laboratory (ORNL)
in USA, consists of 4608 compute nodes, and each node contains 2 IBM POWER9 processors
and 6 NVIDIA Volta V100 GPUs. The GPUs deliver over 95% performance of Summit.
Compared to traditional CPU architecture supercomputer Tianhe-2A, Summit delivers double
peak performance (200 PetaFlOPS) with nearly half power consumption (10KW).
The merits of GPU computing have attracted many attempts to port CFD simulation to
GPU platform. Tutkun, et al. [3] solve compressible Navier-Stokes equations with compact
scheme combining with filter on a Tesla C1060 GPU, and get speedup of 9x to 16x compared
to a AMD Phenom CPU. Esfahanian, et al. [4] perform 3rd and 5th order WENO scheme on
two dimensional uniform mesh solving Euler equations on GPU, and get maximum speedup
of 316x. Karantasis, et al. [5] perform 5th, 7th and 9th order WENO scheme on uniform grid
with 6.9 million grid points, accelerating with Tesla S1070 GPU, and get maximum speedup
of 90x. Xu, et al. [6] perform high order finite difference WCNS scheme and Hybrid cell-edge
and cell-node Dissipative Compact Scheme (HDCS) on curvilinear coordinates. The GPU-only
approach achieves a speedup of about 1.3 on a Tesla M2050 GPU compared with two Xeon
X5670 CPUs, while the hybrid CPU and GPU approach achieves maximum 2.3x speedup.
Lai, et al. [7, 8] develop a multiple GPU algorithm in hypersonic flow computations with the
second order finite volume method on curvilinear coordinates, and obtain 20x to 40x speedup
accelerating with a Nvidia GTX1070 GPU compared to an 8 cores Intel Xeon E2670 CPU.
Elsen, et al. [9] develop a first order to sixth order finite difference method code for complex
geometry on GPU platform, and obtain speedup 40x for a simple geometry and 20x for a com-
plex geometry on a Nvidia 8800GTX GPU compared to an Intel-Core-2-Duo. Very recently,
Lei, et al. [10] accelerate a compressible flow solver based on a second order MUSCL/NND
code on cartesian coordinates with GPU. Tesla P100 GPU is used for accelerating and gets
speedup 260x for MUSCL scheme and 144x for NND scheme over a E5-2640v4 CPU core. In
general, for flow solvers in cartesian coordinates and with low order schemes, the application of
GPU gets conspicuous acceleration. But for a complicated coordinate system and high order
schemes, the flow solvers get inferior accelerating effect, which is far from the computation
requirement for the higher fidelity and larger scale simulation.
2
High order methods based on unstructured mesh, e.g. flux reconstruction (FR) method and
discontinuous Galerkin (DG) method, are popular in accurately simulating flow with complex
geometry. GPU computing achieves great success in this branch of the high order method due
to its computational independence of elements. Crabill, et al. [11] simulated the flow over a
spinning golf ball at a Reynolds number of 150,000 with flux reconstruction (FR) solver on
GPU. Witherden, et al.[12] developed a solver based on the high order flux reconstruction
(FR) which has been used for a variety of flows. Although it is easy to improve precision, the
numerical stability of this type of method in solving high speed flows is poor, which limits its
application. Furthermore, this method requires a large amount of storage which also limits
its application in large scale simulations in GPU computing. In contrast, the finite difference
method is more storage saving and more robust when strong discontinuity exists. In general,
the finite difference method is still an important method in high order CFD simulations, and
deserves further development.
In the past few years, the hardware performance of GPU has been greatly improved, and a
series of technologies were developed, which make the heterogeneous computer more effective
and mature. Current GPU computer is not simply a traditional CPU computer equipped
with one or two GPUs, it is a highly integrated hardware and software system. One computer
has multiple GPUs, and GPUs become the main power of computing. To fully exploit the
performance of current GPU computer, programs must adjust themselves to adapt to the
complicated computer architectures.
In present work, based on architecture analysis of modern GPU servers, we propose a
set of hardware-aware technology to optimize the performance of data transfer between CPU
and GPU, and communication efficiency between GPUs. An in-house code is porting onto
GPU with careful memory planning and high-efficiency kernel design. Moreover, a modified
alternative formulation of WENO scheme which saves computation and is more efficient in
GPU computing is proposed. The rest of paper is organized as follows. The numerical methods
are presented in Section 2. Then, the hardware facility is introduced in Section 3. The
programming and optimization strategies are introduced in detail in Section 4. Furthermore,
the performance cases and practical case are presented in Section 5 and 6, respectively. Finally,
some conclusions are summarized in Section 7.
2. Numerical methods
The present study is based on an inhouse code "HiResX (High Resolution flow Solver)",
which aims to simulate compressible flows with complex geometries in high order accuracy with
finite difference methods. To treat complex geometries, we solve the Navier-Stokes equations
on curvilinear coordinates, which can be written as
∂U˜
∂t
+ ∂(E˜ − E˜v)
∂ξ
+ ∂(F˜ − F˜v)
∂η
+ ∂(G˜− G˜v)
∂ζ
= 0 (1)
where U˜ is the vector of conservative flow variables, E˜, F˜ , G˜ are inviscid flux vectors, and
E˜v, F˜v, G˜v are viscous flux vectors, defined as
U˜ = 1
J
(
ρ, ρu, ρv, ρw, e
)T
(2)
3
E˜ = 1
J

ρU
ρuU + ξxp
ρvU + ξyp
ρwU + ξzp
(e+ p)U
 ,
F˜ = 1
J

ρV
ρuV + ηxp
ρvV + ηyp
ρwV + ηzp
(e+ p)V
 ,
G˜ = 1
J

ρW
ρuW + ζxp
ρvW + ζyp
ρwW + ζzp
(e+ p)W

(3)
E˜v =
1
J

0
ξxτxx + ξyτxy + ξzτxz
ξxτyx + ξyτyy + ξzτyz
ξxτzx + ξyτzy + ξzτzz
ξxβx + ξyβy + ξzβz
 ,
F˜v =
1
J

0
ηxτxx + ηyτxy + ηzτxz
ηxτyx + ηyτyy + ηzτyz
ηxτzx + ηyτzy + ηzτzz
ηxβx + ηyβy + ηzβz
 ,
G˜v =
1
J

0
ζxτxx + ζyτxy + ζzτxz
ζxτyx + ζyτyy + ζzτyz
ζxτzx + ζyτzy + ζzτzz
ζxβx + ζyβy + ζzβz

(4)
τxx = λ(ux + vy + wz) + 2µux, τxy = τyx = µ(uy + vx)
τyy = λ(ux + vy + wz) + 2µvy, τxz = τzx = µ(uz + wx)
τzz = λ(ux + vy + wz) + 2µwz, τyz = τzy = µ(vz + wy)
βx = uτxx + vτxy + wτxz + q˙x
βy = uτxy + vτyy + wτyz + q˙y
βz = uτxz + vτyz + wτzz + q˙z
q˙x =
µ
Pr (γ − 1)
∂T
∂x
, q˙y =
µ
Pr (γ − 1)
∂T
∂y
, q˙z =
µ
Pr (γ − 1)
∂T
∂z
(5)
where (ξ, η, ζ)x,y,z are metrics, and J is the coordinate transformation Jacobian.
The derivatives of inviscid fluxes are approximately with the high order alternative for-
mulation of weighted essentially non-oscillatory scheme (AFWENO) [13]. Different from the
reconstruction idea of traditional high order WENO scheme [14], this new formulation of
4
WENO is based on the idea of interpolation. Similar to the traditional WENO scheme, the
derivative is written in a conservative form:
df(u)
dx
= 1∆x
(
fˆi+ 12
− fˆi− 12
)
(6)
The numerical fluxes are approximated as
fˆi+ 12
= fi+ 12 +
[(r−1)/2]∑
k=1
a2k∆x2k
∂2kf
∂x2k
∣∣∣∣∣∣
i+ 12
+O
(
∆xr+1
)
, (7)
where the coefficients a2k can be obtained by Taylor expansion. In order to achieve fifth order
accuracy, we here fix r = 5 and then
fˆi+ 12
= fi+ 12 −
1
24∆x
2fxx
∣∣∣∣
i+ 12
+ 75760∆x
4fxxxx
∣∣∣∣
i+ 12
, (8)
where fi+ 12 is the numerical flux at xi+ 12 , which can be approximated as
fi+ 12
= h
(
u−
i+ 12
, u+
i+ 12
)
. (9)
The values of u±
i+ 12
can be obtained with a linear combination of the low order interpolation
approximate values based on the small stencils, with the global stencil biases to left for “ + ”,
and right bias for “− ”.
ui+ 12
= w1u(1)i+ 12 + w2u
(2)
i+ 12
+ w3u(3)i+ 12 (10)
The approximations based on three sub-stencils are given explicitly as
u
(0)
i+ 12
= 38ui +
3
4ui+1 −
1
8ui+2
u
(1)
i+ 12
= −18ui−1 +
3
4ui +
3
8ui+1
u
(2)
i+ 12
= 38ui−2 −
5
4ui−1 +
15
8 ui
(11)
The form of nonlinear weights of WENO reads
wj =
w˜j
w˜1 + w˜2 + w˜3
, with w˜j =
γj
(+ βj)2
(12)
where γj (j = 1, 2, 3) are the ideal linear weights, given as
γ1 =
1
16 , γ2 =
5
8 , γ3 =
5
16 (13)
 is a small positive number to avoid the denominator to become zero, which is typically fixed
to be  = 10−6 in actual computations. The smoothness indicators are chosen as the scaled
5
sum of the square L2 norms of all the derivatives of interpolation polynomial, which can be
easily worked out as
β1 =
1
3(4u
2
i−2 − 19ui−2ui−1 + 25u2i−1
+ 11ui−2ui − 31ui−1ui + 10u2i )
β2 =
1
3(4u
2
i−1 − 13ui−1ui + 13u2i
+ 5ui−1ui+1 − 13uiui+1 + 4u2i+1)
β3 =
1
3(10u
2
i − 31uiui+1 + 25u2i+1
+ 11uiui+2 − 19ui+1ui+2 + 4u2i+2)
(14)
The two-argument function h is a monotone flux, which can be chosen from exact or any
approximate Riemann solvers, such as HLL, and Roe. In this work, the Steger-Warming flux
is utilized to construct the numerical flux fi+ 12 .
The second and fourth order derivatives in Eq.(8) can be simply explicitly approximated
with a central scheme as
fxx|i+ 12 =
1
48(−5fi−2 + 39fi−1 − 34fi − 34fi+1
+ 39fi+2 − 5fi+2)
fxxxx|i+ 12 =
1
2(fi−2 − 3fi−1 + 2fi + 2fi+1 − 3fi+2 + fi+3)
(15)
Substitute Eqs.(8) and (15) into Eq.(6), and we get the final expression as
fx|i = fi+ 12 − fi− 12 +
17
256(fi+1 − fi−1)−
13
320(fi+2 − fi−2)
+ 193840(fi+3 − fi−3)
(16)
The semi-discretized Eq.(6) can be explicitly solved with the 3 stage TVD Runge-Kutta
scheme as follows
Q(1) = Qn + ∆t R (Qn)
Q(2) = 34Q
n + 14Q
(1) + 14∆t R
(
Q(1)
)
Qn+1 = 13Q
n + 23Q
(2) + 23∆t R
(
Q(2)
) (17)
3. Hardware environment
A traditional CPU-based supercomputing cluster usually consists of hundreds of thousands
of computational nodes, and each node contains dozens of cores. These nodes are intercon-
nected with high speed network. Up to now, the number of cores in a single CPU is still limited.
Due to this limitation, the large task must be divided into a lot of partitions and distributed
to many nodes. One obvious drawback is that the complex communication between nodes
may lead to high latency. Furthermore, the price/performance ratio of traditional CPU-based
cluster is low, due to the fact that you have to pay for the frame of each computation node
by increasing CPU cores, which does not directly contribute to computational capabilities.
6
CPU 1 CPU 2
RAM
Server 1
CPU 1 CPU 2
RAM
Server 2
Ethenet
PCIe PCIe
Figure 1: Framework of modern GPU cluster. The computational nodes are connected with high speed
network. GPUs deliver the majority of performance of modern GPU cluster.
RAM-0
CPU-0
GPU0
PCIe Switch
QPI/UPI
RAM-1
CPU-1
GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
PCIe Switch PCIe Switch PCIe Switch
(a)
RAM-0
CPU-0
GPU0
PCIe Switch
QPI/UPI
RAM-1
CPU-1
PCIe Switch
GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
(b)
RAM-0
CPU-0
PCIe Switch
QPI/UPI
RAM-1
CPU-1
PCIe Switch
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
(c)
RAM-0
CPU-0
PCIe Switch
QPI/UPI
RAM-1
CPU-1
PCIe Switch
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
(d)
Figure 2: Several typical PCIe root architectures of GPU server. For type (a), each NUMA node has PCIe
switch attached on them, and there are two GPUs mounted on each PCIe switch. For type (b), only one PCIe
switch is mounted on each NUMA node, and there are 4 GPUs attached on one PCIe switch. For type (c), all
GPUs are mounted on one NUMA node by two PCIe switches. For type (d), all GPUs are mounted on one
NUMA node.
7
The computational capability of a computing device with specific space occupation can
be represented as "computing density". In the past few years, supercomputing has come
to heterogeneous computing era, and higher computing density supercomputers have been
developed. Those computers are accelerated with some devices called many-core processor
accelerators, such as GPU and FPGA. In the realm of heterogeneous computing, GPU is
the most popular one. Fig.1 shows the framework of modern GPU server cluster. Each
node contains two CPUs and much more GPUs than before. The GPUs in the same server
communicate with each other through PCIe slots, and further communicate with GPUs in
other servers by network. More sophistic technology called Nvlink has developed by NVIDIA
to enable direct data transfer between GPUs in the same server at faster speed than PCIe’s.
GPU Models RTX 2080 Ti Tesla P100 Titan V
CUDA Cores 4352 3584 5120
SM count 68 56 80
Based Clock (MHz) 1350 1190 1200
Total Memory 11 GB GDDR6 16 GB HBM2 12 GB HBM2
Memory Bandwidth (GB/sec) 616 732.2 652.8
FP32 Performance (TFlops) 13.45 9.3 14.90
FP64 Performance (TFlops) 0.42 4.7 7.45
Computation Capability 7.5 6.0 7.0
Architecture Turing Pascal Volta
Table 1: Present GPU spces. Titan V has highest double precision operation performance. RTX 2080 Ti
utilizes the newest architecture and provides highest single precision operation performance.
Intel Xeon E5 2680 v3
# of Cores 12
# of thread 24
Based Frequency (GHz) 2.50
Cache 30 MB
Max Memory Bandwidth 68 GB/sec
FP32 Performance 960 GFlops
FP64 Performance 480 GFlops
Table 2: Present spce of CPU. The double precision operation performance of E5-2680v3 is slightly higher
than RTX 2080 Ti’s.
The two GPU servers utilized in this work are equipped with 10 NVIDIA RTX 2080 Ti
GPUs and 10 NVIDIA Titan V GPUs, respectively. The RTX 2080 Ti GPU utilizes the newest
Turing architecture of NVIDIA GPU, while Titan V utilizes the architecture Volta. The RTX
2080 Ti GPU is a consumer-oriented product aiming at video and game, though it has up to
4352 CUDA cores, the double precision (DP) performance is only 1/32 of single precision (SP)
performance. However, the core GV100 of Titan V contains 5120 CUDA cores, and ratio of
DP/SP is up to 1/2, which makes it still the most powerful GPU core up to now. We also
utilize the Tesla P100 GPU, based on much older architecture Pascal. Although it contains
less CUDA cores than RTX 2080 Ti, it has much higher theoretical DP performance. The
8
detailed specs of GPUs mentioned above can be found in Tab.1. The CPU of the server is Intel
Xeon E5 2680 v3, with 12 cores in a single CPU, and the spec is given in Tab.2. The ideal
computational performance of dual-CPU in the server is 960 GFlops (FP64 performance).
With GPU acceleration, 10 RTX 2080 Ti GPUs deliver 134.5 TFlops SP float performance
and 4.2 TFlops DP float performance, while 10 Titan V GPUs deliver incredible 149.0 TFlops
SP performance and 74.5 TFlops DP performance. In this work, no Nvlink is applied.
4. Programming implementation and optimization of HiResX
4.1. Code introduction
HiResX is originally written in Fortran 90, aiming at simulating various compressible
flows with complex domains in high order schemes. HiResX solver is equipped with vari-
ous turbulence models such as Reynolds-Averaged Navier-Stokes (RANS) models, LES model
and detached eddy simulation (DES) model to simulate engineering realistic flows with high
Reynolds numbers. Message Passing Interface (MPI) is utilized to distribute computation
tasks to different CPU cores inter- or intra- node.
There are several strategies of performing computation on GPU devices. OpenACC of-
fers a user-driven lightweight solution that the user needs simply adding some directive-based
clauses in the code segment which needs to accelerate on GPU with significantly less program-
ming effort than required with a low-level model. CUDA Fortran based on PGI Fortran is
another solution provided by PGI company, which offers programming flexibility using For-
tran language. However, CUDA Fortran highly relies on the PGI compiler. CUDA C/C++
is a programming language based on standard C/C++ and developed by NVIDIA, and it
provides better affinity and flexibility to operate NVIDIA’s GPUs at low-level programming.
However, programming with CUDA C/C++ needs much higher skill. To port an existing
complex code onto GPU by completely rewriting it with another programming language is
not a wise choice. In order to fully exploit the performance of current GPU server without
rewriting whole the solver, mixed programming of Fortran and C/C++ is applied, and GPU
computing is achieved with CUDA C/C++.
GPU computing for each process is independent with each other whenHiResX is running
in parallel, thus HiResX supports three parallel modes: CPU-only mode, GPU-only mode
and collaborating (or hybrid) CPU/GPU mode. In CPU-only parallel mode, all processes
perform computation on CPU only, whether if the computation node has GPU devices or not.
This parallel mode is suitable for most of the traditional supercomputers which support CPU
computing only. In GPU-only parallel mode, all processes are running on GPU devices, hence
for supercomputing cluster, every computational node running the code must equipped with
a GPU device. The hybrid CPU/GPU mode is a parallel mode that can make fully use of all
CPU and GPU resource in the cluster, as mentioned in Ref.[15]. For this parallel mode, the
computational capability of CPU and GPU in one node should be roughly comparative.
The procedures of GPU computing of HiResX are illustrated in Fig.4. After the initial-
ization of program, the process which runs computation on GPU uploads the conservative
variables Q, primitive variables Qv, Jacobian J and metrics at cell node to GPU device. To
avoid frequent data transfer between CPU and GPU, all these variables in computation do-
main are uploaded to GPU and stored in GPU’s global memory. The Jacobian J and metrics
at cell edge which are computed by CPU at startup would not be uploaded with consideration
of three aspects. Firstly, if all these variables at cell edges in three directions are stored in
9
Figure 3: Main structure of HiResX summarized by pseudo code
configuration CPU init
GPU init
H2D
D2H
MPI
inviscid flux
viscid flux
BC
CV2PV
time advance
control
D
a
ta
 0
D
a
ta
 1
kernels
D2H
I/O
D
a
ta
 3
p
th
re
a
d
w
ri
te
CPU finalize
GPU finalize
End
1 iteration
other 
processes
Figure 4: A global glance of execution procedures of HiResX running on GPU. Except for initialization and
data performed on CPU, all computations are performed on GPU while CPU is used to schedule kernels of
GPU only. "Data 0" represents the initial data computed in CPU and uploaded to GPU. "Data 1" represents
data exhanged between processes. "Data 3" represents flowfield data to be written, and it should be downloaded
to CPU and written with a new thread.
10
memory of GPU, additional memory space for 12 variables is needed, which reduces the max-
imum grid number that a single GPU can deal with. Secondly, if only memory space of one
direction is allocated on GPU, though the memory space is reduced to a third of the former
case, frequent data transfer during direction switching is time-consuming. Thirdly, for GPU,
computational capability is much stronger than memory bandwidth, hence direct interpola-
tion from Jacobian and metrics at cell node is more time-saving. The calculation of residual
dQ on right hand side, time advancement and primitive variables updating from conservative
variables are completely running on GPU. When physical boundary condition is applied, very
little data is needed to upload onto GPU, such as the ID number of boundary points, which
takes very little time that can be ignored.
For traditional CPU computer, the user may not concern on where and how their program is
running. When the program runs in parallel, the user needs only to distribute the computation
to different CPU processes with MPI, or to different threads by tools (e.g. OpenMP). Users
rarely care about the hardware information when they develop the program. However in
current GPU computing, data exchanges at block connecting boundary are more complex,
due to the complex environment in heterogeneous computational architecture. Fig.5 shows
the communication structures of our solver. When the solver runs in CPU-only mode, data
exchange among blocks in the same process is straightforward by simply copying data without
communication with other process, as is depicted in the figure with the yellow pentagon pairs.
For blocks in different processes in the same node, data exchange among them is achieved
by use of shared memory communication technique in MPI-3 standard, as is depicted in the
figure with the red five-pointed star pairs. For blocks in different processes in different nodes,
data exchange is achieved by standard MPI communication, which is marked as the brown
triangle pairs.
In GPU computing, the data must be uploaded to GPU from CPU first, and downloaded
back to CPU when GPU computation finished. The data has to be transferred between memo-
ries of CPU and GPU through PCIe bus. In the early stage of development of GPU computing,
each GPU is installed in different PCIe slots and works independently, and communicates with
CPU only. If the program is running in parallel with multiple processes on GPUs, the data
to be sent has to be downloaded from GPU and then communication is fully performed in
CPU through MPI, when communications among processes are involved. After the data is
exchanged, the received data has to be uploaded to GPU. It is obvious in this procedure that
processes in different GPUs communicate by use of CPU memory and explicit data copying
between CPU and GPU. When data on GPU is downloaded to CPU, the communication re-
stores to general communication in CPU, and all techniques can be applied, as is depicted with
four-pointed star pair and seven-pointed star pair. However, the Unified Virtual Addressing
(UVA) technology and CUDA-aware MPI technology make this procedure simpler and more
efficient. There is no need to perform explicit data copying because the MPI interfaces can
recognize the location of buffer data and then find the optimal path to transfer data. For
blocks in the same process with GPU accelerating, data can be exchanged on GPU without
leaving device.
When the program running in hybrid CPU/GPU mode, the communications among pro-
cesses without GPU accelerating (CPU-only mode) are the same as the communications among
processes with GPU accelerating (GPU-only mode). The only thing to deal with is the com-
munication between processes with or without GPU accelerating, as is marked with diamond
pair in Fig.5. For the process with accelerating, data to be sent should be downloaded to CPU
11
RAM-1
GPU-1 GPU-2
RAM-2
Proc-3 Proc-4
(Remote Node)
P
C
I-
e
P
C
I-
e
IBIB
Figure 5: Framework of communication of HiResX without GPU peer to peer communication technology
support. For communication between processes that are all running on CPU, if processes are located in the
same node, they exchange data within RAM with MPI-3 shared memory technique, see red path. If processes
are located in different nodes, the standard MPI communication is utilized, see yellow path that connects
triangle pair. For processes with GPU acceleration, data on GPU should be downloaded back to CPU, and
then processes communicate in the same way as processes without acceleration.
memory explicitly if CUDA-aware MPI is not applied, while the process without accelerating
does nothing with GPU. When data is downloaded to CPU, the communication can be per-
formed similar to that in CPU-only mode. However, with CUDA-aware MPI technology and
UVA, the data in GPU and CPU can be exchanged directly.
4.2. Domain decomposition
TheHiResX solver is developed to simulate flows with complex domains with multi-block
structure grid. Fig.6 illustrates the domain decomposition strategies of HiResX . The whole
computation domain is splitted into a lot of connecting blocks, and these blocks are distributed
to different processors. For hybrid CPU/GPU parallel computing, the sizes of grid blocks can
be different. The size of grid block for CPU is usually much smaller than that for GPU, since
the latter has higher computational capability.
For HiResX , in order to balance the computational loads of processes of CPUs and
GPUs, the whole domain can be splitted arbitrarily without the limitation of one-to-one block
connection, and the block connecting faces and physical boundary faces can be also arbitrarily
defined on blocks.
4.3. Hardware-aware technique
In current GPU server, there are more GPUs installed in a single machine. Moreover,
a variety of PCIe root architectures have been designed for different performance needs, as
demonstrated in Fig.2. The whole hardware system needs to match the increasing performance
of current GPUs, and the solver also need to adapt to the hardware system for maximum
computational performance.
Complex PCIe root architecture will affect efficiencies of memory access of CPU-GPU and
even process unit (PU) to RAM. Figs.7(a) and 7(b) show the memory access models of GPU
to CPU and GPU to GPU. The server has two CPUs interconnected with QPI, and it is
a dual root system. In Fig.7(a), a process is running in process core PU0 in CPU-0, and
selects GPU0 for acceleration. The data of this process is allocated in the memory (RAM) in
the same root with CPU-0 and GPU0. As a result, the process visits its data in RAM with
12
Figure 6: Domain decomposition strategies of HiResX . The domain is divided into several blocks according
to performance of CPU and GPU in order to balance workloads of each process. For the blocks computed by
CPU, the whole block is computed by a CPU process. For the blocks computed by GPU, each grid cell is
computed by one CUDA thread.
(a) CPU-GPU memory access model (b) GPU-GPU memory access model
Figure 7: CPU-GPU and GPU-GPU memory access models. In (a), red path is the optimal one, while pink
path is inferior because CPU to GPU memory access across NUMA nodes is worse than local access. In
(b), if GPUs support peer-to-peer (P2P) communication technology, green path indicates that two GPUs
communicate by Nvlink, which is the fast path. Purple path indicates that two GPUs communicate by PCIe
switch (PLX), which is most efficient communication way without Nvlink. Yellow path means that two GPUs
communicate by host (CPU). Red path is the worst one because P2P is not supported across NUMA nodes,
and memory access between two GPU in this way must be transferred by both CPUs.
13
optimal bandwidth, and GPU0 also transfers data to CPU with optimal bandwidth. This is
the ideal situation. However, in fact, the PU in which the process is running is not settled,
and is managed by the system by default. For pink path in Fig.7(a), the process is running in
PU2 in CPU-1, but selects GPU2 as accelerator. Though it visits data allocated in the closest
RAM with optimal bandwidth, GPU2 is not in the same root with CPU-1. Data transfer
between CPU and GPU must traverse QPI because of unsupported direct memory access, so
it is very inefficient. The worse case is the blue path. The data is allocated in RAM close to
CPU-0 and GPU1, but the process is running in PU of CPU-1, resulting in poor efficiency of
both GPU memory access and RAM access.
PCIe root architecture will also affect GPU to GPU communication or memory access.
Fig.7(b) gives several possible models of GPU to GPU memory access in current GPU server.
The red path is the worst case since it traverses QPI, and the topology of GPU2 and GPU3
is usually marked as "SYS". For brown path, GPU4 and GPU5 are attached to the same IOH
chip, which provides better communication efficiency than "SYS". The topology is usually
called "PHB". The purple path connects two GPUs in the same PCIe switch, which provides
better efficiency than "PHB", and its topology is marked as "PIX". The green path is the best,
GPUs interconnect each other directly with Nvlink, which is the fast way of GPU to GPU
communication so far, labeled as "NV".
To exploit the optimal efficiency of the hardware system, users should know the topol-
ogy of the hardware system. The Hardware locality (HWLOC) software package provides
an approach to gather hierarchical topology information about modern increasingly complex
parallel computing platforms, including NUMA (Non-Unified Memory Access) memory node,
shared cache, cores and multithreading, as well as I/O devices such as GPUs, InfiniBand
HCAs. With the topology information, users can optimize their own programs to obtain the
best performance. It is should be noted that the optimizations of CPU-GPU communication
and GPU-GPU communication are interdependent, which will be introduced next.
4.3.1. CPU-GPU communication optimization
To optimize the performance of memory access between CPU and GPU, the only way is to
get them "closer". The word "close" is a iconic description of two device that has larger data
transfer bandwidth. As aforementioned, the best strategy is to make the RAM, PU and GPU
be utilized by a process located in the same root. There are many ways to bind a process to
specified PU, including process binding technique of HWLOC, APIs of NVIDIA management
Library (NVML) and process binding technique in Linux system.
Many commercial and open source software packages provide memory binding. MVA-
PICH allows data allocation when calls "MPI_Init()", and OpenMPI allocates it after calling
"MPI_Init()". The problem is that, the size of memory space to allocate before program
calling "MPI_Init()" and reading input files and grids is unkonwn, and we have to run MPI
initialization to get the process ID, so that process can determine which grid files to read.
What is more, to optimize the GPU-GPU communication, we should know the communica-
tion data load of each process, then we can decide which GPU to be selected for acceleration.
After the GPU of a process is chosen, we can bind process to the CPU that the selected GPU
attached to. Finally we allocate memory and bind it to the CPU. The steps of the procedure
are briefly outlined:
• Run the program first until the processes get its GPU ID. When the solver run in parallel,
each process gets its process ID, and accordingly reads its input files. Process calculates
the communication data load according to block connection information. Based on the
14
communication data load and the topology of GPUs of local machine, the GPU-GPU
communication optimization will select the optimal GPU for the process.
• Find the NUMA node (CPU) near given GPU ID, and bind current process or thread
to it. Here we utilize HWLOC software package to do it. Firstly, set the cpuset of given
GPU ID with API function hwloc_cudart_get_device_cpuset(). Secondly, bind current
process or thread to this cpuset with hwloc_set_cpubind(). Finally, get the NUMA node
of this cpuset with hwloc_get_obj_covering_cpuset().
• Allocate memories in RAM for data of the process. If there is allocated memory of
data, free it or migrate it to the CPU selected above. To allocate memory for data in
the RAM attached to the selected CPU, the API function hwloc_alloc_membind() in
HWLOC is utilized. This API function will allocate memory in given size in RAM of
local NUNA node, and return address of the this memory space. For memory of data
that has been allocated before GPU ID is got, we can migrate it to the right NUMA
node with API function hwloc_set_area_membind() when the address and NUMA node
ID are provided.
4.3.2. GPU-GPU communication optimization
GPU-GPU communication is relatively easier to perform, which is efficient if GPUDirect
for Peer-to-Peer (P2P) technology developed by Nvidia is applied. There are many Nvidia
GPUs support GPUDirect P2P, such as Tesla series and Quadro series GPUs. GPUDirect
P2P technique includes P2P memory access and P2P transfer and synchronization, and it
is an optimization technique of GPU-GPU communication in the same system. With it,
buffers between memories of two GPUs in the same computer can be copied directly through
PCIe lane or Nvlink. The efficiencies of GPU-GPU communication by use of GPUDirect P2P
technique vary with the path the peers is connected with. Nvlink offers maximum bandwidth,
thus it is the most efficient path. Nvidia has developed new Nvlink technology which enables
interconnection of any pair of GPUs in a single system that has multiple GPUs. This strategy
works well. However, this is a highly customized product which needs specified motherboards
and only supports Tesla P100 SXM and Tesla V100 SXM2. There are many GPU servers
do not support Nvlink and GPUs are still install in PCIe stlots. Therefore, the GPU-GPU
communication optimization is directed to those GPUs communicating through PCIe.
To optimize the GPU-GPU communication through PCIe, we need the topology of local
system and the sizes of buffers to be exchanged among processes located in this server. Then
we can distribute process groups which include dense communication load to GPU groups
whose communication is more efficient, thus improve global communication efficiency. The
details of optimization are introduced below:
• The solver runs in parallel, and each process reads its input file and gets its communi-
cation buffer sizes to other processes. Each process may communicate with zero or one
process, or multiple processes. Those processes may be distributed in one or multiple
servers.
• Gether processes located in the same server and let them know each other. MPI software
package offers APIs to achieve it. Firstly, new communicator in local server can be cre-
ated with API fucntionMPI_Comm_split_type() with split type "MPI_COMM_TYPE_SHARED",
for better distinguish below we call it "MPI_SHARED_LOCAL". The size of new com-
municator and local rank (or ID) in new communicator of current global process can be
obtained with APIs MPI_Comm_size and MPI_Comm_rank respectively. Secondly,
get the groups of global communicator "MPI_COMM_WORLD" and local communi-
15
cator "MPI_SHARED_LOCAL". Thirdly, gather the processes that current process
communicates with, and translate it into local communicator. A global process ID that
is not located in local communicator will be marked as "MPI_UNDEFINED", so that
we filter out processes to communicate with in local server.
• Each process in local communicator shares its information with all other local processes.
The information here is the sizes of communication buffers to target processes in local
communicator. Then each local process knows the local communication network and
buffer size of each connecting line. This procedure can be easy to implement with MPI
API function MPI_Allgather().
• Each local process inquires the topology of GPU pairs in local server. We can obtain
the connection type of two GPUs by use of NVML API function nvmlDeviceGetTopol-
ogyCommonAncestor(device1, device2, &topo_type) according to the returned variable
topo_type. The connection type of two GPUs in the same system should be "NV", "PIX",
"PXB", "PHB", or "SYS", which range in the order of efficiency decrease. We will not
consider "NV" as mentioned above. Then we organize GPUs in groups. Firstly, each
CPU (NUMA node) gathers GPUs that are attached to it. Any two GPUs in this group
communicate with each other with efficiency not worse than "PHB". Secondly, in each
CPU, gather the GPUs that in the branch in which any two GPUs communicate with
each other with efficiency not worse than "PXB". Thirdly, in each "PXB", gather the
GPUs that are in the same PCIe switch. GPUs in a single PCIe switch have connection
type "PIX" with each other, and there may be multiple "PIX" groups attached to a GPU.
• Filter out the busy GPUs in groups above, and get groups that contain idle GPUs only.
The utilization of GPU can be inquired with NVML API function nvmlDeviceGetUti-
lizationRates().
• Partition local processes according to number of GPUs of each CPU. If all local processes
is less than GPUs of any CPU, choose the CPU that can hold all local process. If
no single CPU can hold all processes, divide local processes into partitions according
number of GPUs in each CPU, and make sure that these partitions have minimal total
communication data with each other.
• In each CPU that gets a partition, divide the processes into partitions for "PXB" groups
in the fashion above. The "PIX" groups should be dealt with in similar fashion too.
With the procedure above, each process in local server is binded to the GPU that makes it
communicate efficiently with other GPUs attached to its target communicating processes. It
is worth to mention that, in modern GPU cluster, servers usually interconnect with infiniband
network. With GPUDirect Remote Direct Memory Access (RDMA) technology, buffers can
be directly sent from the GPU memory to a network adapter without staging through host
memory. In this case, the network adapter should be included during optimization above, and
we can simply regard it as a communication target like process.
4.4. Memory utilization
How to utilize memory of GPU is one of key points in programming GPU solver. Strate-
gies of memory utilization of GPU vary from code to code in different fields according to
algorithms applied and computing scales. In the early stage of GPU computing, GPUs were
mainly designed for image and video processing which usually involves simple algorithms like
matrix operations. Perform simple algorithms on a bunch of single dataset would not occupy
much memory of GPU, hence GPUs in several years ago equipped with very low memory
capacity, which is usually not higher than 4GB. For scientific computing involves large scale
16
computation, memory occupation is a challenge for CPU, thus it is hard to hold all data in
GPU memory. Due to limitation of memory capacity, GPU usually works as local accelerator,
where most of the data of program is fixed in memory of CPU, and upload a part of data
onto GPU for computation acceleration, and then download result back to CPU. This strategy
enables the solver to deal with much more data with limited memory of GPU. Data transfer
between CPU and GPU through PCIe lane is time-consuming, which must be compensated
in GPU computing, in order to get acceleration. However, with performance of GPU improv-
ing greatly, massive and reduplicative data transfer gradually becomes bottleneck of GPU
computing.
Bandwidth of PCIe has not increased significantly over the past few years, but the memory
capacities of GPUs increase a lot. In particular, some professional GPUs are designed for sci-
entific computing, which equipped with memory up to 32GB. Moreover, current GPU server
contains more GPUs, which make a single server equipped with powerful computational ca-
pacity. Hence, the strategy of memory utilization should be adjusted to adapt to current GPU
computing. We will introduce the detail of memory utilization of our solver in the following.
There are two basic and vital principles of memory utilization in our solver. On the one
hand, allocate fixed permanent memory space for key dataset. On the other hand, avoid
reduplicative memory allocation and deallocation by allocating a piece of permanent memory
space that is reasonably large enough for potential public usage.
In the whole procedure of solving Eq.(1), we select several fundamental and dominant
variables from all variables. A variable is fundamental when it is used throughout the whole
computation and many other variables can be obtained with them. The metrics (ξ, η, ζ)x,y,z
and Jacobian J are utilized all over the computation and they are constants if the grid is
stationary, so they are fundamental variables. The conservative flow variable U˜ in Eq.(2)
is fundamental as well, because it plays a key role in time advancement, in calculating the
interpolated variables in WENO interpolation, moreover, it is also the source to compute
primitive variable Qv = (ρ, u, v, w, p, T ). For convenience in utilization, we will not include
the Jacobian J in it, and only store Q = (ρ, ρu, ρv, ρw, e). Although Qv can be obtained from
Q, it is always used directly during the whole computation. So we have to sacrifice memory to
avoid unnecessary computation time. For viscous flow computations, molecular viscosity µ is
unavoidable, which is used in computing viscous flux and local time step. In time advancement
with Runge-Kutta scheme, as shown in Eq.(17), Qn is a variable that should be stored. For
time advancement in local time step, the time step ∆t at each grid point should also be stored.
R(Q) represents the term of right hand side and contains contribution of derivatives of inviscid
flux and viscous flux, and is used in time advancement and residual computation. Thus R(Q)
should be stored during the whole computation. In the following, we denote the size of original
grid of current process with N , and the size of grid containing ghost cell with M . According
to above analysis, we need to allocate memory for fundamental and dominant variables with
32 ∗M elements.
During computations of derivatives of inviscid and viscous fluxes, there are many temporary
variables at each grid point. According to numerical methods applied in inviscid terms, we
need memory space to store the physical flux fi and numerical flux fi± 12 , as shown in Eq.(16).The total size of temporary variables required is 10 ∗M . In the viscous term, the derivatives
of (u, v, w, T ) with respect to (x, y, z) are widely used. Moreover, viscous flux has to be
stored according to our kernel mentioned in the next section. In total, we need to allocate
memory space for temporary variables in the viscous term with size of 17 ∗M . However, the
17
Figure 8: Organization of CUDA threads for NVIDIA’s GPU and its mapping to hardware structure
inviscid term and viscous term are computed successively, and there is no need to allocate
memory space for temporary variables in inviscid term and viscous term, respectively. Thus
we allocate memory space with size of 17 ∗M elements, which is enough for current public
usage and potential usage elsewhere.
In short, we need permanent memory space with a size of 49 ∗M elements. The variables
mentioned above are all in float type. In our solver, arrays in integer type are widely applied as
well, such as in blocks communication and physical boundary conditions. We allocate memory
for arrays for public usage in the way mentioned above too. Our solver can avoid unnecessary
data transfers between CPU and GPU. Meanwhile, we save memory of GPU up to the hilt,
and increases the maximum grid size the solver can deal with. The maximum grid capacities
of our solver in different single GPU are shown in Tab.3. The data is stored in double precision
float type. The result is satisfying for a single GPU. The computation speeds in full load are
also well acceptable in general applications, because much more grid will lead to unacceptably
computation speed. Therefore, the memory strategy mentioned above is well matched with
our programming algorithm strategy.
GPU Memory capacity (GB) Maximum grid capacity (million)
RTX 2080 Ti 11 25
Tesla P100 16 35
Titan V 12 27
Table 3: Maximum grid capacity of HiResX solver in different GPUs
4.5. CUDA kernels
The efficiency of GPU computing relies heavily on the algorithm and programming skills.
To fully exploit the parallel capability of GPU, one needs deep understanding of hardware
of GPU and its execution mechanism. The GPU architecture consists of a scalable array of
Streaming Multiprocessors (SM). Fig.8 shows the structure of GPU in software and hardware
layers. Each SM in a GPU is designed to allow concurrent execution of hundreds of threads,
18
and generally there are multiple SMs per GPU, so it is possible to have thousands of threads
executing concurrently on a single GPU. In CUDA, all the threads are organized in group
called thread block. A thread block can only be scheduled on one SM, and remains on that
SM until execution finishes. Hence, how to utilize threads is one of the primary things in
kernel design.
In CUDA memory model, there are many types of memory that are programmable to
user, as shown in Fig.8. Global memory is the largest, highest-latency, and most commonly
used memory on a GPU. It is accessible to any SM throughout the lifetime of the application.
Registers are the fastest memory space on a GPU, and are partitioned among active warps
in an SM, so register variables are private to each thread. Registers store automatic variables
declared in a kernel without any other type qualifiers. Arrays declared in a kernel with constant
that can be determined at compile time, are also stored in registers. Local memory in nature
is a part of memory resides in global memory, and is used to store variables in a kernel that are
eligible for registers but cannot fit into the register space allocated for that kernel, including
arrays referenced with indices that cannot be determined when code is compiled. Shared
memory is a type of on-chip memory space that has much higher bandwidth and much lower
latency than local or global memory. Registers and shared memory are scarce resource, and
these limited resources impose a strict restriction on the number of active warps in an SM,
thus affect the parallel performance in an SM. Therefore, how to utilize these types of memory
is another thing that is vital to performance in kernel design.
In GPU computing, CPU is called host, and GPU is called device. Kernel function is a
piece of code that runs on device. Design of kernel functions is the core of GPU computing.
In current computation, most of the computation in solving Eq.(1) is spent on evaluation of
inviscid term and viscous term. Evaluating inviscid term and viscous term involves much more
complex operations than other sections of computation. Therefore we give will more details
of them in the following.
4.5.1. Inviscid fluxes
We consider a three dimensional grid block with sizes of (NI,NJ,NK) in I, J,K direc-
tions. In CPU source code, derivatives of convective fluxes with respect to ξ, η, ζ directions
are evaluated in sequence, and in each direction we perform procedure from Eq.(7) to Eq.(16).
Take the derivative in I direction as example, in the JK face of grid block, NI ∗ NJ one-
dimensional problems in I direction are solved. For each pipeline in I direction, conservative
variables at half-point or edge location (i + 1/2) are evaluated with fifth order WENO inter-
polation as is shown in Eq.(10) to Eq.(14). To minimize oscillation, interpolation is performed
in characteristic space. There are two interpolated values, u+i+1/2 and u−i+1/2, at each half-point
location, which are approximated with values of left-biased point stencil (i−2, i−1, i, i+1, i+2)
and right-biased point stencil (i− 1, i, i+ 1, i+ 2, i+ 3) respectively. Then at each half point,
Steger-Warming flux is used to construct numerical flux fi+1/2. The physical flux at the grid
point can be simply obtained with Eq.(3). Finally, the derivative at each point can be ap-
proximated with Eq.(16). Derivatives in the other directions can be done in the same way,
and the results are accumulated during the loop of direction to construct the term R(Q) in
Eq.(17) for Runge-Kutta integration.
All above computation is done by a single process when no multi-threads technique like
OpenMP is applied. The code running on GPU device should do all the work as it does on
CPU. The structure of GPU and its execution mechanism have been introduced earlier, and
it is noted that we can call tens of thousands of threads to complete certain computations.
19
Figure 9: Code structure of derivative of inviscid flux computed with original AFWENO
To port a code onto GPU, the first work to do is how many threads to be called and how to
arrange their works. In our early version of our GPU code, a thread is assigned to complete a
1D problem, so the number of threads to be called is the size of the sweeping face of grid block,
NJ ∗ NK for JK face. However, the acceleration performance is poor. The reason for this
problem is that, the number of active threads is severely restricted by the high occupation
of register in the kernel function. In addition, the number of threads to be called is not
large enough and is limited by sweeping faces of grid block which usually vary with sweeping
directions.
To solve this problem, two improvement measures are introduced. The first improvement
measure is that atomic level operation is applied. Other than assigning each thread to complete
a 1D problem of the whole grid line, we assign each thread to complete the work at only a
grid point, it is so called "atomic level operation". Thus the number of thread to be called
increases up to the number of total grid points, and it does not vary with sweeping directions
of derivatives. The CUDA kernel is therefore configured as
int blocksize = BLOCK_SIZE_INV ISC;
int nblock = (( NI ∗NJ ∗NK − 1) / blocksize ) + 1;
cuda_afweno5 <<< nblock, blocksize >>> (...);
Generally, optimizing the workloads to fit within the boundaries of a warp, which is a group
of 32 threads, will lead to more efficient utilization of GPU compute resources. Thus, the
value of BLOCK_SIZE_INV ISC should be divided evenly by 32.Improper value that is
too small or large may cause performance reduction. In practice applications, 256 is usually
proved to get satisfying performance. But the best value should be determined by testing.
The second improvement measure is utilization of kernel decomposition technique. By
decomposing a big kernel function into several small ones, the number of registers needed in
20
kernel function reduces and the number of active threads increases. Fig.9 shows the pseudo
code of computation of derivatives of convective flux. The procedure of evaluating derivative
is splitted into three parts, and the computation of each part is done with a kernel function. In
kernel cuda_afweno_split_flux(...), numerical flux fi+1/2 is computed at each half point
by a thread, and the result is written into public memory space. After the kernel finishes,
numerical fluxes at all half-point locations in that direction are already known. Then, the
second kernel function cuda_phys_flux(...) computes physical flux fi at each point by a
thread, and the result is written into public array in global memory too, and at the end of
it, physical fluxes are all known. At last, in the third kernel function cuda_cflux_df(...),
numerical flux and physical flux are obtained from global memory to compute derivatives at
each point by a thread, and the result is stored in array R(Q) in global memory. Theoreti-
cally, the second kernel can be merged with the first kernel. But the merged kernel function
occupies too much registers and the number of active warps is strictly restricted due to limited
registers. By separating the independent part out and computing it with a small kernel, more
warps are active, thus improve parallel performance. However, we do not divide the kernel
cuda_afweno_split_flux(...) into two parts to calculate interpolation of conser-
vative variables and numerical fluxes as is done in [6], because delivering data
from one kernel to another kernel involves access to global memory, which brings
high latency. Moreover, this two parts are complex as well, and the kernels still
occupy too much registers. So number of active threads does not increase ap-
parently. Our test shows that the time saved by increasing active threads does
not compensate the latency of access of global memory. It is worth mentioning that,
metrics and Jacobian at half point which are needed to compute numerical flux are neither
stored in global memory, nor transferred from CPU. If we save them in global memory per-
manently, more memory is required for solver to deal with a certain grid size. Although the
public memory space is large enough to hold them, we do not transfer them from CPU either,
because our test suggests that computing is much faster than transferring. Thus they are
interpolated from point values directly when they are needed.
We have known that in GPU, as processor-to-memory latency increases, the capacity of
that memory increases. Global memory is the largest, highest-latency memory space on a
GPU. So frequent access to global memory will increase the execution time of a kernel. There-
fore, reducing access to global memory provides possibility of optimization during program-
ming and numerical algorithm design. In current work, we also propose an improved WENO
interpolation, which reduces access to global memory and also reduces computation price.
Fig.10 shows the schematic diagrams of original WENO interpolation and improved WENO
interpolation. In original method, computing numerical flux at i + 1/2 needs points stencil
(i−2, i−1, i, i+1, i+2, i+3), and performs two complete WENO interpolations from Eq.(10)
to Eq.(14). The two WENO interpolations share the same characteristic space located at
i + 1/2. In CUDA kernel function cuda_afweno_split_flux(...) mentioned above, for
Steger-Warming splitting flux method, f+i+1/2 and f−i+1/2 are obtained simultaneously in the
same thread, thus numerical flux is obtained immediately with fi+1/2 = f+i+1/2 + f−i+1/2. In
our improved method, we build the characteristic space at i, and we utilize the points stencil
(i− 2, i− 1, i, i+ 1, i+ 2) to perform WENO interpolation to get u+i−1/2 and u−i+1/2. We do not
need to perform two complete independent WENO interpolations, because the characteristic
variables and the smoothness indicators are shared by two WENO interpolations. Since we
can not get f+i+1/2 and f−i+1/2 in the same thread, thus we have to compute numerical flux fi+1/2
21
Figure 10: Comparison of WENO interpolations. In original WENO interpolation, the characteristic space
is built at i + 1/2, and u±i+1/2 are computed simultaneous in a loop. In modified WENO interpolation, the
characteristic space is built at i, and u−i+1/2 and u
+
i−1/2 are computed simultaneous in a loop. In modified
WENO interpolation, point stencils are the same, so the characteristic variables and smoothness indicators
are shared, thus it reduces calculation and global memory access.
with a new kernel function cuda_sum_split_flux(...). The procedure is shown in Fig.11.
4.5.2. Viscous fluxes
Evaluation of viscous terms is slightly easier than convective terms. According to Eqs.
(4) and (5), we have to compute (u, v, w, T )x,y,z before computing viscous fluxes. Here, we
approximate the derivative of viscous flux with fluxes defined at point locations, so we need to
compute (u, v, w, T )x,y,z once and share them during derivatives in all directions. Computing
(u, v, w, T )x,y,z is easy by chain rule. In general curvilinear coordinate system, derivatives with
respect to (x, y, z) can be obtained with chain rule as follow
∂
∂x
= ξx
∂
∂ξ
+ ηx
∂
∂η
+ ζx
∂
∂ζ
∂
∂y
= ξy
∂
∂ξ
+ ηy
∂
∂η
+ ζy
∂
∂ζ
∂
∂z
= ξz
∂
∂ξ
+ ηz
∂
∂η
+ ζz
∂
∂ζ
(18)
(u, v, w, T )ξ,η,ζ can be approximated directly with sixth order central scheme. In CPU code,
the computation strategy of viscous term is similar with inviscid term. After (u, v, w, T )x,y,z
are computed, viscous flux and its can be obtained in each direction.
We port the code to GPU similar to convective term. Fig.12 shows the procedure of
evaluation of viscous term. There are two parts and each part involves loop in directions.
In the first part, contribution of (u, v, w, T )x,y,z in each direction are computed, accumu-
lated, and stored in public array in global memory. For example, in ξ direction, ξx ∂(u,v,w,T )∂ξ ,
ξy
∂(u,v,w,T )
∂ξ
and ξz ∂(u,v,w,T )∂ξ are computed. This part is complete by only one kernel function
22
Figure 11: Code structure of derivative of inviscid flux computed with modified AFWENO.
cuda_vars_derv_xyz(...). We have two strategies of thread schedule for this kernel. In the
first strategy, the kernel is designed to compute contribution in each direction for four variables
(u, v, w, T ) at one point by one thread. So the CUDA kernel cuda_vars_derv_xyz(...) is
therefore configured as
int blocksize = BLOCK_SIZE_V ISC;
int nblock = (( NI ∗NJ ∗NK − 1) / blocksize ) + 1;
cuda_vars_derv_xyz <<< nblock, blocksize >>> (...);
In the second strategy, the kernel is designed to compute contribution in each direction for only
one variable at one point by one thread. So the number of threads to be called is four times
of that of the first strategy, and the CUDA kernel cuda_vars_derv_xyz(...) is therefore
configured as
int blocksize = BLOCK_SIZE_V ISC;
int nblock = (( NI ∗NJ ∗NK ∗ 4 − 1) / blocksize ) + 1;
cuda_vars_derv_xyz <<< nblock, blocksize >>> (...);
Each thread must determine the indexes of grid point and the variable according to its thread
ID. The indexes of grid point is needed because of one-side scheme for derivative near bound-
aries of grid block. In convective term, we do not need to care about derivatives near bound-
aries of grid block.
In the second part, two kernels are applied which are similar to convective terms. In the
first kernel, viscous flux at each grid point is computed by one thread. After the first kernel
completes, the second kernel cuda_vflux_df(...) is ready to compute the derivative at each
grid point by one kernel, and the result is accumulated into R(Q) which has already included
the contribution of convective terms.
23
Figure 12: Code structure of the derivative of viscous flux summarized by a pseudo code
4.5.3. Kernels in other section of the solver
We have introduced the kernels of the convective and viscous terms which account for the
majority of total computation, and they are completely performed on GPU. In order to achieve
the whole computation performed on GPU, other parts of the solver have to be ported onto
GPU as well. Fortunately, these parts involve only simple algorithms and operations, which
are independent on grid. Therefore, we apply the "atomic level operation" when design the
kernels. For computation of residual, in which summation of all inner grid points is necessary,
although there is atomic operation in CUDA, we utilize "reduced-dimensional operation" to get
better efficiency. "Reduced-dimensional operation" here we mean that, for three dimensional
grid block, CUDA threads are mapped into grid points of the maximum face of grid block,
and each thread performs specified operation like accumulation over points of that grid line.
When the "3D → 2D" procedure is complete, the "2D → 1D" and "1D → SingleV alue" are
performed in similar fashion. The "reduced-dimensional operation" is also applied to find the
minimum time step for unsteady simulation controlled with the CFL number.
5. Performance result
5.0.1. Speedup varies with grid size
In this section we present the performance of HiResX running on a single GPU with
different gird sizes. We compare performances on different GPUs as well. Three different
GPUs are employed for test: RTX 2080 Ti, Tesla P100 and Titan V. The specs and the
characteristics of them are introduced before. The maximum grid capability of the solver on
these GPUs are also listed in Tab.3. Since the memory capability of RTX 2080 Ti GPU is
the smallest one, the maximum grid size tested here is bounded to 25 million. The CPU in
the test is Intel Xeon E5-2680v3. For CPU computing, solver runs on single CPU core and
all computation is performed on CPU. For GPU computing, because GPU code can not run
independently of CPU, the present solver uses one CPU core and one GPU device, but nearly
all the computation is performed on GPU device. The speedup here is defined as the ratio of
time of CPU computing to time of GPU computing for the same case. Global speedups of the
optimal GPU version of the code compared to the serial CPU version for grid sizes arranging
from one million to 25 million are shown in Fig.13. We get significant speedups for different
GPUs for all grid sizes. Generally, the speedups for all GPUs increase with the grid size when
grid size is smaller than 15 million, but after that RTX 2080 Ti and Tesla P100 gradually
reach their limits. Thus the maximum speedups for RTX 2080 Ti and Tesla P100 are about
640 and 1500 respectively. Titan V has not showed a drop when the grid size increases. Due
24
1.0 2.5 5.0 10.0 15.0 20.0 25.0
Grid Size (Million)
0
250
500
750
1000
1250
1500
1750
2000
Sp
ee
du
p 
(re
sp
ec
t t
o 
on
e 
CP
U 
co
re
)
1
566
1522
1268
1
557
1562
1278
1
581
1671
1292
1
614
1791
1404
1
637
1861
1466
1
638
1862
1500
1
632
1938
1493
CPU E5-2680v3
RTX 2080 Ti
Tesla P100
Titan V
Figure 13: Global performance varying with grid sizes. The speedup is defined as the ratio of elapsed time of
running on CPU with one core to lapsed time of running with a GPU.
to the limitation of memory capacity and the grid capacity, the maximum speedup of Titan
V is predicted to be about 1950.
The trend of actual measured performances are matched with their ideal DP performances
for these three GPUs, though the amplitudes deviate much from ideal DP ratios. According
to the specs of these GPUs in Tab.1, the bandwidths of RTX2080 Ti and Titan V is close, but
actual measured performance of Titan V is nearly three times as much as that of RTX 2080
Ti, which benefits from more CUDA cores and the higher ratio of number of double precision
process units to single precision process units (1:2) in Titan V. Tesla P100 has less CUDA
cores than RTX 2080 Ti, but has higher ratio of number of double precision process units
to single precision process units, 1:2 for Tesla P100 and 1:32 for RTX 2080 Ti, resulting in
more double precision process units. Hence RTX 2080 Ti is mainly bounded to computation
rather than memory bandwidth. According to performance analysis tools of NVIDIA, the
large deviation from ideal performances for these three GPUs comes from the relatively low
occupancy in convective fluxes and viscous fluxes which account for the majority of global
computation. Low occupancy is caused by the limitation of registers. If there are not enough
active threads, double precision capability would be excessive which causes large deviation from
ideal performance. Although we have decomposed the big kernel into smaller ones and cut
down number of registers in kernel to increase occupancy and number of active threads, high
latency of introduction of other type of memory (e.g. global memory) cancels the accelerating
effect of higher occupancy. So there is balance of occupancy and memory efficiency.
25
5.0.2. Performance of kernels
Global performance relies on performances of kernels in each part of the solver. Fig.14
shows performances of several main parts in which all computations are located. As shown in
the figure, the part with massive computation gets apparent accelerating, such as evaluation
of inviscid term, viscous term and time step, and procedure of time advance and computation
of primitive variable. In these parts, there are a lot of complicated calculation at each grid
point. Though evaluation of residual is performed at each point, the kernel for residual com-
puting is not designed to run with "atomic operation", and there is no complicated calculation
in the kernel. So the performance of residual evaluation is relatively low. For Runge-Kutta
integration in time advancement, there is only a little calculation in this part, but the calcula-
tion is performed at each grid point independently, so the "atomic operation" strategy brings
apparent acceleration. The speedup of Titan V is double of RTX 2080 Ti because Titan V has
more CUDA cores and higher bandwidth. For convective terms, in which a lot of complicated
calculation is involved, the speedup of Titan V is relatively high. But the speedup of RTX
2080 Ti is only one fifth of Titan V, because both number of double precision process units
of RTX 2080 Ti and number of CUDA core are less than that of Titan V. For viscous terms,
in which the computation is second only to the convective term, the accelerating effect is the
best. It is worth to mention that, the gap of speedups of both GPUs is much smaller than
that of the convective terms. It is the kernel decomposition used to increase the occupancy
and active thread that brings the effect. In each kernel in the viscous term, the amount of
calculation is medium, so for RTX 2080 Ti the number of double precision process units is not
that scarce, but for Titan V it may be excessive. The scenario is similar for time step.
5.0.3. Performance of running with multiple GPUs
In this section, the performance of our solver on multiple GPUs will be investigated. This
test aims to show the scalability of our solver when it is running on multiple GPUs. For tradi-
tional CPU code, when it runs in parallel, the parallel efficiency is affected by communication
among processes and computation loads of processes. In early GPU computing code, besides
communication on CPU and computation load, data transfer between CPU and GPU affects
parallel efficiency as well. For modern GPU computing, communication between GPUs is a
new factor.
The grid size of all cases in this test is fixed to be 25 million due to limitation of the
memory capacity of RTX2080 Ti. A single block with 25 million grid points is divided nearly
equally into several sub-blocks and each GPU computes computes one sub-block. Each GPU
is driven by a CPU core, so for multiple GPUs cases, solver runs in parallel with MPI actually.
As we have mentioned that for GPU computing nearly all computation are performed on
GPU, so the contribution of CPU to global computation is negligible. The elapsed times
of the solver running with number of GPUs varies from 1 to 10 are shown in Fig.15. The
scalability gradually deviates from linear value when the number of GPUs increases, but the
scalability efficiency is still larger than 75% when 10 GPUs are utilized. We observe that
performance of scalability of Titan V drops faster than RTX 2080 Ti. However, it is not
the parallel efficiency of CPU that causes performance drop, but the fact that when grid size
decreases, the speedup of GPU drops as well. As illustrated in Fig.13, the variation of speedup
of Titan V is greater than that of RTX 2080 Ti. Performance varies with the computation
load, which is a remarkable feature of GPU different from CPU. Therefore, compared with
CPU, the scalability performance of GPU is slightly lower.
26
0 200 400 600 800 1000
Speedup (respect to CPU serial core)
inv_flux
visc_flux
rk3
time_step
bc
Qc_2_Qv
res
110
513
491
716
248
580
75
444
13
43
298
298
102
115 RTX 2080 Ti
Titan V
Figure 14: Performance of main parts of the solver. Generally, the parts that contain large amount of compu-
tation get apparent acceleration. In the sections of time advance and viscous flux, due to the high occupancies
of kernels, both GPUs get higher speedups. For evaluation of time step and inviscid flux, Titan V gets higher
speedups than RTX 2080 Ti, because Titan V has more double double precision operation processing unit,
which means higher double precision operation performance, see Tab.1.
2 4 6 8 10
number of GPU
0
2
4
6
8
10
Sc
al
ab
ilit
y
Linear
RTX 2080 Ti
Titan V
Figure 15: Strong scaling result of different GPU. Both GPUs’ scalabilities deviate from linear result, but the
efficiencies are all larger than 75% when 10 GPUs are utilized. The drops of scalability come from the fact
that when grid size decreases, the speedup drops too, see Fig.13. Scalability of Titan V drops faster than RTX
2080 Ti’s, because the speedup of Titan V drops faster than RTX 2080 Ti’s when the grid size decreases.
27
Block 1 
Block 3
Block 2
Block 4
800 800
1
0
0
1
0
0
(a) Block structure of the test case.
1 
4 
2 
3
5
8
6
7
2
8
3 
7
2
3
7
8
2
8 
7
3
(Ⅰ)
(Ⅱ)
(Ⅲ)
(Ⅳ)
(Ⅴ)
(b) Cases of GPUs assignment.
Figure 16: Cases configuration. In figure (a), there are 100 points in the spanwise direction. Data exchange in
vertical direction (red) is larger than data exchange in horizontal direction (blue). In figure (b). five cases are
set to test the communication performance of different strategies of computations assigned to different GPUs.
The number in block presents the GPU ID in which computation is performed. Blocks with the same color
indicate that they are in the same PCIe switch. In cases I and II, computations are performed on GPUs that
are in two different PCIe switches respectively. In cases III to IV, computations are performed on GPUs belong
to both two PCIe switches. For case III, the maximum data transfers are performed across PCIe switch. In
case IV, the maximum data transfers are performed within the same PCIe switch. In case V, all data transfers
are performed across PCIe switch.
To investigate how the topology of GPUs in the server influences communication efficiency
of GPUs, a computation model with 4 blocks are assigned to be computed with 4 GPUs in
the server. The computation model is shown in Fig.16(a), the size of exchange data between
blocks in horizontal direction is 1/8 of the size of exchange data between blocks in vertical
direction. The topology of GPUs in our test server is similar to type (c) in Fig.2. There are
10 Titan V GPUs in our server, and each PCIe switch has 5 GPUs mounted on it. Hence
there are two "PIX" GPU groups, GPU 0 ∼ 4 and 5 ∼ 9. And GPUs between the "PIX"
groups are in relation of "PHB". As we have introduced before, GPUs in "PIX" relation
communicate faster than GPUs in "PHB" relation. So put processes that have large data
transfer between each other in "PIX" group of GPUs will get higher communication efficiency.
For better comparison, five cases are considered, as shown in Fig.16(b). For all cases, the
blocks with the same color means that they are in the same PCIe switch. Case I and case II
are used to test the communication performance of GPUs in each PCIe switch. In case III,
processes with largest exchange data are not distributed to GPUs in the same PCIe. In case
IV, processes with largest exchange data are distributed to GPUs in the same PCIe, and it is
expected to get higher efficiency than case III. Case V is designed to be the worst situation.
We only run the test on server with Titan V GPU, because RTX 2080 Ti GPU dose not
support P2P communication technique without Nvlink, while Titan V supports P2P by PCIe.
The result is shown in Fig.17. As expected, cases running in the same PCIe switch are the
best. Case IV also gets good performance as case I and II due to reasonable distribution of
communication load. Case V gets the worse performance because each GPU communicates
with GPUs that are all located in another PCIe switch, leading to the worst distribution of
communication load. The results indicate that the GPU-GPU communication optimization
technique is crucial for programs running on current GPU server in order to fully exploit the
performance.
28
(I) (II) (III) (IV) (V)
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
El
ap
se
d 
tim
e 
(m
illi
se
co
nd
)
11.0 11.0
13.0
11.0
14.0
Figure 17: Performance of HiResX running on different GPU topology. Cases in which all GPUs are in the
same PCIe switch get best communication efficiency, see case I and II. For case in which GPUs are located in
different PCIe switches, if GPU devices are optimally assigned according to communication load, communi-
cation efficiency can be also improved, see case IV. Without GPU to GPU optimization, the communication
efficiency is lower, see case III and V.
6. Case study
To further investigate performance of our solver in the practical application, a planar
supersonic jet is simulated with implicit large eddy simulation (ILES). The configuration
is similar to the Ref.[16], and we simulated an underexpanded planar jet with the ratio of
pressure pe/p∞ = 2.09. The computational domain is discretized by a Cartesian grid with total
67.7 million points. The physical domain, which excludes the sponge zone, has dimensions
64h × 30h × 5h, with a nozzle extending over 0.6h inside the domain. The jet height is
h = 3.5mm corresponding to a Reynolds number based on the jet height and acoustic speed
Re = ach
ν
= 8.15×104. The velocity profile inside the nozzle is a laminar Blasius solution with
boundary layer thickness of δ = 0.05h, and there are 12 points within the boundary layer, 70
points inside the jet height. The constant time step 4t = 2.5 × 10−3 is implemented. The
non-dimensional time for the whole computation is 360, and only the last 210 is outputted for
analysis. Three Titan V GPUs are utilized for computation, and the execution time for the
simulation is only about 41 hours.
Fig.18 shows the shock-cell spacing Ls/h as a function of the fully expanded jet Mach
number Mj, and various experimental results[18, 19], LES data of Berland et al.[16] and
theoretical data[17] are plotted for comparison for rectangular jets. It is clear that present
result is closer to experimental results than that of Berland et al.. In Fig.19, the Strouhal
number of the fundamental discrete frequency is given and plotted against the fully expanded
jet Mach number for several experiments on rectangular jets, where the experimental data of
Panda[18] and LES data of Berland et al.[16] are shown for comparison. Clearly, our result
is in good agreement with experimental result. Fig.20 illustrates the flow structure of the jet
29
Mj
1 1.2 1.4 1.6 1.8 2
L s
=h
0
1
2
3
4
5
6
Theory
Panda Exp.
Ramman Exp.
Berland LES
Present
Figure 18: Shock-cell spacing Lh/h as a function of the fully expanded jet Mach number Mj . Our result
is more close to theoretical[17] and experimental results [18, 19], compared to the LES results of Berland et
al.[16].
St
1 1.2 1.4 1.6 1.8
M
j
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Panda Exp.
Berland LES
Present
Figure 19: Strouhal number St = fh/Uj of the fundamental screech tone as a function of the fully expanded
jet Mach number Mj . Our result is in good agreement with experimental result of Ref.[18], which is better
than the LES result of Ref.[16].
30
Figure 20: Instantaneous snapshot of spanwise vorticity ωz and the dilatation in the plane z/h = 2.5 as the
backround. The isosurface of vorticity is colored with the amplitude of velocity.
and its generated acoustic field, which are presented with spanwise vorticity ωz and dilatation
in the plane z/h = 2.5. The above results prove the reliability of our solver, and the short
execution time demonstrates the high efficiency of our solver running with GPUs.
7. Conclusions
Heterogeneous computing is changing our way of scientific computing. The boom of GPU
computing in the past several years shows the power and potential of GPU on computing,
and it attracts more and more researchers to exploit its application in their fields, including
CFD simulation. In the early years of GPU computing, many attempts have been made to
port CFD codes onto GPU, but the results are not that satisfying as expected. With the
development of GPU computing, more powerful GPUs and relating technologies are present,
which prompt us to further exploit its application in CFD.
In this work we analysis the characteristics of architectures of current GPU servers, and
propose a set of techniques to improve efficiency of data transfer between CPU and GPU, and
efficiency of communication between GPUs. An in-house compressible flow solver based on
high order finite difference method on curvilinear coordinates are successfully ported to GPU
with CUDA C. By carefully memory planning, we save time of unnecessary data transfer
between CPU and GPU without significant sacrifice of capability of grid size of our solver.
The "atomic operation" technique and kernel decomposition technology are applied to design
high-efficiency kernels. A modified AFWENO is proposed, which saves computation and
reduces memory access, and works better on GPU compared with original WENO. Tests have
shown that our solver gets maximum speedups almost 2000x on a Titan V GPU, 1500x on a
Tesla P100 and 650x on an RTX 2080 Ti GPU over a CPU core of E5-2680v3. The hardware-
aware technology is proved to be more efficient than unoptimizable scenarios. A test case of
a supersonic jet shows the practical application ability of present solver. This work provides
31
an systematic and efficient solution to apply GPU computing in CFD simulation with certain
high order finite difference methods on current GPU heterogeneous computers.
References
[1] D. R. Chapman, Computational aerodynamics development and outlook, AIAA journal
17 (1979) 1293–1313.
[2] H. Choi, P. Moin, Grid-point requirements for large eddy simulation: Chapman’s esti-
mates revisited, Physics of fluids 24 (2012) 011702.
[3] B. Tutkun, F. O. Edis, A gpu application for high-order compact finite difference scheme,
Computers & Fluids 55 (2012) 29–35.
[4] V. Esfahanian, H. M. Darian, S. I. Gohari, Assessment of weno schemes for numerical
simulation of some hyperbolic equations using gpu, Computers & Fluids 80 (2013) 260–
268.
[5] K. I. Karantasis, E. D. Polychronopoulos, J. A. Ekaterinaris, High order accurate sim-
ulation of compressible flows on gpu clusters over software distributed shared memory,
Computers & Fluids 93 (2014) 18–29.
[6] C. Xu, X. Deng, L. Zhang, J. Fang, G. Wang, Y. Jiang, W. Cao, Y. Che, Y. Wang,
Z. Wang, et al., Collaborating cpu and gpu for large-scale high-order cfd simulations
with complex grids on the tianhe-1a supercomputer, Journal of Computational Physics
278 (2014) 275–297.
[7] J. Lai, Z. Tian, H. Li, S. Pan, A cfd heterogeneous parallel solver based on collaborating
cpu and gpu, in: IOP Conference Series: Materials Science and Engineering, volume 326,
IOP Publishing, 2018, p. 012012.
[8] J. Lai, H. Li, Z. Tian, Y. Zhang, A multi-gpu parallel algorithm in hypersonic flow
computations, Mathematical Problems in Engineering 2019 (2019).
[9] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic vehicle
using a gpu, Journal of Computational Physics 227 (2008) 10148–10161.
[10] J. Lei, D.-l. Li, Y.-l. Zhou, W. Liu, Optimization and acceleration of flow simulations for
cfd on cpu/gpu architecture, Journal of the Brazilian Society of Mechanical Sciences and
Engineering 41 (2019) 290.
[11] J. Crabill, F. D. Witherden, A. Jameson, A parallel direct cut algorithm for high-order
overset methods with application to a spinning golf ball, Journal of Computational
Physics 374 (2018) 692–723.
[12] F. D. Witherden, A. M. Farrington, P. E. Vincent, Pyfr: An open source framework
for solving advection–diffusion type problems on streaming architectures using the flux
reconstruction approach, Computer Physics Communications 185 (2014) 3028–3040.
32
[13] Y. Jiang, C.-W. Shu, M. Zhang, An alternative formulation of finite difference weighted
eno schemes with lax–wendroff time discretization for conservation laws, SIAM Journal
on Scientific Computing 35 (2013) A1137–A1160.
[14] G.-S. Jiang, C.-W. Shu, Efficient implementation of weighted eno schemes, Journal of
computational physics 126 (1996) 202–228.
[15] G. Patnaik, A. Corrigan, K. Obenschain, D. Schwer, D. Fyfe, Efficient Utiliza-
tion of a CPU-GPU Cluster, American Institute of Aeronautics and Astronautics,
2012. URL: https://arc.aiaa.org/doi/abs/10.2514/6.2012-563. doi:10.2514/6.
2012-563. arXiv:https://arc.aiaa.org/doi/pdf/10.2514/6.2012-563.
[16] J. Berland, C. Bogey, C. Bailly, Numerical study of screech generation in a planar
supersonic jet, Physics of fluids 19 (2007) 075105.
[17] C. Tam, The shock-cell structures and screech tone frequencies of rectangular and non-
axisymmetric supersonic jets, Journal of Sound and Vibration 121 (1988) 135–147.
[18] J. Panda, G. Raman, K. Zaman, J. Panda, G. Raman, K. Zaman, Underexpanded screech-
ing jets from circular, rectangular and elliptic nozzles, in: 3rd AIAA/CEAS Aeroacoustics
Conference, 1997, p. 1623.
[19] G. Raman, E. J. Rice, Instability modes excited by natural screech tones in a supersonic
rectangular jet, Physics of fluids 6 (1994) 3999–4008.
33
