Performance analysis and optimization of a CFD application by Zhang, Wentao
c© 2015 by Wentao Zhang. All rights reserved.
PERFORMANCE ANALYSIS AND OPTIMIZATION OF A CFD APPLICATION
BY
WENTAO ZHANG
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Mechanical Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2015
Urbana, Illinois
Adviser:
Professor Daniel J. Bodony
ABSTRACT
This thesis documents the analysis and optimization of a high-order finite difference com-
putational fluid dynamics (CFD) application (PlasComCM). Performance bottlenecks were
identified using performance tools and hardware counters. The performance analysis of Plas-
ComCM showed that the quantity of memory accesses and the lack of vectorization inhibited
optimal serial performance on a x86-based CPU. Optimizing techniques including pointer
dereferencing, loop transformation and Fortran SIMD directives were applied to the top 10
time-consuming subroutines to remove obstacles to vectorization and to improve the serial
performance. Details about the optimization techniques are presented and their impacts on
performance are discussed. A 63% reduction in the number of memory loads and a serial
speedup of 2.02 were obtained from the optimization efforts.
Using the optimized serial program as the codebase, further investigation was focused on
the analysis and optimization of parallel heterogeneous execution on a dual-socket node fitted
with an Intel Xeon Phi MIC card. To reduce the overhead created by host-accelerator copies
in heterogeneous execution, the data layout of the halo region was changed from a “star”
shape to a “box” shape to agglomerate small communications and to create a larger work
granularity. Preliminary results of running PlasComCM on Intel Xeon Phis in symmetric
mode are also presented, where it was found that a 20% reduction in wall-clock time can be
obtained for particular problem size when using 2 SandyBridge sockets + 1 Phi card vs 2
SandyBridge sockets.
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
First and foremost, I would like to thank my advisor Daniel J. Bodony for all his support and
guidance. I am truly grateful to him for patiently leading me to the world of computational
fluid dynamics and parallel computing. I am very fortunate to have the opportunity to work
with him, and I have learned a tremendous amount from him.
I would like to thank John L. Larson, who taught me much about performance measure-
ment and optimization of parallel programs. Discussions with John greatly enhanced my
understanding in this work. I also want to thank Lucas A. Wilson at TACC for his help and
advice on using the Stampede cluster.
I want to thank Qi, Mahesh, Nek, Shakti and everyone else in my research group for
helping me during the tough times and filling my life with laughter and fun.
Many thanks to Congying Chen for always being there for me over the past two years.
Her care and inspiration across the ocean cheered me up when I was down.
I would like to thank my parents Fengling Zhang and Guihua Zhao for their love through-
out my life. The work presented in this thesis would not have been possible without their
support and encouragement.
This work is partially supported by the Department of Energy’s Predictive Science Aca-
demic Alliance Program (PSAAP), by the Department of Mechanical Science and Engi-
neering at the University of Illinois at Urbana-Champaign, and by NSF XSEDE Project
TG-CTS090004. Computational resources from the Stampede system at Texas Advanced
Computing Center (TACC) are greatly appreciated.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
CHAPTER 2 BRIEF INTRODUCTION TO PLASCOMCM . . . . . . . . . . . . . 3
2.1 Physical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Program Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Computational Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Tables for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Figures for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 3 PROFILING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Timing Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Floating-Point Execution Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Memory and Cache Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Tables for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Figures for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
CHAPTER 4 OPTIMIZATION TECHNIQUES AND RESULTS . . . . . . . . . . . 21
4.1 Pointer Dereferencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Loop Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Fortran SIMD Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Optimization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Tables for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Figures for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CHAPTER 5 BOX SHAPED HALO REGION . . . . . . . . . . . . . . . . . . . . . 33
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Star Halo vs Box Halo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Figures for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
v
CHAPTER 6 RUNNING ON INTEL MICS . . . . . . . . . . . . . . . . . . . . . . 36
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Porting Efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.4 Figures for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
CHAPTER 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . 41
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vi
LIST OF TABLES
2.1 Processor specifications of the Stampede system at TACC. . . . . . . . . . . 8
3.1 Timing profile of the original PlasComCM. . . . . . . . . . . . . . . . . . . . 15
3.2 Tasks of top 10 subroutines in PlasComCM. . . . . . . . . . . . . . . . . . . 16
3.3 Flops and execution rate of the original PlasComCM. . . . . . . . . . . . . . 17
3.4 Cache and main memory usage of the original PlasComCM. . . . . . . . . . 17
3.5 Loads per flop ratios of the original PlasComCM. . . . . . . . . . . . . . . . 18
3.6 Vectorization status of the original PlasComCM. . . . . . . . . . . . . . . . . 18
3.7 Cause of failure in vectorization of the original PlasComCM. . . . . . . . . . 19
3.8 Vector instructions generated in the original PlasComCM. . . . . . . . . . . 19
4.1 Reduction of loads and serial speedup after pointer dereferencing. . . . . . . 25
4.2 Loads per flop ratios (before and after pointer dereferencing). . . . . . . . . . 25
4.3 Vector instructions generated in the optimized PlasComCM. . . . . . . . . . 26
4.4 Reduction of loads and serial speedup after all optimizations. . . . . . . . . . 26
vii
LIST OF FIGURES
2.1 A brief sketch of the program control flow of PlasComCM. . . . . . . . . . . 9
3.1 Accessing hardware counters through TAU and PAPI. . . . . . . . . . . . . . 20
4.1 A loop example from subroutine ARK2 (before pointer dereferencing). . . . . 27
4.2 A loop example from subroutine ARK2 (after pointer dereferencing). . . . . . 27
4.3 A loop example from subroutine CID (before transformation). . . . . . . . . 28
4.4 A loop example from subroutine CID (after transformation). . . . . . . . . . 28
4.5 A loop example from subroutine SPARSE NEW (before transformation). . . . . 29
4.6 A loop example from subroutine SPARSE NEW (after transformation). . . . . . 30
4.7 A loop example from subroutine VISCOUSTERMS STRONG (before transformation). 31
4.8 A loop example from subroutine VISCOUSTERMS STRONG (after transformation). 31
4.9 A loop example that cannot be vectorized without SIMD directive. . . . . . 32
4.10 A loop example gets vectorized with help from SIMD directive. . . . . . . . . 32
5.1 Layouts of star halo and box halo using a 2-D grid and the 5-point stencil. . 34
5.2 Execution pattern (star vs box) for one RK step. . . . . . . . . . . . . . . . 35
6.1 Configuration of a compute node of the Stampede system. . . . . . . . . . . 38
6.2 Decomposition and mapping for running on the MIC card in symmetric mode. 39
6.3 Speedup of symmetric runs with increasing M. . . . . . . . . . . . . . . . . . 40
6.4 Speedup of symmetric runs with increasing P. . . . . . . . . . . . . . . . . . 40
viii
CHAPTER 1
INTRODUCTION
Computational fluid dynamics is an important branch of scientific computing that utilizes
numerical methods and computers to model fluid-related phenomena arised in scientific and
engineering practices [1].
1.1 Background
The rapid development of high performance computing (HPC) technology has led to signif-
icant progresses in the field of CFD [2]. The ever-increasing computational capabilities (1)
enabled researchers to run Reynolds-averaged Navier-Stokes (RANS) [3] simulations with
complex geometries and coupled physics (e.g., complex RANS simulation of a real industrial
combustor [4]); (2) promoted the development and usage of large-eddy simulation (LES)
[5] (e.g., using large simulations to study the noise generated by turbulent jets [6]); and
(3) made direct numerical simulation (DNS) [7] feasible (e.g., direct numerical simulation
of homogeneous turbulent shear flow [8]). These applications, in turn, pushed CFD to be-
come a significant consumer of HPC resources. However, application scientists often care
more about the prediction results so that insufficient attention is paid to the performance
of the code, which results in a waste of computational resources. With HPC technology
changing rapidly, there is a growing need to analyze and optimize the performance of CFD
applications to effectively utilize emerging hardware capabilities [9].
This thesis documents the analysis and performance improvement of PlasComCM , a high-
order finite difference CFD application written primarily in Fortran 90 that uses the Message
Passing Interface (MPI) [10] for parallelization. Production runs typically have 50 million to
200 million grid points distributed to 1K to 3K MPI tasks after domain decomposition. Each
1
grid point has five degrees of freedom for the mass, momentum and total energy densities.
1.2 Organization of the Thesis
Chapter 2 introduces the main features of PlasComCM , including the physical models, the
program control flow and the numerical techniques, as well as the computational platform.
Chapter 3 presents the profiling work, which analyzes the timing, floating-point execution
rate, memory/cache usage and vectorization status of the original codebase of PlasComCM.
The optimization techniques and their impact on vectorization and serial performance are
discussed in detail in Chapter 4. Chapter 5 compares the new “box”-shaped halo region
with the old “star”-shaped halo region and shows how the “box” halo layout changes the
execution pattern and creates a larger work granularity. Chapter 6 documents our efforts
to port PlasComCM to the Intel Xeon Phis and summarizes preliminary results of running
PlasComCM on Intel Xeon Phis with a MPI-only model in symmetric mode. Conclusions
and recommended future work are presented in Chapter 7.
2
CHAPTER 2
BRIEF INTRODUCTION TO PLASCOMCM
PlasComCM is a multi-physics solver for investigating the behavior of compressible, viscous
gases arising in aerospace or mechanical engineering practices, with a focus on turbulence
and its generated sound. Fluid simulation examples include predicting and controlling the
noise produced by a high-speed turbulent jet [11], modeling the performance of the acoustic
liners installed in jet engines [12, 13] and evaluating the indirect combustion noise in a
turbine stage [14, 15]. Fluid-structure coupled investigations include a laminar/turbulent
boundary layer grazing over a flexible panel [16, 17], with application to duct noise reduction
and future hypersonic vehicle design, respectively.
Section 2.1 shows the physical models used in PlasComCM. Section 2.2 and section 2.3
present the program control flow from a high level view and the numerical methods used
in PlasComCM, respectively. Section 2.4 briefly introduces the computational platform on
which the performance measurement and optimization were conducted.
2.1 Physical Models
The governing equations solved by PlasComCM are the three-dimensional compressible
Navier-Stokes equations, written as
∂ρ
∂t
+
∂
∂xj
(ρuj) = 0 (2.1)
∂ρui
∂t
+
∂
∂xj
(ρuiuj + pδij − τji) = ρfi (2.2)
∂ρE
∂t
+
∂
∂xj
([ρE + p]uj + qj − uiτji) = ρuifi, (2.3)
3
where repeated indices imply summation. The meanings of the symbols are as follows:
• t — time
• xi — the ith spatial coordinate
• ρ — mass density
• ui — velocity in the ith direction
• ρui — momentum density in the ith direction
• p — thermodynamic pressure
• ρE — total energy density
• qj — rate of heat flux in the jth direction
• τij — viscous stress applied in jth direction due to a face with normal in ith direction
• fi — body force per unit mass applied in the ith direction.
The physical coordinates (xi, t) are further transformed to computational coordinates (ξi, τ)
via
ξi = Ξi(xj, t) (2.4)
τ = t (2.5)
and its inverse
xi = Xi(ξj, τ) (2.6)
t = τ. (2.7)
The Jacobian of the transformation is defined as J = det(∂Ξi/∂xj) and is always positive.
4
In the computational coordinates (ξi, τ) the three-dimensional compressible Navier-Stokes
equations are recast into
∂
∂τ
( ρ
J
)
= − ∂
∂ξj
(
ρUˆj
)
(2.8)
∂
∂τ
(ρui
J
)
= − ∂
∂ξj
(
ρuiUˆj + pξˆj,i − τkiξˆj,k
)
+
ρfi
J
(2.9)
∂
∂τ
(
ρE
J
)
= − ∂
∂ξj
(
[ρE + p]Uˆj − pξˆj,t + [qk − uiτki]ξˆk,j
)
+
ρuifi
J
, (2.10)
where ξˆi,j = J
−1∂ξi/∂xj, ξˆi,t = J−1∂ξi/∂t and Uˆi = ξˆi,t + ξˆi,juj. Equations (2.8), (2.9) and
(2.10) are discretized and solved in PlasComCM, with proper boundary conditions applied.
The terms on the right-hand-side of the equals sign are denoted by RHS.
2.2 Program Control Flow
Fig. 2.1 shows the control flow of PlasComCM from a high level view. Like other MPI-based
parallel programs, PlasComCM starts by initializing MPI processes for parallel execution.
The following execution path can be mainly divided into two parts: the initialization rep-
resented by green boxes and the computation core represented by red boxes. In the initial-
ization stage, grid and input files are first read and the domain decomposition is performed.
The user can explicitly control the decomposition or simply let PlasComCM use the default
decomposition strategy based on load balancing. Then the code uses the information stored
in the input file to initialize the finite difference stencils, construct the differential operators
and apply an initial condition to the flow field.
The computation core is contained within the time marching loop, in which the solution
is advanced by nstep (defined by the user in the input file) time steps. For each time step,
the RHS terms are first computed via sparse matrix-vector multiplication (SpMV) and then
used to advance solution in time. Solution files are written at a user-defined frequency. The
initialization part is only executed once for each simulation while the computation core is
executed nstep (can be as large as several millions) times and does most of the computational
work. Hence, it should be expected that most of the execution time is spent in computation
5
core.
The test case for our performance study solves a three-dimensional problem on the domain
(x, y, z) ∈ [−1, 1]× [−1, 1]× [−1, 1] using a uniform grid of size 48× 48× 48. The simulation
is initialized to have an isentropic (acoustic) pulse located at the domain center (0, 0, 0) with
zero mean flow. The test case is simple in physics but covers the main execution path of
PlasComCM , which ensures that the performance knowledge we learn can be applied to
production runs solving more complex problems at a much larger scale.
2.3 Numerical Methods
The spatial derivatives ∂/∂ξ are discretized with finite difference operators that have the
summation-by-parts (SBP) property [18]. The spatial approximation to ∂/∂ξ is P−1Q
where P is a symmetric, positive-definite matrix and Q has the property that Q + QT =
diag(−1, 0, . . . , 0, 1). For simplicity, consider computing ∂~u/∂ξ on a one-dimensional grid
of N grid points as a simple example. With the finite difference discretization, ∂~u/∂ξ|i=1,...,N
is approximated by the SpMV M~u = P−1Q~u using centered stencils for grid points far way
from boundaries and biased stencils for grid points close to the boundaries. In this way M
has the following structure,
M =

x
(1)
l x
(1)
l x
(1)
l x
(1)
l x
(1)
l x
(1)
l 0 · · · 0
x
(2)
l x
(2)
l x
(2)
l x
(2)
l x
(2)
l x
(2)
l 0 · · · 0
xi xi xi xi xi 0 · · · 0
0 xi xi xi xi xi · · · 0
0
. . . . . . . . . . . . . . . . . . . . . . . .
0 · · · 0 0 xi xi xi xi xi
0 · · · 0 x(N−1)r x(N−1)r x(N−1)r x(N−1)r x(N−1)r x(N−1)r
0 · · · 0 x(N)r x(N)r x(N)r x(N)r x(N)r x(N)r

. (2.11)
The interior stencils are represented by xi and the boundary stencils are represented by
6
xl and xr for left and right boundaries respectively. The SpMV M~u can be further split into
M~u = Mi~ui +Mb~ub, (2.12)
where Mi and Mb contain the interior and boundary stencils respectively. Mi~ui and Mb~ub
are completely independent and implemented in different subroutines in PlasComCM (see
Table 3.2 for the description of subroutine CID and subroutine SPARSE NEW). Repeated first
derivative operators are used to approximate the second derivatives (implemented in sub-
routine VISCOUSTERMS STRONG ). Halo regions (ghost cells) are used for accessing the stencils
at processor boundaries and MPI nonblocking sends/receives are used for sending/receiving
halo data.
The SBP operators are combined with the simultaneous-approximation-term (SAT) bound-
ary condition [19, 20] (implemented in subroutine NS BC) to yield a provably stable method
that has been shown to be accurate [21]. With the right-hand side (RHS) terms computed, a
fourth-order explicit Runge-Kutta (RK4) scheme (implemented in subroutine ARK2) is used
to advance solution in time.
2.4 Computational Platform
The performance analysis and optimization was conducted on Stampede [22], a NSF-supported
supercomputer at the Texas Advanced Computing Center that ranked 7th on the Top 500
site as of November 2014 [23]. Statistics show that PlasComCM approximately consumes
15 million Stampede SUs annually. Stampede has a total peak performance of 9.6 petaflop
(PF)/sec from its approximately 6400 nodes. The CPU-based portion of the cluster con-
tributes 2.2 PF/sec and the other 7.4 PF/sec come from attached Intel Xeon Phis MIC
cards. A typical compute node consists of 2 Intel Xeon E5-2680 (SandyBridge) processors
and one Intel Xeon Phi Coprocessor (MIC Architecture) card with 61 cores. The Sandy-
Bridge processor supports AVX-256 4-lane vectorization and the coprocessor supports 8-lane
vectorization. Details of the processor specifications are summarized in Table 2.1 [22]. All
of our performance experiments were carried out on one compute node of Stampede.
7
2.5 Tables for Chapter 2
Table 2.1: Processor specifications of the Stampede system at TACC.
Features Xeon E5 Xeon Phi
Number of cores 8 61
Base frequency 2.7GHz 1.1GHz
HW threads/core 2 4
Vector length 256 bits 512 bits
Instruction Set x86 + AVX x86 + vector instructions
L1 cache 32KB 32KB
L2 cache 256KB 512KB
L3 cache 20MB
Memory 2GB/core 128MB/core
8
2.6 Figures for Chapter 2
Initialize MPI
Read input
Decompose the domain
Initialize stencils
Apply the initial condition
Construct operators
Output solution files
Compute       and RHS terms
Finalize MPI
Apply the boundary condition
Advance the solution for one time step
M~u
Time marching loop
Initialization
Computation Core
Figure 2.1: A brief sketch of the program control flow of PlasComCM.
9
CHAPTER 3
PROFILING
Our performance study began with profiling PlasComCM using the performance tools TAU
and PAPI. The profiling work consists of four components that are presented in four sections
of this chapter. Section 3.1 shows the timing profile of PlasComCM and identifies the top
10 time-consuming subroutines. Section 3.2 determines the floating-point execution rate of
PlasComCM as a percentage of hardware peak performance. Section 3.3 investigates the
memory and cache usage of PlasComCM. Finally, section 3.4 analyzes the vectorization
status by examining the vectorization report and the assembly code.
3.1 Timing Profile
The Tuning and Analysis Utilities (TAU) Performance System [24] is a profiling and tracing
toolkit that can (1) automatically place measurement probes in source code at the subrou-
tine level, loop level, or even at the code segment level; (2) access hardware counters and
timers to gather performance information; and (3) post-process, visualize and interpret per-
formance results. On TACC’s Stampede, PlasComCM was instrumented with TAU 2.22.1
and compiled using the Intel Fortran Compiler 13.1.0. All of our experiments ran the test
case described in section 2.2 for nstep = 800 time steps.
Table 3.1 shows the timing profile of PlasComCM, with the top 10 time-consuming sub-
routines listed. The wall-clock time of PlasComCM was not concentrated in one subroutine
but distributed to different ones, which made the performance analysis and optimization
more challenging. Top 10 subroutines cumulatively took more than 90% of the wall-clock
time of PlasComCM and the most time-consuming one was NS BC (17.84%) which computes
the SAT terms that enforce the boundary conditions.. Table 3.2 describes the task per-
10
formed in each top 10 subroutine and how its floating-point computational work scales for a
3-D problem (N3 or N2). That is to say that a routine that computes terms in the interior
of the domain has computational work that scales as N3 while a subroutine that computes
terms on the boundary has computational work that scales as N2, both a problem of size
N ×N ×N .
3.2 Floating-Point Execution Rate
In this section, the serial performance of the top 10 subroutines is further investigated by
computing their double precision floating-point execution rate as a percentage of hardware
peak performance, defined by
R =
F
T ×HP × 100%. (3.1)
The meanings of the symbols in Eq. (3.1) are as follows:
• R — floating-point execution rate as a percentage of hardware peak
• F — number of floating-point operations (flops)
• T — wall-clock time in seconds
• HP — hardware peak performance (21.6 GFlops per second for one core of Xeon
processor of Stampede system).
Since the wall-clock time T spent in each subroutine is already recorded in the timing
profile, F is the only metric that needs to be measured for computing R. A natural way
is to count F using hardware counters accessed through the Performance Application Pro-
gramming Interface (PAPI) [25], which can distinguish vendor-specific hardware counters
and define generic events.
As shown in Fig. 3.1, TAU and PAPI work together to build a bridge between the user code
and the vendor-specific hardware counters. TAU-instrumented user code can directly select
PAPI-defined events as metrics to gather performance data from the hardware counters. The
11
number of floating-point operations performed in each subroutine of PlasComCM is counted
by the hardware counters and ideally we can access this performance information through
TAU and PAPI.
However, it has been reported that PAPI has a flop overcounting issue due to speculative
execution for floating-point operations on Sandy Bridge processors [26]. This means it is
necessary to go through each subroutine, manually accumulate the floating-point operations
performed in each loop, and finally obtain the formula of F for each subroutine as a function
of grid size and time steps. Table 3.3 presents the manually counted F and, with the timings
presented in the previous section, the value of R is computed. Observe that R varies from
0.00% (there were no floating-point operations performed in subroutine INTERNAL DERIV and
subroutine NS ALLOCATE MEMORY) to 6.34% (subroutine VISCOUSTERMS STRONG). The top 10
subroutines cumulatively ran at a low execution rate of 3.03%, which clearly indicates that
PlasComCM was not using the hardware efficiently and the serial performance should be
improved. This fact encouraged us to find the performance obstacles in two respects: the
investigation of memory and cache usage (section 3.3) and the study of vectorization (section
3.4).
3.3 Memory and Cache Usage
As the difference between the speed of memory access and the speed of floating-point com-
putation increases, the memory and cache usage plays an increasingly important role in
program’s serial performance. In this section, the memory access profile of PlasComCM is
generated and analyzed.
The following expression was assumed to be the relationship between the number of mem-
ory load instructions and the successful loads from levels L1–L3 cache and main memory,
NLoad = NL1 +NL2 +NL3 +NMM , (3.2)
where the meanings of the symbols are as follows:
• NLoad — total amount of load instructions
12
• NL1 — L1 cache hits
• NL2 — L2 cache hits
• NL3 — L3 cache hits
• NMM — amount of data loaded from the main memory.
TAU and PAPI were very useful tools to measure the quantities in Eq. (3.2) and helped gain
insight into both the quality and the quantity of memory usage. But note that not all of
the quantities in Eq. (3.2) have a directly corresponding PAPI event available on Stampede,
so that we need to make approximations. Based on the available hardware counters on
Stampede system [22], the following four PAPI-defined events were selected as TAU metrics
to be recorded:
• PAPI LD INS — load instructions
• PAPI L2 DCH — L2 data cache hits
• PAPI L3 DCA — L3 data cache accesses
• PAPI L3 TCM — L3 cache misses.
The formulas for converting the PAPI events into the quantities of our interest in Eq. (3.2)
are as follows:
NLoad = PAPI LD INS (3.3)
NL1 = NLoad −NL2 −NL3 −NMM (3.4)
NL2 = PAPI L2 DCH (3.5)
NL3 = PAPI L3 DCA− PAPI L3 TCM (3.6)
NMM = PAPI L3 TCM. (3.7)
Table 3.4 shows where the loads of PlasComCM were serviced (as a percentage of to-
tal amount of loads). Eight of the top 10 subroutines (see Table 3.2 for a description of
13
the task performed in each top 10 subroutine) have more than 90% of the data loads hit-
ting L1 cache and subroutine INTERNAL DERIV has more than 80%. Note that subroutine
NS ALLOCATE MEMORY only allocates memory thus its load information is not applicable. Hav-
ing such a high ratio of L1 cache hits clearly indicates that the quality of memory access of
the original PlasComCM codebase was not the performance bottleneck.
Therefore, we turned to investigating the quantity of data loads by computing the loads
per floating-point operation ratio. As shown in Table 3.5, the top 10 subroutines cumula-
tively have a very large loads per flop ratio of 4.04 and two subroutines (SPARSE NEW and
ARK2) need more than 10 data loads to perform one floating-point operation (note that
subroutine INTERNAL DERIV and subroutine NS ALLOCATE MEMORY have zero floating-point
operations performed thus their loads per flop ratios are not applicable). Hence, it is con-
cluded that PlasComCM was memory-bound and the large quantity of memory loads was a
major performance obstacle that resulted in the observed low floating-point execution rate.
3.4 Vectorization
The Intel Advanced Vector Extensions (Intel AVX) instruction set for performing floating-
point operations in Single Instruction Multiple Data (SIMD) mode are supported on the Intel
Xeon E5 (SandyBridge) processor of Stampede system. Vectorization is critical to obtaining
good performance since the 256-bit wide SIMD registers allow multiplying/adding 4 double
precision operands with only one vector instruction. This section presents our efforts in
investigating the vectorization status of the original PlasComCM, which was compiled using
Intel Fortran Compiler 13.1.0 with complier option -O2 to enable the auto-vectorizer.
According to the vector report generated by the Intel Fortran Compiler, Table 3.6 shows
a summary of the vectorization status of the innermost loops of the top 10 subroutines in
original PlasComCM. Note that subroutine NS BC is not included here because its complex
context makes it necessary to completely reorganize the whole subroutine to benefit from
vectorization, which is beyond the scope of this study. Subroutine NS ALLOCATE MEMORY is
also excluded because it only allocates memory and does not contain any loops operating
floating-point numbers. The second column of Table 3.6 shows the total amount of inner-
14
most loops in each subroutine and the third column shows the amount of vectorized loops.
The data in Table 3.6 reveal that there are only 10 out of 50 loops that were vectorized
by the auto-vectorizer. We referred to the vector report for the cause of failure in vector-
ization, the results of which can be found in Table 3.7. The most frequently cause given
by the compiler preventing vectorization was the existence of vector dependence, which was
surprising since all of the vectors in the innermost loops are independent and strictly vector-
izable. An examination of the assembly files generated by the Intel Fortran Compiler (see
Table 3.8 [27]) shows that some of the vectorized loops used the less-capable 128-bit-wide
SSE2 vector instructions rather than the 256-bit-wide AVX instructions, indicating that
further vectorization is possible.
Hence, it is concluded that the original PlasComCM, when compiled using Intel Fortran
Compiler 13.1.0 with -O2, had a poor status of vectorization. Most of the loops were not
vectorized and even the few vectorized ones were not benefitting from the AVX 4-lane vector
instructions. The lack of vectorization was another cause of the low serial performance of
PlasComCM. Chapter 4 presents our efforts on removing the vectorization obstacles by
pointer-dereferencing, loop transformations and applying Fortran SIMD directives.
3.5 Tables for Chapter 3
Table 3.1: Timing profile of the original PlasComCM.
Subroutine Wall-clock time (s) % of program
NS BC 133.555 17.84
VISCOUSTERMS STRONG 129.412 17.29
SPARSE NEW 114.627 15.32
CID 108.989 14.56
ARK2 46.524 6.22
NS RHS STRNRT 42.712 5.71
INTERNAL DERIV 39.959 5.34
NS RHS EULER 34.055 4.55
COMPUTEDV IDEALGAS 20.318 2.71
NS ALLOCATE MEMORY 16.066 2.15
Total of Top 10 686.216 91.68
Total of Program 748.461 100
15
Table 3.2: Tasks of top 10 subroutines in PlasComCM.
Subroutine Task Work scales as
NS BC Apply the boundary conditions N2
VISCOUSTERMS STRONG Compute the viscous and heat transfer
terms of RHS, coordinate the SpMV M~u
N3
SPARSE NEW Compute the boundary portion of the
SpMV, i.e., Mb~ub (see Eq. (2.12))
N2
CID Compute the interior portion of the
SpMV, i.e., Mi~ui (see Eq. (2.12))
N3
ARK2 Perform RK4 time integration to advance
solution for one time step using RHS
terms
N3
NS RHS STRNRT Compute the strain-rate tensor N3
INTERNAL DERIV Allocate temporary storage, perform halo
data exchange to fill halo regions needed
for M~u (supporting stencils at processors
boundaries), call subroutine CID and sub-
routine SPARSE NEW
N/A
NS RHS EULER Compute the inviscid terms of RHS, co-
ordinate the SpMV M~u
N3
COMPUTEDV IDEALGAS Compute the dependent variables N3
NS ALLOCATE MEMORY Allocate memory N/A
16
Table 3.3: Flops and execution rate of the original PlasComCM.
Subroutine Flops, F Execution rate, R (%)
NS BC 3.92× 1010 1.36
VISCOUSTERMS STRONG 1.77× 1011 6.34
SPARSE NEW 1.73× 1010 0.70
CID 1.36× 1011 5.79
ARK2 1.36× 1010 1.36
NS RHS STRNRT 2.23× 1010 2.42
INTERNAL DERIV 0 0.00
NS RHS EULER 3.19× 1010 4.33
COMPUTEDV IDEALGAS 1.10× 1010 2.50
NS ALLOCATE MEMORY 0 0.00
Total of Top 10 4.49× 1011 3.03
Table 3.4: Cache and main memory usage of the original PlasComCM.
Subroutine L1(%) L2(%) L3(%) Memory(%)
NS BC 97.81 2.01 0.13 0.05
VISCOUSTERMS STRONG 95.03 3.60 1.29 0.09
SPARSE NEW 97.07 1.12 0.72 0.10
CID 94.35 4.93 0.72 0.00
ARK2 98.03 1.72 0.19 0.06
NS RHS STRNRT 97.35 2.10 0.46 0.09
INTERNAL DERIV 82.27 5.67 11.61 0.44
NS RHS EULER 92.70 4.05 2.98 0.27
COMPUTEDV IDEALGAS 98.51 1.36 0.11 0.02
NS ALLOCATE MEMORY N/A N/A N/A N/A
Total of Top 10 95.78 2.88 1.13 0.21
Total of Program 95.61 2.99 1.19 0.22
17
Table 3.5: Loads per flop ratios of the original PlasComCM.
Subroutine Loads Flops Loads per flop
NS BC 3.69× 1011 3.92× 1010 9.40
VISCOUSTERMS STRONG 4.82× 1011 1.77× 1011 2.72
SPARSE NEW 2.23× 1011 1.73× 1010 12.88
CID 2.27× 1011 1.36× 1011 1.67
ARK2 1.51× 1011 1.36× 1010 11.09
NS RHS STRNRT 1.67× 1011 2.23× 1010 7.50
INTERNAL DERIV 5.97× 1010 0 N/A
NS RHS EULER 8.78× 1010 3.19× 1010 2.76
COMPUTEDV IDEALGAS 4.50× 1010 1.10× 1010 4.10
NS ALLOCATE MEMORY 0 0 N/A
Total of Top 10 1.81× 1012 4.49× 1011 4.04
Table 3.6: Vectorization status of the original PlasComCM.
Subroutine Number of total loops Number of vectorized loops
VISCOUSTERMS STRONG 17 2
SPARSE NEW 1 0
CID 3 3
ARK2 10 0
NS RHS STRNRT 5 2
INTERNAL DERIV 3 1
NS RHS EULER 9 2
COMPUTEDV IDEALGAS 2 0
Total 50 10
18
Table 3.7: Cause of failure in vectorization of the original PlasComCM.
Subroutine Cause
VISCOUSTERMS STRONG existence of vector dependence, vectorization pos-
sible but seems inefficient
SPARSE NEW existence of vector dependence
ARK2 existence of vector dependence
NS RHS STRNRT existence of vector dependence
INTERNAL DERIV existence of vector dependence
NS RHS EULER existence of vector dependence, vectorization pos-
sible but seems inefficient
COMPUTEDV IDEALGAS existence of vector dependence
Table 3.8: Vector instructions generated in the original PlasComCM.
Instruction Set Description
movhpd SSE2 Move High Packed Double-Precision
Floating-Point Value
movlpd SSE2 Move Low Packed Double-Precision
Floating-Point Value
mulpd SSE2 Multiply Packed Double-Precision
Floating-Point Values
addpd SSE2 Add Packed Double-Precision Floating-
Point Values
19
3.6 Figures for Chapter 3
User 
Code TAU PAPI
Hardware
Counters
INTEL
AMD
NVIDIA
EventsMetrics
Figure 3.1: Accessing hardware counters through TAU and PAPI.
20
CHAPTER 4
OPTIMIZATION TECHNIQUES AND RESULTS
As shown in the profiling statistics of PlasComCM in chapter 3, the quantity of memory
accesses and the lack of vectorization were the main causes of the code’s low serial per-
formance. This chapter presents the optimization techniques implemented to reduce the
number of loads and remove obstacles to vectorization. Three optimization techniques, i.e.,
pointer dereferencing, loop transformation and Fortran SIMD directives, are discussed in
detail in section 4.1, section 4.2 and section 4.3, respectively. Finally, section 4.4 shows the
improvement in serial performance.
4.1 Pointer Dereferencing
PlasComCM was originally developed where nested multiple-level pointers (high level point-
ers pointing to mid-level ones, lowest level pointers pointing to floating-point numbers and
integer numbers) are frequently used. This kind of data structure supports the flexibility of
applying operators in an overset grid context and simplifies the programming and mainte-
nance process. Fig. 4.1 shows a loop in subroutine ARK2 of the original PlasComCM. The
functionality of this loop is to update state%cv(:,:), which is the data container of the
fundamental variables ρ, ρ~u, and ρE, using the explicit RHS terms. Note that here cv(:,:),
dt(:) and rhs explicit(:,:,:) are all pointers and they belong to the same parent con-
tainer state, which itself is a pointer of user-defined type. The use of two level pointers gives
flexibility but has a tradeoff in that it increases the number of indirect memory references
and can also interfere with other optimizations.
Figure 4.2 shows the same loop as Fig. 4.1 but has the pointer dereferencing optimization
applied. Observe that state%cv(:,:), state%dt(:) and state%rhs explicit(:,:,:)
21
are dereferenced outside the innermost loop by locally-defined pointers, which reduces the
indirect memory references of the innermost loop from two to one.
Table 4.1 presents the reduction of memory loads and the measured serial speedup after
the pointer dereferencing optimization, where the loads reduction factor is defined as
Loads reduction factor =
Loads in the original program
Loads in the optimized program
. (4.1)
The whole program reduced by 33% the number of memory loads and had a serial speedup
of 1.23. The largest loads reduction factor was found in subroutine ARK2, where 85% of the
memory loads were removed. Subroutine NS ALLOCATE MEMORY only allocates memory so its
loads reduction factor is not applicable in Table 4.1. Table 4.2 further shows the effect of
pointer dereferencing on reducing the loads-per-flop ratios (subroutine INTERNAL DERIV and
subroutine NS ALLOCATE MEMORY are not applicable because they have zero flops). Without
pointer dereferencing, subroutine ARK2 needed 11.09 loads to perform one floating-point
operation and top 10 subroutines cumulatively needed 4.04. After pointer dereferencing,
subroutine ARK2 only needed 1.66 loads to perform one floating-point operation and the top
10 subroutines cumulatively needed 2.60.
The lesson to be learned is that the flexible data structure originally used in PlasComCM
has a performance tradeoff due to multiple indirect memory references. As the difference
between the speed of memory access and the speed of floating-point computation increases,
it is necessary to perform manual pointer dereferencing optimization to reduce the level of
indirections inside the innermost loops, if the compiler cannot do it automatically.
4.2 Loop Transformation
Loop transformation techniques were used to reduce memory references and remove obstacles
to vectorization. In this section, we present three examples to illustrate the transformation
techniques in detail. The code examples are from subroutine CID, subroutine SPARSE NEW
and subroutine VISCOUSTERMS STRONG, which are the 4th, 3rd and 2nd time-consuming
subroutines, respectively, in the timing profile of the original PlasComCM (see Table 3.1).
22
Figure 4.3 shows the loop example (before transformation) from subroutine CID. The
functionality of this loop is computing the interior part of the SpMV for applying the operator
in the first dimension. The three outer loops (k, j and i) loop over all of the interior points
while the innermost loop (ii) loops over the neighboring grid points needed for the interior
finite difference stencil. In our test case, the three outer loops have 40, 40 and 36 iterations
while the innermost loop has only 5 iterations due to the limited width of the finite difference
stencil. The amount of iterations of the innermost loop is so small that it cannot fully utilize
the cache line or benefit from vector instructions. In Fig. 4.4, by reorganizing the SpMV
from row order to diagonal order, the order of this loop nest is changed to k, j, ii, i to have
a larger iteration count in the innermost loop. The indexing in the innermost loop is also
simplified for the ease of vectorization. Similar transformations were extended to the second
and third dimensions and the same loop order k, j, ii, i was used for stride-1 computation.
Figure 4.5 shows the loop example (before transformation) from subroutine SPARSE NEW.
The functionality of this loop is computing the boundary part of the SpMV. Before the
computational statement, there are as many as 7 indirect memory references used for ac-
cessing the boundary stencils stored in the Compressed Row Storage (CRS) format and
obtaining the indices of the grid points affected by the boundary stencil. This agrees with
the data shown in Table 4.2 that even after pointer dereferencing optimization, subroutine
SPARSE NEW still has a high loads per flop ratio of 5.75. In Fig. 4.6, we get rid of the CRS stor-
age and split the loop nest into two separate ones for left and right boundaries respectively.
This transformation allows us to reduce loads by directly accessing the boundary stencils
from 3-D arrays (rhs block1(:,:,:) for left boundaries and rhs block2(:,:,:) for right
boundaries). The amount of loads measured by TAU and PAPI shows this optimization
reduced the loads per flop ratio of subroutine SPARSE NEW from 5.75 to 1.69.
Figure 4.7 shows the third loop example (before transformation). This loop is from sub-
routine VISCOUSTERMS STRONG and there is an indirect memory reference introduced by
accessing the index mapping from vMap(:). The transformed loop is shown in Fig. 4.8,
where the indirect memory reference is removed by explicitly specifying the index values
stored in vMap(:). Similar manual transformations were applied to other loops of subroutine
VISCOUSTERMS STRONG and other top 10 subroutines of PlasComCM . All of the transforma-
23
tions were guided by the same goal: reducing memory loads and simplifying the context for
the ease of vectorization.
4.3 Fortran SIMD Directives
Sometimes the auto-vectorizer is not able to confirm independence and fails in vectorizing
loops when no dependency issue exists. Figure. 4.9 shows such a loop example from subrou-
tine ARK2 with the remark from the vector report listed below. The vectors in the innermost
loop (line 2973 to line 2975) are all independent but the compiler still complains about vector
dependency and fails to vectorize this loop.
The Intel SIMD directives are designed to force the compiler to vectorize loops without
checking aliasing or correctness [28]. They are powerful mandates for vectorization but
should be used with care (programmers have to take more responsibility for the correctness
of the program). In Fig. 4.10, the Intel Fortran SIMD directive is added to the same
loop shown in Fig. 4.9 to eliminate the unnecessary worry of the compiler about vector
dependence. As the remark from the vector report indicates, now this loop is successfully
vectorized. Since all of the innermost loops of the top 10 subroutines work on independent
vectors and are strictly vectorizable, the Intel Fortran SIMD directive was applied to all of
them to assist vectorization.
4.4 Optimization Results
The vectorization status of PlasComCM was largely improved after applying the serial opti-
mizations (pointer dereferencing, loop transformations and Intel Fortran SIMD directive) to
the top 10 subroutines. The vectorization report showed that all of the innermost loops of
the top 10 subroutines were vectorized. The -xHost compiler option was also added to use
the highest instruction set AVX, rather than the older and less capable SSE2. A detailed
assembly check of the vectorized loops can be found in Table 4.3 [27], which confirms that
the AVX vector instructions were generated in the optimized PlasComCM.
24
Finally, with all of the serial optimizations implemented, Table 4.4 summarizes the re-
duction of loads and the serial speedup. The serial optimization efforts altogether led to
a reduction of 66% of the memory loads for top 10 subroutines as a whole (63% for Plas-
ComCM) and a serial speedup of 2.22 (2.02 for PlasComCM) on one core.
4.5 Tables for Chapter 4
Table 4.1: Reduction of loads and serial speedup after pointer dereferencing.
Subroutine Loads reduction factor Serial speedup
NS BC 1.01 1.00
VISCOUSTERMS STRONG 1.93 1.49
SPARSE NEW 2.24 1.23
CID 1.00 1.01
ARK2 6.69 3.52
NS RHS STRNRT 3.61 2.26
INTERNAL DERIV 1.00 1.06
NS RHS EULER 1.03 1.03
COMPUTEDV IDEALGAS 4.10 2.10
NS ALLOCATE MEMORY N/A 1.02
Total of Top 10 1.55 1.25
Total of Program 1.50 1.23
Table 4.2: Loads per flop ratios (before and after pointer dereferencing).
Subroutine Before After
NS BC 9.40 9.32
VISCOUSTERMS STRONG 2.72 1.41
SPARSE NEW 12.88 5.75
CID 1.67 1.67
ARK2 11.09 1.66
NS RHS STRNRT 7.50 2.08
INTERNAL DERIV N/A N/A
NS RHS EULER 2.76 2.67
COMPUTEDV IDEALGAS 4.10 1.00
NS ALLOCATE MEMORY N/A N/A
Total of Top 10 4.04 2.60
25
Table 4.3: Vector instructions generated in the optimized PlasComCM.
Instruction Set Description
vmovhpd AVX Move High Packed Double-Precision
Floating-Point Value
vmovlpd AVX Move Low Packed Double-Precision
Floating-Point Value
vmulpd AVX Multiply Packed Double-Precision
Floating-Point Values
vdivpd AVX Divide Packed Double-Precision
Floating-Point Values
vaddpd AVX Add Packed Double-Precision Floating-
Point Values
vsubpd AVX Subtract Packed Double-Precision
Floating-Point Values
Table 4.4: Reduction of loads and serial speedup after all optimizations.
Subroutine Loads reduction factor Serial speedup
NS BC 1.01 1.00
VISCOUSTERMS STRONG 13.02 3.85
SPARSE NEW 4.45 4.58
CID 2.38 3.76
ARK2 12.37 4.08
NS RHS STRNRT 15.00 5.43
INTERNAL DERIV 3.35 1.42
NS RHS EULER 5.22 1.69
COMPUTEDV IDEALGAS 6.43 4.83
NS ALLOCATE MEMORY N/A 1.02
Total of Top 10 2.96 2.22
Total of Program 2.67 2.02
26
4.6 Figures for Chapter 4
do k = 1, input%nCv
do i = 1, grid%nCells
state%cv(i,k) = state%cv(i,k) + state%dt(i) * \
b_vec_exp(j) * state%rhs_explicit(i,k,j)
end do
end do
Figure 4.1: A loop example from subroutine ARK2 (before pointer dereferencing).
nCv = input%nCv
Nc = grid%nCells
cv => state%cv
dt => state%dt
rhs_explicit => state%rhs_explicit
do k = 1, nCv
do i = 1, Nc
cv(i,k) = cv(i,k) + dt(i) * b_vec_exp(j) * \
rhs_explicit(i,k,j)
end do
end do
Figure 4.2: A loop example from subroutine ARK2 (after pointer dereferencing).
27
is = is + gOffset(1)
ie = ie + gOffset(1)
do k = ks, ke
do j = js, je
do i = is, ie
l0 = ( (k-1)*N2 + (j-1) ) * N1g + i
do ii = iis, iie
l1 = l0 + ii
dF(l0) = dF(l0) + vals(ii) * F(l1)
end do
end do
end do
end do
Figure 4.3: A loop example from subroutine CID (before transformation).
is = is + gOffset(1)
ie = ie + gOffset(1)
slice_k = N2 * N1g
do k = ks, ke
offset_k = (k-1) * slice_k
do j = js, je
offset_kj = offset_k + (j-1) * N1g
do ii = iis, iie
value = vals(ii)
offset_kjii = offset_kj + ii
do i = is, ie
dF(i+offset_kj) = dF(i+offset_kj) + \
value * F(i+offset_kjii)
end do
end do
end do
end do
Figure 4.4: A loop example from subroutine CID (after transformation).
28
do k = 1, vdsGhost
do jj = 1, nbc_row(k)
j = bc_row(k,jj)
l1 = vec_indGhost(k,j,dir)
is = row_ptr(k,j)
ie = row_ptr(k,j+1)-1
dF(l1) = 0.0_8
do i = is, ie
local_val = val(i,k)
local_ind = col_ind(i,k)
l0 = vec_indGhost(k,local_ind,dir)
dF(l1) = dF(l1) + local_val * F(l0)
end do
end do
end do
Figure 4.5: A loop example from subroutine SPARSE NEW (before transformation).
29
! ... if we have a left boundary
if (goffset(1) == 0) then
do k = 1, vdsGhost
do j = 1, bndry_depth
l1 = vec_indGhost(k, j, dir)
offset = l1 - j * dist
dF(l1) = 0.0_8
do i = 1, bndry_width
dF(l1) = dF(l1) + \
rhs_block1(order,j,i) * F(i * dist + offset)
end do
end do
end do
end if
! ... if we have a right boundary
if (goffset(2) == 0) then
do k = 1, vdsGhost
do j = 1, bndry_depth
l1 = vec_indGhost(k, offset_depth + j, dir)
dF(l1) = 0.0_8
offset = l1 - j * dist - offset_diff
do i = bndry_width, 1, -1
dF(l1) = dF(l1) + \
rhs_block2(order,j,i) * F(i * dist + offset)
end do
end do
end do
end if
Figure 4.6: A loop example from subroutine SPARSE NEW (after transformation).
30
do l = 1, ND
do ii = 1, Nc
UVW(ii,l) = cv(ii,vMap(l+1)) * dv(ii,3)
end do
end do
Figure 4.7: A loop example from subroutine VISCOUSTERMS STRONG (before
transformation).
if ( ND == 2 ) then
do ii = 1, Nc
UVW(ii,1) = cv(ii,2) * dv(ii,3)
UVW(ii,2) = cv(ii,3) * dv(ii,3)
end do
else if ( ND == 3 ) then
do ii = 1, Nc
UVW(ii,1) = cv(ii,2) * dv(ii,3)
UVW(ii,2) = cv(ii,3) * dv(ii,3)
UVW(ii,3) = cv(ii,4) * dv(ii,3)
end do
end if
Figure 4.8: A loop example from subroutine VISCOUSTERMS STRONG (after transformation).
31
2970 ! ... update state%cv with the explict RHS only
2971 do j = 1, ARK2_nStages
2972 do k = 1, nCv
2973 do i = 1, Nc
2974 cv(i,k)=cv(i,k)+dt(i)*b_vec_exp(j)*rhs_explicit(i,k,j)
2975 end do
2976 end do
2977 end do
src/ModRungeKutta.f90(2973): (col. 11) remark: loop was
not vectorized: existence of vector dependence.
Figure 4.9: A loop example that cannot be vectorized without SIMD directive.
2970 ! ... update state%cv with the explict RHS only
2971 do j = 1, ARK2_nStages
2972 do k = 1, nCv
2973 !DIR$ SIMD
2974 do i = 1, Nc
2975 cv(i,k)=cv(i,k)+dt(i)*b_vec_exp(j)*rhs_explicit(i,k,j)
2976 end do
2977 end do
2978 end do
src/ModRungeKutta.f90(2974): (col. 11) remark:
vectorization support: unroll factor set to 4.
src/ModRungeKutta.f90(2974): (col. 11) remark:
SIMD LOOP WAS VECTORIZED.
Figure 4.10: A loop example gets vectorized with help from SIMD directive.
32
CHAPTER 5
BOX SHAPED HALO REGION
This chapter presents our work on replacing the original star-shaped halo region with box-
shaped halo region to extend the time between synchronizations and increase the work
granularities between communications. These changes were for reasons independent of those
discussed in chapter 3.
5.1 Background
Our work was coordinated with XPACC (The Department of Energy Center for Exascale
Simulation of Plasma-Coupled Combustion at University of Illinois) multi-hardware MxPA
project which seeks to develop code retargeting techniques through source-to-source transla-
tion at the compiler stage to access heterogeneous architectures using a single codebase. The
MxPA project found that the overhead created by data movement to and from the device
dominated the execution time so that no significant speedup was obtained when running
on NVIDIA GPGPU cards. As discussed more in section 5.2, the cause of this was that
the original star-shaped halo region needed frequent sends/receives of small messages that
resulted in frequent data movement between the host and the device. It is discussed in detail
in section 5.2 that, to overcome this issue, the star-shaped halo region needed to be replaced
by the box-shaped halo region so that the MPI calls were agglomerated and moved to keep
the data on the device to do more work.
33
5.2 Star Halo vs Box Halo
Figure 5.1 compares the star-shaped halo region with the box-shaped halo region using a
2-D grid and the 5-point finite difference stencil. The region in blue is the unique data
owned and updated by each MPI rank. The region in red (green) is the original star halo
for applying operators in x (y) direction and the region in orange is the box halo.
Figure 5.2 compares the execution pattern (star vs box) for one RK step. The horizontal
axis is the time line and each MPI rank executes from left to right to finish one RK step.
Using the star halo, taking x direction as an example, the red halo region was allocated and
filled each time before applying operators in x direction and then deallocated. This kind
of implementation dispersed the computation (marked with green) into frequent passing of
small messages (marked with red) and resulted in small work granularities for oﬄoading to
accelerators.
Using the box halo, each MPI rank also carried the orange halo region, which permitted
agglomerating the communications into two large chunks and pushing them to the begin-
ning of the RK step. The time between synchronizations was extended and a larger work
granularity between communications for oﬄoading to the accelerators was created.
5.3 Figures for Chapter 5
22"
Star"Halo" Box"Halo"
Unique"data"owned"by"each"MPI"rank"
Star"halo"for"x=direc>on"operators"
Star"halo"for"y=direc>on"operators"
Box"halo"
x
y
Figure 5.1: Layouts of star halo and box halo using a 2-D grid and the 5-point stencil.
34
Time
Time
STAR
BOX
Communica1on Computa1on
large7work7granularity
small7work7granularity
Figure 5.2: Execution pattern (star vs box) for one RK step.
35
CHAPTER 6
RUNNING ON INTEL MICS
This chapter summarizes our preliminary results of running PlasComCM on the Intel Xeon
Phi MIC card installed on each of the compute nodes of the Stampede system using an MPI-
only programming model in symmetric mode, i.e., with MPI tasks simultaneously running
on both the CPUs and the MIC.
6.1 Background
As noted earlier, the Stampede system has a peak performance of 9.6 PF/sec, where the base
cluster contributes 2.2 PF/sec and the other 7.4 PF/sec come from the MIC cards. As shown
in Fig. 6.1 [22], a compute node of Stampede is installed with two SandyBridge processors on
the host side and a MIC coprocessor on the device side. The details of processor specifications
can be found in Table 2.1. The Intel Many Integrated Core (MIC) architecture is designed
to accelerate highly parallel applications by utilizing a degree of parallelism higher than
on the x86 CPU sockets. The MIC socket has “many” simplified cores and each of them
has a wider vector length and a larger hardware thread count. The optimized version of
PlasComCM discussed in chapter 4, with the obstacles to vectorization removed, is a much
better candidate for porting to the MIC since vectorization is critical to good performance on
MIC. We seek to accelerate the time-to-solution of the optimized PlasComCM by enabling
the MIC card that was previously left unused. The porting efforts are presented in section
6.2 and the performance results are summarized in section 6.3.
36
6.2 Porting Efforts
One key advantage of using the MIC is that we can employ familiar languages (e.g., Fortran)
and familiar parallel programming models (e.g., MPI), to simplify the porting and optimizing
process. As our initial attempt, the MPI-only framework of PlasComCM was retained and
the symmetric execution mode (MPI tasks simultaneously running on both the CPUs and the
MIC) was naturally employed. There was only MPI task-level parallelism on the MIC card
and each MIC core received one MPI task utilizing one hardware thread. Our experiments
were run on a single Stampede compute node and the problem distributed to that node had
a fixed size. When the MIC card was used in symmetric mode simultaneously with the CPU,
we had two complexities that affected performance: (1) the MIC cores are slower than the
CPU cores and (2) the link between the MIC and CPU is slower than between the CPU
cores.
The performance experiments still solved 3-D compressible Navier-Stokes equations with
mass and energy conservation. The analysis was guided by a two-dimensional parameteriza-
tion, M (0 ≤ M ≤ 60) and P (0 ≤ P ≤ 100), where M is the number of enabled MIC cores
and P is the percentage of the total work distributed to each MIC core.
6.3 Results
Intel Fortran compiler 14.0 and Intel MPI 4.1 were used to build the executables, with the
-xHost option for the CPU, the -mmic option for the MIC and the -g -O2 -openmp options
for both. For simplicity, the computational grid was designed to have a size of 11040×24×24,
with 1-D data decomposition along the first dimension. The grid was decomposed with 16
“heavy” MPI tasks for the CPU plus M “light” MPI tasks for the MIC (see Fig. 6.2).
Figure 6.3 shows the speedup, relative to using 16 CPU cores, of using PlasComCM with
symmetric execution utilizing 16 CPU cores + M MIC cores. The speedup was larger than
one and it was most beneficial to have at least 48 MIC cores running. In Fig. 6.4, M was
fixed at 60 and we varied the work distributed to each MIC core. A 20% reduction in wall-
clock time was achieved when distributing 0.33% of the total work to one MIC core (20% of
37
the total work to the MIC card).
6.4 Figures for Chapter 6
Figure 6.1: Configuration of a compute node of the Stampede system.
38
PCIe
Tw
o 
Sa
nd
yB
rid
ge
 so
ck
et
s 
(e
ac
h 
8 
co
re
s)
One MIC card (61 cores)
( (
16 “heavy” MPI tasks M “light” MPI tasks
16 CPU cores
M MIC cores
) Computational grid
ibrun.symm -m ./plascomcm.mic -c ./plascomcm.intel
Figure 6.2: Decomposition and mapping for running on the MIC card in symmetric mode.
39
Figure 6.3: Speedup of symmetric runs with increasing M.
Figure 6.4: Speedup of symmetric runs with increasing P.
40
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1 Conclusion
This thesis summarizes our analysis and optimization efforts to improve the performance
of PlasComCM, a high-order finite difference CFD application. Performance tools TAU
and PAPI were used to gather performance information by profiling and accessing hard-
ware counters. The performance analysis showed that the serial performance of the top 10
time-consuming subroutines of the original PlasComCM was poor (running at 3.03% of the
hardware peak) and caused by high quantity of memory accesses and a lack of vectorization.
Several optimizations including pointer dereferencing, loop transformations and Intel For-
tran SIMD directives were applied to reduce the number of loads and to remove obstacles to
vectorization. These optimizations were able to vectorize all of the innermost loops of the
top 10 time-consuming subroutines, reduce 63% of the loads of PlasComCM and speed up
PlasComCM by a factor of 2.02 when running on one core. The shape of the halo region of
PlasComCM was also changed from a “star” shape into a “box” shape to extend the time
between synchronizations and created a larger work granularity for oﬄoading. Finally, pre-
liminary results of running PlasComCM on Intel MICs in symmetric mode alongside MPI
tasks on CPU were presented. The best result was a 20% reduction in wall-clock time when
using 16 CPU cores + 60 MIC cores vs 16 CPU cores.
7.2 Future Work
Firstly, the work presented in this thesis was primarily focused on analyzing and improving
the serial performance. However, production runs usually require parallel execution where
41
we have wall-clock time spent in communication. Therefore, it would be helpful to evaluate
the efficiency of communication and its impact on the overall parallel performance.
Secondly, the preliminary results of running PlasComCM on Intel MICs presented in this
thesis were limited to a particular problem size. More investigations into how the gird size
and domain decomposition influence the performance behavior are recommended.
Thirdly, all of the performance information of PlasComCM was based on an MPI-only
programming model. It would be helpful to integrate OpenMP into PlasComCM and see
whether other parallel models (OpenMP-only, MPI-\OpenMP-hybrid) can obtain better
performance.
42
REFERENCES
[1] J.D. Anderson. Computational Fluid Dynamics The Basics with Applications. McGraw-
Hill Education, New York City, NY, first edition, 1995.
[2] S. Cant. High-performance computing in computational fluid dynamics: progress and
challenges. Philosophical Transactions. Series A, Mathematical, Physical, and Engi-
neering Sciences, 360(1795):1211–1225, 2002.
[3] S.B. Pope. Turbulent Flows. Cambridge University Press, Cambridge, England, first
edition, 2000.
[4] P. Birkby, R.S. Cant, W.N. Dawes, A.M. Savill, A.A.J. Demargne, P.C. Dhanasekeran,
W.P. Kellar, N.C. Rycroft, R.L.G.M. Eggels, and I.K. Jennions. CFD analysis of a
complete industrial lean premixed gas turbine combustor. In ASME International Gas
Turbine and Aeroengine Congress and Exposition 2000, volume 2, pp. V002T02A050,
Munich, Germany, May 2000.
[5] B. Galperin and S.A. Orszag. Large Eddy Simulation of Complex Engineering and
Geophysical Flows. Cambridge University Press, Cambridge, England, first edition,
2010.
[6] D.J. Bodony and S.K. Lele. On using large-eddy simulation for the prediction of noise
from cold and heated turbulent jets. Physics of Fluids, 17(8):085103, 2005.
[7] P. Moin and K. Mahesh. DIRECT NUMERICAL SIMULATION: A Tool in Turbulence
Research. Annual Review of Fluid Mechanics, 30(1):539–578, 1998.
[8] M.M. Rogers and P. Moin. The structure of the vorticity field in homogeneous turbulent
flows. Journal of Fluid Mechanics, 176:33–66, 1987.
[9] J. Slotnick, A. Khodadoust, J. Alonso, D. Darmofal, W. Gropp, E. Lurie, and
D. Mavriplis. CFD Vision 2030 Study: A Path to Revolutionary Computational Aero-
sciences. Prepared for NASA Langley Research Center, Hampton, Virginia, 2013.
[10] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with
the Message-Passing Interface. MIT Press, Cambridge, MA, second edition, 1999.
[11] J. Kim, D.J. Bodony, and J.B. Freund. Adjoint-based control of loud events in a
turbulent jet. Journal of Fluid Mechanics, 741:28–59, 2014.
43
[12] Q. Zhang and D.J. Bodony. Numerical simulation of two-dimensional acoustic liners
with high-speed grazing flow. AIAA Journal, 49(2):365–382, 2011.
[13] Q. Zhang and D.J. Bodony. Numerical investigation and modelling of acoustically
excited flow through a circular orifice backed by a hexagonal cavity. Journal of Fluid
Mechanics, 693:367–401, 2012.
[14] D.J. Bodony. Scattering of an entropy disturbance into sound by a symmetric thin
body. Physics of Fluids, 21(9):096101, 2009.
[15] A. Mishra and D.J. Bodony. Evaluation of actuator disk theory for predicting indirect
combustion noise. Journal of Sound and Vibration, 332(4):821–838, 2013.
[16] M.M. Sucheendran, D.J. Bodony, and P.H. Geubelle. Coupled structural-acoustic re-
sponse of a duct-mounted elastic plate with grazing flow. AIAA Journal, 52(1):178–194,
2014.
[17] C.M. Ostoich, D.J. Bodony, and P.H. Geubelle. Interaction of a Mach 2.25 turbulent
boundary layer with a fluttering panel using direct numerical simulation. Physics of
Fluids (1994-present), 25(11):110806, 2013.
[18] B. Strand. Summation by parts for finite difference approximations for d/dx. Journal
of Computational Physics, 110(1):47–67, 1994.
[19] M. Sva¨rd, M.H. Carpenter, and J. Nordstro¨m. A stable high-order finite difference
scheme for the compressible Navier–Stokes equations, far-field boundary conditions.
Journal of Computational Physics, 225(1):1020–1038, 2007.
[20] M. Sva¨rd and J. Nordstro¨m. A stable high-order finite difference scheme for the com-
pressible Navier–Stokes equations: no-slip wall boundary conditions. Journal of Com-
putational Physics, 227(10):4805–4824, 2008.
[21] D.J. Bodony. Accuracy of the simultaneous-approximation-term boundary condition
for time-dependent problems. Journal of Scientific Computing, 43(1):118–133, 2010.
[22] Texas Advanced Computing Center (TACC). Stampede User Guide.
http://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide, 2013.
[23] Top 500 Supercomputing Sites. Top 500 Lists: Top 10 Sites for November 2014.
http://www.top500.org/lists/2014/11/, 2014.
[24] S.S. Shende and A.D. Malony. The TAU parallel performance system. International
Journal of High Performance Computing Applications, 20(2):287–311, 2006.
[25] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming
interface for performance evaluation on modern processors. International Journal of
High Performance Computing Applications, 14(3):189–204, 2000.
44
[26] Performance Application Programming Interface (PAPI). Counting
Floating Point Operations on Intel Sandy Bridge and Ivy Bridge.
http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops, 2015.
[27] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol-
ume 2, Instruction Set Reference, 2015.
[28] J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming.
Morgan Kaufmann, Burlington, MA, first edition, 2013.
45
