An unstructured CFD mini-application for the performance prediction of a production CFD code by Owenson, A. M. B et al.
  
  
 
warwick.ac.uk/lib-publications 
 
 
 
 
 
Manuscript version: Author’s Accepted Manuscript 
The version presented in WRAP is the author’s accepted manuscript and may differ from the 
published version or, Version of Record.  
 
Persistent WRAP URL: 
http://wrap.warwick.ac.uk/119788                            
 
How to cite: 
Please refer to published version for the most recent bibliographic citation information.  
If a published version is known of, the repository item page linked to above, will contain 
details on accessing it. 
 
Copyright and reuse: 
The Warwick Research Archive Portal (WRAP) makes this work of researchers of the 
University of Warwick available open access under the following conditions. 
 
This article is made available under the Creative Commons Attribution 4.0 International 
license (CC BY 4.0) and may be reused according to the conditions of the license.  For more 
details see: http://creativecommons.org/licenses/by/4.0/. 
 
 
 
Publisher’s statement: 
Please refer to the repository item page, publisher’s statement section, for further 
information. 
 
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk. 
 
Received 27 July, 2018; Revised 16 May, 2019; Accepted <day> <Month>, 2019
DOI: xxx/xxxx
SPECIAL ISSUE PAPER
An Unstructured CFD Mini-Application for the Performance
Prediction of a Production CFD Code
A. M. B. Owenson1 | S. A. Wright2 | R. A. Bunt1 | Y. K. Ho3 | M. J. Street3 | S. A. Jarvis1
1Department of Computer Science, University
of Warwick, Coventry, United Kingdom
2Department of Computer Science, University
of York, York, United Kingdom
3Design Systems Engineering, Rolls-Royce plc,
Derby, United Kingdom
Correspondence
Andrew Owenson, Department of Computer
Science, University of Warwick, Coventry,
United Kingdom.
Email: a.m.b.owenson@warwick.ac.uk
Funding Information
This research was supported by the EPSRC,
and by Intel (grant number 15220082).
Abstract
Maintaining the performance of large scienti￿c codes is a di￿cult task. To aid in this task a num-
ber of mini-applications have been developed that are more tractable to analyse than large-scale
production codes, while retaining the performance characteristics of them. These “mini-apps”
also enable faster hardware evaluation and, for sensitive commercial codes, allow evaluation
of code and system changes outside of access approval processes. In this paper we develop
MG-CFD, a mini-application that represents a geometric multigrid, unstructured computational
￿uid dynamics (CFD) code, designed to exhibit similar performance characteristics without shar-
ing commercially sensitive code. We detail our experiences of developing this application, using
guidelines detailed in existing research and contributing further to these. Our application is vali-
dated against the inviscid ￿ux routine of HYDRA, a CFD code developed by Rolls-Royce plc. for
turbomachinery design. This paper (i) documents the development of MG-CFD, (ii) introduces an
associated performance model with which it is possible to assess the performance of HYDRA
on new HPC architectures; (iii) demonstrates that it is possible to use MG-CFD and the perfor-
mancemodels to predict the performance of HYDRAwith amean error of 9.2% for strong-scaling
studies.
KEYWORDS:
scienti￿c computing, computational ￿uid dynamics, performance analysis, high performance
computing, performance modelling, mini-application
1 INTRODUCTION
The rapid development of new hardware and software in High Perfor-
mance Computing (HPC) is greatly bene￿ting scienti￿c discovery; with
each new development comes new opportunities for improving the
performance of scienti￿c applications. Evaluating the potential improve-
ments o￿ered by these developments is often a time consuming process
due to the complexity of the applications involved, and the learning
curve for new machines, architectures and toolchains.
In recognition of these challenges, many HPC centres are turning
to supporting tools and methodologies (e.g. predictive performance
modelling1,2,3,4 and hardware simulation5,6) to evaluate new systems
ahead of procurement. Additionally, mini-applications have been shown
to facilitate rapid evaluation of new hardware and programming tech-
niques; these applications capture the key performance characteristics
of a parent code in a much more concise form; making them easier
to work with than full production codes but equally useful in perfor-
mance engineering activities. The use of mini-applications has been
well documented7,8,9,10 and has spawned several suites of such appli-
cations11,12 for industry and research community to examine. Recent
use of mini-apps includes the recently established ASiMoV strategic
partnership between Rolls-Royce plc. and ￿ve leading UK universities,
whose aim it is to achieve the ￿rst high-￿delity simulation of a complete
and operating gas-turbine engine13. This will require several break-
thoughs including achieving exascale performance of a CFD simulation
code.
This paper extends previous work on the development and early val-
idation of a geometric multigrid, unstructured grid Computational Fluid
2Dynamics (CFD) mini-application14. In this paper we re￿ne and con-
clude this development and show, through new research, how it can
be used for performance prediction. In so doing, we address a limita-
tion of our previous work, where the mini-application had a greater
arithmetic intensity than the target kernel. This presented two signi￿-
cant issues that this paper seeks to address: (i) computational speed-ups
identi￿ed with the mini-app such as vectorisation may not transfer to
the target code; and, (ii) the mini-app may not identify optimisations of,
or improvements to, memory bandwidth that would bene￿t the target
code.
Speci￿cally, this paper makes the following contributions:
• This paper reports the development, re￿nement and testing of
MG-CFD, the only multigrid unstructured ￿nite-volume CFD
mini-application;
• MG-CFD has been developed as part of a long-standing univer-
sity / industry collaboration and, as a result, is representative of
the production code HYDRA, which is the primary CFD code
used by Rolls-Royce plc. for turbomachinery design;
• This paper presents a new performance projection model for
HYDRA, with which it is possible to project from MG-CFD to
HYDRA performance on a range of existing and emerging HPC
architectures. This is highly signi￿cant for Rolls-Royce plc. as
they increase their use of virtual certi￿cation and simulation-
based engine design;
• This paper demonstrates that it is possible to use a mini-
application and performance modelling to predict the perfor-
mance of a production ‘target’ code, with a mean error of 9.2%
for strong-scaling studies.
This paper is structured as follows: in Section 2we discuss related work;
in Section 3 we summarise the functionality of HYDRA which we aim
to capture within the mini-application; in Section 4 we describe our
experiences constructing MG-CFD; in Section 5 we validate the perfor-
mance characteristics of MG-CFD when compared to the target kernel;
in Section 6 we describe the proposed analytical model, and validate it
on several HPC systems; ￿nally, Section 7 concludes the paper.
2 RELATEDWORK
There are numerous benchmarks andmini-applications representing the
performance of di￿erent classes of HPC applications, some of which
have been released as component parts of projects such as theMantevo
Project 11, the ECP Proxy Apps Suite15, and the UK Mini-App Consor-
tium12. Mini-applications from these repositories have been used in a
variety of contexts.
One such example is miniMD, which has been used to explore
the performance of molecular dynamics codes on the Intel Xeon Phi
Knights Corner architecture7. Using a combination of AVX intrinsics and
algorithmic optimisations, e.g. overlapping PCIe transfers with compu-
tation, the authors demonstrate a 5⇥ speed-up for the gather-scatter
bottleneck typically present in MD codes.
Mallinson et al. compare the performance of two PGAS program-
ming models (OpenSHMEM and Co-Array Fortran) against MPI using
CloverLeaf, an Lagrangian-Eulerian hydrodynamics mini-application10.
The authors demonstrate that OpenSHMEM is able to outperform an
equivalent MPI implementation by 7.78 iterations/sec, at 4096 sockets,
when using proprietary nonblocking operations from Cray and 4 MB
memory pages.
LULESH, a hydrodynamicsmini-application representative of ALE3D,
is used to assess the suitability of emerging parallel programming mod-
els (e.g. Liszt and Loci) along with more established models such as
OpenMP16, in terms of programmer productivity, runtime performance
and ease of optimisation. The reduced size of LULESH when compared
with ALE3D allowed the authors to examine eight parallel programming
models. Their conclusion highlights that while the emerging models
such as Chapel and Loci enable a high level of productivity, they cannot
match the performance of more established models such as MPI and
OpenMP.
Similarly, Giles et al. examine the performance of OP2, a domain
speci￿c framework for unstructured grid codes using the AIRFOIL
CFD mini-application9. The authors demonstrate that they are able to
achieve performance within 6% of a hand-coded implementation.
TheCFDcode included in the Rodinia benchmark suite has been used
to examine the performance of a Graphics Processing Unit (GPU) when
running unstructured grid applications17. From the results, Corrigan et
al. conclude that GPUs show promise for this class of code given an
increase in double precision performance in the future.
The research in this paper similarly develops and makes use of a
mini-application; however, our application additionally contains a geo-
metric multigrid solver and supports mesh structures with variable node
degree. TheHPGMG-FV and LULESHmini-applications aremost similar
to our mini-application; however, the former operates on a structured
mesh18 and the latter does not have a multigrid solver.
Another body of work which is similar to our own and that we build
upon deals with the validation of a mini-application’s performance in
relation to that of the parent code. The technique employed by Tramm
et al. involves comparing the correlation of parallel e￿ciency loss to per-
formance counters for both the mini-application and the target code8.
Previously this technique has been applied tomini-applications of a neu-
tron transport code8; we employ and validate this technique on a dif-
ferent class of application. Messer et al. develop three mini-applications
and use a comparison between the scalability of the mini-application
and the original code as evidence of their similarity 19. However, the
authors focus on distributed memory scalability, in this research we
focus on intra-node shared memory scalability.
Finally, in this paper we explore how to project frommini-application
performance to predict that of the target code. Sharkawi et al. propose
a technique of identifying surrogate codes that are quanti￿ably similar
3Listing 1: HYDRA solver pseudo-code, with V-cycle geometric multigrid
c a l l jacob // Jacob i precond i t ion
for i t e r = 1 to n i t e r do
for l e v e l i n [0 , 1 , 2 , 3 , 2 , 1] do
for time step = 1 to 5 do
i f d i s s i p a t i v e f l u x update then
c a l l grad // Gradient
c a l l v f l u x // Viscous f l u x
i f v i scous wa l l then
c a l l wff lux // Viscous wa l l f l u x
end i f
c a l l wvflux // Viscous near wal l f l u x
end i f
c a l l i f l u x // I n v i s i d f l u x
c a l l s r c sa // Spa la r t Al lmaras source term
c a l l update // Update flow so lu t i on
end for
/⇤ t r a n s f e r so l u t i on up/down mu l t i g r i d ⇤/
i f d i r e c t i o n = up then
c a l l r e s t r i c t
e l se
c a l l prolong
end i f
end for
end for
to the target code according to 25 performance metrics 20. These sur-
rogates are executed, a genetic algorithm selects a weighting, and their
weighted average used as a performance prediction, achieving a mean
error of 7.2% on a IBM Power6 and 10.5% on a Intel Core. Hoste et al.
apply a similar technique but tomicroarchitecture-independentmetrics,
predicting the ranking of machine performance with 0.89 mean rank
correlation21. In contrast, we show that an analytical model of the per-
formance di￿erence between a mini-application and its target code can
provide projections of similar accuracy, with a mean error of 9.2% for
strong-scaling studies.
3 BACKGROUND
3.1 HYDRA
The manufacturing industry is increasingly making use of CFD simula-
tion codes to aid in the design and testing process of new products.
One such code is HYDRA22, a suite of nonlinear, linear and adjoint
solvers developed by Rolls-Royce plc. in collaboration with a number
of UK universities. These solvers target air￿ow within turbomachin-
ery, where the ￿ow must be modelled as compressible, viscous and
turbulent. As such they solve the Reynolds-Averaged form of the com-
pressible Navier-Stokes equations, which embody conservations of
mass, momentum and energy. Turbulence modelling is enhanced with
the Spalart-Allmaras one-equation model 23. Equations are discretised
using a MUSCL-based ￿ux-di￿erencing scheme, then block Jacobi pre-
conditioned24. An explicit 5-stage Runge-Kutta scheme is applied to
improve stability in high viscosity regions, and convergence of themulti-
grid method25. For more information on the background and numerical
implementation of HYDRA we refer the reader to Lapworth et al. 22.
Loop Function Runtime %
JACOB Jacobi preconditioner matrices 6.8
GRAD Gradient 16.8
VFLUX Viscous ￿uxes 35.8
IFLUX Inviscid ￿uxes 10.7
SRCSA Spalart-Allmaras source term 14.6
UPDATE Update ￿ow 7.3
— Other routines 8.0
TABLE 1 HYDRA runtime breakdown on single node Xeon Broadwell
with 28 MPI processes
In this paper, we focus on HYDRA’s nonlinear solver which is sum-
marised in Listing 1. A breakdown of HYDRA runtime by loop is given
in Table 1, which shows that the two ￿ux routines (vflux and iflux)
account for almost half of the runtime on a single Xeon Broadwell node.
Thus we direct the development of the mini-application towards the
goal of understanding and improving the performance of these rou-
tines. The two routines are computationally very similar, performing an
integration of cell volume surface ￿uxes, but the iflux kernel is much
smaller and so easier to detach from HYDRA. Mini-application devel-
opment time can therefore be reduced by initially targeting the iflux
routine.
3.2 Multigrid
HYDRA employs multigrid methods which are designed to increase the
rate of convergence for iterative solvers, and possess a useful compu-
tational property – the amount of computational work required is linear
in the number of unknowns26. Multigrid applications operate on a hier-
archy of grid levels; in this paper, we are concerned with geometric
multigrid, wherein each grid level has its own explicit mesh geometry,
and the coarse levels of the hierarchy are derived from the geometry of
the ￿nest level.
Starting at the ￿nest level, multigrid applications use an iterative
smoothing subroutine to reduce high frequency errors. Low frequency
errors are then transferred to the next coarsest level (restriction), where
they appear as high frequency errors and can thus be more rapidly
smoothed by the same subroutine. Error corrections from the smooth-
ing of coarse levels are then transferred back to ￿ner levels (prolon-
gation). The order in which prolongations and restrictions are applied
is know as a cycle, of which this paper considers a single type – the
so-called V-cycle.
There are several performance implications of using a geometric
multigrid solver. First, there is the increased memory requirement of
explicitly representing the geometries of all levels of the multigrid. Sec-
ond, there are the additional irregular memory accesses from prolonging
and restricting corrections between levels of the multigrid. Third, the
coarsened meshes have reduced spatial locality.
4Level boundary
Edge
Node
L1
L0
Cell volume
Multigrid edge
FIGURE 1 Representation of a ￿nite-volume decomposition mapped to
an unstructured grid over two multigrid levels.
3.3 Unstructured Grid
HYDRA represents its aerospace models using an unstructured grid –
with reference to Figure 1, an unstructured grid is a collection of nodes
and edges, with the nodes being at an arbitrary position in space. For a
cell-centered ￿nite-volume decomposition, each node represents a cell
and each edge represents a surface between two adjacent cells. Since
HYDRA operates on multigrid datasets, there are also edges between
related nodes of adjacent grid levels. The ￿exibility of the unstructured
grid allows complex geometries to be represented and regions of inter-
est to be denoted by increasing the density of nodes (cells) in these
areas.
The neighbours of a node in an unstructured grid are not implicitly
de￿ned, as is the case for a structured mesh code where the neighbours
can be determined using o￿sets to the array indices. This means that
an explicit list of neighbours must be maintained so that when com-
putation over nodes is performed (e.g. the accumulation of ￿uxes) data
can be read from the required locations. This of course has implications
for the memory access pattern as there is no guarantee that a node’s
neighbours directly and regularly succeed it in memory.
4 MINI-APPLICATION DEVELOPMENT
Although the bene￿ts of mini-applications are increasingly documented
(see Section 2), their development is not a well-de￿ned process as it
depends largely on their intended purpose19. This makes their devel-
opment challenging as the purpose may di￿er on a project-by-project
basis, limiting the reuse of technique and e￿ort. However, considera-
tions and guidelines are aggregated by Messer et al. and summarised
here as a set of questions for reference19.
1. Where does the application spend most of its execution time?
2. What performance characteristics will the mini-application cap-
ture?
3. Can any part of the development process be automated?
4. How can the build system be made as simple as possible?
The aim of these questions is to focus attention on (i) which aspects
of the target code the mini-application should include, and (ii) the com-
ponents of the supporting con￿guration (e.g. tools and datasets). We
apply these guidelines to the development ofMG-CFD and because the
development of each mini-application is essentially unique, we consider
it a valuable exercise to document the ￿ndings of this approach. Addi-
tionally, we add our own considerations to this list, which come from
our own experiences of developing this and other mini-apps.
We address the ￿rst question in Section 3 – the most time consum-
ing regions of code are contained within the two ￿ux routines (vflux
and iflux), and it is these routines we therefore focus on capturing
within our mini-application. These kernels have the same computational
structure – a single loop over edges, accumulating ￿uxes In this work
we focus on iflux as it captures the memory access behaviour of the
unstructured grid, while containing less code than vflux.
The second question is addressed by considering the purpose of our
mini-app – to evaluate the potential impact of code optimisations, new
parallel programmingmodels and new hardware features on the compu-
tational performance of HYDRA. This use case suggests constructing a
mini-applicationwhich ignores I/O and inter-node communication costs
and focuses only on computation, encouraging us to focus on more
speci￿c regions of the code.
Next, we propose our own consideration: which aspects of the tar-
get (e.g. unstructured grid, ￿nite volume, multigrid) contribute to the
compute behaviour within themost expensive regions of the code? This
decomposition by simulation feature provides us with a route for includ-
ing performance characteristics within the mini-application. Drawing
upon other’s experiences with HYDRA along with our own, we know
that it is the irregular memory accesses which contribute greatly to the
di￿culty of running on di￿erent compute architectures. These irregular
memory accesses come from two main sources: the edge updates over
the unstructured grid and the restriction and prolongation of corrections
between the multigrid levels (see Section 3.2).
4.1 Implementation
With these features in mind, we base our mini-app on an existing code
as (i) it is open source, so it will not be restricted in terms of where
it can be run; and (ii) it shares simulation features with HYDRA17. We
base our code on the CFD application by Corrigan et al., now included
in the Rodinia benchmark suite27. We extend this code with the addi-
tion of multigrid, hence we name our mini-app MG-CFD. The existing
code, written in C++, implements a three-dimensional ￿nite-volume dis-
cretisation of the Euler equations for inviscid, compressible ￿ow over
an unstructured grid. It performs a sweep over edges to accumulate
￿uxes, implemented as a loop over cell volumes with an inner loop
over the edges between each cell and its neighbours. iflux also per-
forms a sweep over edges, but implements this as a single loop over
all edges in the grid. Although these two loop schemes implement the
5Hardware
Broadwell Skylake Knights Landing
Model Intel Xeon E5-2660 v4 Intel Xeon Silver 4116 Intel Xeon Phi 7210
All-core turbo (GHz) 2.4 2.4 1.3
Cores 14⇥2 12⇥2 64
Host ISA AVX-2 AVX-512 AVX-512
Memory (GB) 128 96 16 HBM + 96 DDR
Software
Operating System Debian 8, Linux 4.9.0
Compiler Intel 19.0.2
TABLE 2 Hardware/software con￿gurations
same numerical method, the di￿erent loop structures can lead to di￿er-
ent performance characteristics, particularly regarding parallelisation.
To increase similarity to iflux, we replace the existing nested loop with
a single loop over all edges.
The resulting kernel di￿ers from iflux only in the exact arithmetic
operations performed; it is not possible for MG-CFD to perform the
same arithmetic as iflux as this would mean subjecting MG-CFD to
the same commercial portability restrictions as HYDRA itself. We fur-
ther extend our mini-app with additional simulation features present
in HYDRA. It should be noted that we do not focus on verifying the
correctness of the simulation against a standard problem in this paper,
as we are primarily interested in performance characteristics which we
validate in Section 5.
Support for the computational behaviours of multigrid were imple-
mented by augmenting the construction of the Euler solver presented
by Corrigan et al. with crude operators to transfer the state of the simu-
lation between the levels of the multigrid. These operators are de￿ned
by Equations 1 and 2 which serve as restriction (￿ne to coarse grid) and
prolongation (coarse to ￿ne grid) operators respectively28. Where ulj
represents simulation property u of node j at level l, andN lj is the set of
node indices which are linked to node j at level l from l  1 of the grid.
ulj =
P
i2Nlj u
l 1
i
|N lj |
(1)
ul 1
i2Nlj
= ulj (2)
The restriction operator (Equation 1) primes the simulation proper-
tieswith an average across nodes from the ￿ner grid level – thismapping
between levels is de￿ned as part of the input deck. The prolongation
operator (Equation 2) reverses restriction by injecting the values from
the coarse grid to the ￿ne grid as dictated by the mapping.
The ￿nal code change made is to allow for an arbitrary number of
neighbours rather than the ￿xed four in the ￿ux summation. The sum-
mation is already weighted by the surface area of the interface between
nodes in the mesh, so no correction to the underlying mathematics is
necessary to support this change.
4.1.1 Supporting Tools
Part of what makes a mini-application a useful tool is its simplicity.
This however is not only restricted to the application itself and must
also apply to the processes surrounding the mini-application and target
application (e.g. building, job submission).
We opt to simplify the building process by removing all reliance on
third-party libraries such as the Hierarchical Data Format 5 (HDF5)
library and the communications library. These can both be safely
removed as the purpose of this mini-application is not to investigate
I/O performance, inter-node communication performance nor the over-
heads introduced by library abstractions. Removing these dependencies
allows the application to be built swiftly with minor adjustment of
compiler and its ￿ags in the Make￿le. Another challenge to consis-
tent benchmarking is the need to create job submissions scripts, so we
include examples of these for several common schedulers: SLURM, LSF
and Moab.
Utilities have been included to validate the ￿nal state of the sim-
ulation after changes to the con￿guration (e.g. compiler ￿ags, code
optimisations, porting to accelerators) of the code. Additionally we
include tools to extract the geometries from the datasets used to prime
HYDRA and transform them into a form which is understood by the
mini-application. We do this to reduce the number of factors which
could cause di￿erences in runtime behaviour between HYDRA and its
mini-application.
5 VALIDATION
In this sectionwe present a validation of ourmini-application against the
target code’s behaviour on a dual-socket 28-core Xeon Skylake node.
Full hardware details are provided in Table 2.
The unstructured grid used for validation is derived from the geome-
try ofWhittle Laboratory’s lowpressure axial￿ow turbine rotor cascade,
a mesh of 105 K nodes and 305 K edges representing a single rotor
root section (blade and hub connection) 29. To aid visualisation a rotor
6FIGURE 2 Visualisation of a rotor section from NASA’s SSME 2-stage
fuel turbine. Blade geometry is similar to the mesh we use.
1 4 8 12 16 20 24
0
0.2
0.4
0.6
0.8
1
Threads
MG-CFD iflux
FIGURE 3 Parallel e￿ciency of MG-CFD and iflux on Xeon Skylake
with AVX-512 auto-vectorisation.
section of NASA’s SSME 2-stage fuel turbine is shown in Figure 2, con-
sisting of multiple root sections with similar structure to the mesh we
use30. The mesh is duplicated in memory by a factor of 120, produc-
ing a set of 120 disconnected meshes. Each kernel will then process
each of the 120 meshes in turn. This ensures that the workload does
not ￿t in the cache, and enables multi-threaded execution at particular
process counts such that no two threads work on the same mesh. The
nodes have been renumbered by the Cuthill-McKee (CM) algorithm, and
then the edges reordered by the new values of their endpoints. This
ensures consistency with other performance analyses of unstructured
mesh compute.
In this paper, MG-CFD is validated using two existing methods. First
we compare the OpenMP parallel e￿ciency of both iflux and MG-
CFD for each level of the multigrid (MG)19. Figure 3 presents the
scaling performance of both codes on the ￿nest MG mesh, showing
that both codes su￿er similar parallel e￿ciency loss up to 12 threads,
after which iflux su￿ers loss at a greater rate than MG-CFD (scaling
of the other MG levels are similar so are not shown). This is a result of
iflux having a signi￿cantly lower arithmetic intensity than MG-CFD,
becoming memory-bound at a lower thread count. This emphasises
the importance of considering and accounting for di￿erences between
a mini-application and its target code when interpreting performance
data. Similarity between scaling behaviour does not imply that the
underlying causes of the observed behaviour are the same, and so we
strengthen the comparison using a second approach. This involves com-
paring the correlation of parallel e￿ciency loss to performance counters
for both the mini-application and the target code8.
It should be noted that we compare MG-CFD against a direct
Fortran-to-C port of iflux, rather than the original Fortran implemen-
tation. We do this to ease and remove the e￿ects of language from the
comparison process; arguably this moves us further away from the true
performance characteristics of the target code, but it still allows the
examination of language independent features, such as memory access
patterns and arithmetic intensity.
The PAPI library is used to collect performance counter data, which
provides easy access to available performance counters and addition-
ally de￿nes a set of 108 “preset” counters that include performance
counters typically found in many processors31. Figure 4 shows the cor-
relation between each PAPI preset performance counter and parallel
ine￿ciency. To account for variance of performance counters between
runs the mean of three measurements is used. For most of these events
the di￿erence in correlation between MG-CFD and iflux is less than
0.1, indicating that both codes share many performance characteris-
tics, but there are several di￿erences in correlations which we address
here. The correlations for the events PAPI_SR_INS, PAPI_LST_INS and
PAPI_LD_INS di￿er by 0.2, with the correlation being stronger for
iflux. These events count store and load micro-ops, so iflux being
more sensitive to these is in agreement with it having the lower arith-
metic intensity. A similar di￿erence between correlations can be seen
in the branching related events (PAPI_BR_*), but neither code performs
branching operations within the loop body so these are considered
to be false positive. The only large di￿erence is with events relating
to L1 cache misses (PAPI_L1_DCM, PAPI_L1_TCM, PAPI_L2_DCW and
PAPI_L2_TCW), for which a strong correlation is only present with MG-
CFD. This is likely a consequence of the register spilling that occurs with
MG-CFD but not iflux, e￿ectively reserving some of the L1 cache for
register values which leaves less for reuse of mesh data.
Where the correlation between a performance counter and parallel
e￿ciency loss is greater than 0.8, this indicates that the correspond-
ing hardware activity that triggers the counter has a strong in￿uence
on scaling performance. The three events PAPI_L1_STM, PAPI_L2_STM
andPAPI_CA_CLNmeasure Read ForOwnership (RFO) events for cache
levels 1, 2 and 3 respectively, for which the correlations are strong. In
the context of unstructured compute, this is an indication that themem-
ory hierarchy is less able to adequately prefetch the destination arrays
in advance of the indirect writes at higher thread counts. This in turn
is an indication of contention in the memory hierarchy that is present
with both codes. Other notable events are PAPI_SR_INS, PAPI_LST_INS
and PAPI_LD_INS, which count store and load micro-operations and so
is another indication of pressure on the memory hierachy.
70 0.2 0.4 0.6 0.8 1
PAPI REF CYC
PAPI TLB IM
PAPI BR MSP
PAPI L2 ICH
PAPI L1 ICM
PAPI L2 ICA
PAPI L2 ICR
PAPI CA CLN
PAPI L2 ICM
PAPI L3 ICA
PAPI L3 ICR
PAPI L2 STM
PAPI L3 DCW
PAPI L3 TCW
PAPI SR INS
PAPI LST INS
PAPI LD INS
PAPI TOT INS
PAPI BR UCN
PAPI BR INS
PAPI BR TKN
PAPI BR CN
PAPI BR PRC
PAPI BR NTK
PAPI DP OPS
PAPI VEC DP
PAPI L1 STM
PAPI CA SNP
Correlation with Parallel Ine ciency
P
A
P
I
E
ve
nt
N
am
e
(a)
 1  0.5 0 0.5 1
PAPI FUL ICY
PAPI STL ICY
PAPI CA SHR
PAPI CA ITV
PAPI L2 TCM
PAPI L3 TCA
PAPI L2 DCM
PAPI L3 DCA
PAPI L3 TCR
PAPI PRF DM
PAPI L2 LDM
PAPI L3 DCR
PAPI TLB DM
PAPI FUL CCY
PAPI L1 TCM
PAPI L1 DCM
PAPI L2 DCW
PAPI L2 TCW
PAPI TOT CYC
PAPI STL CCY
PAPI L2 TCA
PAPI L1 LDM
PAPI L2 DCR
PAPI L2 TCR
PAPI L2 DCA
PAPI L3 LDM
PAPI L3 TCM
PAPI MEM WCY
PAPI RES STL
Correlation with Parallel Ine ciency
iflux
MG-CFD
di↵erence
(b)
FIGURE 4 Comparison between MG-CFD and iflux of their correlation between PAPI preset performance counters and parallel ine￿ciency.
6 PERFORMANCE PREDICTIONMODEL
The intended use case forMG-CFD is to assess the impact of new archi-
tectures and optimisations to HYDRA without necessarily executing
HYDRA on these. As HYDRA is a sensitive code subject to commercial
distribution restrictions, MG-CFD can assist in benchmarking, hardware
evaluation and procurement decisions. A focus on individual HYDRA
kernels ismaintained, as accurate kernel performance predictions can be
passed into the HYDRA performance model, which then predicts total
walltime based on the predicted execution time of each kernel 3.
6.1 Model development
To achieve accurate assessment of hardware and optimisations requires
a model of the performance di￿erence between MG-CFD and iflux.
This model considers performance in terms of clock cycle consumption,
termed Cmini for MG-CFD and Ciflux for iflux. To assist in the pre-
diction of scaling performance, the model will focus on predicting the
cycle consumption of a single loop iteration of iflux, termed Cl,iflux
for iflux, from the empirical measurement of Cl,mini of MG-CFD.
Assuming that Cl,iflux has been estimated, then runtime prediction of
a single call to iflux at a thread count T is formulated as:
runtime =
maxTt=1[Cl,iflux · iterst]
Hz
(3)
The number of loop iterations performed by each thread, iterst, is
considered independent of hardware and is therefore based on prior
knowledge. Processor frequencyHz is calculated fromMG-CFD’s mea-
surements of cycle consumption and runtime.
A simple approach for predicting Cl,iflux is to assume that the ratio
of Cl,iflux to Cl,mini, de￿ned as Rc, is constant across architectures
and ISAs (ignoring bounds imposed by memory performance). Then
Cl,iflux is predicted as:
Cl,iflux = RcCl,mini (4)
It will be shown that the assumption of constantRc does not always
hold true. Thus aworkingmodel must focus on the change in instruction
content, and how this causes the observed change in cycle consump-
tion. An additional approach is to assume that cycle consumption is
directly proportional to the number of instructions executed, the lat-
ter termed Iiflux for iflux and Imini for MG-CFD. This is equivalent
to assuming that iflux executes at the same overall instructions-per-
cycle (IPC) rate as MG-CFD. This provides the following formulation for
Cl,iflux:
Cl,iflux = Cl,mini
Iiflux
Imini
(5)
8Scheduler
Int ALU
Vec ALU
Vec Add
Vec Mul
Vec FMA
Divide
Int ALU
Vec ALU
Vec Add
Vec Mul
Vec FMA
Int ALU
Vec ALU
Vec Shuf
Int ALU STD
L1D
Port 0 Port 1 Port 5 Port 6 Port 4
FIGURE 5 Instruction scheduling in Xeon Skylake.
6.1.1 Superscalar extension
Analysis of the compiler-generated assembly ￿les of MG-CFD and
iflux identi￿es that they di￿er in the proportion of particular cate-
gories of instructions. We de￿ne four categories by throughput and
type: low-throughput ￿oating-point (division and square root), high-
throughput ￿oating-point, integer, and memory data stores. Di￿erent
proportions of high and low throughput ￿oating-point operations are
likely to result in the two codes having di￿erent overall IPC rates. Thus it
is sensible to assume that iflux and MG-CFD execute at di￿erent IPC
rates. Our instruction categorisation is applied to MG-CFD to produce
Imini, a vector of length 4 where Ii,mini is the number of instructions
in the ith category. It is also applied to iflux to produce Iiflux. Then
the di￿erence in instruction content between the two codes, termed
the vector I, is the element-wise di￿erence of Imini and Iiflux.
To predict the resulting change in cycle consumption of  I requires
a model of superscalar execution to re￿ect the complexity in modern
architectures. A simple model is initially adopted in that each instruc-
tion category is scheduled to a single dedicated execution port, and
each category is executed in parallel with, and independently of, other
categories. The change in instruction content is assumed to be large
enough such that when added to a kernel, the compiler is able to opti-
mise the placement of individual instructions to maintain instruction-
level parallelism (ILP), and so we ignore inter-instruction dependencies.
Throughput information is encoded in the vector c of length 4, where
ci is the cycles-per-instruction (CPI) estimate for category i (later we
detail how this estimation is calculated). The predicted change in total
cycle consumption between MG-CFD and iflux is the maximum of
cycle consumption of each category, formulated as:
 C =
4
max
i=1
[( Ii)ci] (6)
Then Cl,iflux is given by:
Cl,iflux =
Cmini   C
iters
(7)
5. ALU
4. Vec ALU
P0 P1 P6P5
3. FP Add, Mul, FMA
1. FP Div 2. Vec Shuf
FIGURE 6 Ideal model of Xeon Skylake instruction scheduling.
6.2 Contention extension
We extend the model further by considering hardware contention
between di￿erent instruction categories. Figure 5 shows the portion of
the Skylake microarchitecture pipeline related to instruction schedul-
ing. It shows ￿ve ports, four of these receiving integer or ￿oating-point
instructions, and the ￿fth receives memory data store instructions.
Technically these ports receive micro-ops, but for simplicity we refer to
these as instructions. There are additional ports excluded from the dia-
gram as they are not relevant to the performance di￿erence between
MG-CFD and iflux. On each clock cycle the scheduler can assign at
most one instruction to each port, then on the next clock cycle these
move onto appropriate execution units. Ports 0 and 1 can receive both
integer and ￿oating-point instructions, revoking the prior assumption
that each instruction category is scheduled to a dedicated port. Thus
there is the possiblity of contention between di￿erent instruction cat-
egories. Accordingly, we extend the model to capture this contention,
whilst retaining a high degree of ￿exibility.
To implement this task, modelling makes two assumptions. The ￿rst
is that while an execution unit is occupied then so is its resident port,
blocking all other execution units on it. The second assumption is of
an ideal instruction scheduler that can schedule in bulk all instructions,
scheduling ￿rst those instructions with the fewest compatible ports and
minimising themaximum clock cycle consumption across the ports. This
process is visualised in Figure 6. The previous ‘integer’ instruction cate-
gory is separated into three – ALU, Vec ALU, and Vec Shuf – to ensure
that member instructions are scheduled to the same ports as well as
having similar throughput.
As with the previous model, this model seeks to predict the change in
performance that results from the change in instructions fromMG-CFD
tomatch iflux. The ￿rst stage is to predict how those modi￿ed instruc-
tions were scheduled to ports, using the previously stated assumptions.
This produces an allocation matrix A with the following structure:
A =
DIV Vec Shuf STD FP Vec ALU ALU26666664
37777775
d     f2 a3 i4 P0
      f1 a2 i3 P1
        a1 i2 P5
  v       i1 P6
    s       P4
9Our ideal instruction scheduler ￿lls this matrix from left to right, pri-
oritising those instruction categories with the fewest available ports,
seeking to equalise cycle consumption across the ports. Port cycle con-
sumpton is calculated by perfoming a matrix-vector multiply between A
and the CPI vector c (now extended to accomodate the two extra inte-
ger instruction categories). This produces the vector p, where pi is the
clock cycle consumption of port i:
p = Ac (8)
Then the overall change in clock cycle consumption is taken as the
maximum port cycle consumption, where Np is the number of ports:
 C =
Np
max
i=1
pi (9)
6.3 CPI estimation
The CPI values are treated as unknowns, as may be the case when eval-
uating novel architectures. To estimate these, we use eleven distinct
variants ofMG-CFD’s inviscid routine that result from the combinatorial
enabling of four arithmetic optimisations missed by the Intel compiler.
These variants contain di￿erent quantities of division, square-root, all
other ￿oating-point operations, integer operations, and register spills,
while producing the same numerical result. The di￿erence between the
most and least expensive variant is equivalent to a di￿erence of 43
AVX2 instructions per loop iteration. To extend this range we create
an additional kernel, derived from the inviscid routine but with ⇡ 50%
arithmetic instructions removed. This resulting kernel does not correctly
implement the inviscid￿ux accumulation, requiring another correct vari-
ant to be executed to maintain solver convergence, however it enables
the range to be extended to 122 instructions. We ￿nd that this greater
range produces better CPI estimates that greatly reduce prediction error
of Cl,iflux. To estimate CPI values we apply basin-hopping optimisa-
tion, constrained to minimum bounds of 1.0, ￿tting Equations 9 and 6
to the performance data of these MG-CFD variants.
6.4 Models comparison
We evaluate each model by its ability to accurately predict single-
threaded cycle consumption of iflux. We assess across Xeon Skylake,
Broadwell and KNL, and across the ISAs AVX512, AVX2 and SSE42.Rc
is measured for Broadwell AVX2 and then assumed to be constant for
all other combinations of systems and ISAs.
Figure 7 presents accuracies of each model, with accompanying
statistics listed in Table 3. The constant Rc model has very high accu-
racy on most system con￿gurations, averaging 15.6%, but there are two
signi￿cant exceptions. After enabling AVX-512 vectorisation on Sky-
lake the prediction error increases from 5.2% to 55.8%, caused by the
addition of a large quantity of instructions to both kernels that skews
Rc towards 1.0. Prediction error for AVX2 on Knights Landing (KNL) is
49.4%, while for the same instruction set architecture (ISA) on Skylake
and Broadwell it is near zero. Thus although this model often generates
SSE42 AVX2
0%
20%
40%
60%
80%
100%
er
ro
r
(a) Xeon Broadwell
SSE42 AVX2 AVX512x1 AVX512x8
0%
20%
40%
60%
80%
100%
er
ro
r
(b) Xeon Skylake
SSE42 AVX2 AVX512x1 AVX512x8
0%
20%
40%
60%
80%
100%
er
ro
r
Constant Rc Equal IPC
+ superscalar + port scheduling
(c) Xeon KNL
FIGURE 7 Prediction error of described models of i￿ux cycle consump-
tion.
Model mean (%) SD (%) worst (%)
Constant Rc 15.6 20.4 55.8
Equal IPC 31.0 25.4 74.7
+ superscalar 34.4 21.3 69.6
+ contention 10.2 6.2 17.8
TABLE 3 Single-thread iflux runtime prediction errors of the four
described projection models.
predictions with error below 16%, occasionally the error is signi￿cantly
greater which undermines model reliability. The equal IPC model has
a high mean error of 31.0% and the highest worst-case error of 74.7%,
clear evidence that MG-CFD and iflux can and often execute at di￿er-
ent overall IPC rates. The superscalar model has a similarly high error,
demonstrating that the assumption that each instruction category exe-
cutes entirely in parallel is incorrect. The ￿nal model that incorporates
both superscalar execution and instruction scheduling generates predic-
tions with the least mean error of 10.2% with SD 6.2%, and the lowest
worst-case error of 17.8%. Overall, our technique of port scheduling
delivers reliable results across architecture types.
10
2 4 6 8 12 14 24 28
0%
5%
10%
15%
20%
Threads
er
ro
r
(a) Xeon Broadwell
2 4 6 8 10 12 20 24
0%
5%
10%
15%
20%
Threads
er
ro
r
(b) Xeon Skylake
2 8 16 24 40 48 60 80 120
0%
5%
10%
15%
20%
Threads
er
ro
r
SSE4.2 AVX2 AVX512x1
(c) Xeon KNL DDR
FIGURE 8Model prediction errors of i￿ux strong scaling.
6.5 Predicting strong scaling
A necessary component for predicting scaling performance of any ker-
nel in a single node is accounting for limits imposed by the memory
hierarchy. This is measured empirically by a new data throughput (DT)
kernel, which performs the same datamovement as iflux but withmin-
imal arithmetic (the least possible without the compiler optimising away
the data accesses). This kernel is executed at the target thread count to
produce its cycle consumption Cdt,t. As this performs minimal compu-
tation it may run at a di￿erent turbo clock frequency than MG-CFD, so
scaling ofCdt,t is necessary to produce a lower bound forCiflux,t. This
is formulated in Equation 10 to produce the lower bound Cmin,t.
Cmin,t = Cdt,t
GHzmini,t
GHzdt,t
(10)
Figure 8 presents predictions of iflux strong scaling on a Skylake
node with 1-way SMT, a Broadwell node with 1-way SMT, and a KNL
with 2-way SMT, summing runtime across the four MG levels. Predic-
tion error is lowest on the Broadwell node, never exceeding 10% with
mean errors of 3.7% for SSE4.2 and 4.4% for AVX2. Prediction error
is lower on the Skylake for the AVX-2 and AVX-512 instruction sets,
averaging 2.5%, but for SSE4.2 the model error of predicting compute-
bound performance increases to⇡ 10%. Prediction error on KNL is low
at low thread counts, but increases gradually to a peak at 80 threads
of 16.1% for SSE4.2 and 12.2% for AVX2. This indicates that as iflux
System mean (%) SD (%) worst (%)
Broadwell AVX2 10.6 4.8 16.3
Broadwell SSE4.2 5.0 4.1 9.8
Skylake AVX2 12.2 3.1 17.1
TABLE 4Model prediction errors of vflux compute strong scaling.
scales up to near the thread count at which it becomes fully bandwidth-
bound, it is partially and increasingly limited by memory performance
rather than the bound acting in a binary manner as the model assumes.
Once iflux exceeds 80 threads it becomes fully memory-bound under
AVX2 and AVX-512 with error falling to near-zero, and under SSE4.2
iflux it is very close to this limit as evidenced by the error falling
by half to 8%. This discrepancy is explained by newer instruction sets
having more sophisticated instructions that perform more operations
simultaneously, such as fused multiply-accumulate (FMA) and gather/s-
catter instructions, that allow a compiler to reduce arithmetic intensity
of loops.
6.6 Predicting performance of HYDRA
Having validated the predictive ability of theMG-CFD and performance
model, we now direct attention to the most signi￿cant HYDRA kernel,
vflux. This is the single most expensive loop in HYDRA; for 28 MPI
processes on Xeon Broadwell, it accounts for 35.8% of the walltime.
Accordingly its arithmetic intensity is several times that of MG-CFD,
posing a signi￿cant challenge to our projection model. In contrast, its
data access pattern is very similar to that of iflux, performing the same
single loop over edges, and only di￿ering signi￿cantly in the quantity
of data associated with each node (cell) (a re￿ection of the increased
complexity of equations needed for viscous and turbulent ￿ow).
For this prediction task we use a di￿erent dataset that is typical for
a HYDRA workload, the NASA Rotor 37 mesh of an axial compres-
sor rotor32. This contains ⇡ 8.1 M nodes and ⇡ 24 M edges, with an
additional three MG meshes that results in a total count of ⇡15.7 M
nodes and ⇡ 53 M edges. Thus, this is a signi￿cant size for assessing
single-node performance.
As performed for the iflux predictions, we extract the assembly
code of the vflux loop from the compiler-generated object ￿le and cat-
egorise its constituent instructions. The set of MG-CFD variants are
executed on each target system, providing empirical data for estimation
of CPI rates. The projection model is applied to provide an estimate of
Cl,vflux, the cycle consumption of a single vflux loop iteration. MG-
CFD also measures clock speed, allowing Cl,vflux to be converted into
‘grind time’, the runtime of a single loop iteration. The grind time is
passed into a pre-existing performance model of HYDRA, which com-
bines it with knowledge of mesh partitioning and a function call trace to
produce a prediction of total compute runtime for eachHYDRA kernel 3.
Predictions are made of MPI strong scaling of vflux compute on a
Xeon Broadwell node and a Xeon Skylake node, using AVX-2 and SSE4.2
11
1 2 4 8 12 16 20 24 28
0%
5%
10%
15%
20%
Number of MPI processes
er
ro
r
Broadwell AVX2 Broadwell SSE4.2 Skylake AVX2
FIGURE 9 Prediction error of HYDRA v￿ux() compute strong scaling.
instructions sets, presented in Figure 9, with accompanying statistics
listed in Table 4. Prediction error for Broadwell and SSE4.2 is the least,
achieving mean prediction error of 5.0% and not exceeding 10%. At full
node utilisation the error falls to 1.3%. Switching to AVX2 on Broad-
well, the prediction error increases to ⇡ 16% for low process counts,
but reduces steadily as process count increases to a minimum of 2.6%
at 20 processes. At full node utilisation the prediction error is 7.1%. On
Skylake with AVX2 prediction error is similar as on Broadwell AVX2 for
low process counts, and initially shows the same trend of decreasing
with increasing process count, but at 12 processes and above the error
stabilises at ⇡ 10.5%.
We conclude by focusing on prediction error at themaximumprocess
count on each system. System procurement decision-making typically
considers performance in terms of fully-utilised nodes, so achieving
accurate predictions of these is particularly important. Our model pre-
dicts fully-utilised Broadwell node performance with error 7.1% for
AVX2 and 1.3% for SSE4.2, and fully-utilised Skylake node performance
with error 10.7%, thus our model has the capability to meaningfully
inform procurement decisions with high accuracy.
7 CONCLUSIONS
This paper reports the development of MG-CFD, the only multigrid
unstructured ￿nite-volume CFD mini-application. MG-CFD has been
developed as part of a long-standing university / industry collaboration
and, as a result, is representative of the Rolls-Royce plc. production code
HYDRA, their primary CFD code used for turbomachinery design.
We applied two mini-application validation techniques, demonstrat-
ing that MG-CFD is similar to HYDRA’s iflux routine. Further analysis
highlighted that the scaling behaviour achieved was similar, and that
the hardware was being stressed in a similar way by both iflux and
MG-CFD according to hardware counters.
In addition we construct an analytical performance projection model,
targeted towards predicting the performance di￿erence between MG-
CFD and HYDRA. This enables projection from MG-CFD to HYDRA
performance on a range of existing and emerging HPC architectures.
This is highly signi￿cant for Rolls-Royce plc. as they increase their use
of virtual certi￿cation and simulation-based engine design13. We also
demonstrate that it is possible to use a mini-application and perfor-
mance modelling to predict the performance of a production ‘target’
code, with a mean error of 9.2% for strong-scaling studies.
In future research we plan to add MPI functionality to MG-CFD
through integration of the OP2 library. After validating that multi-
node strong scaling performance is similar to HYDRA, we intend to
use MG-CFD to explore alternative communication patterns such as
partitioning-aware rank placement.
The source code for MG-CFD, and scripts for assembly analysis and
projection modelling, are available as open-source software on Github
1 2 3.
ACKNOWLEDGMENTS
This research is supported by Rolls-Royce plc., the EU Horizon 2020
Clean Sky Project, the UK Engineering and Physical Sciences Research
Council (EPSRC), and Intel Corporation: (EP/S005072/1 - Strategic
Partnership in Computational Science for Advanced Simulation and
Modelling of Engineering Systems - ASiMoV; EPSRC Industrial CASE
award 15220082). The authors would like to thank Rolls-Royce plc. for
granting permission to publish this work.
References
1. Mudalige G. R., Vernon M. K., Jarvis S. A.. A Plug-And-Play Model
for Evaluating Wavefront Computations on Parallel Architectures.
Proceedings of the 22nd International Parallel andDistributed Process-
ing Symposium 2008 (IPDPS’08). 2008;:1–14.
2. Barker K. J., Davis K., Hoisie A., et al. Using Performance Modeling
to Design Large-Scale Systems. Computer. 2009;42(10):0042–49.
3. Bunt R. A., Pennycook S. J., Jarvis S. A., Lapworth L., Ho Y. K..
Model-Led Optimisation of a Geometric Multigrid Application. Pro-
ceedings of the 15th High Performance Computing and Communica-
tions (HPCC’13). 2013;:742–753.
4. Bunt R. A., Wright S. A., Jarvis S. A., Street M., Ho Y. K.. Predictive
Evaluation of Partitioning Algorithms Through Runtime Modelling.
Proceedings of The 23rd IEEE International Conference on High Perfor-
mance Computing, Data, and Analytics. 2016;:351–361.
5. Hammond S. D., Mudalige G. R., Smith J. A., Jarvis S. A., Herd-
man A. J., Vadgama A.. WARPP: A Toolkit for Simulating High-
Performance Parallel Scienti￿c Codes. Proceedings of the 2nd
International Conference on Simulation Tools and Techniques 2009
(ICSTT’09). 2009;:1–10.
1https://github.com/warwick-hpsc/MG-CFD-app-plain
2https://github.com/warwick-hpsc/MG-CFD-performance-model
3https://github.com/warwick-hpsc/assembly-loop-extractor
12
6. Janssen C. L., Adalsteinsson H., Cranford S., et al. A Simulator for
Large-Scale Parallel Computer Architectures. International Journal
of Distributed Systems and Technologies. 2010;1(2):57–73.
7. Pennycook S. J., Hughes C. J., Smelyanskiy M., Jarvis S. A.. Explor-
ing SIMD for Molecular Dynamics, Using Intel R  Xeon R  Proces-
sors and Intel R  Xeon Phi Coprocessors. Proceedings of the 27th
International Parallel and Distributed Processing Symposium 2013
(IPDPS’13). 2013;:1085–1097.
8. Tramm J. R., Siegel A. R., Islam T., Schulz M.. XSBench-the Devel-
opment and Veri￿cation of a Performance Abstraction for Monte
Carlo Reactor Analysis. Proceedings of the Role of Reactor Physics
Toward a Sustainable Future 2014 (PHYSOR’14). 2014;:1–12.
9. Reguly I. Z., Mudalige G. R., Giles M. B.. Design and Development
of Domain Speci￿c Active Libraries With Proxy Applications. Pro-
ceedings of Cluster Computing 2015 (CLUSTER’15). 2015;:738–745.
10. Mallinson A. C., Jarvis S. A., Gaudin W. P., Herdman A. J.. Experi-
ences at Scale With PGAS Versions of a Hydrodynamics Applica-
tion. Proceedings of the 8th International Conference on Partitioned
Global Address Space ProgrammingModels 2014 (PGAS’14).2014;:9–
20.
11. Heroux M., Barrett R.. Mantevo Project https://mantevo.org/
(accessed March 3, 2016)2016.
12. UK Mini-App Consortium . UK Mini-App Consortium http://
uk-mac.github.io/papers.html (accessed March 6, 2016)2016.
13. Strategic Partnership in Computational Science for Advanced Sim-
ulation andModelling of Engineering Systems - ASiMoV https://gtr.
ukri.org/projects?ref=EP/S005072/12018.
14. Owenson A.,Wright S. A., Jarvis S. A., Bunt R. A., Ho Y. K., StreetM..
Developing and Using a Geometric Multigrid, Unstructured Grid
Mini-Application to Assess Many-Core Architecture. Proceedings of
the 26th Euromicro International Conference on Parallel, Distributed
and Network-based Processing. 2018;.
15. ECP Proxy Apps Suite https://proxyapps.exascaleproject.org/
ecp-proxy-apps-suite/ (accessed March 13, 2019).
16. Karlin I., Bhatele A., Keasler J., et al. Exploring Traditional and
Emerging Parallel Programming Models using a Proxy Application.
Proceedings of the 27th IEEE International Parallel and Distributed
Processing Symposium 2013 (IPDPS’13). 2013;:919–932.
17. Corrigan A., Camelli F. F., Löhner R.,Wallin J.. Running Unstructured
Grid-Based CFD Solvers on Modern Graphics Hardware. Interna-
tional Journal for Numerical Methods in Fluids. 2011;66(2):221–229.
18. Adams M. F., Brown J., Shalf J., Van Straalen B., Strohmaier E.,
Williams S.. HPGMG 1.0: A Benchmark for Ranking High Perfor-
mance Computing Systems. : Lawrence Berkeley National Labora-
tory; 2014.
19. Messer O. E. B., D’Azevedo E., Hill J., Joubert W., Laosooksathit S.,
Tharrington A.. Developing MiniApps on Modern Platforms Using
Multiple Programming Models. Proceedings of Cluster Computing
2015 (CLUSTER’15). 2015;:753–759.
20. Sharkawi S., DeSota D., Panda R., et al. Performance projection
of HPC applications using SPEC CFP2006 benchmarks. 2009 IEEE
International Symposium on Parallel Distributed Processing. 2009;:1–
12.
21. Hoste K., Phansalkar A., Eeckhout L., Georges A., John L. K., De
Bosschere K.. Performance Prediction Based on Inherent Program
Similarity. Proceedings of the 15th International Conference on Paral-
lel Architectures and Compilation Techniques (PACT’06). 2006;:114–
122.
22. Lapworth L.. HYDRA-CFD: A Framework for Collaborative CFD
Development. Proceedings of the International Conference on Scien-
ti￿c and Engineering Computation 2004 (IC-SEC’04). 2004;.
23. Spalart P. R., Allmaras S. R.. A one-equation turbulence model for
aerodynamic ￿ows. AIAA Journal. 1994;:5–21.
24. Moinier P., Müller J., Giles M. B.. Edge-BasedMultigrid and Precon-
ditioning for Hybrid Grids. AIAA Journal. 2002;40(10):1945–1953.
25. Martinelli L., Jameson A.. Validation of a multigrid method for the
Reynolds averaged equations. AIAA Journal. 1988;.
26. Trottenberg U., Oosterlee C. W., Schuller A.. Multigrid. Elsevier,
Amsterdam, The Netherlands; 2001.
27. Che S., Boyer M., Meng J., et al. Rodinia: A benchmark suite for
heterogeneous computing. 2009 IEEE International Symposium on
Workload Characterization (IISWC). 2009;:44-54.
28. Briggs W. L..Multigrid Tutorial. SIAM, Philadelphia, PA; 1987.
29. Hodson H. P., Dominy R. G.. Three-Dimensional Flow in a Low-
Pressure Turbine Cascade at Its Design Condition. Journal of Turbo-
machinery. 1987;109(2):177–185.
30. NASA . TCGRID v. 400 https://www.grc.nasa.gov/www/5810/rvc/
tcgrid.htm (accessed November, 8th 2017).
31. Browne S., Deane C., Ho G., Mucci P.. PAPI: A Portable Interface
to Hardware Performance Counters. Proceedings of Department of
Defense HPCMP Users Group Conference. 1999;.
32. Reid L., Moore R. D.. Design and overall performance of four highly
loaded, high speed inlet stages for an advanced high-pressure-ratio
core compressor. 1978;.
How to cite this article: Owenson A.M.B., Wright S.A., Bunt R.A., Ho
Y.K., Street M.J., and Jarvis S.A. (2019), An Unstructured CFD Mini-
Application for the Performance Prediction of a Production CFD Code,
Concurrency Computat: Pract Exper., 2019;x:x–y.
