Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers by Calore, E. et al.
Optimization of Lattice Boltzmann
Simulations on Heterogeneous
Computers
Journal Title
XX(X):1–18
c©The Author(s) 0000
Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/ToBeAssigned
www.sagepub.com/
E. Calore1, A. Gabbana1, S. F. Schifano1, R. Tripiccione1
Abstract
High-performance computing systems are more and more often based on accelerators. Computing applications
targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive
sections of the code onto accelerators; this approach only marginally exploits the computational resources available
on the host CPUs, limiting performance and energy efficiency. The obvious step forward is to run compute-intensive
kernels in a concurrent and balanced way on both hosts and accelerators. In this paper we consider exactly this problem
for a class of applications based on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our
goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts
and accelerators. To reach this goal, we define common data layouts enabling the code to exploit efficiently the
different parallel and vector options of the various accelerators, and matching the possibly different requirements of
the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the
best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We
test the performance of our codes and their scaling properties using as testbeds HPC clusters incorporating different
accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs.
Keywords
Lattice Boltzmann methods, Accelerators, Performance modeling, Heterogeneous systems, Performance portability
1 Background and related works
The architecture of high-performance computing
(HPC) systems is increasingly based on accelerators;
typical HPC systems today – including several leading
entries of the TOP500 list ([1]) – are large clusters of
processing nodes interconnected by a fast low-latency
network, each node containing standard processors
coupled with one or more accelerator units.
Accelerators promise higher computing performance
and better energy efficiency, improving on traditional
processors up to one order of magnitude; in fact, they
i) allow massive parallel processing by a combination
of a large number of processing cores and vector units,
ii) allow for massive multi-threading, in order to hide
memory access latencies and, iii) trade a streamlined
control structure (e.g., executing instructions in-order
only) for additional data-path processing for a given
device or energy budget.
Typical accelerated applications usually follow a
host-driven approach in which the host processor
offloads (almost) all compute-intensive sections of
the code onto accelerators; the host itself typically
1University of Ferrara and INFN Ferrara, via Saragat 1, I-44122
Ferrara, ITALY.
Corresponding author:
Sebastiano Fabio Schifano, University of Ferrara and INFN Ferrara,
via Saragat 1, I-44122 Ferrara, ITALY.
Email: schifano@fe.infn.it
Prepared using sagej.cls [Version: 2015/06/09 v1.01]
ar
X
iv
:1
70
3.
04
59
4v
1 
 [c
s.D
C]
  1
4 M
ar 
20
17
2 Journal Title XX(X)
only orchestrates global program flow or processes
sequential segments of the application; this approach
wastes a non negligible amount of the available
computational resources, reducing overall performance
and energy efficiency.
The obvious step forward is that even compute-
intensive application kernels should be executed in a
balanced way on both hosts and accelerators. This
improvement has been hampered so far by several
non trivial obstacles, especially because CPUs and
accelerators often present different architectures, so
efficient accelerated codes may involve different data
structures and operation schedules. Moreover, the lack
of well established performance-portable programming
heterogeneous frameworks has so far required the use of
specific programming languages (or at least proprietary
variants of standard languages) on each different
accelerator, harming portability, maintainability and
possibly even correctness of the application.
Improvements on this latter aspect come with
the recent evolution of directive-based programming
environments, allowing programmers to annotate their
codes with hints to the compiler about available
parallelization options. Several such frameworks have
been proposed, such as the Hybrid Multi-core Parallel
Programming model (HMPP) proposed by CAPS,
hiCUDA ([2]), OpenMPC ([3]) and StarSs ([4]).
However, the most common compiler frameworks
currently used for scientific codes are OpenMP ([5])
and OpenACC ([6]). Both frameworks allow to
annotate codes written in standard languages (e.g., C,
C+ and Fortran) with appropriate pragma directives
characterizing the available parallelization space of
each code section. This approach leaves to compilers
to apply all optimization steps specific to each different
target architecture and consistent with the directives,
allowing in principle portability of codes between any
supported host and accelerator device. This process
is still immature, and significant limits to portability
still exist. The OpenMP standard, version 4 ([7]), has
introduced support for – in principle – any kind of
accelerator, but compilers supporting GPUs are not yet
available, and the Intel Xeon Phi is de-facto the only
supported accelerator. On the other hand OpenACC
supports several different GPUs and more recently also
multi-core CPU architectures, but not the Xeon-Phi,
and does not allow to compile codes able to spread
parallel tasks concurrently on both GPU and CPU
cores. Also, both standards do not address processor-
specific hardware features so non-portable proprietary
directives and instructions are often necessary; for
example performance optimization on Intel CPU
processors often requires Intel-proprietary compiler
directives.
In spite of these weaknesses, and in the hope that
a converging trend is in progress, one would like
to i) understand how difficult it is to design one
common code using common domain data structures
running concurrently on host(s) and accelerator(s),
that is portable and also performance-portable across
traditional processors and different accelerators, and
ii) quantify the performance gains made possible by
concurrent execution on host and accelerator.
In order to explore this problem, one has to
understand the impact on performance that different
data layouts and execution schedules have for different
accelerator architectures. One can then define a
common data layout and write a common code for a
given application with optimal (or close to optimal)
performance on several combinations of host processors
and accelerators. Assuming we can identify a common
data layout giving good performance on several
combinations of host processors and accelerators, then
one has to find efficient partitioning criteria to split the
execution of the code among hosts and processors.
In this paper we tackle exactly these issues for a
class of applications based on Lattice Boltzmann (LB)
methods, widely used in computational fluid-dynamics.
This class of applications offers a large amount of
easily identified available parallelism, making LB an
ideal target for accelerator-based HPC systems. We
consider alternate data layouts, processing schedules
and optimal ways to compute concurrently on host and
accelerator. We quantify the impact on performance,
and use these findings to develop production-grade
massively parallel codes. We run benchmarks and test
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 3
our codes on HPC systems whose nodes have dual Intel
Xeon processors and a variety of different accelerators,
namely the NVIDIA K80 ([8]) and AMD Hawaii GPUs
([9]), and the Intel Xeon-Phi ([10]).
Over the years, LB codes have been written and
optimized for large clusters of commodity CPUs
([11]) and for application-specific machines ([12, 13,
14]). More recent work has focused on exploiting
the parallelism of powerful traditional many-core
processors ([15]), and of power-efficient accelerators
such as GP-GPU ([16, 17, 18]) and Xeon-Phi processors
([19]), and even FPGAs ([20]).
Recent analyses of optimal data layouts for LB have
been made in [21, 22, 23]. However, [21] focuses only
on the propagate step, one of the two key kernels
in LB codes, while [22] does not take into account
vectorization; in [23] vectorization is considered using
intrinsic functions only. None of these papers considers
accelerators. In [24] we have started preliminary
investigation considering only the Xeon-Phi as an
accelerator. Here, we extend these results in several
ways: first, we take into account both propagate and
collision steps used in LB simulations. Then we use
a high level approach based on compiler directives,
and we take into account also NVIDIA and AMD
accelerators commonly used in HPC communities. Very
recently, [25, 26] have explored the benefits of LB
solvers on heterogeneous systems considering different
memory layouts and system based both on NVIDIA
GPUs and Xeon-Phi accelerators. In our contribution
we consider a more complex LB solver (D2Q37 instead
of D2Q9) a wider analysis of data layouts and an
automatic analytic way to find the optimal partitioning
of lattice domains between host-CPU and accelerators.
This paper is structured as follows: section 2
introduces the main hardware features of CPUs and
GPUs that we have taken into account in this paper
and section 3 gives an overview of LB methods. Section
4 discusses the design and implementation options for
data layouts suitable for LB codes and measures the
impact of different choices in terms of performances,
while section 5 describes the implementation of codes
that we have developed concurrently running on both
host and accelerator. Section 6 describes two important
optimization steps for our heterogeneous code and
defines a performance model, while section 7 analyzes
our performance results on two different clusters, one
with K80 GPUs and one with Xeon-Phi accelerators.
Finally, section 8 wraps up our main results and
highlights our conclusions.
2 Architectures of HPC systems
Heterogeneous systems have recently enjoyed growing
popularity in the HPC landscape. These systems
combine within a single processing node commodity
multi-core architectures with off-chip accelerators,
GPUs, Xeon-Phi many-core units and (sometimes)
FPGAs. This architectural choice comes from an
attempt to boost overall performance adding an
additional processor (the accelerator) that exploits
massively parallel data-paths to increase performance
(and energy efficiency) at the cost of reduced
programming flexibility. In this section we briefly
review the architectures of the state-of-the-art
accelerators that we have considered – GPUs and
Xeon-Phi many-core processors – focusing on the
impact that their diverging architectures have on the
possibility to develop a common HPC code able to run
efficiently on both of them.
GPUs are multi-core processors with a large number
of processing units, all executing in parallel. The
NVIDIA K80 is a dual GK210 GPU, each containing
13 processing units, called streaming multiprocessors
(SMX). Each processing unit has 192 compute units,
called CUDA-cores, concurrently executing groups
of 32 operations in a SIMT (Single Instruction,
Multiple Thread) fashion; much like traditional SIMD
processors, cores within a group execute the same
instruction at the same time but are allowed to take
different branches (at a performance penalty). The
AMD FirePro W9100 is conceptually similar to the
K80 NVIDIA GPU; it has 44 processing units, each
one with 64 compute units (stream processors).
The clock of an NVIDIA K80 has a frequency of
573 MHz which can be boosted up to 875 MHz. The
Prepared using sagej.cls
4 Journal Title XX(X)
Xeon E5-2630 v3 Xeon-Phi 7120P Tesla K80 FirePro W9100
#physical-cores 8 61 2 × 13 SMX 44
#logical-cores 16 244 2 × 2496 2816
Clock (GHz) 2.4 1.238 0.560 0.930
Peak perf. (DP/SP GF) 307/614 1208/2416 1870/5600 2620/5240
SIMD unit AVX2 256-bit AVX2 512-bit N/A N/A
LL cache (MB) 20 30.5 1.68 1.00
#Mem. Channels 4 16 – –
Max Memory (GB) 768 16 2 × 12 16
Mem BW (GB/s) 59 352 2 × 240 320
Table 1. Selected hardware features of the systems tested in
this work: Xeon E5-2630 is a commodity processor adopting
the Intel Haswell micro-architecture; Xeon-Phi 7120P is
based on the Intel MIC architecture; Tesla K80 is a NVIDIA
GPU with two Tesla GK210 accelerators; FirePro W9100 is
an AMD Hawaii GPU
aggregate peak performance is then of 5.6 TFlops in
single precision and 1.87 TFlops in double precision
(only one third of the SMXs work concurrently when
performing double precision operations). Working at
930MHz the processing units of the AMD FirePro
W9100 delivers up to 5.2 TFlops in SP and 2.6 in DP.
In general, GPUs sustain their huge potential
performance thanks to large memory bandwidth –in
order to avoid starving the processors – and massive
multi-threading – to hide memory access latency.
Consequently, register files are huge in GPUs, as they
have to store the states of many different threads,
while data caches are less important. For example
(see also Table 1), a K80 GPU has a combined peak
memory bandwidth of 480 GB/s, while each SMX
has a register file of 512KB and just 128KB L1
cache/shared memory; SMX units share a 1536KB L2
cache. Similarly, the AMD W9100 has a peak memory
bandwidth of 320 GB/s and its last level cache is just
1MB.
The architecture of Xeon-Phi processors – the other
class of accelerators that we consider – builds on
a very large number of more traditional X86 cores,
each optimized for streaming parallel processing, with
a streamlined and less versatile control part and
enhanced vector processing facilities. For instance, the
currently available version of this processor family –
the Knights Corner (KNC) – integrates up to 61 CPU
cores, each supporting the execution of 4 threads, for
an aggregate peak performance of 1 TFlops in double
precision, with a clock running at 1.2 GHz. Each core
has a 32 KB L1-cache and a 512 KB L2-cache. The
L2-caches, private at the core level, are interconnected
through a bi-directional ring and data is kept coherent
and indirectly accessible by all cores. The ring also
connects to a GDDR5 memory bank of 16 GB, with a
peak bandwidth of 352 GB/s. Each core has a Vector
Processing Unit (VPU), executing SIMD operations on
vectors of 512 bits.
Present generation GPUs and Xeon-Phi processors
are connected to their host through a PCI-express
interface, allowing data exchange between the two
processors. Typically 16 PCI-express lanes are used,
with an aggregate bandwidth of 8 GB/s, much
smaller that the typical memory bandwidth of these
processors, so processor-accelerator data exchanges
may easily become serious performance bottlenecks.
For both processor classes, in this work we consider
a host-driven heterogeneous programming model, with
applications executing both on the host and on
the accelerator ∗. So far, different programming
languages have been available for each specific
accelerator; indeed, NVIDIA GPUs have a proprietary
programming language, and AMD GPUs are supported
by the OpenCL programming environment. On the
contrary, the Xeon-Phi uses the same programming
environment as its Xeon multi-core counterpart, so
one can develop codes following a directive-based
approach (e.g. OpenMP), slightly reducing the effort
of application migration. On the other hand, many
works have shown that extracting a large fraction of
the performance capabilities of the Knights Corner
(KNC), the first generation Xeon Phi co-processors,
still requires significant efforts and fine restructuring
of the code.
Following recent improvements in directive-based
programming environments, the present work wants
to explore ways of writing common codes that i)
have optimization features that can be exploited by
∗the Xeon-Phi also support “native mode” thus acting as an
independent node capable of running applications independently.
This approach will be enhanced in the next generation Xeon-
Phi system, the Knights Landing, that will also be available as a
stand alone processor. We do not consider this mode of operation
in this paper.
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 5
Figure 1. Left: Velocity vectors for the LB populations of the D2Q37 model. Right: populations are identified by an arbitrary
label, associated to the lattice hop that they perform in the propagate phase.
traditional CPUs and both accelerator architectures,
and ii) are written using a common directive-based
(e.g. OpenMP, OpenACC) programming environment.
3 Lattice Boltzmann methods
In this section, we sketchily introduce the computa-
tional method that we adopt, based on an advanced
Lattice Boltzmann (LB) scheme. LB methods (see
[27] for an introduction) are discrete in position and
momentum spaces; they are based on the synthetic
dynamics of populations sitting at the sites of a discrete
lattice. At each time step, populations hops from
lattice-site to lattice-site and then they collide, mixing
and changing their values accordingly.
Over the years, several LB models have been
developed, describing flows in 2 or 3 dimensions,
and using sets of populations of different size (a
model in x dimensions based on y populations is
labeled as DxQy). Populations (fl(x, t)), each having
a given lattice velocity cl, are defined at the sites of
a discrete and regular grid; they evolve in (discrete)
time according to the Bhatnagar-Gross-Krook (BGK)
equation:
fl(x + cl∆t, t+ ∆t) =
fl(x, t)− ∆t
τ
(
fl(x, t)− f (eq)l
)
(1)
The macroscopic physics variables, density ρ, velocity
u and temperature T are defined in terms of the fl(x, t)
and cl variables:
ρ =
∑
l
fl ρu =
∑
l
clfl DρT =
∑
l
|cl − u|2 fl;
the equilibrium distributions (f
(eq)
l ) are themselves
functions of these macroscopic quantities ([27]). With
an appropriate choice of the set of lattice velocities
cl and of the equilibrium distributions f
(eq)
l , one
shows that, performing an expansion in ∆t and re-
normalizing the values of the physical velocity and
temperature fields, the evolution of the macroscopic
variables obeys the thermo-hydrodynamical equations
of motion and the continuity equation:
∂tρ+ ρ∂iui = 0,
ρDtui = −∂ip− ρgδi,2 + ν∂jjui, (2)
ρcvDtT + p∂iui = k∂iiT ;
Dt = ∂t + uj∂j is the material derivative and we
neglect viscous heating; cv is the specific heat at
constant volume for an ideal gas, p = ρT , and ν and k
are the transport coefficients; g is the acceleration of
gravity, acting in the vertical direction. Summation of
repeated indexes is implied.
In our case we study a 2-dimensional system (D = 2
in the following), and the set of populations has 37
elements (hence the D2Q37 acronym) corresponding
to (pseudo-)particles moving up to three lattice points
away, as shown in Figure 1 ([28, 29]). The main
advantage of this recently developed LB method is
Prepared using sagej.cls
6 Journal Title XX(X)
that it automatically enforces the equation of state of
a perfect gas (p = ρT ). Our optimization efforts have
made it possible to perform large scale simulations of
convective turbulence in several physics conditions (see
e.g., [30, 31]);
An LB code starts with an initial assignment
of the populations, corresponding to a given initial
condition at t = 0 on some spatial domain, and iterates
Equation 1 for each population and lattice site and for
as many time steps as needed; boundary conditions are
enforced at the edges of the domain after each time
step by appropriately modifying population values at
and close to the boundary.
From the computational point of view, the LB
approach offers a huge degree of easily identified
available parallelism. Defining y = x + cl∆t and
rewriting the main evolution equation as:
fl(y, t+ ∆t) =
fl(y − cl∆t, t)− ∆t
τ
(
fl(y − cl∆t, t)− f (eq)l
)
(3)
one easily identifies the overall structure of the
computation that evolves the system by one time
step ∆t: for each point y in the discrete grid the
code: i) gathers from neighboring sites the values
of the fields fl corresponding to populations drifting
towards y with velocity cl and then, ii) performs all
mathematical steps needed to compute the quantities
in the r.h.s. of Equation 3. One quickly sees that there
is no correlation between different lattice points, so
both steps can proceed in parallel on all grid points
according to any convenient schedule, with the only
constraint that step 1 precedes step 2.
As already remarked, our D2Q37 model correctly
and consistently describes the thermo-hydrodynamical
equations of motion and the equation of state of
a perfect gas; the price to pay is that, from a
computational point of view, its implementation is
more complex than for simpler LB models. This
translates in demanding requirements for memory
bandwidth and floating-point throughput. Indeed, step
1 implies accessing 37 neighbor cells to gather all
populations, while step 2 implies ≈ 7000 double-
precision floating point operations per lattice point,
some of which can be optimized away e.g. by the
compiler.
4 Data-layout optimization for LB kernels
Our goal is to design a performance-portable code
capable of running efficiently on recent Intel multi-core
CPUs as well as on Xeon-Phi and GPU accelerators.
Our intendedly common application is written in
plain C and annotated with compiler directives for
parallelization. For Intel architectures, we have used
OpenMP and proprietary Intel directives, using the
offload pragmas to run kernels on the Xeon-Phi.
For NVIDIA and AMD GPUs we have annotated
the code with OpenACC directives, implemented by
the PGI compiler, which supports both architectures.
Mapping OpenACC directives on OpenMP directives
is almost straightforward, so code divergence is
limited at this point in time; it is expected that
OpenMP implementations supporting both classes of
accelerators will become available in the near future,
so we hope to be able soon to merge the two versions
into one truly common code.
As remarked in section 1, data layout has a critical
role in extracting performance from accelerators.
Data layouts for LB methods, as for many other
stencil applications, have been traditionally based
either on array of structures (AoS) or on structure of
arrays (SoA) schemes. Figure 2 shows how population
data are stored in the two cases; in the AoS layout,
population data for each lattice site are stored one
after the other at neighboring memory locations. This
scheme exhibits locality of populations at a given
lattice site, while populations of common index i at
different lattice sites are stored in memory at non
unit-stride addresses. Conversely, in the SoA scheme
for a given index i, populations of all lattice sites are
stored contiguously, while the various populations of
each lattice site are stored far from each other at non
unit-stride addresses.
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 7
#define N ( LX∗LY )
typedef struct {
data_t p1 ; // population 1
data_t p2 ; // population 2
. . .
data_t p37 ; // population 37
} pop_t ;
pop_t lattice [ N ] ;
#define N ( LX∗LY )
typedef struct {
data_t p1 [ N ] ; // population 1
data_t p2 [ N ] ; // population 2
. . .
data_t p37 [ N ] ; // population 37
} pop_t ;
pop_t lattice ;
f0[0]
f37[N]
f37[N]
f0[0] f0[1]
f1[0]
f0[N]
f0[1]f37[0] f0[2] f0[N]f1[1] f37[1] f1[2] f37[2] f1[N]
f1[0] f1[1] f1[N] f2[0] f2[1] f2[N] f37[1]f37[0]
Figure 2. Top and Middle: C definitions of lattice
populations using the AoS and SoA data layouts. Bottom:
Graphic representation of the AoS and SoA layout.
In our implementation we have arbitrarily chosen
to store the lattice in column-major order (Y spatial
direction); we keep in memory two copies that
are alternatively read and written by each kernel
routine. Although demanding in terms of memory
requirements, this choice has the key advantage that it
allows in principle to process all lattice sites in parallel.
4.1 Data-layout optimization for propagate
In our LB code the propagate kernel is applied to
each lattice site, moving populations according to the
patterns of Figure 1. For each site, propagate reads
and writes populations from lattice cells at distance
up to 3 in the physical lattice and for this reason
locality of memory accesses plays an important role
for performance.
This kernel can be implemented using either a
push or a pull scheme, [21]. The former moves all
populations of a lattice site towards appropriate
neighboring sites, while the latter gathers to one
site populations stored at neighbor sites. Relative
advantages and disadvantages of these schemes are not
obvious and depend to some extent on the hardware
features of the target processor. While the push scheme
performs an aligned-read followed by a misaligned-
write, the opposite happens if the pull scheme is
used, and it is well known that reading/writing
from/to (non-)aligned memory addresses may have
a large impact on the sustained memory bandwidth
of modern processors, [32]. However, on cache-based
Intel architectures (both standard CPUs and Xeon-
Phi), aligned data can be stored directly to memory
using non-temporal write instructions. If data to be
stored is not resident at any cache level, standard
semantic of ordinary memory-writes require a prior
read for ownership (RFO); the non-temporal version of
store avoids the RFO read, improving effective memory
bandwidth and saving time. This feature can be used
in the pull scheme reducing the overall memory traffic
by a factor 1/3, so we adopt it.
The propagate kernel can in principle be vectorized
applying each move shown in Figure 1 to several
lattice sites in parallel, e.g. moving populations with
the same index i and belonging to two or more sites.
The number of sites processed in parallel depends
on the size of the vector instructions of the target
processor; vector size is 4 double words for the Xeon-
CPU, 8 for the Xeon-Phi, 32 or multiples thereof for
GPUs. However, using the AoS scheme, populations
of different sites are stored at non-contiguous memory
addresses preventing vectorization. Conversely, when
using the SoA layout, access to several populations of
index i has unit stride, allowing to move in parallel
populations of index i for several sites and allowing
vectorization. This discussion suggests that the SoA
layout should perform better than the AoS one.
Figure 3 (left) shows the C-code written for Intel
processors, adopting the SoA layout. The code sweeps
all lattice sites with two loops in the X and Y spatial
directions. The inner loop is on Y , as elements are
stored in column-major order. We have annotated this
loop with the #pragma omp simd OpenMP pragma
to introduce SIMD vector instructions. Since values
of populations written into nxt array are not re-used
within this kernel, we also enable non-temporal stores;
since this feature is not yet part of the OpenMP
Prepared using sagej.cls
8 Journal Title XX(X)
typedef data_t double ;
typedef struct { data_t s [ LX∗LY ] ; } pop_soa_t ;
pop_soa_t prv [ NPOP ] , nxt [ NPOP ] ;
#pragma omp parallel for
for ( ix = STARTX ; ix < ENDX ; ix++ ) {
#pragma vector nontemporal
for ( iy = STARTY ; iy < ENDY ; iy++ ) {
idx = IDX ( ix , iy ) ;
for ( ip = 0 ; ip < NPOP ; ip++ ) {
nxt [ ip ] . s [ idx ] = prv [ ip ] . s [ idx + OFF [ ip ] ] ;
}
}
}
typedef data_t double ;
typedef struct { data_t s [ LX∗LY ] ; } pop_soa_t ;
pop_soa_t prv [ NPOP ] , nxt [ NPOP ] ;
#pragma acc kernels present ( prv , nxt )
#pragma acc loop gang independent
for ( ix = STARTX ; ix < ENDX ; ix++ ) {
#pragma acc loop vector independent
for ( iy = STARTY ; iy < ENDY ; iy++ ) {
idx = IDX ( ix , iy ) ;
for ( ip = 0 ; ip < NPOP ; ip++ ) {
nxt [ ip ] . s [ idx ] = prv [ ip ] . s [ idx + OFF [ ip ] ] ;
}
}
}
Figure 3. Codes of the propagate kernel for Intel architectures (left) and GPUs (right). This kernel moves populations as
shown in Figure 1. OFF is a vector containing the memory address offsets associated to each population hop. The prv and
nxt arrays use the SoA layout.
Data Structure Haswell Xeon Phi Tesla K80 AMD Hawaii
propagate
AoS 408 194 326 649
SoA 847 224 36 57
CSoA 247 78 32 45
CAoSoA 286 89 33 50
collide
AoS 1232 631 767 2270
SoA 1612 1777 171 1018
CSoA 955 445 165 452
CAoSoA 812 325 166 402
Table 2. Execution time (milliseconds per iteration) of the
propagate and collide kernels on several architectures
using different data-layouts. The size of the lattice is
2160× 8192 points.
standard we have used the specific Intel directive
#pragma vector nontemporal. The equivalent code
for GPUs, replacing Intel and OpenMP directives with
corresponding OpenACC ones, is shown in Figure 3
(right).
The first two rows of Table 2 compare the
performance obtained with the two layouts on all the
processors that we consider. As expected the SoA
layout is much more efficient on GPUs, but this is
not true for both Intel processors; inspection of the
assembly codes and compiler logs shows that read
operations are not vectorized on Intel processors due
to unaligned load addresses. Lacking vectorization, the
AoS layout exhibits a better performance as it has a
better cache hit rate.
The reason why compilers fail to vectorize the code
is that load addresses are computed as the sum of the
destination-site address – which is memory-aligned –
and an offset, so they point to the neighbor sites from
which populations are read. This does not guarantee
that the resulting address is properly aligned to the
vector size, i.e. 32 Bytes for CPUs and 64 Bytes for
the Xeon-Phi. Store addresses on the other hand are
always aligned if the lattice base address is properly
aligned and Y is a multiple of 32 or 64.
A simple modification to the data layout solves this
problem. Starting from the lattice stored in the SoA
scheme, we cluster VL consecutive elements of each
population array, with VL a multiple of the hardware
vector size supported by the processor (e.g. 4 for the
Haswell CPU, 8 for the Xeon-Phi and 32 for GPUs). We
call this scheme Cluster Structure of Array (CSoA); in
Figure 4 we show the corresponding C type definitions,
vdata t and vpop soa t. vdata t holds VL data words
corresponding to the same population of index i at VL
different sites that can be processed in SIMD fashion.
vpop soa t is the type definition for the full lattice
data. Using this scheme, move operations generated by
propagate apply to clusters of populations and not to
individual population elements. Since clusters have the
same size as hardware vectors, all read operations are
now properly aligned. As in the case of the SoA data
layout, write operations always have aligned accesses
and non temporal stores can be used. Figure 4 shows
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 9
#define LYOVL ( LY / VL )
typedef data_t double ;
typedef struct { data_t c [ VL ] ; } vdata_t ;
typedef struct { vdata_t s [ LX∗LYOVL ] ; } vpop_soa_t ;
vpop_soa_t prv [ NPOP ] , nxt [ NPOP ] ;
for ( ix = startX ; ix < endX ; ix++ ) {
idx = ix∗LYOVL ;
for ( ip = 0 ; ip < NPOP ; ip++) {
for ( iy = startY ; iy < endY ; iy++ ) {
#pragma vector aligned nontemporal
for ( k = 0 ; k < VL ; k++) {
nxt [ ip ] . s [ idx + iy ] . c [ k ] =
prv [ ip ] . s [ idx + iy + OFF [ ip ] ] . c [ k ] ;
}
}
}
}
Figure 4. Source code of the propagate kernel for Intel
architectures using the CSoA data layout. OFF is a vector
containing the memory-address offsets associated to each
population hop. VL is the size of a cluster (see text for
details). To properly vectorize the inner loop with SIMD
instructions, value of VL should match the width of
vector-registers supported by the target architecture.
the corresponding code; in this case we have also
rearranged the order of the loops in a way which
reduces the pressure on the TLB cache. As before, code
for GPUs can be obtained replacing directives with
OpenAcc ones.
Table 2 (see first three rows of the propagate
section) quantifies the impact of the data layout on
performance, showing benchmark results using the
three different data layouts, AoS, SoA and CSoA. The
advantages of using the CSoA data layout are large for
Intel architectures, while improvements are marginal
for GPUs, as they are less sensitive to misaligned
memory reads [32]. The relevant result is however that
using the CSoA format we have one common data
layout that maximizes performance for propagate on
all processors.
4.2 Data-layout optimization for collide
The collide kernel is computed after the propagate
step, reading at each lattice site populations gathered
by the propagate phase. It updates their values
applying the collisional operator and performing all
mathematical operations associated with Equation 1.
Metric CSoA CAoSoA Threshold
Xeon-Phi
L1 TLB Miss Ratio 2.66% 0.06% 1.0%
L2 TLB Miss Ratio 2.00% 0.00% 0.1%
Xeon-CPU
LLC Miss Count 787,647,256 177,010,620 n/a
Average Latency (cycles) 13 9 n/a
Table 3. Profiling results provided by the Intel VTune
profiler for the collide kernel on a lattice of 2160× 8192
points comparing the CSoA and the CAoSoA scheme on the
Xeon-CPU and Xeon-Phi processors.
For each lattice site, this floating point intensive
kernel uses only population data associated to the
site on which it operates; lattice sites are processed
independently from each other making processing
of the lattice fully parallelizable. In this case, in
contrast with the propagate kernel, locality of
populations plays an important role for performances.
Vectorization of this step is implemented, as for
propagate, trying to process different sites in parallel.
Following results obtained in section 4.1 we consider
first the CSoA data layout which – as seen before
– gives very good performance results with the
propagate kernel. The log files of the compiler show
that the CSoA scheme allows to vectorize the code,
but profiling the execution of the code on Xeon CPUs
and Xeon Phi accelerators, we have observed a large
number of TLB and LLC misses, suggesting that
further improvements could be put in place. Table 3
shows the results (see CSoA column) provided by
Intel VTune profiler for both Xeon-CPU and Xeon-
Phi; for the latter the miss ratios are much bigger
than threshold values provided by the profiler itself.
These penalties arise because in the CSoA scheme
different populations associated to the same lattice site
are stored at memory addresses far from each other, so
several non-unit stride memory accesses are necessary
to gather all relevant data words.
To overcome these problems, we further modify
the CSoA scheme, and define a new data layout
which takes into account locality requirements of both
propagate and collide kernels. In this scheme, for
each population array, we divide each Y -column in VL
partitions each of size LY/VL; all elements sitting at the
Prepared using sagej.cls
10 Journal Title XX(X)
f0[0] f0[1] f1[0] f37[1]f37[0] f0[2] f0[3] f1[2] f37[3]f37[2]
f37[N]f0[0] f0[1] f0[N] f1[0] f1[1] f1[N]f0[2] f1[2] f37[1]f37[0] f37[2]f0[3] f1[3] f37[3]
f37[N]f37[N-1]f0[N-1] f0[N] f1[N-1]
#define LYOVL ( LY / VL )
// cluster definition
typedef struct { double c [ VL ] ; } vdata_t ;
// CAoSoA type definition
typedef struct { vdata_t p [ NPOP ] ; } caosoa_t ;
vpop_soa_t prv [ LX∗LYOVL ] , nxt [ LX∗LYOVL ] ;
// snippet of collide code to compute density rho
vdata_t rho ;
for ( ip = 0 ; ip < NPOP ; ip++)
#pragma vector aligned
for ( k = 0 ; k < VL ; k++)
rho . c [ k ] = rho . c [ k ] + prv [ idx ] . p [ ip ] . c [ k ] ;
Figure 5. Top: data arrangement for the CAoSoA layout; for
illustration purposes, we take VL=2. Bottom: sample code
for collide using this layout.
ith position of each partition are then packet together
into an array of VL elements called cluster. We call
this layout a Clustered Array of Structure of Array
(CAoSoA) and Figure 5 shows how data are arranged
in memory. This data layout still allows vectorization
of inner structures (clusters) of size VL, and at the
same time improves locality – w.r.t the CSoA – of
populations, as it keeps all population data needed to
process each lattice site at close and aligned addresses.
Figure 5 shows the definition of the vdata t data
type, corresponding to a cluster, and a representative
small sections of the collide code for Intel processors.
Cluster variables are processed iterating on all elements
of the cluster through a loop over VL; pragma vector
aligned instructs the compiler to fully vectorize the
loop since all iterations are independent and memory
accesses are aligned. This data-layout combines the
benefits of the CSoA scheme, allowing aligned memory
accesses and vectorization (relevant for the propagate
kernel) and at the same providing population locality
( together relevant for the collide kernel).
Table 3 shows the impact of the CAoSoA data-layout
on memory misses: on Xeon-Phi the TLB misses have
been reduced well below the threshold values, and on
Xeon-CPU have been reduced by a factor 4.5X w.r.t.
3
3
3
M
SI
ZE
X
 - 
2M
HALOACCELERATORHOST
SIZEY
M
HX
HX
3
3
3
Figure 6. Logic partitioning of the lattice domain among
host and accelerator. The central (dark-green) region is
allocated on the accelerator, side (orange and gray) regions
on the host. Checkerboard textures flag lattice regions
involved in MPI communications with neighbor nodes.
the CSoA scheme. Table 2 shows the execution time
of the collide kernel run using all data data-layouts
defined so far. As wee see, the CAoSoA improves
performances over the CSoA on Intel and AMD
processors while for NVIDIA GPU the two layouts
give marginal differences in performance. These gains
in the performance of collide come at a limited cost
(12− 16%) for propagate on all architectures except
for the K80, so CAoSoA maximizes the combined
performances of the two kernels.
5 Heterogeneous implementation
In this section we describe the implementation of
our code designed to involve and exploit compute
capabilities of both host and accelerators. We only
consider the CAoSoA layout, as it grants the best
overall performances on all processors and accelerators
that we have studied.
Our implementation uses MPI libraries and each
MPI-process manages one accelerator. The MPI-
process runs on the host CPU; part of the lattice
domain is processed on the host itself, and part is
offloaded and processed by the accelerator. Using one
MPI-process per accelerator makes it easy to extend
the implementation to a cluster of accelerators installed
either on the same or on different hosts.
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 11
ACCELERATOR
HOST
t1 t2
propagate bc
propagate bc collideMPI comm
 a
sy
nc
 (Q
1,
Q2
)
t3 t4
collide
 a
sy
nc
 (Q
1)
swap host -> device 
swap host <- device 
Q2
Q1
 w
ai
t (
Q1
,Q
2)
 w
ai
t (
Q1
)
 a
sy
nc
 (Q
2)
Figure 7. Control flow executed by each MPI-process. The
schedule executed on the accelerator is on the upper band,
while the one executed by the host is on the lower band.
Execution on the accelerator runs on two concurrent queues.
Synchronization-points are marked with red lines.
The lattice is partitioned among the MPI-processes,
along one direction, in our case the X-direction, and
each slice is assigned to a different MPI-process.
Within each MPI-process each partition is further
divided between host and accelerator. We define three
regions, namely left border, bulk, and right border,
as shown in Figure 6. The right and left borders
include M columns and are allocated on the host
memory while the remaining SIZEX - 2M columns
stay on the accelerator memory. As propagate stencils
require to access neighbor sites at distance up to 3,
see Figure 1, each region is surrounded by a halo of
three columns and rows. Each halo stores a copy of
lattice sites of the adjacent region either allocated on
the host, on the accelerator, or on the neighbor MPI-
process. This arrangement allows to uniformly apply
the propagate kernel to all sites avoiding divergences
in the computation. An added advantage of this
layout is that data involved in MPI communications
is resident on the host and not on the accelerators,
slightly increasing data locality.
Each MPI-process performs a loop over time steps,
and at each iteration it launches in sequence on the
accelerator the propagate, bc and collide kernels,
processing the bulk region. In order to allow the CPU
to operate in overlapped mode, kernels are launched
asynchronously on the same logical execution queue to
ensure in-order execution. After launching the kernels
on the accelerator, the host first updates halos with
adjacent MPI-processes and then starts to process its
left and right borders applying the same sequence of
kernels. After the processing of borders completes,
the host updates the halo regions shared with the
accelerator; this step moves data between host and
accelerator. The control-flow of the code executed by
the MPI-process is shown in Figure 7, where the bc
kernel applies the physical boundary conditions at the
three uppermost and lowermost rows of the lattice.
The code for KNC is implemented using the offload
features available in the Intel compiler and runtime
framework. In this case we have a unique code
and the compiler produces the executable codes for
both Xeon-Phi and CPU. For GPUs the situation is
somewhat different. In fact, we have to use two different
compilers, one for GPUs and one for CPUs. We have
written the code for GPUs, both NVIDIA and AMD,
using OpenACC directives [33] and compiled using the
PGI 16.5 compiler which supports both architectures.
The kernels running on CPUs are written using
standard C and compiled using Intel compiler ICC
v16.0. Then we have linked the two codes in one single
executable.
6 Parameter optimization
In this section we describe two important optimization
steps for our heterogeneous code. The first is about the
optimal partitioning of the computation between host
and accelerator, and the latter is about the optimal
cluster size for the CAoSoA data layout.
6.1 Workload partitioning
Hosts and accelerators have different peak (and sus-
tained) performance, so a careful workload balancing
between the two concurrent processors is necessary. We
model the execution time Texe of our code with the
Prepared using sagej.cls
12 Journal Title XX(X)
0.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
2M / LX
20
30
40
50
60
70
80
90
M
LU
PS
1024x4096 HSW + KNC
2048x4096 HSW + KNC
4096x4096 HSW + KNC
1024x4096 HSW + K80
2048x4096 HSW + K80
4096x4096 HSW + K80
1024x4096 HSW + Hawaii
2048x4096 HSW + Hawaii
4096x4096 HSW + Hawaii
Figure 8. Performance of the heterogeneous code (measured
in MLUPS, see the text for definition) for all three platforms,
as a function of the fraction of lattice sites (2M/LX)
mapped on the Haswell (HSW) host CPU. KNC is the Intel
Knights Corner accelerator, K80 is the NVIDIA Tesla GPU
and Hawaii is the AMD GPU. Dots are measured values,
dashed lines are the prediction of our model.
following set of equations:
Texe = max{Tacc, Thost + Tmpi}+ Tswap(4)
Tacc = (LX − 2M)LY · τd (5)
Thost = (2M)LY · τh (6)
Tmpi = τc (7)
where Tacc and Thost are the execution times of the
accelerator and host respectively, Tswap is the time
required to exchange data between host and accelerator
at the end of each iteration, and Tmpi is the time
to move data between two MPI-processes in a multi-
accelerator implementation. As Tswap is independent
of M , Texe is minimal for a value M
∗ for which the
following equation holds:
Tacc(M
∗) = Thost(M
∗) + Tmpi(M
∗) (8)
Our code has an initial auto-tuning phase, in
which it runs a set of mini-benchmarks to estimate
approximate values for τd, τh, τc. These are then
inserted in Equation 8 to find M∗, an estimate of the
value of M that minimizes time to solution.
In Figure 8 we show the performance of our code for
three different lattice sizes as a function of 2M/LX,
the fraction of lattice sites that we map on the host
CPU. We have run our tests on two different machines,
the Galileo HPC system installed at CINECA and
the Etna machine. Galileo has two different partitions,
0.0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24
2M / LX
0
20
40
60
80
100
M
LU
PS
VL = 1 
VL = 2 
VL = 4 
VL = 8 
VL = 16
VL = 32
VL = 64
Figure 9. Performance (measured in MLUPS) for different
values of the VL parameter in the CaoSoA data layout;
results are for a Tesla K80 GPU.
one with K80 GPUs and one with KNC accelerators.
The Etna machine is part of the COKA experimental
cluster at Universita` di Ferrara, and has two AMD
Hawaii GPUs. Both machines use as host processor
an 8-core Intel Xeon E5-2630v3 CPU based on the
Haswell micro-architecture, and each host has one
attached accelerator. Performance is measured using
the MiLlion Updates per Second (MLUPS) metric,
a common option for this class of applications. In
Figure 8 dots are measured values, while lines are
our predictions. Our auto-tuning strategy predicts
performance with good accuracy, and estimates the
workload distribution between host and device for
which the execution time reaches its minimum. An
interesting features of this plot is the fact that the
optimal point is – for each accelerator – a function
of 2M/LX. As expected, for values of M < M∗ and
M > M∗ performances decrease because the workload
is unbalanced either on the accelerator or on the
host side; results at 2M/LX = 0 correspond to earlier
implementations in which critical kernels are fully
offloaded to accelerators; we see that running these
kernels concurrently on host and accelerators (KNC
and K80) performances increases approximately by
10− 20%. Finally, as M becomes much larger than
M∗, all lines in the plot fall on top of each other, as in
this limit the host CPU handles the largest part of the
overall computation.
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 13
6.2 Fine-tuning of data-layout cluster size
An important parameter of the CAoSoA layout is the
cluster size VL, as performance depends significantly
on its value. This parameter, whose optimal value
correlates with the hardware features of the target
processor, affects data allocation in memory and must
be fixed at compile time.
In Figure 9 we show the impact on performances
(measured again in MLUPS) of our code running on
a node with K80 GPUs and using different choices for
VL. We see that, for this processor, a wrong choice may
reduce performance by large factors (' 5); good news
is that there is a reasonably large interval VL=16,32,64
for which performance is close to its largest value. We
have done the same measurements for all kind of nodes
available with KNC and AMD GPUs, and then picked,
for each node and each value of VL, the best operating
point in terms of 2M/LX. In this way, we select the
highest performance for each combination of host and
accelerator. These results are collected in Figure 10
as a function of VL; on top of each bar we report the
corresponding value of 2M/LX. We see that GPUs
are more robust than the KNC against a non optimal
choice of VL: for the former processors performance
remains stable as long as VL is large enough, while for
the latter only one or two VL values allow to reach
the highest performance. Fortunately enough, there is
a window of VL values for which all systems are close
to their best performance.
6.3 Performance prediction on new hardware
A further interesting result of the model developed
in section 6.1, is that we can use it to predict
to which extent the performance of our codes is
affected if either the host CPU or the accelerator
is replaced by a different processor; in particular
one may ask what happens if announced but not
yet available processors or accelerators are adopted.
One such exercise replaces the host processor that
we have used for our previous tests with the new
Intel multicore Xeon E5-2697v4, based on the latest
Broadwell micro-architecture. We have run our code
1 2 4 8 16 32 64
VL
0
20
40
60
80
100
M
LU
Ps
0.41
0.26
0.30
0.34
0.450.34
0.31
0.23
0.14
0.13 0.12
0.11
0.31
0.34
0.35 0.26
0.23
0.23
0.18
HSW + KNC
HSW + K80
HSW + Hawaii
Figure 10. Impact on performance of different cluster sizes
(VL) in the CaoSoA data-layout for several accelerator
choices.
0.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
2M / LX
20
30
40
50
60
70
80
90
100
M
LU
PS
1024x4096 HSW + KNC
2048x4096 HSW + KNC
4096x4096 HSW + KNC
1024x4096 BRD + KNC
2048x4096 BRD + KNC
4096x4096 BRD + KNC
1024x4096 HSW + K80
2048x4096 HSW + K80
4096x4096 HSW + K80
1024x4096 BRD + K80
2048x4096 BRD + K80
4096x4096 BRD + K80
Figure 11. Performance predictions (dashed-lines) of our
model for a would-be system using as host the
recently-released Broadwell (BRD) CPU compared with
measured data on a Haswell (HSW) CPU (dots).
Measurements refer to three different lattice sizes.
on a Broadwell processor with no attached accelerators
and measured the host-related performance parameters
used in Equation 4; we have then used these parameters
to estimate the expected performance of a would-
be machine whose nodes combine Broadwell hosts
with either K80 or KNC accelerators. Results are
shown in Figure 11 where we compares the measured
performance on the present hardware (dots) with the
predictions of our model (dashed lines). We see that
using this new more powerful processor, performance
for our code would improve by approximately 20%,
when perfectly balancing the workload between host
and accelerator. As expected the improvement is the
same for both types of accelerators, since we have
only virtually replaced the host processor. Another
interesting feature of the plot is that the model
predictions overlap with measured data when 2M/LX
Prepared using sagej.cls
14 Journal Title XX(X)
values tends to zero; this is expected, since in this case
the fraction of lattice sites mapped on the host-CPU
tends to zero and the execution time is dominated by
the accelerator. A similar analysis might be performed,
for instance, to assess the overall performance gains
to be expected when next generation GPUs become
available.
7 Scalability performances
In this section we analyze scalability performances of
our codes running on the Galileo machine, both on the
K80 and on the KNC partition.
In Figure 12 top-left we show the performance of
our code running on larger and larger KNC partitions
of the Galileo cluster, and for several physically
relevant sizes of the physical lattice, showing the
scaling results of this code. We compare with a
previous v1 implementation of the code running all
kernels on the KNC accelerator. Figure 12 top-right
shows the same data as the previous picture showing
however the speed-up factor as a function of the
number of accelerators. One easily sees that the new
heterogeneous version of the code is not only faster
than its accelerator-only counterpart but also has a
remarkably better behavior from the point of view
of hard scaling. This is due to the fact that data
moved through MPI communications between different
processing nodes is always resident on the host, saving
time to move them to and from KNC accelerator. All
in all, for massively parallel runs on many accelerators,
the heterogeneous code extracts from the same KNC-
based hardware system roughly 2× larger performance
than the v1 version.
In Figure 12 bottom-left we show the performance
of our code running on the K80 partition. In this
case, due to the larger difference in performance
between the host-CPU and the accelerator we have a
different behavior. Version v2 is faster than version
v1, but the gain is less compared to KNC because
the gap in performance between the K80 GPU and
the host CPU is larger. Scalability, see Figure 12
bottom-right, is good as version v1 because in
both implementations MPI communications a fully
overlapped with computation ([34]).
8 Analysis of results and conclusions
In this section we analyze our results and outline our
conclusions.
The first important contribution of this work
highlights the critical role played by data layouts
in the development of a common LB code that
runs concurrently on CPUs and accelerators and is
also performance-portable onto different accelerator
architectures. The crucial finding in this respect
is that data memory organization should support
at the same time efficient memory accesses and
vector processing. This challenge is made more
difficult because different kernels (in our case the
critical propagate and collide routines) have conflicting
requirements. Table 2 substantiates the relevance of
this problem and quantifies the improvements that
we have achieved. Previously used data structures are
the AoS layout supporting data-locality, and the SoA
layout already known to exploit more vectorization
especially for GPUs. The problem with these two
layouts is that the former allows better performances
for Intel architectures but is dramatically bad for
GPUs, with performance losses that are up to 10X for
propagate and 2X − 5X for collide. The latter scheme is
very efficient on GPUs but it fails on CPUs, as it limits
code vectorization and causes memory management
overheads (like TLB and cache misses).
This paper introduces two slightly more complex
data layouts, the CSoA and the CAoSoA to exploit
vectorization on both classes of architecture and
still guaranteeing efficient data memory accesses and
vector processing for both critical kernels. These data-
structures differ from those used in [23] and [25] in
the way populations are packed into SIMD vectors;
this allows to properly translate operations involved
in the propagate kernel into SIMD instructions, and
to perform aligned memory accesses.
The CSoA layout improves performance on Intel
architectures by factors 2X over the AoS for propagate
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 15
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Number of MIC devices
0
200
400
600
800
1000
1200
M
LU
P
S
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Number of MIC devices
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
R
e
la
ti
v
e
 S
p
e
e
d
U
p
L1680x8192 - v1 L3360x8192 - v1 L6720x8192 - v1 L1680x8192 - v2 L3360x8192 - v2 L6720x8192 - v2
0 4 8 12 16 20 24 28 32 36 40
Number of GPUs
0
200
400
600
800
1000
1200
1400
1600
M
LU
P
S
0 4 8 12 16 20 24 28 32 36 40
Number of GPUs
0
4
8
12
16
20
24
28
32
36
40
R
e
la
ti
v
e
 S
p
e
e
d
U
p
L2880x8192 - v1 L5760x8192 - v1 L11520x8192 - v1 L2880x8192 - v2 L5760x8192 - v2 L11520x8192 - v2
Figure 12. Multi-node scalability results measured on the KNC (top) and K80 (down) partitions of the Galileo cluster. We
compare performances – MLUPS and relative speedup – on three different lattice sizes of the heterogeneous code described
in this paper (v2) with an earlier version (v1) with all critical computational kernels running on accelerators only.
and 1.5X for collide. On GPUs it has also a marginal
improvements compared to the already very good
results that GPUs have with the original SoA layout.
A final improvements is given by the CAoSoA layout,
that further increases data locality without introducing
vectorization penalties. Again, performance remains
substantially stable on GPUs in this case, while there is
still a further improvements on Intel architectures for
collide (≈ 10− 20%) and a corresponding penalty for
propagate; since the former routine has a bigger impact
on overall performance, the CAoSoA layout is the most
efficient to be used for the whole code.
The definition of the appropriate layouts has allowed
to code our LB application in a common program that
executes concurrently on host CPUs and all kinds of
accelerators, obviously improving the portability and
long-term support of this code.
Another important contribution of this paper
is the development of an analytic model (see
section 6.1) able to predict the optimal partitioning
of the workload among host-CPU and accelerators;
using this model we can automatically tune this
parameter for best performance on running systems,
or predict performances for not yet available hardware
configurations (see section 6.3).
The final result of this contribution is that single
node performances can be improved by ≈ 10− 20%
w.r.t. earlier implementations that simply wasted
the computational power offered by the host CPU.
Our implementation also significantly improves the
(hard)-scaling behavior on relatively large clusters, see
Figure 12. This follows from the fact that node-to-
node communications in our code do not imply host-
accelerator transfers. This improvement is very large
on KNC-accelerated clusters, while on GPUs, due to
the larger performance unbalance between host and
accelerator, the observed improvement is smaller.
Prepared using sagej.cls
16 Journal Title XX(X)
In conclusion, an important result of this work is that
it is possible to design and implement directive-based
codes that are performance-portable across different
present – and hopefully future – architectures. This
requires an appropriate choice of the data layout, able
to meet conflicting requirements of different parts of
the code and to match hardware features of different
architectures. Defining these data layout is today in
the hands of programmers, and still out of the scope
of currently available and stable compilers commonly
used in the HPC context. For this reason, on a longer
time horizon, we look forward to further progress in
compilers allowing data definitions that abstract from
the actual in-memory implementation, and of tools
able to make appropriate allocation choices for specific
target architectures.
Acknowledgements
This work was done in the framework of the COKA,
COSA and Suma projects of INFN. We thank CINECA
(Bologna, Italy) for access to the Galileo computing cluster.
AG has been supported by the European Union’s Horizon
2020 research and innovation programme under the Marie
Sklodowska-Curie grant agreement No 642069.
References
[1] TOP500. Top500 the list, 2016. URL http://top500.
org. Last visited on 2016-09-01.
[2] Han T and Abdelrahman T. hiCUDA: High-Level
GPGPU Programming. Parallel and Distributed
Systems, IEEE Transactions on 2011; 22(1): 78–90.
DOI:10.1109/TPDS.2010.62.
[3] Lee S and Eigenmann R. OpenMPC: Extended
OpenMP Programming and Tuning for GPUs. In
Proceedings of the 2010 ACM/IEEE International
Conference for High Performance Computing, Net-
working, Storage and Analysis. SC ’10, Washington,
DC, USA: IEEE Computer Society. ISBN 978-1-4244-
7559-9, pp. 1–11. DOI:10.1109/SC.2010.36.
[4] Ayguade´ E, Badia RM, Bellens P et al. Extending
OpenMP to Survive the Heterogeneous Multi-
Core Era. International Journal of Parallel
Programming 2010; 38(5-6): 440–459. DOI:10.1007/
s10766-010-0135-4.
[5] OpenMP. The openmp api specification for parallel
programming, 2016. URL http://www.openmp.org/.
Last visited on 2016-09-01.
[6] OpenACC. Openacc directives for accelerators,
2016. URL http://www.openacc-standard.org. Last
visited on 2016-09-01.
[7] OpenMP. Openmp application program interface
version 4.0, 2016. URL http://www.openmp.org/
mp-documents/OpenMP4.0.0.pdf. Last visited on
2016-09-01.
[8] NVIDIA. Tesla K80 GPU Accelerator - Board
Specification. Technical report, 2015.
[9] AMD. AMD FirePro W9100 Workstation Graphics.
Technical report, 2016.
[10] George Chrysos IC. Intel Xeon Phi X100 Family
Coprocessor - the Architecture. Technical report, 2012.
[11] Pohl T, Deserno F, Thurey N et al. Performance
Evaluation of Parallel Large-Scale Lattice Boltzmann
Applications on Three Supercomputing Architectures.
In Proceedings of the 2004 ACM/IEEE Conference on
Supercomputing. SC ’04, Washington, DC, USA: IEEE
Computer Society. ISBN 0-7695-2153-3, pp. 21–. DOI:
10.1109/SC.2004.37.
[12] Belletti F, Biferale L, Mantovani F et al. Multiphase
lattice boltzmann on the cell broadband engine. Nuovo
Cimento della Societa Italiana di Fisica C 2009; 32(2):
53–56. DOI:10.1393/ncc/i2009-10379-6.
[13] Biferale L, Mantovani F, Pivanti M et al. Lattice
Boltzmann fluid-dynamics on the QPACE supercom-
puter. Procedia Computer Science 2010; 1(1): 1075–
1082. DOI:10.1016/j.procs.2010.04.119. ICCS 2010.
[14] Pivanti M, Mantovani F, Schifano SF et al. An
optimized lattice boltzmann code for BlueGene/Q.
In Wyrzykowski R, Dongarra J, Karczewski K et al.
(eds.) Parallel Processing and Applied Mathematics:
10th International Conference, PPAM 2013, Warsaw,
Poland, September 8-11, 2013, Revised Selected
Papers, Part II. Lecture Notes in Computer Science,
Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN
978-3-642-55195-6, 2014. pp. 385–394. DOI:10.1007/
Prepared using sagej.cls
Calore, Gabbana, Schifano, Tripiccione 17
978-3-642-55195-6 36.
[15] Mantovani F, Pivanti M, Schifano SF et al.
Performance issues on many-core processors: A D2Q37
lattice boltzmann scheme as a test-case. Computers &
Fluids 2013; 88: 743 – 752. DOI:10.1016/j.compfluid.
2013.05.014.
[16] Bailey P, Myre J, Walsh SD et al. Accelerating
lattice Boltzmann fluid flow simulations using graphics
processors. In Parallel Processing, 2009. ICPP’09.
International Conference on. IEEE, pp. 550–557. DOI:
10.1109/ICPP.2009.38.
[17] Biferale L, Mantovani F, Pivanti M et al. An
optimized D2Q37 lattice boltzmann code on GP-
GPUs. Computers & Fluids 2013; 80: 55 – 62. DOI:
10.1016/j.compfluid.2012.06.003.
[18] Calore E, Gabbana A, Kraus J et al. Massively parallel
latticeboltzmann codes on large GPU clusters. Parallel
Computing 2016; 58: 1 – 24. DOI:10.1016/j.parco.
2016.08.005.
[19] Crimi G, Mantovani F, Pivanti M et al. Early
Experience on Porting and Running a Lattice
Boltzmann Code on the Xeon-phi Co-Processor.
Procedia Computer Science 2013; 18: 551–560. DOI:
10.1016/j.procs.2013.05.219.
[20] Sano K, Pell O, Luk W et al. FPGA-based Streaming
Computation for Lattice Boltzmann Method. In
Field-Programmable Technology, 2007. ICFPT 2007.
International Conference on. pp. 233–236. DOI:10.
1109/FPT.2007.4439254.
[21] Wittmann M, Zeiser T, Hager G et al. Comparison of
different propagation steps for the lattice boltzmann
method. CoRR 2011; abs/1111.0922.
[22] Shet AG, Sorathiya SH, Krithivasan S et al. Data
structure and movement for lattice-based simulations.
Phys Rev E 2013; 88: 013314. DOI:10.1103/
PhysRevE.88.013314.
[23] Shet AG, Siddharth K, Sorathiya SH et al.
On vectorization for lattice based simulations.
International Journal of Modern Physics C 2013; 24.
DOI:10.1142/S0129183113400111.
[24] Calore E, Demo N, Schifano SF et al. Experience on
vectorizing lattice boltzmann kernels for multi- and
many-core architectures. In Wyrzykowski R, Deelman
E, Dongarra J et al. (eds.) Parallel Processing and
Applied Mathematics: 11th International Conference,
PPAM 2015, Krakow, Poland, September 6-9, 2015.
Revised Selected Papers, Part I. Lecture Notes in
Computer Science, Cham: Springer International
Publishing. ISBN 978-3-319-32149-3, 2016. pp. 53–
62. DOI:10.1007/978-3-319-32149-3 6.
[25] Valero-Lara P, Igual FD, Prieto-Matas M et al. Accel-
erating fluidsolid simulations (lattice-boltzmann &
immersed-boundary) on heterogeneous architectures.
Journal of Computational Science 2015; 10: 249 – 261.
DOI:10.1016/j.jocs.2015.07.002.
[26] Valero-Lara P and Jansson J. Heterogeneous cpu+gpu
approaches for mesh refinement over lattice-boltzmann
simulations. Concurrency and Computation: Practice
and Experience 2016; : 1 – 20DOI:10.1002/cpe.3919.
[27] Succi S. The Lattice-Boltzmann Equation. Oxford
university press, Oxford, 2001.
[28] Sbragaglia M, Benzi R, Biferale L et al.
Lattice Boltzmann method with self-consistent
thermo-hydrodynamic equilibria. Journal of
Fluid Mechanics 2009; 628: 299–309. DOI:
10.1017/S002211200900665X.
[29] Scagliarini A, Biferale L, Sbragaglia M et al. Lattice
Boltzmann methods for thermal flows: Continuum
limit and applications to compressible Rayleigh–
Taylor systems. Physics of Fluids (1994-present) 2010;
22(5): 055101. DOI:10.1063/1.3392774.
[30] Biferale L, Mantovani F, Sbragaglia M et al. Second-
order closure in stratified turbulence: Simulations
and modeling of bulk and entrainment regions.
Physical Review E 2011; 84(1): 016305. DOI:10.1103/
PhysRevE.84.016305.
[31] Biferale L, Mantovani F, Sbragaglia M et al. Reactive
Rayleigh-Taylor systems: Front propagation and non-
stationarity. EPL 2011; 94(5): 54004. DOI:10.1209/
0295-5075/94/54004.
[32] Kraus J, Pivanti M, Schifano SF et al. Benchmarking
GPUs with a parallel Lattice-Boltzmann code. In
Computer Architecture and High Performance Com-
puting (SBAC-PAD), 25th International Symposium
Prepared using sagej.cls
18 Journal Title XX(X)
on. IEEE, pp. 160–167. DOI:10.1109/SBAC-PAD.
2013.37.
[33] Calore E, Gabbana A, Kraus J et al. Performance
and portability of accelerated lattice boltzmann
applications with openacc. Concurrency and
Computation: Practice and Experience 2016; 28(12):
3485–3502. DOI:10.1002/cpe.3862.
[34] Calore E, Marchi D, Schifano SF et al. Optimizing
communications in multi-gpu lattice boltzmann
simulations. In High Performance Computing
Simulation (HPCS), 2015 International Conference
on. pp. 55–62. DOI:10.1109/HPCSim.2015.7237021.
Prepared using sagej.cls
