Early Experience on Using Knights Landing Processors for Lattice
  Boltzmann Applications by Calore, Enrico et al.
Early experience on using Knights Landing
processors for Lattice Boltzmann applications
Enrico Calore1,2, Alessandro Gabbana1,2, Sebastiano Fabio Schifano1,2, and
Raffaele Tripiccione1,2
Universita` di Ferrara, Ferrara, ITALY
INFN Ferrara, ITALY
Abstract. Knights Landing (KNL) is the second generation of Intel pro-
cessors based on Many Integrated Cores (MIC) architecture targeting
HPC application segment. It delivers massive thread and data parallelism
together with high-speed on-chip memory bandwidth in a standalone
processor that can boot a off-the-shelf Linux operating system. KNL
provides more than 3 TFlops of computing power for double-precision
computation, doubling to 6 TFlops for single-precision. In this work we
assess the performance of this new processor for Lattice Boltzmann codes
widely used in computational fluid-dynamics. We design and implement
an OpenMP code, and evaluate the impact of several data memory lay-
outs to meet the different computing requirements of distinct parts of the
application, aiming to exploit a large fraction of available peak computing
throughput. We also perform a preliminary analysis of energy efficiency,
evaluating the time-to-solution and average-power consumption for each
memory layout, and make some comparison with other processors and
accelerators.
1 Introduction
Hi-end processors, the building blocks of HPC computer systems, have seen a
steady increase in the number of processing cores, with cores able to perform
more and more operations per clock-cycle. This trend has been further pushed
forward in accelerators, such as GPUs and Many Integrated Cores (MIC) pro-
cessors, offering large computing power together with a significant computing
efficiency, e.g. a high ratio of computing power per Watt. However, the use of
accelerated systems is not without problems. The link between host CPU and
accelerator, usually based on PCIe interface, creates a data bottleneck that re-
duces the sustained performances of most applications. Reducing the impact of
this bottleneck in heterogeneous systems requires complex implementations [1,2]
with a non negligible impact on development an maintenance efforts.
The latest generation Intel MIC accelerator, the Knights Landing (KNL)
Xeon-Phi processor, offers a way out of this problem: it is a self-hosted system,
running a standard Linux operating system, so it can be used alone to assemble
homogeneous clusters.
ar
X
iv
:1
80
4.
01
91
8v
1 
 [c
s.D
C]
  5
 A
pr
 20
18
In this work we present an early assessment of the performance of the KNL
processor, using as test-case a state-of-the-art Lattice Boltzmann (LB) code. This
application is very interesting for benchmarking purposes, as its two main crit-
ical compute-intensive kernels, propagate and collide, are respectively strongly
memory bound and compute bound. For regular applications like LB codes,
task parallelism is easily done by assigning tiles of the physical lattice to dif-
ferent cores. However, exploiting data-parallelism through vectorization requires
additional care, and in particular a careful design of the data layout is criti-
cal to allow an efficient use of vector instructions. Our code uses OpenMP to
manage task parallelism, and we experiment with different data-layouts trying
to find a compromise between the conflicting requirements of the propagate and
collide kernels. We then assess the impact of several layout choices in terms of
computing and energy performances.
Recent works have studied the performances of KNL [3,4,5] with several
applications, but as far as we know none of these investigate the impact of data
layouts on performance and energy efficiency.
The rest of this paper is organized as following: section 2 gives a short
overview of the KNL architecture, highlighting the main features relevant for
this work; section 3 briefly sketches an outline of the Lattice Boltzmann method,
while section 4 presents the various options for data-layout that we have studied;
section 5 analyzes our results and ends with some concluding remarks.
2 Overview of Knights Landing Architecture
The Knights Landing (KNL) is the second generation of Intel processors based on
the MIC architecture, and the first self-bootable processor in this family. It has an
array of 64, 68 or 72 cores and four high speed memory banks based on the Multi-
Channel DRAM (MCDRAM) technology providing an aggregate bandwidth of
more than 450 GB/s [6]; it also integrates 6 DDR4 channels supporting up to
384 GB of memory with a peak raw bandwidth of 115.2 GB/s. Two processors
form a tile and share an L2-cache of 1 MB; tiles are connected by a 2D-mesh
of rings and can be clustered in several NUMA configurations. In this work we
only consider the Quadrant cluster configuration in which tiles are divided in four
quadrants, each directly connected to one MCDRAM bank. This configurations
is the recommended one to use the KNL as a symmetric multi-processor, as it
reduces the latency of L2-cache misses, and the 4 block of MCDRAM appears as
contiguous block of addresses. For more details on clustering see [7]. MCDRAM
on a KNL can be configured at boot time in Flat, Cache or Hybrid mode. The
Flat mode defines the whole MCDRAM as addressable memory allowing explicit
data allocation, whereas Cache mode uses the MCDRAM as a last-level cache
between the L2-caches and the on-platform DDR4 memory. In Hybrid mode, the
MCDRAM is used partly as addressable memory and partly as cache. For more
details on memory configuration see [8]. In this work we only consider Flat and
Cache modes. Parallelism is exploited at two levels on the KNL: task parallelism
builds onto the large number of integrated cores, while data parallelism uses the
Fig. 1. Left: Velocity vectors for the LB populations in the D2Q37 model. Right:
populations are identified by an arbitrary label, identifying the lattice hop that they
perform in the propagate phase.
AVX 512-bit vector (SIMD) instructions. Each core has two out-of-order vector
processing units (VPUs) and supports the execution of up to 4 threads. The KNL
has a peak theoretical performance of 6 TFlops in single precision and 3 TFlops
in double precision. Typical thermal design power (TDP) is 215 W including
MCDRAM memories (but not the Omni-Path interface). For more details on
KNL architecture see [9].
3 Lattice Boltzmann Methods
Lattice Boltzmann methods [10] (LB) are widely used in computational fluid-
dynamics, to describe fluid flows. LB methods are widely used in science and
engineering to accurately model single and multi-phase flows and can be easily
accommodate irregular boundary conditions. This is why they are usually used
in the oil&gas industry to to study the dynamics of oil and shale-gas reservoirs
and to maximize their yield.
This class of applications, discrete in time and momenta and living on a dis-
crete and regular grid of points, offers a large amount of available parallelism, so
they are an ideal target for multi- and many-core processors. They are based on
the synthetic dynamics of populations sitting at the sites of a discrete lattice. At
each time step, populations propagate from lattice-sites to lattice-sites, and then
collide mixing and changing their values accordingly. In these processes, there
is no data dependency between different lattice points, so both the propagate
and collide steps can be performed in parallel on all grid points following any
convenient schedule.
A model describing flows in x dimensions and using y populations is labeled
as DxQy. In this work we study a D2Q37 model, a 2-dimensional system with
37 population associated to each lattice-site, corresponding to (pseudo-)particles
moving up to three lattice points away, as shown in figure 1. This recently devel-
oped [11,12] LB model automatically enforces the equation of state of a perfect
gas (p = ρT ); it has been recently used to perform large scale simulations of con-
vective turbulence in several physics regimes [13,14]. The D2Q37 model is com-
putationally more demanding than earlier methods; indeed, propagate implies
accessing 37 neighbor cells to gather all populations, while collide executes
≈ 6600 double-precision floating point operations per lattice point.
Fig. 2. Top to bottom, AoS, SoA, CSoA and CAoSoA data memory layouts for a 2×8
lattice with two populations (red and blue) per site and VL=2. For CSoA and CAoSoA
each grey-box is a cluster. Memory addresses increase left-to-right top-to-bottom.
4 Implementation and Optimization of D2Q37 LB model
In all LB methods propagate and collide use most of compute cycles of the
whole application so optimization efforts have to focus largely on these two
kernels. Data allocation policies and memory-layout decisions are becoming more
and more important for performance on recent many-core processors; this is
even more so, as we have to find a data layout that matches the conflicting
requirements of these two kernels. In this section we focus mostly on this point,
discussing several possible choices and showing that they have very large effects
on the obtained performance (and energy-performance) for the KNL processor;
here, we extend previous works [15,2], where additional details on other aspects
of the code structure are available.
Array of Structures (AoS) or Structure of Arrays (SoA) offer a starting point
to consider more complex data memory organizations. In the AoS scheme, pop-
ulation data associated to each lattice site are stored one after the other at
contiguous memory addresses. In this arrangement all data associated to one
lattice point are at close memory locations, but same index populations of dif-
ferent lattice sites are stored in memory at non-unit strided addresses; this makes
it more difficult to process them using vector SIMD instructions. Conversely, the
SoA scheme stores same index populations of all sites one after the other; this
is appropriate for vector SIMD instructions, as it allows to move several lattice
sites – 8 for the KNL – in parallel. Figure 2 – first two designs at the top –
visualize the AoS and SoA layouts, for a sample case of a lattice of 2 × 8 sites
with only two populations (red and blue). This layout has a potential inefficiency
associated to unaligned memory accesses; in fact, the read-address for popula-
tion values is computed as the sum of the address of the current site plus an
offset, and the resulting address is in general not aligned to a 64 Byte boundary,
preventing direct memory copies to vector registers.
In order to circumvent this problem, we start from the SoA layout and, for
a lattice of size LX × LY , we cluster together VL elements of each population
at a distance LY/V L, with VL a multiple of the KNL vector size. We call this
#define LYOVL (LY / VL)
typedef struct { double c [ VL ] ; } vdata_t ;
typedef struct { vdata_t s [ LX∗LYOVL ] ; } vpop_csoa_t ;
vpop_csoa_t prv [ NPOP ] , nxt [ NPOP ] ;
#pragma omp parallel for num_threads ( NTHREAD ) schedule ( dynamic )
for ( ix = startX ; ix < endX ; ix++ ) {
idx = ( NYOVL∗ix ) + HYOVL ;
for ( p = 0 ; p < NPOP ; p++){
for ( iy = 0 ; iy < SIZEYOVL ; iy++ ) {
#pragma unroll
#pragma vector aligned nontemporal
for ( k = 0 ; k < VL ; k++)
nxt−>p [ p ] [ idx+iy ] . c [ k ] = prv−>p [ p ] [ idx+iy+OFF [ p ] ] . c [ k ]
} } }
Fig. 3. Source code of propagate kernel for using the CSoA data layouts. OFF is a
vector containing memory-address offsets associated to each population hop. VL is the
size of a cluster.
data layout a Cluster Structure of Array (CSoA), see Figure 2 – third design
from top – for the case of V L = 2 corresponding to an hypothetical processor
using vectors of length 2. Using CSoA, propagate, whose main task is to read
the same population elements at all sites and move them to different sites, is
able to use vector instructions to process clusters of properly memory-aligned
items. Figure 3 shows the corresponding C type definitions and code implemen-
tation for propagate. The loop on X is parallelized among the threads using the
OpenMP pragma parallel loop, making each thread to work on a slice of the
lattice; the inner loop, copying elements of a cluster into another cluster, can
be unrolled and vectorized since both read and write pointers are now properly
aligned. A further optimization can in this case be applied with the use of non-
temporal write operations saving time and reducing the overall memory traffic
by 1/3 [2]. We instruct the compiler to use these optimizations using pragmas
unroll and (vector aligned nontemporal). Figure 4 shows measured band-
width for our data structures, using the FLAT memory mode, and using both
off-chip or MCDRAM memory, and the CACHE memory mode. Data refer to a
64 core Xeon-Phi 7230 running at 1.4 GHz.
The collide kernel can be vectorized using the same strategy as of propagate,
so one expects the CSoA layout should be an efficient choice; however, profiling
the execution of this kernel, we found that a large number of TLB misses are
generated. These happens because different populations associated to each lattice
site are stored at memory addresses far from each other, and several non-unit
stride reads are necessary to load all population values necessary to compute
the collisional operator. We then introduce yet another data layout, in order to
reduce this penalty. We start again from the SoA layout, and for each population
array, we divide each Y -column in VL partitions each of size LY/VL; all elements
sitting at the ith position of each partition are then packed together into an array
of VL elements called cluster. For each index i we then store in memory one after
the other the 37 clusters – one for each population – associated to it. This defines
the Clustered Array of Structure of Arrays (CAoSoA); the main improvement on
Fig. 4. Performance of propagate (top) and collide (bottom) using the AoS, SoA, CSoA
and CAoSoA data layouts. Performance for propagate is shown in MLUPS, defined in
the text. All data for a 64 core Xeon-Phi 7230 running at 1.4 GHz. For the FLAT
configuration we use a 2304 × 8192 lattice that fits into MCDRAM; for the CACHE
configuration, the lattice is 4608×12288, twice the size of MCDRAM. For each layout,
3 groups of 4 bars correspond respectively to FLAT-DDR4, FLAT-MCDRAM and
CACHE. Within each group, bars correspond respectively to 1,2,3 and 4 threads per
core.
CSoA is that it still allows vectorization of clusters of size V L, and at the same
time improves locality of populations, keeping all population data associated to
each lattice site at close and aligned addresses; see again Figure 2 for a visual
description. This data layout combines the benefits of the CSoA scheme, allowing
aligned memory accesses and vectorization (relevant for the propagate kernel)
and at the same providing population locality (together relevant for the collide
kernel). Figure 4 shows measured performances for the collide kernel – expressed
in Million Lattice UPdates per Second, a common figure of performance for these
codes – for all data-layouts considered sofar. For a lattice of 2560 × 8192 sites,
using CAoSoA we have reduced to zero the number of TLB misses of collide
measured using the hardware counter PAGE WALKS.D SIDE WALKS) w.r.t. almost
2 billions misses for CSoA; correspondingly, the number of clocks ticks (counter
CPU CLK UNHALTED.REF TSC) decreases by approximately 25%. The picture also
shows that the performance of the propagate kernel is unchanged using the CSoA
and CAoSoA layouts.
5 Analysis of Results and Conclusions
We start summarizing our performance data. See again Figure 4 showing results
for propagate and collide, using the FLAT and CACHE memory configurations.
For the propagate kernel, performance is almost independent from the number
of threads per core, while the impact of the various data layouts is large; indeed,
using a FLAT MCDRAM configuration the measured bandwidth increases from
138 GB/s of AoS to 314 GB/s of SoA and to 433 GB/s of CSoA. This trend is
similar using the DDR4 memory banks but performance is much lower, ranging
from 54 GB/s of AoS to 56 GB/s of SoA and to 81 GB/s of CSoA. We have a
similar behavior also with the CACHE configuration, measuring in this case a
bandwidth of 59, 60 and 62 GB/s for the AoS, SoA and CSoA memory layouts
for a lattice size that does not fit into MCDRAM. Using the CAoSoA layout,
performance does not further improves, both for FLAT and CACHE configura-
tions.
For collide kernel, using a FLAT MCDRAM configuration we obtain a good
level of performance, 114 MLUPS, using the AoS layout with 4 threads per core;
the SoA layout performance does not allow efficient vectorization, so performance
goes down to 62 MLUPS with one thread per core, further decreasing if we use
2, 3 and 4 threads per core. Enforcing memory alignment with the CSoA layout,
we obtain again a properly vectorized code and performance increases up to
135 MLUPS using 4 threads per core. Performances further improve with the
CAoSoA layout as we remove the overhead associated to TLB misses and we
reach the level of 165 MLUPS with 4 threads per core, corresponding to a factor
1.4X and 1.2X w.r.t. the AoS and CSoA layouts. As the collide kernels performs
approximately 6600 floating-point operations per lattice size, our KNL processor,
using the CAoSoA layout, delivers a sustained performance of approximately 1
TFlops, that is about 30% of the available raw peak. If one used DDR4 memory
performances are harmed by memory bandwidth, but results follows the same
trend as in the MCDRAM case, reaching 89 MLUPS with the CAoSoA layout.
The same is true with the CACHE configuration where collide reaches a peak of
98 MLUPS for the CAoSoA layout.
We now consider energy efficiency for our code; we use data from the RAPL
(Running Average Power Limit) registers available in the KNL processor, for
both Package and DRAM counters, that we read using the custom library de-
scribed in [16]. Results are shown in Figure 5 for both FLAT memory config-
urations, highlighting the impact of data-layouts on energy consumption. All
figures refer to Energy-to-Solution (ES) and are the sum of Package (in-chip)
and DRAM (off-chip memory) contributions. For propagate, we see that using
MCDRAM increases the average power drain (≈ 35%) compared to the use of
off-chip DDR4, but ES is lower since a slightly higher power gets integrated over
a much shorter (≈ 4×) time. Also, the CSoA and CAoSoA data-layouts halve
ES w.r.t. the AoS and SoA layouts as a result of their shorter execution times
and slightly lower power drain. For the collide kernel the SoA layout has a rather
low power drain (≈ 30% less than CSoA and CAoSoA) because vector units are
not used; however, the code runs also much slower (≈ 3×), translating into the
worst performance figure in terms of ES . Conversely, the CAoSoA layout gives
the best result in term of energy efficiency, with energy-to-solution decreasing
while increasing the number of threads per core, thanks to a constant power
drain and an increasing performance. Using CACHE configurations, the average
Fig. 5. Energy-to-Solution for propagate (top) and collide (bottom), for all data lay-
outs, using the FLAT configuration. For each layout we plot two groups of bars cor-
responding to the use of either DDR4 off-chip memory or on-chip MCDRAM. Within
each group the bars correspond respectively to 1,2,3 and 4 threads per core. All val-
ues are computed as the sum of the Package and DRAM RAPL energy counters, per
iteration.
power drain is in between the values recorded for the DDR4 and MCDRAM
cases. As shown in Figure 4 performances are similar to the case of DDR4, with
a slightly performance decrease for propagate and a slightly increase for collide
when using CSoA and CAoSoA data-layouts. Thus, from the energy consump-
tion point of view, using cache configuration leads to similar energy behaviors
as using DDR4.
We finally compare our performance results with that of other recent multi-
and many-core processors [17,18,19]. Our comparison is shown in Table 1 for
both critical kernels and also for the complete code; we adopt the CAoSoA layout
throughout, as it offers the best performance. Let first discuss the case of lattice
size 1024×8192 requiring a memory footprint of ≈ 4.6 GB fitting the 16 GB on-
chip MCDRAM. The data size also fits most other accelerator boards, so we can
perform a meaningful comparison. Comparing the KNL in FLAT mode with the
KNC 7120P, the previous generation MIC processor, we see that performances
for propagate and collide are respectively ≈ 4X and ≈ 3.5X faster. Comparing
with NVIDIA GPUs, the execution time for propagate is ≈ 2.5X faster than on
a GK2010 GPU (hosted on a K80 board), and the same as a P100 Pascal board.
The execution time of collide is 1.4X faster than a GK210, and approximately
50% slower than a P100. Comparing performances with a more traditional Intel
E5-2697v4 CPU, based on Broadwell micro-architecture, propagate is 7.8X faster
and collide is 3.5X faster. Using the KNL in CACHE mode with a lattice that
does not fit into MCDRAM, the performance of the processor are much slower.
In the last column at right of Table 1 we see the results for a lattice using a
Table 1. Performance comparison among several processors. We consider the propa-
gate and collide kernels and the full code (Global), using the CAoSoA data layout. We
compare the KNL against the MIC KNC, the NVIDIA GK210 and P100 GPUs, and
the Intel E52697v4 CPU. The row labeled with Global report the perforamnce of the
full code.
KNC 7120P GK210 P100 E52697v4 KNL 7230 KNL 7230 KNL 7230
flat/quad cache/quad cache/quad
Lattice size 1024× 8192 4608× 12288
Memory footprint [GB] ≈ 4.6 ≈ 30
Tprop [ms] 49.9 32.3 12.5 98.06 12.5 19.65 506.64
Tcoll [ms] 180.9 71.1 24.1 173.42 50.3 51.42 550.25
Propagate [GB/s] 100 155 396 51 398 253 66
Collide [GF/s] 307 764 2253 320 1100 1079 680
Collide [MLUPS] 46 115 340 48 166 163 103
Global [MLUPS] 35 73 232 31 119 106 67
memory footprint twice the size of MCDRAM. In this case, comparing with
CPU E5-2697v4 for which the lattice 1024 × 8192 does not fit in the last-level
cache, performances of propagate are more or less the same, and that of collide
are ≈ 2X faster.
In summary, based on our experience related to our application, some con-
cluding remarks are in order: i) the KNL architecture makes it easy to port and
run codes previously developed for X86 standard CPUs; however performance
is strongly affected by the massive level of parallelism that must necessarily be
exploited on the processor, lest the level of performance drops to the value of
standard multi-core CPUs or even worst; ii) for this reason data layouts plays
a relevant role in allowing to use an efficient level of vectorization; at least for
LB applications, appropriate data structures are necessary to allow the different
vectorization strategies necessary in different parts of the application; iii) the
KNL processor improves on the KNC – the previous generation MIC processors
– by a factor ≈ 3 − 4X; iv) if application data fits within the MCDRAM, per-
formances are very competitive with that of GPU accelerators; however, if this
is not the case, performance drops to levels similar to those of multi-core CPUs,
with the further drawback that codes and operations (editing, compilations, IO,
etc.) not exploiting task and data parallelism run much slower.
In the future, we plan to further analyze the energy performances of KNL
comparing with other processors, and to design and develop a parallel hybrid
MPI+OpenMP code able to run on a cluster of KNLs, in order to investigate
scalability.
Acknowledgements. This work was done in the framework of the COKA, COSA
and SUMA projects of INFN. We would like to thank CINECA (Italy) for access
to their HPC systems. AG has been supported by the EU Horizon 2020 research and
innovation programme under the Marie Sklodowska-Curie grant agreement No 642069.
References
1. Tang, P., et al.: An implementation and optimization of lattice boltzmann method
based on the multi-node cpu+mic heterogeneous architecture. Int. Conf. on Cyber-
Enabled Distributed Computing and Knowledge Discovery (CyberC) (2016) 315–
320 doi:10.1109/CyberC.2016.67
2. Calore, E., et al.: Optimization of lattice boltzmann simulations on heterogeneous
computers. The International Journal of High Performance Computing Applica-
tions (2017) 1–16 doi:10.1177/1094342017703771
3. Rosales, C., et al.: In: A Comparative Study of Application Performance and
Scalability on the Intel Knights Landing Processor. LNCS (2016) 307–318 doi:
10.1007/978-3-319-46079-6_22
4. Li, S., et al.: Enhancing application performance using heterogeneous memory
architectures on a many-core platform. Int. Conf. on High Performance Computing
Simulation (HPCS) (2016) 1035–1042 doi:10.1109/HPCSim.2016.7568455
5. Rucci, E., et al.: First Experiences Optimizing Smith-Waterman on Intel’s Knights
Landing Processor. ArXiv e-prints (February 2017)
6. John D. McCalpin, P.: Stream: Sustainable memory bandwidth in high perfor-
mance computers (2017)
7. Colfax: Clustering modes in knights landing processors (2017)
8. Colfax: Mcdram as high-bandwidth memory (hbm) in knights landing processors:
Developers guide (2017)
9. Sodani, A., et al.: Knights landing: Second-generation intel xeon phi product.
IEEE Micro 36(2) (Mar 2016) 34–46 doi:10.1109/MM.2016.25
10. Succi, S.: The Lattice-Boltzmann Equation. Oxford university press, Oxford (2001)
11. Sbragaglia, M., et al.: Lattice Boltzmann method with self-consistent thermo-
hydrodynamic equilibria. Journal of Fluid Mechanics 628 (2009) 299–309 doi:
10.1017/S002211200900665X
12. Scagliarini, A., et al.: Lattice Boltzmann methods for thermal flows: Continuum
limit and applications to compressible Rayleigh–Taylor systems. Physics of Fluids
22(5) (2010) 055101 doi:10.1063/1.3392774
13. Biferale, L., et al.: Second-order closure in stratified turbulence: Simulations and
modeling of bulk and entrainment regions. Physical Review E 84(1) (2011) 016305
doi:10.1103/PhysRevE.84.016305
14. Biferale, L., et al.: Reactive Rayleigh-Taylor systems: Front propagation and non-
stationarity. EPL 94(5) (2011) 54004 doi:10.1209/0295-5075/94/54004
15. Calore, E., et al.: Experience on vectorizing lattice boltzmann kernels for
multi- and many-core architectures. LNCS Part I (2016) 53–62 doi:10.1007/
978-3-319-32149-3_6
16. Calore, E., et al.: Evaluation of dvfs techniques on modern hpc processors and
accelerators for energy-aware applications. Concurrency Computation: Practice
and Experience (2017) doi:10.1002/cpe.4143
17. Mantovani, F., et al.: Performance issues on many-core processors: A D2Q37 lattice
boltzmann scheme as a test-case. Computers & Fluids 88 (2013) 743 – 752 doi:
10.1016/j.compfluid.2013.05.014
18. Biferale, L., et al.: An optimized D2Q37 lattice boltzmann code on GP-GPUs.
Computers & Fluids 80 (2013) 55 – 62 doi:10.1016/j.compfluid.2012.06.003
19. Calore, E., et al.: Massively parallel lattice boltzmann codes on large GPU clusters.
Parallel Computing 58 (2016) 1 – 24 doi:10.1016/j.parco.2016.08.005
