Performance and energy footprint assessment of FPGAs and GPUs on HPC
  systems using Astrophysics application by Goz, David et al.
March 2020
Performance and energy footprint
assessment of FPGAs and GPUs on HPC
systems using Astrophysics application
David GOZ a,1, Georgios IERONYMAKIS b , Vassilis PAPAEFSTATHIOU b ,
Nikolaos DIMOU b , Sara BERTOCCO a , Giuliano TAFFONI a , Francesco SIMULA c
, Antonio RAGAGNIN a , Luca TORNATORE a and Igor CORETTI a
a INAF-Osservatorio Astronomico di Trieste, Italy
bFORTH-ICS, Heraklion, Crete, Greece
c INFN-Sezione di Roma, Italy
Abstract.
New challenges in Astronomy and Astrophysics (AA) are urging the need for a
large number of exceptionally computationally intensive simulations. ”Exascale”
(and beyond) computational facilities are mandatory to address the size of theoret-
ical problems and data coming from the new generation of observational facilities
in AA. Currently, the High Performance Computing (HPC) sector is undergoing
a profound phase of innovation, in which the primary challenge to the achieve-
ment of the “Exascale” is the power-consumption. The goal of this work is to give
some insights about performance and energy footprint of contemporary architec-
tures with a real astrophysical application in an HPC context. We use a state-of-the-
art N-body application that we re-engineered and optimized to exploit the hetero-
geneous underlying hardware fully. We quantitatively evaluate the impact of com-
putation on energy consumption when running on four different platforms. Two of
them represent the current HPC systems (Intel-based and equipped with NVIDIA
GPUs), one is a micro-cluster based on ARM-MPSoC, and one is a ”prototype to-
wards Exascale” equipped with ARM-MPSoCs tightly coupled with FPGAs. We
investigate the behaviour of the different devices where the high-end GPUs excel in
terms of time-to-solution while MPSoC-FPGA systems outperform GPUs in power
consumption. Our experience reveals that considering FPGAs for computationally
intensive application seems very promising, as their performance is improving to
meet the requirements of scientific applications. This work can be a reference for
future platforms development for astrophysics applications where computationally
intensive calculations are required.
Keywords.
Astrophysics, HPC,N-body, ARM MPSoC, GPUs, FPGAs, Hardware Acceleration,
Acceleration Architectures, Exascale, Energy-Delay-Product
1Corresponding Author: David Goz, ORCID: 0000-0001-9808-2283, INAF-Osservatorio Astronomico Di
Trieste, Via G.B. Tiepolo 11, 34131 Trieste, Italy; E-mail: david.goz@inaf.it
ar
X
iv
:2
00
3.
03
28
3v
1 
 [a
str
o-
ph
.IM
]  
6 M
ar 
20
20
February 2019
1. Introduction and motivation
In the last decade, energy efficiency has become the primary concern in the High Perfor-
mance Computing (HPC) sector. HPC systems constructed from conventional multicore
Central Processing Units (CPUs) have to face, on one side, the reduction in year-on-year
performance gain for CPUs and on the other side, the increasing cost of cooling and
power supply as HPC clusters grow larger.
Some technological solutions have already been identified to address the energy
issue in HPC [9]; one of them is the use of power-efficient Multiprocessor Systems-
on-Chip (MPSoC) [10,12,11,1,2,4]. These hardware platforms are integrated circuits
composed of multicore CPUs combined with accelerators like Graphic-Processing-Units
(GPUs) and/or Field-Programmable-Gate-Arrays (FPGAs). Such hardware accelerators
can offer higher throughput and energy-efficiency compared to traditional multicore
CPUs. The main drawback of those platforms is the complexity of their programming
model, requiring a new set of skills for software developers and hardware design con-
cepts, leading to increased development time for accelerated applications.
Astronomy and Astrophysics (AA) sector is one of the research areas in Physics
that requires more and higher performing software, as well as the necessity of Exascale
supercomputers (and beyond) [13]. In AA, HPC numerical simulations are the most ef-
fective instruments to model complex dynamic systems, to interpret observations and
to make theoretical predictions, advancing scientific knowledge. They are mandatory to
help capture and analyze the torrent of complex observational data that the new genera-
tion of observatories produce, providing new insights into astronomical phenomena, the
formation and evolution of the universe, and the fundamental laws of physics.
The research presented in this paper arises in the framework of EuroExa European
funded project [20] aiming at the design and development of a prototype of an exascale
HPC machine. EuroEXA is achieving that through the use of low-power ARM processors
accelerated by tightly-coupled FPGAs.
Focusing on performance and energy-efficiency, in this work we exploit four plat-
forms:
• (I-II) two Linux x86 HPC clusters that represent the state-of-the-art of HPC archi-
tectures (Intel-based and equipped with NVIDIA GPUs);
• (III) a Multiprocessor SoC micro-cluster that represents a low purchase-cost and
low-power approach to HPC;
• (IV) an exascale prototype that represents a possible future for supercomputers.
This prototype was developed by the ExaNeSt European project2 [24,5,23] and
customized by the EuroEXA project3.
The platforms are probed using a direct N-body solver for astrophysical simulations,
widely used for scientific production in AA, e.g. for simulations of star clusters up to∼ 8
million bodies [21,22].
The goal of this paper is to investigate the performance-consumption plane, namely
the parameter space where time-to-solution and energy-to-solution are combined, ex-
ploiting the different devices hosted on the platforms. We include the comparison among
2https://exanest.eu/
3https://euroexa.eu/
February 2019
high-end CPUs, GPUs and MPSoCs tightly coupled with FPGAs systems. To the best
of our knowledge, this paper provides one of the first comprehensive evaluations of a
real AA application on an exascale prototype, comparing the results with todays HPC
hardware.
The paper is organized as follows. In Section 2 we describe the computing platforms
used for the analysis. In Section 3 and 4 we discuss the methodology employed to make
the performance and energy measurement experiments, including considerations on the
usage of the different platforms and the configuration of the parallel runs. Section 5 is
devoted to present the scientific application used to benchmark the platforms. Our results
are presented in Section 6. The final Section 7 is devoted to the conclusions and the
perspectives for future work.
2. Computing platforms
In this section, we describe the four platforms used in our tests. In Table 1, we list the
devices, and we highlight in bold the ones exploited in this paper.
Table 1. The computing node and the associated devices. The devices exploited are highlighted in bold.
Node CPU GPU FPGA
mC 4x(ARM A53)) + 2x(ARM A72) ARMMali-T864 None
IC 40x(Xeon Haswell E5-4627v3) None None
ExaBed 16x(ARM A53) + 8x(ARM R5) 4x(ARM Mali-400) 4x(Zynq-US+)
GPUC 32x(Xeon Gold 6130) 8x(Tesla-V100-SXM2) None
2.1. ExaNest HPC testbed prototype
The ExaNest HPC testbed prototype [24] (hereafter ExaBed) is a liquid-cooled cluster
composed of the proprietary Quad-FPGA daughterboard (QFDB) [25] computing nodes,
interconnected with a custom network and equipped with a BeeGFS parallel filesystem.
In Figure 1, we present a block diagram of the computing node of the platform.
The compute-node board includes 4 Xilinx Zynq Ultrascale+ MPSoC devices
(ZCU9EG), each featuring 4x(ARM-A53) and 2x(ARM-R5) cores, along with a rich set
of hard IPs and Reconfigurable Logic. Each Zynq device has a 16GB DDR4 (SODIMM)
attached and a 32MB Flash (QSPI) memory. Also, as shown in Figure 1, within the
QFDB each FPGA is connected to each other through 2 HSSL and 24 LVDS pairs (12
in each direction). Out of the 4, only the ”Network” FPGA is directly connected to the
outside world, while the ”Storage” FPGA has an additional 250 GB M.2 SSD attached to
it. The maximum sustained power of the board is 120 Watts, while the power dissipation
during normal operation is usually around to 50 Watts. Targeting a compact design, the
dimension of the board is 120-130mm while no component on top or below the printed
circuit board (PCB) is taller than 10mm.
These compute nodes are sealed within a blade enclosure, each hosting 4 QFDBs.
Currently, the ExaNest prototype HPC testbed consists of 12 fully functional blades. The
rack provides connectivity between the blades, while each QFDB is managed through a
February 2019
Figure 1. The Quad-FPGA daughterboard block diagram and interconnects.
Manager VM and runs a customized version of Linux based on Gentoo Linux, which is
called Carvoonix.
When running the matrix-matrix multiplication benchmark (DGEMM) the ARM-
A53x4 CPU of a single Zynq device can execute up to 7.9 GFLOP/s. The memory band-
width measured with the STREAM benchmark results to be 6488 MB/s, 5886 MB/s,
4269 MB/s and 4032.9 MB/s, for Copy, Scale, Add and Triad tests, respectively.
In the QFDB, the measurement of the current and power is accomplished by using a
set of TI INA226 coupled with high-power shunt resistors. The INA226 minimal capture
time is 140[νs]. However, the Linux driver default (and the power-on set-up) sets capture
time to 1.1[ms]. The Linux driver also enables averaging from 16 samples, and captures
both the shunt and the bus voltages. To collect data from the sensors, each board includes
15 I2C power sensors, which allow the measurement of power consumption by major
subsystems.
2.2. Intel cluster
Each node of the Intel cluster (hereafter IC) is equipped with 4 sockets INTEL Haswell
E5-4627v3 at 2.60 GHz with 10 cores each and 256 GB (then 6 GB per core). The in-
terconnect is the Infiniband ConnectX-3 Pro Dual QSFP+ (54Gbs), and the storage sys-
tem is a BeeGFS parallel file system, with 4 IO servers offering 350TB of disk space
[6,7]. The cluster has a peak performance of 27 TeraFLOPS, measured using HPL bench-
mark. Running the STREAM benchmark, we measured 62408.0 MB/s, 56592.6 MB/s,
February 2019
73716.3MB/s and 69170.1 MB/s memory bandwidths for the Copy, Scale, Add and Triad
tests, respectively. Each computing node is equipped with an iLO4 management con-
troller, that can be used to measure the node instantaneous power consumption (1 sample
every second).
2.3. ARM-Micro-Cluster
We design our ARM-Micro-Cluster (hereafter mC) starting from the OpenSource MP-
SoC Firefly-RK3399 [26]. This single-board is equipped with the big.LITTLE architec-
ture: 4x(Cortex-A53) cores with 32kB L1 cache and 512kB L2 cache, and a cluster of
2x(Cortex-A72) high-performance cores with 32kB L1 cache and 1M L2 cache. Each
cluster operates at independent frequencies, ranging from 200MHz up to 1.4GHz for the
LITTLE and up to 1.8GHz for the big. The MPSoC contains 4GB DDR3 - 1333MHz
RAM. The MPSoC features also the OpenCL-compliant Mali-T864 embedded GPU that
operates at 800 MHz.
The ARM-Micro-Cluster, composed by 8 Firefly-RK3399 single-boards, is based
on Ubuntu 18.04 Linux and scheduled using SLURM [27]. The interconnect is based on
Gigabit Ethernet, and the storage system is a device shared via NFS.
Regarding DGEMM and STREAM benchmarks, the A72x2 cores and the A53x4
cores offer a performance of 9.5 GFLOP/s and 7.5 GFLOP/s respectively, while the ob-
tained bandwidths are 5939 MB/s, 5912 MB/s, 5451 MB/s and 5547 MB/s for Copy,
Scale, Add and Triad tests, respectively.
2.4. GPU cluster
Each node of the GPU cluster (hereafter GPUC) is equipped with 2 sockets INTEL Xeon
Gold 6130 at 2.10 GHz with 16 cores each along with 8 NVIDIA Tesla-V100-SXM2.
The GPUs are hosted by a SuperServer 4029GP-TVRT system by SuperMicro R©, which
integrates a Baseboard Management Controller (BMC) that through Intelligent Platform
Management Interface (IPMI) provides out-of-band access to the sensors embedded into
the system. Among the physical parameters that these sensors are able to measure (tem-
perature, cooling fans speed, chassis intrusion, etc.), this system is also able to contin-
uously monitor amperage and voltage of the different rails within the redundant power
supply units, in order to give at least a ballpark figure of its wattage. Right after booting
and with all GPUs in idle, the system wattage is given at about 440 Watts. In order to
get full throttle GPU’s power measurements we rely on the built-in sensors queried by
NVIDIA nvidia-smi tool4.
3. Methodology and considerations
The platforms exhibit different behaviour as concern the power policies.
In the case of ARM sockets, the frequency scaling is absent, meaning that idle and
performance mode are mutually-exclusive active. Our code, described in the Section 5,
is not able to exploit the highly heterogeneous big.LITTLE ARM socket, whose archi-
4The readings are accurate to within +/- 5 Watts, as stated by NVIDIA documentation. That accuracy does
not affect our results.
February 2019
tecture couples relatively power-saving and slower processor cores (LITTLE) with rela-
tively more powerful and power-hungry ones (big). This MPSoC is conceived to migrate
more demanding threads on the more powerful cores of the big socket (A72 in the case
of the mC). Hence, in order to disentangle the performance and the power consumption,
we pin all MPI processes and Open Multi-Processing (OpenMP) threads to the big or
to the LITTLE socket setting explicitly CPU affinity. It is worth to be noticed that both
mC and ExaBed are equipped with A53x4/socket, letting us to run our simulations only
on the former and extrapolating the results using CPUs for the latter as well. Hence, we
consider useful to focus on the performance of the FPGA in the Xilinx Zynq UltraScale+
MPSoC hosted by ExaBed, which is in turn the most important topic of this paper.
To carry out a meaningful comparison, we decide to perform the comparison using
the same amount of computational units. Given the heterogeneity of the platforms in
terms of the underlying devices (Table 1 as reference), we define the computational
unit as a group of four cores for CPUs5, and either one GPU or FPGA for accelerators.
Table 2 summarizes the compute units (hereafter CUs), as defined above, available for
each platform.
Table 2. The compute units, as defined in Section 3, available on the platforms.
CUs
Platform
CU-type IC mC ExaBed GPUC
CPU 10 1(A53) – 1(A72) None None
GPU None 1 None 8
FPGA None None 4 None
One of the aim of this work is to shed some light on the crucial comparison between
the energy consumption and performance of current platforms and (possibly) exascale-
like ones.
4. Power consumption measurements
Since the IC, Exabed and GPUC have built-in sensors, for those three platforms, we rely
on the power measurements returned by the diagnostic infrastructure. The mC, on the
contrary, does not have any sensor, so we obtain the energy consumption by measuring
the actual absorption using a Yokogawa WT310E Digital Power Meter.
We assess that having different methods of energy measurements is not affecting
the results. To make power measurements, we set-up simulations so that their runtime is
much larger than the sampling time of on-board sensors so that fluctuations are averaged
out.
On the IC, the ExaBed, the mC and the GPUC respectively, the smallest units for
which energy consumption can be measured are a 4-sockets (with 10 cores each) node,
a QFDB (4 MPSoCs, with 4 cores and one FPGA each), a single-board (dual socket and
one gpu), and one GPU. For each platform, we estimate both the energy consumed under
no workload (Eidle) and the total energy consumption under 100% workload (E f ull) using
the CUs highlighted in Table 3.
5Two cores for ARM A72 since is the maximum available on the big socket hosted by the mC.
February 2019
Table 3. The energy consumption of platforms. Eidle is the average power over 3 minutes in idle of the plat-
form; Efull is the average energy used over 3 minutes of HY-NBODY continuous execution with 100% load,
using one CU of type either CPU, GPU, or FPGA (ECU−CPUfull , E
CU−GPU
full , and E
CU−FPGA
full respectively).
Platforms
IC mC ExaBed GPUC
Eidle [W] 160 3.15 42.5 440
ECU−CPUfull [W] 223 4.55 (A53) - 7.35 (A72) N/A N/A
ECU−GPUfull [W] N/A 4.75 N/A 710
ECU−FPGAfull [W] N/A N/A 53.5 N/A
In the following we report the energy-to-solution (total energy required to perform
the calculation) excluding the Eidle, i.e Ework = E f ull −Eidle, in order to focus on the
power consumption of different CUs. The idle energy and the energy consumed by the
processing units (CPUs, GPUs, FPGAs) are distinct targets for engineering and improve-
ment, and it seems useful to disentangle them while considering what is most promising
in the Exascale perspective.
Finally, we estimate the energy impact of the application also in terms of Energy
Delay Product (EDP). The EDP proposed by Cameron [28] is a ”fused” metric to evaluate
the trade-off between time-to-solution and energy-to-solution. It is defined as:
EDP= ECU×TwCU (1)
where ECU is the Ework consumed during the run by the CU, TCU is the time-to-solution
of the given CU and w (usually w = 1,2,3) is a parameter to weight performance versus
power. The larger is w the greater the weight we assign to its performance.
5. Astrophysical code
As aforementioned, we compare both time-to-solution and energy-to-solution perfor-
mance using a real scientific application coming from the astrophysical domain: the HY-
NBODY code [29,30].
In Astrophysics the N-body problem consists of predicting the individual motion
of celestial bodies interacting purely gravitationally. Since every body interacts with all
the others, the computational cost scales as O(N2), where N is the number of bodies.
HY-NBODY is a modified version of a GPU-based N-body code [31,32,33], it has been
developed in the framework of the ExaNeSt project [24], and it is currently optimized
for exascale-like machines within the FET HPC H2020 EuroEXA project6. The code
relies on the 6th order Hermite integration schema [34], which consists of three stages: a
predictor step that predicts particle’s positions and velocities; an evaluation step to
evaluate new accelerations, their first order (jerk), second order (snap), and third order
derivatives (crackle); a corrector step that corrects the predicted positions and veloci-
ties using the results of the previous steps.
Code profiling shows the Hermite schema spends more than 90% of time calcu-
lating the evaluation step, characterized by having an arithmetic intensity I ' 104
6https://euroexa.eu/
February 2019
[FLOPs/byte] (ratio of FLOPS to the memory traffic) using 323 particles. In the follow-
ing, time-to-solution and energy-to-solution measurements refer to that compute-bound
kernel.
Three version of the code are available:
(i) Standard C code: cache-aware designed for CPUs and parallelized with hybrid
MPI+OpenMP programming;
(ii) OpenCL code: conceived to target accelerators like GPGPUs or embedded GPUs.
All the stages of the Hermite integrator are performed on the OpenCL-compliant
device(s). The kernel implementation exploits local memory (OpenCL terminol-
ogy) of device(s), which is generally accepted as the best method to reduce global
memory latency in discrete GPUs. However, on ARM embedded GPUs, the global
and local OpenCL address spaces are mapped to main host memory (as reported
by the ARM developer guide7). So, a specific ARM-GPU-optimized version of all
kernels of Hy-Nbody, in which local memory is not used, has been implemented
and used in the results shown in the paper. The impact of such an optimization is
shown in [29].
Regarding the host parallelization schema, a one-to-one correspondence between
MPI processes and computational nodes is established and each MPI process man-
ages all the OpenCL-compliant devices available per node (the number of such
devices is user defined). Inside each share-memory computational node the paral-
lelization is achieved by means of OpenMP. Such a implementation requires that
particle data is communicated between the host and the device at each time-step,
which gives rise to synchronization points between host and device(s). Accelera-
tions and time-step computed by the device(s) are retrieved by the host on every
computational node, reduced and then sent back again to the device(s);
(iii) Standard C targeting HLS tool: Xilinx Vivado High Level Synthesis tool8
was used to develop a highly optimized hardware accelerator for QFDB’s FPGAs.
The kernel was designed to be parameterizable, in order to experiment with differ-
ent area vs performance implementations and to provide the capability of deploy-
ing it to any Xilinx FPGA with any amount of reconfigurable resources.
Vivado HLS provides a directive-oriented style of programming where the tool
transforms the high level code (C, C++, SystemC, OpenCL) to a Hardware De-
scription Language (HDL) according to the directives provided by the program-
mer. Some of the optimization performed in this kernel are described bellow:
• calculation in chunks: given the finite resources of the FPGA and the need
to accelerate Hermite algorithm in large arrays that exceed the amount of
internal memory inside the FPGA (BRAM), we followed a tiled approach
where the kernel loops over the corresponding tiles of the original arrays and
the core Hermite algorithm is performed in chunks of data stored internally.
• burst memory mode: this directive was used in order to request and fetch the
data in bursts instead of one by one and reduce latency while communicating
with the DRAM. The burst size selected was the maximum burst size allowed
by the AXI4 protocol, which is 4K.
7https://bit.ly/2T1yrrw
8https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html
February 2019
• loop pipeline and loop unroll: core Hermite algorithm has been pipelined
achieving an initiation interval of 1 clock cycle. In order to achieve this, we
increased the amount of the kernel’s read/write interfaces in order to fetch
data from many arrays simultaneously and independently and perform cal-
culations. Also, by applying the loop unrolling directive we allowed the al-
gorithm to be performed on more particles per cycle, with the corresponding
increase of the reconfigurable resources needed.
In our previous work [30] we demonstrated a kernel showing a single QFDB’s
FPGA full potential. Due to its extra connectivity capabilities, the ”Network”
FPGA results in a higher reconfigurable resource congestion in order to operate.
Thus, the previous kernel’s high demand of resources made it unfeasible to deploy
it to the ”Network” FPGA, so in order to demonstrate the application running in
many FPGAs and split the computation load evenly inside the QFDB we chose a
different size for this work’s kernel. This kernel has 75% throughput of the previ-
ous one and operates on a slightly higher frequency (320 MHz compared to 300
MHz).
5.1. Floating point arithmetic considerations
Arithmetic precision plays a key role during the integration of the equations of motion of
an N-body system. Generally, Hermite integration schema requires double-precision
arithmetic in order to minimize the accumulation of the round-off error, preserving both
the total energy and the angular momentum during the simulation. We have already
demonstrated that extended-precision arithmetic [35] can speed up the calculation
on GPUs, while is performance-poor on both CPUs and FPGAs [30], due to its higher
arithmetic intensity compared to the double-precision algorithm (additional accumu-
lations etc.).
Given that, we obtain the results shown in Section 6 using double-precision
arithmetic to exploit both CPUs and FPGAs, while extended-precision arithmetic is
employed to exploit GPUs.
6. Computational performances and energy consumption
In all simulations, in the case of CPUs, the cores composing the CUs are exploited by
means of OpenMP threads, and multi-CUs by means of MPI; for GPUs, instead, we use a
fixed number for the work-group-size (also called block-size in CUDA terminology)
of 649.
We investigate the time-to-solution of HY-NBODY running two different test series.
First, keeping the number of CUs constant, we increase the number of particles. We run
four simulations with 323, 643 and 1283 particles. Then, for 643 particles simulation, we
vary the number of CUs used, from 1 to 4.
On Figure 2, we report the computational performances expressed in terms of time-
to-solution for the first and second test suites, respectively. Using 4 CUs, the GPUC
performs almost 18 times better than ExaBed, which in turn performs 5 times better than
IC.
9We have already shown that the performance on ARM embedded GPUs is not driven by any specific
work-group-size, regardless the usage of the local memory [29].
February 2019
Figure 2. Left: time-to-solution as a function of the number of particles using 4 CUs. Right: time-to-solution
as a function of the CUs using 643 particles.
Figure 3. The performance-consumption plane for 643 particles varying the number of CUs.
Figure 3 shows the performance-consumption plane (energy-to-solution, Ework, vs
time-to-solution) using 643 particles and varying the number of CUs from 1 to 4. Different
symbols refer to a different number of CUs.
HY-NBODY is a compute-bound application, as stated in Section 5, hence, in these
tests, we measure the computing performance of the platforms but not the network con-
February 2019
Figure 4. EDP as a function of the CUs using 643 particles.
tribution, so the results on mC are not affected by the latency of MPI communication
across different computational nodes.
Not surprisingly, the CPUs of both IC and mC consume more power than acceler-
ators either in GPUC or ExaBed. The most interesting thing to point out is the equiva-
lence of the energy-to-solution between GPUC and ExaBed, which indicates a definite
trend toward Exascale prototype. We can also see the effect of the energy consumption
overhead when a node uses only a subset of its cores or sockets. This effect is evident for
IC when using 1 or 2 CUs.
In Figure 4, we present the results of the EDP for w= 1. We note that with the same
CUs, the GPUC has a better EDP than the other platforms. When comparing the ExaBed
and IC for the same time-to-solution configuration, the ExaBed has a better EDP. The
configuration with 1 CUs on ExaBed has the same time-to-solution of the configuration
with 4 CUs of the IC (we compare the violet circle with the black pentagram on Figure 3).
In Figure 5, we show the ratio between the total energy required to perform the
calculation and the energy consumed by the CUs, i.e. E f ull/Ework, as a function of the
CUs, using 643 particles. For the mC the trend is almost constant, since the smallest
unit for which energy consumption has been measured is the single-board, as stated in
Section 4. For the other platforms, the effect of the energy consumption overhead when
a node uses only a subset of its CUs is shown.
February 2019
Figure 5. Ratio between the total energy-to-solution (E f ull = Ework +Eidle) and the energy-to-solution con-
sumed by the CUs (Ework) as a function of the CUs using 643 particles.
7. Conclusion and future work
In this work, we discuss the performance evaluation of four platforms concerning both
the time-to-solution and energy-to-solution for code coming from AA sector. Two plat-
forms that represent the current status of HPC systems, the former Intel-based (IC) and
the latter equipped with NVIDIA-Tesla-V100 GPUs (GPUC), an ARM MPSoC micro-
cluster (mC) that could represent a low-budget HPC solution, and the ExaNeSt exascale
prototype (ExaBed) that (possibly) represents the next generation of HPC systems.
Our analysis have been conducted using code for scientific production exploiting
multi-CPUs, GPUs and FPGAs of the aforementioned platforms. The compute-bound
nature of our application allows us to focus on performance assessment of the computa-
tional power and energy-efficiency of the devices, without dealing with the interplay of
different key factors, like memory bandwidth, network latency and application execution
pattern.
The overall picture, where accelerators outperform CPUs in terms of both perfor-
mance and energy-efficiency, is not surprising. Exploiting CPUs, when we set-up a run
on the ExaBed in order to achieve the same time-to-solution with a run on the IC (ARM-
A53 cores equip both the ExaBed and the mC), our results show that the former proves
to be more power-efficient than the latter, which supports the exascale perspective of
having single compute units to be tailored to a better FLOP/W ratio than pure FLOPs
performance.
February 2019
Regarding accelerators, the NVIDIA-Tesla-V100 GPUs perform faster than Xilinx
US+ FPGAs, however the latter demonstrate superior energy-efficiency (the energy-to-
solution is the same). We found that FPGA programming practice continues to be chal-
lenging for HPC software developers, even using the high-level-synthesis technique,
which allows the conversion of an algorithm description in high level languages (e.g.
C/C++, OpenCL) into a digital circuit. In comparison, GPU programming is pretty
straightforward using the latest frameworks like CUDA, OpenCL or OpenAcc, but our
great deal of effort has been devoted to optimize the kernel using extended-precision
arithmetic. So at the end, we use comparable development effort in terms of design time
and programmer training.
Our conclusion is that, when performance alone is a priority, CPUs or embedded
GPUs on MPSoC are not a valid option, albeit their power-efficiency. ARM-based ex-
ascale prototypes may soon evolve to become a viable option for exascale-class HPC
production machines if their performance improves while still maintaining a favorable
power consumption. Furthermore, in order to reduce programmer effort, software envi-
ronment should provide a clear, high-level, abstract interface to the programmer to effi-
ciently execute functionality in the coupled-FPGAs, opening the path for successful and
cost-effective use of such devices in HPC.
Our future activity will be aimed to exploit more computational nodes, offering a
more comprehensive benchmark of both the computation power and the interconnect
network of the platforms.
8. Acknowledgments
This work was carried out within the EuroExa FET-HPC and ESCAPE projects (grant
no. 754337 and no. 824064), funded by the European Union’s Horizon 2020 research and
innovation program. We thank the INAF Trieste Astronomical Observatory Information
Technology Framework. We thank Piero Vicini and the INFN APE Roma Group for the
support and for the use of INFN computational infrastructure. We also thank Giuseppe
Murante and Stefano Borgani for the fruitful discussions on the energy and performance
optimization of our codes.
References
[1] Calore E., Schifano S. F., Tripiccione R.: Energy-Performance Tradeoffs for HPC Applications on Low
Power Processors. Euro-Par 2015: Parallel Processing Workshops. Springer International Publishing
(2015). doi: 10.1007/978-3-319-27308-2 59
[2] V. P. Nikolskiy, V. V. Stegailov and V. S. Vecher: Efficiency of the Tegra K1 and X1 systems-on-chip
for classical molecular dynamics. (2016) International Conference on High Performance Computing &
Simulation (HPCS), Innsbruck, 2016, pp. 682-689.
[3] Nikolskii V., and Stegailov, V.: Domain-Decomposition Parallelization for Molecular Dynamics Algo-
rithm with Short-Ranged Potentials on Epiphany Architecture. Lobachevskii Journal of Mathematics
(2018). doi:10.1134/S1995080218090159
[4] Morganti L, Cesini D, Ferraro A: Evaluating Systems on Chip through HPC Bioinformatics and Astro-
physics Applications. 24th Euromicro International Conference on Parallel, Distributed, and Network-
Based Processing (PDP) 2016, 541-544, 2016. doi: 10.1109/PDP.2016.82
February 2019
[5] Ammendola R., Biagioni A. , Cretaro P., Frezza O., Cicero FL et al.: The Next Generation of Exascale-
Class Systems: The ExaNeSt Project. In Euromicro Conference on Digital System Design (DSD), Vi-
enna, pp. 510-515 (2017) http://dx.doi.org/10.1109/DSD.2017.20
[6] Bertocco, S.; Goz, D.; Tornatore, L.; et al. “INAF Trieste Astronomical Observatory Information Tech-
nology Framework” doi: arXiv:1912.05340 [astro-ph.IM] 4 pages, conference, ADASS 2019
[7] Giuliano, Taffoni; Ugo, Becciani; Bianca, Garilli; et al. “CHIPP: INAF pilot project for HTC, HPC and
HPDA” doi:arXiv:2002.01283 [astro-ph.IM] 4 pages, conference, ADASS 2019
[8] Pawel, Czarnul; Jerzy, Proficz; and Adam Krzywaniak, “Energy-Aware High-Performance Computing:
Survey of State-of-the-Art Tools, Techniques, and Environments’ Scientific Programming 2019, Article
ID:8348791, Hindawi
[9] P. Dutot, Y. Georgiou, D. Glesser, L. Lefevre, M. Poquet and I. Rais, ”Towards Energy Budget Con-
trol in HPC,” 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
(CCGRID), Madrid, 2017, pp. 381-390.
[10] Daniele, Cesini; Elena, Corni; Antonio, Falabella; and et al., “Power-Efficient Computing: Experiences
from the COSA Project” Scientific Programming 2017, Article ID:7206595, Hindawi
[11] Simula, F. et al., “Real-Time Cortical Simulations: Energy and Interconnect Scaling on Distributed Sys-
tems” 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Pro-
cessing (PDP), Pavia, Italy, 2019, pp. 283-290.
[12] Ammendola, R.; Biagioni, A; Capuani F.; Cretaro, P.; De Bonis, G.; Lo Cicero, F.; Lonardo, A.; Mar-
tinelli, M.; Paolucci, P.S.; Pastorelli, E.; Pontisso, L.; Simula, F; and Vicini, P., “The Brain on Low
Power Architectures - Efficient Simulation of Cortical Slow Waves and Asynchronous States”, ParCo
2017, IOS Press, Advances in Parallel Computing Volume 32: Parallel Computing is Everywhere, pp.
760-769. doi:10.3233/978-1-61499-843-3-760
[13] Taffoni, Giuliano; Murante, Giuseppe; Tornatore, Luca; Katevenis, Manolis; Chrysos, Nikolaos;
Marazakis, Manolis: Shall Numerical Astrophysics Step Into the Era of Exascale Computing?,
2019ASPC..521..567T
[14] P. E. Dewdney, P. J. Hall, R. T. Schilizzi, and T. J. L. Lazio, “The Square Kilometre Array”, IEEE
Proceedings, 97, 1482, August 2009
[15] B. S. Acharya, M. Actis, T. Aghajani, and et al., “Introducing the CTA concept”, in Astroparticle Physics,
43, 3, March 2013
[16] T. de Zeeuw, R. Tamai, and J. Liske, “Constructing the E-ELT”, The Messenger, 158, 3, Dicember 2014
[17] J.P. Gardner, J.C. Mather, M. Clampin, and et al. “The James Webb Space Telescope”, Space Sci.Rev.,
123, 485. April 2006
[18] L. Amendola, S. Appleby, D. Bacon, T. Baker, M. Baldi, N. Bartolo, and et al., “Cosmology and Funda-
mental Physics with the Euclid Satellite,” Living Reviews in Relativity, 16:6, September 2013
[19] A. Kolodzig, M. Gilfanov, R. Sunyaev, S. Sazonov, and M. Brusa. AGN and QSOs in the eROSITA
All-Sky Survey. I. Statistical properties. A&A, 558:A89, October 2013.
[20] EuroEXA: European Exascale System Interconnect and Storage. https://euroexa.eu/
[21] Spera M., Capuzzo-Dolcetta R., ”Rapid mass segregation in small stellar cluster”s,
2017Ap&SS.362..233S, doi:10.1007/s10509-017-3209-6
[22] Spera, M., Mapelli, M., Bressan, A., ”The mass spectrum of compact rem-nants from the PARSEC
stellar evolution tracks”. 2015MNRAS.451.4086S, doi: 10.1093/mnras/stv1161
[23] M. Katevenis, R. Ammendola, A. Biagioni, P. Cretaro, O. Frezza, F. Lo Cicero, and et al., “Next gener-
ation of Exascale-class systems: ExaNeSt project and the status of its interconnect and storage develop-
ment,” Microprocessors and Microsystems, 61, 58, 2018
[24] Katevenis M., Chrysos N., Marazakis M., Mavroidis I., Chaix F., Kallimanis N., et al.: The ExaNeSt
Project: Interconnects, Storage, and Packaging for Exascale Systems, 2016 Euromicro Conference on
Digital System Design (DSD), Limassol, pp. 60-67 (2016)
[25] F. Chaix, A.D. Ioannou, N. Kossifidis, N. Dimou, G. Ieronymakis, M. Marazakis, V. Papaefstathiou, V.
Flouris, M. Ligerakis, G. Ailamakis, T.C. Vavouris, A. Damianakis, M. G.H. Katevenis and I. Mavroidis,
”Implementation and impact of an ultra-compact multi-FPGA board for large system prototyping”, 5th
International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC′19),
held in conjunction with SC’19, 2019
[26] S. Bertocco, D. Goz, L. Tornatore, G. Taffoni, “ INCAS: INtensive Clustered ARM SoC – Cluster
Deployment,” in INAF Technical Reports doi:10.20371/INAF/PUB/2018 0000, 2018
[27] J.A. Pascual, J. Navaridas, J. Miguel-Alonso, “Effects of Topology-Aware Allocation Policies on
February 2019
Scheduling Performance. Job Scheduling Strategies for Parallel Processing,” Lecture Notes in Computer
Science. 5798, pp. 138–144, 2009
[28] Cameron K.W., Ge R., Feng X., Varner D., Jones C.: High-performance, power-aware distributed com-
puting framework. In Proceedings of the International Conference on High Performance Computing,
Networking, Storage, and Analysis (SC), ACM/IEEE, (2004)
[29] Goz, D., Bertocco, S., Tornatore, L., and Taffoni, G. “,Direct N-body Code on Low-Power Embedded
ARM GPUs,” Intelligent Computing, Springer International Publishing, Charm, pp. 179–193, 2019
[30] Goz, D. Ieronymakis, G. Papaefstathiou, V. Dimou, N. Bertocco, S., Ragagnin, A. Tornatore, L. Taffoni,
G. and Coretti, I., ”Direct N-body application on low-power and energy-efficient parallel architectures”,
2019, arXiv, arXiv:1910.14496
[31] Capuzzo-Dolcetta, R., Spera, M., Punzo, D.: A fully parallel, high precision, N-body code running on
hybrid computing platforms. Journal of Computational Physics 236 (2013) 580593
[32] Capuzzo-Dolcetta R., Spera M.: A performance comparison of different graphics processing units run-
ning direct N-body simulations. Computer Physics Communications 184:25282539 (2013)
[33] Spera M.: Using Graphics Processing Units to solve the classical N-body problem in physics and astro-
physics. ArXiv e-prints 1411.5234 (2014)
[34] Nitadori K., Makino J.: Sixth- and eighth-order Hermite integrator for N-body simulations. New As-
tronomy 13:498507, (2008)
[35] Thall A.: Extended-precision floating-point numbers for gpu computation. p 52, (2006) DOI
10.1145/1179622.1179682
