Technology emerging from the DEEP & DEEP-ER projects by Suarez, Estela
Technology emerging from the 
DEEP & DEEP-ER projects 
Estela Suarez  
Jülich Supercomputing Centre 
 
03.06.2016 
The research leading to these results has received funding from the European Community's Seventh 
Framework Programme (FP7/2007-2013) under Grant Agreement n° 287530 and n° 610476  
DEEP 
• Cluster-Booster archit. 
• Software stack 
• Programming environ. 
• Energy efficiency 
• Applications:  
– Co-design 
– Evaluation/demonstration 
– Code modernisation 
 
 
DEEP-ER 
• Extend memory hierarchy 
• High-performance I/O 
• Scalable resiliency 
 
• Applications: 
– Co-design 
– Evaluation/demonstration 
– Code modernisation 
2 
Topics 
CLUSTER-BOOSTER 
ARCHITECTURE 
 
 
3 
“Standard” heterogeneity 
CN 
CN 
CN 
InfiniBand 
 
CN 
CN 
CN 
GPU 
GPU 
GPU 
GPU 
GPU 
GPU 
Flat topology 
Simple management of 
resources 
Static assignment of 
accelerators to CPUs 
Accelerators cannot act 
autonomously 
4 
CN 
CN 
CN 
InfiniBand 
Cluster 
Cluster-Booster architecture 
5 
BI BN 
BI BN 
BI BN 
BN 
BN 
BN 
BN 
BN 
BN 
EXTOLL 
Booster 
Flexible assignment of resources (CPUs, accelerators) 
Direct communication between accelerators 
“Offload” of large and complex parts of applications 
DEEP Architecture 
6 
Intel® Xeon ® 
Intel ® Xeon PhiTM 
DEEP System 
• Installed at JSC 
• 1,5 racks 
• 500 TFlop/s 
peak perf. 
• 3.5 GFlop/s/W 
• Water cooled 
 
 
7 
Cluster  
(128 Xeon) 
Booster 
(384 Xeon Phi 
KNC) 
Node Card Interface Card 
8 
Booster main components 
Booster Nodes (KNC) from 1 to 384 
B
o
o
st
er
 N
o
d
es
 (
K
N
C
) 
fr
o
m
 1
 t
o
 3
8
4
 
Booster measurements 
MPI Linktest: ping-pong 
15 
Latency: 870 us 
BW: 1.3 MB/s 
PCIe BW-
issue in 3 
nodes 
Booster Nodes (KNC) from 1 to 384 
B
o
o
st
er
 N
o
d
es
 (
K
N
C
) 
fr
o
m
 1
 t
o
 3
8
4
 
Booster measurements 
MPI Linktest: ping-pong 
16 
Stddev < 10% 
Main EXTOLL characteristics 
– Direct network: no switches required 
– Integrates network interface controller 
– Supports 6+1 links 
– Capable of tunneling PCIe (allows 
remote-booting KNC from the 
network) 
 
Current (A3) version of EXTOLL  ASIC 
– 270 million transistors 
– Link bandwidth: 100 G 
– MPI latency: 850 ns 
– MPI bandwidth: 8.5 GB/s 
– Message rate: 70 million mgs/sec  
– PCIe Gen3 x16 
 
Network EXTOLL Tourmalet 
Tourmalet PCI Express Board 
Tourmalet Chip and Wafer 
17 
GreenICE system 
Alternative Booster implementation 
• Interconnect EXTOLL ASIC “Tourmalet” 
• 32 KNC-node system 
• Implement 442 topology, with Z 
dimension open 
2-phase immersion cooling 
• NOVEC liquid from 3M 
• Evaporates at about 50 degrees 
• Condensates again in a water cooling pipe 
• Allows very high-density integration  
 
19 
GreenICE Booster 
MPI performance 
Latency 
20 
MPI performance 
Bandwidth 
21 
Factor of 3× achieved by EXTOLL TOURMALET (5Gbit/s version A2) 
DEEP Architecture 
23 
Xeon Xeon Phi 
Xeon Phi 
DEEP-ER Architecture 
Innovation 
Simplified Interconnect 
On-Node NVM 
Self-Booting Nodes 
Network  
Attached Memory 
24 
Xeon 
DEEP-ER Aurora Blade 
prototype 
Eurotech’s Aurora technology 
Direct water cooled, high density 
25 
7U 
19 inch 
Rootcard      
-18x EXTOLL 
-18x NVMe 
Chassis: 
-18x KNL 
-94GB Mem 
-1x backplane 
4U 
3U 
EXTOLL Tourmalet 
Aurora Blade DEEP-ER Booster 
(in construction) 
Aurora Blade Chassis 
NVMe Intel Xeon Phi 
(KNL) 
Network Attached Memory 
NAM architecture 
• Xilinx Virtex 7 FPGA 
• Hybrid Memory Cube (HMC) 
– Bandwidth HMC↔FPGA: 40+ GByte/s 
– HMC Conrtroler: Open source development 
• Attached to TOURMALET NIC 
libNAM (libc based) for ease of use 
Use cases:  
• global (shared) storage 
• compute node for an X-OR C/R app 
• “active memory”, etc. 
26 
EXTOLL
Tourmalet 
NIC
12 EXTOLL 
lanes
H
D
I6
H
D
I6
H
D
I6
H
D
I6
H
D
I6
H
D
I6
UHEI Aspin v2 Board
H
D
I6
Xilinx
Virtex 7
FPGA
HMC
DRAM
16 HMC lanes
12-lane
EXTOLL Cable
Xilinx Virtex 7 FPGA
EXTOLL
 
Endpoint
HMC
Controller
16 HMC 
lanes
12 EXTOLL 
lanes
RMA
Engine
NAM
Functions
NAM Board 
SOFTWARE 
27 
28 
Programming environment 
Cluster Booster 
Booster 
Interface 
In
fin
ib
a
n
d
 
E
x
to
ll 
Cluster Booster 
Protocol 
MPI_Comm_spawn 
ParaStation MPI 
OmpSs on top of MPI provides pragmas to ease the offload process 
0100
200
300
400
500
600
700
800
900
1000
Message size 
T
h
ro
u
g
h
p
u
t 
[M
B
/s
] 
CBP
Message-based
RMA limit
Cluster-Booster Protocol 
29 
Application running on DEEP 
30 
Source code 
Compiler 
Application 
binaries 
DEEP 
Runtime 
#pragma omp task in(…) out (…) onto (com, size*rank+1) 
DEEP Offload (with OmpSs) 
Performance & Scalability evaluation 
31 
Rank 0 
master 
Rank    
0-15 
slave0 
Rank 0 
Worker Rank 1 
Worker Rank 2 
Worker Rank 3 
wk63 
Rank     
16-31 
slave1 
Rank 0 
Worker Rank 1 
Worker Rank 2 
Worker Rank 3 
wk127 
Rank 
240-255 
slave15 
Rank 0 
Worker Rank 1 
Worker Rank 2 
Worker Rank 3 
wk1023 
x16 
x16 
x16 
Figure 7: FWI hierarchical MPI architecture
 64
 128
 256
 512
 1024
 64  128  256  512  1024
S
p
e
e
d
-u
p
# nodes (16 cores)
Ideal
OmpSs Offload
OmpSs Offload (no I/O)
Figure 8: Scalability of FWI application on up to 1024 nodes
VI. Concl usions and fut ur e wor k
This paper presents the OmpSs O✏oad model that was
originally developed to ease the porting of complex ap-
plications to the highly heterogeneous cluster architecture
proposed on the DEEP Exascale project. The OmpSs O✏oad
model has completely fulﬁlled its design goals, combining
the ease of use of Intel O✏oad with the ﬂexibility, per-
formance and scalability of the native MPI Comm spawn
API. Moreover, our approach is fully integrated with the
rest of features provided by OmpSs, such as support for
OpenMP codes and CUDA or OpenCL kernels. Although
it was originally conceived for heterogeneous clusters we
have also successfully used it to develop hierarchical MPI
applications such as FWI. We think that these hierarchical
MPI architectures will play an important role in exploiting
future Exascale systems. Hence, tools such as OmpSs Of-
ﬂoad will be essential for designing such architectures and
helping with their implementation for complex and large
applications.
As future work, we plan to integrate our allocation API
with a resource manager/job scheduler to avoid the need
to reserve all the resources that will be required before
the program is launched. We also plan to investigate the
potential of OmpSs O✏oad to improve the malleability of
existing MPI applications, as well as the implications of
using this o✏oad model from the resilience point of view.
Refer ences
[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T. Lip-
pert, and E. Suarez, “On the scalability of the clusters-booster
concept: a critical assessment of the DEEP architecture,” in
Proceedings of the Future HPC Systems: the Challenges of
Power-Constrained Performance. ACM, 2012, p. 3.
[2] A. Duran, E. Ayguade´, R. M. Badia, J. Labarta, L. Martinell,
X. Martorell, and J. Planas, “OmpSs: a proposal for pro-
gramming heterogeneous multi-core architectures,” Parallel
Processing Letters, vol. 21, no. 02, pp. 173–193, 2011.
[3] K. O. W. Group et al., “The OpenCL speciﬁcation,” A.
Munshi, Ed, 2008.
[4] C. Nvidia, “Compute Uniﬁed Device Architecture program-
ming guide,” 2007.
[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty,
R. Narayanaswamy, J. Wiegert, F. Chinchilla, and
R. McGuire, “O✏oad compiler runtime for the Intel R
Xeon Phi R coprocessor,” in Supercomputing. Springer,
2013, pp. 239–254.
[6] “OpenMP 4.0 speciﬁcation,” http://www.openmp.org/mp-
documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-
Dec-2013].
[7] O. W. Group et al., “TheOpenACC application programming
interface,” 2011.
[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell,
and E. Ayguade´, “Leveraging OmpSs to exploit hardware
accelerators,” in 26th IEEE International Symposium on
Computer Architecture and High Performance Computing,
SBAC-PAD 2014, Paris, France, October 22-24, 2014.
IEEE, 2014, pp. 112–119. [Online]. Available: http:
//dx.doi.org/10.1109/SBAC-PAD.2014.26
[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana-
Orti, “ rCUDA: Reducing the number of GPU-based accel-
erators in high performance clusters,” in High Performance
Computing and Simulation (HPCS), 2010 International Con-
ference on. IEEE, 2010, pp. 224–231.
[10] A. Barak and A. Shiloh, “Themosix Virtual OpenCLl (VCL)
cluster platform,” in Proc. Intel European Research and
Innovation Conference, 2011.
[11] F. Sainz and V. Beltran. (2015) OmpSs Collective
O✏oad. User Manual. [Online]. Available: http://pm.bsc.es/
ompss-docs/user-guide/run-programs-archs-o✏oad.html
Published in: “Collective Offload for Heterogeneous Clusters”, HiPC 2015  
FWI (full wave inversion) code 
Scalable I/O 
• Improve I/O scalability on all usage-levels 
• Used also for checkpointing 
32 
Filesystem 
• Two instances: 
– Global FS on HDD server 
– Cache FS on NVM at node 
• API for cache domain handling 
– Synchronous version 
– Asynchronous version 
33 
Resiliency 
• Develop a hierarchical, distributed checkpoint/restart 
mechanism leveraging DEEP-ER architecture 
34 
APPLICATIONS 
 
 
36 
Application-driven approach 
• DEEP+DEEP-ER applications: 
– Brain simulation (EPFL) 
– Space weather simulation (KULeuven) 
– Climate simulation (Cyprus Institute) 
– Computational fluid engineering (CERFACS) 
– High temperature superconductivity (CINECA) 
– Seismic imaging (CGG) 
– Human exposure to electromagnetic fields (INRIA) 
– Geoscience (LRZ Munich) 
– Radio astronomy (Astron)  
– Oil exploration (BSC) 
– Lattice QCD (University of Regensburg) 
• Goals: 
– Co-design and evaluation of architecture and its programmability 
– Analysis of the I/O and resiliency requirements of HPC codes 
37 
Cluster-Booster Advantages 
• More flexible than a standard architecture 
→ This enables different use models: 
1. Dynamic ratio of processors/coprocessors 
2. Use Booster as pool of accelerators (globally shared) 
3. Discrete use of the Booster 
4. Discrete use + I/O offload 
5. Specialized symmetric mode 
• Enables a more efficient use of system resources 
– Only resources actually needed are blocked by applications 
– Dynamic allocation further increases system utilization 
38 
39 
Code optimisations 
0
2
4
6
8
10
12
0
20
40
60
80
100
120
140
160
Sp
e
e
d
u
p
 
G
fl
o
p
s/
s 
Impact of different optimizations of 
wave propagator on Xeon Phi 
Gflops/s
speedup
BSC: Enhancing Oil Exploration (FWI, wave propagator) 
 1 XeonPhi (60 cores), 180 OpenMP threads 
 
Using SIONlib 
LRZ: Rapid crustal deformation & earthquake source equation (Seisol) 
 1 process per node, 16 threads per process 
 writing 20 checkpoint files (4GB/checkpoint) 
 
40 
0.1
1.0
10.0
100.0
16 32 64 128 256 512 1,024
B
an
d
w
id
th
 [
G
B
/s
] 
# of nodes 
Accumulated bandwidth on SuperMUC 
SIONlib POSIX MPI I/O HDF5
41 
Using NVMe 
Inria: Assessment of Human exposure to EM fields 
 24 MPI processes, 1 thread per process 
      
0
10
20
30
40
50
60
P1 P2 P3 P4
W
ri
ti
n
g 
ti
m
e
 [
s]
 
I/O performance of MAXW-DGTD 
sdv-work NVMe
Increasing 
model precision 
P1<P2<P3<P4 
DEEP/-ER  
Emerging Technologies 
• Cluster-Booster Architecture:  
– Alternative approach to heterogeneity 
– High flexibility enabling various use modes 
• Hardware components: 
– Booster (new kind of cluster of accelerators) 
– GreenICE Booster (2-phase immersion cooling) 
– EXTOLL network tested at scale 
– Warm-water cooling 
– Memory hierarchy based on NVM 
– Network Attached Memory 
 
 
42 
DEEP/-ER  
Emerging Technologies 
43 
• Software 
– Cluster-Booster Protocol: low-level communication protocol 
between different high-speed networks 
– Programming environment for future heterogeneous systems 
– ParaStation Global MPI supporting EXTOLL and CBP 
– OmpSs extensions for DEEP Offload 
– Resiliency extensions for OmpSs (task recovery) and ParaStation 
– BeeGFS extension for local caches (on NVM) 
– SIONlib extensions for buddy-checkpointing, integration with SCR 
and use of BeeGFS functionality 
– E10 scalability optimisations for MPI-I/O 
– Extrae/Paraver support for DEEP Offload 
– Applications modernisation and optimisation 
DEEP and DEEP-ER 
44 www.deep-project.eu   www.deep-er.eu 
EU-Exascale projects 
20 partners 
Total budget: 28,3 M€ 
EU-funding: 14,5 M€ 
Nov 2011 – Mar 2017 
 
 
Visit us  @ 
ISC’16, Frankfurt  
(Germany)  
20.-22.06.2016 
 
-Booth #1340 
-BoF #11 
-Workshop 
