


















An Introduction to HPC
3 Nov 2015 Alexander Peyser Simulation Lab Neuroscience
Jülich Supercomputing Centre
























3 Nov 2015 | Alexander Peyser | 2/31
HPC definition HPC capabilities What? How? HiQ


















What is High Performance Computing?
HPC definition
HPC capabilities
What do we have?
How do we use them?
Fully scaled projects
3 Nov 2015 | Alexander Peyser | 2/31
HPC definition HPC capabilities What? How? HiQ



















What is High Performance Computing?
What is a Supercomputer?
Wikipedia: A supercomputer is a computer with a high-level
computational capacity compared to a general-purpose computer.
Performance of a supercomputer is measured in floating point
operations per second (FLOPS) instead of million instructions per
second (MIPS). As of 2015, there are supercomputers which can
perform up to quadrillions of FLOPS
3 Nov 2015 | Alexander Peyser | 3/31
HPC definition HPC capabilities What? How? HiQ



















What is High Performance Computing?
What is it in short?
A computer at the very limit of what is currently available
3 Nov 2015 | Alexander Peyser | 3/31
HPC definition HPC capabilities What? How? HiQ



















What is High Performance Computing?
What do we do with Supercomputers?
Supercomputers are massively parallel machines using the most
advanced nodes, interconnects and memory to tackle problems that
are too large or take too long for commodity computers
Felix Schürmann (EPFL)
3 Nov 2015 | Alexander Peyser | 3/31
HPC definition HPC capabilities What? How? HiQ
























Typical workgroup cluster (10-15 TFlops)
High-end desktop (ca. 250 GFlops)
JURECA (2 PFlops)
3 Nov 2015 | Alexander Peyser | 4/31
HPC definition HPC capabilities What? How? HiQ


















What do we have?
What is High Performance Computing?
3 Nov 2015 | Alexander Peyser | 5/31
HPC definition HPC capabilities What? How? HiQ


















How do we use them?
What is High Performance Computing?
3 Nov 2015 | Alexander Peyser | 6/31
HPC definition HPC capabilities What? How? HiQ



















What is High Performance Computing?
3 Nov 2015 | Alexander Peyser | 7/31
HPC definition HPC capabilities What? How? HiQ





















Parallelizing out of the bottlenecks
Parallelism speedup
General implemenation
GPUs: data parallel problems
3 Nov 2015 | Alexander Peyser | 8/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs




















3 Nov 2015 | Alexander Peyser | 9/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs























Arithmetic logic unit (ALU)










Flops floating point operations per second
Bandwidth Number of bytes of memory transferred per second
Latency Time delay to completion of a single computational or memory
access operation
3 Nov 2015 | Alexander Peyser | 10/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs


















Parallelizing out of the bottlenecks
Hardware architectures
One pacman eats nine ghosts in 3 seconds...
... but three pacmen eat 9 ghosts in 1 second
... which is strong scaling
3 Nov 2015 | Alexander Peyser | 11/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs


















Parallelizing out of the bottlenecks
Hardware architectures
One pacman eats nine ghosts in 3 seconds...
... or three pacmen eat 27 ghosts in 3 seconds
... which is weak scaling
3 Nov 2015 | Alexander Peyser | 11/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs





















Amdahl’s law: FA(n) =
[
S + (1 − S)/n]−1
If S > 0, lim
n→∞FA(n) = 1/S
Gustafson’s law: FG(n) = S + n ∗ (1 − S)
(FG(n)− S)/n = 1 − S
16 32 64 128 256 512 1024 2048






















3 Nov 2015 | Alexander Peyser | 12/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs





















Amdahl’s law: FA(n) =
[
S + (1 − S)/n]−1
If S > 0, lim
n→∞FA(n) = 1/S
Gustafson’s law: FG(n) = S + n ∗ (1 − S)
(FG(n)− S)/n = 1 − S
4096 8192 16384 32768 65536
























3 Nov 2015 | Alexander Peyser | 12/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs




















3 Nov 2015 | Alexander Peyser | 13/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs


















GPUs: data parallel problems
Hardware architectures
GPU: Graphics Processing Unit
A large number of arithmetic/floating point units with reduced control
logic in parallel to the CPU. Originally developed for 3D graphics
rendering
nvidia, amd
3 Nov 2015 | Alexander Peyser | 14/31
Serial Bottlenecks Parallelism Speedup Implementation GPUs
























3 Nov 2015 | Alexander Peyser | 15/31
Performance Architecture Accelerators JUQUEEN JURECA

















































1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
PERFORMANCE DEVELOPMENT PROJ ECTED
top500.org
3 Nov 2015 | Alexander Peyser | 16/31
Performance Architecture Accelerators JUQUEEN JURECA






































‘93 ‘94 ‘95 ‘96 ‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 ‘08 ‘09 ‘10 ‘11 ‘12 ‘13 ‘14 ‘15
top500.org
3 Nov 2015 | Alexander Peyser | 17/31
Performance Architecture Accelerators JUQUEEN JURECA







































3 Nov 2015 | Alexander Peyser | 18/31
Performance Architecture Accelerators JUQUEEN JURECA























CPU Core PowerPC A2 16 (x4)
Cores 458,752
Linpack Perf. (Rmax ) 5.009 PFlop/s
Theoretical Peak (Rpeak ) 5.872 PFlop/s
Power 2.3 MW
Memory 458 TB
Interconnect Custom Interconnect Network
Topology 5D Torus
Operating System Linux / CNK
Batch LoadLeveler
3 Nov 2015 | Alexander Peyser | 19/31
Performance Architecture Accelerators JUQUEEN JURECA





















Manufacturer T-Platforms, ParTec, Intel,
Mellanox
Design Intel / GPU
CPU Core 2 Intel Haswell 12-core & K80
GPUs
Cores 45,216
GPUs 75 nodes w/ 2 K80 (4992 CUDA)
Linpack Perf. (Rmax ) (Prel) 1.4 PFlop/s
Theoretical Peak (Rpeak ) 1.8 PFlop/s + 0.44 (GPU)
Power ???
Memory 281 TB
Interconnect Mellanox EDR InfiniBand
Operating System CentOS 7 Linux
Batch Slurm
3 Nov 2015 | Alexander Peyser | 20/31
Performance Architecture Accelerators JUQUEEN JURECA
























3 Nov 2015 | Alexander Peyser | 21/31
Hardware Cluster LL Commands Script




















4. Node Card (“Node Board”):
32 Compute Cards (2x2x2x2x2), 
Optical Modules, BQL Link Chips, Torus
5a. Midplane: 
16 Node Cards
6. Rack: 2 Midplanes 
1, 2 or 4 I/O Drawer
7. System: 
e.g. 28 racks = 5.9 PF/s
e.g. 96 racks = 20 PF/s
3. Compute Card (“Node”):
One BQC Module
16 GB DDR3 Memory
5b. I/O drawer (1 ,2 or 4 per rack):
* 8 I/O cards @ 16 GB
* 8 PCIe gen2 x8 slots (IB, 10GbE)
IBM
3 Nov 2015 | Alexander Peyser | 22/31
Hardware Cluster LL Commands Script
















































Michael Stephan & Jutta Doctor (JSC)
3 Nov 2015 | Alexander Peyser | 23/31
Hardware Cluster LL Commands Script




















Supercomputers are shared resources
3 Nov 2015 | Alexander Peyser | 24/31
Hardware Cluster LL Commands Script





















llsubmit <jobfile> Send job to the queuing system
llq List all queued and running jobs
llq -l <job ID> detail information about the specific
job
llq -s <job ID> details information about a specific
queued job, such as start time
llq -u <user> list all jobs from one user
llcancel <job ID> Kill the specified job
llstatus Display the status of LoadLeveler
llclass List existing classes and their
properties
llqx Show detailed information about all
jobs
3 Nov 2015 | Alexander Peyser | 25/31
Hardware Cluster LL Commands Script




















# @job_name = weakScale_nmr00032_nthr04_N16667_dryRun
# @job_type = bluegene
# @bg_size = 32
# @bg_connectivity = TORUS
# @environment = COPY_ALL
# @wall_clock_limit = 00:30:00
# @output = $(HOME)/log/juqueen/$(job_name ).$(jobid).out
# @error = $(HOME)/log/juqueen/$(job_name ).$(jobid).err
# @notification = error







runjob --exe $NEST_EXE \\
--np $NPROCS \\
--ranks -per -node 1 \\
--verbose 1 --exp -env OMP_NUM_THREADS \\
--args $SLI_SCRIPT
3 Nov 2015 | Alexander Peyser | 26/31
Hardware Cluster LL Commands Script






















3 Nov 2015 | Alexander Peyser | 27/31
Overview Languages Tools





















MPI At the cluster level, message passing standard via network
between nodes (application)
OpenMP At the node level, thread programming standard using
shared memory (process)
OpenCL At the GPU level, work groups of parallel kernels
3 Nov 2015 | Alexander Peyser | 28/31
Overview Languages Tools





















MPI At the cluster level, message passing standard via network
between nodes (application)
OpenMP At the node level, thread programming standard using
shared memory (process)
OpenCL At the GPU level, work groups of parallel kernels
3 Nov 2015 | Alexander Peyser | 28/31
Overview Languages Tools





















MPI At the cluster level, message passing standard via network
between nodes (application)
OpenMP At the node level, thread programming standard using
shared memory (process)
OpenCL At the GPU level, work groups of parallel kernels
3 Nov 2015 | Alexander Peyser | 28/31
Overview Languages Tools




















Which programming languages can I use?
Scripting Run over parameter spaces with scripting
languages and JUBE — and run (most) any code
MPI C/C++/Fortran/Python/JAVA... (special library)
OpenMP C/C++/Fortran (special directives)
pthreads Native threads for most languages
GPU C/C++ with other experimental kernel mappings,
bindings on cpus to most everything... (kernels
compiled and shipped)
Make the outer loops & code as you want, and focus on
optimizing core kernels
3 Nov 2015 | Alexander Peyser | 29/31
Overview Languages Tools
























profile & trace 
analysis 





















Introduction to Parallel Performance Engineering, VI-HPS
3 Nov 2015 | Alexander Peyser | 30/31
Overview Languages Tools





















3 Nov 2015 | Alexander Peyser | 31/31
Questions?
HPC Arch Systems Practical Programming Conclusions
