High-performance computing hardware for high data rates by Chilingaryan, S. et al.
www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
High-performance computing 
hardware for high data rates
Agenda
Parallel Computing: 
Possibilities & Challenges
Handling Data I/O at High Rates
Accelerating Synchrotron Tomography
Scaling to Cluster
Suren A. Chilingaryan, KIT
Michele Caselle, KIT
Thomas van de Kamp, KIT
Andreas Kopmann, KIT
Alessandro Mirone, ESRF
Uros Stevanovic, KIT
Tomy dos Santos Rolo, KIT
Matthias Vogelgesang, KIT
Authors
S. Chilingaryan et. all2 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Ultra Fast X-ray Imaging of Scientific Processes with 
On-Line Assessment and Data-Driven Process Control 
ANKA 
beam 
line
Optics and sample 
manipulators 
Smart high-
speed camera
Online monitoring 
and evaluation
Offline 
storage
UFO
Goals
High speed tomography
Increase sample throughput
Tomography of temporal processes
Allow interactive quality assessment
Enable data driven control
Auto-tunning optical system
Tracking dynamic processes
Finding area of interest
S. Chilingaryan et. all4 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Reconstruction Problem
Resolution: 2560 x 2160
Dynamic Range: 16 bit
Frame Rate: 100 fps
PCO.edge
Tomographic Reconstruction
3D image: 20003
Projections: 2000
Acquisition time: 20 seconds
FBP Complexity: 144 Tflops
Xeon Performance: ~ 100 Gflops
Minimum time: ~ 15 minute on DP
Actually: ~ 1 hour
20 seconds acquisition
1 hour reconstruction
Heads of a newt larva showing bone 
formation and muscle insertions (top) 
and a stick insect (bottom), acquisition 
time 2s.
S. Chilingaryan et. all5 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Parallel Architectures
2007 2008 2010 2012
10
100
1000
10000
192
96
3950
1170
Xeon/SP Xeon/DP Tesla/SP Tesla/DP
G
Fl
op
s
Parallel
Architecture
Rise of GPU performance as compared to Xeons 
E7-8870 Xeon/Phi Tesla K20 GeForce Titan AMD HD7970 Power7+
SP 192 2020 3950 4500 3790 265
DP 96 1010 1170 1300 950 132
Mem 34.11 320 250 288 264 68
Max dev. 8 ? 8 8 >=4 32
Price $4,800.00 $2,800.00 $3,200.00 $1,000.00 $400.00 $$$$$$$$
S. Chilingaryan et. all6 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Efficiency
GTX Titan
GTX680
GTX580
HD7970
Core i7-980X
0 1000 2000 3000 4000 5000
Peak Measured GFlops
Core i7-980X
HD7970
GTX580
GTX680
0
50
100
150
200
250
300
Peak Measured
G
B/
s
GTX Titan
GTX 680
GTX 580
HD7970
Core i7-980X
0 50 100 150 200 250 300 350 400 450 500
GFlops
Matrix Multiplication
1D Fast Fourier Transform
Memory Bandwidth
98%
69%
86% 78%
94%
63%
63%
43%
72%
GTX Titan performance is taken from Anandtech
52%
3%
18%
9%
10%
S. Chilingaryan et. all7 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
GPU-programing considerations
• Special programming tools and techniques are required
– Multiple different and ever-changing architectures
– Branching, non-fp operations are very expensive
– Optimized mathematical libraries are some times missing
• Limited amount of memory and expensive data transfers
– x16 PCIe gen2 (8 GB/s), gen3 (16 GB/s)
– Specially allocated (pinned) memory required for a full performance and to 
overlap computations and data transfers
• Reduced caches, low memory to computation ratio, strict access patterns
– 177 GB/s per Teraflop for Xeon, 60 - 70 GB/s  per Teraflop for GPUs
– Varying cache hierarchies on different architectures
– Special access patterns are required for better performance. For instance, 
bandwidth of matrix transpose (GTX280 with 142 GB/s memory bandwidth)
• 2 GB/s – for naive approach
• 17 GB/s – if shared memory is used
• 80 GB/s – if care taken for shared memory banks and global memory partitions
• I/O problem
– 100 MB/s sequential write while camera produces ~ 1 GB/s
– Handling big data sets not fitting in the memory (up to 500 GB)
• Various problems with growing number of GPUs connected to a system
S. Chilingaryan et. all8 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Parallel Programming Environments
• CUDA – The oldest GPU programming technology from NVIDIA
• OpenCL – Open standard technology close to CUDA, but working with 
wide range of hardware platforms including CPUs and GPUs
• OpenAC – Declarative technology similar to OpenMP
• MATLAB and other mathematical packages with integrated GPU 
support. Only some operations are parallelized and necessity to 
transfer over slow PCIe bus to execute non-parallelized operations kill 
the performance. 
CUDA
• Supports latest NVIDIA technologies
– GPUDirect – direct transfers between GPU and IB, etc. Integration 
with MPI frameworks
– Dynamic parallelism – GPUs are ablee to spawn new jobs
• NVIDIA provides a set of highly optimized libraries (BLAS, FFT, 
Lapack, Reduction, etc.)
• Only NVIDIA GPUs are supported
S. Chilingaryan et. all9 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Programming Environments
OpenCL
• Syntax is very similar to CUDA (easy porting)
• Well written code is as fast as CUDA
• Works with CPUs and GPUs from multiple vendors (Intel, AMD, IBM, 
NVIDIA)
• Still no way to run code simultaneously on NVIDIA GPU and CPU, but 
possible with AMD cards
• Many libraries existing, but generally slightly slower than CUDA 
counterparts. Some libraries are only available commercially
• No GPUDirect, significantly limited options to use pinned memory (i.e. 
slower data transfers)
OpenACC
• Existing applications may be easily parallelized. Also developing new code 
is easy compared to OpenCL/CUDA
• No free compilers are existing at the moment. Though there is a similar 
technology OmpSS developed at Barcelona Supercomuter Center.
• At current level, technology does not support shared memory and some 
other technologies available with direct programing (i.e. it is slower)
cuFFT
oclFFT
0 50 100 150 200 250 300
GFlops
S. Chilingaryan et. all10 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
GPUDirect and Frame Grabbing
1 core
all cores
0 5 10 15 20 25 30 35
Core i7 950
Core i7-980X
Core i7 3820
2 x E5-2640
GB/s
And we get ~ 1 GB/s from 
camera. With 3 memcpy it 
 is already on the border.
S. Chilingaryan et. all11 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Memory: Space vs. Speed
CPU
DIMM DIMM DIMM
DIMM DIMM DIMM
DIMM DIMM DIMM
channel 1 channel 2 channel 3
DPC
DIMM per 
Channel
Xeon 
X5500
Xeon 
E5-2600
1 DIMM 10.6 GB/s 12.8 GB/s
2 DIMMs 8.5 GB/s 12.8 GB/s
3 DIMMs 6.4 GB/s 8.5 GB/s
S. Chilingaryan et. all12 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
NUMA Architecture and Data Transfers
Bandwidth
QPI: 14.4 GB/s
DDR3: 10.6 GB/s  (PC1333)
PCIe: 16 GB/s (gen3 x16)
Xeon 
E5-2640
QPI bus is even not enough 
to feed both GPU cards
Non-NUMA
NUMA
Pinned
0 1 2 3 4 5 6 7 8
Host to Device Device to Host GB/s
GTX590 (gen2)
Pageable
Pinned
0 2 4 6 8 10 12
Host to Device Device to Host GB/s
AMD HD7970 (gen3)
Pinned memory support is limited with OpenCL
NVIDIA does not support gen3 mode on X79 boards (workaround exits for Win, but not Linux)
S. Chilingaryan et. all13 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
90% Back Projection
Filtering
PCIe Transfer
Filtered Back Projection on GPU
Image Loader
Pool of Sinograms
   (host memory)
Pool of CPU and GPU
processing threads
Pool of Vertical Slices
(host memory)
Texture
Data Storage
W
H
GPU 
thread
1st Stage 2nd Stage
Double 
buffering
Double 
buffering
Filtering
PC
Ie  D
ata  
Trans fer
PC
Ie D
ata
Tra nsf er
Fetch slices
for processing Store results
Ratio of 
operations
50 51 52 53 54 55
Transfer
Compute
ms, per slice 
Overlapping 
Efficiency
S. Chilingaryan et. all14 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Tuning for hardware architectures
GT200
Base version 
Uses texture 
engine
Fermi
High computation power, but 
low speed of texture unit 
Reduce load on texture engine: 
use shared memory to cache 
the fetched data and, then, 
perform linear interpolation
using computation units.
Kepler
Low bandwidth of integer inst-
ructions, but high register count
Uses texture engine, but 
processes 16 projections at once 
and 16 points per thread to 
enhance cache hit rate 
GCN
High performance of texture 
engine and computation nodes
Balance usage of texture engine 
and computation nodes to get 
highest performance
VLIW
Executes 5 independent 
operations per thread
Computes 16 points per thread 
in order to provide sufficient 
flow of independent instructions 
to VLIW engine
+100%
+530% +95%
+75%
S. Chilingaryan et. all15 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Back Projection: Evolution of GPU architectures 
HD7970
GTX680
GTX580
HD5970
GTX280
E5-2640
0 20 40 60 80 100 120 140 160 180
Linear Oversample giga-interpolations per seconds (texture fill rate, GT/s)
133%
126%
92%
107%
157%
195%
78%
144%
39%
S. Chilingaryan et. all16 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Adding more GPU devices
1 2 3 4 5 6 7 8
5
7
9
11
13
15
17
19
21
23
25
tim
e 
(s
ec
on
ds
)
Initialization Time Maximum 8-9 GPU cores (not cards) 
per system. System will not turn on 
otherwise
Lan Option ROM have to be turned 
off in the BIOS
The PCIe slots, where storage 
adapters inserted, have to be disabled
ASTRA Lab reported to run 13 GPU 
cores with modified BIOS
To run more than 5 GPUs, NVIDIA 
driver have to be force to use MSI 
interrupts. Crashes will occur otherwise 
NVIDIA
AMD
4 GPU cards (single core) working 
fine, no configuration modifications 
required
Dual-core card are working in a 
single-core mode only
Number of GTX590 cores used
S. Chilingaryan et. all17 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Handling large data sets
A7K2000
Raid 12 Drives
Intel X25E
2 x Crucial C300
4 x Crucial C300 
Samsung 840 Pro
RamdDisk
1 10 100 1000 10000
28,71
29,74
140
417,39
965,21
1575,85
2859,88
MB/s
Using SSD drives may significantly increase random access performance 
to the data sets which are not fitting in memory completely. The big arrays 
of magnetic hard drives will not help unless multiple readers involved.
S. Chilingaryan et. all18 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Streaming data: file system caches
Buffer Cache
Storage
Data
Default data flow in Linux
Buffer cache significantly limits maximal 
write performance
Kernel AIO may be used to program IO 
scheduler to issue read requests without 
delays
Optimizing I/O for maximum streaming performance using a single data 
source/receiver
Read
Write
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Buffered Direct AIO MB/s
S. Chilingaryan et. all19 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Data Streaming
XFS
Ext4
Ext2
JFS
Btrfs
Reiser3
0 500 1000 1500 2000 2500 3000 3500
Start End MB/s
XFS
Ext4
Ext2
JFS
Btrfs
Reiser3
0 500 1000 1500 2000 2500 3000 3500 4000
Start End MB / s
OpenSuSE 12.1  / Kernel 3.3.1
32 disks per 2 raid controllers, raid 60
Read
Write
Used file system matter. And it should be adapted to raid configuration (strip, read-ahead)
Unless really big number of disks used, the start of partition will be faster than the end)
fallocate may significantly improve performance (allocation unit may be increased during 
FS creation/mount, XFS supports allocation sizes up to 1GB)
Ext4 does not support partitions more than 16TB yet
Real-time feature of XFS is unstable, data is loss is likely
S. Chilingaryan et. all20 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Processing Pipeline
Moderate size 
data-sets are 
stored in memory
Huge data-sets 
are cached on 
SSD Raid
P
reprocessing
R
econstruction Real-time
storage
1
2 3
4
4 stage pipeline
I/O + Computations
1. Reading data from fast  SSD Raid-0 (random reads are effective)
2. Scheduling and preprocessing using SIMD instructions of x86 CPUs
3. Reconstructing on GPUs
4. Storing to Raid on magnetic disks (sequential writes are effective)
S. Chilingaryan et. all21 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Building a server
• Too much external hardware is required
– High speed network
– Storage system (and SSD cache separately preferably)
– High speed Frame Grabber for Camera
– Normally 4-6 high speed PCIe slots per server
– Space for 1-2 GPUs only
• System cooling is complicated
– both GPUs, HDDs, and SSDs produce a lot of heat
• Extensibility
– There is no space to add more storage / computing power
S. Chilingaryan et. all22 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
UFO Computing Infrastructure
LSDF
Large Scale Data Facility
0 100 200 300 400 500 600
Internal External
GT/s
External GPU Box SAS Attached Storage
CameraLink
850MB/s
External PCIe x16 (8 GB/s)
Ethernet 
10 Gb/s
PCO.edge
PCO.dimax
PCO.4000
SuperMicro 7046GT-TRF (Dual Intel 5520 Chipset)
CPU: 2 x Xeon X5650 ( total 12 cores at 2.66 Ghz)
GPUs: 4 x GTX590 External
Memory: 96 GB / 12 DDR3 slots (192GB max)
Network: Intel 82598EB (10 Gb/s)
Camera Link Frame Grabber (850 MB/s)
Storage: Areca ARC-1880-ix-12 SAS Raid
   16 x Hitachi A7K200 (Raid6)
     8 x Samsung 840 Pro 510 (Raid0)
0 500 1000 1500 2000
32 disks 16 disks
sequential write, MB/s
0 1000 2000 3000 4000
Read Write
MB/s
SFF8088 (2.4 GB/s)
Camera Storage
SSD Raid
S. Chilingaryan et. all23 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Filtered Back Projection Performance
small (4GB) standard (11GB) Big (120 GB) Huge (300 GB)
0
200
400
600
800
1000
1200
Memory SSD
M
B
/s
CPU GPU
1
10
100
1000
10000
17,93
1016,26
M
B
/s
11 GB data set
S. Chilingaryan et. all24 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
PCIe Extension Box
External GPU Enclosure 
by One Stop Systems
1 x PCIe x16 2.0
4 x GTX590
8 GPU cores
NUMA
Non-NUMA
0 1 2 3 4 5 6 7 8 9
1 slice, transfer time, ms
NUMA
Compute
Non-NUMA
0 10 20 30 40 50 60 70
8
7
6
5
4
3
2
1
Computems, per 8 slices
With external box configuration
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
Internal External Number of GPUs
S
pe
ed
-u
p
7 7,2 7,4 7,6 7,8
Internal External 8 GPUs, Speed-up
Scalability
0 1 2 3 4 5 6 7 8
NUMA Non-NUMA 8 GPUs, Speed-up
S. Chilingaryan et. all25 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Scaling up to Cluster
Camera
SSD Cache
up to
8 GB/s
4GB/s <2 GB/s 0,25 GB/sTransfer
rates
Master Node
Task Scheduler
(lots of memory)
Distributed
Storage
OpenCL
Node
OpenCL
Node
Readout PC
CUDA +
GPUDirect
Real-time control loop
Outer control loop
Storage
Node
Storage
Node
Infiniband
(Optical)
IB router
LS
D
F
Lar ge  S
c al e D
at a F ac ility
S. Chilingaryan et. all26 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Infiniband: Connection and Protocols
2 x QDR Direct (gen3)
QDR Direct (gen3)
QDR Switched (gen3)
QDR Switched (gen2)
QDR Optical (gen2)
0 1 2 3 4 5 6
GB/s
RDMA
MVAPICH
SDP
TCPoIB
0 0,5 1 1,5 2 2,5 3 3,5 4 4,5 5
GB/s
Optical
Switch
RDMA
MVAPICH
SDP
TCPoIB
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
la
te
nc
y 
(n
s)
Mellanox ConnectX 3 VPI
SDP is obsolete by OpenFabric alliance, 
but we have patches for latest kernels.
S. Chilingaryan et. all27 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Storage Protocols
Raw
iSER
XFS/Local
XFS/iSER
OCFS2/Local
OCFS2/iSER
FhGFS (over XFS)
Gluster (over XFS)
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Write Read MB/s
Network FS
NFS
Samba
SSHFS
Slow
Cluster FS
Lustre (patched kernel)
Gluster
FhGFS (close-sourced)
Slow if few nodes
Network Devices
ISCSI (slow)
iSER
OCFS2
S. Chilingaryan et. all28 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
UFO Storage Subsystem
         20 TB8 TB
Storage Node 1
Raid6: 16 Hitachi 7K300, 28TB 
         20 TB8 TB
Storage Node 2
Raid6: 16 Hitachi 7K300, 28TB 
8 TB 8 TB
16 TB, XFS
iSer
iSer
SoftRaid  Level 0
Camera PC
High-speed 
Streaming storage
Read: 2.3 GB/s
Write: 2.5 GB/s
GlusterFS
Compute
Node 2
Compute
Node 3
Compute
Node 1
Client 1 Client 2
NFS
S. Chilingaryan et. all29 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
High-speed Programmable Camera
Vacuum or N2
Optical body 
and lens
CMOS pixel sensor
Daughter card
“made in house”
Readout mother card
FPGA Virtex 6
Flex electrical cable
(max f = 183MHz)
Heat exchangerPeltier cells
Link
High speed CMOS sensor 
1Mpix, 5000 fps, 10 bits
Self-trigger & Data compression
On-line elaborations and control
Full Programmability
Direct connection to Infiniband-cluster
First Prototype
S. Chilingaryan et. all30 Institute for Data Processing and Electronics
Karlsruhe Institute of Technology
Summary
• We are in the age of parallel architectures
• Getting good performance is rather easy, getting ultimate-performance 
managing multi-gigabyte streams of data is complicated and needs care on 
multiple levels. The hardware should be carefully selected according to the 
planned tasks and data rates. The software should be tunned to the selected 
hardware.
• Streams about 500 MB/s may be processed with a single reconstruction 
station, cluster is required to handle more data in near real-time.
• Hybrid CUDA/OpenCL system is probably the best approach.
• UFO Parallel Processing Framework is provided to help you to come along 
some of these difficulties and will be presented in next talk.
OpenSource
http://ufo.kit.edu
Features
➢Easy Algorithm Exchange
➢Camera Abstraction
➢Pipelined Processing
➢Glib/GObject, scripting language 
support with introspection
➢OpenCL + automated management 
of OpenCL buffers
