High-speed, low-latency readout system with real-time trigger based on GPUs by Caselle, M. et al.
KIT – Universität des Landes Baden-Württemberg und 
nationales Forschungszentrum in der Helmholtz-Gemeinschaft 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 
M. Caselle 
www.kit.edu 
M. Caselle, L.E. Ardila, M. Balzer, S. Chilingaryan, T. Dritschler,  A. Kopmann, H. Mohr, L. Rota, 
M. Vogelgesang, M. Weber 
High-speed, low-latency readout system with 
real-time trigger based on GPUs 
 
 IEEE- 20th Real Time Conference, 5-10 June 2016. Padova, Italy  
 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 2 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Motivations 
Hardware implementation based on “Direct-GPU” technology 
Performance:  Bandwidth & Latency 
Track finding algorithm based on GPU  
Results and GPU limitations 
Conclusions & what’s next 
Outline 
L1 trigger will require reconstruction of charged particles 
with transverse momentum > ~2 GeV/c 
See: Thomas Schuh,  
this conference ID: RTA1_59 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 3 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
CMS – L1 track trigger system 
Patter recognition Fitting Slices 
Associative 
Memory 
Large bank of pattern stored in 
a dedicated AM chip  
PCA, Hough transform, 
Retina (FPGA) 
48 = 8x6 (φxη) 
Loading balancing time 
Tracklet algorithm conventional road-based track 
search (FPGA) 
linearized χ2  fit  
(FPGA) 
168 = 28x6  (φxη) 
4 x (BX) 
Time-multiplexed 
architecture 
Hough transform (FPGA) (FPGA) 324= 36x9  (φxη) and time 
multiplexing of 24 x (BX)  See: Thomas Schuh 





Track reconstruction and fitting, 
primary particles  with pT > 2 GeV. Latency: ~ 5 µs 
tracks from muons 
system and calorimeter 
information 
Data rates:  
O(100 Tb/s) 
Processed off-module in the back-end 
L1 track finding system 
Track reconstruction 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 4 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
CMS – L1 track trigger system 
Patter recognition Fitting Slices 
Associative 
Memory 
Large bank of patter stored in a 
dedicated AM chip  
PCA, Hough transform, 
Retina (FPGA) 
48 = 8x6 (φxη) 
Loading balancing time 
Tracklet algorithm conventional road-based track 
search (FPGA) 
linearized χ2  fit  
(FPGA) 
168 = 28x6  (φxη) 
4 x (BX) 
Time-Multiplexed 
architecture 
Hough transform (FPGA) (FPGA) 324= 36x9  (φxη) and time 
multiplexing of 24 x (BX) 
What about GPUs for L1 track finding ?  
How performs a GPU L1 track system compared to current 
HW systems (AMs + FPGAs) ?  
How to find the tracks in ~5 µs with high efficiency and acceptable fake rates? 
L1 accept 
to front-end 
tracks from muons 
system and calorimeter 
information 
Processed off-module in the back-end 
Data rates:  
O(100 Tb/s) Global  
Trigger 
L1 track finding system 
Track reconstruction 
Track reconstruction and fitting, 
primary particles  with pT > 2 GeV. Latency: ~ 5 µs 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 5 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Hough Transform on GPU 
Accumulation points with high vote will correspond to real tracks 
Each stubs  corresponding 
line in the Hough Space 
All stubs from a real track have 












∗  𝑟 
Why Hough transform on GPU?  
The Hough transform is naturally amenable to a high degree of parallelization, as the 
parameter space calculation for each hit (stub) is independent of all other hits (Stub) in the 
event/tracks.  
 
 Consequently, it is a natural candidate for implementation on a GPU 
Highly parallel computing  optimized to execute simultaneously the same operation on 
many different data (Single-Instruction on Multiple Data) 













KIT, Institut für Prozessdatenverarbeitung und Elektronik 6 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Data transfer: Detector – GPU 








~ 3.4 µs for 4KBytes ~ 4.6 µs for 4KBytes 
~ 8 µs >> CMS low level trigger specification 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 7 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Advanced Data transfer with “GPU-Direct” 
Using “Direct-GPU”, FPGA devices can read and write 
directly CUDA/OpenCL host and device memory, 
eliminating unnecessary memory copies, dramatically 
lowering CPU overhead and reducing latency 
Total latency =  latency (FPGA <->GPU) + latency (GPU process) 
One-sided data transfer latency 1.15 µs (average), jitter < 100 ns  
Total time = Latency + data transfer  
System Memory 
GPU  






Latency + data transfer 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 8 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
“GPUDirect” and “DirectGMA” concepts 
Total latency =  latency (FPGA <->GPU) + latency (GPU process) 
One-sided data transfer latency 1.15 µs (average), jitter < 100 ns  
Total time = Latency + data transfer  
System Memory 
GPU  






Latency + data transfer 
READOUT 
CARD 
Using “Direct-GPU”, FPGA devices can read and write 
directly CUDA/OpenCL host and device memory, 
eliminating unnecessary memory copies, dramatically 
lowering CPU overhead and reducing latency 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 9 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
High-flexibility readout card 
High pin counter FMC connectors 
Data throughput up to 130 




Ref: A PCIe DMA Architecture for Multi-Gigabyte Per Second Data Transmission, 
DOI: 10.1109/TNS.2015.2426877, IEEE-Real time 2014 26-30 May. Nara Japan 
 2 x High Pin Counter FMC connectors: 
 VITA 57 compliant 
 320 single-ended or 160 diff. signals @ 9 GHz  
 12 MGT I/O @ 13.1 Gb/s 
 Processing unit 
 Xilinx Virtex 7 FPGA (XC7VX330T-2 FFG1761) 
 High performance Memory: DDR3 
 64 lanes @ 1866 Mb/s  119 Gb/s 
 4 GByte 
 PCIe Gen 3 x 16 lanes 
 PCBs:  
 16 layer metals stack  / Nelco N4000-13 EP SI 
 Picosecond time controlled transmission lines 
PCIe Gen 3 x 16 lanes 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 10 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Firmware architecture 
 KIT-Direct Memory Access  operating both Bus Master/Slave modes 
  “Scatter-Gather mechanism” where descriptors located inside FPGA in both ring-buffer 
or memory dynamical allocation are possible. 
Compatible with (NVIDIA, AMD) GPUs and system memory 
PCI Express/DMA Linux 32-64 bits driver  READY 
PCIe link 








































To DDR / Trigger 
From DDR / Detector 
DDR3 
IPE controller 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 11 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Readout system – performance / comparison 
NVIDIA  Tesla K40 (CUDA ver. 7.5) AMD  FirePro W9100 (OpenCL ver. 2.0) 
Data throughput over 6.5 GB/s very close to maximum theoretical limit for PCIe 
Gen 3 (max payload limited to 128 Byte by GPUs) 
PCIe Gen 3 x 8 lanes 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 12 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Readout system – performance / comparison 
NVIDIA: Latency < 2 µs, jitter < 30 ns AMD: Latency < 1.3 µs, jitter < 50ns 
three PCIe transactions  
ping-pong latency 
Send data  
Data received 
GPU  High-Flex 
readout card  
NVIDIA: FPGA as bus master (FPGA  GPU 
             GPU as bus master (GPU  FPGA) 
AMD:    FPGA as bus master (FPGA  GPU 
            FPGA as bus master (GPU  FPGA) 
Both GPUs vendors present an excellent latency performance.  
KIT, Institut für Prozessdatenverarbeitung und Elektronik 13 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Kernel Latency ~ 140 µs 
How much time is necessary to synchronize the FPGA data with the launching 
of the kernel? 












Candidate tracks  
Latency? 
Data ready  
Expected GPU limitation for real-time application 
GPU limitations – kernel latency 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 14 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Kernel Latency ~ 140 µs Kernel Latency ~ 25 µs 
Launch data processing kernel  
(Hough transform) 
NVIDIA shows a very low kernel latency. Drastically reduction  expected new CUDA release 











Candidate tracks  
Latency? 
Data ready  
GPU limitations – kernel latency 
How much time is necessary to synchronize the FPGA data with the launching 
of the kernel? 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 15 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 





# 45 stubs  
Tracker detector segmentation 288 sectors (32 in Φ and 9 in η)  like FPGA Hough transform implementation by 
Thomas Schuh, this conference ID: RTA1_59 
 









Track 3 Gev/c 
Tracks 5 Gev/c 
Fitting (η) 
















KIT, Institut für Prozessdatenverarbeitung und Elektronik 16 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 





# 45 stubs  
Tracker detector segmentation 288 sectors (32 in Φ and 9 in η)  like FPGA Hough transform implementation by 
Thomas Schuh, this conference ID: CR_RTA1_59 
 









Track 3 Gev/c 
Tracks 5 Gev/c 
Fitting (η) 
Hough space (GPU processing) Hough space (after filtering) 
Current implementation in CUDA process 500 Stubs  
















KIT, Institut für Prozessdatenverarbeitung und Elektronik 17 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
pT track finding by GPUs – first demonstrator 
High-Flex readout card  
Pattern generator  
by Monte Carlo data 
(CMSSW) 
Move to DDR 
Write data to GPU 
GPU perform: 
Hough transform track finding  
 track fitting 
Write candidate tracks 
to FPGA 
by “GPUDirect” 
1) Load the stubs to DDR (High-Flex) 
2) Start data transfer FPGA -> GPU (start FPGA counter) 
3) GPU launches the Hough transform  track finding  
4) GPU sends candidate tracks to FPGA (stop FPGA counter)   
CPU 
Start 
To center trigger 
GPU -NVIDIA 
Total latency= 1.91 µs (data latency) + ~7 µs (data processing) + 23.19 (kernel latency) = ~32.1 µs by NVIDIA 
Total latency > 150 µs with AMD, the launching kernel is the major penalty   
improvements  new CUDA 8 
Total time measured by a 
FPGA counter  
The total latency of ~ 30 µs is >> of 5 µs required by CMS  but very promising  
KIT, Institut für Prozessdatenverarbeitung und Elektronik 18 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Conclusions & What’s next 
 First demonstrator of L1 track trigger for CMS based on Hough transform (GPUs) have 
been developed for both GPU vendors (NVIDIA and AMD): 
 A total latency of 30 µs has been achieved with KIT-DMA and NVIDIA GPUs 
 Characterization/optimization of the Hough transform algorithm: 
 Merging of DMA – Hough transform kernels  expected a total time ~ 9 us 
 Comparison between OpenCL and CUDA and FPGAs (Thomas Schuh, ID: RTA1_59) 
 Significant technological evolution can be expected in the coming years, GPUs and FPGA can 
obtain full benefit with a timely development schedule. 
 Develop a next demonstrator based on Ultrascale+ Xilinx family: 
 Next generation of GPUs and NVLink @NVIDIA high-speed bidirectional bus protocol to 
exchange up to 320 Gb/s   
 NVLink in FPGA for high bandwidth zero latency communication 
What’s next 
.. Room for improvements … 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 19 
Thank you for your attention 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 20 
Backup slides  
KIT, Institut für Prozessdatenverarbeitung und Elektronik 21 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 














L1 track system 
Track reconstruction 
DTC 
Data retrieval and 
distribution to trigger 
towers 
~104 Tb/s ~470 Tb/s ~50 Tb/s 
~50 Tb/s ~ 1 Tb/s 
Patter recognition Fitting Slices 
Associative 
Memory 
Large bank of patter stored in 
a dedicated AM chip  
PCA, Hough transform, 
Retina (FPGA) 
48 = 8x6 (φxη) 
Loading balancing time 
Tracklet algorithm conventional road-based 
track search (FPGA) 
linearized χ2  fit  
(FPGA) 




Hough transform (FPGA) ? 5 sectors in φ and time 
multiplexing of 24 
Extreme challenge: reconstruct O(100) tracks from O(10k) stubs at 40 MHz  
What’s about GPUs for L1 track finding?  
To reduce the hardware devices (AMs + FPGAs)  and increase the trigger flexibility 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 22 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
CMS and new tracker detector for fast pT dissemination 
Pixel detector 
Pixel-Strip (PS) module 
2S (two strip sensors) module 
Forward detector 
Barrel detector: 4130 PS modules 
(three layers) and 4464 2S modules 
(three layers).  
 
All outer tracker with 15 508 modules 
in total 
Each bunch crossing produces on the order of 10,000 
stubs (PU 140). Only about 5 to 10% of these stubs 
actually belong to primary tracks with pT > 2 GeV.  
The goal of the L1 Track Finding system is to reconstruct the tracks of primary particles with pT > 2 GeV and 
discard as many as possible of all the other stubs. 
Max latency = 12.5 µs 
L1 stubs are processed off-module, in 
the back end, to build L1 track 
primitives  
Track find latency = 5 µs 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 23 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 24 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 25 
GPUDirect communication NVLINK communication 
NVIDIA® NVLink™ is a high-bandwidth, energy-efficient interconnect that 
enables ultra-fast communication between the CPU and GPU, and between 
GPUs. The technology allows data sharing at rates 5 to 12 times faster than the 
traditional PCIe Gen3 interconnect, resulting in dramatic speed-ups in 
application performance and creating a new breed of high-density, flexible 
servers for accelerated computing  
 
See more at: http://www.nvidia.com/object/nvlink.html#sthash.7RlpyR8X.dpu 
GPUDirect vs NVLINK (NVIDIA) 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 26 
K40 (NVIDIA) 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 27 
Novel concept of DMA (KIT) 
1. GPU memory allocation (nvidia_p2p_get_pages() or clCreateBuffer() OpenCL ).  
 write the “Bus addresses” into FPGA descriptor memory, addr.surface_bus_address() OpenCL 
2. Start DMA data transfer 
3. DMA load the descriptor from the memory and fetch the DATA 
4. Data transfer from FPGA  to GPU memory block (defined by descriptor) 
5. DMA Update the Status for GPU kernel  number of blocks written, current descriptor address  
6. DMA receive the current descriptor read by driver and therefore free for the next block transfers 
Operations: 
CPU - Linux 
Kernel (nVIDIA) 





























Device Bus Address & 
block length 
Data transfer (Engine) 
Status WR/RD update 
Descriptors 
MEMORY 
KIT, Institut für Prozessdatenverarbeitung und Elektronik 28 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Readout system – performance / comparison 










(FPGA -> GPU) 
GPU master 
(GPU-> FPGA) 
“clEnqueueWaitSignalAMD”  to 
synchronize the GPU with the 
remote device is not optimized 









(FPGA -> GPU) 
FPGA master 
(FPGA -> GPU) 
RD request 
FPGA -> AMD 
Latency < 1.3 µs, three PCIe transactions  
KIT, Institut für Prozessdatenverarbeitung und Elektronik 29 20th IEEE- Real Time Conference, 5-10 June 2016. Padova - Italy. M. Caselle 
Multiple “GPU-Direct” architecture 
System Memory 





4x AMD FirePro W9100  
High-Flex 
Only 20 ns are necessary to 
switch from GPU1 to GPUN 
Ref: Ultra-fast computer tomography real-time 3D reconstruction 
(data rate 50Gb/s). DOI:  10.1109/TNS.2015.2425911. Presented 
to Real-time 2014 -Nara 
Ref: Streaming Camera Platform for Scientific Applications. DOI: 
10.1109/TNS.2013.2252528. Presented to Real-time 2012 – 
Lawrence Berkeley Laboratory 
