Evaluation of GPUs for high-level triggers in high energy physics by Mohr, Hannes et al.
EVALUATION OF GPUS FOR HIGH-LEVEL 
TRIGGERS IN HIGH ENERGY PHYSICS
Presenter: Hannes Mohr - KIT, IPE 
Contributors: L. Ardilla Perez, M. Balzer,  
M. Caselle, S. Chilingaryan, T. Dritschler,  
A. Kopmann, L. Rota, T. Schuh, M. Weber 
29TH SEPTEMBER, TWEPP 2016
OUTLINE
▸ Implement track trigger using GPUs 
▸ Use established methods for seeding 
▸ Present our own version of the Hough transformation 
▸ Compare different GPUs/vendors 
▸ Investigate data transfer/latencies 
▸ Estimate impact of technological advances

























▸ Current CMS trigger won’t be able to handle: 
▸ Increased data rates 
▸ Increased pile-up 
▸ Currently proposed solution: 
▸ Data reduction on detector 
▸ Raise latency of trigger from 3.4 to 12.5 μs 





▸ Readout at 40 MHz, BX every 25 ns  
▸ 6 μs each for L1 Trigger and Global Trigger 













▸ Applies momentum cut to hits 
▸ Delivers estimate on track bend 
▸ Drastically decreases number of hits by a factor of 100
8
CURRENT APPROACHES
▸ Associative Memory approach (ASICs) 
▸ Time-multiplexed FPGA Hough transformation 
▸ …
CURRENT APPROACHES USE SPECIALIZED HARDWARE
WHAT ABOUT GPUS?
9
Nvidia Tesla K40c vs. XILINX VIRTEX-7 XC7VX1140T (both 28nm)




















QUALITATIVE COMPARISON OF STRENGTHS
GPU FPGA
Rapid development cycles 
and high flexibility Huge I/O Bandwidth









Equivalent to FPGA approach, used by collaboration of KIT 
and UK Track Trigger Group* 
‣ Uncompress data 
▸ Perform Hough transformation 
➤ Uses module bend information 
▸ Apply layer condition 
▸ Reject or return track candidates
Partitioning of tracker 9 x η 32 x ɸ
HOUGH TRANSFORMATION 12
*An FPGA-Based Track Finder for the L1 Trigger of the CMS Experiment at the 
High Luminosity LHC (DOI: 10.1109/RTC.2016.7543102)
HOUGH TRANSFORMATION
Calculate possible  
(φ0, q/pt) pairs for each hit








Calculate possible  
(φ0, q/pt) pairs for each hit








Calculate possible  
(φ0, q/pt) pairs for each hit












TTBar event - PU 140
16





TTBar event - PU 140
17
OUR IMPLEMENTATION
GPU  implementation specifics: 
▸ Optimized for minimum latency 
▸ Computes q/pt-bins in parallel 
▸ Almost no dependence on number of stubs
18
▸ Kernel scheduling 
▸ Kernel launch time 




Invocation and setup of kernels is too costly, 
we need to keep it running continuously
19






CURRENTLY NOT POSSIBLE IN OPENCL: 
CACHE CAN’T BE FLUSHED FROM KERNEL!
CUDA - Tesla K40c
20
21
WHAT ABOUT DATA TRANSFER?




























At the moment we don’t write back into the FPGA,  














DMA BENCHMARK: POLLING - SPINNING KERNEL
➤ Start transfer ➤ Poll for data ➤ Write back result

















DMA BENCHMARK - HOUGH TRANSFORMATION
➤ Read/Uncompress data ➤ Compute ➤ Poll





▸ Computation time is higher than data transfer 
▸ We can hide the transfer behind the computation
INTERLEAVED APPROACH
▸ Start data transfer for current data set 
▸ Do calculations on previous dataset (lies in register memory) 
▸ Poll new data (should take less time)
INCREASES THROUGHPUT 
AT COST OF LATENCY
28













➤ (poll) Read/Uncompress data ➤ Ask for data ➤ Compute




NEW APPROACH - HEXAGONAL HOUGH-SPACE
▸ Hexagonal bins in hough space 
▸ Suppresses fake candidates 
▸ Runtime comparable 
▸ only 1 possible bin per row 
▸ less algorithmic branching
CAVEAT: NEEDS MORE BINS 
(FACTOR OF 2)
30
COMPARISON OF FAKE RATES - REGULAR VS. HEXAGONAL
Results for TTBar Dataset PU140, whole detector, 1 event
TRACK CANDIDATES
0 22 44 66 88 110
REGULAR
HEXAGONAL
number of true tracks (20)
31
Preliminary Results




















‣ Computational time of around 4 μs 
‣ Transfer time of around 2 μs 
                   Surpassed our expectations 
Development is faster 
More complex algorithms are possible: 
‣ Example: hexagonal approach 
Data transfer using standard interfaces is challenging 
33
OUTLOOK
▸ Need to process multiple sectors per card in future 
▸ Look at performance of newer cards  
▸ High Bandwidth Memory, 
already in consumer model cards,  
promises 2-4x better throughput 
▸ Investigate new transfer technologies 
▸ PCIe 4.0 (2x faster) 
▸ nv-link (5-10x faster)





CUDA 7.5 - TESLA K40C 36
CUDA 7.5 - Tesla K40C
CUDA 8 - TESLA K40C 37
