Presenter: Hannes Mohr - KIT, IPE

Contributors: L. Ardilla Perez, M. Balzer, M. Caselle, S. Chilingaryan, T. Dritschler, A. Kopmann, L. Rota, T. Schuh, M. Weber 29TH SEPTEMBER, TWEPP 2016





## **EVALUATION OF GPUS FOR HIGH-LEVEL TRIGGERS IN HIGH ENERGY PHYSICS**

- Implement track trigger using GPUs
- Use established methods for seeding
- Present our own version of the Hough transformation
- Compare different GPUs/vendors
- Investigate data transfer/latencies
- Estimate impact of technological advances

#### OUR GOAL IS TO ACHIEVE COMPETITIVE RESULTS, While gaining flexibility



## CMS DETECTOR

# **SILICON TRACKER**

#### **BASELINE GEOMETRY - 6 LAYERS 5 DISKS**



\*Image CMS Collaboration



Year

\*Image CMS Collaboration

- Current CMS trigger won't be able to handle:
  - Increased data rates
  - Increased pile-up
- Currently proposed solution:
  - Data reduction on detector
  - Raise latency of trigger from 3.4 to 12.5 μs
  - L1 track trigger





- Readout at 40 MHz, BX every 25 ns
- 6 μs each for L1 Trigger and Global Trigger
- L1 Tracking to combine Track seeding and Fitting

#### **STUB BUILDING**



- Applies momentum cut to hits
- Delivers estimate on track bend
- Drastically decreases number of hits by a factor of 100

- Associative Memory approach (ASICs)
- Time-multiplexed FPGA Hough transformation

**CURRENT APPROACHES USE SPECIALIZED HARDWARE** 

#### **COMPARISON GPU VS. FPGA**

Nvidia Tesla K40c vs. XILINX VIRTEX-7 XC7VX1140T (both 28nm)



| GPU                                              | FPGA                               |
|--------------------------------------------------|------------------------------------|
| Rapid development cycles<br>and high flexibility | Huge I/O Bandwidth                 |
| Large bandwidth to<br>external memory            | Deterministic timings/<br>runtimes |
| High Floating-point<br>performance               | High bit-level performance         |

Equivalent to FPGA approach, used by collaboration of KIT and UK Track Trigger Group\*

- Uncompress data
- Perform Hough transformation
  - Uses module bend information
- Apply layer condition
- Reject or return track candidates

\*An FPGA-Based Track Finder for the L1 Trigger of the CMS Experiment at the High Luminosity LHC (DOI: 10.1109/RTC.2016.7543102)

#### **HOUGH TRANSFORMATION**



#### Calculate possible

 $(\phi_0, q/pt)$  pairs for each hit



Make histogram in

Hough space

#### **HOUGH TRANSFORMATION**



#### **HOUGH TRANSFORMATION**



#### **CLUSTERING POINTS ARE TRACK CANDIDATES**



#### FILTERING BY LAYER CONDITION



GPU implementation specifics:

- Optimized for minimum latency
- Computes q/pt-bins in parallel
- Almost no dependence on number of stubs

### OVERHEADS...

- Kernel scheduling
- Kernel launch time

...?

Allocation of shared memory

Invocation and setup of kernels is too costly,

we need to keep it running continuously



#### **BENCHMARK KERNEL RUNTIME – SPINNING**



### WHAT ABOUT DATA TRANSFER?

#### DMA SETUP – CPU ONLY STARTS THE KERNEL



(Red) Conventional transfer (Green) RDMA transfer







At the moment we don't write back into the FPGA,

#### DMA BENCHMARK: POLLING – SPINNING KERNEL

read and write 160 stubs (64 bits each)

Start transfer > Poll for data > Write back result



#### **DMA BENCHMARK – HOUGH TRANSFORMATION**

Read/Uncompress data > Compute > Poll



- Computation time is higher than data transfer
- We can hide the transfer behind the computation

#### **INTERLEAVED APPROACH**

- Start data transfer for current data set
- Do calculations on previous dataset (lies in register memory)
- Poll new data (should take less time)

#### INCREASES THROUGHPUT At cost of latency

#### **DMA BENCHMARK: INTERLEAVED HT**

Data older, throughput higher

(poll) Read/Uncompress data > Ask for data > Compute



#### **NEW APPROACH – HEXAGONAL HOUGH-SPACE**

- Hexagonal bins in hough space 40
- Suppresses fake candidates
- Runtime comparable
- only 1 possible bin per row
- less algorithmic branching

CAVEAT: NEEDS MORE BINS (FACTOR OF 2)





number of true tracks (20)

Results for TTBar Dataset PU140, whole detector, 1 event

#### **DMA BENCHMARK: HEXAGONAL HT**

(poll) Read/Uncompress data > Ask for data > Compute



Performance:

- Computational time of around 4 µs
- Transfer time of around 2 μs

Surpassed our expectations

**Development is faster** 

- More complex algorithms are possible:
  - Example: hexagonal approach

Data transfer using standard interfaces is challenging

Need to process multiple sectors per card in future

Look at performance of newer cards

High Bandwidth Memory,

already in consumer model cards,

promises 2-4x better throughput

- Investigate new transfer technologies
  - PCIe 4.0 (2x faster)
  - nv-link (5-10x faster)

#### MOORS LAW IS OUR FRIEND!

#### **THANK YOU!**

## **QUESTIONS?**



Karlsruher Institut für Technologie



