- 1 PREPARED FOR SUBMISSION TO JINST
- 2 TOPICAL WORKSHOP ON ELECTRONICS FOR PARTICLE PHYSICS
- 3 September 26<sup>™</sup> 30<sup>™</sup>2016
- 4 KARLSRUHE, GERMANY

# Evaluation of GPUs as a level-1 track trigger for the High-Luminosity LHC

# 7 H. Mohr,<sup>1</sup> T. Dritschler, L. E. Ardila, M. Balzer, M. Caselle, S. Chilingaryan, A. Kopmann, L.

#### 8 Rota, T. Schuh, M. Vogelgesang, M. Weber

9 Karlsruhe Institute of Technology,

10 Hermann-von-Helmholtz-Platz 1, 76344, Eggenstein-Leopoldshafen, Germany

11 *E-mail:* h.mohr.hd@googlemail.com

ABSTRACT: In this work, we investigate the use of GPUs as a way of realizing a low-latency, 12 high-throughput track trigger, using CMS as a showcase example. The CMS detector at the Large 13 Hadron Collider (LHC) will undergo a major upgrade after the long shutdown from 2024 to 2026 14 when it will enter the high luminosity era. During this upgrade, the silicon tracker will have to be 15 completely replaced. In the High Luminosity operation mode, luminosities of  $5 - 7 \times 10^{34}$  cm<sup>-2</sup>s<sup>-1</sup> 16 and pileups averaging at 140 events, with a maximum of up to 200 events, will be reached. These 17 changes will require a major update of the triggering system. The demonstrated systems rely on 18 dedicated hardware such as associative memory ASICs and FPGAs. We investigate the use of 19 GPUs as an alternative way of realizing the requirements of the L1 track trigger. To this end 20 we implemeted a Hough transformation track finding step on GPUs and established a low-latency 21 RDMA connection using the PCIe bus. To showcase the benefits of floating point operations, made 22 possible by the use of GPUs, we present a modified algorithm. It uses hexagonal bins for the 23 parameter space and leads to a more truthful representation of the possible track parameters of the 24 individual hits in Hough space. This leads to fewer duplicate candidates and reduces fake track 25 candidates compared to the regular approach. With data-transfer latencies of 2  $\mu$ s and processing 26 times for the Hough transformation as low as  $3.6 \,\mu s$ , we can show that latencies are not as critical 27 as expected. However, computing throughput proves to be challenging due to hardware limitations. 28

KEYWORDS: Trigger concepts and systems (hardware and software), Trigger algorithms, Comput ing (architecture, farms, GRID for recording, storage, archiving, and distribution of data), Data
 processing methods

<sup>&</sup>lt;sup>1</sup>Corresponding author.

# 1 Contents

| 2  | 1 | 1 Introduction |                                                                      |   |  |  |  |  |
|----|---|----------------|----------------------------------------------------------------------|---|--|--|--|--|
| 3  |   | 1.1            | Motivation                                                           | 2 |  |  |  |  |
| 4  |   | 1.2            | Graphics processing units                                            | 2 |  |  |  |  |
| 5  | 2 | The            | e CMS trigger system at the high luminosity LHC                      | 2 |  |  |  |  |
| 6  | 3 | Hou            | igh transformation                                                   | 3 |  |  |  |  |
| 7  |   | 3.1            | Hexagonal Hough transform                                            | 3 |  |  |  |  |
| 8  |   | 3.2            | Correctness and implications on the parameter space                  | 4 |  |  |  |  |
| 9  |   | 3.3            | Benefits                                                             | 4 |  |  |  |  |
| 10 |   |                | 3.3.1 Tracking efficiency and rates of duplicate and fake candidates | 5 |  |  |  |  |
| 11 | 4 | Imp            | blementation of the GPU track trigger                                | 6 |  |  |  |  |
| 12 |   | 4.1            | RDMA interconnect and low latency requirements                       | 7 |  |  |  |  |
| 13 |   | 4.2            | Implementation details                                               | 7 |  |  |  |  |
| 14 | 5 | Ben            | Benchmarks                                                           |   |  |  |  |  |
| 15 | 6 | Con            | nclusion                                                             | 9 |  |  |  |  |
|    |   |                |                                                                      |   |  |  |  |  |

#### 1 1 Introduction

#### 2 1.1 Motivation

3 The increased luminosity and the large number of pile-up events at the High Luminosity Large

4 Hadron Collider (HL-LHC) makes triggering a major challenge. To address this challenge, CMS

 $_5$  is implementing a first-level track trigger. Subsequently, starting with the upgrade during the

<sup>6</sup> long shutdown 3, information from the silicon tracker will be used for the first time in the level-1

<sup>7</sup> triggering decision, called the L1 track trigger. The latency will be raised from  $3.4 \,\mu s$  to  $12 \,\mu s$ ,

 $_{8}$  leaving around 4  $\mu$ s for the L1 track trigger processing. CMS is persuing several approaches based

<sup>9</sup> on ASICs and FPGAs to meet the increased trigger requirements.

10 However, modern Graphics Processing Units (GPUs) have considerably increased in computing

performance and interconnect capabilities over the last years. Their widespread availability and

easy programming, combined with their high computational complexity, make the use of GPUs as

<sup>13</sup> a possible level-1 track trigger platform an option worth considering.

#### 14 **1.2** Graphics processing units

GPUs are massively parallel, multi-core processing units. They consist of streaming multiprocessors 15 that are specialized in performing the same operation on different data. This is called a SIMD (Single 16 Instruction Multiple Data) model. Computational tasks are arranged in so-called thread blocks. 17 Each thread block consists of numerous threads that perform the same operation simultaneously. If 18 there are different logical paths, they have to be computed sequentially. This is called *algorithmic* 19 branching. The concept of a thread block is especially important for GPU scheduling as it allows 20 threads residing in the same block to share data among one another using fast internal memory. 21 There are different memory types. The biggest but also slowest one is global memory, which is 22 commonly connected to the GPU's die externally and can be accessed by all threads at all times. 23 The aforementioned fast internal memory is called *shared memory*. It is considerably faster than 24 global memory (by a factor of 100), but also much smaller than global memory, with only some 25 kilobytes of storage. It is only visible to the threads within one block. The fastest form of memory 26 is register memory. It is available only to a single thread and only holds a few bytes of data. 27 In this work, we use a late 2013 Tesla K40c and the CUDA 8 framework, developed by NVIDIA 28

<sup>29</sup> to facilitate general purpose computing on their GPUs [1].

#### <sup>30</sup> 2 The CMS trigger system at the high luminosity LHC

After the upgrade, the Phase-2 CMS detector will have to face luminosities of  $5 - 7 \times 10^{34}$  cm<sup>-2</sup>s<sup>-1</sup> 31 and pileups (PUs) averaging at 140 events up to a maximum of 200 events. The trigger system 32 currently used can not handle the resulting data rates. Neither can it discriminate the background 33 from the physics interactions of interest. In Phase-2, data from the silicon tracker will be used in 34 the online triggering procedure for the first time. This Level-1 track trigger will provide the global 35 trigger system with a list of found track candidates, their fitted track parameters and their respective 36 hits. To reduce the data flow from the silicon tracker, an on-detector front-end data reduction scheme 37 will be used. This will help cope with the increased data rates [2]. It will only allow detector hits to 38 pass if they are above a minimum transverse momentum. This process is called stub building [3]. 39

1 The design goal of the new trigger system is to maintain physics acceptance, while dealing with a

<sup>2</sup> trigger acceptance rate 750 kHz, compared to the current rate of 100 kHz. The involved challenges,

 $_{3}$  and thus the need for new advances in both technology and algorithms, are apparent.

Currently three hardware concepts are being discussed. They include the Associative Memory
 approach that uses pattern banks to find track candidates [4]. The tracklet approach that finds
 promising pairs of stubs, called tracklets. Combining two stubs in this way improves their momen tum resolution, [5] and allows the initial track segments to be extrapolated to neighboring layers,

<sup>8</sup> thus facilitating the identification of full tracks [6]. The time-multiplexed track-trigger approach

<sup>9</sup> performs a Hough transformation in FPGAs followed by a fitting step, like a linear regression or a

<sup>10</sup> Kalman Filter [7]. The track finding procedure discussed below is based on the time-multiplexed

11 track-trigger approach.

#### **12 3** Hough transformation

The Hough transformation is a global method for detecting image features and has wide applications in both image analysis and particle physics. It scales linearly with the amount of detector hits, since each hit gets transformed into Hough space independently. The underlying assumption for the transformation is a helix track model with the impact parameter  $d_0$  assumed to be zero so that all tracks are coming from the middle of the detector.

The charged particles are bent inside the magnetic field of the detector. They move in a circle of radius R, according to

$$R = \frac{p_{\rm t}}{1.14q}.\tag{3.1}$$

<sup>20</sup> Where R is in m,  $p_t$  is the transverse momentum in GeV and q the charge of the particle. Since the <sup>21</sup> detector is already applying a filtering step for tracks by means of stub building, we can consider the <sup>22</sup> stubs entering the Hough transform to be high- $p_t$  tracks. Furthermore, we ignore energy losses due <sup>23</sup> to interactions of the particles. Following the approach from [7], we carry out the transformation <sup>24</sup> with respect to a pivot point  $r' = r - r_c$ , where  $r_c = 65$ cm. This optimizes the parameter space. <sup>25</sup> Furthermore a shift in  $\phi$  is introduced as  $\phi' = \phi - \phi_c$ , enabling us to carry out the calculation in <sup>26</sup> independent  $\phi$ -sectors. This leads to

$$\phi'_0 = \phi' - \frac{0.57q}{p_t} r', \tag{3.2}$$

where  $\phi'_0$  is the tracks production angle. The set of possible track parameters for a given detector hit is thus represented by a line in the  $r - \phi$  parameter space. The boundaries of the parameter space are fixed. The allowed momentum range is given by the front-end momentum cut, whereas the production angle range is given by the size of the current  $\phi$ -sector. Lines from different detector hits cross and form clustering points. Those clusters correspond to track candidates. We accept a cluster as a track candidate if consists of at least 5 hits from different detector layers.

#### **33 3.1** Hexagonal Hough transform

<sup>34</sup> We developed an algorithm that makes use of the GPUs floating point capabilities. It divides the

<sup>35</sup> parameter space into regular hexagons. We will describe the basic idea behind the algorithm and

<sup>36</sup> point out some of its benefits.

The maximum slope  $m_{\text{max}}$  for a track in parameter space is realized when r' is maximized.  $r'_{\text{max}}$  is determined by the detector geometry. Since the parameter space boundaries are fixed, the effective maximum slope is given by the amount of bins we choose. We can guarantee to allow only one hexagon per row, where hexagon centers are atop each other by demanding

$$m_{\max} < \frac{h}{w} \approx 1.15. \tag{3.3}$$

<sup>5</sup> Here h = 2R is the hexagons height, and *R* the outer radius. Its width *w* corresponds to twice the <sup>6</sup> inner radius. If we arbitrarily choose w = 1, then  $h = \frac{2}{\sqrt{3}}$ . If (3.3) is fulfilled, it suffices to calculate <sup>7</sup> the  $\phi$ -axis value, according to (3.2), once in the hexagon's center (with respect to the momentum <sup>8</sup> axis) for each row. The valid bin is either directly known, because we are already inside a given <sup>9</sup> hexagon, or found by linear extrapolation. The linear propagator  $\delta h$  is based on the ratio of the <sup>10</sup> current and maximum slope. It is given by

$$\delta h = 0.5 \frac{r'}{r'_{\text{max}}} s. \tag{3.4}$$

This way we define thresholds for making it to the upper and lower hexagon. The principle is illustrated in figure 1.

#### **3.2** Correctness and implications on the parameter space

<sup>14</sup> Using 32 bins in  $\frac{q}{p_t}$  we find the number of bins in  $\phi_c$  that guarantees (3.3) to be 23. The method <sup>15</sup> assumes that we get arbitrarily close to the value. This overestimates the allowed parameter space <sup>16</sup> and thus allows for an uncertainty in the measured quantities. Effectively we allow lines that barely <sup>17</sup> miss a given cell to still contribute to the cell. If we use more bins, and allow for higher values of <sup>18</sup> the slope, we achieve the opposite and apply a cut in the parameter space. The latter reduces the <sup>19</sup> amount of valid cells around the clustering points with effects on both efficiency and fake duplicate <sup>20</sup> rates. The results of the two approaches are discussed in section 3.3.1.

#### 21 3.3 Benefits

The algorithm uses the geometric properties of the hexagons, the underlying helix track model and the properties of the tracker to its advantage.

The layout of the grid allows for a maximum of three cells per cluster, as compared to the four 24 cells using rectangles. The neighborhood relations in a hexagonal grid are equidistant and well-25 defined. We only have to check six neighboring cells as compared to the eight potential neighbors 26 of a rectangle. The greater slope (1.15 as compared to 1) decreases the width of clusters in the 27 momentum axis, leading to fewer duplicate track candidates, as each cluster covers fewer bins. On 28 the other hand the clusters are more spread out in the production angle axis, which would normally 29 lead to more votes. This effect gets mitigated by the alternating layout of the hexagonal grid. 30 The reduced number of fake track candidates leads to a decrease in the workload of the duplicate 31 removal. The overall higher amount of bins — approximately a factor of two — leads to more 32 and more densely packed seed values for the track parameters. This is due to the hexagon being 33 the two-dimensional polygon that segments the plane most effectively. Generally, the distance to a 34 cell's center is minimized while the area of each cell is smaller. This leads to a finer resolution of the 35



**Figure 1**. Comparison of the regular and hexagonal parameter spaces coverage with respect to the regular approach (left). The histograms give the number of cells needed to represent a line in the parameter space. The histograms cover all of the allowed input values of r' and  $\phi'$  for the regular approach, hexagonal approach and for the hexagonal transformation, weighted by the relative area with respect to the regular approach. Graphical representation of the binning procedure (right). This shows the procedure for determining whether a line enters either the upper or lower hexagon inside the current row. Both lines will be counted in the lower, red hexagon because they are both beneath their threshold, also shown in red. The solid line has maximum slope. Its thresholds (also solid) coincide at the center. The dashed line would miss both hexagons, if it was between its thresholds (dashed).

allowed track parameters. In figure 1 a 2D histogram of the needed number of cells to represent a line is shown. The axes of the histogram range over all allowed input values for r' and  $\phi'$ . This value 2 is higher for the hexagonal approach shown in the center of the graph, as compared to the regular 3 one shown on the left. The finer rasterization of the line leads to a more truthful representation of 4 the allowed track parameters. The plot on the right is the number of cells used for each possible 5 line, but weighted with relative size of each cell in parameter space. This illustrates that we use 6 more individual cells but in many cases cover less of the parameter space. Figure 2 shows the filled Hough maps for the regular and hexagonal approach for a qualitative comparison. It is noticeable 8 that the maxima in the hexagonal case are well defined, whereas in the regular one they spread out over two neighboring cells in both cases. 10 From a computational point of view the implementation on a GPU benefits from the property 11

of needing only one cell per momentum value, to avoid algorithmic branching. While the needed memory increases by a factor of about two, the number of calculations stays the same, as we do not have to check for a second possible production angle value as is the case for the regular grid.

#### 15 3.3.1 Tracking efficiency and rates of duplicate and fake candidates

To evaluate the algorithms performance with respect to the track finding efficiency, a TTBar dataset was analyzed, for both pile up scenarios. As is to be expected, the overall track finding performance of the hexagonal approach is similar to that of the normal version of the algorithm. For an in-depth discussion see [7]. The efficiency is slightly higher, by around 0.35% for the uncut hexagonal version. The amount of produced tracks drops by around 33%. The fewer number of found track



**Figure 2**. Comparison of the regular and hexagonal binning behavior. The shown hexagonal transformation (left) has 29 bins in  $\phi_0$ , corresponding to the momentum cut version. Nevertheless, the binned representation of the line is without gaps. For the regular transformation (right) the width of the clusters is bigger, not just in the index space, but also in the parameter space.

- candidates manifests itself mainly in a reduction of duplicates. We attribute this behavior to the
  aforementioned properties. The slightly higher efficiency is due to the allowed uncertainty in the
- <sup>3</sup> input parameters as discussed in section 3.2. As for the parameter space cut approach using 29 bins,
- <sup>4</sup> we find a drop in efficiency of about 0.5% compared to the rectangular approach. The number of
- <sup>5</sup> found track candidates drops in this case by around 55%. This is to be expected due to the nature of
- <sup>6</sup> the approach. The amount of fake candidates drops significantly as compared to the uncut version,
- <sup>7</sup> which makes this an interesting option for reducing the overall workload on the system.

#### 8 4 Implementation of the GPU track trigger

<sup>9</sup> We use an FPGA-based data transfer infrastructure based on the Direct Memory Access (DMA) <sup>10</sup> architecture described in [8]. The FPGA gets loaded with our test-data in advance and is programmed <sup>11</sup> with custom DMA firmware that enables low latency data transfer from the FPGA directly into the <sup>12</sup> GPU's memory [9, 10]. The data is assumed to be compressed, making use of the detector's finite <sup>13</sup> resolution. Each stub is compressed to a size of 64 bits. We assume a detector segmentation of <sup>14</sup>  $32 \phi \times 9 \eta$ . The stub coordinates are transmitted in accordance to the local coordinate system of the <sup>15</sup> detector segment in which they were detectd.



**Figure 3**. Comparison of the regular data transfer scheme, optimized for low latency (left) and the pipelined data transfer scheme, optimized for maximum throughput (right).

#### 1 4.1 RDMA interconnect and low latency requirements

Traditionally the usage of GPUs introduces a latency penalty that would render them unusable
 for low latency applications. The overheads associated with traditional memory transactions are
 typically of the order of tens of microseconds. Additionally, there are overheads involved with
 starting kernels, due to the allocation of shared memory as well as the scheduling on both the host
 and device side.

Our setup uses a NVIDIA Tesla K40c (late 2013) GPU for the computations and a Xilinx Virtex-7 XC7VX1140T FPGA as a data source. The interconnect is established using an RDMA data ransfer scheme. All memory transactions are initialized from within the GPU program (*kernel*). To avoid kernel launch overheads, our kernel runs in a continuous loop. In each iteration, it initializes data transfer, waits for data transfer completion and then performs the Hough transformation. Together with the aforementioned RDMA transfer scheme, this mitigates most of the involved overheads. It also brings with it some restrictions concerning workload balancing (section 4.2) that will need to be addressed in future work.

#### **15 4.2 Implementation details**

<sup>16</sup> The filling of the parameter space histogram used in the Hough transformation is currently done

with one thread block per momentum bin. This means we use a total of  $n_{\text{bins}}^{p_{\text{t}}}$  thread blocks, each with

a number of threads corresponding to the maximum allowed number of stubs. This is a necessity,

<sup>19</sup> given the current implementation and hardware restrictions.

<sup>20</sup> Calculating a histogram on a GPU can not be performed completely in parallel. To update a counter

<sup>21</sup> inside a given histogram bin, the current value has to be read from memory, updated and then

written back. This leads to *race conditions* if two or more execution units write into the same bin

at the same time, which needs to be prevented. The CUDA framework offers so called *atomic* 

24 *operations* which guarantee read-modify-write operations without interference from other threads.

This has a negative effect on the throughput, as it forces other concurrent threads to wait while memory is being modified, which makes atomic operations costly. To reduce the impact of the 0 enforced waiting times, we limit atomic operations to shared memory which is significantly faster 3 than global memory. The needed atomic operations can only operate on regions of 32 or 64 bits in size. The used memory consists of a counter for each bin, a bit mask keeping track of the hit layers, 5 and the list of stub indices belonging to the cell. All of the above take up 32 bits in size each. The 6 needed shared memory is thus 130 kilobytes for the regular transformation and 260 kilobytes for the hexagonal one. This exceeds the amount of available shared memory per thread block for the 8 given card, which only provides a maximum of 48 kilobytes per block. Because of this limitation, g we are forced to spread our calculation across multiple thread-blocks. This procedure also increases 10 the amount of concurrently used threads by a factor corresponding to the number of used thread 11 blocks. In our current implementation, we chose to spread the calculation across 32 blocks. This 12 has a positive effect on the systems latency, but a negative impact on its throughput. There are 13 ways around these problems, however, they are beyond the scope of the current proof-of-concept 14 minimum latency implementation. 15

#### 16 5 Benchmarks

To assess GPUs as a possible track trigger we benchmark the data transfer and computational 17 performance in terms of their latency. The timings were measured both on the GPU and the FPGA. 18 Inside the running kernel we measure the time at certain points using the GPU clock cycles. Inside 19 the FPGA there is a hardware timer that counts the FPGA clock cycles with a resolution of 4 ns. 20 The former allows us to perform fine grained measurements of the individual steps, while the latter 21 assures correctness of the results. When the transfer is initiated, the FPGA writes data into the 22 GPU memory. At the same time the GPU starts polling for the arrival of this data by continuously 23 checking a specific memory region for a special completion-marker, which the FPGA will write to 24 this location once the data transfer is complete. Afterwards, the GPU immediately starts to process 25 the received data. We call the combined time for the data transfer and the time it takes the kernel 26 to notice the completion-marker the *polling time* (table 1). The total computation time includes the 27 transfer from global to shared memory, the decompression of the data, the filling of the histogram 28 as well as the filtering step based on our threshold condition. 29 The polling step performed by the GPU is done by only one single thread, halting all other operations 30 for that time. Therefore, most of the GPUs computing resources remain idle, while the GPU waits 31 for the data transfer to finish. For this reason, we implemented a second variant that maximizes 32

throughput at the cost of some latency (figure 3). In this case the GPU takes the transmitted data from global memory and copies it into shared memory. It then immediately requests new data to be written to global memory. While this new data is being transmitted to global memory, the GPU simultaneously starts to process the data that was copied to shared memory. This way, we hide the transfer time behind the computations. We call this the *pipelined* approach. Except for the modified

<sup>38</sup> ordering of data transfer and computation, it is otherwise identical to the normal approach.

<sup>39</sup> The data used in our benchmarks is from a TTBar PU140 data set. The data transmission time

shows a very stable behavior and takes on the order of 2  $\mu$ s for 1280 bytes of data. The computation

time exhibits larger spread and dominates the total time taken for both algorithms. It is at around

<sup>1</sup> 3.7  $\mu$ s for the normal Hough transform and around 4.9  $\mu$ s for the hexagon version. Checking for <sup>2</sup> newly arrived data still takes around 0.4  $\mu$ s. This is due to the relatively large penalty that comes <sup>3</sup> with accessing global memory, which is on the order of 300 GPU clock cycles.

- 4 It should be noted that the currently used card does not have sufficient concurrent threads to
- $_5$  perform these calculations for a larger number of stubs in parallel. We have decided to allow for 160
- <sup>6</sup> stubs per sector in these benchmarks, even though that surpasses the amount of available threads by
- <sup>7</sup> about a factor of two. This allows a measurement to be given where the effects of high occupancy
- <sup>8</sup> are accounted for.

**Table 1**. Execution times for polling and computation for both algorithms with both measured setups. All times given in  $\mu$ s.

| Sequential           | mean | $\sigma$ | max  | Pipelined   | mean | $\sigma$ | max  |
|----------------------|------|----------|------|-------------|------|----------|------|
| Regular              |      |          |      |             |      |          |      |
| Polling and Transfer | 1.96 | ±0.02    | 2.01 | Polling     | 0.41 | ±0.01    | 0.43 |
| Computation          | 3.73 | ±0.13    | 4.19 | Computation | 3.63 | ±0.10    | 3.82 |
| Total                | 5.84 | ±0.14    | 6.33 | Total       | 4.17 | ±0.07    | 4.37 |
| Hexagonal            |      |          |      |             |      |          |      |
| Polling and Transfer | 1.91 | ±0.11    | 1.99 |             |      |          |      |
| Computation          | 4.94 | ±0.11    | 5.20 |             |      |          |      |
| Total                | 7.07 | ±0.12    | 7.40 |             |      |          |      |

## **9 6 Conclusion**

Our results show that a latency of a few  $\mu$ s is in fact achievable using a GPU for track finding 10 purposes. The requirements for achieving this sort of result come at the cost of flexibility, especially 11 in terms of load balancing. The work group dimensions need to be fixed prior to performing any 12 calculations. This currently leads to suboptimal behavior in terms of achievable throughput. Overall, 13 the performance in terms of latency is very promising, whereas the throughput is not yet competitive. 14 Using the tested generation of cards, we would have to use an unfeasibly large amount of GPUs to 15 process all the detector data. Currently, due to restrictions given by the available shared memory, 16 as well as some limitations in the memory layout of the current implementation, we are unable to 17 process more than one sector per GPU. However, we are confident that newer generation cards, as 18 well as a more refined memory layout, will allow us to achieve comparable execution times for a 19 larger number of sectors in the future. In addition, technological advances in GPU interconnects, 20 like NVIDIA's NVLink technology [11], might further improve transfer latency, resulting in overall 21 better turnaround times. 22

The newly developed algorithm, using floating point calculations, produces better results in terms of duplicate candidates, as well as, given the discussed momentum cut, better results in terms of fake track candidates. This should help reduce the overall workload on both the duplicate removal as well as the subsequent fitting step in the track trigger. In terms of computation time it shows a comparable behavior to the regular approach when implemented on a GPU.

## References

- [1] CUDA Nvidia. Programming guide, 2008. 2
- [2] G Boudoul. A level-1 tracking trigger for the cms upgrade using stacked silicon strip detectors and 3 advanced pattern technologies. Journal of Instrumentation, 8(01):C01024, 2013. 4 [3] D Contardo, M Klute, J Mans, L Silvestris, and J Butler. Technical Proposal for the Phase-II Upgrade 5 of the CMS Detector. Technical Report CERN-LHCC-2015-010. LHCC-P-008. CMS-TDR-15-02, 6 Geneva, Jun 2015. Upgrade Project Leader Deputies: Lucia Silvestris (INFN-Bari), Jeremy Mans 7 (University of Minnesota) Additional contacts: Lucia.Silvestris@cern.ch, Jeremy.Mans@cern.ch. 8 [4] D. Sabes. L1 track triggering with associative memory for the cms hl-lhc tracker. Journal of 9 Instrumentation, 9(11):C11014, 2014. 10 [5] E Salvati. A level-1 track trigger for cms with double stack detectors and long barrel approach. 11 Journal of Instrumentation, 7(08):C08005, 2012. 12 [6] L. Skinnari. L1 track triggering at cms for high luminosity lhc. Journal of Instrumentation, 13 9(10):C10035, 2014. 14 [7] C. Amstutz, F. A. Ball, M. N. Balzer, J. Brooke, L. Calligaris, D. Cieri, E. J. Clement, G. Hall, T. R. 15 Harbaum, K. Harder, P. R. Hobson, G. M. Iles, T. James, K. Manolopoulos, T. Matsushita, A. D. 16 Morton, D. Newbold, S. Paramesvaran, M. Pesaresi, I. D. Reid, A. W. Rose, O. Sander, T. Schuh, 17 C. Shepherd-Themistocleous, A. Shtipliyski, S. P. Summers, A. Tapper, I. Tomalin, K. Uchida, 18 P. Vichoudis, and M. Weber. An FPGA-based track finder for the 11 trigger of the CMS experiment at 19 the high luminosity LHC. In 2016 IEEE-NPSS Real Time Conference (RT). Institute of Electrical and 20 Electronics Engineers (IEEE), jun 2016. 21 [8] L. Rota, M. Caselle, S. Chilingaryan, A. Kopmann, and M. Weber. A pcie dma architecture for 22 multi-gigabyte per second data transmission. IEEE Transactions on Nuclear Science, 62(3):972–976, 23 June 2015. 24 [9] L. Rota, M. Vogelgesang, L.E. Ardila Perez, M. Caselle, S. Chilingaryan, T. Dritschler, N. Zilio, 25 A. Kopmann, M. Balzer, and M. Weber. A high-throughput readout architecture based on pci-express 26 gen3 and directgma technology. Journal of Instrumentation, 11(02):P02007, 2016. 27 [10] M. Caselle, L.E. Ardila Perez, M. Balzer, S. Chilingaryan, T. Dritschler, A. Kopmann, H. Mohr, 28
- L. Rota, M. Vogelgesang, and M. Weber. A high-speed daq framework for future high-level trigger 29 and event building clusters. Journal of Instrumentation, 2016. 30

[11] Denis Foley. Nvlink, pascal and stacked memory: Feeding the appetite for big data. Nvidia. com, 31 32 2014.