GPU real-time processing in NA62 trigger system 
Introduction
General Purpose computing on GPU (GPGPU) is nowadays widespread in scientific areas requiring large processing power such as computational astrophysics, lattice QCD calculations, image reconstruction for medical diagnostics. In a High Energy Physics experiments are using (i.e. ALICE [1] ) or are considering to use (i.e. ATLAS [2] ) GPUs for high level triggers.
Low level triggers could also benefit from GPUs computing power to build more refined physics-related trigger primitives and to design more selective condition in order to reduce the load to the higher level without sacrificing interesting physics events; Three main aspects must be carefully evaluated in order to establish the real-time capability of this kind of processors: computing power, total processing latency and its stability in time. In the present paper a description is provided of the GPU-based L0 trigger integrating in the experimental setup of the RICH detector of the NA62 experiment in order to reconstruct the ring-shaped hit patterns; we report and discuss results obtained with this system along with the algorithms that will be implemented. 2. The NA62 experiment, trigger and RICH detector The NA62 experiment at CERN [3] has the goal of measuring the branching ratio of the ultrarare decay of the charged kaon into a pion and a neutrino anti-neutrino pair. The trigger is a key system to obtain such a result. In its standard implementation, the FPGAs on the readout boards of each sub-detector participating to the L0 trigger, compute simple trigger primitives. The maximum latency allowed for the synchronous L0 trigger was chosen to be rather large in NA62 (up to 1 ms). Such a time budget allows in principle the use of more complex but slower trigger. This would have the benefit of increasing the trigger selectivity for the K + → π + νν process, and would allow the design of new trigger condition to collect additional physics processes.
The Ring Imaging Cherenkov (RICH) detector is a key element for particle identification in the NA62 experiment and one of the subdetectors responsible for the L0 trigger [4] . Its main purpose is to separate pions from muons in the momentum range from 15 GeV/c to 35 GeV/c and to measure the particle arrival time with better than 100 ps resolution [5] . Thanks to its fast response, it is used in the L0 trigger by providing a multiplicity count. However, much more information would be available from the RICH. Due to the loose relation between hits multiplicity and physics related quantities, evaluating this information at the online trigger stage would allow to build most effective trigger conditions.
GPU computing at L0
In a standard approach for GPU computing, data from a detector reach a Network Interface Card (NIC) which copies them periodically on a dedicated area in the PC RAM. Here, a sufficient data load of buffered events is usually prepared and then copied to GPU memory through the PCI express bus. The PC on which the GPU card is plugged, i.e. the host, has the role of starting the GPU "kernel".
In contrast with respect to the applications for which GPUs have been originally designed, in triggering the computing latency is a very important issue. The total latency is essentially composed by two parts: the time to bring the data on the GPU processor and the effective time of processing. With the aim of decreasing the transport latency (and its fluctuations) for L0 applications a direct data transfer protocol from a custom FPGA-based NIC to the GPU has been used. Dedicated pattern recognition algorithms have been developed to exploit the GPU parallel structure. Both these aspects will be discussed below.
NaNet is a low-latency NIC with GPUdirect capability developed at INFN Rome division. Its design comes from the APEnet+ card logic [6] and, in the current implementation, it provides 10 Gb/s communication (NaNet-10) [7] . NaNet board is able to exploit the GPUDirect peer-topeer (P2P) and RDMA capabilities of NVIDIA Tesla GPUs injecting into their memory an UDP input data stream from the detector front-end. With respect to the previous version (NaNet-1 with 1 Gb/s links) NaNet-10 board is able to cope with higher data rate and allow to receive data from 4 TEL62 RICH readout board [4] , providing on-the-fly data processing to perform tasks as data de-compression and event fragments coalescing.
A fast ring finding algorithm is a crucial point to allow the use of RICH in on-line trigger selection. The processing time must be small enough to cope with high input data rate. The parallel structure of the GPUs helps in increasing the computing throughput both allowing the parallel processing of several event concurrently and exploiting pattern recognition parallel algorithms. We focused in particular on two multi-rings standalone pattern recognition algorithms based only on geometrical considerations (as no information from other detectors is available at this level). In case of the first algorithm, named Histogram, the XY plane is divided into a grid and a histogram is created with distances from the grid points and the PMT hits. Rings are identified looking at distance bins whose contents exceed a predefined threshold value. The second algorithm, i.e. Almagest [8] , is based on Ptolemy Theorem. Both these algorithms are used for pattern matching, once the number of rings and points belonging to them have been found, it is possible to apply Crawford's method [9] to obtain centre coordinates and radii with good spatial resolution.
Results
The whole system, equipped with the NaNet-10 board, has been tested during the normal NA62 data taking, with data coming from the four RICH TEL62 readout boards.
The latency measurement is done by exploiting an hardware clock on the FPGA synchronized with the NA62 general clock through a TTC [10] interface. The resolution of this clock is 25 ns. The clock counter is reset at each start of burst (the burst is a bunch of particles produced by the protons extracted from the SPS. The burst duration is about 6 s each 14 s, the duty cycle depends on the SPS extraction scheme).
In fig.1 the total latency including all the components is shown for an entire burst. In fig. the total latency including all the components is shown for an entire burst. The main source of the width of the latency distribution is due to fluctuation in the number of events buffered before the processing: in order to better exploit the parallel structure of the GPU the data are processed in buffers collected in 350 µs, called CLOP (Circular List of Persistent buffers). The latency for processing of a single datagram in the NaNet-10 board, since its receiving in the 10GbE MAC to the completion of the DMA operation towards the GPU memory, accounts for less than 2 µs In 2016, at 1/3 of the nominal intensity, each CLOP contains in average about 150 events ( fig.2) . Nevertheless, the processing time per event, in the present version of the computing kernel, is quite stable below 1µs (using a single GPU). The interaction with the present central trigger processor (L0TP) needs a further stage of synchronization. The send of the GPU results, codified in a format compatible with the present NA62 L0 trigger, is done directly from the GPU memory with a low latency RDMA, still through the NaNet-10 board.
Conclusions
The GPU system performed as expected in the NA62 2016 run. In the next physics run, the GPU system can be used to provide high quality primitives to increase the purity of the main triggers and to give handle to build selective triggers to enlarge the NA62 trigger menu. Current results for the processing time per event hints to the further development of GPU algorithms and/or to the use of multiple GPU devices to reach an order of magnitude speed-up to cope with the experiment nominal intensity. Additional details on the results shown here can be found in [11] . 
