Abstract-In the context of the high-luminosity large hadron collider (HL-LHC) upgrade, this work presents the latest update on the design of the FPGA firmware responsible of particle track reconstruction in the pattern recognition mezzanine (PRM) of the hardware-based tracking for the trigger (HTT) system, a subsystem of the ATLAS experiment trigger and data acquisition system. This computationally demanding task relies heavily on two FPGA features: the embedded in silicon digital signal processing (DSP) components and the performance of an available high bandwidth memory (HBM).
I. INTRODUCTION
T HE expected increase in peak luminosity of the highluminosity large hadron collider to 7.5 × 10 34 cm −2 s −1
is driving the ATLAS experiment upgrade strategy for the trigger and data acquisition system [1] towards the implementation of precise hardware-based track reconstruction. The hardware-based tracking for the trigger (HTT) system uses a combination of custom ASICs for pattern recognition and FPGAs to provide the software-based trigger system with access to tracking information, allowing for reduced p T trigger thresholds for primary lepton selections, while contributing to pile-up mitigation, essential for hadronic signatures. The HTT implementation is based on ATCA carrier boards housing different mezzanine cards. The pattern recognition mezzanine (PRM) is a critical component of this system which mounts the FPGA assigned with the challenging computational task of performing a linear track fit and reconstruction. In the PRM FPGA firmware architecture ( Fig. 1) there are two modules which take care of the aforementioned task: the track fitter block (TFB) and the parameter calculator (PC). To achieve the best possible performance these two entities rely heavily on embedded digital signal processing (DSP) blocks and high bandwidth memory (HBM) [3] , both of which can be found in the FPGA model under consideration in this document: the Altera Intel Stratix 10 MX 2100.
Throughout the document rates presented are to be interpreted as data rates and not as clock rates unless otherwise explicitly specified. II. FUNCTIONAL DESCRIPTION In this section the mathematical functions computed by the TFB and PC are described together with a careful breakdown of their arithmetical operations, as introduced in Ref. [4] .
A. Track fitter block
The main function of the TFB is to perform a linear fit, which corresponds to calculating a χ 2 value according to the following equation:
where x j are the full-resolution local coordinates of the cluster, while S ij and h i are constant values computed through detector simulation. The constants are unique to a specific physical sector of the ATLAS inner detector and are chosen depending on which sector the clusters belong to.
Name Description Ncoo Number of coordinates 9 ≤ Ncoo ≤ 12 
1) Missing coordinates:
Another activity of the TFB is computing missing coordinates. Due to partial data provided outside the control of the HTT system, it can happen that some components of the input coordinates vector x j could be missing. In these cases an additional set of operations need to be performed before the χ 2 computation. The missing coordinates are found through the minimisation of the χ 2 presented in Eq. 1, which corresponds to solving the following equation:
where M is the number of missing coordinatesx i to compute, while C and t are given by:
Thex i components so computed can then be inserted into the vector x j in place of the missing coordinates. A breakdown of the type of operations necessary to perform the computation described in Eq. 2 and Eq. 3 is given in Tab. III and Tab. IV. 
General case
TABLE IV: Number of mathematical operations to computê x i as presented in Eq. 2 and Eq. 3.
B. Parameter calculator
The PC is the firmware block that comes right after the TFB (see Fig. 1 ). It performs similar computational operations specified by the following formula:
where x j are the full-resolution cluster local coordinates, while B ij and q i are constant values unique to the specific sector the coordinates x j belongs to. A breakdown of the type of operations necessary to perform the computation described in Eq. 4 is given in Tab. V.
Name Description Ncoo
Number of coordinates 9 ≤ Ncoo ≤ 12
Track parameters result dim(p) = 5 
C. Data-flow
In the PRM firmware block diagram ( Fig. 1 ) the whole x j vector of cluster coordinates is encoded inside the signal with label "Cluster" which is an output of the data organiser and an input of the TFB. The various TFB and PC constants are represented by the signal with label "Constants", which are fetched from the HBM. A χ 2 is computed in the TFB and compared to a threshold. Only coordinates, for which the χ 2 satisfies the threshold, are then passed to the PC. The acceptance associated to this requirement is known from simulation to be about 20% [1] .
III. DIGITAL SIGNAL PROCESSING PERFORMANCE TEST
To support the early design of the PRM an assessment of the performance for the TFB and PC has been done by implementing a matrix multiplication matching the characteristics described in Sec. II, and in particular targeting the worst case scenario of N coo = 12 and M = 4. A sketch of the HDL implementation used is shown in Fig. 2 where the DSP is operated in multiply-accumulate mode [5] taking the input values from local memory. This is coherent with the implementation plan of retrieving the constants from the HBM and keeping them in a local buffer for as long as necessary. x   1  2  3  4  5  6  7  8  9  10  11  12  1  2  3  4  5  6  7  8  9  10  11  12  1   y   1  2  3  4  5  6  7  8  9  10  11  12  1  2  3  4  5  6  7  8  9  10  11  12  1   result   1  2  3  4  5  6  7  8  9  10  11  12  1  2  3 signal of the DSP is set to "low" for the first input and then set to "high" for the following eleven. After the DSP internal latency of five clock cycles the result register starts holding the partial results: x 1 y 1 , x 1 y 1 + x 2 y 2 , etc. until at the twelfth iteration it equals x 1 y 1 + x 2 y 2 + ... + x 12 y 12 . This last value is then also made available as output, while the second set of input already entered the DSP pipeline without any clock delay, synchronously with the acc control signal set to "low" to restart the accumulation.
There is a trade-off between latency and performance and in this case, since the latency is only at start-up thanks to the pipeline design, the DSP is operated with all internal registers enabled to maximise the performance. With these settings the firmware simulation was able to run with a clock frequency of 500 MHz with the targeted Stratix 10 FPGA device and -2 speed grade. Given the very simplistic implementation of the test, a safe assumption for the estimates in Sec. IV is to consider multiply-accumulate (macc) performances at less than half the rate, i.e. assume 200 M macc s DSP .
IV. RESOURCE ESTIMATES
A driving factor is the performance target for the PRM of being able to compute four billion χ 2 per second (this metric is referred in formulas as 4 
A. Digital signal processing
The estimate on the amount of DSP necessary to carry out the mathematical operations described in Sec. II is based on three factors: (1) how many fits (χ 2 from Eq. 1) and track parameters (p i from Eq. 4) the PRM is required to compute, (2) Number Size Macc operations to compute one set of p i 60 Constants used to compute one set of p i 60 1.9 kb Constants stored to compute one set of p i 60 1.9 kb Total constants stored in HBM 1.9 M 61 Mb which and how many mathematical operations are necessary to compute these quantities, (3) how many of these operations can a single DSP carry out in a given time.
1) Track fitter block:
The following formula estimates how many DSP would be necessary to instantiate for the TFB:
where the value of 140 macc fit is taken from Tab. VI. 2) Parameter calculator: After the χ 2 computation from the TFB a decision is made to discard all tracks with a χ 2 value below a certain threshold. The acceptance corresponding to this threshold is known from simulation to correspond to about 20% [1] . As a consequence, considering from Tab. VII that the number of multiply-accmulate operations for one parameter calculation (pc) is about 60 macc pc we get the following:
B. High bandwidth memory
The HBM stores the various constants necessary to compute the χ 2 and the track parameters (Fig. 1) . These values are retrieved from the HBM and kept in a local buffer close to the computing core. The necessary bandwidth is therefore estimated by multiplying the rate at which the HBM is accessed with the size of the payload requested.
1) Track fitter block: Since on average the same set of track fitter constants can be used ten times in a row [1] , we get the following:
Gfit s × 560 B 10 fit = 224 GB s
where 4
Gfit s is the performance requirement of the PRM, and 560 B is the payload corresponding to the retrieval of one set of 140 constants in single-precision floating-point as expressed in Tab. VI.
2) Parameter calculator: In a similar way for the parameter calculator we get:
where 0.2 pc fit is the ratio between the number of track parameter calculations and track fittings performed in the PRM, 4
Gfit s is the required performance of the PRM, and 240
B pc
is the payload corresponding to retrieving the necessary 60 constants as expressed in Tab. VII.
V. PARALLEL DESIGN Fig. 1 shows the plan for having multiple track fitter cores (TFCs) operating in parallel inside the TFB and Fig. 5 shows a parallel paradigm which can be used for the implementation of the TFC. The feasibility of this design is contingent on the capability of the HBM to provide the inputs with enough speed and parallelisation. Since there is a trade-off between resource usage and performance, another factor to take into account regards the DSP. To guide the development of the TFC, Fig. 6 shows the inverse relationship present in a single TFC between latency and DSP resource allocation necessary to achieve the PRM target throughput for different clock frequencies. 
VI. CONCLUSION
The DSP and HBM resource usages derived in Sec. IV show for the DSP an estimated total occupancy of about 76% w.r.t the DSP resources available in the targeted FPGA model, while for HBM there is a negligible memory occupancy and a bandwidth utilisation at about 80% of the nominal value declared by the FPGA vendor.
For the future we plan to continue the development and to provide performance simulation for a first TFC prototype. Regarding the HBM, since the bandwidth plays such a critical role in the feasibility of a parallel computational implementation, further studies will be performed to better understand the HBM access timings from the TFB and PC. As a first step, a test on the HBM similar to the one presented for the DSP would provide the necessary information to derive the bandwidth efficiency, a factor which still needs to be included into the HBM utilisation estimates.
