Abstract-We report on the performances of a prototype for a specialized processor capable of reconstructing charged-particle tracks in a realistic Large Hadron Collider (LHC) detector, at full readout speed and with sub-microsecond latency. The processor is based on an innovative pattern recognition, called "artificial retina" algorithm, inspired by the vision system of the mammals. A prototype system has been designed, simulated, and implemented on readout boards equipped with Altera Stratix III FPGA devices. This is an important step towards the realization of a real-time track reconstruction device capable of processing complex events of high-luminosity LHC experiments at 40 MHz crossing rate.
I. INTRODUCTION
Experiments at hadron colliders are designed to measure with high precision many physical properties of collision products, but, generally, detailed quantities are available only after offline processing and cannot be used to select events during data acquisition. Information from charged-particle trajectories (tracks) are one of the best tools to discriminate events, but only in the '90s the Collider Detector at Fermilab (CDF) demonstrated the possibility to reconstruct in real time two-dimensional tracks from clusters of aligned hits in a large drift-chamber [1] . The algorithm used by CDF was based on matching hits in the detector with precalculated patterns, and an updated version of the same approach is at the root of other current projects for Large Hadron Collider (LHC) experiments [2] , [3] . A new pattern-matching methodology has been recently proposed under the name "artificial retina" algorithm [4] , inspired by the quick detection of edges in mammals visual cortex. Its aim is to further increase the parallelism of the pattern-matching process and decrease the number of stored patterns, to reduce latencies and hardware size. The purpose of this study is to explore the possibility of using this algorithm for a detailed tracking reconstruction at 40 MHz, compatible with the first step of online event selection at LHC, at a reasonable cost.
In the present work, we report on our effort of developing a hardware prototype based on already-existing FPGA readout boards, to explore the potential of this new approach in a realistic experimental environment, albeit at lower rates. The main purpose is to show the possibility of reconstructing quality tracks using the same hardware resources that are normally required just for reading out the raw hits. Other efforts along this line worth mentioning are the study of implementation of the algorithm within a silicon telescope [5] .
II. THE "ARTIFICIAL RETINA" ALGORITHM The "artificial retina" algorithm, elsewhere referred to as retina algorithm, was inspired by the first stage of mammal vision. Recent experimental studies show that specific neurons, called receptive fields, receive signals only from specific regions of the retina, in order to reduce the connectivity and save bandwidth. The neurons are tuned to recognize a specific shape and the response is proportional to how close are the stimulus shape and the shape for which the neuron is tuned to. Generated in parallel, the responses of neurons are then interpolated to create a preview of image edges in about 30 ms, corresponding to about 30 clock cycles. Those concepts of early vision can be used to realize a viable highly-parallel implementation of an "analog" patternmatching system, where each pattern is assigned a continuous level of "matching", rather than a simple binary response. The mathematical aspects of the algorithm has some similarities with the "Hough transform" [6] , [7] , a method already applied for finding lines in image processing; however, the main challenge here is the design of the physical layout and the development of an implementation capable to sustain the event rate at high-luminosity LHC experiments.
For configuring the algorithm, we divide the space of track parameters into cells, which mimic the neurons connected to the receptive fields of the retina. The center of each cell corresponds to a specific track in the detector that intersects the layers in spatial points called receptors. A first mapping connects each cell with the receptors, as shown in Fig. 1 . For a group of contiguous cells, where variations of track parameters are small, the corresponding receptors in the detector layers would belong to a limited area. A second mapping, also shown in Fig. 1 , connects clusters of cells to areas of the detector, and this information is recorded in a LUT. We use a commercial PC to generate the two mappings starting from simulated tracks.
The track reconstruction, to be implemented on high-speed FPGA devices, has three steps. During step 1, also referred 978-1-4673-9680-6/16/$31.00 ©2016 IEEE to as the switching step, we distribute detector hits only to a reduced number of cells, according the LUT's created in the configuration phase, as shown in Fig. 2a . In step 2, sketched in Fig. 2b , for every incoming hit we accumulate in each cell a Gaussian weight w proportional to the distance to the receptor:
where d l is the distance, on the layer l, between the hit and the corresponding receptor, and σ is a parameter of the algorithm, that can be adjusted to optimize the sharpness of the response of the receptors. After we process all the hits from the same event, in step 3 we identify tracks looking for local maxima of the accumulated weight over the cell's grid. For a track resolution similar to offline reconstruction, the grid does not require a high granularity, because significantly better precision can be obtained by computing the centroid of the accumulated weights for the cells surrounding each maximum, as shown in Fig. 2c . Compared to other algorithms, the retina method takes advantage of two levels of parallelization. First, each cell processes in parallel hits from a limited region of the detector, reducing the required input bandwidth for the single cell. Moreover, if we associate time information to each hit and accumulate weights separately for each event, different events can be processed simultaneously. The latter allows to feed data continuously to each cell, because cells are not receiving the same number of hits for every event. Another important feature is the total system bandwidth: the bandwidth increases significantly during the hit distribution, because multiple copies of the same hit can be produced, but shrinks down when only the information about local maxima is kept in the last step. The larger bandwidth can be physically managed only because is limited to one stage of the system.
III. FIRST PROTOTYPE DESIGN

A. Simulation Studies
In order to perform a detailed study of possible applications to a real HEP detector, we simulate the algorithm using a C++ emulator, the same reported in Ref. [8] . As geometry model we use a telescope with 6 single-coordinate layers (50 cm long) and no magnetic field [9] , as shown in Fig. 3 . We parametrize any track going through the telescope with two variables, and we choose the horizontal coordinates on the first and last layer, below referred to as u and v. The simulation shows that real tracks are likely to be along z axis, being located mainly in the region around the diagonal in the (u, v) plane. The simulation shows also that resolution and efficiency similar to offline tracking systems can be achieved using only 3,000 cells to cover this specific region. Therefore, we cover this whole region using matrices of cells, and matrices are dimensioned to implement one of them on a single chip. The matrices in different chips overlap by one cell to find local maxima in all the cells. This smaller prototype is also designed to be implemented on a currently-available electronic board that meets our requirements. We also use the simulation to compute the algorithm configuration that fits the chosen board.
B. Board Details
We pick the Tel62 board developed in Pisa for the NA62 experiment [10] , which satisfies our target requirements of multiple high-bandwidth FPGA chips on the same board and fast interconnection between the chips. This board includes 4 Stratix III chip for data processing, each with approximately 200k logic elements. Each chip is connected through a highspeed link (10 Gbit/s) to a master chip (another Stratix III) that collects the processed data and controls the other FPGA's. The main clock is 40 MHz, while the processing internal to the FPGA's is done at 160 MHz. Each processing chip is connected to a 2 Gbyte DDR2 RAM through a high-speed bus (40 Gbit/s), to a mezzanine connector that can host various interface cards for I/O (5 Gbit/s), and to the neighbouring FPGA's (2.5 Gbit/s). The master FPGA is also connected to a mezzanine card with Ethernet ports (2.5 Gbit/s). Implemented on an embedded PC that sits on the board, a slow-control interface can be used to monitor all the devices and to perform various standalone tests.
C. Algorithm Implementation
The Tel62 board is designed for a standard DAQ system, where bandwidth typically reduces following the dataflow (only selected data move to the next step) and data streams do not have many connections between all of them. For implementing our system, however, we need to increase the amount of data and communicate laterally within the retina switch. Therefore, we connect two boards together, reversing the data-flow in the first one, where we implement the switch that redirects the hits (Switch board), and use a newly designed interface card to connect the second board, where we implement the processing engines (Engine board), as shown in Fig. 4 . This configuration is fully consistent with the implementation of the system proposed in Ref. [8] , so our tests are actually meaningful for applications in the field.
In this configuration, we lack sufficient external bandwidth to feed the system with the large amount of data that it is capable of processing, so for our tests we store a sample of events in the RAM external to the chips. On the output side, instead, the bandwidth is already reduced enough to allow for transferring the track information to an external PC using the Ethernet ports. Data delivered to a switch board can reach only one of the four chips in the corresponding engine board.
In the 6-layer detector used as a model, each layer is divided into 7 modules along the horizontal axis, while strips run along the vertical axis. As detailed below, 4 pairs of Tel62 boards are enough to implement 3,000 cells. Because hits cannot be transferred from a pair of boards to another, data coming from this 42 modules are regrouped into 4 appropriate sets, one for each pair of boards. Most of the tracks in our events are only slightly tilted with respect to z axis, so we group the modules along the z axis, with some lateral overlap. Assuming that the innermost module is #0, data from modules #0 and #1 for all the six layers are routed to set 0, modules #1 and #2 to set 1, modules #2, #3, and #4 to set 2, and modules #4, #5, and #6 to set 3. Each module is read through a separate fiber, so this routing configuration is physically possible by redirecting the single fibers and splitting them where multiple copies of data are needed. Even if we do not plan to connect the current system to the DAQ of a real detector, we take this issue into account, in order to make our prototype as realistic as possible.
We implement the processing engine in a fully pipelined mode, so each engine can receive one hit for each clock cycle [11] . All the engines in a chip receive the same hit sequence and, after a special end-event (EE) hit is received, engine values are copied to the following stage where the search for local maxima starts, while the next hit sequence enters. As explained in more detail below, we fit approximately 200 engines on each Stratix III device (EP3SL200F1152), therefore 16 chips are enough to implement the whole system.
D. Hardware Development Details
We develop the firmware for the various FPGA chips on the board using the software HDL Designer by Mentor Graphics and the Altera specific software Quartus II. Most of the firmware is written in VHDL using generic components configurable through parameters, so they can easily be reused for systems on larger devices. The same firmware is loaded into all the four chips on the boards, so values for engine receptors and switch LUT's have to be loaded through the slow control and stored in internal RAM's.
The firmware for the switch chips is based on a switch block with 4 inputs and 4 outputs, made by 4 switch basic blocks with 2 inputs and 2 outputs. A switch basic block is composed by two splitters (one input, two outputs) and two mergers (two inputs, one output) connected together. The splitter copies the input data to zero, one, or both outputs, according to a 2-bit LUT value based on the hit coordinate and layer. The merger copies both input data to a single output, holding the data alternatively on each input channel if both have data at the same time. While the splitter logic is more linear, the merger logic involves multiple operating sequences that need careful design. Additional logic is also required to implement the correct propagation of EE hits: generated by the readout, they arrive separately on each input, and, after the switch step, hit sequences on each output have to contain all the appropriate hits from the event, tailed by a single EE hit. Each of the 4 switch block outputs corresponds to a different chip on the engine board. Because each engine chip is connected only to one switch chip, we redirect properly all the switch outputs between chips, using the lateral connections or through the master chip. Then, we merge the streams from the redirected switch outputs and move the data to the engine board through the interface card, as shown in Fig. 4 .
The firmware for the processing chips of the engine boards is based on a matrix of engines. For each engine completely surrounded by other ones, we add the logic to check if the accumulated value is greater than that of the first neighbors.
The maximum search requires a large number of interconnections between engines, limiting the number of engines that can be fit in our Stratix III FPGA to a 16x15 matrix 1 . We use a priority encoder to transfer the resulting maxima to a FIFO, from where we move them to the master chip. The master chip collects maxima data from all the processing chip, due to the reduced bandwidth achieved at this stage, and sends them out through the Ethernet connectors.
IV. FIRST RESULTS
We write the basic firmware for all the chips involved in our system and verify the correct behaviour of the logic using ModelSim by Mentor Graphics. We have not yet integrated the communication module for the external RAM from NA62, but hits can be loaded directly on internal FIFO's of the chips. We load sequences of hits for simple events in all the chips of one switch board and the hits were correctly dispatched to the proper output channels, together with the EE flags. All the logic for the switching, internal and between chips, is running properly at the maximum clock rate of the board (160 MHz). We test successfully the transmission between the switching and engines boards at 80 MHz, using two channels for each interface card pair. We also load sequences of hits in the chips of one engine board: the hits were processed and the correct maxima data were received by the readout PC through Ethernet connection. Also here the internal logic for the engine is running properly at 160 MHz clock rate. Assuming realistic events with an average of 70 hits per layer, the engine board can sustain a maximum event rate of 1.8 MHz with a latency smaller than 1 µs.
V. CONCLUSIONS
A first, sizable hardware prototype of a retina tracking system with 3,000 patterns is under advanced development, based on already existing FPGA readout boards. First results show the track-processing system based on our algorithm is feasible, and essential steps are successfully implemented on the real board at the nominal clock speed. We successfully process hit sequences through the entire chain: the switching network, the board interface transmission, the engine matrix, and the transmission of local maxima through the Ethernet ports. We find that our system is capable of reconstructing tracks at a 1.8 MHz event rate, using boards that had originally been designed for 1 MHz readout-only functionality. Moreover, the requested additional hardware to implement the tracking functionality is also comparable with respect to what needed for the readout-only function. Performances mentioned above are expected to be easily scalable to higher speeds, as much larger and faster FPGA devices are already available today on the commercial market, compared to the ones used in our prototype. Therefore, these results represent an important step towards demonstrating the capability of performing complex tracking at LHC at the full event rate of 40MHz, which is the final goal of the present R&D activity.
