Abstract-A multi-core FPGA-based 2D-clustering algorithm for real-time image processing is presented. The algorithm uses a moving window technique adjustable to the cluster size in order to minimize the FPGA resources required for cluster identification. The window size is generic and application dependent (size/shape of clusters in the input images). A key element of this algorithm is the possibility to instantiate multiple clustering cores working on different windows that can be used in parallel to increase performance exploiting more resources on the FPGA device. In addition to the offered parallelism, the algorithm is executed in a pipeline, thus allowing the cluster readout to be performed in parallel with the cluster identification and the data pre-processing. The algorithm is developed for the Fast Tracker processor for the trigger upgrade of the ATLAS experiment but is easily adjustable to other image processing applications which require real-time pixel clustering.
I. INTRODUCTION
n recent years image detectors have massively increased resolutions and the produced data that require image processing has multiplied in size. This principle applies to a broad spectrum of applications, from simple everyday use cameras to the more complicated and computationally demanding biomedical and high-energy physics applications. All these applications have different performance requirements but in all cases preprocessing for data reduction such as clustering or edge detection is applied. The algorithm presented here is a general purpose clustering algorithm with generic characteristics, easily adjustable to a variety of applications and performance needs. The algorithm is developed for the Fast Tracker processor (FTK) [1] used for the trigger upgrade for the ATLAS experiment [2] .
FTK is an approved ATLAS upgrade that has the goal to provide a complete list of tracks to the ATLAS HLT (High Level Triggers) at each level-1 accept, up to 100kHz, with a very small latency, of the order of 100μsec. It will receive data from the pixel and microstrip detectors to reconstruct data from the whole silicon tracking detector to provide high- quality reconstructed track parameters for all tracks with transverse momentum above 1 GeV/c to the HLT.
The data from all silicon detector read out drivers (RODs) are received by 256 S-Links [3] running at 1.2Gbits/s, therefore the total input rate will be 308Gbits/s. It is essential for this great amount of data to be reduced as much as possible without losing useful information. This is the motivation for using the clustering algorithm. The clustering is implemented on the FTK Input Mezzanine cards (FTK_IM) which are installed on the FTK Data Formatter boards [4] . Each FTK_IM receives 4 S-Links where hits are transferred as 32bit words at a 40MHz rate per S-Link, which is the specification of the system.
The clustering implementation presented in this paper is the 2D-clustering which will be used to identify hit clusters on the ATLAS pixel detector. The ATLAS pixel detector is made from separate pixel modules. The ATLAS pixel module ( [5] , Figure 1 ) has 328x144 pixels which are read out by 16 Front End chips (FE). The FE chips are organized in two rows of 8 chips each. The data from each chip is read out in column pairs. The data coming from the two columns are not sorted by position. The main challenge of the clustering 
II. CLUSTERING IMPLEMENTATION
The clustering algorithm will be responsible for the identification of groups of contiguous pixels compatible with a cluster and then reduction of the hit data to a single set of coordinates: the cluster centroid plus a few bits of cluster shape information. The data from the pixel and microstrip RODs arrive via an S-Link channel. They are received by an S-Link decoder and are forwarded to an S-Link FIFO, which is the source of data of the presented clustering implementation (Figure 2) .
The clustering implementation is designed in three separate processing modules ( Figure 2 ): a) the hit decoder module, b) the grid clustering module and c) the centroid calculation module.
A. Hit Decoder Module
The hit decoder transforms the incoming data from the ATLAS bytestream format to a format useful to the following processing step, the grid clustering module. It is, in fact, a preprocessing step that selects, formats and organizes the information that is used by the clustering algorithm such as start/end event words (the flag words that mark the beginning and the end of an event in the bytestream), module headers/trailers (the flag words that mark the beginning and the end of hits from one pixel module as well as the module number) and of course the pixel hits. The code is robust against bit errors in the input data. In the rare case when the arriving hit data are not identified by a start event word or a module header the data are dropped. In addition the hit decoder can reintroduce missing control words such as end event words and module trailers.
The most important role of the hit decoder module is to properly align all the incoming data. The ATLAS pixel module FE chips are numbered from the bottom left corner to the upper left corner in a clockwise cycle and they are read out in the same sequence. This means that half of the pixel module data arrive in reverse column order than the other half. The hit decoder module needs to restore the order of the hits since the clustering algorithm is based on the assumption that they are ordered by an increasing column number sequence.
To achieve this a LIFO is used to store all the hits that arrive from FEs with numbers from 0 up to 7 (Figure 3) . When a hit arrives from a FE chip with number from 8 up to 15 it is stored in a separate register. The value of the register is compared with the last value stored in the LIFO and the hit with the smallest column value is propagated to the next processing module. In this way increasing column sequence is restored. The LIFO size is 512 words which is sufficient for the currently expected hit occupancy in the pixel modules. In the rare case that LIFO size will be exceeded the only effect will be to split some cluster for the corresponding pixel module. This condition will be recorded in a end-event-word flag. Two small FIFOs (16 words each) are added as input and output buffering stages for synchronization purposes.
B. Grid Clustering Module
The grid clustering module is the one that actually identifies the clusters and it is the most computationally intensive block of the implementation. The module uses a "moving window" technique to minimize computational time per cluster identification as well as needed FPGA resources. The "window" is actually a rectangular grid of pixel cells of generic size. Its size depends on the maximum expected cluster size per application and it must be big enough to fit this cluster size. The "window" is "moving" in the sense that Register Figure 3 .The Hit Decoder Module during the several passes of the cluster identification process it is virtually placed in different positions of the pixel module and every time it is filled with data from different areas of the pixel module plane.
On the starting of a module processing it is filled with data around the first received hit. This hit is used as a reference hit and it is placed on the middle row of either column 0 or column 1 of the window (because of the double column scrambling of the data, to allow for one column space for preceding hits arriving later and depending on whether it belongs to an even or odd column of the ATLAS pixel module). The hits are read from the input until the first hit with a column beyond the column range spanned by the "window" arrives. This hit is kept in the input FIFO and processed later. All the hits that belong to the "window" are loaded to the grid, while the hits that do not belong to the window but are within the window column span (above or below it) are stored in a separate circular buffer.
The cluster identification process begins by selecting two grid pixel cells as "seeds" (column 0 and column 1 on the middle row) (blue coloured cells - Figure 4 , a). The "seed" cells that contain a hit when selected change their state to "selected". The "selected" state is propagated on the next clock cycle to all neighboring hits (arrow - Figure 4 , b, c and d). On the same cycle the hit that was previously selected is now read out (black coloured cells - Figure 4 , b, c and d). When a hit is read out the cell returns to an "empty" state (grey coloured cell - Figure 4 , d) Using the same process all the hits that form a cluster are read out. The hit information that is read out of the grid is propagated to the next processing module in its relative coordinates with respect to the reference hit. After the cluster hits are all read out, a cluster flag word is sent to the next module which contains the absolute coordinates of the reference hit. The hits that remain in the grid that do not belong to the identified cluster are also read out in the next processing step and they are saved in the circular buffer in their absolute coordinates. The hits that are recovered from the grid to be stored in the circular buffer are not in column sequence with the previous hits of the circular buffer.
On the next run of the clustering module the grid is loaded with hits from the circular buffer. The leftmost hit stored in the circular buffer is chosen as a new reference hit. This hit value is stored in a separate register, called the "leftmost register", as the circular buffer is being filled. While reading from the circular buffer to load the grid, hits that do not belong to the grid need to be saved again in the circular buffer. Extra functionality had to be added to the circular buffer to control simultaneous reading and writing of hits without accessing twice the same data. If after reading the circular buffer there are hits in the input FIFO that belong to the columns of the circular buffer these hits are read until a hit with a column number outside the grid arrives at the input. A clustering module process all the data related to a pixel detector module, so the cycle is repeated until a pixel detector module trailer word is received from the clustering input and the circular buffer is empty. For the current clustering module implementation a "window" of 8x21 pixels is used (8 for the z or η direction and 21 for the r ϕ − ). The most common cluster size in the ATLAS pixel module is of 2x3 pixels. The bigger grid is used to allow identification of the rarer but still existing bigger clusters, or clusters generated by merging hits from two or more clusters. Clusters of bigger size than the grid size, which means clusters extending from the reference hit beyond one of the grid edges, will be split. Clusters that touch a grid edge will be identified by a flag in the output. The algorithm is executed in a pipeline, which means that clusters are identified and read out from the clustering module to the centroid module simultaneously. Different numbers of clustering modules can be implemented at the output of the hit decoder to identify clusters in parallel. These modules will work independently on different pixel module data. Therefore, the implementation offers the versatility to choose performance over resource usage.
C. Centroid Calculation Module
The centroid calculation module is the post-processing step in the clustering implementation that performs the data reduction process. It is the module where the cluster data is replaced with one set of coordinates, the centroid coordinates. For each cluster a centroid value is calculated. The centroid is then corrected by a variable calculated by taking into account the absolute pixel position in the detector as well as the charge deposition in each cluster measured by the Time-overthreshold (ToT) information from the FE. The ToT value for each hit is stored in the same word as the hit coordinates and while the hits are placed in the clustering window of the grid clustering module these values are stored in a separate memory (ToT memory) and are recovered while the cluster hits are read out.
A simplified version of the formulae used to calculate the centroid value for the column (x coordinate) value is demonstrated below:
ColMin ColMax x average a qRatio
ColMin and ColMax in equation (1) are the minimum and maximum columns of the cluster. Constant a is a function of pixel position. In equation (2) the charge imbalance between the two sides of the cluster is calculated (left-right for x and top-bottom for y). qColMax is the sum of the Time-OverThreshold values of ColMax and qColMin the sum of the same values for ColMin. x_average is the final centroid column value with the applied corrections.
The post-processing step can change depending on application (e.g. center of mass, median calculation etc).
III. RESULTS
The design will be implemented on a Xilinx Spartan 6 LX150T FPGA [6] , where the available space will be shared with the clustering of the microstrips, the S-Link decoder and S-Link FIFO, as well as monitoring circuits. Therefore, it is essential to have a low area occupation on the device. Both the hit decoder and grid clustering modules are in their final verification stages and have successfully passed post-placeand-route simulation. The centroid calculation module is still under design. Both the hit decoder and clustering module use a very small fraction of the available FPGA resources (see Table  I ), much less than 1% of Slice LUTs for the hit decoder and about 2% for the grid clustering. The BRAMs used (FPGA device memory blocks) are occupied by the LIFO of the hit decoder, the grid clustering circular buffer and ToT memory, and the fraction of BRAM blocks used is less than 0.5% of the total blocks available on the device.
The hit decoder passed post-place-and-route verification with a 10ns clock period and the grid clustering module with a 12ns clock period. The complete hit decoder-clustering flow is being tested using the 12ns clock (~80MHz) of the latter. The performance of the clustering identification process with a realistic hit occupancy is estimated to be of the order of 10 clock cycles per hit. The current implementation is an evolution of a linear clustering algorithm with high cost in terms of FPGA resources [7] .
IV. CONCLUSIONS AND FUTURE WORK
In this paper a 2D-clustering algorithm and its FPGA implementation is presented. The implementation is designed for the ATLAS Fast Tracker Data Formatter but it is generic enough to be used in various applications where clustering is required. The clustering implementation achieves a ~80MHz operational frequency by using ~2% of FPGA Slice LUTs. The estimated time required to identify the clusters is O(10) clock cycles per hit, therefore a single grid clustering module flow will achieve an ~8MHz response.
To surpass the given specifications of 40MHz response by a factor of 2 (~80MHz), a x10 parallelization is planned to be implemented. This implementation will occupy ~20% of FPGA resources. Each clustering flow will process data of a single pixel module in order to operate independently from the other. With these parameters the clustering implementation will have a significant margin of performance over the requirements while occupying only a fraction of the FPGA area.
