A multi-core FPGA-based clustering algorithm for high-throughput data intensive applications is presented. The algorithm is optimized for data with two dimensional organization (e.g. image processing, pixel detectors for high energy physics experiments etc.). It uses a moving window of generic size to adjust to the application's processing requirements (the cluster sizes and shapes that appear in the input data sets). One or more windows (cores) can be used to identify clusters in parallel, allowing for versatility to increase performance or reduce the amount of used resources. In addition to the inherent parallelism the algorithm is executed in a pipeline, thus allowing for readout to be performed in parallel with the cluster identification.
Introduction
Pixel detectors are used for a variety of applications from simple image capture for an everyday camera to high resolution imaging used for astrophysics and high-energy physics applications. The data produced are organized in two- dimensions and the performance requirements vary with the application. Data reduction techniques such as clustering and edge detection are used as a preprocessing step, especially for high-throughput data intensive applications. The algorithm presented here is a general purpose clustering algorithm with generic characteristics which can be easily adapted for different applications. This algorithm is being developed as part of the Fast Tracker trigger upgrade for the ATLAS experiment [1] . The Fast Tracker processor will receive data from the pixel and microstrip ATLAS detectors for all events accepted by the Level 1 trigger at 100kHz event rate. It will reconstruct all events to provide in few dozens of microseconds all the tracks with transverse momentum above 1 GeV to the Level 2 trigger. The clustering module is part of the FTK Input Mezzanine. It will receive data from 256 S-Links running at 1.2 Gbits/s. Therefore the total input rate will be 308Gbits/s. Each hit is a 32bit word, received by the module at a 40MHz rate, which is the specification of the system. The ATLAS pixel module has 328x144 pixels that are read out by 16 Front End chips (FEs) [2] . The FEs are organized in two rows of 8 chips. The data is read out in a double column format, which means that data from double columns arrives scrambled. The main challenge of the clustering implementation is to design a module that will identify clusters on the fly.
Clustering Implementation
The clustering algorithm is a data reduction process that identifies the clusters generated by pixel hits in the ATLAS experiment. For each cluster the output data is reduced to a single pair of coordinates, plus a few bits of cluster size information. The data from the pixel and strips readout drivers (RODs) is transmitted to FTK over an S-Link [3] channel. An S-Link decoder receives the data and it executes a first data formatting procedure. The data is sent to an SLink FIFO that is the source of data for the Clustering implementation.
The presented clustering implementation has three computational stages: a) a hit decoder module, b) a clustering module and c) a centroid calculation module.
Hit Decoder
The hit decoder's main function is to decode the incoming data, which use the ATLAS Pixel bytestream format, to a format useful to the grid clustering module. The hit decoder drops all unnecessary data and propagates only data that is useful for the clustering algorithm (start/end event words, module headers/trailers, hit data). In the rare case of imperfect input data, the module can introduce missing control words in the datastream, and it can drop undefined hit data.
One very important role of the hit decoder module is to properly align the incoming data. The grid-clustering module needs the data to arrive in column sequence in order to properly identify the clusters. But the FE chips in the ATLAS pixel module are numbered from the bottom left corner to the upper left corner and are read in the same sequence which means that the first half of the hit data arrives in reverse order with respect to the second half.
This problem is compensated by using a LIFO to store the arriving hits from the FE chips 0-7. When hits start to arrive from FE chips 8-15 the information is temporarily stored on a register and compared to the last data in the LIFO. The data with the smallest column number is propagated to the output. In this way the column sequence is restored. The LIFO size is 512 words, which is sufficient for the current expected hit occupancy. In order to avoid synchronization issues two small FIFOs are added to the input and the output of the design.
Clustering Module
The grid clustering module is the most computationally intensive part of the implementation. Its role is to group the detector hits in clusters. For the first step the algorithm takes advantage of a moving window technique to reduce the data requirements for each cluster identification process. Each window is a grid of clustering cells of generic size. The size of the grid can be determined by the type and size of the clusters featured in each application (shape and number of pixels). Each cell changes state if the corresponding pixel is active in the detector or not and if it belongs to a cluster or not. The grid is aligned to place the first active pixel (hit) that arrives on the middle row of either column 0 or 1 (depending on whether it belongs to an even or odd module column) thus minimizing the pixel data required for identifying each cluster. Hits are read from the input until a hit that belongs to a column beyond the clustering window is identified. The hits that belong to the same columns as the clustering window, but they are not within the row range of the window, are stored in a separate circular buffer.
When the module stops reading data the cluster must be read out. To initiate the process two cells in the clustering window are selected as "seeds" (the cells in the coordinates (0, mid-row) and (1, mid-row) - Figure 1, A) . The "seeds" are set in a "selected" state (grey colour). The "selected" state is propagated to the neighboring hits (arrow - Figure 1 , B, C and D) and the previous hit is now read out to the next processing module (black colour). When a hit is read out it returns to an "empty" state (light grey colour - Figure 1, D) . Selecting and reading out the cluster is executed in parallel. The hits that do not belong to the cluster are recovered (read out) after the cluster is and they are written to the circular buffer in their absolute coordinates. After the hits are read out, the clustering window is reloaded with data from the circular buffer, thus "moving the window" to a new reference point. If the data from the circular buffer do not cover all the columns of the clustering window more data is read from the input. The "seeds" are selected and a new cluster is again read out. The process continues until a module trailer is received and the circular buffer is empty.
This two-dimensional grid takes full advantage of the inherent two dimensional logic matrix of the FPGA. The algorithm itself is executed in a pipeline and the readout of the clusters is executed in parallel with the cluster identification. In addition, different number of windows (cores) can be used to identify clusters in parallel. Therefore the algorithm offers versatility to choose performance over resource usage.
For the ATLAS Fast Tracker application clustering algorithm we are planning to use a grid size of 8x21 pixels (21 for the r-phi direction and 8 for the z or \eta direction). The most common cluster size is of two or three pixels. A larger grid is used in order to allow for bigger clusters or clusters that derive from merging neighboring clusters to be also identified. Clusters that, starting from the seed cells, propagate beyond the edges of the grid will be split.
Centroid Calculation
The post-processing step of the clustering implementation can be adapted depending on the application. For the ATLAS FTK processor a centroid calculation is used. The charge imbalance between the two sides of the cluster is calculated and also the absolute pixel position is taken into account. These two values result in a correction that is applied to the centroid position.
Results
When implementing a window size of 8x21 pixels a 1% FPGA area is used (slice registers usage) on a Spartan 6 lx150t FPGA device. For the same device we have verified the functionality using a post-place and route simulation with a 12ns clock. A realistic hit occupancy shows that the processing time is roughly 10 clock cycles per hit. A parallel implementation with 10 cores working in parallel on the same device should achieve 80 MHz hit processing speed using less than 20% of the available FPGA resources. In the parallel implementation each stream consists of data from different pixel modules, in order to keep each engine independent of the other. Data are tagged by their event and module number in order to allow the final stream to be recomposed recovering the proper event sequence. The current implementation is an evolution of a linear algorithm with a high cost in terms of FPGA resources [4] .
Conclusions
In this paper a 2D-clustering implementation is presented. The implementation targets the ATLAS Fast Tracker but it is generic enough to be used in general image processing applications. The module achieves a ~80MHz operational frequency with a 1% FPGA Slice Registers occupation. An estimated ~10 cycles per hit are necessary for cluster identification. A factor of 10 parallelization will allow an input hit bandwidth of ~80MHz, therefore surpassing the desired specifications of 40MHz by a factor of 2.
