Abstract-We describe a track segment recognition scheme called the Tiny Triplet Finder (TTF) that involves grouping of three hits satisfying a constraint such as forming of a straight line. The TTF performs this O(n 3 ) function in O(n) time, where n is number of hits in each detector plane. The word "tiny" reflects the fact that the FPGA resource usage is small. The number of logic elements needed for the TTF is O (Nlog(N) ), where N is the number of bins in the coordinate considered, which for large N, is significantly smaller than O(N 2 ) needed for typical implementations of similar functions. The TTF is also suitable for software implementations as well as many other pattern recognition problems.
I. INTRODUCTION
T RACK segment finding is an essential process in many trigger systems for high-energy physics experiments. For example in the Fermilab BTeV [1] trigger system [2] - [4] , we need to identify track segments from the coordinates of pixel detector hits from three adjacent detector planes forming a straight-line segment in the non-bend view. For a given track segment, the following relationship holds:
where and are the hit coordinates on planes A, B and C in the non-bend view. Such segments consisting of three hits are referred to as "triplets" [2] (see Fig. 1 ).
In the BTeV detector, the interaction points are distributed along the beam axis in a wide range. There are two free parameters for a track segment even in the non-bend projection. A possible track segment is not formed until at least three hits are aligned.
Therefore the triplet finding process in BTeV is different from another type of straight track segment finding in other high energy physics experiments when the interaction point is known. There is only one free parameter for a track segment in the nonbend projection in this case. A possible track segment is formed after two hits and the known interaction point are aligned. The process is significantly simpler than the BTeV triplet finding process. The ZX is the non-bend view. The dashed line represents a fake triplet.
Straightforward software implementation of the triplet finding function would require ) execution time, where is number of hits per plane, in order to examine all possible combinations of three hits using three layers of nested loop. In hardware implementations, the execution time can be reduced to , the time required to fetch the data. This execution time reduction is accomplished by "unrolling" two layers of loops, which consumes significant amount of silicon resources in the device. The number of logic elements needed for typical triplet finding implementations is where is the number of bins that each plane is divided into.
In this article, we describe a new algorithm that performs the triplet finding function, which we refer to as the Tiny Triplet Finder (TTF) [5] . We also describe sample hardware implementations of the TTF using low cost FPGA devices. Logic element usage in TTF implementation is which is significantly smaller than when is large. After the triplets are found, the triplets in one part of the detector are matched with triplets found in other part of the detector. The process of matching two data items require execution time in software and it can be reduced to in hardware by unrolling one layer of the loops, with logic element usage of . The "Hash Sorter" [6] or other schemes can be used for this type of applications.
II. PRINCIPLE
Consider three equally spaced detector planes in the non-bend view as shown in Fig. 2 . We first divide the two outer detector planes, Plane_A and Plane_C, into bins ( in this example), choosing a bin as unit of the coordinate in the non-bend view and rounding to an integer. Since the detector plane usually contains far more pixels than 64 in the projection being considered, the binning is actually merging several pixels together. For example, to merge 1024 pixels into 64 bins, 16 pixels are merged together. To convert the pixel number into the bin number in this case, simply shift the pixel number by 4 bits.
In general, there exist possible track combinations, or "roads" in this configuration. A "road" is defined here as a line segment passing through one of the bins in each of Plane_A and Plane_C. Directly implementing all possible combinations using either logic elements or content addressable memories (CAMs) would require a large amount of silicon resources.
In the Tiny Triplet Finder, two register arrays, BitReg_A and BitReg_C, are used to record the hits in the detector planes.
When a hit coordinate from a detector plane is input, one of the 64 bits in the register array corresponding to its position is set. After all hits in Plane_A and Plane_C are recorded, the algorithm cycles through each hit from Plane_B. In the special case when the hit is at the mid-point of Plane_B (see the top of Fig. 2) , there are 64 possible track combinations or roads. Each possibility is checked through bit-wise coincident logic between the bit patterns recorded in BitReg_A and BitReg_C. If a pair of corresponding bits in BitReg_A and BitReg_C are both set, e.g., (0, 63), (1, 62) or (2, 61), etc., the bit-wise logic will output the pattern of the matching pair(s) corresponding to a possible track segment passing through the hits in Plane_A, Plane_C and Plane_B. The bit-wise coincident logic is primarily a bit-wise AND of the patterns in the two registers. In a real implementation, a bit-wise OR with the neighboring bits in one pattern may first be performed to cover boundaries. An example of the bit-wise coincident logic is shown in the bottom of Fig. 2 .
For hits that are not at the mid-point of Plane_B, the bit-wise coincident logic is identical, except that the positions of the bit patterns representing the hits on Plane_A and Plane_C are shifted relative to each other by an amount determined from the coordinate of the hit on Plane_B (see the second and third configurations of Fig. 2 ). The constraint for the triplet can be rewritten in the following form:
It can be seen that the relative shift between the bit patterns is . In a practical implementation, the unit of the Plane_B hit coordinate is chosen so that is an integer from 0 to (To merge 1024 pixels to 128 bins in our example, simply shift the pixel number by 3 bits.). It can also be seen that the orders of the two bit patterns relative to each other should be reversed due to the negative sign between and . Additionally, hits from different tracks as well as noise hits can also satisfy the coincident logic to produce fake tracks. The simplest way to deal with the fake tracks is to encode and output all of them and perform arbitration at a later stage. The users may also use a priority encoder to choose a particular track depending on the physics requirement of the experiment.
In the Tiny Triplet Finder, only (64 in this example) combinations are implemented in the bit-wise coincident logic instead of combinations. Taking advantage of symmetry, all possible combinations can be achieved by shifting the bit patterns.
In the time domain, the total execution time is taken up by the following processes: 1) Setting the bit patterns BitReg_A and BitReg_C.
2) Looping over hits in Plane_B, shifting the bit pattern in BitReg_A, performing the bit-wise coincident and decoding matching pair(s) found. The processes take approximately 2 clock cycles to execute making them essentially , although small non-linear contributions exist when more than one pair is found by the bit-wise coincident logic.
Generally speaking, the probability of creating fake triplets that cause the non-linear contribution increases as increases, and decreases as increases. When number of bins is sufficiently large and number of hits is small, the non-linear contribution should be small. This is true for all triplet finding schemes and the results studied for other schemes remain valid here also.
Ignoring the small non-linear contributions, we see that the TTF unrolls two layers of loops so that an process can now be executed in time. The acceleration is accomplished through the use of the bitwise coincident logic that simultaneously finds all matching hits on Plane_A and Plane_C for each hit on Plane_B in a single operation, making the process time proportional to the number of hits on Plane_B.
III. FPGA IMPLEMENTATIONS OF THE TINY TRIPLET FINDER
The block diagram of the Tiny Triplet Finder implemented in an FPGA device is shown in Fig. 3 .
A. Bit Array Filling
As the hit data from Plane_A and Plane_C are fetched from input FIFO's, a bit corresponding to each hit is set in the BitReg blocks. The resulting hit patterns are presented at the output ports on buses AQ and CQ. As mentioned earlier, the bit order of CQ is reversed. Meanwhile, the full hit data are stored into memory buffers called "Hash Sorters" [6] for fast retrieval later. For simplicity, one may think of the Hash Sorters as memory areas that are each divided into 64 bins. When a hit sets a bit in the BitReg register array, the full hit data are written into the corresponding bin in the Hash Sorter.
B. Looping B Hits and Shifting Bit Pattern
After all hits from Plane_A and Plane_C have been written into the Hash Sorters, the hits from Plane_B are fetched from the input FIFO. The coordinate of each Plane_B hit is used to determine the relative shift distance between the two bit patterns AQ and CQ. The shifter shifts the bit pattern AQ by this amount and presents the shifted pattern at port A2Q. The full hit data from Plane_B are also stored for later retrieval in a buffer which can either be a hash sorter or a regular output FIFO.
The shifter is implemented in a two-stage pipeline to increase operation frequency. Although the shifter requires a relatively large number, i.e., of logic elements compared to other blocks in this design, it is still much smaller than typical implementations where logic elements are needed.
C. Bit-Wise Coincident Logic
The bit pattern CQ and the shifted pattern of AQ, A2Q, are sent to the "BitLogic" block in which the bit-wise coincident logic is performed. The coincident logic is essentially a bit-wise AND. The OR logic among the neighboring bits in A2Q is included to cover the boundaries. See the bottom of Fig. 2 for an example. Any non-zero bit in the resulting bit pattern P indicates a found triplet. The location of this bit represents the coordinate of the Plane_C hit belonging to the triplet. The coordinate of the Plane_A hit can be derived from this location and the distance of shift.
D. Priority Sequence Encoder
The locations of the non-zero bits are encoded in the "Priority Sequence Encoder" block which can accommodate situations with more than one triplet. Normally there is only one non-zero bit in pattern P and the encoder outputs the location of the bit. If there are two or more non-zero bits, the encoder changes a signal (EN in Fig. 4 ) to halt earlier pipeline stages, allowing the locations of all the non-zero bits to be reported sequentially.
This block is also implemented as a pipeline. Although it takes 6 clock cycles to encode the non-zero bit(s) in P, the block accepts one P pattern during each clock cycle, as long as the pipeline is not halted.
IV. TEST DESIGNS AND SILICON RESOURCE USAGE
We have test compiled the Tiny Triplet Finder with and bins in an Altera Cyclone device (EP1C4) [7] . Results of the full simulation of the Tiny Triplet Finder are shown in Fig. 4 . The simulation uses hit coordinates given in Fig. 1(a) as an example. The coordinates for Plane_B are multiplied by 2 to obtain the shift distance B. All four real triplets in this example are found in addition to a fake triplet which also satisfies the triplet condition . The fake triplet is represented by the dashed line in Fig. 1(b) with hits at and , which corresponds to and in Fig. 4 .
The outputs of the Priority Sequence Encoder, KA, KB, and KC, are the bin numbers where the original hit data are stored in the Hash Sorters (or FIFO for Plane_B hits). These numbers are used as addresses to read out the hit data in the corresponding bins to send to later stages for further processing.
If there is more than one hit stored in a bin, the Hash Sorter will output all the hits in the bin so that later stages can make better choices. In this case, the pipeline in earlier stages will be halted, allowing multiple hits to be read out.
Another interesting point shown in this example is that we have found a triplet corresponding to the input . One of the input coordinates is off by 1 bin due to a boundary effect and/or a round-off error. Our bit-wise coincident logic covers this kind of difference. To trace back the original hits in the Plane_A at bin 4, the hash sorter will check both bin KA and KA-1, i.e., both bin 5 and bin 4 in this example.
The compilation results are shown in Table I for all functional blocks shown within the dashed box in Fig. 3 . Clearly, the Tiny Triplet Finder can easily be accommodated in currently available middle-sized FPGAs.
The resource usages for two other typical implementations are also shown for comparison. The first implementation uses Content Addressable Memories (CAM) which can be implemented fairly efficiently with Altera Embedded System Blocks (ESBs) [8] . For this case, the silicon usage for roads has been calculated without considering boundary effects and other supporting logic.
Another implementation uses the Hough transform scheme [9] . The number shown includes only the 2-D histogram, assuming each bin can be implemented with 4 logic cells. Decoder and other supporting logic are not included. Since these two other implementations do not fit in the EP1C4 device, they were accommodated with a 7 times larger EP2A40 APEX II device [8] .
Furthermore, as the bin number increases from 64 to 128, the logic cell usage of the Tiny Triplet Finder increases only by about a factor of 2 while for the other two implementations an increase by a factor of 4 is anticipated.
In Table I , we also listed the resource usage of a 64-bit TTF implemented using distributed RAM devices available in the FPGA devices from Xilinx [10] . The details are discussed in the next section.
V. IMPLEMENTATION USING DISTRIBUTED RAMS
In this section, a more efficient implementation using distributed RAMs is described.
In today's main stream FPGA devices, lookup tables are used to implement combinational logic. The lookup tables are small RAMs, usually 16 locations by 1 bit in size. In some device families like Xilinx Virtex-II [10] , the users are allowed to write and read the RAM which provides possibilities to implement functions more compactly. As an example, the implementation of the bit registers and shifters in the TTF using distributed RAMs is described in this section.
In Fig. 5 , the bit maps for a 64-bit TTF are shown. Two register arrays corresponding to Plane_A (left) and Plane_C (right) are formed using 64 distributed 16 1-bit RAMs for each array.
While filling hits for Plane_A, each RAM is given a rotational address, i.e., the address of a RAM differs from the address of the next RAM by 1. For any hit in Plane_A, up to 16 RAMs, 7 above, 8 below the hit position (marked with an arrow for each hit in Fig. 5 ), are enabled to be written. The bit being filled in the RAM at the hit location is bit 7, while in the RAMs above and below the hit location are bits 6-0 and bits 8-15, respectively. The net effect is that each hit sets up to 16 bits in the array as shown in the left bit map. Although up to 16 bits are written for each hit, only one clock cycle, (not 16) is needed for the writing process.
While reading the register array for Plane_A, all the RAMs are given the same address (indicated with a vertical bar in Fig. 5 ), so the hit pattern appears at the outputs of the array. Changing the address shifts the hit pattern. In the register array for Plane_A, relative shift from units to units can be achieved. Again, the reading of the shifted pattern takes a single clock cycle (not 7 or 8).
To cover the entire range of the shift, the register array for Plane_C is also implemented similarly. The bit being filled in the RAM at the hit location is bit 7. In the RAMs (where to ) bins above the hit location, the bits are filled, and similarly in the RAMs bins below the hit location, the bits are filled. For each Plane_C hit, a jumping patterns as show in the right bit map in Fig. 5 is written.
In the reading phase, the RAMs of the register array for Plane_C are all given the same address. The hit pattern appears at the outputs of the array. If the address changes by 1, the hit pattern jumps upward or downward by 8.
Combining the two arrays allows all of the relative shifts between them to be achieved. In fact, the addresses during read for the two arrays are derived from the coordinate of the Plane_B hit. Plane_B is divided into 128 bins and a hit on Plane_B is represented by a 7-bit integer . The reading addresses for Plane_A and for Plane_C are constructed as: and . Some examples are tabulated in Table II . It can be seen that when , possible values for RA are 0 to 7, or left half of the Plane_A bit map, and when , possible values for RA are 8 to 15, or the right half. In other words, B(6) is used as a "page selection bit" for RA. The reader may also note that given address lines in two bit maps, in order to generate relative shifts from to , only half of the Plane_A bit map would be needed. The reason why redundant bits are booked and B(6) is used as a "page selection bit" for is for better plane edge coverage for the bit-wise coincident logics. As shown in Fig. 2 , when the Plane_B hit is below the mid-point, or
, the bin 0 edges of Plane_A and C must be contained in the coincident logics. When , the other edges, the bin 63 edges of the Plane_A and C must be contained. It can be shown that without redundant page in bit map for Plane_A, which is free in this case, about 8 bits additional RAMs and bit-wise coincident logics beyond 64 bits would be required to cover the track roads being shifted out of the plane edges. The otherwise unused bits in RAMs for Plane_A are used to contain the logics within 64 bits (see Fig. 5 and the last column of Table II) .
It can be seen that the functions of the BitReg and the Shifter blocks in Fig. 3 are integrated in the two RAM arrays. By eliminating the Shifter block, the silicon resource usage is further reduced as shown in Table I .
As mentioned earlier that the maximum possible range of relative shift between the two patterns should be 256 given address lines of the two arrays, so it is possible to build a 128-bit TTF without using a shifter. (Of course, as mentioned above, about 16 bits additional RAMs and coincident logics will be needed to cover the plane boundaries.) For bigger TTF, a shifter will be needed, but the shifting functions provided by the RAM arrays will reduce the size of the shifter significantly. For example, for a 1024-bit TTF, the shift range is 2048 that requires an 11-stage shifter, one stage for each bit of the shift index. With RAM arrays, 8 bits in the shift index can be eliminated leaving only 3 to be implemented in the shifter.
A disadvantage for the implementation discussed in this section is longer reset time. Since the 16 1-bit RAM does not support fast reset, 16 clock cycles are needed to clear the RAM arrays to prepare for a new event, while the register-based implementation takes only one clock cycle.
VI. CONCLUSION
We have described an FPGA implementation of the Tiny Triplet Finder. Since the Tiny Triplet Finder algorithm uses no special logic operations other than shift and bit-wise AND/OR, it is also suitable for software implementation. In most CPU or DSP processors, the TTF algorithm is expected to allow execution time to be reduced from to . Although we used the simplest configuration, i.e., straight tracks in three equally spaced detector planes as an example in this document, the TTF algorithm can be extended to configurations with non-equally spaced planes and more than three planes. We have discussions on these cases in Reference [5] . In another document we are currently composing, several applications with curved tracks are studied.
In addition to track segment finding, the TTF algorithm may also be used in hit recognition problems in wire chambers, time of flight counters, and GEM/MICROMEGAS detectors. These applications will be discussed in separate documents.
