Abstract
Introduction
Recent developments in reconfigurable hardware have resulted in significant increases in logic densities in FPGA devices. As a result, many real-time computer vision tasks that are too computationally or bandwidth intensive for general processors can now be performed effectively on reconfigurable hardware. This paper describes a high speed object tracking system that is designed to utilize stereo vision disparities and is implemented in these advanced reconfigurable devices.
Tracking
Object tracking is the process of following a feature or object through concurrent frames to maintain the pose of the object. Maintaining a pose through tracking is beneficial as pure pose determination techniques, in the absence of a prior pose estimate, typically are processor intensive tasks. These tasks cover a large search space and typically are unable to operate at high speeds in real-time [8] . However, the finite speed of objects and fast sensor frame rates result in frame coherence, i.e. a similarity in the object's pose between successive frames. Trackers exploit frame coherence to limit the search space around the previous frame's pose, enabling real-time operations.
Iterative Closest Point Tracking
The Iterative Closest Point (ICP) algorithm is a common 3D registration tool first described by Besl et al. [1] . The algorithm allows for two 3D point data sets to be registered using a least squares minimization. Specifically, ICP iteratively determines the transformation between the two data sets that minimizes the distance measure:
where M i is the i th model point, D i is the i th data point, and R and T are the rotation [3 × 3] and translation [3 × 1] matrices, respectively.
ICP was first extended to perform tracking by Simon et al. [9] , by repeatedly applying ICP to consecutive frames of a data sequence, using the previous frame's transform as the current frame's initial estimate. Their tracker was able to achieve a rate of 10 frames-per-second (fps) of free form rigid objects using range data with a low resolution of 32 × 32 pixels.
A benefit of ICP-based tracking is that it is robust to outliers, object occlusion, and noise. This robustness is mostly due to the large number data D i and model M i points that are considered at each iteration. The most expensive computation at each iteration of the ICP is the Nearest Neighbor (NN) computation, which determines correspondences between the two data sets. As the number of points to be considered increases, the NN step can become very expensive, requiring upwards of 90% of the processing effort. Various authors have developed methods for reducing the expense of the NN step of ICP utilizing data structures [6, 4] , parallel processing [5] , and approximations [2] .
FPGA
Reprogrammable hardware, such as the FPGA, have been available for the past 30 years, and the amount of configurable logic that they contain has greatly increased over that time period. FPGAs enable a user to design and reconfigure the integrated circuit's (IC) overall hardware function by loading hardware configurations into dedicated on-chip memory.
With such IC's, substantial increases in algorithm performance can be obtained despite the relatively slow clock speeds (∼150 MHz for the Vertex II Pro), when an algorithm can be sufficiently parallelized. Also, the ability to control data bit widths enables more efficient logic resource utilization during processing and provides greater bandwidths, compared to general processors (which have fixed data bit widths). Some of the additional benefits of FPGA's are their compact size and weight, durability, low power requirements, as well as fast prototyping.
FPGA Implementation of ICP
The tracking system in this paper is one of three components that have been designed to reside on a single FPGA device:
1. Stereo Sensor platform is capable of imaging and rectifying a stereo image pair with 640 × 480 pixel resolution at 200 fps.
2. Stereo Extraction platform is capable of analyzing a stereo pair to provide 640 × 480 pixels by 128 disparities at 200 fps using the maximum likelihood stereo correspondence algorithm [7] .
3. Object Tracker platform is capable of processing the disparity images to track an object's pose at 200 fps, utilizing ICP. This component is the focus of this paper.
Hardware ICP
The ICP algorithm has been divided into the following four stages:
• Filtering: reduces the number of points considered by the ICP algorithm at each frame.
• Nearest Neighbor: provides point correspondences between the model and image data.
• Transform Recovery: determines the registration transform between the model and image data.
• Transform Application: applies the transform to register the data to the model, which is then used as feedback for the system.
Throughout these stages, fixed point arithmetic has been used rather than floating point in order to efficiently utilize the FPGA logic (excluding the embedded softcore processor used for matrix decomposition and debugging). Furthermore, the number of both data and model points was maintained as a power of two to allow for efficient averaging without division arithmetic units. The next sections expand on the first three processing stages.
Filtering
Filtering the input data can both speed up the ICP processing, as well as prevent outliers from driving the solution to an incorrect result. Dense stereo extraction can generate a large number of data points which need to be considered for tracking. The sensor used here produces 640 × 480 pixels, all of which potentially could contain relevant object disparity data. However, because we are tracking the object we are able to segment the object from most of the surrounding background. By filtering the data provided by the stereo extraction, we can segment the expected object region to remove ouliers, which decreases the number of unnecesssary calculations required for object tracking.
The buffered filtering approach used here comprises three steps:
1. 2D ROI to initially segment the object of interest using a fixed region-of-interest (ROI);
2. Subsampler continually subsamples the 2D ROI;
3. 3D ROI verifies the points subsampled from the 2D ROI, placing valid points into a list for further processing.
The 2D ROI uses the previous frame's position to buffer a constant 200×200 region in the 2D disparity image plane. This buffered region enables the 2D ROI to be subsampled. As all points in the region have been buffered, they can then be subsampled and verified using the 3D ROI until 512 data points result. Having a fixed (rather than a variable) number of points at each frame simplifies the calculations in further processing steps. In this design, 512 disparities were sampled to generate the 512 data points. Figure 1 illustrates the buffered filter approach. The rectified left stereo intensity image of the scene before stereo extraction is depicted in a) with the object of interest hilighted in yellow. The disparity image of the scene provided by stereo extraction is shown in b), where the dashed box represents the 2D ROI. In c), the subsampled points from the 2D ROI are shown, with star shaped points (orange) considered valid, and circular (blue) points considered invalid. A view of the scene data looking down the sensor's Y axis is depicted in d), showing a frustum with a 3D ROI that further segments the points. Star shaped points inside the 3D ROI are considered valid, while all other circular points are invalid. Here, the size of the 3D ROI is kept constant, because there is no perspective scaling in 3D. A fixed number of valid subsampled points that have been accepted are stored in a list for later processing, as depicted in e).
The subsampling process for this filter is depicted in Figure 2 . Here, the depicted 2D ROI is 5 × 5 pixels (shown as pink squares) as opposed to the 200 × 200 region actually used. An iterator with constant incrementing value of three is used for this example. The numbers in the matrix indicate the sampling order with the first two selected samples bordered in black. The first sample (1) is picked arbitrarily, and the iterator proceeds along the row from left to right, with a constant increment of three, until the row overflows. Once this overflow occurs, the iterator moves down three rows, and the overflow amount is then used as a column index in the new row. The method for calculating the second sample (2) is indicated with blue and green numbering. Because of the buffering, this filtering process works with sparse data as long as there is valid data in the ROIs. The sampling is also performed uniformly to ensure a diverse selection of points from the ROI.
Nearest Neighbor
ICP requires a NN calculation for determining point correspondences so that the transform between the matched point sets can be determined. The NN search is the bottleneck of the ICP algorithm [1] [2] . The NN search can be separated into two main tasks as follows:
1. Compute the point-to-point distance: This distance metric, which is usually the Euclidean distance, can be computed in a pipelined fashion so that one point pair distance is computed every cycle.
2. Search for the closest point: This search uses the distance calculated to determine the minimum distance point pair by comparing to the previously buffered minimum.
Let D i be the data point whose NN is being sought, M be the set of all N m model points being searched, and P i be the resulting point pair which will be composed of D i and its NN M j . An expression for the NN search is then:
Sequential Brute Force NN
The Sequential Brute Force Search is an O(N 2 ) search which sequentially compares every model point to the data point seeking its NN. Figure 3 depicts the brute force search module consisting of two main components -a distance calculator as well as a point comparator. The number of data points N d and model points N m remain constant: therefore, this search will execute in constant time.
To find the NN to data point D i , the circuit iterates through all model points M k ∈ M. The distance between points is first calculated for every point-to-point pairing, and is then followed by a comparison to the minimal distance found thus far. After all the M k have been compared, the circuit will report the NN point pair P i = (D i , M j ) as its output. Because D i never changes throughout operation, it is not passed through the circuit, but is only buffered to be read at the output.
For N d = 512 data points, N m = 2048 model points (these number of data and model points will be maintained throughout this section for comparison), and with a pipelining priming delay of p = 4, the search will take N cyc cycles to complete:
A circuit running at 125 MHz would be able to perform 14.9 fps given eight iterations of ICP per frame (or 119.21 complete NN calculations per second). The benefit of the brute force approach is that it is simple to implement and is guaranteed to find the closest point. It also uses minimal hardware resources. Therefore, it is ideal for situations in which hardware space is limited and high speed is not a requirement.
Parallel Brute Force NN
The sequential approach can be extended to a parallel brute force search, with only a small amount of overhead required for the circuit. The circuit consists of many single brute force units combined with a IO controller to load and store the results. The overhead for the circuit accounts for preloading the data points and writing the resulting point pair correspondences sequentially into memory. The number of cycles needed for one parallel NN search is:
where d = N d is the number of data memory reads required by the system and 2 d accounts for the data reads and pair writes performed before and after each parallel search, respectively. Also, p
Transform Recovery
The transform recovery determines the registration transform between model and data points that have been paired from the NN search. Transform recovery has been implemented utilizing a combined hardware-software technique that first reduces the point pairs into a smaller matrix, which is then processed solely in software. The three hardware and software processing steps are as follows: a) In hardware, calculate the centroids of the paired data and model points; b) In hardware, calculate the [3 × 3] covariance matrix M that is used to form the [4 × 4] matrix N described by [3] :
c) In software, the resulting N matrix of Equation 6 is processed using Singular Value Decomposition (SVD) to recover the transform between model and data points.
Streaming
Rather than centering the points to the data set centroid prior to the computation of M, the centering can be performed afterwards, as follows: 
Experimental Results
The system was implemented on an Amirix AP1100 Development Board. The AP1100 comprises a single Xilinx Vertex II Pro (XC2VP100) SRAM FPGA with over 99,000 logic cells, each having one 4-input look-up table (LUT). Approximately 1000 Kb of block random access memory (BRAM) is available along with 1164 IO pins, two PowerPC cores, and 444 18 × 18 bit multipliers.
Because this tracker was designed as part of a concurrently developed three-component project, the stereo sensor and stereo extraction components were unavailable for testing, and were substituted with a consumer sensor and software to provide the disparity data. The Point Grey Research Bumblebee stereo vision sensor was chosen due to its similar 640 by 480 disparity images that could produce 128 disparities required for tracking.
To test the system, three objects of interest, an Angel (190 mm × 160 mm × 250 mm), Big Bird (410 mm × 310 mm × 230 mm), and a Cube (180 mm × 180 mm × 180 mm), were chosen and are illustrated in Figure 4 . All three objects had distinct surface geometries.
Initial tests were conducted in hardware to determine the number of ICP iterations required to reach a good pose estimated with an RMS error of under 10 mm. An object was moved through a known linear path by a robotic arm for 100 frames at a speed of 316 mm per second. See Figure 5 for an image of the test configuration.
Results of this test, the average error and standard deviation for the 100 frames, are plotted in Figure 6 . The RMS error drops to under 10 mm after four ICP iterations. It can also be observed from the graph that the standard deviation of the 100 frame sequence starts to settle after only three ICP iterations. As a result, eight iterations were selected for use throughout the remaining experimentation, to ensure that the ICP has had adequate time to converge.
Path Analysis
A free form path test was conducted to verify that the tracker was able to follow arbitrary paths. The objects were hand-held and moved through a sequence of 1000 frames. The recovered path output for the Angel object was plotted in Figure 7 . Figure 8 shows four frames of a free form sequence test with overlays. The images were obtained by sampling the left stereo sensor at various frames. Various tracker output overlays were then applied to the image in post processing on a PC. These overlays indicate the tracker's current configuration (the ROIs and the estimated poses) so that its performance can be observed. The outermost blue overlay is the 2D ROI, and the middle green overlay is the 3D ROI. The innermost yellow overlay represents the estimated pose of the object. These free form path tests verified the tracker's ability to follow arbitrary movements for all three objects tested.
The recovered paths of an object's movement were also analyzed quantitatively. The ground truth positions of the object in each frame were unavailable, so only relative paths (paths with no absolute reference frame) were considered. Paths with simple trajectories were chosen so that they could be easily aligned and compared to the extracted paths.
The two quantitative tests are linear path, and rotational path tests. These tests were performed using both the hardware and a software version of the ICP algorithm. The results show that the tracked paths in both hardware and software have similar errors for each test. This indicates that the hardware implementation properly replicated the software implementation, which was known to be correct.
Linear Path
The three objects of interest were processed in hardware to determine how well the tracker could follow a 100 frame linear path 632 mm long, with the object moving at an effective rate of 316 mm per second (were the sensor acquiring at 200 fps). After the tracker was used to track the linear path, the estimated path was fit to a line to determine how well the object was tracked during its linear motion. The recovered paths of the Angel, as well as the best fit line, is graphed in Figure 9 . The average Euclidean distance from the ideal path, and the standard deviation, were calculated, as was the average angle error and standard deviation. The plots of the linear test results from the two other models appear quite similar, and the results of all three are tabulated in Table 1 . The average displacement off the linear path is 1.57 mm and the average rotation error of the objects over the linear path was 2.66 degs.
Rotational Path
The rotational path test aimed to determine how well the tracker could follow a rotating object during a sequence of frames, independent of the translational components of motion. The setup for this test was identical to the linear test, except that the robot was rotated through 80 degrees along one joint over a sequence of 100 frames. The rotation was limited to 80 degrees to prevent robot object occlusion. The effective rotational rate was 160 degrees per second, were the tracker operating at 200 fps. Because the object was not rotated exactly around its centroid, the resulting translation of the object was ignored. After the tracker had recovered the path of the object, the path was then processed to determine both the average change in the rotational axis and its standard deviation. These metrics provided an indication of how well the tracker was able to follow the rotation.
The results of the rotational test can be seen in Table  2 . The average axis error of only 0.39 degrees indicates an accurate tracking of the single axis rotations.
FPGA Efficiency and Utilization
For the hardware tests, the FPGA's clock was operating at 125 MHz, which was the maximum frequency at which the Microblaze softcore processor that performed SVD could operate. Table 3 provides the results of speed tests between the brute force software, k-d tree software, Ak-d tree software [2] , as well as the hardware parallel brute force ICP implementations. The parallel brute force method utilizing 16 parallel units is shown to be over five times faster than the software k-d tree, and 18% faster then the Ak-d tree approximate NN running on a Pentium M 2.2 GHz processor. The software timing does not take into account input time for loading the disparity image into memory but only running of ICP on the disparity data. With more parallel units, utilizing more resources of a larger FPGA, we could expect the parallel brute force NN to provide even greater efficiencies. Figure 10 shows the expected performance of the parallel brute force search for different numbers of parallel units. The speed up is limited to the number of model points searched, and as the number of parallel units increase, the relative speed enhancement slowly decreases. Therefore, if higher NN processing speeds are required, it is possible to further increase the NN performance significantly by only adding parallel units. Table 4 provides the results of FPGA component utilization of the object tracking component as well as initial integration with the stereo extraction component (64 disparities). About half of the utilization of the object tracker component is composed of the Microblaze softcore processor. This softcore processor was necessary for transform recovery using SVD as well as act as a debugging interface for the FPGA. After merging with the stereo extraction component with 64 disparities, BRAM memory utilization became the constraining factor for the place and route tools. Table 4 indicates that 85% of the memory blocks have been utilized. Table 5 . Projected Power Consumption of Implemented Circuit of the minimum amount of power required and the single FPGA design, this system would be beneficial for situations that require minimal power dissipation in a small package.
SPEED COMPARISON

Conclusion
We have presented a hardware FPGA ICP based object tracker which is able to track range data at over 200 fps, with eight ICP iterations per frame. The tracker utilizes a single FPGA making it compact and light weight, with low power requirements. A pre-filter stage was applied to decrease data to only the expected object location, thereby minimizing outliers. A parallel brute force NN approach with 16 parallel units was used to calculate correspondences, and executed at over 200 fps. Further speed improvements can be gained by further increasing the number of parallel NN units at a cost of consuming further FPGA resources. The resulting transform is recovered by applying SVD on a softcore processor. Experimentation shows that the tracker is able to track with 1.57 mm positioning error (0.85% of the objects size) for linear translational tests, and under 0.39 degs. of rotational error for rotational tests.
Further work will optimize the system to decrease hardware memory utilization. The softcore processor for SVD transform recovery could be realized in hardware for additional hardware space savings.
