We present an implementation of the Speeded Up Robust Features (SURF) on a Field Programmable Gate Array (FPGA). The SURF algorithm extracts salient points from image and computes descriptors of their surroundings that are invariant to scale, rotation and illumination changes. The interest point detection and feature descriptor extraction algorithm is often used as the first stage in autonomous robot navigation, object recognition and tracking etc. However, detection and extraction are computationally demanding and therefore can't be used in systems with limited computational power. We took advantage of algorithm's natural parallelism and implemented it's most demanding parts in FPGA logic. Several modifications of the original algorithm have been made to increase it's suitability for FPGA implementation. Experiments show, that the FPGA implementation is comparable in terms of precision, speed and repeatability, but outperforms the CPU and GPU implementation in terms of power consumption. Our implementation is intended to be used in embedded systems which are limited in computational power or as the first stage preprocessing block, which allows the computational resources to focus on higher level algorithms.
I. INTRODUCTION
Most of today's mobile robots utilize computer vision methods to gain information about their surrounding environment. Significant advantages of vision-based sensing are affordable price of cameras, amount of provided data and easy interpretation of these data by humans. The quantity of provided data can be also regarded as a drawback, because real-time processing of image data streams requires computationally strong hardware. However, this drawback diminishes due to Moore's law and therefore machine vision methods have found applications in mobile robotics, where real-time processing is necessary. Vision-based object recognition [1] , reactive navigation [2] , three-dimensional reconstruction [3] , and efficient mapping and localization methods [4] [5] are used in mobile robotics [6] today.
Many of these methods rely on feature extraction algorithms [7] like Scale Invariant Feature Transformation [8] , Gradient Location and Orientation Histogram [9] or Local Energy based Shape Histogram [10] . Although these algorithms are based on different principles and therefore perform differently [9] , their purpose is to provide descriptors of significant image areas, which are immune to illumination and camera position changes. Since detected feature descriptors contain information independent of viewpoint and illumination, they are especially suitable for image-matching tasks. However, these algorithms are computationally demanding and therefore represent a significant bottleneck in many computer vision systems, forcing researchers to focus on improvement of their speed.
One of possible ways to increase speed of feature extraction is to exploit the fact, that image processing can easily be parallelized. Implementations of feature extraction algorithms utilizing GPUs [11] [12] have achieved performance over 30 FPS, which is considered as suitable for realtime applications. Although these implementations are based on an affordable computational hardware, small robotic platforms usually can not afford to carry entire PC system. As a consequence small robots have to use small intelligent cameras, e.g. CMUCam 1 , with on-board image processing. These intelligent cameras can rely on standard microcontrollers, digital signal processors or specialized integrated circuits [13] . A popular choice for embedded machine vision is usage of the Field Programmable Gate Arrays (FPGA), because they effectively utilize parallel nature of the image processing. Several authors [14] [15] [16] [17] reported successful implementation of robotic vision algorithms on a FPGA-based hardware. Among these, some conclude that their SIFT algorithm implementation performs better than on conventional CPUs [17] and some demonstrate their implementation in a real robotic system [16] .
One of the most efficient feature extracting algorithms is the Speeded Up Robust Features (SURF) [18] , which outperforms SIFT implementations on general purpose CPUs in both speed and robustness. Most significant improvement in calculation speed is achieved by use of "Integral Image", which allows fast calculation of box filter responses used throughout the algorithm. The SURF algorithm has been reported to achieve around 2 frames per second on a general purpose CPU.
We present an implementation of this algorithm massively accelerated by the FPGA logic. Considering the FPGA architecture, we have decided that only some parts of the original SURF algorithm are suitable for implementation using FPGA logical blocks, while other parts are implemented on CPU with floating-point arithmetics. Although our implementation follows the original definition as closely as possible, we had to apply optimizations which affect precision of the detector and therefore repeatability and distinguishability of the whole algorithm. Performed experiments show that the repeatability and robustness of the FPGA-SURF have not been severely affected and are comparable to the original and GPU-SURF implementation. However, the FPGA-SURF is faster than the original -it processes approximately 10 images (1024x768 pixels) per second 1 , consumes less than than 10 W and occupies less space than GPU-based system. Higher speed of feature extraction at lower size, weight and power consumption opens 1 Considering around 100 descriptors per image. 
I-A. Paper structure
Next chapter will overview the SURF algorithm and outline its detection and description stages. The following section is concerned with the implementation itself. Afterwards, we present experiments comparing results of FPGA-SURF with the original SURF implementation and show, how the FPGA-SURF can be used as a part of mobile mobile robot navigation system.
II. ALGORITHM
This chapter contains brief description of SURF algorithm, according to definition in [18] . Algorithm consists of four main parts: 1) Integral image generation, 2) Fast-Hessian detector (interest point detection), 3) Descriptor orientation assignment (optional), 4) Descriptor generation.
Integral image is used by all subsequent parts of algorithm to significantly accelerate their speed. Integral image is defined by eq. (1). When using integral image, it is necessary to always read only four pixel values to calculate surface integral of any size from original image. Fox example integral over grayed area in image 2 is equal to Σ =
. This fact is widely used when calculating Gaussian and Haar wavelet filter responses.
SURF uses determinants of Hessian matrices to locate image's significant points. Equation (2) shows original definition of this determinant for general two dimensional function. Fast-Hessian detector modifies this equation in two significant ways: 1) Second order partial derivatives are replaced by convolutions of image with approximated Gaussian kernels second order derivatives. Approximation is done by box filters with coefficients 1, −1, 2, −2 2 . Coefficient 0.9 in eq. (3) is used to compensate this approximation.
2) Gaussian kernels are not parameterized only by position in image, but also by their size. To achieve scale invariance, algorithm searches for significant points on multiple image scale levels. Since scaling whole image is computationally expensive, SURF only scales Gaussian kernels (exactly box filters which approximate them). Scaling is represented by third parameter s in eq. 3, this parameter creates the three dimensional space of determinant results, usually refered to as "scale space". Scale differs (and is quantized) according to octaves and intervals. As we can see in table I box filter size increase doubles between octaves, moreover image sampling step also doubles. When determinants in all preselected scales and intervals are calculated they are filtered by positive threshold value. Afterwards a non-local maxima suppression is performed among 26 closest neighbors in determinant scale space. Determinant local maxima marks position of interest point. However, this position has to be refined by interpolation to sub-pixel precision.
Equation (4) shows second order Taylor polynomial approximation of Hessian in scale space centered around examined local maxima, where x = (x, y, s). Position correction is obtained by seeking for zero value of this polynomial's derivative. Resulting formula for position correction is eq. (5), wherex = (x,ŷ,ŝ). Local maxima is accepted as interest point if none of absolute values of position correction vectorx elements is bigger than 0.5.
Rotation invariance is achieved by assigning each interest point a "dominant direction". If application doesn't require descriptors to be rotation invariant we can simply skip steps described in this paragraph and assign orientation (0, 0). A circle of radius 6s is set around interest point. Inside a circle Haar wavelet responses in x and y direction are calculated from 4s-sized filters with sampling step s. These responses are weighted by Gaussian function with σ = 2.5s centered around interest point. Afterwards they are summed by a sliding sector window of π/3 with approximate step π/18. Longest vector obtained as this window sum is chosen to be descriptor dominant direction.
The SURF descriptor is calculated from square interest point's neighborhood with edge size 20s. This neighborhood is rotated to match interest point dominant direction. This square area is further divided into 16 equal sub-squares with edge size 5s. Inside each of these sub-square areas 25 Haar wavelet filter response pairs are calculated (filter size 2s, sampling step s, one direction along interest point dominant direction -d d , one perpendicular in positive sense -d p ). Responses are weighted by Gaussian function with σ = 3.3s and summed within sub-square into four di-
These vectors are chained (one for each of 16 sub-squares) to form 64 dimensional SURF descriptor. Descriptor elements are further scaled so resulting descriptor is a unit vector.
Output of algorithm for one image gives us set of interest point locations in image's scale space along with set of descriptors describing areas around interest points. To increase matching performance the sign of Laplacian (i.e. trace of the Hessian matrix) is included into the algorithm output. Since interest points are usually found on blob-type structures this sign distinguishes light blobs on dark background from opposite.
III. IMPLEMENTATION
This chapter is dedicated to description of hardware design and software controlling functions. The algorithm to generate the SURF descriptor is written according to specification of original SURF with a few minor optimizations. As mentioned before, for the purpose of the FPGA implementation, only the Fast-Hessian detector part of SURF has been chosen for hardware-only implementation. Created hardware IPblocks provide integral image, determinants for all inspected scale levels and also marks local maxima in calculated determinant scale-space. Generation of the SURF descriptor is handled entirely by software. Descriptor generation by itself is still quite time demanding, that is why we have chosen XC5VFX70 FPGA which incorporates PowerPC-440 CPU core with floating-point co-processor.
Hardware design of the Fast-Hessian detector introduces two major limitations in comparison with other implementations: 1) Determinant calculation is done in an integer arithmetics with limited precision. 2) Hardware block is de-signed for specific number of octaves and scale intervals with limited image size. Current image size limit is 1024 × 1024 pixels and IP-blocks are designed to calculate determinants in 2 octaves and 4 intervals per octave. simple DMA controllers, which allow them to manage data transfers from and to system memory without any CPU interaction. PLB2 bus is designated for these transfers and has accordingly increased capacity. PixelBus is dedicated simple one-way bus which transfers integral image data into Fast-Hessian generator for determinant calculation. Figure 3 also includes some design details, such as bus widths and clock frequencies, which are vital in order to achieve stated performance. fig. 3 ) consists from this calculation core and simple DMA controller to transfer data to and from this block. fig. 3 ) IP-block. This block generates approximated determinants of Hessian matrices and searches for local maxima among them. An overview of FPGA's dedicated blocks, bus widths and frequencies usage is summarized in Figure 5 . Incoming integral image data are stored in large FIFO inside MasterController block. An access to integral image area of 52 × 52 pixels 3 . is need to calculate response of the largest Gaussian filter. Data in this area are accessed many times and in very unsequential manner, that's why MasterController FIFO stores 56 integral image lines and allows access to them through 4 independent busses. This configuration delivers enough bandwidth for integral image data used during determinant calculation. 18 TripleMAC (Multiply-Accumulate) units are connected to these busses to capture integral image data and calculate Gaussian filter responses from them. A block HessianCalc calculates actual determinants from Gaussians. Determinants are written into system memory using simple DMA controller in SSA FHG (in fig. 3 ) IP-block.
III-

III-B. Integral image generator
III-C. Fast-Hessian generator
To achieve higher performance optimization determinants are calculated not one by one but in "determinant blocks". Determinant block consists of all determinants calculated for 2 × 2 pixels image area. Since octave 2 has doubled sampling step and first 2 intervals in octave 2 have same filter sizes as intervals 2 and 4 from octave 1 we have to calculate only 18 determinants in each block (instead of 20). Integral image data are sent out through four output busses in pre-optimized order. This order is stored in "Control and addressing sequence memory". This memory also holds Control and addressing sequence must be optimized in order to achieve stated performance. To carry out this optimization a specialized script, which generates this sequence, has been written. This sequence runs for every 2 × 2 pixels image block and delivers all integral image data needed for calculation of one determinant block.
One TripleMAC block handles calculation of three Gaussian filter responses. Incoming data are scaled by one of 1, −1, 3, −3 coefficients based on input control signals and added to one of three filter responses. After each run of MasterController's sequence a signal indicating end of this sequence is being generated. This signal causes TripleMAC blocks to load results into output buffers and to reset accumulators. On the output side all blocks are connected to one three-state bus which is used to read results from them. To save FPGA's resources TripleMAC performs three cycles of addition/subtraction instead of multiplication by coefficients 3, −3. This fact has to be reflected in control and addressing sequence since TripleMACs don't report MasterController their busy state. Block LocMaxFinder stores at least 3 determinant blocks and compares each determinant with its neighbors to search for local maxima. This block doesn't have sufficient memory resources to store all data needed to examine whole neighborhood of each determinant. Thus it performs nonmaxima suppression only throughout neighbors from two determinant blocks. Each determinant marked by this block as local maxima has to be examined again by software. Determinants marked by this block are stored into FIFO in IP-block SSA FHG, which is readable through SPLB interface and generates interrupts based on its occupancy.
Units MasterController, HessianCalc and LocMaxFinder are controlled by embedded sequences. In case of MasterController this sequence is quite large and needs to be stored in block RAM, other two use FPGA logic, since sequences are short. Generation of control sequences for HessianCalc and LocMaxFinder is fairly simple -these only need to process incoming determinants in preset order. However, MasterController's sequence has severe impact on system's performance and is not trivial to generate. To generate this sequence only very basic algorithm has been written and thus introduces approximately 11% increase in determinant block processing time compared to optimal state 4 .
III-D. Control software
This part of software manages mainly data transfers to and from IP-blocks and is represented by "libSSA2" library. Processing is divided into tasks. Library provides functions to create and process these tasks. Each task represents one image and has designated memory space to store image data, integral image data and determinants. SURF descriptors are being passed to superordinate application through callback. Task processing has four phases: 1) integral image generation, 2) determinant generation, 3) local maxima localization, 4) SURF descriptor generation. To optimize processing speed, all these phases may overlap as well as processed task may overlap. First two are carried out by hardware, library only sets data source and destination addresses in IP-block DMA controllers and waits for interrupts. Third phase is only intermediate state when image is done processing in hardware but there are still some local maxima determinants unread from SSA FHG FIFO. Fourth phase is actually performing all that is left for "software" part of our SURF implementation, i.e. interest point position interpolation and SURF descriptor generation. Since all hardware is wrapped up into two Xilinx Platform Studio IP-blocks, its very simple to put more calculation accelerators on one chip and library supports this option.
IV. RESULTS
Although we have tried to follow the original description as closely as possible, some changes had to be made to optimize algorithm for implementation in the FPGA logic. In current state of development this of course applies only to detection stage. There has been a great concern for the possible degradation of the algorithm due to loss of precision in the detection phase. To evaluate impact on algorithm efficiency, we have performed tests comparing the repeatability of SURF, GPU-SURF and FPGA-SURF. These tests used a dataset of 1500 images captured by a mobile robot moving in a park environment.
IV-A. Measuring distinguishability
By distinguishability, we understand a chance of algorithm to establish a valid correspondence. This is done by establishing correspondencies solely by their descriptor Euclidean distance and checking these against geometric constraints of multiple view geometry [19] .
For every feature f a in an image A, we compute descriptor's Euclidean distance to each of feature descriptors in an image B. Two features from image B, which are most similar to f a (i.e. they have minimal descriptor distance) are selected and therefore form two pairs with f a . If the most similar feature pair has its descriptor distance lower than 75% of the second most matching pair descriptor distance, we proclaim it a correspondence candidate. After we find all correspondence candidates for every feature in the image A, we perform a check based on the epipolar geometry [19] .
The fundamental matrix is established by the eight-point algorithm [19] and RANSAC [20] . For every feature in image A, we find an epipolar line in image B and compute the distance of the corresponding feature from that epipole. If this distance is greater than 3 pixels, the correspondence candidate is discarded, otherwise it is considered valid. The ratio of valid correspondencies to the number of correspondence candidates is a measure of the SURF algorithm distinguishability. The higher ratio means better algorithm performance.
When evaluating algorithm performance, we compute this ratio for every two consecutive images of a particular dataset. The average ratio is then taken as algorithm's distinguishability measure.
Every algorithm has been tested with four versions of the method establishing the correspondence candidates. Because the camera viewpoint doesn't change a lot between two consecutive images in the datasets, we can limit the search for a correspondence to a certain distance (200 pixels). Moreover, we might or might not match the features with same or different Laplacian sign when establishing the correspondences. The table II compares the distinguishability for our, 1500 image dataset. Similar results were obtained on a part of the dataset 2 used by the authors of the original SURF algorithm.
IV-B. Simple FPGA-SURF based navigation system
Finally, to demonstrate the usability of the proposed implementation, we deployed a navigation system with an image preprocessing module based on FPGA-SURF. The system composes of a P3AT robot with an Unibrain Firei601c camera with 1024x768 pixel resolution, TCM2 compass and HP 8710p laptop. This navigation system is first guided through the environment by means of teleoperation. During this drive, the robot searches for SURF features in pictures from the onboard camera. As the robot moves, it tracks the features, estimates their distance and creates a three dimensional map of them. Using this map, the system can estimate its position using currently sensed and mapped features. Therefore, the robot is able to operate in the environment. After the teleoperated drive is finished, the system can localize and navigate the robot through the mapped environment simply by matching currently captured and previously mapped. We have mapped part of the Czech technical university campus and let the robot traverse the learned path. The system has proven to be able to faultlessly navigate the learned path, which was approximately 500 meters long.
V. CONCLUSION
We have presented an implementation of the SURF algorithm massively accelerated by the FPGA logic. Despite several simplifications imposed by the limitations of FPGA hardware, our implementation offers similar distinguishability and robustness as the GPU-SURF implementation. The FPGA-SURF implementation achieves about 10 FPS at HD (1024x768 pixels) resolution, which is a necessity for real-time operation. The total power consumption of this device is less than 10 W, making it suitable for smaller robotic platforms. In the future, we will extend the algorithm by feature matching as a first step towards an embedded device realising visual localization and mapping. Moreover, we would like to port the algorithm to a Xilinx Spartan-6 family of FPGA chips, reducing the costs of the final system. 
ACKNOWLEDGMENTS
