In this paper, a power-efficient and real-time image feature detecting system is implemented, which is based on the Speeded-Up Robust Feature (SURF) algorithm. We optimized the SURF algorithm, and implemented on the FPGA fabric of Xilinx ZYNQ-7020 device. Our design of SURF algorithm circuit can work up to 100Mhz clock frequency, and its processing speed up to 270 fps for standard VGA (640 * 480) resolution gray image. We implemented the system on the ZYNQ platform with the hardware and software co-design approach. The image feature detecting system based on SURF algorithm circuit runs embedded Linux system. There is a GUI application for Linux system designed with QT and open-cv, which can capture video, process and display image or video. The system meets the real-time and low-power requirements of embedded devices, with great practical value.
INTRODUCTION
Image feature detection is a popular topic in computer vision research. Features are essential characteristics of an image, which can effectively reflect the most relevant information of an image. It is an indispensable step in practical application of computer vision to find the correspondence of the same scene or object in the two images by detecting feature points.
Recently, FPGA (Field Programmable Gate Array) is widely used in the acceleration of the image algorithm because of its programmability. We designed a fully pipelined and parallel architecture of SURF algorithm based on FPGA, capable of improving the speed of algorithm greatly.
There are so many algorithms for detecting image feature points. The SIFT (Scale Invariant Feature Transform) algorithm proposed by David G. Lowe at [1] . It's demonstrated that SIFT algorithm has the advantages of rotation invariance, scale invariance and high accuracy, but its computational cost is expensive. FAST [2] algorithm is famous for its computational performance and high repeatability. But the FAST algorithm is not the best choice under varying viewing conditions, as its bad robustness. SURF (Speeded-Up Robust Feature) algorithm is based on SIFT, proposed at the [3] . SURF algorithm has the advantages of the SIFT, and its computational complexity was decreased significantly for using the integral image concept. In recent years, there have been some literatures that study the SURF algorithm based on FPGA. Reference [4] implemented the SURF algorithm on the FPGA fabric of Xilinx Virtex-4 device in 2013, which only can process 64 frames per second (fps). _________________________________________ Weijie Cai, Fei Wang, Zhenghui Xu, Ziqi Li, HIT University, ShenZhen, China
And it costs a lot of logic resources. C. Wilson also implemented a full pipelined and parallel architecture for SURF algorithm on the ZYNQ in 2014 [5] . C. Wilson's design processes the VGA resolution image at a speed of 131 fps. But the design didn't make great progress in resource consuming. In 2015, they also optimized the SURF algorithm at [6] [7] . The speed of their design achieved at 25 and 50 fps, without outstanding performance. All of the designs mentioned above achieved the acceleration of the SURF algorithm on FPGA. However, it's difficult to apply those designs to practical engineering, due to the interfaces and clock problems.
Hence, this paper implemented an image feature point detecting system with hardware and software co-design on ZYNQ device. We can apply the SURF feature point detecting algorithm to practical engineering, which is optimized and implemented in programmable logic of ZYNQ device for acceleration.
The ZYNQ-based SURF image feature point detecting system implemented in this paper has the following innovations:
1) Implementation of optimized SURF algorithm on FPGA: reducing the consumption of logic resources while accelerating procession speed. Its speed up to 270 fps for a standard VGA resolution image.
2) Design the system with hardware and software co-design: Building the embedded Linux operation system on the hardware system, which is based on SURF algorithm circuit accelerated with FPGA. The SURF image feature detecting system has advantages, like real-time, power-efficient, high portability and practical application value.
SURF CIRCUIT DESIGN AND OPTIMIZATION

SURF Circuit Design
We designed a parallel and pipeline architecture for SURF algorithm, according [5] . There are four modules in our SURF architecture, which are Integral Image Module, Integral Image Buffer Module, Hessian Calculation Module and NonMaximal Suppression Module.
A standard VGA resolution gray scale image enter to the integral image module in the form of data stream, and the integral pixel value is obtained by Integral Image Module and input to the Integral Image Buffer Module.
The Integral Image Buffer Module traverses the integral image through the sliding window. Then the Hessian Calculation Module calculates the hessian determinants in parallel with data in the sliding window. There are 8 different scales, corresponding the box filter size 9, 15, 21, 27, 33, 39, 45, 51. The first step to calculate the hessian determinants is convolving the integral image with box filters, and it would generate 8 hessian matrices for 8 different scales, then calculating the determinants of 8 hessian matrices. The last stage of the SURF algorithm circuit proposed in this paper is nonmaximal suppression that can locate the feature points. The Non-Maximal Suppression Module compares the user-defined threshold and 26 hessian determinants in the three-dimensional space of the adjacent three scales centered on the target point, if it's maximal, then the target point was feature point.
SURF Circuit Optimization
In the Hessian Calculation Module, it consumes numerous logic resources for building scale space and mathematical operations. Hence, there are descriptions of the optimization for the Hessian Calculation Module in following.
H.W.J Belt proposed that the integral image binary word length ii L satisfies (1) at [8] .
Where i L represents the input image pixel binary word length, and W , H are the width and height of image.
As we know, the approximate Gaussian second order can be calculated by (2) , where S stands for area. However, there would be more 3 addition or subtraction operations according to (2) . But it would save the logic resources for that 3 addition or subtraction operations, if calculating by (3). 
So,
ADEH S
is the maximal area, and the maximal box filter size is 51*51; hence, the minimum binary word length of data for the Hessian Calculation Module is 18.7. According (1), however, the optimal value is 21 with taking the structure of Xilinx DSP48E1 into account.
The hessian matrix determinant is calculated by (4), where _ D are the secondorder Gaussian derivatives normalized with respect to the filter size, and  is a correction factor.
It was necessary to do division operation if according (4). It would cost numerous logic resources to do division operation with FPGA. Fortunately, all the filter size is constant, and it can do multiplication operation with DSP, instead of division, as shown in (6). 
BUILDING THE IMAGE FEATURE POINT DETECTING SYSTEM
258
The SURF image feature point detecting system is implemented with hardware and software co-design. It's convenient for us to co-design with hardware and software using the ZYNQ-7000 SOC device of Xilinx. Hence, the system described in this paper is implemented on ZedBoard. There are some high performance interfaces between programmable logic (PL) and processing system (PS) in ZYNQ device, that interfaces can achieve high bandwidth communication between PL and PS.
We used the development kit Vivado to design the hardware of the system. And we create a SURF IP with SURF algorithm circuit mentioned above. In the hardware of the SURF image feature point detecting system, there are a SURF IP for acceleration of SURF algorithm, and video DMAs (VDMA) used to transfer the image data with PS, and other IPs in the PL. The PS uses the AXI-Lite protocol to control IPs in PL through the AXI_GP port. The configuration of VDMAs, SURF and other IP in the system is controlled through this port. The image data can be read from or written into DDR memory by VDMA through AXI_HP port. In order to improve work efficiency, we allocated 2 buffers as ping pong cache in DDR memory. When the VDMA read data from one buffer, other VDMA can write data to another buffer.
We built the embedded Linux operation system on the hardware system mentioned above. And there is a Hardware driver layer, it's used to communicating for Linux kernel with the hardware system. It's convenient for user to operate the system with graphical user interface application. Hence, we developed an application with QT. That application can capture video, process and display image or video.
EXPERIMENTAL RESULTS
The SURF image feature point detecting system implemented in this paper has 3 functions, which are detecting image feature points, image matching with BRIEF descriptor and detecting video feature points, as shown in Figure 1 . In addition, the system supports user-defined threshold for different occasions or scenes.
Performance Analysis
The SURF IP can process standard VGA resolution images at the speed of 270 fps and the maximum working frequency for SURF IP is 100 MHz, that guarantees the real-time advantage of the system.
We compared the SURF IP implemented in this paper with recently published SURF circuits implemented on same device. As Table I shows, when compared to designs in the literature [5] , [7] , the SURF IP implemented in this paper achieves the best performance, that provides the highest frame rate with the least consuming resource.
Error Analysis
We compared the hessian determinants calculated by SURF IP against the results calculated by open-SURF C code, as shown in Figure 1 , and the test images set is the commonly-used INRIA Graffiti images set that can be found at [10]. Figure 1 . Hessian determinants error analysis and the result of whole system.
The proportion is more than 99.6% which relative error of the hessian determinants is less than 1%. There are a few hessian determinants whose relative error is unusual. The reason for unusual relative error of hessian determinants is that the absolute value of hessian determinant is extremely small. However, the threshold of SURF is usually bigger than those points whose absolute value of hessian determinant is extremely small. Hence, the hessian determinants cannot affect the accuracy of feature points detecting whose absolute values are extremely small.
CONCLUSION
In this paper, a high performance image feature point detecting system is implemented on ZYNQ, which is based on SURF IP. The SURF IP can perform at higher working frequency with less logic resources consuming, and it's able to detect feature points from a standard VGA resolution image at a speed of 270 fps. The whole system consumes approximately 2 Watts according to Xilinx Power Estimator Tool. Hence, the SURF image feature points detecting system has advantages of real-time, low-power and high portability.
