Image feature extraction and matching is a fundamental but computation intensive task in machine vision. This paper proposes a novel FPGA-based embedded system to accelerate feature extraction and matching. It implements SURF feature point detection and BRIEF feature descriptor construction and matching. For binocular stereo vision, feature matching includes both tracking matching and stereo matching, which simultaneously provide feature point correspondences and parallax information. Our system is evaluated on a ZYNQ XC7Z045 FPGA. The result demonstrates that it can process binocular video data at a high frame rate (640  480 @ 162fps). Moreover, an extensive test proves our system has robustness for image compression, blurring and illumination.
INTRODUCTION
Machine vision has been widely used in industrial testing, autonomous driving even in consumer electronics. Feature extraction and feature matching are pre-stage algorithms for many applications, such as target tracking, image stitching, visual odometer, SLAM, AR and VR. Feature extraction includes both feature detection and feature description. As long as a feature point is detected, a descriptor (generally a vector) is built up using information around it. If the distance between a pair of descriptors is small enough, their corresponding feature points are considered "matched". After feature matching, all possible corresponding point pairs between two images will be found, and then holography between the two images can be defined. Repeatability is a metric for evaluating the performance of a kind of specific defined feature. It reflects the robustness of a specific kind of feature against image changes caused by camera movement, sensor noise, illumination, and loss compression. SIFT [1] \SURF [2] are two well-known feature extraction algorithms, due to their excellent invariance in rotation, affine transformation, illumination variation. SURF [2] algorithm, as a speedup version of SIFT, takes advantage of integral image and box filters to reduce complexity while with similar accuracy. However, SURF are still very time-consuming. Moreover, descriptors of SIFT and SURF are both complex and need huge memory overhead. A BRIEF descriptor [3] , in a binary vector form, is fast both to build and match. Other feature definition may include Harris [4] , FAST [5] , ORB [6] , etc.
Feature extraction and matching are computation intensive, which are resource starvation for storage and memory bandwidth. For instance, it takes 717ms, 180ms and 755ms on SURF detection, BRIEF descriptor building and BRIEF matching respectively on a desktop CPU (E8400 @ 3GHz @512  384) [7] . Therefore parallel versions of SURF\SIFT attract wide attention from researchers. A GPU (GTX480) accelerator for SURF detection and description (no matching) reaches 40FPS @791  740 [8] . Another instance of GPU (GeForce 8800M) accelerator shows SURF feature matching (1024 feature points) takes only 19ms [9] . However GPU acceleration approach is very power consuming. Other acceleration approaches are based on FPGA or ASIC have been proposed [10] [11] [12] [13] . The design that implements SURF detection and description achieves a speed performance of 72FPS on Stratix III @1080P [12] . In [14] , a complete hardware acceleration system which include FAST feature point detection, BRIEF description and matching reaches 308FPS on Zynq @640
 480.But robustness of FAST feature points is weaker than that of SURF. Moreover, all of the above works are only for monocular applications. Major contributions of this paper are as follows:
1) A binocular system on FPGA is built, which covers SURF detection, BRIEF description and matching, and runs at a frame rate of 162fps @640  480. 2) Standard AXI protocol is used, and different IPs can be flexibly exchanged or configured. Consequently, the design can be extended to more complex applications.
FEATURE EXTRACTION
SURF description is too complicated to implement. Consequently, the BRIEF description is used as an alternative. In this paper, we choose the combination of SURF feature detection and BRIEF description, whose rationality is discussed in [3] . Of cause this combination sacrifice rotation invariance and partial scale invariance. But in many applications, robotic, like quad-drones, may move smoothly, where requirement for rotation invariance may not be indispensable. To help readers to understand our circuit design, SURF detection and BRIEF description are reviewed shortly. 
SURF Detection
The coefficient  is used to correct the error caused by approximation.
In the original SURF algorithm [1] , the scale space was established for the purpose of scale invariance. The space is divided into multi octaves, and each of them has 4 scales. To save logic resource, we only construct an octave that has the smallest 8 scales ( s = 1.2, 2.0, 2.8, 3.6, 4.4, 5.2, 6.0, 6.4 ), and omit the interpolation step. The rationality of this method has been discussed in [10] Non-maximum value suppression (NMS) localizes candidate points with local maximum Hessian determinants to prevent feature points from being too concentrated. The NMS search scope includes not only the 8 neighborhoods at its own scale but also the 18 neighborhoods at the two adjacent scales. If the Hessian determinant of the candidate point exceeds a threshold, then the point will be considered as a feature point.
BRIEF Description
To create a BRIEF descriptor, two steps are needed: smooth filtering and binary descriptor generation. In the smooth filtering step, to reduce the number of multipliers, a 9  9 average filter is used instead of the Gaussian filter recommended in [3] . In the binary descriptor generation step, on the filtered image a NN  window centered at the feature point is selected, and then 128\256 pairs of the sample points in the window are compared. The comparing result is calculate by 12 12 
THE PROPOSED ARCHITECTURE 3.1 Overview
Modern Xilinx Zynq SOC has Processor System and Programmable Logic, as shown in Figure 1 . Video streams from two camera are fed into Image Capture module. Streams in DVP format are converted to satisfy AXI4 protocol, and then enter a ping-pong buffer in DDR. Image Rectification model, similar to the work in [15] , extracts stream data form the ping-pong buffer for epipolar rectification and distortion removal. The Feature Extractor implements feature detection and descriptor generation, and outputs pixel coordinates and descriptors of feature points. For stereo vision system, Feature Matcher not only implements trace matching but also stereo matching. In this paper the left image is used as the reference.The trace matching builds feature point correspondences for left image of current frame and the left image of the previous frame, and the stereo matching builds the correspondences for the left and right image in current frame. 
Figure 2. Architecture of image filter
The feature extractor module is shown in figure 3 . It has three types of Window (N  N) module in Figure 2 , with N equals to 52, 3 and 49 respectively. Image integration, the first step in SURF In BRIEF description, the Gaussian filter is replaced by an averaging filter as mentioned in section 2.1. Unlike Gaussian filter, averaging filter can be implemented by accessing the data of integral image, which means that the Averaging Filter Function module can share data in the Window (52  52) with Hessian Cores. Thus, additional Window for Averaging Filter Function is not needed. Moreover, only one adder and two subtracters need to be used in the Averaging Filter Function because of the integral image. The Compare module with 1bit output is used to implement Eq (4). Outputs from the 128 Compare modules are combined into a BRIEF descriptor. The Coordinate Generator module contains two counters for row and column coordinate counting respectively. The input and output of the feature extractor are designed to adapt to the AXI-Stream protocol (not shown in Figure 3 ). The AXI-Stream has a handshaking mechanism. When the input data is invalid or the subsequent circuit is not ready for data reception, the feature extractor will be suspended.
Feature Matcher

Multi Buffer module
As depicted in Figure 4 , our system is designed as a two-stage pipeline to speedup processing. The matching occurs one frame later than exaction. In one binocular system，feature exaction is conducted both on the left and right images, and trace matching followed by stereo matching has to be conducted The extraction results of current left, current right and previous left frame will be stored for future trace and stereo matching. For example, the 3T (current frame Trace matching) needs the results of 2L (previous left) and 3L (current left). The 3S (current frame Stereo matching)needs the output data of 3L (current left) and 3R (current right) extraction. Note that trace matching only starts from the 2nd frame. 
Figure 4. Schedule of feature exaction and matching
Extraction and matching are performed simultaneously, i.e., extraction results are read and written at the same time. To avoid conflicts between reading and writing, a Multi Buffer module is proposed. As shown in Figure 5 , the Multi Buffer is logically a ring buffer that consist of five storage sections. In any time moment, three sections are read and two are written. For example, at the T3 time, the extraction results of 1L, 2L and 2R are read from RP, RL and RR (Note the legend in Figure 5 ) sections to perform 2T and 2S matching; at the same, the 3L, 3R results are wrote into WL, WR sections for future matching at T4 and T5. Figure 6 shows the block diagram of the Match Executor module. The inputs of the Match Executor module are from the RP, RL and RR storages in the Multi Buffer module. The Match Core models in the Multi Buffer are divided into T group and S group, which are used for tracking match and stereo match respectively. The number of the Match Core models can be flexibly configured. The more matching cores, the faster the matching, but the more resources are consumed. The matching process is controlled by a finite state machine (FSM) that will be discussed below. 
Match Executor module
Figure 6. Block diagram of the Match Executor
The Match Core, as shown in Figure 7 , is used to find a pair of descriptors with the smallest Hamming distance [3] , and therefore their corresponding feature points. A and B, the inputs of Match Core is 148-bit wide (128bit for descriptor and 20bit for coordinate). In matching, the descriptor and the coordinate of a feature point (denoted as FA) will be loaded into A. Only the descriptor portion is used for calculating the Hamming distance, the coordinate portion just accompanies the descriptor portion without any change. Then a group of feature points (denoted as FBs) will be loaded into B respectively. In the process of loading FBs, if the Hamming distance between new FB with the FA is less than that of old FB, the comparator of Match Core will output 0 and the Hamming Distance Register will be updated. After the process, the best point in the FBs that matches the FA will be found. The "best" means the minimum Hamming distance. As depicted in Figure 8 , the matching processing is controlled by an FSM, which has four states: LOAD, RUNNING, TRANSPORT and CLEAR. Now matching at time T3 is used as an example for illustration. 
As shown in Figure 10 , performance of the proposed algorithm is compared with some other feature extraction algorithms in OpenCV.
The Bikes, boat, wall and ubc in the data set correspond to image blur, zoom+rotation, viewpoint change, and JPEG compression, respectively. For the detail of the data set, please refer to [17] .Except boat, the proposed method performs well on the data set. In the data set boat, there is obviously rotation and scale variations between the two images, the proposed method does not performs well. Figure 9 shows the example of trace matching.
Mismatched point pairs can be removed by the RANSAC[18] Stereo matching can creates parallax for a corresponding pair. As mentioned in section 3.3.2, unlike tracking match, stereo matching is restrict by not only the threshold, but also the range of parallax search.
As shown in Figure 10 , the precision of stereo matching almost reaches 100% for all image pairs in the data set [18] , due to the parallel check which eliminates point pairs which do not satisfy parallax epipolar line constraint. 
Performance analysis
Assuming an image resolution of 640  480, the resource utilization on a ZYNQ XC7Z045 FPGA is shown in Table 1 . Assuming a 100MHz working frequency, our system can work at162fps @640  480. For contrast, our system and previous works are listed in Table 2 . Those works in [14, [20] [21] [22] [23] only support monocular cameras. Although our system is targeted to binocular vision, performance metrics of monocular version of our system is also given. For [20, 21] , only feature detection and description is accelerated on FPGA, while the matching of them are finished by software. So their frame rates are low Literatures [13, 22, 23] accelerate feature point extraction 、 descriptor construction and matching. The frame rates of [22, 23] are still low. The performance of [14] is similar to our system in monocular mode because of the similar design on pipeline architecture. 
CONCLUSION
In this paper, SURF feature detection, BRIEF descriptor construction and matching system is proposed for binocular vision systems. It can work at 162fps @640  480 on a ZYNQ XC7Z045. The use of standard AXI4 interfaces allows different modules in the system to be exchanged or configured easily. In future, we will focus on improving the accuracy and combining our system with higher level applications like visual odometer or SLAM.
