ABSTRACT This paper proposes a novel hardware design method of scale-invariant feature transform (SIFT) algorithm for implementation on field-programmable gate array (FPGA). To reduce the computing costs, Gaussian kernels are calculated offline for use in Gaussian filters. To eliminate low-contrast points, the inverse of a Hessian matrix is required for hardware implementation, which results in poor performance because dividers are needed. To solve this problem, this paper presents a new mathematical derivation model to implement the low-contrast detection, avoiding the use of any dividers. For the implementation of the normalization module, a large number of dividers are required by traditional methods, which adversely affects the computational efficiency. This paper presents a new architecture using only one divider to implement the normalization function in hardware. Thanks to the parallel processing architecture proposed to design the image pyramid, SIFT detection, and SIFT descriptor, the computational efficiency of the SIFT algorithm is significantly improved. As a result of the proposed design method, the requirement of logic elements in the FPGA hardware is greatly reduced and system frequency is significantly increased. Experimental results show that the proposed hardware architecture outperforms existing techniques in terms of resource usage and computational efficiency for real-time image processing.
I. INTRODUCTION
Feature extraction is a fundamental aspect of many problems in the field of machine vision, including image stitching, object recognition, 3D modeling from multiple images, image tracking, and robotic mapping and navigation. Although many feature-detection algorithms have been proposed over the past years, the scale-invariant feature transform (SIFT) [1] algorithm, which mainly converts image data into high-dimensional feature descriptors before matching two images, is the most stable method of all. Because of its stability and robustness, SIFT not only deals with change in brightness but also satisfactorily addresses the problem of image scaling and rotation. As a result, SIFT is able to maintain feature invariance under orientation or illumination changes, out-performing the commonly used Harris corner detector [2] . Although SIFT is powerful with excellent results, it suffers from heavy computation and large memory usage, which impose a serious constraint upon real-world applications. In order to overcome this drawback, several approaches, including Speed Up Robust Features (SURF) [3] and Principal Components Analysis (PCA) SIFT [4] , have been proposed with the aim of reducing calculation time. As a stable image recognition and descriptor algorithm, SURF is widely applied to object-recognition problems. The biggest difference between SURF and SIFT is that the former uses integral images and 2D discrete wavelet transform; the latter uses an image pyramid and histogram of oriented gradients. PCA-SIFT, on the other hand, reduces the dimensionality of SIFT feature descriptors from 128 to 36 in order to minimize the dimension to speed up feature matching. In 2009, Juan and Gwun analyzed the performance of SIFT, SURF, and PCA-SIFT [5] . The result showed that SURF has the shortest computation time while SIFT has the best matching accuracy, considering the factors of changing scale, image rotation, image blurring, and changing illumination. In particular, various variants of SIFT algorithms, for example affine SIFT [6] , have been proposed in recent years with improved stability and robustness for use in feature matching. As a result, SIFT algorithms are widely used in various applications. Although deep learning approaches [7] have been adopted over the past years to extract region features with success, they inevitably require a lot of training examples for the model to learn properly. As complexities and uncertainties in the environment increase, feature extraction might not be successful if features in the scene are not well trained. Furthermore, network complexities of deep learning architecture also result in high costs particularly for hardware implementation [8] , [9] . In view of the above, the SIFT algorithm would be the best choice of all if the problem of computation time can be overcome. To address this problem, much research has been conducted in recent years to implement the SIFT algorithm on various experimental platforms, including multi-core processors, graphics processing unit (GPU), and field-programmable gate array (FPGA), with a goal of speeding up the computational performance to achieve real-time image recognition. Among them, Bonate et al. presented a software and hardware co-design approach to implement SIFT on FPGA [10] . In addition, Yao et al. [11] presented a method of improving the SIFT algorithm for implementation on FPGA, where the dimensionality of SIFT feature descriptors is reduced from 128 to 72. As a result, its computation through hardware implementation of FPGA is nearly real-time. In 2014, Wang et al. [12] presented an embedded SOC architecture for detection and matching by hardware. The result showed that it can process 60 images per second. Jiang et al. [13] also presented a hardware structure to implement the detection and matching of SIFT by hardware. The experimental results demonstrated that about 150 images can be processed per second. All the above discussions suggested that SIFT can be successfully implemented on hardware platforms, not only maintaining the success rate of matching but also achieving the objective of real-time computation.
As an attempt to further improve the computational efficiency of hardware implementation on FPGA, we present a novel hardware design method for SIFT to accelerate the computational efficiency of its major modules, including image pyramid, SIFT detection, and SIFT descriptor. Because the hardware design is based on pipeline architecture, the execution speed is significantly improved. It is worth mentioning that Gaussian smoothing of an image requires an exponential function that it is hard to implement and requires a lot of logic elements in the hardware. With the use of offline calculation of the Gaussian kernel proposed in this paper, the number of logic elements required in the hardware can be greatly reduced to accelerate system frequency. To eliminate low-contrast points, inverse matrix operations are required in the hardware by traditional methods, which results in low performance because dividers are needed for the calculation. To solve this problem, this paper presents a new mathematical derivation model to implement low-contrast detection, avoiding the use of any dividers. Similarly, for the implementation of the normalization module, a large number of dividers are required by traditional methods. This paper presents a new architecture using only one divider, instead of 128, to implement the normalization function in hardware. As a result, computational efficiency is greatly improved. Thanks to the parallel processing architecture proposed in this paper to design the image pyramid, SIFT detection, and SIFT descriptor, computational efficiency of the entire hardware system is significantly improved. As a result of the proposed design method, utilization of logic elements required in the FPGA hardware is greatly reduced and system frequency is significantly increased.
The paper is organized as follows. A comprehensive review of hardware implementation methodologies of SIFT and their experimental results is presented in Section 2. Preliminaries of the SIFT algorithm are introduced in Section 3. The proposed hardware design methodology for SIFT is described in Section 4. Experimental results are presented in Section 5. Section 6 concludes this paper.
II. RELATED WORK
This section discusses existing techniques that implement SIFT algorithm and their experimental results.
Vourvoulankis et al. [14] proposed an FPGA-based pipeline architecture for implementing SIFT. At a resolution of 640 × 480, up to 70 fps can be processed, allowing the system to achieve real-time operation. Aniruddha et al. [15] also obtained a speed of around 55 fps for 640 × 480 images by implementing a parallel structure of SIFT on a GPU. Unfortunately, their works can only be used in situations in which the descriptor does not rotate or the system has a small change in angle. Yao et al. [11] proposed an optimized SIFT feature detection architecture for image matching. The number of descriptor vectors is reduced from 128 to 72, simplifying the matching operation. According to the experimental results, this approach successfully reduced the detection time to 31 ms, at the expense of very high resource utilization. Mizuno et al. [16] , [17] proposed a SIFT implementation based on a FPGA architecture to quickly separate the regions of interest. This approach significantly reduced the required on-chip memory resources and supported two modes of operation. In the high-speed mode, a processing speed of 56 fps was achieved, while the high-precision mode could reach 32 fps; both of these results were achieved in VGA resolution. The main drawbacks of this architecture is the use of external Static Random-Access Memory (SRAM) to store input images and a large number of required Digital signal processing (DSP) blocks. Chang et al. [18] proposed a SIFT detection method that can reduce the usage of chip resources, where the SIFT keypoint detection time can be reduced to 11 ms. Unfortunately, it is applicable only to VGA resolution. Zhong et al. [19] used a combination of FPGA VOLUME 6, 2018 and a DSP processor to implement SIFT, where the SIFT detector is implemented on FPGA, while the SIFT descriptor is implemented by DSP. This system takes 10 ms to detect features in an image, and each descriptor consumes 80 µs. This hybrid architecture, however, is limited to low-resolution input images and the calculation of the descriptor is too slow. Huang et al. [20] proposed an ASIC-based SIFT architecture, where parallel operations were used to transfer the input data to SRAM, which can reduce the number of transistors and chip area. According to the experimental results, they succeeded in detecting the feature points in VGA resolution in 3.4 ms. With the use of the finite-state machine (FSM), the calculation time for each descriptor is about 0.0331 ms. Suzuki and Ikenaga [21] , [22] proposed a scheme in which the difference of Gaussian (DOG) is replaced by a Harris corner detector. The detected keypoints were used as a descriptor for SIFT. However, they failed to implement an image pyramid and therefore the difference in scale-space has not been considered in SIFT detection, which inevitably decreased the image matching performance. Kim and Lee [23] proposed a framework to maintain resource utilization at a lower level. They mainly split the input image and used an external SRAM to store the data to reduce the on-chip memory requirements. The calculation time of each descriptor is about 60 µs, which is the main bottleneck of the architecture. Chiu et al. [24] proposed a parallel layer of SIFT architecture. They used an integral image to reduce the computational complexity in Gaussian blurring. The calculation of descriptors, such as trigonometric functions and segmentation, used a custom multi-cycle component implemented on ASIC chips. When the number of feature points is less than 2000, it can handle 30-fps full-HD images. This means that it can compute the feature descriptor within 16 µs. Wang et al. [12] proposed an architecture that combines a detector, a descriptor, and a feature matcher, where SIFT is used for keypoint detection. The proposed system used an image with a resolution of 1280 × 720 pixels to achieve feature detection and matching at 60 fps. However, they sacrificed some of the robustness of the algorithm in order to speed up the execution. Jiang et al. [13] proposed an architecture for realtime SIFT extraction based on parallel and pipeline structure. They also used two ping-pong RAM buffers to preserve the location, gradient, and orientation of the features. The authors claimed that the proposed scheme has almost the same matching performance as the original algorithm. However, descriptor calculation is still the bottleneck of the entire algorithm.
III. PRELIMINARIES OF SIFT ALGORITHM
As a method of detecting and describing local features in images, the SIFT algorithm was proposed by David G. Lowe in 2004 [1] . Fig. 1 shows a flow chart of the SIFT algorithm, where three basic steps, namely Image Pyramid, SIFT Detection, and SIFT Descriptor, are required to implement the SIFT algorithm. Detailed descriptions for each function are given as follows: 
A. IMAGE PYRAMID
The image pyramid uses a cascade Gaussian filtering approach that creates continuous images at different scales. As shown in Fig. 2 , six Gaussian images and five DOG (difference of Gaussian) images in every octave are generated to construct an image pyramid. First of all, each Gaussian filter uses a different sigma to calculate a Gaussian kernel in (1) . Then a convolution operation is performed for the initial image and the Gaussian kernel to obtain a Gaussian image in (2) . By subtracting the two continuous Gaussian images, a DOG image is obtained by (3) .
(1)
B. SIFT DETECTION
As soon as the image pyramid is built, we proceed to the step of keypoint detection, including extrema detection, eliminating low-contrast, and edge response keypoints. Extrema detection needs to use 3D DOG space, that is, three continuous DOG images of the same octave. As shown in Fig. 3 , the red point is a target point and the blue points are the nearest neighbors of 3D DOG space. If the value of the red point is the maximum or minimum in the nearest neighbor region, it is a candidate point. By doing so, candidate points can be determined in the previous step. However, some of them are unstable points. As a result, low-contrast keypoints and edge response points have to be eliminated in the next stage. First, the low-contrast detection uses a Taylor series to expand the 3D DOG space as shown in (4). Then we calculate the derivative of (4) to find the extrema value by making the derivative of (4) equal to zero. By proper derivations, we suppose the location of the extrema value according to (5) . Substituting a of Equation (5) into x in Equation (4), we obtain (6). If |D (a)| is smaller than 0.03, this candidate point has to be deleted.
To eliminate the edge response, we first build a Hessian matrix in (7) and then follow the steps to calculate the trace (Tr(.)) and determinant (Det(.)) of the Hessian matrix by (8) and (9), respectively. Let r be the ratio between the biggest eigenvalue and the smallest one, so that α = rβ. We can use the solutions of (8) and (9) to derive (10) . To detect an edge point, we only need to check (11) , where r is usually 10. We can distinguish a non-edge point if the inequality in (11) is satisfied.
Each keypoint has an orientation assignment to allow for rotation invariant. To calculate the correct orientation, we use the area adjacent to the gradient value of the direction of the statistics. The magnitude and direction calculations for the gradient are derived for every pixel in a neighboring region around the keypoint in the Gaussian-blurred image G. We count the magnitude of each direction in the entire Gaussian image and use the maximum magnitude of the direction as the direction of the keypoint, as shown by the red dot in Fig. 4 . Equation (12) is used to calculate the gradient magnitude m around the red dot. Equation (13) is used to calculate the gradient direction θ (x, y). Each sample in the neighboring window added to a histogram bin is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a value of σ that is 1.5 times that of the scale of the keypoint. The peaks in this histogram correspond to dominant orientations, as shown in Fig. 4 . In order to increase the success rate of matching, the second largest peak (about 15% of the feature points) has been chosen as the secondary direction. Thus, matching stability can be notably improved.
m(x, y)
C. SIFT DESCRIPTOR
In this paper, a gradient histogram is used to increase the matching success rate of feature descriptors. Before using the gradient histogram, we have to use (14) to rotate the main direction of the feature point mask to 0 degree. The 16 × 16 mask is divided into 4 × 4 regions, and the gradient of the eight directions in each region is counted, as shown in Fig. 5 . Each region has a gradient value of eight directions, so each feature descriptor can be described as a 128-dimensions vector.
IV. PROPOSED METHODOLOGY
This section describes the proposed hardware design of SIFT. Major functional blocks of the SIFT hardware circuit VOLUME 6, 2018 include Image Pyramid, SIFT Detection, and SIFT Descriptor, as shown in Fig. 6 . The Image Pyramid Block uses parallel processing to calculate Gaussian filter images and DOG images. After the DOG images have been processed using hardware, the feature points are taken and stored in the first-in-first-out (FIFO). The feature points need to wait for the SIFT descriptor block to complete. The SIFT Descriptor then calculates the feature descriptor vector for each pixel. If the FIFO read value is 1, then the position and the descriptor vector of the feature point are stored. We use the pipeline architecture for the overall system to improve computational efficiency. Detailed descriptions for each hardware functional modules are given as follows.
A. IMAGE PYRAMID BLOCK
This paper designs a Gaussian filter module based on a parallel architecture to produce a Gaussian image, as shown in Fig. 7 , where MAC# is a multiply and accumulate module. The Gaussian pyramid in this paper contains four consecutive Gaussian images, so four kinds of Gaussian mask parameters are used and stored in Gaussian Mask Selection, where iGaussian_num is used to select the corresponding Gaussian mask. To simplify the operations, we offline calculate four kinds Gaussian mask values and enlarge 1024 times of their values by shifting 10 bits left. After the operation is complete, the output value is then shifted right by 10 bits.
B. SIFT DETECTION BLOCK Fig. 8 shows the pipeline hardware of the SIFT detection block. There are three main modules in this block, including 1) extrema detection, 2) unstable point detection, and 3) the pipeline hold circuit module. The following paragraphs describe the hardware design and implementation for each module.
1) EXTREMA DETECTION MODULE
The extrema detection module is used to determine whether the pixel is a maximum or a minimum of all neighboring points. This module performs both maximum detection and minimum detection with 26 neighboring points in parallel, as shown in Fig. 9 .
2) HESSIAN MATRIX MODULE
Equation (15), as shown at the top of the next page, is a Hessian matrix and each variable in the matrix is described in (16) to (21), as shown at the top of the next page, where Dxy, Dxs, and Dys are divided by four by shifting 2 bits right. Fig. 10 illustrates the hardware realization of the Hessian matrix module.
3) DIFFERENTIAL MATRIX MODULE
Equation (22) is the partial differential matrix of the DOG images, and Fig. 11 shows the hardware realization of the Differential Matrix Module.
4) INVERSE HESSIAN MATRIX MODULE
To obtain the inverse of the Hessian matrix, we use (23) to calculate the adjoint matrix. The determinant of the matrix is 
performance, we calculate and output the adjoint matrix and the determinant to the next module. 
5) LOW CONTRAST DETECTION MODULE
Equations (25) and (26) are the original formula to determine the low contrast keypoints, where the inverse of matrix is required. Taking square of (25), we obtain (27). To compute the inverse matrix, it is necessary to use the divider, which inevitably consumes a lot of hardware resources and reduces the processing speed. To avoid this problem, we firstly output adj(A) and det(A) of the Inverse Hessian Matrix and then we express the inverse matrix in (26) using the adjoint matrix and determinant to obtain Eq. (28).
Substituting Equation (28) into (27), we obtain an approximate inequality (29). Finally, we can easily implement (29) by hardware to determine the success or failure of low-contrast detection. The hardware module is shown in Fig. 12 .
1024
6) EDGE-DETECTION MODULE
The function of this module is to determine whether a feature point is an edge response keypoint. We use (33) and (34) to calculate the trace and determinant of the Hessian matrix, so that (35) can be used to decide whether it is an edge feature point, where the pixel is not an edge feature point if (35) is true. Fig. 13 shows the hardware diagram of the Edge Detection Module, where r is equal to 10.
C. SIFT DESCRIPTOR BLOCK
This paragraph introduces the hardware architecture of the SIFT feature descriptor as shown in Fig. 14 , including three functional modules: (1) Image Gradient, (2) Histogram, and (3) Normalization.
1) IMAGE GRADIENT MODULE
To calculate the gradient and direction, we need the square root and tan −1 functions. Therefore, we design a CORDIC (Coordinate Rotation Digital Computer) circuit to deal with them, as shown in Fig. 15 , where K is a constant value
2) HISTOGRAM MODULE
For feature point matching, we use the histogram of oriented gradient (HOG) to calculate the dominant direction and descriptor. To implement the gradient statistical histogram module, we divide the 16 × 16 mask size into 16 sub-regions and calculate the gradient histogram in the each sub-region, as shown in Fig. 16 . Sixteen statistic eight-bin modules are therefore to simultaneously compute the gradient in eight directions, as shown in Fig 17. Each Statistic_bin module determines which sub-region the angle belongs to and accumulates the gradient value, as shown in Fig. 18 .
3) NORMALIZATION MODULE
The normalization operations are performed in (36) and (37), where W is the feature point vector, and W = (w 1 , w 2 , . . . , w 128 ). L is a normalization vector and L = (l 1 , l 2 , . . . , l 128 ). For the implementation of the hardware normalization, a large number of multipliers and dividers are required. However, the use of a large number of dividers undermines system performance and demands a large number of logic elements. Thus, we rewrite (39) as (40), where we can only use one divider in addition to multipliers and shifters to implement the normalization hardware. If s is 270, the value of n (n = 8 in this case) can be found according to (38) and m value can be calculated from (41) afterwards. The architecture of the normalization hardware module is shown in Fig. 19 . 
V. EXPERIMENTAL RESULTS
In this section, we evaluate the performance of all modules in the proposed hardware and compare the computation time achieved by the proposed FPGA hardware, NIOS II, and a PC. The computation platform is a personal computer with an Intel Core i7-3770, a 3.4-GHz CPU, and 8 GB of RAM. The simulation environment is Nios II and Visual Studio 2010 Express with OpenCV library under Win 7. The hardware used is an Altera DE2i-150 board of FPGA with Cyclone IV GX: EP4CGX150DF31C7, and the system frequency is 50 MHz. Table 1 shows the system performances of all modules in the proposed hardware system, including logic elements (LE), memory bits, and nine-bit multiplier. Table 2 shows the computation time of the Gaussian filter for two different image sizes on different platforms, including FPGA, NIOS II, and PC. The mask size of the Gaussian filter is 7 × 7, and the Gaussian sigma is 1.6. We also calculate the differences in grey value of 0, 1, and 2 between the original image and the image processed by the proposed Gaussian filter, as shown in Table 3 . The size of the given images is 800×480. Note that three pixels around the edges are ignored in Table 3 . The results indicate that the proposed method has achieved almost the same performance level as that by software. The reason for the image error of the Gaussian filter is that we only use a finite number of 10 bits for shifting to solve the floating point problem. Based on Table 3 , the Peak Signal-to-Noise Ratio (PSNR) can be calculated as 43.6694 db. The original image is shown in Fig. 20 , while Fig. 21 shows the Gaussian image using a Gaussian operation to process the original image by FPGA. Table 4 shows the computation time of the image pyramid executed on three platforms. We can see that the Gaussian VOLUME 6, 2018 filter (Table 2 ) and the image pyramid (Table 4) have the same computation time by hardware, because all Gaussian images and DOG images can be processed in parallel.
A. COMPUTATION TIME OF GAUSSIAN IMAGE

B. COMPUTATION TIME OF IMAGE PYRAMID
Note that the computation time for the image pyramid is not affected by the proposed FPGA design, in comparison to the Gaussian filter, thanks to parallel computation of the proposed hardware architecture, while the computation time of the image pyramid by PC is six times longer than that by the Gaussian filter.
C. COMPUTATION TIME OF SIFT DETECTION
The result of SIFT detection by FPGA hardware is shown in Fig. 22 , in which the red points represent the keypoints. Table 5 shows the computation time of SIFT detection for two different image sizes. From Tables 4 and 5 , we can find that the larger the image size, the more processing time the proposed FPGA hardware can save. TABLE 6 . compares hardware resource usage and maximum operation frequency between the traditional inverse matrix method and the proposed method. The traditional method, which uses an inverse matrix to implement the hardware, requires nine dividers to perform the operations. In the proposed method, we only use the inequality to determine the low-contrast keypoints. Thus, we can achieve a satisfactory performance, as shown in Table 6 .
D. COMPUTATION TIME OF SIFT DETECTION
Note that the total LE required is much smaller and FMAX is significantly increased, because no dividers are used.
E. IMAGE GRADIENT ERROR CALCULATION
We enter four quadrants of X and Y values into the CORDIC module implemented by hardware and obtain an approximation value for √ X 2 + Y 2 and tan −1 Y X in Eq. (12) and (13) . Note that the error is satisfactorily small for practical usage. Table 7 shows the calculation results of CORDIC, and Table 8 shows the hardware resource usage of CORDIC.
To evaluate the benefits and drawbacks of the proposed method, we compare twelve state-of-the-art hardware implementations of SIFT in terms of their performance, namely detection time, descriptor calculation time and frame rate, and their hardware resource usage (e.g. gates, registers, multiplexers and RAM). The results are shown in Table 9 and Table 10 , respectively. We can only compare performance and hardware resource usage with [14] , not only because their work achieved an exceptional performance reported in the literature, but also because the hardware platform and image size used are the same. As shown in Table 9 , detection time and descriptor calculation time required by the proposed method are less than those by [14] and therefore the frame rate of the proposed architecture can be as high as 150 fps. As far as hardware resource usage is concerned, the proposed architecture greatly reduces usage of LUTs/ gates, which leaves more room to make the SIFT hardware more applicable for realworld applications, as shown in Table 10 .
F. MATCHING ACCURACY
As soon as the hardware modules are constructed, experiments can be conducted to validate the matching accuracy based on the obtained SIFT descriptors by the proposed hardware design method. Taking image sets in 10 different scenes, where the images bear different degrees of scaling, rotation, and view angles in each image set, we obtain a satisfactory average matching accuracy of about 97.84% for the image sets. This suffices to show the robustness and effectiveness of the obtained SIFT descriptors. As illustrative examples, Figs. 23-25 show the matching results of 3 image sets, where matching accuracy is 99.29%, 99.03%, and 96.15%, respectively.
VI. CONCLUSION
The SIFT algorithm uses scale-space images to achieve feature point detection and description with good results to changes in scale or rotation. Thus, matching performance of SIFT is more stable and robust compared with other image recognition algorithms. However, the steps to implement the SIFT algorithm are complex and the processing time is much too long. It is therefore a non-trivial task to achieve a real-time realization for SIFT algorithm using only a single PC. In this paper, a hardware design method for SIFT on FPGA is proposed, using a pipeline and parallel process to speed up the SIFT algorithm. Several new methods have also been proposed in this paper to raise the system frequency of the SIFT hardware. For example, we offline calculate the Gaussian mask values to reduce computing costs. The use of dividers to calculate the inverse matrix in the low-contrast detection module is avoided. In the normalization module, the use of dividers is also avoided to save resources. As a result, utilization of logic elements required in the FPGA hardware is greatly reduced and system frequency is significantly increased. From the experimental results, the proposed FPGA-based architecture for SIFT compares favorably with the state-of-the-art implementations to achieve real-time image processing.
WEI-ZHENG PAN received the B.S. degree from Tamkang University in 2014 and the M.S. degree from National Taiwan Normal University, Taipei, Taiwan, in 2016, all in electrical engineering. He is currently a Research Assistant with the Computational Intelligence Lab, National Taiwan Normal University, Taiwan, where he is involved in the area of facial expression recognition and action recognition using artificial intelligence approaches. He has been studying the problems of image recognition for over 3 years. His expertise includes image recognition and SLAM algorithms. 
CHEN-CHIEN JAMES HSU
