Abstract Smart cameras integrate processing close to the image sensor, so they can deliver high-level information to a host computer or high-level decision process. One of the most common processing is the visual features extraction since many vision-based use-cases are based on such algorithm. Unfortunately, in most of cases, features detection algorithms are not robust or do not reach realtime processing. Based on these limitations, a feature detection algorithm that is robust enough to deliver robust features under any type of indoor/outdoor scenarios is proposed. This was achieved by applying a non-textured corner filter combined to a subpixel refinement. Furthermore, an FPGA architecture is proposed. This architecture allows compact system design, real-time processing for Full HD images (it can process up to 44 frames/91.238.400 pixels per second for Full HD images), and high efficiency for smart camera implementations (similar hardware resources than previous formulations without subpixel refinement and without non-textured corner filter). For accuracy/robustness, experimental results for several realworld scenes are encouraging and show the feasibility of our algorithmic approach.
Introduction
Smart cameras are image/video acquisition devices with self-contained image processing algorithms that simplify the formulation of a particular application. For instance, algorithms for smart video surveillance could detect and track pedestrians [23] , but for a robotic application, algorithms could be edge and feature detection [2] . In recent years, advances in embedded vision systems such as progress in microprocessor power and FPGA technology allowed the creation of compact smart cameras with low cost and this increased the smart camera applications performance, as shown in [6] [7] [8] 11] . In current embedded vision applications, smart cameras represent a promising onboard solution under different application domains: motion detection [25] , object detection/tracking [31, 32] , inspection and surveillance [16] , human behavior recognition [20] , etc. In any case, flexibility of application domain relies on the large variety of image processing algorithms that can be implemented inside the camera. Algorithms highly used by smart cameras are feature extraction algorithms since extracted features represent medium-level abstractions of the images and this can be used as rough reference for scene understanding. There are two types of features that can be extracted from an image. Global features describe the image as a whole; they can be interpreted as a particular property of the image. On the other hand, local features aim to detect key points/feature points within the image. In smart cameras context, there is a tendency for local features (edges, blobs, corners), as the only visual features extracted by the algorithms inside the camera. Several distributed vision systems like object tracking [29, 44] , virtual reality [34] and human 3D pose reconstruction [43, 51] have applied smart camera networks in which every node provides local image features. In these configurations, nodes cooperation delivers highlevel information to a host computer/robot.
Visual features for smart cameras
Local feature detection is an image processing operation that aims to deliver medium-level abstractions from an image, and often it is used as initial step of several computer vision algorithms. In previous work, several local visual feature detection algorithms were proposed: algorithms such as Canny or Sobel [10] , deliver image edges that often are used in applications like object detection [13] , image labeling [50] , image segmentation [27] , stereo vision [41] . Other algorithms are corner detection like Shi and Tomasi [36] , Harris and Stephens [21] , FAST [35] , and they are the cornerstone of several computer vision applications such as 3D reconstruction [39] , camera calibration [49] , Structure from Motion (SfM) [40] , Simultaneous Localization and Mapping (SLAM). Nowadays desktop computers can process most of the corner detection algorithms in real time. Unfortunately, in some cases (mobile applications, autonomous robotics and compact smart vision systems) such approaches could be low efficient since they require relatively high computational resources, then power consumption and sizes can be not compatible with an embedded system. One solution to this problem is the use of dedicated hardware as Field Programmable Gate Arrays (FPGAs). This is because FPGAs are devices with low power consumption and its size is small (suitable to embedded/mobile applications). In addition, FPGAs are structured as a customizable circuit where image processing operations can be performed in parallel using a dataflow formalism. For corner detection , in previous work several FPGA-based smart cameras have integrated corner detection algorithms [2, 6, 7] inside the camera fabric, as result, these cameras can simplify the formulation of applications like 3D reconstruction, SfM, object tracking and camera calibration [39, 40, 49] . This is because in these algorithms the first step is for visual feature extraction, considering that the sensor (smart camera) delivers images and feature extraction simultaneously, then, the problem become partially solved. i.e., in all cases the first step of the algorithmic formulation become solved.
Performance of corner detection algorithms
Previous corner detection algorithms such as Shi-Tomasi [36] or Harris and Stephens [21] provide good performance for datasets and/or for geometrical scenes (building images, text images and calibration patterns). There are several computer vision applications that used these corner detection algorithms successfully [39, 40, 49] , and several smart cameras included them in their self-contained algorithms [2, 6] . Unfortunately, in several applications the corner detection algorithms are not compatible with high textured regions [22] or cannot perform real-time processing on Full HD images [4, 24, 33] . We can mention the three most important limitations affecting the current corner detection algorithms:
1 Low performance under complex textured regions One limitation occurs when the input images have complex textured regions such as tree foliage or flowerbed (Fig. 1a) . In these regions, most algorithms detect features that have low temporal stability: i.e., its illumination or orientation changes in time and it is difficult to track. This problem is well documented, for example, there are several works that study the SLAM systems scope/performance under different feature extraction algorithms [18, 30, 47] . In several applications (3D reconstruction, SfM, SLAM), one solution frequently used is to retain only the features with high dominance since it is assumed that the retained features should have high stability and should be easy to track. In practice the stability of high dominant features is not necessarily consistent since retained features still can be located within complex textured regions, as shown in Fig. 1b . In addition, retaining features with high dominance implies the use of high threshold values, but in several cases these values retain low number of features, and in SfM/SLAM applications these few features often do not provide sufficient information for the camera pose estimation, more details about this problem can found at [17] . 2 Location accuracy is low Most of current feature extraction algorithms determine if a candidate pixel p is a corner or not, then, if the pixel p is a corner, the location of the pixel p fits with the location of the corner. In practice, these locations introduce an imprecision since the real position of the corner points can be spatial positions between two or more pixels (subpixel location). In the case of camera calibration and 3D reconstruction, these imprecisions have high impact in the global performance [37] , as shown in Fig. 2 . One solution is to add subpixel refinements as a post-processing step as shown in [48] . However, this increases the computational requirements and processing time.
3 Low performance for embedded applications Nowadays computers can process several corner detection algorithms in real time. Unfortunately, in embedded applications such as mobile applications, autonomous robotics or compact smart vision systems, the use of computers is difficult due to their high power consumption and size. [4, 5, 12] , in all cases, the FPGA architectures were focused in an efficient hardware resources utilization. In [4] , an FPGA implementation based on sliding processing window for Harris corner algorithm is presented. The purpose of the sliding window is to avoid storing intermediate results of processing stages in the external FPGA memory or to avoid the use of large line buffers typically implemented with BRAM blocks. Therefore, the entire processing pipeline benefits from data locality. In [12] , the ''repetitive feature'' extraction procedures were exploited in order to develop a full-parallel FPGA architecture. There are other works that have focused on the FAST-N formulation where direct hardware parallelization implies low hardware resources demand, compact system design and real-time processing with low-grade FPGAs [9] . Those benefits have been used in applications such as analysis of traffic images [14] or mobile robotics [26] .
In our case, our work focuses on robust corner detection algorithms suitable for embedded applications. Thus, a corner detection algorithm is proposed with high spatiotemporal robustness to complex textured regions. The keystone of this algorithm consists in applying a non-textured corner filtering combined to a subpixel refinement. The algorithm is fully compliant with a hardware implementation, and an FPGA architecture suitable for real-time embedded applications is proposed. Unlike to previous works, this new formulation reuses the information processed by the corner detection algorithm. Then, subpixel refinement and non-textured corner filtering are a part of the corner detection formulation. They cannot be considered as post-processing steps.
3 Robust/accurate feature extraction algorithm suitable for smart cameras
Our work is based on the Shi-Tomasi formulation [36] . The Shi-Tomasi algorithm is a good trade-off between performance for real-world scenarios and high-speed processing. As explained above, new feature extraction, subpixel refinement and a non-textured corner filter are combined to increase the performance and robustness of the original Shi-Tomasi corner extraction algorithm. In Fig. 3 an overview of our algorithm is shown.
The preprocessing module
Given a grayscale image I(i, j), horizontal and vertical gradients are given by: G x ði; jÞ ¼ jIði À 1; jÞ À Iði þ 1; jÞj; G y ði; jÞ ¼ jIði; j À 1Þ À Iði; j þ 1Þj: Absolute values in the gradient formulation are used to avoid signed variables and reduce the hardware resources utilization. Of course, this modification changes the performance of the original ShiTomasi algorithm; however, performance detriment is minimum in comparison with the decreasement of hardware resources. The difference between the formulation using signed values and formulation using absolute values is around 1% for the corner response image. From gradients, matrices A, B, C (x, y and xy gradient derivatives) are defined as: Aði; jÞ ¼ G x ði; jÞ Á G x ði; jÞ; Bði; jÞ ¼ G y ði; jÞ; Á G y ði; jÞ; Cði; jÞ ¼ G x ði; jÞ Á G y ði; jÞ: A Gaussian filtering is applied on the A, B, C matrices to reduce noise and to remove fine-scale structures that affect the performance of the corner response. In order to reach high performance for embedded applications, convolution steps of our feature extraction algorithm take inspiration from our previous work [3] where, in order to reach straightforward FPGA implementation, a fixed kernel with values simplified/rounded to fixed point binary representation was proposed. In this work, we propose a fixed binary kernel as shown in Eq. 1. This kernel performs a 5x5 Gaussian kernel with r ¼ 5=3 and simplifies multiplication in the FPGA implementation by replacing them with shift register operations. This decreases the hardware resources during FPGA implementation, facilitates parallel-pipeline design and has low compromise compared with the original Gaussian kernel accuracy. This is because, the average difference between the original Gaussian kernel and the modified kernel has a difference of 0.0112 (1.12%) then, it is possible to assume that results using the modified kernel have to be very close than the results using the original. 
Corner detection
The original Shi and Tomasi corner response Eq. 2 provides a high response value for corners and low response otherwise, as illustrated in Fig. 4b . In order to determine if a pixel P is a corner or not, the maximum values of the corner response could be retained. Of course, many pixels around each corner are also detected in spite of a filtering with a threshold r. These pixels are false feature candidates and have low temporal stability.
Dði; jÞ ¼ ðAði; jÞ þ Bði; jÞÞ À ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðAði; jÞ À Bði; jÞÞ 2 þ 4Cði; jÞ
A way of solving the false feature candidates consists in applying a non-maxima suppression step. In our case, we consider that an appropriate FPGA-based non-maxima suppression step could be defined as follows: 
Non-textured corner filtering
A robust corner is a point/pixel located at the intersection of two or more edges. Unfortunately, textured regions are highly responsive to corner detectors (like the Shi and Tomasi) and do not represent robust corner features. In this way, we propose a non-textured corner filtering based on a triple surrounding patch around each candidate corner Patches in the non-textured corner filter. We assume that a robust corner has to be associated with a geometric shape in which all pixels must have similar corner response. Then, it has to be similar corner response across all the patches; otherwise, the detected corner is an isolated point within a complex textured region
Subpixel refinement module
In previous work, one of the most used approaches uses the grayscale values of the input image and then, a Gaussian/ Quadratic fitting is applied over the extracted corners in order to refine the location previously computed [19, 38, 48] . Although Gaussian/Quadratic fitting using grayscale values achieves relatively high performance, one solution more suitable for FPGA architectures could be a mathematical fashion that uses the same input that the corner extraction step. In this case, the feature extraction and the subpixel refinement could be computed in parallel, in addition, the use of the same input allows the use of the same buffer, this could decrease the hardware resources usage.
Subpixel location using the least squares fitting
Considering a group of observed data x 1 ; y 1 ; z 1 ; x 2 ; y 2 ; z 2 ; . . .; x n ; y n ; z n ;: where x, y are the image pixel location, while z is the corner metric response (Eq. 2), any fitting function f(x, y) should fulfill with the standard least squares equation:
where x, y are the independent variables, z is the dependent variable, n ! k, k is the number of independent parameters in the function f(x, y), and it is also the least number of samples required [45] . Considering that the fitting technique can be generalized from a best-fit line to a best-fit polynomial, if a low-order polynomial is employed, the fitting accuracy must be bad, while high-order polynomials may lead to unstable fitting results. In this work, we select a quadratic polynomial function ðk ¼ 6, i.e., a 0 ; a 1 ; . . .; a 5 Þ as shown in Eq. 10 to fit a parabolic surface, and then, obtain the parameters of this function through least squares adjustment.
Considering that the generalized adjustment model could be expressed as follows:
where X k;1 denotes the k independent parameters, L n;1 are the n samples, and d n;1 is the constant item in the expression (in this case d n;1 ¼ 0). In such scenario the problem is to set an appropriate weight determination for the independent variables ðB n;k Þ [45] . One approach to solve this problem is to set a Gaussian weight distribution as initial solution, then it is necessary to iterate using an adjustment criterion that refines the first approximation [48] . In our case, we propose a direct weight determination using the Vandermonde Matrix as weight determination for the independent variables. In the past Vandermonde Matrix has been used under polynomial interpolation procedures obtained promising results [42] so, in this work we assume that similar performance could be reach under our application domain. Using the Vandermonde Matrix as weight determination for the independent variables, then, it is possible to avoid the iterative procedures required in previous work, and therefore simplify the hardware implementation. i.e., given the weight determination for the independent variables, ðB n;k Þ obtained via the Vandermonde Matrix and considering X k;1 as corner responses from an image, then, it is possible to compute subpixel position within a parabolic surface as L n;1 ¼ B n;k Á X k;1 þ 0. 
The proposed approach
Considering that a 3 Â 3 template window as shown in Fig. 7 and Eq. 10 is used to carry out the least squares fitting. The sample values X k;1 ðk ¼ 0; 1; . . .; nÞ are corner metric responses, n ¼ 9 is the number of sample values, and set: X k;1 ¼ ðs 0 ; s 1 ; . . .; s n Þ T , where s 0 ; s 1 ; . . .; s 5 are the six independent parameters in the least squares fitting. Our algorithm selects any six corner responses at reasonably small distances from the center point as independent parameters, for practical purpose we used S0, S3, S5, S7, S2 and S6. Then, the parameters in the fitting function can then be calculated as: L n;1 ¼ B n;k Á X k;1 , where B n;k is the weight determination for the independent variables and it is defined as follows: given the independent parameters as fS0; S3; S5; S7; S2; S6g. x, y displacements with respect the center S4 could be defined as x ¼ fÀ1; À1; 1; 0; 1; À1g, y ¼ fÀ1; 0; 0; 1; À1; 1g. Then, two different Vandermonde Matrices Eqs. 11 and 12 are computed. For practical purposes we use the vander MATLAB function to compute V x ; V y matrices. Finally, we defined B n;k as ðV x Ã V y Þ=S n , this is because the matrices multiplication between V x ; V y provide weigh response for the x, y axis using a single matrix [42] . S n is the number of observations used in the interpolation process, in this case fS0; S3; S5; S7; S2; S6g.
After calculating the sample values in the least squares Fitting ðL n;1 Þ, we apply the formulation presented in [48] ; therefore, the decimal part of the features extracted can be calculated as:
So the subpixel location of the extracted features is:
where x sp ; y sp are the refined subpixel locations, this process is illustrated in Fig. 8 .
Output construction module
The final step is the subpixel location of the corners retained after the non-textured filtering. Thus, when Eði; jÞ ¼ 1 AND Fði; jÞ ¼ 1, the coordinates of each corner are computed by:
FPGA architecture for the feature extraction algorithm
An overview of the developed FPGA architecture is illustrated in Fig. 9 . The structure of the architecture is composed by four hardware processing elements: image preprocessing, subpixel refinement, corner detection and non-textured corner filter. The cores of the FPGA architecture are circular buffers attached to the local processors that are used to hold local sections of the image and allow local parallel data access for parallel processing. In general, input images are processed in stream. First, the architecture reads/stores data/parts of the frames into circular buffers that can hold rows temporarily as cache, store image rows from the input images, and that can deliver parallel data to the image preprocessing module. For the image Fig. 7 Template window for least squares fitting preprocessing module, the architecture computes the vertical and horizontal gradients. Then it computes the A(i, j), B(i, j), C(i, j) variables. Circular buffers deliver image pixels for the smoothing operations and, reconfigurable convolution units (see [3] ) compute the smoothing operation. Finally, the FPGA architecture computes the corner response metric, Dði; jÞÞ ¼ ðAði; jÞ þ Bði; jÞÞÀ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðAði; jÞ À Bði; jÞÞ 2 þ 4Cði; jÞ 2 q , for that, we adapted the architecture developed by Yamin Li and Wanming Chu [28] . This architecture uses a shift register mechanism and compares the more significant/less significant bits, it allows to compute square root with low hardware resources; therefore, it allows for a convenient square root framework, suitable for our algorithmic formulation. Using the D(i, j) computed by the image preprocessing module, three parallel modules carry out the corner detection, subpixel refinement and the non-textured corner filter in parallel. Finally, the output construction module delivers the refined positions for the corners retained after the non-textured filtering.
The circular buffers
In [3] we introduced a circular buffer schema in which input data from the previous N rows of an image can be stored using memory buffers (block RAMs/BRAMs) till the moment when a N Â N neighborhood is scanned along subsequent rows. In this work, we follow a similar approach to achieve high data reuse and high level of parallelism. Then, our algorithm is processed in modules where all image patches can be read in parallel. First, a shift mechanism (control unit) manages the read/write addresses of N?1 BRAMs, in this formulation N BRAMs are in read mode and one BRAM is in write mode in each clock cycle. Then, data inside the read mode BRAMs can be accessed in parallel and each pixel within a N Â N region is delivered in parallel (N Â N buffer), as shown in Fig. 10a . For more details see [3] .
FPGA architecture for feature extraction
In Fig. 11 , the FPGA architecture for feature extraction is shown. First, a circular buffer delivers parallel data for a 3 Â 3 processing window. i.e., for all pixel in the input image, the circular buffer delivers the nine pixels centered in a 3 Â 3 patch in parallel. Using pixels within the patch, the processor computes the non-maxima suppression step, and then, based on the values obtained, a thresholding operation determines if the patch center is a corner or not.
FPGA architecture for non-textured corner filtering
First, a circular buffer delivers parallel data for the nontextured corner filtering, in this case, all the pixels within a 9 Â 9 image patch in parallel, the patch center is the location of the pixel being processed. Then, using the pixels within the patch, the processor computes the three patches that surround the tentative corner in parallel, as shown in Fig. 12 . Then, our FPGA architecture carries out the comparisons between these patches and finally, this module estimates the geometric robustness of the corners. Using this geometric robustness and considering a threshold value provided by the user ðr 2 Þ, the module retains only corners with high geometric robustness.
FPGA architecture for subpixel refinement
For the subpixel refinement, we implement the proposed mathematical formulation using a parallel-pipeline approach, as shown in Fig. 13 . First, a circular buffer delivers parallel data for the subpixel computation. i.e., the circular buffer delivers all pixels within a 3 Â 3 image patch in parallel, the patch center is the location of the pixel being processed. Using pixels within the patch, the processor computes the sample values (a 1 -a 6 ) in parallel. Then, based on the sample values, the FPGA architecture computes decimal part ðD x ; D y Þ for all possible corner points in the input image in parallel. In order to reduce hardware resources consumption, a lookup 
Output construction
Using the outputs: E(i, j), F(i, j) and x sp ði; jÞ, y sp ði; jÞ, the output construction module applying logic comparisons between registers. Then, it computes the subpixel coordinates of the corner points after the non-textured filter. In practice, these subpixel coordinates could be used by any real-world application: SfM, SLAM, camera calibration, etc.
Results and discussion
The developed FPGA architecture was implemented in an FPGA Cyclone IV EP4CGX150CF23C8 of Altera. All modules were designed via Quartus II Web Edition version 10.1SP1. All modules were validated via post-synthesis simulations performed in ModelSim Altera. For all test, we consider r 1 ¼ 0:1; r 2 ¼ 0:1 since these values provided high number of ''good'' features (it is possible to obtain more than 10.000 features per frame, and these features have high temporality stability; therefore, they are easy to track) under large set of different indoor/outdoor scenarios. Fig. 11 FPGA architecture for the feature extraction. In a first instance, a suppression step over the corner response is computed. Then, a thresholding ðr 1 Þ is applied in order to select the ''good'' corner features Fig. 12 FPGA architecture for the non-textured corner filter. Three patches that surround any possible corner are computed. Then, comparisons between these patches are computed. Finally, by using the patches comparisons, the geometric robustness of the corners is estimated Fig. 13 FPGA architecture for the subpixel refinement. First, six independent parameters for a least squares fitting are computed. Then least squares fitting refines the integer locations
In practice, we recommend these values as reference between high number of detected features and high temporality stability. Lower values of r 1 ; r 2 could detect more features; however, temporality stability could be decreased. On the other hand, higher values of r 1 ; r 2 should provide more temporality stability, but number of features detected is decreased.
Performance compared with previous work
The full hardware resource consumption of the architecture is shown in Table 1 . Our algorithm formulation allows for a compact system design, it requires 4% of the total logic elements. For memory bits, our architecture uses 8% of the total resources, this represents 34 block RAMs consumed mainly in the circular buffers. These hardware utilization enables to target a smaller FPGA device and therefore could be possible a small FPGA-based smart camera, suitable for real-time embedded applications. In comparison with previous work, in Table 2 we present hardware resource utilization between our FPGA architecture and previous FPGA-based feature extraction algorithms. For the FAST algorithm, there are several works [9, 14, 26] which FPGA implementations take advantages of the mathematical formulation of the FAST algorithm. For all test, we compared Harris and Shi-Tomasi formulations in straightforward form. This is because the Shi-Tomasi corner detector is based entirely on the Harris corner detector. In general, one modification on the corner response function makes the Shi-Tomasi corner detector more robust under illumination changes (that is useful to track the features). Unfortunately, this modification uses one square root that limits the hardware implementation. In this work, we compute square roots adapting the algorithm presented in [28] , then, we introduce an FPGA-based implementation on Shi-Tomasi feature extraction algorithm. In general, FAST-based approaches not require block RAM cores since the original FAST formulation allows straightforward pipeline reformulation. Therefore, the hardware resource demand is low compared with our approach (Shi-Tomasi-based approach) and low compared with previous Harris/Shi-Tomasi-based approaches [4, 12, 24, 33] . The reason for FPGA architectures based on the original Harris-Stephens/Shi-Tomasi is the low robustness of the features detected by the FAST algorithm. In general, features detected by FAST-based approaches have high noise sensitivity and have very low performance for complex texture regions. Compared with previous Harris-Stephens/Shi-Tomasi-based algorithms, our algorithm formulation which replaces quotients by lookup tables and that uses reconfigurable convolution units [3] , our algorithm allows lower hardware consumption than [24] and [33] . In addition, our algorithm allows similar hardware requirements than the more efficient Harris-Stephens implementations [4] , and only [4] has lower hardware requirements than our algorithm but without subpixel computation and the robustness of our approach. In Table 3 , speed processing for the proposed feature extraction algorithm for different image resolutions is shown. For that, we synthesized different versions of our FPGA architecture (Fig. 9) , in these versions, we modified the circular buffers in order to work with all tested image resolutions. Then, we carried out post-synthesis simulation in ModelSim Altera. In all cases, our FPGA architecture allows for real-time processing. When compared with previous work (Table 4) , our algorithm provides the highest speed processing under Full HD images, it outperforms several previous work [9, 14, 24, 26] , and for HD images, our algorithm reaches speed processing similar to the more efficient Harris-based approaches [4] .
Performance compared with the original ShiTomasi algorithm
In order to validate the accuracy of our subpixel refinement step, we create a dataset as shown in Fig. 14 extract feature points from frame 1, then, we track feature points using the approach presented in [1] . This approach deliver high accuracy in terms of feature tracking (more accurate than the most used algorithms such as SIFT and SURF), but algorithmic formulation is highly exhaustive and an FPGA architecture is necessary in order to reach real-time processing. Finally, in order to measure the repeatability, we measure the relation between the input/ output features number in the feature tracking algorithm. i.e., the feature tracking algorithm assumes that all input features can be tracked; therefore, input features number must be equal to the output features number. In practice this is not true and the correlation function in the feature tracking algorithm only retains features with high temporal robustness (see Eq. 8 in paper [1] ), i.e., it measure the robustness of the features detected by any feature In Fig. 15 , we show in graphical form the scope and performance for the non-textured corner filter. For low threshold values, the original Shi-Tomasi formulation delivers high number of features that are difficult to track, as shown in the left column of Fig. 15 . On the other hand, high threshold values often deliver highly robust features, as illustrated in the central column of Fig. 15 . However, the number of features is low, and in several applications such as 3D reconstruction, SfM, SLAM, low number of features implies sparse 3D reconstructions that make difficult to understand the environment. For SfM, SLAM, low number of features often make difficult to estimate the camera pose. Using our feature extraction algorithm, the non-textured corner filter retains a high number of features (right column of Fig. 15) , even under input images with complex texture, as shown in Fig. 17 .
Finally, for practical real-world applications, in Figs. 16 and 17 we show the performance for a 3D reconstruction application. In both cases squares are the features detected by the feature extractor module, while circles are the features retained after the non-textured corner filter. The retained features were tracked across different viewpoints from the same scene. Then, we compute the corresponding 3D reconstruction following the formulation presented in [2] . In Fig. 16 we show the performance for indoor scenarios. As shown in Fig. 16d the subpixel refinement module increases the accuracy of the 3D reconstruction. In Fig. 17 we show the performance for outdoor scenarios. In this case, low threshold values in the original Shi-Tomasi formulation deliver high number of features that are difficult to track, while high threshold values deliver low number of features and the 3D reconstruction is sparse. By using our formulation, it is possible to retain a higher number of features that are easy to track and therefore, deliver semi-dense 3D reconstruction under input images with complex textured regions.
Appendix A: Pseudo code for the proposed algorithm
Appendix B: Conclusions
In this article, we have introduced a new feature extraction algorithm suitable for smart camera implementation. Our algorithm is robust enough to deliver high number of robust features for image sequences with high number of complex textured regions, and at the same time it delivers high performance for real-time embedded applications. We have proposed a non-textured corner filter that retains high number of robust features for images with complex textured regions, and we have proposed its subpixel refinement. Both algorithms increase the performance and scope of the original Shi-Tomasi corner detection algorithm. We proposed an FPGA architecture that allows real-time processing and compact system design and we validated our FPGA architecture via post-synthesis simulations. Our results are encouraging and show the feasibility of our algorithmic approach. The FPGA architecture reuses the corner response values used in the corner detection module and computes the subpixel refinement and the non-textured filtering modules in parallel. This enables an efficient hardware resources utilization, lower than several previous formulations without subpixel refinement and without nontextured corner filter, and similar hardware resources than the most efficient FPGA-based Harris corner detection. Finally, our FPGA architecture delivers high-speed processing (it can process up to 44 frames/91,238,400 pixels per second for Full HD images), higher than most previous work and similar speed processing than the more efficient FPGA-based Harris corner detection reported. Since many vision algorithms rely on finding and tracking features, we consider this work can be useful in several real-time image processing applications such as Structure from Motion and Simultaneous Localization and Mapping. As work in progress, we are implementing the developed FPGA architecture into the DreamCam, a robust/flexible smart camera [6] .
