Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded implementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA's Graphics Processing Unit (GPU). The parallelization design for SIFT on GPUs is divided into two stages, a) Algorithm design-generic design strategies which focuses on data and b) Implementation design-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in average computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.
Introduction
Image matching is a fundamental aspect needed to solve computer or machine vision problems, including object or scene recognition, 3D structure modeling, stereo image correspondence, motion tracking, etc. Objects in images have features that can be extracted and used for comparing across images. A large number of such features for various objects can be extracted from images with effi-cient algorithms. An example of such an algorithm is Scale Invariant Feature Transform (SIFT) [1] , which detects and extracts local feature descriptors from images. The features extracted using the SIFT algorithm are invariant to rotation, scaling and illumination and hence applicable to scene modeling, recognition and tracking [2] . A simple sequential implementation of SIFT on lower resolution images is shown to utilize huge memory with high computation times [3] , making the use of SIFT in real-time applications infeasible.
High-resolution image processing is used in various application domains like remote sensing, traffic monitoring, unmanned air vehicles, mars expeditions, etc.
SIFT has been shown to provide good tracking capability in all the above domains [4] [5] [6] [7] . However, the large computational burden prevents the use of SIFT in real-time applications in the above domains. Feng et al., in their multi-core parallelization work [8] have demonstrated the potential of parallelizing the SIFT algorithm. A number of researchers [3] [9] [10]- [15] [35] [36] [37] [38] [39] have attempted to parallelize the SIFT algorithm to make it applicable for real-time. In particular, the research effort by Yao et al., has demonstrated a parallelized SIFT algorithm [13] with the capability to process an image of resolution 640 × 480 pixels in 31 ms. The 31 ms processing time would be sufficient to process video of frame rate 24 frames per second with a resolution of 640 × 480.
However, the current application domains require processing of images with resolutions ranging from 1280 × 720 (720 p) to 8192 × 4608 (8 k) known as highresolution images in real-time. Recent researches [16] [17] [18] [19] [20] have attempted to parallelize SIFT for high-resolution images. However, all of these attempts either have high execution time or fail to provide reproducible implementation, making current approaches of parallelizing the SIFT algorithm incompatible for real-time applications.
In this paper, we discuss a two phase: a) algorithm design and b) implementation design, parallelization strategies to obtain a better and efficient parallelization of SIFT, which lowers the compute time enabling real-time processing of high-resolution images. While, the algorithm design phase strategies demonstrate parallelization of the SIFT algorithm, the implementation design phase strategies define the values of machine parameters, such as blocks and threads, to implement the parallelized SIFT algorithm on to the underlying GPU architecture. The two-phase parallelization approach facilitates porting of parallelized SIFT algorithm to different GPU architectures. The research work presented in this paper is based on the preliminary research presented by authors in [21] . In preliminary research, the parallelization of only the first stage of the SIFT algorithm was presented. This paper presents the complete parallel implementation of SIFT suitable for real-time applications.
The paper is organized as follows: In Section 2, we provide a brief description of the SIFT algorithm based on the research work by Lowe in [1] . Section 3, presents the mathematical steps in the SIFT algorithm. In Section 4, an intro-duction to the GPU architecture and its constraints are presented. In Section 5, we provide the two phase parallelization strategies and demonstrate the parallelization of SIFT algorithm. Section 6 provides the performance of the parallelized SIFT algorithm. We discuss the future work and conclude the paper in Section 7.
Description of SIFT
SIFT algorithm discussed in the research work by Lowe has four stages to obtain the features of an image [1] . The four stages of the SIFT algorithm are shown in b) Extrema detection: The pixels of a DoG image are the difference in intensity values of consecutive blurred images. Keypoints are pixels identified as local maxima or minima of the DoG images across octaves. This is obtained by comparing a pixel in the DoG image to its neighboring pixels at the current and adjacent scales [1] . If the pixel is greater than or lesser than all its neighbors, it is considered as the local maxima or minima respectively for the given octave level.
If the local minima or maxima for the first octave, continues to be a local minima or maxima for all octaves, then it is selected as a candidate keypoint.
2) Keypoint localization: The selected candidate keypoints (CKs) are filtered to eliminate unstable CKs. CKs having low contrast or being a part of an edge in an image are considered as unstable keypoints [1] . A 3D quadratic function fitting is performed on the selected CKs to evaluate its contrast with respect to neighboring points [22] . By setting a threshold for evaluation, low contrast CKs are eliminated [1] . A second-order derivative of the Hessian matrix [1] is used to eliminate the CKs that form part of an edge.
3) Orientation assignment: The CKs remaining after the localization step are considered as keypoints. One or more orientations are assigned to each keypoint based on local image gradient directions, i.e., the contribution of each neighboring pixel is weighted by a gradient magnitude as discussed in [1] . The peaks of the histogram of orientations indicate dominant orientations. Once a keypoint orientation has been selected, the feature descriptor is computed as a set of orientation histograms represented in the form of a vector [1] .
Steps in SIFT
A step-by-step mathematical discussion of SIFT, as seen in [1] , is provided below. It should be noted that we are reiterating the description of SIFT from [1] 
where, , is obtained as follows:
where, , x y in o xy I and so xy L is the co-ordinate of a pixel in the image.
Since the image values are discrete numbers, the convolution is performed as shown below: 
This operation can be merged as shown in Equation (5) CKs that include points having low contrast or are poorly localized along an edge, and hence are considered unstable [1] . To eliminate low contrast CKs, an evaluation of CKs' contrast with respect to its neighboring points needs to be determined. This is done by fitting a 3D quadratic function to the CK as described in [1] [22] . The 3D quadratic function is given by:
where,
Since the image values are discrete numbers, the derivatives have to be computed numerically. Equations (7)- (9) provide the finite difference method to calculate the partial derivatives
x y σ are cyclic, and hence can be swapped in Equations (7)- (9) to obtain other derivatives: Figure 3 . Extrema detection example for a pixel. 
or so so so so xy 
The contrast ( ) ( ) 
The CKs are considered low contrast if they satisfy the inequality shown in Equation (12) and are discarded.
( ) 0.03
g) Edge keypoint elimination: The DoG has a strong edge response, i.e., points on an edge is detected as an extrema. This is due to the principal curvature values for DoG tend to be higher along an edge. However, the principal curvature value for DoG along the perpendicular direction of an edge is low in magnitude.
This property is used to eliminate the CKs that are a part of the edge. The eigen values of the Hessian matrix ( ) H computed on D, as shown in Equation (13) , are proportional to the principal curvature values.
Therefore, the ratio ( ) r of the maximum eigen value ( ) α to the minimum eigen value ( ) β can be used to determine whether a CK belongs to an edge.
Using the relationship between trace ( ) Tr and determinant ( ) (14), (15) , the candidate keypoint must satisfy the constraint as seen in Equation (16) to qualify as a keypoint.
( )
Tr H α β = + (14) ( ) Det H αβ = (15)( ) ( ) ( ) 2 2 1
Det
Tr H r H r
As described in [1] , r = 10 serves as a good value to eliminate edge points.
h) Orientation assignment: The keypoints can be described relative to local orientation of its neighboring pixels rendering them invariant to image rotation.
The orientation is assigned by calculating two parameters a) gradient magnitude
is assigned as follows:
A descriptor is generated using this orientation by constructing a histogram of orientations of sample points around the keypoints. Each sample point added to Table 1 indicate that our parallelization should focus more on steps (a) -(c). Table 1 .
Step-wise computations for SSED for octave level 1.
Step 
Compute Unified Device Architecture
General-Purpose Computing on GPU (GPGPU) is a new parallelization domain to accelerate scientific and engineering computations in vision algorithms [23] .
NVIDIA's CUDA is the earliest and most widely adopted programming model for GPGPU [24] . Since the beginning of parallelization on GPGPU, NVIDIA has introduced several GPU architectures to improve the performance of parallelized
programs. An overview of the three common features of GPU architecture namely thread, memory and execution hierarchies [25] , are provided below.
GPU thread hierarchy is characterized by three parts namely the grid, block can execute concurrently on a SM is based on the number of warps that can concurrently execute on the SM which, in turn, is based on the amount of maximum warps supported by the SM, and the amount of shared and register memory required by a block. As the memory requirement per block increases, the number of warps that can concurrently execute on a SM decreases. An SM is said to be at maximum efficiency if the number of warps concurrently executing is equal to maximum warps supported by the SM.
Like any other machine, GPUs also have constraints. Each SM, B, T and w have its own constraints. For example, each SM has a constraint on maximum active blocks (currently executing blocks), maximum block allocation, maximum active warps, maximum executing threads, etc. Similar constraints apply to B, T and w.
Constraints, for the purpose of this paper, are represented symbolically as a function shown below:
where α represents the variable to which the constraint applies, β represents the scope of the α variable, γ represents the type of restriction imposed on α and δ is the output that provides the numerical value of the restriction. γ can take the following values listed in the Table 2 below.   Table 3 summarizes all the thread and execution hierarchy based constraint functions: Similar to SM, B, T and w, the memory is also subjected to constraints. Table 4 summarizes all the memory hierarchy based constraint functions:
Parallelization
There have been a number of GPU parallelization strategies that have been proposed over the last few years. A few relevant to SIFT are presented in [26] - [34] .
While [26] [27] focus on parallelization strategies for stencil codes similar to SIFT, [28] - [34] focus on implementation specific strategies for parallelization.
None of the articles provide algorithm design strategies, as we propose in this paper, to utilize the benefits of GPU architecture for parallelization.
The parallelization strategies, proposed in this paper, are divided into two phases for separating SIFT and GPU related parallelization strategies. The algorithm design phase strategies focus on parallelizing SIFT based on data size, data usage, and data organization. The implementation design phase strategies focus on how the parallelized SIFT design can achieve maximum execution efficiency on a GPU. The remainder of this section focuses on the description of the two phases and their application on SIFT.
The algorithm design phase strategies involve parallelization techniques based on three guidelines. a) Reduce: As seen in [35] , the bottleneck of algorithm execution is always found to be the memory accesses-read and write operations. The memory operations require higher time reducing the performance of the algorithm. Hence, the higher the number of memory operations, the lower the performance of the algorithm. However, not all algorithms' performance is bound by memory. This is because algorithms may have a lot more computations than memory operations.
In order to differentiate memory bound algorithms, a ratio, termed as computational intensity (CI), is defined as the number of floating point computations performed for each memory operation. This provides a measure to understand the impact of memory operations on algorithms. It is given by The CI determines if an algorithm is memory or compute bound. If the ratio is less than 1, then it is said to be memory bound. For CI 0  , the algorithm tends to become compute bound.
The reduce guideline states that for a memory bound algorithm, the number of memory operations should be reduced to increase CI. In other words, for every floating point operation, the number of memory operations should be mi- SIFT algorithm is observed to have both branching and scattered data access.
For example, while step (d) in section III requires twenty six comparisons, leading to branching, step (d) scatters data elements for an increase in octave level.
These guidelines are not rigid rules but rather provide a methodology to parallelize programs. The SSED parallelization, steps (a) -(e) in section III has been parallelized in our previous work-in-progress publication [21] . However, the adoption of guidelines has improved upon the parallelization strategies as shown below:
a) SSED provides a pyramid structure as shown previously in Figure 2 . The pyramid structure emphasizes the dependency that exists between each octave. If the pixels in the lower most octave are considered as granularity of parallelization, it provides huge parallelization for threads to work. However, it also increases dependency across threads for each increment in octave level. This violates the guidelines by increasing memory operations, reloading similar memory elements and having scattered accesses. It has been observed through our simulations that this significantly deteriorates performance, especially for ultra-high resolution images. Adhering to our guidelines, we propose top-down approach for parallelization of SSED; we group data based on the top DoG image in the highest octave. To determine if 21 
s o xy D
at the highest octave and scale is an extrema, it has to be compared to eight neighboring pixels of the same scale, and eighteen neighboring pixels of adjacent scales within the same octave level.
Adopting reduce guideline, all scales within an octave can be computed simultaneously. Therefore SSED would require loading nine elements-eight neighboring pixels and the Consider a pixel a N ∈ which needs to be compared with twenty six other pixels represented as x a , where 1, 2, , 26 x =  and x a N ∈ . To check whether a is maxima or minima, the two factors given in Equations (7), (8) are computed:
Next, If α > 0 and β = 0 then a is the local maxima and if β > 0 and α = 0 then a is the local minima. c) Since the extrema detection can yield either a selection of candidate keypoint or rejection of pixels, the execution has to proceed only with CKs. This leads to load imbalance which can significantly impact performance on a SIMT architecture which executes at warp granularity. Therefore, to avoid load imbalance between threads, it is recommended to synchronize the execution after obtaining CKs. Since there are no dependencies observed in steps (e) -(h) described in section III, the computation at hand is embarrassingly parallel with respect to each keypoint. Moreover, each keypoint requires no more than 8 neighboring pixels for all computations through steps (e) -(h) per scale per octave leading to a high CI.
As mentioned earlier, this algorithm design phase facilitates for parallelization of SIFT algorithm across architectures. The second phase of parallelization involves implementation design parallelization strategies. There are several different approaches to designing parallelization strategies for GPU. Two popular approaches, as seen in [26] - [34] , are a) Maximizing occupancy and b) Maximizing memory bandwidth. Our approach for implementation design phase is maximizing occupancy as opposed to maximizing memory bandwidth. This is because our approach to eliminating memory bottlenecks at algorithm design phase complements a maximizing occupancy in implementation phase.
There are few research publications, such as [37] , which conclude that occupancy is not a measure of effective utilization of GPU resources. The work conducted in [37] shows that the author pursued maximizing occupancy alone, without considering other dependent resources that are affected by increasing occupancy. Moreover, the results in [37] show that good performance can be achieved at lower occupancy. However, no proof has been provided to show increasing occupancy lowers performance either in [37] or any other publications.
The usual approach to maximizing occupancy on GPUs is to profile the code, and then add modifications such as increase the number of threads, or decrease the memory being used. Our approach to maximizing occupancy differs with others because we adopt a mathematical approach to maximize occupancy. Unlike the usual approach, the mathematical model connects occupancy with warps, shared memory, register memory, blocks and threads creating a multi-dependency model. Therefore, we do not set occupancy to maximum value and then compute the other dependencies; rather we try to calculate maximum achievable occupancy while tuning other dependencies. The idea is to combine optimization techniques to the underlying architecture mathematically, so as to obtain a near optimal implementation. 
Equation (25) 
Due to hardware constraints, shared memory has allocation granularity, i.e., the allocation of shared memory is always an integral multiple of ( )
The ceil function rounds the first argument to the immediate next multiple of the second argument. In Equation (27) , the allocation of shared memory per block is rounded to allocation granularity using ceil function.
Similarly, ( )
is the number of warps that can be allocated based on the registers needed by the algorithm and its availability per SM. It is calcu-lated as shown below:
As derived from NVIDIA documentation and seen in Equations (28), (29), the number of warps allocated based on registers depends on the registers allocated per thread, number of warps allocated per block and allocation granularity of register memory.
In the equations listed to calculate ( ) 
The two guidelines in the implementation design phase provide the exact thread and block numbers that maximizes performance on GPU. Since the calculations depend on the GPU used, image resolution and the number of keypoints generated, its calculation for the entire SIFT is not demonstrated in the paper. However, calculations of thread and block numbers for SSED are demonstrated for a 720 p resolution greyscale image on C2075 Tesla architecture GPU. 
The code implementation showed the requirement of 8 registers per thread, each of size 32 bytes. Hence,
Using Equations (32), (34), (35) along with C2075 architectural specifications, we can calculate Equations (25), (27), (28) . Plugging in the results back to Equation (24) and Equation (23), it is seen that ( ) 
Performance
The research work of Heymann et al., on parallelization of the SIFT algorithm [3] has shown that the SSED stage computations are the major computational and memory bottlenecks in the SIFT algorithm. Feng et al. has shown that 40% -60% of the total computational burden [8] is encountered in performing the SSED. Our approach of selecting thread numbers for optimizing occupancy needs to be validated. Moreover, as mentioned before, it is challenging to find a parallel Lowe's SIFT algorithm benchmarked for an identical GPU. Therefore, using parallel profiling tools, the effectiveness of PSIFT is demonstrated by monitoring the occupancy and cache misses on the GPU for the duration of execution of PSIFT.
Phase 1
In order to compare the performance of our parallel implementation PSSED, we In Figure 6 , the MSSED ACT and the corresponding standard deviation (numbers on the graph) for each image resolution are shown.
The ACT for MSSED is less than a second for images with resolution less than 1080 P. Even at this low resolution, only two frames can be processed in a second using the MSSED implementation. This high value of ACT makes the MSSED implementation unsuitable for real-time applications. The MSSED ACT is comparable to that reported in the research work [3] by Heymann et al. In [3] , the ACT for a 640 × 480 resolution image was 312 msecs compared to 401 msecs for a 1280 × 720 resolution image. Next, as the image resolution increases the ACT increases tremendously. The ACT of MSSED can be reduced further by using multiple cores of the Intel Xeon processor. It can be observed in Figure 6 ; the standard deviation is very less.
In Figure 7 , the MPSSED ACT and the corresponding standard deviation for each image resolution are shown. Based on the ACT, the MPSSED is capable of processing 720 p and 1080 p images in real time.
In Figure 8 , the PSSED ACT and the corresponding standard deviation for each image resolution are shown.
With PSSED implementation, the ACT is less than 16 msecs even with the 8 K-resolution image. This is comparable ACT of the convolution of 9 K-resolution images using GPUs, described in [40] to be 10 msecs. The low ACT would allow processing of the 8 K resolution images at 60 fps making the PSSED implementation suitable for real-time applications. In the research work [12] it has been shown with homogenous multi-core DSP implementation a processing rate of 45 fps of 640 × 480 resolution images is achievable.
In Figure 9 , the speedup of PSSED over MSSED and MPSSED are shown. The speedup is found to vary from 592× to 829× and 78× to 143× over MSSED and MPSSED respectively with increasing image resolutions. It is seen that the speedup increases initially and reaches saturation. This is due to the finite amount of memory and cores available on the GPU for concurrent execution. The optimized CUDA C PSSED implementation performance is still significantly better compared to the Matlab based GPU implementation of the SSED.
The test image used for comparing performance is shown in Figure 10 , and the corresponding final DoG image by PSSED is shown in Figure 11 . Identical DoG images are created by the MSSED, MPSSED and PSSED.
Phase 2
The ACT of PSIFT algorithm and its speed up w.r.t. sequential (SWS) Matlab SIFT (SSIFT) is provided in Figure 12 . The ACT for a 720 p image is nearly 2.5 msecs making SIFT a feasible algorithm for real time applications. However, the ACT increases non-linearly with an increase in image resolution. The number of Figure 13 shows the impact of keypoints on PSIFT's performance for 720 p resolution images. The keypoints, expressed in percentage in Figure 13 , are taken as a fraction of the total pixels of an image. As seen in Figure 13 , the performance of PSIFT scales linearly for an increase in keypoints. Once beyond a threshold, 8% in case of Figure 13 , the GPU has all SMs occupied. Hence, execution of slices of code is delayed till SMs are available. This is observed in the non-linear rise in ACT from 8% -16% in Figure 13 . Figure 13 also displays the SWS, which are consistent up to 8%, and decrease thereafter. As mentioned before, phase 2 provides performance results of PSIFT tuned for faster execution. This performance boost is achieved by lowering the accuracy of the SIFT algorithm. Figure 14 shows the accuracy of PSIFT in terms of the percentage of matching keypoints generated in comparison to SSIFT, which is adopted from Lowe's algorithm. Figure 14 shows steep decrease in accuracy with an increase in performance. However, 89% keypoints matching SSIFT for 8 k images is an acceptable number of keypoints.
Phase 3
Since the accuracy of results are lower in phase 2, it is necessary to compare PSIFT optimized for ACT and PSIFT tuned for 100% accuracy. Figure 15 provides the SWS for the two approaches. It is evident that tuning PSIFT for accuracy does not scale well for higher resolution. Figure 16 provides the SIFT descriptors generated for image. The figure shows a number of keypoints are generated as an output for the SIFT algorithm.
It matches with the keypoints generated by SSIFT, verifying the accuracy.
Phase 4
In this phase, our approach to select the number of threads is analyzed. The optimal thread number for SSED is 256 threads. Figure 17 compares of performance of SIFT with optimal threads with threads lower and higher than optimal values. It is seen that our approach provides the best performance of the five cases. Lower threads fail to hide memory latency and hence have worse performance. Higher threads hide memory latency as much as optimal case, but for higher data sizes requires more memory than optimal case. Hence performance deteriorates as seen in Figure 17 .
With no parameters for comparison, it is difficult to compare the effectiveness of parallelization approaches without having the same algorithm, device, and dataset. BSC has developed parallel program profiling tools that help analyze parallelization strategies and performance of the program. Two such tools are Extrae and Folding. Extrae is for collecting raw performance data from execution of software. Extrae allows parallel program developers to instrument the code to collect statistics on PAPI [39] counters to track resources within a GPU.
However, doing so for a single thread does not provide insight for parallelization strategies. Therefore, the data is collected for 1000 sample runs of SIFT for 128 threads spread across different SMs. The counters are read periodically, and data is generated based on the counter. The folding tool helps merge the 1000 samples across 128 threads to provide an average view of execution on GPU. The smoother the curve folding generates, the better the reproducibility of the program. For PSIFT, PAPI counters for MIPS and LCM were monitored on a K20X
GPU card. Readings were obtained from the counter for every 1 usecs, and the results are presented in Figure 18 . It is seen that for the first 30% of the execution Figure 18 . PSIFT GPU resource utilization.
time, the MIPS are high while LCM is lower than 2% except for the initial load.
This marks the SSED phase where data is pre-fetched by the compiler reducing LCM and increasing MIPS. During the data rearrangement phase, the MIPS are cut down by 50%. The LCM during this phase is observed to be between 0% -15% indicating scattered memory access. This phase lasts till 85% of the execution time, followed by the keypoint localization and descriptor generation, which are similar to SSED. The plot shows that the GPU MIPS is high throughout the execution of the algorithm, indicating that our parallelization strategy is effective.
Conclusions and Future Work
This paper demonstrates a new two-phase parallelization approach used to implement the SIFT algorithm. Our approach to the two-phase parallelization has an algorithm design phase-where parallelization strategies are based on data size, usage and organization, and an implementation design phase-where parallelization strategies are based on achieving maximum execution efficiency on a GPU. The algorithm design phase involves reduction in data usage and computations, reusing image data points to maximize cache blocking and, rearranging data to facilitate faster memory operations. The implementation design phase maximizes occupancy by adopting a mathematical model that connects occupancy with warps, shared memory, register memory, blocks, and threads. This model is then solved to get a range of GPU parameters such as number of blocks and threads.
The performance of the parallelized SIFT thus obtained is presented, compared and analyzed. The following observations were made: a) PSIFT performs better than the sequential and parallel Matlab implementations.
b) Accuracy of the SIFT algorithm, in terms of number of keypoints generated, can be traded to obtain lower execution time.
c) PSIFT can process high-resolution images under 25 msecs when optimized for low execution time.
d) The number of keypoints in a given image has a significant impact on the execution time of PSIFT.
e) The implementation design phase strategies provided the optimal number of threads and blocks for PSIFT.
f) The two-phase parallelization strategies did show effective use of GPU resources.
Our approach to parallelization facilitates SIFT algorithm to be used for processing high-resolution images in real time. The future work would be to provide a multi-GPU implementation.
