A study of the scale-invariant feature transform on a parallel pipeline by Vinukonda, Phaneendra
Louisiana State University
LSU Digital Commons
LSU Master's Theses Graduate School
2011
A study of the scale-invariant feature transform on a
parallel pipeline
Phaneendra Vinukonda
Louisiana State University and Agricultural and Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_theses
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU
Master's Theses by an authorized graduate school editor of LSU Digital Commons. For more information, please contact gradetd@lsu.edu.
Recommended Citation
Vinukonda, Phaneendra, "A study of the scale-invariant feature transform on a parallel pipeline" (2011). LSU Master's Theses. 2721.
https://digitalcommons.lsu.edu/gradschool_theses/2721
A STUDY OF THE SCALE-INVARIANT FEATURE TRANSFORM ON A PARALLEL PIPELINE
Thesis
Submitted to the Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering
in
The Department of Electrical and Computer Engineering
by
Phaneendra Vinukonda
B.TECH., JNTU University, 2007
May 2011
Acknowledgements
I am indebted to my major advisor Dr. Ramachandran Vaidyanathan for his exemplary
patience, guidance and support. During my stay here at LSU, he taught me the skills
of problem solving, provided me with some good motivations, helped me through the
difficulties I have gone through on the way towards this degree. It was with his kind
support, I overcame all the obstacles on my way towards completing my thesis.
I would also like to thank my committee members, Dr. S. Rai and Dr. Gunturk for
their valuable suggestions and kind support. Furthermore, I thank the Dept. of Electrical
and Computer Engineering for making me concentrate on my research without any other
deviations.
I wish to endow my earnest gratitude to my parents, who believed in me and have been
thorough all the rough times. I also want to thank my brother V. Hareendra, my entire
family and friends for their affection, support and compassion.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Scale Invariant Feature Transform . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Flow of Data in SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Scale-Space Extrema Detection . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Keypoint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Orientation Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Keypoint Descriptor Generation . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Images Used in Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Time Taken by Different Phases . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Feature Fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Tile Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Tile Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Row Major Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Diagonal Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 The Computation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Stage Start Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
5.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Input Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Pipelining Multiple Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Input Data Flow Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1 Tile-Plus-Neighborhood Protocol . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.1 Row Major Tile Ordering . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.2 Diagonal Method Tile Ordering . . . . . . . . . . . . . . . . . . . . 56
6.2 Tile-Only Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 Row Major Tile Ordering - Tile Only . . . . . . . . . . . . . . . . . 64
6.2.2 Diagonal Tile Ordering - Tile Only . . . . . . . . . . . . . . . . . . 66
7 Single-Chip Uniprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1 Running Time on a 3-Stage Pipeline . . . . . . . . . . . . . . . . . . . . . 70
7.2 Time Complexity of Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Memory Requirement on a 3-Stage Pipeline . . . . . . . . . . . . . . . . . 73
7.3.1 Row Major Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Diagonal Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.5 Tile Only Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.1 Row Major Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.2 Diagonal Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8 Single-Chip, Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . 84
8.1 The Hierarchical Multi-Level-Caching (HM) Model . . . . . . . . . . . . . 84
8.2 Mapping Tile Data to Cores . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Computation Stage S1 in the HM Model . . . . . . . . . . . . . . . . . . . 86
8.3.1 Accessing Local Data . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.3.2 Accessing Neighborhood Pixels . . . . . . . . . . . . . . . . . . . . 89
8.3.3 Running SIFT on the Subtile . . . . . . . . . . . . . . . . . . . . . 91
8.3.4 Total Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.4 Memory Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9 Two-Chip, Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.1 Time Complexities of the Stages . . . . . . . . . . . . . . . . . . . . . . . . 96
10 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
iv
List of Figures
2.1 Major phases of the SIFT algorithm . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The internal stages of Scale-Space Extrema Detection . . . . . . . . . . . . 7
2.3 An example of applying a 5× 5 Gaussian window on a point . . . . . . . . 8
2.4 Scale, octaves and difference of Gaussians . . . . . . . . . . . . . . . . . . . 10
2.5 Extrema detection on octave j . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Improved SIFT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Keypoint descriptor generation . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Pictures considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Percentage time for major stages of SIFT . . . . . . . . . . . . . . . . . . . 19
3.3 Percentage time of SIFT major phases over different pictures . . . . . . . . 20
3.4 The absolute times of major SIFT phases . . . . . . . . . . . . . . . . . . . 21
3.5 Gaussian blurring lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Average time for Gaussian blurring and difference of Gaussian phases . . . 23
3.7 Times taken for scale-space extrema detection, orientation assignment and
keypoint descriptor generation phases . . . . . . . . . . . . . . . . . . . . . 23
3.8 Nominal number of extrema, keypoints and features . . . . . . . . . . . . . 24
3.9 The value of α for all the image resolutions of picture number 26 . . . . . . 24
3.10 The value of β for all the image resolutions of picture number 26 . . . . . . 25
v
3.11 The value of γ for all the image resolutions of picture number 26 . . . . . . 25
3.12 The value of α for all the image resolutions of pictures numbered 1-25 . . . 26
3.13 The value of β for all the image resolutions of pictures numbered 1-25 . . . 27
3.14 The value of γ for all the image resolutions of pictures numbered 1-25 . . . 28
3.15 The value of α across all the images averaged over their resolutions . . . . 29
3.16 The value of β across all the images averaged over their resolutions . . . . 29
3.17 The value of γ across all the images averaged over their resolutions . . . . 30
4.1 Coordinate representation of tiles in a tile array . . . . . . . . . . . . . . . 32
4.2 Row-major ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Diagonal ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 A c-chip pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 A chip in the pipeline model . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Stages in the pipeline model . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1 A tile and its neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Tile and its neighborhood in the context of entire image . . . . . . . . . . 48
6.3 First tile of the row-major tile ordering . . . . . . . . . . . . . . . . . . . . 49
6.4 Tile in row 0 and column 0 < c < z . . . . . . . . . . . . . . . . . . . . . . 50
6.5 Tile in column 0 and in row 0 < r < z rows . . . . . . . . . . . . . . . . . 51
6.6 Tile in row 0 and column z . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.7 Tile in column 0 and row z . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.8 Tile in row 0 < r < z and column 0 < c < z . . . . . . . . . . . . . . . . . 54
6.9 Tile in column z and row 0 < r < z . . . . . . . . . . . . . . . . . . . . . . 55
6.10 Tile row z and in column 0 < c < z . . . . . . . . . . . . . . . . . . . . . . 56
vi
6.11 Tile in row z and column z . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.12 First tile of the diagonal tile ordering . . . . . . . . . . . . . . . . . . . . . 58
6.13 Tiles in row 0 and column 0 < c < z . . . . . . . . . . . . . . . . . . . . . 59
6.14 Tiles in column 0 and row 0 < r < z . . . . . . . . . . . . . . . . . . . . . 60
6.15 Tile in row 0 and column z . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.16 Tile in column 0 and row z . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.17 Tile in row 0 < r < z and column 0 < c < z . . . . . . . . . . . . . . . . . 63
6.18 Tile in column z and row 0 < r < z . . . . . . . . . . . . . . . . . . . . . . 64
6.19 Tile in row z and column 0 < c < z . . . . . . . . . . . . . . . . . . . . . . 65
6.20 Tile in row z and column z . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.21 Tile only input for row major ordering . . . . . . . . . . . . . . . . . . . . 67
6.22 Tile only input for diagonal ordering . . . . . . . . . . . . . . . . . . . . . 68
7.1 The 3-stage pipeline model . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Total data received . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Regions of total data received . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.4 Data that is not needed further . . . . . . . . . . . . . . . . . . . . . . . . 76
7.5 Memory requirement for Diagonal Ordering . . . . . . . . . . . . . . . . . 77
7.6 L = 〈L0, L1, L2, L3, L4〉 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.7 x-border of L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.8 Area details of x-border of L . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.9 The memory requirement for Diagonal Ordering . . . . . . . . . . . . . . . 80
7.10 The memory requirement for Diagonal Ordering . . . . . . . . . . . . . . . 81
8.1 Hierarchical Multi-Level-Cache (HM) Model . . . . . . . . . . . . . . . . . 85
8.2 Accessing local data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
vii
8.3 Accessing the neighborhood data . . . . . . . . . . . . . . . . . . . . . . . 90
9.1 A two-chip pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.1 Splitting the data at Stage S0 for Stage S1 and Stage S3 . . . . . . . . . . 99
viii
List of Tables
2.1 The time complexities of different stages . . . . . . . . . . . . . . . . . . . 14
3.1 The time values in the Table . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.1 The number of SIFT operations for an n2 pixel tiles . . . . . . . . . . . . . 72
8.1 The number of SIFT operations required for a subtile . . . . . . . . . . . . 92
9.1 The number of SIFT operations on a Uniprocessor Chip . . . . . . . . . . 96
9.2 The number of SIFT operations on a P -core Chip . . . . . . . . . . . . . . 96
ix
Abstract
In this thesis we study the running of the Scale Invariant Feature Transform (SIFT) al-
gorithm on a pipelined computational platform. The SIFT algorithm is one of the most
widely used methods for image feature extraction.
We develop a tile based template for running SIFT that facilitates the analysis while
abstracting away lower-level details. We formalize the computational pipeline and the
time to execute any algorithm on it based on the relative times taken by the pipeline
stages. In the context of the SIFT algorithm, this reduces the time to that of running the
entire image through a bottlenecked stage and the time to run either the first or last tile
through the remaining stages. Through an experimental study of the SIFT algorithm on
a broad collection of test images, we determined image feature fraction values, that relate
the sizes of the image extracts as it the computation proceeds through the stages of the
SIFT algorithm.
We show that for a single chip uniprocessor pipeline, the computational stage is the bot-











; here x is the neigbor-
hood of the tile, pi, po are the number of input, output pins of the chip, α, β, γ are the
feature fractions, and Γ0,Γ1,Γ2 are the input, compute, output clocks. The three terms in
the expression represents the time complexities of input, compute and output stages. The
input and output stages can be slowed down substantially without appreciate degradation
of the overall performance. This slowdown can be traded off for lower power and higher
signal quantity.
For multicore chips, we show that for anN×N image on a P -core chip, the overall time com-













addition to the quantities described earlier w is the window size used for the Gaussian
blurring. Overall we establish that without improvements in the input bandwidth, the




The speed of processors has increased exponentially in modern systems but the rate at
which data enters and exits a processor has not kept up with this increase. This is because
while technological improvements have been able to keep pace with Moore’s law for many
decades, the physical size of input/output pins of a chip cannot be reduced beyond a point
due to mechanical stability reasons. Jordon [12] presents a comparative chart of number
of transistors and pins in Intel chips over the last 20 years. While the number of transistors
has gone up by a factor of 20000, the number of pins has increased only by a factor of 30
during the same period. Three-dimensional stacking [2] [5] that allows better connectivity
within chips and optical input/output [26] are promising future possibilities.
Currently, applications requiring high input/output bandwidth are mostly executed on a
single chip environment. This is not because there is not enough computational need to
spread the algorithm across chips, but because its not economical for the data to leave a
chip and go to the next. As a result, one cannot exploit the benefits of high parallelism
that result in better speeds and more sustainable use of power. Applications that deal with
image and video processing (particularly those with real time constraints) require large
input/output bandwidth. It is one of these applications that we study in this thesis.
We study the scale invariant feature transform (SIFT) algorithm [13] that extracts features
of an image in a manner that is stable over image translation, rotation, scaling, illumina-
tion and camera viewpoint. We have selected SIFT as it is one of the most widely used
algorithms for object recognition, that has been employed in many applications such as
face/object recognition [13] [14] [15], robot localization and mapping [16], 3D-scene mod-
elling, and action recognition [3]. SIFT accepts an N ×N image as input and produces a
set of features. The input bandwidth of N2 pixels can be very high for large values of N
1
(modern household cameras produce images with N2 is more than 10M so N ≥ 1000 is not
unreasonable).
In this thesis we develop a c-chip (or (2c+1)-stage) pipeline as a basic platform to which we
execute the SIFT algorithm for an N ×N image. It should be pointed out that this work
focuses on running SIFT on the pipeline platform, rather that improve the performance of
the original algorithm of Lowe [13]. The choice of a pipeline platform suits the structure
of SIFT and many real time applications that stream in image data. More specifically, we
develop general results for this pipeline and apply this to single- or two-chip (3- or 5-stage)
pipeline. Both uniprocessor and multicore chips are considered. We use the single-chip
uniprocessor (3-stage) pipeline model as a base and we study the case where the processing
platform is a multicore chip. We extend this to a two-chip (5-stage) pipeline model. where
each chip could be uniprocessor or multicore. To analyze SIFT in a manner that abstracts
lower-level details, we introduce a decomposition of the N × N image into smaller n × n
tiles. We consider two orders in which these tiles are fed as input to the pipeline. The
order makes a difference in the processing time for certain cases. We derive expressions for
the time and memory complexities for SIFT on a pipeline model. We also study several
images each at 20 different resolutions, using the SIFT implementation of Hess [19]. This
helps us further refine some of the constraints in the time complexity of SIFT.
This thesis has contributions in many directions. We show that for the uniprocessor case,
the input/output bandwidth is not critical, as the computation is the bottleneck. However
as we move to a multicore platform, the input/output bandwidth becomes a bottleneck,
particularly when the pipeline (sequence of chips) is deeper. Specifically, we show that the
time to run SIFT on an image on a uniprocessor pipeline is essentially the time to compute
SIFT for the entire image. The only additional contribution due to the input and output
stages is the time to input the first tile and the time to output the features of the last
tile (these are very small fractions of the image). As we move to the multicore platform,
the time to run SIFT essentially equals the time to input the image. The additional times
due the compute and output stages are only the times for computation and feature output
of the last tile (again a very small fraction of the image). In the uniprocessor case, the
computation time is the bottleneck and other stages (input and output) idle much of the
time. As the number of cores increases, the input is the bottleneck and cores begin to idle.
We also develop general results for the running time on a c−chip pipeline which may be of
independent interest.
The results we develop in this work also point to directions in which the performance of
SIFT can be improved when run on the pipeline model. In the single-chip uniprocessor
case, where the input is not a bottleneck, one could slow the input clock rate to save power
or transmit additional bits to improve the bit error rate without affecting the overall time.
In the multicore case, the full computational power of multicore chips cannot be used unless
2
the input bandwidth is improved.
Recent research on SIFT has followed many directions including parallel implementation
and optimization, application areas and modifications to the SIFT technique of Lowe [13].
Sinha et al. [22] and Heymann et al. [8] proposed SIFT implementation on GPUs. Wen
et al. [23] proposed a CUDA based implementation for a GPU framework and analyzed its
parallelism. Ko et al. [11] analyzed the performance and cost of SIFT for visual classification
and he discussed tradeoffs among system parameters that affect the energy, accuracy and
latency. Nasir et al. [7] proposed a method that improves SIFT. Lin et al. [27] proposed
a tracking method using SIFT for recording the trajectory of the human motion in an
image sequence. Mikolajczyk and Schmid proposed gradient location-orientation histogram
(GLOH) [17] as an extension of the SIFT descriptor designed to increase its robustness
and distinctiveness. Zhong et al. [24], presented an improvement on the basic SIFT that
is geared toward palmprint recognition. Shekar et al. [21] proposed improved descriptor
representation in the face recognition context. Zhang et al. [28] proposed two parallel
SIFT algorithms and presented some optimization techniques to improve the performance
on multicore systems. However their results focus less on the analysis for a general model
such as the HM model [4] that we consider. We are not aware of any work similar to this
one that theoretically analyzes the performance of SIFT on a general pipeline platform
including uniprocessor and multicore chips.
The remainder of this thesis is organized as follows. In Chapter 2, we explain the major
phases of the SIFT algorithm and analyze it to determine the number of operations needed
to perform these phases. In Chapter 3, we describe experiments involving running SIFT on
different images to further refine our theoretical analysis of the running time of SIFT. We
also introduce the concept of feature fractions in this chapter that play an important part
of subsequent analysis. In Chapter 4, the two tile ordering methods are introduced (row
major and diagonal). In Chapter 5, we describe the pipeline model. Chapter 6, is devoted
to an analysis of the input complexity of SIFT. In Chapter 7, we derive the expressions for
time and memory to run SIFT on a single-chip uniprocessor pipeline model. We extend
this in Chapter 8, to a single-chip multicore pipeline. In Chapter 9, a two-chip model is
considered where each chip can be single or multicore. Finally in Chapter 10, we summarize
the work and identify directions for future research.
3
Chapter 2
Scale Invariant Feature Transform
In this chapter we briefly discuss the Scale Invariant Feature Transform (SIFT) algorithm
introduced by Lowe [13]. This algorithm is one of the most widely used one for image
feature extraction. SIFT extracts image features, that are stable over image translation,
rotation and scaling and somewhat invariant to changes in the illumination and camera
viewpoint. The SIFT algorithm has four major phases (as illustrated in Figure 2.1) (a)
Extrema Detection, (b) Keypoint Localization, (c) Orientation Assignment, (d) Keypoint
Descriptor Generation. The first phase, Extrema Detection, examines the image under
various scales and octaves (explained in detail later) to isolate points of the picture that
are different from their surroundings. These points, called extrema, are potential candidates
for image features.
The next phase, Keypoint Detection, starts with the extrema and selects some of these
points to be keypoints, that are a whittled down a set of feature candidates. This refinement
rejects extrema, that are caused by edges of the picture and by low contrast points.
The third phase, Orientation Assignment, converts each keypoint and its neighborhood
into a set of vectors by computing a magnitude and a direction for them. It also identifies
other keypoints that may have been missed in the first two phases; this is done on the basis
of a point having a significant magnitude without being an extremum. The algorithm now
has identified a final set of keypoints.
The last phase, Keypoint Descriptor Generation, takes a collection of vectors in the neigh-
borhood of each keypoint and consolidates this information into a set of eight vectors called











Figure 2.1: Major phases of the SIFT algorithm
2.1 Flow of Data in SIFT
In this section, we describe the nominal number of data items traversing each phase of
the SIFT algorithm. This is used later to determine the algorithm complexity and in the
experimental study (see Chapter 3). Recall that the Extrema Detection and Keypoint
Detection phases reduce the number of feature candidates. The Orientation Assignment
phase potentially adds points to this number of feature candidates.
The input to the SIFT algorithm is a set of N2 pixels of an N × N image. Only a small
fraction of these pixels typically turn out to be extrema. Let 0 < α < 1 be this fraction.
So αN2 extrema will move on to the next Keypoint Detection phase. Only a small fraction
of these extrema will qualify as keypoints. Let 0 < β < 1 be this fraction. So nominally
there are αβN2 keypoints at this stage. Orientation assignment reexamines all the N2
points in the image to check if any points of significant magnitude have been missed. If so,
they are added to the set of keypoints. Let a fraction γ of the image pixels qualify to be
these added keypoints. That is, γN2 new keypoints are added. The Compute Descriptors
phase converts these points into vectors which are then turned into features. The number
of feature descriptors output by SIFT algorithm is nominally (αβ + γ)N2 for an N × N
image. We call the quantities α, β and γ as feature fractions.
While α, β and γ will depend on the picture in question, we will consider nominal values
5
averaged over many pictures to guide this work. This part of the study is described in
Chapter 3. In the remaining sections of this chapter, we describe the major four phases of
SIFT in detail.
2.2 Scale-Space Extrema Detection
This is the first phase of the SIFT algorithm. Here the algorithm identifies the points that
are stable with respect to image rotation, translation and those that are minimally affected
by noise and small distortions. Detecting these points can be accomplished by searching
for stable features across all possible scales (defined below). Figure 2.2 shows the internal
stages of the Extrema Detection phase. The algorithm compute “scale,” “difference of
gaussians,” and “extrema” over several “octaves.” We now discuss these ideas, before
explaining the order in which they are computed.





the discrete two-dimensional Gaussian function. Then the scale of the image I is defined
as L(σ) = { G(x, y, σ)∗I(x, y) : 0 ≤ x, y < N } where ∗ is the two-dimensional convolution
operation and I(x, y) is the pixel at row x and column y of image I(x, y).
In general, the kth scale of the image, for k ≥ 1 is defined as L(kσ) = { G(x, y, kσ)∗I(x, y) :
0 ≤ x, y < N }. For each image point I(x, y), the scale is computed by applying a scalar
product between the point I(x, y) and a w×w Gaussian weighted window placed over that
point. For example, suppose that point I(x, y) is the central point i0,0 of the 5× 5 window
shown in Figure 2.3(a). The figure also shows the pixel values of the 5 × 5 neighborhood
of this point. If σ =
√
2 and w = 5, then the Gaussian filter G(u, v) can be shown to be
the one in Figure 2.3 (b). Applying this filter to the central point is computing the quantity
G(i0,0) = 0.001 ∗ i−2,2 + 0.003 ∗ i−1,2 + · · ·+ 0.145 ∗ i0,0 + · · ·+ 0.003 ∗ i1,−2 + 0.001 ∗ i2,−2
In general for a w × w window with odd w, the image points located around the point
I(x, y) are I(x+ u, y + v) where −w−1
2
≤ u, v ≤ w−1
2
. Here the the scale of I(x, y) is









G(u, v)I(x+ u, y + v)
6
Image










     More
   More
     More










Figure 2.2: The internal stages of Scale-Space Extrema Detection
Thus, computing scale for each point I(x, y) requires w2 multiplications and w2−1 additions
which has Θ(w2) complexity. For the entire N × N image the complexity for this step is
Θ(w2N2).
Let s ≥ 1 be an integer and k = 2 1s . The SIFT algorithm repeatedly computes the scale of
the image as described below.
Let σ0 be the initial value of σ in the Gaussian filter. Define σi = k
iσ for 0 ≤ i < s + 3.
Let L00 = I be the original image (the superscript is explained later). For image element
I(x, y), define L0i+1(x, y) = G(x, y, σi) ∗ L0i (x, y) where 0 ≤ i < s + 3. In this fashion the
7
i−2,2 i−1,2 i0,2 i1,2 i2,2
i−1,2 i−1,2 i0,1 i1,1 i1,2
i0,−2 i0,−1 i0,0 i0,1 i0,2
i−1,−2 i−1,−2 i0,−1 i1,−1 i1,−2
i−2,−2 i−1,−2 i0,−2 i1,−2 i2,−2
(a) Pixel Values
0.001 0.003 0.004 0.003 0.001
0.003 0.008 0.113 .008 0.003
0.004 0.113 0.145 0.113 0.004
0.003 0.008 0.113 0.008 0.003
0.001 0.003 0.004 0.003 0.001
(b) Gaussian window
Figure 2.3: An example of applying a 5× 5 Gaussian window on a point












Octaves: The sequence of scales in Equation (2.1) is called an octave. As discussed above
we computed L0s+1 as part of the first octave. This is a blurred image from the original
image I. The next step requires a reduction in image resolution. The resolution of an
image can be reduced1 by a factor of 2 in each dimension by sampling every other pixel of
the image in a checkerboard pattern. Let L10 be L
0
s+1 reduced in resolution by a factor of
2 (the superscript j here denotes Octave 1 for L10 and Octave 0 for L
0
s+1 ).












If there are ŝ octaves, then in general for 0 < j ≤ ŝ − 1, Lj0 is Lj−1s+1 reduced in resolution
























The time complexity for computing all (s + 3) scales of the image over one octave is
1Image resolution can be reduced in other ways, example averaging over a 2× 2 pixel set.
8













operations. Therefore the overall complexity of this phase is Θ (N2w2s). Considering con-
stants and unit time for all operations, this phase requires approximately 4N2w2s time.
Difference of Gaussians: At this point, we have (s + 3) scales Lji over all ŝ octaves
where 0 ≤ i < s + 3 and 0 ≤ j < ŝ. For any fixed octave j and 0 ≤ i < s + 2, define the






where the difference is for each pair of corresponding pixels Lji+1(x, y) and L
j
i (x, y). Fig-
ure 2.4 illustrates these ideas.












so computing Dji requires finding
N2
2j
differences. Thus, computing all s + 2 difference of








units of time. The complexity to compute







= Θ(sN2). With normally used values
of SIFT parameters, the number of operations is approximately 4N2s.




i+1 of difference of
Gaussian images in an octave j. For each octave j where 0 ≤ j < ŝ and for 1 < i <




i+1 in three adjacent layers (as shown in
Figure 2.5). Now elementDji (x, y) (shown in Figure 2.5 as a dark square) has 26 neighboring
difference of Gaussian elements (as shown as gray squares in Figure 2.5). Element Dji (x, y)
is an extremum iff it is strictly larger (in pixel value) than all of the neighboring elements
or it is strictly smaller than all of the neighboring elements.
Detecting whether Dji (x, y) is an extremum takes at most 26 comparisons each requiring


























Figure 2.4: Scale, octaves and difference of Gaussians
can be shown to be approximately 104sN2. As discussed earlier we nominally have αN2
extrema at this point of the algorithm.
Before we proceed to the next phase (Keypoint Detection), we touch upon how the Scale-
space Extrema Detection phase is executed. As we noted in this section, the algorithm
determines, scales (see Section 2.2), difference of Gaussians (see Section 2.2), and extrema
(see Section 2.2) independently for ŝ octaves. Thus, it is possible to have an outer loop
over octaves and determine the other quantities one after the other within this outer loop
(see Figure 2.2). However each of these (scales, difference of Gaussians, extrema) have
relatively local dependencies. That is, to determine the scale of a point, one needs to know
only the w×w neighborhood of the point. To determine the difference of Gaussian, we only











Figure 2.5: Extrema detection on octave j
26 difference of Gaussians points spread over three scales around it. Thus, it is possible
to execute these operations over octaves in many different ways. The original algorithm
of Lowe [13] use the structure in shown in Figures 2.2. The program we used [19] for this
work uses the modified flow of algorithm [28] as shown in Figure 2.6.
2.3 Keypoint Detection
Recall that the algorithm first determines αN2 extrema and then further distills them into
αβN2 keypoints, that will ultimately become keypoints of the image. In this section, we
discuss the phase of selection of keypoints from extrema. The Scale-Space Extrema De-
tection phase of the algorithm identifies αN2 potential candidates for keypoints. Some of
these candidates may lie along an edge of the image or may corresponds to points of low
contrast. These are generally not useful as features as they are unstable over image varia-
tion [13]. Hence these points are rejected. For rejecting low contrast points, each extremum
is examined using a method that involves solving a system of 3× 3 linear equations and so
it takes constant time. To detect the extrema on edges, a 2 × 2 matrix is generated and
simple computations performed on it (including finding the determinant and the trace of
2 × 2 matrix all requiring Θ(1) time), to generate a ratio of principle of curvatures. This
quantity is simply compared with a threshold value to decide whether an extremum is to
be rejected or not. Thus, this phase runs in Θ (αsN2) time over all octaves. Considering
constants into the account this phase takes approximately 100sαN2 operations.
After the elimination of extrema points, the points that remain are called keypoints. We











   All
Scales?
Downsample







Figure 2.6: Restructured SIFT algorithm flow of data
2.4 Orientation Assignment
The nominal number of keypoints at the start of this phase is αβN2. This phase adds to
the set of keypoints (those that may be missed in the previous phases) on the basis of their
magnitude and orientation. The magnitude mji (x, y) and orientation θ
j
i (x, y) for each point
L
j
i (x, y) can be calculated as follows:
m
j
i (x, y) =
√
(Lji (x+ 1, y)− Lji (x− 1, y))2 + (Lji (x, y + 1)− Lji (x, y − 1))2
θ
j
i (x, y) = tan
−1 (L
j
i (x, y + 1)− Lji (x, y − 1))
(Lji (x+ 1, y)− Lji (x− 1, y)
12
Non-keypoint points whose magnitudes are close to the peak magnitude are added as new
keypoints. The number of points examined is N2 − αβN2 ∼= N2 as α and β are small
fractions. Of these, a fraction γ are added back. Thus, the total number of keypoints at
the end of this phase is αβN2 + γ (N2 − αβN2) = αβN2(1 − γ) + γN2 ∼= N2 (αβ + γ)
again because γ is a small fraction. Clearly the computation for mji (x, y) and θ
j
i (x, y) can
be done over constant time. The overall complexity for all points over all octaves is Θ(sN2).
Considering the constants, the number of operations is approximately 48sN2
2.5 Keypoint Descriptor Generation
In this phase, the algorithm computes a descriptor for each keypoint identified so far. The
descriptor is a collection of information in an 2x × 2x neighborhood of the keypoint (the
work of Lowe [13] considers a 16× 16 neighborhood, which we generalize to 2x× 2x). The
following tasks are undertaken for each keypoint.
• The magnitudes of all the points in the neighborhood are smoothed by a normalized
Gaussian filter with σ = x. This requires Θ(x2) multiplications for each point.
• The neighborhood is divided into 4×4 regions. In each region the vectors (magnitude
and direction of points) are histogrammed into 8 buckets covering 360◦using trilinear
interpolation [13]. Again this requires Θ(x2) time for the neighborhood.
• The feature is computed from these descriptors in the neighborhood by computing a
normal of the descriptors in the neighborhood.





with an 8 bucket histogram of vectors. Thus, the feature is log2
8x2
4
= 2 log x+ 1 bits
long.
As the time complexity is Θ(x2) for each keypoint identified so far, then the overall time
complexity for all the keypoints is Θ (x2(αβ + γ)N2). Considering the constants2, the
number of operations is approximately 1520x2(αβ + γ)N2.
The overall time complexity of the SIFT algorithm is determined from the complexities of
the phases discussed so far. This is shown in Table 2.1.
2For this phase the constants were obtained through the experiments described in Chapter 3
13
Figure 2.7: Keypoint descriptor generation
Table 2.1: The time complexity and the number of operations required by the different
phases of the SIFT algorithm for N2 pixels
Phase Complexity Number of operations
Gaussian Blurring Θ (N2w2s) 4N2w2s
Difference of Gaussian Θ (sN2) 4N2s
Scale-space Extrema Detection Θ (sN2) 104sN2
Keypoint Detection Θ (αsN2) 100sαN2
Orientation Assignment Θ (sN2(1− αβ)) 48sN2




The purpose of this chapter is to experimentally study the time requirement for different
phases of the SIFT algorithm and to detect, if possible, any trend in the values of feature
fractions α, β and γ (see Section 2.1). We performed this study using a SIFT implementa-
tion by Hess [19]. In Section 3.1, we describe the pictures used in our study, in Section 3.2
we discuss the time taken by the different phases of the SIFT algorithm and in Section 3.3
we describe the feature fraction values used in this work.
3.1 Images Used in Study
We selected a range of pictures (see Figure 3.1) to test the SIFT algorithm. These pictures
were obtained from the Internet [30]– [54].
Images fall in the following categories.
1. Airplanes (four pictures)
2. Spheres (four pictures)
3. Portraits (four pictures each of Einstein and Gandhi, and three pictures of Sandra
Day O’Connor)
4. Vehicles (three pictures)
5. Palm-like trees (four pictures)
15
The pictures represent a variety of themes with enough elements within each category to fa-
cilitate detection of patterns. Each picture was converted into a gray scale portable network
graphics (PNG) image with 2000×2000 pixels. We also considered 19 reduced resolutions of
each original picture. These resolutions are at 1900×1900, 1800×1800, · · · , 100×100.
Thus we have 26 “pictures,” each at 20 resolutions, totalling 520 “images” in the study. We
use the term “image” to represent any of these 520 elements. The term “picture” is used
for the 26 sets, each of 20 images of varying resolutions. Figure 3.1 shows a representative
picture of each of these sets.
3.2 Time Taken by Different Phases
The SIFT implementation [19] that we used in this work is a large piece of code (over 2500
lines) with several calls to routines, including openCV [18] functions whose code is not
explicitly available in the program. This makes it virtually impossible to trace through all
parts of the code to determine their execution times.
Our first aim is to identify the major parts of the code that fit a sequential pipeline (see
Chapter 5). We do not wish to separately consider functions that are called by multiple
parts of the program. We proceed as follows.
(a) We introduce a “level counter” that is initialized to zero and incremented every time
a function is called. The counter is decremented each time a function returns to
the calling function. Functions with multiple levels of invocation can immediately be
excluded for our consideration.
(b) Functions with very small execution times need not be considered separately. These
times (even with multiple calls) can typically be rolled into the lines of their parent
routines.
(c) With these two filters we identified the functions shown in Table 3.1 that need separate
consideration.
(d) For each routine in Table 3.1, we modified the SIFT code to record the starting clock
time and ending clock time. The time taken by each routine is calculated by finding
a difference of these times.
(e) Of the routines in Table 3.1, create init img is excluded because it is a prepro-
cessing stage that converts a color image to a gray scale image in PNG format.
In our experiments we start with such an image. Routine release pyr which re-
lated to displaying the output on the screen is also excluded. Of the remaining rou-
tines calc feature scales has a much smaller execution time than the others and
16
(a) 1 (b) 2 (c) 3 (d) 4 (e) 5
(f) 6 (g) 7 (h) 8 (i) 9 (j) 10
(k) 11 (l) 12 (m) 13 (n) 14 (o) 15
(p) 16 (q) 17 (r) 18 (s) 19 (t) 20
(u) 21 (v) 22 (w) 23 (x) 24 (y) 25
(z) 26
Figure 3.1: Pictures considered
17
Table 3.1: The time values in the table corresponds to the average values of Image 9
(Gandhi 4) over its all resolutions.
Routine Name Level Time Description
create init img 2 0.116 Convert color to grayscale image
build gauss pyr 2 0.871 Create scales (see Section 2.2 Page 6)
build dog pyr 2 0.180 Difference of Gaussians (see Page 9)
scale space extrema 2 1.571 Extrema and Keypoint detection (see Section 2.3)
calc features scales 2 0.016 part of Orientation assignment
calc features oris 2 2.204 Orientation assignment (see Section 2.4)
compute descriptors 2 11.341 Computing descriptors (see Section 2.5)
release pyr 2 0.008 Related to Display
sift features 1 16.603 Sift Routine
Main Routine 0 16.646 start of main
functionally can be rolled together with calc feature oris. Therefore we use only
calc feature oris with the understanding that this also includes calc feature scales.
We now have the following routines that have a significant time contribution in the overall
time of the program. The mnemonics in parenthesis are used in graphics to identify these
routines.
1. build gauss pyr (Gauss) [Gaussian blurring]
2. build dog pyr (DoG) [Difference of gaussians]
3. scale space extrema (SSext) [Scale space extrema and keypoint detection]
4. calc features oris (Orien) [Orientation assignment]
5. compute descriptors (Descr) [Keypoint descriptor generation]
The above stages are in close correspondence with the main phases of the SIFT algorithm
described in Lowe [13].
We now examine the behavior of these major phases. Figure 3.2 shows the percentage of
the total time taken by the major phases of the SIFT algorithm. The Gaussian phase and
Computing Descriptor phase need most of the time (around 70%). Figures 3.3, 3.4 show the
normalized and absolute times taken by the different phases of the SIFT algorithm across all
the images. The absolute values are averaged over their resolutions. The Gaussian blurring
phase and Keypoint descriptor generation phase still take a large amount of time (around
18
Figure 3.2: Percentage time for major stages of SIFT (averaged over all images)
70%). However there is a large variation in the time taken by Gaussian blurring vis-a-vis
the Computing Descriptor phase. This variation also does not seem to be correlated with
picture categories.
In this thesis we consider that the Gaussian blurring phase and Keypoint descriptor gen-
eration phase are the most time consuming stages (as bases for decoupling the algorithm
across chips). These stages also exhibit high parallelism to facilitate efficient execution on
multicore chips. In fact for the two chip pipeline mode we assume that Chip 1 consists
of phases Gaussian blurring to Keypoint Detection, and Chip2 consists of the Orientation
Assignment and keypoint descriptor generation phases.
We now examine how individual phases perform across different pictures.
Figure 3.5 shows the time taken by the Gaussian stage for all 2000× 2000 images (highest
resolution), 1000 × 1000 images (intermediate resolution) and 100 × 100 images (lowest
resolution). The average time taken by the Gaussian stage across these 20 resolutions is
also shown. Notice the strong correlation between times taken for different resolutions.
They all indicate a similar pattern. Thus the average over all resolutions of a fixed image
is a reasonable indicator for that picture. We will henceforth consider only the average of
19
Figure 3.3: Percentage time of SIFT major phases over different pictures (averaged over
all resolutions)
all image resolutions of each picture.
As Figure 3.5 shows the amount of time taken by the Gaussian blurring stage across all
images is almost constant. This is because, all the images have same number of pixels and
no distinguishing features have been detected yet. The difference of Gaussians stage also
shows the similar behavior (see Figure 3.6).
Figure 3.7 shows the Scale-space extrema and keypoint detection phase, Orientation assign-
ment phase and keypoint descriptor generation phase times across pictures. While these
times vary largely across pictures, these are correlated within images of a pictures. This is
because the times for these stages depend on the number of features in the image.
In this thesis we will consider 1- and 2-chip pipelines for running SIFT (see Chapters 7,
8). As noted earlier for the 2-chip case we will broadly divide the algorithm between the
Gaussian blurring phase and keypoint descriptor generation phase. The Gaussian blurring
phase takes nearly constant time across pictures, while the keypoint descriptor generation
phase does not. To even out these variations, we place the difference of Gaussians phase and
scale-space extrema and keypoint detection phases with the Gaussian blurring phase and
Orientation assignment phase with the keypoint descriptor generation phase. If additional
chips are to be used in the pipeline (future work) then a similar approach can be used to
further subdivide the Gaussian blurring phase and keypoint descriptor generation phase.
20
Figure 3.4: The absolute times of major SIFT phases (averaged over all image resolutions
of a picture)
3.3 Feature Fractions
Figure 3.8 shows, for each resolutions size, the number of extrema detected, the number
of keypoints and the number of features, averaged over all pictures. Clearly, the number
of extrema is more than than the number of keypoints, which, in turn, is slightly less than
the number of features. This brings us to the idea of feature fractions of an image (see
Section 2.1).
To recap, an N2 pixel image produces αN2 extrema, αβN2 keypoints and (αβ + γ)N2
features. Figures 3.9, 3.10, 3.11 show the values of α, β and γ for Picture number 26
(all 20 image resolutions of the picture). In these graphs the number 1-20 on the X-axis
represents the 20 resolutions with 1 as the 2000 × 2000 image (largest resolution) and 20
as the 100× 100 image (smallest resolution). That is, a smaller picture number represents
a larger resolution. Notice that α, β and γ are nearly constant for large resolutions1 of the
1We give more importance to larger images because parallel SIFT is more useful for large images that
require higher speed to keep up with real time constraints
21
Figure 3.5: The time taken by the Gaussian stage for 2000× 2000 (the highest resolution),
1000 × 1000 (intermediate resolution) and 100 × 100 (lowest resolution) and the average
value over all these resolutions
pictures (left end of the X-axis). Also α, β and γ are all small fractions. These observations
hold in general over all these pictures. Figures 3.12, 3.13, 3.14 show the values of α, β and
γ for Pictures 1–25. These graphs are the same as there in Figures 3.9, 3.10, 3.11 expect
that the axes are not labeled. Figures 3.15, 3.16, 3.17 show the average values of α, β and
γ.
From these will assume nominal values of α, β and γ around 0.6%, 35% and 0.04% (see
Figures 3.15, 3.16, 3.17)for every image considered in this work. In the pictures consid-
ered, Picture 16 did not resemble with the remaining pictures, so we took it out from the
determination of feature fraction values.
22
Figure 3.6: Average time for Gaussian blurring and difference of Gaussian phases
Figure 3.7: Times taken for scale-space extrema detection, orientation assignment and
keypoint descriptor generation phases
23
Figure 3.8: Nominal number of extrema, keypoints and features
Figure 3.9: The value of α for all the image resolutions of picture number 26
24
Figure 3.10: The value of β for all the image resolutions of picture number 26
Figure 3.11: The value of γ for all the image resolutions of picture number 26
25
(a) 1 (b) 2 (c) 3 (d) 4 (e) 5
(f) 6 (g) 7 (h) 8 (i) 9 (j) 10
(k) 11 (l) 12 (m) 13 (n) 14 (o) 15
(p) 16 (q) 17 (r) 18 (s) 19 (t) 20
(u) 21 (v) 22 (w) 23 (x) 24 (y) 25
Figure 3.12: The value of α for all the image resolutions of pictures numbered 1-25
26
(a) 1 (b) 2 (c) 3 (d) 4 (e) 5
(f) 6 (g) 7 (h) 8 (i) 9 (j) 10
(k) 11 (l) 12 (m) 13 (n) 14 (o) 15
(p) 16 (q) 17 (r) 18 (s) 19 (t) 20
(u) 21 (v) 22 (w) 23 (x) 24 (y) 25
Figure 3.13: The value of β for all the image resolutions of pictures numbered 1-25
27
(a) 1 (b) 2 (c) 3 (d) 4 (e) 5
(f) 6 (g) 7 (h) 8 (i) 9 (j) 10
(k) 11 (l) 12 (m) 13 (n) 14 (o) 15
(p) 16 (q) 17 (r) 18 (s) 19 (t) 20
(u) 21 (v) 22 (w) 23 (x) 24 (y) 25
Figure 3.14: The value of γ for all the image resolutions of pictures numbered 1-25
28
Figure 3.15: The value of α across all the images averaged over their resolutions
Figure 3.16: The value of β across all the images averaged over their resolutions
29




As mentioned earlier, we assume the SIFT algorithm to process (n2)-pixel tiles of the
image in each iteration. We can process the given image of size N ×N in one iteration (if
n = N) or at the other extreme, we can process one pixel at a time (if n = 1). In general
1 ≫ n ≫ N . The given image is decomposed into subimages each with n2 pixels. These




tiles. These ξ2 tiles are provided as input to the computation pipeline (discussed
in Chapter 6) one at a time to Stage S0 of the pipeline.
The order in which the image tiles are sent as an input to the pipeline is called tile ordering.
This tile ordering plays a major role on the performance of the algorithm where there are
restrictions on the input format. In this chapter, we introduce two tile orderings used in
the later chapters to analyze the performance of the SIFT algorithm namely Row Major
Ordering and Diagonal Ordering.
The remainder of this chapter is organized as follows. In the Section 4.1, we describe the
notation for a tile. In Sections 4.2, 4.3, we describe the two tile orderings.
4.1 Tile Notation
In this section we define some notation for a tile and its numbering in a tile ordering. Each
tile is an n× n square array of n2 distinct pixels from the given image. For a given N ×N
image, there will be ξ = N
n
rows and columns of tiles. We refer to this ξ × ξ array of tiles
as tile array. The tiles in this array are represented with two co-ordinates (r, c) denoting
31
the row r and column c in the tile array. Figure 4.1 shows the tiles coordinates in 5 × 5
tile array. For 0 ≤ i < ξ, let τi,j represents the tile at position (r, c). Let ⊚ be the given
ordering where ⊚ ∈ {R,D} (denoting row major and diagonal ordering) then the kth tile
in this ordering is denoted by τ⊚k . If this happens to be the tile in position (r, c) then τ
⊚
r,c.
This quantity k associated with tile τr,c with respect to ordering ⊚ is called the rank of τr,c,
where the ordering ⊚ is clear or unimportant, we will write τ⊚r,c or τ
⊚
k , simply as τr,c or τk.












Figure 4.1: Coordinate representation of tiles in a tile array
4.2 Row Major Ordering
As mentioned earlier, this ordering deals with square tiles of size n× n. Here the order of
the tiles is by rows and the direction is from left to right within a row. Figure 4.2 shows
this order for a 5 × 5 tile array.
Lemma 1. For any 0 ≤ r, c < ξ, the rank of tile τi,j in Row major order is rank(τr,c)RM =
rξ + c
Proof: Any given tile τr,c can only be enumerated after all tiles before it are enumerated.
The number of tiles enumerated before Tile τr,c is rξ before row r and c in the row r. Thus
τk = rξ + c
Note: Here τr,c = τ
R
k where k = rank(τr,c)
R = rξ + c
32







Figure 4.2: Row-major ordering
4.3 Diagonal Ordering
Here the order of tiles is by the value of r + c. Consider two tiles τr1,c1 and τr2,c2 . If
(r1+ c1) < (r2+ c2) then τr1,c1 is enumerated before τr2,c2 . If (r1+ c1) = (r2+ c2), then τr1,c1
is enumerated before τr2,c2 if and only if r1 < r2. Figure 4.3 shows this order for a 5× 5
tile array. The proof of following lemma can be proved in a manner similar to Lemma 1.
Its proof is omitted for brevity.























Figure 4.3: Diagonal ordering
33













if r + c ≥ ξ




The computational platform on which we study SIFT is a pipeline of chips. The fact
that the SIFT algorithm consists of sequential stages and that many real-time applications
of SIFT stream in data make a pipeline a suitable model for SIFT. Each chip in the
model has three internal stages, each stage feeding its partial result to the next. So the
computational model can be viewed as a pipeline of stages, rather than a pipeline of chips.
Let C = {C0, C1, C2, · · · , Cc−1} be a set of c chips (see Figure 5.1). For each 1 ≤ i < c−1
Chip Ci receives its input from the Chip Ci−1 and sends its output to the Chip Ci+1. Chip C0
receives the input of the algorithm and Chip Cc−1 produces the algorithm’s final output
(see Figure 5.1).




Figure 5.1: A c-chip pipeline
As noted above, each chip is assumed to consist of three internal stages, input, compute
and output, that can work simultaneously (see Figure 5.2). The c-chip model is a (2c+ 1)-
stage pipeline. The output stage of Chip Ci must be compatible with the input stage of
Chip Ci+1 in terms of bandwidth, pins, etc. We will consider these two stages (namely
35
i/po/p o/p i/pcomputation
Chip i+1Chip iChip i−1
Figure 5.2: A chip in the pipeline model
output stage of Chip Ci and input stage of Chip Ci+1) to be identical in the pipeline. To
keep the notation clear and consistent on this (2c+ 1) stage pipeline, we will denote these
stages as S0, S1, S2, · · · , S2c where S2i denotes the input stage of Chip Ci and the output
















Figure 5.3: Stages in the pipeline model
The input stage of Chip Ci receives the input from the output stage of Chip Ci−1 and holds
this in local memory until the compute stage of Chip Ci requires it. The output stage of
Chip Ci receives the results of the computation from the compute stage.
Now we describe how the SIFT algorithm runs on the c-chip pipeline. Recall that the given
image of size N×N is decomposed into (n2)-pixel tiles. These tiles are fed into the pipeline
in some order as described in Chapter 4. For a given tile order, ⊚, let τ⊚k (or simply τk) be
the kth tile input to the pipeline.
36
Let us trace the traversal of the tiles through the pipeline in this ordering. For ease of
explanation, assume each tile traverses each stage in 1 unit of time, starting from T = 1.
At time t = 1 the first tile τk enters Chip C0 through Stage S0. Additional information
necessary to process the tile also enters stage S0. For reasons that will apparent later, let
this tile entering be denoted by τk,0. At time t = 2, Stage S1 receives τk from S0 and
processes it. We call this input to Stage S1 as τk,1. After processing it, it produces an
output τk,2 that Stage S2 receives at time t = 3. At time t = 2i, Stage S2i(an input/output
stage) receives tile τk,2i. Note that τk,2i may be just an extract of the original image tile τk.
5.1 Notation
In this section we formalize the notation used earlier to describe “tiles” at each stage or
chip in the pipeline model. A given tile τk goes through the (2c + 1)-stage pipeline. The
notation τk,ℓ represents the data input to stage Sℓ of the pipeline. Tile τk,ℓ+1 represents the
result of the stage Sℓ which is fed to Stage Sℓ+1 (if it exists) and so on. For tile τk, here τk,0
is the image input to the pipeline at stage S0 and τk,2c is the final output of the pipeline at
the stage S2c for tile τk.
Consider any chip Ci, with internal stages S2i, S2i+1, S2i+2. Here the stages S2i and S2i+2
are communication stages (input and output stages) and S2i+1 is the compute stage. Stage
S2i brings in all the data needed for S2i+1 to start its computation on any given tile τk,2i+1
and S2i+2 carries out all the data produced by S2i+1. The data needed to perform the
computation on tile τk,2i+1 at Stage S2i+1 is not necessarily the same as τk,2i+1 and may
have been brought in entirely or partially at an earlier time. We formalize this notion
below.
To process tile τk, the compute stage S2i+1 of chip Ci requires some additional data related
to the tile τk,2i+1. These bits have to be present in the chip Ci before stage S2i+1 can
process τk,2i+1. However the bits could have been brought in earlier to Chip Ci (wholly
or in part). The time when these bits arrive at Chip Ci is decided by the input protocol
which is discussed in Section 5.3.
5.1.1 Stage Start Times
In this section we derive an expression for the earliest starting times for each stage Si
in processing tile τk. This allows us to derive an expression for the overall time of the
algorithm. For any stage Sℓ, (0 ≤ ℓ < 2c) and any tile τk, let Tk,ℓ be the earliest time when
37
stage Sℓ can work on τk,ℓ. Let tk,ℓ denote the time needed for stage Sℓ to perform its action
on tile τk,ℓ.
The time for stage Sℓ to complete processing tile τk,ℓ depends on two quantities.
1. The time for stage Sℓ to finish processing the previous tile τk−1.
2. The time for stage Sℓ−1 to complete processing the current tile τk
The following theorem captures the above observation.
Theorem 3. If ℓ is any stage in the pipeline where 0 ≤ ℓ < 2c, then the earliest time when
stage Sℓ can start on tile τk is
Tk,ℓ = max{Tk,ℓ−1 + tk,ℓ−1, Tk−1,ℓ + tk−1,ℓ}
The following lemma representing a well known standard result finds use later in this
chapter.
Lemma 4. The recurrence, a0 = b0 and an = an−1 + bn, for n > 0 has the solution
an = b0 + b1 + b2 + · · ·+ bn.
Proof: We proceed by induction on n. For n = 0, a0 = b0. Assuming the lemma to hold
for n = n+ 1. consider an+1. an+1 = an + bn+1 = (b0 + b1 + · · ·+ bn) + bn+1.





Proof: Since there is no stage before stage S0, it does not have to wait on any other
stages to receive the data of tile. Therefore from Theorem 3, Tk,0 = Tk−1,0 + tk−1,0. With











Proof: Since there is no tile before tile τ0, a stage Sℓ is free when τ0,ℓ arrives at Sℓ. Then
from Theorem 3, T0,ℓ = T0,ℓ−1 + t0,ℓ−1. With T0,0 = 0 (the count of time begins here), this
recurrence has the solution T0,ℓ =
ℓ−1∑
v=0
t0,v (from Lemma 4).
We now discuss the total time needed to process all tiles.
Total Time: Recall that the given image of size N ×N is decomposed into ξ rows and
ξ columns of n × n tiles. The total time to process a given image is equal to the time
taken by the pipeline to process all tiles in the ξ × ξ tile array. Denote the ξ2 tiles in the
array, by τ0, τ1, · · · , τξ2−1 where the indices 0, 1, · · · , ξ2 − 1 reflect the number of tile in tile
ordering (see Sections 4.2,4.3). This, coupled with the fact that a stage Sℓ cannot process
tile τk+1,ℓ until it has finished with tile τk,ℓ, gives the following expression for the overall
time T for processing all tiles through all stages is T = Tξ2−1,2c + tξ2−1,2c
Definition 1. Consider a (2c + 1)-stage pipeline. For any fixed ℓ, where 0 ≤ ℓ ≤ 2c,
Stage Sℓ is a comparable stage, iff for all 0 ≤ k < ξ2 and for every 0 ≤ ℓ′ ≤ 2c , either
tk,ℓ ≤ tk−ℓ′+ℓ,ℓ′ or tk,ℓ ≥ tk−ℓ′+l,ℓ′ .
In a stage that is not comparable, for some ℓ′ we could have tk,ℓ < tk−ℓ′+ℓ,ℓ′ while for a
different ℓ′, we have tk,ℓ′ > tk−ℓ′+ℓ,ℓ′
Definition 2. A (2c+ 1)-stage pipeline, is a totally ordered pipeline, iff for all 0 ≤ ℓ ≤ 2c,
Stage Sℓ is a comparable stage.
In our application, each stage performs a specific activity on different tiles. The time for a
stage is generally a function of the task it performs, as all tiles are of same size. Therefore
we may expect our pipeline to be totally ordered.
In a totally ordered pipeline, For all 0 ≤ k < ξ2, each pair of stages Sℓ and Sℓ′ satisfies
either tk,ℓ ≤ tk−ℓ′+ℓ,ℓ′ or tk,ℓ ≥ tk−ℓ′+ℓ,ℓ′ . We use the notation Sℓ  Sℓ′ to denote for all ℓ if
tk,ℓ ≤ tk−ℓ′+ℓ,ℓ′ and the notation Sℓ  Sℓ′ to denote for all ℓ′ if tk,ℓ ≥ tk−ℓ′+ℓ,ℓ′ . Intuitively,
Sℓ  Sℓ′ , implies that Sℓ will not hold Sℓ′ up due to the time it takes to process a tile.
Lemma 7. If S0  S1, then Tk+1,0 ≤ Tk,1
Proof: We proceed by induction on k ≥ 0. For k = 0, we consider T1,0 and T0,1. From
Theorem 3 and Lemma 5 we know that T1,0 = T0,1 = t0,0. This implies T1,0 ≤ T0,1. This
39
proves the base case.
Assuming the lemma to hold for all 0 ≤ k ≤ ξ2 − 2, consider Tk+2,0 and Tk+1,1.
Tk+1,1 = max{Tk+1,0 + tk+1,0, Tk,1 + tk,1} from Theorem 3
= max{Tk+2,0, Tk,1 + tk,1} from Lemma 5
≥ Tk+2,0
We now generalize the above lemma to stages that are not necessarily neighbors.
Lemma 8. For any 0 ≤ ℓ ≤ 2c, if, S0, S1, · · · , Sℓ−1  Sℓ then, for all 0 ≤ k ≤ ξ2 − 1,
Tk+1,ℓ ≤ Tk,ℓ+1.
Proof: We proceed by induction on ℓ ≥ 0. For ℓ = 0 consider Tk+1,0 and Tk,1. We know
that Tk+1,0 ≤ Tk,1. (from Lemma 7)
Assuming the current lemma to hold for any ℓ′ < ℓ, consider Tk+1,ℓ and Tk,ℓ+1.
Tk+1,ℓ = max{Tk+1,ℓ−1 + tk+1,ℓ−1, Tk,ℓ + tk,ℓ} from Theorem 3
= Tk,ℓ + tk,ℓ as Tk,ℓ ≥ Tk+1,l−1 ( induction hypothesis)
and tk,ℓ ≥ tk+1,ℓ−1 as Sℓ−1  Sℓ
That is,
Tk+1,ℓ = Tk,ℓ + tk,ℓ (5.1)
Now,
Tk,ℓ+1 = max{Tk,ℓ + tk,ℓ, Tk−1,ℓ+1 + tk−1,ℓ+1} from Theorem 3
= max{Tk+1,ℓ, Tk−1,ℓ+1 + tk−1,ℓ+1} from Equation (5.1)
≥ Tk+1,ℓ
Although the following lemma seems obvious from above Lemma, it does not follow directly.
So we prove it below.
Lemma 9. For any 0 ≤ ℓ ≤ 2c, if, S0, S1, · · · , Sℓ−1  Sℓ then, for all 0 ≤ k ≤ ξ2 − 1,
Tk+1,ℓ ≥ Tk,ℓ+1.
Proof: We proceed by induction on ℓ ≥ 0. For ℓ = 0 consider Tk+1,0 and Tk,1. We know
that Tk+1,0 ≥ Tk,1. (from Lemma 7)
40
Assuming the lemma to hold for any ℓ′ < ℓ, consider Tk+1,ℓ and Tk,ℓ+1.
Tk,ℓ+1 = max{Tk,ℓ + tk,ℓ, Tk−1,ℓ+1 + tk−1,ℓ+1} from Theorem 3
= Tk,ℓ + tk,ℓ Tk,ℓ ≥ Tk−1,ℓ+1 by induction hypothesis
as given is lemma tk,ℓ ≥ tk−1,ℓ+1
as Sℓ  Sℓ−1
That is
Tk,ℓ+1 = Tk,ℓ + tk,ℓ (5.2)
Now,
Tk+1,ℓ = max{Tk+1,ℓ−1 + tk+1,ℓ−1, Tk,ℓ + tk,ℓ} equation (3)
= max{Tk+1,ℓ + tk+1,ℓ, Tk,ℓ+1} from Equation 5.2
≥ Tk,ℓ+1
In is a (2c+1)-stage totally ordered pipeline then there exists a stage Sm such that Sm  Sℓ
for all 0 ≤ ℓ ≤ 2c. This stage is called the maximal stage. It will never have to wait on any
of its previous stages. The following lemma formalizes this statement.
Lemma 10. If Sm is a maximal stage, then for all 0 ≤ k < ξ2, Tk,m = Tk−1,m + tk−1,m.
Proof: We know that
Tk,m = max{Tk−1,m + tk−1,m, Tk,m−1 + tk,m−1} from Theorem 3
= Tk−1,m + tk−1,m as Tk−1,m  Tk,m−1 and Lemma 8








Proof: From Lemma 10, Tk,m = Tk−1,m + tk−1,m. Solving this recurrence, we get
Tk,m = T0,m +
k−1∑
u=0
tu,m (from Lemma 4). We know that T0,m =
m−1∑
v=0








Let TM denotes the time for stage Sm to process all ξ
2 tiles. Let T denotes the total time
for the pipeline to process all tiles.
41
Theorem 12. In a (2c+ 1)-stage totally ordered pipeline the total time to run all ξ2 tiles
is T = TM + TE, where TM is the time for the maximal stage to process all ξ
2 tiles and TE
is the time taken by the stages Sm+1, Sm+2, · · · , S2c to process the last tile τξ2−1.
Proof: After the time TM , all ξ
2 tiles have been processed by the stage Sm. For any
stage Sℓ, processing tile τξ2−1,ℓ implies that all tiles τ0,ℓ, τ1,ℓ, · · · , τξ2 − 1, ℓ have been pro-
cessed. Therefore if stages Sm+1, Sm+2, · · ·S2c each process tile τξ2−1, then all tiles would
have been processed. Therefore T = TM + TE.
Time TE depends on the times taken by the stages Sm+1, Sm+2, · · ·S2c. Since all these
stages have to process the last tile τξ2−1 clearly, TE ≥
2c∑
v=m+1
tξ2−1,v. The theorem below
identifies a case where this lower bound on TE is achieved.




Proof: We proceed by induction on ℓ ≤ ℓ′. For ℓ = ℓ′, consider Tk,ℓ′ = Tk,l. This proves
the base case. Assuming the lemma to hold for 0 < ℓ < ℓ′ consider ℓ − 1. so Sℓ−1  Sℓ.
By Theorem 3, Tk,ℓ = max{Tk,ℓ−1 + tk,ℓ−1, Tk−1,ℓ + tk−1,ℓ}. Since Sℓ−1  Sℓ, we have
tk,ℓ−1 ≥ tk−1,ℓ and by Lemma 9, Tk,ℓ−1 ≥ Tk−1,ℓ.
Therefore
Tk,ℓ = Tk,ℓ−1 + tk,ℓ−1. (5.3)
By the induction hypothesis, Tk,ℓ′ =
ℓ′−1∑
v=ℓ
tk,v + Tk,ℓ. Substituting for Tk,ℓ (from Equa-









Theorem 14. In a (2c+ 1)-stage totally ordered pipeline with maximal stage Sm and for



































tξ2−1,v from Lemma 11
5.2 Memory
In this section we study space requirement for running the SIFT on the pipeline. Recall
that a chip has three stages input, compute and output stage (see Figure 5.2). The input
stage receives the data from the output stage of the previous chip (if any) and holds this
information until the compute stage is ready for it. The compute stage processes this
information employing, possibly some scratch-pad memory. However after computing, it
passes the result to the output stage and moves on to the next tile. Ignoring the scratch
pad memory, the memory requirement of the chip can be estimated from the requirements
of the input and output stages.
Let φk,ℓ be the set of bits received when tile τk,ℓ is being processed by Stage Sℓ. Let ηk,ℓ be




computing tile τk,ℓ, the Stage Sℓ discards some bits say ψk,ℓ which are not required further.







Memory requirement of Stage Sℓ = max {Mk,ℓ} where 0 ≤ k < ξ2 − 1
43
5.3 Input Protocol
In this section we describe the following two input protocols considered in this thesis.
Tile-plus-neighborhood: As mentioned earlier, in order to process a tile in the compu-
tation stage S1 some additional data is needed in the form of an x-neighborhood of the
tile. That is a (2x + n) × (2x + n) array of pixels with an extra width of x around the
n×n tile. In this tile-plus-neighborhood protocol the additional neighborhood information
is also sent to the Stage S0 of the pipeline model.
Tile-only: In this protocol only the tiles are sent to the Stage S0 of the pipeline model.
However the amount of data that needs to be sent to the Stage S0 for the very first tile
in the pipeline will be sum of all the tiles that have to present in the chip to start its
computation on first tile. So S1 may not be able to start on the tile immediately as the
neighborhood data has not been known.
5.4 Architecture
In this section we briefly describe the two architectures we considered for the compute stage
of each chip.
Uniprocessor: For a Uniprocessor model of computation, the standard RAM (Random
Access Machine) model is assumed. In this model a single processor on a chip executes one
instruction at a time.
Multiprocessor: For the Multiprocess model of computation, The Hierarchical Multi-
level Multicore (HM model) 8.1 is assumed. Here model admits a memory hierarchy and
thus abstracts away small details without ignoring expensive memory access.
5.5 Pipelining Multiple Images
In this section we extend the ideas discussed so far to the processing of multiple images in
the pipeline model. So far we have considered pipelining of tiles of a single image. As we
will see in Chapter 6, some stages finish their tasks before others. We therefore consider
the processing of a stream of m images, each of size N×N images. This is not significantly
different from that of tiling and feeding a single image except that memory requirement do
44
not cross the image boundaries. If there are M number of images are going through the
pipeline one after the other then the overall time taken by the pipeline to process these
images completely is given by TM = TM(ξ)2−1,2c + tM(ξ)2−1,2c.
45
Chapter 6
Input Data Flow Requirements
In this section/chapter we discuss the flow of the input data required for chip C0 (the first
chip) to start its work on the tile. Subsequent flow of data of the pipeline will be governed
by this. In order to process a tile of size n × n, the algorithm requires some additional
data related to the tile being processed. In this section, we will discuss the number of bits
of data (tile + neighborhood) that must be delivered at stage S0 of the pipeline model
at each round of the algorithm. As noted in Section 5.3, we consider two protocols for
the flow of data between the two stages tile-plus neighborhood and tile-only protocols.In
Section 6.1, we discuss the input data requirements for the tile-plus neighborhood protocol
and in Section 6.2, we deal with the input data of the tile-only protocol.
As noted earlier, the tile has n2 pixels and the image is of size N × N . Th neighborhood
of the tile is defined in terms of the quantity x as follows.
Let τ be the set of pixels in the tile. For any pixel π ∈ τ , let nbr(π) be the set of pixels
located at a distance x from π in all directions. Then for pixel (i, j), The quantity
nbr(i, j) = {(k, ℓ) where i− x ≤ k ≤ i+ x and j − x ≤ ℓ ≤ j + x} − {(i, j)} .








The neighborhood nbr(τ) represents the extra pixels (over and beyond τ) needed for pro-
cessing tile τ . For any tile τ , nbr(τ) is small square band of width x around τ (see Fig-
46






Figure 6.1: A tile and its neighborhood





and r = x mod n
6.1 Tile-Plus-Neighborhood Protocol
In this section we discuss the input data required by Stage S0 of the pipeline to process the
tile τk for the tile-plus-neighborhood protocol (see Section 5.3). In this protocol the data
brought in is the tile and the neighborhood required to process the tile.
In the following sections we describe the input data required by the two tile orderings
mentioned in Sections 4.2, 4.3.
Before we proceed to the different tile orderings, we illustrate a typical setting in the
algorithm. This will allows us to define a convention that will be followed. If τk is to be
processed then all the tiles τℓ where 0 ≤ ℓ < k have already been processed, and to do
that their neighborhoods must have been brought in (see Figure 6.2) red or darkly shaded
area. This leaves just the portion shaded in yellow to be brought in during current round
k. The singly hatched portion (region 〈C〉) represents the neighborhood of tile τk (region
〈A〉), which is doubly hatched. The previous tile in the image and its neighbourhood which
had been brought in already is shown in dotted line (region 〈B〉). Regions 〈F 〉 and 〈E〉
show the neighborhood of previous tile. This basic theme has many variations depending
on the position of the tile in the image and the value of x. For example if x ≫ n, then
47
the neighbourhood of several tiles at the right and bottom ends of the image would have










Figure 6.2: Tile and its neighborhood in the context of entire image
Note that (almost) every tile and its neighborhood consists of (n+ 2x)2 pixels. The infor-
mation we address below is not about the number of pixels needed to process a tile (which
is (n+2x)2 pixels) rather it is the number of pixels that need to be brought in at the start
of the round that process a tile. We use this term pixels “input” (rather than “needed”)
for a tile to be processed. Here for a given tile ordering ⊚, if the tile in row i and column j
of the tile array is processed kth in order (that is τ⊚i,j = τ
⊚
k ). We will refer to the bits input
for the tile τi,j with the ordering ⊚ and hence k implied by the context.
48
6.1.1 Row Major Tile Ordering
In this section, we describe the data required at the stage S0 of the pipeline model for
the row major tile ordering described in Section 4.2. This is explained through a series of
lemmas that address the different cases depending on the position (i, j) of tile τi,j.
Lemma 15. For i = 0 and j = 0, the number of pixels input for tile τi,j is (n+ x)
2.
Proof: See Figure 6.3.This is the first tile of the image. Clearly no bits have been received
yet. So all the (n+ x)2 bits in the tile and its neighbourhood have to be brought in.
Figure 6.3: First tile of the row-major tile ordering





. Then consider tile τ(i, z). The last column of pixels of this tile
is column zn − 1 of the N × N image. The last column of the neighbourhood of τ(i, z) is
49





















n < N − x+ 1 < N.
This means that while tile τi,z is within the image, it neighbourhood includes the last
column of pixels in the image. That is, τi,z is the last tile in row i for which an input is
required. The input requirement for the tiles after tile τi,z (τi,z+1, τi,z+2, · · · ) is zero.
Lemma 16. For r = 0 and 0 < c < z, the number of pixels input for the tile τr,c is
n(n+ x).
Proof: From Figure 6.4, The number of pixels that should be inputed is shown by yellow
portion and is equal to n(n+ x).
n
n+x
Figure 6.4: Tile in row 0 and column 0 < c < z
Lemma 17. For 0 < r < z and c = 0, the number of pixels input for the tile τr,c is
n(n+ x).
50
Proof: From Figure 6.5, The number of pixels that should be inputed is shown by yellow
portion and is equal to n(n+ x).
n
n+x
Figure 6.5: Tile in column 0 and in row 0 < r < z rows
Lemma 18. For r = 0 and c = z, the number of pixels input for the tile τr,c is
(n− r)(n+ x).
Proof: From Figure 6.6, the neighbourhood of the file which lies outside the image does
not exist. The data that needs to be brought in shown by the region in yellow.
The area of yellow region = length of 〈1, 3〉 * length of 〈3, 4〉. We have length of 〈3, 4〉=












+ x − N = n − (N − q + x−N)) =
n− (x− q) = n− r.
Thus the area of the yellow region = (n− r)(n+ x)









Figure 6.6: Tile in row 0 and column z
Proof: From Figure 6.7, along the lines of the proof of Lemma 18.
Lemma 20. For 0 < r, c < z, the number of pixels input for the tile τr,c is n
2.
Proof: From Figure 6.8, the number of pixels that should be inputed is shown as the
yellow portion and is equal to n2.
Lemma 21. For 0 < r < z and c = z, the number of pixels input for the tile τr,c is
n(n− r).
Proof: From Figure 6.9, the number of pixels that should be inputed is shown in yellow







Figure 6.7: Tile in column 0 and row z
Lemma 22. For 0 < c < z and r = z , the number of pixels input for the tile τr,c is
n(n− r).
Proof: From Figure 6.10, the data need to be brought in is represented by the region
shown in yellow color and it is equal to n (n− r).
Lemma 23. For r, c = z, the number of pixels input for the tile τr,c is (n− r)2.
Proof: From Figure 6.11
The number of pixels that should be inputed is shown by yellow portion and is equal to
(n− r)2. s




Figure 6.8: Tile in row 0 < r < z and column 0 < c < z
Proof:
Here the number of pixels that are needed for τr,c have already been brought in at an earlier
time.
The following theorem summarizes the above results




Figure 6.9: Tile in column z and row 0 < r < z





(n+ x)2 if r = 0 and c = 0 ;
n · (n+ x) if r = 0 and c < z OR c = 0 and r < z ;
n2 if 0 < r, c < z ;
(n− r)(n+ x) if r = 0 and c = z OR r = z and c = 0;
n · (n− r) if r = z and c < z OR r < z and c = z;
(n− r)2 if r = z and c = z ;




Figure 6.10: Tile row z and in column 0 < c < z
6.1.2 Diagonal Method Tile Ordering
In this section we describe the data required at the Stage S0 of the pipeline model for
diagonal tile ordering as discussed in Section 4.3. In this tile ordering, the tiles are ordered
by the value of (i+ j). For two tiles τi1,j1 and τi2,j2 with (i1 + j1) = (i2 + j2). In this case
the tile with the lower r value is enumerated first. That is, if i1 < i2 then rank of τi1,j1 <
rank of τi2,j2 .




Figure 6.11: Tile in row z and column z
–24.
Lemma 26. For r = 0 and c = 0, the number of pixels input for tile τr,c is (n+ x)
2.
Proof: This is the first tile of the image. Clearly no bits have been received yet. So all
the (n+ x)2 bits in the tile and its neighbourhood have to be brought in. See Figure 6.12.
Lemma 27. For r = 0 and 0 < c < z, the number of pixels input for the tile τr,c is
n(n+ x).
Proof: See Figure 6.13





Figure 6.12: First tile of the diagonal tile ordering
Proof: See Figure 6.14
Lemma 29. For r = 0 and c = z, the number of pixels input for the tile τr,c is
(n− r)(n+ x).
Proof: From Figure 6.15, The area of yellow region = length of 〈1, 3〉 * length of 〈3, 4〉.
We have length of 〈3, 4〉= n + x. The length of 〈1, 3〉 = [length of 〈1, 5〉 - length of 〈3, 5〉]











+ x − N=n −
(N − q + x−N))=n− (x− q)=n− r.




Figure 6.13: Tiles in row 0 and column 0 < c < z
Lemma 30. For r = z and c = 0, the number of pixels input for the tile τr,c is
(n− r)(n+ x).
Proof: See Figure 6.16.
Lemma 31. For 0 < r, c < z, the number of pixels input for the tile τr,c is n
2.
Proof: See Figure 6.17.
Lemma 32. For 0 < r < z and c = z, the number of pixels input for the tile τr,c is
n(n− r).




Figure 6.14: Tiles in column 0 and row 0 < r < z
Lemma 33. For 0 < c < z and r = z , the number of pixels input for the tile τr,c is
n(n− r).
Proof: From Figure 6.19.
Lemma 34. For r, c = z, the number of pixels input for the tile τr,c is (n− r)2.
Proof: See Figure 6.20.
Lemma 35. For r, c > z, the number of pixels input for the tile τr,c is 0.
Proof: In this case the number of pixels that should be sent as input the current round
have been already brought in earlier time. So the number of pixels inputed in this round
is zero.









Figure 6.15: Tile in row 0 and column z
Theorem 36. the number of pixels input for the tile τi,j at stage S0 of the pipeline in





(n+ x)2 if r = 0 and c = 0 ;
n · (n+ x) if r = 0 and c < z OR c = 0 and r < z ;
n2 if 0 < r, c < z ;
(n− r)(n+ x) if r = 0 and c = z OR r = z and c = 0;
n · (n− r) if r = z and c < z OR r < z and c = z;
(n− r)2 if r = z and c = z ;









Figure 6.16: Tile in column 0 and row z
6.2 Tile-Only Protocol
So far we assumed that the input to Stage S0 is in the form of an n × n tile and its
neighborhood of width x forming an (2x + n)× (2x + n) tile and neighborhood. Without
this neighborhood a tile cannot be processed. What would happen if the input received
was only the tile (without its neighborhood)? This situation could happen when n is
very small relative to x. We call such an input a “tile-only” input, as opposed to the
“tile-plus-neighborhood” input.
Now we consider the time tk,0 if tile only inputs are used. Before we proceed, we clarify
the meanings of the quantities tk,ℓ and Tk,ℓ. We defined Tk,ℓ as the earliest starting time
of Stage Sℓ to work on tile τk and tk,ℓ to be the time taken by Stage Sℓ to process tile τk.
We also derived the equation Tk,ℓ = max{Tk,ℓ−1 + tk,ℓ−1, Tk−1,ℓ + tk−1,ℓ}. In the context




Figure 6.17: Tile in row 0 < r < z and column 0 < c < z
the equation. The equation implies that if Sℓ is free, then Sℓ should be able to start on τk
immediately after Tk,ℓ−1+ tk,ℓ−1 time. For the tile-only input if tk,ℓ is the time taken to get
any tile τk to Stage S0, then the above meaning of Tk,1 would not hold. That is, Sℓ may not
be able to start on τk just because S0 has received τk. Stage S0 must also receive all tiles
that includes the neighborhood of τk, before S1 can start on τk. Thus tk,ℓ is interpreted as
the time needed for Stage Sℓ to bring all the information needed for Stage Sℓ+1 to start





Figure 6.18: Tile in column z and row 0 < r < z
6.2.1 Row Major Tile Ordering - Tile Only
Figure 6.21, shows tile τ0 (or tile τ0,0), and a neighbourhood x around it. All pixels of
the neighborhood will be received only after tile τ12 has been received at Stage S0. So
the the Stage S1 would not start on tile τ0 until tile τ12 has been received completely. In






) has been received completely. This means that τ0,0’s computation will
not start until tile τ⌈ xn⌉,⌈ xn⌉ = τδ,δ = τδ(ξ+1) is received. Until this time Stage S1 idles. So
the amount of data that needs to be received by Stage S0 for Stage S1 to process tile τ0,
equals the size of all tiles from τ0 to τδ(ξ+1). As each tile is of size n
2 pixels, the amount of
data needed to process tile τ0 is [δ(ξ + 1) + 1]n
2. However the mount of data needed to be
received by Stage S0 to process the tile τk for all k > 0 is only n
2 pixels. This is because the
neighborhood required to process these tiles has already been brought in, in the previous
iterations. In the example of Figure 6.21, 15 tiles (shaded) need to be brought in before S1
can process tile τ0.




Figure 6.19: Tile in row z and column 0 < c < z





[δ(ξ + 1) + 1]n2 if r, c = 0 or k = 0;
n2 if r, c > 0 or k > 0 ;




Figure 6.20: Tile in row z and column z
6.2.2 Diagonal Tile Ordering - Tile Only
Figure 6.21, shows tile τ0 (or tile τ0,0), and a neighbourhood x around it. All pixels of the
neighborhood will be received only after the tile τ12 has been receive at Stage S0. So the the
Stage S1 can not work on tile τ0 would not start until tile τ12 has been received completely.
In general tile τi,j ’s computation will not start on Stage S1 until tile τ⌈ xn⌉,⌈ xn⌉ = τδ,δ





) has been received completely. As each tile is of size n2 pixels,
the amount of data needed to process tile τ0 is [δ (2δ + 1) + δ + 1]n
2. For the example
shown in Figure 6.22 the 12 tiles (shaded) tiles need to be received by S1 before it can
process tile τ0. However the amount of data needed to be received by Stage S0 to process
the tile τi,j for all i = 0 and j > 0 is only (2δ + 1)n
2 pixels. The mount of data needed to be
received by Stage S0 to process the remaining tiles is n
2. This is because the neighborhood
66
43210 5
6 8 97 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
n+x
n+x
Figure 6.21: Tile only input for row major ordering
required to process these tiles has already been brought in, in the previous iterations.
Theorem 38. The number of pixels input for the tile τr,c = τk at Stage S0 of the pipeline





[δ (2δ + 1) + δ + 1]n2 if r, c = 0 or k = 0;
(2δ + 1)n2 if r, c > 0 or k > 0 ;
n2 if r, c > 0;
0 if k > (ξ2 − 1)− [δ (2δ + 1) + δ].
Notice that td0,0 here is independent of N . In contrast, t0,0 ≥ xN for the row major
ordering. Considering that among the times tk,0, t0,0 is the only one that matter for
computationally bottlenecked case (see Chapter 8). The diagonal ordering makes a big

































In this chapter we study the performance of the SIFT algorithm in terms of its time
complexity and memory requirement on a single chip processing pipeline (see Figure 7.1),
where the computing platform consists of a single CPU. This is the simplest case and its
study lays the foundation for the single-chip multicore case (Chapter 8) and the two-chip
pipeline (Chapter 9). Its also provides a basis to deal with an n-chip pipeline. As discussed






S0 1 S 2
Figure 7.1: The 3-stage pipeline model
As discussed in Chapter 5, the single-chip model results in a 3-stage pipeline consisting
of the input stage, compute stage and output stage. These stages are denoted by S0, S1
and S2 according to the notation discussed in Section 5.1. The tiles enter the pipeline at
Stage S0, the processor performs its computation in Stage S1 and the output is delivered
69
through Stage S2.
7.1 Running Time on a 3-Stage Pipeline
Recall that the given image of size N × N is decomposed into a ξ × ξtile array (where
ξ = N
n
), so there are ξ2 tiles numbered as 0, 1, · · · , ξ2 − 1 in the image. In Section 5.1.1,
we stated that in a (2c+ 1)-stage totally ordered pipeline with maximal stage Sm and for











Now we use this to derive the expression for the overall time of the pipeline. For 0 ≤ k < ξ2,
the time to process tile τk in stage Sℓ is tk,ℓ (tk,0, tk,1 and tk,2 in the 3-stage pipeline). The
overall running time depends on where the maximal stage is located in the pipeline.


























+ tξ2−1,1 + tξ2−1,2
Case 3 S0  S1  S2: In this case the maximal stage is stage S1. Again from Equa-
tion (7.1)








Case 4 S0  S1  S2: Here either S0 or S2 is maximal. So either Case 1 or Case 2 applies.
70
We now consider the time for the stages.
7.2 Time Complexity of Stages
In this section we derive the time complexities of the stages in the 3-stage pipeline. namely
for the input, compute and output stages. As mentioned in Section 5.1.1, the time taken
by Stage Sℓ to complete its process on tile τk is denoted by tk,ℓ.
Input Stage Time Complexity: The input stage time complexity is the time (tk,0)
taken by Stage S0 to make the received data for tile τk available to the next stage S1. In
Chapter 6, we discussed details of the amount of input data received at stage S0 for the
two tile orderings. Let pi be the number of input pins to the chip at stage S0, let b be the
number of bits in each pixel of the image and let Γ0 be the clock rate for Stage S0. Let
tile ordering ⊚ be the used (where ⊚ ∈ {R,D}) and let there be the η⊚k pixels coming into
S0 at iteration k. Then, the time taken by Stage S0 to receive the input data and make it







For the worst case η⊚k = (n+ x)
2 (see Theorem 25). So tk,0 ≤
⌈




Compute Stage Time Complexity The compute stage complexity is the time taken
by Stage S1 to extract features from the tile received from the previous stage and deliver
features to the next output stage. This time for tile τk is denoted by tk,1. The computations
performed by Stage S1 includes the phases Gaussian blurring, difference of Gaussians,
extrema detection, potential keypoints detection and generation of keypoint descriptors
(explained in Sections 2.2 – 2.5). Table 2.1 shows the time complexities and the number of
operations required by each phase of the SIFT algorithm for an N2-pixel image. Applying
these to an n2-pixel tile and then substituting the values of α = 0.6%, β = 35% and
γ = 0.04% values (from Section 3.3) gives the time complexity (number of operations) of
the computation stage as shown in Table 7.1.
Therefore with Γ1 as the clock rate for the computation stage,
tk,1 = (4n
2w2s+ 4n2s+ 104sn2 + 0.6sn2 + 48sn2 + 12.03x2n2) · b · Γ1
= (4w2s+ 157s+ 12.03x2)n2 · b · Γ1. The 12.03x2n2 term is the dominating term.
71
Table 7.1: The number of SIFT operations for an n2 pixel tiles
Phase Number of operations
Gaussian Blurring 4n2w2s
Difference of Gaussian 4n2s
Scale-space Extrema Detection 104sn2
Keypoint Detection 0.6sn2
Orientation Assignment 48sn2
Keypoint Descriptor Generation 12.03x2n2
Output Stage Time Complexity The output stage time complexity, tk,2 (for τk) is the
time taken by Stage S2 to output the tile features. The size of each feature is (2 log x+ 1)
bits. If Γ2 represents clock rate of the output stage, po denotes the number of output pins of
the chip, then the time taken by stage S2 is nominally
⌈








In general b = 32 bits , x ∼= 8, w ∼= 3, s ∼= 2 as used in Lowe’s [13] algorithm. Here we




















n2 · b · Γ1 ∼= (400)n2 · 32 · Γ1 = 36928n2 · Γ1
tk,2 =
⌈












So the case of S0  S1  S2 applies. The time complexity of running the SIFT algorithm
on a 1-chip (3-stage) uniprocessor pipeline is (from Equation (7.1))
















. The last term is independent of n so to
reduce the overall time n can be selected to be as small as possible. While maintaining the
relationship S0  S1  S2, we could also adjust pi, Γ0 and po, Γ2. Clearly pi, po ≥ 1. They
are likely to be much larger. Modern chips are available with around 1000 pins.
For example supposeN2 = 106 for a 1000×1000 image. The total time the algorithm spends
72
on the stages is about S0 ∼= 10
6×32·Γ0
320
= 100000 Γ0 for Stage S0; about S1 ∼= 1154× 106 Γ1
for Stage S1 and about S2 ∼= 1500 Γ2 for Stage S2. By slowing Γ0 and Γ2 proportionally to
Γ1 power savings are possible. So one could make Γ0 11540 times slower than Γ1 and Γ2
769× 103 times slower than Γ1 without changing the maximal stage. Of course this would
increase the times for t0,0 and tξ2−1,2.
This may be significant if the input and output stages work over a noisy (say wireless)
channel. The clock rates Γ0,Γ2 can be adjusted to be substantially smaller than Γ1. This
decreased clock rate reduces the bit error rate (BER) of the input. The reduced clock also
reduces power. One could increase the redundancy in the input to again reduce the BER.
Many intermediate choices are possible as well whose benefits do not come at the purse of
speed.
In Section 5.5, we discuss the processing of multiple images in pipeline model. Even if X
images are processed one after the other (that is Xξ2 tiles), S1 is still the maximal stage








Theorem 39. In a 3-stage totally ordered pipeline, for all 0 ≤ k ≤ ξ2 − 1, S1 is being the




(n+ x)2 · b · Γ0
pi
+
log x · Γ2
po
+XN2 · b · Γ1
)
7.3 Memory Requirement on a 3-Stage Pipeline
The amount of memory required in the chip is discussed in Section 5.2. Since the output
of SIFT is considerably smaller than the input, the memory requirement is driven by the
input stage. So here the memory requirement is the difference between the amount of data
received so far and the data that is not required further to process the remaining tiles. We
use the notation Mk for memory requirement, UK for data that has been received so far,
and Vk for the data that is not needed further (see Section 5.2). Now we consider the two
input orderings to determine the memory requirement.
73
7.3.1 Row Major Ordering
Figure 7.2 shows a tile (bold square) and its neighborhood (hatched). The total area of
the image that has been brought in so far is shaded (pink or red). Of this, the portion
shaded light (pink) (Area B) is required for the current tile and future tiles. The portion
shaded dark (red) (Area A) may be discarded. The memory requirement at this point of
the algorithm is Area B = total shaded area − Area A. The determination of the size of
this area is done with the help of Figures 7.3, 7.4.






Figure 7.2: Total data received
the current tile τk = τi,j (on tile array). The figure shows the four regions labeled 1, 2, 3, 4.
Let 〈ℓk〉 denote the area of the region ℓ for tile k, where the context is clear, we will omit
















































































Figure 7.3: Regions of total data received
processed, Uk, is the sum of the areas of regions 1, 2, 3, 4.That is
Uk = 〈1〉+ 〈2〉+ 〈3〉+ 〈4〉 = inN + (j + 1)n · n+ (ax+ bx) + nx
= inN + jn2 + n2 +Nx+ nx since (a+ b = N)
So we have,
Uk = inN + jn
2 + n2 +Nx+ nx (7.3)
Figure 7.4 shows the amount of data that is not needed further to process the remaining
tiles after the tile τk = τi,j. This is equal to the sum of the areas of regions 〈1〉, 〈2〉.
Vk = 〈1〉+ 〈2〉 = (in− x)N + (jn− x)n




























Figure 7.4: Data that is not needed further
Thus we have
Vk = inN +Nx+ jn
2 + nx (7.4)
Now, The amount of memory needed to process the remaining tiles is given by Equa-
tion (7.3) − Equation (7.4). Thus
Amount of memory needed = Uk − VK
Mk = n
2 + 2nx+ 2Nx.
Notice that Mk is independent of k. Except for the very beginning and end of the image
the algorithm will require Mk = n
2 + 2nx + 2Nx of memory to store pixels to be used at
a later time.
Theorem 40. The amount of memory required to run the SIFT algorithm on a 1-chip
(3-stage) uniprocessor pipeline for row major ordering is n2 + 2nx+ 2Nx.
76
7.4 Diagonal Ordering
In this section we study the amount of memory required in the chip. It is the difference
between the amount of data received so far and the data that is not required further
to process the remaining tiles. In Figure 7.5, the total area of the image that has been
brought in so far is shaded (pink or red). Of this the portion shaded light (pink) (Area
B) is required for the current tile and future tiles. The portion shaded dark (red) (Area
A)may be discarded. The memory requirement at this point of the algorithm Area B =
total shaded area−Area A. The evaluation of the size of this memory needed is calculated




Figure 7.5: Memory requirement for Diagonal Ordering
77
Figure 7.9 shows the total amount of data that has been received so far while processing
the current tile τk = τi,j (on tile array). The figure shows the two regions labelled 1, 2. Let
ℓk denote the area of the region ℓ for tile k, where the content is clear, we will omit k and
simply call this area as 〈ℓ〉.
We now define some additional quantities and develop some intermediate results.
Definition 3. A series in rectilinear space is a contiguous set of adjacent horizontal and
vertical lines. If L0, L1, L2 · · ·Lu−1 are adjacent lines rectilinear series formed by these lines
is denoted by L, where L = 〈L0, L1, L2 · · ·Lu−1〉.





Figure 7.6: L = 〈L0, L1, L2, L3, L4〉
Definition 4. An x-border of a series L in rectilinear space is the shape defined by two






Figure 7.7: x-border of L
Lemma 41. Let L = 〈L0, L1, L2 · · ·Lu−1〉 be a series in rectilinear space whose total length
is ℓ. Then for sufficiently small x > 0, the area of an x-border of L is 2xℓ.
78
Proof: We proceed by induction on u. For u = 1, there is only one line segment whose
length is ℓ and the area of the x-border is 2x× ℓ = 2xℓ. Assuming the lemma to hold for
any u ≥ 1, consider a u + 1 line segment. That is L = 〈L0, L1, L2 · · ·Lu−1, Lu〉. Let series
〈L0, L1, L2 · · ·Lu−1〉 have the length ℓ′ and the series 〈Lu〉 have the length ℓ′′, so the length
ℓ = ℓ′ + ℓ′′.
The Area of the x-border of series 〈L0, L1, L2 · · ·Lu−1〉 = 2xℓ′ (from the induction hypoth-
esis).





























































Figure 7.8: Area details of x-border of L
2xℓ′′ (from Figure 7.8). Overall area of the x-border of line segment
L = 〈L0, L1, L2 · · ·Lu−1, Lu〉 = 2xℓ′ + 2xℓ′′ = 2x(ℓ′ + ℓ′′) = 2xℓ.
For tile τi,j (located anywhere in the tile array), memory requirement (that is proportional
to the Area 2 in Figure 7.10) depends on 2(i+ j)n. The value of (i+ j) is maximum when
the tile τi,j is located on the diagonal of the tile array and here (i + j) = 2(ξ − 1). The
amount memory required is also maximum for the tiles that are located in the primary
diagonal of the tile array, its because that the amount of data brought in is high for these
tiles when compared to the remaining tiles. Thus we use this value of (i+ j) = 2(ξ − 1) in
the determination of memory requirement for diagonal tile ordering.
Uk = 〈1〉+ 〈2〉 = (k + 1)n2 + 2ℓx from Lemma 41

















































































































Figure 7.9: The memory requirement for Diagonal Ordering
Thus
Total amount of data received = (k + 1)n2 + 2 (ξ − 1)nx (7.5)
In Figure 7.10, the amount of data that is not needed further Vk is shown by the region 〈1〉.
Vk = {〈1〉+ 〈2〉+ 〈3〉} − {〈2〉+ 〈3〉} = (k + 1)n2 − {2 (ξ − 1)nx+ n2}
Vk = (k + 1)n
2 − 2 (ξ − 1)nx− n2
Thus
Vk = (k + 1)n
2s− 2 (ξ − 1)nx− n2 (7.6)














































































































Figure 7.10: The memory requirement for Diagonal Ordering
Equation (7.4).
MK = Uk − Vk = (k + 1)n2 + 2 (ξ − 1)nx− ((k + 1)n2 − 2 (ξ − 1)nx)






nx+ n2 = 4Nx− 4nx+ n2
Theorem 42. The amount of memory required to run the SIFT algorithm on a 1-chip
(3-stage) uniprocessor pipeline for the diagonal ordering is 4Nx− 4nx+ n2.
81
7.5 Tile Only Input
In this chapter we so far considered the tile-plus-neighborhood input protocol, where the
input sends the tile and its neighborhood one after the other. We now examine the time T
to run the SIFT on an image on a 1-chip uniprocessor pipeline for tile-only input protocol.
7.5.1 Row Major Ordering
As mentioned earlier in Section 6.2.1,the number of input pixels that are coming in to
Stage S0 of the pipeline for τ0 is η
R
k = [δ(ξ + 1) + 1]n
2. From Section 7.2, the input time








= Θ(xN). Clearly, the value of t0,0
depends on the value of N (as ξ = N
n
) and the compute stage complexity is proportional
to the value of n2. So t0,0 ≥ tk,1.
Even though t0,0 ≥ tk,1, it is clear that once S1 start processing of τ0, there is no stopping of
S1. So the complexity of t0,0 = Θ(xN). This could be significantly larger than tk,1 = Θ(n
2).
However from Theorem 37, for tile-only input tk,0 ≤ tk,1 for all k > 0. The condition for
S0  S1 requires that for all k ≥ 0 Tk,1 + tk,1 ≥ Tk+1,0 + tk+1,0. Thus t0,0 is not required to
be ≤ tk,1. That is S0  S1 and Theorem 14 holds. Therefore we have
Theorem 43. For all 0 ≤ k ≤ ξ2 − 1, the total time to run the SIFT algorithm on a
1-chip (3-stage) pipeline on an image of size N ×N using row major ordering and tile-only








Remark: Notice that the tile-only contribution of time of Stage S0 is due to t0,0. Because
t0,0 in this case is large there is a significant degradation in time even though tk,0 is small
for all k > 0.
7.5.2 Diagonal Ordering
As mentioned earlier in Section 6.2.2, the number of input pixels that are coming into
Stage S0 of the pipeline for τ0 is η
D
k = [δ (2δ + 1) + δ + 1]n
2 . From Section 7.2, the input








. Here the value of t0,0does not
depends on the value of N and t0,0 = Θ(x
2).
82
Here S0  S1, so the time complexity of the entire algorithm is better than for the row
major ordering case.
Theorem 44. For all 0 ≤ k ≤ ξ2−1, the total time to run the SIFT algorithm on a 1-chip
(3-stage) pipeline on an image of size N ×N using the diagonal ordering and the tile-only












In this chapter we study the performance of the SIFT algorithm, again on a 1-chip (3-stage)
pipeline, but this time with a multicore computing platform. We model the multicore
platform using the hierarchical multi-level-caching model (HM model) [4].
In the next section we briefly describe the HM model. In section 8.2 we discuss the mapping
of tiles to the cores and in Section 8.3 we derive the expressions for time to run SIFT on a
multicore platform.
8.1 The Hierarchical Multi-Level-Caching (HM) Model
Modern multicore architectures use multiple processor cores that share a memory hierarchy
(for example, the Intel Xeon Multiprocessor [9]). The Hierarchical Multi-Level-Memory
model (HM model) [4] captures this structure in a general architecture that abstracts away
details of particular chips and interconnects, but without ignoring memory access costs.
Let the chip here be P > 1 processor cores numbered 0, 1, 2, · · · , P − 1. Let the chip
contain h levels of caches, numbered 1, 2, · · · , h (see Figure 8.1). For 1 ≤ 0 ≤ h, an Li-
cache refers to a Level-i cache. While a full description of the cache structure appears in
Chowdhury et al. [4], we detail only relevant parts here. The caches are arranged in a
tree-like hierarchy. Let the number of Li−1-caches connected to an Li-cache be si; if i = 1
then the Li−1-cache is replaced by a processor core. As in Chowdhury et al., we assume
that s1 = 1; that is, each processor core has its own private L1-cache.























Figure 8.1: Hierarchical Multi-Level-Cache (HM) Model
cache interfaces to the input/output system outside the chip. For any i < h, the number
of Li-cache modules is Ri = sh, sh−1, · · · , si+1. The shadow of a cache is the number of
processor cores under it in the hierarchy. There is s1 = 1 core under the shadow of each
L1-cache, s1s2 cores under the shadow of each L2-cache, s1s2s3 cores under the shadow of
each L3-cache, and so on. Let Qi = s1s2 · · · si denote the size of shadow of Li-cache. In this
notation RiQi = P , the number of cores. Let each L1-cache have a line size of Bi pixels;
we express this in pixels to facilitate our analysis. The actual line size can be obtained by
multiplying Bi by the size of a pixel. If for all levels i, si ≤ BiBi−1 , we will call it an inclusive
large-line cache, that is, al line from each of i subcache at the Li−1 level, will fit in a single
Li-cache line. If si = s, for all iwe will call the hierarchy uniform. In subsequent discussion
we will consider hierarchies that are inclusive large-line and uniform.
Each processor core u (0 ≤ u < P ) accesses data as follows. It first looks for the data in
its L1-cache. If the data is found, then there is no penalty on the access. Otherwise, there
is an L1 miss. This causes an L2 access. An L2 miss causes an L3 access and so on. For
our application (as we will show later), we consider accesses with misses all the way to the
Lh-cache. When an Li miss triggers an Li+1 access, this access competes with all other Li
misses under that Li+1 module. These accesses to the Li+1 module are all sequential, so
the complexity of these accesses is the number of accesses to the Li+1 module. However
different Li+1-cache modules can be accessed in parallel. Thus the cache complexity can be
viewed as the maximum number of accesses to any Li-cache module, summed over all levels.
More specifically, if Ai is the maximum number of accesses to an Li-cache module, then






. We also note that the access to an Li-cache
85
equals the number of misses at all si subcaches at level i− 1.
8.2 Mapping Tile Data to Cores
As described in Section 8.1, the last level Lh-cache interfaces to outside the chip. Thus
in our 3-stage pipeline, the input stage brings in pixels to the Lh-cache. These pixels are
ultimately mapped to particular processor cores. In this section we describe this mapping.
Let n
P
≥ 1 be an integer, where n× n is the tile size. Divide the n× n tile into P subtiles
each of size n× n
P
. The uth subtile is mapped to core u (where 0 ≤ u < P ) (see Figures 8.2,




= u. Since n
P






= u. We will refer to data mapped to core u as local data. As data arrives
in the Lh-caches, they are all first moved to the L1 caches of the appropriate cores (recall
that each core has a private L1-cache). Subsequent references to data that is local to a
different core will cause cache misses.
In the SIFT algorithm, the amount of data that comes into the cache, the times when it
comes in and the computation for which its needed are all known in advance. Therefore,
we assume that data that is no longer required for the algorithm is automatically swapped
out to make room for other data. Thus, each cache has a miss only for the first time a
piece of data is needed.
8.3 Computation Stage S1 in the HM Model
As each n × n tile is input to the multicore chip, it enters at the highest level cache, the
Lh-cache. The first step of the computation stage is for each core to access its local pixels.






At this point each L1-cache holds the local data of a subtile corresponding to a core.
But before the core can apply SIFT on its local data, it needs the x neighborhood of its
local data and some of this may be local to other cores. The second step is to get this
neighborhood data. After this step each core has all data needed to independently process
its local subtile.
We now describe these phases below.
86
8.3.1 Accessing Local Data
We examine how each of P processors accesses its local data from the n× n tile located in




pixels of local data. Let the L1-cache have a cache-line length of B1. Then the





cache lines of storage at the L1 level. Since we do not have
any constraints on how cache lines are organized in the hierarchy, we may assume that





, Li−1 lines. That is, data for lower level caches are
compactly placed within the upper level cache lines as needed. This structure is possible
as local data is exclusive and the hierarchy is a tree. The total number of accesses due























































Figure 8.2: Accessing local data









































. This is in the form stated in the Lemma.






























































This completes the proof.










. Recall that P = s1s2 · · · sh, Q = s1s2s3 · · · si,
Ri = si+1si+2 · · · sh and QiRi = P .























Suppose the hierarchy is inclusive large-line. Then the ratio of cache lines in successive









≤ 1 and Qi
Bi
≤ 1. If si = s , a constant for all i, then
h∑
j=2
sj = is and s
h = P so h = logs P . So Now we have
Ai ≤ A1B1 + is (8.2)
We now derive the time TL to access local data. Let an Li-cache require αi access time.
Then the time required for Ai accesses is TLi = αiAi. Let αm = max{αi : 1 ≤ i ≤ h};











αiAi ≤ (A1B1)αmh+ αms
h∑
j=2




log2 P + logP
)
. Substitut-











































Notice that since each subtile is of size n
2
P
, the overhead is only O(logP )
8.3.2 Accessing Neighborhood Pixels
Consider core u again. In processing its subtile (shaded dark (red) in Figure 8.3) core u
requires the neighborhood data (shaded light (yellow) in Figure 8.3). In the worst case
the entire neighborhood has to be accessed. However the pixels in the core u shaded dark
(red) are local and already available to core. The core needs to bring in the pixels shaded
in light color (yellow). Clearly, the access is symmetric on either side of u (that is, access
is from core u± v for some v). So we consider only one side. The total data to be accessed










. That is for one half
of the lightly shaded (yellow) region in Figure 8.3, core u gets n
2
P
pixels from each of cores
u ± 1, u ± 2, · · · u ± (σ − 1) and ρ = (x mod n
P
) pixels from core u ± σ. This data is in
the L1-cache of these cores. Recall that B1 is the size of each L1-cache line. Then core










= (σ − 1)D1 + E1 (say)
lines. Notice that if x is large, then the E1 term can be replaced by D1 without significant





. Clearly this is the number of misses is well. So
these accesses go to an L2-cache. Each L2-cache has s2 number of L1-caches attached to
it and each L2-cache line holds R2 lines of L1-caches. These A1 accesses are to data in σ,
L1-cache. In fact, these accesses mirror those accesses of Section 8.3.1, except for the fact
that A1 is different.











. Simplifying this with the assumption of inclusive large-line





































































+B1 + si (8.4)
We now derive the time TN to access local data. Recall that we have only considered the
subtiles on one side of core u. Let an Li-cache require αi access time. Let αm = max{αi :
1 ≤ i ≤ h}, then the time required for Ai accesses is TNi = 2αiAi. The time to access all
90
the local data is
TN = 2
∑h


























































logP + B1 logP + αms
(
log2 P + logP
)



























Again the overhead is O(logP ).
8.3.3 Running SIFT on the Subtile
The compute stage complexity is the time taken by the stage S1 to extract features from
the tile received from the previous stage and deliver features to the next output stage. This
time for tile τk is denoted by TC . Table 2.1 shows the time complexities and the number
of operations required by each phase of the SIFT algorithm for N2 pixels. Applying these
to the n × n
P
pixel tiles and then substituting the values of α = 0.6%, β = 35% and
γ = 0.04% values (from Section 3.3) gives the time complexity (number of operations) of
the computation stage as shown in Table 8.1. The fact that a subtile is not a square does
not affect the time complexity.
91
Table 8.1: The number of operations required by phases of SIFT for n× n
P
pixel tiles


















Keypoint Descriptor Generation 12.03x2 n
2
P



































The total time taken by Stage S1 of the pipeline to process tile τk is tk,1 = TL + TN + TC .



































The quantity x is fixed by the algorithm and P is fixed by the multicore chip. Select n












Theorem 46. For all 1 < i ≤ h, and for an inclusive large-line and uniform hierarchy, the











We now consider two cases.
92




















= O(P + Px) = O(Px) =
O(nx).






logP = O(nx). So with n = max{x, P},
tk,1 =O
(















which is optimal considering
that a uniprocessor solution to the problem takes O(n2x2) time.
Let us examine x
lognx
. In the algorithm of Lowe [13], x = 8. for higher feature accuracy
x may increase. Here for x = 8, x
lognx
≤ 1, implies that 8 ≤ 3 + log n or n ≥ 25. That is
P ≤ 32 ≤ n cores can be supported without loss of efficiency. If we increase x = 9, then
P ≤ 29
9
≤ 56 ≥ n, so very modest increase in x allows for a much longer increase in P . If
x = 16, the next logical higher value of x, then P ≤ 212 ≤ n. Thus the method proposed
can scale to quite large values of P .
8.3.4 Total Time





, whereas tk,0 (when non zero) is Ω(n
2). Even though
Stage S0 has parts where nothing need to be brought in, it can continue to bring in further
data as it is the bottleneck. Thus S0  S1  S2 holds. So Case 2 of Section 7.1 holds. We
now have the following result for the total time.
Theorem 47. For any number of pictures X ≥ 1, the time required to run the SIFT
algorithm on a P -core chip is T = T0+T1+T2 where T0 is the time taken by the first stage
to input all X pictures and T1, T2 are the times to process the last tile of the last picture
in the compute and output stages.
This theorem clearly illustrates the importance of large input bandwidth, without which
the parallelism of multicore environment cannot be fully used.
93
8.4 Memory Requirement
We assumed the multicore chip to swap out any pixel data that was not required for the
future use. Therefore at level Lh the memory requirement will be Mk the same as in the
uniprocessor case. Since each subsequent level can operate by holding no more than what
its parent cache holds (as we have an inclusive cache), and since memory usage is fully
predictable, a level Li-cache can also swap out anything it does not need. Thus the total
memory at any level Li is Mk. Since there are Ri cache modules in each level i, each Li
cache module has size Mk
Ri
.
From Theorem 40, the amount of memory required to run the SIFT algorithm on a 1-chip
(3-stage) uniprocessor pipeline for row major ordering is Mk = Θ(Nx). So the amount of






Theorem 48. The size of each module of Li-cache required to run the SIFT algorithm on






. The total memory needed is O(hNx).





In this chapter we study the performance of the SIFT algorithm on a two-chip processing
pipeline (see Figure 9.1), where each chip can be single core or multicore. We model the
multicore platform using the hierarchical multi-level-caching model (HM model) [4], the




S S S S S
0 41 2 3
Figure 9.1: A two-chip pipeline
As discussed in Chapter 5, the two-chip model results in a 5-stage pipeline. These stages
are denoted by S0, S1, S2, S3 and S4 according to the notation discussed in Section 5.1.
The tiles enter the pipeline at Stage S0 and the final output is delivered through Stage S4.
The input and output stages are discussed in Section 7.2. The computation of SIFT on
an image is divided among the two chips. The first chip computes the phases of Gaussian
blurring, difference of Gaussians, scale-space extrema and keypoint detection in stage S1.
The Chip 2 computes the phases of orientation assignment and keypoint descriptor gener-
ation in Stage S3. The parameters for the Stage S2 is the same as that of Stage S0. The
relation between these five stages is assumed to be S0 ∼= S2 and S0, S1, S2, S3  S4.
95
Table 9.1: The number of SIFT operations required for an n2 pixel tile on a Uniprocessor
chip
Phase Number of operations
Gaussian Blurring 4n2w2s
Difference of Gaussian 4n2s
Scale-space Extrema Detection 104sn2
Keypoint Detection 0.6sn2
Orientation Assignment 48sn2
Keypoint Descriptor Generation 12.03x2n2
Table 9.2: The number of SIFT operations required for an n× n
P
pixel tile on a P -core chip


















Keypoint Descriptor Generation 12.03x2 n
2
P
Tables 9.1 and 9.2 show the number of operations required by the different phases of SIFT
to process an image of size n × n for the uniprocessor case and n × n
P
for the multicore
case. (See also Sections 2.5, 7.2, 8.3.3). Of the phases shown in the Tables, the Stage S1
of Chip C1 corresponds to Gaussian blurring, difference of Gaussians, Scale-Space extrema
detection and keypoint detection, Stage S3 of the Chip C2 handless to the remaining phases.
9.1 Time Complexities of the Stages
The time complexities of the input and output stages for the 2-chip model pipeline are same
as the complexities of the single-chip uniprocessor and single-chip multiprocessor pipelines
discussed in Section 7.2.
96
The time taken by the compute stages S1 and S3 depends on whether they use uniprocessor
or multicore chips. For i ∈ {1, 3} and α ∈ {U,M} (for uniprocessor or multicore), Let tαk,i
denote the time for Stage Si to process tile τk. Observe that S3 works on an n×n tile with
an x neighborhood. In the same way S1 works an an n × n tile with a w neighborhood.
Thus if tαk,3 = f(n, x) then t
α




so tUk,1 = O(n
2w2). In Section 8.3, we showed that for an inclusive large-line and uniform























which is smaller than the input complexity Θ(n2 + x2).
However the constants with the t
U/M
k,1 are quite small. If x < n, then most input iterations
would be non-empty and S0  S1, S3. So Stage S0 would be the maximal stage. Therefore
the time taken by the 2-chip pipeline to run the SIFT in the image is T= Input time for
all tiles + time to move the last tile through the rest of the pipeline.
Again this stresses the importance of an increased input bandwidth.
If Stage S3 is a uniprocessor stage, then clearly its O(n
2x2) complexity would dominant.
So the maximal stage in the pipeline would be Stage S3. The time T taken by the pipeline
to process an image is the time taken by Stage S3 to process an entire image plus the time
taken by the Stages S0, S1, S2 for the first tile and time taken by the Stage S4 for last tile.
Note that since Stage S3 represents only part of the SIFT computation, the time is till an
improvement over the single chip uniprocessor case.







. Assuming the times for the Stages S1 and S3 match, we will still be
bottlenecked by the input stages S0 and S3.
In summary, for this split of computing among the two chips, it appears to be better to
use less resources for Chip 1 than Chip 2, based on the times of these stages (see Table 9.2,
9.1). However, the bottleneck is still the input stage.
97
Chapter 10
Conclusion and Future Work
In this thesis, we developed a template for running SIFT in terms of tiles that facilitates
its analysis without getting bogged down on input/output details. We developed a c-chip
((2c + 1)-stage) pipeline model and derived general expressions for the time required to
run SIFT on the model. We considered uniprocessor and multicore computing platforms
and analyzed SIFT on a single-chip (3-stage) pipeline model and the two-chip (5-stage)
pipeline model and for all combinations of single and multicore chips. Two tile orderings
(row major and diagonal) were considered as well.
In the single-chip uniprocessor pipeline model (consisting of stages S0, S1, S2) the time to
run SIFT is essentially the time to perform the computation on the complete image on
stage S1. The times due to the input stage S0 and output stage S2 are less relevant, being
restricted to that needed by the input stage S0 to make data of the first tile available to
the Stage S1, and time needed by the output stage S2 to output the features of last tile;
these times are much smaller than the time to run SIFT on the entire image (that stage











; where Γℓ is the clock rate of
a stage Sℓ and pi, po are the number of input and output pins and α, β, γ are the feature
fractions. Here x is a measure of the size of the neighborhood of a keypoint defined by SIFT.
In this complexity, the middle term that depends N is the largest. The remaining terms
depend on x and n that are much smaller (N
n




thousands). We showed that the overall time complexity will not increase significantly
as long as (n+x)
2Γ0
pi
, (αβ + γ)n2Γ2 = Θ(N
2x2Γ1). This allows the possibility to reduce
Γ0,Γ2 there by reducing the power and/or improving the data transmission quality for
input/output.
98
As we move to a multicore computational platform (modeled using the HM model), the
input stage of the pipeline becomes the bottleneck as the computation time at the Stage S1






cores can be employed fully, but increasing x slightly from 8 to 9 increase the
range of cores to 56. With the parameter values used by Lowe [13] about 60 cores can be
used. The time taken by the pipeline to run SIFT on an N ×N image here equals the time
taken by Stage S0 to input the entire image plus the time taken by stages S1, S2 to perform
computation and feature output for the last tile. The input is the bottleneck and without
better input bandwidth, the power of multicore processing cannot be utilized fully.
In the two-chip (5-stage) model pipeline where each chip can be a single core or multicore,
the exact relationship between the stages depends on the chip (multicore or uniprocessor)
used. In any case, once again Stage S0 is the bottleneck. In summary, we have established
that without major improvements in input bandwidth, algorithms such as SIFT cannot
fully utilize the power of multicore technology.
We also derived the expressions for the amount of memory needed to run SIFT. In general
the amount of space is Θ(Nx) that is roughly
√
image size. There are many other directions
for possible future research. In the derivation of the total time to run SIFT, the maximal
stage is a key stage to identify. For the case where Sm  Sm+1  Sm+2  Sm+3 · · ·  S2c we
developed a time bound that we feel can be further tightened. However, we do not expect
it to significantly alter the results and conclusions reached in this thesis. The 3-stage and
5-stage pipeline models are studied in this thesis. To analyze the (2c + 1)-stage pipeline,
a detailed study of SIFT is needed to identify the computation at the stages themselves
and to establish relationships between these stages. In the 5-stage pipeline, Stage S3 needs
S S S S S0 1 2 3 4
Figure 10.1: Splitting the data at Stage S0 for Stage S1 and Stage S3
input data from Stage S2 that is same as that of Stage S0. But Stage S1 does not require
this entire data. Data transfer as shown in the Figure 10.1 could be employed.
The number of features of an image depends on the level of detail in the image and its
contents. In applications such as surveillance, the camera points to a fixed point and all
images are likely to have common features. This may also be the case for video data where
99
successive frames are correlated. One could consider mechanisms to exploit this, possibly
on a model that configures itself to suit the common features.
100
Bibliography
[1] L. Arg, M. T. Goodrich, M. Nelson and N. Sitchinava, “Fundamental Parallel Algo-
rithms for Private-Chip Multiprocessors,” In Proc. 20th ACM Symposium on Paral-
lelism in Algorithms and Architectures, pp. 107-206, 2008.
[2] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh1, D. Mc-
Cauley, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar,
J. Shen, and C. Webb, “Die Stacking (3D) Microarchitecture,” Proc. 39th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO’06), 2006.
[3] M. Brown, D. G. Lowe, “ecognising Panoramas,” Proc. 9th IEEE International Con-
ference on Computer Vision, pp. 12181225.
[4] R. A. Chowdhury, F. Silvestri, B. Blakeley and V. Ramachandran, “Oblivious Al-
gorithms for Multicores and Network Processors,” Univeristy of Texas Computer
Science Technical Report TR-09-19, July 2009.
[5] C. Seiculescu, S. Murali, L. Benini and G. De Micheli, “3D Network on Chip Topology
Synthesis: Designing Custom Topologies for Chip Stacks,” 3D Integration for NoC-
based SoC Architectures, Integrated Circuits and Systems, 2009.
[6] K. Chakraborthy, P. M. Wells and G. S. Sohi “Computation Spreading: Employing
Hardware Migration to Specialize CMP Cores On-the-fly,” Proc. Architectural Sup-
port for Programming Languages and Operating Systems (ASPLOS’06), Oct. 2006.
[7] H. Nasir, V. Stankovic, and S. Marshall, “Image Registration For Super Resolu-
tion Using Scale Invariant Feature Transform, Belief Propagation and Random Sam-
pling Consensus,” 18th European Signal Processing Conference (EUSIPCO-2010),
Aug. 2010.
[8] S. Heymann, K. Mller, A. Smolic, B. Froehlich, and T. Wiegand, “SIFT Implemen-
tation and Optimization for General-Purpose GPU ,” In proc. of the WSCG’07, Jan.
2000.
101
[9] Intel 64 and IA-32 Architectures Software Developers Manua, vol. 1: Basic Archi-
tecture.
[10] J.-Y. Kim, S. Oh, S. Lee, M. Kim, J. Oh, and H.-J. Yoo, “An Attention Controlled
Multi-core Architecture for Energy Efficient Object Recognition,” Signal Process.
Image Commun, 2010.
[11] T. Ko, Z. M. Charbiwala, S. Ahmadian, M. Rahimi, M. B. Srivastava, S. Soatto, and
D. Estrin, “ Exploring Tradeoffs in Accuracy, Energy and Latency of Scale Invariant
Feature Transform in Wireless Camera Networks,” IEEE, 2007.
[12] M. C. Jordon, “A Configurable Decode for Pin-Limited Applications,” MS Thesis,
Electrical and Computer Engineering, Louisiana State University, Dec. 2006.
[13] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Interna-
tional Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004.
[14] D. G. Lowe, “Object Recognition from local scale-invariant keypoints,” In proc. of
International Conference on Computer Vision, pp. 1150-1157, 1999.
[15] D. G. Lowe, “Local feature vire clustering for 3D object recognition,” IEEE Cconfer-
ence on Computer Vision and Pattern Recognition, pp. 682-688, 2001.
[16] D. G. Lowe, J. Little,“Vision-based mobile robot localization and mapping us-
ing scale-invariant features,” Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA). 2, pp. 2051, doi:10.1109/ROBOT.2001.932909.
http://citeseer.ist.psu.edu/425735.html.
[17] K. Mikolajczyk, “Detection of local features invariant to affine tranformations,” Ph.D.
thesis,Institut National Polytechnique de Grenoble, France.
[18] “OpenCV 2.1 C++ Reference,” http://opencv.willowgarage.com/documentation/cpp/index.html.
[19] Hess - SIFT Library, http://blogs.oregonstate.edu/hess/code/sift/.
[20] S. Cho, P.-C. Yew, and G. Lee, “A High-Bandwidth Memory Pipeline for Wide Issue
Processors,” IEEECS, 2001.
[21] B. H. Shekar, M. Sharmila Kumari, L. M. Mesteskiy and N. Dyshkant, “FLD-SIFT:
Class Bases Scale Invariant Feature Transform for Accurate Classification of Faces,”
CNC 2011,CCIS 142, pp. 15–21, 2011.
[22] S. N. Sinha, J.-M. Frahm, M. Pollefeys, and Y. Genc, “Feature Tracking and Matching
in Video Using Programmable Graphics Hardware,Machine Vision and Applications,
Mar. 2007
102
[23] T. Wen, X. Fan, W. H. Yuan, and Z. Bo, “Fast Scale Invariant Feature Transform
Algorithm Based on CUDA,” DOI CNKI:SUN:JSJC.0.2010-08-079.
[24] Z. Qu and Z. Wang, “ The Improved Algorithm of Scale Invariant Feature Transform
on Palmprint Recongnition,” Advanced Material Research, vol. 186, pp. 565–569,
Jan. 2011.
[25] A.P. Witkin, “Scale-Space Filtering,” International Joint Conference on Artificial
Intelligence, pp. 1019-1022, 1983.
[26] I. A. Young, E. Mohammed, J. T. S. Liao, A. M. Kern, S. Palermo, B. A. Block, M. R.
Reshotko, and P. L.D. Chang, “Optical I/O Technology for Tera-Scale Computing,”
Digital Object Identifier.
[27] Y.-s. Lin, S.-M. Chang, J. C.Tsai, T. K.Shih and H.-H. Hsu, “Motion Analysis Via
Feature Point Tracking Technology,” MMM’11 Proceedings of the 17th International
Conference on Advances in Multimedia Modeling, vol. Part II.
[28] Q. Zhang, Y. Chen, Y. Zhang and Y. Xu, “SIFT Implementation and Optimization
for Multi-Core Systems,” Proc. of the IEEE, 2008.
[29] Piper Seneca RP-C 2123.jpg
http://www.griffinaviation.in/Images/Piper%20Seneca%20I%20RP-C%202123.JPG




(last retrieved date, 04-24-2011).
[31] Old warden.jpg
http://upload.wikimedia.org/wikipedia/commons/c/c1/504 at Old Warden.jpg
(last retrieved date, 08-24-2010).
[32] AirOne Flight Academy Gallery
http://www.aironeflightacademy.com/files/gallery images/IMG 0101.JPG
(last retrieved date, 08-11-2010).
[33] Orlando FL, EPCOT, sphere.jpg
http://www.pleaseyourself.ca/Orlano%20FL,%20EPCOT,%20sphere.jpg
(last retrieved date, 04-24-2011).
[34] Opera Sphere.jpg
http://www.cameronballoons.co.uk/images/gallery/7/Opera Sphere.jpg




(last retrieved date, 04-24-2011).
[36] orange-sphere.jpg
http://www.psdgraphics.com/file/orange-sphere.jpg
(last retrieved date, 04-24-2011).
[37] einstein2.jpg
http://elementaryteacher.files.wordpress.com/2010/01/einstein2.jpg
(last retrieved date, 08-11-2010)
[38] File:Albert Einstein 1947a.jpg
http://commons.wikimedia.org/wiki/File:Albert Einstein 1947a.jpg
(last retrieved date, 04-24-2011).
[39] Einstein and Thatcham Mindwork’s Webblog
http://mindworksblog.com/2008/07/31/einstein-and-thatcham/
(last retrieved date, 04-24-2011).
[40] Einstein Potrait2.jpg
http://www.neoformix.com/2008/EinsteinWordPortrait.html
(last retrieved date, 08-11-2010)
[41] Mohandas Karamchand GANDHI cille85
http://cille85.wordpress.com/2009/04/16/mohandas-karamchand-gandhi/
(last retrieved date, 04-24-2011).
[42] index of ngo images 3362122.jpg
http://aryangroupofhospital.com/ngo/images/3362122.jpg
(last retrieved date, 04-24-2011).
[43] Mohandas Karamchand GANDHI cille85
http://cille85.wordpress.com/2009/04/16/mohandas-karamchand-gandhi/
(last retrieved date, 04-24-2011).
[44] Little known Facts About the Great Mahatma
http://www.randomthoughtz.com/archives/1007
(last retrieved date, 04-24-2011).
[45] Palm-trees1.jpg
http://haveanopinion.files.wordpress.com/2008/04/palm-trees1.jpg




(last retrieved date, 04-24-2011).
[47] palm-trees.jpg
http://withoutwords.files.wordpress.com/2007/11/palm-trees.jpg
(last retrieved date, 04-24-2011).
[48] frindswsunbehindreallife2.jpg
http://www.tikibarcentral.com/frondswsunbehindreallife2.jpg
(last retrieved date, 04-24-2011).
[49] 81552687.jpg
http://content.answcdn.com/main/content/img/ getty/8/7/81552687.jpg




(last retrieved date, 08-11-2010).
[51] 74166268.jpg
http://content.answcdn.com/main/content/img/getty/6/8/74166268.jpg
(last retrieved date, 04-24-2011).
[52] Buffmotorsports Images http://buffmotorsports.com/images/
img 0290 dm7g.jpg
(last retrieved date, 08-11-2010)
[53] Index of /classifieds/uploadt/au/vehicles http://www.adpost.com/classifieds/
uploadt/au/vehicles/au vehicles.7104.1.jpg
(last retrieved date, 08-11-2010).
[54] www.fleetowner.com http://blog.fleetowner.com/trucks at work/wpcontent/uploads/
2010/08/fo2011chevysilveradonm.jpg
(last retrieved date, 08-11-2010).
105
Vita
Phaneendra Vinukonda was born in April, 1987, in Warangal, Andhra Pradesh, India. He
graduated with his Bachelor of Technology in Electrical and electronics engineering from
Jawaharlal Nehru Technological University, Hyderabad, India, in the year 2007. He is
presently enrolled in master’s program in electrical and computer engineering at Louisiana
State University and is expected to graduate in May 2011. His research interests include
networking, interconnection networks and computer architecture.
106
