Brigham Young University

BYU ScholarsArchive
Theses and Dissertations
2019-07-01

Optimization and Hardware Implementation of SYBA-An Efficient
Feature Descriptor
Samuel Gaylin Fuller
Brigham Young University

Follow this and additional works at: https://scholarsarchive.byu.edu/etd

BYU ScholarsArchive Citation
Fuller, Samuel Gaylin, "Optimization and Hardware Implementation of SYBA-An Efficient Feature
Descriptor" (2019). Theses and Dissertations. 7520.
https://scholarsarchive.byu.edu/etd/7520

This Thesis is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion
in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please
contact ellen_amatangelo@byu.edu.

Optimization and Hardware Implementation of SYBA—
An Efficient Feature Descriptor

Samuel Gaylin Fuller

A thesis submitted to the faculty of
Brigham Young University
in partial fulfillment of the requirements for the degree of
Master of Science

Dah Jye Lee, Chair
James K. Archibald
Randal W. Beard

Department of Electrical and Computer Engineering
Brigham Young University

Copyright c 2019 Samuel Gaylin Fuller
All Rights Reserved

ABSTRACT
Optimization and Hardware Implementation of SYBA—
An Efficient Feature Descriptor
Samuel Gaylin Fuller
Department of Electrical and Computer Engineering, BYU
Master of Science
Feature detection, description and matching are crucial steps in many computer vision algorithms. These rely on feature descriptors to be able to match image features across sets of images.
This paper discusses a hardware implementation and various optimizations of our lab’s previous
work on the SYnthetic BAsis feature descriptor (SYBA). Previous work has shown that SYBA can
offer superior performance to other binary descriptors, such as BRIEF. This hardware implementation on an FPGA is a high throughput and low latency solution, which is critical for applications
such as: high speed object detection and tracking, stereo vision, visual odometry, structure from
motion, and optical flow. Finally, we compare our solution to other hardware methods. We believe
that our implementation of SYBA as a feature descriptor in hardware offers superior image feature
matching performance and uses less resources than most binary feature descriptor implementations.

Keywords: SYBA, FPGA, Feature, Descriptor, Binary

ACKNOWLEDGMENTS
I would first like to express my gratitude and thanks to my advisor Dr. Dah Jye Lee for
his support, guidance, friendship and inspiration during my time at the Robotics Vision Lab. Dr.
Lee gave me an opportunity to join his lab starting as an undergraduate student and I’ve had the
opportunity to work on many different educational, fun and exciting projects for him over the
years. It is due to his help and support that I am able to complete this Masters thesis and degree.
I would also like to thank the rest of my committee members, Dr. James Archibald and
Dr. Randal Beard for serving on my committee, for their support in producing this thesis, and for
also being excellent teachers for me in various classes at BYU. I am sure that they will continue to
inspire and teach other students as they have done for me.
I would also like to thank my many friends at the Robotic Vision Lab, staff members,
and fellow graduate students in the Department of Electrical and Computer Engineering. There
have been many who have encouraged me and given me new insights into new directions for my
research. I would also like to thank my fellow friend and student, Alex McCown, with whom I
have been a class partner, lab partner, research partner, and business partner along with many other
pursuits.
I would especially like to thank my wife, Lauren Fuller, who has given me great love and
support in completing my Masters degree and research. She has continually inspired me to do and
be better and has always believed in me. She would frequently encourage me to work harder and
to push myself to achieve the best possible.
Finally, I would also like to thank my parents and family for all of their love and support
throughout my educational pursuits. I feel blessed to have goodly parents who have set excellent
examples in family and academic pursuits.

TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Review of Existing Algorithms for Feature Detection, Description and Matching
1.3 Review of FPGA Architectures for Feature Detection and Description . . . . .

.
.
.
.

.
.
.
.

1
1
2
4

Chapter 2 Algorithm for SYBA Feature Description
2.1 Introduction to SYBA . . . . . . . . . . . . . . .
2.2 Compressed Sensing Theory . . . . . . . . . . .
2.3 SYBA Feature Description . . . . . . . . . . . .
2.4 SYBA Feature Matching . . . . . . . . . . . . .
2.5 Flow of Algorithm . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

. 7
. 7
. 8
. 9
. 11
. 13

Chapter 3 SYBA in Hardware . . . . . . . . . . . . . . . .
3.1 Line Buffers . . . . . . . . . . . . . . . . . . . . . . .
3.2 Feature Detection - FAST . . . . . . . . . . . . . . . .
3.3 Feature Description with SYBA . . . . . . . . . . . .
3.4 Feature Matching . . . . . . . . . . . . . . . . . . . .
3.5 Analysis of Speed and Resource Utilization of Solution
3.6 Graphical Flow of SYBA Hardware . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

15
16
19
20
23
24
26

Chapter 4 SYBA Optimizations . . . . . . . . . . . . .
4.1 Accuracy Test and Image Sequences Used . . . . .
4.2 Changing the Number of SBIs . . . . . . . . . . .
4.3 Changing the Method of Binarization . . . . . . . .
4.4 Impact of SYBA Optimizations on Hardware Usage

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

28
28
31
36
42

Chapter 5 Accelerating SYBA Feature Matching with HLS
5.1 Feature Matching SYBA in Software . . . . . . . . . . .
5.2 HLS Feature Matching With SYBA - Basic Block . . . .
5.3 HLS Feature Matching With SYBA - Advanced Block .
5.3.1 Demo Application . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

45
46
46
47
49

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Chapter 6 Comparison of Results and Conclusion . . . . . . . . . . . . . . . . . . . . 52
6.1 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

iv

LIST OF TABLES
3.1
3.2

Resource utilization for the FAST detector and SYBA descriptor . . . . . . . . . . . . 25
Frame Rates for different image sizes at 100Mhz clock . . . . . . . . . . . . . . . . . 26

4.1
4.2

Changing the number of SBIs vs. descriptor length and accuracy . . . . . . . . . . . . 36
Implementation type vs. Hardware Usage . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1

Runtime results for SW vs inital HLS configuration . . . . . . . . . . . . . . . . . . . 47

6.1

Comparison of hardware resource utilization to others results . . . . . . . . . . . . . . 52

v

LIST OF FIGURES
1.1

Recall vs. Precision curve from previous results . . . . . . . . . . . . . . . . . . . . .

2.1
2.2
2.3

A 30x30 sample synthetic basis image and a 5x5 synthetic basis image . . . . . . . . . 8
SBIs comparison example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
The flowchart of the SYBA algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

Numato STYX FPGA Board with a ZYNQ XCZ7020 SoC . . . . . . . . . . . .
Omnivision 5642 Hardware Camera with Parallel Output Interface . . . . . . . .
Using shift registers as line buffers. . . . . . . . . . . . . . . . . . . . . . . . . .
Using shift registers and BRAM as line buffers. . . . . . . . . . . . . . . . . . .
Example of FAST feature detector . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Diagram of Parallel Access and Comparison for Fast Feature Detector.
Graphical Flow of Efficient Averaging Finder . . . . . . . . . . . . . . . . . . .
Graphical Flow of SYBA Hardware . . . . . . . . . . . . . . . . . . . . . . . .

4.1

4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17

Image sequences from left to right: Bikes (Image blurring), Graffiti (Viewpoint change),
Leuven (Illumination change) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Image sequences from left to right: Trees (Image blurring), UBC/JPEG Compression
(image compression artifacts), Wall (Viewpoint change) . . . . . . . . . . . . . . . . .
Number of SBIs vs. Recognition Rate for Trees Image Sequence . . . . . . . . . . . .
Number of SBIs vs. Recognition Rate for Bikes Image Sequence . . . . . . . . . . . .
Number of SBIs vs. Recognition Rate for Wall Image Sequence . . . . . . . . . . . .
Number of SBIs vs. Recognition Rate for UBC Image Sequence . . . . . . . . . . . .
Number of SBIs vs. Recognition Rate for Leuven Image Sequence . . . . . . . . . . .
Number of SBIs vs. Recognition Rate for Graffiti Image Sequence . . . . . . . . . . .
Number of SBIs vs. Average Recognition Rate Across All Image Sequences . . . . . .
Binarization Method vs. Recognition Rate for Trees Test Sequence . . . . . . . . . . .
Binarization Method vs. Recognition Rate for Bikes Test Sequence . . . . . . . . . . .
Binarization Method vs. Recognition Rate for UBC Test Sequence . . . . . . . . . . .
Binarization Method vs. Recognition Rate for Leuven Test Sequence . . . . . . . . . .
Binarization Method vs. Recognition Rate for Wall Test Sequence . . . . . . . . . . .
Binarization Method vs. Recognition Rate for Graffiti Test Sequence . . . . . . . . . .
Binarization Method vs. Average Recognition Rate . . . . . . . . . . . . . . . . . . .
Different SYBA Methods and BRIEF-32 vs. Feature Recognition Rate . . . . . . . . .

30
32
33
33
34
34
35
35
38
39
39
40
41
41
42
43

5.1
5.2
5.3
5.4

Graphical Flow of Advanced HLS Block . . . . . . . . . . . . . . . .
Graphical Flow of the Demo Application . . . . . . . . . . . . . . . .
Numato Styx Board with Omnivision 5642 Camera . . . . . . . . . .
Captured image from HDMI output showing SYBA Feature Matching

48
50
51
51

4.2

vi

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

4

15
16
17
18
19
21
23
27
29

CHAPTER 1.

1.1

INTRODUCTION

Introduction
Image processing for humans involves using sight and then mentally breaking down what is

seen to give it meaning. The human visual system can easily distinguish and recognize individual
components and quickly make deductions about them based upon prior experience. For computer
vision engineers, the task is to teach a computer to extract some desired meaning or information
from an image or series of images as well. These algorithms often involve computationally intensive tasks such as object identification [1], [2], localization and pose estimation [3], optical
flow [4], [5], super resolution [6], visual odometry [7], [8], target tracking [9], 3D reconstruction [10] and many others. The primary process to achieve these algorithms is to find features in
an image and then match them to corresponding features in another image. This generally follows
a familiar three step process: feature detection, feature description, and feature matching.
Feature detection is the process of identifying so-called “interesting” parts of an image. An
image feature is generally something that is trackable, distinguishable, and hopefully unique. Thus,
image features normally consist of corners, blobs, non-straight edges, or other ridges. Feature
detection is also referred to as key-point detection. After feature detection, a feature description
is generated around a feature region of interest or FRI. Feature description seeks to describe the
FRI in a unique enough way to increase the probability that the features across images will match
correctly. Ideally, a feature description is resistant to image deformations and changes such as
noise, illumination, perspective, rotation, scale, blurriness, JPEG or other compression artifacts,
and more. There are many different methods to perform feature description and generally offer
trade-offs between accuracy and run-time performance. Each method must be efficient enough to
allow for a large number of features to be compared in a relatively small scale of time for realtime applications. This is even more difficult for low power and low resource embedded systems
which are becoming increasingly prevalent in computer vision applications. The implementation
1

of these algorithms on platforms such as field programmable gate arrays (FPGAs) and embedded
processors deserve special consideration.
This paper explores and expands upon a feature descriptor and comparing algorithm called
synthetic basis imaging descriptor (SYBA). SYBA was designed with the purpose of being hardwarefriendly by eliminating the need for operators that are generally costly to implement on FPGAs or
other hardware platforms. SYBA does not make use of multiplication, square roots, or trigonometric operations which are commonly used for other feature descriptors. SYBA can be implemented
with only basic hardware functions such as: adders, comparators, and simple logic gates (primarily
NOR). In addition to being more hardware friendly, SYBA was designed to be efficient in general
by reducing the complexity of feature descriptions and computations as well as reducing the storage
costs for storing the descriptions [11]. In this thesis, the following accomplishments are detailed:
the first ever hardware implementation of SYBA, key optimizations to SYBA to make it smaller
and faster, hardware optimizations to be competitive with other hardware feature descriptors, and
a realtime SYBA feature matching demo built on a ZYNQ XC7Z020 FPGA.

1.2

Review of Existing Algorithms for Feature Detection, Description and Matching
The most widely used and prominent algorithms for feature detection, description and

matching are the Scale-Invariant Feature Transform (SIFT) [12] and the Speeded-Up Robust Features (SURF) algorithms [13]. SIFT is well-known and uses orientation and a magnitudes-ofintensity gradient-based feature description. It works very well on intensity images and provides
feature descriptors that are invariant to both rotation and scaling. This increase in robustness and
complexity comes at the cost of higher computation and storage requirements which renders it unsuitable for many resource constrained applications. SURF is more commonly used, as it relies on
integral images which helps cut down its execution time. Even still, it has a relatively large storage
requirement (256 bytes per descriptor) and still has high computation cost.
More recently binary feature descriptors have been developed which have more compact
sizes and lower computational requirements. These generally compute the descriptor with pixel
level intensity comparisons. Among the most common of these are Binary Robust Independent
Elementary Features (BRIEF) [14], Binary Robust Invariant Scalable Keypoints (BRISK) [15],
Oriented Fast and Rotated BRIEF (ORB) [16], and Fast REtinA Keypoint (FREAK) [17]. These
2

descriptors trade reliability and robustness for computational speed. BRIEF consists of a binary
string which is the result of multiple intensity comparisons within a single image at random but
predetermined locations. A newer version of BRIEF called rBRIEF has been developed by Rublee
et al. [16]. rBRIEF uses learned pixel pairs rather than random locations to reduce the correlation
among the binary tests. Like BRIEF, rBRIEF requires only 32 bytes to represent a feature point.
BRISK relies on configurable concentric circles with more points on the outer rings for its sampling
patterns from which it also computes brightness comparisons. It computes the orientation of the
keypoint by using local gradients between the sampling pairs within the pattern which allows
BRISK to be rotation invariant. Overall this requires significantly more computation (including
division and multiplication) and slightly more storage space than BRIEF.
To address this issue, ORB was developed to maintain BRIEFs low computational complexity but maintain rotational invariance. ORB uses a set of 256 learned pixel pairs and only
requires 32 bytes to represent a feature point. The ORB descriptor also includes orientation information, which helps make it rotation invariant. While ORB has this major advantage over BRIEF
for rotated images, BRIEF tends to outperform ORB in other cases [18]. Finally, FREAK uses a
sampling pattern that is inspired by a retinal sampling grid. Which means that there is higher density of points around the center of the sampling pattern. This is similar to the human vision system
where peripheral vision is blurry compared to what is seen when looking straight ahead. FREAK
uses an orientation assignment similar to BRISK’s which allows it to be rotationally invariant.
Created at Brigham Young University’s Robotic Vision Lab, SYBA offers an alternative to
other binary feature descriptors. SYBA is also a binary feature descriptor; however, it is formed
in an entirely different way, and is inspired by compressed sensing theory, which uses synthetic
basis functions to uniquely encode or reconstruct a signal. SYBA works by performing a number
of similarity tests between a feature image region (FIR) and a predetermined number of synthetic
basis images (SBIs) [19]. By only storing the similarity of the FIR to each SBI the overall storage
size is dramatically reduced. This also makes comparison when searching for feature matches
easier. In short, the SYBA descriptor is designed to provide real-time vision applications high
feature matching accuracy with computational simplicity, relatively low resource requirements,
and a hardware friendly design. SYBA has previously been compared with two well-known binary
descriptors, BRIEF-32 and rBRIEF, and has been shown to produce better feature matching results
3

[19]. Figure 1.1 shows a recall vs. precision curve using threshold-based similarity matching
(sliding the Hamming distance from minimum to maximum). Precision is defined as,
Precision =

Tp
Tp + Fp

(1.1)

and recall is defined as,
Recall =

Tp
,
Tp + Fn

(1.2)

where Tp is the number of true positive matches, Fp is the number of false positive matches, and
Fn is the number of false negative matches.
It can be seen that SYBA significantly outperformed both BRIEF algorithms for high recall
values. This work was performed on the academically common Oxford dataset, Karlsruhe dataset,
and KITTI vision benchmark dataset. Previous work also includes several applications of SYBA
based on a software implementation [11], including visual odometry drift reduction [20].

Figure 1.1: Recall vs. Precision curve from previous results

1.3

Review of FPGA Architectures for Feature Detection and Description
In this section a review is presented on some of the methods used to do feature detection

and description on field-programmable gate arrays. The use of programmable logic instead of
4

software to do image processing enables custom hardware circuitry to accelerate the process. Many
of the basic functions for computer vision are inherently parallelizable including: thresholding,
convolution, color space conversion, feature detection, feature description, and feature matching.
Pixels are generally fed into a hardware system by some means of a pixel clock and associated data.
Hardware designs have been made that do image processing with pixel data coming in directly
from a camera (including this thesis), images from HDMI, ethernet, and off chip DDR. The steps
for image processing with features in hardware follows the familiar process of feature detection
followed by feature description. Whether or not feature matching is included in the implementation
varies on the application. The first choice to be made is what feature detector should be used. Some
FPGA-based systems have employed state of the art feature detection such as SIFT as in [21]
and [22]. As mentioned in the previous section, SIFT requires the use of several functions that
are not very hardware friendly. This causes these implementations to have relatively high resource
usage and also requires the use of digital signal processing (DSP) slices. In order to reduce resource
usage, several implementations have made use of the Harris Corner feature detector as in [23]
and [24]. Others have elected to reduce resource usage further using the Features from Accelerated
Segment Test (FAST) feature detector as in [25] and [26]. Generally, since FAST is a simpler
feature detector it requires fewer resources to implement. Following this a feature descriptor must
be chosen, with many FPGA implementations using BRIEF including [25] and [26]. The usage of
FAST and BRIEF on FPGAs is more common because they are generally very hardware friendly.
One of the primary drawbacks of these binary feature detectors and descriptors is a lack of scale
and rotation invariance, although small changes are tolerable. Thus these will not have the same
accuracy as implementations that make use of SIFT or SURF but will have much lower resource
usage.
In comparison to BRIEF, SYBA has been shown to have better feature matching performance, as shown in Figure 1.1. More detailed comparison results are available from previous
work [19]. Like the other binary methods described, SYBA is not scale or rotation invariant, but
will tolerate small changes. In this thesis we show that SYBA can also have significantly lower resource usage in FPGAs, while maintaining its feature matching accuracy. This makes it a superior
implementation to other common binary feature descriptor implementations in hardware.

5

In summary, the contributions of this thesis are the following: the first ever hardware design and implementation of SYBA, several creative improvements to the SYBA algorithm to make
it more hardware friendly, another hardware design and implementation based upon the improvements made to SYBA, and a demo application based upon High Level Synthesis (HLS) that accelerates SYBA feature matching in hardware.

6

CHAPTER 2.

2.1

ALGORITHM FOR SYBA FEATURE DESCRIPTION

Introduction to SYBA
Most mainstream feature description algorithms, such as SIFT and SURF, work well enough

to generate correct feature matches between images in the majority of applications. The drawback
to these feature descriptors and others is that they rely upon large and complex operations and
require large amounts of storage. These include operations such multiplication, square root, and
other floating point operations. These can be too costly for resource constrained-embedded systems to perform at a high throughput. In order to address these issues, binary feature descriptors
such as BRIEF, ORB, and BRISK have been developed. These algorithms are unique in that they
mostly rely upon simple comparisons and other operations to generate a descriptor. This greatly
reduces the storage space required for these descriptors, and also enables real-time applications on
resource-constrained environments. In order to compete with these feature descriptors, Alok Desai
invented the Synthetic Basis (SYBA) algorithm. The focus of SYBA is to use Synthetic Basis Images (SBIs) which are overlaid onto the Feature Region of Interest (FRI) and then compared. This
comparison generates a number of hits which are then compared and matched to find an image
correspondence. This idea was inspired by the concept of compressed sensing theory to uniquely
describe the image region and generate the binary descriptor. SYBA also utilizes a unique matching scheme to produce accurate feature matches. Previous research has shown that SYBA can
compete with and beat other binary feature descriptors. Similar to some binary descriptors such
as BRIEF, SYBA performance suffers with large amounts of image rotation and scaling. Previous
work performed by Lindsey Raven [11] has addressed this shortcoming in making Robust SYBA
(rSYBA) by implementing FRI rotation and scaling in a similar manner that SIFT does. While
this does remove much of the hardware friendliness of SYBA, it was found to have good results in
comparison to SURF and SIFT. Since the purpose of this thesis is to compare a hardware friendly

7

SYBA implementation to other simple binary descriptors’ hardware implementations, it is SYBA
that is implemented in hardware, not rSYBA.

2.2

Compressed Sensing Theory
Compressed sensing theory is used to encode and decode signals efficiently and can reduce

bandwidth and storage requirements. This is a clearly advantageous feature for systems that are
resource-limited. Compressed sensing is capable of uniquely describing signals with synthetic
basis functions, and therefore has a suitable application for feature description. The basic idea of
synthetic basis functions for compressed sensing is to use random patterns as a guess. For image
feature description, these random patterns are simply patterns of black and white pixels which are
the SBIs. The black pixels are the points which are sampled and compared against the original
image. The equation to find the maximum number of different random patterns that is required is
given by the following equation as described in [27],


N
M = K ln
,
K

(2.1)

where N is the number of pixels in the image feature region, K is in the number of random guesses
per pattern, and M is the total number of random patterns that are required to accurately encode
the signal. M is smallest when K = N/2, meaning that SBIs will be 50% black and 50% white.

Figure 2.1: A 30x30 sample synthetic basis image and a 5x5 synthetic basis image

8

2.3

SYBA Feature Description
The SYBA algorithm is used only for describing and matching FRIs, and is not used to

detect features in an image. Therefore a feature detector is needed in conjuction with SYBA to
identify FRIs. In the original SYBA work, SURF was used to find feature point locations. This
was done since most other literature also used SURF as the feature detector for software based
implementations. This meant that the detected feature points would be the same, and thus offers
a fair comparison to other papers. For feature description and detection systems that are built
in hardware, simpler detectors are commonly used. The much more hardware-friendly feature
descriptor Features from Accelerated Segment Test (FAST) is commonly used in these systems,
and so it is also used in this work and detailed in the hardware implementation section. Once the
feature point has been detected, the FRI must then be binarized based upon the average intensity of
the image region. This calculation allows SYBA to be illumination invariant. To do this an average
intensity (g) of the FRI can be calculated as given by Equation (2.2),
g=

∑x,y I(x, y)
p

(2.2)

where I(x, y) is the intensity of each pixel at location (x, y) and p is the number of pixels. The
binary FRI (BFRI) is then generated using the average intensity g.

BFRI(x, y)



1

I(x,y) > g


0

otherwise

(2.3)

Since the SBI is an image of the same size as the BFRI, the images can be overlaid for easy
comparison. As described above, the number of black pixels should be set to half the region size
(rounding up in case of odd dimensions). The images are then compared to generate a similarity
value. If both pixels are black then this is counted as a hit, any other combination is not a hit.
This is equivalent in hardware to a NOR gate. The number of corresponding hits is then counted
for each SBI. These are represented as unsigned numbers which are then concatenated together
to form the descriptor. The length of the descriptor, without considering the pixel coordinates, is
given by Equation (2.4),

9

L = Ns × Nb × R

(2.4)

where L is the length of the descriptor, Ns is the number of SBIs, Nb is the number of bits needed to
represent the maximum number of hits for that SBI, and R is the number of subregions (only used
in SYBA 5x5). For SYBA 30x30 this means that each of the 312 SBIs can have a maximum of 450
hits. This number of hits can be represented as an unsigned number with 9 bits. This means that
the final descriptor length will be 312 × 9 × 1 = 2808 bits. SYBA 30x30 has great image feature
matching performance, but it comes at a very high cost computationally. In order to be competitive
with other simple feature descriptors, SYBA 5x5 was developed. It uses far fewer operations and
still offers very good image feature matching performance, as shown in Figure 1.1. For SYBA
5x5 only 9 SBIs are needed which can each have a maximum of 13 hits. This number of hits can
represented as an unsigned number with 4 bits. This means that the final descriptor length will be
36 × 9 × 4 = 1296 bits.

Figure 2.2: A 30*30 FRI is divided into 36 5*5 sub regions. Nine 5*5 SBIs are subjected to NOR
logic to find coinciding black pixels. The sum is taken of the output to find the descriptor value, in
this example the sum is 8.

10

It is worth noting that this feature description length is less than the number of bits needed
for more advanced feature descriptors such as SIFT and SURF, which use 256 bytes or 2048 bits
[21]. When compared with BRIEF, another binary descriptor, it depends on the version of BRIEF
that is used. BRIEF-16 uses 128 bits, BRIEF-32 uses 256 bits, and BRIEF-64 uses 512 bits. Thus,
the SYBA 5x5 descriptor has a larger descriptor size than BRIEF. Nevertheless, SYBA has found
value as a competitive feature descriptor, as A. Desai showed that despite being larger, it could
produce better feature matching results. As part of the research in this thesis, it was found that
the SYBA descriptor length could be significantly reduced with only minimal impact to accuracy.
The results and discussion of this are contained in Chapter 4. For the sake of easy comparison to
the numbers listed here, it was found that SYBA 5x5 with only 3 SBIs offers comparable image
feature matching performance and reduces calculations and descriptor length to be only

1
3

of the

original size. So, the SYBA 5x5 descriptor when using only 3 SBIs needs only 432 bits.

2.4

SYBA Feature Matching
The first step in any feature matching algorithm is to determine the distance between two

feature descriptions. Note that this distance does not refer to coordinate distance between the
feature points in the image space, but rather refers to the similarity between the descriptions themselves. Euclidean distances are often used as comparison metrics for this, but this requires complex
operations such as multiplication and square root operators. More basic distance operations include
the Hamming distance and the L1 norm. The Hamming distance requires only an XOR operation
and adders so it is very computationally simple. The L1 norm requires only adders and an absolute
value operation, which can be implemented simply from other basic hardware operations. The L2
norm requires adders, an absolute value operation, a square multiplier and a square root operator.
SYBA makes use of the L1 norm, as it is principally interested in the difference between the number of hits for each SBI and requires only simple hardware. The equation for the L1 norm is given
in Equation (2.5),
n

d = ∑ |xi − yi |

(2.5)

i=1

where xi and yi represent the number of hits (the unsigned value) for all n comparisons where n is
the total number of SBIs times the number of sub regions. Here is an example of the computation
11

of the SYBA distance calculation following the L1 norm.
5

4

3

8

7

6

3

5

6

9

5

4

3

8

7

1

6

2

6

8

5

3

6

6

3

5

6

8

3

9

6

5

6

1

4

1

4

2

5

1

0

0

0

0

1

2

5

3

3

1

0

2

1)

∑ (1

(2.6)
=31

Now that the method for calculating distances between two feature descriptions has been
described and shown, the algorithm for feature matching can be explained. First, the point-to-point
correspondence is determined by comparing each descriptor in the first image to each descriptor in
the second image using the L1 norm as shown. The remaining process is divided into two steps:
a two-pass search, and a global minimum requirement. The first pass of the first step is to find
the feature point in the second image which has the minimum distance to each feature in the first
image. The second pass in the two-pass search guarantees that the pair is uniquely matched. So the
second pass is to find the feature point in the first image which has the minimum distance to each
feature in the second image. In order to for any pair to form a match, the minimum distance for
each feature pair from the first pass must also be the same corresponding pair from the second pass.
To put this more simply, each feature must be each others best match in order for it to be a match.
If this is not the case, then it is not considered a match. This is also called cross checking. The
remaining unmatched feature points are sent to the next matching step. The second step is to apply
a global minimum requirement. This finds the minimum distance values between all remaining
feature points and looks for one-to-one matches between these feature points which are considered
matches as well. This process can be repeated with remaining unmatched feature points until
no minimum can be found or all distances exceed the global minimum distance threshold. The
smaller the distance the more similar to each other the two features are. The global minimum
distance threshold can easily be adjusted to reject feature matches with a distance that is too large.
A larger global minimum will return more but lower quality matches whereas a smaller global
minimum will return fewer matches but at a better quality.

12

2.5

Flow of Algorithm
This section serves to provide a high level overview of the SYBA algorithm. First feature

detection is performed on the image and this is used to generated a feature list. A 30x30 FRI is
then generated around each of the feature points. An average intensity is calculated which is used
to generate the binary FRI. This is then compared to a selected number of SBIs to compute the
SYBA Similarity measure (SSM). The number of hits is counted from the SSM and represents a
unique feature descriptor. A visualization of this algorithm is shown in Figure 2.3. The L1 norm is
then used to compare the descriptors, and the best matched features are found as described using a
two-pass matching strategy with a global minimum requirement.

13

Figure 2.3: The flowchart of the SYBA algorithm.

14

CHAPTER 3.

SYBA IN HARDWARE

In this section the hardware implementation of the SYBA algorithm will be discussed and
presented. The goal is to create a pipelined implementation that can detect, describe, and match
feature points in real-time using the SYBA algorithm. The hardware design has been written using
VHDL and has also been implemented on an FPGA board with a ZYNQ XCZ7020 SoC. The
FPGA board that has been used is the Numato Styx which can be seen in Figure 3.1. The design
assumes that incoming pixels come in every clock cycle. This matches most video and imaging
standards and makes it compatible with a variety of formats such as HDMI and VGA. In addition
to the aforementioned video transmission formats, many hardware cameras also rely upon a pixel
clock to transmit images. In this design implementation, an Omnivision OV5642 camera has been
used which transmits pixels into the hardware design with a reference pixel clock. A picture of this
is in Figure 3.2.

Figure 3.1: Numato STYX FPGA Board with a ZYNQ XCZ7020 SoC

15

Figure 3.2: Omnivision 5642 Hardware Camera with Parallel Output Interface

3.1

Line Buffers
As pixels enter the hardware design, they are converted into an 8-bit grayscale value and

entered into a series of line buffers. These are necessary to be able to access regions of pixels
simultaneously in a pipelined design. Since SYBA uses a 30x30 FRI, 30 line buffers are needed
to be able access 30 rows simultaneously. Two different hardware designs have been tested and
implemented for these. The first design uses shift registers for line buffers, and the second design
uses BRAMs for line buffers with a small shift register window.
The first design was implemented first and was initially easier to design. Essentially the
idea is to declare a large array of size 30 ∗ Xres of 8-bit values and shift new pixel values in with the
pixel clock. So if the image is VGA resolution at 640x480, this means that shift register array is of
size 19,200 x 8 bits. Then as new pixels enter the hardware design, subsequent pixels are shifted
in as shown in Figure 3.3.
New pixels are shifted into n0 and pass up through all of the line buffers. To access pixels
in the same row a simple offset into the array is needed from the incoming pixel. To access pixels
in previous rows an offset corresponding to Xres is used. Setting up the line buffers in this manner
allows easy parallel access to any of the needed pixels in the FRI, and can also be used for any
other accesses needed such as for feature detection. This design has several shortcomings and
disadvantages. The first is that in FPGA implementations this creates relatively high LUT usage.
LUTs can be configured as shift registers in FPGAs, so for a typical LUT6 design a single LUT
can hold 26 = 64 values. So in the above example at VGA resolution, to create 30 line buffers a

16

Figure 3.3: Using shift registers as line buffers.

total of 19, 200 ∗ 8/64 = 2400 LUTs at a bare minimum are needed. Additional LUTS are inserted
to be able to access needed pixels, which would be 900 in this case for the FRI. This LUT usage
grows as the image size grows; realtime image processing on 1080p images would require 3x as
many LUTs for example. A second minor disadvantage of this method is it places much higher
strain on CAD tools causing synthesis and implementation to be much larger.
A better design is to take advantage of BRAMs within the FPGA to store incoming pixels.
This method dramatically reduces LUT usage within the FPGA, along with greatly reduced strain
on synthesis and implementation tools. The problem is that BRAMs can have a maximum of
2 ports, a far cry from the minimum 900 parallel accesses needed for SYBA. The solution is to
create shift registers only in the window where parallel access is needed, and place the remainder of
the line into a BRAM as shown in Figure 3.4. Wres is the width of the window in pixel coordinates,
and y is the height of the window - 1.

17

Figure 3.4: Using shift registers and BRAM as line buffers.

The key to making this design work is to correctly set the read and write addresses to the
BRAMs, in order to emulate a line buffer. A simple inspection reveals that these need only operate
as simple queues or FIFOs. To achieve this using the native ports of a BRAM, the following steps
are needed. First, initialize all memory in the BRAM to be zero. Second, increment the write
address on the clock and rollover when the value is Xres − Wres − 1. Third set the read address to
always be the write address + 1 accounting for rollover.
In this design, shift registers are only used in the sections of the line buffers where parallel
access is needed. Using the example above, a total of 900 ∗ 8/64 = 113 LUTs are needed at a
bare minimum. This represents a LUT reduction of about 95%. The tradeoff of course is that
a BRAM block is needed for every line. In practice, this is a better tradeoff as the number of
BRAMs used is relatively small. Another benefit of this design is that LUT usage does not scale as
the resolution increases, since the window size remains the same. Actual implementation resource

18

usage differences between the two line buffer methods can be seen towards the end of the chapter
in Table 3.1.

3.2

Feature Detection - FAST
As discussed previously, the first step is to detect feature regions of interest and generate

a list of features. Any feature detector could be used for this purpose, and in this case Features
from Accelerated Segment Test (FAST) has been chosen. FAST is a straightforward and hardwarefriendly algorithm which was originally proposed by Rosten and Drummond [28] for identifying
feature points in an image. It can also be implemented in an FPGA with low resource utilization. It
has also been used in various other hardware implementations with BRIEF as a feature descriptor,
so for consistency it is also good to compare against.
FAST, like any other feature detector, is interested in identifying if certain points of an
image are of interest or not. These are generally centered corner points, and FAST is no exception.
FAST works by selecting a candidate center pixel P and a threshold T . The intensity of this pixel,
I p , is compared with a circle of 16 pixels around the candidate center pixel P. This occurs within a
window region of 7x7 pixels as shown in Figure 3.5. P is considered to be a corner if there exists a
set of n contiguous pixels in the circle which are all brighter than I p + T or all darker than I p − T .
In this implementation n has been chosen to be n = 9. The value n can be adjusted to be more or
less exclusive, for example many implementations of FAST use n = 12. In this work 9 was chosen
to be less exclusive and allow more features points to pass through.

Figure 3.5: Example of FAST feature detector

19

In most software implementations this detection is made faster by only looking at smaller
subsets of the pixel ring to determine if a pixel can be rejected more quickly. In our hardware
implementation all comparisons are done in a single parallel step, so looking at subsets of the pixel
ring is not necessary. To access these pixel values the same line buffers which store the 30x30 FRI
window region are used and the location of the 7x7 region within the FRI is adjusted off of the
center point to account for any delays in the pipelining and line buffers. This method eliminates
the need to store feature point coordinate locations and makes the hardware solution more efficient.
The result after performing the comparisons is two 16-bit values B and D as shown in Figure 3.6.
It is then a simple logic function to find if there exist n = 9 continuous bits which are all HIGH in
either B or D. If this is the case then the pixel centered around the 7x7 region is considered to be a
corner point.
The final step needed for feature detection is to add non-maximal suppression. Often many
feature points are found in relatively close proximity to each other, so non-maximal suppression
is needed to keep only the best features. In this work we are using a 7x7 window to achieve nonmaximal suppression. This means that at a maximum only one feature point should be in any 7x7
region of the image space. This is achieved from the following steps. First, a score function V is
calculated for all the detected feature points. If the pixel is not a detected feature point then V = 0.
Otherwise, V is the sum of absolute differences between the center pixel P and the surrounding
16 pixels. Then, this score value is output from the detector and is stored in another 7 line buffers
when new score values are entered. This allows for a 7x7 area to be accessed in parallel. Finally,
the center pixel is compared to the other 48 pixels in the region and if there is any pixel with a
score lower than the center pixel, then these pixels’ scores are suppressed and set to 0.
The output from the final line of these line buffers of scores is what determines if the point
was a good feature point. This means that the point was detected by FAST and therefore passed all
of the thresholds necessary, and that it was also the best feature point in its 7x7 region.

3.3

Feature Description with SYBA
Once a region has been determined to have a good feature a 30x30 feature region of interest

(FRI) is centered around the feature point. The FRI needs to be binarized, and this is done according to Equation (2.2) which generates the binarized FRI (bFRI) by comparing each pixel value to
20

Figure 3.6: Parallel Access and Comparison for Fast Feature Detector. Pc is the center pixel. Pn
registers are the ring of pixels. Bn registers indicate if the pixel is brighter than the center plus the
threshold. Dn registers indicate if the pixel is darker than the center minus the threshold.

this average value. While the approach inferred from the equation to loop through all pixel coordinates can work well in software, there is a much more efficient way to do this in a pipelined
hardware approach. Since each individual pixel must be passed through the FPGA fabric, we can
take advantage of this fact to calculate the average and perform the binarization more efficiently.
This method takes advantages of the fact that we are already storing the incoming pixels
into 30 line buffers. The optimization is as follows. First, as a new pixel comes into the image, its
21

column total is found by adding this pixel value to the other 29 pixels directly above this new pixel
from the line buffers. The column average is then found by dividing the column total by 30. These
column averages are then stored in a shift register that holds 30 of these as unsigned numbers.
The area total, for all 900 pixels in the 30x30 area, is then found by adding the incoming column
average and subtracting the outgoing column average and then dividing by 30 again. This reduces
the number of additions from 899, to 29+1 additions and 1 subtraction. This is shown in Figure
3.7.
In order to efficiently perform the two divisions by 30 to find the averages, a traditional
hardware divider can be substituted with shifts and additions since the divisor is constant. A
quick analysis shows that 1/30 = 0.03333. This can be quickly approximated as 1/30 = 1/32 +
1/512 = 0.03320. Performing the division in this manner only costs a single adder, since performing shifts in hardware is as simple as rearranging the bits that are being used.
Now that the area average has been calculated, binarizing the entire FRI is straightforward.
All 900 pixels in the FRI are compared to the area average as in Equation (2.3). Since these are all
performed every clock cycle, this necessitates the need for 900 8-bit comparators. In Chapter 4 a
modification to SYBA is explored to reduce the number of resources needed.
Since the resulting binary FRI (BFRI) is relatively small (900 bits=113 bytes), it is stored
in registers and then sent to a module to calculate the unique SYBA similarity measure (SSM).
As explained in the previous chapter, this is used to measure the similarity between the FRI and
the synthetic basis images (SBIs). The SSM is made by overlaying each SBI with the BFRI and
counting the number of matching black pixels, which is easily implemented in hardware as a series
of nor gates and adders. In the original work of SYBA, two versions exist which are denoted as
SYBA 5x5 and SYBA 30x30. SYBA 30x30 compares the entire 30x30 BFRI with 312 30x30
SBIs to form the SSM, whereas SYBA 5x5 breaks the 30x30 BFRI into 36 5x5 subsections and
compares each of these to 9 5x5 SBIs. In general, SYBA 30x30 offers slightly better performance
at the cost of more memory and resources. In this implementation SYBA 5x5 was chosen as being
more balanced and practical for a hardware implementation. This means the total descriptor length
is 36x9x4=1296 bits = 256 Bytes as given by Equation (2.4). This is for the 36 sub regions, 9 SBIs
per sub region, and 4 bits per SBI. At this point the descriptor is finalized and ready to be saved or
matched.
22

Figure 3.7: Graphical Flow of Efficient Averaging Finder

3.4

Feature Matching
Feature matching can be achieved with a variety of different methods and algorithms. Of-

ten the method of choice is based upon the application, be it stereo vision, visual odometry, 3Dreconstruction, optical flow, or any other computer vision application. Due to the variety of po-

23

tential user applications, and since SYBA is a feature descriptor only, the decision was made to
transition the feature descriptions to the software side of the ZYNQ. The choice to transition from
hardware to software at this point is also supported by (1) the vastly reduced data and bandwidth
at this point, and (2) the ease of programming and modularity with software.
Compared to the raw incoming images, just transferring the feature coordinates and descriptors reduces the data rate by more than 2 orders of magnitude, to a level that is achievable
for embedded processors. Writing software is also typically much more rapid and modular than
designing hardware. Programmers are able to leverage many common libraries such as OpenCV to
perform common algorithms such as optical flow, stereo matching, and visual odometry. This also
allows for easy connection to other peripherals and makes the device much more practical which
enables the device as a whole to easily interface to other systems. There are different benefits to
doing much of the SYBA feature matching in hardware as well. A discussion on a hardware based
solution for SYBA feature matching is presented in Chapter 5.
In order to hand off the SYBA descriptors from the programmable logic (PL) to the processor system (PS) an AXI-Stream FIFO is instantiated. This block receives data on a 32-bit bus and
is synchronized with valid, ready, and last handshakes. In order to send the 1,296-bit descriptor, a
small state machine is made to send and shift over the correct piece of the description. Once the description has been sent over the AXI-Stream, the final 32-bit value represents the pixel coordinate
location of the feature. The last handshaking signal is asserted to indicate end of feature descriptor
transmission. On the software side the AXI-Stream is simply a memory mapped component, and
the received data is simply saved into the DDR feature matches. From there, feature distances are
calculated with the L1 Norm as in Equation 2.5. With feature coordinates and distances in the
DDR, it is relatively simple to use SYBA in a variety of applications.

3.5

Analysis of Speed and Resource Utilization of Solution
The design presented in this chapter is a pipelined hardware design that is capable of pro-

cessing a new pixel every clock cycle. This has been implemented on a Xilinx Zynq XC7Z020
board and meets timing with the default clock of 100Mhz. This makes calculating the maximum
frame rate for a given image size a straightforward process as given by Equation (3.1),

24

Table 3.1: SYBA resource utilization for the
FAST detector and SYBA descriptor.
Implementation Type
LUT
FF
BRAM
Shift Register Line Buffers 15447 10630
0
Block RAM Line Buffers 10049 9674
18

Ff ps = fclk /Tp

(3.1)

where Ff ps is the frame rate, fclk is the clock frequency, and Tp is the total number of pixels in the
frame.
This maximum frame-rate analysis only considers the hardware aspect of the design, which
in this case is the feature detector and feature descriptor. The assumption is made that the software
side of the chip is able to keep up with any necessary feature matching algorithms. For more
demanding feature matching tasks an example hardware circuit demo with feature matching for
SYBA is described in Chapter 5.
Careful analysis of the timing circuity could allow for faster clock speeds and thus higher
potential maximum frame rates. Higher frame rates can also be achieved with higher end FPGAs
which are built on newer transistor processes and will also achieve higher clock speeds. The speed
of image processing is normally highly important for embedded systems that require computer
vision. Often times the use of an embedded system is required because the system will be used in
an environment that has power, weight, and size constraints such as UAVs. Since these systems
are generally in motion, a higher image processing speed is desired to be able to react quicker..
The resource utilization for just the feature detector and descriptor is given in Table 3.1.
The table also demonstrates how large of an impact the choice of line buffer style can have on
the FPGAs resource utilization. In Chapter 4, various optimizations are discussed which improve
the resource utilization. Even without any major optimizations the result of creating a hardware
implementation of SYBA is an accurate, capable, fast, low power and low weight solution for
embedded vision systems.

25

Table 3.2: Frame Rates for different image sizes at
100Mhz clock
Image Resolution Frame Rate at 100Mhz Clock
640x480
326
1280x720
109
1920x1080
48

3.6

Graphical Flow of SYBA Hardware
The purpose of this section is to provide a summary of the hardware described in this chap-

ter. In Figure 3.8, we can see an overview of the hardware described. In this visual an Omnivision
OV5642 is giving pixel data to the hardware system. Pixels then enter line buffers, which are
made via shift-registers in this example. These line buffers are used for both the FAST detector
and the binarization region. Within the FAST detector the continuity test and score calculation are
performed, and these are put into additional line buffers.
These line buffers are used to suppress non-maximal feature points, and then the output
from the non-maximal supression indicates that a feature is at this point. After some delay (which
is determined by the resolution of the image), the SYBA descriptor block takes in the values
from the binarized FRI. This is then compared with the SBIs, and the hits are counted. The pixel
coordinates are concatenated to the end of the descriptor, and this enters a FIFO for descriptor
values and coordinates. A state machine reads through these and converts them into a 32-bit AXIStream protocol. These arrive at the ZYNQ processing system which then handles the L1 Norm
feature matching.

26

Figure 3.8: Graphical Flow of SYBA Hardware
27

CHAPTER 4.

SYBA OPTIMIZATIONS

This section will discuss the optimizations made for the SYBA algorithm to be more efficient in general and also more hardware efficient. In some cases the changes make little or no
difference to the descriptor calculation, and therefore have minimal impact on the quality of the descriptor but can have large impact on the resource utilization for the FPGA. Other changes that do
alter the descriptor calculation significantly are discussed here as well. When significant changes
have been made to the feature descriptor, accuracy tests and results are run again to understand
how the change has impacted the feature descriptors performance.

4.1

Accuracy Test and Image Sequences Used
Accuracy tests were performed on the common Oxford image dataset,( www.robots.ox.

ac.uk/~vgg/research/affine/ ). These image sequences are frequently used in academic literature to test a variety of feature descriptors, including BRIEF, SYBA and others. These image
sequences are used to test the modifications performed to the SYBA descriptor, and to help understand how the optimizations made impact the descriptor. The image sequences include disturbances such as blurring, lighting variation, viewpoint changes, and image compression.
In order to determine accuracy measurements, OpenCV was used with its Python libraries
to accelerate the process. The following steps were used to test the image sequences. First, features
were found in each image using Feature From Accelerated Segment Test (FAST) as the feature
detector in image I1 . Detection thresholds were adjusted so that the number of detected features
in each image was about 1,000 features. The original SYBA 5x5 descriptor is then calculated
for each detected feature. In order to match features to the second image, the dataset provides a
homography H which can be used to find the pixel coordinates of the same point in the second

28

Figure 4.1: Image sequences from left to right: Bikes (Image blurring), Graffiti (Viewpoint
change), Leuven (Illumination change)

29

Figure 4.2: Image sequences from left to right: Trees (Image blurring), UBC/JPEG Compression
(image compression artifacts), Wall (Viewpoint change)

30

image. The homography is used as in Equation (4.1),
p2 = H ∗ p1

(4.1)

where p1 is the point in the first image p2 is the point in the second image. The SYBA description
algorithm is then used around the calculated point p2 to describe the image feature. The features
are then matched with OpenCV’s Brute-Force matcher.
Distance is calculated using the L1 norm following the SYBA specification. This BruteForce matcher is very simple, it takes each feature from the first image and compares it with every
feature from the second image. The closest feature is then returned and calculated as a match. To
increase the accuracy, cross checking is enabled, which ensures that for every feature match from
image 1 to the corresponding feature in image 2, that the same pair has the minimum distance for
each feature in image 2. Matches which exceed a global minimum threshold requirement are then
filtered out. This follows the same process outlined in the matching strategy in the original SYBA
description. Enabling cross checking naturally decreases the total number of matches, as some
matches are discarded when the minimum distance pairs do not align. The accuracy rate is defined
as the ratio of the number of correct matches Nc to the total number of matches found Nt . The first
image in every image sequence is always used as the base image, and the following images are
compared against it.

4.2

Changing the Number of SBIs
One of the largest discoveries in optimizing SYBA was that when reducing the number of

SBIs the feature matching performance does not change very much. This of course is an optimization that can be applied at both the software and hardware level. While run time results for the
software implementation are not included, anecdotally it is worth mentioning that the run-times
were significantly faster in the OpenCV-Python implementation with fewer SBIs as would be expected. This makes sense since there are fewer comparisons being done between SBIs and FRIs.
Additionally, the size of the descriptor also becomes smaller, so calculating the L1 norm is faster,
as is the entire feature matching process as well.

31

Figure 4.3: Number of SBIs vs. Recognition Rate for Trees Image Sequence

Results using the Oxford image sequences are discussed below. It can be seen that on most
image sequence pairs reducing the number of SBIs has very minimal impact until less than 3 SBIs
are used. Some of the image pairs that suffer in particularly when less than 3 SBIs include Wall
1-4, Wall 1-5, Wall 1-6, Bikes 1-6, and UBC 1-6 image pairs.
In order to better quantify the overall recognition rate as the number of SBIs is reduced, the
average recognition rate is found for every image pair and sequence. This is shown in Figure 4.9.
It can be seen that with the full number of SBIs (n=9) the average recognition rate is 76.12%, when
the total number of SBIs reduced to n=3, the overall recognition rate has only dropped to 75.98%.
Even when using only a single SBI the overall recognition rate remains at an impressive 74.44%.
What makes this a particularly intriguing optimization, is that the descriptor length and calculation complexity depend linearly upon the number of SBIs as shown in Table 4.1. This roughly
corresponds to the hardware complexity for the descriptor, and so is well worth the optimization.
Given these findings, it is the author’s opinion that the best number of SBIs to use is 3.
A small decrease in accuracy of less than one quarter of one percent is well worth it, as logic

32

Figure 4.4: Number of SBIs vs. Recognition Rate for Bikes Image Sequence

Figure 4.5: Number of SBIs vs. Recognition Rate for Wall Image Sequence
33

Figure 4.6: Number of SBIs vs. Recognition Rate for UBC Image Sequence

Figure 4.7: Number of SBIs vs. Recognition Rate for Leuven Image Sequence
34

Figure 4.8: Number of SBIs vs. Recognition Rate for Graffiti Image Sequence

Figure 4.9: Number of SBIs vs. Average Recognition Rate Across All Image Sequences
35

Table 4.1: This table demonstrates a linear dependency on Number of SBIs for pixel comparisons and descriptor length, and a non-linear dependency for accuracy.
Number of SBIs Single Pixel Comparisons Descriptor Length Accuracy Test
9
4212
1296
76.12%
8
3744
1152
76.18%
7
3276
1008
76.11%
6
2808
864
76.02%
5
2340
720
76.02%
4
1872
576
75.97%
3
1404
432
75.98%
2
936
288
75.34%
1
468
144
74.44%

utilization within the descriptor is reduced to about 1/3 of the required amount for the full complement of SBIs. The descriptor itself also becomes 1/3 the length of its previous amount as given by
Equation (2.4). It has also been noted that matching in software is likewise performed about 3x as
quickly. Further reducing the number of SBIs was considered, but rejected based on the relatively
larger drop in recognition rate not only in the averages, but also in certain revealing image pairs as
mentioned earlier in the section. This is one of the key optimizations that has allowed SYBA to be
competitive in hardware with other small binary feature descriptors such as BRIEF. The reduction
in resource utilization can be seen in Table 4.2.

4.3

Changing the Method of Binarization
As mentioned previously, the feature region of interest (FRI) must be binarized around the

feature point. In the original SYBA work this is done by calculating the average intensity of the
FRI as in Equation (2.2).
This optimization comes in the way the binarized image is made from the comparison to
the average value. In the original SYBA work, all 900 comparisons are made to the average value.
In a pipelined hardware design which has a new incoming pixel every clock cycle, this would mean
that 900 8-bit comparators are needed to generate the binarized FRI. In order to reduce the number
of comparators needed, a new idea is presented in order to perform the binarization. The idea is to
pass a kernel over the image of size k × k and find the average value within that kernel. This can
be represented by the following Equation (4.2),
36

avgxy =

1 x+k/2 y+k/2
∑ ∑ pi j
k2 i=x−k/2
j=y−k/2

(4.2)

where k is the kernel size, p is the grayscale value of a pixel, and x and y are the pixel coordinates
in the image space.
The pixel in the center of the kernel is then compared to the average value and binarized.
This pixel is put onto 30 line buffers which only hold 1-bit values. The resulting image space is
given by Equation (4.3), where BIMG represents the binarized image space.

BIMG(x, y)



1

pxy > avgxy


0

otherwise

(4.3)

A 30x30 window into these line buffers is then used to grab the binarized FRI centered
around the detected feature point. From an FPGA hardware perspective, this method has the
advantage of only requiring a single comparator at the cost of using additional BRAMs for the
binary line buffers. This is an engineering decision tradeoff, but it is well worth it as can be seen
in Table 4.2 comparing the resource usage.
From a computer vision perspective, this represents a significant change to the way the image FRI is binarized. Rather than looking at the average of just the FRI and comparing all pixels
to that average, each individual pixel is now compared to the average of a 30x30 region centered
around that individual pixel. This means that pixels near the edges will have a significant area outside the FRI used in the average calculation. In theory this could be good, as it would give the pixel
additional “context” compared to other pieces of the image space, but could also be harmful as it
would no longer be considering the entire FRI as well. Since this represents a significant departure
from the method used to binarize the FRIs in the original SYBA implementation, additional study
is needed and presented below.
As was done for changing the number of SBIs, the change is implemented and tested in
Python and OpenCV using the framework described above. Each of the following graphs (Figures
4.10 to 4.16) first show the image pair recognition rate for the original SYBA 5x5 implementation
with 9 SBIs. The second bar in the graphs show the image pair recognition rate for SYBA 5x5
with 3 SBIs using the original binarization method. These values are given for reference and are
37

Figure 4.10: Binarization Method vs. Recognition Rate for Trees Test Sequence

identical to the preceding graphs. The subsequent bars in each graph show the image recognition
rate performance by using a kernel to determine the value of the center pixel and binarize the image
using that method. The kernel size is denoted by k. These results are also generated with only 3
SBIs since the accuracy impact by going to 3 SBIs is so minimal.
The results of this optimization have a very interesting impact on several of the image
sequences. As can be seen in Figure 4.10, binarizing the image with a kernel instead yields very
positive results for the Trees and Bikes image sequences. For a kernel size of k = 30, accuracy
improvements are seen on image pair Trees 1-4 of 3.8%, Trees 1-5 of 5.9%, and Trees 1-6 of
20.6%. A similar pattern is observed for the more difficult pairs seen in Figure 4.11. For a kernel
size of k = 30, accuracy improvements are seen on image pairs Bikes 1-4 of 3.3%, Bikes 1-5 of
8%, and Bikes 1-6 of 11.1%. These two image sequences primarily have blur distortions, with
only relatively minor changes to the viewpoint. The Leuven image sequence, which primarily has
illumination changes, sees very little change for k = 30, with all image recognition rates within 1%
as seen in Figure 4.13. This is similar to the UBC JPEG compression test sequence, which also
only differs by less than 1% for the same kernel size. as seen in Figure 4.12.

38

Figure 4.11: Binarization Method vs. Recognition Rate for Bikes Test Sequence

Figure 4.12: Binarization Method vs. Recognition Rate for UBC Test Sequence

39

Figure 4.13: Binarization Method vs. Recognition Rate for Leuven Test Sequence

This optimization did yield a negative result for image feature recognition rates when there
are large changes in the viewpoint of the image. For the Wall image sequence in Figure 4.14 and
for a kernel size of k = 30, image pair Wall 1-4 has an accuracy decrease of 5.6%, Wall 1-5 of
8.5%, and Wall 1-6 of 3.2%. For these image pairs it is very clear that increasing the kernel size
helps the feature match recognition rate. It is plausible that a larger kernel size would yield better
results. The Graffiti image sequence in Figure 4.15, which also has large view point changes, also
has a decrease in accuracy. For Graf 1-2 the decrease in accuracy is 4.1%, and for Graf 1-3 the
decrease is 6.2%.
Analyzing these image sequences shows that in general for images with only minor perturbations to viewpoint along with blurring, lighting changes, or JPEG compression the binarization
method either yields better or unchanged matching performance. For images with major viewpoint
variations, this binarization method results in poorer matching performance. An overall increase
for the entire dataset is noted as in Figure 4.16. This of course means relatively little as the averages would quickly change based upon the types of images for a different dataset. A more telling
graphic is included in Figure 4.17, which shows the improvement or deterioration for each image

40

Figure 4.14: Binarization Method vs. Recognition Rate for Wall Test Sequence

Figure 4.15: Binarization Method vs. Recognition Rate for Graffiti Test Sequence

41

Figure 4.16: Binarization Method vs. Average Recognition Rate

sequence. Although SYBA has already been extensively compared to BRIEF-32 and other descriptors in [19], another comparison is offered in this figure. Matching distance thresholds are adjusted
so that each descriptor has a roughly equal number of correspondences for each image pair. This
shows that SYBA still performs favorably in comparison to BRIEF-32 even with the changes and
optimizations, and a different hardware friendly feature detector.

4.4

Impact of SYBA Optimizations on Hardware Usage
In this section an overview of how these changes are implemented in the physical hard-

ware is given, along with how these changes have impacted the resource utilization. The change
in hardware to use only 3 SBIs rather than 9 SBIs is straightforward. Rather than instantiating 9
SBI comparison modules in VHDL, only 3 SBI comparison modules are needed. Then the descriptor lengths are updated accordingly. While the change is very simple, the impact on resource
utilization is high. In Table 4.2 the resource usage reduction can be seen. Compared to the original implementation, just reducing the number of SBIs reduces the number LUTs by 41% and the
number of FFs by 9%.
42

Figure 4.17: Different SYBA Methods and BRIEF-32 vs. Feature Recognition Rate

Table 4.2: Implementation type vs. Hardware Usage
Implementation of SYBA 5x5 SBIs Binarization Method LUT
FF BRAM 36K
Original
9
Original
10049 9674
18
Binarization Optimized
9
Kernel Optimized
5944 4169
19.5
SBIs Optimized
3
Original
5955 9115
18
SBIs and Binarization Optimized
3
Kernel Optimized
2966 3419
19.5

The second change is a little bit more involved. As described above, an additional 30 line
buffers are needed in order to store the output result from comparison of the center pixel to the
average of the 30x30 kernel. These line buffers use fewer BRAM resources since they are only 1
bit wide instead of 8 bits. The location of the FAST feature detector within the 30x30 FRI is further
adjusted to account for the extra processing delay. The resource reduction for this method is also
substantial. Compared to the original solution, LUT usage is decreased by 41% and FF usage is
decreased by 57%. This comes at the cost of using another 1.5 BRAM blocks, each 36KB in size.

43

When these optimizations are both applied to the design the overall hardware reduction
is substantial. LUT usage is decreased by 70% and FF usage is decreased by 64%. This still
comes at the cost of the additional 1.5 BRAM blocks. These optimizations have been shown to
not only reduce hardware usage, but also either improve or maintain feature matching accuracy
on images with blurring, lighting, or compression deformations. This has allowed the SYBA
hardware implementation to use fewer resources than comparable hardware accelerators of other
feature descriptors while maintaining its high feature matching accuracy.

44

CHAPTER 5.

ACCELERATING SYBA FEATURE MATCHING WITH HLS

Many algorithms such as visual odometry, 3D reconstruction, and optical flow rely upon
the ability to match features. Increasingly resource constrained systems are being asked to perform
these tasks in real time. Take for example an unmanned aerial vehicle (UAV) which will place
limited power and weight constraints on a system. In order to achieve a real time feature matching
on a resource constrained system, accelerating feature matches is necessary. Feature matching
can quickly become a resource consuming task, especially if the task is to perform global feature
matching. Suppose that for any given image there are N features detected. That means that in order
to globally match features between any 2 images, there must be at least

N1 ×N2
2

feature distance

calculations made. This is due to the fact that the distance from any image feature from image
1 to another image feature in image 2 only need to be calculated once. Suppose further that
certain image sequences have feature detection thresholding set so that there are approximately
N1 = N2 = 200 features detected and the video is being captured at 60 frames per second. That
would mean that approximately 1,200,000 feature distances must be calculated per second. If these
were to be performed sequentially, that would indicate feature comparisons must be made in no
more than 0.84 microseconds. As part of accelerating and optimizing SYBA for hardware usage,
feature matching is an area that can be parallelized to have higher throughput and enable real time
applications.
In this section, the hardware accelerator was written by taking advantage of High-Level
Synthesis (HLS). The reason for this is to show the relative ease of working with SYBA features
from an HLS environment, and to allow quick and easy modifications for future users. Recall that
the distance between two SYBA features is quickly computed as the L1 norm between the two
features as in Equation (2.5). It is readily apparent that all the subtractions and absolute values
can be done in parallel, as each part of the summation loop is independent from the others. For
obtaining the overall sum from the summation loop, an adder tree can be built into hardware. This

45

will be able to perform the addition much quicker by instantiating as many adders as necessary in
hardware, as opposed to just performing the additions one by one.
Up to this point the merits for parallelizing the distance between two singular feature points
have been discussed. Upon inspection of the overall feature matching algorithm, it can be quickly
seen that all the feature distance calculations are independent of one another, meaning that the
distance between multiple features can be quickly calculated in parallel. There is of course a limit
to the amount of parallelization that can be done, with the tradeoff in general being speedup vs.
hardware needed. The remainder of this chapter will discuss the tradeoffs between performing
feature matching in software, and two different hardware implementations. All measurements are
performed on a ZYNQ XC7Z020.

5.1

Feature Matching SYBA in Software
Performing feature matching in sofware has the primary advantage of being relatively easy

to do. Calculating the distance can be quickly done in several lines of code, and finding the
minimum distance is an easy extension from this. Software also has the advantage of being easier
to modify for future users. There are however several disadvantages to using software. These
are relatively lower performance, using valuable CPU time, and lack of parallel application. In
this implementation the software is running on the ARM core of the ZYNQ which is clocked at
667 MHz and which has DDR3 clocked at 533MHz. The time measured to calculate the distance
between two features was 7.77 microseconds. This would yield a theoretical throughput of 128,700
feature matches per second. From the example at the start of the chapter, it is clear that this will
not provide the necessary speed, unless additional optimizations are made or the feature matching
is not a true global search.

5.2

HLS Feature Matching With SYBA - Basic Block
A basic attempt to match SYBA features with HLS is made as follows. Since the feature

descriptions are stored into the DDR, an HLS block was created with a master AXI port that can
access the two independent descriptions. The offsets are set through AXI-Lite slave registers in the
Xilinx SDK. While this approach does work, there is only about a 2x speedup—as seen in Table

46

Table 5.1: Runtime results for SW vs inital HLS configuration
Software Implementation High Level Synthesis Implementation
Runtime for 2 features
7.77us
3.624us
MAX Matches/sec
128,700
275,900

5.1—and more speed and performance is desired. The primary bottleneck with this approach is
having to access each of the features through the DDR interface. Too much time passes to create
read requests, and the amount of data on a single feature is so small that DDR bursting hardly
offers any benefit. If the amount of data to be transferred was larger, a DDR burst read would have
a more significant impact as the address is only sent once for the entire block of data.

5.3

HLS Feature Matching With SYBA - Advanced Block
To help overcome these shortcomings, an HLS block was designed that is able to compare

multiple feature descriptions at a time. The first change was to store features in FPGA BRAMs.
These can be instantiated in parallel to allow high performance access to multiple parts of memory
at the same time. The second change was to match these to features as they are detected and described from the FPGA. These are passed into the HLS block as an AXI stream with the “tUser”
signal indicating that this is the first feature description in that frame. Finally, new feature descriptors along with their matches are saved to the DDR via a large DDR burst transaction. The
interfaces to this block are shown in Figure 5.1.
This approach allowed for several advantages compared to the previous approach. Taking
in the descriptors directly as an input allows for the feature similarity measure calculation to begin
without having to pass through the CPU or DDR. Comparing to features that are prestored in the
BRAMs allows for the similarity measure calculation to operate in a fast, efficient, and parallel
manner. This also increases the amount of data and calculations visible to the HLS block at a time,
which allows for further performance improvements. Storing the new feature descriptions in DDR
allows them to be easily accessed and saved for future feature matches. Storing the matches in the
DDR allows for the software to have quick access to the matches to be used for whatever application it needs. To change where the feature descriptions and matches are being stored, modifications
are made via AXI-Slave registers through the SDK.

47

Figure 5.1: Graphical Flow of Advanced HLS Block

A match has 5 pieces of information, all 5 of which are stored as unsigned 32 bit integers.
The 5 pieces of every match are: (1) the first feature’s x location, (2) the first feature’s y location, (3)
the second feature’s x location, (4) the second feature’s y location, and (5) the similarity measure
calculated by the HLS block to the most similar feature. Since these are already stored in DDR,
this allows for software to quickly and efficiently access feature matches for any application. The
performance of this new block is much greater than either of the previous approaches. Since the
HLS block was easily pipelined, I found that it was able to calculate the distance between the
incoming feature and any features stored in the BRAMs in a single clock cycle. While this would
yield a theoretical throughput on a 100MHz clock of 100 million feature distance calculations per
second, in reality this number will be lower. This is due to the need to save the matches and feature
descriptions in the DDR and this is unpredictable and varies due to DDR latency, bandwidth and
bus access rights. (Other components, such as the CPU, have access to the DDR and it must
become available.) It is worth noting that a block that simply compared the new feature to a set of
features in BRAMs, and then output the distance as an AXI stream should be able to approach the
theoretical limit of 100 million feature distance calculations per second.

48

5.3.1

Demo Application
In order to show the effectiveness of the HLS block, a demo application was created. Using

the HLS block, a picture of the BYU logo was placed in front of the camera and the detected feature
points and their SYBA descriptions were saved into the DDR. This information was then hard
coded back into the HLS block to be partitioned and saved into BRAMs. In this demo application
the feature descriptions of 33 points on the BYU logo were hardcoded into the HLS block. The
block then matches the incoming feature points to those saved in the HLS block, and returns the
match as described above. After re-synthesizing the stream was outputting feature matches to the
BYU logo in realtime to the DDR.
A small application was written in the SDK to look through these matches in the DDR, then
draw a box onto the framebuffer around the BYU logo. This works by comparing the similarity
measure to a distance threshold. If the similarity measure is within the distance threshold then it
is considered a match. The average x and y location for all matches are then found and a box is
drawn around this point. For this demo application design the resource utilization of just the HLS
block is as follows: as reported by Vivado HLS this design is estimated to use 15 BRAMs, 0 DSP
slices, 4125 Flip Flops, and 10553 LUTs. Once the design has been implemented on the ZYNQ
XC7Z020 however, Vivado reports the utilization as 7 BRAMs, 0 DSP slices, 1060 FFs, and 2651
LUTs. The design meets timing and appears to be working well looking at the HDMI output and
it successfully tracks the BYU logo as in Figure 5.4.
There are few published results of other studies that have used HLS to create feature matching accelerators in hardware. There are more results from traditional hardware implementations of
feature matchers, but even these are difficult to compare against since they are comparing different
feature descriptors and not using SYBA. Due to the significant differences between what is being
matched, the method used to create the hardware, along with the overall lack of results that also
include detailed resource usage, feature matching results from other papers are not included.
A graphical flow of the hardware portion of the demo application is offered in Figure 5.2.
This is an updated version of Figure 3.8 to reflect doing feature matching in the hardware. Finally,
a picture of the FPGA board with the camera and mount is given in Figure 5.3.

49

Figure 5.2: Graphical Flow of the Demo Application. External I/O such as HDMI, I2C for camera
registers, etc, is omitted for simplicity.
50

Figure 5.3: Numato Styx Board with Omnivision 5642 Camera and HDMI Output for Demo Application

Figure 5.4: Captured image from HDMI output showing SYBA Feature Matching

51

CHAPTER 6.

6.1

COMPARISON OF RESULTS AND CONCLUSION

Results and Discussion
In this section the work of our hardware implementation is compared to other feature de-

scriptor implementations on FPGAs. In order to find a good comparison of feature descriptors,
designs are needed which have used the same feature detector but have different feature descriptors. Since comparisons to other implementations in hardware are sensitive to the device type and
the toolchain used, comparisons are also sought against implementations on the same hardware.
This can be seen in table 6.1. All of these designs are using the same hardware chip which is
the Xilinx ZYNQ XC7Z020 , the same clock frequency of 100Mhz, and the same feature detector FAST. Resource utilizations included in the table show the resource usage for only the logic
needed for the feature detector and feature descriptor portions of the design. This is to remove any
comparison inconsistencies that would arise from implementation consistencies outside of these
parts of logic. For example, some designs use raw camera input, some HDMI input, and some use
images from DDR so the logic required to grab the images from different sources should not be
included in a comparison.
Table 6.1: Resource utilizations of various feature detector and descriptor implementations
on FPGAs. These all have several of the same attributes. 1) They all use FAST as the
feature detector. 2) They are all using the same hardware, a ZYNQ XC7Z020. 3)
They all have the same clock speed, 100 MHz.
Implementation
LUT
FF
FAST + Original SYBA (This) 10049 9674
FAST + Optimized SYBA (This) 2966 3419
FAST + BRIEF [25]
4118 9543
FAST + BRIEF [26]
4257 3187
FAST + BRIEF [29]
14398 2093
FAST + BRISK [29]
25575 7115

52

BRAM KB DSP
81
0
88
0
140
0
72
0
50
0
50
0

FPS 640x480
326
326
326
55.5
147
147

It can be seen that the optimized version of SYBA has the lowest LUT usage of any of the
included designs, regardless of design speed. The only other design found that runs as quickly as
this design (326 FPS at 640x480, 100 Mhz clock, 1 pixel per clock) requires 39% more LUTs,
199% more FFs, and 59% more BRAMs. Other designs run at a much slower speed, and are
therefore not fully pipelined. Some of the resource utilizations in these designs are lower, while
many resources are still higher such as LUTs. Given the high reduction in capable speeds these
designs were able to decrease resource usage. Considering all of the comparisons SYBA is clearly
capable of delivering great feature matching results, while also being very hardware efficient, even
when compared to other low-resource utilization binary descriptors.

6.2

Conclusion
The focus of this thesis is on the improvement of one of the most important aspects of

computer vision: image feature description. This thesis has improved upon a feature description
algorithm called SYBA. The thesis began by introducing the SYBA algorithm developed by Alok
Desai several years ago. A discussion of the hardware implementation was then developed in this
paper, along with justification of several design choices. The following chapter discussed two
novel ideas to improve SYBA to make it more efficient. The first idea was to reduce the number of
SBIs for comparison, which greatly reduced resource usage and had minimal impact on accuracy.
The second idea was to change the binarization method from comparing all pixels in the FRI to
an average value, to comparing only the center pixel of a kernel to the average value. This also
significantly reduced hardware usage, and even improved feature matching accuracy for certain
types of image pairs and deformations. The next chapter focused on an HLS implementation to do
SYBA feature matching and provided a demonstration of its performance. This chapter also exists
to show the overall hardware design working on an actual system.
The result of putting SYBA into hardware and optimizing it for better resource utilization is
very good. It has been shown to use significantly less resources than competing feature descriptors
while running at very high frame rates. SYBA has also been previously shown to have better feature
matching accuracy than these descriptors as in [19]. The combination of these two facts clearly
demonstrates that the SYBA descriptor in hardware has great image feature matching performance
at low resource utilization.
53

A short summary of the accomplishments and contributions of this thesis is now given.
First, this is the first ever hardware design and implementation of SYBA. Without changing the
SYBA algorithm, an efficient pipelined design was created. The hardware design was made in
VHDL and implemented on a ZYNQ XC7Z020 based chip. The second contribution was several
creative improvements to the SYBA algorithm to make it more hardware friendly. These included
reducing the number of SBIs with only a minimal impact to accuracy and to change the method of
FRI binarization by using a kernel. A testing setup to measure accuracy with these changes was
created and measurements from this are examined. The third contribution was to make another
hardware design and implementation with these optimizations. Compared to the first hardware
design, LUT usage was decreased by 70% and FF usage was decreased by 64%. The fourth contribution was to make a demo application based upon High Level Synthesis (HLS) that accelerates
SYBA feature matching in hardware. This demo application can serve as a starting point for future
projects to leverage SYBA in hardware for fast and efficient computer vision applications.

54

REFERENCES

[1] H. C. Garcia, J. R. Villalobos, and G. C. Runger, “An automated feature selection method
for visual inspection systems,” IEEE Transactions on Automation Science and Engineering,
vol. 3, no. 4, pp. 394–406, 2006. 1
[2] P. Piccinini, A. Prati, and R. Cucchiara, “Real-time object detection and localization with
SIFT-based clustering,” Image and Vision Computing, vol. 30, no. 8, pp. 573–587, 2012. 1
[3] D. Glasner, M. Galun, S. Alpert, R. Basri, and G. Shakhnarovich, “aware object detection
and continuous pose estimation,” Image and Vision Computing, vol. 30, no. 12, pp. 923–933,
2012. 1
[4] Z. Wei, D. J. Lee, and B. E. Nelson, “A hardware-friendly adaptive tensor based optical flow
algorithm,” in International Symposium on Visual Computing. Springer, 2007, pp. 43–51. 1
[5] A. Doshi and A. G. Bors, “Smoothing of optical flow using robustified diffusion kernels,”
Image and Vision Computing, vol. 28, no. 12, pp. 1575–1589, 2010. 1
[6] H. Huang and N. Wu, “Fast facial image super-resolution via local linear transformations for
resource-limited applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 10, pp. 1363–1377, 2011. 1
[7] J. Civera, O. G. Grasa, A. J. Davison, and J. Montiel, “1-point RANSAC for extended kalman
filtering: Application to real-time structure from motion and visual odometry,” Journal of
Field Robotics, vol. 27, no. 5, pp. 609–631, 2010. 1
[8] H. Badino, A. Yamamoto, and T. Kanade, “Visual odometry by multi-frame feature integration,” in Proceedings of the IEEE International Conference on Computer Vision Workshops,
2013, pp. 222–229. 1
[9] P. C. Niedfeldt and R. W. Beard, “Multiple target tracking using recursive RANSAC,” in 2014
American Control Conference. IEEE, 2014, pp. 3393–3398. 1
[10] M. Pollefeys, D. Nistér, J. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels,
D. Gallup, S. Kim, P. Merrell et al., “Detailed real-time urban 3d reconstruction from video,”
International Journal of Computer Vision, vol. 78, no. 2-3, pp. 143–167, 2008. 1
[11] L. Raven, D. J. Lee, and A. Desai, “Robust synthetic basis feature descriptor,” in 2017 IEEE
International Conference on Image Processing (ICIP), Sept 2017, pp. 1542–1546. 2, 4, 7
[12] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004. 2

55

[13] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” in European
conference on computer vision. Springer, 2006, pp. 404–417. 2
[14] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary robust independent elementary features,” in European conference on computer vision. Springer, 2010, pp. 778–792.
2
[15] S. Leutenegger, M. Chli, and R. Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in 2011 IEEE international conference on computer vision (ICCV). Ieee, 2011, pp.
2548–2555. 2
[16] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, “ORB: An efficient alternative to
SIFT or SURF.” in ICCV, vol. 11, no. 1. Citeseer, 2011, p. 2. 2, 3
[17] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 2012, pp. 510–517. 2
[18] J. Heinly, E. Dunn, and J. Frahm, “Comparative evaluation of binary features,” in European
Conference on Computer Vision. Springer, 2012, pp. 759–773. 3
[19] A. Desai, D. J. Lee, and D. Ventura, “An efficient feature descriptor based
on synthetic basis functions and uniqueness matching strategy,” Computer Vision
and Image Understanding, vol. 142, pp. 37 – 49, 2016. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1077314215002003 3, 4, 5, 42, 53
[20] A. Desai and D. J. Lee, “Visual odometry drift reduction using SYBA descriptor and feature
transformation,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp.
1839–1851, July 2016. 4
[21] F. Huang, S. Huang, J. Ker, and Y. Chen, “High-performance SIFT hardware accelerator for
real-time image feature extraction,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 22, no. 3, pp. 340–351, 2011. 5
[22] V. Bonato, E. Marques, and G. A. Constantinides, “A parallel hardware architecture for scale
and rotation invariant feature detection,” IEEE transactions on circuits and systems for video
technology, vol. 18, no. 12, pp. 1703–1712, 2008. 5
[23] A. Amaricai, C. Gavriliu, and O. Boncalo, “An FPGA sliding window-based architecture
harris corner detector,” in 2014 24th International Conference on Field Programmable Logic
and Applications (FPL), Sept 2014, pp. 1–4. 5
[24] M. F. Aydogdu, M. F. Demirci, and C. Kasnakoglu, “Pipelining harris corner detection with
a tiny FPGA for a mobile robot,” in 2013 IEEE International Conference on Robotics and
Biomimetics (ROBIO). IEEE, 2013, pp. 2177–2184. 5
[25] M. Fularz, M. Kraft, A. Schmidt, and A. Kasiski, “A high-performance FPGA-Based image
feature detector and matcher based on the FAST and BRIEF algorithms,” International
Journal of Advanced Robotic Systems, vol. 12, no. 10, p. 141, 2015. [Online]. Available:
https://doi.org/10.5772/61434 5, 52

56

[26] H. Heo, J. Lee, K. Lee, and C. Lee, “FPGA based implementation of FAST and BRIEF
algorithm for object recognition,” in 2013 IEEE International Conference of IEEE Region 10
(TENCON 2013), Oct 2013, pp. 1–4. 5, 52
[27] H. Anderson, “Both lazy and efficient: Compressed sensing and applications,” Sandia National Laboratories, Albuquerque, NM, pp. 2013–7521, 2013. 8
[28] E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in European conference on computer vision. Springer, 2006, pp. 430–443. 19
[29] O. Ulusel, C. Picardo, C. B. Harris, S. Reda, and R. I. Bahar, “Hardware acceleration of
feature detection and description algorithms on low-power embedded platforms,” in 2016
26th International Conference on Field Programmable Logic and Applications (FPL). IEEE,
2016, pp. 1–9. 52

57

