Low-Latency Image Fusion Implementation using FPGA by Yeoh, Chan
Low-Latency Image Fusion
Implementation using
FPGA
A Thesis
Submitted to the Faculty
of
Drexel University
by
Chan Yeoh
in partial fulfillment of the
requirements for the degree of
Master of Science in Computer Engineering
February 26, 2018
c© Copyright 2018
Chan Yeoh. All Rights Reserved
Acknowledgements
I would like to express my sincere gratitude from my advisor Prof. Nagvajara
for his patience, dedication, and support in assisting me to complete this thesis.
His guidance has helped me not only in writing the thesis but also advices in
presenting my thesis.
Apart from my advisor, I would also like to thank my thesis committee: for
providing valuable advices and hard questions.
In addition to my advisor and the committee, I would also like to thank my
friends for going through the years in Drexel University with me. Allowing me
to speak up my thoughts and giving me various encouragement and advices. I
would like to thank my girlfriend especially for ...
Last but not least, I would also like to thank my family for giving me support
in completing my thesis during tough times. Without my parents giving birth
to me, I would not be able to be where I am today.
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . ii
FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Xilinx FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . v
Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
HDMI Input Signal . . . . . . . . . . . . . . . . . . . . . . . . . . x
Image Fusion Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Signal Processing Theory . . . . . . . . . . . . . . . . . . . . . . . . . xii
Harris Corner Detection . . . . . . . . . . . . . . . . . . . . . . . xii
Laplacian Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 2: Design and Implementation . . . . . . . . . . . . . . . . xvii
Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Non-separable Convolution Design . . . . . . . . . . . . . . . . . xvii
Separable Convolution Design . . . . . . . . . . . . . . . . . . . . xvii
Advantages of Separable Kernel . . . . . . . . . . . . . . . . . . . xix
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Modifications of Separable Kernel . . . . . . . . . . . . . . . . . . xxii
Harris Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv
Laplacian Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv
Xilinx Architecture: AXI-Lite . . . . . . . . . . . . . . . . . . . . . . .xxvii
Xilinx Architecture: AXI-Master . . . . . . . . . . . . . . . . . . . . . xxx
Xilinx Architecture: AXI-Stream . . . . . . . . . . . . . . . . . . . . . xxx
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxxii
Misc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxxii
Cores used in Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxxiii
Chapter 3: Performance . . . . . . . . . . . . . . . . . . . . . . . . . . xlvi
Hardware Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . xlvi
Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlvi
PSNR Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlvii
Chapter 4: Demo Images . . . . . . . . . . . . . . . . . . . . . . . . . xlix
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . lviii
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lviii
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lix
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lx
List of Figures
1 Image of a FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
2 Sample Block Diagram of HDMI Input Image . . . . . . . . . . . iv
3 Shows the AXI Master Signals . . . . . . . . . . . . . . . . . . . vi
4 Shows the AXI Stream Signals . . . . . . . . . . . . . . . . . . . vii
5 HDMI Sync Diagram Timing . . . . . . . . . . . . . . . . . . . . viii
6 HDMI Blanking Diagram Timing . . . . . . . . . . . . . . . . . . ix
7 Clock Diagram for YCbCr . . . . . . . . . . . . . . . . . . . . . . xi
8 Comparison of RGB and Luminance . . . . . . . . . . . . . . . . xi
9 Image illustrating Corner Feature . . . . . . . . . . . . . . . . . . xiv
10 Image illustrating building the Laplacian Pyramid . . . . . . . . xvi
11 Workflow of Traditional Convolution . . . . . . . . . . . . . . . . xviii
12 Workflow of Separable Convolution . . . . . . . . . . . . . . . . . xix
13 Performance of Separable vs. Non-separable kernel . . . . . . . . xxi
14 Clock Calculation of Maximum Winner . . . . . . . . . . . . . . xxiii
15 Harris Corner Block Diagram . . . . . . . . . . . . . . . . . . . . xxiv
16 Gaussian Pyramid Illustration . . . . . . . . . . . . . . . . . . . . xxv
17 Laplacian Pyramid Reconstruction . . . . . . . . . . . . . . . . . xxvi
18 Laplacian Pyramid Block Diagram . . . . . . . . . . . . . . . . .xxvii
19 Clock Diagram of AXI Lite Read . . . . . . . . . . . . . . . . . .xxviii
20 Clock Diagram of AXI Lite Write . . . . . . . . . . . . . . . . . . xxix
21 Clock Diagram of AXI Master Write . . . . . . . . . . . . . . . . xxxi
22 Clock Diagram of AXI Master Write . . . . . . . . . . . . . . . . xxxi
23 First Frame Sync Core . . . . . . . . . . . . . . . . . . . . . . . .xxxiii
24 Derivative Image Core . . . . . . . . . . . . . . . . . . . . . . . .xxxiv
25 Squaring Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxxiv
26 Multiplication Core . . . . . . . . . . . . . . . . . . . . . . . . . .xxxiv
27 Harris Corner Core . . . . . . . . . . . . . . . . . . . . . . . . . .xxxv
28 Local Maximum Neighbor Core . . . . . . . . . . . . . . . . . . .xxxv
29 Frame Signal Buffering Core . . . . . . . . . . . . . . . . . . . . .xxxvi
30 Feature Point Storage Core . . . . . . . . . . . . . . . . . . . . .xxxvii
31 Threshold Value Core . . . . . . . . . . . . . . . . . . . . . . . .xxxvii
32 RGB 444 to YCbCr 444 Core . . . . . . . . . . . . . . . . . . . .xxxviii
33 Frame Shifting Core . . . . . . . . . . . . . . . . . . . . . . . . .xxxviii
34 AXI-Stream for Syncing two frames Core . . . . . . . . . . . . .xxxix
35 Laplacian Pyramid Decomposition Core . . . . . . . . . . . . . .xxxix
36 Laplacian Pyramid Merging Core . . . . . . . . . . . . . . . . . . xl
37 Laplacian Pyramid Reconstruction Core . . . . . . . . . . . . . . xli
38 Video Timing Core . . . . . . . . . . . . . . . . . . . . . . . . . . xlii
39 Video In to AXI-Stream Core . . . . . . . . . . . . . . . . . . . . xliii
40 HDMI Input Core . . . . . . . . . . . . . . . . . . . . . . . . . . xliii
41 AXI-Stream to Video out Core . . . . . . . . . . . . . . . . . . . xliv
42 HDMI out Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . xliv
43 Final Block Diagram Core . . . . . . . . . . . . . . . . . . . . . . xlv
44 Sample Image of Emboss Filter . . . . . . . . . . . . . . . . . . . xlix
45 Sample Image of Sobel Filter in X and Y Direction . . . . . . . . l
46 Harris Corner (No Threshold) . . . . . . . . . . . . . . . . . . . . li
47 Harris Corner (Threshold: 50) . . . . . . . . . . . . . . . . . . . . lii
48 Harris Corner (Threshold: 1000) . . . . . . . . . . . . . . . . . . liii
49 A regular Laplacian Merge . . . . . . . . . . . . . . . . . . . . . . liv
50 A regular Laplacian Merge with ghost . . . . . . . . . . . . . . . lv
51 A regular Laplacian Merge with Bad Merge . . . . . . . . . . . . lv
52 A Laplacian Merge with various techniques . . . . . . . . . . . . lvi
53 A Laplacian Merge with a webcam . . . . . . . . . . . . . . . . . lvii
List of Tables
1 Results of Non-separable Convolution . . . . . . . . . . . . . . . xx
2 Results of Separable Convolution . . . . . . . . . . . . . . . . . . xxii
3 PSNR Values of Image Fusion Manual Calibration . . . . . . . . xlvii
4 PSNR Values of Image Fusion Closest Neighbor Calibration . . .xlviii
Abstract
Image fusion is useful for obtaining multiple input from the camera and may
enhance details of a given scene. For example, a regular RGB combined with
heat vision may allow a visual example of the scene’s heat, or widen the the view,
i.e., the panorama view or similar to that of a human’s visual field. In current
applications, image fusion is not done in real time. Users would have to wait
for a period of time before the images are combined together. This project uses
a FPGA (Field Programmable Gate Array) to present image fusion with low
latency of about 17 ms, which is not noticeable by the human eye.The paper
also presents novel approaches to minimizing hardware usage and effectively
uses Vivado’s architecture to communicate with the processor to fuse two 1080p
videos. The result of the program produces a ghosting effect on the output
result because of the lack of homography transformation and noise coming from
the webcam frame-to-frame.
i

Chapter 1: Introduction
Image fusion is one important aspect in robotics. A robot that wants to
capture a scene like the human eye needs to be equipped with two cameras
in order to have a larger field of view and detect depth in real time. Another
application would be for image multiplexing. [1] One classic example is color for
night vision. Short Wavelength Infrared Red Light is able to produce some form
of color for the human visual. On the other hand, Long Wavelength Infrared
Red Light is able to capture heat emitted by an object that is not visible to
the human visual. By fusing these images it may provide more information of
a given scenario.
FPGA
The FPGA (Field Programmable Gate Array) is an integrated circuit (see
Fig. 1) that can be programmed using hardware description language (HDL).
HDL would then reconfigure logic gate arrays and logic blocks to match that of
description. These logic gate arrays and blocks include Digital Signal Processing
(DSP) Blocks, Flip Flops (FF), LUT (Look up Table), LUTRAM (Look-up table
Random Access Memory) and BRAM (Block Random Access Memory). [2]
Modern FPGA is able to to perform at 500 MHz for clock cycle speed. For
video processing, a pixel clock for a 1080p video is 148 MHz. This is well
within the range for the FPGA to process. FPGA has advantages in latency
ii
Figure 1: An example of the a FPGA, ZC 702. The FPGA board is
manufactured by Xilinx. It includes the labels of various input and
output clocks. [3]
iii
Figure 2: Illustrates a row-by-row input, assuming that it is a 1920
per row, from the HDMI input of each pixel block when received by
the FPGA. Each column, c1, c2, · · · cn, is filled by pixels, p1, p2, · · ·
p1920. As When pixel 3 is being inserted, it would be placed behind
the tail of c3. This data structure could also be pictured as a queue,
since it would be the first pixel in to be processed and the first pixel
out when it is done being processed.
and throughput compared to that of the CPU (central processing unit) and GPU
(graphics processing unit). In video processing CPU and GPU would have to
store the whole frame in memory space before processing, making pipelining
impossible. On the other hand, FPGAs can process the video input as soon as
the first pixel arrives (see Fig. 2). The input sequence could be pictured as a
queue. When the first row is filled in with 1920 pixels it would then start to
fill the next row above and so on, until all 1080 rows are completed. FPGAs
can also calculate parallel computation as long as the parallel computations are
in sync with one another. Parallelism for CPUs and GPUs is possible by using
threads, but is limited by the maximum number of cores of a processor.
Although ASIC (Application-Specific Integrated Circuit) is able to run at
iv
a faster clock cycle than that of the FPGA, the time of production for ASIC
is much longer. FPGAs, on the other hand, could be re-configured quickly
to detect potential flaws on the design. Hence, it is a common practice to
prototype the hardware design in FPGA, ensuring that there are no flaws, before
manufacturing to an ASIC.
FPGA also has the advantage of parallelism, that is to compute various al-
gorithms in a clock cycle. For example, calculating multiplication and division
of two different variables at the same time. For CPUs and GPUs, the multi-
plication and division are converted to individual instructions. The amount of
parallel threads that can run simultaneously is dependent on the processor. [4]
Xilinx FPGA Architecture
Vivado, developed by Xilinx, allows the programming logic block in the
FPGA to communicate with the processor and vice versa. In order to do so,
AXI4-Full is introduced. The concept of AXI4-Full (see Fig. 3) is to specify a
certain address space with N amount of length for read/write access. Rather
than specifying the address and then sending the data, it would send N datas
within one handshake. This approach allows N-data to be stored in N+1 clock
cycles as oppose to 2N clock cycles, one clock cycle for specifying the [5]address
and the other to transmit/read the data.
AXI4-Stream (see Fig. 4) is developed to communicate with Vivado’s built-
in programmable logic cores. One application of the AXI4-Stream is syncing
an external clock speed with the FPGA’s clock speed. This could be useful for
storing video data coming from an external source, with a different clock speed,
v
Figure 3: The list of all the AXI-4 Master port of a custom ip. Those
that start with m00 axi ar*** are the ports dealing with address read-
ing. The ports that start with m00 axi aw*** are the ports that
deals with address writing. The ports that has m00 axi w*** are
ports that deal with writing data. The ports that have m00 axi r***
are the ports that deal with read data. Finally the data that has
m00 axi b*** are the response code from the slave or master.
vi
Figure 4: The list of all the AXI-Stream of a custom core.
M00 axis tdata are the data input. M00 axis tlast is the last input
data. M00 axis tuser is the first frame. M00 axis tready is the ready
signal triggered by the slave.
to the processor’s memory. [5]
Video Processing
There are two common video signals sets when dealing with HDMI input and
output: (1) horizontal sync, vertical sync, active and data, and (2) horizontal
blank, vertical blank, active, and data. [6] The first pair is triggered by the
timing when the front porch and back porch occurs, in both horizontal and
vertical direction (see Fig. 5). Both sync signals are triggered low at the end
of the front porch and triggered high at the start of the back porch. The active
signal is triggered high, indicating the video data is valid, during display time.
The active signal is triggered low, indicating the video data is invalid, during
blanking time.
The second set of signals uses blanking time as indicators to trigger the hor-
vii
Figure 5: Shows the timing diagram of the h sync and v sync signals.
The display time shows the region where the active signal would be
triggered high. The blanking time shows the region where the active
signal would be triggered low. [7]
viii
Figure 6: Illustrates the signal of the h blank and v blank signals of
the frame. The horizontal blank is triggered when the horizontal part
is triggered high during the blanking time. On the other hand, the
vertical timing is triggered high when it is blank vertically. Finally,
the display time indicates region of where the active frame is high
and blanking time indicates the region where the active frame is low.
izontal blank and vertical blank signals. During display time, the active signal
is triggered high and both horizontal and vertical blank signals are triggered
low. The horizontal blank signal will be triggered high when the timing width
exceeds that of the active frame. Similarly, the vertical blank signal will be
ix
triggered high when the height exceeds that of the active frame. (see Fig. 6)
HDMI Input Signal
Some typical HDMI input source are RGB, YUV and YCbCr, [9] RGB being
the most commonly used. Although RGB are commonly used for computer vi-
sion, this project uses YCbCr: Luminance, Chromium blue and Chromium red.
The one main advantage of YCbCr is the luminance channel. By separating
the YCbCr channels (see Fig. 8), one can see that luminance channel preserves
most of the edges captured in the original color image. On the other hand,
chromium blue and chromium red does not. [10] The same analysis could be
done for the RGB channels. When the red, green and blue channels are sepa-
rated and viewed individually (see Fig. 8), some of the edges that are seen in
the original color image are lost. Therefore, YCbCr is preferred mainly because
the luminance channel.
The luminance channel captures the most information, while chromium blue
and chromium red does not. Hence, subsampling both the chromium blue and
chromium red channels would not result in a color image that is unpleasant to
the human eye. Each channel is 8-bits, which is a total of 24-bit per color image.
However, by removing information through subsampling, transmitting the data
would only require 16-bit (see Fig. 7) This format is also called YCbCr 4:2:2. [9]
The luminance would be paired up with chromium blue when the width count
x
Figure 7: The diagram shows the typical processing input of the
video data for the YCbCr process. It could be seen that in the
video data signal, luminance is never dropped, but Chromium blue
and chromium red is subsampled and alternated every other clock
cycle. [8]
Figure 8: (Top) shows the different channels of the Luminance, Chroma
Blue and Chroma Red. (Bottom) shows the different channels when
splitting it using the Red, Green and Blue values. [16]
xi
is odd and paired up with chromium red when the width count is even.
Image Fusion Theory
In order to complete image fusion, there are four basic steps, which are
feature detection, feature matching, transform model and image resampling.
[11] The first algorithm for feature detection is Harris Corner Detection. Cor-
ners are good feature points because they are repeatable and distinct. Feature
matching is tested with manual calibration, nearest neighbor and descriptors
matching. Transform model only contains the x and y translation due to the
hardware limitation from the FPGA. Finally, image resampling is done through
with laplacian pyramid.
Signal Processing Theory
The project uses Harris Corner Detection to detect essential feature points
from two camera sources. The essential feature points are then stored into a
programming logic block. The C-code would read the data in the programming
logic block and calculate the optimal displacement between two frames. Finally
the fusion is done using the laplacian pyramid and the calculated displacement.
Harris Corner Detection
The Harris Corner Detection is used to design to infer possible corner points
in an image. These corner points could later be used for image recognition or
image stitching. It also runs in run time. As a result, the human visual system
would not detect any lags.
A corner is defined as an intersection between two edges that is close to
xii
perpendicular, 90 deg. An edge is formed when there is a great change of in-
tensity. Therefore, the Harris corner seeks the changes in intensity within a
window function. The first step is to compute the derivative in the X direction
and the Y direction. Doing so would eliminate flat regions and linear edges. It
is possible linear edges to have a strong derivative in either X or Y direction,
but not in both directions. [12] A corner would show a strong derivative result
in both X and Y directions (see Fig. 9). The final step is to blur derivative to
reduce noise and compute the response detector by comparing the determinant
and trace of the window function4. To further reduce noisy features for stitching
a threshold and the local neighbor maxima is used.
Laplacian Pyramid
The Laplacian Pyramid is used for image fusion. The first step is to form
the Gaussian Pyramid. The Gaussian Pyramid is a low pass filter that reduces
noise on each layer. The first layer is the original image. The second layer would
require the first layer to be blurred and subsampled. The reason for blurring the
image before subsampling is due to Nyquist Theorem. Nyquist Theorem states
that information from a signal would be lost if sampled at a frequency twice as
high. Blurring the image would not only retain some of the information after
subsampling, but also remove high frequency noise. The final step is to take the
difference between the base layer and the top layer. When stitching the images
together, the two images are most likely be discontinued due to rapid change in
intensity. However, the Gaussian Pyramid has removed high frequency noise at
each layer and kept the low frequency content. As a result, when pyramids are
xiii
Figure 9: The derivative image from various input image patch for the
linear edge, flat region and a corner. It could be seen that flat regions
do not show any strong response on the derivative. The linear edge
shows only a strong response on either one of the derivative. In the
case of the image, it is the X derivative and shows no response on the
Y derivative. The corner shows a strong response on both the X and
Y derivative.[13]
xiv
blended together, the discontinuity is less noticeable to the human eye.
The Laplacian pyramid is an algorithm that first creates a Gaussian Pyra-
mid. The base of Gaussian Pyramid is the original image. To produce the next
layer, the base layer is blurred by a N by N Gaussian kernel and then subsam-
pled. [14] The third layer would treat the second layer as the base layer and
continue on the process. After the Gaussian Pyramid is produced and the dif-
ference between the two layers is taken. For example, to produce the first layer
of the Laplacian Pyramid, the first and second layer of the Gaussian Pyramid
are obtained. Next, the second layer of the Gaussian Pyramid is upsampled.
Finally, the difference between the first layer of the Gaussian Pyramid and the
upsampled second layer of the Gaussian Pyramid is used to produce the first
layer of the Laplacian Pyramid (see Fig. 10).
xv
Figure 10: Illustrates the process of the Laplacian Pyramid. The first
step is to to construct the Gaussian Pyramid, which blurs the image
and subsamples it up from the original image G 0 up until G n. Later,
each layer is expanded and the different is taken to find the laplacian
pyramid. It could be seen in the left of the equation. [15]
xvi
Chapter 2: Design and Implementation
Convolution
The convolution operation is calculated with the equation shown below:.
The convolution operation is essentially a formula that is a weighted average
of a kernel shifted throughout time. In order to produce a pixel output, the
neighboring pixels of the original pixel, f, and the weighted kernel, h, are needed.
The weighted average between f and h is then calculated. The process is repeated
for all of the pixels in the image.
Non-separable Convolution Design
The traditional hardware design for the convolution operation are designed
for non separable kernels. For a 3 by 3 kernel, three rows are stored in a BRAM.
As soon as all three rows are filled up, each of the three BRAM block would
push out the first item in the queue and stored in a flip flop. On the next clock
cycle, the second pixel is pushed out and stored in a flip flop. Finally, on the
third clock cycle, all three BRAM blocks has pushed three pixels. [17] This
would result in a total of nine pixel data. The convolution operation is then
calculated. (see Fig. 11)
Separable Convolution Design
Separable kernels are kernels that can be split into horizontal and vertical
xvii
Figure 11: This image illustrates the convolution of the traditional
kernel. The pixel would first go into the queue for processing. Later
a flip flop is used to push out the data and the weighted sum of the
results are then calculated
components. The outer product of these horizontal and vertical component
would be the same as the original kernel. Since convolution is commutative,
the equation can be be re-written as (I ∗ h) ∗ v or I ∗ (h ∗ v). The design
of non-separable kernels perform the operation as such: I ∗ (h ∗ v). However,
this approach would require more additions and multiplication. Rewriting the
equation to (I ∗ h) ∗ v would result in much less additions and multiplications.
The implementation (see Fig. 12) of a separable kernel is similar to that
of the non-separable kernel with some slight changes. However, before being
pushed into the rows of BRAMs, the data is stored in a shift register to com-
pute the horizontal component. For example, a 3 by 3 kernel would first store
three pixels in the shift register and convolve with the horizontal kernel. This
hardware is equivalent to (I ∗ h). When three of the BRAMs block are filled
xviii
Figure 12: Illustrates the process of a separable of a 3 by 3 convolution
kernel. It could be seen that the first 3 of the data is being processed
in the horizontal direction and pushed into the queue. Later when 3
of the rows of the queue is buffered, it is then sent out for process.
up. The first pixel of each row pushed out and convolved with the vertical
component. For a 3 by 3 kernel, the separable kernel results in a total of 6
multiplications and 4 additions. On the other hand, non-separable kernel would
result in a total of 9 multiplications and 8 additions. The hardware complexity
is increased linearly for separable kernel and exponentially for non-separable
kernel.
Advantages of Separable Kernel
A comparison of the clock latency, flip flops, and look-up tables is made on
the non-separable kernel and separable kernel. The measurements are collected
through Vivado’s utilization tool. To ensure consistency measurements for both
the separable kernel and non-separable kernel. Each multiplication and addition
xix
Table 1: Statistics number of LUT, LUTRAM, Flip Flop, pixel delay,
number of additions and additions gathered from the Vivado 2015.4
utilization and pixel measurement for separable kernel.
3× 3 5× 5 7× 7 9× 9
LUT 281 426 532 723
LUTRAM 1 1 1 1
Flip Flop 257 364 419 513
Pixel Delay 2213 4415 6615 8817
# of Additions 4 8 12 16
# of Multiplications 6 10 12 18
will take only one clock cycle. Each read and write will take one clock cycle.
From the gathered statistics (see Table 1 and 2) and visualization (see Fig.
13), it could be seen that LUT and Flip Flop grows exponentially for non-
separable kernels and linearly for separable kernels. The number of additions
can be modeled as 2(n−1) for separable kernels and 2(n2−1) for non-separable
kernels. The number of multiplication can be modeled as 2n for separable
kernels and n2 for non-separable kernels. The pixel delay can be modeled as
2dlog2(n)e+C and dlog2(n2)) +C − 1e, where C represents the constant width
of an image.
xx
Figure 13: This shows the measurement of the separable and non-
separable kernel. It could be seen that at most cases implementing
the separable kernel would grow exponentially in terms of hardware
size, while separable kernel grows linearly. This is very useful for
significant reduction in the hardware to fit in the FPGA. In addition
to that, the runtime for both algorithm are very much similar, at
most 1 or 2 clock cycle difference.
xxi
Table 2: Statistics number of flip flops, BRAM, pixel delay, number of
additions and additions gathered from the Vivado 2015.4 utilization
and pixel measurement for Non-Separable kernel.
3× 3 5× 5 7× 7 9× 9
LUT 359 1006 1604 3269
LUTRAM 9 9 12 13
Flip Flop 351 676 882 1481
Pixel Delay 2212 4413 6614 8815
# of Additions 16 48 96 160
# of Multiplications 9 25 49 81
Results
Modifications of Separable Kernel
Modifications can be done on the original separable convolution architecture
to find its maximum neighbor. The maximum neighbor is a crucial step for al-
gorithms like Harris Corner Detection. However, instead of finding the weighted
average, the maximum value is computed. Each components within the kernel
would be viewed as a leaf node. Each of the leaf node would look its neighbor.
The winner would proceed to the next level. The process would be repeated
until only one node is left (see Fig. 14). This tournament is repeated for both
xxii
Figure 14: Shows the process of finding the largest value within a
neighborhood. Instead of using a linear approach to calculate the
maximum amount within a neighborhood. By creating a tournament,
it could reduce the amount of clock cycle from N-1 to log(N).
xxiii
Figure 15: Shows the process of finding the largest value within a
neighborhood. Instead of using a linear approach to calculate the
maximum amount within a neighborhood. By creating a tournament,
it could reduce the amount of clock cycle from N-1 to log(N).
the horizontal and vertical components.
Harris Corner Detection
The first step is to compute the derivative of the image simultaneously,
producing Ix and Iy. The next step is to square both the Ix and Iy to produce
Ixx and Iyy. The multiplication of Ix and Iy produces Ixy. After a 5 by 5
Gaussian kernel is convolved for Ixx, Iyy, and Ixy, individually. [13] A syncing
core is used to ensure that the hardware signals are not out of sync. The Harris
Corner computes the following formula: to calculate the response value for
the Harris Corner. Finally, the response value must be be larger than a given
threshold and be a local maxima in the 5 by 5 neighbourhood to be considered
a valid corner. The block diagram is shown below (see Fig. 15):
The next few figures show results from the Harris Corner Detector at differ-
xxiv
Figure 16: The image illustrates the process of building the Gaussian
Pyramid. The data is blurred and subsampled for each resolution.
The kernel is blurred depending on the user input. [18]
ent environment results with various threshold values.
Laplacian Pyramid
The first step of the laplacian pyramid is to create the Gaussian Pyramid.
The Gaussian Pyramid is done by chaining together a Gaussian convolution
kernel with a subsampling core to produce the gaussian output of each layer
(see Fig. 16). Once the Gaussian Pyramid is formed, the difference between
two layers is measured to form the laplacian pyramid. Finally, two frames are
merged together by moving one of the frame by a given displacement.
The reconstruction of the pyramid is done by upsampling the smallest layer
and adding it with the layer below if. It is then upsampled and blurred again
and added to the laplacian layer below. (see Fig. 17) The process continue until
xxv
Figure 17: This illustrates the diagram of the reconstruction of the
laplacian image. F2 is the highest layer of the laplacian layer. H1 is
the second highest layer. F2 is then upsampled and blur to prevent
aliasing. It is then added with h1 to form f1. The same process is
formed with h0 and we get the final image f0 [19]
the base layer of the Laplacian Pyramid is formed. The base layer would result
in the fused image. [19]
The block diagram (see Fig. 18) shows the workflow for merging two different
frames using laplacian pyramid. The first step is to decompose the laplacian
pyramid of both frames. The second step is to merge the two frames. The
merging is done by comparing the given X and Y translation value. If the given
pixel count is less than the X and Y translation value, then the core would have
accept the pixel value of the first frame. If the given pixel count is larger than
X and Y translation value then the core would accept the pixel value of the
second frame. Finally, the image would be reconstructed and the base of the
xxvi
Figure 18: This illustrates the workflow of the laplacian pyramid fusion.
It first decomposes both frames. After it would form the laplacian
merging of the two frames and assign the value of the given pixel
based on the X translate and Y translate value. Finally, the image is
reconstructed and sent to the output.
reconstructed pyramid would be used for the HDMI output.
Xilinx Architecture: AXI-Lite
AXI-Lite is used to communicate between the programming logic block with
the processor and vice versa. This paper uses AXI-Lite the communicate the
threshold value of the Harris corner detection and the displacement value from
the programming logic block to the processor. The threshold value varies at
different environment. One easy way is to set a threshold is through trial and
error by human input. The processor would then use AXI-Lite to send the
user input data to the programming logic block. This approach is also used for
xxvii
Figure 19: The image shows the timing diagram of the read address.
First it would trigger the read address location of 0x30000000. When
both ARVALID and ARREADY are triggered high, it shows that
the address is registered by both the slave core. After, the address is
recorded, the data is to be sent. When the RVALID and RREADY
are triggered high, it would then be able to read the valid data when
both are triggered high. [20]
sending the displacement to the Laplacian Pyramid core and reading essential
feature points from the programming logic block.
In order to successfully create a handshake, the AXI-Lite must ensure that
the master knows the slave accepts the data. Therefore, Xilinx has set a tready
for the slave and tvalid for the master. When both values are set to high than
it is considered a successful handshake (see Fig. 19, 20). The handshake is done
xxviii
Figure 20: It indicates the write address. AWAddress and AWReady
indicates to the slave that the master wants to write to the given
address. Once both are triggered high, it would then proceed to
writing data channel. When both are triggered high, then it is saying
that the data is ready to be written. The Write response channel just
indicates that whether the write is successfully. [20]
xxix
twice, one for the address and one for reading/writing the data value.
Xilinx Architecture: AXI-Master
However, there are cases where AXI-Lite is too slow. It takes one clock cycle
to register the address and another clock cycle to register the data. Sometimes,
it would be important to store a stream of data. For example, the incoming
data from the HDMI input. In order to successfully store all of the value with
less clock cycles, an additional signal is needed. The additional signal is the
total length of the data input. This way it would require only N+1 clock cycle
as oppose to 2N clock cycles to store N data. For example, sending an address
of 0x30000000 with a length of 4 would imply that the 0x30000000 for the first
incoming data, 0x30000004 for the second incoming data, 0x30000008 for the
third incoming data, and 0x3000000C for the final incoming data. [21] This is
useful to store data that requires a larger storage than that of the the available
BRAM provided by the FPGA. This approach is used to shift one of the image
frame before fusion.
The AXI-Master works exactly the same as AXI-Lite by creating a hand-
shake. However, an additional signal, length, is used. The length signal indicates
the total length of the data being sent. (see Fig. 21, 22).
Xilinx Architecture: AXI-Stream
AXI-Stream is a standard set by Xilinx to communicate with it’s built-in
core. This project creates an AXI-Stream core to sync two VDMA (Video
Direct Memory Address). This would eliminate the possibilities for an out-of-
xxx
Figure 21: Shows the reading transaction the AXI-Address. Similarly
to the AXI-Lite, the addition signals including is the length. If the
length is N, the read data channel would send N data. When N data
handshakes are made from the RVALID and RREADY signal the
slave would send a tlast signal indicating the last of frame.[21]
Figure 22: Shows the signal address for writing the address similarly
to that of AXI-Lite. Addition data of the length, N, is added to
indicate how long the data is. The write data channel would have
the valid and ready signal triggered high. So when a total N correct
handshakes are triggered it would send a WLAST signal to send that
the value is the last transaction. [21]
xxxi
sync frame to occur when pipelining. The difference between AXI-Stream and
AXI-Lite/AXI-Master is that there is no address being sent. All the data are
being pipelined.
Software
The C-code uses three algorithms for testing: 1) Manual movement, 2) Near-
est Neighbor, and 3) Image Descriptors. After the PSNR value are calculated if
the algorithm above are able to fuse the results together then it would calculate
the PSNR value result of each of the algorithm.
The manual movement would require the user to enter the best translation
visually.
The nearest neighbor would ask the user for the closest translation and then
calculate the best change of displacement.
The image descriptors does it by by calculating the neighbor’s pixel value
from memory. The feature points contains the X and Y position of the location.
After retrieving the pixel’s neighbor it would compare the results with the closest
result. The downside of the algorithm is that it is possible for the result to be
non-deterministic, unable to calculate to best optimal result.
Misc
The HDMI Input are generated by the Raspberry Pi. The Raspberry Pi is
connected to a camera that generates the input image. The HDMI output is
connected to a TV that supports HDMI to visualize the result.
xxxii
Cores used in Design
The next following figures would show the cores that are developed for the
project. (see Fig. 23 - 37)
Figure 23: Shows the First frame syncing. This would ensure that the
output would be the first active frame before all the processing is
done in pipelining fashion.
xxxiii
Figure 24: Shows the input and output core for the derivative image
for X and Y. The cores are designed for the pipeline output from
the HDMI output. This is also used for the Gaussian Cores that are
used.
Figure 25: Shows the core that is used to square the output results.
Figure 26: Shows the core that is used to multiple two output results.
xxxiv
Figure 27: Shows the core that is used to compute the Harris Corner
Response value.
Figure 28: Shows the Local Neighbor Maximum filter value. It has a
threshold value. The values must be larger than a given threshold.
xxxv
Figure 29: This shows the buffer signal frame. When doing other
processing, it important signal is the active signal. However the hsync
and vsync signal are not needed. In order to save BRAM storage,
everything is stored at the signal buffer frame. This would have the
potential to store the storage.
xxxvi
Figure 30: This cores show the feature point storage core. It is used to
communicate with the processor to indicate where the feature points
of a giving image at the X and Y Position.
Figure 31: This core is a threshold setter, where the processor is able
to set the threshold value for the Harris Corner Detection. The
threshold value would then be sent to the programming logic blocks.
xxxvii
Figure 32: This core is used to convert RGB 444 to YCbCr 444. It
is also used to convert YCbCr 444 to YCbCr 422. This is useful to
convert the grayscale image to YCbCr 44 for the HDMI Output.
Figure 33: Shows the core that is used for large translation values of the
frame. The data are stored in the process since it is a AXI-Master.
It is then output using AXI Stream for other cores for conversion.
xxxviii
Figure 34: This core is used to sync the two axi-stream frame, so that
the output and input signal would be in sync when being output and
causes no distortion.
Figure 35: This is the Laplacian Decomposition core that is used to
break down the image into its laplacian pyramid.
xxxix
Figure 36: Shows the laplacian adding of the two frame in various level.
Values that are lower than X Translate and Y Translate would accept
the 1st frame. On the other hand, values that are larger than X
Translate and Y Translate would accept the 2nd frame.
xl
Figure 37: Shows the laplacian reconstruction to form the newly
merged image. This is used for the HDMI Output values.
xli
The next following figures shows the core that are developed by Xilinx that
are used for the project. (see Fig. 38 - 42)
Figure 38: This image shows the video timing diagram that is built by
the Xilinx Core. It is used for creating the hblank, vblank signal for
the output signal.
xlii
Figure 39: This image shows the video output diagram from AXI-
Stream. This is used to convert AXI-Stream input to Video output
clocking to ensure the output is correct.
Figure 40: This is the HDMI Input core that convert the signal into
hblank, vblank, active and data signal that is used for processing.
The core is developed by AVNET.
xliii
Figure 41: Shows the AXI-4Stream to Video In core. It is used to
convert the hblank, vblank, active and data signal to AXI-4 Stream
that could sync up the time with the processor time in order to
communicate with the processor.
Figure 42: This shows the core for the HDMI Output. It converts
the video processing used for the programming logic to the HDMI
Output signal that are then received by the TV or HDMI output
source.
xliv
The block diagram of the whole design of the thesis paper on image fusion
(see Fig. 43).
Figure 43: This is the block diagram that is created for the thesis paper
on image fusion for a 1080p video processing. The result of it is a
low-latency that is barely detectable by the human vision response.
xlv
Chapter 3: Performance
Hardware Complexity
By implementing the separable kernel in the algorithm, the harris corner
detection design was reduced from 104 multiplications to 24 multiplications and
204 additions to 44 additions. The design of the Laplacian Pyramid was reduced
from 75 multiplications to 15 multiplications and 144 additions to 24 additions.
The kernel size that is used throughout the design is a 5 by 5 kernel. The
hardware complexity grows linearly with the number of kernels added, i.e., 5 x
5 an addition the complexity would grow by 5.
The BRAMs used is also reduced by storing data in the processor memory.
Originally, it would require 2073600 pixels to store a 1080p image. Instead, a
1920 pixel BRAM is used for buffering the data that is to be written to memory
and 1920 pixel of BRAM that is used for buffering the output data.
Another reduction is using only the luminance channel. By using only one
channel for computation, it further reduces the required hardware complexity.
Instead of calculating frames from RGB channel separately and combining later.
Only one channel is used.
Latency
The latency of the algorithm is approximately one frame, which is 60 Hz.
xlvi
Since the frame is being stored on the processor memory for one frame before
being processing. When the vertical blank is triggered high, the laplacian fusion
and harris corner detection design blocks are computed. The latency is approx-
imately 16.7 ms. The average human visual is approximately 250 ms to detect
that there is a delay. Therefore, it is well below the human visual threshold and
lag would be not be sensed by the average human.
PSNR Values
In order to measure how well the data is measured. The overlapping region
will be measured and compared with a PSNR value. If the PSNR value is above
30 dB then it is considered visually pleasing to the human eye. The data of
the images are collected and calculated for the manual calibration (see Table 3)
and nearest neighbor (see Table 4). The results are compared with three still
images and from a webcam source.
Table 3: The table illustrates the image fusion PSNR values of values
of resulting data for manual calibration on Figure 52 and Figure 53
Figure # PSNR Value
Top Left (Fig. 52) 21.67
Middle Left (Fig. 52) 19.84
Bottom Left (Fig. 52) 18.44
Figure 53 15.55
xlvii
Table 4: The table illustrates the image fusion PSNR values of values
of resulting data for closest neighbor calibration on Figure 52 and
Figure 53
Figure # PSNR Value
Top Left (Fig. 52) 21.95
Middle Left (Fig. 52) 21.56
Bottom Left (Fig. 52) 20.14
Figure 53 N/A
xlviii
Chapter 4: Demo Images
The demo images for non-separable convolution kernels are shown in Fig 44.
The kernel that is used is an emboss filter.
Figure 44: The following image shows an example image of an emboss
filter
xlix
The demo images for separable kernels are shown in Fig 45. The kernel
demonstrates the Sobel Filter in the X Direction and Sobel Filter in the Y
Direction
Figure 45: The following image shows an example image of the original
(top), Sobel X Image (Bottom Left), and Sobel Y Image (Bottom
Right)
l
The demo images for Harris Corner is shown from Fig. 46 - Fig. 48 with
varying threshold values.
Figure 46: The following image shows the Harris Corner with no
Thresholds
li
Figure 47: The following image shows the Harris Corner with a thresh-
old set to 50
lii
Figure 48: The following image shows the Harris Corner with a thresh-
old set to 1000
liii
The demo images for Laplacian Pyramid is shown from Fig. 49 - Fig. 53
Figure 49: The following a regular laplacian merge of overlapping im-
ages
liv
Figure 50: The following a laplacian merge of two of the images, but
one of them is shifted. It could be seen that there is a ghost effect
Figure 51: The following a laplacian merge of two of the images that
are not related merged together
lv
Figure 52: The following a laplacian merge of two images that are
closest related. The images on the left is the initial guess. The image
on the right is with nearest neighbors.
lvi
Figure 53: The following shows an attempt of merging with a real-time
input webcam source
lvii
Chapter 5: Conclusion
The design implementation is a success. The results are able to fuse image
from two HDMI camera sources with low latency, 17 ms, well below the human
visual threshold. The main issue in the project comes from the visual result.
It could be seen that there is ghosting effect. The result of the ghosting effect
comes from the lack of translation and rotation. It is also noted that in real
time fusion there are noise from the input source causing it hard to fuse. More
feature points are also needed for the algorithm to converge, however, due to
hardware restrictions.
Applications
The application of the project includes robotic vision, security camera and
multiplexing. This design would allow robot to imitate human visual without
lag. Perhaps, it may also allow robots to react faster than that of the human
being.
A second application that could be used is security cameras. By merging all
the frames together would allow a larger field of view. As a result, it is able to
analyze a larger scene at once.
The final application is image multiplexing. Camera sources may come from
various input, such as infrared red cameras, ultraviolet cameras and regular
RGB cameras. By merging and interpolating these input sources together, it
lviii
may allow humans to see something that could not be sense by the naked eye.
This could be further used in devices like microscopes and color in night vision.
Future Work
Future work of the design includes implementing color. One design mistake
that was not done correctly was implementing the laplacian pyramid on the
output stream for pipelining rather than AXI-Stream. Using AXI-Stream would
reduce the amount of storage needed and also ensure that time for the output
pixels is in sync.
Another improvement that could be done is to implement affine transform.
This could be done with the help of software. The software would calculate the
complex math and move the data results into the new address space. This way
the image input is able to scaled, rotated and translated. This would better
help the fusion result to reduce ghosting.
A case studies on various feature detection algorithms to pick the best fea-
ture selection algorithm on various scenarios. The optimal feature detection
algorithm would have distinct features on each frame and less noise.
Finally, a need for image stabilization is needed to reduce noise captured
from feature points.
lix
Bibliography
[1] Flusser, Jan, Filip Sroubek, & Zitova, Barbara. Image Fusion: Principles,
Methods and Applications. EUSIPCO 2007.
[2] Seranno, J. Introduction to FPGA. CERN.
[3] Xilinx. Xilinx Zynq-7000 All Programmable SoC ZC702 Evaluation Kit.
[4] Nurvitadhi, Eriko, Venkatesh, Ganesh, Sim, Jaewoong, Marr, Debbie,
Huang Randy, Jason Gee Hock Ong, Liew, Tat Yeong, Srivatsan, Krishnan,
Moss, Duncan, Subhaschandra, Suchit, Boudoukh, Guy. Can FPGAs Beat
GPUs in Accelerating Next Generation Deep Neural Networks? Intel
Corporation.
[5] Xilinx. AXI Reference Guide. March 2011.
[6] Xilinx. AXI4-Stream Video IP and System Design Guide. October 2016.
[7] MLM. VGA Decoding - Dealing with tolerances. December 2013.
[8] Xilinx. Chroma Resampler v4.0. November 2015
[9] Xilinx. Video In to AXI4-Stream 4.0. October 2017.
[10] Learned G. Erik. Human Vision Light, Color, etc. Introduction to
Computer Vision.
[11] Flusser, Jan, Filip Sroubek, & Zitova, Barbara. Image Fusion: Principles,
Methods and Applications. EUSIPCO 2007.
[12] Harris, Chris & Stephens, Mike. A Combined Corner and Edge Detector.
lx
Plessey Research Roke Manor. 1998
[13] Collins, Robert. Lecture 06: Harris Corner Detector. Introduction to
Computer Vision. 2007.
[14] E.H.Adelson, C.H. Anderson, J.R. Bergen, P.J. Burt. J.M. Ogden.
Pyramid methods in Image Porcessing. RCA Engineering Nov/Dec 1984.
[15] Kenneth Kwan. Image Pyramids and Blending. Computational
Photography. 2015.
[16] Microsoft Dev Center. JPEG YCbCr Support. 2018.
[17] Benedetti, Arrigo, Andrea, Prati, Scarabottolo, Nello. Image Convolutions
on FPGAs: the Implementation of a Multi-FPGA FIFO Structure.
EUROMICRO 98.
[18] Cmglee. Pyramid (Image Porcessing). August 2015. Wikipedia
[19] Stanford Exploration Project. The Laplacian Pyramid. January 2002.
[20] Griffin, Rich. Designing a Custom AXI-lite Slave Peripheral. July 2014.
Silica EMEA.
[21] Griffin, Rich. Designing a Custom AXI Master using Bus Functional
Models (BFMs). May 2015. Silica EMEA.
lxi
