Block-Matching Optical Flow for Dynamic Vision Sensor- Algorithm and
  FPGA Implementation by Liu, Min & Delbruck, Tobi
Block-Matching Optical Flow for Dynamic Vision
Sensors: Algorithm and FPGA Implementation
Min Liu and Tobi Delbruck
Institute of Neuroinformatics
University of Zurich and ETH Zurich, Zurich, Switzerland
Email: minliu@ini.uzh.ch
Abstract—Rapid and low power computation of optical flow
(OF) is potentially useful in robotics. The dynamic vision sensor
(DVS) event camera produces quick and sparse output, and has
high dynamic range, but conventional OF algorithms are frame-
based and cannot be directly used with event-based cameras.
Previous DVS OF methods do not work well with dense textured
input and are designed for implementation in logic circuits. This
paper proposes a new block-matching based DVS OF algorithm
which is inspired by motion estimation methods used for MPEG
video compression. The algorithm was implemented both in
software and on FPGA. For each event, it computes the motion
direction as one of 9 directions. The speed of the motion is set
by the sample interval. Results show that the Average Angular
Error can be improved by 30% compared with previous methods.
The OF can be calculated on FPGA with 50 MHz clock in 0.2 us
per event (11 clock cycles), 20 times faster than a Java software
implementation running on a desktop PC. Sample data is shown
that the method works on scenes dominated by edges, sparse
features, and dense texture.
I. INTRODUCTION
Optical Flow (OF) estimation has always been a core
topic in computer vision; it is widely used in segmentation,
3D reconstruction and navigation. It was first studied in the
context of neuroscience to understand motion perception in
insects and mammals. In computer vision, OF describes the
motion field induced by camera movement through space. Two
well known inexpensive optical flow algorithms are the Lucas-
Kanade [1] and Horn-Schunck [2] methods. The core of many
OF methods is a search over possible flows to select the most
likely one at each image or feature location. This search with
dense image blocks is expensive and difficult to calculate on
an embedded platform in real time.
The DVS is data-driven rather than regular-sample driven.
Regular-sample driven means camera sends the output data at a
fixed interval, thus we call it a frame-based camera. However,
the DVS output is driven by brightness changes rather than
a fixed sample interval. Therefore, new OF methods need
to be designed. Benosman et al. [3] proposed a time-surface
method, which combines the 2D events and timestamps into
a 3D space and OF is obtained by the local plane fitting. [4]
proposed a Lucas-Kanade gradient based method that collects
short 2D histograms of events and solves the brightness-
constancy constraint on them. In 2015, Conradt [5] proposed
a real-time DVS optical flow algorithm implementation on
an ARM 7 microcontroller. Barranco [6] proposed a more
expensive phase-based method for high-frequency texture re-
gions. [7] re-implemented several of these methods in the Java
framework jAER [8] and compared them with the earliest
jAER method based on time-of-flight of oriented edges. Its
conclusion was that all methods offered comparable accuracy
for sharp and sparse edges, but all fail on textured or low
spatial frequency inputs, because the underlying assumptions
(e.g. smooth gradients or isolated edges) are violated. This
paper also introduced the use of an integrated camera inertial
measurement unit (IMU) to obtain ground truth global optical
flow from camera rotation and published a benchmark dataset
from a 240x180 pixel DVS camera, which we use here. Most
of the existing work is based on PC software algorithms [3]
[4] [6] [7]. Though [5] is based on an embedded system and
can work in real-time, it was only characterized for camera
rotations, not camera translation through space, and its use
of direct time of flight of events makes it unlikely to work
well with dense textured scenes and to suffer from aperture
problems for edges.
In video technology, OF is called motion estimation (ME)
and is widely used in exploiting the temporal redundancy
of video sequences for video compression standards, such as
MPEG-4 and H.263 [9]. The pipeline for ME includes block
matching. Block matching means that rectangular blocks of
pixels are matched between frames to find the best match.
Block matching is computationally expensive. That is why it
is now widely implemented in dedicated logic circuits. In order
to address this problem, an example of logic ME implemen-
tation based on block matching is presented in Shahrukh [9].
Our paper proposes an event-based block matching algorithm
to calculate OF on FPGA.
The paper is organized as follows: Section II introduces
our system architecture and algorithm, section III shows
experimental results, and Section IV concludes the paper.
II. PROPOSED METHOD
The output of DVS is a stream of brightness change events.
Each event has a microsecond timestamp, a pixel address, and
a binary polarity describing the sign of the brightness change.
Each event signifies a change in brightness of about 15%
since the last event from the pixel. In this work, events are
accumulated into time slice frames as binary images ignoring
the polarity, since our aim is for minimum logic and memory
size. Here we will refer to these bitmap frames as slices.
ar
X
iv
:1
70
6.
05
41
5v
1 
 [c
s.C
V]
  1
6 J
un
 20
17
clk
Finite State
Machine
Rotation
Control
Logic
Host PC
Monitor-
Sequencer
Board
send
receive
data
Enable
Slice RAMs
Spartan 6 FPGA
Figure 1. System Architecture
A block is a square centered around the incoming event’s
location. Matching is based on a distance metric. In our work,
we implemented Hamming Distance (HD) as the distance
metric. HD is the count of the number of differing bits. For
bitmaps, HD is exactly the same as the better-known Sum-of-
Absolute-Differences (SAD).
The software implementation is open source. It is called
PatchMatchFlow [10] in jAER.
A. System Evaluation Architecture
The hardware evaluation system is divided into two parts,
one for data sequencing and monitoring and the other for the
algorithm implementation. For the first part, we use a monitor-
sequencer board [13] designed by the Univ. of Seville. The
sequencer converts the event-based benchmark dataset [7] into
real-time hardware events sent to the OF FPGA. During OF
calculation, the monitor collects the OF events and sends them
over USB to jAER for rendering and analysis. In this way
we can compare software and hardware processing of the OF
algorithm. In this work, we only used prerecorded data to
allow systematic comparison between software and hardware
implementations.
The OF architecture (Fig 1) contains three main modules:
the finite state machine (FSM), random access memory block
memory (RAMs) and rotation control logic. The architecture
of the FSM is shown in Fig 2. The FSM consists of three
parts: data receiving module, OF calculation module, and data
sending module. The data sending and data receiving module
communicate with the monitor-sequencer. The OF module is
described in the section II-B.
Three 240x180-pixel DVS event bitmap slices are stored
in RAM. These slices are like binary image frames from
conventional cameras but in the case of DVS we can arbitrarily
select the slice interval. One is the current collecting slice
starting at time t and the other two are the past two slices
starting at times t-d and t-2d. d is the slice duration. At
intervals of d, the rotation control logic rotates the three
slices. The t slice accumulates new data. It starts out empty
IDLEstart Read DataCheck
Extract
Events
Read
Blocks
SAD /HDGetMinimumSend data
Timeout
Check
RAM
Rotation
req = 0
req = 1
No
Yes
ack = 1
ack = 0
No
Yes
Figure 2. Finite state machine
and gradually accumulates events, so it cannot be used for
matching to past slices. The two past slices are used for OF,
but the OF computation is done at the location of each event
stored into the t slice, and thus is driven by these events. Slices
are stored in block RAM on the FPGA. The total size of the
RAM is 240x180x3, matching the DVS pixel array size. It is
generated by the IP Core of Xilinx.
B. Optical Flow algorithm
When an event arrives, a single reference block from slice
t-d and 9 blocks from slice t-2d are sent to the HD module
to calculate the distances. In the current implementation, the
block contains 9x9 pixels. For the t-d slice, we use only one
center block as the reference. The algorithm finds the most
similar block on the t-2d slice. According to the brightness-
constancy assumption of OF, we should see a similar block in
the t-2d slice for the block that best matches the actual OF.
We search over the 8 blocks centered on the 8 neighbors of the
current event address and one block centered on the reference
and choose the one with minimum distance.
1) Hamming Distance: The implementation of one HD
block is shown in Fig 3. A total of 81 XOR logic gates
receive input from corresponding pixels on the slices. The
XOR outputs are summed to compute the HD.
2) Minimum Distance Computation: The last step of the
algorithm is to find the minimum distance candidate. Part of
the novel minimum circuit is shown in Fig 4. It is a parallel
implementation that outputs the index of the minimum dis-
tance direction. For instance, if we need to find the minimum
among 5 data: HD0-4 (output from Fig 3), the circuit can be
divided into 5 parts. The first part in Fig 4 compares HD0 with
all the other data and outputs a count of how many times data0
is larger than HD1-4. The other 4 parts are implemented in the
same way and all those parts are computed concurrently. At
the end, the part whose sum is zero is the minimum candidate.
Thus the minimum distance candidate is determined in one
clock cycle.
blockt−d[0]
blockt−2d[0]
blockt−d[1]
blockt−2d[1]
blockt−d[79]
blockt−2d[79]
blockt−d[80]
blockt−2d[80]
. . .
HD
x1
x2
xn−1
xn
+
Figure 3. Hamming Distance implementation for one 9x9 block match.
There are 9 of these circuits for the 9 flow directions.
≥
≥
≥
≥
HD0
HD1
HD2
HD3
HD4
# of inputs
smaller than
HD0
+
Figure 4. Sort algorithm implementation block for HD0, simplified for 5
inputs rather than 9. There are 9 of these blocks.
III. EXPERIMENTAL RESULTS
We used the Xilinx Spartan 6 family chip xc6slx150t to
implement our algorithm. It has 184304 Flip-Flops and 92152
LUTs and 4MB block memory. The implemented OF design
occupies 0.9% of the Flip-Flops, 5% of the LUTs and 5% of
the block RAM. For the test dataset, we use the event-based
optical flow benchmark dataset in [7] which also provides the
evaluation method and the ground truth.
We tested three sample datasets. All of them are real DVS
data: the box translation, pavement, and gravel corresponding
to edge, sparse points, and dense texture respectively. The
boxes scene has a box in the foreground and clutter in
the background and the camera pans to the left, producing
rightwards global translation mostly of extended edges. In the
pavement dataset, the camera was down-looking and carried
by hand; the flow points downwards and to the right. Im-
perfections in the pavement cause sparse features. The gravel
dataset is recorded outside and has dense texture; movement
is eastward.
The block-matching OF results are shown in Fig 5. It can
be seen that in each scene, most vectors point correctly east
for box translation, southeast for the pavement scene, and east
for the gravel scene. Errors are mostly caused by DVS noise
or aperture ambiguity for the extended edges.
(a) Boxes translation
(b) Pavement on grass
(c) Gravel
Figure 5. OF Results. The arrows are the flow vectors and their length
represents the speed (determined by the slice duration d). DVS On events
are green and Off events are red. The color wheel indicates the flow vector
direction color. The 2D gray scale histogram above each color wheel shows
the distribution of flow event directions (here we use 9 direction bins) in the
time slice. The brightest bin votes the highly possible direction of the global
motion. (a) is the boxes scene from [7] with d = 40ms. (b) is pavement
recorded by a down-looking DVS; d = 10ms. (c) is a gravel area with
d = 3ms. For clarity, downsampling was used to compute 1 flow event
every 100 DVS events.
A. Accuracy analysis
[7] proposed two ways to calculate event-based OF accu-
racy, based on similar metrics used for conventional OF. One is
called Average Endpoint Error (AEE) and the other is Average
Angular Error (AAE). AAE measures error in the direction
of estimated flow and AEE includes speed error. These two
methods are already implemented in jAER [8]. They use IMU
data from a pure camera rotation along with lens focal length
as the ground truth. Since the output data of the sequencer
lacks IMU data, we measured the OF accuracy using the PC
implementation. The algorithm pipeline between FPGA and
PC is identical, so it will not influence the accuracy. The result
is also compared with [7]. We chose two variants of the event-
based Lucas-Kanade and Local Plane algorithms. The errors
from all the algorithms are shown in Table I. PMhd represents
Table I. OF algorithm’s accuracy
AAE transBoxes
PMhd 42.68±33.82
LKsg 30.30±44.35
LKbd 98.92±42.24
LPorig 77.18±33.73
LPsg 47.52±54.44
(a) AAE comparison
AEE transBoxes
PMhd 17.86±6.31
LKsg 24.72±26.11
LKbd 37.00±15.18
LPorig 93.02±107.02
LPsg 98.32±82.5
(b) AEE comparison
Block Radius
0 2 4 6 8 10 12
Av
er
ag
e 
An
gu
la
r E
rro
r
30
35
40
45
50
AAE as a function of the block radius
Figure 6. The relationship between the block radius and AAE
the block matching algorithm with HD metric.
As shown in Table I, the block matching algorithm has
the best accuracy for AEE and second-best for AAE, partly
from an appropriate choice of the sample rate that matches the
dataset motion.
Fig 6 shows the relationship between the block radius and
AAE. It indicates that bigger block dimension leads to better
accuracy. However, larger blocks consume more logic and
reduce spatial resolution of the flow. The comparison between
PC and FPGA implementation complexity is discussed next,
in III-B.
B. Time complexity analysis
The time complexity of the software grows quadratically
with the block size while only linearly in FPGA. The pro-
cessing time of the algorithm contains three parts: reading
data from three slices, HD calculation and looking for the
minimum. Both FPGA implementation and software imple-
mentation on PC consume linear time to read data from
RAM since multiple data cannot be read from one RAM
simultaneously. However, the latter two parts take constant
time (2 clock cycles) on FPGA while quadratic time on PC. In
summary, processing time on FPGA is (block dimension + 2)
cycles. In this paper, FPGA runs at 50MHz frequency and the
block dimension is 9. Thus the whole algorithm will take only
220ns per event, i.e. 0.22us. On PC, it takes 4.5us per event
for (admittedly non-optimized) jAER to run the algorithm. The
implementation on FPGA is 20 times faster than that on the
PC. The current implementation uses single port RAM and
could be further sped up by using multiple banks.
IV. CONCLUSION
In this paper, we proposed a new method to estimate the
event-based optical flow on FPGA in real time. The software
computational cost of Hamming Distance increases quadrati-
cally as the block size increases, however, in FPGA, all the bits
in the block can be calculated at the same time which leads
to a constant time for all block sizes. This greatly reduces
the overall computation time for the FPGA implementation,
which is 20 times faster than the software implementation.
In the current implementation, every single incoming event
is processed (allowing an input event rate of up to 5 Meps
to be handled using a modest FPGA clock of only 50 MHz).
However, processing every event is not required, as illustrated
in Fig. 5(c), where OF computation is downsampled, but the
DVS events still indicate locations to estimate the flow.
There are three possible improvements. The current im-
plementation estimates only direction of flow and not speed.
Measuring speed requires additional search distances and there
are well-known algorithms for efficient search [14]. Secondly,
other distance metrics should be explored because event se-
quences collected onto the slices usually have different length
due to noise and HD is somewhat ambiguous [15]. Finally,
we will implement feedback control on the slice duration to
better exploit the unique feature of DVS event output that it
can be processed at any desired sample rate.
ACKNOWLEDGMENT
Funded by Swiss National Center of Competence in Re-
search Robotics (NCCR Robotics). We thank H. Liu and
the Architecture and Computer Technology group, Univ. of
Seville, Spain for support with testing.
REFERENCES
[1] Baker S, Matthews I. Lucas-kanade 20 years on: A unifying frame-
work[J]. International journal of computer vision, 2004, 56(3): 221-255.
[2] Horn B K P, Schunck B G. Determining optical flow[J]. Artificial
intelligence, 1981, 17(1-3): 185-203.
[3] Benosman R, Clercq C, Lagorce X, et al. Event-based visual flow[J].
IEEE transactions on neural networks and learning systems, 2014, 25(2):
407-417.
[4] R. Benosman, S.-H. Ieng, C. Clercq, C. Bartolozzi, and M. Srinivasan,
Asynchronous frameless event-based optical flow, Neural Networks, vol.
27, pp. 32-37, 2012.
[5] Conradt J. On-board real-time optic-flow for miniature event-based vi-
sion sensors[C]. //2015 IEEE International Conference on Robotics and
Biomimetics (ROBIO). IEEE, 2015: 1858-1863.
[6] Barranco F, Fermuller C, Aloimonos Y. Bio-inspired motion estimation
with event-driven sensors[C]. International Work-Conference on Artificial
Neural Networks. Springer International Publishing, 2015: 309-321.
[7] Rueckauer B, Delbruck T. Evaluation of event-based algorithms for
optical flow with ground-truth from inertial measurement sensor[J].
Frontiers in neuroscience, 2016, 10.
[8] https:jaerproject.net.
[9] Agha S, Dwayer V M. Algorithms and VLSI Architectures for MPEG-4
Motion Estimation[J]. Electronic systems and control Division Research,
2003: 24-27.
[10] Class ch.unizh.ini.jaer.projects.minliu.PatchMatchFlow Source code.
[11] Wong S, Vassiliadis S, Cotofana S. A sum of absolute differences
implementation in FPGA hardware[C], Euromicro Conference, 2002.
Proceedings. 28th. IEEE, 2002: 183-188.
[12] Lichtsteiner P, Posch C, Delbruck T. A 128x128 120 dB 15 us latency
asynchronous temporal contrast vision sensor[J]. IEEE journal of solid-
state circuits, 2008, 43(2): 566-576.
[13] Berner R, Delbruck T, Civit-Balcells A, et al. A 5 Meps $100 USB2.0
address-event monitor-sequencer interface[C].//2007 IEEE International
Symposium on Circuits and Systems. IEEE, 2007: 2451-2454.
[14] Barjatya A. Block matching algorithms for motion estimation[J]. IEEE
Transactions Evolution Computation, 2004, 8(3): 225-239.
[15] Zhang L, Zhang Y, Tang J, et al. Binary code ranking with weighted
hamming distance[C]//Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2013: 1586-1593.
