A Survey of FPGA-Based Robotic Computing by Wan, Zishen et al.
1A Survey of FPGA-Based Robotic Computing
Zishen Wan*1,2, Bo Yu*3, Thomas Yuang Li3, Jie Tang4, Yuhao Zhu5, Yu Wang6, Arijit Raychowdhury1,
and Shaoshan Liu3
1School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA
2John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138 USA
3PerceptIn Inc, Fremont, CA 94539 USA
4School of Computer Science and Engineering, South China University of Technology, Guangzhou, Guangdong, China
5Department of Computer Science, University of Rochester, Rochester, NY 14627 USA
6Department of Electronic Engineering, Tsinghua University, Beijing, China
Recent researches on robotics have shown significant improvement, spanning from algorithms, mechanics to hardware architectures.
Robotics, including manipulators, legged robots, drones, and autonomous vehicles, are now widely applied in diverse scenarios.
However, the high computation and data complexity of robotic algorithms pose great challenges to its applications. On the one
hand, CPU platform is flexible to handle multiple robotic tasks. GPU platform has higher computational capacities and easy-to-
use development frameworks, so they have been widely adopted in several applications. On the other hand, FPGA-based robotic
accelerators are becoming increasingly competitive alternatives, especially in latency-critical and power-limited scenarios. With
specialized designed hardware logic and algorithm kernels, FPGA-based accelerators can surpass CPU and GPU in performance
and energy efficiency. In this paper, we give an overview of previous work on FPGA-based robotic accelerators covering different
stages of the robotic system pipeline. An analysis of software and hardware optimization techniques and main technical issues is
presented, along with some commercial and space applications, to serve as a guide for future work.
Index Terms—Robotics, Autonomous Machines, Computer Architecture, FPGA, Space Exploration.
I. INTRODUCTION
Over the last decade, we have seen significant progress
in the development of robotics, spanning from algorithms,
mechanics to hardware platforms. Various robotic systems, like
manipulators, legged robots, unmanned aerial vehicles, self-
driving cars have been designed for search and rescue [1], [2],
exploration [3], [4], package delivery [5], entertainment [6],
[7] and more applications and scenarios. These robots are
on the rise of demonstrating their full potential. Take drones,
a type of aerial robots, for example, the number of drones
has grown by 2.83x between 2015 and 2019 based on the
U.S. Federal Aviation Administration (FAA) report [8]. The
registered number has reached 1.32 million in 2019, and the
FFA expects this number will come to 1.59 billion by 2024.
However, robotic systems are pretty complicated [9], [10].
They tightly integrate many technologies and algorithms,
including sensing, perception, mapping, localization, decision
making, control, etc. This complexity poses many challenges
for the design of robotic edge computing systems [11] [12]. On
the one hand, the robotic system needs to process an enormous
amount of data in real-time. The incoming data often comes
from multiple sensors and is highly heterogeneous. However,
the robotic system usually has limited on-board resources,
such as memory storage, bandwidth, and compute capabilities,
making it hard to meet the real-time requirements. On the other
hand, the current state-of-the-art robotic system usually has
strict power constraints on the edge that cannot support the
amount of computation required for performing tasks, such
as 3D sensing, localization, navigation, and path planning.
* These authors contributed equally to this work.
Corresponding author: Shaoshan Liu (email: shaoshan.liu@perceptin.io).
Therefore, the computation and storage complexity, as well
as real-time and power constraints of the robotic system,
hinders its wide application in latency-critical or power-limited
scenarios [13].
Therefore, it is essential to choose a proper compute plat-
form for the robotic system. CPU and GPU are two widely
used commercial compute platforms. CPU is designed to
handle a wide range of tasks quickly and is often used to
develop novel algorithms. A typical CPU can achieve 10-
100 GFLOPS with below 1GOP/J power efficiency [14]. In
contrast, GPU is designed with thousands of processor cores
running simultaneously, which enable massive parallelism. A
typical GPU can perform up to 10 TOPS performance and
become a good candidate for high-performance scenarios. Re-
cently, benefiting in part from the better accessibility provided
by CUDA/OpenCL, GPU has been predominantly used in
many robotic applications. However, conventional CPU and
GPUs usually consume 10W to 100W of power, which are
orders of magnitude higher than what is available on the
resource-limited robotic system.
Besides CPU and GPU, FPGAs are attracting attention
and becoming a platform candidate to achieve energy-efficient
robotics tasks processing. FPGAs require little power and are
often built into small systems with less memory. They have the
ability to parallel computations massively and makes use of the
properties of perception (e.g., stereo matching), localization
(e.g., SLAM), and planning (e.g., graph search) kernels to re-
move additional logic and simplify the implementation. Taking
into account hardware characteristics, several algorithms are
proposed which can be run in a hardware-friendly way and
achieve similar software performance. Therefore, FPGAs are
possible to meet real-time requirements while achieving high
ar
X
iv
:2
00
9.
06
03
4v
1 
 [c
s.R
O]
  1
3 S
ep
 20
20
2energy efficiency compared to CPUs and GPUs.
Unlike the ASIC counterparts, FPGA technology provides
the flexibility of on-site programming and re-programming
without going through re-fabrication with a modified design.
Partial Reconfiguration (PR) takes this flexibility one step fur-
ther, allowing the modification of an operating FPGA design
by loading a partial configuration file. Using PR, part of the
FPGA can be reconfigured at runtime without compromising
the integrity of the applications running on those parts of
the device that are not being reconfigured. As a result, PR
can allow different robotic applications to time-share part of
an FPGA, leading to energy and performance efficiency, and
making FPGA a suitable computing platform for dynamic and
complex robotic workloads.
FPGAs have been successfully utilized in commercial au-
tonomous vehicles. Particularly, over the past three years, Per-
ceptIn has built and commercialized autonomous vehicles for
micromobility, and PerceptIn’s products have been deployed
in China, US, Japan and Switzerland. In this paper, we review
how PerceptIn developed its computing system by relying
heavily on FPGAs, which perform not only heterogeneous
sensor synchronizations, but also the acceleration of software
components on the critical path. In addition, FPGAs are
used heavily in space robotic applications, for FPGAs offered
unprecedented flexibility and significantly reduced the design
cycle and development cost. In this paper, we also delve into
space-grade FPGAs for robotic applications.
The rest of paper is organized as follows: Section II intro-
duces the basic workloads of the robotic system. Section III,
IV and V reviews the various perception, localization and
motion planning algorithms and their implementations on
FPGA platforms. In section VI, we discuss about FPGA par-
tial reconfiguration techniques. Section VII and VIII present
robotics FPGA applications in commercial and space areas.
Section IX concludes the paper.
II. OVERVIEW OF ROBOTICS WORKLOADS
A. Overview
Robotics is not one technology but rather an integration
of many technologies. As shown in Fig 1, the stack of the
robotic system consists of three major components: application
workloads, including sensing, perception, localization, motion
planning, and control; a software edge subsystem, including
operating system and runtime layer; and computing hardware,
including both microcontrollers and companion computers.
We focus on the robotic application workloads in this sec-
tion. The application subsystem contains multiple algorithms
that are used by the robot to extract meaningful information
from raw sensor data to understand the environment and
dynamically make decisions about its actions.
B. Sensing
The sensing stage is responsible for extracting meaningful
information from the sensor raw data. To enable intelligent
actions and improve reliability, the robot platform usually
supports a wide range of sensors. The number and type of
sensors are heavily dependent on the specifications of the
workload and the capability of the on-board compute platform.
The sensors can include the following:
Cameras. Cameras are usually used for object recognition
and object tracking, such as lane detection in autonomous
vehicles and obstacle detection in drones, etc. RGB-D camera
can also be utilized to determine object distances and posi-
tions. Take autonomous vehicle as an example, the current
system usually mounts eight or more 1080p cameras around
the vehicle to detect, recognize and track objects in different
directions, which can greatly improve the safety. Usually, these
cameras run at 60 Hz, which will process about multiple
gigabytes raw data per second when combined.
GNSS/IMU. The global navigation satellite system (GNSS)
and inertial measurement unit (IMU) system help the robot lo-
calize itself by reporting both inertial updates and an estimate
of the global location at a high rate. They have their own
advantages and drawbacks. GNSS can enable fairly accurate
localization, while it runs at only 10Hz, thus unable to provide
real-time updates. By contrast, both accelerometer and gyro-
scope in IMU can run at 100-200 Hz, which can satisfy the
real-time requirement. However, IMU suffers bias wandering
over time or perturbation by some thermo-mechanical noise,
which may lead to an accuracy degradation in the position
estimates. By combining GNSS and IMU, we can get accurate
and real-time updates for robots.
LiDAR. Light detection and ranging (LiDAR) is used for
evaluating distance by illuminating the obstacles with laser
light and measuring the reflection time. These pulses, along
with other recorded data, can generate precise and three-
dimensional information about the surrounding characteristics.
LiDAR plays an important role in localization, obstacle detec-
tion and avoidance.
Radar and Sonar. The Radio Detection and Ranging
(Radar) and Sound Navigation and Ranging (Sonar) system is
used to determine the distance and speed to a certain object,
which usually serves as the last line of defense to avoid
obstacles. Take autonomous vehicle as an example, a danger of
collision may occur when near obstacles are detected, then the
vehicle will apply brakes or turn to avoid obstacles. Compared
to LiDAR, Radar and Sonar system is cheaper and smaller, and
their raw data is usually fed to the control processor directly
without going through the main compute pipeline, which can
be used to implement some urgent functions as swerving or
applying the brakes.
C. Perception
The sensor data is then fed into the perception layer to sense
the static and dynamic objects as well as build a reliable and
detailed representation of the robot’s environment by using
computer vision techniques (including deep learning).
The perception layer is responsible for object detection, seg-
mentation and tracking. There are obstacles, lane dividers and
other objects to detect. Traditionally, a detection pipeline starts
with image pre-processing, followed by a region of interest
detector and finally a classifier that outputs detected objects.
In 2005, Dalal and Triggs [15] proposed an algorithm based on
histogram of orientation (HOG) and support vector machine
3GPS/IMU
LiDAR
Camera
Sensing Perception Decision
Path Planning
Action Prediction
Obstacle AvoidanceObject Detection
Object Tracking
Mapping
Localization
Radar/Sonar Feedback Control
Operating System
Hardware Platform
Fig. 1: The stack of the robotic system.
(SVM) to model both the appearance and shape of the object
under various condition. The goal of segmentation is to give
the robot a structured understanding of its environment. Se-
mantic segmentation is usually formulated as a graph labeling
problem with vertices of the graph being pixels or super-pixels.
Inference algorithms on graphical models such as conditional
random field (CRF) [16], [17] are used. The goal of tracking
is to estimate the trajectory of moving obstacles. Tracking can
be formulated as a sequential Bayesian filtering problem by
recursively running the prediction step and correction step.
Tracking can also be formulated by tracking-by-detection han-
dling with Markovian decision process (MDP) [18], where an
object detector is applied to consecutive frames and detected
objects are linked across frames.
In recent years, deep neural networks (DNN), also known
as deep learning, have greatly affected the field of computer
vision and made significant progress in solving robot percep-
tion problems. Most state-of-the-art algorithms now apply one
type of neural network based on convolution operation. Fast
R-CNN [19], Faster R-CNN [20], SSD [21], YOLO [22],
and YOLO9000 [23] were used to get much better speed
and accuracy in object detection. Most CNN-based semantic
segmentation work is based on Fully Convolutional Networks
(FCN) [24], and there are some recent work in spatial pyramid
pooling network [25] and pyramid scene parsing network
(PSPNet) [26] to combine global image-level information
with the locally extracted feature. By using auxiliary natural
images, a stacked autoencoder model can be trained offline to
learn generic image features and then applied for online object
tracking [27].
D. Localization
The localization layer is responsible for aggregating data
from various sensors to locate the robot in the environment
model.
GNSS/IMU system is used for localization. The GNSS
consist of several satellite systems, such as GPS, Galileo and
BeiDou, which can provide accurate localization results but
with a slow update rate. In comparison, IMU can provide
a fast update with less accurate rotation and acceleration
results. A mathematical filter, such as Kalman Filter, can be
used to combine the advantages of the two and minimize the
localization error and latency. However, this sole system has
some problems, such as the signal may bounce off obstacles,
introduce more noise, and fail to work in closed environments.
LiDAR and High-Definition (HD) maps are used for local-
ization. LiDAR can generate point clouds and provide a shape
description of the environment, while it is hard to differentiate
individual points. HD map has a higher resolution compared to
digital maps and makes the route familiar to the robot, where
the key is to fuse different sensor information to minimize the
errors in each grid cell. Once the HD map is built, a particle
filter method can be applied to localize the robot in real-time
correlated with LiDAR measurement. However, the LiDAR
performance may be severely affected by weather conditions
(e.g., rain, snow) and bring localization error.
Cameras are used for localization as well. The pipeline
of vision-based localization is simplified as follows: 1) by
triangulating stereo image pairs, a disparity map is obtained
and used to derive depth information for each point; 2) by
matching salient features between successive stereo image
frames in order to establish correlations between feature points
in different frames, the motion between the past two frames
is estimated; and 3) by comparing the salient features against
those in the known map, the current position of the robot is
derived [28].
Apart from these techniques, sensor fusion strategy is also
often utilized to combine multiple sensors together for local-
ization, which can improve the reliability and robustness of
robot [29]–[31].
E. Planning and Control
The planning and control layer is responsible for generating
trajectory plans and passing the control commands based on
the original and destination of the robot. Broadly, prediction
and routing modules are also included here, where their
outputs are fed into downstream planning and control layers
as input. The prediction module is responsible for predicting
the future behavior of surrounding objects identified by the
perception layer, and the routing module can be a lane-level
routing based on lane segmentation of the HD maps for
autonomous vehicles.
Planning and Control layers usually include behavioral
decision, motion planning and feedback control. The mission
of the behavioral decision module is to make effective and
safe decisions by leveraging all various input data sources.
Bayesian models are becoming more and more popular and
have been applied in recent works [32], [33]. Among the
Bayesian models, Markov Decision Process (MDP) and Par-
tially Observable Markov Decision Process (POMDP) are the
widely applied methods in modeling robot behavior. The task
of motion planning is to generate a trajectory and send it
to the feedback control for execution. The planned trajec-
tory is usually specified and represented as a sequence of
planned trajectory points, and each of these points contains
attributes like location, time, speed, etc. Low-dimensional
motion planning problems can be solved with grid-based
algorithms (such as Dijkstra [34] or A* [35]) or geometric
algorithms. High-dimensional motion planning problems can
4be dealt with sampling-based algorithms, such as Rapidly-
exploring Random Tree (RRT) [36] and Probabilistic Roadmap
(PRM) [37], which can avoid the problem of local minima.
Reward-based algorithms, such as the Markov decision process
(MDP), can also generate the optimal path by maximizing
cumulative future rewards. The goal of feedback control is to
track the difference between the actual pose and the pose on
the predefined trajectory by continuous feedback. The most
typical and widely used algorithm in robot feedback control
is PID.
While optimization-based approaches enjoy mainstream
appeal in solving motion planning and control problems,
learning-based approaches [38]–[42] are becoming increas-
ingly popular with recent developments in artificial intelli-
gence. Learning-based methods, such as reinforcement learn-
ing, can naturally make full use of historical data and itera-
tively interact with the environment through actions to deal
with complex scenarios. Some model the behavioral level
decisions via reinforcement learning [40], [42], while other
approaches directly work on motion planning trajectory output
or even direct feedback control signals [39]. Q-learning [43],
Actor-Critic learning [44], policy gradient [37] are some
popular algorithms in reinforcement learning.
III. PERCEPTION ON FPGA
A. Overview
Perception is related to many robotic applications where
sensory data and artificial intelligence techniques are involved.
Examples of such applications include stereo matching, object
detection, scene understanding, semantic classification, etc.
The recent developments in machine learning, especially deep
learning, have exposed robotic perception systems to more
tasks. In this section, we will focus on the recent algorithms
and FPGA implementations in the stereo vision system, which
is one of the key components in the robotic perception stage.
Real-time and robust stereo vision systems are increasingly
popular and widely used in many perception applications,
e.g., robotics navigation, obstacle avoidance [45] and scene
reconstruction [46]–[48]. The purpose of stereo vision systems
is to obtain 3D structure information of the scene using
stereoscopic ranging techniques. The system usually has two
cameras to capture images from two points of view within
the same scenario. The disparities between the corresponding
pixels in two stereo images are searched using stereo matching
algorithms. Then the depth information can be calculated from
the inverse of this disparity.
Throughout the whole pipeline, stereo matching is the
bottleneck and time-consuming stage. The stereo matching
algorithms can be mainly classified into two categories: lo-
cal algorithms [49]–[57] and global algorithms [58]–[62].
Local methods compute the disparities by only processing
and matching the pixels around the points of interest within
windows. They are fast and computationally-cheap, and the
lack of pixel dependencies makes them suitable for parallel
acceleration. However, they may suffer in textureless areas
and occluded regions, which may result in incorrect disparities
estimation.
In contrast, global methods compute the disparities by
matching all other pixels and minimizing a global cost func-
tion. They can achieve much higher accuracy than local
methods. However, they tend to come at high computation
cost and require much more resources due to their large
and irregular memory access as well as the sequential nature
of algorithms, thus not suitable for real-time and low-power
applications. Many research works in stereo systems focus
on the speed and accuracy improvement of stereo matching
algorithms, and some of the implementations are summarized
in Tab. I
B. Local Stereo Matching on FPGA
Local algorithms are usually based on correlation, where
the process involves finding matching pixels in the left and
right image patches by aggregating costs within a specific
region. There are many ways for cost aggregation, such as
the sum of absolute differences (SAD), the sum of squared
differences (SSD), normalized cross-correlation (NCC), and
census transform (CT), and many previous implementations
are based on these methods. Jin et al. [63] develop a real-
time stereo vision system based on census rank transformation
matching cost for 640×480 resolution images. Zhang et al. [64]
propose a real-time high definition stereo matching design
on FPGA based on mini-census transform and cross-based
cost aggregation, which achieves 60 fps at 1024×768 pixel
stereo images. The implementation of Honegger et al. [65]
achieves 127 fps at 376×240 pixel resolution with 32 disparity
levels based on block matching. Jin et al. [66] further achieve
507.9 fps for 640×480 resolution images by applying fast
local consistent dense stereo functions and cost aggregation.
Several works [67], [68] utilize the high-level synthesis (HLS)
approach to map local stereo matching algorithms on FPGAs
to achieve acceleration. These works can perform real-time
processing, however, they cannot produce good enough quality
disparity maps for high-definition images.
C. Global Stereo Matching on FPGA
Global algorithms can provide the state-of-the-art accuracy
and disparity map quality, however, they are usually processed
through high computational-intensive optimization techniques
or massive convolutional neural networks, which makes them
difficult to be deployed on resource-limited embedded sys-
tems for real-time applications. However, some works have
attempted to implement global algorithms on FPGA for better
performance. Park et al. [69] present a trellis-based stereo
matching system on FPGA with a low error rate and achieved
30 fps at 320×240 resolution with 128 disparity levels.
Sabihuddin et al. [70] implement a dynamic programming
maximum likelihood (DPML) based hardware architecture for
dense binocular disparity estimation and achieved 63.54 fps at
640×480 pixel resolution with 128 disparity levels. The design
in Jin et al. [71] uses a tree-structured dynamic programming
method, and achieves 58.7 fps at 640×480 resolution as well
as a low error rate. Recently, some other adaptations of global
approaches for FPGA-implementation have been proposed,
such as cross-trees [60], dynamic programming for DNA
5Algorithm Reference Frame Rate (fps) Image Resolution Disparity MDE/s FPGA Platform Year
Local
Stereo Matching
Jin et al. [63]
Zhang et al. [64]
Honegger et al. [65]
Jin et al. [66]
230
60
127
507.9
640 × 480
1024 × 768
376 × 240
640 × 480
64
64
32
60
4522
3020
367
9362
Xilinx Virtex-4 XC4VLX200-10
Altera EP3SL150
AItera Cyclone III EP3C80
Xilinx Vertex-6
2009
2011
2012
2014
Global
Stereo Matching
Park et al. [69]
Sabihuddin et al. [70]
Jin et al. [71]
Zha et al. [60]
Puglia et al. [61]
30
63.54
32
30
30
320 × 240
640 × 480
640 × 480
1920 × 1680
1024 × 768
128
128
60
60
64
295
2498
590
5806
1510
Xilinx Virtex II pro-100
Xilinx XC2VP100
Xilinx XC4VLX160
Xilinx Kintex 7
Xilinx Virtex-7 XC7Z020CLG484-1
2007
2008
2012
2016
2017
Semi-Global
Stereo Matching
Banz et al. [73]
Wang et al. [74]
Cambuim et al. [75]
Rahnama et al. [76]
Cambuim et al. [77]
Zhao et al. [78]
30
42
127
72
25
161
640 × 480
1600 × 1200
1024 × 768
1242 × 375
1024 × 768
1242 × 375
128
128
128
128
256
64
1180
10322
12784
4292
5033
4799
Xilinx Virtex-5
Altera 5SGSMD5K2
AItera Cyclone IV
Xilinx ZC706
AItera Cyclone IV GX, Stratix IV GX
Xilinx Ultrascale + ZCU102
2010
2015
2017
2018
2019
2020
Efficient Large-Scale
Stereo Matching
Rahnama et al. [79]
Rahnama et al. [80]
23.7
50
1242 × 375
1242 × 375
–
–
–
–
Xilinx ZC706
Xilinx ZCU104
2018
2019
TABLE I: Comparison of Stereo Vision Systems on FPGA platforms, across local stereo matching, global stereo matching, semi-global stereo
matching (SGM) and efficient large-scale stereo matching (ELAS) algorithms. The results reported in each design are evaluated by frame
rate (fps), image resolution (width × height), disparity levels, million disparity estimations per second (MDE/s) and hardware platforms,
where MDE/s = width × height × fps × disparity.
sequence alignment [61], and graph cuts [72], where all of
these implementations achieve real-time processing.
D. Semi-Global Matching on FPGA
Semi-global matching (SGM) [81] bridges the gap between
local and global methods, and achieves a notable improvement
in accuracy. SGM calculates the initial matching disparities by
comparing local pixels, and then approximates an image-wide
smoothness constraint with global optimization, which can
obtain more robust disparity maps through this combination.
There are several critical challenges for implementing SGM
on hardware, e.g., data dependence, high complexity, and large
storage, so this is an active research field with recent works
proposing FPGA-friendly variants of SGM [73], [74], [82]–
[85].
Banz et al. [73] propose a systolic-array based hardware
architecture for SGM disparity estimation along with a two-
dimensional parallelization concept for SGM. This design
achieves 30 fps performance at 640×480 pixel images with
a 128-disparity range on the Xilinx Virtex-5 FPGA platform.
Wang et al. [74] implement a complete real-time FPGA-based
hardware system that supports both absolute difference-census
cost initialization, cross-based cost aggregation and semi-
global optimization. The system achieves 67 fps at 1024×768
resolution with 96 disparity levels on the Altera Stratix-IV
FPGA platform, and 42fps at 1600x1200 resolution with 128
disparity levels on the Altera Stratix-V FPGA platform. The
design in Cambuim et al. [75] uses a scalable systolic-array
based architecture for SGM based on the Cyclone IV FPGA
platform, and it achieves a 127 fps image delivering rate
in 1024×768 pixel HD resolution with 128 disparity levels.
The key point of this design is the combination of disparity
and multi-level parallelisms such as image line processing to
deal with data dependency and data irregular access pattern
problems in SGM. Later, to improve the robustness of SGM
and achieve a more accurate stereo matching, Cambuim et
al. [77] combine the sampling-insensitive absolute difference
in the pre-processing phase, and propose a novel streaming
architecture to detect noisy and occluded regions in the post-
processing phase. The design is evaluated in a full stereo
vision system using two heterogeneous platforms, DE2i-150
and DE4, and achieves 25 fps processing rate in 1024×768
HD maps with 256 disparity levels.
While most existing SGM designs on FPGA are imple-
mented using the register-transfer level (RTL), some works
leveraged the high-level synthesis (HLS) approach. Rahnama
et al. [76] implement an SGM variation on FPGA using HLS,
which achieves 72 fps speed at 1242×375 pixel size with 128
disparity levels. To reduce the design effort and achieve an
appropriate balance among speed, accuracy and hardware cost,
Zhao et al. [78] recently propose FP-Stereo for building high-
performance SGM pipelines on FPGAs automatically. A series
of optimization techniques are applied in this system to exploit
parallelism and reduce resource consumption. Compared to
GPU designs, it achieves the same accuracy at a competitive
speed while consuming much less energy.
E. Efficient Large-Scale Stereo Matching on FPGA
Another popular stereo matching algorithm that offers a
good trade-off between speed and accuracy is Efficient Large-
Scale Stereo Matching (ELAS) [86], which is currently one
of the fastest and accurate CPU algorithms concerning the
resolution on Middlebury dataset [87]. ELAS implements a
slanted plane prior very effectively while its dense estimation
of depth is completely decomposable over all pixels, which
make it attractive for easily parallelized.
Rahnama et al. [79] first implement and evaluate an FPGA
accelerated adaptation of the ELAS algorithm, which achieved
a frame rate of 47 fps (up to 30× compared high-end CPU)
while consuming under 4W of power. By taking advantage of
different components on the SoC, several elaboration blocks
such as feature extraction and dense matching are executed
on FPGA, while I/O and other conditional/sequential blocks
are executed on ARM-core CPU. The authors also reveal
6the strategy to accelerate more complex and computationally
diverse algorithms for low power and real-time systems by
collaboratively utilizing different compute components. Later,
by leveraging and combining the best features of SGM and
ELAS-based methods, Rahnama et al. [80] propose a sophis-
ticated stereo approach and achieve an 8.7% error rate on the
challenging KITTI 2015 dataset at over 50 fps, with a power
consumption of only 4.5 W.
F. CNN-based stereo vision system on FPGA
Convolutional neural networks (CNNs) have been demon-
strated to perform very well on many vision tasks such
as image classification, object detection, and semantic seg-
mentation. Recently, CNN has also been utilized in stereo
estimation [88], [89] and stereo matching [90]. CNN is applied
to determine SGM penalties [91], estimate real-time optical
flow disparity [92] and predict cost volume computation and
aggregation [93].
CNN has been deployed on FPGA platforms in several
works [94]–[97], with an example of lightweight YOLOv2 for
object detection [98]. Nakahara et al. implement a pipelined
based architecture for lightweight YOLOv2 with a binarized
CNN on Xilinx ZCU102 FPGA platform. This design achieves
40.81 fps object detection speed, which is 177.4× faster
than ARM Cortex-A57 and 27.5× faster than NVIDIA Pascal
embedded GPU.
IV. LOCALIZATION ON FPGA
A. Overview
For robots, one of the most critical tasks is localization and
mapping. Simultaneous Localization and Mapping (SLAM)
is an advanced robot navigation algorithm for constructing
or updating a map of unknown surroundings while simul-
taneously keeping tracking the robot’s location. Localization
and mapping are two concurrent tasks and cannot be solved
independently from each other. Localizing a robot requires
a sufficiently detailed map, and constructing or updating or a
map requires accurate landmarks or pose estimates from know
positions.
Many SLAM algorithms have been developed in the last
decades to improve the accuracy and robustness, and its
implementation comes in a diverse set of sizes and shapes.
One end of the spectrum is dense SLAM algorithms [99]–
[102], which can generate high-quality map of the environment
with complex computations. Dense SLAM algorithms usually
are executed on powerful and high-performance machines to
ensure real-time performance. At the same time, the intensive
computation characteristic makes dense SLAM hard to deploy
on edge devices.
The other end of the spectrum is sparse SLAM [103]–
[106], which is computationally light by only selecting limited
numbers of landmarks or features. Sparse SLAM algorithms
can be feasibly used in mobile robots while having the
downside of accuracy degradation and reconstruction usability
reduction.
To form a compromise in terms of compute intensity and
accuracy quality between these two extremes, a family of
works described as semi-dense SLAM has emerged [107],
[108]. They aim to achieve better computational efficiency
compared to dense methods by only processing a subset of
high-quality sensory information while providing a more dense
and informative map compared to sparse methods. To execute
SLAM efficiently on mobile robots and meet real-time and
power constraint requirements, the efficient software/hardware
architecture implementation in embedded systems, especially
on FPGAs, has been explored in diverse ways in recent years,
and some of them are summarized in Tab. II.
B. Dense SLAM on FPGA
Dense SLAM can construct high quality and complete
models of the environment, and most of them are running
in high-end hardware platforms (especially GPU). Several
works have attempted to implement 3D real-time dense SLAM
algorithms on a heterogeneous system with FPGA embedded.
One of the representative real-time dense SLAM algorithms
is KinectFusion [109], which was released by Microsoft in
2011. As a scene reconstruction algorithm, it updates the
global 3D map and tracks the location of depth cameras
within the surrounding environment continuously. KinectFu-
sion is generally composed of three algorithms: ray-casting
algorithm for generating graphics from surface information,
iterative closest point (ICP) algorithm for camera-tracking and
volumetric integration (VI) algorithm for integrating depth
streams into the 3D surface.
Belshaw [99] presents an FPGA implementation of the ICP
algorithm, which achieves over 200 fps tracking speed with
low tracking errors. This design divides the ICP algorithm into
filtering, nearest neighbor, transform recovery and transform
application stages. It leverages fixed-point arithmetic and
power of two data points to utilize FPFA logic efficiently.
Williams [100] notices that the nearest neighbor search takes
up the majority of ICP runtime, and then proposes two
hybrid CPU-FPGA architectures to accelerate the bottleneck
of the ICP-SLAM algorithm. The implementation is performed
with Vivado HLS, a high-level synthesis tool from Xilinx,
and achieves a maximum 17.22× speedup over the ARM
software implementation. Hoorick [101] presents an FPGA-
based heterogeneous framework using a similar HLS method
to accelerate the KinectFusion algorithm and explored various
ways of dataflow and data management patterns. Gautier et
al. [102] embed both ICP and VI algorithms on an Altera
Stratix V FPGA by using the OpenCL language and the Altera
OpenCL SDK. This design was a heterogeneous system with
NVIDIA GTX 760 GPU and Altera Stratix V FPGA. By
distributing different workloads on different parts of SoC, the
entire system achieves up to 28 fps real-time speed.
C. Sparse SLAM on FPGA
Sparse SLAM algorithms usually use a small set of features
for tracking and maintaining a sparse map of surrounding envi-
ronments. These algorithms exhibit lower power consumption
but are limited to the localization accuracy.
71) EKF-SLAM
EKF-SLAM [103] is a class of algorithms that utilizes
the extended Kalman Filter (EKF) for SLAM. EKF-SLAM
algorithms are typically feature-based and use the maximum
likelihood algorithm for data association. Several heteroge-
neous architectures using multi-core CPUs, GPUs, DSPs, and
FPGAs are proposed to accelerate the complex computation in
EKF-SLAM algorithms. Bonato et al. [110] presents the first
FPGA-based architecture for the EKF-SLAM based algorithm
that is capable of processing 2D maps at up to 1800 features
at real-time with a frequency of 14 Hz, compared to 572
features with Pentium CPU and 131 features with ARM. They
analyze the computational complexity and memory bandwidth
requirements for FPGA-based EKF-SLAM, and then propose
an architecture with a parallel memory access pattern to
accelerate the matrix multiplication. This design achieves two
orders of magnitude more power-efficient than a general-
purpose processor.
Similarly, Tertei et al. [111] propose an efficient FPGA-SoC
hardware architecture for matrix multiplication with systolic
arrays to accelerate EKF-SLAM algorithms. The setup of this
design is a PLB peripheral to PPC440 hardcore embedded
processor on a Virtex5 FPGA, and it achieves a 7.3× speedup
with a processing frequency of 44 Hz compared to the pure
software implementation. Later, taking into account the sym-
metry in cross-covariance matrix-related computations, Tertei
et al. [112] improve the previous implementation to further
reduce the computational time and on-chip memory storage
with an AXI4 bus peripheral on Zynq-7020 FPGA.
DSP is also leveraged in some works to accelerate EKF-
SLAM algorithms. Vincke et al. [113] implement an efficient
implementation of EKF-SLAM on a low-cost heterogeneous
architecture system consisting of a single-core ARM processor
with a SIMD coprocessor and a DSP core. The EKF-SLAM
program is partitioned into different functional blocks based
on the profiling characteristics results. Compared to a non-
optimized ARM implementation, this design achieved 4.7×
speed up from 12 fps to 57 fps. In a later work, Vincke
et al. [114] replace the single-core ARM with a double-core
ARM to optimize the non-optimized blocks using the OpenMP
library. This design achieves a 2.75× speedup compared to
non-optimized implementation.
2) ORB-SLAM
ORB-SLAM [104] is an accurate and widely-used sparse
SLAM algorithm for monocular, stereo, and RGB-D cameras.
Its framework usually consists of five main procedures: feature
extraction, feature matching, pose estimation, pose optimiza-
tion and map updating. Based on the profiling results on
a quad-core ARM v8 mobile SoC, feature extraction is the
most computation-intensive stage in the ORB-SLAM system,
which consumes more than half of CPU resources and energy
budget [115].
ORB based feature extraction algorithm usually consists of
two parts, namely Oriented Feature from Accelerated Seg-
ment Test (oFAST) [116] based feature detection and Binary
Robust Independent Elementary (BRIEF) [117] based feature
descriptors computation. To accelerate this bottleneck, Fang
et al. [115] design and implement a hardware ORB feature
extractor and achieved a great balance between performance
and energy consumption, which outperforms ARM Krait by
51% and Intel Core i5 by 41% in computation latency as well
as outperforms ARM Krait by 10% and Intel Core i5 by 83%
in energy consumption. Liu et al. [118] propose an energy-
efficient FPGA implementation eSLAM to accelerate both
feature extraction and feature matching stages. This design
achieves up to 3× and 31× speedup in framerate, as well as up
to 71× and 25× in energy efficiency improvement compared to
Intel i7 and ARM Cortex-A9 CPUs, respectively. This eSLAM
design utilizes a rotationally symmetric ORB descriptor pat-
tern to make the algorithm more hardware-friendly, resulting
in a 39% less latency compared to [115]. Rescheduling and
parallelizing optimization techniques are also exploited to
improve the computation throughput in eSLAM design.
Scale-invariant feature transform (SIFT) and Harris corner
detector are also commonly-used feature extraction methods.
SIFT is invariant to rotation and translation. Gu et al. [106]
implement SIFT-feature based SLAM algorithm on FPGA and
accelerate the matrix computation part to achieve speedup.
Harris corner detector is used to extract corners and features of
an image, and Schulz et al. [119] propose an implementation of
Harris and Stephen corner detector optimized for an embedded
SoC platform that integrates a multicore ARM processor
with Zynq-7000 FPGA. Taking into account I/O requirements
and the advantage of parallelization and pipeline, this design
achieves a speedup of 1.77 compared to dual-core ARM
processors.
There are also some ASIC implementations for accelerating
the ORB-SLAM system. Lam et al. [120] present a novel
detector for computing FAST and BRIEF features to save
energy consumption and improve performance. An optimized
adder tree for smoothing operation and an optimized sampling
scheme are proposed to reduce hardware resource usage. An
rBRIEF-based feature extraction approach is then presented
by [121] to further improve the feature matching quality. To
help optimize the tracking task in the vSLAM system with
high-performance and energy-efficient, Li et al. [122] design a
specialized CMOS-based hardware accelerator to help perform
high-quality feature extraction and high-precision descriptor
generation. The design is compatible with ORB-SLAM system
requirements and can be integrated into any SoC architecture.
3) Fast-SLAM
One of the key limitations of EKF-SLAM is its computa-
tional complexity since EKF-SLAM requires time quadratic
in the number of landmarks to incorporate each sensor up-
date. In 2002, Montemerlo et al. [105] propose an efficient
SLAM algorithm called Fast-SLAM. Fast-SLAM decomposes
the SLAM problem into a robot localization problem and a
landmark estimation problem, and recursively estimates the
full posterior distribution over landmark positions and robot
path with a logarithmic scale.
Abouzahir et al. [123] implement Fast-SLAM 2.0 on a CPU-
GPGPU-based SoC architecture. The algorithm is partitioned
into function blocks, and each of them is implemented on the
CPU or GPU accordingly. This optimized and efficient CPU-
GPGPU partitioning enables accurate localization and a 37×
execution speedup compared to non-optimized implementation
8Type Algorithm Reference Processing Speed Hardware Platform Year
Dense
SLAM
Iterative Closest Point (ICP) Belshaw [99] 200 fps Xilinx Vertex II Pro VC2VP100 2008
Iterative Closest Point (ICP) Williams [100] 2 fps Xilinx Zynq-7020 SoC 2017
KinectFusion Hoorick [101] 242 fps Xilinx Zynq-7020 SoC 2019
Iterative Closest Point (ICP)
and Volumetric Integration (VI) Gautier et al. [102] 26-28 fps GTX 760 GPU + Altera Stratix V FPGA 2014
Sparse
SLAM
EKF-SLAM
Bonato et al. [110] 14 Hz Intel EP2S90F1020C4 FPGA 2009
Tertei et al. [111] 44.39 Hz Xilinx Virtex5 XC5VFX70T FPGA 2014
Tertei et al. [112] 30 Hz Xilinx Zynq-7020 FPGA 2016
ORB-SLAM Fang et al. [115] 67 fps Host CPU + Stratix V FPGA 2017Liu et al. [118] 55.87 fps Xilinx XCZ7045 SoC 2019
Fast-SLAM Abouzahir et al. [123] 30 fps Nvidia Tegra K1 SoC 2016Abouzahir et al. [124] 102.14 fps ARM SoC of the Arria 10 2018
VO-SLAM Gu et al. [106] 31 fps Host CPU + Stratix V FPGA 2015
Semi-Dense
SLAM LSD-SLAM
Boikos et al. [125] 4 fps Xilinx Zynq-7020 SoC 2016
Boikos et al. [126] 22 fps Xilinx Zynq-7020 SoC 2017
Boikos et al. [127] 60 fps Xilinx Zynq-706 SoC 2019
CNN-Based
SLAM
SuperPoint Xu et al. [128] 20 fps Xilinx ZCU-102 SoC 2020
Decentralized SLAM (DSLAM) Yu et al. [129] 125 fps Xilinx ZCU102 MPSoC + DPU 2020
DSLAM, SuperPoint Yu et al. [130] 20 fps Xilinx ZCU102 MPSoC + ZU9 MPSoC 2020
Bundle
Adjustment
LM Algorithm Liu et al. [131] – Xilinx Zynq SoC 2020
Visual Odometry Sun et al. [132] – Xilinx XCZU9EG + ZCU102 MPSoC 2020
TABLE II: Comparison of localization system on SOC-FPGA platforms, across dense SLAM, sparse SLAM, semi-dense SLAM, CNN-based
SLAM and bundle adjustment algorithms.
on a single-core CPU. Further, Abouzahir et al. [124] perform
a complete study of the processing time of different SLAM
algorithms under popular embedded devices, and demonstrate
that Fast-SLAM2.0 allowed a compromise between the con-
sistency of localization results and computation time. This
algorithm is then optimized and implemented on GPU and
FPGA using HLS and parallel computing frameworks OpenCL
and OpenGL. It is observed that the global processing time
of FastSLAM2.0 on FPGA implementations achieves 7.5×
acceleration compared to high-end GPU. The processing fre-
quency achieves 102 fps and meets the real-time performance
constraints of an operated robot.
4) VO-SLAM
The visual odometry based SLAM algorithm (VO-SLAM)
also belongs to the Sparse SLAM class with low computa-
tional complexity. Gu et al. [106] implement the VO-SLAM
algorithm on a DE3 board (Altera Stratix III) to perform drift-
free pose estimation, resulting in localization results accurate
to 1-2cm. A Nios II soft-core is used as a master processor.
The authors design a dedicated matrix accelerator and propose
a hierarchical matrix computing mechanism to support appli-
cation requirements. This design achieves a processing speed
of 31 fps with 30000 global map features, and 10× energy
saving for each frame processing compared to Intel i7 CPU.
D. Semi-dense SLAM on FPGA
Semi-dense SLAM algorithms have emerged to provide a
compromise between sparse SLAM and dense SLAM algo-
rithms, which attempts to achieve both improved efficiency
and dense point clouds. However, they are still usually compu-
tationally intensive and require desktop-scale multicore CPUs
for real-time processing.
Large-Scale Direct Monocular SLAM (LSD-SLAM) is one
of the state-of-the-art and widely-used semi-dense SLAM
algorithms, and it directly operates on image intensities for
both tracking and mapping problems. The camera is tracked
by direct image alignment, while geometry is estimated from
semi-dense depth maps acquired by filtering over multiple
stereo pixel-wise comparisons.
Several works have explored LSD-SLAM FPGA-SoC im-
plementation. Boikos et al. [125] investigate the performance
and acceleration opportunities for LSD-SLAM in the SoC
system. This design achieves an average framerate of more
than 4 fps for a resolution of 320×240 with an estimated
power of less than 1W, which is a 2× acceleration and more
than 4.3× energy efficiency compared to a software version
running on embedded CPU. The author also notes that the
communication between two accelerators is via DDR since
the produced intermediate data is too large to be fully cached
on the FPGA. Hence, it is important to optimize the memory
architecture (e.g., data movement and caching techniques) to
ensure the scalability and compatibility of the design.
To further improve the performance of [125], Boikos et
al. [126] re-implement the design using a dataflow architecture
and distributed asynchronous blocks to allow the memory
system and the custom hardware pipelines to function at peak
efficiency. This implementation can process and track more
than 22 fps with an embedded power budget and achieves a
5× speedup over [125].
Furthermore, Boikos et al. [127] combine a scalable depth
estimation with direct semi-dense SLAM architecture and
propose a complete accelerator for semi-dense SLAM on
FPGA. This architecture achieved more than 60 fps at the
resolution of 640×480 and an order of magnitude power
consumption improvement compared to Intel i7-4770 CPU.
This implementation leverages multi-rate and multi-modal
units to deal with LSD-SLAM’s complex control flow. A new
dataflow paradigm is also proposed where the kernel is linked
9with a single consumer and a single producer to achieve high
efficiency.
E. CNN-based SLAM
Recently, CNNs have made significant progress in the
perception and localization ability of the robots compared
to handcrafted methods. Take one of the main SLAM com-
ponents, feature extraction, for example, the CNN-based ap-
proach SuperPoint [133] can achieve 10%-30% higher match-
ing accuracy compared to handcrafted ORB. Other CNN-
based methods, such as DeepDesc [134] and GeM [135],
also present significant improvements in feature extraction
and descriptor generation stage. However, CNN has a much
higher computational complexity and requires more memory
footprint.
Several works have explored to deploy CNN on FPGAs.
Xilinx DPU [136] is one of the state-of-the-art programmable
engines dedicated to CNN, which has a specialized instruction
set and works efficiently across various CNN topologies. Xu et
al. [128] propose a hardware architecture to accelerate CNN-
based feature extraction SuperPoint on the Xilinx ZCU102
platform and achieve 20 fps in a real-time SLAM system. The
key point of this design is an optimized software dataflow to
deal with the extra post-processing operations within CNN-
based feature extraction networks. 8-bit fixed-point numerics
are leveraged in the post-processing operations and CNN
backbone.
Yu et al. [129] build a CNN-based monocular decentralized-
SLAM (DSLAM) on the Xilinx ZCU102 MPSoC platform
with DPU. DSLAM is usually used in multi-robot applications
that can share environment information and locations between
agents. To accelerate the main components in DSLAM, namely
visual odometry (VO) and decentralized place recognition
(DPR), the authors adopt CNN-based Depth-VO-Feat [137]
and NetVLAD [138] to replace handcrafted approaches and
propose a cross-component pipeline scheduling algorithm to
improve the performance.
To enable multi-tasking processing in embedded robots
on CNN accelerators, Yu et al. [130] further propose an
INterruptible CNN accelerator (INCA) with a novel virtual-
instruction-based interrupt method. Feature extraction and
place recognition of DSLAM are deployed and accelerated
on the same CNN accelerator of the embedded FPGA system,
and the interrupt response latency is reduced by 1%.
F. Bundle Adjustment
Besides the hardware implementation of the frontend of
the SLAM system, several works investigate to accelerate the
backend of the SLAM system, mainly Bundle Adjustment
(BA). BA is heavily used in robot localization [104], [139],
autonomous driving [140], space exploration missions [141]
and some commercial products [142], where it is usually
employed in the last stage of the processing pipeline to refine
camera trajectories and 3D structures further.
Essentially, BA is a massive joint non-linear optimization
problem that usually consumes a significant amount of power
and processing time in both offline visual reconstruction and
real-time localization applications.
Several works aim to accelerate BA on multi-core CPUs
or GPUs using parallel or distributed computing techniques.
Jeong et al. [143] exploit efficient memory handling and fast
block-based linear solving, and propose a novel embedded
point iterations method, which substantially improves the BA
performance on CPU. Wu et al. [144] present a multi-core
parallel processing solution running on CPUs and GPUs. The
matrix-vector product is carefully restructured in this design
to reduce memory requirements and compute latency substan-
tially. Eriksson et al. [145] propose a distributed approach
for very large scale global bundle adjustment computation to
achieve BA performance improvement. The authors present a
consensus framework using the proximal splitting method to
reduce the computational cost. Similarly, Zhang et al. [146]
propose a distributed formulation to accelerate the global BA
computation without much distributed computing communica-
tion overhead.
To better deploy BA in embedded systems with strict
power and real-time constraints, recent works explore BA
algorithm acceleration using specialized hardware. The design
in [147] implements both the image frontend and BA backend
of a VIO algorithm on a single-chip for nano-drone scale
applications. Liu et al. [131] propose a hardware-software co-
designed BA hardware accelerator and its implementation on
an embedded FPGA-SoC to achieve higher performance and
power efficiency simultaneously. Especially, a co-observation
optimization technique and a hardware-friendly differentiation
method are proposed to accelerate BA operations with opti-
mized usage of memory and computation resources. Sun et
al. [132] present a hardware architecture running local BA
on FPGAs, which works without external memory access and
refines both cameras poses and 3D map points simultaneously.
V. PLANNING AND CONTROL ON FPGA
A. Overview
Planning and control are the modules that compute how
the robot should maneuver itself. They usually include behav-
ioral decision, motion planning and feedback control kernels.
Without loss of generality, we focus on the motion planning
algorithms and their FPGA implementations in this section.
As a fundamental problem in the robotic system, motion
planning aims to find the optimal collision-free path from the
current position to a goal position for a robot in complex
surroundings. Generally, motion planning contains three steps,
namely roadmap construction, collision detection and graph
search [37], [148]. Motion planning will become a relatively
complicated problem when robots work with a high degree
of freedom (DOF) configurations since the search space will
be exponentially increased. Typically, state-of-the-art CPU-
based approaches take a few seconds to find a collision-free
trajectory [149]–[151], making the existing motion planning
algorithms too slow to meet the real-time requirement for
complex robot tasks and environments. Several works have
investigated approaches to speed up motion planning, either
for each stage or whole pipeline.
10
B. Roadmap Construction
In the roadmap construction step, the planner generates
a set of states in the robot’s configuration space and then
connects them with edges to construct a general-purpose
roadmap in the obstacle-free space. Each state represents a
robot’s configuration, and each edge represents a possible
robot movement. Conventional algorithms build the roadmap
by randomly sampling poses from configuration space at
runtime to navigate around the obstacles present at that time.
Several works explore roadmap construction acceleration.
Yershova et al. [152] improve the nearest neighbor search
to accelerate roadmap construction by orders of magnitude
compared to the naive nearest-neighbor searching. Wang et
al. [153] reduce the computation workload by trimming
roadmap edges and keeping the roadmap to a reasonable size
to achieve speedup. Different from online runtime approaches,
Murray et al. [154] completely remove the runtime latency
by conducting the roadmap construction only once at the
design time. A more general and much larger roadmap is
precomputed and allows for fast and successive queries in
complex environments without reprogramming the accelerator
during runtime.
C. Collision Detection
In the collision detection step, the planner determines
whether there are potential collisions with the environment or
robot itself during movement. Specifically, collision detection
is the primary challenge in motion planning, which often
comprises 90% of the processing time [155].
Several works leverage data parallelization computing
on GPUs to achieve speedup [155]–[157]. For example,
Bialkowski et al. [155] divide the RRT* algorithm of collision
detection tasks into three parallel dimensions and constructe
thread block grids to execute collision computations simulta-
neously. However, GPU can only provide a constant speedup
factor due to the core limitations, which is still hard to achieve
the real-time requirement.
Recently, [158]–[160] develope high-efficiency custom
hardware implementations based on the FPGA system. Atay
and Bayazit [158] focus on directly accelerating the PRM
algorithm on FPGA by creating functional units to perform
the random sampling and nearest neighbor search as well
as parallelizing triangle-triangle testing. However, this design
cannot be reconfigured at runtime, and the huge resources
demands make it fail to support a large roadmap. Murray
et al. [159] present a novel microarchitecture for an FPGA-
based accelerator to speed up collision detection by creating
a specialized circuit for each motion in the roadmap. This
solution achieves sub-millisecond speed for motion planning
query and improves the power consumption by more than one
order of magnitude, which is sufficient to enable real-time
robotics applications.
Besides real-time constraint, motion planning algorithms
also have flexibility requirements to make the robots adapt
to dynamic environments. Dadu-P [160] build a scalable
motion planning accelerator to attain both high efficiency
and flexibility, where a motion plan can be solved in around
300 microseconds in a dynamic environment. A hardware-
friendly data structure representing roadmap edges is adopted
to achieve flexibility, and a batched processing as well as a
priority-rating method are proposed to achieve high efficiency.
But this design comprises a 25× latency increase to make
it retargetable to different robots and scenarios due to the
external memory access. Murray et al. [154] develop a fully
retargetable microarchitecture of a novel collision detection
and graph search accelerator that can perform motion planning
in less than 3 ms with a modest power consumption of 35
W. This design divides the collision detection workflow into
two stages. The collision detection results for the discretized
roadmap are precomputed in the first stage before runtime, and
then the collision detection accelerator streams in the voxels
of obstacles and the edges of flags which are in collision at
runtime.
D. Graph Search
After collision detection, the planner will try to find the
shortest and safe path from the start position to the target
position based on the obtained collision-free roadmap through
graph search. Several works explore graph search accelera-
tions. Bondhugula et al. [161] employ a parallel FPGA-based
design using a blocked algorithm to solve large instances of
All-Pairs Shortest-Paths (APSP) problem, which achieves a
15× speedup over an optimized CPU-based implementation.
Sridharan et al. [162] present an architecture-efficient solution
based on Dijkstra’s algorithm to accelerate the shortest path
search, and Takei et al. [163] extend this for a high degree
of parallelism and large-scale graph search. Recently, Mur-
ray et al. [154] accelerate graph search with the Bellman-
Ford algorithm. By leveraging a precomputed roadmap and
bounding specific robot quantities, this design enables a more
compact and efficient storage structure, dataflows and a low-
cost interconnection network.
VI. PARTIAL RECONFIGURATION
FPGA technology provides the flexibility of on-site pro-
gramming and re-programming without going through re-
fabrication with a modified design. Partial Reconfiguration
(PR) takes this flexibility one step further, allowing the mod-
ification of an operating FPGA design by loading a partial
configuration file, usually a partial BIT file [164]. Using PR,
after a full BIT file configures the FPGA, partial BIT files can
be downloaded to modify reconfigurable regions in the FPGA
without compromising the integrity of the applications running
on those parts of the device that are not being reconfigured.
A major performance bottleneck for PR is the configuration
overhead, which seriously limits the usefulness of PR. To ad-
dress this problem, in [165], the authors propose a combination
of two techniques to minimize the overhead. First, the authors
design and implement fully streaming DMA engines to satu-
rate the configuration throughput. Second, the authors exploit a
simple form of data redundancy to compress the configuration
bitstreams, and implement an intelligent internal configuration
access port (ICAP) controller to perform decompression at
runtime. This design achieves an effective configuration data
11
transfer throughput of up to 1.2 Gbytes/s, which actually well
surpasses the theoretical upper bound of the data transfer
throughput, 400 Mbytes/s. Specifically, the proposed fully
streaming DMA engines reduce the configuration time from
the range of seconds to the range of milliseconds, a more than
1000-fold improvement. In addition, the proposed compression
scheme achieves up to a 75% reduction in bitstream size and
results in a decompression circuit with negligible hardware
overhead.
Another problem of PR is that it may incur additional energy
consumption. In [166], the authors investigate whether PR
can be used to reduce FPGA energy consumption. The core
idea is that there are a number of independent circuits within
a hardware design, and some can be idle for long periods
of time. Idle circuits still consume power though, especially
through clock oscillation and static leakage. Using PR, one
can replace these circuits during their idle time with others that
consume much less power. Since the reconfiguration process
itself introduces energy overhead, it is unclear whether this
approach actually leads to an overall energy saving or to a
loss. This study identifies the precise conditions under which
partial reconfiguration reduces the total energy consumption,
and proposes solutions to minimize the configuration energy
overhead. In this study, PR is compared against clock gating
to evaluate its effectiveness. The authors apply these tech-
niques to an existing embedded microprocessor design, and
successfully demonstrate that FPGAs can be used to accelerate
application performance while also reducing overall energy
consumption.
Further, PerceptIn demonstrate in their commercial product
that Runtime partial reconfiguration (RPR) is useful for robotic
computing, especially computing for autonomous vehicles,
because many on-vehicle tasks usually have multiple versions
where each is used in a particular scenario [167]. For instance,
in PerceptIn’s design, the localization algorithm relies on
salient features; features in key frames are extracted by a
feature extraction algorithm (based on ORB features [168]),
whereas features in non-key frames are tracked from previous
frames (using optical flow [169]); the latter executes in 10 ms,
50% faster than the former. Spatially sharing the FPGA is not
only area-inefficient, but also power-inefficient as the unused
portion of the FPGA consumes non-trivial static power. In
order to temporally share the FPGA and “hot-swap” different
algorithms, PerceptIn develop a partial reconfiguration engine
(PRE) that dynamically reconfigures part of the FPGA at
runtime. The PRE achieves a 400 MB/sec reconfiguration
throughput (i.e., bitstream programming rate). Both the feature
extraction and tracking bitstreams are less than 4 MB. Thus,
the reconfiguration delay is less than 1 ms.
VII. COMMERCIAL APPLICATIONS OF FPGAS IN
AUTONOMOUS VEHICLES
Over the past three years, PerceptIn has built and commer-
cialized autonomous vehicles for micromobility. Our products
have been deployed in China, US, Japan and Switzerland.
We summarize system design constraints, workloads and their
performance characteristics from the real products. A custom
computing system is developed by taking into account the
inherent task-level parallisim, cost, safety and programmability
[167] [170]. FPGA plays a critical role in our system, which
synchronizes various sensors and accelerates the component
on the critical path.
A. Computing system
Software pipeline. Fig. 2 shows the block diagram of the
processing pipeline in our vehicle, which consists of three
parts: sensing, perception and planning. The sensing module
bridges sensors and computing system. It synchronizes various
sensor samples for the downstream perception module, which
performs two fundamental tasks: 1) locating the vehicle itself
in a global map and 2) understanding the surroundings through
depth estimation and object detection. The planning module
uses the perception results to devise a driveable route, and
then converts the planed path into a sequence of control
commands, which will drive the vehicle along the path. The
control commands are sent to the vehicle’s Engine Control
Unit (ECU) via the CAN bus interface.
Sensing, perception and planning are serialized. They are all
on the critical path of the end-to-end latency. We pipeline the
three modules to improve the throughput. Within perception,
localization and scene understanding are independent and
could execute in parallel. While there are multiple tasks within
scene understanding, they are mostly independent with the
only exception that object tracking must be serialized with
object detection. The task-level parallelisms influence how the
tasks are mapped to the hardware platform.
Algorithm. Our localization module is based on Visual
Inertial Odometry algorithms [171], [172], which fuses camera
images, IMU and GPS samples to estimate the vehicle pose
in the global map. The depth estimation employs traditional
stereo vision algorithms, which calculates depths according to
the principal of triangulation [173]. In particular, our method is
based on the classic ELAS algorithm, which uses hand-crafted
features [174]. While DNN models for depth estimation exist,
they are orders of magnitude more compute-intensive than
non-DNN algorithms [175] while providing only marginal ac-
curacy improvements to our use-cases. We detect objects using
DNN models, such as YOLO [22]. We use the Kernelized
Correlation Filter (KCF) [176] to track detected objects. The
planning algorithm is formulated as Model Predictive Control
(MPC) [177].
Hardware architecture. Fig. 3 is the hardware system
designed for our autonomous vehicles. The sensing hardware
consists of stereo cameras, IMU and GPS. In particular, our
system uses stereo cameras for depth estimation. One of
the cameras is also used for semantic tasks such as object
detection. The cameras along with the IMU and the GPS drive
the VIO-based localization task.
Considering the cost, compute requirements and power
budget, our computing platform is composed of a Xilinx Zynq
Ultrascale+ FPGA and an on-vehicle PC equipped with an
Intel Coffe Lake CPU and an Nvidia GTX 1060 GPU. The
PC is the main computing platform, while the FPGA plays a
critical role, which bridges sensors and the PC, and provides
12
Fig. 2: Processing pipeline of PerceptIn’s on-vehicle processing system.
Fig. 3: The computing system in our autonomous vehicle.
an acceleration platform. To optimize the end-to-end latency,
explore the task level parallelism and ease practical develop-
ment and deployment, planning and scene understanding are
mapped onto the CPU and the GPU respectively, and sensing
and localization are implemented on the FPGA platform.
B. Sensing on FPGA
We map sensing on the Zynq FPGA platform. The FPGA
processes sensor data and transfer sensor data to the PC for
subsequent processing. The reason that sensing is mapped to
FPGA is three-fold. First, embedded FPGA platforms today
are built with rich sensor interface (e.g. standard MIPI Camera
Serial Interface) and sensor pre-processing hardware (e.g.
ISP). Second, by having the FPGA directly process sensor data
in situ, we allow accelerators on the FPGA to directly process
sensor data without involving the power-hungry CPU for data
movement and task coordination. Finally, processing sensor
data on the FPGA naturally leads to a design of hardware-
assisted multiple sensor synchronization mechanism.
Sensor Synchronization Sensor synchronization is critical
to perception algorithms that fuse multiple sensors. Sensor
fusion algorithms assume sensor samples have been well
synchronized. For example, widely adopted datasets, such as
KITTI, provide synchronized data so that researchers could
focus on algorithmic development.
An ideal synchronization ensures that 1) various sensor
samples have a unified timing system, and 2) timestamps of
samples precisely record the time of events triggering the
sensors. GPS synchronization is now wildly adopted to unify
various measurements in a global timing domain. Software-
based synchronization associates samples with timestamps at
the application or the driver layer. This approach is inaccurate
due to the software processing before the timestamp stage. The
Fig. 4: Performance comparison of different platforms running three
perception tasks.
software processing introduces variable latency that is non-
deterministic.
To obtain more precise synchronization, we uses a hardware
synchronizer implemented by FPGA fabrics. The hardware
synchronizer triggers the camera sensors and the IMU using
a common timer initialized by the satellite atomic time pro-
vided by the GPS device. It records the triggering time of
each sensor sample, and then pack the timestamp with the
corresponding sensor data. In terms of costs, the synchronizer
is extremely lightweight in design with only 1,443 LUTs and
1,587 registers and consumes 5mW of power.
C. Perception on FPGA
For our autonomous vehicles, the perception tasks includes
scene understanding (depth estimation and objection detection)
and localization, which are independent. The slower one
dictates the overall perception latency.
We evaluate our perception algorithms on the CPU, GPU
and Zynq FPGA platform. Fig. 4 compares the latency of each
perception tasks on the FPGA platform with the GPU. Due
to the available resources, the FPGA platform is faster than
the GPU only for localization, which is more lightweight than
other tasks. We offload localization to the FPGA while leaving
other perception task on the GPU. This partitioning frees
more GPU resources for depth estimation and object detection,
which is benefit for reducing the perception pipeline’s latency.
As with classic SLAM algorithms, our localization algo-
rithm consists of a front-end and a back-end. The front-end
13
uses the ORB features and descriptors for detecting and track-
ing key points [115], [178]. The back-end uses Levenberg-
Marquardt’s (LM) algorithm, a non-linear optimization algo-
rithm, to optimize the position of 3D key points and the pose
of the camera [131], [179].
The ORB feature extraction/matching and the LM optimizer
are the most time-consuming parts of our SLAM algorithm,
which take up nearly all the execution time. We accelerate
ORB feature extraction/matching and the non-linear optimizer
on FPGA fabrics. The rest lightweight parts are implemented
on the ARM core of the Zyqn platform. We use independent
hardware for each camera to extract features and compute
descriptors. Hamming distance and Sum of Absoluated Dif-
ference (SAD) matching are implemented to obtain stable
matching results. Compared with the CPU implementation,
our FPGA implementation achieves a 2.2× speedup and 44
fps.
We use LM algorithm to optimize features and poses over
a fixe-size sliding window. To solve the non-linear optimiza-
tion problem, the LM algorithm iteratively use Jacobbian to
linearize the problem and solve the linear equation at each
iteration. Schur elimination is used to reduce the dimension
of the linear equation, thus reduce the complexity of solving
the equation. Cholesky factorization is employed to solve
the linear equation. For sliding-window based vSLAM, the
Jacobian and Schur elimination are the most time-consuming
parts. By profiling our algorithm on datasets [180], Schur
and Jacobian computations account for 29.8% and 48.27% of
total time. We implemented Schur elimination and Jacobian
updates on FGPA fabrics [131]. Compared with the CPU
implementation, the FPGA achieves 4× and 27× speedup for
Schur and Jacobbian, and saves 76% energy.
VIII. APPLICATION OF FPGAS IN SPACE ROBOTICS
In the 1980s, field-programmable gate arrays (FPGA)
emerged as a result of increasing integration in electronics.
Before the use of FPGA, glue-logic designs were based on
individual boards with fixed components interconnected via a
shared standard bus, which has various drawbacks, such as
hindrance of high volume data processing and higher suscep-
tibility to radiation-induced errors, in addition to inflexibility.
The utilization of FPGAs in space applications began in 1992,
for FPGAs offered unprecedented flexibility and significantly
reduced the design cycle and development cost [181].
FPGAs can be categorized by the type of their pro-
grammable interconnection switches: antifuse, SRAM, and
Flash. Each of the three technologies comes with trade-offs.
Antifuse FPGAs are non-volatile and have minimal delay
due to routing, resulting in a faster speed and lower power
consumption. The drawback is evident as they have a rel-
atively more complicated fabrication process and are only
one time programmable. SRAM-based FPGAs are the most
common type employed in space missions. They are field
reprogrammable and use the standard fabrication process that
foundries put in significant effort in optimizing, resulting in a
faster rate of performance increase. However, based on SRAM,
these FPGAs are volatile and may not hold configuration if a
power glitch occurs. Also, they have more substantial routing
delay, require more power, and have a higher susceptibility
to bit errors. Flash-based FPGAs are non-volatile and repro-
grammable, and also have low power consumption and route
delay. The major drawback is that in-flight reconfiguration is
not recommended for flash-based FPGAs due to the potentially
destructive results if radiation effects occur during the recon-
figuration process [182]. Also, the stability of stored charge
on the floating gate is of concern: it is a function including
factors such as operating temperature, the electric fields that
might disturb the charge. As a result, flash-based FPGAs are
not as frequently used in space missions [183].
A. Radiation Tolerance for Space Computing
For electronics intended to operate in space, the harsh
space radiation present is an essential factor to consider.
Radiation has various effects on electronics, but the commonly
focused two are total ionizing dose effect (TID) and single
event effects (SEE). TID results from the accumulation of
ionizing radiation over time, which causes permanent damage
by creating electron-hole pairs in the silicon dioxide layers of
MOS devices. The effect of TID is that electronics gradually
degrade in their performance parameters and eventually fail
to function. Electronics intended for application in space are
tested for the total amount of radiation, measured in kRads,
they can endure before failure. Usually, electronics that can
withstand 100 kRads are sufficient for low earth orbit missions
to use for several years [182].
SEE occurs when high-energy particles from space radiation
strike electronics and leave behind an ionized trail. The results
are various types of SEEs [184], which can be categorized
as either soft errors, which usually do not cause permanent
damage, or hard errors, which often cause permanent damage.
Examples of soft error include single event upset (SEU), and
single event transient (SET). In SEU, a radiation particle struck
a memory element, causing a bit flip. Noteworthy is that as
the cell density and clock rate of modern devices increases,
multiple cell upset (MCU), corruption of two or more memory
cells in a single particle strike, is increasingly becoming a
concern. A special type of SEU is single event functional
interrupt (SEFI), where the upset leads to loss of normal
function of the device by affecting control registers or the
clock. In SET, a radiation particle passes through a sensitive
node, which generates a transient voltage pulse, causing wrong
logic state at the combinatorial logic output. Depending on
whether the impact occurs during an active clock edge or
not, the error may or may not propagate. Some examples
of hard error include single event latch-up (SEL), in which
energized particle activates parasitic transistor and then cause
a short across the device, and single event burnout (SEB), in
which radiation induces high local power dissipation, leading
to device failure. In these hard error cases, radiation effects
may cause the failure of an entire space mission.
Space-grade FPGAs can withstand considerable levels of
TID and have been designed against most destructive SEEs
[185]. However, SEU susceptibility is pervasive. For the most
part, radiation effects on FPGA are not different from those
14
of other CMOS based ICs. The primary anomaly stems from
FPGAs’ unique structure, involving programmable intercon-
nections. Depending on their type, FPGAs have different sus-
ceptibility toward SEU in their configuration. SRAM FPGAs
are designated by NASA as the most susceptible ones due
to their volatile nature. Even after the radiation hardening
process, the configuration of SRAM FPGAs is only designated
as “hardened” or simply having embedded SEE mitigation
techniques rather than “hard,”which means close to immune
[182]. Configuration SRAM is not used in the same way as
the traditional SRAM. A bit flip in configuration causes an
instantaneous effect without the need for a read-write cycle.
Moreover, instead of producing one single error in the output,
the bit flip shifts the user logic directly, changing the device’s
behavior. Scrubbing is needed to rectify SRAM configuration.
Antifuse and flash FPGAs are less susceptible to effects in
configuration and are designated “hard” against SEEs in their
configuration without applying radiation hardening techniques
[182].
Design based SEU/fault mitigation techniques are com-
monly used, for, in contrast to fabrication level radiation
hardening techniques, they can be readily applied to com-
mercial off the shelf (COTS) FPGAs. These techniques can
be classified into static and dynamic. Static techniques rely
on fault-masking, toleration of error without requiring active
fixing. One such example is passive redundancy with voting
mechanisms. Dynamic techniques, in contrast, detect faults
and act to correct them. The common SEU Mitigation Methods
include [186] [187]:
1) Hardware Redundancy: functional blocks are replicated
to detect/tolerate faults. Triple modular redundancy
(TMR) is perhaps the most widely used mitigation
technique. It can be applied to entire processors or parts
of circuits. At a circuit level, registers are implemented
using three or more flip flops or latches. Then, voters
compare the values and output the majority, reducing the
likelihood of error due to SEU. As internal voters are
also susceptible to SEU, they are sometimes triplicated
also. For mission-critical applications, global signals
may be triplicated to mitigate SEUs further. TMR can
be implemented at ease with the help supporting HDLs
[188]. It is important to note that a limitation of TMR is
that one fault, at most, can be tolerated per voter stage.
As a result, TMR is often used with other techniques,
such as scrubbing, to prevent error accumulation.
2) Scrubbing: The vast majority of memory cells in repro-
grammable FPGAs contain configuration information.
As discussed earlier, configuration memory upset may
lead to alteration routing network, loss of function, and
other critical effects. Scrubbing, refreshing and restora-
tion of configuration memory to a known-good state,
is therefore needed [187]. The reference configuration
memory is usually stored in radiation-hardened memory
cells either off or on the device. Scrubbers, processors or
configuration controllers, carry out scrubbing. Some ad-
vanced SRAM FPGAs, including ones made by Xilinx,
support partial reconfiguration, which allows memory
repairs to be made without interrupting the operation
of the whole device. Scrubbing can be done in frame-
level (partial) or device-level (full), which will inevitably
lead to some downtime; some devices may not be able
to tolerate such an interruption. Blind scrubbing is the
most straightforward way of implementation: individual
frames are scrubbed periodically without error detection.
Blind scrubbing avoids the complexity required in error
detection, but extra scrubbing may increase vulnerability
to SEUs as errors may be written into frames during
the scrubbing process. An alternative to blind scrubbing
is readback scrubbing, where scrubbers actively detect
errors in configuration through error-correcting code or
cyclic redundancy check [186]. If an error is found,
scrubber initiates frame-level scrubbing.
Currently, the majority of space-grade FPGA comes from
Xilinx and Microsemi. Xilinx offers the Virtex family and
Kintex. Both are SRAM based, which have high flexibility.
Microsemi offers antifuse based RTAX and Flash-based RTG4,
RT PolarFire, which have lower susceptibility against SEE and
power consumption. 20 nm Kintex and 28nm RT PolarFire are
the latest generations. The European market is offered with
Atmel devices and NanoXplore space-grade FPGAs [189].
Table III shows the specifications of the above devices.
B. FPGAs in Space Missions
For space robotics, processing power is of particular impor-
tance, given the range of information required to accurately
and efficiently process. Many of the current and previous
space missions are packed with sophisticated algorithms that
are mostly static. They serve to increase the efficiency of
data transmission; nevertheless, data processing is done mainly
on the ground. As the travel distance of missions increases,
transmitting all data to, and processing it on the ground is no
longer an efficient or even viable option due to transmission
delay. As a result, space robots need to become more adaptable
and autonomous. They will also need to pre-process on-board a
large amount of data collected and compress it before sending
it back to Earth [190].
The rapid development of new generation FPGAs may fill
the need in space robotics. FPGAs enable robotic systems
to be reconfigurable in real-time, making the systems more
adaptable by allowing them to respond more efficiently to
changes in environment and data. As a result, autonomous
reconfiguration and performance optimization can be achieved.
Also, the FPGAs have a high capability for parallel processing,
which is useful in boosting processing performance. The use
of FPGA is present in various space robots. Some of the most
prominent examples of the application are the NASA Mars
rovers. Since the first pair of rovers were launched in 2003,
the presence of FPGAs have steadily increased in the later
rovers.
1) Mars Exploration Rover Missions
Beginning in the early 2000s, NASA have been using
FPGAs in exploration rover control and lander control. In
Opportunity and Spirit, the two Mars rovers launched in 2003,
two Xilinx Virtex XQVR1000s were in the motor control
15
Device Logic Memory DSPs Technology Rad. Tolerance
Xilinx Virtex-5QV 81.9K LUT6 12.3 Mb 320 65 nm SRAM SEE immune up to LET>100 MeV/(mg·cm2) and 1 Mrad TID
Xilinx RT Kintex UltraScale 331K LUT6 38 Mb 2760 20 nm SRAM SEE immune up to LET>80 MeV/(mg·cm2) and 100-120 Krads TID
Microsemi RTG4 150K LE 5 Mb 462 65 nm Flash SEE immune up to LET>37 MeV(mg·cm2) and TID>100 Krads
Microsemi RT PolarFire 481K LE 33 Mb 1480 28 nm Flash SEE immune up to LET>63 MeV(mg·cm2) and 300 Krads
Microsemi RTAX 4M gates 0.5 Mb 120 150 nm antifuse SEE immune up to LET>37 MeV(mg·cm2) and 300 Krads TID
Atmel ATFEE560 560K gates 0.23 Mb – 180 nm SRAM SEL immune up to 95 MeV(mg·cm2) and 60 Krads TID
NanoXplore NG-LARGE 137K LUT4 9.2 Mb 384 65 nm SRAM SEL immune up to 60 MeV(mg·cm2) and 100 Krads TID
TABLE III: Specifications of Space-Grade FPGAs.
board [191], which operates motors on instruments as well
as rover wheels. In addition, an Actel RT 1280 FPGA was
used in each of the 20 cameras on the rovers to receive and
dispatch hardware commands. The camera electronics consist
of clock driver that provides timing pulses through the charge-
coupled device (CCD), an IC containing an array of linked or
coupled capacitors. Also, there are signal chains that amplify
the CCD output and convert it from analog to digital. The
Actel FPGA provides the timing, logic, and control functions
in the CCD signal chain and inserts a camera ID into camera
telemetry to simplify processing [192].
Selected electronic parts have to undergo a multi-step flight
consideration process before utilized in any space exploration
mission [191] [193]. The first step is the general flight ap-
proval, during which the manufacturers perform additional
space-grade verification tests beyond the normal commercial
evaluation, and NASA meticulously examines the results. Ad-
ditional device parameters, such as temperature considerations
and semiconductor characteristics are verified in these tests.
What follows is flight-specific approval. In this step, NASA
engineers examine the device compatibility with the mission.
For instance, considerations of the operating environment
including factors like temperature and radiation. Also included
are a variety of mission-specific situations that the robot may
encounter and the associated risk assessment. Depending on
the specific application of the device, whether mission critical
or not, and the expected mission lifetime, the risk standards
varies. Finally, parts go through specific design consideration
to ensure all the design requirements have been met. Parts
are examined for their designs addressing issues such as SEL,
SEU, SEFI. The Xilinx FPGAs used addressed some of the
SEE through the following methods [192]:
1) Fabrication processes largely prevents SEL
2) TMR reduces SEU frequency
3) Scrubbing allows device recovery from single event
functional interrupts
MER went successful and despite being designed for only
90 Martian days (1 Martian day = 24.6 hours), continued until
2019. The implementation of mitigation techniques was also
proven to be effective as the observed error rate was very
similar to that predicted [191].
2) Mars Science Laboratory Mission
Launched in 2011, Mars Science Lab (MSL) was the new
Rover sent on to Mars. FPGAs were heavily used in its
key components, mainly responsible for scientific instrument
control, image processing, and communications.
Curiosity has 17 cameras on board: four navigation cameras,
eight hazard cameras, the Mars Hand Lens Imager (MAHLI),
two Mast Cameras, the Mars Descent Imager (MARDI), and
the ChemCam Remote Microscopic Imager [194]. MAHLI,
the mast cameras, and MARDI share the same electronics
design. Similar to the system used on MER, an Actel FPGA
provides the timing, logic, and control functions in the CCD
signal chain and transmits pixels to the digital electronics
assembly (DEA), which interfaces the camera heads with the
rover electronics, transmitting command to the camera heads
and data back to the rover. There is one DEA dedicated to
each of the imagers above. Each is has a Virtex-II FPGA
that contains a Microblaze soft-processor core. All of the core
functionalities of the DEA, including timing, interface, and
compression, are implemented in the FPGA as logic peripher-
als of the Microblaze. Specifically, the DEA provides an image
processing pipeline that includes 12 to 8-bit commanding
of input pixels, horizontal subframing, and lossless or JPEG
image compression [194]. What runs on the Microblaze is
the DEA flight software, which coordinates DEA hardware
functions such as camera movements. It receives and executes
commands, and transmits command from the Earth. The flight
software also implements image acquisition algorithms, in-
cluding autofocus and autoexposure, performs error correction
of flash memory, and mechanism control fault protection
[194]. In total, the flight software consists of 10,000 lines of
ANSI C code, all implemented on the FPGA. Additionally,
FPGAs power communication boxes (Electra-Lite) to provide
critical communication to Earth from the rovers through a
Mars relay network [195]. They are responsible for a variety
of high speed bulk signal processing.
3) Mars 2020 Mission
Perseverance is NASA’s latest launched Mars rover. The
presence of FPGA continued and increased. FPGA was used in
the autonomous driving system as a coprocessor for algorithm
acceleration for the first time in NASA’s planetary rovers.
Perseverance runs on the GESTALT (grid-based estimation of
surface traversability applied to local terrain) AutoNav algo-
rithm same as Curiosity [196]. Added was the FPGA based
accelerator, called Vision Compute Element (VCE). During
landing, VCE serves to provide sufficient computing power for
the Lander Vision System (LVS), which performs an intensive
task of estimates the landing location in 10 seconds by fusing
data from the designed landing location, IMU, and landmark
matches. After landing, the connection between VCE and LVS
is severed. Instead, VCE is repurposed for the GESTALT
driving algorithm. The VCE has three cards plugged into a
PCI backplane: a CPU card with BAE RAD750 processor,
16
a Compute Element Power Conditioning Unit (CEPCU), and
a Computer Vision Acceleration Card (CVAC). While the
former two parts were inherited from the MLS mission, the
CVAC is new. It has two FPGAs. One is called the Vision
Processor–a Xilinx Virtex 5QV that contains image processing
modules for matching landmarks to estimate position. The
other is called the Housekeeping FPGA–a Microsemi RTAX
2000 antifuse FPGA that handles tasks such as synchronization
with the spacecraft, power management, Vision Processor
configuration.
Through more than two decades of use in space, FPGAs
have shown their reliability and applicability for space robotic
missions. The properties of FPGAs make them good on-
board processors, ones that have high reliability, adaptability,
processing power, and power efficiency: FPGAs have been
used for space robotic missions for decades and are proven
in reliability; they have unrivaled adaptability and can even
be reconfigured in run time; their capability for high degree
parallel processing allow significant acceleration in execut-
ing many complex algorithms; hardware/software co-design
method makes them potentially more power-efficient. They
may finally help us close the two-decade performance gap
between commercial processors and space-grade ASICs. As a
direct result, the achievements that the world has made in fields
such as deep learning and computer vision, which were often
too computationally intense for space-grade processors to be
used, may become applicable for robots in space in the near
future. The implementation of those new technologies will be
of great benefit for space robots, boosting their autonomy and
capabilities and allowing us to explore farther and faster.
IX. CONCLUSION
In this paper, we review the state-of-the-art FPGA-based
robotic computing accelerator designs and summarize their
adopted optimized techniques. According to the results shown
in Section III, IV and V, by co-designing both the software
and hardware, FPGA can achieve more than 10× better perfor-
mance and energy efficiency compared to the CPU and GPU
implementations. We also review the partial reconfiguration
methodology in FPGA implementation to further improve
the design flexibility and reduce the overhead. Finally, by
presenting some recent FPGA-based robotics applications in
commercial and space areas, we demonstrate that FPGA has
excellent potential and is a promising candidate for robotic
computing acceleration due to its high reliability, adaptability
and power efficiency.
The authors believe that FPGAs are the best compute
substrate for robotic applications for several reasons: first,
robotic algorithms are still evolving rapidly, and thus any
ASIC-based accelerators will be months or even years behind
the state-of-the-art algorithms; on the other hand, FPGAs can
be dynamically updated as needed. Second, robotic workloads
are highly diverse, thus it is difficult for any ASIC-based
robotic computing accelerator to reach economies of scale
in the near future; on the other hand, FPGAs are a cost-
effective and energy-effective alternative before one type of
accelerator reaches economies of scale. Third, compared to
SoCs that have reached economies of scale, e.g. mobile SoCs,
FPGAs deliver a significant performance advantage. Fourth,
partial reconfiguration allows multiple robotic workloads to
time-share an FPGA, thus allowing one chip to serve multiple
applications, leading to overall cost and energy reduction.
However, FPGAs are still not the mainstream computing
substrate for robotic workloads, for several reasons: first,
FPGA programming is still much more challenging than regu-
lar software programming, and the supply of FPGA engineers
is still limited. Second, although there is significant progress in
the past few years in the FPGA High-Level Synthesis (HLS)
automation, such as [197], HLS is still not able to produce
optimized code, and IP supports for robotic workloads are
still extremely limited. Third, commercial software support
for robotic workloads on FPGAs is still missing. For instance,
there is no official ROS support on any commercial FPGA
platform today. For robotic companies to fully exploit the
power of FPGAs, these problems need to be first addressed,
and the authors use these problems to motivate our future
research work.
REFERENCES
[1] A. Qiantori, A. B. Sutiono, H. Hariyanto, H. Suwa, and T. Ohta, “An
emergency medical communications system by low altitude platform at
the early stages of a natural disaster in indonesia,” Journal of medical
systems, vol. 36, no. 1, pp. 41–52, 2012.
[2] A. Ryan and J. K. Hedrick, “A mode-switching path planner for
uav-assisted search and rescue,” in Proceedings of the 44th IEEE
Conference on Decision and Control, pp. 1471–1476, IEEE, 2005.
[3] N. Smolyanskiy, A. Kamenev, J. Smith, and S. Birchfield, “Toward low-
flying autonomous mav trail navigation using deep neural networks for
environmental awareness,” in 2017 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp. 4241–4247, IEEE, 2017.
[4] A. Giusti, J. Guzzi, D. C. Cires¸an, F.-L. He, J. P. Rodrı´guez, F. Fontana,
M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al., “A machine
learning approach to visual perception of forest trails for mobile
robots,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 661–
667, 2015.
[5] J. K. Stolaroff, C. Samaras, E. R. O’Neill, A. Lubers, A. S. Mitchell,
and D. Ceperley, “Energy use and life cycle greenhouse gas emissions
of drones for commercial package delivery,” Nature communications,
vol. 9, no. 1, pp. 1–13, 2018.
[6] S. J. Kim, Y. Jeong, S. Park, K. Ryu, and G. Oh, “A survey of
drone use for entertainment and avr (augmented and virtual reality),” in
Augmented Reality and Virtual Reality, pp. 339–352, Springer, 2018.
[7] S. Jung, S. Cho, D. Lee, H. Lee, and D. H. Shim, “A direct visual
servoing-based framework for the 2016 iros autonomous drone racing
challenge,” Journal of Field Robotics, vol. 35, no. 1, pp. 146–166,
2018.
[8] “Fact sheet – the federal aviation administration (faa) aerospace
forecast fiscal years (fy) 2020-2040.” https://www.faa.gov/news/fact
sheets/news story.cfm?newsId=24756, 2020.
[9] S. Liu, L. Li, J. Tang, S. Wu, and J.-L. Gaudiot, “Creating autonomous
vehicle systems,” Synthesis Lectures on Computer Science, vol. 6, no. 1,
pp. i–186, 2017.
[10] S. Krishnan, Z. Wan, K. Bhardwaj, P. Whatmough, A. Faust, G.-
Y. Wei, D. Brooks, and V. J. Reddi, “The sky is not the limit: A
visual performance model for cyber-physical co-design in autonomous
machines,” IEEE Computer Architecture Letters, vol. 19, no. 1, pp. 38–
42, 2020.
[11] S. Liu and J.-L. Gaudiot, “Autonomous vehicles lite self-driving
technologies should start small, go slow,” IEEE Spectrum, vol. 57,
no. 3, pp. 36–49, 2020.
[12] S. Liu, L. Liu, J. Tang, B. Yu, Y. Wang, and W. Shi, “Edge computing
for autonomous driving: Opportunities and challenges,” Proceedings of
the IEEE, vol. 107, no. 8, pp. 1697–1716, 2019.
[13] S. Liu, J. Tang, Z. Zhang, and J.-L. Gaudiot, “Computer architectures
for autonomous driving,” Computer, vol. 50, no. 8, pp. 18–25, 2017.
17
[14] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “[dl] a survey of
fpga-based neural network inference accelerators,” ACM Transactions
on Reconfigurable Technology and Systems (TRETS), vol. 12, no. 1,
pp. 1–26, 2019.
[15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 vol.
1, 2005.
[16] Xuming He, R. S. Zemel, and M. A. Carreira-Perpinan, “Multiscale
conditional random fields for image labeling,” in Proceedings of the
2004 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2004. CVPR 2004., vol. 2, pp. II–II, 2004.
[17] X. He, R. S. Zemel, and D. Ray, “Learning and incorporating top-
down cues in image segmentation,” in Computer Vision – ECCV 2006
(A. Leonardis, H. Bischof, and A. Pinz, eds.), (Berlin, Heidelberg),
pp. 338–351, Springer Berlin Heidelberg, 2006.
[18] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online
multi-object tracking by decision making,” in 2015 IEEE International
Conference on Computer Vision (ICCV), pp. 4705–4713, 2015.
[19] R. Girshick, “Fast r-cnn,” 2015 IEEE International Conference on
Computer Vision (ICCV), Dec 2015.
[20] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
real-time object detection with region proposal networks,” CoRR,
vol. abs/1506.01497, 2015.
[21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu,
and A. C. Berg, “SSD: single shot multibox detector,” CoRR,
vol. abs/1512.02325, 2015.
[22] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi,
“You only look once: Unified, real-time object detection,” CoRR,
vol. abs/1506.02640, 2015.
[23] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” CoRR,
vol. abs/1612.08242, 2016.
[24] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” CoRR, vol. abs/1411.4038, 2014.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pool-
ing in deep convolutional networks for visual recognition,” CoRR,
vol. abs/1406.4729, 2014.
[26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” CoRR, vol. abs/1612.01105, 2016.
[27] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr,
“Fully-convolutional siamese networks for object tracking,” CoRR,
vol. abs/1606.09549, 2016.
[28] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-
ping: part i,” IEEE Robotics Automation Magazine, vol. 13, no. 2,
pp. 99–110, 2006.
[29] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark,
J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al., “Autonomous driving
in urban environments: Boss and the urban challenge,” Journal of Field
Robotics, vol. 25, no. 8, pp. 425–466, 2008.
[30] M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Et-
tinger, D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke, et al., “Junior:
The stanford entry in the urban challenge,” Journal of field Robotics,
vol. 25, no. 9, pp. 569–597, 2008.
[31] J. Ziegler, P. Bender, M. Schreiber, H. Lategahn, T. Strauss, C. Stiller,
T. Dang, U. Franke, N. Appenrodt, C. G. Keller, E. Kaus, R. G. Her-
rtwich, C. Rabe, D. Pfeiffer, F. Lindner, F. Stein, F. Erbs, M. Enzweiler,
C. Kno¨ppel, J. Hipp, M. Haueis, M. Trepte, C. Brenk, A. Tamke,
M. Ghanaat, M. Braun, A. Joos, H. Fritz, H. Mock, M. Hein, and
E. Zeeb, “Making bertha drive—an autonomous journey on a historic
route,” IEEE Intelligent Transportation Systems Magazine, vol. 6, no. 2,
pp. 8–20, 2014.
[32] C. Katrakazas, M. Quddus, W.-H. Chen, and L. Deka, “Real-time
motion planning methods for autonomous on-road driving: State-of-
the-art and future research directions,” Transportation Research Part
C: Emerging Technologies, vol. 60, pp. 416–442, 2015.
[33] B. Paden, M. Cˇa´p, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of
motion planning and control techniques for self-driving urban vehicles,”
IEEE Transactions on intelligent vehicles, vol. 1, no. 1, pp. 33–55,
2016.
[34] Y. Deng, Y. Chen, Y. Zhang, and S. Mahadevan, “Fuzzy dijkstra
algorithm for shortest path problem under uncertain environment,”
Applied Soft Computing, vol. 12, no. 3, pp. 1231–1237, 2012.
[35] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the
heuristic determination of minimum cost paths,” IEEE transactions on
Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968.
[36] S. M. LaValle and J. J. Kuffner Jr, “Randomized kinodynamic plan-
ning,” The international journal of robotics research, vol. 20, no. 5,
pp. 378–400, 2001.
[37] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars, “Prob-
abilistic roadmaps for path planning in high-dimensional configuration
spaces,” IEEE transactions on Robotics and Automation, vol. 12, no. 4,
pp. 566–580, 1996.
[38] S. Shalev-Shwartz, N. Ben-Zrihem, A. Cohen, and A. Shashua,
“Long-term planning by short-term prediction,” arXiv preprint
arXiv:1602.01580, 2016.
[39] M. Go´mez, R. Gonza´lez, T. Martı´nez-Marı´n, D. Meziat, and
S. Sa´nchez, “Optimal motion planning by reinforcement learning in
autonomous mobile vehicles,” Robotica, vol. 30, no. 2, p. 159, 2012.
[40] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-
agent, reinforcement learning for autonomous driving,” arXiv preprint
arXiv:1610.03295, 2016.
[41] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to
end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
2016.
[42] X. Geng, H. Liang, B. Yu, P. Zhao, L. He, and R. Huang, “A scenario-
adaptive driving behavior prediction approach to urban autonomous
driving,” Applied Sciences, vol. 7, no. 4, p. 426, 2017.
[43] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
no. 3-4, pp. 279–292, 1992.
[44] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances
in neural information processing systems, pp. 1008–1014, 2000.
[45] S. L. Hicks, I. Wilson, L. Muhammed, J. Worsfold, S. M. Downes,
and C. Kennard, “A depth-based head-mounted visual display to aid
navigation in partially sighted individuals,” PloS one, vol. 8, no. 7,
p. e67695, 2013.
[46] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and
S. Leutenegger, “Elasticfusion: Real-time dense slam and light source
estimation,” The International Journal of Robotics Research, vol. 35,
no. 14, pp. 1697–1716, 2016.
[47] V. A. Prisacariu, O. Ka¨hler, S. Golodetz, M. Sapienza, T. Caval-
lari, P. H. Torr, and D. W. Murray, “Infinitam v3: A framework
for large-scale 3d reconstruction with loop closure,” arXiv preprint
arXiv:1708.00783, 2017.
[48] S. Golodetz, T. Cavallari, N. A. Lord, V. A. Prisacariu, D. W. Murray,
and P. H. Torr, “Collaborative large-scale dense 3d reconstruction with
online inter-agent pose optimisation,” IEEE transactions on visualiza-
tion and computer graphics, vol. 24, no. 11, pp. 2895–2905, 2018.
[49] V. D. Nguyen, D. D. Nguyen, T. T. Nguyen, V. Q. Dinh, and J. W. Jeon,
“Support local pattern and its application to disparity improvement and
texture classification,” IEEE transactions on circuits and systems for
video technology, vol. 24, no. 2, pp. 263–276, 2013.
[50] M. Pe´rez-Patricio and A. Aguilar-Gonza´lez, “Fpga implementation of
an efficient similarity-based adaptive window algorithm for real-time
stereo matching,” Journal of Real-Time Image Processing, vol. 16,
no. 2, pp. 271–287, 2019.
[51] S. Perri, P. Corsonello, and G. Cocorullo, “Adaptive census transform:
A novel hardware-oriented stereovision algorithm,” Computer Vision
and Image Understanding, vol. 117, no. 1, pp. 29–41, 2013.
[52] D.-W. Yang, L.-C. Chu, C.-W. Chen, J. Wang, and M.-D. Shieh,
“Depth-reliability-based stereo-matching algorithm and its vlsi archi-
tecture design,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 25, no. 6, pp. 1038–1050, 2014.
[53] A. Aguilar-Gonza´lez and M. Arias-Estrada, “An fpga stereo matching
processor based on the sum of hamming distances,” in International
symposium on applied reconfigurable computing, pp. 66–77, Springer,
2016.
[54] C. Ttofis, C. Kyrkou, and T. Theocharides, “A low-cost real-time
embedded stereo vision system for accurate disparity estimation based
on guided image filtering,” IEEE Transactions on Computers, vol. 65,
no. 9, pp. 2678–2693, 2015.
[55] M. Pe´rez-Patricio, A. Aguilar-Gonza´lez, M. Arias-Estrada, H.-R.
Hernandez-de Leon, J.-L. Camas-Anzueto, and J. de Jesu´s Osuna-
Coutin˜o, “An fpga stereo matching unit based on fuzzy logic,” Mi-
croprocessors and Microsystems, vol. 42, pp. 87–99, 2016.
[56] G. Cocorullo, P. Corsonello, F. Frustaci, and S. Perri, “An efficient
hardware-oriented stereo matching algorithm,” Microprocessors and
Microsystems, vol. 46, pp. 21–33, 2016.
[57] P. M. Santos, J. C. Ferreira, and J. S. Matos, “Scalable hardware
architecture for disparity map computation and object location in real-
time,” Journal of Real-Time Image Processing, vol. 11, no. 3, pp. 473–
485, 2016.
18
[58] B. McCullagh, “Real-time disparity map computation using the cell
broadband engine,” Journal of Real-Time Image Processing, vol. 7,
no. 2, pp. 87–93, 2012.
[59] L. Li, X. Yu, S. Zhang, X. Zhao, and L. Zhang, “3d cost aggregation
with multiple minimum spanning trees for stereo matching,” Applied
optics, vol. 56, no. 12, pp. 3411–3420, 2017.
[60] D. Zha, X. Jin, and T. Xiang, “A real-time global stereo-matching on
fpga,” Microprocessors and Microsystems, vol. 47, pp. 419–428, 2016.
[61] L. Puglia, M. Vigliar, and G. Raiconi, “Real-time low-power fpga
architecture for stereo vision,” IEEE Transactions on Circuits and
Systems II: Express Briefs, vol. 64, no. 11, pp. 1307–1311, 2017.
[62] A. Kjær-Nielsen, K. Pauwels, J. B. Jessen, M. Van Hulle, N. Kru¨ger,
et al., “A two-level real-time vision machine combining coarse-and
fine-grained parallelism,” Journal of Real-Time Image Processing,
vol. 5, no. 4, pp. 291–304, 2010.
[63] S. Jin, J. Cho, X. Dai Pham, K. M. Lee, S.-K. Park, M. Kim, and
J. W. Jeon, “Fpga design and implementation of a real-time stereo
vision system,” IEEE transactions on circuits and systems for video
technology, vol. 20, no. 1, pp. 15–26, 2009.
[64] L. Zhang, K. Zhang, T. S. Chang, G. Lafruit, G. K. Kuzmanov, and
D. Verkest, “Real-time high-definition stereo matching on fpga,” in
Proceedings of the 19th ACM/SIGDA international symposium on Field
programmable gate arrays, pp. 55–64, 2011.
[65] D. Honegger, P. Greisen, L. Meier, P. Tanskanen, and M. Pollefeys,
“Real-time velocity estimation based on optical flow and disparity
matching,” in 2012 IEEE/RSJ International Conference on Intelligent
Robots and Systems, pp. 5177–5182, IEEE, 2012.
[66] M. Jin and T. Maruyama, “Fast and accurate stereo vision system on
fpga,” ACM Transactions on Reconfigurable Technology and Systems
(TRETS), vol. 7, no. 1, pp. 1–24, 2014.
[67] K. Rupnow, Y. Liang, Y. Li, D. Min, M. Do, and D. Chen, “High level
synthesis of stereo matching: Productivity, performance, and software
constraints,” in 2011 International Conference on Field-Programmable
Technology, pp. 1–8, IEEE, 2011.
[68] K. M. Ali, R. B. Atitallah, N. Fakhfakh, and J.-L. Dekeyser, “Exploring
hls optimizations for efficient stereo matching hardware implementa-
tion,” in International Symposium on Applied Reconfigurable Comput-
ing, pp. 168–176, Springer, 2017.
[69] S. Park and H. Jeong, “Real-time stereo vision fpga chip with low error
rate,” in 2007 International Conference on Multimedia and Ubiquitous
Engineering (MUE’07), pp. 751–756, IEEE, 2007.
[70] S. Sabihuddin, J. Islam, and W. J. MacLean, “Dynamic program-
ming approach to high frame-rate stereo correspondence: A pipelined
architecture implemented on a field programmable gate array,” in
2008 Canadian Conference on Electrical and Computer Engineering,
pp. 001461–001466, IEEE, 2008.
[71] M. Jin and T. Maruyama, “A real-time stereo vision system using a
tree-structured dynamic programming on fpga,” in Proceedings of the
ACM/SIGDA international symposium on Field Programmable Gate
Arrays, pp. 21–24, 2012.
[72] R. Kamasaka, Y. Shibata, and K. Oguri, “An fpga-oriented graph
cut algorithm for accelerating stereo vision,” in 2018 International
Conference on ReConFigurable Computing and FPGAs (ReConFig),
pp. 1–6, IEEE, 2018.
[73] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-time
stereo vision system using semi-global matching disparity estimation:
Architecture and fpga-implementation,” in 2010 International Confer-
ence on Embedded Computer Systems: Architectures, Modeling and
Simulation, pp. 93–101, IEEE, 2010.
[74] W. Wang, J. Yan, N. Xu, Y. Wang, and F.-H. Hsu, “Real-time high-
quality stereo vision system in fpga,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 25, no. 10, pp. 1696–1708,
2015.
[75] L. F. Cambuim, J. P. Barbosa, and E. N. Barros, “Hardware module
for low-resource and real-time stereo vision engine using semi-global
matching approach,” in Proceedings of the 30th Symposium on Inte-
grated Circuits and Systems Design: Chip on the Sands, pp. 53–58,
2017.
[76] O. Rahnama, T. Cavalleri, S. Golodetz, S. Walker, and P. Torr,
“R3sgm: Real-time raster-respecting semi-global matching for power-
constrained systems,” in 2018 International Conference on Field-
Programmable Technology (FPT), pp. 102–109, IEEE, 2018.
[77] L. F. Cambuim, L. A. Oliveira, E. N. Barros, and A. P. Ferreira, “An
fpga-based real-time occlusion robust stereo vision system using semi-
global matching,” Journal of Real-Time Image Processing, pp. 1–22,
2019.
[78] J. Zhao, T. Liang, L. Feng, W. Ding, S. Sinha, W. Zhang, and
S. Shen, “Fp-stereo: Hardware-efficient stereo vision for embedded
applications,” arXiv preprint arXiv:2006.03250, 2020.
[79] O. Rahnama, D. Frost, O. Miksik, and P. H. Torr, “Real-time dense
stereo matching with elas on fpga-accelerated embedded devices,”
IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2008–2015,
2018.
[80] O. Rahnama, T. Cavallari, S. Golodetz, A. Tonioni, T. Joy, L. Di Ste-
fano, S. Walker, and P. H. Torr, “Real-time highly accurate dense depth
on a power budget using an fpga-cpu hybrid soc,” IEEE Transactions
on Circuits and Systems II: Express Briefs, vol. 66, no. 5, pp. 773–777,
2019.
[81] H. Hirschmuller, “Accurate and efficient stereo processing by semi-
global matching and mutual information,” in 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 2, pp. 807–814, IEEE, 2005.
[82] D. Honegger, H. Oleynikova, and M. Pollefeys, “Real-time and low
latency embedded computer vision hardware based on a combination
of fpga and mobile cpu,” in 2014 IEEE/RSJ International Conference
on Intelligent Robots and Systems, pp. 4930–4935, IEEE, 2014.
[83] D. Hernandez-Juarez, A. Chaco´n, A. Espinosa, D. Va´zquez, J. C.
Moure, and A. M. Lo´pez, “Embedded real-time stereo estimation via
semi-global matching on the gpu,” Procedia Computer Science, vol. 80,
pp. 143–153, 2016.
[84] S. Mattoccia and M. Poggi, “A passive rgbd sensor for accurate and
real-time depth sensing self-contained into an fpga,” in Proceedings
of the 9th International Conference on Distributed Smart Cameras,
pp. 146–151, 2015.
[85] S. K. Gehrig, F. Eberli, and T. Meyer, “A real-time low-power stereo
vision engine using semi-global matching,” in International Conference
on Computer Vision Systems, pp. 134–143, Springer, 2009.
[86] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereo
matching,” in Asian conference on computer vision, pp. 25–38,
Springer, 2010.
[87] H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for
stereo matching,” in 2007 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1–8, IEEE, 2007.
[88] S. Zagoruyko and N. Komodakis, “Learning to compare image patches
via convolutional neural networks,” in Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pp. 4353–4361,
2015.
[89] J. Zˇbontar and Y. LeCun, “Stereo matching by training a convolutional
neural network to compare image patches,” The journal of machine
learning research, vol. 17, no. 1, pp. 2287–2318, 2016.
[90] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for
stereo matching,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 5695–5703, 2016.
[91] A. Seki and M. Pollefeys, “Sgm-nets: Semi-global matching with
neural networks,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 231–240, 2017.
[92] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
and T. Brox, “A large dataset to train convolutional networks for
disparity, optical flow, and scene flow estimation,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 4040–
4048, 2016.
[93] A. Kuzmin, D. Mikushin, and V. Lempitsky, “End-to-end learning of
cost-volume aggregation for real-time dense stereo,” in 2017 IEEE 27th
International Workshop on Machine Learning for Signal Processing
(MLSP), pp. 1–6, IEEE, 2017.
[94] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high perfor-
mance fpga-based accelerator for large-scale convolutional neural net-
works,” in 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), pp. 1–9, IEEE, 2016.
[95] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song, et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pp. 26–
35, 2016.
[96] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,
and H. Yang, “Angel-eye: A complete design flow for mapping cnn
onto embedded fpga,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2017.
[97] J. Yu, G. Ge, Y. Hu, X. Ning, J. Qiu, K. Guo, Y. Wang, and H. Yang,
“Instruction driven cross-layer cnn accelerator for fast detection on
fpga,” ACM Transactions on Reconfigurable Technology and Systems
(TRETS), vol. 11, no. 3, pp. 1–23, 2018.
19
[98] H. Nakahara, H. Yonekawa, T. Fujii, and S. Sato, “A lightweight
yolov2: A binarized cnn with a parallel support vector regression
for an fpga,” in Proceedings of the 2018 ACM/SIGDA International
Symposium on field-programmable gate arrays, pp. 31–40, 2018.
[99] M. S. Belshaw, A high-speed Iterative Closest Point tracker on an
FPGA platform. PhD thesis, 2008.
[100] B. Williams, “Evaluation of a soc for real-time 3d slam,” 2017.
[101] B. Van Hoorick, “Fpga-based simultaneous localization and mapping
(slam) using high-level synthesis,” 2019.
[102] Q. Gautier, A. Shearer, J. Matai, D. Richmond, P. Meng, and R. Kast-
ner, “Real-time 3d reconstruction for fpgas: A case study for evaluating
the performance, area, and programmability trade-offs of the altera
opencl sdk,” in 2014 International Conference on Field-Programmable
Technology (FPT), pp. 326–329, IEEE, 2014.
[103] T. Bailey, J. Nieto, J. Guivant, M. Stevens, and E. Nebot, “Consistency
of the ekf-slam algorithm,” in 2006 IEEE/RSJ International Conference
on Intelligent Robots and Systems, pp. 3562–3568, IEEE, 2006.
[104] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile
and accurate monocular slam system,” IEEE transactions on robotics,
vol. 31, no. 5, pp. 1147–1163, 2015.
[105] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “Fastslam:
A factored solution to the simultaneous localization and mapping
problem,” Aaai/iaai, vol. 593598, 2002.
[106] M. Gu, K. Guo, W. Wang, Y. Wang, and H. Yang, “An fpga-based
real-time simultaneous localization and mapping system,” in 2015
International Conference on Field Programmable Technology (FPT),
pp. 200–203, IEEE, 2015.
[107] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age,” IEEE
Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
[108] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry
for a monocular camera,” in Proceedings of the IEEE international
conference on computer vision, pp. 1449–1456, 2013.
[109] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,
A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon,
“Kinectfusion: Real-time dense surface mapping and tracking,” in 2011
10th IEEE International Symposium on Mixed and Augmented Reality,
pp. 127–136, IEEE, 2011.
[110] V. Bonato, E. Marques, and G. A. Constantinides, “A floating-point
extended kalman filter implementation for autonomous mobile robots,”
Journal of Signal Processing Systems, vol. 56, no. 1, pp. 41–50, 2009.
[111] D. T. Tertei, J. Piat, and M. Devy, “Fpga design and implementation
of a matrix multiplier based accelerator for 3d ekf slam,” in 2014
International Conference on ReConFigurable Computing and FPGAs
(ReConFig14), pp. 1–6, IEEE, 2014.
[112] D. T. Tertei, J. Piat, and M. Devy, “Fpga design of ekf block accelerator
for 3d visual slam,” Computers & Electrical Engineering, vol. 55,
pp. 123–137, 2016.
[113] B. Vincke, A. Elouardi, and A. Lambert, “Real time simultaneous
localization and mapping: towards low-cost multiprocessor embedded
systems,” EURASIP Journal on Embedded Systems, vol. 2012, no. 1,
p. 5, 2012.
[114] B. Vincke, A. Elouardi, A. Lambert, and A. Dine, “Simd and openmp
optimization of ekf-slam,” in 2014 International Conference on Multi-
media Computing and Systems (ICMCS), pp. 712–716, IEEE, 2014.
[115] W. Fang, Y. Zhang, B. Yu, and S. Liu, “Fpga-based orb feature
extraction for real-time visual slam,” in 2017 International Conference
on Field Programmable Technology (ICFPT), pp. 275–278, IEEE,
2017.
[116] Y. Biadgie and K.-A. Sohn, “Feature detector using adaptive acceler-
ated segment test,” in 2014 International Conference on Information
Science & Applications (ICISA), pp. 1–4, IEEE, 2014.
[117] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust
independent elementary features,” in European conference on computer
vision, pp. 778–792, Springer, 2010.
[118] R. Liu, J. Yang, Y. Chen, and W. Zhao, “eslam: An energy-efficient
accelerator for real-time orb-slam on fpga platform,” in Proceedings of
the 56th Annual Design Automation Conference 2019, pp. 1–6, 2019.
[119] V. H. Schulz, F. G. Bombardelli, and E. Todt, “A harris corner detector
implementation in soc-fpga for visual slam,” in Robotics, pp. 57–71,
Springer, 2016.
[120] S.-K. Lam, G. Jiang, M. Wu, and B. Cao, “Area-time efficient streaming
architecture for fast and brief detector,” IEEE Transactions on Circuits
and Systems II: Express Briefs, vol. 66, no. 2, pp. 282–286, 2018.
[121] T. H. Pham, P. Tran, and S.-K. Lam, “High-throughput and area-
optimized architecture for rbrief feature extraction,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 4, pp. 747–
756, 2018.
[122] R. Li, J. Wu, M. Liu, Z. Chen, S. Zhou, and S. Feng, “Hcveacc: a high-
performance and energy-efficient accelerator for tracking task in vslam
system,” in 2020 Design, Automation & Test in Europe Conference &
Exhibition (DATE), pp. 198–203, IEEE, 2020.
[123] M. Abouzahir, A. Elouardi, S. Bouaziz, R. Latif, and A. Tajer,
“Large-scale monocular fastslam2. 0 acceleration on an embedded
heterogeneous architecture,” EURASIP Journal on Advances in Signal
Processing, vol. 2016, no. 1, p. 88, 2016.
[124] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and A. Tajer,
“Embedding slam algorithms: Has it come of age?,” Robotics and
Autonomous Systems, vol. 100, pp. 14–26, 2018.
[125] K. Boikos and C.-S. Bouganis, “Semi-dense slam on an fpga soc,” in
2016 26th International Conference on Field Programmable Logic and
Applications (FPL), pp. 1–4, IEEE, 2016.
[126] K. Boikos and C.-S. Bouganis, “A high-performance system-on-chip
architecture for direct tracking for slam,” in 2017 27th International
Conference on Field Programmable Logic and Applications (FPL),
pp. 1–7, IEEE, 2017.
[127] K. Boikos and C.-S. Bouganis, “A scalable fpga-based architecture
for depth estimation in slam,” in International Symposium on Applied
Reconfigurable Computing, pp. 181–196, Springer, 2019.
[128] Z. Xu, J. Yu, C. Yu, H. Shen, Y. Wang, and H. Yang, “Cnn-
based feature-point extraction for real-time visual slam on embedded
fpga,” in 2020 IEEE 28th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), pp. 33–37,
IEEE, 2020.
[129] J. Yu, F. Gao, J. Cao, C. Yu, Z. Zhang, Z. Huang, Y. Wang, and H. Yang,
“Cnn-based monocular decentralized slam on embedded fpga,” 2020.
[130] J. Yu, Z. Xu, S. Zeng, C. Yu, J. Qiu, C. Shen, Y. Xu, G. Dai, Y. Wang,
and H. Yang, “Inca: Interruptible cnn accelerator for multi-tasking in
embedded robots,” in 2020 57th ACM/ESDA/IEEE Design Automation
Conference (DAC), IEEE, 2020.
[131] Q. Liu, S. Qin, B. Yu, J. Tang, and S. Liu, “pi-ba: Bundle adjustment
hardware accelerator based on distribution of 3d-point observations,”
IEEE Transactions on Computers, 2020.
[132] R. Sun, P. Liu, J. Xue, S. Yang, J. Qian, and R. Ying, “Bax: A bundle
adjustment accelerator with decoupled access/execute architecture for
visual odometry,” IEEE Access, vol. 8, pp. 75530–75542, 2020.
[133] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-
supervised interest point detection and description,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pp. 224–236, 2018.
[134] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-
Noguer, “Discriminative learning of deep convolutional feature point
descriptors,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 118–126, 2015.
[135] F. Radenovic´, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval
with no human annotation,” IEEE transactions on pattern analysis and
machine intelligence, vol. 41, no. 7, pp. 1655–1668, 2018.
[136] Xilinx, “Dpu for convolutional neural network.”
[137] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and
I. Reid, “Unsupervised learning of monocular depth estimation and
visual odometry with deep feature reconstruction,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 340–349, 2018.
[138] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad:
Cnn architecture for weakly supervised place recognition,” in Pro-
ceedings of the IEEE conference on computer vision and pattern
recognition, pp. 5297–5307, 2016.
[139] R. Mur-Artal and J. D. Tardo´s, “Orb-slam2: An open-source slam
system for monocular, stereo, and rgb-d cameras,” IEEE Transactions
on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[140] S. Liu, Engineering Autonomous Vehicles and Robots: The DragonFly
Modular-based Approach. Wiley-IEEE Press, 1 ed., 3 2020.
[141] M. Maimone, Y. Cheng, and L. Matthies, “Two years of visual
odometry on the mars exploration rovers,” Journal of Field Robotics,
vol. 24, no. 3, pp. 169–186, 2007.
[142] B. Klingner, D. Martin, and J. Roseborough, “Street view motion-
from-structure-from-motion,” in Proceedings of the IEEE International
Conference on Computer Vision, pp. 953–960, 2013.
[143] Y. Jeong, D. Nister, D. Steedly, R. Szeliski, and I.-S. Kweon, “Pushing
the envelope of modern methods for bundle adjustment,” IEEE trans-
20
actions on pattern analysis and machine intelligence, vol. 34, no. 8,
pp. 1605–1617, 2011.
[144] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz, “Multicore bundle
adjustment,” in CVPR 2011, pp. 3057–3064, IEEE, 2011.
[145] A. Eriksson, J. Bastian, T.-J. Chin, and M. Isaksson, “A consensus-
based framework for distributed bundle adjustment,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1754–1762, 2016.
[146] R. Zhang, S. Zhu, T. Fang, and L. Quan, “Distributed very large scale
bundle adjustment by global camera consensus,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 29–38, 2017.
[147] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion:
A 2-mw fully integrated real-time visual-inertial odometry accelerator
for autonomous navigation of nano drones,” IEEE Journal of Solid-
State Circuits, vol. 54, no. 4, pp. 1106–1119, 2019.
[148] P. Leven and S. Hutchinson, “A framework for real-time path planning
in changing environments,” The International Journal of Robotics
Research, vol. 21, no. 12, pp. 999–1030, 2002.
[149] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal
motion planning,” The international journal of robotics research,
vol. 30, no. 7, pp. 846–894, 2011.
[150] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Batch informed
trees (bit*): Sampling-based optimal planning via the heuristically
guided search of implicit random geometric graphs,” in 2015 IEEE
international conference on robotics and automation (ICRA), pp. 3067–
3074, IEEE, 2015.
[151] K. Hauser, “Lazy collision checking in asymptotically-optimal motion
planning,” in 2015 IEEE International Conference on Robotics and
Automation (ICRA), pp. 2951–2957, IEEE, 2015.
[152] A. Yershova and S. M. LaValle, “Improving motion-planning algo-
rithms by efficient nearest-neighbor searching,” IEEE Transactions on
Robotics, vol. 23, no. 1, pp. 151–157, 2007.
[153] W. Wang, D. Balkcom, and A. Chakrabarti, “A fast online spanner
for roadmap construction,” The International Journal of Robotics
Research, vol. 34, no. 11, pp. 1418–1432, 2015.
[154] S. Murray, W. Floyd-Jones, G. Konidaris, and D. J. Sorin, “A pro-
grammable architecture for robot motion planning acceleration,” in
2019 IEEE 30th International Conference on Application-specific Sys-
tems, Architectures and Processors (ASAP), vol. 2160, pp. 185–188,
IEEE, 2019.
[155] J. Bialkowski, S. Karaman, and E. Frazzoli, “Massively parallelizing
the rrt and the rrt,” in 2011 IEEE/RSJ International Conference on
Intelligent Robots and Systems, pp. 3513–3518, IEEE, 2011.
[156] J. Pan and D. Manocha, “Gpu-based parallel collision detection for
fast motion planning,” The International Journal of Robotics Research,
vol. 31, no. 2, pp. 187–200, 2012.
[157] J. Pan, C. Lauterbach, and D. Manocha, “g-planner: Real-time motion
planning and global navigation using gpus.,” in AAAI, 2010.
[158] N. Atay and B. Bayazit, “A motion planning processor on reconfig-
urable hardware,” in Proceedings 2006 IEEE International Conference
on Robotics and Automation, 2006. ICRA 2006., pp. 125–132, IEEE,
2006.
[159] S. Murray, W. Floyd-Jones, Y. Qi, G. Konidaris, and D. J. Sorin, “The
microarchitecture of a real-time robot motion planning accelerator,” in
2016 49th Annual IEEE/ACM International Symposium on Microar-
chitecture (MICRO), pp. 1–12, IEEE, 2016.
[160] S. Lian, Y. Han, X. Chen, Y. Wang, and H. Xiao, “Dadu-p: A scalable
accelerator for robot motion planning in a dynamic environment,” in
2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC),
pp. 1–6, IEEE, 2018.
[161] U. Bondhugula, A. Devulapalli, J. Dinan, J. Fernando, P. Wyckoff,
E. Stahlberg, and P. Sadayappan, “Hardware/software integration for
fpga-based all-pairs shortest-paths,” in 2006 14th Annual IEEE Sympo-
sium on Field-Programmable Custom Computing Machines, pp. 152–
164, IEEE, 2006.
[162] K. Sridharan, T. Priya, and P. R. Kumar, “Hardware architecture
for finding shortest paths,” in TENCON 2009-2009 IEEE Region 10
Conference, pp. 1–5, IEEE, 2009.
[163] Y. Takei, M. Hariyama, and M. Kameyama, “Evaluation of an fpga-
based shortest-path-search accelerator,” in Proceedings of the Interna-
tional Conference on Parallel and Distributed Processing Techniques
and Applications (PDPTA), p. 613, The Steering Committee of The
World Congress in Computer Science, Computer Engineering and
Applied Computing (WorldComp), 2015.
[164] K. Vipin and S. A. Fahmy, “Fpga dynamic and partial reconfiguration:
A survey of architectures, methods, and applications,” ACM Computing
Surveys (CSUR), vol. 51, no. 4, pp. 1–39, 2018.
[165] S. Liu, R. N. Pittman, and A. Forin, “Minimizing partial reconfigu-
ration overhead with fully streaming dma engines and intelligent icap
controller,” in FPGA, p. 292, Citeseer, 2010.
[166] S. Liu, R. N. Pittman, A. Forin, and J.-L. Gaudiot, “Achieving
energy efficiency through runtime partial reconfiguration on reconfig-
urable systems,” ACM Transactions on Embedded Computing Systems
(TECS), vol. 12, no. 3, p. 72, 2013.
[167] B. Yu, W. Hu, L. Xu, J. Tang, S. Liu, and Y. Zhu, “Building the comput-
ing system for autonomous micromobility vehicles: Designconstraints
and architectural optimizations,” in 2020 53rd Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), IEEE, 2020.
[168] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, “Orb: An
efficient alternative to sift or surf.,” in ICCV, vol. 11, p. 2, Citeseer,
2011.
[169] B. D. Lucas and T. Kanade, “An iterative image registration technique
with an application to stereo vision,” in Proceedings of the 7th
International Joint Conference on Artificial Intelligence, 1981.
[170] W. Fang, Y. Zhang, B. Yu, and S. Liu, “Dragonfly+: Fpga-based
quad-camera visual slam system for autonomous vehicles,” Proc. IEEE
HotChips, p. 1, 2018.
[171] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-
ular visual-inertial state estimator,” IEEE Transactions on Robotics,
vol. 34, no. 4, pp. 1004–1020, 2018.
[172] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar,
C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for
fast autonomous flight,” IEEE Robotics and Automation Letters, vol. 3,
no. 2, pp. 965–972, 2018.
[173] R. Szeliski, Computer Vision: Algorithms and Applications. Texts in
Computer Science, Springer London, 2010.
[174] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereo
matching,” in Proceedings of the 10th Asian Conference on Computer
Vision, 2010.
[175] Y. Feng, P. Whatmough, and Y. Zhu, “Asv: Accelerated stereo vision
system,” in Proceedings of the 52nd Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO ’52, p. 643–656, 2019.
[176] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed
tracking with kernelized correlation filters,” IEEE Transactions on
Pattern Analysis & Machine Intelligence, vol. 37, pp. 583–596, mar
2015.
[177] A. Kelly, Mobile Robotics: Mathematics, Models, and Methods. Cam-
bridge University Press, 2013.
[178] J. Tang, B. Yu, S. Liu, Z. Zhang, W. Fang, and Y. Zhang, “pi-soc:
Heterogeneous soc architecture for visual inertial slam applications,”
in 2018 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pp. 8302–8307, 2018.
[179] S. Qin, Q. Liu, B. Yu, and S. Liu, “pi-ba: Bundle adjustment accel-
eration on embedded fpgas with co-observation optimization,” in 27th
IEEE Annual International Symposium on Field-Programmable Custom
Computing Machines, FCCM 2019, San Diego, CA, USA, April 28 -
May 1, 2019, pp. 100–108, IEEE, 2019.
[180] https://grail.cs.washington.edu/projects/bal/.
[181] P. L. Mckerracher, R. P. Cain, J. C. Barnett, W. S. Green, and J. D.
Kinnison, “Design and test of field programmable gate arrays in space
applications,” 1992.
[182] M. Berg, “Fpga mitigation strategies for critical applications,” 2019.
[183] D. Sheldon, “Flash-based fpga nepp fy12 summary report,”
[184] R. Gaillard, “Single event effects: Mechanisms and classification,” in
Soft errors in modern electronic systems, pp. 27–54, Springer, 2011.
[185] M. Wirthlin, “Fpgas operating in a radiation environment: lessons
learned from fpgas in space,” Journal of Instrumentation, vol. 8, no. 02,
p. C02020, 2013.
[186] F. Brosser and E. Milh, “Seu mitigation techniques for advanced
reprogrammable fpga in space,” Master’s thesis, 2014.
[187] B. Ahmed and C. Basha, Fault mitigation strategies for reliable FPGA
architectures. PhD thesis, Rennes 1, 2016.
[188] S. Habinc, “Suitability of reprogrammable fpgas in space applications,”
Gaisler Research,” Feasibility Report, 2002.
[189] G. Lentaris, K. Maragos, I. Stratakos, L. Papadopoulos, O. Papaniko-
laou, D. Soudris, M. Lourakis, X. Zabulis, D. Gonzalez-Arjona, and
G. Furano, “High-performance embedded computing in space: Evalu-
ation of platforms for vision-based navigation,” Journal of Aerospace
Information Systems, vol. 15, no. 4, pp. 178–192, 2018.
[190] T. Y. Li and S. Liu, “Enabling commercial autonomous robotic space
explorers,” IEEE Potentials, vol. 39, no. 1, pp. 29–36, 2019.
[191] D. Ratter, “Fpgas on mars,” Xcell J, vol. 50, pp. 8–11, 2004.
21
[192] J. F. Bell III, S. Squyres, K. E. Herkenhoff, J. Maki, H. Arneson,
D. Brown, S. Collins, A. Dingizian, S. Elliot, E. Hagerott, et al., “Mars
exploration rover athena panoramic camera (pancam) investigation,”
Journal of Geophysical Research: Planets, vol. 108, no. E12, 2003.
[193] “Space flight system design and environmental test.” https://www.nasa.
gov/sites/default/files/atoms/files/std8070.1.pdf. Accessed: 2020-09-01.
[194] M. C. Malin, M. A. Ravine, M. A. Caplinger, F. Tony Ghaemi, J. A.
Schaffner, J. N. Maki, J. F. Bell III, J. F. Cameron, W. E. Dietrich,
K. S. Edgett, et al., “The mars science laboratory (msl) mast cameras
and descent imager: investigation and instrument descriptions,” Earth
and Space Science, vol. 4, no. 8, pp. 506–539, 2017.
[195] C. D. Edwards, T. C. Jedrey, A. Devereaux, R. DePaula, and M. Dapore,
“The electra proximity link payload for mars relay telecommunications
and navigation,” 2003.
[196] A. Johnson, S. Aaron, J. Chang, Y. Cheng, J. Montgomery, S. Mohan,
S. Schroeder, B. Tweddle, N. Trawny, and J. Zheng, “The lander vision
system for mars 2020 entry descent and landing,” 2017.
[197] “Vivado high-level synthesis.” https://www.xilinx.com/products/
design-tools/vivado/integration/esl-design.html. Accessed: 2020-09-
10.
