Semi-dense SLAM on an FPGA SoC by Boikos, K & Bouganis, C-S
Semi-Dense SLAM
on an FPGA SoC
Konstantinos Boikos
Department of Electrical and
Electronic Engineering
Imperial College London
Email: k.boikos14@imperial.ac.uk
Christos-Savvas Bouganis
Department of Electrical and
Electronic Engineering
Imperial College London
Abstract—Moving Simultaneous Localisation and Mapping, or
SLAM, algorithms into the embedded space with efficient, low-
power designs can open the way to emerging new applications
including autonomous robotics and augmented reality. Such
applications require an accurate and information rich recon-
struction of the environment. Current approaches try to achieve
this by either significantly reducing the richness and quality
of the recovered information or by offloading the computation
to a base station, thereby increasing energy consumption and
latency, neither of which is optimal. This paper discusses the
use of a novel platform for embedded SLAM, an FPGA-SoC,
combining an embedded CPU and programmable logic on the
same chip. The use of programmable logic, tightly integrated
with an efficient multicore embedded CPU stands to provide an
effective solution to this problem. The baseline performance was
based on a dual-core ARM microprocessor, running a highly
optimised version of LSD SLAM, a complex, state-of-the-art
Semi-Dense SLAM algorithm. In this work an acceleration of
2⇥ has been achieved for LSD-SLAM, without any compromise
in the quality of the result. Results show an average framerate
of more than 4 frames/second for a resolution of 320x240.
I. INTRODUCTION
Simultaneous localisation and mapping, or SLAM, has been
an important and active research area in the field of mobile
robotics for more than two decades. It consists of integrating
a series of observations of an environment to build a map of
that environment while simultaneously keeping track of the
observer’s position in the generated map. It is a fundamental
problem for any autonomous or semi-autonomous robots that
have to operate in unknown or changing spaces. One such
application is self-driving cars, which have to continuously
monitor their environment for obstacles or other vehicles
while tracking their own position with high accuracy, using
a combination of radar, sonar, and camera sensors. Visual
SLAM is also a critical ingredient of the rapidly growing
field of autonomous Unmanned Aerial Vehicles or UAVs.
Autonomous UAVs, such as quadcopters or fixed-wing aircraft,
that can operate in unknown environments in the real world
can open the way to exciting applications. Existing work
includes using swarms for rapid exploration of large spaces[1],
much more effective search and rescue operations [2] and
precision agriculture, used in conjunction with other computer
vision techniques [3].
UAVs and other robots impose tight constraints on the
power and weight of the electronics on board. Other appli-
cations that require some form of SLAM, such as augmented
reality, are also power and performance constrained.
Research and industry solutions for SLAM in the embedded
space are either using onboard processing and extremely
efficient software implementations for low power platforms,
or attempt to offload the computationally demanding tasks
to a base station. In the first case, the usual approach is to
trade information richness and accuracy for performance by
using lower sensor resolution, track a sparse set of points
instead of surfaces or objects and avoid denser or large-
scale mapping, and other features found in state-of-the-art
SLAM algorithms. These solutions are either designed from
the ground up to be economical in the amount of computation
they perform, for example [4] and [5] or are reduced versions
of existing algorithms with the same goal in mind, for ex-
ample [6]. However, emerging applications in robotics require
unprecedented information richness and accuracy, which these
solutions cannot provide.
Another solution that has been used in the past is to offload
a large part of the computation to a base station, but this
comes with many disadvantages as well. Examples of this
are [7] and [8]. Not only does it limit the range of operation
significantly, SLAM applications would require uninterrupted,
high bandwidth and low latency communication since they
generate a high amount of data that needs to be processed
in real time. They also require constant feedback between the
generated map and the localisation task. These demands would
strain most network solutions, and come with a significant in-
crease in power consumption themselves. Hence, for practical
applications on embedded platforms, computation has to be
on-board, and with a high computation per watt ratio.
This amount of computational demand, not met by software
optimisation, could be potentially solved by new hardware
platforms. In the last decade, major FPGA companies intro-
duced a new type of embedded System-on-Chip (SoC), one
that combines an embedded low-power CPU integrated with
programmable logic on the same chip. This type of heteroge-
neous system, while having low enough power requirements
to fit low-power platforms, stands to substantially improve
performance and bring more advanced and computationally
demanding algorithms to the embedded space.
This work investigates the performance and acceleration
opportunities for a complex, optimised, state-of-the-art Semi-
Dense SLAM algorithm, based on LSD-SLAM by Engel et
al.[9] using an FPGA SoC. The algorithm that was used as a
base is the one the authors provide for research use, and will
be discussed in more detail in section IV.
The main contributions of this work are: the first, to be the
best of our knowledge, mapping of the LSD-SLAM algorithm
to an FPGA, and the identification of opportunities and bottle-
necks of heterogeneous FPGA SoCs for such algorithms. The
next two sections will give an overview of related work, and
a background for the rest of the paper.
II. BACKGROUND
SLAM is an inverse problem, in which the goal is to
recover as much information as possible about an environment
using a series of noisy observations, insufficient to produce
a unique solution, simultaneously solving two interdependent
probabilistic problems. As a result of this, most algorithms
are based on complex probabilistic models, that in conjunction
with the amount of data required to construct a 3D-map, result
in high computational requirements.
Because of this, state-of-the-art algorithms in SLAM require
at the very least a modern multicore desktop CPU to run in
real time, and often high-end GPUs for acceleration as well, as
for example in [10] and [11] who aim to achieve a processing
speed of at least 30 frames per second. There is therefore
a large gap currently between SLAM research and SLAM
implementations in embedded platforms, owing to the large
gap in computational capabilities between high-end GPUs and
desktop CPUs and those in mobile devices, or embedded SoCs.
In the problem of SLAM the algorithm consists of Tracking,
Mapping, and sometimes global optimisation. Tracking usually
refers to the frame to frame, real-time estimation of the camera
pose, and feature selection and detection, while Mapping refers
to the use of that pose and features for landmark depth estima-
tion and subsequently the reconstruction of the environment
in two or three dimensions. Global optimisation as a step in
many SLAM implementations refers to the optimisation of a
series of poses and observations globally, to minimise the total
accumulated error, in a way sometimes referred to as Bundle
Adjustment. The state of the art in SLAM uses high quality
frames called keyframes as the basis of bundle adjustment,
which allows to extend the optimisiation to a larger trajectory
for less computational resources [12], [13], [14].
SLAM algorithms are defined as sparse when they are using
a comparatively sparse set of observations for Tracking and
maintain a sparser map of the environment. On the other end,
Dense SLAM attempts to reconstruct most of the environment
in their visual field, and often use more of their observations
for Tracking as well. Lately, algorithms are described as
Semi-Dense when they don’t necessarily deal with all of the
visual information at their disposal, but they use a lot more
observations than typical Sparse SLAM algorithms. Semi-
dense SLAM is an attempt at generating more useful maps
for robotics by including a larger amount of points in the
depth map calculated, while retaining some of the efficiency
of only processing a high quality subset of the available visual
information.
III. RELATED WORK
SLAM implementations in the embedded space were tradi-
tionally dominated by sparse and mainly feature-based SLAM
methods. If the platform is power, or weight constrained, the
embedded processor options will provide a performance that
immediately disqualifies a large part of the state-of-the-art
Semi-Dense or Dense SLAM research. Furthermore, some-
times even Sparse SLAM algorithms are too computationally
demanding for some embedded platforms and have to be
further constrained in the amount of processing they do, which
means that not only is the information recovered sparse, it is
sometimes less accurate as well. A variety of sensors have
been used previously such as a laser scanner[15], a Kinect
sensor [16] or a stereo rig [17] mounted on a quadrocopter
embedded navigation. These sensors provide absolute scale,
but have limited range and large weight, size and power
consumption when compared to a monocular setup [8], [18],
which is why a lot of the recent research for SLAM on the
embedded space has focused on monocular solutions. Some
recent examples of work with clearly presented performance
figures are summarised in table I.
Current SLAM algorithms on low-power robotics focus first
on robust Tracking, and the most efficient way, performance-
wise, to achieve this is with a sparse set of high quality
features. This means that work in this field consists of limited
large scale mapping and reconstruction, or none at all, at which
point it is classified as Visual Odometry or VO. An example
of this category is [19], which achieves real-time direct visual
odometry on a quad-core ARM Cortex-A9. The authors use
a sparse, direct image alignment framework for pure odome-
try. They report a processing time of 55 fps, by restricting
the number of keyframes and interest points. The authors
claim to perform accurate odometry in outdoor environments
something that is supported by the results of their testing,
which show reduced drift and higher accuracy in comparison
to related work by Weiss et al.weiss2013monocular. Another
example of real-time VO on a flying robot is [26] where an
1.6GHz Intel Atom based board was used to achieve 40fps
with an inertial-optical flow framework, combining an IMU
and a camera to perform VO on a high-agility MAV (Micro
Aerial Vehicle).
Another sparse SLAM algorithm on a mobile device is [27]
which used few, sparse keyframes and limited bundle adjust-
ment, while tracking at 120x160. The achieved performance
is not directly comparable to more modern approaches like
[6], but this work was an important first proof of concept for
running a state-of-the-art SLAM method with bundle adjust-
ment on a smartphone. Multi-threading and SIMD isntructions
have been used to improve performance on some embedded
platforms. Scho¨ps et al. [6] ran a modified version of LSD-
SLAM on an Xperia Z1 smarthpone with a quad-core ARM
TABLE I
VISUAL NAVIGATION AND MAPPING ON EMBEDDED DEVICES
Reference Platform Tracking Framework Mapping Performance
Forster[19] Nano Quadrotor Direct B.A. VO Only 55fps on Quad-core A9
Faessler [20] AR.Drone Direct B.A. Dense real-time Offboard (Quadro K2000M)
Scho¨ps [21] Xperia Z1 Direct B.A. VO Only 30fps on quad-core ARM(Krait)
Vincke[22] Ground Robot Feature-based EKF Pointcloud 38fps per frame
Lee [4] Ground Robot Line features Pose-graph Opt. 2D Map of features 2.5fps on ARM11
Barry [23] & [24] Fixed-wing Aircraft Single-depth stereo 12-state Kalman F. Pointcloud 120fps with 2x quad-core ARM
Honegger [25] & [24] Fixed-wing Aircraft SGM Stereo match. - Disparity Map 120fps ARM & FPGA preprocessing
Weiss[26] MAV Feature-based EKF Pointcloud 40fps on Atom 1.6GHz
processor. It consists of only the tracking part to perform visual
odometry, and the current depth estimation, without the large
scale mapping features of LSD-SLAM, and achieved 30fps
with hand-optimised functions with NEON assembly and a
reduced resolution of 160x120.
Vincke et al. [5] ran SLAM on a hardware system consisting
of an ARM based board and microcontroller. They built
an EKF framework, with a FAST corner feature detector
front-end, that combines odometry and visual data from a
camera and achieves 38fps, but are limited in their area of
operation since they suffer from EKF’s characteristic quadratic
unbounded complexity. Lee et al.[4] demonstrated another
computationally light approach to low-cost indoors robot local-
ization, based on line features and a simplistic 2D map. Using
this technique their robot can achieve around 2.5fps running on
an ARM11 processor at 533MHz, while concurrently building
a map, and planing a path with obstacle avoidance.
Another approach in the literature is using off-board pro-
cessing on base stations for dense mapping and global opti-
misation in conjuction with lightweight onboard localisation
methods, such as the ones discussed, to provide the best of
both worlds. There are however, significant drawbacks to this
approach, which include increased power consumption, high
bandwidth requirements and a reduced area of operation and
increased latency as well. Sturm et al.[7] use a quadrocopter
to create a 3D reconstruction of indoor scenes with an RGB-D
camera and a base station with an Nvidia GTX560 GPU. Engel
et al. [28] perform camera-based navigation on a quadrocopter,
based on PTAM [12] using keyframes and bundle adjustment,
with an EKF for pose estimates, by streaming the sensor data
wirelessly to a laptop on site[29]. PTAM is also used in another
vision based MAV navigation framework [8] which used a
probabilistic framework to fuse data from an IMU with the
visual tracking and mapping, offloading the computationally
demanding parts to a dual core desktop CPU on a base station.
Related to SLAM algorithms, FPGAs have been used to
implement feature detectors based on SURF[30], SIFT [31]
and many others, as well as stereo disparity estimation such
as [32], [33]. A very interesting example of both, where
the FPGA is acting as a coprocessor is [34] where FPGAs
are considered for Autonomous Planetary Rover Navigation.
However they are not prevalent in embedded SLAM for a
few reasons. They used to be too heavy and require a lot
of power, and they were difficult to program and integrate,
especially by researchers not specialised in hardware design.
However there have been many improvements on the power
front, and FPGA SoCs that combine light FPGAs with mobile
CPUs on the same chip make integration much easier as well.
Advances in tools like High-level synthesis promise to improve
the programming difficulty as well.
A recent example of a hardware system combining a
mobile CPU and FPGA for real-time SLAM is[25], where
an FPGA acts as a preprocessor for feature detection and
tracking, and streams its results towards the main memory.
It ran SGM stereo at a resolution of 752x480, and achieved
a performance of 60 frames/second, limited by the maximum
framerate offered by the sensor. A. Barry et al.[24] test and
compare this work with an embedded CPU based system for
real-time vision on high-speed MAVs. The second system is
running two arm-based boards, each with a quad-core ARM
processor, to perform a sparse, fixed-distance, filtering based
stereo algorithm called ‘pushbroom stereo’[23]. According
to [24], the sparse pushbroom stereo produces ”substantially
more sparse data” than the FPGA’s Depth Map, at a quarter of
the power consumption. In the comparison paper, the authors
have set the systems up at a lower resolution of 320x240,
since the focus here is very high framerate, to allow for rapid
obstacle avoidance in high flying speeds. Both systems were
successful running at 120 frames per second.
The difference, as the author’s also note in this comparison,
is that the FPGA-CPU combination is a platform that has so far
been less accessible to the software community comparatively
to a mobile CPU. However an FPGA-CPU based paradigm
offers the opportunity for much richer SLAM to run on
embedded low-power systems. As computation requirements
are increasing steadily, and semiconductor scaling slows down,
the trend towards more specialised computing such as FPGA-
based accelerators will accelerate. This could also be assisted
by research in improved methods and tools for utilising such
devices, as well as novel platforms such as FPGA Systems on
chip.
IV. LSD-SLAM
One of the best performing and most well-known systems
in this field is LSD-SLAM, which was published by Engel
et al. in 2014 [9]. In comparison to other state-of-the-art
methods, such as ORB-SLAM [35] it is the only one to achieve
robust, accurate and efficient Direct photometric tracking,
simultaneously with semi-dense map reconstruction. A dense
map is a very important feature for robotics, and skipping
the step of feature extraction and matching means it works
robustly in a variety of environments, and does not rely on
specific types of features to work well. This paper is about
accelerating a Semi-Dense SLAM system with an FPGA-CPU
system, and at the time of writing, the reasons above made
LSD-SLAM the best candidate.
As its tracking method, LSD-SLAM performs direct whole-
image alignment to recover the camera pose, and subsequently
performs a filtering depth estimation, to generate depth in-
formation. It also incorporates uncertainties about the depth
estimation in both tracking and mapping. The required infor-
mation is stored in a data structure called a Keyframe, which
contains depth and depth variance information for a subset
of the frame pixels. After a sufficient camera displacement,
when the distance heuristic between the current frame and the
keyframe is satisfied, a new keyframe is generated. Then, the
previous keyframe is retired and is incorporated in a graph of
keyframes, which for LSD-SLAM represents the global map.
Notation: Matrices are denoted as bold, capital letters (R)
and vectors as bold lower case letters (t).
The camera poses are initially represented as 3D Rigid Body
Transformations as in eq. 1, while the pose to pose constraints
between keyframes are represented with 3D Similarity Trans-
formations, which have an additional scale factor s 2 R:
T =
"
R t
0 1
#
2 SE(3) (1) T =
"
sR t
0 1
#
2 Sim(3)
(2)
A. Tracking in LSD-SLAM
LSD-SLAM uses pixels characterised by a gradient bigger
than a set threshold to perform tracking and mapping. It
operates under the assumption that the pixel areas that are
the most useful are those that contain a high enough intensity
gradient, and therefore some kind of texture or edge. This
could be considered as a feature, with the difference that LSD-
SLAM does not require the intermediate and computationally
demanding step of detecting and comparing features to find
unique matches, and at the same time manages to create a
more dense reconstruction of the environment, offering a good
middle ground between sparse and dense algorithms.
During optimisation, LSD-SLAM uses the vector ⇠ 2 se(3)
of the associated Lie-algebra directly as a minimal way to
represent the camera pose. Lie Algebra offers a way to apply
mathematical optimisation methods to the poses obtained,
by representing them as elements of a Lie group, which is
a Locally Euclidean differentiable manifold. After the first
estimation, each pose will be processed and updated by
an iterative optimisation process, the Levenberg-Marquardt
algorithm[36]. They can be converted back to SE(3) with
an exponential map G = expse(3)(⇠).
The pose is recovered by aligning the frame that is tracked
with the current keyframe, to minimise the photometric error.
The function to which Levenberg-Marquardt is applied is the
variance-normalized sum of the photometric error, calculated
by directly comparing pixel intensities. From [9]:
Ep(⇠ji) =
X
p2⌦Di
    r2p(p,⇠ji) 2rp (p,⇠ji)     (3)
is the Error function, in which r is the residual, calculated for
the subset of pixels p which contain a depth value Di:
r2p(p, ⇠ji) := Ii(p)  Ij(!(p,Di(p), ⇠ji)) (4)
and  2 the residual’s variance, which is computed using
covariance propagation and utilizing the inverse depth variance
Vi under a Gaussian noise assumption:
 2rp(p, ⇠ji) = 2 
2
I +
✓
@rp(p, ⇠ji)
@Di(p)
◆2
Vi(p) (5)
The operator
  .  
 
is the Huber norm:
  r2  
 
:=
(
r2
2  if |r|   
|r|   2 otherwise
(6)
and is applied to the normalised residual to reduce the effect
of outliers in the optimisation process.
Levenberg-Marquardt is in essence a modified Gauss-
Newton optimisation method, that is better able to deal with
non-linear functions and converges faster by adding a positive
multiple of an identity matrix of the same size to the Hessian
matrix: H(x) +  I . In this implementation, the Hessian
is substituted with the approximation JTJ , so the step is
formulated as:
 ⇠(n) =  (JTJ +  I) 1JTr(⇠(n)) (7)
where ⇠ 2 SE(3) and n is the optimisation step, with:
J =
@r(✏   ⇠(n))
@✏
    
✏=0
(8)
Additionally, in LSD-SLAM a weighting scheme is im-
plemented, where in each iteration a weight matrix W =
W (⇠(n)) which for each iteration down-weighs large resid-
uals. The residual in the iteratively solved error function then
is multiplied by the weight factors, and the update is computed
as
 ⇠(n) =  (JTWJ +  I) 1JTWr(⇠(n)) (9)
The goal of the optimisation process is to estimate an update
 ⇠(n) that converges towards a minimum of the error function
as quickly and accurately as possible. In LSD-SLAM, the
first step is to calculate the residual, weights, and perform
one iteration of the optimisation process. After this step, the
linear system generated is solved for different values of  , the
residual and weights are recalculated, and a tentative update
step is generated and applied to generate a new pose. If the
new pose decreases the error then it is applied as follows:
⇠(n+1) = logSE(3)
⇣
expse(3)( ⇠
(n)) ⇤ expse(3)(⇠(n))
⌘
(10)
and we recalculate the Hessian and Jacobian from the new
position. Otherwise the update is rejected, and the system
is solved again for a new lambda. This process is stopped
if a maximum number of iterations is reached, if the error
decreases by an amount smaller than a threshold or if the
update step is smaller than a set threshold, in an effort to
improve convergence and reduce the number of iterations at
the same time as much as possible.
The whole process described here, is applied in a set of
different levels, in a pyramid representation, in which the
image is processed at different levels of subsampling, from
the native VGA resolution, down to 30x40 pixels at the 4th
level, where the resolution is divided by 16 in each direction,
with a different number of maximum iterations and thresholds
optimal for each level. This aids convergence, as Direct Image
alignment is inherently a non-convex optimisation problem.
This coarse to fine processing, sometimes starting as low as
a resolution of 15x20, was proven to be a very good solution
to increase the convergence radius and accuracy in a variety
of scenarios [9].
B. Mapping and Global Optimisations
The tracking process is always calculated in respect to an
existing keyframe, except for the first frame where one has
to create a keyframe as an initial starting point. This is done
by initialising with a random depth and a very large variance.
It is suggested that this will usually converge to a correct
initial estimation after enough frames. This Keyframe plays an
important point in this algorithm, as it is used for Tracking,
depth estimation and filtering, and to represent a section of
the global map.
In this work, for reasons that will be discussed more in
section V, mapping has not been so far targeted for accelera-
tion. However it is worth mentioning that depth is estimated
and subsequently refined by filtering over many per-pixel
small-baseline stereo comparisons [37]. The data structure to
store these is the keyframe itself, which includes the pixel
values of the original image, and for the subset of the pixels
considered reference points, Depth Map and Depth Variance
information as well. When it is replaced by a new keyframe, it
is incorporated in the global pose to pose graph, with an edge
connecting it to the previous keyframe which corresponds as
discussed to a similarity transform in an effort to keep track
of scale drift.
In that graph, each keyframe is represented as a vertex, with
3D similarity transforms as edges, as in equation 2, adding
scale information as a 7th degree of freedom as a means to
track scale drift between frames. Scale drift is generated by the
inherent scale ambiguity of monocular depth estimation. The
algorithm then uses a graph optimisation method to perform
global pose optimisation in the background, using only pose to
pose constraints to increase computation efficiency. The graph
optimisation framework used, is g2o, used here as a library,
published by [38].
V. PROFILING LSD-SLAM
The source code used as a base is the one provided by
the authors of the original paper for research purposes. It
consists of highly optimised C++ code, with two versions
of hand-optimised tasks with SIMD instructions, one with
SSE instructions, one with NEON vector instructions. The
algorithm was timed and profiled on the desktop using SSE
extensions, and both the NEON and standard implementations
were tested and timed on a dual core ARM Cortex-A9. A
summary of the most important profiling results is included in
table II.
TABLE II
PROFILING RESULTS - DESKTOP CPU
Task Total Time
Frame Tracking 33%
Depth Estimation 30%
Frame Input and Creation 15%
Graph Optimisation 1-3%
Visualisation and Misc. Tasks 18%
A. Timing Results
Guided by the profiling results, the performance of key
functions was timed on both the desktop and the embedded
CPU. This was done both to complement the profiling results,
and to provide a point of comparison between an embedded
platform and a modern desktop machine for this kind of
software. In table III we can see a collection of mean times
for the x86 and the ARM processor respectively.
TABLE III
TIMING RESULTS
Task& Execution Time Desktop CPU ARM
Tracking 12mS 440mS
Global Optimisation 20 mS 128mS
Mapping 13mS 576 mS
Results indicate that the ARM processor’s performance is
more significantly impacted by the large amount of memory
transactions in this algorithm. Differences in memory caching
size and strategies between the two processors meant that
communication using large buffers, between different func-
tions in the tracking task, was much more costly for the
ARM processor. On the high-end desktop CPU, there are
three levels of caching, with an L2 Cache of 1024KB and
an L3 Cache of 8MBytes, easily enough for the results of
one tracking iteration, and the passing of buffers between
consecutive function calls. This processor can also afford
more advanced prefetching and prediction strategies, that can
petter cope with random memory accesses, with weak spatial
locality.
However, the ARM processor is a much less complicated
core, and with only 32KB for an L1 Cache, and an L2 Cache
of 512KB, the ARM processor is much less able to cope with
the data sizes and buffers used in an implementation designed
for a desktop CPU, as shown by the timing data for each
function. These results establish the importance of accelerating
the tracking task in order to enhance the performance of
this algorithm in the embedded board. They also show that
SLAM, as is the case with many computer vision algorithms,
is characterised by large data movements, and memory and
cache strategies, should be a high priority in any platform
designed for such algorithms.
VI. ACCELERATOR ARCHITECTURE
In this work, an FPGA SoC system was targeted for the
acceleration of LSD-SLAM. This platform offers a very power
efficient embedded CPU tightly integrated with programmable
logic fabric on the same chip. The FPGA fabric can be used for
dedicated accelerators for computationally demanding tasks,
while the embedded CPU efficiently handles control flow, and
other computation in paralle, offering the possibility to achieve
much better performance to watt ratios than embedded micro-
processors alone, in a very low power and weight package.
The SLAM system was implemented on a Zynq-7000 FPGA-
SoC, combining a dual-core ARM CPU with a xilinx FPGA
on-chip.
Two accelerators were implemented to accelerate the track-
ing task of LSD-SLAM. Firstly, tracking in LSD-SLAM was
shown by profiling and timing to consume a large percentage
of the total computation time. Most importantly, fast and
accurate tracking is crucial in SLAM, and must be established
before accurate mapping can be introduced. Tracking has to
achieve a high-enough framerate for mobile robots, in order
to keep up with the camera movement [39]. Real-time can
be considered to be 30 fps but faster moving platforms,
such as fixed-wing aircraft, could necessite double, or four
times that number. The accelerators were implemented on the
programmable logic fabric and connected through an AXI-
BUS interface both to the ARM CPU and a dedicated, high-
performance memory port, which issues transactions directly
to the off-chip RAM.
Fig. 1. System Architecture
Fig. 2. Residual and Weight Calculation Unit
A. Control
The ARM CPU takes care of the accelerator control, as well
as copying the input data from software buffers to a dedicated
DDR region. There are in total two accelerators, the Residual
and Weight Calculation unit and a linear system update unit,
calculating the Jacobian and Hessian and generating the linear
system equations. An overview of this system can be seen
in figure 1. The reason data copying happens is because
the necessary data is scattered in virtual pages from the
software running on a Linux OS, making it difficult to process
them in hardware. For each level we copy the tracked frame
image pixels and the reference point information from the
keyframe. After copying these inputs, some constants are set,
along with the hardware pointers for the input data through a
slave interface where the CPU acts as the master. Finally the
accelerator is started, its status is checked and the necessary
outputs are read back through the same interface. The ARM
then handles the control flow for the rest of the function as
well as the linear system solving.
The second accelerator has as its inputs the current values in
the dedicated DDR region, produced by the first accelerator,
hence no further copying is necessary. The communication
between the two goes through the DDR memory because the
residual outputs are not always processed, and the decision is
only made after the first accelerator finishes. The amount of
output data produced is several MBytes and therefore cannot
be cached entirely on the FPGA, making DDR the only option.
B. Residual and Weight calculation Unit
In the Residual and Weight Calculation Unit, there are
from a high-level viewpoint, three blocks, one input DMA
block, one output DMA block and the main calculation unit
which performs all of the computation required, shown in
figure 2. Before the computation begins the entire frame being
tracked is prefetched in a local cache. This was found to be
significantly faster than accessing the required gradients on
the fly from random addresses in the DDR as happened in the
software implementation by Engel et al.
The best approach for a hardware accelerator is to instead
redesign the pipeline to perform pixel prefetching, cache the
pixels locally, and calculate the required gradients on the
fly. Testing showed the pipeline’s throughput was increased
by more than a factor of two in comparison to emulating
the software version, because of the penalty of DDR request
latency. The total data copied from the CPU was reduced as
Fig. 3. Residual Calculation Pipeline - Pixel Re-projection
well, since only one word per pixel had to be copied instead
of four to the dedicated DDR region.
The DDR latency is also why the inputs and outputs are
buffered rather than written sequentially for every iteration.
A good design point was found to be to perform one larger
burst transfer of vectors for 50 reference points (a total of 250
word), perform the computation on them, buffer the results in
a set of seven local buffers, and then perform a succession of
seven burst writes to empty those buffers and write them back
to the DDR memory. Larger iteration sizes than 50 were tested
but with minimal improvement. The points are of course not
always divisible by 50. The remainder of points are processed
in a sequential version of the pipeline. It is approximately two
times slower, but as it stands, the worst case scenario is it runs
10% of the time, for the highest pyramid level however with
only around 600 points, and the performance effect it has is
completely negligible, especially for larger problem sizes of
more than 20,000. This also has no significant cost in terms
of hardware resources, since most hardware units are re-used
from the block processing pipelines, and there is only a small
overhead for more complex control logic.
1) Residual Calculation: The Residual and Weight Calcula-
tion block has essentially three functions. Residual calculation,
weight calculation, and updating the keyframe pixel score
at the top level of tracking. During the initialisation phase,
constants are fetched through Direct Memory Access or are
pre-set from the CPU. After the first batch of 50 reference
points is fetched in the input buffer, the first task is to
re-project the reference point from the frame of reference
of the Keyframe, to the one provided by our current pose
estimate. This involves a matrix to vector multiplication and
a vector addition. After this reprojection we arrive at the new
coordinates, x’, y’ and the new depth (z’). The x’ and y’
are divided by the depth according to the pinhole camera
model, to give the coordinates on the camera projection plane.
Subsequently, these coordinates are ‘warped’ to change the
coordinates from the pinhole camera plane to the actual camera
plane, using the values in the Intrinsic Camera Matrix, which
has to do with the specific camera and the lens distortion it
introduces. This process is shown in figure 3. The results of
this is a set of two coordinates corresponding to the actual
captured image frame, (u, v).
There are a total of 9 multiplications and 6 additions,
pipelined into three hardware float multiply units and two
adders. This is possible since the interval is lower bound
to 6 hardware cycles. This limit is created by the random
memory accesses in the interpolated element calculation, but
most importantly, since this could be lowered at the cost
of resources, it is created by the data dependencies in the
algorithm itself. A float addition, at our FPGA takes at least
5 cycles. Since there are lots of computation steps in the
iteration loop dependent on previous results, the effective limit
of our iteration interval is exactly those 5 hardware cycles.
With this interval in mind, using fewer resources allows this
more efficient design.
After the interpolated element block returns the three results,
dx, dy and intensity, the residual calculation unit then proceeds
to complete the computation. The first step is calculating the
actual residual, which is simply the difference between the
interpolated intensity at the current frame, and the intensity of
the keyframe at the point we re-projected. Then we proceed
to accumulate five sums. The sum, and sum of squares, of
the interpolated intensity value (
P
c and
P
c2), the sum and
the sum of squares of the keyframe pixel intensity (
P
c2 andP
c22), and the sum of a local weight factor, which is used to
weight the previous four sums:
weight =
5
|residual| if |residual| < 5 , 1 otherwise.
Simultaneously to the computation, six results are buffered
to be written back. The first three are the re-projected x’, y’,
and z’ for the second accelerator to use. The residual for this
tracked point, and the interpolated gradients dx and dy are the
other three. It buffers all of these results in a set of 6 Block
RAMs. At this point, a pixel quality heuristic is calculated as
well. This is worth mentioning since inside this unit there is
another DMA port, which is used at the lower pyramid level to
perform a write to an address specified in the DDR memory, at
a different place in that memory. This exists to support specific
functionality in the implementation accelerated.
It was found to have a small impact on performance if
enabled, but other factors at this point dominate the commu-
nication bottleneck and it was not possible to buffer since
it is a series of completely random accesses to a very large
array. At this stage of the pipeline there are an additional
4 multiplier units to perform these calculations. One would
have been enough given our initiation interval, the reason for
the rest is that the same pipeline also performs the weight
calculation. These four multipliers, as well as one of the three
mentioned before, are time-shared with the multiplications
in the weight calculation block. There are also present two
float division units, which are also shared with the weight
calculation block. Some of this connectivity is not shown in
the figures to preserve clarity of presentation.
2) Interpolated Element Calculation: The coordinates,
(u, v) are send to the element interpolation unit, shown in
figure 4. This unit performs four sets of reads from the pixel
cache, stored in a couple of line buffers to fetch twelve pixel
intensity values. These twelve pixels are used to generate
gradient information for a square region of four neighbouring
pixels, the top left of which is the one with coordinates
(floor(u), f loor(v)).
The gradients for a pixel with coordinates (x, y) are calcu-
lated as :
dx =
1
2
(color(x+ 1, y)  color(x  1, y))
dy =
1
2
(color(x, y + 1)  color(x, y   1))
This function interpolates three quantities, between these
four neighbouring pixels. The horizontal gradient, the vertical
gradient and the pixel colour. The aim is to give sub-pixel
accuracy to the tracking process. The interpolation works
based on the coordinates that are input as u and v. These floats
are floored, in order to be used for memory offset calculation,
and then the decimal digits are used to weigh the interpolation
process, between the 4 pixels.
The operations performed are 6 float subtractions, 12 float
multiplications, and 11 float additions in total. The overall
initiation interval bound of 5 allows us to reuse multipliers
and adders here as well, time-shared. The architecture of this
unit, a bit simplified for clarity, is shown in figure 4. The result
of these calculations is a set of three elements, dx, dy, and an
interpolated intensity C.
Fig. 4. Element Interpolation
3) Weight Calculation: This was originally a separate func-
tion in the code. Since most of its inputs were the buffers
generated by the previous function, it was decided to include
it in the residual calculation unit and perform the computation
simultaneously, on the FPGA, before everything is written to
the DRAM. In total it performs 18 float multiplications, four
divisions, two additions, two subtractions and one accumu-
lation. Most of its units are time shared with the residual
calculation block. There are also a float square root unit used
twice and an absolute value unit. The result of this function is
a weight factor corresponding to the current residual, which is
buffered in another Block RAM to be read eventually by the
DMA Output Block.
C. Jacobian Update Unit
This second accelerator calculates the derivative values for
every single reference pixel, and adds them to calculate the
Jacobian, as well as the Hessian approximation H ⇡ JTJ .
A weight factor is also included in the calculation.
It essentially reads back the last version of the buffers
produced by the other accelerator when the linear system
solution is actually accepted as one that reduced the error.
It reads the x’, y’, z’ dx, dy, weight and residual values for
each reference pixel that was tracked, and then calculates the
6 values for a Jacobian that is reduced to a row vector, since
our function has a scalar output: f : R6 ! R . First, the 6
elements of the jacobian are calculated as follows:
J(1) =
dx
z
, J(2) =
dy
z
J(3) =   x
z2
dx  y
z2
dy
J(4) =  xy
z2
dx  z
2 + y2
z2
dy
J(5) =
z2 + x2
z2
dx+
xy
z2
dy
J(6) =  y
z
dx+
x
z
dy
Then the Hessian is approximated as the vector by vector
multiplication, with an added weight factor and the linear
system is generated by accumulating these results for each
reference point i, with
for each i:
1) A = A+ J(i) ⇤ JT (i) ⇤ w
2) b = b+ JT ⇤ residual(i) ⇤ w
This functions generates six elements of a vector, with
values independent from each other. In total this accelerator
contains twelve float-point multipliers, one float divider, three
add/sub units, and three dedicated float adders, to achieve
the maximum throughput. The initiation interval was set at
seven hardware cycles, limited partly by the accumulation in
the end for the Jacobian vector, which introduces inter-cycle
dependency and more importantly limited by the necessary
DMA access to read the set of seven inputs per iteration.
No mater how fast the hardware can process the incoming
elements, a DMA port can always fetch at most a single word
per hardware cycle. Therefore it imposes a hard limit on the
achievable acceleration.
VII. EVALUATION
The SLAM System was implemented and tested on a
Zedboard, with a Zynq-7020 SoC. This includes a dual-core
Arm Cortex-A9 processor running at 667 MHz with a Xilinx
FPGA on the same chip and an off-chip memory, a DDR3
512MB RAM. The programmable logic used has a total of
85K Logic Cells, 4.9Mb of BRAM and 220DSPs available.
Vivado HLS was used to implement the two accelerators which
were then imported in Vivado to interface and synthesise the
entire system.
Running on the board is an Ubuntu-based Linux OS running
on the ARM Cortex-A9. The first step was to port the LSD-
SLAM algorithm to that system. Next, it was run as pure
software, to get timing information about the frame rate and
all the different portions of the tracking function separately.
To interface the software with the hardware accelerator a user-
space driver was written, at which point both the software and
the hardware version were run in parallel to compare their
behaviour for identical inputs. After it was ascertained that
for many thousands of executions the results were the same in
both the hardware and the software version, and there was no
loss in quality or precision, the hardware based version was
run on its own and timed.
The FPGA was run at a clock frequency of 100 MHz. At
that frequency the system achieved on average a total frame
processing time, including the cost of memory copy and some
bookkeeping on the software side, of around 218mS per frame,
which corresponds to a framerate of more than 4.5 frames per
second for our main test scene, tracking at a resolution of
320x240. This performance was achieved with an estimated
dynamic power consumption of 0.71 Watts for the FPGA, less
than half the estimated consumption of the CPU which was
1.53 Watts at 50% load, and for a total for both processors of
approximately 2.25W including the static power consumption
of the system.
Results did vary slightly between test sets, because the
actual run-time of the algorithm depends partially on the
complexity of the scene tracked. However the performance
ratio between hardware and software remained at approxi-
mately 2⇥. The average performance results are given on
figure 5, where the lower bars represent a particularly complex
sequence, with a higher ratio of feature points to total pixels.
2.27
2.6
4.55
1.9
2.2
3.84
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
ARM A9 ARM A9 with NEON Hardware Accelerated
Frames / Sec
Fig. 5. Total Acceleration - Average framerate for two different sequences
This is a promising result, and it is only constrained by
memory bandwidth, and not by computation performance.
For the FPGA that was used, the resources consumed were
around 55% for the DSPs and 45% for the FFs, as seen in
table IV, with a design not aggressively optimised for area.
However even the current design is not allowed to run at
max throughput, which would be an order of magnitude more
than the actual performance we get. The measurements for
data copying are shown for each level on figure 6. Trying to
interface custom hardware with software that uses large dy-
namically allocated buffers will incur a penalty due to the cost
of transferring a large amount of data back and forth. These
buffers are also stored in virtual pages, scattered in different
places in the physical system memory. The copy of input data
to a dedicated area of the DDR memory will take a significant
percentage of the total execution time, up to 25%, depending
on the execution level. As it stands, we achieve more than
4.5 fps even with the data copy penalty in software. If an
algorithm redesign allowed a full streaming communication
paradigm to be used, an order of magnitude of improvement
could be achieved from the proposed architecture.
0
4000
8000
12000
16000
20000
24000
28000
32000
36000
40000
L E V E L  4 L E V E L  3 L E V E L  2 L E V E L  1
μSeconds
Fig. 6. Memory Transfer Cost in microSeconds
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
L E V E L  4 L E V E L  3 L E V E L  2 L E V E L  1
μSeconds
Residual And Weights FPGA Linear System Update FPGA
Residual And Weights Software Linear System Update Software
Fig. 7. Function Acceleration
TABLE IV
FPGA RESOURCES
Estimated LS. Update (J) Res. &Weight Unit Actual
DSPs 48 78 119
BRAM 0 73 36.5
FFs 14630 32328 47570
LUTs 24902 41383 40298
The per function performance, of the hardware accelerator
scales linearly based on the number of points, with a small
extra penalty on the final level due to the additional random
reads and writes necessary to support the original software
implementation. This is shown in figure 7. It is important to
note that the residual and weight calculation unit is called
approximately three to four times as often than the linear
system update, while the memory copy happens only once
per level. In table IV the resource usage results are shown
as reported from Vivado post-implementation. The accelerator
was not aggressively optimised for resource use at this point,
as there was limited gain to be had by that.
The pipeline for the residual and weight calculation has
an interval of 6 cycles, which translates into a throughput
of approximately 16.6 Million points per second. However to
be able to achieve that throughput the input data rate would
have to be about 300MB/sec and crucially, the output rate
would have to be more than 420MB/sec, spread accross 7
memory locations. After several hardware architectures were
designed and tested on the development board it turned out
a good design point is to perform one larger burst transfer
for 50 reference points (a total of 250 words or 1000Bytes),
perform the computation on them, and cache the results locally
in buffers of 50 words each, allowing the pipeline to function
as fast as possible, and then write the results in a series of
7 successive burst writes back to the DDR. The accelerator
updating the linear system has a similar bottleneck from
memory performance and latency. The technical reference
manual indicates that ideally the high performance DMA port
on the SoC can sustain bursts of 255 words, transferring one
word per cycle at a max frequency of 150MHz, or 600MB/sec
sustained, so if a steady stream of data could be supplied, that
would mean approaching the theoretical limits of the port.
This demonstrates again the high communication demands
characteristic of SLAM algorithms.
VIII. CONCLUSION
We have presented an implementation of LSD-SLAM, a
state-of-the-art, Semi-dense SLAM algorithm on an FPGA
System on Chip, achieving a 2⇥ acceleration compared to a
pure software version running on an embedded CPU. The im-
plementation is capable of running LSD-SLAM at more than
4 fps at a resolution of 320x240, limited by memory latency
and bandwidth, with an estimated power consumption of less
than 2.5 Watts for the whole chip. This work has demonstrated
that an FPGA SoC is capable of bringing advanced and more
dense SLAM algorithms to embedded low-power devices, with
increased performance and greater performance-to-watt ratio
than other solutions.
Future work with FPGA SoCs should begin with hard-
ware/software co-design because of the nature of these al-
gorithms. Semi-dense and Dense SLAM are characterised by
high bandwidth requirements. In addition, most implementa-
tions are optimised for a CPU and feature complex control
flow, a large amount of random memory accesses and a series
of sequential functions sharing the work instead of one specific
bottleneck. A hardware design should place a lot of importance
on its memory architecture, including caching techniques and
data movement, to ensure scalability, and compatibility with
dense or higher-resolution algorithms. A redesign of software
implementations to allow a streaming interface with the FPGA,
with a large guaranteed bandwidth and the CPU free to
perform higher-level optimisation and control is ideal.
REFERENCES
[1] M. Saska, J. Chudoba, L. Precil, J. Thomas, G. Loianno, A. Tresnak,
V. Vonasek, and V. Kumar, “Autonomous deployment of swarms of
micro-aerial vehicles in cooperative surveillance,” in Unmanned Aircraft
Systems (ICUAS), 2014 International Conference on. IEEE, 2014, pp.
584–595.
[2] S. Waharte and N. Trigoni, “Supporting search and rescue operations
with uavs,” in Emerging Security Technologies (EST), 2010 International
Conference on. IEEE, 2010, pp. 142–147.
[3] C. Zhang and J. M. Kovacs, “The application of small unmanned
aerial systems for precision agriculture: a review,” Precision agriculture,
vol. 13, no. 6, pp. 693–712, 2012.
[4] S. Lee, S. Lee, and J. J. Yoon, “Illumination-invariant localization
based on upward looking scenes for low-cost indoor robots,” Advanced
Robotics, vol. 26, no. 13, pp. 1443–1469, 2012.
[5] B. Vincke, A. Elouardi, and A. Lambert, “Real time simultaneous
localization and mapping: towards low-cost multiprocessor embedded
systems,” EURASIP Journal on Embedded Systems, vol. 2012, no. 1,
pp. 1–14, 2012.
[6] T. Scho¨ps, J. Engel, and D. Cremers, “Semi-dense visual odometry for
ar on a smartphone,” in Mixed and Augmented Reality (ISMAR), 2014
IEEE International Symposium on. IEEE, 2014, pp. 145–150.
[7] J. Sturm, E. Bylow, C. Kerl, F. Kahl, and D. Cremers, “Dense tracking
and mapping with a quadrocopter,” in UAV-g 2013, 2013.
[8] M. Blo¨sch, S. Weiss, D. Scaramuzza, and R. Siegwart, “Vision based
mav navigation in unknown and unstructured environments,” in Robotics
and automation (ICRA), 2010 IEEE international conference on. IEEE,
2010, pp. 21–28.
[9] J. Engel, T. Scho¨ps, and D. Cremers, “LSD-SLAM: Large-scale di-
rect monocular SLAM,” in European Conference on Computer Vision
(ECCV), September 2014.
[10] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense
tracking and mapping in real-time,” in Computer Vision (ICCV), 2011
IEEE International Conference on. IEEE, 2011, pp. 2320–2327.
[11] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and
J. McDonald, “Real-time large-scale dense rgb-d slam with volumetric
fusion,” The International Journal of Robotics Research, vol. 34, no.
4-5, pp. 598–626, 2015.
[12] G. Klein and D. Murray, “Parallel tracking and mapping for small AR
workspaces,” in Proc. Sixth IEEE and ACM International Symposium
on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, November
2007.
[13] J. Lim, J.-M. Frahm, and M. Pollefeys, “Online environment mapping,”
in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on. IEEE, 2011, pp. 3489–3496.
[14] H. Strasdat, J. Montiel, and A. J. Davison, “Scale drift-aware large scale
monocular slam.” in Robotics: Science and Systems, vol. 2, 2010, p. 5.
[15] S. Grzonka, G. Grisetti, and W. Burgard, “Towards a navigation system
for autonomous indoor flying,” in Robotics and Automation, 2009.
ICRA’09. IEEE International Conference on. IEEE, 2009, pp. 2878–
2883.
[16] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox,
and N. Roy, “Visual odometry and mapping for autonomous flight using
an rgb-d camera,” in International Symposium on Robotics Research
(ISRR), 2011, pp. 1–16.
[17] M. Achtelik, A. Bachrach, R. He, S. Prentice, and N. Roy, “Stereo vision
and laser odometry for autonomous helicopters in gps-denied indoor
environments,” in SPIE Defense, Security, and Sensing. International
Society for Optics and Photonics, 2009, pp. 733 219–733 219.
[18] M. Achtelik, M. Achtelik, S. Weiss, and R. Siegwart, “Onboard imu
and monocular vision based control for mavs in unknown in-and
outdoor environments,” in Robotics and automation (ICRA), 2011 IEEE
international conference on. IEEE, 2011, pp. 3056–3063.
[19] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct
monocular visual odometry,” in Robotics and Automation (ICRA), 2014
IEEE International Conference on. IEEE, 2014, pp. 15–22.
[20] M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and
D. Scaramuzza, “Autonomous, vision-based flight and live dense 3d
mapping with a quadrotor micro aerial vehicle,” Journal of Field
Robotics, 2015.
[21] T. Schops, J. Engel, and D. Cremers, “Semi-dense visual odometry for
ar on a smartphone,” in Mixed and Augmented Reality (ISMAR), 2014
IEEE International Symposium on. IEEE, 2014, pp. 145–150.
[22] B. Vincke, A. Elouardi, A. Lambert, and A. Merigot, “Efficient imple-
mentation of ekf-slam on a multi-core embedded system,” in IECON
2012-38th Annual Conference on IEEE Industrial Electronics Society.
IEEE, 2012, pp. 3049–3054.
[23] A. J. Barry and R. Tedrake, “Pushbroom stereo for high-speed navigation
in cluttered environments,” in Robotics and Automation (ICRA), 2015
IEEE International Conference on. IEEE, 2015, pp. 3046–3052.
[24] A. J. Barry, H. Oleynikova, D. Honegger, M. Pollefeys, and R. Tedrake,
“Fast onboard stereo vision for uavs,” in Vision-based Control and Nav-
igation of Small Lightweight UAV Workshop, International Conference
On Intelligent Robots and Systems (IROS), 2015.
[25] D. Honegger, H. Oleynikova, and M. Pollefeys, “Real-time and low
latency embedded computer vision hardware based on a combination of
fpga and mobile cpu,” in Intelligent Robots and Systems (IROS 2014),
2014 IEEE/RSJ International Conference on. IEEE, 2014, pp. 4930–
4935.
[26] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, “Real-
time onboard visual-inertial state estimation and self-calibration of mavs
in unknown environments,” in Robotics and Automation (ICRA), 2012
IEEE International Conference on. IEEE, 2012, pp. 957–964.
[27] G. Klein and D. Murray, “Parallel tracking and mapping on a camera
phone,” in Mixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE
International Symposium on. IEEE, 2009, pp. 83–86.
[28] J. Engel, J. Sturm, and D. Cremers, “Camera-based navigation of a
low-cost quadrocopter,” in Intelligent Robots and Systems (IROS), 2012
IEEE/RSJ International Conference on. IEEE, 2012, pp. 2815–2821.
[29] ——, “Scale-aware navigation of a low-cost quadrocopter with a monoc-
ular camera,” Robotics and Autonomous Systems, vol. 62, no. 11, pp.
1646–1656, 2014.
[30] D. Bouris, A. Nikitakis, and I. Papaefstathiou, “Fast and efficient
fpga-based feature detection employing the surf algorithm,” in Field-
Programmable Custom Computing Machines (FCCM), 2010 18th IEEE
Annual International Symposium on. IEEE, 2010, pp. 3–10.
[31] L. Yao, H. Feng, Y. Zhu, Z. Jiang, D. Zhao, and W. Feng, “An architec-
ture of optimised sift feature detection for an fpga implementation of an
image matcher,” in Field-Programmable Technology, 2009. FPT 2009.
International Conference on. IEEE, 2009, pp. 30–37.
[32] P. Greisen, S. Heinzle, M. Gross, and A. P. Burg, “An fpga-based
processing pipeline for high-definition stereo video,” EURASIP Journal
on Image and Video Processing, vol. 2011, no. 1, pp. 1–13, 2011.
[33] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-time
stereo vision system using semi-global matching disparity estimation:
Architecture and fpga-implementation,” in Embedded Computer Systems
(SAMOS), 2010 International Conference on. IEEE, 2010, pp. 93–101.
[34] T. M. Howard, A. Morfopoulos, J. Morrison, Y. Kuwata, C. Villalpando,
L. Matthies, and M. McHenry, “Enabling continuous planetary rover
navigation through fpga stereo and visual odometry,” in Aerospace
Conference, 2012 IEEE. IEEE, 2012, pp. 1–9.
[35] R. Mur-Artal, J. Montiel, and J. D. Tardos, “Orb-slam: a versatile
and accurate monocular slam system,” Robotics, IEEE Transactions on,
vol. 31, no. 5, pp. 1147–1163, 2015.
[36] J. J. More´, “The levenberg-marquardt algorithm: implementation and
theory,” in Numerical analysis. Springer, 1978, pp. 105–116.
[37] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry
for a monocular camera,” in Proceedings of the IEEE International
Conference on Computer Vision, 2013, pp. 1449–1456.
[38] G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g2o: A general
framework for graph optimization,” in IEEE International Conference
on Robotics and Automation, 2011.
[39] A. Handa, R. A. Newcombe, A. Angeli, and A. J. Davison, “Real-time
camera tracking: When is high frame-rate best?” in Computer Vision–
ECCV 2012. Springer, 2012, pp. 222–235.
