9 research outputs found
NeRF-VINS: A Real-time Neural Radiance Field Map-based Visual-Inertial Navigation System
Achieving accurate, efficient, and consistent localization within an a priori
environment map remains a fundamental challenge in robotics and computer
vision. Conventional map-based keyframe localization often suffers from
sub-optimal viewpoints due to limited field of view (FOV), thus degrading its
performance. To address this issue, in this paper, we design a real-time
tightly-coupled Neural Radiance Fields (NeRF)-aided visual-inertial navigation
system (VINS), termed NeRF-VINS. By effectively leveraging NeRF's potential to
synthesize novel views, essential for addressing limited viewpoints, the
proposed NeRF-VINS optimally fuses IMU and monocular image measurements along
with synthetically rendered images within an efficient filter-based framework.
This tightly coupled integration enables 3D motion tracking with bounded error.
We extensively compare the proposed NeRF-VINS against the state-of-the-art
methods that use prior map information, which is shown to achieve superior
performance. We also demonstrate the proposed method is able to perform
real-time estimation at 15 Hz, on a resource-constrained Jetson AGX Orin
embedded platform with impressive accuracy.Comment: 6 pages, 7 figure
Direct Sparse Visual-Inertial Odometry using Dynamic Marginalization
We present VI-DSO, a novel approach for visual-inertial odometry, which
jointly estimates camera poses and sparse scene geometry by minimizing
photometric and IMU measurement errors in a combined energy functional. The
visual part of the system performs a bundle-adjustment like optimization on a
sparse set of points, but unlike key-point based systems it directly minimizes
a photometric error. This makes it possible for the system to track not only
corners, but any pixels with large enough intensity gradients. IMU information
is accumulated between several frames using measurement preintegration, and is
inserted into the optimization as an additional constraint between keyframes.
We explicitly include scale and gravity direction into our model and jointly
optimize them together with other variables such as poses. As the scale is
often not immediately observable using IMU data this allows us to initialize
our visual-inertial system with an arbitrary scale instead of having to delay
the initialization until everything is observable. We perform partial
marginalization of old variables so that updates can be computed in a
reasonable time. In order to keep the system consistent we propose a novel
strategy which we call "dynamic marginalization". This technique allows us to
use partial marginalization even in cases where the initial scale estimate is
far from the optimum. We evaluate our method on the challenging EuRoC dataset,
showing that VI-DSO outperforms the state of the art
ViPR: Visual-Odometry-aided Pose Regression for 6DoF Camera Localization
Visual Odometry (VO) accumulates a positional drift in long-term robot
navigation tasks. Although Convolutional Neural Networks (CNNs) improve VO in
various aspects, VO still suffers from moving obstacles, discontinuous
observation of features, and poor textures or visual information. While recent
approaches estimate a 6DoF pose either directly from (a series of) images or by
merging depth maps with optical flow (OF), research that combines absolute pose
regression with OF is limited. We propose ViPR, a novel modular architecture
for long-term 6DoF VO that leverages temporal information and synergies between
absolute pose estimates (from PoseNet-like modules) and relative pose estimates
(from FlowNet-based modules) by combining both through recurrent layers.
Experiments on known datasets and on our own Industry dataset show that our
modular design outperforms state of the art in long-term navigation tasks.Comment: Conf. on Computer Vision and Pattern Recognition (CVPR): Joint
Workshop on Long-Term Visual Localization, Visual Odometry and Geometric and
Learning-based SLAM 202
Near-field Perception for Low-Speed Vehicle Automation using Surround-view Fisheye Cameras
Cameras are the primary sensor in automated driving systems. They provide
high information density and are optimal for detecting road infrastructure cues
laid out for human vision. Surround-view camera systems typically comprise of
four fisheye cameras with 190{\deg}+ field of view covering the entire
360{\deg} around the vehicle focused on near-field sensing. They are the
principal sensors for low-speed, high accuracy, and close-range sensing
applications, such as automated parking, traffic jam assistance, and low-speed
emergency braking. In this work, we provide a detailed survey of such vision
systems, setting up the survey in the context of an architecture that can be
decomposed into four modular components namely Recognition, Reconstruction,
Relocalization, and Reorganization. We jointly call this the 4R Architecture.
We discuss how each component accomplishes a specific aspect and provide a
positional argument that they can be synergized to form a complete perception
system for low-speed automation. We support this argument by presenting results
from previous works and by presenting architecture proposals for such a system.
Qualitative results are presented in the video at https://youtu.be/ae8bCOF77uY.Comment: Accepted for publication at IEEE Transactions on Intelligent
Transportation System
Keyframe-based visual-inertial online SLAM with relocalization
Complementing images with inertial measurements has become one of the most
popular approaches to achieve highly accurate and robust real-time camera pose
tracking. In this paper, we present a keyframe-based approach to
visual-inertial simultaneous localization and mapping (SLAM) for monocular and
stereo cameras. Our visual-inertial SLAM system is based on a real-time capable
visual-inertial odometry method that provides locally consistent trajectory and
map estimates. We achieve global consistency in the estimate through online
loop-closing and non-linear optimization. Furthermore, our system supports
relocalization in a map that has been previously obtained and allows for
continued SLAM operation. We evaluate our approach in terms of accuracy,
relocalization capability and run-time efficiency on public indoor benchmark
datasets and on newly recorded outdoor sequences. We demonstrate
state-of-the-art performance of our system compared to a visual-inertial
odometry method and baseline visual SLAM approaches in recovering the
trajectory of the camera
Visual-Inertial Sensor Fusion Models and Algorithms for Context-Aware Indoor Navigation
Positioning in navigation systems is predominantly performed by Global Navigation Satellite Systems (GNSSs). However, while GNSS-enabled devices have become commonplace for outdoor navigation, their use for indoor navigation is hindered due to GNSS signal degradation or blockage. For this, development of alternative positioning approaches and techniques for navigation systems is an ongoing research topic. In this dissertation, I present a new approach and address three major navigational problems: indoor positioning, obstacle detection, and keyframe detection. The proposed approach utilizes inertial and visual sensors available on smartphones and are focused on developing: a framework for monocular visual internal odometry (VIO) to position human/object using sensor fusion and deep learning in tandem; an unsupervised algorithm to detect obstacles using sequence of visual data; and a supervised context-aware keyframe detection.
The underlying technique for monocular VIO is a recurrent convolutional neural network for computing six-degree-of-freedom (6DoF) in an end-to-end fashion and an extended Kalman filter module for fine-tuning the scale parameter based on inertial observations and managing errors. I compare the results of my featureless technique with the results of conventional feature-based VIO techniques and manually-scaled results. The comparison results show that while the framework is more effective compared to featureless method and that the accuracy is improved, the accuracy of feature-based method still outperforms the proposed approach.
The approach for obstacle detection is based on processing two consecutive images to detect obstacles. Conducting experiments and comparing the results of my approach with the results of two other widely used algorithms show that my algorithm performs better; 82% precision compared with 69%. In order to determine the decent frame-rate extraction from video stream, I analyzed movement patterns of camera and inferred the context of the user to generate a model associating movement anomaly with proper frames-rate extraction. The output of this model was utilized for determining the rate of keyframe extraction in visual odometry (VO). I defined and computed the effective frames for VO and experimented with and used this approach for context-aware keyframe detection. The results show that the number of frames, using inertial data to infer the decent frames, is decreased
Structureless Camera Motion Estimation of Unordered Omnidirectional Images
This work aims at providing a novel camera motion estimation pipeline from large collections of unordered omnidirectional images. In oder to keep the pipeline as general and flexible as possible, cameras are modelled as unit spheres, allowing to incorporate any central camera type. For each camera an unprojection lookup is generated from intrinsics, which is called P2S-map (Pixel-to-Sphere-map), mapping pixels to their corresponding positions on the unit sphere. Consequently the camera geometry becomes independent of the underlying projection model. The pipeline also generates P2S-maps from world map projections with less distortion effects as they are known from cartography. Using P2S-maps from camera calibration and world map projection allows to convert omnidirectional camera images to an appropriate world map projection in oder to apply standard feature extraction and matching algorithms for data association. The proposed estimation pipeline combines the flexibility of SfM (Structure from Motion) - which handles unordered image collections - with the efficiency of PGO (Pose Graph Optimization), which is used as back-end in graph-based Visual SLAM (Simultaneous Localization and Mapping) approaches to optimize camera poses from large image sequences. SfM uses BA (Bundle Adjustment) to jointly optimize camera poses (motion) and 3d feature locations (structure), which becomes computationally expensive for large-scale scenarios. On the contrary PGO solves for camera poses (motion) from measured transformations between cameras, maintaining optimization managable. The proposed estimation algorithm combines both worlds. It obtains up-to-scale transformations between image pairs using two-view constraints, which are jointly scaled using trifocal constraints. A pose graph is generated from scaled two-view transformations and solved by PGO to obtain camera motion efficiently even for large image collections. Obtained results can be used as input data to provide initial pose estimates for further 3d reconstruction purposes e.g. to build a sparse structure from feature correspondences in an SfM or SLAM framework with further refinement via BA.
The pipeline also incorporates fixed extrinsic constraints from multi-camera setups as well as depth information provided by RGBD sensors. The entire camera motion estimation pipeline does not need to generate a sparse 3d structure of the captured environment and thus is called SCME (Structureless Camera Motion Estimation).:1 Introduction
1.1 Motivation
1.1.1 Increasing Interest of Image-Based 3D Reconstruction
1.1.2 Underground Environments as Challenging Scenario
1.1.3 Improved Mobile Camera Systems for Full Omnidirectional Imaging
1.2 Issues
1.2.1 Directional versus Omnidirectional Image Acquisition
1.2.2 Structure from Motion versus Visual Simultaneous Localization and Mapping
1.3 Contribution
1.4 Structure of this Work
2 Related Work
2.1 Visual Simultaneous Localization and Mapping
2.1.1 Visual Odometry
2.1.2 Pose Graph Optimization
2.2 Structure from Motion
2.2.1 Bundle Adjustment
2.2.2 Structureless Bundle Adjustment
2.3 Corresponding Issues
2.4 Proposed Reconstruction Pipeline
3 Cameras and Pixel-to-Sphere Mappings with P2S-Maps
3.1 Types
3.2 Models
3.2.1 Unified Camera Model
3.2.2 Polynomal Camera Model
3.2.3 Spherical Camera Model
3.3 P2S-Maps - Mapping onto Unit Sphere via Lookup Table
3.3.1 Lookup Table as Color Image
3.3.2 Lookup Interpolation
3.3.3 Depth Data Conversion
4 Calibration
4.1 Overview of Proposed Calibration Pipeline
4.2 Target Detection
4.3 Intrinsic Calibration
4.3.1 Selected Examples
4.4 Extrinsic Calibration
4.4.1 3D-2D Pose Estimation
4.4.2 2D-2D Pose Estimation
4.4.3 Pose Optimization
4.4.4 Uncertainty Estimation
4.4.5 PoseGraph Representation
4.4.6 Bundle Adjustment
4.4.7 Selected Examples
5 Full Omnidirectional Image Projections
5.1 Panoramic Image Stitching
5.2 World Map Projections
5.3 World Map Projection Generator for P2S-Maps
5.4 Conversion between Projections based on P2S-Maps
5.4.1 Proposed Workflow
5.4.2 Data Storage Format
5.4.3 Real World Example
6 Relations between Two Camera Spheres
6.1 Forward and Backward Projection
6.2 Triangulation
6.2.1 Linear Least Squares Method
6.2.2 Alternative Midpoint Method
6.3 Epipolar Geometry
6.4 Transformation Recovery from Essential Matrix
6.4.1 Cheirality
6.4.2 Standard Procedure
6.4.3 Simplified Procedure
6.4.4 Improved Procedure
6.5 Two-View Estimation
6.5.1 Evaluation Strategy
6.5.2 Error Metric
6.5.3 Evaluation of Estimation Algorithms
6.5.4 Concluding Remarks
6.6 Two-View Optimization
6.6.1 Epipolar-Based Error Distances
6.6.2 Projection-Based Error Distances
6.6.3 Comparison between Error Distances
6.7 Two-View Translation Scaling
6.7.1 Linear Least Squares Estimation
6.7.2 Non-Linear Least Squares Optimization
6.7.3 Comparison between Initial and Optimized Scaling Factor
6.8 Homography to Identify Degeneracies
6.8.1 Homography for Spherical Cameras
6.8.2 Homography Estimation
6.8.3 Homography Optimization
6.8.4 Homography and Pure Rotation
6.8.5 Homography in Epipolar Geometry
7 Relations between Three Camera Spheres
7.1 Three View Geometry
7.2 Crossing Epipolar Planes Geometry
7.3 Trifocal Geometry
7.4 Relation between Trifocal, Three-View and Crossing Epipolar Planes
7.5 Translation Ratio between Up-To-Scale Two-View Transformations
7.5.1 Structureless Determination Approaches
7.5.2 Structure-Based Determination Approaches
7.5.3 Comparison between Proposed Approaches
8 Pose Graphs
8.1 Optimization Principle
8.2 Solvers
8.2.1 Additional Graph Solvers
8.2.2 False Loop Closure Detection
8.3 Pose Graph Generation
8.3.1 Generation of Synthetic Pose Graph Data
8.3.2 Optimization of Synthetic Pose Graph Data
9 Structureless Camera Motion Estimation
9.1 SCME Pipeline
9.2 Determination of Two-View Translation Scale Factors
9.3 Integration of Depth Data
9.4 Integration of Extrinsic Camera Constraints
10 Camera Motion Estimation Results
10.1 Directional Camera Images
10.2 Omnidirectional Camera Images
11 Conclusion
11.1 Summary
11.2 Outlook and Future Work
Appendices
A.1 Additional Extrinsic Calibration Results
A.2 Linear Least Squares Scaling
A.3 Proof Rank Deficiency
A.4 Alternative Derivation Midpoint Method
A.5 Simplification of Depth Calculation
A.6 Relation between Epipolar and Circumferential Constraint
A.7 Covariance Estimation
A.8 Uncertainty Estimation from Epipolar Geometry
A.9 Two-View Scaling Factor Estimation: Uncertainty Estimation
A.10 Two-View Scaling Factor Optimization: Uncertainty Estimation
A.11 Depth from Adjoining Two-View Geometries
A.12 Alternative Three-View Derivation
A.12.1 Second Derivation Approach
A.12.2 Third Derivation Approach
A.13 Relation between Trifocal Geometry and Alternative Midpoint Method
A.14 Additional Pose Graph Generation Examples
A.15 Pose Graph Solver Settings
A.16 Additional Pose Graph Optimization Examples
Bibliograph