258 research outputs found
SelfOdom: Self-supervised Egomotion and Depth Learning via Bi-directional Coarse-to-Fine Scale Recovery
Accurately perceiving location and scene is crucial for autonomous driving
and mobile robots. Recent advances in deep learning have made it possible to
learn egomotion and depth from monocular images in a self-supervised manner,
without requiring highly precise labels to train the networks. However,
monocular vision methods suffer from a limitation known as scale-ambiguity,
which restricts their application when absolute-scale is necessary. To address
this, we propose SelfOdom, a self-supervised dual-network framework that can
robustly and consistently learn and generate pose and depth estimates in global
scale from monocular images. In particular, we introduce a novel coarse-to-fine
training strategy that enables the metric scale to be recovered in a two-stage
process. Furthermore, SelfOdom is flexible and can incorporate inertial data
with images, which improves its robustness in challenging scenarios, using an
attention-based fusion module. Our model excels in both normal and challenging
lighting conditions, including difficult night scenes. Extensive experiments on
public datasets have demonstrated that SelfOdom outperforms representative
traditional and learning-based VO and VIO models.Comment: 14 pages, 8 figures, in submissio
FSNet: Redesign Self-Supervised MonoDepth for Full-Scale Depth Prediction for Autonomous Driving
Predicting accurate depth with monocular images is important for low-cost
robotic applications and autonomous driving. This study proposes a
comprehensive self-supervised framework for accurate scale-aware depth
prediction on autonomous driving scenes utilizing inter-frame poses obtained
from inertial measurements. In particular, we introduce a Full-Scale depth
prediction network named FSNet. FSNet contains four important improvements over
existing self-supervised models: (1) a multichannel output representation for
stable training of depth prediction in driving scenarios, (2) an
optical-flow-based mask designed for dynamic object removal, (3) a
self-distillation training strategy to augment the training process, and (4) an
optimization-based post-processing algorithm in test time, fusing the results
from visual odometry. With this framework, robots and vehicles with only one
well-calibrated camera can collect sequences of training image frames and
camera poses, and infer accurate 3D depths of the environment without extra
labeling work or 3D data. Extensive experiments on the KITTI dataset, KITTI-360
dataset and the nuScenes dataset demonstrate the potential of FSNet. More
visualizations are presented in \url{https://sites.google.com/view/fsnet/home}Comment: 12 pages. conditionally accepted by IEEE T-AS
DEUX: Active Exploration for Learning Unsupervised Depth Perception
Depth perception models are typically trained on non-interactive datasets
with predefined camera trajectories. However, this often introduces systematic
biases into the learning process correlated to specific camera paths chosen
during data acquisition. In this paper, we investigate the role of how data is
collected for learning depth completion, from a robot navigation perspective,
by leveraging 3D interactive environments. First, we evaluate four depth
completion models trained on data collected using conventional navigation
techniques. Our key insight is that existing exploration paradigms do not
necessarily provide task-specific data points to achieve competent unsupervised
depth completion learning. We then find that data collected with respect to
photometric reconstruction has a direct positive influence on model
performance. As a result, we develop an active, task-informed, depth
uncertainty-based motion planning approach for learning depth completion, which
we call DEpth Uncertainty-guided eXploration (DEUX). Training with data
collected by our approach improves depth completion by an average greater than
18% across four depth completion models compared to existing exploration
methods on the MP3D test set. We show that our approach further improves
zero-shot generalization, while offering new insights into integrating robot
learning-based depth estimation
2D-3D Pose Tracking with Multi-View Constraints
Camera localization in 3D LiDAR maps has gained increasing attention due to
its promising ability to handle complex scenarios, surpassing the limitations
of visual-only localization methods. However, existing methods mostly focus on
addressing the cross-modal gaps, estimating camera poses frame by frame without
considering the relationship between adjacent frames, which makes the pose
tracking unstable. To alleviate this, we propose to couple the 2D-3D
correspondences between adjacent frames using the 2D-2D feature matching,
establishing the multi-view geometrical constraints for simultaneously
estimating multiple camera poses. Specifically, we propose a new 2D-3D pose
tracking framework, which consists: a front-end hybrid flow estimation network
for consecutive frames and a back-end pose optimization module. We further
design a cross-modal consistency-based loss to incorporate the multi-view
constraints during the training and inference process. We evaluate our proposed
framework on the KITTI and Argoverse datasets. Experimental results demonstrate
its superior performance compared to existing frame-by-frame 2D-3D pose
tracking methods and state-of-the-art vision-only pose tracking algorithms.
More online pose tracking videos are available at
\url{https://youtu.be/yfBRdg7gw5M}Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
- …