1,366 research outputs found
DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning
In the current monocular depth research, the dominant approach is to employ
unsupervised training on large datasets, driven by warped photometric
consistency. Such approaches lack robustness and are unable to generalize to
challenging domains such as nighttime scenes or adverse weather conditions
where assumptions about photometric consistency break down.
We propose DeFeat-Net (Depth & Feature network), an approach to
simultaneously learn a cross-domain dense feature representation, alongside a
robust depth-estimation framework based on warped feature consistency. The
resulting feature representation is learned in an unsupervised manner with no
explicit ground-truth correspondences required.
We show that within a single domain, our technique is comparable to both the
current state of the art in monocular depth estimation and supervised feature
representation learning. However, by simultaneously learning features, depth
and motion, our technique is able to generalize to challenging domains,
allowing DeFeat-Net to outperform the current state-of-the-art with around 10%
reduction in all error measures on more challenging sequences such as nighttime
driving
There and Back Again: Self-supervised Multispectral Correspondence Estimation
Across a wide range of applications, from autonomous vehicles to medical
imaging, multi-spectral images provide an opportunity to extract additional
information not present in color images. One of the most important steps in
making this information readily available is the accurate estimation of dense
correspondences between different spectra.
Due to the nature of cross-spectral images, most correspondence solving
techniques for the visual domain are simply not applicable. Furthermore, most
cross-spectral techniques utilize spectra-specific characteristics to perform
the alignment. In this work, we aim to address the dense correspondence
estimation problem in a way that generalizes to more than one spectrum. We do
this by introducing a novel cycle-consistency metric that allows us to
self-supervise. This, combined with our spectra-agnostic loss functions, allows
us to train the same network across multiple spectra.
We demonstrate our approach on the challenging task of dense RGB-FIR
correspondence estimation. We also show the performance of our unmodified
network on the cases of RGB-NIR and RGB-RGB, where we achieve higher accuracy
than similar self-supervised approaches. Our work shows that cross-spectral
correspondence estimation can be solved in a common framework that learns to
generalize alignment across spectra
Single-Image Depth Prediction Makes Feature Matching Easier
Good local features improve the robustness of many 3D re-localization and
multi-view reconstruction pipelines. The problem is that viewing angle and
distance severely impact the recognizability of a local feature. Attempts to
improve appearance invariance by choosing better local feature points or by
leveraging outside information, have come with pre-requisites that made some of
them impractical. In this paper, we propose a surprisingly effective
enhancement to local feature extraction, which improves matching. We show that
CNN-based depths inferred from single RGB images are quite helpful, despite
their flaws. They allow us to pre-warp images and rectify perspective
distortions, to significantly enhance SIFT and BRISK features, enabling more
good matches, even when cameras are looking at the same scene but in opposite
directions.Comment: 14 pages, 7 figures, accepted for publication at the European
conference on computer vision (ECCV) 202
(LC): LiDAR-Camera Loop Constraints For Cross-Modal Place Recognition
Localization has been a challenging task for autonomous navigation. A loop
detection algorithm must overcome environmental changes for the place
recognition and re-localization of robots. Therefore, deep learning has been
extensively studied for the consistent transformation of measurements into
localization descriptors. Street view images are easily accessible; however,
images are vulnerable to appearance changes. LiDAR can robustly provide precise
structural information. However, constructing a point cloud database is
expensive, and point clouds exist only in limited places. Different from
previous works that train networks to produce shared embedding directly between
the 2D image and 3D point cloud, we transform both data into 2.5D depth images
for matching. In this work, we propose a novel cross-matching method, called
(LC), for achieving LiDAR localization without a prior point cloud map. To
this end, LiDAR measurements are expressed in the form of range images before
matching them to reduce the modality discrepancy. Subsequently, the network is
trained to extract localization descriptors from disparity and range images.
Next, the best matches are employed as a loop factor in a pose graph. Using
public datasets that include multiple sessions in significantly different
lighting conditions, we demonstrated that LiDAR-based navigation systems could
be optimized from image databases and vice versa.Comment: 8 pages, 11 figures, Accepted to IEEE Robotics and Automation Letters
(RA-L
SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation
Depth estimation from images serves as the fundamental step of 3D perception
for autonomous driving and is an economical alternative to expensive depth
sensors like LiDAR. The temporal photometric constraints enables
self-supervised depth estimation without labels, further facilitating its
application. However, most existing methods predict the depth solely based on
each monocular image and ignore the correlations among multiple surrounding
cameras, which are typically available for modern self-driving vehicles. In
this paper, we propose a SurroundDepth method to incorporate the information
from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views
and propose a cross-view transformer to effectively fuse the information from
multiple views. We apply cross-view self-attention to efficiently enable the
global interactions between multi-camera feature maps. Different from
self-supervised monocular depth estimation, we are able to predict real-world
scales given multi-camera extrinsic matrices. To achieve this goal, we adopt
the two-frame structure-from-motion to extract scale-aware pseudo depths to
pretrain the models. Further, instead of predicting the ego-motion of each
individual camera, we estimate a universal ego-motion of the vehicle and
transfer it to each view to achieve multi-view ego-motion consistency. In
experiments, our method achieves the state-of-the-art performance on the
challenging multi-camera depth estimation datasets DDAD and nuScenes.Comment: Accepted to CoRL 2022. Project page:
https://surrounddepth.ivg-research.xyz Code:
https://github.com/weiyithu/SurroundDept
A Simple Baseline for Supervised Surround-view Depth Estimation
Depth estimation has been widely studied and serves as the fundamental step
of 3D perception for autonomous driving. Though significant progress has been
made for monocular depth estimation in the past decades, these attempts are
mainly conducted on the KITTI benchmark with only front-view cameras, which
ignores the correlations across surround-view cameras. In this paper, we
propose S3Depth, a Simple Baseline for Supervised Surround-view Depth
Estimation, to jointly predict the depth maps across multiple surrounding
cameras. Specifically, we employ a global-to-local feature extraction module
which combines CNN with transformer layers for enriched representations.
Further, the Adjacent-view Attention mechanism is proposed to enable the
intra-view and inter-view feature propagation. The former is achieved by the
self-attention module within each view, while the latter is realized by the
adjacent attention module, which computes the attention across multi-cameras to
exchange the multi-scale representations across surround-view feature maps.
Extensive experiments show that our method achieves superior performance over
existing state-of-the-art methods on both DDAD and nuScenes datasets
EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization
Visual localization is the task of estimating a 6-DoF camera pose of a query
image within a provided 3D reference map. Thanks to recent advances in various
3D sensors, 3D point clouds are becoming a more accurate and affordable option
for building the reference map, but research to match the points of 3D point
clouds with pixels in 2D images for visual localization remains challenging.
Existing approaches that jointly learn 2D-3D feature matching suffer from low
inliers due to representational differences between the two modalities, and the
methods that bypass this problem into classification have an issue of poor
refinement. In this work, we propose EP2P-Loc, a novel large-scale visual
localization method that mitigates such appearance discrepancy and enables
end-to-end training for pose estimation. To increase the number of inliers, we
propose a simple algorithm to remove invisible 3D points in the image, and find
all 2D-3D correspondences without keypoint detection. To reduce memory usage
and search complexity, we take a coarse-to-fine approach where we extract
patch-level features from 2D images, then perform 2D patch classification on
each 3D point, and obtain the exact corresponding 2D pixel coordinates through
positional encoding. Finally, for the first time in this task, we employ a
differentiable PnP for end-to-end training. In the experiments on newly curated
large-scale indoor and outdoor benchmarks based on 2D-3D-S and KITTI, we show
that our method achieves the state-of-the-art performance compared to existing
visual localization and image-to-point cloud registration methods.Comment: Accepted to ICCV 202
- …