48 research outputs found
Self-Supervised Monocular Depth Hints
Monocular depth estimators can be trained with various forms of
self-supervision from binocular-stereo data to circumvent the need for
high-quality laser scans or other ground-truth data. The disadvantage, however,
is that the photometric reprojection losses used with self-supervised learning
typically have multiple local minima. These plausible-looking alternatives to
ground truth can restrict what a regression network learns, causing it to
predict depth maps of limited quality. As one prominent example, depth
discontinuities around thin structures are often incorrectly estimated by
current state-of-the-art methods.
Here, we study the problem of ambiguous reprojections in depth prediction
from stereo-based self-supervision, and introduce Depth Hints to alleviate
their effects. Depth Hints are complementary depth suggestions obtained from
simple off-the-shelf stereo algorithms. These hints enhance an existing
photometric loss function, and are used to guide a network to learn better
weights. They require no additional data, and are assumed to be right only
sometimes. We show that using our Depth Hints gives a substantial boost when
training several leading self-supervised-from-stereo models, not just our own.
Further, combined with other good practices, we produce state-of-the-art depth
predictions on the KITTI benchmark.Comment: Accepted to ICCV 201
Enabling monocular depth perception at the very edge
Depth estimation is crucial in several computer vision applications, and a recent trend aims at inferring such a cue from a single camera through computationally demanding CNNs - precluding their practical deployment in several application contexts characterized by low-power constraints. Purposely, we develop a tiny network tailored to microcontrollers, processing low-resolution images to obtain a coarse depth map of the observed scene. Our solution enables depth perception with minimal power requirements (a few hundreds of mW), accurately enough to pave the way to several high-level applications at-the-edge
Single-Image Depth Prediction Makes Feature Matching Easier
Good local features improve the robustness of many 3D re-localization and
multi-view reconstruction pipelines. The problem is that viewing angle and
distance severely impact the recognizability of a local feature. Attempts to
improve appearance invariance by choosing better local feature points or by
leveraging outside information, have come with pre-requisites that made some of
them impractical. In this paper, we propose a surprisingly effective
enhancement to local feature extraction, which improves matching. We show that
CNN-based depths inferred from single RGB images are quite helpful, despite
their flaws. They allow us to pre-warp images and rectify perspective
distortions, to significantly enhance SIFT and BRISK features, enabling more
good matches, even when cameras are looking at the same scene but in opposite
directions.Comment: 14 pages, 7 figures, accepted for publication at the European
conference on computer vision (ECCV) 202
2T-UNET: A Two-Tower UNet with Depth Clues for Robust Stereo Depth Estimation
Stereo correspondence matching is an essential part of the multi-step stereo
depth estimation process. This paper revisits the depth estimation problem,
avoiding the explicit stereo matching step using a simple two-tower
convolutional neural network. The proposed algorithm is entitled as 2T-UNet.
The idea behind 2T-UNet is to replace cost volume construction with twin
convolution towers. These towers have an allowance for different weights
between them. Additionally, the input for twin encoders in 2T-UNet are
different compared to the existing stereo methods. Generally, a stereo network
takes a right and left image pair as input to determine the scene geometry.
However, in the 2T-UNet model, the right stereo image is taken as one input and
the left stereo image along with its monocular depth clue information, is taken
as the other input. Depth clues provide complementary suggestions that help
enhance the quality of predicted scene geometry. The 2T-UNet surpasses
state-of-the-art monocular and stereo depth estimation methods on the
challenging Scene flow dataset, both quantitatively and qualitatively. The
architecture performs incredibly well on complex natural scenes, highlighting
its usefulness for various real-time applications. Pretrained weights and code
will be made readily available
FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation
The great potential of unsupervised monocular depth estimation has been
demonstrated by many works due to low annotation cost and impressive accuracy
comparable to supervised methods. To further improve the performance, recent
works mainly focus on designing more complex network structures and exploiting
extra supervised information, e.g., semantic segmentation. These methods
optimize the models by exploiting the reconstructed relationship between the
target and reference images in varying degrees. However, previous methods prove
that this image reconstruction optimization is prone to get trapped in local
minima. In this paper, our core idea is to guide the optimization with prior
knowledge from pretrained Flow-Net. And we show that the bottleneck of
unsupervised monocular depth estimation can be broken with our simple but
effective framework named FG-Depth. In particular, we propose (i) a flow
distillation loss to replace the typical photometric loss that limits the
capacity of the model and (ii) a prior flow based mask to remove invalid pixels
that bring the noise in training loss. Extensive experiments demonstrate the
effectiveness of each component, and our approach achieves state-of-the-art
results on both KITTI and NYU-Depth-v2 datasets.Comment: Accepted by ICRA202
Detaching and Boosting: Dual Engine for Scale-Invariant Self-Supervised Monocular Depth Estimation
Monocular depth estimation (MDE) in the self-supervised scenario has emerged
as a promising method as it refrains from the requirement of ground truth
depth. Despite continuous efforts, MDE is still sensitive to scale changes
especially when all the training samples are from one single camera. Meanwhile,
it deteriorates further since camera movement results in heavy coupling between
the predicted depth and the scale change. In this paper, we present a
scale-invariant approach for self-supervised MDE, in which scale-sensitive
features (SSFs) are detached away while scale-invariant features (SIFs) are
boosted further. To be specific, a simple but effective data augmentation by
imitating the camera zooming process is proposed to detach SSFs, making the
model robust to scale changes. Besides, a dynamic cross-attention module is
designed to boost SIFs by fusing multi-scale cross-attention features
adaptively. Extensive experiments on the KITTI dataset demonstrate that the
detaching and boosting strategies are mutually complementary in MDE and our
approach achieves new State-of-The-Art performance against existing works from
0.097 to 0.090 w.r.t absolute relative error. The code will be made public
soon.Comment: Accepted by IEEE Robotics and Automation Letters (RAL
BENCHMARKING THE EXTRACTION OF 3D GEOMETRY FROM UAV IMAGES WITH DEEP LEARNING METHODS
3D reconstruction from single and multi-view stereo images is still an open research topic, despite the high number of solutions proposed in the last decades. The surge of deep learning methods has then stimulated the development of new methods using monocular (MDE, Monocular Depth Estimation), stereoscopic and Multi-View Stereo (MVS) 3D reconstruction, showing promising results, often comparable to or even better than traditional methods. The more recent development of NeRF (Neural Radial Fields) has further triggered the interest for this kind of solution. Most of the proposed approaches, however, focus on terrestrial applications (e.g., autonomous driving or small artefacts 3D reconstructions), while airborne and UAV acquisitions are often overlooked. The recent introduction of new datasets, such as UseGeo has, therefore, given the opportunity to assess how state-of-the-art MDE, MVS and NeRF 3D reconstruction algorithms perform using airborne UAV images, allowing their comparison with LiDAR ground truth. This paper aims to present the results achieved by two MDE, two MVS and two NeRF approaches levering deep learning approaches, trained and tested using the UseGeo dataset. This work allows the comparison with a ground truth showing the current state of the art of these solutions and providing useful indications for their future development and improvement