1,195 research outputs found
Learning Depth from Monocular Videos using Direct Methods
The ability to predict depth from a single image - using recent advances in
CNNs - is of increasing interest to the vision community. Unsupervised
strategies to learning are particularly appealing as they can utilize much
larger and varied monocular video datasets during learning without the need for
ground truth depth or stereo. In previous works, separate pose and depth CNN
predictors had to be determined such that their joint outputs minimized the
photometric error. Inspired by recent advances in direct visual odometry (DVO),
we argue that the depth CNN predictor can be learned without a pose CNN
predictor. Further, we demonstrate empirically that incorporation of a
differentiable implementation of DVO, along with a novel depth normalization
strategy - substantially improves performance over state of the art that use
monocular videos for training
Self-Supervised Monocular Depth Hints
Monocular depth estimators can be trained with various forms of
self-supervision from binocular-stereo data to circumvent the need for
high-quality laser scans or other ground-truth data. The disadvantage, however,
is that the photometric reprojection losses used with self-supervised learning
typically have multiple local minima. These plausible-looking alternatives to
ground truth can restrict what a regression network learns, causing it to
predict depth maps of limited quality. As one prominent example, depth
discontinuities around thin structures are often incorrectly estimated by
current state-of-the-art methods.
Here, we study the problem of ambiguous reprojections in depth prediction
from stereo-based self-supervision, and introduce Depth Hints to alleviate
their effects. Depth Hints are complementary depth suggestions obtained from
simple off-the-shelf stereo algorithms. These hints enhance an existing
photometric loss function, and are used to guide a network to learn better
weights. They require no additional data, and are assumed to be right only
sometimes. We show that using our Depth Hints gives a substantial boost when
training several leading self-supervised-from-stereo models, not just our own.
Further, combined with other good practices, we produce state-of-the-art depth
predictions on the KITTI benchmark.Comment: Accepted to ICCV 201
UAMD-Net: A Unified Adaptive Multimodal Neural Network for Dense Depth Completion
Depth prediction is a critical problem in robotics applications especially
autonomous driving. Generally, depth prediction based on binocular stereo
matching and fusion of monocular image and laser point cloud are two mainstream
methods. However, the former usually suffers from overfitting while building
cost volume, and the latter has a limited generalization due to the lack of
geometric constraint. To solve these problems, we propose a novel multimodal
neural network, namely UAMD-Net, for dense depth completion based on fusion of
binocular stereo matching and the weak constrain from the sparse point clouds.
Specifically, the sparse point clouds are converted to sparse depth map and
sent to the multimodal feature encoder (MFE) with binocular image, constructing
a cross-modal cost volume. Then, it will be further processed by the multimodal
feature aggregator (MFA) and the depth regression layer. Furthermore, the
existing multimodal methods ignore the problem of modal dependence, that is,
the network will not work when a certain modal input has a problem. Therefore,
we propose a new training strategy called Modal-dropout which enables the
network to be adaptively trained with multiple modal inputs and inference with
specific modal inputs. Benefiting from the flexible network structure and
adaptive training method, our proposed network can realize unified training
under various modal input conditions. Comprehensive experiments conducted on
KITTI depth completion benchmark demonstrate that our method produces robust
results and outperforms other state-of-the-art methods.Comment: 11 pages, 4 figure
- …