16,616 research outputs found
A Large RGB-D Dataset for Semi-supervised Monocular Depth Estimation
Current self-supervised methods for monocular depth estimation are largely
based on deeply nested convolutional networks that leverage stereo image pairs
or monocular sequences during a training phase. However, they often exhibit
inaccurate results around occluded regions and depth boundaries. In this paper,
we present a simple yet effective approach for monocular depth estimation using
stereo image pairs. The study aims to propose a student-teacher strategy in
which a shallow student network is trained with the auxiliary information
obtained from a deeper and more accurate teacher network. Specifically, we
first train the stereo teacher network by fully utilizing the binocular
perception of 3-D geometry and then use the depth predictions of the teacher
network to train the student network for monocular depth inference. This
enables us to exploit all available depth data from massive unlabeled stereo
pairs. We propose a strategy that involves the use of a data ensemble to merge
the multiple depth predictions of the teacher network to improve the training
samples by collecting non-trivial knowledge beyond a single prediction. To
refine the inaccurate depth estimation that is used when training the student
network, we further propose stereo confidence-guided regression loss that
handles the unreliable pseudo depth values in occlusion, texture-less region,
and repetitive pattern. To complement the existing dataset comprising outdoor
driving scenes, we built a novel large-scale dataset consisting of one million
outdoor stereo images taken using hand-held stereo cameras. Finally, we
demonstrate that the monocular depth estimation network provides feature
representations that are suitable for high-level vision tasks. The experimental
results for various outdoor scenarios demonstrate the effectiveness and
flexibility of our approach, which outperforms state-of-the-art approaches.Comment: https://dimlrgbd.github.io
GAL: A Global-Attributes Assisted Labeling System for Outdoor Scenes
An approach that extracts global attributes from outdoor images to facilitate
geometric layout labeling is investigated in this work. The proposed
Global-attributes Assisted Labeling (GAL) system exploits both local features
and global attributes. First, by following a classical method, we use local
features to provide initial labels for all super-pixels. Then, we develop a set
of techniques to extract global attributes from 2D outdoor images. They include
sky lines, ground lines, vanishing lines, etc. Finally, we propose the GAL
system that integrates global attributes in the conditional random field (CRF)
framework to improve initial labels so as to offer a more robust labeling
result. The performance of the proposed GAL system is demonstrated and
benchmarked with several state-of-the-art algorithms against a popular outdoor
scene layout dataset
Monocular Depth Estimation with Augmented Ordinal Depth Relationships
Most existing algorithms for depth estimation from single monocular images
need large quantities of metric groundtruth depths for supervised learning. We
show that relative depth can be an informative cue for metric depth estimation
and can be easily obtained from vast stereo videos. Acquiring metric depths
from stereo videos is sometimes impracticable due to the absence of camera
parameters. In this paper, we propose to improve the performance of metric
depth estimation with relative depths collected from stereo movie videos using
existing stereo matching algorithm. We introduce a new "Relative Depth in
Stereo" (RDIS) dataset densely labelled with relative depths. We first pretrain
a ResNet model on our RDIS dataset. Then we finetune the model on RGB-D
datasets with metric ground-truth depths. During our finetuning, we formulate
depth estimation as a classification task. This re-formulation scheme enables
us to obtain the confidence of a depth prediction in the form of probability
distribution. With this confidence, we propose an information gain loss to make
use of the predictions that are close to ground-truth. We evaluate our approach
on both indoor and outdoor benchmark RGB-D datasets and achieve
state-of-the-art performance.Comment: 10 page
Estimating Depth from Monocular Images as Classification Using Deep Fully Convolutional Residual Networks
Depth estimation from single monocular images is a key component of scene
understanding and has benefited largely from deep convolutional neural networks
(CNN) recently. In this article, we take advantage of the recent deep residual
networks and propose a simple yet effective approach to this problem. We
formulate depth estimation as a pixel-wise classification task. Specifically,
we first discretize the continuous depth values into multiple bins and label
the bins according to their depth range. Then we train fully convolutional deep
residual networks to predict the depth label of each pixel. Performing discrete
depth label classification instead of continuous depth value regression allows
us to predict a confidence in the form of probability distribution. We further
apply fully-connected conditional random fields (CRF) as a post processing step
to enforce local smoothness interactions, which improves the results. We
evaluate our approach on both indoor and outdoor datasets and achieve
state-of-the-art performance.Comment: Accepted to IEEE Transactions on Circuits and Systems for Video
Technolog
Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference
Monocular depth estimation is a challenging task in complex compositions
depicting multiple objects of diverse scales. Albeit the recent great progress
thanks to the deep convolutional neural networks (CNNs), the state-of-the-art
monocular depth estimation methods still fall short to handle such real-world
challenging scenarios. In this paper, we propose a deep end-to-end learning
framework to tackle these challenges, which learns the direct mapping from a
color image to the corresponding depth map. First, we represent monocular depth
estimation as a multi-category dense labeling task by contrast to the
regression based formulation. In this way, we could build upon the recent
progress in dense labeling such as semantic segmentation. Second, we fuse
different side-outputs from our front-end dilated convolutional neural network
in a hierarchical way to exploit the multi-scale depth cues for depth
estimation, which is critical to achieve scale-aware depth estimation. Third,
we propose to utilize soft-weighted-sum inference instead of the hard-max
inference, transforming the discretized depth score to continuous depth value.
Thus, we reduce the influence of quantization error and improve the robustness
of our method. Extensive experiments on the NYU Depth V2 and KITTI datasets
show the superiority of our method compared with current state-of-the-art
methods. Furthermore, experiments on the NYU V2 dataset reveal that our model
is able to learn the probability distribution of depth
OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas
Recent work on depth estimation up to now has only focused on projective
images ignoring 360 content which is now increasingly and more easily produced.
We show that monocular depth estimation models trained on traditional images
produce sub-optimal results on omnidirectional images, showcasing the need for
training directly on 360 datasets, which however, are hard to acquire. In this
work, we circumvent the challenges associated with acquiring high quality 360
datasets with ground truth depth annotations, by re-using recently released
large scale 3D datasets and re-purposing them to 360 via rendering. This
dataset, which is considerably larger than similar projective datasets, is
publicly offered to the community to enable future research in this direction.
We use this dataset to learn in an end-to-end fashion the task of depth
estimation from 360 images. We show promising results in our synthesized data
as well as in unseen realistic images.Comment: Pre-print to appear in ECCV1
Visualization of Convolutional Neural Networks for Monocular Depth Estimation
Recently, convolutional neural networks (CNNs) have shown great success on
the task of monocular depth estimation. A fundamental yet unanswered question
is: how CNNs can infer depth from a single image. Toward answering this
question, we consider visualization of inference of a CNN by identifying
relevant pixels of an input image to depth estimation. We formulate it as an
optimization problem of identifying the smallest number of image pixels from
which the CNN can estimate a depth map with the minimum difference from the
estimate from the entire image. To cope with a difficulty with optimization
through a deep CNN, we propose to use another network to predict those relevant
image pixels in a forward computation. In our experiments, we first show the
effectiveness of this approach, and then apply it to different depth estimation
networks on indoor and outdoor scene datasets. The results provide several
findings that help exploration of the above question
A Compromise Principle in Deep Monocular Depth Estimation
Monocular depth estimation, which plays a key role in understanding 3D scene
geometry, is fundamentally an ill-posed problem. Existing methods based on deep
convolutional neural networks (DCNNs) have examined this problem by learning
convolutional networks to estimate continuous depth maps from monocular images.
However, we find that training a network to predict a high spatial resolution
continuous depth map often suffers from poor local solutions. In this paper, we
hypothesize that achieving a compromise between spatial and depth resolutions
can improve network training. Based on this "compromise principle", we propose
a regression-classification cascaded network (RCCN), which consists of a
regression branch predicting a low spatial resolution continuous depth map and
a classification branch predicting a high spatial resolution discrete depth
map. The two branches form a cascaded structure allowing the classification and
regression branches to benefit from each other. By leveraging large-scale raw
training datasets and some data augmentation strategies, our network achieves
top or state-of-the-art results on the NYU Depth V2, KITTI, and Make3D
benchmarks
FishNet: A Camera Localizer using Deep Recurrent Networks
This paper proposes a robust localization system that employs deep learning
for better scene representation, and enhances the accuracy of 6-DOF camera pose
estimation. Inspired by the fact that global scene structure can be revealed by
wide field-of-view, we leverage the large overlap of a fisheye camera between
adjacent frames, and the powerful high-level feature representations of deep
learning. Our main contribution is the novel network architecture that extracts
both temporal and spatial information using a Recurrent Neural Network.
Specifically, we propose a novel pose regularization term combined with LSTM.
This leads to smoother pose estimation, especially for large outdoor scenery.
Promising experimental results on three benchmark datasets manifest the
effectiveness of the proposed approach
VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry
Semantic understanding and localization are fundamental enablers of robot
autonomy that have for the most part been tackled as disjoint problems. While
deep learning has enabled recent breakthroughs across a wide spectrum of scene
understanding tasks, its applicability to state estimation tasks has been
limited due to the direct formulation that renders it incapable of encoding
scene-specific constrains. In this work, we propose the VLocNet++ architecture
that employs a multitask learning approach to exploit the inter-task
relationship between learning semantics, regressing 6-DoF global pose and
odometry, for the mutual benefit of each of these tasks. Our network overcomes
the aforementioned limitation by simultaneously embedding geometric and
semantic knowledge of the world into the pose regression network. We propose a
novel adaptive weighted fusion layer to aggregate motion-specific temporal
information and to fuse semantic features into the localization stream based on
region activations. Furthermore, we propose a self-supervised warping technique
that uses the relative motion to warp intermediate network representations in
the segmentation stream for learning consistent semantics. Finally, we
introduce a first-of-a-kind urban outdoor localization dataset with pixel-level
semantic labels and multiple loops for training deep networks. Extensive
experiments on the challenging Microsoft 7-Scenes benchmark and our DeepLoc
dataset demonstrate that our approach exceeds the state-of-the-art
outperforming local feature-based methods while simultaneously performing
multiple tasks and exhibiting substantial robustness in challenging scenarios.Comment: Demo and dataset available at http://deeploc.cs.uni-freiburg.d
- …