6,411 research outputs found
Monocular Depth Estimators: Vulnerabilities and Attacks
Recent advancements of neural networks lead to reliable monocular depth
estimation. Monocular depth estimated techniques have the upper hand over
traditional depth estimation techniques as it only needs one image during
inference. Depth estimation is one of the essential tasks in robotics, and
monocular depth estimation has a wide variety of safety-critical applications
like in self-driving cars and surgical devices. Thus, the robustness of such
techniques is very crucial. It has been shown in recent works that these deep
neural networks are highly vulnerable to adversarial samples for tasks like
classification, detection and segmentation. These adversarial samples can
completely ruin the output of the system, making their credibility in real-time
deployment questionable. In this paper, we investigate the robustness of the
most state-of-the-art monocular depth estimation networks against adversarial
attacks. Our experiments show that tiny perturbations on an image that are
invisible to the naked eye (perturbation attack) and corruption less than about
1% of an image (patch attack) can affect the depth estimation drastically. We
introduce a novel deep feature annihilation loss that corrupts the hidden
feature space representation forcing the decoder of the network to output poor
depth maps. The white-box and black-box test compliments the effectiveness of
the proposed attack. We also perform adversarial example transferability tests,
mainly cross-data transferability
Attention-based Context Aggregation Network for Monocular Depth Estimation
Depth estimation is a traditional computer vision task, which plays a crucial
role in understanding 3D scene geometry. Recently,
deep-convolutional-neural-networks based methods have achieved promising
results in the monocular depth estimation field. Specifically, the framework
that combines the multi-scale features extracted by the dilated convolution
based block (atrous spatial pyramid pooling, ASPP) has gained the significant
improvement in the dense labeling task. However, the discretized and predefined
dilation rates cannot capture the continuous context information that differs
in diverse scenes and easily introduce the grid artifacts in depth estimation.
In this paper, we propose an attention-based context aggregation network (ACAN)
to tackle these difficulties. Based on the self-attention model, ACAN
adaptively learns the task-specific similarities between pixels to model the
context information. First, we recast the monocular depth estimation as a dense
labeling multi-class classification problem. Then we propose a soft ordinal
inference to transform the predicted probabilities to continuous depth values,
which can reduce the discretization error (about 1% decrease in RMSE). Second,
the proposed ACAN aggregates both the image-level and pixel-level context
information for depth estimation, where the former expresses the statistical
characteristic of the whole image and the latter extracts the long-range
spatial dependencies for each pixel. Third, for further reducing the
inconsistency between the RGB image and depth map, we construct an attention
loss to minimize their information entropy. We evaluate on public monocular
depth-estimation benchmark datasets (including NYU Depth V2, KITTI). The
experiments demonstrate the superiority of our proposed ACAN and achieve the
competitive results with the state of the arts.Comment: 12 pages, 10 figure
Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference
Monocular depth estimation is a challenging task in complex compositions
depicting multiple objects of diverse scales. Albeit the recent great progress
thanks to the deep convolutional neural networks (CNNs), the state-of-the-art
monocular depth estimation methods still fall short to handle such real-world
challenging scenarios. In this paper, we propose a deep end-to-end learning
framework to tackle these challenges, which learns the direct mapping from a
color image to the corresponding depth map. First, we represent monocular depth
estimation as a multi-category dense labeling task by contrast to the
regression based formulation. In this way, we could build upon the recent
progress in dense labeling such as semantic segmentation. Second, we fuse
different side-outputs from our front-end dilated convolutional neural network
in a hierarchical way to exploit the multi-scale depth cues for depth
estimation, which is critical to achieve scale-aware depth estimation. Third,
we propose to utilize soft-weighted-sum inference instead of the hard-max
inference, transforming the discretized depth score to continuous depth value.
Thus, we reduce the influence of quantization error and improve the robustness
of our method. Extensive experiments on the NYU Depth V2 and KITTI datasets
show the superiority of our method compared with current state-of-the-art
methods. Furthermore, experiments on the NYU V2 dataset reveal that our model
is able to learn the probability distribution of depth
A Large RGB-D Dataset for Semi-supervised Monocular Depth Estimation
Current self-supervised methods for monocular depth estimation are largely
based on deeply nested convolutional networks that leverage stereo image pairs
or monocular sequences during a training phase. However, they often exhibit
inaccurate results around occluded regions and depth boundaries. In this paper,
we present a simple yet effective approach for monocular depth estimation using
stereo image pairs. The study aims to propose a student-teacher strategy in
which a shallow student network is trained with the auxiliary information
obtained from a deeper and more accurate teacher network. Specifically, we
first train the stereo teacher network by fully utilizing the binocular
perception of 3-D geometry and then use the depth predictions of the teacher
network to train the student network for monocular depth inference. This
enables us to exploit all available depth data from massive unlabeled stereo
pairs. We propose a strategy that involves the use of a data ensemble to merge
the multiple depth predictions of the teacher network to improve the training
samples by collecting non-trivial knowledge beyond a single prediction. To
refine the inaccurate depth estimation that is used when training the student
network, we further propose stereo confidence-guided regression loss that
handles the unreliable pseudo depth values in occlusion, texture-less region,
and repetitive pattern. To complement the existing dataset comprising outdoor
driving scenes, we built a novel large-scale dataset consisting of one million
outdoor stereo images taken using hand-held stereo cameras. Finally, we
demonstrate that the monocular depth estimation network provides feature
representations that are suitable for high-level vision tasks. The experimental
results for various outdoor scenarios demonstrate the effectiveness and
flexibility of our approach, which outperforms state-of-the-art approaches.Comment: https://dimlrgbd.github.io
A Compromise Principle in Deep Monocular Depth Estimation
Monocular depth estimation, which plays a key role in understanding 3D scene
geometry, is fundamentally an ill-posed problem. Existing methods based on deep
convolutional neural networks (DCNNs) have examined this problem by learning
convolutional networks to estimate continuous depth maps from monocular images.
However, we find that training a network to predict a high spatial resolution
continuous depth map often suffers from poor local solutions. In this paper, we
hypothesize that achieving a compromise between spatial and depth resolutions
can improve network training. Based on this "compromise principle", we propose
a regression-classification cascaded network (RCCN), which consists of a
regression branch predicting a low spatial resolution continuous depth map and
a classification branch predicting a high spatial resolution discrete depth
map. The two branches form a cascaded structure allowing the classification and
regression branches to benefit from each other. By leveraging large-scale raw
training datasets and some data augmentation strategies, our network achieves
top or state-of-the-art results on the NYU Depth V2, KITTI, and Make3D
benchmarks
Monocular 3D Object Detection via Geometric Reasoning on Keypoints
Monocular 3D object detection is well-known to be a challenging vision task
due to the loss of depth information; attempts to recover depth using separate
image-only approaches lead to unstable and noisy depth estimates, harming 3D
detections. In this paper, we propose a novel keypoint-based approach for 3D
object detection and localization from a single RGB image. We build our
multi-branch model around 2D keypoint detection in images and complement it
with a conceptually simple geometric reasoning method. Our network performs in
an end-to-end manner, simultaneously and interdependently estimating 2D
characteristics, such as 2D bounding boxes, keypoints, and orientation, along
with full 3D pose in the scene. We fuse the outputs of distinct branches,
applying a reprojection consistency loss during training. The experimental
evaluation on the challenging KITTI dataset benchmark demonstrates that our
network achieves state-of-the-art results among other monocular 3D detectors
OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas
Recent work on depth estimation up to now has only focused on projective
images ignoring 360 content which is now increasingly and more easily produced.
We show that monocular depth estimation models trained on traditional images
produce sub-optimal results on omnidirectional images, showcasing the need for
training directly on 360 datasets, which however, are hard to acquire. In this
work, we circumvent the challenges associated with acquiring high quality 360
datasets with ground truth depth annotations, by re-using recently released
large scale 3D datasets and re-purposing them to 360 via rendering. This
dataset, which is considerably larger than similar projective datasets, is
publicly offered to the community to enable future research in this direction.
We use this dataset to learn in an end-to-end fashion the task of depth
estimation from 360 images. We show promising results in our synthesized data
as well as in unseen realistic images.Comment: Pre-print to appear in ECCV1
Monocular Depth Estimation with Augmented Ordinal Depth Relationships
Most existing algorithms for depth estimation from single monocular images
need large quantities of metric groundtruth depths for supervised learning. We
show that relative depth can be an informative cue for metric depth estimation
and can be easily obtained from vast stereo videos. Acquiring metric depths
from stereo videos is sometimes impracticable due to the absence of camera
parameters. In this paper, we propose to improve the performance of metric
depth estimation with relative depths collected from stereo movie videos using
existing stereo matching algorithm. We introduce a new "Relative Depth in
Stereo" (RDIS) dataset densely labelled with relative depths. We first pretrain
a ResNet model on our RDIS dataset. Then we finetune the model on RGB-D
datasets with metric ground-truth depths. During our finetuning, we formulate
depth estimation as a classification task. This re-formulation scheme enables
us to obtain the confidence of a depth prediction in the form of probability
distribution. With this confidence, we propose an information gain loss to make
use of the predictions that are close to ground-truth. We evaluate our approach
on both indoor and outdoor benchmark RGB-D datasets and achieve
state-of-the-art performance.Comment: 10 page
Estimating Depth from Monocular Images as Classification Using Deep Fully Convolutional Residual Networks
Depth estimation from single monocular images is a key component of scene
understanding and has benefited largely from deep convolutional neural networks
(CNN) recently. In this article, we take advantage of the recent deep residual
networks and propose a simple yet effective approach to this problem. We
formulate depth estimation as a pixel-wise classification task. Specifically,
we first discretize the continuous depth values into multiple bins and label
the bins according to their depth range. Then we train fully convolutional deep
residual networks to predict the depth label of each pixel. Performing discrete
depth label classification instead of continuous depth value regression allows
us to predict a confidence in the form of probability distribution. We further
apply fully-connected conditional random fields (CRF) as a post processing step
to enforce local smoothness interactions, which improves the results. We
evaluate our approach on both indoor and outdoor datasets and achieve
state-of-the-art performance.Comment: Accepted to IEEE Transactions on Circuits and Systems for Video
Technolog
Dual CNN Models for Unsupervised Monocular Depth Estimation
The unsupervised depth estimation is the recent trend by utilizing the
binocular stereo images to get rid of depth map ground truth. In unsupervised
depth computation, the disparity images are generated by training the CNN with
an image reconstruction loss. In this paper, a dual CNN based model is
presented for unsupervised depth estimation with 6 losses (DNM6) with
individual CNN for each view to generate the corresponding disparity map. The
proposed dual CNN model is also extended with 12 losses (DNM12) by utilizing
the cross disparities. The presented DNM6 and DNM12 models are experimented
over KITTI driving and Cityscapes urban database and compared with the recent
state-of-the-art result of unsupervised depth estimation. The code is available
at:
https://github.com/ishmav16/Dual-CNN-Models-for-Unsupervised-Monocular-Depth-Estimation.Comment: Accepted in 8th Pattern Recognition and Machine Intelligence
Conference (PReMI) 201
- …