6,411 research outputs found

    Monocular Depth Estimators: Vulnerabilities and Attacks

    Full text link
    Recent advancements of neural networks lead to reliable monocular depth estimation. Monocular depth estimated techniques have the upper hand over traditional depth estimation techniques as it only needs one image during inference. Depth estimation is one of the essential tasks in robotics, and monocular depth estimation has a wide variety of safety-critical applications like in self-driving cars and surgical devices. Thus, the robustness of such techniques is very crucial. It has been shown in recent works that these deep neural networks are highly vulnerable to adversarial samples for tasks like classification, detection and segmentation. These adversarial samples can completely ruin the output of the system, making their credibility in real-time deployment questionable. In this paper, we investigate the robustness of the most state-of-the-art monocular depth estimation networks against adversarial attacks. Our experiments show that tiny perturbations on an image that are invisible to the naked eye (perturbation attack) and corruption less than about 1% of an image (patch attack) can affect the depth estimation drastically. We introduce a novel deep feature annihilation loss that corrupts the hidden feature space representation forcing the decoder of the network to output poor depth maps. The white-box and black-box test compliments the effectiveness of the proposed attack. We also perform adversarial example transferability tests, mainly cross-data transferability

    Attention-based Context Aggregation Network for Monocular Depth Estimation

    Full text link
    Depth estimation is a traditional computer vision task, which plays a crucial role in understanding 3D scene geometry. Recently, deep-convolutional-neural-networks based methods have achieved promising results in the monocular depth estimation field. Specifically, the framework that combines the multi-scale features extracted by the dilated convolution based block (atrous spatial pyramid pooling, ASPP) has gained the significant improvement in the dense labeling task. However, the discretized and predefined dilation rates cannot capture the continuous context information that differs in diverse scenes and easily introduce the grid artifacts in depth estimation. In this paper, we propose an attention-based context aggregation network (ACAN) to tackle these difficulties. Based on the self-attention model, ACAN adaptively learns the task-specific similarities between pixels to model the context information. First, we recast the monocular depth estimation as a dense labeling multi-class classification problem. Then we propose a soft ordinal inference to transform the predicted probabilities to continuous depth values, which can reduce the discretization error (about 1% decrease in RMSE). Second, the proposed ACAN aggregates both the image-level and pixel-level context information for depth estimation, where the former expresses the statistical characteristic of the whole image and the latter extracts the long-range spatial dependencies for each pixel. Third, for further reducing the inconsistency between the RGB image and depth map, we construct an attention loss to minimize their information entropy. We evaluate on public monocular depth-estimation benchmark datasets (including NYU Depth V2, KITTI). The experiments demonstrate the superiority of our proposed ACAN and achieve the competitive results with the state of the arts.Comment: 12 pages, 10 figure

    Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference

    Full text link
    Monocular depth estimation is a challenging task in complex compositions depicting multiple objects of diverse scales. Albeit the recent great progress thanks to the deep convolutional neural networks (CNNs), the state-of-the-art monocular depth estimation methods still fall short to handle such real-world challenging scenarios. In this paper, we propose a deep end-to-end learning framework to tackle these challenges, which learns the direct mapping from a color image to the corresponding depth map. First, we represent monocular depth estimation as a multi-category dense labeling task by contrast to the regression based formulation. In this way, we could build upon the recent progress in dense labeling such as semantic segmentation. Second, we fuse different side-outputs from our front-end dilated convolutional neural network in a hierarchical way to exploit the multi-scale depth cues for depth estimation, which is critical to achieve scale-aware depth estimation. Third, we propose to utilize soft-weighted-sum inference instead of the hard-max inference, transforming the discretized depth score to continuous depth value. Thus, we reduce the influence of quantization error and improve the robustness of our method. Extensive experiments on the NYU Depth V2 and KITTI datasets show the superiority of our method compared with current state-of-the-art methods. Furthermore, experiments on the NYU V2 dataset reveal that our model is able to learn the probability distribution of depth

    A Large RGB-D Dataset for Semi-supervised Monocular Depth Estimation

    Full text link
    Current self-supervised methods for monocular depth estimation are largely based on deeply nested convolutional networks that leverage stereo image pairs or monocular sequences during a training phase. However, they often exhibit inaccurate results around occluded regions and depth boundaries. In this paper, we present a simple yet effective approach for monocular depth estimation using stereo image pairs. The study aims to propose a student-teacher strategy in which a shallow student network is trained with the auxiliary information obtained from a deeper and more accurate teacher network. Specifically, we first train the stereo teacher network by fully utilizing the binocular perception of 3-D geometry and then use the depth predictions of the teacher network to train the student network for monocular depth inference. This enables us to exploit all available depth data from massive unlabeled stereo pairs. We propose a strategy that involves the use of a data ensemble to merge the multiple depth predictions of the teacher network to improve the training samples by collecting non-trivial knowledge beyond a single prediction. To refine the inaccurate depth estimation that is used when training the student network, we further propose stereo confidence-guided regression loss that handles the unreliable pseudo depth values in occlusion, texture-less region, and repetitive pattern. To complement the existing dataset comprising outdoor driving scenes, we built a novel large-scale dataset consisting of one million outdoor stereo images taken using hand-held stereo cameras. Finally, we demonstrate that the monocular depth estimation network provides feature representations that are suitable for high-level vision tasks. The experimental results for various outdoor scenarios demonstrate the effectiveness and flexibility of our approach, which outperforms state-of-the-art approaches.Comment: https://dimlrgbd.github.io

    A Compromise Principle in Deep Monocular Depth Estimation

    Full text link
    Monocular depth estimation, which plays a key role in understanding 3D scene geometry, is fundamentally an ill-posed problem. Existing methods based on deep convolutional neural networks (DCNNs) have examined this problem by learning convolutional networks to estimate continuous depth maps from monocular images. However, we find that training a network to predict a high spatial resolution continuous depth map often suffers from poor local solutions. In this paper, we hypothesize that achieving a compromise between spatial and depth resolutions can improve network training. Based on this "compromise principle", we propose a regression-classification cascaded network (RCCN), which consists of a regression branch predicting a low spatial resolution continuous depth map and a classification branch predicting a high spatial resolution discrete depth map. The two branches form a cascaded structure allowing the classification and regression branches to benefit from each other. By leveraging large-scale raw training datasets and some data augmentation strategies, our network achieves top or state-of-the-art results on the NYU Depth V2, KITTI, and Make3D benchmarks

    Monocular 3D Object Detection via Geometric Reasoning on Keypoints

    Full text link
    Monocular 3D object detection is well-known to be a challenging vision task due to the loss of depth information; attempts to recover depth using separate image-only approaches lead to unstable and noisy depth estimates, harming 3D detections. In this paper, we propose a novel keypoint-based approach for 3D object detection and localization from a single RGB image. We build our multi-branch model around 2D keypoint detection in images and complement it with a conceptually simple geometric reasoning method. Our network performs in an end-to-end manner, simultaneously and interdependently estimating 2D characteristics, such as 2D bounding boxes, keypoints, and orientation, along with full 3D pose in the scene. We fuse the outputs of distinct branches, applying a reprojection consistency loss during training. The experimental evaluation on the challenging KITTI dataset benchmark demonstrates that our network achieves state-of-the-art results among other monocular 3D detectors

    OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas

    Full text link
    Recent work on depth estimation up to now has only focused on projective images ignoring 360 content which is now increasingly and more easily produced. We show that monocular depth estimation models trained on traditional images produce sub-optimal results on omnidirectional images, showcasing the need for training directly on 360 datasets, which however, are hard to acquire. In this work, we circumvent the challenges associated with acquiring high quality 360 datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to 360 via rendering. This dataset, which is considerably larger than similar projective datasets, is publicly offered to the community to enable future research in this direction. We use this dataset to learn in an end-to-end fashion the task of depth estimation from 360 images. We show promising results in our synthesized data as well as in unseen realistic images.Comment: Pre-print to appear in ECCV1

    Monocular Depth Estimation with Augmented Ordinal Depth Relationships

    Full text link
    Most existing algorithms for depth estimation from single monocular images need large quantities of metric groundtruth depths for supervised learning. We show that relative depth can be an informative cue for metric depth estimation and can be easily obtained from vast stereo videos. Acquiring metric depths from stereo videos is sometimes impracticable due to the absence of camera parameters. In this paper, we propose to improve the performance of metric depth estimation with relative depths collected from stereo movie videos using existing stereo matching algorithm. We introduce a new "Relative Depth in Stereo" (RDIS) dataset densely labelled with relative depths. We first pretrain a ResNet model on our RDIS dataset. Then we finetune the model on RGB-D datasets with metric ground-truth depths. During our finetuning, we formulate depth estimation as a classification task. This re-formulation scheme enables us to obtain the confidence of a depth prediction in the form of probability distribution. With this confidence, we propose an information gain loss to make use of the predictions that are close to ground-truth. We evaluate our approach on both indoor and outdoor benchmark RGB-D datasets and achieve state-of-the-art performance.Comment: 10 page

    Estimating Depth from Monocular Images as Classification Using Deep Fully Convolutional Residual Networks

    Full text link
    Depth estimation from single monocular images is a key component of scene understanding and has benefited largely from deep convolutional neural networks (CNN) recently. In this article, we take advantage of the recent deep residual networks and propose a simple yet effective approach to this problem. We formulate depth estimation as a pixel-wise classification task. Specifically, we first discretize the continuous depth values into multiple bins and label the bins according to their depth range. Then we train fully convolutional deep residual networks to predict the depth label of each pixel. Performing discrete depth label classification instead of continuous depth value regression allows us to predict a confidence in the form of probability distribution. We further apply fully-connected conditional random fields (CRF) as a post processing step to enforce local smoothness interactions, which improves the results. We evaluate our approach on both indoor and outdoor datasets and achieve state-of-the-art performance.Comment: Accepted to IEEE Transactions on Circuits and Systems for Video Technolog

    Dual CNN Models for Unsupervised Monocular Depth Estimation

    Full text link
    The unsupervised depth estimation is the recent trend by utilizing the binocular stereo images to get rid of depth map ground truth. In unsupervised depth computation, the disparity images are generated by training the CNN with an image reconstruction loss. In this paper, a dual CNN based model is presented for unsupervised depth estimation with 6 losses (DNM6) with individual CNN for each view to generate the corresponding disparity map. The proposed dual CNN model is also extended with 12 losses (DNM12) by utilizing the cross disparities. The presented DNM6 and DNM12 models are experimented over KITTI driving and Cityscapes urban database and compared with the recent state-of-the-art result of unsupervised depth estimation. The code is available at: https://github.com/ishmav16/Dual-CNN-Models-for-Unsupervised-Monocular-Depth-Estimation.Comment: Accepted in 8th Pattern Recognition and Machine Intelligence Conference (PReMI) 201
    • …
    corecore