16,616 research outputs found

    A Large RGB-D Dataset for Semi-supervised Monocular Depth Estimation

    Full text link
    Current self-supervised methods for monocular depth estimation are largely based on deeply nested convolutional networks that leverage stereo image pairs or monocular sequences during a training phase. However, they often exhibit inaccurate results around occluded regions and depth boundaries. In this paper, we present a simple yet effective approach for monocular depth estimation using stereo image pairs. The study aims to propose a student-teacher strategy in which a shallow student network is trained with the auxiliary information obtained from a deeper and more accurate teacher network. Specifically, we first train the stereo teacher network by fully utilizing the binocular perception of 3-D geometry and then use the depth predictions of the teacher network to train the student network for monocular depth inference. This enables us to exploit all available depth data from massive unlabeled stereo pairs. We propose a strategy that involves the use of a data ensemble to merge the multiple depth predictions of the teacher network to improve the training samples by collecting non-trivial knowledge beyond a single prediction. To refine the inaccurate depth estimation that is used when training the student network, we further propose stereo confidence-guided regression loss that handles the unreliable pseudo depth values in occlusion, texture-less region, and repetitive pattern. To complement the existing dataset comprising outdoor driving scenes, we built a novel large-scale dataset consisting of one million outdoor stereo images taken using hand-held stereo cameras. Finally, we demonstrate that the monocular depth estimation network provides feature representations that are suitable for high-level vision tasks. The experimental results for various outdoor scenarios demonstrate the effectiveness and flexibility of our approach, which outperforms state-of-the-art approaches.Comment: https://dimlrgbd.github.io

    GAL: A Global-Attributes Assisted Labeling System for Outdoor Scenes

    Full text link
    An approach that extracts global attributes from outdoor images to facilitate geometric layout labeling is investigated in this work. The proposed Global-attributes Assisted Labeling (GAL) system exploits both local features and global attributes. First, by following a classical method, we use local features to provide initial labels for all super-pixels. Then, we develop a set of techniques to extract global attributes from 2D outdoor images. They include sky lines, ground lines, vanishing lines, etc. Finally, we propose the GAL system that integrates global attributes in the conditional random field (CRF) framework to improve initial labels so as to offer a more robust labeling result. The performance of the proposed GAL system is demonstrated and benchmarked with several state-of-the-art algorithms against a popular outdoor scene layout dataset

    Monocular Depth Estimation with Augmented Ordinal Depth Relationships

    Full text link
    Most existing algorithms for depth estimation from single monocular images need large quantities of metric groundtruth depths for supervised learning. We show that relative depth can be an informative cue for metric depth estimation and can be easily obtained from vast stereo videos. Acquiring metric depths from stereo videos is sometimes impracticable due to the absence of camera parameters. In this paper, we propose to improve the performance of metric depth estimation with relative depths collected from stereo movie videos using existing stereo matching algorithm. We introduce a new "Relative Depth in Stereo" (RDIS) dataset densely labelled with relative depths. We first pretrain a ResNet model on our RDIS dataset. Then we finetune the model on RGB-D datasets with metric ground-truth depths. During our finetuning, we formulate depth estimation as a classification task. This re-formulation scheme enables us to obtain the confidence of a depth prediction in the form of probability distribution. With this confidence, we propose an information gain loss to make use of the predictions that are close to ground-truth. We evaluate our approach on both indoor and outdoor benchmark RGB-D datasets and achieve state-of-the-art performance.Comment: 10 page

    Estimating Depth from Monocular Images as Classification Using Deep Fully Convolutional Residual Networks

    Full text link
    Depth estimation from single monocular images is a key component of scene understanding and has benefited largely from deep convolutional neural networks (CNN) recently. In this article, we take advantage of the recent deep residual networks and propose a simple yet effective approach to this problem. We formulate depth estimation as a pixel-wise classification task. Specifically, we first discretize the continuous depth values into multiple bins and label the bins according to their depth range. Then we train fully convolutional deep residual networks to predict the depth label of each pixel. Performing discrete depth label classification instead of continuous depth value regression allows us to predict a confidence in the form of probability distribution. We further apply fully-connected conditional random fields (CRF) as a post processing step to enforce local smoothness interactions, which improves the results. We evaluate our approach on both indoor and outdoor datasets and achieve state-of-the-art performance.Comment: Accepted to IEEE Transactions on Circuits and Systems for Video Technolog

    Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference

    Full text link
    Monocular depth estimation is a challenging task in complex compositions depicting multiple objects of diverse scales. Albeit the recent great progress thanks to the deep convolutional neural networks (CNNs), the state-of-the-art monocular depth estimation methods still fall short to handle such real-world challenging scenarios. In this paper, we propose a deep end-to-end learning framework to tackle these challenges, which learns the direct mapping from a color image to the corresponding depth map. First, we represent monocular depth estimation as a multi-category dense labeling task by contrast to the regression based formulation. In this way, we could build upon the recent progress in dense labeling such as semantic segmentation. Second, we fuse different side-outputs from our front-end dilated convolutional neural network in a hierarchical way to exploit the multi-scale depth cues for depth estimation, which is critical to achieve scale-aware depth estimation. Third, we propose to utilize soft-weighted-sum inference instead of the hard-max inference, transforming the discretized depth score to continuous depth value. Thus, we reduce the influence of quantization error and improve the robustness of our method. Extensive experiments on the NYU Depth V2 and KITTI datasets show the superiority of our method compared with current state-of-the-art methods. Furthermore, experiments on the NYU V2 dataset reveal that our model is able to learn the probability distribution of depth

    OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas

    Full text link
    Recent work on depth estimation up to now has only focused on projective images ignoring 360 content which is now increasingly and more easily produced. We show that monocular depth estimation models trained on traditional images produce sub-optimal results on omnidirectional images, showcasing the need for training directly on 360 datasets, which however, are hard to acquire. In this work, we circumvent the challenges associated with acquiring high quality 360 datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to 360 via rendering. This dataset, which is considerably larger than similar projective datasets, is publicly offered to the community to enable future research in this direction. We use this dataset to learn in an end-to-end fashion the task of depth estimation from 360 images. We show promising results in our synthesized data as well as in unseen realistic images.Comment: Pre-print to appear in ECCV1

    Visualization of Convolutional Neural Networks for Monocular Depth Estimation

    Full text link
    Recently, convolutional neural networks (CNNs) have shown great success on the task of monocular depth estimation. A fundamental yet unanswered question is: how CNNs can infer depth from a single image. Toward answering this question, we consider visualization of inference of a CNN by identifying relevant pixels of an input image to depth estimation. We formulate it as an optimization problem of identifying the smallest number of image pixels from which the CNN can estimate a depth map with the minimum difference from the estimate from the entire image. To cope with a difficulty with optimization through a deep CNN, we propose to use another network to predict those relevant image pixels in a forward computation. In our experiments, we first show the effectiveness of this approach, and then apply it to different depth estimation networks on indoor and outdoor scene datasets. The results provide several findings that help exploration of the above question

    A Compromise Principle in Deep Monocular Depth Estimation

    Full text link
    Monocular depth estimation, which plays a key role in understanding 3D scene geometry, is fundamentally an ill-posed problem. Existing methods based on deep convolutional neural networks (DCNNs) have examined this problem by learning convolutional networks to estimate continuous depth maps from monocular images. However, we find that training a network to predict a high spatial resolution continuous depth map often suffers from poor local solutions. In this paper, we hypothesize that achieving a compromise between spatial and depth resolutions can improve network training. Based on this "compromise principle", we propose a regression-classification cascaded network (RCCN), which consists of a regression branch predicting a low spatial resolution continuous depth map and a classification branch predicting a high spatial resolution discrete depth map. The two branches form a cascaded structure allowing the classification and regression branches to benefit from each other. By leveraging large-scale raw training datasets and some data augmentation strategies, our network achieves top or state-of-the-art results on the NYU Depth V2, KITTI, and Make3D benchmarks

    FishNet: A Camera Localizer using Deep Recurrent Networks

    Full text link
    This paper proposes a robust localization system that employs deep learning for better scene representation, and enhances the accuracy of 6-DOF camera pose estimation. Inspired by the fact that global scene structure can be revealed by wide field-of-view, we leverage the large overlap of a fisheye camera between adjacent frames, and the powerful high-level feature representations of deep learning. Our main contribution is the novel network architecture that extracts both temporal and spatial information using a Recurrent Neural Network. Specifically, we propose a novel pose regularization term combined with LSTM. This leads to smoother pose estimation, especially for large outdoor scenery. Promising experimental results on three benchmark datasets manifest the effectiveness of the proposed approach

    VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry

    Full text link
    Semantic understanding and localization are fundamental enablers of robot autonomy that have for the most part been tackled as disjoint problems. While deep learning has enabled recent breakthroughs across a wide spectrum of scene understanding tasks, its applicability to state estimation tasks has been limited due to the direct formulation that renders it incapable of encoding scene-specific constrains. In this work, we propose the VLocNet++ architecture that employs a multitask learning approach to exploit the inter-task relationship between learning semantics, regressing 6-DoF global pose and odometry, for the mutual benefit of each of these tasks. Our network overcomes the aforementioned limitation by simultaneously embedding geometric and semantic knowledge of the world into the pose regression network. We propose a novel adaptive weighted fusion layer to aggregate motion-specific temporal information and to fuse semantic features into the localization stream based on region activations. Furthermore, we propose a self-supervised warping technique that uses the relative motion to warp intermediate network representations in the segmentation stream for learning consistent semantics. Finally, we introduce a first-of-a-kind urban outdoor localization dataset with pixel-level semantic labels and multiple loops for training deep networks. Extensive experiments on the challenging Microsoft 7-Scenes benchmark and our DeepLoc dataset demonstrate that our approach exceeds the state-of-the-art outperforming local feature-based methods while simultaneously performing multiple tasks and exhibiting substantial robustness in challenging scenarios.Comment: Demo and dataset available at http://deeploc.cs.uni-freiburg.d
    • …
    corecore