10 research outputs found
Learning Stereo from Single Images
Supervised deep networks are among the best methods for finding
correspondences in stereo image pairs. Like all supervised approaches, these
networks require ground truth data during training. However, collecting large
quantities of accurate dense correspondence data is very challenging. We
propose that it is unnecessary to have such a high reliance on ground truth
depths or even corresponding stereo pairs. Inspired by recent progress in
monocular depth estimation, we generate plausible disparity maps from single
images. In turn, we use those flawed disparity maps in a carefully designed
pipeline to generate stereo training pairs. Training in this manner makes it
possible to convert any collection of single RGB images into stereo training
data. This results in a significant reduction in human effort, with no need to
collect real depths or to hand-design synthetic data. We can consequently train
a stereo matching network from scratch on datasets like COCO, which were
previously hard to exploit for stereo. Through extensive experiments we show
that our approach outperforms stereo networks trained with standard synthetic
datasets, when evaluated on KITTI, ETH3D, and Middlebury.Comment: Accepted as an oral presentation at ECCV 202
FCDSN-DC: An Accurate and Lightweight Convolutional Neural Network for Stereo Estimation with Depth Completion
We propose an accurate and lightweight convolutional neural network for stereo estimation with depth completion. We name this method fully-convolutional deformable similarity network with depth completion (FCDSN-DC). This method extends FC-DCNN by improving the feature extractor, adding a network structure for training highly accurate similarity functions and a network structure for filling inconsistent disparity estimates. The whole method consists of three parts. The first part consists of fully-convolutional densely connected layers that computes expressive features of rectified image pairs. The second part of our network learns highly accurate similarity functions between this learned features. It consists of densely-connected convolution layers with a deformable convolution block at the end to further improve the accuracy of the results. After this step
an initial disparity map is created and the left-right consistency check is performed in order to remove inconsistent points. The last part of the network then uses this input together with the corresponding left RGB image in order to train a network that fills in the missing measurements. Consistent depth estimations are gathered around invalid points and are parsed together with the RGB points into a shallow CNN network structure in order to recover the missing values. We evaluate our method on challenging real world indoor and outdoor scenes, in particular Middlebury, KITTI and ETH3D were it produces competitive results. We furthermore show that this method generalizes well and is well suited for many applications without the need of further training. The code of our full framework is available at: https://github.com/thedodo/FCDSN-D
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey
Stereo matching is one of the longest-standing problems in computer vision
with close to 40 years of studies and research. Throughout the years the
paradigm has shifted from local, pixel-level decision to various forms of
discrete and continuous optimization to data-driven, learning-based methods.
Recently, the rise of machine learning and the rapid proliferation of deep
learning enhanced stereo matching with new exciting trends and applications
unthinkable until a few years ago. Interestingly, the relationship between
these two worlds is two-way. While machine, and especially deep, learning
advanced the state-of-the-art in stereo matching, stereo itself enabled new
ground-breaking methodologies such as self-supervised monocular depth
estimation based on deep networks. In this paper, we review recent research in
the field of learning-based depth estimation from single and binocular images
highlighting the synergies, the successes achieved so far and the open
challenges the community is going to face in the immediate future.Comment: Accepted to TPAMI. Paper version of our CVPR 2019 tutorial:
"Learning-based depth estimation from stereo and monocular images: successes,
limitations and future challenges"
(https://sites.google.com/view/cvpr-2019-depth-from-image/home
Vision-based Self-Supervised Depth Perception and Motion Control for Mobile Robots
The advances in robotics have enabled many different opportunities to deploy a mobile robot in various settings. However, many current mobile robots are equipped with a sensor suite with multiple types of sensors. This expensive sensor suite and the computationally complex program to fully utilize these sensors may limit the large-scale deployment of these robots. The recent development of computer vision has enabled the possibility to complete various robotic tasks with simply camera systems. This thesis focuses on two problems related to vision-based mobile robots: depth perception and motion control.
Commercially available stereo cameras relying on traditional stereo matching algorithms are widely used in robotic applications to obtain depth information. Although their raw (predicted) disparity maps may contain incorrect estimates, they can still provide useful prior information towards more accurate predictions. We propose a data-driven pipeline to incorporate the raw disparity to predict high-quality disparity maps. The pipeline first utilizes a confidence generation component to identify raw disparity inaccuracies. Then a deep neural network, which consists of a feature extraction module, a confidence guided raw disparity fusion module, and a hierarchical occlusion-aware disparity refinement module, computes the final disparity estimates and their corresponding occlusion masks. The pipeline can be trained in a self-supervised manner, removing the need of expensive ground truth training labels. Experimental results on public datasets show that the pipeline has competitive accuracy with real-time processing rate. The pipeline is also tested with images captured by commercial stereo cameras to demonstrate its effectiveness in improving their raw disparity estimates.
After the stereo matching pipeline predicts the disparity maps, they are used by a proposed disparity-based direct visual servoing controller to compute the commanded velocity to move a mobile robot towards its target pose. Many previous visual servoing methods rely on complex and error-prone feature extraction and matching steps. The proposed visual servoing framework follows the direct visual servoing approach which does not require any extraction or matching process. Hence, its performance is not affected by the potential errors introduced by these steps. Furthermore, the predicted occlusion masks are also incorporated in the controller to address the occlusion problem inherited from a stereo camera setup. The performance of the proposed control strategy is verified by extensive simulations and experiments