19 research outputs found
Learning monocular depth estimation with unsupervised trinocular assumptions
Obtaining accurate depth measurements out of a single image represents a
fascinating solution to 3D sensing. CNNs led to considerable improvements in
this field, and recent trends replaced the need for ground-truth labels with
geometry-guided image reconstruction signals enabling unsupervised training.
Currently, for this purpose, state-of-the-art techniques rely on images
acquired with a binocular stereo rig to predict inverse depth (i.e., disparity)
according to the aforementioned supervision principle. However, these methods
suffer from well-known problems near occlusions, left image border, etc
inherited from the stereo setup. Therefore, in this paper, we tackle these
issues by moving to a trinocular domain for training. Assuming the central
image as the reference, we train a CNN to infer disparity representations
pairing such image with frames on its left and right side. This strategy allows
obtaining depth maps not affected by typical stereo artifacts. Moreover, being
trinocular datasets seldom available, we introduce a novel interleaved training
procedure enabling to enforce the trinocular assumption outlined from current
binocular datasets. Exhaustive experimental results on the KITTI dataset
confirm that our proposal outperforms state-of-the-art methods for unsupervised
monocular depth estimation trained on binocular stereo pairs as well as any
known methods relying on other cues.Comment: 14 pages, 7 figures, 4 tables. Accepted to 3DV 201
Footprints and Free Space from a Single Color Image
Understanding the shape of a scene from a single color image is a formidable
computer vision task. However, most methods aim to predict the geometry of
surfaces that are visible to the camera, which is of limited use when planning
paths for robots or augmented reality agents. Such agents can only move when
grounded on a traversable surface, which we define as the set of classes which
humans can also walk over, such as grass, footpaths and pavement. Models which
predict beyond the line of sight often parameterize the scene with voxels or
meshes, which can be expensive to use in machine learning frameworks.
We introduce a model to predict the geometry of both visible and occluded
traversable surfaces, given a single RGB image as input. We learn from stereo
video sequences, using camera poses, per-frame depth and semantic segmentation
to form training data, which is used to supervise an image-to-image network. We
train models from the KITTI driving dataset, the indoor Matterport dataset, and
from our own casually captured stereo footage. We find that a surprisingly low
bar for spatial coverage of training scenes is required. We validate our
algorithm against a range of strong baselines, and include an assessment of our
predictions for a path-planning task.Comment: Accepted to CVPR 2020 as an oral presentatio