13 research outputs found
Learning Single-Image Depth from Videos using Quality Assessment Networks
Depth estimation from a single image in the wild remains a challenging
problem. One main obstacle is the lack of high-quality training data for images
in the wild. In this paper we propose a method to automatically generate such
data through Structure-from-Motion (SfM) on Internet videos. The core of this
method is a Quality Assessment Network that identifies high-quality
reconstructions obtained from SfM. Using this method, we collect single-view
depth training data from a large number of YouTube videos and construct a new
dataset called YouTube3D. Experiments show that YouTube3D is useful in training
depth estimation networks and advances the state of the art of single-view
depth estimation in the wild
Face Normals "in-the- wild" using Fully Convolutional Networks
In this work we pursue a data-driven approach to the problem of estimating surface normals from a single intensity image, focusing in particular on human faces. We introduce new methods to exploit the currently available facial databases for dataset construction and tailor a deep convolutional neural network to the task of estimating facial surface normals in-the-wild. We train a fully convolutional network that can accurately recover facial normals from images including a challenging variety of expressions and facial poses. We compare against state-of-the-art face Shape-from-Shading and 3D reconstruction techniques and show that the proposed network can recover substantially more accurate and realistic normals. Furthermore, in contrast to other existing face-specific surface recovery methods, we do not require the solving of an explicit alignment step due to the fully convolutional nature of our network
Surface Normal Estimation of Tilted Images via Spatial Rectifier
In this paper, we present a spatial rectifier to estimate surface normals of
tilted images. Tilted images are of particular interest as more visual data are
captured by arbitrarily oriented sensors such as body-/robot-mounted cameras.
Existing approaches exhibit bounded performance on predicting surface normals
because they were trained using gravity-aligned images. Our two main hypotheses
are: (1) visual scene layout is indicative of the gravity direction; and (2)
not all surfaces are equally represented by a learned estimator due to the
structured distribution of the training data, thus, there exists a
transformation for each tilted image that is more responsive to the learned
estimator than others. We design a spatial rectifier that is learned to
transform the surface normal distribution of a tilted image to the rectified
one that matches the gravity-aligned training data distribution. Along with the
spatial rectifier, we propose a novel truncated angular loss that offers a
stronger gradient at smaller angular errors and robustness to outliers. The
resulting estimator outperforms the state-of-the-art methods including data
augmentation baselines not only on ScanNet and NYUv2 but also on a new dataset
called Tilt-RGBD that includes considerable roll and pitch camera motion.Comment: 16 page
Single-Image Depth Prediction Makes Feature Matching Easier
Good local features improve the robustness of many 3D re-localization and
multi-view reconstruction pipelines. The problem is that viewing angle and
distance severely impact the recognizability of a local feature. Attempts to
improve appearance invariance by choosing better local feature points or by
leveraging outside information, have come with pre-requisites that made some of
them impractical. In this paper, we propose a surprisingly effective
enhancement to local feature extraction, which improves matching. We show that
CNN-based depths inferred from single RGB images are quite helpful, despite
their flaws. They allow us to pre-warp images and rectify perspective
distortions, to significantly enhance SIFT and BRISK features, enabling more
good matches, even when cameras are looking at the same scene but in opposite
directions.Comment: 14 pages, 7 figures, accepted for publication at the European
conference on computer vision (ECCV) 202