28,876 research outputs found
Consistency Guided Scene Flow Estimation
Consistency Guided Scene Flow Estimation (CGSF) is a self-supervised
framework for the joint reconstruction of 3D scene structure and motion from
stereo video. The model takes two temporal stereo pairs as input, and predicts
disparity and scene flow. The model self-adapts at test time by iteratively
refining its predictions. The refinement process is guided by a consistency
loss, which combines stereo and temporal photo-consistency with a geometric
term that couples disparity and 3D motion. To handle inherent modeling error in
the consistency loss (e.g. Lambertian assumptions) and for better
generalization, we further introduce a learned, output refinement network,
which takes the initial predictions, the loss, and the gradient as input, and
efficiently predicts a correlated output update. In multiple experiments,
including ablation studies, we show that the proposed model can reliably
predict disparity and scene flow in challenging imagery, achieves better
generalization than the state-of-the-art, and adapts quickly and robustly to
unseen domains
GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose
We propose GeoNet, a jointly unsupervised learning framework for monocular
depth, optical flow and ego-motion estimation from videos. The three components
are coupled by the nature of 3D scene geometry, jointly learned by our
framework in an end-to-end manner. Specifically, geometric relationships are
extracted over the predictions of individual modules and then combined as an
image reconstruction loss, reasoning about static and dynamic scene parts
separately. Furthermore, we propose an adaptive geometric consistency loss to
increase robustness towards outliers and non-Lambertian regions, which resolves
occlusions and texture ambiguities effectively. Experimentation on the KITTI
driving dataset reveals that our scheme achieves state-of-the-art results in
all of the three tasks, performing better than previously unsupervised methods
and comparably with supervised ones.Comment: Accepted to CVPR 2018; Code will be made available at
https://github.com/yzcjtr/GeoNe
Monocular Depth Estimation: A Survey
Monocular depth estimation is often described as an ill-posed and inherently
ambiguous problem. Estimating depth from 2D images is a crucial step in scene
reconstruction, 3Dobject recognition, segmentation, and detection. The problem
can be framed as: given a single RGB image as input, predict a dense depth map
for each pixel. This problem is worsened by the fact that most scenes have
large texture and structural variations, object occlusions, and rich geometric
detailing. All these factors contribute to difficulty in accurate depth
estimation. In this paper, we review five papers that attempt to solve the
depth estimation problem with various techniques including supervised,
weakly-supervised, and unsupervised learning techniques. We then compare these
papers and understand the improvements made over one another. Finally, we
explore potential improvements that can aid to better solve this problem.Comment: 8 pages, 1 figure, 4 table
Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Learning to estimate 3D geometry in a single frame and optical flow from
consecutive frames by watching unlabeled videos via deep convolutional network
has made significant progress recently. Current state-of-the-art (SoTA) methods
treat the two tasks independently. One typical assumption of the existing depth
estimation methods is that the scenes contain no independent moving objects.
while object moving could be easily modeled using optical flow. In this paper,
we propose to address the two tasks as a whole, i.e. to jointly understand
per-pixel 3D geometry and motion. This eliminates the need of static scene
assumption and enforces the inherent geometrical consistency during the
learning process, yielding significantly improved results for both tasks. We
call our method as "Every Pixel Counts++" or "EPC++". Specifically, during
training, given two consecutive frames from a video, we adopt three parallel
networks to predict the camera motion (MotionNet), dense depth map (DepthNet),
and per-pixel optical flow between two frames (OptFlowNet) respectively. The
three types of information are fed into a holistic 3D motion parser (HMP), and
per-pixel 3D motion of both rigid background and moving objects are
disentangled and recovered. Comprehensive experiments were conducted on
datasets with different scenes, including driving scenario (KITTI 2012 and
KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic
animation (MPI Sintel dataset). Performance on the five tasks of depth
estimation, optical flow estimation, odometry, moving object segmentation and
scene flow estimation shows that our approach outperforms other SoTA methods.
Code will be available at: https://github.com/chenxuluo/EPC.Comment: Chenxu Luo, Zhenheng Yang, and Peng Wang contributed equally, TPAMI
submissio
SceneFlowFields++: Multi-frame Matching, Visibility Prediction, and Robust Interpolation for Scene Flow Estimation
State-of-the-art scene flow algorithms pursue the conflicting targets of
accuracy, run time, and robustness. With the successful concept of pixel-wise
matching and sparse-to-dense interpolation, we push the limits of scene flow
estimation. Avoiding strong assumptions on the domain or the problem yields a
more robust algorithm. This algorithm is fast because we avoid explicit
regularization during matching, which allows an efficient computation. Using
image information from multiple time steps and explicit visibility prediction
based on previous results, we achieve competitive performances on different
data sets. Our contributions and results are evaluated in comparative
experiments. Overall, we present an accurate scene flow algorithm that is
faster and more generic than any individual benchmark leader.Comment: arXiv admin note: text overlap with arXiv:1710.1009
Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras
We present a novel method for simultaneous learning of depth, egomotion,
object motion, and camera intrinsics from monocular videos, using only
consistency across neighboring video frames as supervision signal. Similarly to
prior work, our method learns by applying differentiable warping to frames and
comparing the result to adjacent ones, but it provides several improvements: We
address occlusions geometrically and differentiably, directly using the depth
maps as predicted during training. We introduce randomized layer normalization,
a novel powerful regularizer, and we account for object motion relative to the
scene. To the best of our knowledge, our work is the first to learn the camera
intrinsic parameters, including lens distortion, from video in an unsupervised
manner, thereby allowing us to extract accurate depth and motion from arbitrary
videos of unknown origin at scale. We evaluate our results on the Cityscapes,
KITTI and EuRoC datasets, establishing new state of the art on depth prediction
and odometry, and demonstrate qualitatively that depth prediction can be
learned from a collection of YouTube videos
DispSegNet: Leveraging Semantics for End-to-End Learning of Disparity Estimation from Stereo Imagery
Recent work has shown that convolutional neural networks (CNNs) can be
applied successfully in disparity estimation, but these methods still suffer
from errors in regions of low-texture, occlusions and reflections.
Concurrently, deep learning for semantic segmentation has shown great progress
in recent years. In this paper, we design a CNN architecture that combines
these two tasks to improve the quality and accuracy of disparity estimation
with the help of semantic segmentation. Specifically, we propose a network
structure in which these two tasks are highly coupled. One key novelty of this
approach is the two-stage refinement process. Initial disparity estimates are
refined with an embedding learned from the semantic segmentation branch of the
network. The proposed model is trained using an unsupervised approach, in which
images from one half of the stereo pair are warped and compared against images
from the other camera. Another key advantage of the proposed approach is that a
single network is capable of outputting disparity estimates and semantic
labels. These outputs are of great use in autonomous vehicle operation; with
real-time constraints being key, such performance improvements increase the
viability of driving applications. Experiments on KITTI and Cityscapes datasets
show that our model can achieve state-of-the-art results and that leveraging
embedding learned from semantic segmentation improves the performance of
disparity estimation.Comment: Add more description on the architecture of the model. Add more
discussion on section IV-C. Fix typo in formula
Learning monocular depth estimation infusing traditional stereo knowledge
Depth estimation from a single image represents a fascinating, yet
challenging problem with countless applications. Recent works proved that this
task could be learned without direct supervision from ground truth labels
leveraging image synthesis on sequences or stereo pairs. Focusing on this
second case, in this paper we leverage stereo matching in order to improve
monocular depth estimation. To this aim we propose monoResMatch, a novel deep
architecture designed to infer depth from a single input image by synthesizing
features from a different point of view, horizontally aligned with the input
image, performing stereo matching between the two cues. In contrast to previous
works sharing this rationale, our network is the first trained end-to-end from
scratch. Moreover, we show how obtaining proxy ground truth annotation through
traditional stereo algorithms, such as Semi-Global Matching, enables more
accurate monocular depth estimation still countering the need for expensive
depth labels by keeping a self-supervised approach. Exhaustive experimental
results prove how the synergy between i) the proposed monoResMatch architecture
and ii) proxy-supervision attains state-of-the-art for self-supervised
monocular depth estimation. The code is publicly available at
https://github.com/fabiotosi92/monoResMatch-Tensorflow.Comment: accepted at CVPR 2019. Code available at
https://github.com/fabiotosi92/monoResMatch-Tensorflo
SIGNet: Semantic Instance Aided Unsupervised 3D Geometry Perception
Unsupervised learning for geometric perception (depth, optical flow, etc.) is
of great interest to autonomous systems. Recent works on unsupervised learning
have made considerable progress on perceiving geometry; however, they usually
ignore the coherence of objects and perform poorly under scenarios with dark
and noisy environments. In contrast, supervised learning algorithms, which are
robust, require large labeled geometric dataset. This paper introduces SIGNet,
a novel framework that provides robust geometry perception without requiring
geometrically informative labels. Specifically, SIGNet integrates semantic
information to make depth and flow predictions consistent with objects and
robust to low lighting conditions. SIGNet is shown to improve upon the
state-of-the-art unsupervised learning for depth prediction by 30% (in squared
relative error). In particular, SIGNet improves the dynamic object class
performance by 39% in depth prediction and 29% in flow prediction. Our code
will be made available at https://github.com/mengyuest/SIGNetComment: To appear at CVPR 201
ResDepth: Learned Residual Stereo Reconstruction
We propose an embarrassingly simple but very effective scheme for
high-quality dense stereo reconstruction: (i) generate an approximate
reconstruction with your favourite stereo matcher; (ii) rewarp the input images
with that approximate model; (iii) with the initial reconstruction and the
warped images as input, train a deep network to enhance the reconstruction by
regressing a residual correction; and (iv) if desired, iterate the refinement
with the new, improved reconstruction. The strategy to only learn the residual
greatly simplifies the learning problem. A standard Unet without bells and
whistles is enough to reconstruct even small surface details, like dormers and
roof substructures in satellite images. We also investigate residual
reconstruction with less information and find that even a single image is
enough to greatly improve an approximate reconstruction. Our full model reduces
the mean absolute error of state-of-the-art stereo reconstruction systems by
>50%, both in our target domain of satellite stereo and on stereo pairs from
the ETH3D benchmark.Comment: updated supplementary materia
- …