6,878 research outputs found
MSDC-Net: Multi-Scale Dense and Contextual Networks for Automated Disparity Map for Stereo Matching
Disparity prediction from stereo images is essential to computer vision
applications including autonomous driving, 3D model reconstruction, and object
detection. To predict accurate disparity map, we propose a novel deep learning
architecture for detectingthe disparity map from a rectified pair of stereo
images, called MSDC-Net. Our MSDC-Net contains two modules: multi-scale fusion
2D convolution and multi-scale residual 3D convolution modules. The multi-scale
fusion 2D convolution module exploits the potential multi-scale features, which
extracts and fuses the different scale features by Dense-Net. The multi-scale
residual 3D convolution module learns the different scale geometry context from
the cost volume which aggregated by the multi-scale fusion 2D convolution
module. Experimental results on Scene Flow and KITTI datasets demonstrate that
our MSDC-Net significantly outperforms other approaches in the non-occluded
region.Comment: Accepted at ICIGP201
Learning Dense Stereo Matching for Digital Surface Models from Satellite Imagery
Digital Surface Model generation from satellite imagery is a difficult task
that has been largely overlooked by the deep learning community. Stereo
reconstruction techniques developed for terrestrial systems including self
driving cars do not translate well to satellite imagery where image pairs vary
considerably. In this work we present neural network tailored for Digital
Surface Model generation, a ground truthing and training scheme which maximizes
available hardware, and we present a comparison to existing methods. The
resulting models are smooth, preserve boundaries, and enable further
processing. This represents one of the first attempts at leveraging deep
learning in this domain
Deep Rigid Instance Scene Flow
In this paper we tackle the problem of scene flow estimation in the context
of self-driving. We leverage deep learning techniques as well as strong priors
as in our application domain the motion of the scene can be composed by the
motion of the robot and the 3D motion of the actors in the scene. We formulate
the problem as energy minimization in a deep structured model, which can be
solved efficiently in the GPU by unrolling a Gaussian-Newton solver. Our
experiments in the challenging KITTI scene flow dataset show that we outperform
the state-of-the-art by a very large margin, while being 800 times faster.Comment: CVPR 2019. Rank 1st on KITTI scene flow benchmark. 800 times faster
than prior ar
KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects
Estimating the 3D pose of desktop objects is crucial for applications such as
robotic manipulation. Many existing approaches to this problem require a depth
map of the object for both training and prediction, which restricts them to
opaque, lambertian objects that produce good returns in an RGBD sensor. In this
paper we forgo using a depth sensor in favor of raw stereo input. We address
two problems: first, we establish an easy method for capturing and labeling 3D
keypoints on desktop objects with an RGB camera; and second, we develop a deep
neural network, called , that learns to accurately predict object
poses using 3D keypoints, from stereo input, and works even for transparent
objects. To evaluate the performance of our method, we create a dataset of 15
clear objects in five classes, with 48K 3D-keypoint labeled images. We train
both instance and category models, and show generalization to new textures,
poses, and objects. KeyPose surpasses state-of-the-art performance in 3D pose
estimation on this dataset by factors of 1.5 to 3.5, even in cases where the
competing method is provided with ground-truth depth. Stereo input is essential
for this performance as it improves results compared to using monocular input
by a factor of 2. We will release a public version of the data capture and
labeling pipeline, the transparent object database, and the KeyPose models and
evaluation code. Project website: https://sites.google.com/corp/view/keypose.Comment: CVPR 202
Learning Depth with Convolutional Spatial Propagation Network
Depth prediction is one of the fundamental problems in computer vision. In
this paper, we propose a simple yet effective convolutional spatial propagation
network (CSPN) to learn the affinity matrix for various depth estimation tasks.
Specifically, it is an efficient linear propagation model, in which the
propagation is performed with a manner of recurrent convolutional operation,
and the affinity among neighboring pixels is learned through a deep
convolutional neural network (CNN). We can append this module to any output
from a state-of-the-art (SOTA) depth estimation networks to improve their
performances. In practice, we further extend CSPN in two aspects: 1) take
sparse depth map as additional input, which is useful for the task of depth
completion; 2) similar to commonly used 3D convolution operation in CNNs, we
propose 3D CSPN to handle features with one additional dimension, which is
effective in the task of stereo matching using 3D cost volume. For the tasks of
sparse to dense, a.k.a depth completion. We experimented the proposed CPSN
conjunct algorithms over the popular NYU v2 and KITTI datasets, where we show
that our proposed algorithms not only produce high quality (e.g., 30% more
reduction in depth error), but also run faster (e.g., 2 to 5x faster) than
previous SOTA spatial propagation network. We also evaluated our stereo
matching algorithm on the Scene Flow and KITTI Stereo datasets, and rank 1st on
both the KITTI Stereo 2012 and 2015 benchmarks, which demonstrates the
effectiveness of the proposed module. The code of CSPN proposed in this work
will be released at https://github.com/XinJCheng/CSPN.Comment: v1.2: add some exps v1.1: fixed some mistakes, v1: 17 pages, 12
figures. arXiv admin note: substantial text overlap with arXiv:1808.0015
Self-Supervised Learning for Stereo Matching with Self-Improving Ability
Exiting deep-learning based dense stereo matching methods often rely on
ground-truth disparity maps as the training signals, which are however not
always available in many situations. In this paper, we design a simple
convolutional neural network architecture that is able to learn to compute
dense disparity maps directly from the stereo inputs. Training is performed in
an end-to-end fashion without the need of ground-truth disparity maps. The idea
is to use image warping error (instead of disparity-map residuals) as the loss
function to drive the learning process, aiming to find a depth-map that
minimizes the warping error. While this is a simple concept well-known in
stereo matching, to make it work in a deep-learning framework, many non-trivial
challenges must be overcome, and in this work we provide effective solutions.
Our network is self-adaptive to different unseen imageries as well as to
different camera settings. Experiments on KITTI and Middlebury stereo benchmark
datasets show that our method outperforms many state-of-the-art stereo matching
methods with a margin, and at the same time significantly faster.Comment: 13 pages, 11 figure
OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas
Recent work on depth estimation up to now has only focused on projective
images ignoring 360 content which is now increasingly and more easily produced.
We show that monocular depth estimation models trained on traditional images
produce sub-optimal results on omnidirectional images, showcasing the need for
training directly on 360 datasets, which however, are hard to acquire. In this
work, we circumvent the challenges associated with acquiring high quality 360
datasets with ground truth depth annotations, by re-using recently released
large scale 3D datasets and re-purposing them to 360 via rendering. This
dataset, which is considerably larger than similar projective datasets, is
publicly offered to the community to enable future research in this direction.
We use this dataset to learn in an end-to-end fashion the task of depth
estimation from 360 images. We show promising results in our synthesized data
as well as in unseen realistic images.Comment: Pre-print to appear in ECCV1
Real-time 3D Traffic Cone Detection for Autonomous Driving
Considerable progress has been made in semantic scene understanding of road
scenes with monocular cameras. It is, however, mainly related to certain
classes such as cars and pedestrians. This work investigates traffic cones, an
object class crucial for traffic control in the context of autonomous vehicles.
3D object detection using images from a monocular camera is intrinsically an
ill-posed problem. In this work, we leverage the unique structure of traffic
cones and propose a pipelined approach to the problem. Specifically, we first
detect cones in images by a tailored 2D object detector; then, the spatial
arrangement of keypoints on a traffic cone are detected by our deep structural
regression network, where the fact that the cross-ratio is projection invariant
is leveraged for network regularization; finally, the 3D position of cones is
recovered by the classical Perspective n-Point algorithm. Extensive experiments
show that our approach can accurately detect traffic cones and estimate their
position in the 3D world in real time. The proposed method is also deployed on
a real-time, critical system. It runs efficiently on the low-power Jetson TX2,
providing accurate 3D position estimates, allowing a race-car to map and drive
autonomously on an unseen track indicated by traffic cones. With the help of
robust and accurate perception, our race-car won both Formula Student
Competitions held in Italy and Germany in 2018, cruising at a top-speed of 54
kmph. Visualization of the complete pipeline, mapping and navigation can be
found on our project page.Comment: IEEE Intelligent Vehicles Symposium (IV'19). arXiv admin note: text
overlap with arXiv:1809.1054
DeepV2D: Video to Depth with Differentiable Structure from Motion
We propose DeepV2D, an end-to-end deep learning architecture for predicting
depth from video. DeepV2D combines the representation ability of neural
networks with the geometric principles governing image formation. We compose a
collection of classical geometric algorithms, which are converted into
trainable modules and combined into an end-to-end differentiable architecture.
DeepV2D interleaves two stages: motion estimation and depth estimation. During
inference, motion and depth estimation are alternated and converge to accurate
depth. Code is available https://github.com/princeton-vl/DeepV2D
Deep Fundamental Matrix Estimation without Correspondences
Estimating fundamental matrices is a classic problem in computer vision.
Traditional methods rely heavily on the correctness of estimated key-point
correspondences, which can be noisy and unreliable. As a result, it is
difficult for these methods to handle image pairs with large occlusion or
significantly different camera poses. In this paper, we propose novel neural
network architectures to estimate fundamental matrices in an end-to-end manner
without relying on point correspondences. New modules and layers are introduced
in order to preserve mathematical properties of the fundamental matrix as a
homogeneous rank-2 matrix with seven degrees of freedom. We analyze performance
of the proposed models using various metrics on the KITTI dataset, and show
that they achieve competitive performance with traditional methods without the
need for extracting correspondences.Comment: ECCV 2018, Geometry Meets Deep Learning Worksho
- …