8,438 research outputs found
Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras
Visual scene understanding is an important capability that enables robots to
purposefully act in their environment. In this paper, we propose a novel
approach to object-class segmentation from multiple RGB-D views using deep
learning. We train a deep neural network to predict object-class semantics that
is consistent from several view points in a semi-supervised way. At test time,
the semantics predictions of our network can be fused more consistently in
semantic keyframe maps than predictions of a network trained on individual
views. We base our network architecture on a recent single-view deep learning
approach to RGB and depth fusion for semantic object-class segmentation and
enhance it with multi-scale loss minimization. We obtain the camera trajectory
using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth
annotated frames in order to enforce multi-view consistency during training. At
test time, predictions from multiple views are fused into keyframes. We propose
and analyze several methods for enforcing multi-view consistency during
training and testing. We evaluate the benefit of multi-view consistency
training and demonstrate that pooling of deep features and fusion over multiple
views outperforms single-view baselines on the NYUDv2 benchmark for semantic
segmentation. Our end-to-end trained network achieves state-of-the-art
performance on the NYUDv2 dataset in single-view segmentation as well as
multi-view semantic fusion.Comment: the 2017 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2017
Flowing ConvNets for Human Pose Estimation in Videos
The objective of this work is human pose estimation in videos, where multiple
frames are available. We investigate a ConvNet architecture that is able to
benefit from temporal context by combining information across the multiple
frames using optical flow.
To this end we propose a network architecture with the following novelties:
(i) a deeper network than previously investigated for regressing heatmaps; (ii)
spatial fusion layers that learn an implicit spatial model; (iii) optical flow
is used to align heatmap predictions from neighbouring frames; and (iv) a final
parametric pooling layer which learns to combine the aligned heatmaps into a
pooled confidence map.
We show that this architecture outperforms a number of others, including one
that uses optical flow solely at the input layers, one that regresses joint
coordinates directly, and one that predicts heatmaps without spatial fusion.
The new architecture outperforms the state of the art by a large margin on
three video pose estimation datasets, including the very challenging Poses in
the Wild dataset, and outperforms other deep methods that don't use a graphical
model on the single-image FLIC benchmark (and also Chen & Yuille and Tompson et
al. in the high precision region).Comment: ICCV'1
Deep Eyes: Binocular Depth-from-Focus on Focal Stack Pairs
Human visual system relies on both binocular stereo cues and monocular
focusness cues to gain effective 3D perception. In computer vision, the two
problems are traditionally solved in separate tracks. In this paper, we present
a unified learning-based technique that simultaneously uses both types of cues
for depth inference. Specifically, we use a pair of focal stacks as input to
emulate human perception. We first construct a comprehensive focal stack
training dataset synthesized by depth-guided light field rendering. We then
construct three individual networks: a Focus-Net to extract depth from a single
focal stack, a EDoF-Net to obtain the extended depth of field (EDoF) image from
the focal stack, and a Stereo-Net to conduct stereo matching. We show how to
integrate them into a unified BDfF-Net to obtain high-quality depth maps.
Comprehensive experiments show that our approach outperforms the
state-of-the-art in both accuracy and speed and effectively emulates human
vision systems
- …