4,389 research outputs found
DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks
3D scene understanding is important for robots to interact with the 3D world
in a meaningful way. Most previous works on 3D scene understanding focus on
recognizing geometrical or semantic properties of the scene independently. In
this work, we introduce Data Associated Recurrent Neural Networks (DA-RNNs), a
novel framework for joint 3D scene mapping and semantic labeling. DA-RNNs use a
new recurrent neural network architecture for semantic labeling on RGB-D
videos. The output of the network is integrated with mapping techniques such as
KinectFusion in order to inject semantic information into the reconstructed 3D
scene. Experiments conducted on a real world dataset and a synthetic dataset
with RGB-D videos demonstrate the ability of our method in semantic 3D scene
mapping.Comment: Published in RSS 201
A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
Traditional architectures for solving computer vision problems and the degree
of success they enjoyed have been heavily reliant on hand-crafted features.
However, of late, deep learning techniques have offered a compelling
alternative -- that of automatically learning problem-specific features. With
this new paradigm, every problem in computer vision is now being re-examined
from a deep learning perspective. Therefore, it has become important to
understand what kind of deep networks are suitable for a given problem.
Although general surveys of this fast-moving paradigm (i.e. deep-networks)
exist, a survey specific to computer vision is missing. We specifically
consider one form of deep networks widely used in computer vision -
convolutional neural networks (CNNs). We start with "AlexNet" as our base CNN
and then examine the broad variations proposed over time to suit different
applications. We hope that our recipe-style survey will serve as a guide,
particularly for novice practitioners intending to use deep-learning techniques
for computer vision.Comment: Published in Frontiers in Robotics and AI (http://goo.gl/6691Bm
Non-local Neural Networks
Both convolutional and recurrent operations are building blocks that process
one local neighborhood at a time. In this paper, we present non-local
operations as a generic family of building blocks for capturing long-range
dependencies. Inspired by the classical non-local means method in computer
vision, our non-local operation computes the response at a position as a
weighted sum of the features at all positions. This building block can be
plugged into many computer vision architectures. On the task of video
classification, even without any bells and whistles, our non-local models can
compete or outperform current competition winners on both Kinetics and Charades
datasets. In static image recognition, our non-local models improve object
detection/segmentation and pose estimation on the COCO suite of tasks. Code is
available at https://github.com/facebookresearch/video-nonlocal-net .Comment: CVPR 2018, code is available at:
https://github.com/facebookresearch/video-nonlocal-ne
Predicting Deeper into the Future of Semantic Segmentation
The ability to predict and therefore to anticipate the future is an important
attribute of intelligence. It is also of utmost importance in real-time
systems, e.g. in robotics or autonomous driving, which depend on visual scene
understanding for decision making. While prediction of the raw RGB pixel values
in future video frames has been studied in previous work, here we introduce the
novel task of predicting semantic segmentations of future frames. Given a
sequence of video frames, our goal is to predict segmentation maps of not yet
observed video frames that lie up to a second or further in the future. We
develop an autoregressive convolutional neural network that learns to
iteratively generate multiple frames. Our results on the Cityscapes dataset
show that directly predicting future segmentations is substantially better than
predicting and then segmenting future RGB frames. Prediction results up to half
a second in the future are visually convincing and are much more accurate than
those of a baseline based on warping semantic segmentations using optical flow.Comment: Accepted to ICCV 2017. Supplementary material available on the
authors' webpage
- …