8,928 research outputs found
Volume-based Semantic Labeling with Signed Distance Functions
Research works on the two topics of Semantic Segmentation and SLAM
(Simultaneous Localization and Mapping) have been following separate tracks.
Here, we link them quite tightly by delineating a category label fusion
technique that allows for embedding semantic information into the dense map
created by a volume-based SLAM algorithm such as KinectFusion. Accordingly, our
approach is the first to provide a semantically labeled dense reconstruction of
the environment from a stream of RGB-D images. We validate our proposal using a
publicly available semantically annotated RGB-D dataset and a) employing ground
truth labels, b) corrupting such annotations with synthetic noise, c) deploying
a state of the art semantic segmentation algorithm based on Convolutional
Neural Networks.Comment: Submitted to PSIVT201
ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans
We introduce ScanComplete, a novel data-driven approach for taking an
incomplete 3D scan of a scene as input and predicting a complete 3D model along
with per-voxel semantic labels. The key contribution of our method is its
ability to handle large scenes with varying spatial extent, managing the cubic
growth in data size as scene size increases. To this end, we devise a
fully-convolutional generative 3D CNN model whose filter kernels are invariant
to the overall scene size. The model can be trained on scene subvolumes but
deployed on arbitrarily large scenes at test time. In addition, we propose a
coarse-to-fine inference strategy in order to produce high-resolution output
while also leveraging large input context sizes. In an extensive series of
experiments, we carefully evaluate different model design choices, considering
both deterministic and probabilistic models for completion and semantic
inference. Our results show that we outperform other methods not only in the
size of the environments handled and processing efficiency, but also with
regard to completion quality and semantic segmentation performance by a
significant margin.Comment: Video: https://youtu.be/5s5s8iH0NF
Semantic 3D Reconstruction with Finite Element Bases
We propose a novel framework for the discretisation of multi-label problems
on arbitrary, continuous domains. Our work bridges the gap between general FEM
discretisations, and labeling problems that arise in a variety of computer
vision tasks, including for instance those derived from the generalised Potts
model. Starting from the popular formulation of labeling as a convex relaxation
by functional lifting, we show that FEM discretisation is valid for the most
general case, where the regulariser is anisotropic and non-metric. While our
findings are generic and applicable to different vision problems, we
demonstrate their practical implementation in the context of semantic 3D
reconstruction, where such regularisers have proved particularly beneficial.
The proposed FEM approach leads to a smaller memory footprint as well as faster
computation, and it constitutes a very simple way to enable variable, adaptive
resolution within the same model
Recurrent Pixel Embedding for Instance Grouping
We introduce a differentiable, end-to-end trainable framework for solving
pixel-level grouping problems such as instance segmentation consisting of two
novel components. First, we regress pixels into a hyper-spherical embedding
space so that pixels from the same group have high cosine similarity while
those from different groups have similarity below a specified margin. We
analyze the choice of embedding dimension and margin, relating them to
theoretical results on the problem of distributing points uniformly on the
sphere. Second, to group instances, we utilize a variant of mean-shift
clustering, implemented as a recurrent neural network parameterized by kernel
bandwidth. This recurrent grouping module is differentiable, enjoys convergent
dynamics and probabilistic interpretability. Backpropagating the group-weighted
loss through this module allows learning to focus on only correcting embedding
errors that won't be resolved during subsequent clustering. Our framework,
while conceptually simple and theoretically abundant, is also practically
effective and computationally efficient. We demonstrate substantial
improvements over state-of-the-art instance segmentation for object proposal
generation, as well as demonstrating the benefits of grouping loss for
classification tasks such as boundary detection and semantic segmentation
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
Estimating the 6D pose of known objects is important for robots to interact
with the real world. The problem is challenging due to the variety of objects
as well as the complexity of a scene caused by clutter and occlusions between
objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network
for 6D object pose estimation. PoseCNN estimates the 3D translation of an
object by localizing its center in the image and predicting its distance from
the camera. The 3D rotation of the object is estimated by regressing to a
quaternion representation. We also introduce a novel loss function that enables
PoseCNN to handle symmetric objects. In addition, we contribute a large scale
video dataset for 6D object pose estimation named the YCB-Video dataset. Our
dataset provides accurate 6D poses of 21 objects from the YCB dataset observed
in 92 videos with 133,827 frames. We conduct extensive experiments on our
YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is
highly robust to occlusions, can handle symmetric objects, and provide accurate
pose estimation using only color images as input. When using depth data to
further refine the poses, our approach achieves state-of-the-art results on the
challenging OccludedLINEMOD dataset. Our code and dataset are available at
https://rse-lab.cs.washington.edu/projects/posecnn/.Comment: Accepted to RSS 201
OctNetFusion: Learning Depth Fusion from Data
In this paper, we present a learning based approach to depth fusion, i.e.,
dense 3D reconstruction from multiple depth images. The most common approach to
depth fusion is based on averaging truncated signed distance functions, which
was originally proposed by Curless and Levoy in 1996. While this method is
simple and provides great results, it is not able to reconstruct (partially)
occluded surfaces and requires a large number frames to filter out sensor noise
and outliers. Motivated by the availability of large 3D model repositories and
recent advances in deep learning, we present a novel 3D CNN architecture that
learns to predict an implicit surface representation from the input depth maps.
Our learning based method significantly outperforms the traditional volumetric
fusion approach in terms of noise reduction and outlier suppression. By
learning the structure of real world 3D objects and scenes, our approach is
further able to reconstruct occluded regions and to fill in gaps in the
reconstruction. We demonstrate that our learning based approach outperforms
both vanilla TSDF fusion as well as TV-L1 fusion on the task of volumetric
fusion. Further, we demonstrate state-of-the-art 3D shape completion results.Comment: 3DV 2017, https://github.com/griegler/octnetfusio
Semantic Cross-View Matching
Matching cross-view images is challenging because the appearance and
viewpoints are significantly different. While low-level features based on
gradient orientations or filter responses can drastically vary with such
changes in viewpoint, semantic information of images however shows an invariant
characteristic in this respect. Consequently, semantically labeled regions can
be used for performing cross-view matching. In this paper, we therefore explore
this idea and propose an automatic method for detecting and representing the
semantic information of an RGB image with the goal of performing cross-view
matching with a (non-RGB) geographic information system (GIS). A segmented
image forms the input to our system with segments assigned to semantic concepts
such as traffic signs, lakes, roads, foliage, etc. We design a descriptor to
robustly capture both, the presence of semantic concepts and the spatial layout
of those segments. Pairwise distances between the descriptors extracted from
the GIS map and the query image are then used to generate a shortlist of the
most promising locations with similar semantic concepts in a consistent spatial
layout. An experimental evaluation with challenging query images and a large
urban area shows promising results
- …