302 research outputs found
Deep Depth Completion of a Single RGB-D Image
The goal of our work is to complete the depth channel of an RGB-D image.
Commodity-grade depth cameras often fail to sense depth for shiny, bright,
transparent, and distant surfaces. To address this problem, we train a deep
network that takes an RGB image as input and predicts dense surface normals and
occlusion boundaries. Those predictions are then combined with raw depth
observations provided by the RGB-D camera to solve for depths for all pixels,
including those missing in the original observation. This method was chosen
over others (e.g., inpainting depths directly) as the result of extensive
experiments with a new depth completion benchmark dataset, where holes are
filled in training data through the rendering of surface reconstructions
created from multiview RGB-D scans. Experiments with different network inputs,
depth representations, loss functions, optimization methods, inpainting
methods, and deep depth estimation networks show that our proposed approach
provides better depth completions than these alternatives.Comment: Accepted by CVPR2018 (Spotlight). Project webpage:
http://deepcompletion.cs.princeton.edu/ This version includes supplementary
materials which provide more implementation details, quantitative evaluation,
and qualitative results. Due to file size limit, please check project website
for high-res pape
Fine-To-Coarse Global Registration of RGB-D Scans
RGB-D scanning of indoor environments is important for many applications,
including real estate, interior design, and virtual reality. However, it is
still challenging to register RGB-D images from a hand-held camera over a long
video sequence into a globally consistent 3D model. Current methods often can
lose tracking or drift and thus fail to reconstruct salient structures in large
environments (e.g., parallel walls in different rooms). To address this
problem, we propose a "fine-to-coarse" global registration algorithm that
leverages robust registrations at finer scales to seed detection and
enforcement of new correspondence and structural constraints at coarser scales.
To test global registration algorithms, we provide a benchmark with 10,401
manually-clicked point correspondences in 25 scenes from the SUN3D dataset.
During experiments with this benchmark, we find that our fine-to-coarse
algorithm registers long RGB-D sequences better than previous methods
Interactive 3D Modeling with a Generative Adversarial Network
This paper proposes the idea of using a generative adversarial network (GAN)
to assist a novice user in designing real-world shapes with a simple interface.
The user edits a voxel grid with a painting interface (like Minecraft). Yet, at
any time, he/she can execute a SNAP command, which projects the current voxel
grid onto a latent shape manifold with a learned projection operator and then
generates a similar, but more realistic, shape using a learned generator
network. Then the user can edit the resulting shape and snap again until he/she
is satisfied with the result. The main advantage of this approach is that the
projection and generation operators assist novice users to create 3D models
characteristic of a background distribution of object shapes, but without
having to specify all the details. The core new research idea is to use a GAN
to support this application. 3D GANs have previously been used for shape
generation, interpolation, and completion, but never for interactive modeling.
The new challenge for this application is to learn a projection operator that
takes an arbitrary 3D voxel model and produces a latent vector on the shape
manifold from which a similar and realistic shape can be generated. We develop
algorithms for this and other steps of the SNAP processing pipeline and
integrate them into a simple modeling tool. Experiments with these algorithms
and tool suggest that GANs provide a promising approach to computer-assisted
interactive modeling.Comment: Published at International Conference on 3D Vision 2017
(http://irc.cs.sdu.edu.cn/3dv/index.html
FrameNet: Learning Local Canonical Frames of 3D Surfaces from a Single RGB Image
In this work, we introduce the novel problem of identifying dense canonical
3D coordinate frames from a single RGB image. We observe that each pixel in an
image corresponds to a surface in the underlying 3D geometry, where a canonical
frame can be identified as represented by three orthogonal axes, one along its
normal direction and two in its tangent plane. We propose an algorithm to
predict these axes from RGB. Our first insight is that canonical frames
computed automatically with recently introduced direction field synthesis
methods can provide training data for the task. Our second insight is that
networks designed for surface normal prediction provide better results when
trained jointly to predict canonical frames, and even better when trained to
also predict 2D projections of canonical frames. We conjecture this is because
projections of canonical tangent directions often align with local gradients in
images, and because those directions are tightly linked to 3D canonical frames
through projective geometry and orthogonality constraints. In our experiments,
we find that our method predicts 3D canonical frames that can be used in
applications ranging from surface normal estimation, feature matching, and
augmented reality
Neural Illumination: Lighting Prediction for Indoor Environments
This paper addresses the task of estimating the light arriving from all
directions to a 3D point observed at a selected pixel in an RGB image. This
task is challenging because it requires predicting a mapping from a partial
scene observation by a camera to a complete illumination map for a selected
position, which depends on the 3D location of the selection, the distribution
of unobserved light sources, the occlusions caused by scene geometry, etc.
Previous methods attempt to learn this complex mapping directly using a single
black-box neural network, which often fails to estimate high-frequency lighting
details for scenes with complicated 3D geometry. Instead, we propose "Neural
Illumination" a new approach that decomposes illumination prediction into
several simpler differentiable sub-tasks: 1) geometry estimation, 2) scene
completion, and 3) LDR-to-HDR estimation. The advantage of this approach is
that the sub-tasks are relatively easy to learn and can be trained with direct
supervision, while the whole pipeline is fully differentiable and can be
fine-tuned with end-to-end supervision. Experiments show that our approach
performs significantly better quantitatively and qualitatively than prior work
TossingBot: Learning to Throw Arbitrary Objects with Residual Physics
We investigate whether a robot arm can learn to pick and throw arbitrary
objects into selected boxes quickly and accurately. Throwing has the potential
to increase the physical reachability and picking speed of a robot arm.
However, precisely throwing arbitrary objects in unstructured settings presents
many challenges: from acquiring reliable pre-throw conditions (e.g. initial
pose of object in manipulator) to handling varying object-centric properties
(e.g. mass distribution, friction, shape) and dynamics (e.g. aerodynamics). In
this work, we propose an end-to-end formulation that jointly learns to infer
control parameters for grasping and throwing motion primitives from visual
observations (images of arbitrary objects in a bin) through trial and error.
Within this formulation, we investigate the synergies between grasping and
throwing (i.e., learning grasps that enable more accurate throws) and between
simulation and deep learning (i.e., using deep networks to predict residuals on
top of control parameters predicted by a physics simulator). The resulting
system, TossingBot, is able to grasp and throw arbitrary objects into boxes
located outside its maximum reach range at 500+ mean picks per hour (600+
grasps per hour with 85% throwing accuracy); and generalizes to new objects and
target locations. Videos are available at https://tossingbot.cs.princeton.eduComment: Summary Video: https://youtu.be/f5Zn2Up2RjQ Project webpage:
https://tossingbot.cs.princeton.ed
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
While there has been remarkable progress in the performance of visual
recognition algorithms, the state-of-the-art models tend to be exceptionally
data-hungry. Large labeled training datasets, expensive and tedious to produce,
are required to optimize millions of parameters in deep network models. Lagging
behind the growth in model capacity, the available datasets are quickly
becoming outdated in terms of size and density. To circumvent this bottleneck,
we propose to amplify human effort through a partially automated labeling
scheme, leveraging deep learning with humans in the loop. Starting from a large
set of candidate images for each category, we iteratively sample a subset, ask
people to label them, classify the others with a trained model, split the set
into positives, negatives, and unlabeled based on the classification
confidence, and then iterate with the unlabeled set. To assess the
effectiveness of this cascading procedure and enable further progress in visual
recognition research, we construct a new image dataset, LSUN. It contains
around one million labeled images for each of 10 scene categories and 20 object
categories. We experiment with training popular convolutional networks and find
that they achieve substantial performance gains when trained on this dataset
Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning
Skilled robotic manipulation benefits from complex synergies between
non-prehensile (e.g. pushing) and prehensile (e.g. grasping) actions: pushing
can help rearrange cluttered objects to make space for arms and fingers;
likewise, grasping can help displace objects to make pushing movements more
precise and collision-free. In this work, we demonstrate that it is possible to
discover and learn these synergies from scratch through model-free deep
reinforcement learning. Our method involves training two fully convolutional
networks that map from visual observations to actions: one infers the utility
of pushes for a dense pixel-wise sampling of end effector orientations and
locations, while the other does the same for grasping. Both networks are
trained jointly in a Q-learning framework and are entirely self-supervised by
trial and error, where rewards are provided from successful grasps. In this
way, our policy learns pushing motions that enable future grasps, while
learning grasps that can leverage past pushes. During picking experiments in
both simulation and real-world scenarios, we find that our system quickly
learns complex behaviors amid challenging cases of clutter, and achieves better
grasping success rates and picking efficiencies than baseline alternatives
after only a few hours of training. We further demonstrate that our method is
capable of generalizing to novel objects. Qualitative results (videos), code,
pre-trained models, and simulation environments are available at
http://vpg.cs.princeton.eduComment: To appear at the International Conference On Intelligent Robots and
Systems (IROS) 2018. Project webpage: http://vpg.cs.princeton.edu Summary
video: https://youtu.be/-OkyX7Zlhi
Structure-Aware Shape Synthesis
We propose a new procedure to guide training of a data-driven shape
generative model using a structure-aware loss function. Complex 3D shapes often
can be summarized using a coarsely defined structure which is consistent and
robust across variety of observations. However, existing synthesis techniques
do not account for structure during training, and thus often generate
implausible and structurally unrealistic shapes. During training, we enforce
structural constraints in order to enforce consistency and structure across the
entire manifold. We propose a novel methodology for training 3D generative
models that incorporates structural information into an end-to-end training
pipeline.Comment: Accepted to 3DV 201
- …