1,524 research outputs found
GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB
We address the highly challenging problem of real-time 3D hand tracking based
on a monocular RGB-only sequence. Our tracking method combines a convolutional
neural network with a kinematic 3D hand model, such that it generalizes well to
unseen data, is robust to occlusions and varying camera viewpoints, and leads
to anatomically plausible as well as temporally smooth hand motions. For
training our CNN we propose a novel approach for the synthetic generation of
training data that is based on a geometrically consistent image-to-image
translation network. To be more specific, we use a neural network that
translates synthetic images to "real" images, such that the so-generated images
follow the same statistical distribution as real-world hand images. For
training this translation network we combine an adversarial loss and a
cycle-consistency loss with a geometric consistency loss in order to preserve
geometric properties (such as hand pose) during translation. We demonstrate
that our hand tracking system outperforms the current state-of-the-art on
challenging RGB-only footage
Frustum PointNets for 3D Object Detection from RGB-D Data
In this work, we study 3D object detection from RGB-D data in both indoor and
outdoor scenes. While previous methods focus on images or 3D voxels, often
obscuring natural 3D patterns and invariances of 3D data, we directly operate
on raw point clouds by popping up RGB-D scans. However, a key challenge of this
approach is how to efficiently localize objects in point clouds of large-scale
scenes (region proposal). Instead of solely relying on 3D proposals, our method
leverages both mature 2D object detectors and advanced 3D deep learning for
object localization, achieving efficiency as well as high recall for even small
objects. Benefited from learning directly in raw point clouds, our method is
also able to precisely estimate 3D bounding boxes even under strong occlusion
or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection
benchmarks, our method outperforms the state of the art by remarkable margins
while having real-time capability.Comment: 15 pages, 12 figures, 14 table
2D+3D Indoor Scene Understanding from a Single Monocular Image
Scene understanding, as a broad field encompassing many
subtopics, has gained great interest in recent years. Among these
subtopics, indoor scene understanding, having its own specific
attributes and challenges compared to outdoor scene under-
standing, has drawn a lot of attention. It has potential
applications in a wide variety of domains, such as robotic
navigation, object grasping for personal robotics, augmented
reality, etc. To our knowledge, existing research for indoor
scenes typically makes use of depth sensors, such as Kinect, that
is however not always available.
In this thesis, we focused on addressing the indoor scene
understanding tasks in a general case, where only a monocular
color image of the scene is available. Specifically, we first
studied the problem of estimating a detailed depth map from a
monocular image. Then, benefiting from deep-learning-based depth
estimation, we tackled the higher-level tasks of 3D box proposal
generation, and scene parsing with instance segmentation,
semantic labeling and support relationship inference from a
monocular image. Our research on indoor scene understanding
provides a comprehensive scene interpretation at various
perspectives and scales.
For monocular image depth estimation, previous approaches are
limited in that they only reason about depth locally on a single
scale, and do not utilize the important information of geometric
scene structures. Here, we developed a novel graphical model,
which reasons about detailed depth while leveraging geometric
scene structures at multiple scales.
For 3D box proposals, to our best knowledge, our approach
constitutes the first attempt to reason about class-independent
3D box proposals from a single monocular image. To this end, we
developed a novel integrated, differentiable framework that
estimates depth, extracts a volumetric scene representation and
generates 3D proposals. At the core of this framework lies a
novel residual, differentiable truncated signed distance function
module, which is able to handle the relatively low accuracy of
the predicted depth map.
For scene parsing, we tackled its three subtasks of instance
segmentation, se- mantic labeling, and the support relationship
inference on instances. Existing work typically reasons about
these individual subtasks independently. Here, we leverage the
fact that they bear strong connections, which can facilitate
addressing these sub- tasks if modeled properly. To this end, we
developed an integrated graphical model that reasons about the
mutual relationships of the above subtasks.
In summary, in this thesis, we introduced novel and effective
methodologies for each of three indoor scene understanding tasks,
i.e., depth estimation, 3D box proposal generation, and scene
parsing, and exploited the dependencies on depth estimates of the
latter two tasks. Evaluation on several benchmark datasets
demonstrated the effectiveness of our algorithms and the benefits
of utilizing depth estimates for higher-level tasks
- …