446 research outputs found
GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB
We address the highly challenging problem of real-time 3D hand tracking based
on a monocular RGB-only sequence. Our tracking method combines a convolutional
neural network with a kinematic 3D hand model, such that it generalizes well to
unseen data, is robust to occlusions and varying camera viewpoints, and leads
to anatomically plausible as well as temporally smooth hand motions. For
training our CNN we propose a novel approach for the synthetic generation of
training data that is based on a geometrically consistent image-to-image
translation network. To be more specific, we use a neural network that
translates synthetic images to "real" images, such that the so-generated images
follow the same statistical distribution as real-world hand images. For
training this translation network we combine an adversarial loss and a
cycle-consistency loss with a geometric consistency loss in order to preserve
geometric properties (such as hand pose) during translation. We demonstrate
that our hand tracking system outperforms the current state-of-the-art on
challenging RGB-only footage
Markerless structure-based multi-sensor calibration for free viewpoint video capture
Free-viewpoint capture technologies have recently started demonstrating impressive results. Being able to capture
human performances in full 3D is a very promising technology for a variety of applications. However, the setup
of the capturing infrastructure is usually expensive and requires trained personnel. In this work we focus on one
practical aspect of setting up a free-viewpoint capturing system, the spatial alignment of the sensors. Our work aims
at simplifying the external calibration process that typically requires significant human intervention and technical
knowledge. Our method uses an easy to assemble structure and unlike similar works, does not rely on markers or
features. Instead, we exploit the a-priori knowledge of the structure’s geometry to establish correspondences for
the little-overlapping viewpoints typically found in free-viewpoint capture setups. These establish an initial sparse
alignment that is then densely optimized. At the same time, our pipeline improves the robustness to assembly
errors, allowing for non-technical users to calibrate multi-sensor setups. Our results showcase the feasibility of our
approach that can make the tedious calibration process easier, and less error-prone
VIBE: Video Inference for Human Body Pose and Shape Estimation
Human motion is fundamental to understanding behavior. Despite progress on
single-image 3D pose and shape estimation, existing video-based
state-of-the-art methods fail to produce accurate and natural motion sequences
due to a lack of ground-truth 3D motion data for training. To address this
problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE),
which makes use of an existing large-scale motion capture dataset (AMASS)
together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty
is an adversarial learning framework that leverages AMASS to discriminate
between real human motions and those produced by our temporal pose and shape
regression networks. We define a temporal network architecture and show that
adversarial training, at the sequence level, produces kinematically plausible
motion sequences without in-the-wild ground-truth 3D labels. We perform
extensive experimentation to analyze the importance of motion and demonstrate
the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving
state-of-the-art performance. Code and pretrained models are available at
https://github.com/mkocabas/VIBE.Comment: CVPR-2020 camera ready. Code is available at
https://github.com/mkocabas/VIB
- …