1,888 research outputs found
VIBE: Video Inference for Human Body Pose and Shape Estimation
Human motion is fundamental to understanding behavior. Despite progress on
single-image 3D pose and shape estimation, existing video-based
state-of-the-art methods fail to produce accurate and natural motion sequences
due to a lack of ground-truth 3D motion data for training. To address this
problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE),
which makes use of an existing large-scale motion capture dataset (AMASS)
together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty
is an adversarial learning framework that leverages AMASS to discriminate
between real human motions and those produced by our temporal pose and shape
regression networks. We define a temporal network architecture and show that
adversarial training, at the sequence level, produces kinematically plausible
motion sequences without in-the-wild ground-truth 3D labels. We perform
extensive experimentation to analyze the importance of motion and demonstrate
the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving
state-of-the-art performance. Code and pretrained models are available at
https://github.com/mkocabas/VIBE.Comment: CVPR-2020 camera ready. Code is available at
https://github.com/mkocabas/VIB
Motion-DVAE: Unsupervised learning for fast human motion denoising
Pose and motion priors are crucial for recovering realistic and accurate
human motion from noisy observations. Substantial progress has been made on
pose and shape estimation from images, and recent works showed impressive
results using priors to refine frame-wise predictions. However, a lot of motion
priors only model transitions between consecutive poses and are used in
time-consuming optimization procedures, which is problematic for many
applications requiring real-time motion capture. We introduce Motion-DVAE, a
motion prior to capture the short-term dependencies of human motion. As part of
the dynamical variational autoencoder (DVAE) models family, Motion-DVAE
combines the generative capability of VAE models and the temporal modeling of
recurrent architectures. Together with Motion-DVAE, we introduce an
unsupervised learned denoising method unifying regression- and
optimization-based approaches in a single framework for real-time 3D human pose
estimation. Experiments show that the proposed approach reaches competitive
performance with state-of-the-art methods while being much faster
Semi-supervised Dense Keypoints Using Unlabeled Multiview Images
This paper presents a new end-to-end semi-supervised framework to learn a
dense keypoint detector using unlabeled multiview images. A key challenge lies
in finding the exact correspondences between the dense keypoints in multiple
views since the inverse of the keypoint mapping can be neither analytically
derived nor differentiated. This limits applying existing multiview supervision
approaches used to learn sparse keypoints that rely on the exact
correspondences. To address this challenge, we derive a new probabilistic
epipolar constraint that encodes the two desired properties. (1) Soft
correspondence: we define a matchability, which measures a likelihood of a
point matching to the other image's corresponding point, thus relaxing the
requirement of the exact correspondences. (2) Geometric consistency: every
point in the continuous correspondence fields must satisfy the multiview
consistency collectively. We formulate a probabilistic epipolar constraint
using a weighted average of epipolar errors through the matchability thereby
generalizing the point-to-point geometric error to the field-to-field geometric
error. This generalization facilitates learning a geometrically coherent dense
keypoint detection model by utilizing a large number of unlabeled multiview
images. Additionally, to prevent degenerative cases, we employ a
distillation-based regularization by using a pretrained model. Finally, we
design a new neural network architecture, made of twin networks, that
effectively minimizes the probabilistic epipolar errors of all possible
correspondences between two view images by building affinity matrices. Our
method shows superior performance compared to existing methods, including
non-differentiable bootstrapping in terms of keypoint accuracy, multiview
consistency, and 3D reconstruction accuracy.Comment: Published as a conference paper at NeurIPS 202
Sim2real transfer learning for 3D human pose estimation: motion to the rescue
Synthetic visual data can provide practically infinite diversity and rich
labels, while avoiding ethical issues with privacy and bias. However, for many
tasks, current models trained on synthetic data generalize poorly to real data.
The task of 3D human pose estimation is a particularly interesting example of
this sim2real problem, because learning-based approaches perform reasonably
well given real training data, yet labeled 3D poses are extremely difficult to
obtain in the wild, limiting scalability. In this paper, we show that standard
neural-network approaches, which perform poorly when trained on synthetic RGB
images, can perform well when the data is pre-processed to extract cues about
the person's motion, notably as optical flow and the motion of 2D keypoints.
Therefore, our results suggest that motion can be a simple way to bridge a
sim2real gap when video is available. We evaluate on the 3D Poses in the Wild
dataset, the most challenging modern benchmark for 3D pose estimation, where we
show full 3D mesh recovery that is on par with state-of-the-art methods trained
on real 3D sequences, despite training only on synthetic humans from the
SURREAL dataset.Comment: Accepted at NeurIPS 201
- …