2,102 research outputs found
Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach
Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data
Recovery of articulated 3D structure from 2D observations is a challenging
computer vision problem with many applications. Current learning-based
approaches achieve state-of-the-art accuracy on public benchmarks but are
restricted to specific types of objects and motions covered by the training
datasets. Model-based approaches do not rely on training data but show lower
accuracy on these datasets. In this paper, we introduce a model-based method
called Structure from Articulated Motion (SfAM), which can recover multiple
object and motion types without training on extensive data collections. At the
same time, it performs on par with learning-based state-of-the-art approaches
on public benchmarks and outperforms previous non-rigid structure from motion
(NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while
integrating a soft spatio-temporal constraint on the bone lengths. We use
alternating optimization strategy to recover optimal geometry (i.e., bone
proportions) together with 3D joint positions by enforcing the bone lengths
consistency over a series of frames. SfAM is highly robust to noisy 2D
annotations, generalizes to arbitrary objects and does not rely on training
data, which is shown in extensive experiments on public benchmarks and real
video sequences. We believe that it brings a new perspective on the domain of
monocular 3D recovery of articulated structures, including human motion
capture.Comment: 21 pages, 8 figures, 2 table
MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
In this work we propose a novel model-based deep convolutional autoencoder
that addresses the highly challenging problem of reconstructing a 3D human face
from a single in-the-wild color image. To this end, we combine a convolutional
encoder network with an expert-designed generative model that serves as
decoder. The core innovation is our new differentiable parametric decoder that
encapsulates image formation analytically based on a generative model. Our
decoder takes as input a code vector with exactly defined semantic meaning that
encodes detailed face pose, shape, expression, skin reflectance and scene
illumination. Due to this new way of combining CNN-based with model-based face
reconstruction, the CNN-based encoder learns to extract semantically meaningful
parameters from a single monocular input image. For the first time, a CNN
encoder and an expert-designed generative model can be trained end-to-end in an
unsupervised manner, which renders training on very large (unlabeled) real
world data feasible. The obtained reconstructions compare favorably to current
state-of-the-art approaches in terms of quality and richness of representation.Comment: International Conference on Computer Vision (ICCV) 2017 (Oral), 13
page
Exploiting temporal information for 3D pose estimation
In this work, we address the problem of 3D human pose estimation from a
sequence of 2D human poses. Although the recent success of deep networks has
led many state-of-the-art methods for 3D pose estimation to train deep networks
end-to-end to predict from images directly, the top-performing approaches have
shown the effectiveness of dividing the task of 3D pose estimation into two
steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from
images and then mapping them into 3D space. They also showed that a
low-dimensional representation like 2D locations of a set of joints can be
discriminative enough to estimate 3D pose with high accuracy. However,
estimation of 3D pose for individual frames leads to temporally incoherent
estimates due to independent error in each frame causing jitter. Therefore, in
this work we utilize the temporal information across a sequence of 2D joint
locations to estimate a sequence of 3D poses. We designed a
sequence-to-sequence network composed of layer-normalized LSTM units with
shortcut connections connecting the input to the output on the decoder side and
imposed temporal smoothness constraint during training. We found that the
knowledge of temporal consistency improves the best reported result on
Human3.6M dataset by approximately and helps our network to recover
temporally consistent 3D poses over a sequence of images even when the 2D pose
detector fails
Unsupervised 3D Pose Estimation with Geometric Self-Supervision
We present an unsupervised learning approach to recover 3D human pose from 2D
skeletal joints extracted from a single image. Our method does not require any
multi-view image data, 3D skeletons, correspondences between 2D-3D points, or
use previously learned 3D priors during training. A lifting network accepts 2D
landmarks as inputs and generates a corresponding 3D skeleton estimate. During
training, the recovered 3D skeleton is reprojected on random camera viewpoints
to generate new "synthetic" 2D poses. By lifting the synthetic 2D poses back to
3D and re-projecting them in the original camera view, we can define
self-consistency loss both in 3D and in 2D. The training can thus be self
supervised by exploiting the geometric self-consistency of the
lift-reproject-lift process. We show that self-consistency alone is not
sufficient to generate realistic skeletons, however adding a 2D pose
discriminator enables the lifter to output valid 3D poses. Additionally, to
learn from 2D poses "in the wild", we train an unsupervised 2D domain adapter
network to allow for an expansion of 2D data. This improves results and
demonstrates the usefulness of 2D pose data for unsupervised 3D lifting.
Results on Human3.6M dataset for 3D human pose estimation demonstrate that our
approach improves upon the previous unsupervised methods by 30% and outperforms
many weakly supervised approaches that explicitly use 3D data
- …