1,791 research outputs found
MonoPerfCap: Human Performance Capture from Monocular Video
We present the first marker-less approach for temporally coherent 3D
performance capture of a human with general clothing from monocular video. Our
approach reconstructs articulated human skeleton motion as well as medium-scale
non-rigid surface deformations in general scenes. Human performance capture is
a challenging problem due to the large range of articulation, potentially fast
motion, and considerable non-rigid deformations, even from multi-view data.
Reconstruction from monocular video alone is drastically more challenging,
since strong occlusions and the inherent depth ambiguity lead to a highly
ill-posed reconstruction problem. We tackle these challenges by a novel
approach that employs sparse 2D and 3D human pose detections from a
convolutional neural network using a batch-based pose estimation strategy.
Joint recovery of per-batch motion allows to resolve the ambiguities of the
monocular reconstruction problem based on a low dimensional trajectory
subspace. In addition, we propose refinement of the surface geometry based on
fully automatically extracted silhouettes to enable medium-scale non-rigid
alignment. We demonstrate state-of-the-art performance capture results that
enable exciting applications such as video editing and free viewpoint video,
previously infeasible from monocular video. Our qualitative and quantitative
evaluation demonstrates that our approach significantly outperforms previous
monocular methods in terms of accuracy, robustness and scene complexity that
can be handled.Comment: Accepted to ACM TOG 2018, to be presented on SIGGRAPH 201
Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach
Structure from Recurrent Motion: From Rigidity to Recurrency
This paper proposes a new method for Non-Rigid Structure-from-Motion (NRSfM)
from a long monocular video sequence observing a non-rigid object performing
recurrent and possibly repetitive dynamic action. Departing from the
traditional idea of using linear low-order or lowrank shape model for the task
of NRSfM, our method exploits the property of shape recurrency (i.e., many
deforming shapes tend to repeat themselves in time). We show that recurrency is
in fact a generalized rigidity. Based on this, we reduce NRSfM problems to
rigid ones provided that certain recurrency condition is satisfied. Given such
a reduction, standard rigid-SfM techniques are directly applicable (without any
change) to the reconstruction of non-rigid dynamic shapes. To implement this
idea as a practical approach, this paper develops efficient algorithms for
automatic recurrency detection, as well as camera view clustering via a
rigidity-check. Experiments on both simulated sequences and real data
demonstrate the effectiveness of the method. Since this paper offers a novel
perspective on rethinking structure-from-motion, we hope it will inspire other
new problems in the field.Comment: To appear in CVPR 201
PERGAMO: Personalized 3D Garments from Monocular Video
Clothing plays a fundamental role in digital humans. Current approaches to
animate 3D garments are mostly based on realistic physics simulation, however,
they typically suffer from two main issues: high computational run-time cost,
which hinders their development; and simulation-to-real gap, which impedes the
synthesis of specific real-world cloth samples. To circumvent both issues we
propose PERGAMO, a data-driven approach to learn a deformable model for 3D
garments from monocular images. To this end, we first introduce a novel method
to reconstruct the 3D geometry of garments from a single image, and use it to
build a dataset of clothing from monocular videos. We use these 3D
reconstructions to train a regression model that accurately predicts how the
garment deforms as a function of the underlying body pose. We show that our
method is capable of producing garment animations that match the real-world
behaviour, and generalizes to unseen body motions extracted from motion capture
dataset.Comment: Published at Computer Graphics Forum (Proc. of ACM/SIGGRAPH SCA),
2022. Project website http://mslab.es/projects/PERGAMO
FML: Face Model Learning from Videos
Monocular image-based 3D reconstruction of faces is a long-standing problem
in computer vision. Since image data is a 2D projection of a 3D face, the
resulting depth ambiguity makes the problem ill-posed. Most existing methods
rely on data-driven priors that are built from limited 3D face scans. In
contrast, we propose multi-frame video-based self-supervised training of a deep
network that (i) learns a face identity model both in shape and appearance
while (ii) jointly learning to reconstruct 3D faces. Our face model is learned
using only corpora of in-the-wild video clips collected from the Internet. This
virtually endless source of training data enables learning of a highly general
3D face model. In order to achieve this, we propose a novel multi-frame
consistency loss that ensures consistent shape and appearance across multiple
frames of a subject's face, thus minimizing depth ambiguity. At test time we
can use an arbitrary number of frames, so that we can perform both monocular as
well as multi-frame reconstruction.Comment: CVPR 2019 (Oral). Video: https://www.youtube.com/watch?v=SG2BwxCw0lQ,
Project Page: https://gvv.mpi-inf.mpg.de/projects/FML19
Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz
The reconstruction of dense 3D models of face geometry and appearance from a
single image is highly challenging and ill-posed. To constrain the problem,
many approaches rely on strong priors, such as parametric face models learned
from limited 3D scan data. However, prior models restrict generalization of the
true diversity in facial geometry, skin reflectance and illumination. To
alleviate this problem, we present the first approach that jointly learns 1) a
regressor for face shape, expression, reflectance and illumination on the basis
of 2) a concurrently learned parametric face model. Our multi-level face model
combines the advantage of 3D Morphable Models for regularization with the
out-of-space generalization of a learned corrective space. We train end-to-end
on in-the-wild images without dense annotations by fusing a convolutional
encoder with a differentiable expert-designed renderer and a self-supervised
training loss, both defined at multiple detail levels. Our approach compares
favorably to the state-of-the-art in terms of reconstruction quality, better
generalizes to real world faces, and runs at over 250 Hz.Comment: CVPR 2018 (Oral). Project webpage:
https://gvv.mpi-inf.mpg.de/projects/FML
- …