30 research outputs found
A Groupwise Multilinear Correspondence Optimization for 3D Faces
The official version of this article is available on the IEEE websiteInternational audienceMultilinear face models are widely used to model the space of human faces with expressions. For databases of 3D human faces of different identities performing multiple expressions, these statistical shape models decouple identity and expression variations. To compute a high-quality multilinear face model, the quality of the registration of the database of 3D face scans used for training is essential. Meanwhile, a multilinear face model can be used as an effective prior to register 3D face scans, which are typically noisy and incomplete. Inspired by the minimum description length approach, we propose the first method to jointly optimize a multilinear model and the registration of the 3D scans used for training. Given an initial registration, our approach fully automatically improves the registration by optimizing an objective function that measures the compactness of the multilinear model, resulting in a sparse model. We choose a continuous representation for each face shape that allows to use a quasi-Newton method in parameter space for optimization. We show that our approach is computationally significantly more efficient and leads to correspondences of higher quality than existing methods based on linear statistical models. This allows us to evaluate our approach on large standard 3D face databases and in the presence of noisy initializations
Generating 3D faces using Convolutional Mesh Autoencoders
Learned 3D representations of human faces are useful for computer vision
problems such as 3D face tracking and reconstruction from images, as well as
graphics applications such as character generation and animation. Traditional
models learn a latent representation of a face using linear subspaces or
higher-order tensor generalizations. Due to this linearity, they can not
capture extreme deformations and non-linear expressions. To address this, we
introduce a versatile model that learns a non-linear representation of a face
using spectral convolutions on a mesh surface. We introduce mesh sampling
operations that enable a hierarchical mesh representation that captures
non-linear variations in shape and expression at multiple scales within the
model. In a variational setting, our model samples diverse realistic 3D faces
from a multivariate Gaussian distribution. Our training data consists of 20,466
meshes of extreme expressions captured over 12 different subjects. Despite
limited training data, our trained model outperforms state-of-the-art face
models with 50% lower reconstruction error, while using 75% fewer parameters.
We also show that, replacing the expression space of an existing
state-of-the-art face model with our autoencoder, achieves a lower
reconstruction error. Our data, model and code are available at
http://github.com/anuragranj/com
Instant Volumetric Head Avatars
We present Instant Volumetric Head Avatars (INSTA), a novel approach for
reconstructing photo-realistic digital avatars instantaneously. INSTA models a
dynamic neural radiance field based on neural graphics primitives embedded
around a parametric face model. Our pipeline is trained on a single monocular
RGB portrait video that observes the subject under different expressions and
views. While state-of-the-art methods take up to several days to train an
avatar, our method can reconstruct a digital avatar in less than 10 minutes on
modern GPU hardware, which is orders of magnitude faster than previous
solutions. In addition, it allows for the interactive rendering of novel poses
and expressions. By leveraging the geometry prior of the underlying parametric
face model, we demonstrate that INSTA extrapolates to unseen poses. In
quantitative and qualitative studies on various subjects, INSTA outperforms
state-of-the-art methods regarding rendering quality and training time.Comment: Website: https://zielon.github.io/insta/ Video:
https://youtu.be/HOgaeWTih7
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Instant Multi-View Head Capture through Learnable Registration
Existing methods for capturing datasets of 3D heads in dense semantic
correspondence are slow, and commonly address the problem in two separate
steps; multi-view stereo (MVS) reconstruction followed by non-rigid
registration. To simplify this process, we introduce TEMPEH (Towards Estimation
of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads
in dense correspondence from calibrated multi-view images. Registering datasets
of 3D scans typically requires manual parameter tuning to find the right
balance between accurately fitting the scans surfaces and being robust to
scanning noise and outliers. Instead, we propose to jointly register a 3D head
dataset while training TEMPEH. Specifically, during training we minimize a
geometric loss commonly used for surface registration, effectively leveraging
TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric
feature representation that samples and fuses features from each view using
camera calibration information. To account for partial occlusions and a large
capture volume that enables head movements, we use view- and surface-aware
feature fusion, and a spatial transformer-based head localization module,
respectively. We use raw MVS scans as supervision during training, but, once
trained, TEMPEH directly predicts 3D heads in dense correspondence without
requiring scans. Predicting one head takes about 0.3 seconds with a median
reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art.
This enables the efficient capture of large datasets containing multiple people
and diverse facial motions. Code, model, and data are publicly available at
https://tempeh.is.tue.mpg.de.Comment: Conference on Computer Vision and Pattern Recognition (CVPR) 202
Statistical Shape Spaces for 3D Data: A Review
International audienceMethods and systems for capturing 3D geometry are becoming increasingly commonplace–and with them a plethora of 3D data. Much of this data is unfortunately corrupted by noise, missing data, occlusions or other outliers. However, when we are interested in the shape of a particular class of objects, such as human faces or bodies, we can use machine learning techniques, applied to clean, registered databases of these shapes, to make sense of raw 3D point clouds or other data. This has applications ranging from virtual change rooms to motion and gait analysis to surgical planning depending on the type of shape. In this chapter, we give an overview of these techniques, a brief review of the literature, and comparative evaluation of two such shape spaces for human faces
Expressive Body Capture: 3D Hands, Face, and Body from a Single Image
To facilitate the analysis of human actions, interactions and emotions, we
compute a 3D model of human body pose, hand pose, and facial expression from a
single monocular image. To achieve this, we use thousands of 3D scans to train
a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with
fully articulated hands and an expressive face. Learning to regress the
parameters of SMPL-X directly from images is challenging without paired images
and 3D ground truth. Consequently, we follow the approach of SMPLify, which
estimates 2D features and then optimizes model parameters to fit the features.
We improve on SMPLify in several significant ways: (1) we detect 2D features
corresponding to the face, hands, and feet and fit the full SMPL-X model to
these; (2) we train a new neural network pose prior using a large MoCap
dataset; (3) we define a new interpenetration penalty that is both fast and
accurate; (4) we automatically detect gender and the appropriate body models
(male, female, or neutral); (5) our PyTorch implementation achieves a speedup
of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to
both controlled images and images in the wild. We evaluate 3D accuracy on a new
curated dataset comprising 100 images with pseudo ground-truth. This is a step
towards automatic expressive human capture from monocular RGB data. The models,
code, and data are available for research purposes at
https://smpl-x.is.tue.mpg.de.Comment: To appear in CVPR 201
Towards Racially Unbiased Skin Tone Estimation via Scene Disambiguation
Virtual facial avatars will play an increasingly important role in immersive
communication, games and the metaverse, and it is therefore critical that they
be inclusive. This requires accurate recovery of the appearance, represented by
albedo, regardless of age, sex, or ethnicity. While significant progress has
been made on estimating 3D facial geometry, albedo estimation has received less
attention. The task is fundamentally ambiguous because the observed color is a
function of albedo and lighting, both of which are unknown. We find that
current methods are biased towards light skin tones due to (1) strongly biased
priors that prefer lighter pigmentation and (2) algorithmic solutions that
disregard the light/albedo ambiguity. To address this, we propose a new
evaluation dataset (FAIR) and an algorithm (TRUST) to improve albedo estimation
and, hence, fairness. Specifically, we create the first facial albedo
evaluation benchmark where subjects are balanced in terms of skin color, and
measure accuracy using the Individual Typology Angle (ITA) metric. We then
address the light/albedo ambiguity by building on a key observation: the image
of the full scene -- as opposed to a cropped image of the face -- contains
important information about lighting that can be used for disambiguation. TRUST
regresses facial albedo by conditioning both on the face region and a global
illumination signal obtained from the scene image. Our experimental results
show significant improvement compared to state-of-the-art methods on albedo
estimation, both in terms of accuracy and fairness. The evaluation benchmark
and code will be made available for research purposes at
https://trust.is.tue.mpg.de.Comment: Camera-Ready version, accepted at ECCV202
SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes
We present SCULPT, a novel 3D generative model for clothed and textured 3D
meshes of humans. Specifically, we devise a deep neural network that learns to
represent the geometry and appearance distribution of clothed human bodies.
Training such a model is challenging, as datasets of textured 3D meshes for
humans are limited in size and accessibility. Our key observation is that there
exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image
datasets of clothed humans and multiple appearances can be mapped to a single
geometry. To effectively learn from the two data modalities, we propose an
unpaired learning procedure for pose-dependent clothed and textured human
meshes. Specifically, we learn a pose-dependent geometry space from 3D scan
data. We represent this as per vertex displacements w.r.t. the SMPL model.
Next, we train a geometry conditioned texture generator in an unsupervised way
using the 2D image data. We use intermediate activations of the learned
geometry model to condition our texture generator. To alleviate entanglement
between pose and clothing type, and pose and clothing appearance, we condition
both the texture and geometry generators with attribute labels such as clothing
types for the geometry, and clothing colors for the texture generator. We
automatically generated these conditioning labels for the 2D images based on
the visual question answering model BLIP and CLIP. We validate our method on
the SCULPT dataset, and compare to state-of-the-art 3D generative models for
clothed human bodies. We will release the codebase for research purposes
GIF: Generative Interpretable Faces
Photo-realistic visualization and animation of expressive human faces have
been a long standing challenge. 3D face modeling methods provide parametric
control but generates unrealistic images, on the other hand, generative 2D
models like GANs (Generative Adversarial Networks) output photo-realistic face
images, but lack explicit control. Recent methods gain partial control, either
by attempting to disentangle different factors in an unsupervised manner, or by
adding control post hoc to a pre-trained model. Unconditional GANs, however,
may entangle factors that are hard to undo later. We condition our generative
model on pre-defined control parameters to encourage disentanglement in the
generation process. Specifically, we condition StyleGAN2 on FLAME, a generative
3D face model. While conditioning on FLAME parameters yields unsatisfactory
results, we find that conditioning on rendered FLAME geometry and photometric
details works well. This gives us a generative 2D face model named GIF
(Generative Interpretable Faces) that offers FLAME's parametric control. Here,
interpretable refers to the semantic meaning of different parameters. Given
FLAME parameters for shape, pose, expressions, parameters for appearance,
lighting, and an additional style vector, GIF outputs photo-realistic face
images. We perform an AMT based perceptual study to quantitatively and
qualitatively evaluate how well GIF follows its conditioning. The code, data,
and trained model are publicly available for research purposes at
http://gif.is.tue.mpg.de.Comment: International Conference on 3D Vision (3DV) 202