812 research outputs found
HeadOn: Real-time Reenactment of Human Portrait Videos
We propose HeadOn, the first real-time source-to-target reenactment approach
for complete human portrait videos that enables transfer of torso and head
motion, face expression, and eye gaze. Given a short RGB-D video of the target
actor, we automatically construct a personalized geometry proxy that embeds a
parametric head, eye, and kinematic torso model. A novel real-time reenactment
algorithm employs this proxy to photo-realistically map the captured motion
from the source actor to the target actor. On top of the coarse geometric
proxy, we propose a video-based rendering technique that composites the
modified target portrait video via view- and pose-dependent texturing, and
creates photo-realistic imagery of the target actor under novel torso and head
poses, facial expressions, and gaze directions. To this end, we propose a
robust tracking of the face and torso of the source actor. We extensively
evaluate our approach and show significant improvements in enabling much
greater flexibility in creating realistic reenacted output videos.Comment: Video: https://www.youtube.com/watch?v=7Dg49wv2c_g Presented at
Siggraph'1
Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach
Relightable and Animatable Neural Avatar from Sparse-View Video
This paper tackles the challenge of creating relightable and animatable
neural avatars from sparse-view (or even monocular) videos of dynamic humans
under unknown illumination. Compared to studio environments, this setting is
more practical and accessible but poses an extremely challenging ill-posed
problem. Previous neural human reconstruction methods are able to reconstruct
animatable avatars from sparse views using deformed Signed Distance Fields
(SDF) but cannot recover material parameters for relighting. While
differentiable inverse rendering-based methods have succeeded in material
recovery of static objects, it is not straightforward to extend them to dynamic
humans as it is computationally intensive to compute pixel-surface intersection
and light visibility on deformed SDFs for inverse rendering. To solve this
challenge, we propose a Hierarchical Distance Query (HDQ) algorithm to
approximate the world space distances under arbitrary human poses.
Specifically, we estimate coarse distances based on a parametric human model
and compute fine distances by exploiting the local deformation invariance of
SDF. Based on the HDQ algorithm, we leverage sphere tracing to efficiently
estimate the surface intersection and light visibility. This allows us to
develop the first system to recover animatable and relightable neural avatars
from sparse view (or monocular) inputs. Experiments demonstrate that our
approach is able to produce superior results compared to state-of-the-art
methods. Our code will be released for reproducibility.Comment: Project page: https://zju3dv.github.io/relightable_avata
EgoFace: Egocentric Face Performance Capture and Videorealistic Reenactment
Face performance capture and reenactment techniques use multiple cameras and sensors, positioned at a distance from the face or mounted on heavy wearable devices. This limits their applications in mobile and outdoor environments. We present EgoFace, a radically new lightweight setup for face performance capture and front-view videorealistic reenactment using a single egocentric RGB camera. Our lightweight setup allows operations in uncontrolled environments, and lends itself to telepresence applications such as video-conferencing from dynamic environments. The input image is projected into a low dimensional latent space of the facial expression parameters. Through careful adversarial training of the parameter-space synthetic rendering, a videorealistic animation is produced. Our problem is challenging as the human visual system is sensitive to the smallest face irregularities that could occur in the final results. This sensitivity is even stronger for video results. Our solution is trained in a pre-processing stage, through a supervised manner without manual annotations. EgoFace captures a wide variety of facial expressions, including mouth movements and asymmetrical expressions. It works under varying illuminations, background, movements, handles people from different ethnicities and can operate in real time
Hi4D: 4D Instance Segmentation of Close Human Interaction
We propose Hi4D, a method and dataset for the automatic analysis of
physically close human-human interaction under prolonged contact. Robustly
disentangling several in-contact subjects is a challenging task due to
occlusions and complex shapes. Hence, existing multi-view systems typically
fuse 3D surfaces of close subjects into a single, connected mesh. To address
this issue we leverage i) individually fitted neural implicit avatars; ii) an
alternating optimization scheme that refines pose and surface through periods
of close proximity; and iii) thus segment the fused raw scans into individual
instances. From these instances we compile Hi4D dataset of 4D textured scans of
20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D
contains rich interaction-centric annotations in 2D and 3D alongside accurately
registered parametric body models. We define varied human pose and shape
estimation tasks on this dataset and provide results from state-of-the-art
methods on these benchmarks.Comment: Project page: https://yifeiyin04.github.io/Hi4D
Implicit Neural Head Synthesis via Controllable Local Deformation Fields
High-quality reconstruction of controllable 3D head avatars from 2D videos is
highly desirable for virtual human applications in movies, games, and
telepresence. Neural implicit fields provide a powerful representation to model
3D head avatars with personalized shape, expressions, and facial parts, e.g.,
hair and mouth interior, that go beyond the linear 3D morphable model (3DMM).
However, existing methods do not model faces with fine-scale facial features,
or local control of facial parts that extrapolate asymmetric expressions from
monocular videos. Further, most condition only on 3DMM parameters with poor(er)
locality, and resolve local features with a global neural field. We build on
part-based implicit shape models that decompose a global deformation field into
local ones. Our novel formulation models multiple implicit deformation fields
with local semantic rig-like control via 3DMM-based parameters, and
representative facial landmarks. Further, we propose a local control loss and
attention mask mechanism that promote sparsity of each learned deformation
field. Our formulation renders sharper locally controllable nonlinear
deformations than previous implicit monocular approaches, especially mouth
interior, asymmetric expressions, and facial details.Comment: Accepted at CVPR 202
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
- …