2 research outputs found
One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2
While recent research has progressively overcome the low-resolution
constraint of one-shot face video re-enactment with the help of StyleGAN's
high-fidelity portrait generation, these approaches rely on at least one of the
following: explicit 2D/3D priors, optical flow based warping as motion
descriptors, off-the-shelf encoders, etc., which constrain their performance
(e.g., inconsistent predictions, inability to capture fine facial details and
accessories, poor generalization, artifacts). We propose an end-to-end
framework for simultaneously supporting face attribute edits, facial motions
and deformations, and facial identity control for video generation. It employs
a hybrid latent-space that encodes a given frame into a pair of latents:
Identity latent, , and Facial deformation latent,
, that respectively reside in the and spaces of
StyleGAN2. Thereby, incorporating the impressive editability-distortion
trade-off of and the high disentanglement properties of . These hybrid
latents employ the StyleGAN2 generator to achieve high-fidelity face video
re-enactment at . Furthermore, the model supports the generation of
realistic re-enactment videos with other latent-based semantic edits (e.g.,
beard, age, make-up, etc.). Qualitative and quantitative analyses performed
against state-of-the-art methods demonstrate the superiority of the proposed
approach.Comment: The project page is located at
https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io
Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space
While the recent advances in research on video reenactment have yielded
promising results, the approaches fall short in capturing the fine, detailed,
and expressive facial features (e.g., lip-pressing, mouth puckering, mouth
gaping, and wrinkles) which are crucial in generating realistic animated face
videos. To this end, we propose an end-to-end expressive face video encoding
approach that facilitates data-efficient high-quality video re-synthesis by
optimizing low-dimensional edits of a single Identity-latent. The approach
builds on StyleGAN2 image inversion and multi-stage non-linear latent-space
editing to generate videos that are nearly comparable to input videos. While
existing StyleGAN latent-based editing techniques focus on simply generating
plausible edits of static images, we automate the latent-space editing to
capture the fine expressive facial deformations in a sequence of frames using
an encoding that resides in the Style-latent-space (StyleSpace) of StyleGAN2.
The encoding thus obtained could be super-imposed on a single Identity-latent
to facilitate re-enactment of face videos at . The proposed framework
economically captures face identity, head-pose, and complex expressive facial
motions at fine levels, and thereby bypasses training, person modeling,
dependence on landmarks/ keypoints, and low-resolution synthesis which tend to
hamper most re-enactment approaches. The approach is designed with maximum data
efficiency, where a single latent and 35 parameters per frame enable
high-fidelity video rendering. This pipeline can also be used for puppeteering
(i.e., motion transfer).Comment: The project page is located at
https://trevineoorloff.github.io/ExpressiveFaceVideoEncoding.io