While recent research has progressively overcome the low-resolution
constraint of one-shot face video re-enactment with the help of StyleGAN's
high-fidelity portrait generation, these approaches rely on at least one of the
following: explicit 2D/3D priors, optical flow based warping as motion
descriptors, off-the-shelf encoders, etc., which constrain their performance
(e.g., inconsistent predictions, inability to capture fine facial details and
accessories, poor generalization, artifacts). We propose an end-to-end
framework for simultaneously supporting face attribute edits, facial motions
and deformations, and facial identity control for video generation. It employs
a hybrid latent-space that encodes a given frame into a pair of latents:
Identity latent, WIDβ, and Facial deformation latent,
SFβ, that respectively reside in the W+ and SS spaces of
StyleGAN2. Thereby, incorporating the impressive editability-distortion
trade-off of W+ and the high disentanglement properties of SS. These hybrid
latents employ the StyleGAN2 generator to achieve high-fidelity face video
re-enactment at 10242. Furthermore, the model supports the generation of
realistic re-enactment videos with other latent-based semantic edits (e.g.,
beard, age, make-up, etc.). Qualitative and quantitative analyses performed
against state-of-the-art methods demonstrate the superiority of the proposed
approach.Comment: The project page is located at
https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io