116 research outputs found
MegaPortraits: One-shot Megapixel Neural Head Avatars
In this work, we advance the neural head avatar technology to the megapixel
resolution while focusing on the particularly challenging task of cross-driving
synthesis, i.e., when the appearance of the driving image is substantially
different from the animated source image. We propose a set of new neural
architectures and training methods that can leverage both medium-resolution
video data and high-resolution image data to achieve the desired levels of
rendered image quality and generalization to novel views and motion. We
demonstrate that suggested architectures and methods produce convincing
high-resolution neural avatars, outperforming the competitors in the
cross-driving scenario. Lastly, we show how a trained high-resolution neural
avatar model can be distilled into a lightweight student model which runs in
real-time and locks the identities of neural avatars to several dozens of
pre-defined source images. Real-time operation and identity lock are essential
for many practical applications head avatar systems
CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer
Reconstructing personalized animatable head avatars has significant
implications in the fields of AR/VR. Existing methods for achieving explicit
face control of 3D Morphable Models (3DMM) typically rely on multi-view images
or videos of a single subject, making the reconstruction process complex.
Additionally, the traditional rendering pipeline is time-consuming, limiting
real-time animation possibilities. In this paper, we introduce CVTHead, a novel
approach that generates controllable neural head avatars from a single
reference image using point-based neural rendering. CVTHead considers the
sparse vertices of mesh as the point set and employs the proposed
Vertex-feature Transformer to learn local feature descriptors for each vertex.
This enables the modeling of long-range dependencies among all the vertices.
Experimental results on the VoxCeleb dataset demonstrate that CVTHead achieves
comparable performance to state-of-the-art graphics-based methods. Moreover, it
enables efficient rendering of novel human heads with various expressions, head
poses, and camera views. These attributes can be explicitly controlled using
the coefficients of 3DMMs, facilitating versatile and realistic animation in
real-time scenarios.Comment: WACV202
ToonTalker: Cross-Domain Face Reenactment
We target cross-domain face reenactment in this paper, i.e., driving a
cartoon image with the video of a real person and vice versa. Recently, many
works have focused on one-shot talking face generation to drive a portrait with
a real video, i.e., within-domain reenactment. Straightforwardly applying those
methods to cross-domain animation will cause inaccurate expression transfer,
blur effects, and even apparent artifacts due to the domain shift between
cartoon and real faces. Only a few works attempt to settle cross-domain face
reenactment. The most related work AnimeCeleb requires constructing a dataset
with pose vector and cartoon image pairs by animating 3D characters, which
makes it inapplicable anymore if no paired data is available. In this paper, we
propose a novel method for cross-domain reenactment without paired data.
Specifically, we propose a transformer-based framework to align the motions
from different domains into a common latent space where motion transfer is
conducted via latent code addition. Two domain-specific motion encoders and two
learnable motion base memories are used to capture domain properties. A source
query transformer and a driving one are exploited to project domain-specific
motion to the canonical space. The edited motion is projected back to the
domain of the source with a transformer. Moreover, since no paired data is
provided, we propose a novel cross-domain training scheme using data from two
domains with the designed analogy constraint. Besides, we contribute a cartoon
dataset in Disney style. Extensive evaluations demonstrate the superiority of
our method over competing methods
Neural Voice Puppetry: Audio-driven Facial Reenactment
We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples. Our method is not only more general than existing works since we are generic to the input person, but we also show superior visual and lip sync quality compared to photo-realistic audio- and video-driven reenactment techniques
Generating 3D faces using Convolutional Mesh Autoencoders
Learned 3D representations of human faces are useful for computer vision
problems such as 3D face tracking and reconstruction from images, as well as
graphics applications such as character generation and animation. Traditional
models learn a latent representation of a face using linear subspaces or
higher-order tensor generalizations. Due to this linearity, they can not
capture extreme deformations and non-linear expressions. To address this, we
introduce a versatile model that learns a non-linear representation of a face
using spectral convolutions on a mesh surface. We introduce mesh sampling
operations that enable a hierarchical mesh representation that captures
non-linear variations in shape and expression at multiple scales within the
model. In a variational setting, our model samples diverse realistic 3D faces
from a multivariate Gaussian distribution. Our training data consists of 20,466
meshes of extreme expressions captured over 12 different subjects. Despite
limited training data, our trained model outperforms state-of-the-art face
models with 50% lower reconstruction error, while using 75% fewer parameters.
We also show that, replacing the expression space of an existing
state-of-the-art face model with our autoencoder, achieves a lower
reconstruction error. Our data, model and code are available at
http://github.com/anuragranj/com
One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2
While recent research has progressively overcome the low-resolution
constraint of one-shot face video re-enactment with the help of StyleGAN's
high-fidelity portrait generation, these approaches rely on at least one of the
following: explicit 2D/3D priors, optical flow based warping as motion
descriptors, off-the-shelf encoders, etc., which constrain their performance
(e.g., inconsistent predictions, inability to capture fine facial details and
accessories, poor generalization, artifacts). We propose an end-to-end
framework for simultaneously supporting face attribute edits, facial motions
and deformations, and facial identity control for video generation. It employs
a hybrid latent-space that encodes a given frame into a pair of latents:
Identity latent, , and Facial deformation latent,
, that respectively reside in the and spaces of
StyleGAN2. Thereby, incorporating the impressive editability-distortion
trade-off of and the high disentanglement properties of . These hybrid
latents employ the StyleGAN2 generator to achieve high-fidelity face video
re-enactment at . Furthermore, the model supports the generation of
realistic re-enactment videos with other latent-based semantic edits (e.g.,
beard, age, make-up, etc.). Qualitative and quantitative analyses performed
against state-of-the-art methods demonstrate the superiority of the proposed
approach.Comment: The project page is located at
https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io
Single Source One Shot Reenactment using Weighted Motion from Paired Feature Points
Image reenactment is a task where the target object in the source image imitates the motion represented in the driving image. One of the most common reenactment tasks is face image animation. The major challenge in the current face reenactment approaches is to distinguish between facial motion and identity. For this reason, the previous models struggle to produce high-quality animations if the driving and source identities are different (cross-person reenactment). We propose a new (face) reenactment model that learns shape-independent motion features in a self-supervised setup. The motion is represented using a set of paired feature points extracted from the source and driving images simultaneously. The model is generalised to multiple reenactment tasks including faces and non-face objects using only a single source image. The extensive experiments show that the model faithfully transfers the driving motion to the source while retaining the source identity intact.acceptedVersionPeer reviewe
A Generalist FaceX via Learning Unified Facial Representation
This work presents FaceX framework, a novel facial generalist model capable
of handling diverse facial tasks simultaneously. To achieve this goal, we
initially formulate a unified facial representation for a broad spectrum of
facial editing tasks, which macroscopically decomposes a face into fundamental
identity, intra-personal variation, and environmental factors. Based on this,
we introduce Facial Omni-Representation Decomposing (FORD) for seamless
manipulation of various facial components, microscopically decomposing the core
aspects of most facial editing tasks. Furthermore, by leveraging the prior of a
pretrained StableDiffusion (SD) to enhance generation quality and accelerate
training, we design Facial Omni-Representation Steering (FORS) to first
assemble unified facial representations and then effectively steer the SD-aware
generation process by the efficient Facial Representation Controller (FRC).
%Without any additional features, Our versatile FaceX achieves competitive
performance compared to elaborate task-specific models on popular facial
editing tasks. Full codes and models will be available at
https://github.com/diffusion-facex/FaceX.Comment: Project page: https://diffusion-facex.github.io
- …