483 research outputs found
Improving Facial Analysis and Performance Driven Animation through Disentangling Identity and Expression
We present techniques for improving performance driven facial animation,
emotion recognition, and facial key-point or landmark prediction using learned
identity invariant representations. Established approaches to these problems
can work well if sufficient examples and labels for a particular identity are
available and factors of variation are highly controlled. However, labeled
examples of facial expressions, emotions and key-points for new individuals are
difficult and costly to obtain. In this paper we improve the ability of
techniques to generalize to new and unseen individuals by explicitly modeling
previously seen variations related to identity and expression. We use a
weakly-supervised approach in which identity labels are used to learn the
different factors of variation linked to identity separately from factors
related to expression. We show how probabilistic modeling of these sources of
variation allows one to learn identity-invariant representations for
expressions which can then be used to identity-normalize various procedures for
facial expression analysis and animation control. We also show how to extend
the widely used techniques of active appearance models and constrained local
models through replacing the underlying point distribution models which are
typically constructed using principal component analysis with
identity-expression factorized representations. We present a wide variety of
experiments in which we consistently improve performance on emotion
recognition, markerless performance-driven facial animation and facial
key-point tracking.Comment: to appear in Image and Vision Computing Journal (IMAVIS
ICface: Interpretable and Controllable Face Reenactment Using GANs
This paper presents a generic face animator that is able to control the pose
and expressions of a given face image. The animation is driven by human
interpretable control signals consisting of head pose angles and the Action
Unit (AU) values. The control information can be obtained from multiple sources
including external driving videos and manual controls. Due to the interpretable
nature of the driving signal, one can easily mix the information between
multiple sources (e.g. pose from one image and expression from another) and
apply selective post-production editing. The proposed face animator is
implemented as a two-stage neural network model that is learned in a
self-supervised manner using a large video collection. The proposed
Interpretable and Controllable face reenactment network (ICface) is compared to
the state-of-the-art neural network-based face animation techniques in multiple
tasks. The results indicate that ICface produces better visual quality while
being more versatile than most of the comparison methods. The introduced model
could provide a lightweight and easy to use tool for a multitude of advanced
image and video editing tasks.Comment: Accepted in WACV-202
That's What I Said: Fully-Controllable Talking Face Generation
The goal of this paper is to synthesise talking faces with controllable
facial motions. To achieve this goal, we propose two key ideas. The first is to
establish a canonical space where every face has the same motion patterns but
different identities. The second is to navigate a multimodal motion space that
only represents motion-related features while eliminating identity information.
To disentangle identity and motion, we introduce an orthogonality constraint
between the two different latent spaces. From this, our method can generate
natural-looking talking faces with fully controllable facial attributes and
accurate lip synchronisation. Extensive experiments demonstrate that our method
achieves state-of-the-art results in terms of both visual quality and lip-sync
score. To the best of our knowledge, we are the first to develop a talking face
generation framework that can accurately manifest full target facial motions
including lip, head pose, and eye movements in the generated video without any
additional supervision beyond RGB video with audio
FML: Face Model Learning from Videos
Monocular image-based 3D reconstruction of faces is a long-standing problem
in computer vision. Since image data is a 2D projection of a 3D face, the
resulting depth ambiguity makes the problem ill-posed. Most existing methods
rely on data-driven priors that are built from limited 3D face scans. In
contrast, we propose multi-frame video-based self-supervised training of a deep
network that (i) learns a face identity model both in shape and appearance
while (ii) jointly learning to reconstruct 3D faces. Our face model is learned
using only corpora of in-the-wild video clips collected from the Internet. This
virtually endless source of training data enables learning of a highly general
3D face model. In order to achieve this, we propose a novel multi-frame
consistency loss that ensures consistent shape and appearance across multiple
frames of a subject's face, thus minimizing depth ambiguity. At test time we
can use an arbitrary number of frames, so that we can perform both monocular as
well as multi-frame reconstruction.Comment: CVPR 2019 (Oral). Video: https://www.youtube.com/watch?v=SG2BwxCw0lQ,
Project Page: https://gvv.mpi-inf.mpg.de/projects/FML19
4D Facial Expression Diffusion Model
Facial expression generation is one of the most challenging and long-sought
aspects of character animation, with many interesting applications. The
challenging task, traditionally having relied heavily on digital craftspersons,
remains yet to be explored. In this paper, we introduce a generative framework
for generating 3D facial expression sequences (i.e. 4D faces) that can be
conditioned on different inputs to animate an arbitrary 3D face mesh. It is
composed of two tasks: (1) Learning the generative model that is trained over a
set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input
facial mesh driven by the generated landmark sequences. The generative model is
based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved
remarkable success in generative tasks of other domains. While it can be
trained unconditionally, its reverse process can still be conditioned by
various condition signals. This allows us to efficiently develop several
downstream tasks involving various conditional generation, by using expression
labels, text, partial sequences, or simply a facial geometry. To obtain the
full mesh deformation, we then develop a landmark-guided encoder-decoder to
apply the geometrical deformation embedded in landmarks on a given facial mesh.
Experiments show that our model has learned to generate realistic, quality
expressions solely from the dataset of relatively small size, improving over
the state-of-the-art methods. Videos and qualitative comparisons with other
methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models
will be made available upon acceptance
OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions
One-shot talking head generation has no explicit head movement reference,
thus it is difficult to generate talking heads with head motions. Some existing
works only edit the mouth area and generate still talking heads, leading to
unreal talking head performance. Other works construct one-to-one mapping
between audio signal and head motion sequences, introducing ambiguity
correspondences into the mapping since people can behave differently in head
motions when speaking the same content. This unreasonable mapping form fails to
model the diversity and produces either nearly static or even exaggerated head
motions, which are unnatural and strange. Therefore, the one-shot talking head
generation task is actually a one-to-many ill-posed problem and people present
diverse head motions when speaking. Based on the above observation, we propose
OSM-Net, a \textit{one-to-many} one-shot talking head generation network with
natural head motions. OSM-Net constructs a motion space that contains rich and
various clip-level head motion features. Each basis of the space represents a
feature of meaningful head motion in a clip rather than just a frame, thus
providing more coherent and natural motion changes in talking heads. The
driving audio is mapped into the motion space, around which various motion
features can be sampled within a reasonable range to achieve the one-to-many
mapping. Besides, the landmark constraint and time window feature input improve
the accurate expression feature extraction and video generation. Extensive
experiments show that OSM-Net generates more natural realistic head motions
under reasonable one-to-many mapping paradigm compared with other methods.Comment: Paper Under Revie
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
In this work, we present a multimodal solution to the problem of 4D face
reconstruction from monocular videos. 3D face reconstruction from 2D images is
an under-constrained problem due to the ambiguity of depth. State-of-the-art
methods try to solve this problem by leveraging visual information from a
single image or video, whereas 3D mesh animation approaches rely more on audio.
However, in most cases (e.g. AR/VR applications), videos include both visual
and speech information. We propose AVFace that incorporates both modalities and
accurately reconstructs the 4D facial and lip motion of any speaker, without
requiring any 3D ground truth for training. A coarse stage estimates the
per-frame parameters of a 3D morphable model, followed by a lip refinement, and
then a fine stage recovers facial geometric details. Due to the temporal audio
and video information captured by transformer-based modules, our method is
robust in cases when either modality is insufficient (e.g. face occlusions).
Extensive qualitative and quantitative evaluation demonstrates the superiority
of our method over the current state-of-the-art
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
The challenge of talking face generation from speech lies in aligning two
different modal information, audio and video, such that the mouth region
corresponds to input audio. Previous methods either exploit audio-visual
representation learning or leverage intermediate structural information such as
landmarks and 3D models. However, they struggle to synthesize fine details of
the lips varying at the phoneme level as they do not sufficiently provide
visual information of the lips at the video synthesis step. To overcome this
limitation, our work proposes Audio-Lip Memory that brings in visual
information of the mouth region corresponding to input audio and enforces
fine-grained audio-visual coherence. It stores lip motion features from
sequential ground truth images in the value memory and aligns them with
corresponding audio features so that they can be retrieved using audio input at
inference time. Therefore, using the retrieved lip motion features as visual
hints, it can easily correlate audio with visual dynamics in the synthesis
step. By analyzing the memory, we demonstrate that unique lip features are
stored in each memory slot at the phoneme level, capturing subtle lip motion
based on memory addressing. In addition, we introduce visual-visual
synchronization loss which can enhance lip-syncing performance when used along
with audio-visual synchronization loss in our model. Extensive experiments are
performed to verify that our method generates high-quality video with mouth
shapes that best align with the input audio, outperforming previous
state-of-the-art methods.Comment: Accepted at AAAI 2022 (Oral
- …