26 research outputs found
ShapeEditer: a StyleGAN Encoder for Face Swapping
In this paper, we propose a novel encoder, called ShapeEditor, for
high-resolution, realistic and high-fidelity face exchange. First of all, in
order to ensure sufficient clarity and authenticity, our key idea is to use an
advanced pretrained high-quality random face image generator, i.e. StyleGAN, as
backbone. Secondly, we design ShapeEditor, a two-step encoder, to make the
swapped face integrate the identity and attribute of the input faces. In the
first step, we extract the identity vector of the source image and the
attribute vector of the target image respectively; in the second step, we map
the concatenation of identity vector and attribute vector into the
potential space. In addition, for learning to map into the
latent space of StyleGAN, we propose a set of self-supervised loss functions
with which the training data do not need to be labeled manually. Extensive
experiments on the test dataset show that the results of our method not only
have a great advantage in clarity and authenticity than other state-of-the-art
methods, but also reflect the sufficient integration of identity and attribute.Comment: 13 pages, 3 figure
DeepWrinkles: Accurate and Realistic Clothing Modeling
We present a novel method to generate accurate and realistic clothing
deformation from real data capture. Previous methods for realistic cloth
modeling mainly rely on intensive computation of physics-based simulation (with
numerous heuristic parameters), while models reconstructed from visual
observations typically suffer from lack of geometric details. Here, we propose
an original framework consisting of two modules that work jointly to represent
global shape deformation as well as surface details with high fidelity. Global
shape deformations are recovered from a subspace model learned from 3D data of
clothed people in motion, while high frequency details are added to normal maps
created using a conditional Generative Adversarial Network whose architecture
is designed to enforce realism and temporal consistency. This leads to
unprecedented high-quality rendering of clothing deformation sequences, where
fine wrinkles from (real) high resolution observations can be recovered. In
addition, as the model is learned independently from body shape and pose, the
framework is suitable for applications that require retargeting (e.g., body
animation). Our experiments show original high quality results with a flexible
model. We claim an entirely data-driven approach to realistic cloth wrinkle
generation is possible.Comment: 18 pages, 12 figures, 15th European Conference on Computer Vision
(ECCV) 2018, Oral Presentatio
Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics from a Source Video to a Single Target Image
In this paper, we propose Dynamics Transfer GAN; a new method for generating
video sequences based on generative adversarial learning. The spatial
constructs of a generated video sequence are acquired from the target image.
The dynamics of the generated video sequence are imported from a source video
sequence, with arbitrary motion, and imposed onto the target image. To preserve
the spatial construct of the target image, the appearance of the source video
sequence is suppressed and only the dynamics are obtained before being imposed
onto the target image. That is achieved using the proposed appearance
suppressed dynamics feature. Moreover, the spatial and temporal consistencies
of the generated video sequence are verified via two discriminator networks.
One discriminator validates the fidelity of the generated frames appearance,
while the other validates the dynamic consistency of the generated video
sequence. Experiments have been conducted to verify the quality of the video
sequences generated by the proposed method. The results verified that Dynamics
Transfer GAN successfully transferred arbitrary dynamics of the source video
sequence onto a target image when generating the output video sequence. The
experimental results also showed that Dynamics Transfer GAN maintained the
spatial constructs (appearance) of the target image while generating spatially
and temporally consistent video sequences
X2Face: A network for controlling face generation by using images, audio, and pose codes
The objective of this paper is a neural network model that controls the pose
and expression of a given face, using another face or modality (e.g. audio).
This model can then be used for lightweight, sophisticated video and image
editing.
We make the following three contributions. First, we introduce a network,
X2Face, that can control a source face (specified by one or more frames) using
another face in a driving frame to produce a generated frame with the identity
of the source frame but the pose and expression of the face in the driving
frame. Second, we propose a method for training the network fully
self-supervised using a large collection of video data. Third, we show that the
generation process can be driven by other modalities, such as audio or pose
codes, without any further training of the network.
The generation results for driving a face with another face are compared to
state-of-the-art self-supervised/supervised methods. We show that our approach
is more robust than other methods, as it makes fewer assumptions about the
input data. We also show examples of using our framework for video face
editing.Comment: To appear in ECCV 2018. Accompanying video:
http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.htm
Synthesized Texture Quality Assessment via Multi-scale Spatial and Statistical Texture Attributes of Image and Gradient Magnitude Coefficients
Perceptual quality assessment for synthesized textures is a challenging task.
In this paper, we propose a training-free reduced-reference (RR) objective
quality assessment method that quantifies the perceived quality of synthesized
textures. The proposed reduced-reference synthesized texture quality assessment
metric is based on measuring the spatial and statistical attributes of the
texture image using both image- and gradient-based wavelet coefficients at
multiple scales. Performance evaluations on two synthesized texture databases
demonstrate that our proposed RR synthesized texture quality metric
significantly outperforms both full-reference and RR state-of-the-art quality
metrics in predicting the perceived visual quality of the synthesized textures
Task-agnostic Temporally Consistent Facial Video Editing
Recent research has witnessed the advances in facial image editing tasks. For
video editing, however, previous methods either simply apply transformations
frame by frame or utilize multiple frames in a concatenated or iterative
fashion, which leads to noticeable visual flickers. In addition, these methods
are confined to dealing with one specific task at a time without any
extensibility. In this paper, we propose a task-agnostic temporally consistent
facial video editing framework. Based on a 3D reconstruction model, our
framework is designed to handle several editing tasks in a more unified and
disentangled manner. The core design includes a dynamic training sample
selection mechanism and a novel 3D temporal loss constraint that fully exploits
both image and video datasets and enforces temporal consistency. Compared with
the state-of-the-art facial image editing methods, our framework generates
video portraits that are more photo-realistic and temporally smooth
Harmonizing Maximum Likelihood with GANs for Multimodal Conditional Generation
Recent advances in conditional image generation tasks, such as image-to-image
translation and image inpainting, are largely accounted to the success of
conditional GAN models, which are often optimized by the joint use of the GAN
loss with the reconstruction loss. However, we reveal that this training recipe
shared by almost all existing methods causes one critical side effect: lack of
diversity in output samples. In order to accomplish both training stability and
multimodal output generation, we propose novel training schemes with a new set
of losses named moment reconstruction losses that simply replace the
reconstruction loss. We show that our approach is applicable to any conditional
generation tasks by performing thorough experiments on image-to-image
translation, super-resolution and image inpainting using Cityscapes and CelebA
dataset. Quantitative evaluations also confirm that our methods achieve a great
diversity in outputs while retaining or even improving the visual fidelity of
generated samples.Comment: Accepted as a conference paper at ICLR 201
Towards Disentangled Representations for Human Retargeting by Multi-view Learning
We study the problem of learning disentangled representations for data across
multiple domains and its applications in human retargeting. Our goal is to map
an input image to an identity-invariant latent representation that captures
intrinsic factors such as expressions and poses. To this end, we present a
novel multi-view learning approach that leverages various data sources such as
images, keypoints, and poses. Our model consists of multiple id-conditioned
VAEs for different views of the data. During training, we encourage the latent
embeddings to be consistent across these views. Our observation is that
auxiliary data like keypoints and poses contain critical, id-agnostic semantic
information, and it is easier to train a disentangling CVAE on these simpler
views to separate such semantics from other id-specific attributes. We show
that training multi-view CVAEs and encourage latent-consistency guides the
image encoding to preserve the semantics of expressions and poses, leading to
improved disentangled representations and better human retargeting results
Head2Head: Video-based Neural Head Synthesis
In this paper, we propose a novel machine learning architecture for facial
reenactment. In particular, contrary to the model-based approaches or recent
frame-based methods that use Deep Convolutional Neural Networks (DCNNs) to
generate individual frames, we propose a novel method that (a) exploits the
special structure of facial motion (paying particular attention to mouth
motion) and (b) enforces temporal consistency. We demonstrate that the proposed
method can transfer facial expressions, pose and gaze of a source actor to a
target video in a photo-realistic fashion more accurately than state-of-the-art
methods.Comment: To be published in 15th IEEE International Conference on Automatic
Face and Gesture Recognition (FG 2020
Personalized Face Modeling for Improved Face Reconstruction and Motion Retargeting
Traditional methods for image-based 3D face reconstruction and facial motion
retargeting fit a 3D morphable model (3DMM) to the face, which has limited
modeling capacity and fail to generalize well to in-the-wild data. Use of
deformation transfer or multilinear tensor as a personalized 3DMM for
blendshape interpolation does not address the fact that facial expressions
result in different local and global skin deformations in different persons.
Moreover, existing methods learn a single albedo per user which is not enough
to capture the expression-specific skin reflectance variations. We propose an
end-to-end framework that jointly learns a personalized face model per user and
per-frame facial motion parameters from a large corpus of in-the-wild videos of
user expressions. Specifically, we learn user-specific expression blendshapes
and dynamic (expression-specific) albedo maps by predicting personalized
corrections on top of a 3DMM prior. We introduce novel constraints to ensure
that the corrected blendshapes retain their semantic meanings and the
reconstructed geometry is disentangled from the albedo. Experimental results
show that our personalization accurately captures fine-grained facial dynamics
in a wide range of conditions and efficiently decouples the learned face model
from facial motion, resulting in more accurate face reconstruction and facial
motion retargeting compared to state-of-the-art methods.Comment: ECCV 2020 (spotlight), webpage:
https://homes.cs.washington.edu/~bindita/personalizedfacemodeling.htm