458 research outputs found
G3AN: Disentangling Appearance and Motion for Video Generation
Creating realistic human videos entails the challenge of being able to
simultaneously generate both appearance, as well as motion. To tackle this
challenge, we introduce GAN, a novel spatio-temporal generative model,
which seeks to capture the distribution of high dimensional video data and to
model appearance and motion in disentangled manner. The latter is achieved by
decomposing appearance and motion in a three-stream Generator, where the main
stream aims to model spatio-temporal consistency, whereas the two auxiliary
streams augment the main stream with multi-scale appearance and motion
features, respectively. An extensive quantitative and qualitative analysis
shows that our model systematically and significantly outperforms
state-of-the-art methods on the facial expression datasets MUG and UvA-NEMO, as
well as the Weizmann and UCF101 datasets on human action. Additional analysis
on the learned latent representations confirms the successful decomposition of
appearance and motion. Source code and pre-trained models are publicly
available.Comment: CVPR 2020, project link https://wyhsirius.github.io/G3AN
Reinforced Disentanglement for Face Swapping without Skip Connection
The SOTA face swap models still suffer the problem of either target identity
(i.e., shape) being leaked or the target non-identity attributes (i.e.,
background, hair) failing to be fully preserved in the final results. We show
that this insufficient disentanglement is caused by two flawed designs that
were commonly adopted in prior models: (1) counting on only one compressed
encoder to represent both the semantic-level non-identity facial
attributes(i.e., pose) and the pixel-level non-facial region details, which is
contradictory to satisfy at the same time; (2) highly relying on long
skip-connections between the encoder and the final generator, leaking a certain
amount of target face identity into the result. To fix them, we introduce a new
face swap framework called 'WSC-swap' that gets rid of skip connections and
uses two target encoders to respectively capture the pixel-level non-facial
region attributes and the semantic non-identity attributes in the face region.
To further reinforce the disentanglement learning for the target encoder, we
employ both identity removal loss via adversarial training (i.e., GAN) and the
non-identity preservation loss via prior 3DMM models like [11]. Extensive
experiments on both FaceForensics++ and CelebA-HQ show that our results
significantly outperform previous works on a rich set of metrics, including one
novel metric for measuring identity consistency that was completely neglected
before.Comment: Accepted by ICCV 202
Video Prediction by Efficient Transformers
Video prediction is a challenging computer vision task that has a wide range
of applications. In this work, we present a new family of Transformer-based
models for video prediction. Firstly, an efficient local spatial-temporal
separation attention mechanism is proposed to reduce the complexity of standard
Transformers. Then, a full autoregressive model, a partial autoregressive model
and a non-autoregressive model are developed based on the new efficient
Transformer. The partial autoregressive model has a similar performance with
the full autoregressive model but a faster inference speed. The
non-autoregressive model not only achieves a faster inference speed but also
mitigates the quality degradation problem of the autoregressive counterparts,
but it requires additional parameters and loss function for learning. Given the
same attention mechanism, we conducted a comprehensive study to compare the
proposed three video prediction variants. Experiments show that the proposed
video prediction models are competitive with more complex state-of-the-art
convolutional-LSTM based models. The source code is available at
https://github.com/XiYe20/VPTR.Comment: Accepted by Image and Vision Computing. arXiv admin note: text
overlap with arXiv:2203.1583
LatentKeypointGAN: Controlling GANs via Latent Keypoints
Generative adversarial networks (GANs) have attained photo-realistic quality
in image generation. However, how to best control the image content remains an
open challenge. We introduce LatentKeypointGAN, a two-stage GAN which is
trained end-to-end on the classical GAN objective with internal conditioning on
a set of space keypoints. These keypoints have associated appearance embeddings
that respectively control the position and style of the generated objects and
their parts. A major difficulty that we address with suitable network
architectures and training schemes is disentangling the image into spatial and
appearance factors without domain knowledge and supervision signals. We
demonstrate that LatentKeypointGAN provides an interpretable latent space that
can be used to re-arrange the generated images by re-positioning and exchanging
keypoint embeddings, such as generating portraits by combining the eyes, nose,
and mouth from different images. In addition, the explicit generation of
keypoints and matching images enables a new, GAN-based method for unsupervised
keypoint detection
- …