9,523 research outputs found
Disentangled Cycle Consistency for Highly-realistic Virtual Try-On
Image virtual try-on replaces the clothes on a person image with a desired
in-shop clothes image. It is challenging because the person and the in-shop
clothes are unpaired. Existing methods formulate virtual try-on as either
in-painting or cycle consistency. Both of these two formulations encourage the
generation networks to reconstruct the input image in a self-supervised manner.
However, existing methods do not differentiate clothing and non-clothing
regions. A straight-forward generation impedes virtual try-on quality because
of the heavily coupled image contents. In this paper, we propose a Disentangled
Cycle-consistency Try-On Network (DCTON). The DCTON is able to produce
highly-realistic try-on images by disentangling important components of virtual
try-on including clothes warping, skin synthesis, and image composition. To
this end, DCTON can be naturally trained in a self-supervised manner following
cycle consistency learning. Extensive experiments on challenging benchmarks
show that DCTON outperforms state-of-the-art approaches favorably.Comment: Accepted by CVPR202
Person-in-Context Synthesiswith Compositional Structural Space
Despite significant progress, controlled generation of complex images with
interacting people remains difficult. Existing layout generation methods fall
short of synthesizing realistic person instances; while pose-guided generation
approaches focus on a single person and assume simple or known backgrounds. To
tackle these limitations, we propose a new problem, \textbf{Persons in Context
Synthesis}, which aims to synthesize diverse person instance(s) in consistent
contexts, with user control over both. The context is specified by the bounding
box object layout which lacks shape information, while pose of the person(s) by
keypoints which are sparsely annotated. To handle the stark difference in input
structures, we proposed two separate neural branches to attentively composite
the respective (context/person) inputs into shared ``compositional structural
space'', which encodes shape, location and appearance information for both
context and person structures in a disentangled manner. This structural space
is then decoded to the image space using multi-level feature modulation
strategy, and learned in a self supervised manner from image collections and
their corresponding inputs. Extensive experiments on two large-scale datasets
(COCO-Stuff \cite{caesar2018cvpr} and Visual Genome \cite{krishna2017visual})
demonstrate that our framework outperforms state-of-the-art methods w.r.t.
synthesis quality
Face Identity Disentanglement via Latent Space Mapping
Learning disentangled representations of data is a fundamental problem in
artificial intelligence. Specifically, disentangled latent representations
allow generative models to control and compose the disentangled factors in the
synthesis process. Current methods, however, require extensive supervision and
training, or instead, noticeably compromise quality. In this paper, we present
a method that learn show to represent data in a disentangled way, with minimal
supervision, manifested solely using available pre-trained networks. Our key
insight is to decouple the processes of disentanglement and synthesis, by
employing a leading pre-trained unconditional image generator, such as
StyleGAN. By learning to map into its latent space, we leverage both its
state-of-the-art quality generative power, and its rich and expressive latent
space, without the burden of training it.We demonstrate our approach on the
complex and high dimensional domain of human heads. We evaluate our method
qualitatively and quantitatively, and exhibit its success with
de-identification operations and with temporal identity coherency in image
sequences. Through this extensive experimentation, we show that our method
successfully disentangles identity from other facial attributes, surpassing
existing methods, even though they require more training and supervision.Comment: 17 pages, 10 figure
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
Talking face generation aims to synthesize a sequence of face images that
correspond to a clip of speech. This is a challenging task because face
appearance variation and semantics of speech are coupled together in the subtle
movements of the talking face regions. Existing works either construct specific
face appearance model on specific subjects or model the transformation between
lip motion and speech. In this work, we integrate both aspects and enable
arbitrary-subject talking face generation by learning disentangled audio-visual
representation. We find that the talking face sequence is actually a
composition of both subject-related information and speech-related information.
These two spaces are then explicitly disentangled through a novel
associative-and-adversarial training process. This disentangled representation
has an advantage where both audio and video can serve as inputs for generation.
Extensive experiments show that the proposed approach generates realistic
talking face sequences on arbitrary subjects with much clearer lip motion
patterns than previous work. We also demonstrate the learned audio-visual
representation is extremely useful for the tasks of automatic lip reading and
audio-video retrieval.Comment: AAAI Conference on Artificial Intelligence (AAAI 2019) Oral
Presentation. Code, models, and video results are available on our webpage:
https://liuziwei7.github.io/projects/TalkingFace.htm
Illumination-Adaptive Person Re-identification
Most person re-identification (ReID) approaches assume that person images are
captured under relatively similar illumination conditions. In reality,
long-term person retrieval is common, and person images are often captured
under different illumination conditions at different times across a day. In
this situation, the performances of existing ReID models often degrade
dramatically. This paper addresses the ReID problem with illumination
variations and names it as {\em Illumination-Adaptive Person Re-identification
(IA-ReID)}. We propose an Illumination-Identity Disentanglement (IID) network
to dispel different scales of illuminations away while preserving individuals'
identity information. To demonstrate the illumination issue and to evaluate our
model, we construct two large-scale simulated datasets with a wide range of
illumination variations. Experimental results on the simulated datasets and
real-world images demonstrate the effectiveness of the proposed framework.Comment: Accepted by TM
Unsupervised Part-Based Disentangling of Object Shape and Appearance
Large intra-class variation is the result of changes in multiple object
characteristics. Images, however, only show the superposition of different
variable factors such as appearance or shape. Therefore, learning to
disentangle and represent these different characteristics poses a great
challenge, especially in the unsupervised case. Moreover, large object
articulation calls for a flexible part-based model. We present an unsupervised
approach for disentangling appearance and shape by learning parts consistently
over all instances of a category. Our model for learning an object
representation is trained by simultaneously exploiting invariance and
equivariance constraints between synthetically transformed images. Since no
part annotation or prior information on an object class is required, the
approach is applicable to arbitrary classes. We evaluate our approach on a wide
range of object categories and diverse tasks including pose prediction,
disentangled image synthesis, and video-to-video translation. The approach
outperforms the state-of-the-art on unsupervised keypoint prediction and
compares favorably even against supervised approaches on the task of shape and
appearance transfer.Comment: CVPR 2019 Ora
Image Generation from Layout
Despite significant recent progress on generative models, controlled
generation of images depicting multiple and complex object layouts is still a
difficult problem. Among the core challenges are the diversity of appearance a
given object may possess and, as a result, exponential set of images consistent
with a specified layout. To address these challenges, we propose a novel
approach for layout-based image generation; we call it Layout2Im. Given the
coarse spatial layout (bounding boxes + object categories), our model can
generate a set of realistic images which have the correct objects in the
desired locations. The representation of each object is disentangled into a
specified/certain part (category) and an unspecified/uncertain part
(appearance). The category is encoded using a word embedding and the appearance
is distilled into a low-dimensional vector sampled from a normal distribution.
Individual object representations are composed together using convolutional
LSTM, to obtain an encoding of the complete layout, and then decoded to an
image. Several loss terms are introduced to encourage accurate and diverse
generation. The proposed Layout2Im model significantly outperforms the previous
state of the art, boosting the best reported inception score by 24.66% and
28.57% on the very challenging COCO-Stuff and Visual Genome datasets,
respectively. Extensive experiments also demonstrate our method's ability to
generate complex and diverse images with multiple objects.Comment: Accepted to CVPR 2019 (Oral
Unsupervised Learning of Disentangled Representations from Video
We present a new model DrNET that learns disentangled image representations
from video. Our approach leverages the temporal coherence of video and a novel
adversarial loss to learn a representation that factorizes each frame into a
stationary part and a temporally varying component. The disentangled
representation can be used for a range of tasks. For example, applying a
standard LSTM to the time-vary components enables prediction of future frames.
We evaluate our approach on a range of synthetic and real videos, demonstrating
the ability to coherently generate hundreds of steps into the future
ELEGANT: Exchanging Latent Encodings with GAN for Transferring Multiple Face Attributes
Recent studies on face attribute transfer have achieved great success. A lot
of models are able to transfer face attributes with an input image. However,
they suffer from three limitations: (1) incapability of generating image by
exemplars; (2) being unable to transfer multiple face attributes
simultaneously; (3) low quality of generated images, such as low-resolution or
artifacts. To address these limitations, we propose a novel model which
receives two images of opposite attributes as inputs. Our model can transfer
exactly the same type of attributes from one image to another by exchanging
certain part of their encodings. All the attributes are encoded in a
disentangled manner in the latent space, which enables us to manipulate several
attributes simultaneously. Besides, our model learns the residual images so as
to facilitate training on higher resolution images. With the help of
multi-scale discriminators for adversarial training, it can even generate
high-quality images with finer details and less artifacts. We demonstrate the
effectiveness of our model on overcoming the above three limitations by
comparing with other methods on the CelebA face database. A pytorch
implementation is available at https://github.com/Prinsphield/ELEGANT.Comment: Github: https://github.com/Prinsphield/ELEGAN
An Efficient Integration of Disentangled Attended Expression and Identity FeaturesFor Facial Expression Transfer andSynthesis
In this paper, we present an Attention-based Identity Preserving Generative
Adversarial Network (AIP-GAN) to overcome the identity leakage problem from a
source image to a generated face image, an issue that is encountered in a
cross-subject facial expression transfer and synthesis process. Our key insight
is that the identity preserving network should be able to disentangle and
compose shape, appearance, and expression information for efficient facial
expression transfer and synthesis. Specifically, the expression encoder of our
AIP-GAN disentangles the expression information from the input source image by
predicting its facial landmarks using our supervised spatial and channel-wise
attention module. Similarly, the disentangled expression-agnostic identity
features are extracted from the input target image by inferring its combined
intrinsic-shape and appearance image employing our self-supervised spatial and
channel-wise attention mod-ule. To leverage the expression and identity
information encoded by the intermediate layers of both of our encoders, we
combine these features with the features learned by the intermediate layers of
our decoder using a cross-encoder bilinear pooling operation. Experimental
results show the promising performance of our AIP-GAN based technique.Comment: 10 Pages, excluding reference
- …