12 research outputs found
Image-to-image Transformation with Auxiliary Condition
The performance of image recognition like human pose detection, trained with
simulated images would usually get worse due to the divergence between real and
simulated data. To make the distribution of a simulated image close to that of
real one, there are several works applying GAN-based image-to-image
transformation methods, e.g., SimGAN and CycleGAN. However, these methods would
not be sensitive enough to the various change in pose and shape of subjects,
especially when the training data are imbalanced, e.g., some particular poses
and shapes are minor in the training data. To overcome this problem, we propose
to introduce the label information of subjects, e.g., pose and type of objects
in the training of CycleGAN, and lead it to obtain label-wise transforamtion
models. We evaluate our proposed method called Label-CycleGAN, through
experiments on the digit image transformation from SVHN to MNIST and the
surveillance camera image transformation from simulated to real images
Towards Fine-grained Human Pose Transfer with Detail Replenishing Network
Human pose transfer (HPT) is an emerging research topic with huge potential
in fashion design, media production, online advertising and virtual reality.
For these applications, the visual realism of fine-grained appearance details
is crucial for production quality and user engagement. However, existing HPT
methods often suffer from three fundamental issues: detail deficiency, content
ambiguity and style inconsistency, which severely degrade the visual quality
and realism of generated images. Aiming towards real-world applications, we
develop a more challenging yet practical HPT setting, termed as Fine-grained
Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail
replenishment. Concretely, we analyze the potential design flaws of existing
methods via an illustrative example, and establish the core FHPT methodology by
combing the idea of content synthesis and feature transfer together in a
mutually-guided fashion. Thereafter, we substantiate the proposed methodology
with a Detail Replenishing Network (DRN) and a corresponding coarse-to-fine
model training scheme. Moreover, we build up a complete suite of fine-grained
evaluation protocols to address the challenges of FHPT in a comprehensive
manner, including semantic analysis, structural detection and perceptual
quality assessment. Extensive experiments on the DeepFashion benchmark dataset
have verified the power of proposed benchmark against start-of-the-art works,
with 12\%-14\% gain on top-10 retrieval recall, 5\% higher joint localization
accuracy, and near 40\% gain on face identity preservation. Moreover, the
evaluation results offer further insights to the subject matter, which could
inspire many promising future works along this direction.Comment: IEEE TIP submissio
Two-Stream Appearance Transfer Network for Person Image Generation
Pose guided person image generation means to generate a photo-realistic
person image conditioned on an input person image and a desired pose. This task
requires spatial manipulation of the source image according to the target pose.
However, the generative adversarial networks (GANs) widely used for image
generation and translation rely on spatially local and translation equivariant
operators, i.e., convolution, pooling and unpooling, which cannot handle large
image deformation. This paper introduces a novel two-stream appearance transfer
network (2s-ATN) to address this challenge. It is a multi-stage architecture
consisting of a source stream and a target stream. Each stage features an
appearance transfer module and several two-stream feature fusion modules. The
former finds the dense correspondence between the two-stream feature maps and
then transfers the appearance information from the source stream to the target
stream. The latter exchange local information between the two streams and
supplement the non-local appearance transfer. Both quantitative and qualitative
results indicate the proposed 2s-ATN can effectively handle large spatial
deformation and occlusion while retaining the appearance details. It
outperforms prior states of the art on two widely used benchmarks.Comment: 9 pages, 5 figure
Intrinsic Temporal Regularization for High-resolution Human Video Synthesis
Temporal consistency is crucial for extending image processing pipelines to
the video domain, which is often enforced with flow-based warping error over
adjacent frames. Yet for human video synthesis, such scheme is less reliable
due to the misalignment between source and target video as well as the
difficulty in accurate flow estimation. In this paper, we propose an effective
intrinsic temporal regularization scheme to mitigate these issues, where an
intrinsic confidence map is estimated via the frame generator to regulate
motion estimation via temporal loss modulation. This creates a shortcut for
back-propagating temporal loss gradients directly to the front-end motion
estimator, thus improving training stability and temporal coherence in output
videos. We apply our intrinsic temporal regulation to single-image generator,
leading to a powerful "INTERnet" capable of generating
resolution human action videos with temporal-coherent, realistic visual
details. Extensive experiments demonstrate the superiority of proposed INTERnet
over several competitive baselines.Comment: 10 pages, work done during internship at Alibaba DAMO Academ
Disentangled Cycle Consistency for Highly-realistic Virtual Try-On
Image virtual try-on replaces the clothes on a person image with a desired
in-shop clothes image. It is challenging because the person and the in-shop
clothes are unpaired. Existing methods formulate virtual try-on as either
in-painting or cycle consistency. Both of these two formulations encourage the
generation networks to reconstruct the input image in a self-supervised manner.
However, existing methods do not differentiate clothing and non-clothing
regions. A straight-forward generation impedes virtual try-on quality because
of the heavily coupled image contents. In this paper, we propose a Disentangled
Cycle-consistency Try-On Network (DCTON). The DCTON is able to produce
highly-realistic try-on images by disentangling important components of virtual
try-on including clothes warping, skin synthesis, and image composition. To
this end, DCTON can be naturally trained in a self-supervised manner following
cycle consistency learning. Extensive experiments on challenging benchmarks
show that DCTON outperforms state-of-the-art approaches favorably.Comment: Accepted by CVPR202
Single-Shot Freestyle Dance Reenactment
The task of motion transfer between a source dancer and a target person is a
special case of the pose transfer problem, in which the target person changes
their pose in accordance with the motions of the dancer.
In this work, we propose a novel method that can reanimate a single image by
arbitrary video sequences, unseen during training. The method combines three
networks: (i) a segmentation-mapping network, (ii) a realistic frame-rendering
network, and (iii) a face refinement network. By separating this task into
three stages, we are able to attain a novel sequence of realistic frames,
capturing natural motion and appearance. Our method obtains significantly
better visual quality than previous methods and is able to animate diverse body
types and appearances, which are captured in challenging poses, as shown in the
experiments and supplementary video
Unbalanced Feature Transport for Exemplar-based Image Translation
Despite the great success of GANs in images translation with different
conditioned inputs such as semantic segmentation and edge maps, generating
high-fidelity realistic images with reference styles remains a grand challenge
in conditional image-to-image translation. This paper presents a general image
translation framework that incorporates optimal transport for feature alignment
between conditional inputs and style exemplars in image translation. The
introduction of optimal transport mitigates the constraint of many-to-one
feature matching significantly while building up accurate semantic
correspondences between conditional inputs and exemplars. We design a novel
unbalanced optimal transport to address the transport between features with
deviational distributions which exists widely between conditional inputs and
exemplars. In addition, we design a semantic-activation normalization scheme
that injects style features of exemplars into the image translation process
successfully. Extensive experiments over multiple image translation tasks show
that our method achieves superior image translation qualitatively and
quantitatively as compared with the state-of-the-art.Comment: Accepted to CVPR 202
Toward Accurate and Realistic Outfits Visualization with Attention to Details
Virtual try-on methods aim to generate images of fashion models wearing
arbitrary combinations of garments. This is a challenging task because the
generated image must appear realistic and accurately display the interaction
between garments. Prior works produce images that are filled with artifacts and
fail to capture important visual details necessary for commercial applications.
We propose Outfit Visualization Net (OVNet) to capture these important details
(e.g. buttons, shading, textures, realistic hemlines, and interactions between
garments) and produce high quality multiple-garment virtual try-on images.
OVNet consists of 1) a semantic layout generator and 2) an image generation
pipeline using multiple coordinated warps. We train the warper to output
multiple warps using a cascade loss, which refines each successive warp to
focus on poorly generated regions of a previous warp and yields consistent
improvements in detail. In addition, we introduce a method for matching outfits
with the most suitable model and produce significant improvements for both our
and other previous try-on methods. Through quantitative and qualitative
analysis, we demonstrate our method generates substantially higher-quality
studio images compared to prior works for multi-garment outfits. An interactive
interface powered by this method has been deployed on fashion e-commerce
websites and received overwhelmingly positive feedback.Comment: Accepted to CVPR2021. Live demo here https://revery.ai/demo.htm
Unselfie: Translating Selfies to Neutral-pose Portraits in the Wild
Due to the ubiquity of smartphones, it is popular to take photos of one's
self, or "selfies." Such photos are convenient to take, because they do not
require specialized equipment or a third-party photographer. However, in
selfies, constraints such as human arm length often make the body pose look
unnatural. To address this issue, we introduce , a novel
photographic transformation that automatically translates a selfie into a
neutral-pose portrait. To achieve this, we first collect an unpaired dataset,
and introduce a way to synthesize paired training data for self-supervised
learning. Then, to a photo, we propose a new three-stage
pipeline, where we first find a target neutral pose, inpaint the body texture,
and finally refine and composite the person on the background. To obtain a
suitable target neutral pose, we propose a novel nearest pose search module
that makes the reposing task easier and enables the generation of multiple
neutral-pose results among which users can choose the best one they like.
Qualitative and quantitative evaluations show the superiority of our pipeline
over alternatives.Comment: To appear in ECCV 202
Pose-Guided Human Animation from a Single Image in the Wild
We present a new pose transfer method for synthesizing a human animation from
a single image of a person controlled by a sequence of body poses. Existing
pose transfer methods exhibit significant visual artifacts when applying to a
novel scene, resulting in temporal inconsistency and failures in preserving the
identity and textures of the person. To address these limitations, we design a
compositional neural network that predicts the silhouette, garment labels, and
textures. Each modular network is explicitly dedicated to a subtask that can be
learned from the synthetic data. At the inference time, we utilize the trained
network to produce a unified representation of appearance and its labels in UV
coordinates, which remains constant across poses. The unified representation
provides an incomplete yet strong guidance to generating the appearance in
response to the pose change. We use the trained network to complete the
appearance and render it with the background. With these strategies, we are
able to synthesize human animations that can preserve the identity and
appearance of the person in a temporally coherent way without any fine-tuning
of the network on the testing scene. Experiments show that our method
outperforms the state-of-the-arts in terms of synthesis quality, temporal
coherence, and generalization ability.Comment: 14 pages including Appendi