3,534 research outputs found
Unsupervised Person Image Generation with Semantic Parsing Transformation
In this paper, we address unsupervised pose-guided person image generation,
which is known challenging due to non-rigid deformation. Unlike previous
methods learning a rock-hard direct mapping between human bodies, we propose a
new pathway to decompose the hard mapping into two more accessible subtasks,
namely, semantic parsing transformation and appearance generation. Firstly, a
semantic generative network is proposed to transform between semantic parsing
maps, in order to simplify the non-rigid deformation learning. Secondly, an
appearance generative network learns to synthesize semantic-aware textures.
Thirdly, we demonstrate that training our framework in an end-to-end manner
further refines the semantic maps and final results accordingly. Our method is
generalizable to other semantic-aware person image generation tasks, eg,
clothing texture transfer and controlled image manipulation. Experimental
results demonstrate the superiority of our method on DeepFashion and
Market-1501 datasets, especially in keeping the clothing attributes and better
body shapes.Comment: Accepted to CVPR 2019 (Oral). Our project is available at
https://github.com/SijieSong/person_generation_sp
Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis
Despite remarkable advances in image synthesis research, existing works often
fail in manipulating images under the context of large geometric
transformations. Synthesizing person images conditioned on arbitrary poses is
one of the most representative examples where the generation quality largely
relies on the capability of identifying and modeling arbitrary transformations
on different body parts. Current generative models are often built on local
convolutions and overlook the key challenges (e.g. heavy occlusions, different
views or dramatic appearance changes) when distinct geometric changes happen
for each part, caused by arbitrary pose manipulations. This paper aims to
resolve these challenges induced by geometric variability and spatial
displacements via a new Soft-Gated Warping Generative Adversarial Network
(Warping-GAN), which is composed of two stages: 1) it first synthesizes a
target part segmentation map given a target pose, which depicts the
region-level spatial layouts for guiding image synthesis with higher-level
structure constraints; 2) the Warping-GAN equipped with a soft-gated
warping-block learns feature-level mapping to render textures from the original
image into the generated segmentation map. Warping-GAN is capable of
controlling different transformation degrees given distinct target poses.
Moreover, the proposed warping-block is light-weight and flexible enough to be
injected into any networks. Human perceptual studies and quantitative
evaluations demonstrate the superiority of our Warping-GAN that significantly
outperforms all existing methods on two large datasets.Comment: 17 pages, 14 figure
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
FaceShapeGene: A Disentangled Shape Representation for Flexible Face Image Editing
Existing methods for face image manipulation generally focus on editing the
expression, changing some predefined attributes, or applying different filters.
However, users lack the flexibility of controlling the shapes of different
semantic facial parts in the generated face. In this paper, we propose an
approach to compute a disentangled shape representation for a face image,
namely the FaceShapeGene. The proposed FaceShapeGene encodes the shape
information of each semantic facial part separately into a 1D latent vector. On
the basis of the FaceShapeGene, a novel part-wise face image editing system is
developed, which contains a shape-remix network and a conditional label-to-face
transformer. The shape-remix network can freely recombine the part-wise latent
vectors from different individuals, producing a remixed face shape in the form
of a label map, which contains the facial characteristics of multiple subjects.
The conditional label-to-face transformer, which is trained in an unsupervised
cyclic manner, performs part-wise face editing while preserving the original
identity of the subject. Experimental results on several tasks demonstrate that
the proposed FaceShapeGene representation correctly disentangles the shape
features of different semantic parts. %In addition, we test our system on
several novel part-wise face editing tasks. Comparisons to existing methods
demonstrate the superiority of the proposed method on accomplishing novel face
editing tasks
Down to the Last Detail: Virtual Try-on with Detail Carving
Virtual try-on under arbitrary poses has attracted lots of research attention
due to its huge potential applications. However, existing methods can hardly
preserve the details in clothing texture and facial identity (face, hair) while
fitting novel clothes and poses onto a person. In this paper, we propose a
novel multi-stage framework to synthesize person images, where rich details in
salient regions can be well preserved. Specifically, a multi-stage framework is
proposed to decompose the generation into spatial alignment followed by a
coarse-to-fine generation. To better preserve the details in salient areas such
as clothing and facial areas, we propose a Tree-Block (tree dilated fusion
block) to harness multi-scale features in the generator networks. With
end-to-end training of multiple stages, the whole framework can be jointly
optimized for results with significantly better visual fidelity and richer
details. Extensive experiments on standard datasets demonstrate that our
proposed framework achieves the state-of-the-art performance, especially in
preserving the visual details in clothing texture and facial identity. Our
implementation will be publicly available soon.Comment: Our implementation is available at
https://github.com/AIprogrammer/Down-to-the-Last-Detail-Virtual-Try-on-with-Detail-Carvin
Context-Aware Synthesis and Placement of Object Instances
Learning to insert an object instance into an image in a semantically
coherent manner is a challenging and interesting problem. Solving it requires
(a) determining a location to place an object in the scene and (b) determining
its appearance at the location. Such an object insertion model can potentially
facilitate numerous image editing and scene parsing applications. In this
paper, we propose an end-to-end trainable neural network for the task of
inserting an object instance mask of a specified class into the semantic label
map of an image. Our network consists of two generative modules where one
determines where the inserted object mask should be (i.e., location and scale)
and the other determines what the object mask shape (and pose) should look
like. The two modules are connected together via a spatial transformation
network and jointly trained. We devise a learning procedure that leverage both
supervised and unsupervised data and show our model can insert an object at
diverse locations with various appearances. We conduct extensive experimental
validations with comparisons to strong baselines to verify the effectiveness of
the proposed network
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Towards Fine-grained Human Pose Transfer with Detail Replenishing Network
Human pose transfer (HPT) is an emerging research topic with huge potential
in fashion design, media production, online advertising and virtual reality.
For these applications, the visual realism of fine-grained appearance details
is crucial for production quality and user engagement. However, existing HPT
methods often suffer from three fundamental issues: detail deficiency, content
ambiguity and style inconsistency, which severely degrade the visual quality
and realism of generated images. Aiming towards real-world applications, we
develop a more challenging yet practical HPT setting, termed as Fine-grained
Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail
replenishment. Concretely, we analyze the potential design flaws of existing
methods via an illustrative example, and establish the core FHPT methodology by
combing the idea of content synthesis and feature transfer together in a
mutually-guided fashion. Thereafter, we substantiate the proposed methodology
with a Detail Replenishing Network (DRN) and a corresponding coarse-to-fine
model training scheme. Moreover, we build up a complete suite of fine-grained
evaluation protocols to address the challenges of FHPT in a comprehensive
manner, including semantic analysis, structural detection and perceptual
quality assessment. Extensive experiments on the DeepFashion benchmark dataset
have verified the power of proposed benchmark against start-of-the-art works,
with 12\%-14\% gain on top-10 retrieval recall, 5\% higher joint localization
accuracy, and near 40\% gain on face identity preservation. Moreover, the
evaluation results offer further insights to the subject matter, which could
inspire many promising future works along this direction.Comment: IEEE TIP submissio
Uncovering Temporal Context for Video Question and Answering
In this work, we introduce Video Question Answering in temporal domain to
infer the past, describe the present and predict the future. We present an
encoder-decoder approach using Recurrent Neural Networks to learn temporal
structures of videos and introduce a dual-channel ranking loss to answer
multiple-choice questions. We explore approaches for finer understanding of
video content using question form of "fill-in-the-blank", and managed to
collect 109,895 video clips with duration over 1,000 hours from TACoS, MPII-MD,
MEDTest 14 datasets, while the corresponding 390,744 questions are generated
from annotations. Extensive experiments demonstrate that our approach
significantly outperforms the compared baselines
Superquadrics Revisited: Learning 3D Shape Parsing beyond Cuboids
Abstracting complex 3D shapes with parsimonious part-based representations
has been a long standing goal in computer vision. This paper presents a
learning-based solution to this problem which goes beyond the traditional 3D
cuboid representation by exploiting superquadrics as atomic elements. We
demonstrate that superquadrics lead to more expressive 3D scene parses while
being easier to learn than 3D cuboid representations. Moreover, we provide an
analytical solution to the Chamfer loss which avoids the need for computational
expensive reinforcement learning or iterative prediction. Our model learns to
parse 3D objects into consistent superquadric representations without
supervision. Results on various ShapeNet categories as well as the SURREAL
human body dataset demonstrate the flexibility of our model in capturing fine
details and complex poses that could not have been modelled using cuboids.Comment: CVPR 2019 Camera Ready. Project
Page:https://github.com/paschalidoud/superquadric_parsin
- …