8,061 research outputs found
Pose-Guided Human Parsing with Deep Learned Features
Parsing human body into semantic regions is crucial to human-centric
analysis. In this paper, we propose a segment-based parsing pipeline that
explores human pose information, i.e. the joint location of a human model,
which improves the part proposal, accelerates the inference and regularizes the
parsing process at the same time. Specifically, we first generate part segment
proposals with respect to human joints predicted by a deep model, then part-
specific ranking models are trained for segment selection using both pose-based
features and deep-learned part potential features. Finally, the best ensemble
of the proposed part segments are inferred though an And-Or Graph.
We evaluate our approach on the popular Penn-Fudan pedestrian parsing
dataset, and demonstrate the effectiveness of using the pose information for
each stage of the parsing pipeline. Finally, we show that our approach yields
superior part segmentation accuracy comparing to the state-of-the-art methods.Comment: 12 pages, 10 figures, a shortened version of this paper was accepted
by AAAI 201
Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis
Despite remarkable advances in image synthesis research, existing works often
fail in manipulating images under the context of large geometric
transformations. Synthesizing person images conditioned on arbitrary poses is
one of the most representative examples where the generation quality largely
relies on the capability of identifying and modeling arbitrary transformations
on different body parts. Current generative models are often built on local
convolutions and overlook the key challenges (e.g. heavy occlusions, different
views or dramatic appearance changes) when distinct geometric changes happen
for each part, caused by arbitrary pose manipulations. This paper aims to
resolve these challenges induced by geometric variability and spatial
displacements via a new Soft-Gated Warping Generative Adversarial Network
(Warping-GAN), which is composed of two stages: 1) it first synthesizes a
target part segmentation map given a target pose, which depicts the
region-level spatial layouts for guiding image synthesis with higher-level
structure constraints; 2) the Warping-GAN equipped with a soft-gated
warping-block learns feature-level mapping to render textures from the original
image into the generated segmentation map. Warping-GAN is capable of
controlling different transformation degrees given distinct target poses.
Moreover, the proposed warping-block is light-weight and flexible enough to be
injected into any networks. Human perceptual studies and quantitative
evaluations demonstrate the superiority of our Warping-GAN that significantly
outperforms all existing methods on two large datasets.Comment: 17 pages, 14 figure
Face Image Reflection Removal
Face images captured through the glass are usually contaminated by
reflections. The non-transmitted reflections make the reflection removal more
challenging than for general scenes, because important facial features are
completely occluded. In this paper, we propose and solve the face image
reflection removal problem. We remove non-transmitted reflections by
incorporating inpainting ideas into a guided reflection removal framework and
recover facial features by considering various face-specific priors. We use a
newly collected face reflection image dataset to train our model and compare
with state-of-the-art methods. The proposed method shows advantages in
estimating reflection-free face images for improving face recognition
Graphonomy: Universal Human Parsing via Graph Transfer Learning
Prior highly-tuned human parsing models tend to fit towards each dataset in a
specific domain or with discrepant label granularity, and can hardly be adapted
to other human parsing tasks without extensive re-training. In this paper, we
aim to learn a single universal human parsing model that can tackle all kinds
of human parsing needs by unifying label annotations from different domains or
at various levels of granularity. This poses many fundamental learning
challenges, e.g. discovering underlying semantic structures among different
label granularity, performing proper transfer learning across different image
domains, and identifying and utilizing label redundancies across related tasks.
To address these challenges, we propose a new universal human parsing agent,
named "Graphonomy", which incorporates hierarchical graph transfer learning
upon the conventional parsing network to encode the underlying label semantic
structures and propagate relevant semantic information. In particular,
Graphonomy first learns and propagates compact high-level graph representation
among the labels within one dataset via Intra-Graph Reasoning, and then
transfers semantic information across multiple datasets via Inter-Graph
Transfer. Various graph transfer dependencies (\eg, similarity, linguistic
knowledge) between different datasets are analyzed and encoded to enhance graph
transfer capability. By distilling universal semantic graph representation to
each specific task, Graphonomy is able to predict all levels of parsing labels
in one system without piling up the complexity. Experimental results show
Graphonomy effectively achieves the state-of-the-art results on three human
parsing benchmarks as well as advantageous universal human parsing performance.Comment: Accepted to CVPR 2019. The Code is available at
https://github.com/Gaoyiminggithub/Graphonom
Multi-Scale Body-Part Mask Guided Attention for Person Re-identification
Person re-identification becomes a more and more important task due to its
wide applications. In practice, person re-identification still remains
challenging due to the variation of person pose, different lighting, occlusion,
misalignment, background clutter, etc. In this paper, we propose a multi-scale
body-part mask guided attention network (MMGA), which jointly learns whole-body
and part body attention to help extract global and local features
simultaneously. In MMGA, body-part masks are used to guide the training of
corresponding attention. Experiments show that our proposed method can reduce
the negative influence of variation of person pose, misalignment and background
clutter. Our method achieves rank-1/mAP of 95.0%/87.2% on the Market1501
dataset, 89.5%/78.1% on the DukeMTMC-reID dataset, outperforming current
state-of-the-art methods
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark
Human parsing and pose estimation have recently received considerable
interest due to their substantial application potentials. However, the existing
datasets have limited numbers of images and annotations and lack a variety of
human appearances and coverage of challenging cases in unconstrained
environments. In this paper, we introduce a new benchmark named "Look into
Person (LIP)" that provides a significant advancement in terms of scalability,
diversity, and difficulty, which are crucial for future developments in
human-centric analysis. This comprehensive dataset contains over 50,000
elaborately annotated images with 19 semantic part labels and 16 body joints,
which are captured from a broad range of viewpoints, occlusions, and background
complexities. Using these rich annotations, we perform detailed analyses of the
leading human parsing and pose estimation approaches, thereby obtaining
insights into the successes and failures of these methods. To further explore
and take advantage of the semantic correlation of these two tasks, we propose a
novel joint human parsing and pose estimation network to explore efficient
context modeling, which can simultaneously predict parsing and pose with
extremely high quality. Furthermore, we simplify the network to solve human
parsing by exploring a novel self-supervised structure-sensitive learning
approach, which imposes human pose structures into the parsing results without
resorting to extra supervision. The dataset, code and models are available at
http://www.sysu-hcp.net/lip/.Comment: We proposed the most comprehensive dataset around the world for
human-centric analysis! (Accepted By T-PAMI 2018) The dataset, code and
models are available at http://www.sysu-hcp.net/lip/ . arXiv admin note:
substantial text overlap with arXiv:1703.0544
A High-Efficiency Framework for Constructing Large-Scale Face Parsing Benchmark
Face parsing, which is to assign a semantic label to each pixel in face
images, has recently attracted increasing interest due to its huge application
potentials. Although many face related fields (e.g., face recognition and face
detection) have been well studied for many years, the existing datasets for
face parsing are still severely limited in terms of the scale and quality,
e.g., the widely used Helen dataset only contains 2,330 images. This is mainly
because pixel-level annotation is a high cost and time-consuming work,
especially for the facial parts without clear boundaries. The lack of accurate
annotated datasets becomes a major obstacle in the progress of face parsing
task. It is a feasible way to utilize dense facial landmarks to guide the
parsing annotation. However, annotating dense landmarks on human face
encounters the same issues as the parsing annotation. To overcome the above
problems, in this paper, we develop a high-efficiency framework for face
parsing annotation, which considerably simplifies and speeds up the parsing
annotation by two consecutive modules. Benefit from the proposed framework, we
construct a new Dense Landmark Guided Face Parsing (LaPa) benchmark. It
consists of 22,000 face images with large variations in expression, pose,
occlusion, etc. Each image is provided with accurate annotation of a
11-category pixel-level label map along with coordinates of 106-point
landmarks. To the best of our knowledge, it is currently the largest public
dataset for face parsing. To make full use of our LaPa dataset with abundant
face shape and boundary priors, we propose a simple yet effective
Boundary-Sensitive Parsing Network (BSPNet). Our network is taken as a baseline
model on the proposed LaPa dataset, and meanwhile, it achieves the
state-of-the-art performance on the Helen dataset without resorting to extra
face alignment
ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language
Person search by natural language aims at retrieving a specific person in a
large-scale image pool that matches the given textual descriptions. While most
of the current methods treat the task as a holistic visual and textual feature
matching one, we approach it from an attribute-aligning perspective that allows
grounding specific attribute phrases to the corresponding visual regions. We
achieve success as well as the performance boosting by a robust feature
learning that the referred identity can be accurately bundled by multiple
attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment
model (dubbed as ViTAA) learns to disentangle the feature space of a person
into subspaces corresponding to attributes using a light auxiliary attribute
segmentation computing branch. It then aligns these visual features with the
textual attributes parsed from the sentences by using a novel contrastive
learning loss. Upon that, we validate our ViTAA framework through extensive
experiments on tasks of person search by natural language and by
attribute-phrase queries, on which our system achieves state-of-the-art
performances. Code will be publicly available upon publication.Comment: ECCV2020, 18 pages, 6 figure
Multi-Granularity Reasoning for Social Relation Recognition from Images
Discovering social relations in images can make machines better interpret the
behavior of human beings. However, automatically recognizing social relations
in images is a challenging task due to the significant gap between the domains
of visual content and social relation. Existing studies separately process
various features such as faces expressions, body appearance, and contextual
objects, thus they cannot comprehensively capture the multi-granularity
semantics, such as scenes, regional cues of persons, and interactions among
persons and objects. To bridge the domain gap, we propose a Multi-Granularity
Reasoning framework for social relation recognition from images. The global
knowledge and mid-level details are learned from the whole scene and the
regions of persons and objects, respectively. Most importantly, we explore the
fine-granularity pose keypoints of persons to discover the interactions among
persons and objects. Specifically, the pose-guided Person-Object Graph and
Person-Pose Graph are proposed to model the actions from persons to object and
the interactions between paired persons, respectively. Based on the graphs,
social relation reasoning is performed by graph convolutional networks.
Finally, the global features and reasoned knowledge are integrated as a
comprehensive representation for social relation recognition. Extensive
experiments on two public datasets show the effectiveness of the proposed
framework
Anti-Confusing: Region-Aware Network for Human Pose Estimation
In this work, we propose a novel framework named Region-Aware Network
(RANet), which learns the ability of anti-confusing in case of heavy occlusion,
nearby person and symmetric appearance, for human pose estimation.
Specifically, the proposed method addresses three key aspects, i.e., data
augmentation, feature learning and prediction fusion, respectively. First, we
propose Parsing-based Data Augmentation (PDA) to generate abundant data that
synthesizes confusing textures. Second, we not only propose a Feature Pyramid
Stem (FPS) to learn stronger low-level features in lower stage; but also
incorporate an Effective Region Extraction (ERE) module to excavate better
target-specific features. Third, we introduce Cascade Voting Fusion (CVF) to
explicitly exclude the inferior predictions and fuse the rest effective
predictions for the final pose estimation. Extensive experimental results on
two popular benchmarks, i.e. MPII and LSP, demonstrate the effectiveness of our
method against the state-of-the-art competitors. Especially on
easily-confusable joints, our method makes significant improvement
- …