3,249 research outputs found
Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net
Parsing articulated objects, e.g. humans and animals, into semantic parts
(e.g. body, head and arms, etc.) from natural images is a challenging and
fundamental problem for computer vision. A big difficulty is the large
variability of scale and location for objects and their corresponding parts.
Even limited mistakes in estimating scale and location will degrade the parsing
output and cause errors in boundary details. To tackle these difficulties, we
propose a "Hierarchical Auto-Zoom Net" (HAZN) for object part parsing which
adapts to the local scales of objects and parts. HAZN is a sequence of two
"Auto-Zoom Net" (AZNs), each employing fully convolutional networks that
perform two tasks: (1) predict the locations and scales of object instances
(the first AZN) or their parts (the second AZN); (2) estimate the part scores
for predicted object instance or part regions. Our model can adaptively "zoom"
(resize) predicted image regions into their proper scales to refine the
parsing.
We conduct extensive experiments over the PASCAL part datasets on humans,
horses, and cows. For humans, our approach significantly outperforms the
state-of-the-arts by 5% mIOU and is especially better at segmenting small
instances and small parts. We obtain similar improvements for parsing cows and
horses over alternative methods. In summary, our strategy of first zooming into
objects and then zooming into parts is very effective. It also enables us to
process different regions of the image at different scales adaptively so that,
for example, we do not need to waste computational resources scaling the entire
image.Comment: A shortened version has been submitted to ECCV 201
Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing
Human parsing has recently attracted a lot of research interests due to its
huge application potentials. However existing datasets have limited number of
images and annotations, and lack the variety of human appearances and the
coverage of challenging cases in unconstrained environment. In this paper, we
introduce a new benchmark "Look into Person (LIP)" that makes a significant
advance in terms of scalability, diversity and difficulty, a contribution that
we feel is crucial for future developments in human-centric analysis. This
comprehensive dataset contains over 50,000 elaborately annotated images with 19
semantic part labels, which are captured from a wider range of viewpoints,
occlusions and background complexity. Given these rich annotations we perform
detailed analyses of the leading human parsing approaches, gaining insights
into the success and failures of these methods. Furthermore, in contrast to the
existing efforts on improving the feature discriminative capability, we solve
human parsing by exploring a novel self-supervised structure-sensitive learning
approach, which imposes human pose structures into parsing results without
resorting to extra supervision (i.e., no need for specifically labeling human
joints in model training). Our self-supervised learning framework can be
injected into any advanced neural networks to help incorporate rich high-level
knowledge regarding human joints from a global perspective and improve the
parsing results. Extensive evaluations on our LIP and the public
PASCAL-Person-Part dataset demonstrate the superiority of our method.Comment: Accepted to appear in CVPR 201
Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis
Despite remarkable advances in image synthesis research, existing works often
fail in manipulating images under the context of large geometric
transformations. Synthesizing person images conditioned on arbitrary poses is
one of the most representative examples where the generation quality largely
relies on the capability of identifying and modeling arbitrary transformations
on different body parts. Current generative models are often built on local
convolutions and overlook the key challenges (e.g. heavy occlusions, different
views or dramatic appearance changes) when distinct geometric changes happen
for each part, caused by arbitrary pose manipulations. This paper aims to
resolve these challenges induced by geometric variability and spatial
displacements via a new Soft-Gated Warping Generative Adversarial Network
(Warping-GAN), which is composed of two stages: 1) it first synthesizes a
target part segmentation map given a target pose, which depicts the
region-level spatial layouts for guiding image synthesis with higher-level
structure constraints; 2) the Warping-GAN equipped with a soft-gated
warping-block learns feature-level mapping to render textures from the original
image into the generated segmentation map. Warping-GAN is capable of
controlling different transformation degrees given distinct target poses.
Moreover, the proposed warping-block is light-weight and flexible enough to be
injected into any networks. Human perceptual studies and quantitative
evaluations demonstrate the superiority of our Warping-GAN that significantly
outperforms all existing methods on two large datasets.Comment: 17 pages, 14 figure
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Attribute And-Or Grammar for Joint Parsing of Human Attributes, Part and Pose
This paper presents an attribute and-or grammar (A-AOG) model for jointly
inferring human body pose and human attributes in a parse graph with attributes
augmented to nodes in the hierarchical representation. In contrast to other
popular methods in the current literature that train separate classifiers for
poses and individual attributes, our method explicitly represents the
decomposition and articulation of body parts, and account for the correlations
between poses and attributes. The A-AOG model is an amalgamation of three
traditional grammar formulations: (i) Phrase structure grammar representing the
hierarchical decomposition of the human body from whole to parts; (ii)
Dependency grammar modeling the geometric articulation by a kinematic graph of
the body pose; and (iii) Attribute grammar accounting for the compatibility
relations between different parts in the hierarchy so that their appearances
follow a consistent style. The parse graph outputs human detection, pose
estimation, and attribute prediction simultaneously, which are intuitive and
interpretable. We conduct experiments on two tasks on two datasets, and
experimental results demonstrate the advantage of joint modeling in comparison
with computing poses and attributes independently. Furthermore, our model
obtains better performance over existing methods for both pose estimation and
attribute prediction tasks
The ApolloScape Open Dataset for Autonomous Driving and its Application
Autonomous driving has attracted tremendous attention especially in the past
few years. The key techniques for a self-driving car include solving tasks like
3D map construction, self-localization, parsing the driving road and
understanding objects, which enable vehicles to reason and act. However, large
scale data set for training and system evaluation is still a bottleneck for
developing robust perception models. In this paper, we present the ApolloScape
dataset [1] and its applications for autonomous driving. Compared with existing
public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape
contains much large and richer labelling including holistic semantic dense
point cloud for each site, stereo, per-pixel semantic labelling, lanemark
labelling, instance segmentation, 3D car instance, high accurate location for
every frame in various driving videos from multiple sites, cities and daytimes.
For each task, it contains at lease 15x larger amount of images than SOTA
datasets. To label such a complete dataset, we develop various tools and
algorithms specified for each task to accelerate the labelling process, such as
3D-2D segment labeling tools, active labelling in videos etc. Depend on
ApolloScape, we are able to develop algorithms jointly consider the learning
and inference of multiple tasks. In this paper, we provide a sensor fusion
scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and
a 3D semantic map in order to achieve robust self-localization and semantic
segmentation for autonomous driving. We show that practically, sensor fusion
and joint learning of multiple tasks are beneficial to achieve a more robust
and accurate system. We expect our dataset and proposed relevant algorithms can
support and motivate researchers for further development of multi-sensor fusion
and multi-task learning in the field of computer vision.Comment: Version 4: Accepted by TPAMI. Version 3: 17 pages, 10 tables, 11
figures, added the application (DeLS-3D) based on the ApolloScape Dataset.
Version 2: 7 pages, 6 figures, added comparison with BDD100K datase
Joint Multi-Person Pose Estimation and Semantic Part Segmentation
Human pose estimation and semantic part segmentation are two complementary
tasks in computer vision. In this paper, we propose to solve the two tasks
jointly for natural multi-person images, in which the estimated pose provides
object-level shape prior to regularize part segments while the part-level
segments constrain the variation of pose locations. Specifically, we first
train two fully convolutional neural networks (FCNs), namely Pose FCN and Part
FCN, to provide initial estimation of pose joint potential and semantic part
potential. Then, to refine pose joint location, the two types of potentials are
fused with a fully-connected conditional random field (FCRF), where a novel
segment-joint smoothness term is used to encourage semantic and spatial
consistency between parts and joints. To refine part segments, the refined pose
and the original part potential are integrated through a Part FCN, where the
skeleton feature from pose serves as additional regularization cues for part
segments. Finally, to reduce the complexity of the FCRF, we induce human
detection boxes and infer the graph inside each box, making the inference forty
times faster.
Since there's no dataset that contains both part segments and pose labels, we
extend the PASCAL VOC part dataset with human pose joints and perform extensive
experiments to compare our method against several most recent strategies. We
show that on this dataset our algorithm surpasses competing methods by a large
margin in both tasks.Comment: This paper has been accepted by CVPR 201
High-Resolution Representations for Labeling Pixels and Regions
High-resolution representation learning plays an essential role in many
vision problems, e.g., pose estimation and semantic segmentation. The
high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human
pose estimation, maintains high-resolution representations through the whole
process by connecting high-to-low resolution convolutions in \emph{parallel}
and produces strong high-resolution representations by repeatedly conducting
fusions across parallel convolutions.
In this paper, we conduct a further study on high-resolution representations
by introducing a simple yet effective modification and apply it to a wide range
of vision tasks. We augment the high-resolution representation by aggregating
the (upsampled) representations from all the parallel convolutions rather than
only the representation from the high-resolution convolution as done
in~\cite{SunXLW19}. This simple modification leads to stronger representations,
evidenced by superior results. We show top results in semantic segmentation on
Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW,
COFW, W, and WFLW. In addition, we build a multi-level representation from
the high-resolution representation and apply it to the Faster R-CNN object
detection framework and the extended frameworks. The proposed approach achieves
superior results to existing single-model networks on COCO object detection.
The code and models have been publicly available at
\url{https://github.com/HRNet}
UniHCP: A Unified Model for Human-Centric Perceptions
Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian
detection, person re-identification, etc.) play a key role in industrial
applications of visual models. While specific human-centric tasks have their
own relevant semantic aspect to focus on, they also share the same underlying
semantic structure of the human body. However, few works have attempted to
exploit such homogeneity and design a general-propose model for human-centric
tasks. In this work, we revisit a broad range of human-centric tasks and unify
them in a minimalist manner. We propose UniHCP, a Unified Model for
Human-Centric Perceptions, which unifies a wide range of human-centric tasks in
a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can
outperform strong baselines on several in-domain and downstream tasks by direct
evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a
wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing,
86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID,
and 85.8 JI on CrowdHuman for pedestrian detection, performing better than
specialized models tailored for each task.Comment: Accepted for publication at the IEEE/CVF Conference on Computer
Vision and Pattern Recognition 2023 (CVPR 2023
Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation
Supervised deep learning with pixel-wise training labels has great successes
on multi-person part segmentation. However, data labeling at pixel-level is
very expensive. To solve the problem, people have been exploring to use
synthetic data to avoid the data labeling. Although it is easy to generate
labels for synthetic data, the results are much worse compared to those using
real data and manual labeling. The degradation of the performance is mainly due
to the domain gap, i.e., the discrepancy of the pixel value statistics between
real and synthetic data. In this paper, we observe that real and synthetic
humans both have a skeleton (pose) representation. We found that the skeletons
can effectively bridge the synthetic and real domains during the training. Our
proposed approach takes advantage of the rich and realistic variations of the
real data and the easily obtainable labels of the synthetic data to learn
multi-person part segmentation on real images without any human-annotated
labels. Through experiments, we show that without any human labeling, our
method performs comparably to several state-of-the-art approaches which require
human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other
hand, if part labels are also available in the real-images during training, our
method outperforms the supervised state-of-the-art methods by a large margin.
We further demonstrate the generalizability of our method on predicting novel
keypoints in real images where no real data labels are available for the novel
keypoints detection. Code and pre-trained models are available at
https://github.com/kevinlin311tw/CDCL-human-part-segmentationComment: To appear in IEEE Transactions on Circuits and Systems for Video
Technology; Presented at ICCV 2019 Demonstratio
- …