318 research outputs found
Keypoint Based Weakly Supervised Human Parsing
Fully convolutional networks (FCN) have achieved great success in human
parsing in recent years. In conventional human parsing tasks, pixel-level
labeling is required for guiding the training, which usually involves enormous
human labeling efforts. To ease the labeling efforts, we propose a novel weakly
supervised human parsing method which only requires simple object keypoint
annotations for learning. We develop an iterative learning method to generate
pseudo part segmentation masks from keypoint labels. With these pseudo masks,
we train an FCN network to output pixel-level human parsing predictions.
Furthermore, we develop a correlation network to perform joint prediction of
part and object segmentation masks and improve the segmentation performance.
The experiment results show that our weakly supervised method is able to
achieve very competitive human parsing results. Despite our method only uses
simple keypoint annotations for learning, we are able to achieve comparable
performance with fully supervised methods which use the expensive pixel-level
annotations
MONET: Multiview Semi-supervised Keypoint Detection via Epipolar Divergence
This paper presents MONET -- an end-to-end semi-supervised learning framework
for a keypoint detector using multiview image streams. In particular, we
consider general subjects such as non-human species where attaining a large
scale annotated dataset is challenging. While multiview geometry can be used to
self-supervise the unlabeled data, integrating the geometry into learning a
keypoint detector is challenging due to representation mismatch. We address
this mismatch by formulating a new differentiable representation of the
epipolar constraint called epipolar divergence---a generalized distance from
the epipolar lines to the corresponding keypoint distribution. Epipolar
divergence characterizes when two view keypoint distributions produce zero
reprojection error. We design a twin network that minimizes the epipolar
divergence through stereo rectification that can significantly alleviate
computational complexity and sampling aliasing in training. We demonstrate that
our framework can localize customized keypoints of diverse species, e.g.,
humans, dogs, and monkeys
Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation
Supervised deep learning with pixel-wise training labels has great successes
on multi-person part segmentation. However, data labeling at pixel-level is
very expensive. To solve the problem, people have been exploring to use
synthetic data to avoid the data labeling. Although it is easy to generate
labels for synthetic data, the results are much worse compared to those using
real data and manual labeling. The degradation of the performance is mainly due
to the domain gap, i.e., the discrepancy of the pixel value statistics between
real and synthetic data. In this paper, we observe that real and synthetic
humans both have a skeleton (pose) representation. We found that the skeletons
can effectively bridge the synthetic and real domains during the training. Our
proposed approach takes advantage of the rich and realistic variations of the
real data and the easily obtainable labels of the synthetic data to learn
multi-person part segmentation on real images without any human-annotated
labels. Through experiments, we show that without any human labeling, our
method performs comparably to several state-of-the-art approaches which require
human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other
hand, if part labels are also available in the real-images during training, our
method outperforms the supervised state-of-the-art methods by a large margin.
We further demonstrate the generalizability of our method on predicting novel
keypoints in real images where no real data labels are available for the novel
keypoints detection. Code and pre-trained models are available at
https://github.com/kevinlin311tw/CDCL-human-part-segmentationComment: To appear in IEEE Transactions on Circuits and Systems for Video
Technology; Presented at ICCV 2019 Demonstratio
Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer
Human body part parsing, or human semantic part segmentation, is fundamental
to many computer vision tasks. In conventional semantic segmentation methods,
the ground truth segmentations are provided, and fully convolutional networks
(FCN) are trained in an end-to-end scheme. Although these methods have
demonstrated impressive results, their performance highly depends on the
quantity and quality of training data. In this paper, we present a novel method
to generate synthetic human part segmentation data using easily-obtained human
keypoint annotations. Our key idea is to exploit the anatomical similarity
among human to transfer the parsing results of a person to another person with
similar pose. Using these estimated results as additional training data, our
semi-supervised model outperforms its strong-supervised counterpart by 6 mIOU
on the PASCAL-Person-Part dataset, and we achieve state-of-the-art human
parsing results. Our approach is general and can be readily extended to other
object/animal parsing task assuming that their anatomical similarity can be
annotated by keypoints. The proposed model and accompanying source code are
available at https://github.com/MVIG-SJTU/WSHPComment: CVPR'18 spotligh
Learning Correspondence from the Cycle-Consistency of Time
We introduce a self-supervised method for learning visual correspondence from
unlabeled video. The main idea is to use cycle-consistency in time as free
supervisory signal for learning visual representations from scratch. At
training time, our model learns a feature map representation to be useful for
performing cycle-consistent tracking. At test time, we use the acquired
representation to find nearest neighbors across space and time. We demonstrate
the generalizability of the representation -- without finetuning -- across a
range of visual correspondence tasks, including video object segmentation,
keypoint tracking, and optical flow. Our approach outperforms previous
self-supervised methods and performs competitively with strongly supervised
methods.Comment: CVPR 2019 Oral. Project page: http://ajabri.github.io/timecycl
Attribute And-Or Grammar for Joint Parsing of Human Attributes, Part and Pose
This paper presents an attribute and-or grammar (A-AOG) model for jointly
inferring human body pose and human attributes in a parse graph with attributes
augmented to nodes in the hierarchical representation. In contrast to other
popular methods in the current literature that train separate classifiers for
poses and individual attributes, our method explicitly represents the
decomposition and articulation of body parts, and account for the correlations
between poses and attributes. The A-AOG model is an amalgamation of three
traditional grammar formulations: (i) Phrase structure grammar representing the
hierarchical decomposition of the human body from whole to parts; (ii)
Dependency grammar modeling the geometric articulation by a kinematic graph of
the body pose; and (iii) Attribute grammar accounting for the compatibility
relations between different parts in the hierarchy so that their appearances
follow a consistent style. The parse graph outputs human detection, pose
estimation, and attribute prediction simultaneously, which are intuitive and
interpretable. We conduct experiments on two tasks on two datasets, and
experimental results demonstrate the advantage of joint modeling in comparison
with computing poses and attributes independently. Furthermore, our model
obtains better performance over existing methods for both pose estimation and
attribute prediction tasks
Deep Supervision with Intermediate Concepts
Recent data-driven approaches to scene interpretation predominantly pose
inference as an end-to-end black-box mapping, commonly performed by a
Convolutional Neural Network (CNN). However, decades of work on perceptual
organization in both human and machine vision suggests that there are often
intermediate representations that are intrinsic to an inference task, and which
provide essential structure to improve generalization. In this work, we explore
an approach for injecting prior domain structure into neural network training
by supervising hidden layers of a CNN with intermediate concepts that normally
are not observed in practice. We formulate a probabilistic framework which
formalizes these notions and predicts improved generalization via this deep
supervision method. One advantage of this approach is that we are able to train
only from synthetic CAD renderings of cluttered scenes, where concept values
can be extracted, but apply the results to real images. Our implementation
achieves the state-of-the-art performance of 2D/3D keypoint localization and
image classification on real image benchmarks, including KITTI, PASCAL VOC,
PASCAL3D+, IKEA, and CIFAR100. We provide additional evidence that our approach
outperforms alternative forms of supervision, such as multi-task networks.Comment: Submitted to TPAMI, first revision. arXiv admin note: text overlap
with arXiv:1612.0269
OriNet: A Fully Convolutional Network for 3D Human Pose Estimation
In this paper, we propose a fully convolutional network for 3D human pose
estimation from monocular images. We use limb orientations as a new way to
represent 3D poses and bind the orientation together with the bounding box of
each limb region to better associate images and predictions. The 3D
orientations are modeled jointly with 2D keypoint detections. Without
additional constraints, this simple method can achieve good results on several
large-scale benchmarks. Further experiments show that our method can generalize
well to novel scenes and is robust to inaccurate bounding boxes.Comment: BMVC 2018. Code available at https://github.com/chenxuluo/OriNet-dem
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark
Human parsing and pose estimation have recently received considerable
interest due to their substantial application potentials. However, the existing
datasets have limited numbers of images and annotations and lack a variety of
human appearances and coverage of challenging cases in unconstrained
environments. In this paper, we introduce a new benchmark named "Look into
Person (LIP)" that provides a significant advancement in terms of scalability,
diversity, and difficulty, which are crucial for future developments in
human-centric analysis. This comprehensive dataset contains over 50,000
elaborately annotated images with 19 semantic part labels and 16 body joints,
which are captured from a broad range of viewpoints, occlusions, and background
complexities. Using these rich annotations, we perform detailed analyses of the
leading human parsing and pose estimation approaches, thereby obtaining
insights into the successes and failures of these methods. To further explore
and take advantage of the semantic correlation of these two tasks, we propose a
novel joint human parsing and pose estimation network to explore efficient
context modeling, which can simultaneously predict parsing and pose with
extremely high quality. Furthermore, we simplify the network to solve human
parsing by exploring a novel self-supervised structure-sensitive learning
approach, which imposes human pose structures into the parsing results without
resorting to extra supervision. The dataset, code and models are available at
http://www.sysu-hcp.net/lip/.Comment: We proposed the most comprehensive dataset around the world for
human-centric analysis! (Accepted By T-PAMI 2018) The dataset, code and
models are available at http://www.sysu-hcp.net/lip/ . arXiv admin note:
substantial text overlap with arXiv:1703.0544
Deep Functional Dictionaries: Learning Consistent Semantic Structures on 3D Models from Functions
Various 3D semantic attributes such as segmentation masks, geometric
features, keypoints, and materials can be encoded as per-point probe functions
on 3D geometries. Given a collection of related 3D shapes, we consider how to
jointly analyze such probe functions over different shapes, and how to discover
common latent structures using a neural network --- even in the absence of any
correspondence information. Our network is trained on point cloud
representations of shape geometry and associated semantic functions on that
point cloud. These functions express a shared semantic understanding of the
shapes but are not coordinated in any way. For example, in a segmentation task,
the functions can be indicator functions of arbitrary sets of shape parts, with
the particular combination involved not known to the network. Our network is
able to produce a small dictionary of basis functions for each shape, a
dictionary whose span includes the semantic functions provided for that shape.
Even though our shapes have independent discretizations and no functional
correspondences are provided, the network is able to generate latent bases, in
a consistent order, that reflect the shared semantic structure among the
shapes. We demonstrate the effectiveness of our technique in various
segmentation and keypoint selection applications
- …