39 research outputs found
Body Segmentation Using Multi-task Learning
Body segmentation is an important step in many computer vision problems
involving human images and one of the key components that affects the
performance of all downstream tasks. Several prior works have approached this
problem using a multi-task model that exploits correlations between different
tasks to improve segmentation performance. Based on the success of such
solutions, we present in this paper a novel multi-task model for human
segmentation/parsing that involves three tasks, i.e., (i) keypoint-based
skeleton estimation, (ii) dense pose prediction, and (iii) human-body
segmentation. The main idea behind the proposed Segmentation--Pose--DensePose
model (or SPD for short) is to learn a better segmentation model by sharing
knowledge across different, yet related tasks. SPD is based on a shared deep
neural network backbone that branches off into three task-specific model heads
and is learned using a multi-task optimization objective. The performance of
the model is analysed through rigorous experiments on the LIP and ATR datasets
and in comparison to a recent (state-of-the-art) multi-task body-segmentation
model. Comprehensive ablation studies are also presented. Our experimental
results show that the proposed multi-task (segmentation) model is highly
competitive and that the introduction of additional tasks contributes towards a
higher overall segmentation performance
Matching-CNN Meets KNN: Quasi-Parametric Human Parsing
Both parametric and non-parametric approaches have demonstrated encouraging
performances in the human parsing task, namely segmenting a human image into
several semantic regions (e.g., hat, bag, left arm, face). In this work, we aim
to develop a new solution with the advantages of both methodologies, namely
supervision from annotated data and the flexibility to use newly annotated
(possibly uncommon) images, and present a quasi-parametric human parsing model.
Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the
parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict
the matching confidence and displacements of the best matched region in the
testing image for a particular semantic region in one KNN image. Given a
testing image, we first retrieve its KNN images from the
annotated/manually-parsed human image corpus. Then each semantic region in each
KNN image is matched with confidence to the testing image using M-CNN, and the
matched regions from all KNN images are further fused, followed by a superpixel
smoothing procedure to obtain the ultimate human parsing result. The M-CNN
differs from the classic CNN in that the tailored cross image matching filters
are introduced to characterize the matching between the testing image and the
semantic region of a KNN image. The cross image matching filters are defined at
different convolutional layers, each aiming to capture a particular range of
displacements. Comprehensive evaluations over a large dataset with 7,700
annotated human images well demonstrate the significant performance gain from
the quasi-parametric model over the state-of-the-arts, for the human parsing
task.Comment: This manuscript is the accepted version for CVPR 201