12 research outputs found
Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing
Recently, fully convolutional neural networks (FCNs) have shown significant
performance in image parsing, including scene parsing and object parsing.
Different from generic object parsing tasks, hand parsing is more challenging
due to small size, complex structure, heavy self-occlusion and ambiguous
texture problems. In this paper, we propose a novel parsing framework,
Multi-Scale Dual-Branch Fully Convolutional Network (MSDB-FCN), for hand
parsing tasks. Our network employs a Dual-Branch architecture to extract
features of hand area, paying attention on the hand itself. These features are
used to generate multi-scale features with pyramid pooling strategy. In order
to better encode multi-scale features, we design a Deconvolution and Bilinear
Interpolation Block (DB-Block) for upsampling and merging the features of
different scales. To address data imbalance, which is a common problem in many
computer vision tasks as well as hand parsing tasks, we propose a
generalization of Focal Loss, namely Multi-Class Balanced Focal Loss, to tackle
data imbalance in multi-class classification. Extensive experiments on
RHD-PARSING dataset demonstrate that our MSDB-FCN has achieved the
state-of-the-art performance for hand parsing
C-DLinkNet: considering multi-level semantic features for human parsing
Human parsing is an essential branch of semantic segmentation, which is a
fine-grained semantic segmentation task to identify the constituent parts of
human. The challenge of human parsing is to extract effective semantic features
to resolve deformation and multi-scale variations. In this work, we proposed an
end-to-end model called C-DLinkNet based on LinkNet, which contains a new
module named Smooth Module to combine the multi-level features in Decoder part.
C-DLinkNet is capable of producing competitive parsing performance compared
with the state-of-the-art methods with smaller input sizes and no additional
information, i.e., achiving mIoU=53.05 on the validation set of LIP dataset
CaseNet: Content-Adaptive Scale Interaction Networks for Scene Parsing
Objects in an image exhibit diverse scales. Adaptive receptive fields are
expected to catch suitable range of context for accurate pixel level semantic
prediction for handling objects of diverse sizes. Recently, atrous convolution
with different dilation rates has been used to generate features of
multi-scales through several branches and these features are fused for
prediction. However, there is a lack of explicit interaction among the branches
to adaptively make full use of the contexts. In this paper, we propose a
Content-Adaptive Scale Interaction Network (CaseNet) to exploit the multi-scale
features for scene parsing. We build the CaseNet based on the classic Atrous
Spatial Pyramid Pooling (ASPP) module, followed by the proposed contextual
scale interaction (CSI) module, and the scale adaptation (SA) module.
Specifically, first, for each spatial position, we enable context interaction
among different scales through scale-aware non-local operations across the
scales, \ie, CSI module, which facilitates the generation of flexible mixed
receptive fields, instead of a traditional flat one. Second, the scale
adaptation module (SA) explicitly and softly selects the suitable scale for
each spatial position and each channel. Ablation studies demonstrate the
effectiveness of the proposed modules. We achieve state-of-the-art performance
on three scene parsing benchmarks Cityscapes, ADE20K and LIP
High-Resolution Representations for Labeling Pixels and Regions
High-resolution representation learning plays an essential role in many
vision problems, e.g., pose estimation and semantic segmentation. The
high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human
pose estimation, maintains high-resolution representations through the whole
process by connecting high-to-low resolution convolutions in \emph{parallel}
and produces strong high-resolution representations by repeatedly conducting
fusions across parallel convolutions.
In this paper, we conduct a further study on high-resolution representations
by introducing a simple yet effective modification and apply it to a wide range
of vision tasks. We augment the high-resolution representation by aggregating
the (upsampled) representations from all the parallel convolutions rather than
only the representation from the high-resolution convolution as done
in~\cite{SunXLW19}. This simple modification leads to stronger representations,
evidenced by superior results. We show top results in semantic segmentation on
Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW,
COFW, W, and WFLW. In addition, we build a multi-level representation from
the high-resolution representation and apply it to the Faster R-CNN object
detection framework and the extended frameworks. The proposed approach achieves
superior results to existing single-model networks on COCO object detection.
The code and models have been publicly available at
\url{https://github.com/HRNet}
Instance Adaptive Self-Training for Unsupervised Domain Adaptation
The divergence between labeled training data and unlabeled testing data is a
significant challenge for recent deep learning models. Unsupervised domain
adaptation (UDA) attempts to solve such a problem. Recent works show that
self-training is a powerful approach to UDA. However, existing methods have
difficulty in balancing scalability and performance. In this paper, we propose
an instance adaptive self-training framework for UDA on the task of semantic
segmentation. To effectively improve the quality of pseudo-labels, we develop a
novel pseudo-label generation strategy with an instance adaptive selector.
Besides, we propose the region-guided regularization to smooth the pseudo-label
region and sharpen the non-pseudo-label region. Our method is so concise and
efficient that it is easy to be generalized to other unsupervised domain
adaptation methods. Experiments on 'GTA5 to Cityscapes' and 'SYNTHIA to
Cityscapes' demonstrate the superior performance of our approach compared with
the state-of-the-art methods.Comment: ECCV 202
GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Pixel Labeling
Existing CNN-based methods for pixel labeling heavily depend on multi-scale
features to meet the requirements of both semantic comprehension and detail
preservation. State-of-the-art pixel labeling neural networks widely exploit
conventional scale-transfer operations, i.e., up-sampling and down-sampling to
learn multi-scale features. In this work, we find that these operations lead to
scale-confused features and suboptimal performance because they are
spatial-invariant and directly transit all feature information cross scales
without spatial selection. To address this issue, we propose the Gated
Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features
to another scale. Specifically, GSTO can work either with or without extra
supervision. Unsupervised GSTO is learned from the feature itself while the
supervised one is guided by the supervised probability matrix. Both forms of
GSTO are lightweight and plug-and-play, which can be flexibly integrated into
networks or modules for learning better multi-scale features. In particular, by
plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet)
for pixel labeling, and it achieves new state-of-the-art results on the COCO
benchmark for human pose estimation and other benchmarks for semantic
segmentation including Cityscapes, LIP and Pascal Context, with negligible
extra computational cost. Moreover, experiment results demonstrate that GSTO
can also significantly boost the performance of multi-scale feature aggregation
modules like PPM and ASPP. Code will be made available at
https://github.com/VDIGPKU/GSTO
Correlating Edge, Pose with Parsing
According to existing studies, human body edge and pose are two beneficial
factors to human parsing. The effectiveness of each of the high-level features
(edge and pose) is confirmed through the concatenation of their features with
the parsing features. Driven by the insights, this paper studies how human
semantic boundaries and keypoint locations can jointly improve human parsing.
Compared with the existing practice of feature concatenation, we find that
uncovering the correlation among the three factors is a superior way of
leveraging the pivotal contextual cues provided by edges and poses. To capture
such correlations, we propose a Correlation Parsing Machine (CorrPM) employing
a heterogeneous non-local block to discover the spatial affinity among feature
maps from the edge, pose and parsing. The proposed CorrPM allows us to report
new state-of-the-art accuracy on three human parsing datasets. Importantly,
comparative studies confirm the advantages of feature correlation over the
concatenation.Comment: CVPR202
Hierarchical Human Parsing with Typed Part-Relation Reasoning
Human parsing is for pixel-wise human semantic understanding. As human bodies
are underlying hierarchically structured, how to model human structures is the
central theme in this task. Focusing on this, we seek to simultaneously exploit
the representational capacity of deep graph networks and the hierarchical human
structures. In particular, we provide following two contributions. First, three
kinds of part relations, i.e., decomposition, composition, and dependency, are,
for the first time, completely and precisely described by three distinct
relation networks. This is in stark contrast to previous parsers, which only
focus on a portion of the relations and adopt a type-agnostic relation modeling
strategy. More expressive relation information can be captured by explicitly
imposing the parameters in the relation networks to satisfy the specific
characteristics of different relations. Second, previous parsers largely ignore
the need for an approximation algorithm over the loopy human hierarchy, while
we instead address an iterative reasoning process, by assimilating generic
message-passing networks with their edge-typed, convolutional counterparts.
With these efforts, our parser lays the foundation for more sophisticated and
flexible human relation patterns of reasoning. Comprehensive experiments on
five datasets demonstrate that our parser sets a new state-of-the-art on each.Comment: Accepted to CVPR 2020.
Code:https://github.com/hlzhu09/Hierarchical-Human-Parsin
Learning Compositional Neural Information Fusion for Human Parsing
This work proposes to combine neural networks with the compositional
hierarchy of human bodies for efficient and complete human parsing. We
formulate the approach as a neural information fusion framework. Our model
assembles the information from three inference processes over the hierarchy:
direct inference (directly predicting each part of a human body using image
information), bottom-up inference (assembling knowledge from constituent
parts), and top-down inference (leveraging context from parent nodes). The
bottom-up and top-down inferences explicitly model the compositional and
decompositional relations in human bodies, respectively. In addition, the
fusion of multi-source information is conditioned on the inputs, i.e., by
estimating and considering the confidence of the sources. The whole model is
end-to-end differentiable, explicitly modeling information flows and
structures. Our approach is extensively evaluated on four popular datasets,
outperforming the state-of-the-arts in all cases, with a fast processing speed
of 23fps. Our code and results have been released to help ease future research
in this direction.Comment: ICCV2019. Websie:
https://github.com/ZzzjzzZ/CompositionalHumanParsin
AGRNet: Adaptive Graph Representation Learning and Reasoning for Face Parsing
Face parsing infers a pixel-wise label to each facial component, which has
drawn much attention recently. Previous methods have shown their success in
face parsing, which however overlook the correlation among facial components.
As a matter of fact, the component-wise relationship is a critical clue in
discriminating ambiguous pixels in facial area. To address this issue, we
propose adaptive graph representation learning and reasoning over facial
components, aiming to learn representative vertices that describe each
component, exploit the component-wise relationship and thereby produce accurate
parsing results against ambiguity. In particular, we devise an adaptive and
differentiable graph abstraction method to represent the components on a graph
via pixel-to-vertex projection under the initial condition of a predicted
parsing map, where pixel features within a certain facial region are aggregated
onto a vertex. Further, we explicitly incorporate the image edge as a prior in
the model, which helps to discriminate edge and non-edge pixels during the
projection, thus leading to refined parsing results along the edges. Then, our
model learns and reasons over the relations among components by propagating
information across vertices on the graph. Finally, the refined vertex features
are projected back to pixel grids for the prediction of the final parsing map.
To train our model, we propose a discriminative loss to penalize small
distances between vertices in the feature space, which leads to distinct
vertices with strong semantics. Experimental results show the superior
performance of the proposed model on multiple face parsing datasets, along with
the validation on the human parsing task to demonstrate the generalizability of
our model