19 research outputs found
Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation
A central challenge in human pose estimation, as well as in many other
machine learning and prediction tasks, is the generalization problem. The
learned network does not have the capability to characterize the prediction
error, generate feedback information from the test sample, and correct the
prediction error on the fly for each individual test sample, which results in
degraded performance in generalization. In this work, we introduce a
self-correctable and adaptable inference (SCAI) method to address the
generalization challenge of network prediction and use human pose estimation as
an example to demonstrate its effectiveness and performance. We learn a
correction network to correct the prediction result conditioned by a fitness
feedback error. This feedback error is generated by a learned fitness feedback
network which maps the prediction result to the original input domain and
compares it against the original input. Interestingly, we find that this
self-referential feedback error is highly correlated with the actual prediction
error. This strong correlation suggests that we can use this error as feedback
to guide the correction process. It can be also used as a loss function to
quickly adapt and optimize the correction network during the inference process.
Our extensive experimental results on human pose estimation demonstrate that
the proposed SCAI method is able to significantly improve the generalization
capability and performance of human pose estimation.Comment: Accepted by CVPR 202
Learning Enhanced Resolution-wise features for Human Pose Estimation
Recently, multi-resolution networks (such as Hourglass, CPN, HRNet, etc.)
have achieved significant performance on pose estimation by combining feature
maps of various resolutions. In this paper, we propose a Resolution-wise
Attention Module (RAM) and Gradual Pyramid Refinement (GPR), to learn enhanced
resolution-wise feature maps for precise pose estimation. Specifically, RAM
learns a group of weights to represent the different importance of feature maps
across resolutions, and the GPR gradually merges every two feature maps from
low to high resolutions to regress final human keypoint heatmaps. With the
enhanced resolution-wise features learnt by CNN, we obtain more accurate human
keypoint locations. The efficacies of our proposed methods are demonstrated on
MS-COCO dataset, achieving state-of-the-art performance with average precision
of 77.7 on COCO val2017 set and 77.0 on test-dev2017 set without using extra
human keypoint training dataset.Comment: Published on ICIP 202
F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation
Although deep learning based methods have achieved great progress in
unsupervised video object segmentation, difficult scenarios (e.g., visual
similarity, occlusions, and appearance changing) are still not well-handled. To
alleviate these issues, we propose a novel Focus on Foreground Network (F2Net),
which delves into the intra-inter frame details for the foreground objects and
thus effectively improve the segmentation performance. Specifically, our
proposed network consists of three main parts: Siamese Encoder Module, Center
Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
Firstly, we take a siamese encoder to extract the feature representations of
paired frames (reference frame and current frame). Then, a Center Guiding
Appearance Diffusion Module is designed to capture the inter-frame feature
(dense correspondences between reference frame and current frame), intra-frame
feature (dense correspondences in current frame), and original semantic feature
of current frame. Specifically, we establish a Center Prediction Branch to
predict the center location of the foreground object in current frame and
leverage the center point information as spatial guidance prior to enhance the
inter-frame and intra-frame feature extraction, and thus the feature
representation considerably focus on the foreground objects. Finally, we
propose a Dynamic Information Fusion Module to automatically select relatively
important features through three aforementioned different level features.
Extensive experiments on DAVIS2016, Youtube-object, and FBMS datasets show that
our proposed F2Net achieves the state-of-the-art performance with significant
improvement.Comment: Accepted by AAAI202