10 research outputs found
Relation-Based Associative Joint Location for Human Pose Estimation in Videos
Video-based human pose estimation (HPE) is a vital yet challenging task.
While deep learning methods have made significant progress for the HPE, most
approaches to this task detect each joint independently, damaging the pose
structural information. In this paper, unlike the prior methods, we propose a
Relation-based Pose Semantics Transfer Network (RPSTN) to locate joints
associatively. Specifically, we design a lightweight joint relation extractor
(JRE) to model the pose structural features and associatively generate heatmaps
for joints by modeling the relation between any two joints heuristically
instead of building each joint heatmap independently. Actually, the proposed
JRE module models the spatial configuration of human poses through the
relationship between any two joints. Moreover, considering the temporal
semantic continuity of videos, the pose semantic information in the current
frame is beneficial for guiding the location of joints in the next frame.
Therefore, we use the idea of knowledge reuse to propagate the pose semantic
information between consecutive frames. In this way, the proposed RPSTN
captures temporal dynamics of poses. On the one hand, the JRE module can infer
invisible joints according to the relationship between the invisible joints and
other visible joints in space. On the other hand, in the time, the propose
model can transfer the pose semantic features from the non-occluded frame to
the occluded frame to locate occluded joints. Therefore, our method is robust
to the occlusion and achieves state-of-the-art results on the two challenging
datasets, which demonstrates its effectiveness for video-based human pose
estimation. We will release the code and models publicly
Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition
Skeleton-based action recognition is a central task in human-computer
interaction. However, most previous methods suffer from two issues: (i)
semantic ambiguity arising from spatial-temporal information mixture; and (ii)
overlooking the explicit exploitation of the latent data distributions (i.e.,
the intra-class variations and inter-class relations), thereby leading to
sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a
spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain
discriminative and semantically distinct representations from the sequences,
which can be incorporated into various previous skeleton encoders and can be
removed when testing. Specifically, we decouple the global features into
spatial-specific and temporal-specific features to reduce the spatial-temporal
coupling of features. Furthermore, to explicitly exploit the latent data
distributions, we employ the attentive features to contrastive learning, which
models the cross-sequence semantic relations by pulling together the features
from the positive pairs and pushing away the negative pairs. Extensive
experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN,
CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and
NW-UCLA benchmarks. The code will be released soon
Topology-aware MLP for Skeleton-based Action Recognition
Graph convolution networks (GCNs) have achieved remarkable performance in
skeleton-based action recognition. However, existing previous GCN-based methods
have relied excessively on elaborate human body priors and constructed complex
feature aggregation mechanisms, which limits the generalizability of networks.
To solve these problems, we propose a novel Spatial Topology Gating Unit
(STGU), which is an MLP-based variant without extra priors, to capture the
co-occurrence topology features that encode the spatial dependency across all
joints. In STGU, to model the sample-specific and completely independent
point-wise topology attention, a new gate-based feature interaction mechanism
is introduced to activate the features point-to-point by the attention map
generated from the input. Based on the STGU, in this work, we propose the first
topology-aware MLP-based model, Ta-MLP, for skeleton-based action recognition.
In comparison with existing previous methods on three large-scale datasets,
Ta-MLP achieves competitive performance. In addition, Ta-MLP reduces the
parameters by up to 62.5% with favorable results. Compared with previous
state-of-the-art (SOAT) approaches, Ta-MLP pushes the frontier of real-time
action recognition. The code will be available at
https://github.com/BUPTSJZhang/Ta-MLP.Comment: 10 pages, 9 figure
BiHRNet: A Binary high-resolution network for Human Pose Estimation
Human Pose Estimation (HPE) plays a crucial role in computer vision
applications. However, it is difficult to deploy state-of-the-art models on
resouce-limited devices due to the high computational costs of the networks. In
this work, a binary human pose estimator named BiHRNet(Binary HRNet) is
proposed, whose weights and activations are expressed as 1. BiHRNet
retains the keypoint extraction ability of HRNet, while using fewer computing
resources by adapting binary neural network (BNN). In order to reduce the
accuracy drop caused by network binarization, two categories of techniques are
proposed in this work. For optimizing the training process for binary pose
estimator, we propose a new loss function combining KL divergence loss with
AWing loss, which makes the binary network obtain more comprehensive output
distribution from its real-valued counterpart to reduce information loss caused
by binarization. For designing more binarization-friendly structures, we
propose a new information reconstruction bottleneck called IR Bottleneck to
retain more information in the initial stage of the network. In addition, we
also propose a multi-scale basic block called MS-Block for information
retention. Our work has less computation cost with few precision drop.
Experimental results demonstrate that BiHRNet achieves a PCKh of 87.9 on the
MPII dataset, which outperforms all binary pose estimation networks. On the
challenging of COCO dataset, the proposed method enables the binary neural
network to achieve 70.8 mAP, which is better than most tested lightweight
full-precision networks.Comment: 12 pages, 6 figure
Learning Human Kinematics by Modeling Temporal Correlations between Joints for Video-based Human Pose Estimation
Estimating human poses from videos is critical in human-computer interaction.
By precisely estimating human poses, the robot can provide an appropriate
response to the human. Most existing approaches use the optical flow, RNNs, or
CNNs to extract temporal features from videos. Despite the positive results of
these attempts, most of them only straightforwardly integrate features along
the temporal dimension, ignoring temporal correlations between joints. In
contrast to previous methods, we propose a plug-and-play kinematics modeling
module (KMM) based on the domain-cross attention mechanism to model the
temporal correlation between joints across different frames explicitly.
Specifically, the proposed KMM models the temporal correlation between any two
joints by calculating their temporal similarity. In this way, KMM can learn the
motion cues of each joint. Using the motion cues (temporal domain) and
historical positions of joints (spatial domain), KMM can infer the initial
positions of joints in the current frame in advance. In addition, we present a
kinematics modeling network (KIMNet) based on the KMM for obtaining the final
positions of joints by combining pose features and initial positions of joints.
By explicitly modeling temporal correlations between joints, KIMNet can infer
the occluded joints at present according to all joints at the previous moment.
Furthermore, the KMM is achieved through an attention mechanism, which allows
it to maintain the high resolution of features. Therefore, it can transfer rich
historical pose information to the current frame, which provides effective pose
information for locating occluded joints. Our approach achieves
state-of-the-art results on two standard video-based pose estimation
benchmarks. Moreover, the proposed KIMNet shows some robustness to the
occlusion, demonstrating the effectiveness of the proposed method
Physics-constrained Attack against Convolution-based Human Motion Prediction
Human motion prediction has achieved a brilliant performance with the help of
convolution-based neural networks. However, currently, there is no work
evaluating the potential risk in human motion prediction when facing
adversarial attacks. The adversarial attack will encounter problems against
human motion prediction in naturalness and data scale. To solve the problems
above, we propose a new adversarial attack method that generates the worst-case
perturbation by maximizing the human motion predictor's prediction error with
physical constraints. Specifically, we introduce a novel adaptable scheme that
facilitates the attack to suit the scale of the target pose and two physical
constraints to enhance the naturalness of the adversarial example. The
evaluating experiments on three datasets show that the prediction errors of all
target models are enlarged significantly, which means current convolution-based
human motion prediction models are vulnerable to the proposed attack. Based on
the experimental results, we provide insights on how to enhance the adversarial
robustness of the human motion predictor and how to improve the adversarial
attack against human motion prediction