25 research outputs found
Relation-Based Associative Joint Location for Human Pose Estimation in Videos
Video-based human pose estimation (HPE) is a vital yet challenging task.
While deep learning methods have made significant progress for the HPE, most
approaches to this task detect each joint independently, damaging the pose
structural information. In this paper, unlike the prior methods, we propose a
Relation-based Pose Semantics Transfer Network (RPSTN) to locate joints
associatively. Specifically, we design a lightweight joint relation extractor
(JRE) to model the pose structural features and associatively generate heatmaps
for joints by modeling the relation between any two joints heuristically
instead of building each joint heatmap independently. Actually, the proposed
JRE module models the spatial configuration of human poses through the
relationship between any two joints. Moreover, considering the temporal
semantic continuity of videos, the pose semantic information in the current
frame is beneficial for guiding the location of joints in the next frame.
Therefore, we use the idea of knowledge reuse to propagate the pose semantic
information between consecutive frames. In this way, the proposed RPSTN
captures temporal dynamics of poses. On the one hand, the JRE module can infer
invisible joints according to the relationship between the invisible joints and
other visible joints in space. On the other hand, in the time, the propose
model can transfer the pose semantic features from the non-occluded frame to
the occluded frame to locate occluded joints. Therefore, our method is robust
to the occlusion and achieves state-of-the-art results on the two challenging
datasets, which demonstrates its effectiveness for video-based human pose
estimation. We will release the code and models publicly
Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition
Skeleton-based action recognition is a central task in human-computer
interaction. However, most previous methods suffer from two issues: (i)
semantic ambiguity arising from spatial-temporal information mixture; and (ii)
overlooking the explicit exploitation of the latent data distributions (i.e.,
the intra-class variations and inter-class relations), thereby leading to
sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a
spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain
discriminative and semantically distinct representations from the sequences,
which can be incorporated into various previous skeleton encoders and can be
removed when testing. Specifically, we decouple the global features into
spatial-specific and temporal-specific features to reduce the spatial-temporal
coupling of features. Furthermore, to explicitly exploit the latent data
distributions, we employ the attentive features to contrastive learning, which
models the cross-sequence semantic relations by pulling together the features
from the positive pairs and pushing away the negative pairs. Extensive
experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN,
CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and
NW-UCLA benchmarks. The code will be released soon
BiHRNet: A Binary high-resolution network for Human Pose Estimation
Human Pose Estimation (HPE) plays a crucial role in computer vision
applications. However, it is difficult to deploy state-of-the-art models on
resouce-limited devices due to the high computational costs of the networks. In
this work, a binary human pose estimator named BiHRNet(Binary HRNet) is
proposed, whose weights and activations are expressed as 1. BiHRNet
retains the keypoint extraction ability of HRNet, while using fewer computing
resources by adapting binary neural network (BNN). In order to reduce the
accuracy drop caused by network binarization, two categories of techniques are
proposed in this work. For optimizing the training process for binary pose
estimator, we propose a new loss function combining KL divergence loss with
AWing loss, which makes the binary network obtain more comprehensive output
distribution from its real-valued counterpart to reduce information loss caused
by binarization. For designing more binarization-friendly structures, we
propose a new information reconstruction bottleneck called IR Bottleneck to
retain more information in the initial stage of the network. In addition, we
also propose a multi-scale basic block called MS-Block for information
retention. Our work has less computation cost with few precision drop.
Experimental results demonstrate that BiHRNet achieves a PCKh of 87.9 on the
MPII dataset, which outperforms all binary pose estimation networks. On the
challenging of COCO dataset, the proposed method enables the binary neural
network to achieve 70.8 mAP, which is better than most tested lightweight
full-precision networks.Comment: 12 pages, 6 figure
Topology-aware MLP for Skeleton-based Action Recognition
Graph convolution networks (GCNs) have achieved remarkable performance in
skeleton-based action recognition. However, existing previous GCN-based methods
have relied excessively on elaborate human body priors and constructed complex
feature aggregation mechanisms, which limits the generalizability of networks.
To solve these problems, we propose a novel Spatial Topology Gating Unit
(STGU), which is an MLP-based variant without extra priors, to capture the
co-occurrence topology features that encode the spatial dependency across all
joints. In STGU, to model the sample-specific and completely independent
point-wise topology attention, a new gate-based feature interaction mechanism
is introduced to activate the features point-to-point by the attention map
generated from the input. Based on the STGU, in this work, we propose the first
topology-aware MLP-based model, Ta-MLP, for skeleton-based action recognition.
In comparison with existing previous methods on three large-scale datasets,
Ta-MLP achieves competitive performance. In addition, Ta-MLP reduces the
parameters by up to 62.5% with favorable results. Compared with previous
state-of-the-art (SOAT) approaches, Ta-MLP pushes the frontier of real-time
action recognition. The code will be available at
https://github.com/BUPTSJZhang/Ta-MLP.Comment: 10 pages, 9 figure
Learning Human Kinematics by Modeling Temporal Correlations between Joints for Video-based Human Pose Estimation
Estimating human poses from videos is critical in human-computer interaction.
By precisely estimating human poses, the robot can provide an appropriate
response to the human. Most existing approaches use the optical flow, RNNs, or
CNNs to extract temporal features from videos. Despite the positive results of
these attempts, most of them only straightforwardly integrate features along
the temporal dimension, ignoring temporal correlations between joints. In
contrast to previous methods, we propose a plug-and-play kinematics modeling
module (KMM) based on the domain-cross attention mechanism to model the
temporal correlation between joints across different frames explicitly.
Specifically, the proposed KMM models the temporal correlation between any two
joints by calculating their temporal similarity. In this way, KMM can learn the
motion cues of each joint. Using the motion cues (temporal domain) and
historical positions of joints (spatial domain), KMM can infer the initial
positions of joints in the current frame in advance. In addition, we present a
kinematics modeling network (KIMNet) based on the KMM for obtaining the final
positions of joints by combining pose features and initial positions of joints.
By explicitly modeling temporal correlations between joints, KIMNet can infer
the occluded joints at present according to all joints at the previous moment.
Furthermore, the KMM is achieved through an attention mechanism, which allows
it to maintain the high resolution of features. Therefore, it can transfer rich
historical pose information to the current frame, which provides effective pose
information for locating occluded joints. Our approach achieves
state-of-the-art results on two standard video-based pose estimation
benchmarks. Moreover, the proposed KIMNet shows some robustness to the
occlusion, demonstrating the effectiveness of the proposed method
Physics-constrained Attack against Convolution-based Human Motion Prediction
Human motion prediction has achieved a brilliant performance with the help of
convolution-based neural networks. However, currently, there is no work
evaluating the potential risk in human motion prediction when facing
adversarial attacks. The adversarial attack will encounter problems against
human motion prediction in naturalness and data scale. To solve the problems
above, we propose a new adversarial attack method that generates the worst-case
perturbation by maximizing the human motion predictor's prediction error with
physical constraints. Specifically, we introduce a novel adaptable scheme that
facilitates the attack to suit the scale of the target pose and two physical
constraints to enhance the naturalness of the adversarial example. The
evaluating experiments on three datasets show that the prediction errors of all
target models are enlarged significantly, which means current convolution-based
human motion prediction models are vulnerable to the proposed attack. Based on
the experimental results, we provide insights on how to enhance the adversarial
robustness of the human motion predictor and how to improve the adversarial
attack against human motion prediction
Generalized quadrature spatial modulation for STAR-RIS aided NOMA networks
The simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) is regarded as a promising paradigm for enhancing the connectivity and reliability of non-orthogonal multiple access (NOMA) networks. However, the transmission of STAR-RIS enhanced NOMA networks performance is severely limited due to the inter-user interference (IUI) on multi-user detections. To mitigate this drawback, we propose a generalized quadrature spatial modulation (GQSM) aided STAR-RIS in conjunction with the NOMA scheme, termed STAR-RIS-NOMA-GQSM, to improve the performance of the corresponding NGMA network. By STAR-RIS-NOMA-GQSM, the information bits for all users in transmission and reflection zones are transmitted via orthogonal signal domains to eliminate the IUI so as to greatly improve the system performance. The low-complexity detection and upper-bounded bit error rate (BER) of STAR-RIS-NOMA-GQSM are both studied to evaluate its feasibility and performance. Moreover, by further utilizing index modulation (IM), we propose an enhanced STAR-RIS-NOMA-GQSM scheme, termed E-STAR-RIS-NOMA-GQSM, to enhance the transmission rate by dynamically adjusting reflection patterns in both transmission and reflection zones. Simulation results show that the proposed original and enhanced scheme significantly outperform the conventional STAR-RIS-NOMA and also confirm the precision of the theoretical analysis of the upper-bounded BER
Micro-expression recognition based on depth map to point cloud
Micro-expressions are nonverbal facial expressions that reveal the covert
emotions of individuals, making the micro-expression recognition task receive
widespread attention. However, the micro-expression recognition task is
challenging due to the subtle facial motion and brevity in duration. Many 2D
image-based methods have been developed in recent years to recognize MEs
effectively, but, these approaches are restricted by facial texture information
and are susceptible to environmental factors, such as lighting. Conversely,
depth information can effectively represent motion information related to
facial structure changes and is not affected by lighting. Motion information
derived from facial structures can describe motion features that pixel textures
cannot delineate. We proposed a network for micro-expression recognition based
on facial depth information, and our experiments have demonstrated the crucial
role of depth maps in the micro-expression recognition task. Initially, we
transform the depth map into a point cloud and obtain the motion information
for each point by aligning the initiating frame with the apex frame and
performing a differential operation. Subsequently, we adjusted all point cloud
motion feature input dimensions and used them as inputs for multiple point
cloud networks to assess the efficacy of this representation. PointNet++ was
chosen as the ultimate outcome for micro-expression recognition due to its
superior performance. Our experiments show that our proposed method
significantly outperforms the existing deep learning methods, including the
baseline, on the dataset, which includes depth information
