2 research outputs found
What and Where: Modeling Skeletons from Semantic and Spatial Perspectives for Action Recognition
Skeleton data, which consists of only the 2D/3D coordinates of the human
joints, has been widely studied for human action recognition. Existing methods
take the semantics as prior knowledge to group human joints and draw
correlations according to their spatial locations, which we call the semantic
perspective for skeleton modeling. In this paper, in contrast to previous
approaches, we propose to model skeletons from a novel spatial perspective,
from which the model takes the spatial location as prior knowledge to group
human joints and mines the discriminative patterns of local areas in a
hierarchical manner. The two perspectives are orthogonal and complementary to
each other; and by fusing them in a unified framework, our method achieves a
more comprehensive understanding of the skeleton data. Besides, we customized
two networks for the two perspectives. From the semantic perspective, we
propose a Transformer-like network that is expert in modeling joint
correlations, and present three effective techniques to adapt it for skeleton
data. From the spatial perspective, we transform the skeleton data into the
sparse format for efficient feature extraction and present two types of sparse
convolutional networks for sparse skeleton modeling. Extensive experiments are
conducted on three challenging datasets for skeleton-based human action/gesture
recognition, namely, NTU-60, NTU-120 and SHREC, where our method achieves
state-of-the-art performance
Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action Recognition
Dynamic skeletal data, represented as the 2D/3D coordinates of human joints,
has been widely studied for human action recognition due to its high-level
semantic information and environmental robustness. However, previous methods
heavily rely on designing hand-crafted traversal rules or graph topologies to
draw dependencies between the joints, which are limited in performance and
generalizability. In this work, we present a novel decoupled spatial-temporal
attention network(DSTA-Net) for skeleton-based action recognition. It involves
solely the attention blocks, allowing for modeling spatial-temporal
dependencies between joints without the requirement of knowing their positions
or mutual connections. Specifically, to meet the specific requirements of the
skeletal data, three techniques are proposed for building attention blocks,
namely, spatial-temporal attention decoupling, decoupled position encoding and
spatial global regularization. Besides, from the data aspect, we introduce a
skeletal data decoupling technique to emphasize the specific characteristics of
space/time and different motion scales, resulting in a more comprehensive
understanding of the human actions.To test the effectiveness of the proposed
method, extensive experiments are conducted on four challenging datasets for
skeleton-based gesture and action recognition, namely, SHREC, DHG, NTU-60 and
NTU-120, where DSTA-Net achieves state-of-the-art performance on all of them