3,274 research outputs found
Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning
Self-supervised learning has proved effective for skeleton-based human action
understanding, which is an important yet challenging topic. Previous works
mainly rely on contrastive learning or masked motion modeling paradigm to model
the skeleton relations. However, the sequence-level and joint-level
representation learning cannot be effectively and simultaneously handled by
these methods. As a result, the learned representations fail to generalize to
different downstream tasks. Moreover, combining these two paradigms in a naive
manner leaves the synergy between them untapped and can lead to interference in
training. To address these problems, we propose Prompted Contrast with Masked
Motion Modeling, PCM, for versatile 3D action representation
learning. Our method integrates the contrastive learning and masked prediction
tasks in a mutually beneficial manner, which substantially boosts the
generalization capacity for various downstream tasks. Specifically, masked
prediction provides novel training views for contrastive learning, which in
turn guides the masked prediction training with high-level semantic
information. Moreover, we propose a dual-prompted multi-task pretraining
strategy, which further improves model representations by reducing the
interference caused by learning the two different pretext tasks. Extensive
experiments on five downstream tasks under three large-scale datasets are
conducted, demonstrating the superior generalization capacity of PCM
compared to the state-of-the-art works. Our project is publicly available at:
https://jhang2020.github.io/Projects/PCM3/PCM3.html .Comment: Accepted by ACM Multimedia 202
Hypergraph Transformer for Skeleton-based Action Recognition
Skeleton-based action recognition aims to predict human actions given humanjoint coordinates with skeletal interconnections. To model such off-grid datapoints and their co-occurrences, Transformer-based formulations would be anatural choice. However, Transformers still lag behind state-of-the-art methodsusing graph convolutional networks (GCNs). Transformers assume that the inputis permutation-invariant and homogeneous (partially alleviated by positionalencoding), which ignores an important characteristic of skeleton data, i.e.,bone connectivity. Furthermore, each type of body joint has a clear physicalmeaning in human motion, i.e., motion retains an intrinsic relationshipregardless of the joint coordinates, which is not explored in Transformers. Infact, certain re-occurring groups of body joints are often involved in specificactions, such as the subconscious hand movement for keeping balance. Vanillaattention is incapable of describing such underlying relations that arepersistent and beyond pair-wise. In this work, we aim to exploit these uniqueaspects of skeleton data to close the performance gap between Transformers andGCNs. Specifically, we propose a new self-attention (SA) extension, namedHypergraph Self-Attention (HyperSA), to incorporate inherently higher-orderrelations into the model. The K-hop relative positional embeddings are alsoemployed to take bone connectivity into account. We name the resulting modelHyperformer, and it achieves comparable or better performance w.r.t. accuracyand efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, thesignificantly improved performance reached by our Hyperformer demonstrates theunderestimated potential of Transformer models in this field.<br
Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition
Recognizing interactive action plays an important role in human-robot
interaction and collaboration. Previous methods use late fusion and
co-attention mechanism to capture interactive relations, which have limited
learning capability or inefficiency to adapt to more interacting entities. With
assumption that priors of each entity are already known, they also lack
evaluations on a more general setting addressing the diversity of subjects. To
address these problems, we propose an Interactive Spatiotemporal Token
Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and
interactive relations. Specifically, our network contains a tokenizer to
partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to
represent motions of multiple diverse entities. By extending the entity
dimension, ISTs provide better interactive representations. To jointly learn
along three dimensions in ISTs, multi-head self-attention blocks integrated
with 3D convolutions are designed to capture inter-token correlations. When
modeling correlations, a strict entity ordering is usually irrelevant for
recognizing interactive actions. To this end, Entity Rearrangement is proposed
to eliminate the orderliness in ISTs for interchangeable entities. Extensive
experiments on four datasets verify the effectiveness of ISTA-Net by
outperforming state-of-the-art methods. Our code is publicly available at
https://github.com/Necolizer/ISTA-NetComment: IROS 2023 Camera-ready version. Project website:
https://necolizer.github.io/ISTA-Net
- …