4 research outputs found
Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition
As a fundamental aspect of human life, two-person interactions contain
meaningful information about people's activities, relationships, and social
settings. Human action recognition serves as the foundation for many smart
applications, with a strong focus on personal privacy. However, recognizing
two-person interactions poses more challenges due to increased body occlusion
and overlap compared to single-person actions. In this paper, we propose a
point cloud-based network named Two-stream Multi-level Dynamic Point
Transformer for two-person interaction recognition. Our model addresses the
challenge of recognizing two-person interactions by incorporating local-region
spatial information, appearance information, and motion information. To achieve
this, we introduce a designed frame selection method named Interval Frame
Sampling (IFS), which efficiently samples frames from videos, capturing more
discriminative information in a relatively short processing time. Subsequently,
a frame features learning module and a two-stream multi-level feature
aggregation module extract global and partial features from the sampled frames,
effectively representing the local-region spatial information, appearance
information, and motion information related to the interactions. Finally, we
apply a transformer to perform self-attention on the learned features for the
final classification. Extensive experiments are conducted on two large-scale
datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The
results show that our network outperforms state-of-the-art approaches across
all standard evaluation settings