510 research outputs found
An Improvements of Deep Learner Based Human Activity Recognition with the Aid of Graph Convolution Features
Many researchers are now focusing on Human Action Recognition (HAR), which is based on various deep-learning features related to body joints and their trajectories from videos. Among many schemes, Joints and Trajectory-pooled 3D-Deep Geometric Positional Attention-based Hierarchical Bidirectional Recurrent convolutional Descriptors (JTDGPAHBRD) can provide a video descriptor by learning geometric features and trajectories of the body joints. But the spatial-temporal dynamics of the different geometric features of the skeleton structure were not explored deeper. To solve this problem, this article develops the Graph Convolutional Network (GCN) in addition to the JTDGPAHBRD to create a video descriptor for HAR. The GCN can obtain complementary information, such as higher-level spatial-temporal features, between consecutive frames for enhancing end-to-end learning. In addition, to improve feature representation ability, a search space with several adaptive graph components is created. Then, a sampling and computation-effective evolution scheme are applied to explore this space. Moreover, the resultant GCN provides the temporal dynamics of the skeleton pattern, which are fused with the geometric features of the skeleton body joints and trajectory coordinates from the JTDGPAHBRD to create a more effective video descriptor for HAR. Finally, extensive experiments show that the JTDGPAHBRD-GCN model outperforms the existing HAR models on the Penn Action Dataset (PAD)
Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition
Recently, skeleton-based human action has become a hot research topic because
the compact representation of human skeletons brings new blood to this research
domain. As a result, researchers began to notice the importance of using RGB or
other sensors to analyze human action by extracting skeleton information.
Leveraging the rapid development of deep learning (DL), a significant number of
skeleton-based human action approaches have been presented with fine-designed
DL structures recently. However, a well-trained DL model always demands
high-quality and sufficient data, which is hard to obtain without costing high
expenses and human labor. In this paper, we introduce a novel data augmentation
method for skeleton-based action recognition tasks, which can effectively
generate high-quality and diverse sequential actions. In order to obtain
natural and realistic action sequences, we propose denoising diffusion
probabilistic models (DDPMs) that can generate a series of synthetic action
sequences, and their generation process is precisely guided by a
spatial-temporal transformer (ST-Trans). Experimental results show that our
method outperforms the state-of-the-art (SOTA) motion generation approaches on
different naturality and diversity metrics. It proves that its high-quality
synthetic data can also be effectively deployed to existing action recognition
models with significant performance improvement
Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition
Capturing the dependencies between joints is critical in skeleton-based
action recognition task. Transformer shows great potential to model the
correlation of important joints. However, the existing Transformer-based
methods cannot capture the correlation of different joints between frames,
which the correlation is very useful since different body parts (such as the
arms and legs in "long jump") between adjacent frames move together. Focus on
this problem, A novel spatio-temporal tuples Transformer (STTFormer) method is
proposed. The skeleton sequence is divided into several parts, and several
consecutive frames contained in each part are encoded. And then a
spatio-temporal tuples self-attention module is proposed to capture the
relationship of different joints in consecutive frames. In addition, a feature
aggregation module is introduced between non-adjacent frames to enhance the
ability to distinguish similar actions. Compared with the state-of-the-art
methods, our method achieves better performance on two large-scale datasets.Comment: 14 pages, 5 figure
EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning
Learning to predict agent motions with relationship reasoning is important
for many applications. In motion prediction tasks, maintaining motion
equivariance under Euclidean geometric transformations and invariance of agent
interaction is a critical and fundamental principle. However, such equivariance
and invariance properties are overlooked by most existing methods. To fill this
gap, we propose EqMotion, an efficient equivariant motion prediction model with
invariant interaction reasoning. To achieve motion equivariance, we propose an
equivariant geometric feature learning module to learn a Euclidean
transformable feature through dedicated designs of equivariant operations. To
reason agent's interactions, we propose an invariant interaction reasoning
module to achieve a more stable interaction modeling. To further promote more
comprehensive motion features, we propose an invariant pattern feature learning
module to learn an invariant pattern feature, which cooperates with the
equivariant geometric feature to enhance network expressiveness. We conduct
experiments for the proposed model on four distinct scenarios: particle
dynamics, molecule dynamics, human skeleton motion prediction and pedestrian
trajectory prediction. Experimental results show that our method is not only
generally applicable, but also achieves state-of-the-art prediction
performances on all the four tasks, improving by 24.0/30.1/8.6/9.2%. Code is
available at https://github.com/MediaBrain-SJTU/EqMotion.Comment: Accepted to CVPR 202
- …