STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action
  Recognition

de Melo, Celso M.; Hauptmann, Alexander; Huang, Po-Yao; Liang, Junwei; Zhu, Xiaoyu

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Authors: Celso M. de Melo
Alexander Hauptmann
Po-Yao Huang
Junwei Liang
Xiaoyu Zhu
Publication date: 31 March 2023
Publisher

Abstract

We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.Comment: CVPR 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2303.18177

Last time updated on 08/04/2023