RGB-D action and gesture recognition remain an interesting topic in
human-centered scene understanding, primarily due to the multiple granularities
and large variation in human motion. Although many RGB-D based action and
gesture recognition approaches have demonstrated remarkable results by
utilizing highly integrated spatio-temporal representations across multiple
modalities (i.e., RGB and depth data), they still encounter several challenges.
Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion
differences between local clips under different modalities. Secondly, the
intricate nature of highly integrated spatio-temporal modeling can lead to
optimization difficulties. Thirdly, duplicate and unnecessary information can
add complexity and complicate entangled spatio-temporal modeling. To address
the above issues, we propose an innovative heuristic architecture called
Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture
recognition. The proposed MFST model comprises a 3D Central Difference
Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal
stages. The CDC-Stem enriches fine-grained temporal perception, and the
multiple hierarchical spatio-temporal stages construct dimension-independent
higher-order semantic primitives. Specifically, the CDC-Stem module captures
bottom-level spatio-temporal features and passes them successively to the
following spatio-temporal factored stages to capture the hierarchical spatial
and temporal features through the Multi- Scale Convolution and Transformer
(MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans)
block. The seamless integration of these innovative designs results in a robust
spatio-temporal representation that outperforms state-of-the-art approaches on
RGB-D action and gesture recognition datasets.Comment: ACM MM'2