104,760 research outputs found
Cross-Modal Learning with 3D Deformable Attention for Action Recognition
An important challenge in vision-based action recognition is the embedding of
spatiotemporal features with two or more heterogeneous modalities into a single
feature. In this study, we propose a new 3D deformable transformer for action
recognition with adaptive spatiotemporal receptive fields and a cross-modal
learning scheme. The 3D deformable transformer consists of three attention
modules: 3D deformability, local joint stride, and temporal stride attention.
The two cross-modal tokens are input into the 3D deformable attention module to
create a cross-attention token with a reflected spatiotemporal correlation.
Local joint stride attention is applied to spatially combine attention and pose
tokens. Temporal stride attention temporally reduces the number of input tokens
in the attention module and supports temporal expression learning without the
simultaneous use of all tokens. The deformable transformer iterates L times and
combines the last cross-modal token for classification. The proposed 3D
deformable transformer was tested on the NTU60, NTU120, FineGYM, and Penn
Action datasets, and showed results better than or similar to pre-trained
state-of-the-art methods even without a pre-training process. In addition, by
visualizing important joints and correlations during action recognition through
spatial joint and temporal stride attention, the possibility of achieving an
explainable potential for action recognition is presented.Comment: 10 pages, 8 figure
Multi-Dimensional Refinement Graph Convolutional Network with Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition
Graph convolutional networks have been widely used in skeleton-based action
recognition. However, existing approaches are limited in fine-grained action
recognition due to the similarity of inter-class data. Moreover, the noisy data
from pose extraction increases the challenge of fine-grained recognition. In
this work, we propose a flexible attention block called Channel-Variable
Spatial-Temporal Attention (CVSTA) to enhance the discriminative power of
spatial-temporal joints and obtain a more compact intra-class feature
distribution. Based on CVSTA, we construct a Multi-Dimensional Refinement Graph
Convolutional Network (MDR-GCN), which can improve the discrimination among
channel-, joint- and frame-level features for fine-grained actions.
Furthermore, we propose a Robust Decouple Loss (RDL), which significantly
boosts the effect of the CVSTA and reduces the impact of noise. The proposed
method combining MDR-GCN with RDL outperforms the known state-of-the-art
skeleton-based approaches on fine-grained datasets, FineGym99 and FSD-10, and
also on the coarse dataset NTU-RGB+D X-view version
Action recognition from RGB-D data
In recent years, action recognition based on RGB-D data has attracted increasing attention. Different from traditional 2D action recognition, RGB-D data contains extra depth and skeleton modalities. Different modalities have their own characteristics. This thesis presents seven novel methods to take advantages of the three modalities for action recognition.
First, effective handcrafted features are designed and frequent pattern mining method is employed to mine the most discriminative, representative and nonredundant features for skeleton-based action recognition. Second, to take advantages of powerful Convolutional Neural Networks (ConvNets), it is proposed to represent spatio-temporal information carried in 3D skeleton sequences in three 2D images by encoding the joint trajectories and their dynamics into color distribution in the images, and ConvNets are adopted to learn the discriminative features for human action recognition. Third, for depth-based action recognition, three strategies of data augmentation are proposed to apply ConvNets to small training datasets. Forth, to take full advantage of the 3D structural information offered in the depth modality and its being insensitive to illumination variations, three simple, compact yet effective images-based representations are proposed and ConvNets are adopted for feature extraction and classification. However, both of previous two methods are sensitive to noise and could not differentiate well fine-grained actions. Fifth, it is proposed to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling to deal with the issue. The structured dynamic image preserves the spatial-temporal information, enhances the structure information across both body parts/joints and different temporal scales, and takes advantages of ConvNets for action recognition. Sixth, it is proposed to extract and use scene flow for action recognition from RGB and depth data. Last, to exploit the joint information in multi-modal features arising from heterogeneous sources (RGB, depth), it is proposed to cooperatively train a single ConvNet (referred to as c-ConvNet) on both RGB features and depth features, and deeply aggregate the two modalities to achieve robust action recognition
VPN: Learning Video-Pose Embedding for Activities of Daily Living
In this paper, we focus on the spatio-temporal aspect of recognizing
Activities of Daily Living (ADL). ADL have two specific properties (i) subtle
spatio-temporal patterns and (ii) similar visual patterns varying with time.
Therefore, ADL may look very similar and often necessitate to look at their
fine-grained details to distinguish them. Because the recent spatio-temporal 3D
ConvNets are too rigid to capture the subtle visual patterns across an action,
we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN
are a spatial embedding and an attention network. The spatial embedding
projects the 3D poses and RGB cues in a common semantic space. This enables the
action recognition framework to learn better spatio-temporal features
exploiting both modalities. In order to discriminate similar actions, the
attention network provides two functionalities - (i) an end-to-end learnable
pose backbone exploiting the topology of human body, and (ii) a coupler to
provide joint spatio-temporal attention weights across a video. Experiments
show that VPN outperforms the state-of-the-art results for action
classification on a large scale human activity dataset: NTU-RGB+D 120, its
subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota
Smarthome and a small scale human-object interaction dataset Northwestern UCLA.Comment: Accepted in ECCV 202
VPN: Learning Video-Pose Embedding for Activities of Daily Living
International audienceIn this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities-(i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA
Learning Graph Convolutional Network for Skeleton-based Human Action Recognition by Neural Searching
Human action recognition from skeleton data, fueled by the Graph
Convolutional Network (GCN), has attracted lots of attention, due to its
powerful capability of modeling non-Euclidean structure data. However, many
existing GCN methods provide a pre-defined graph and fix it through the entire
network, which can loss implicit joint correlations. Besides, the mainstream
spectral GCN is approximated by one-order hop, thus higher-order connections
are not well involved. Therefore, huge efforts are required to explore a better
GCN architecture. To address these problems, we turn to Neural Architecture
Search (NAS) and propose the first automatically designed GCN for
skeleton-based action recognition. Specifically, we enrich the search space by
providing multiple dynamic graph modules after fully exploring the
spatial-temporal correlations between nodes. Besides, we introduce multiple-hop
modules and expect to break the limitation of representational capacity caused
by one-order approximation. Moreover, a sampling- and memory-efficient
evolution strategy is proposed to search an optimal architecture for this task.
The resulted architecture proves the effectiveness of the higher-order
approximation and the dynamic graph modeling mechanism with temporal
interactions, which is barely discussed before. To evaluate the performance of
the searched model, we conduct extensive experiments on two very large scaled
datasets and the results show that our model gets the state-of-the-art results.Comment: Accepted by AAAI202
- …