141 research outputs found
Skeleton Focused Human Activity Recognition in RGB Video
The data-driven approach that learns an optimal representation of vision
features like skeleton frames or RGB videos is currently a dominant paradigm
for activity recognition. While great improvements have been achieved from
existing single modal approaches with increasingly larger datasets, the fusion
of various data modalities at the feature level has seldom been attempted. In
this paper, we propose a multimodal feature fusion model that utilizes both
skeleton and RGB modalities to infer human activity. The objective is to
improve the activity recognition accuracy by effectively utilizing the mutual
complemental information among different data modalities. For the skeleton
modality, we propose to use a graph convolutional subnetwork to learn the
skeleton representation. Whereas for the RGB modality, we will use the
spatial-temporal region of interest from RGB videos and take the attention
features from the skeleton modality to guide the learning process. The model
could be either individually or uniformly trained by the back-propagation
algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D
and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance,
which indicates that the proposed skeleton-driven attention mechanism for the
RGB modality increases the mutual communication between different data
modalities and brings more discriminative features for inferring human
activities.Comment: 8 page
Focusing and Diffusion: Bidirectional Attentive Graph Convolutional Networks for Skeleton-based Action Recognition
A collection of approaches based on graph convolutional networks have proven
success in skeleton-based action recognition by exploring neighborhood
information and dense dependencies between intra-frame joints. However, these
approaches usually ignore the spatial-temporal global context as well as the
local relation between inter-frame and intra-frame. In this paper, we propose a
focusing and diffusion mechanism to enhance graph convolutional networks by
paying attention to the kinematic dependence of articulated human pose in a
frame and their implicit dependencies over frames. In the focusing process, we
introduce an attention module to learn a latent node over the intra-frame
joints to convey spatial contextual information. In this way, the sparse
connections between joints in a frame can be well captured, while the global
context over the entire sequence is further captured by these hidden nodes with
a bidirectional LSTM. In the diffusing process, the learned spatial-temporal
contextual information is passed back to the spatial joints, leading to a
bidirectional attentive graph convolutional network (BAGCN) that can facilitate
skeleton-based action recognition. Extensive experiments on the challenging NTU
RGB+D and Skeleton-Kinetics benchmarks demonstrate the efficacy of our
approach
Effective Human Activity Recognition Based on Small Datasets
Most recent work on vision-based human activity recognition (HAR) focuses on
designing complex deep learning models for the task. In so doing, there is a
requirement for large datasets to be collected. As acquiring and processing
large training datasets are usually very expensive, the problem of how dataset
size can be reduced without affecting recognition accuracy has to be tackled.
To do so, we propose a HAR method that consists of three steps: (i) data
transformation involving the generation of new features based on transforming
of raw data, (ii) feature extraction involving the learning of a classifier
based on the AdaBoost algorithm and the use of training data consisting of the
transformed features, and (iii) parameter determination and pattern recognition
involving the determination of parameters based on the features generated in
(ii) and the use of the parameters as training data for deep learning
algorithms to be used to recognize human activities. Compared to existing
approaches, this proposed approach has the advantageous characteristics that it
is simple and robust. The proposed approach has been tested with a number of
experiments performed on a relatively small real dataset. The experimental
results indicate that using the proposed method, human activities can be more
accurately recognized even with smaller training data size.Comment: 7 page
Feedback Graph Convolutional Network for Skeleton-based Action Recognition
Skeleton-based action recognition has attracted considerable attention in
computer vision since skeleton data is more robust to the dynamic circumstance
and complicated background than other modalities. Recently, many researchers
have used the Graph Convolutional Network (GCN) to model spatial-temporal
features of skeleton sequences by an end-to-end optimization. However,
conventional GCNs are feedforward networks which are impossible for low-level
layers to access semantic information in the high-level layers. In this paper,
we propose a novel network, named Feedback Graph Convolutional Network (FGCN).
This is the first work that introduces the feedback mechanism into GCNs and
action recognition. Compared with conventional GCNs, FGCN has the following
advantages: (1) a multi-stage temporal sampling strategy is designed to extract
spatial-temporal features for action recognition in a coarse-to-fine
progressive process; (2) A dense connections based Feedback Graph Convolutional
Block (FGCB) is proposed to introduce feedback connections into the GCNs. It
transmits the high-level semantic features to the low-level layers and flows
temporal information stage by stage to progressively model global
spatial-temporal features for action recognition; (3) The FGCN model provides
early predictions. In the early stages, the model receives partial information
about actions. Naturally, its predictions are relatively coarse. The coarse
predictions are treated as the prior to guide the feature learning of later
stages for a accurate prediction. Extensive experiments on the datasets,
NTU-RGB+D, NTU-RGB+D120 and Northwestern-UCLA, demonstrate that the proposed
FGCN is effective for action recognition. It achieves the state-of-the-art
performance on the three datasets.Comment: 18 pages, 5 figure
On the spatial attention in Spatio-Temporal Graph Convolutional Networks for skeleton-based human action recognition
Graph convolutional networks (GCNs) achieved promising performance in
skeleton-based human action recognition by modeling a sequence of skeletons as
a spatio-temporal graph. Most of the recently proposed GCN-based methods
improve the performance by learning the graph structure at each layer of the
network using a spatial attention applied on a predefined graph Adjacency
matrix that is optimized jointly with model's parameters in an end-to-end
manner. In this paper, we analyze the spatial attention used in spatio-temporal
GCN layers and propose a symmetric spatial attention for better reflecting the
symmetric property of the relative positions of the human body joints when
executing actions. We also highlight the connection of spatio-temporal GCN
layers employing additive spatial attention to bilinear layers, and we propose
the spatio-temporal bilinear network (ST-BLN) which does not require the use of
predefined Adjacency matrices and allows for more flexible design of the model.
Experimental results show that the three models lead to effectively the same
performance. Moreover, by exploiting the flexibility provided by the proposed
ST-BLN, one can increase the efficiency of the model.Comment: 7 pages, 5 figure
Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition
Skeleton-based human action recognition has attracted great interest thanks
to the easy accessibility of the human skeleton data. Recently, there is a
trend of using very deep feedforward neural networks to model the 3D
coordinates of joints without considering the computational efficiency. In this
paper, we propose a simple yet effective semantics-guided neural network (SGN)
for skeleton-based action recognition. We explicitly introduce the high level
semantics of joints (joint type and frame index) into the network to enhance
the feature representation capability. In addition, we exploit the relationship
of joints hierarchically through two modules, i.e., a joint-level module for
modeling the correlations of joints in the same frame and a framelevel module
for modeling the dependencies of frames by taking the joints in the same frame
as a whole. A strong baseline is proposed to facilitate the study of this
field. With an order of magnitude smaller model size than most previous works,
SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU
datasets. The source code is available at https://github.com/microsoft/SGN.Comment: Accepted by CVPR2020. The source code is available at
https://github.com/microsoft/SG
Symbiotic Graph Neural Networks for 3D Skeleton-based Human Action Recognition and Motion Prediction
3D skeleton-based action recognition and motion prediction are two essential
problems of human activity understanding. In many previous works: 1) they
studied two tasks separately, neglecting internal correlations; 2) they did not
capture sufficient relations inside the body. To address these issues, we
propose a symbiotic model to handle two tasks jointly; and we propose two
scales of graphs to explicitly capture relations among body-joints and
body-parts. Together, we propose symbiotic graph neural networks, which contain
a backbone, an action-recognition head, and a motion-prediction head. Two heads
are trained jointly and enhance each other. For the backbone, we propose
multi-branch multi-scale graph convolution networks to extract spatial and
temporal features. The multi-scale graph convolution networks are based on
joint-scale and part-scale graphs. The joint-scale graphs contain actional
graphs, capturing action-based relations, and structural graphs, capturing
physical constraints. The part-scale graphs integrate body-joints to form
specific parts, representing high-level relations. Moreover, dual bone-based
graphs and networks are proposed to learn complementary features. We conduct
extensive experiments for skeleton-based action recognition and motion
prediction with four datasets, NTU-RGB+D, Kinetics, Human3.6M, and CMU Mocap.
Experiments show that our symbiotic graph neural networks achieve better
performances on both tasks compared to the state-of-the-art methods.Comment: submitted to IEEE-TPAM
Temporal Attention-Augmented Graph Convolutional Network for Efficient Skeleton-Based Human Action Recognition
Graph convolutional networks (GCNs) have been very successful in modeling
non-Euclidean data structures, like sequences of body skeletons forming actions
modeled as spatio-temporal graphs. Most GCN-based action recognition methods
use deep feed-forward networks with high computational complexity to process
all skeletons in an action. This leads to a high number of floating point
operations (ranging from 16G to 100G FLOPs) to process a single sample, making
their adoption in restricted computation application scenarios infeasible. In
this paper, we propose a temporal attention module (TAM) for increasing the
efficiency in skeleton-based action recognition by selecting the most
informative skeletons of an action at the early layers of the network. We
incorporate the TAM in a light-weight GCN topology to further reduce the
overall number of computations. Experimental results on two benchmark datasets
show that the proposed method outperforms with a large margin the baseline
GCN-based method while having 2.9 times less number of computations. Moreover,
it performs on par with the state-of-the-art with up to 9.6 times less number
of computations.Comment: 8 pages, 4 figures, International Conference on Pattern Recognitio
Infrared and 3D skeleton feature fusion for RGB-D action recognition
A challenge of skeleton-based action recognition is the difficulty to
classify actions with similar motions and object-related actions. Visual clues
from other streams help in that regard. RGB data are sensible to illumination
conditions, thus unusable in the dark. To alleviate this issue and still
benefit from a visual stream, we propose a modular network (FUSION) combining
skeleton and infrared data. A 2D convolutional neural network (CNN) is used as
a pose module to extract features from skeleton data. A 3D CNN is used as an
infrared module to extract visual cues from videos. Both feature vectors are
then concatenated and exploited conjointly using a multilayer perceptron (MLP).
Skeleton data also condition the infrared videos, providing a crop around the
performing subjects and thus virtually focusing the attention of the infrared
module. Ablation studies show that using pre-trained networks on other large
scale datasets as our modules and data augmentation yield considerable
improvements on the action classification accuracy. The strong contribution of
our cropping strategy is also demonstrated. We evaluate our method on the NTU
RGB+D dataset, the largest dataset for human action recognition from depth
cameras, and report state-of-the-art performances.Comment: 11 pages, 5 figures, submitted to IEEE Acces
SpatioTemporal Focus for Skeleton-based Action Recognition
Graph convolutional networks (GCNs) are widely adopted in skeleton-based
action recognition due to their powerful ability to model data topology. We
argue that the performance of recent proposed skeleton-based action recognition
methods is limited by the following factors. First, the predefined graph
structures are shared throughout the network, lacking the flexibility and
capacity to model the multi-grain semantic information. Second, the relations
among the global joints are not fully exploited by the graph local convolution,
which may lose the implicit joint relevance. For instance, actions such as
running and waving are performed by the co-movement of body parts and joints,
e.g., legs and arms, however, they are located far away in physical connection.
Inspired by the recent attention mechanism, we propose a multi-grain contextual
focus module, termed MCF, to capture the action associated relation information
from the body joints and parts. As a result, more explainable representations
for different skeleton action sequences can be obtained by MCF. In this study,
we follow the common practice that the dense sample strategy of the input
skeleton sequences is adopted and this brings much redundancy since number of
instances has nothing to do with actions. To reduce the redundancy, a temporal
discrimination focus module, termed TDF, is developed to capture the local
sensitive points of the temporal dynamics. MCF and TDF are integrated into the
standard GCN network to form a unified architecture, named STF-Net. It is noted
that STF-Net provides the capability to capture robust movement patterns from
these skeleton topology structures, based on multi-grain context aggregation
and temporal dependency. Extensive experimental results show that our STF-Net
significantly achieves state-of-the-art results on three challenging benchmarks
NTU RGB+D 60, NTU RGB+D 120, and Kinetics-skeleton.Comment: Submitted to TCSV
- …