1,177 research outputs found
An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition
Skeleton-based action recognition is an important task that requires the
adequate understanding of movement characteristics of a human action from the
given skeleton sequence. Recent studies have shown that exploring spatial and
temporal features of the skeleton sequence is vital for this task.
Nevertheless, how to effectively extract discriminative spatial and temporal
features is still a challenging problem. In this paper, we propose a novel
Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action
recognition from skeleton data. The proposed AGC-LSTM can not only capture
discriminative features in spatial configuration and temporal dynamics but also
explore the co-occurrence relationship between spatial and temporal domains. We
also present a temporal hierarchical architecture to increases temporal
receptive fields of the top AGC-LSTM layer, which boosts the ability to learn
the high-level semantic representation and significantly reduces the
computation cost. Furthermore, to select discriminative spatial information,
the attention mechanism is employed to enhance information of key joints in
each AGC-LSTM layer. Experimental results on two datasets are provided: NTU
RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate
the effectiveness of our approach and show that our approach outperforms the
state-of-the-art methods on both datasets.Comment: Accepted by CVPR201
Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition
Skeleton-based human action recognition has attracted great interest thanks
to the easy accessibility of the human skeleton data. Recently, there is a
trend of using very deep feedforward neural networks to model the 3D
coordinates of joints without considering the computational efficiency. In this
paper, we propose a simple yet effective semantics-guided neural network (SGN)
for skeleton-based action recognition. We explicitly introduce the high level
semantics of joints (joint type and frame index) into the network to enhance
the feature representation capability. In addition, we exploit the relationship
of joints hierarchically through two modules, i.e., a joint-level module for
modeling the correlations of joints in the same frame and a framelevel module
for modeling the dependencies of frames by taking the joints in the same frame
as a whole. A strong baseline is proposed to facilitate the study of this
field. With an order of magnitude smaller model size than most previous works,
SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU
datasets. The source code is available at https://github.com/microsoft/SGN.Comment: Accepted by CVPR2020. The source code is available at
https://github.com/microsoft/SG
Skeleton Focused Human Activity Recognition in RGB Video
The data-driven approach that learns an optimal representation of vision
features like skeleton frames or RGB videos is currently a dominant paradigm
for activity recognition. While great improvements have been achieved from
existing single modal approaches with increasingly larger datasets, the fusion
of various data modalities at the feature level has seldom been attempted. In
this paper, we propose a multimodal feature fusion model that utilizes both
skeleton and RGB modalities to infer human activity. The objective is to
improve the activity recognition accuracy by effectively utilizing the mutual
complemental information among different data modalities. For the skeleton
modality, we propose to use a graph convolutional subnetwork to learn the
skeleton representation. Whereas for the RGB modality, we will use the
spatial-temporal region of interest from RGB videos and take the attention
features from the skeleton modality to guide the learning process. The model
could be either individually or uniformly trained by the back-propagation
algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D
and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance,
which indicates that the proposed skeleton-driven attention mechanism for the
RGB modality increases the mutual communication between different data
modalities and brings more discriminative features for inferring human
activities.Comment: 8 page
Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition
In skeleton-based action recognition, graph convolutional networks (GCNs),
which model the human body skeletons as spatiotemporal graphs, have achieved
remarkable performance. However, in existing GCN-based methods, the topology of
the graph is set manually, and it is fixed over all layers and input samples.
This may not be optimal for the hierarchical GCN and diverse samples in action
recognition tasks. In addition, the second-order information (the lengths and
directions of bones) of the skeleton data, which is naturally more informative
and discriminative for action recognition, is rarely investigated in existing
methods. In this work, we propose a novel two-stream adaptive graph
convolutional network (2s-AGCN) for skeleton-based action recognition. The
topology of the graph in our model can be either uniformly or individually
learned by the BP algorithm in an end-to-end manner. This data-driven method
increases the flexibility of the model for graph construction and brings more
generality to adapt to various data samples. Moreover, a two-stream framework
is proposed to model both the first-order and the second-order information
simultaneously, which shows notable improvement for the recognition accuracy.
Extensive experiments on the two large-scale datasets, NTU-RGBD and
Kinetics-Skeleton, demonstrate that the performance of our model exceeds the
state-of-the-art with a significant margin
DeepSSM: Deep State-Space Model for 3D Human Motion Prediction
Predicting future human motion plays a significant role in human-machine
interactions for a variety of real-life applications. In this paper, we build a
deep state-space model, DeepSSM, to predict future human motion. Specifically,
we formulate the human motion system as the state-space model of a dynamic
system and model the motion system by the state-space theory, offering a
unified formulation for diverse human motion systems. Moreover, a novel deep
network is designed to build this system, enabling us to utilize both the
advantages of deep network and state-space model. The deep network jointly
models the process of both the state-state transition and the state-observation
transition of the human motion system, and multiple future poses can be
generated via the state-observation transition of the model recursively. To
improve the modeling ability of the system, a unique loss function, ATPL
(Attention Temporal Prediction Loss), is introduced to optimize the model,
encouraging the system to achieve more accurate predictions by paying
increasing attention to the early time-steps. The experiments on two benchmark
datasets (i.e., Human3.6M and 3DPW) confirm that our method achieves
state-of-the-art performance with improved effectiveness. The code will be
available if the paper is accepted
Temporal Graph Modeling for Skeleton-based Action Recognition
Graph Convolutional Networks (GCNs), which model skeleton data as graphs,
have obtained remarkable performance for skeleton-based action recognition.
Particularly, the temporal dynamic of skeleton sequence conveys significant
information in the recognition task. For temporal dynamic modeling, GCN-based
methods only stack multi-layer 1D local convolutions to extract temporal
relations between adjacent time steps. With the repeat of a lot of local
convolutions, the key temporal information with non-adjacent temporal distance
may be ignored due to the information dilution. Therefore, these methods still
remain unclear how to fully explore temporal dynamic of skeleton sequence. In
this paper, we propose a Temporal Enhanced Graph Convolutional Network (TE-GCN)
to tackle this limitation. The proposed TE-GCN constructs temporal relation
graph to capture complex temporal dynamic. Specifically, the constructed
temporal relation graph explicitly builds connections between semantically
related temporal features to model temporal relations between both adjacent and
non-adjacent time steps. Meanwhile, to further explore the sufficient temporal
dynamic, multi-head mechanism is designed to investigate multi-kinds of
temporal relations. Extensive experiments are performed on two widely used
large-scale datasets, NTU-60 RGB+D and NTU-120 RGB+D. And experimental results
show that the proposed model achieves the state-of-the-art performance by
making contribution to temporal modeling for action recognition
Feedback Graph Convolutional Network for Skeleton-based Action Recognition
Skeleton-based action recognition has attracted considerable attention in
computer vision since skeleton data is more robust to the dynamic circumstance
and complicated background than other modalities. Recently, many researchers
have used the Graph Convolutional Network (GCN) to model spatial-temporal
features of skeleton sequences by an end-to-end optimization. However,
conventional GCNs are feedforward networks which are impossible for low-level
layers to access semantic information in the high-level layers. In this paper,
we propose a novel network, named Feedback Graph Convolutional Network (FGCN).
This is the first work that introduces the feedback mechanism into GCNs and
action recognition. Compared with conventional GCNs, FGCN has the following
advantages: (1) a multi-stage temporal sampling strategy is designed to extract
spatial-temporal features for action recognition in a coarse-to-fine
progressive process; (2) A dense connections based Feedback Graph Convolutional
Block (FGCB) is proposed to introduce feedback connections into the GCNs. It
transmits the high-level semantic features to the low-level layers and flows
temporal information stage by stage to progressively model global
spatial-temporal features for action recognition; (3) The FGCN model provides
early predictions. In the early stages, the model receives partial information
about actions. Naturally, its predictions are relatively coarse. The coarse
predictions are treated as the prior to guide the feature learning of later
stages for a accurate prediction. Extensive experiments on the datasets,
NTU-RGB+D, NTU-RGB+D120 and Northwestern-UCLA, demonstrate that the proposed
FGCN is effective for action recognition. It achieves the state-of-the-art
performance on the three datasets.Comment: 18 pages, 5 figure
Infrared and 3D skeleton feature fusion for RGB-D action recognition
A challenge of skeleton-based action recognition is the difficulty to
classify actions with similar motions and object-related actions. Visual clues
from other streams help in that regard. RGB data are sensible to illumination
conditions, thus unusable in the dark. To alleviate this issue and still
benefit from a visual stream, we propose a modular network (FUSION) combining
skeleton and infrared data. A 2D convolutional neural network (CNN) is used as
a pose module to extract features from skeleton data. A 3D CNN is used as an
infrared module to extract visual cues from videos. Both feature vectors are
then concatenated and exploited conjointly using a multilayer perceptron (MLP).
Skeleton data also condition the infrared videos, providing a crop around the
performing subjects and thus virtually focusing the attention of the infrared
module. Ablation studies show that using pre-trained networks on other large
scale datasets as our modules and data augmentation yield considerable
improvements on the action classification accuracy. The strong contribution of
our cropping strategy is also demonstrated. We evaluate our method on the NTU
RGB+D dataset, the largest dataset for human action recognition from depth
cameras, and report state-of-the-art performances.Comment: 11 pages, 5 figures, submitted to IEEE Acces
Deep manifold-to-manifold transforming network for action recognition
Symmetric positive definite (SPD) matrices (e.g., covariances, graph
Laplacians, etc.) are widely used to model the relationship of spatial or
temporal domain. Nevertheless, SPD matrices are theoretically embedded on
Riemannian manifolds. In this paper, we propose an end-to-end deep
manifold-to-manifold transforming network (DMT-Net) which can make SPD matrices
flow from one Riemannian manifold to another more discriminative one. To learn
discriminative SPD features characterizing both spatial and temporal
dependencies, we specifically develop three novel layers on manifolds: (i) the
local SPD convolutional layer, (ii) the non-linear SPD activation layer, and
(iii) the Riemannian-preserved recursive layer. The SPD property is preserved
through all layers without any requirement of singular value decomposition
(SVD), which is often used in the existing methods with expensive computation
cost. Furthermore, a diagonalizing SPD layer is designed to efficiently
calculate the final metric for the classification task. To evaluate our
proposed method, we conduct extensive experiments on the task of action
recognition, where input signals are popularly modeled as SPD matrices. The
experimental results demonstrate that our DMT-Net is much more competitive over
state-of-the-art
HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition
Previous methods for skeleton-based gesture recognition mostly arrange the
skeleton sequence into a pseudo picture or spatial-temporal graph and apply
deep Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN)
for feature extraction. Although achieving superior results, these methods have
inherent limitations in dynamically capturing local features of interactive
hand parts, and the computing efficiency still remains a serious issue. In this
work, the self-attention mechanism is introduced to alleviate this problem.
Considering the hierarchical structure of hand joints, we propose an efficient
hierarchical self-attention network (HAN) for skeleton-based gesture
recognition, which is based on pure self-attention without any CNN, RNN or GCN
operators. Specifically, the joint self-attention module is used to capture
spatial features of fingers, the finger self-attention module is designed to
aggregate features of the whole hand. In terms of temporal features, the
temporal self-attention module is utilized to capture the temporal dynamics of
the fingers and the entire hand. Finally, these features are fused by the
fusion self-attention module for gesture classification. Experiments show that
our method achieves competitive results on three gesture recognition datasets
with much lower computational complexity.Comment: Under peer review for TCSV
- …