636 research outputs found
Effective Human Activity Recognition Based on Small Datasets
Most recent work on vision-based human activity recognition (HAR) focuses on
designing complex deep learning models for the task. In so doing, there is a
requirement for large datasets to be collected. As acquiring and processing
large training datasets are usually very expensive, the problem of how dataset
size can be reduced without affecting recognition accuracy has to be tackled.
To do so, we propose a HAR method that consists of three steps: (i) data
transformation involving the generation of new features based on transforming
of raw data, (ii) feature extraction involving the learning of a classifier
based on the AdaBoost algorithm and the use of training data consisting of the
transformed features, and (iii) parameter determination and pattern recognition
involving the determination of parameters based on the features generated in
(ii) and the use of the parameters as training data for deep learning
algorithms to be used to recognize human activities. Compared to existing
approaches, this proposed approach has the advantageous characteristics that it
is simple and robust. The proposed approach has been tested with a number of
experiments performed on a relatively small real dataset. The experimental
results indicate that using the proposed method, human activities can be more
accurately recognized even with smaller training data size.Comment: 7 page
Towards Understanding the Adversarial Vulnerability of Skeleton-based Action Recognition
Skeleton-based action recognition has attracted increasing attention due to
its strong adaptability to dynamic circumstances and potential for broad
applications such as autonomous and anonymous surveillance. With the help of
deep learning techniques, it has also witnessed substantial progress and
currently achieved around 90\% accuracy in benign environment. On the other
hand, research on the vulnerability of skeleton-based action recognition under
different adversarial settings remains scant, which may raise security concerns
about deploying such techniques into real-world systems. However, filling this
research gap is challenging due to the unique physical constraints of skeletons
and human actions. In this paper, we attempt to conduct a thorough study
towards understanding the adversarial vulnerability of skeleton-based action
recognition. We first formulate generation of adversarial skeleton actions as a
constrained optimization problem by representing or approximating the
physiological and physical constraints with mathematical formulations. Since
the primal optimization problem with equality constraints is intractable, we
propose to solve it by optimizing its unconstrained dual problem using ADMM. We
then specify an efficient plug-in defense, inspired by recent theories and
empirical observations, against the adversarial skeleton actions. Extensive
evaluations demonstrate the effectiveness of the attack and defense method
under different settings
Multi Scale Temporal Graph Networks For Skeleton-based Action Recognition
Graph convolutional networks (GCNs) can effectively capture the features of
related nodes and improve the performance of the model. More attention is paid
to employing GCN in Skeleton-Based action recognition. But existing methods
based on GCNs have two problems. First, the consistency of temporal and spatial
features is ignored for extracting features node by node and frame by frame. To
obtain spatiotemporal features simultaneously, we design a generic
representation of skeleton sequences for action recognition and propose a novel
model called Temporal Graph Networks (TGN). Secondly, the adjacency matrix of
the graph describing the relation of joints is mostly dependent on the physical
connection between joints. To appropriately describe the relations between
joints in the skeleton graph, we propose a multi-scale graph strategy, adopting
a full-scale graph, part-scale graph, and core-scale graph to capture the local
features of each joint and the contour features of important joints.
Experiments were carried out on two large datasets and results show that TGN
with our graph strategy outperforms state-of-the-art methods
Progressive Spatio-Temporal Graph Convolutional Network for Skeleton-Based Human Action Recognition
Graph convolutional networks (GCNs) have been very successful in
skeleton-based human action recognition where the sequence of skeletons is
modeled as a graph. However, most of the GCN-based methods in this area train a
deep feed-forward network with a fixed topology that leads to high
computational complexity and restricts their application in low computation
scenarios. In this paper, we propose a method to automatically find a compact
and problem-specific topology for spatio-temporal graph convolutional networks
in a progressive manner. Experimental results on two widely used datasets for
skeleton-based human action recognition indicate that the proposed method has
competitive or even better classification performance compared to the
state-of-the-art methods with much lower computational complexity.Comment: 5 pages, 2 figure
Improving Skeleton-based Action Recognitionwith Robust Spatial and Temporal Features
Recently skeleton-based action recognition has made signif-icant progresses
in the computer vision community. Most state-of-the-art algorithms are based on
Graph Convolutional Networks (GCN), andtarget at improving the network
structure of the backbone GCN lay-ers. In this paper, we propose a novel
mechanism to learn more robustdiscriminative features in space and time. More
specifically, we add aDiscriminative Feature Learning (DFL) branch to the last
layers of thenetwork to extract discriminative spatial and temporal features to
helpregularize the learning. We also formally advocate the use of
Direction-Invariant Features (DIF) as input to the neural networks. We show
thataction recognition accuracy can be improved when these robust featuresare
learned and used. We compare our results with those of ST-GCNand related
methods on four datasets: NTU-RGBD60, NTU-RGBD120,SYSU 3DHOI and
Skeleton-Kinetics
On the spatial attention in Spatio-Temporal Graph Convolutional Networks for skeleton-based human action recognition
Graph convolutional networks (GCNs) achieved promising performance in
skeleton-based human action recognition by modeling a sequence of skeletons as
a spatio-temporal graph. Most of the recently proposed GCN-based methods
improve the performance by learning the graph structure at each layer of the
network using a spatial attention applied on a predefined graph Adjacency
matrix that is optimized jointly with model's parameters in an end-to-end
manner. In this paper, we analyze the spatial attention used in spatio-temporal
GCN layers and propose a symmetric spatial attention for better reflecting the
symmetric property of the relative positions of the human body joints when
executing actions. We also highlight the connection of spatio-temporal GCN
layers employing additive spatial attention to bilinear layers, and we propose
the spatio-temporal bilinear network (ST-BLN) which does not require the use of
predefined Adjacency matrices and allows for more flexible design of the model.
Experimental results show that the three models lead to effectively the same
performance. Moreover, by exploiting the flexibility provided by the proposed
ST-BLN, one can increase the efficiency of the model.Comment: 7 pages, 5 figure
What and Where: Modeling Skeletons from Semantic and Spatial Perspectives for Action Recognition
Skeleton data, which consists of only the 2D/3D coordinates of the human
joints, has been widely studied for human action recognition. Existing methods
take the semantics as prior knowledge to group human joints and draw
correlations according to their spatial locations, which we call the semantic
perspective for skeleton modeling. In this paper, in contrast to previous
approaches, we propose to model skeletons from a novel spatial perspective,
from which the model takes the spatial location as prior knowledge to group
human joints and mines the discriminative patterns of local areas in a
hierarchical manner. The two perspectives are orthogonal and complementary to
each other; and by fusing them in a unified framework, our method achieves a
more comprehensive understanding of the skeleton data. Besides, we customized
two networks for the two perspectives. From the semantic perspective, we
propose a Transformer-like network that is expert in modeling joint
correlations, and present three effective techniques to adapt it for skeleton
data. From the spatial perspective, we transform the skeleton data into the
sparse format for efficient feature extraction and present two types of sparse
convolutional networks for sparse skeleton modeling. Extensive experiments are
conducted on three challenging datasets for skeleton-based human action/gesture
recognition, namely, NTU-60, NTU-120 and SHREC, where our method achieves
state-of-the-art performance
Temporal Extension Module for Skeleton-Based Action Recognition
We present a module that extends the temporal graph of a graph convolutional
network (GCN) for action recognition with a sequence of skeletons. Existing
methods attempt to represent a more appropriate spatial graph on an
intra-frame, but disregard optimization of the temporal graph on the
inter-frame. In this work, we focus on adding extra edges to neighboring
multiple vertices on the inter-frame and extracting additional features based
on the extended temporal graph. Our module is a simple yet effective method to
extract correlated features of multiple joints in human movement. Moreover, our
module aids in further performance improvements, along with other GCN methods
that optimize only the spatial graph. We conduct extensive experiments on two
large datasets, NTU RGB+D and Kinetics-Skeleton, and demonstrate that our
module is effective for several existing models and our final model achieves
competitive or state-of-the-art performance.Comment: 7 pages, 4 figures, pre-prin
SpatioTemporal Focus for Skeleton-based Action Recognition
Graph convolutional networks (GCNs) are widely adopted in skeleton-based
action recognition due to their powerful ability to model data topology. We
argue that the performance of recent proposed skeleton-based action recognition
methods is limited by the following factors. First, the predefined graph
structures are shared throughout the network, lacking the flexibility and
capacity to model the multi-grain semantic information. Second, the relations
among the global joints are not fully exploited by the graph local convolution,
which may lose the implicit joint relevance. For instance, actions such as
running and waving are performed by the co-movement of body parts and joints,
e.g., legs and arms, however, they are located far away in physical connection.
Inspired by the recent attention mechanism, we propose a multi-grain contextual
focus module, termed MCF, to capture the action associated relation information
from the body joints and parts. As a result, more explainable representations
for different skeleton action sequences can be obtained by MCF. In this study,
we follow the common practice that the dense sample strategy of the input
skeleton sequences is adopted and this brings much redundancy since number of
instances has nothing to do with actions. To reduce the redundancy, a temporal
discrimination focus module, termed TDF, is developed to capture the local
sensitive points of the temporal dynamics. MCF and TDF are integrated into the
standard GCN network to form a unified architecture, named STF-Net. It is noted
that STF-Net provides the capability to capture robust movement patterns from
these skeleton topology structures, based on multi-grain context aggregation
and temporal dependency. Extensive experimental results show that our STF-Net
significantly achieves state-of-the-art results on three challenging benchmarks
NTU RGB+D 60, NTU RGB+D 120, and Kinetics-skeleton.Comment: Submitted to TCSV
Skeleton Aware Multi-modal Sign Language Recognition
Sign language is used by deaf or speech impaired people to communicate and
requires great effort to master. Sign Language Recognition (SLR) aims to bridge
between sign language users and others by recognizing words from given videos.
It is an important yet challenging task since sign language is performed with
fast and complex movement of hand gestures, body posture, and even facial
expressions. Recently, skeleton-based action recognition attracts increasing
attention due to the independence on subject and background variation.
Furthermore, it can be a strong complement to RGB/D modalities to boost the
overall recognition rate. However, skeleton-based SLR is still under
exploration due to the lack of annotations on hand keypoints. Some efforts have
been made to use hand detectors with pose estimators to extract hand key points
and learn to recognize sign language via a Recurrent Neural Network, but none
of them outperforms RGB-based methods. To this end, we propose a novel Skeleton
Aware Multi-modal SLR framework (SAM-SLR) to further improve the recognition
rate. Specifically, we propose a Sign Language Graph Convolution Network
(SL-GCN) to model the embedded dynamics and propose a novel Separable
Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. Our
skeleton-based method achieves a higher recognition rate compared with all
other single modalities. Moreover, our proposed SAM-SLR framework can further
enhance the performance by assembling our skeleton-based method with other RGB
and depth modalities. As a result, SAM-SLR achieves the highest performance in
both RGB (98.42%) and RGB-D (98.53%) tracks in 2021 Looking at People Large
Scale Signer Independent Isolated SLR Challenge. Our code is available at
https://github.com/jackyjsy/CVPR21Chal-SLRComment: This submission is a preprint version of our work SAM-SLR that ranked
1st at CVPR2021 Challenge on Large Scale Signer Independent Isolated Sign
Language Recognitio
- …