88 research outputs found
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Spatial Temporal Transformer Network for Skeleton-based Action Recognition
Skeleton-based human action recognition has achieved a great interest in
recent years, as skeleton data has been demonstrated to be robust to
illumination changes, body scales, dynamic camera views, and complex
background. Nevertheless, an effective encoding of the latent information
underlying the 3D skeleton is still an open problem. In this work, we propose a
novel Spatial-Temporal Transformer network (ST-TR) which models dependencies
between joints using the Transformer self-attention operator. In our ST-TR
model, a Spatial Self-Attention module (SSA) is used to understand intra-frame
interactions between different body parts, and a Temporal Self-Attention module
(TSA) to model inter-frame correlations. The two are combined in a two-stream
network which outperforms state-of-the-art models using the same input data on
both NTU-RGB+D 60 and NTU-RGB+D 120.Comment: Accepted as ICPRW2020 (FBE2020, Workshop on Facial and Body
Expressions, micro-expressions and behavior recognition) 8 pages, 2 figures.
arXiv admin note: substantial text overlap with arXiv:2008.0740
AnimGAN: A Spatiotemporally-Conditioned Generative Adversarial Network for Character Animation
Producing realistic character animations is one of the essential tasks in
human-AI interactions. Considered as a sequence of poses of a humanoid, the
task can be considered as a sequence generation problem with spatiotemporal
smoothness and realism constraints. Additionally, we wish to control the
behavior of AI agents by giving them what to do and, more specifically, how to
do it. We proposed a spatiotemporally-conditioned GAN that generates a sequence
that is similar to a given sequence in terms of semantics and spatiotemporal
dynamics. Using LSTM-based generator and graph ConvNet discriminator, this
system is trained end-to-end on a large gathered dataset of gestures,
expressions, and actions. Experiments showed that compared to traditional
conditional GAN, our method creates plausible, realistic, and semantically
relevant humanoid animation sequences that match user expectations.Comment: Submitted to ICIP 202
Hierarchical long short-term memory for action recognition based on 3D skeleton joints from Kinect sensor
Action recognition has been used in a wide range of applications such as human-computer interaction, intelligent video surveillance systems, video summarization, and robotics. Recognizing action is important for intelligent agents to understand, learn and interact with the environment. The recent technology that allows the acquisition of RGB+D and 3D skeleton data and a deep learning model's development significantly increases the action recognition model's performance. In this research, hierarchical Long Sort-Term Memory is proposed to recognize action based on 3D skeleton joints from Kinect sensor. The model uses the 3D axis of skeleton joints and groups each joint in the axis into parts, namely, spine, left and right arm, left and right hand, and left and right leg. To fit the hierarchically structured layers of LSTM, the parts are concatenated into spine, arms, hands, and legs and then concatenated into the body. The model crosses the body in each axis into a single final body and fed to the final layer to classify the action. The performance is measured using cross-view and cross-subject evaluation and achieves accuracy 0.854 and 0.837, respectively, from the 10 action classes of the NTU RGB+D dataset
- …