1,675,352 research outputs found
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Action-Attending Graphic Neural Network
The motion analysis of human skeletons is crucial for human action
recognition, which is one of the most active topics in computer vision. In this
paper, we propose a fully end-to-end action-attending graphic neural network
(AGNN) for skeleton-based action recognition, in which each irregular
skeleton is structured as an undirected attribute graph. To extract high-level
semantic representation from skeletons, we perform the local spectral graph
filtering on the constructed attribute graphs like the standard image
convolution operation. Considering not all joints are informative for action
analysis, we design an action-attending layer to detect those salient action
units (AUs) by adaptively weighting skeletal joints. Herein the filtering
responses are parameterized into a weighting function irrelevant to the order
of input nodes. To further encode continuous motion variations, the deep
features learnt from skeletal graphs are gathered along consecutive temporal
slices and then fed into a recurrent gated network. Finally, the spectral graph
filtering, action-attending and recurrent temporal encoding are integrated
together to jointly train for the sake of robust action recognition as well as
the intelligibility of human actions. To evaluate our AGNN, we conduct
extensive experiments on four benchmark skeleton-based action datasets,
including the large-scale challenging NTU RGB+D dataset. The experimental
results demonstrate that our network achieves the state-of-the-art
performances
Semantic Embedding Space for Zero-Shot Action Recognition
The number of categories for action recognition is growing rapidly. It is
thus becoming increasingly hard to collect sufficient training data to learn
conventional models for each category. This issue may be ameliorated by the
increasingly popular 'zero-shot learning' (ZSL) paradigm. In this framework a
mapping is constructed between visual features and a human interpretable
semantic description of each category, allowing categories to be recognised in
the absence of any training data. Existing ZSL studies focus primarily on image
data, and attribute-based semantic representations. In this paper, we address
zero-shot recognition in contemporary video action recognition tasks, using
semantic word vector space as the common space to embed videos and category
labels. This is more challenging because the mapping between the semantic space
and space-time features of videos containing complex actions is more complex
and harder to learn. We demonstrate that a simple self-training and data
augmentation strategy can significantly improve the efficacy of this mapping.
Experiments on human action datasets including HMDB51 and UCF101 demonstrate
that our approach achieves the state-of-the-art zero-shot action recognition
performance.Comment: 5 page
- …
