76 research outputs found
Graph Distillation for Action Detection with Privileged Modalities
We propose a technique that tackles action detection in multimodal videos
under a realistic and challenging condition in which only limited training data
and partially observed modalities are available. Common methods in transfer
learning do not take advantage of the extra modalities potentially available in
the source domain. On the other hand, previous work on multimodal learning only
focuses on a single domain or task and does not handle the modality discrepancy
between training and testing. In this work, we propose a method termed graph
distillation that incorporates rich privileged information from a large-scale
multimodal dataset in the source domain, and improves the learning in the
target domain where training data and modalities are scarce. We evaluate our
approach on action classification and detection tasks in multimodal videos, and
show that our model outperforms the state-of-the-art by a large margin on the
NTU RGB+D and PKU-MMD benchmarks. The code is released at
http://alan.vision/eccv18_graph/.Comment: ECCV 201
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Real-time human action recognition using raw depth video-based recurrent neural networks
This work proposes and compare two different approaches for real-time human action recognition (HAR) from raw depth video sequences. Both proposals are based on the convolutional long short-term memory unit, namely ConvLSTM, with differences in the architecture and the long-term learning. The former uses a video-length adaptive input data generator (stateless) whereas the latter explores the stateful ability of general recurrent neural networks but is applied in the particular case of HAR. This stateful property allows the model to accumulate discriminative patterns from previous frames without compromising computer memory. Furthermore, since the proposal uses only depth information, HAR is carried out preserving the privacy of people in the scene, since their identities can not be recognized. Both neural networks have been trained and tested using the large-scale NTU RGB+D dataset. Experimental results show that the proposed models achieve competitive recognition accuracies with lower computational cost compared with state-of-the-art methods and prove that, in the particular case of videos, the rarely-used stateful mode of recurrent neural networks significantly improves the accuracy obtained with the standard mode. The recognition accuracies obtained are 75.26% (CS) and 75.45% (CV) for the stateless model, with an average time consumption per video of 0.21 s, and 80.43% (CS) and 79.91%(CV) with 0.89 s for the stateful one.Agencia Estatal de InvestigaciónUniversidad de Alcal
SAR-NAS: Skeleton-based Action Recognition via Neural Architecture Searching
This paper presents a study of automatic design of neural network
architectures for skeleton-based action recognition. Specifically, we encode a
skeleton-based action instance into a tensor and carefully define a set of
operations to build two types of network cells: normal cells and reduction
cells. The recently developed DARTS (Differentiable Architecture Search) is
adopted to search for an effective network architecture that is built upon the
two types of cells. All operations are 2D based in order to reduce the overall
computation and search space. Experiments on the challenging NTU RGB+D and
Kinectics datasets have verified that most of the networks developed to date
for skeleton-based action recognition are likely not compact and efficient. The
proposed method provides an approach to search for such a compact network that
is able to achieve comparative or even better performance than the
state-of-the-art methods
A Comparative Review of Recent Kinect-based Action Recognition Algorithms
Video-based human action recognition is currently one of the most active
research areas in computer vision. Various research studies indicate that the
performance of action recognition is highly dependent on the type of features
being extracted and how the actions are represented. Since the release of the
Kinect camera, a large number of Kinect-based human action recognition
techniques have been proposed in the literature. However, there still does not
exist a thorough comparison of these Kinect-based techniques under the grouping
of feature types, such as handcrafted versus deep learning features and
depth-based versus skeleton-based features. In this paper, we analyze and
compare ten recent Kinect-based algorithms for both cross-subject action
recognition and cross-view action recognition using six benchmark datasets. In
addition, we have implemented and improved some of these techniques and
included their variants in the comparison. Our experiments show that the
majority of methods perform better on cross-subject action recognition than
cross-view action recognition, that skeleton-based features are more robust for
cross-view recognition than depth-based features, and that deep learning
features are suitable for large datasets.Comment: Accepted by the IEEE Transactions on Image Processin
- …