5,429 research outputs found
Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs
Scene recognition with RGB images has been extensively studied and has
reached very remarkable recognition levels, thanks to convolutional neural
networks (CNN) and large scene datasets. In contrast, current RGB-D scene data
is much more limited, so often leverages RGB large datasets, by transferring
pretrained RGB CNN models and fine-tuning with the target RGB-D dataset.
However, we show that this approach has the limitation of hardly reaching
bottom layers, which is key to learn modality-specific features. In contrast,
we focus on the bottom layers, and propose an alternative strategy to learn
depth features combining local weakly supervised training from patches followed
by global fine tuning with images. This strategy is capable of learning very
discriminative depth-specific features with limited depth images, without
resorting to Places-CNN. In addition we propose a modified CNN architecture to
further match the complexity of the model and the amount of data available. For
RGB-D scene recognition, depth and RGB features are combined by projecting them
in a common space and further leaning a multilayer classifier, which is jointly
optimized in an end-to-end network. Our framework achieves state-of-the-art
accuracy on NYU2 and SUN RGB-D in both depth only and combined RGB-D data.Comment: AAAI Conference on Artificial Intelligence 201
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition
Human action recognition remains an important yet challenging task. This work
proposes a novel action recognition system. It uses a novel Multiple View
Region Adaptive Multi-resolution in time Depth Motion Map (MV-RAMDMM)
formulation combined with appearance information. Multiple stream 3D
Convolutional Neural Networks (CNNs) are trained on the different views and
time resolutions of the region adaptive Depth Motion Maps. Multiple views are
synthesised to enhance the view invariance. The region adaptive weights, based
on localised motion, accentuate and differentiate parts of actions possessing
faster motion. Dedicated 3D CNN streams for multi-time resolution appearance
information (RGB) are also included. These help to identify and differentiate
between small object interactions. A pre-trained 3D-CNN is used here with
fine-tuning for each stream along with multiple class Support Vector Machines
(SVM)s. Average score fusion is used on the output. The developed approach is
capable of recognising both human action and human-object interaction. Three
public domain datasets including: MSR 3D Action,Northwestern UCLA multi-view
actions and MSR 3D daily activity are used to evaluate the proposed solution.
The experimental results demonstrate the robustness of this approach compared
with state-of-the-art algorithms.Comment: 14 pages, 6 figures, 13 tables. Submitte
- …