21,533 research outputs found
Feature Learning for RGB-D Data
RGB-D data has turned out to be a very useful representation for solving fundamental computer
vision problems. It takes the advantages of the color images that provide appearance
information of an object and also the depth image that is immune to the variations in color,
illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect
sensor, which was initially used for gaming and later became a popular device for computer
vision, high quality RGB-D data can be acquired easily. RGB-D image/video can facilitate
a wide range of application areas, such as computer vision, robotics, construction and medical
imaging. Furthermore, how to fuse RGB information and depth information is still a
problem in computer vision. It is not enough to simply concatenate RGB data and depth
data together. A new fusion method could better fuse RGB images and depth images. It
still needs more powerful algorithms on this. In this thesis, to explore more advantages of
RGB-D data, we use some popular RGB-D datasets for deep feature learning algorithms
evaluation, hyper-parameter optimization, local multi-modal feature learning, RGB-D data
fusion and recognizing RGB information from RGB-D images: i)With the success of Deep
Neural Network in computer vision, deep features from fused RGB-D data can be proved to
gain better results than RGB data only. However, different deep learning algorithms show
different performance on different RGB-D datasets. Through large-scale experiments to
comprehensively evaluate the performance of deep feature learning models for RGB-D image/
video classification, we obtain the conclusion that RGB-D fusion methods using CNNs
always outperform other selected methods (DBNs, SDAE and LSTM). On the other side, since
LSTM can learn from experience to classify, process and predict time series, it achieved
better performances than DBN and SDAE in video classification tasks. ii) Hyper-parameter
optimization can help researchers quickly choose an initial set of hyper-parameters for a new
coming classification task, thus reducing the number of trials in terms of hyper-parameter
space. We present a simple and efficient framework for improving the efficiency and accuracy
of hyper-parameter optimization by considering the classification complexity of a
particular dataset. We verify this framework on three real-world RGB-D datasets. After
the analysis of experiments, we confirm that our framework can provide deeper insights
into the relationship between dataset classification tasks and hyperparameters optimization, thus quickly choosing an accurate initial set of hyper-parameters for a new coming classification
task. iii) We propose a new Convolutional Neural Networks (CNNs)-based local
multi-modal feature learning framework for RGB-D scene classification. This method can
effectively capture much of the local structure from the RGB-D scene images and automatically
learn a fusion strategy for the object-level recognition step instead of simply training a
classifier on top of features extracted from both modalities. Experiments are conducted on
two popular datasets to thoroughly test the performance of our method, which show that our
method with local multi-modal CNNs greatly outperforms state-of-the-art approaches. Our
method has the potential to improve RGB-D scene understanding. Some extended evaluation
shows that CNNs trained using a scene-centric dataset is able to achieve an improvement
on scene benchmarks compared to a network trained using an object-centric dataset.
iv) We propose a novel method for RGB-D data fusion. We project raw RGB-D data into
a complex space and then jointly extract features from the fused RGB-D images. Besides
three observations about the fusion methods, the experimental results also show that our
method achieves competing performance against the classical SIFT. v) We propose a novel
method called adaptive Visual-Depth Embedding (aVDE) which learns the compact shared
latent space between two representations of labeled RGB and depth modalities in the source
domain first. Then the shared latent space can help the transfer of the depth information to
the unlabeled target dataset. At last, aVDE matches features and reweights instances jointly
across the shared latent space and the projected target domain for an adaptive classifier. This
method can utilize the additional depth information in the source domain and simultaneously
reduce the domain mismatch between the source and target domains. On two real-world
image datasets, the experimental results illustrate that the proposed method significantly
outperforms the state-of-the-art methods
Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs
Scene recognition with RGB images has been extensively studied and has
reached very remarkable recognition levels, thanks to convolutional neural
networks (CNN) and large scene datasets. In contrast, current RGB-D scene data
is much more limited, so often leverages RGB large datasets, by transferring
pretrained RGB CNN models and fine-tuning with the target RGB-D dataset.
However, we show that this approach has the limitation of hardly reaching
bottom layers, which is key to learn modality-specific features. In contrast,
we focus on the bottom layers, and propose an alternative strategy to learn
depth features combining local weakly supervised training from patches followed
by global fine tuning with images. This strategy is capable of learning very
discriminative depth-specific features with limited depth images, without
resorting to Places-CNN. In addition we propose a modified CNN architecture to
further match the complexity of the model and the amount of data available. For
RGB-D scene recognition, depth and RGB features are combined by projecting them
in a common space and further leaning a multilayer classifier, which is jointly
optimized in an end-to-end network. Our framework achieves state-of-the-art
accuracy on NYU2 and SUN RGB-D in both depth only and combined RGB-D data.Comment: AAAI Conference on Artificial Intelligence 201
Deep Affordance-grounded Sensorimotor Object Recognition
It is well-established by cognitive neuroscience that human perception of
objects constitutes a complex process, where object appearance information is
combined with evidence about the so-called object "affordances", namely the
types of actions that humans typically perform when interacting with them. This
fact has recently motivated the "sensorimotor" approach to the challenging task
of automatic object recognition, where both information sources are fused to
improve robustness. In this work, the aforementioned paradigm is adopted,
surpassing current limitations of sensorimotor object recognition research.
Specifically, the deep learning paradigm is introduced to the problem for the
first time, developing a number of novel neuro-biologically and
neuro-physiologically inspired architectures that utilize state-of-the-art
neural networks for fusing the available information sources in multiple ways.
The proposed methods are evaluated using a large RGB-D corpus, which is
specifically collected for the task of sensorimotor object recognition and is
made publicly available. Experimental results demonstrate the utility of
affordance information to object recognition, achieving an up to 29% relative
error reduction by its inclusion.Comment: 9 pages, 7 figures, dataset link included, accepted to CVPR 201
- …