4 research outputs found
RGB-D datasets using microsoft kinect or similar sensors: a survey
RGB-D data has turned out to be a very useful representation of an indoor scene for solving fundamental computer vision problems. It takes the advantages of the color image that provides appearance information of an object and also the depth image that is immune to the variations in color, illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect sensor, which was initially used for gaming and later became a popular device for computer vision, high quality RGB-D data can be acquired easily. In recent years, more and more RGB-D image/video datasets dedicated to various applications have become available, which are of great importance to benchmark the state-of-the-art. In this paper, we systematically survey popular RGB-D datasets for different applications including object recognition, scene classification, hand gesture recognition, 3D-simultaneous localization and mapping, and pose estimation. We provide the insights into the characteristics of each important dataset, and compare the popularity and the difficulty of those datasets. Overall, the main goal of this survey is to give a comprehensive description about the available RGB-D datasets and thus to guide researchers in the selection of suitable datasets for evaluating their algorithms
Multimodal Observation and Interpretation of Subjects Engaged in Problem Solving
In this paper we present the first results of a pilot experiment in the
capture and interpretation of multimodal signals of human experts engaged in
solving challenging chess problems. Our goal is to investigate the extent to
which observations of eye-gaze, posture, emotion and other physiological
signals can be used to model the cognitive state of subjects, and to explore
the integration of multiple sensor modalities to improve the reliability of
detection of human displays of awareness and emotion. We observed chess players
engaged in problems of increasing difficulty while recording their behavior.
Such recordings can be used to estimate a participant's awareness of the
current situation and to predict ability to respond effectively to challenging
situations. Results show that a multimodal approach is more accurate than a
unimodal one. By combining body posture, visual attention and emotion, the
multimodal approach can reach up to 93% of accuracy when determining player's
chess expertise while unimodal approach reaches 86%. Finally this experiment
validates the use of our equipment as a general and reproducible tool for the
study of participants engaged in screen-based interaction and/or problem
solving
Feature Learning for RGB-D Data
RGB-D data has turned out to be a very useful representation for solving fundamental computer
vision problems. It takes the advantages of the color images that provide appearance
information of an object and also the depth image that is immune to the variations in color,
illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect
sensor, which was initially used for gaming and later became a popular device for computer
vision, high quality RGB-D data can be acquired easily. RGB-D image/video can facilitate
a wide range of application areas, such as computer vision, robotics, construction and medical
imaging. Furthermore, how to fuse RGB information and depth information is still a
problem in computer vision. It is not enough to simply concatenate RGB data and depth
data together. A new fusion method could better fuse RGB images and depth images. It
still needs more powerful algorithms on this. In this thesis, to explore more advantages of
RGB-D data, we use some popular RGB-D datasets for deep feature learning algorithms
evaluation, hyper-parameter optimization, local multi-modal feature learning, RGB-D data
fusion and recognizing RGB information from RGB-D images: i)With the success of Deep
Neural Network in computer vision, deep features from fused RGB-D data can be proved to
gain better results than RGB data only. However, different deep learning algorithms show
different performance on different RGB-D datasets. Through large-scale experiments to
comprehensively evaluate the performance of deep feature learning models for RGB-D image/
video classification, we obtain the conclusion that RGB-D fusion methods using CNNs
always outperform other selected methods (DBNs, SDAE and LSTM). On the other side, since
LSTM can learn from experience to classify, process and predict time series, it achieved
better performances than DBN and SDAE in video classification tasks. ii) Hyper-parameter
optimization can help researchers quickly choose an initial set of hyper-parameters for a new
coming classification task, thus reducing the number of trials in terms of hyper-parameter
space. We present a simple and efficient framework for improving the efficiency and accuracy
of hyper-parameter optimization by considering the classification complexity of a
particular dataset. We verify this framework on three real-world RGB-D datasets. After
the analysis of experiments, we confirm that our framework can provide deeper insights
into the relationship between dataset classification tasks and hyperparameters optimization, thus quickly choosing an accurate initial set of hyper-parameters for a new coming classification
task. iii) We propose a new Convolutional Neural Networks (CNNs)-based local
multi-modal feature learning framework for RGB-D scene classification. This method can
effectively capture much of the local structure from the RGB-D scene images and automatically
learn a fusion strategy for the object-level recognition step instead of simply training a
classifier on top of features extracted from both modalities. Experiments are conducted on
two popular datasets to thoroughly test the performance of our method, which show that our
method with local multi-modal CNNs greatly outperforms state-of-the-art approaches. Our
method has the potential to improve RGB-D scene understanding. Some extended evaluation
shows that CNNs trained using a scene-centric dataset is able to achieve an improvement
on scene benchmarks compared to a network trained using an object-centric dataset.
iv) We propose a novel method for RGB-D data fusion. We project raw RGB-D data into
a complex space and then jointly extract features from the fused RGB-D images. Besides
three observations about the fusion methods, the experimental results also show that our
method achieves competing performance against the classical SIFT. v) We propose a novel
method called adaptive Visual-Depth Embedding (aVDE) which learns the compact shared
latent space between two representations of labeled RGB and depth modalities in the source
domain first. Then the shared latent space can help the transfer of the depth information to
the unlabeled target dataset. At last, aVDE matches features and reweights instances jointly
across the shared latent space and the projected target domain for an adaptive classifier. This
method can utilize the additional depth information in the source domain and simultaneously
reduce the domain mismatch between the source and target domains. On two real-world
image datasets, the experimental results illustrate that the proposed method significantly
outperforms the state-of-the-art methods
MobileRGBD, An Open Benchmark Corpus for mobile RGB-D Related Algorithms
International audienceSince the commercialization of low cost RGB-D sensors, like the Kinect, more and more indoor robots have been equipped with this kind of sensors to perform tasks as people tracking or gesture recognition. Nevertheless, as far as we know from the literature, studies do not consider the limits of the sensors in term of motion speed, position of the sensor on the robot, etc. In this work, we propose to provide a corpus dedicated to low level RGB-D algorithms benchmarking. Originality of our approach is the use of dummies in order to play static users in the environment. This idea let us vary other variables that can impact algorithm performance: linear/angular speed of the robot, trajectory of the robot, RGB-D sensor height and vertical angle of view, number and relative position of dummies and furniture position. This paper first describes the experimental platform used to perform the acquisitions and the environment setup required to reproduce the dataset. Then, a precise description of all available data is given. We will see that, as this corpus contains a lot of configurations, it will allow researchers to investigate how these variables impact the results of their algorithms