10,684 research outputs found
3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks
Human activity understanding with 3D/depth sensors has received increasing
attention in multimedia processing and interactions. This work targets on
developing a novel deep model for automatic activity recognition from RGB-D
videos. We represent each human activity as an ensemble of cubic-like video
segments, and learn to discover the temporal structures for a category of
activities, i.e. how the activities to be decomposed in terms of
classification. Our model can be regarded as a structured deep architecture, as
it extends the convolutional neural networks (CNNs) by incorporating structure
alternatives. Specifically, we build the network consisting of 3D convolutions
and max-pooling operators over the video segments, and introduce the latent
variables in each convolutional layer manipulating the activation of neurons.
Our model thus advances existing approaches in two aspects: (i) it acts
directly on the raw inputs (grayscale-depth data) to conduct recognition
instead of relying on hand-crafted features, and (ii) the model structure can
be dynamically adjusted accounting for the temporal variations of human
activities, i.e. the network configuration is allowed to be partially activated
during inference. For model training, we propose an EM-type optimization method
that iteratively (i) discovers the latent structure by determining the
decomposed actions for each training example, and (ii) learns the network
parameters by using the back-propagation algorithm. Our approach is validated
in challenging scenarios, and outperforms state-of-the-art methods. A large
human activity database of RGB-D videos is presented in addition.Comment: This manuscript has 10 pages with 9 figures, and a preliminary
version was published in ACM MM'14 conferenc
Recovering 6D Object Pose: A Review and Multi-modal Analysis
A large number of studies analyse object detection and pose estimation at
visual level in 2D, discussing the effects of challenges such as occlusion,
clutter, texture, etc., on the performances of the methods, which work in the
context of RGB modality. Interpreting the depth data, the study in this paper
presents thorough multi-modal analyses. It discusses the above-mentioned
challenges for full 6D object pose estimation in RGB-D images comparing the
performances of several 6D detectors in order to answer the following
questions: What is the current position of the computer vision community for
maintaining "automation" in robotic manipulation? What next steps should the
community take for improving "autonomy" in robotics while handling objects? Our
findings include: (i) reasonably accurate results are obtained on
textured-objects at varying viewpoints with cluttered backgrounds. (ii) Heavy
existence of occlusion and clutter severely affects the detectors, and
similar-looking distractors is the biggest challenge in recovering instances'
6D. (iii) Template-based methods and random forest-based learning algorithms
underlie object detection and 6D pose estimation. Recent paradigm is to learn
deep discriminative feature representations and to adopt CNNs taking RGB images
as input. (iv) Depending on the availability of large-scale 6D annotated depth
datasets, feature representations can be learnt on these datasets, and then the
learnt representations can be customized for the 6D problem
An original framework for understanding human actions and body language by using deep neural networks
The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour.
By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way.
These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively.
While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements;
both are essential tasks in many computer vision applications, including event recognition, and video surveillance.
In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided.
The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements.
All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
- …