3,462 research outputs found
An original framework for understanding human actions and body language by using deep neural networks
The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour.
By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way.
These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively.
While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements;
both are essential tasks in many computer vision applications, including event recognition, and video surveillance.
In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided.
The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements.
All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
3D Object Reconstruction from Hand-Object Interactions
Recent advances have enabled 3d object reconstruction approaches using a
single off-the-shelf RGB-D camera. Although these approaches are successful for
a wide range of object classes, they rely on stable and distinctive geometric
or texture features. Many objects like mechanical parts, toys, household or
decorative articles, however, are textureless and characterized by minimalistic
shapes that are simple and symmetric. Existing in-hand scanning systems and 3d
reconstruction techniques fail for such symmetric objects in the absence of
highly distinctive features. In this work, we show that extracting 3d hand
motion for in-hand scanning effectively facilitates the reconstruction of even
featureless and highly symmetric objects and we present an approach that fuses
the rich additional information of hands into a 3d reconstruction pipeline,
significantly contributing to the state-of-the-art of in-hand scanning.Comment: International Conference on Computer Vision (ICCV) 2015,
http://files.is.tue.mpg.de/dtzionas/In-Hand-Scannin
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects
We introduce T-LESS, a new public dataset for estimating the 6D pose, i.e.
translation and rotation, of texture-less rigid objects. The dataset features
thirty industry-relevant objects with no significant texture and no
discriminative color or reflectance properties. The objects exhibit symmetries
and mutual similarities in shape and/or size. Compared to other datasets, a
unique property is that some of the objects are parts of others. The dataset
includes training and test images that were captured with three synchronized
sensors, specifically a structured-light and a time-of-flight RGB-D sensor and
a high-resolution RGB camera. There are approximately 39K training and 10K test
images from each sensor. Additionally, two types of 3D models are provided for
each object, i.e. a manually created CAD model and a semi-automatically
reconstructed one. Training images depict individual objects against a black
background. Test images originate from twenty test scenes having varying
complexity, which increases from simple scenes with several isolated objects to
very challenging ones with multiple instances of several objects and with a
high amount of clutter and occlusion. The images were captured from a
systematically sampled view sphere around the object/scene, and are annotated
with accurate ground truth 6D poses of all modeled objects. Initial evaluation
results indicate that the state of the art in 6D object pose estimation has
ample room for improvement, especially in difficult cases with significant
occlusion. The T-LESS dataset is available online at cmp.felk.cvut.cz/t-less.Comment: WACV 201
Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images
We propose a simple and efficient method for exploiting synthetic images when
training a Deep Network to predict a 3D pose from an image. The ability of
using synthetic images for training a Deep Network is extremely valuable as it
is easy to create a virtually infinite training set made of such images, while
capturing and annotating real images can be very cumbersome. However, synthetic
images do not resemble real images exactly, and using them for training can
result in suboptimal performance. It was recently shown that for exemplar-based
approaches, it is possible to learn a mapping from the exemplar representations
of real images to the exemplar representations of synthetic images. In this
paper, we show that this approach is more general, and that a network can also
be applied after the mapping to infer a 3D pose: At run time, given a real
image of the target object, we first compute the features for the image, map
them to the feature space of synthetic images, and finally use the resulting
features as input to another network which predicts the 3D pose. Since this
network can be trained very effectively by using synthetic images, it performs
very well in practice, and inference is faster and more accurate than with an
exemplar-based approach. We demonstrate our approach on the LINEMOD dataset for
3D object pose estimation from color images, and the NYU dataset for 3D hand
pose estimation from depth maps. We show that it allows us to outperform the
state-of-the-art on both datasets.Comment: CVPR 201
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
Articulated Clinician Detection Using 3D Pictorial Structures on RGB-D Data
Reliable human pose estimation (HPE) is essential to many clinical
applications, such as surgical workflow analysis, radiation safety monitoring
and human-robot cooperation. Proposed methods for the operating room (OR) rely
either on foreground estimation using a multi-camera system, which is a
challenge in real ORs due to color similarities and frequent illumination
changes, or on wearable sensors or markers, which are invasive and therefore
difficult to introduce in the room. Instead, we propose a novel approach based
on Pictorial Structures (PS) and on RGB-D data, which can be easily deployed in
real ORs. We extend the PS framework in two ways. First, we build robust and
discriminative part detectors using both color and depth images. We also
present a novel descriptor for depth images, called histogram of depth
differences (HDD). Second, we extend PS to 3D by proposing 3D pairwise
constraints and a new method that makes exact inference tractable. Our approach
is evaluated for pose estimation and clinician detection on a challenging RGB-D
dataset recorded in a busy operating room during live surgeries. We conduct
series of experiments to study the different part detectors in conjunction with
the various 2D or 3D pairwise constraints. Our comparisons demonstrate that 3D
PS with RGB-D part detectors significantly improves the results in a visually
challenging operating environment.Comment: The supplementary video is available at https://youtu.be/iabbGSqRSg
Ego-Downward and Ambient Video based Person Location Association
Using an ego-centric camera to do localization and tracking is highly needed
for urban navigation and indoor assistive system when GPS is not available or
not accurate enough. The traditional hand-designed feature tracking and
estimation approach would fail without visible features. Recently, there are
several works exploring to use context features to do localization. However,
all of these suffer severe accuracy loss if given no visual context
information. To provide a possible solution to this problem, this paper
proposes a camera system with both ego-downward and third-static view to
perform localization and tracking in a learning approach. Besides, we also
proposed a novel action and motion verification model for cross-view
verification and localization. We performed comparative experiments based on
our collected dataset which considers the same dressing, gender, and background
diversity. Results indicate that the proposed model can achieve
improvement in accuracy performance. Eventually, we tested the model on
multi-people scenarios and obtained an average accuracy
- …