17,520 research outputs found
Large-scale Isolated Gesture Recognition Using Convolutional Neural Networks
This paper proposes three simple, compact yet effective representations of
depth sequences, referred to respectively as Dynamic Depth Images (DDI),
Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images
(DDMNI). These dynamic images are constructed from a sequence of depth maps
using bidirectional rank pooling to effectively capture the spatial-temporal
information. Such image-based representations enable us to fine-tune the
existing ConvNets models trained on image data for classification of depth
sequences, without introducing large parameters to learn. Upon the proposed
representations, a convolutional Neural networks (ConvNets) based method is
developed for gesture recognition and evaluated on the Large-scale Isolated
Gesture Recognition at the ChaLearn Looking at People (LAP) challenge 2016. The
method achieved 55.57\% classification accuracy and ranked place in
this challenge but was very close to the best performance even though we only
used depth data.Comment: arXiv admin note: text overlap with arXiv:1608.0633
Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction
Frame-level visual features are generally aggregated in time with the
techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust
video-level representation. We here introduce a learnable aggregation technique
whose primary objective is to retain short-time temporal structure between
frame-level features and their spatial interdependencies in the representation.
Also, it can be easily adapted to the cases where there have very scarce
training samples. We evaluate the method on a real-fake expression prediction
dataset to demonstrate its superiority. Our method obtains 65% score on the
test dataset in the official MAP evaluation and there is only one misclassified
decision with the best reported result in the Chalearn Challenge (i.e. 66:7%) .
Lastly, we believe that this method can be extended to different problems such
as action/event recognition in future.Comment: Submitted to International Conference on Computer Vision Workshop
Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories
In this paper, we propose a new approach for facial expression recognition
using deep covariance descriptors. The solution is based on the idea of
encoding local and global Deep Convolutional Neural Network (DCNN) features
extracted from still images, in compact local and global covariance
descriptors. The space geometry of the covariance matrices is that of Symmetric
Positive Definite (SPD) matrices. By conducting the classification of static
facial expressions using Support Vector Machine (SVM) with a valid Gaussian
kernel on the SPD manifold, we show that deep covariance descriptors are more
effective than the standard classification with fully connected layers and
softmax. Besides, we propose a completely new and original solution to model
the temporal dynamic of facial expressions as deep trajectories on the SPD
manifold. As an extension of the classification pipeline of covariance
descriptors, we apply SVM with valid positive definite kernels derived from
global alignment for deep covariance trajectories classification. By performing
extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we show that
both the proposed static and dynamic approaches achieve state-of-the-art
performance for facial expression recognition outperforming many recent
approaches.Comment: A preliminary version of this work appeared in "Otberdout N, Kacem A,
Daoudi M, Ballihi L, Berretti S. Deep Covariance Descriptors for Facial
Expression Recognition, in British Machine Vision Conference 2018, BMVC 2018,
Northumbria University, Newcastle, UK, September 3-6, 2018. ; 2018 :159."
arXiv admin note: substantial text overlap with arXiv:1805.0386
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
This paper presents a novel framework for automatic speech-driven gesture
generation, applicable to human-agent interaction including both virtual agents
and robots. Specifically, we extend recent deep-learning-based, data-driven
methods for speech-driven gesture generation by incorporating representation
learning. Our model takes speech as input and produces gestures as output, in
the form of a sequence of 3D coordinates. Our approach consists of two steps.
First, we learn a lower-dimensional representation of human motion using a
denoising autoencoder neural network, consisting of a motion encoder MotionE
and a motion decoder MotionD. The learned representation preserves the most
important aspects of the human pose variation while removing less relevant
variation. Second, we train a novel encoder network SpeechE to map from speech
to a corresponding motion representation with reduced dimensionality. At test
time, the speech encoder and the motion decoder networks are combined: SpeechE
predicts motion representations based on a given speech signal and MotionD then
decodes these representations to produce motion sequences. We evaluate
different representation sizes in order to find the most effective
dimensionality for the representation. We also evaluate the effects of using
different speech features as input to the model. We find that mel-frequency
cepstral coefficients (MFCCs), alone or combined with prosodic features,
perform the best. The results of a subsequent user study confirm the benefits
of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code
is available at
https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode
Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks
This paper addresses the problem of continuous gesture recognition from
sequences of depth maps using convolutional neutral networks (ConvNets). The
proposed method first segments individual gestures from a depth sequence based
on quantity of movement (QOM). For each segmented gesture, an Improved Depth
Motion Map (IDMM), which converts the depth sequence into one image, is
constructed and fed to a ConvNet for recognition. The IDMM effectively encodes
both spatial and temporal information and allows the fine-tuning with existing
ConvNet models for classification without introducing millions of parameters to
learn. The proposed method is evaluated on the Large-scale Continuous Gesture
Recognition of the ChaLearn Looking at People (LAP) challenge 2016. It achieved
the performance of 0.2655 (Mean Jaccard Index) and ranked place in
this challenge
- …