4,426 research outputs found
Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories
In this paper, we propose a new approach for facial expression recognition
using deep covariance descriptors. The solution is based on the idea of
encoding local and global Deep Convolutional Neural Network (DCNN) features
extracted from still images, in compact local and global covariance
descriptors. The space geometry of the covariance matrices is that of Symmetric
Positive Definite (SPD) matrices. By conducting the classification of static
facial expressions using Support Vector Machine (SVM) with a valid Gaussian
kernel on the SPD manifold, we show that deep covariance descriptors are more
effective than the standard classification with fully connected layers and
softmax. Besides, we propose a completely new and original solution to model
the temporal dynamic of facial expressions as deep trajectories on the SPD
manifold. As an extension of the classification pipeline of covariance
descriptors, we apply SVM with valid positive definite kernels derived from
global alignment for deep covariance trajectories classification. By performing
extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we show that
both the proposed static and dynamic approaches achieve state-of-the-art
performance for facial expression recognition outperforming many recent
approaches.Comment: A preliminary version of this work appeared in "Otberdout N, Kacem A,
Daoudi M, Ballihi L, Berretti S. Deep Covariance Descriptors for Facial
Expression Recognition, in British Machine Vision Conference 2018, BMVC 2018,
Northumbria University, Newcastle, UK, September 3-6, 2018. ; 2018 :159."
arXiv admin note: substantial text overlap with arXiv:1805.0386
Gram Matrices Formulation of Body Shape Motion: An Application for Depression Severity Assessment
International audienceWe propose an automatic method to measure depression severity from body movement dynamics in participants undergoing treatment for depression. Participants were recorded in clinical interviews (Hamilton Rating Scale for Depression, HRSD) at seven-week intervals over a period of 21 weeks. Gram matrices formulation was used for body shape and trajectories representation from each video interview. Kinematic features were then extracted and encoded for video based representation using Gaussian Mixture Models (GMM) and Fisher vector encoding. A multi-class SVM was finally used to classify the encoded body movement dynamics into three levels of depression severity scores: moderate to severely depressed, mildly depressed, and remitted. Accuracy was higher for moderate to severe depression (68%) followed by mild depression (56%), and then remitted (37.93%). The obtained results suggest that automatic detection of depression severity from body movement is feasible
Temporal Alignment of Human Motion Data: A Geometric Point of View
Temporal alignment is an inherent task in most applications dealing with
videos: action recognition, motion transfer, virtual trainers, rehabilitation,
etc. In this paper we dive into the understanding of this task from a geometric
point of view: in particular, we show that the basic properties that are
expected from a temporal alignment procedure imply that the set of aligned
motions to a template form a slice to a principal fiber bundle for the group of
temporal reparameterizations. A temporal alignment procedure provides a
reparameterization invariant projection onto this particular slice. This
geometric presentation allows to elaborate a consistency check for testing the
accuracy of any temporal alignment procedure. We give examples of alignment
procedures from the literature applied to motions of tennis players. Most of
them use dynamic programming to compute the best correspondence between two
motions relative to a given cost function. This step is computationally
expensive (of complexity where and are the numbers of frames).
Moreover most methods use features that are invariant by translations and
rotations in , whereas most actions are only invariant by
translation along and rotation around the vertical axis, where the vertical
axis is aligned with the gravitational field. The discarded information
contained in the vertical direction is crucial for accurate synchronization of
motions. We propose to incorporate keyframe correspondences into the dynamic
programming algorithm based on coarse information extracted from the vertical
variations, in our case from the elevation of the arm holding the racket. The
temporal alignment procedures produced are not only more accurate, but also
computationally more efficient
Representation Learning With Convolutional Neural Networks
Deep learning methods have achieved great success in the areas of Computer Vision and Natural Language Processing. Recently, the rapidly developing field of deep learning is concerned with questions surrounding how we can learn meaningful and effective representations of data. This is because the performance of machine learning approaches is heavily dependent on the choice and quality of data representation, and different kinds of representation entangle and hide the different explanatory factors of variation behind the data.
In this dissertation, we focus on representation learning with deep neural networks for different data formats including text, 3D polygon shapes, and brain fiber tracts.
First, we propose a topic-based word representation learning approach for text classification. The proposed approach takes global semantic relationship between words over the whole corpus into consideration and encodes the relationships into distributed vector representations with continuous Skip-gram model. The learned representations which capture a large number of precise syntactic and semantic word relationships are taken as input of Convolution Neural Networks for classification. Our experimental results show the effectiveness of the proposed method on indexing of biomedical articles, behavior code annotation of clinical text fragments, and classification of news groups.
Second, we present a 3D polygon shape representation learning framework for shape segmentation. We propose Directionally Convolutional Network (DCN) that extends convolution operations from images to the polygon mesh surface with rotation-invariant property. Based on the proposed DCN, we learn effective shape representations from raw geometric features and then classify each face of a given polygon into predefined semantic parts. Through extensive experiments, we demonstrate that our framework outperforms the current state-of-the-arts.
Third, we propose to learn effective and meaningful representations for brain fiber tracts using deep learning frameworks. We handle the highly unbalanced dataset by introducing asymmetrical loss function for easily classified samples and hard classified ones. The training loss avoids to be dominated by the easy samples and the training step is more efficient. In addition, we learn more effective and meaningful representations by introducing deeper network and metric learning approaches. Furthermore, we propose to improve the interpretability of our framework by inducing attention mechanism. Our experimental results show that our proposed framework outperforms current golden standard significantly on the real-world dataset
Second-order Temporal Pooling for Action Recognition
Deep learning models for video-based action recognition usually generate
features for short clips (consisting of a few frames); such clip-level features
are aggregated to video-level representations by computing statistics on these
features. Typically zero-th (max) or the first-order (average) statistics are
used. In this paper, we explore the benefits of using second-order statistics.
Specifically, we propose a novel end-to-end learnable feature aggregation
scheme, dubbed temporal correlation pooling that generates an action descriptor
for a video sequence by capturing the similarities between the temporal
evolution of clip-level CNN features computed across the video. Such a
descriptor, while being computationally cheap, also naturally encodes the
co-activations of multiple CNN features, thereby providing a richer
characterization of actions than their first-order counterparts. We also
propose higher-order extensions of this scheme by computing correlations after
embedding the CNN features in a reproducing kernel Hilbert space. We provide
experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained
datasets such as MPII Cooking activities and JHMDB, as well as the recent
Kinetics-600. Our results demonstrate the advantages of higher-order pooling
schemes that when combined with hand-crafted features (as is standard practice)
achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV
Automatic Estimation of Self-Reported Pain by Interpretable Representations of Motion Dynamics
We propose an automatic method for pain intensity measurement from video. For
each video, pain intensity was measured using the dynamics of facial movement
using 66 facial points. Gram matrices formulation was used for facial points
trajectory representations on the Riemannian manifold of symmetric positive
semi-definite matrices of fixed rank. Curve fitting and temporal alignment were
then used to smooth the extracted trajectories. A Support Vector Regression
model was then trained to encode the extracted trajectories into ten pain
intensity levels consistent with the Visual Analogue Scale for pain intensity
measurement. The proposed approach was evaluated using the UNBC McMaster
Shoulder Pain Archive and was compared to the state-of-the-art on the same
data. Using both 5-fold cross-validation and leave-one-subject-out
cross-validation, our results are competitive with respect to state-of-the-art
methods.Comment: accepted at ICPR 2020 Conferenc
- …