4,426 research outputs found

    Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories

    Get PDF
    In this paper, we propose a new approach for facial expression recognition using deep covariance descriptors. The solution is based on the idea of encoding local and global Deep Convolutional Neural Network (DCNN) features extracted from still images, in compact local and global covariance descriptors. The space geometry of the covariance matrices is that of Symmetric Positive Definite (SPD) matrices. By conducting the classification of static facial expressions using Support Vector Machine (SVM) with a valid Gaussian kernel on the SPD manifold, we show that deep covariance descriptors are more effective than the standard classification with fully connected layers and softmax. Besides, we propose a completely new and original solution to model the temporal dynamic of facial expressions as deep trajectories on the SPD manifold. As an extension of the classification pipeline of covariance descriptors, we apply SVM with valid positive definite kernels derived from global alignment for deep covariance trajectories classification. By performing extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we show that both the proposed static and dynamic approaches achieve state-of-the-art performance for facial expression recognition outperforming many recent approaches.Comment: A preliminary version of this work appeared in "Otberdout N, Kacem A, Daoudi M, Ballihi L, Berretti S. Deep Covariance Descriptors for Facial Expression Recognition, in British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018. ; 2018 :159." arXiv admin note: substantial text overlap with arXiv:1805.0386

    Gram Matrices Formulation of Body Shape Motion: An Application for Depression Severity Assessment

    Get PDF
    International audienceWe propose an automatic method to measure depression severity from body movement dynamics in participants undergoing treatment for depression. Participants were recorded in clinical interviews (Hamilton Rating Scale for Depression, HRSD) at seven-week intervals over a period of 21 weeks. Gram matrices formulation was used for body shape and trajectories representation from each video interview. Kinematic features were then extracted and encoded for video based representation using Gaussian Mixture Models (GMM) and Fisher vector encoding. A multi-class SVM was finally used to classify the encoded body movement dynamics into three levels of depression severity scores: moderate to severely depressed, mildly depressed, and remitted. Accuracy was higher for moderate to severe depression (68%) followed by mild depression (56%), and then remitted (37.93%). The obtained results suggest that automatic detection of depression severity from body movement is feasible

    Temporal Alignment of Human Motion Data: A Geometric Point of View

    Full text link
    Temporal alignment is an inherent task in most applications dealing with videos: action recognition, motion transfer, virtual trainers, rehabilitation, etc. In this paper we dive into the understanding of this task from a geometric point of view: in particular, we show that the basic properties that are expected from a temporal alignment procedure imply that the set of aligned motions to a template form a slice to a principal fiber bundle for the group of temporal reparameterizations. A temporal alignment procedure provides a reparameterization invariant projection onto this particular slice. This geometric presentation allows to elaborate a consistency check for testing the accuracy of any temporal alignment procedure. We give examples of alignment procedures from the literature applied to motions of tennis players. Most of them use dynamic programming to compute the best correspondence between two motions relative to a given cost function. This step is computationally expensive (of complexity O(NM)O(NM) where NN and MM are the numbers of frames). Moreover most methods use features that are invariant by translations and rotations in R3\mathbb{R}^3, whereas most actions are only invariant by translation along and rotation around the vertical axis, where the vertical axis is aligned with the gravitational field. The discarded information contained in the vertical direction is crucial for accurate synchronization of motions. We propose to incorporate keyframe correspondences into the dynamic programming algorithm based on coarse information extracted from the vertical variations, in our case from the elevation of the arm holding the racket. The temporal alignment procedures produced are not only more accurate, but also computationally more efficient

    Representation Learning With Convolutional Neural Networks

    Get PDF
    Deep learning methods have achieved great success in the areas of Computer Vision and Natural Language Processing. Recently, the rapidly developing field of deep learning is concerned with questions surrounding how we can learn meaningful and effective representations of data. This is because the performance of machine learning approaches is heavily dependent on the choice and quality of data representation, and different kinds of representation entangle and hide the different explanatory factors of variation behind the data. In this dissertation, we focus on representation learning with deep neural networks for different data formats including text, 3D polygon shapes, and brain fiber tracts. First, we propose a topic-based word representation learning approach for text classification. The proposed approach takes global semantic relationship between words over the whole corpus into consideration and encodes the relationships into distributed vector representations with continuous Skip-gram model. The learned representations which capture a large number of precise syntactic and semantic word relationships are taken as input of Convolution Neural Networks for classification. Our experimental results show the effectiveness of the proposed method on indexing of biomedical articles, behavior code annotation of clinical text fragments, and classification of news groups. Second, we present a 3D polygon shape representation learning framework for shape segmentation. We propose Directionally Convolutional Network (DCN) that extends convolution operations from images to the polygon mesh surface with rotation-invariant property. Based on the proposed DCN, we learn effective shape representations from raw geometric features and then classify each face of a given polygon into predefined semantic parts. Through extensive experiments, we demonstrate that our framework outperforms the current state-of-the-arts. Third, we propose to learn effective and meaningful representations for brain fiber tracts using deep learning frameworks. We handle the highly unbalanced dataset by introducing asymmetrical loss function for easily classified samples and hard classified ones. The training loss avoids to be dominated by the easy samples and the training step is more efficient. In addition, we learn more effective and meaningful representations by introducing deeper network and metric learning approaches. Furthermore, we propose to improve the interpretability of our framework by inducing attention mechanism. Our experimental results show that our proposed framework outperforms current golden standard significantly on the real-world dataset

    Second-order Temporal Pooling for Action Recognition

    Full text link
    Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics. Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV

    Automatic Estimation of Self-Reported Pain by Interpretable Representations of Motion Dynamics

    Full text link
    We propose an automatic method for pain intensity measurement from video. For each video, pain intensity was measured using the dynamics of facial movement using 66 facial points. Gram matrices formulation was used for facial points trajectory representations on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. Curve fitting and temporal alignment were then used to smooth the extracted trajectories. A Support Vector Regression model was then trained to encode the extracted trajectories into ten pain intensity levels consistent with the Visual Analogue Scale for pain intensity measurement. The proposed approach was evaluated using the UNBC McMaster Shoulder Pain Archive and was compared to the state-of-the-art on the same data. Using both 5-fold cross-validation and leave-one-subject-out cross-validation, our results are competitive with respect to state-of-the-art methods.Comment: accepted at ICPR 2020 Conferenc
    corecore