10,000 research outputs found
LEARNet Dynamic Imaging Network for Micro Expression Recognition
Unlike prevalent facial expressions, micro expressions have subtle,
involuntary muscle movements which are short-lived in nature. These minute
muscle movements reflect true emotions of a person. Due to the short duration
and low intensity, these micro-expressions are very difficult to perceive and
interpret correctly. In this paper, we propose the dynamic representation of
micro-expressions to preserve facial movement information of a video in a
single frame. We also propose a Lateral Accretive Hybrid Network (LEARNet) to
capture micro-level features of an expression in the facial region. The LEARNet
refines the salient expression features in accretive manner by incorporating
accretion layers (AL) in the network. The response of the AL holds the hybrid
feature maps generated by prior laterally connected convolution layers.
Moreover, LEARNet architecture incorporates the cross decoupled relationship
between convolution layers which helps in preserving the tiny but influential
facial muscle change information. The visual responses of the proposed LEARNet
depict the effectiveness of the system by preserving both high- and micro-level
edge features of facial expression. The effectiveness of the proposed LEARNet
is evaluated on four benchmark datasets: CASME-I, CASME-II, CAS(ME)^2 and SMIC.
The experimental results after investigation show a significant improvement of
4.03%, 1.90%, 1.79% and 2.82% as compared with ResNet on CASME-I, CASME-II,
CAS(ME)^2 and SMIC datasets respectively.Comment: Dynamic imaging, accretion, lateral, micro expression recognitio
ICface: Interpretable and Controllable Face Reenactment Using GANs
This paper presents a generic face animator that is able to control the pose
and expressions of a given face image. The animation is driven by human
interpretable control signals consisting of head pose angles and the Action
Unit (AU) values. The control information can be obtained from multiple sources
including external driving videos and manual controls. Due to the interpretable
nature of the driving signal, one can easily mix the information between
multiple sources (e.g. pose from one image and expression from another) and
apply selective post-production editing. The proposed face animator is
implemented as a two-stage neural network model that is learned in a
self-supervised manner using a large video collection. The proposed
Interpretable and Controllable face reenactment network (ICface) is compared to
the state-of-the-art neural network-based face animation techniques in multiple
tasks. The results indicate that ICface produces better visual quality while
being more versatile than most of the comparison methods. The introduced model
could provide a lightweight and easy to use tool for a multitude of advanced
image and video editing tasks.Comment: Accepted in WACV-202
Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories
In this paper, we propose a new approach for facial expression recognition
using deep covariance descriptors. The solution is based on the idea of
encoding local and global Deep Convolutional Neural Network (DCNN) features
extracted from still images, in compact local and global covariance
descriptors. The space geometry of the covariance matrices is that of Symmetric
Positive Definite (SPD) matrices. By conducting the classification of static
facial expressions using Support Vector Machine (SVM) with a valid Gaussian
kernel on the SPD manifold, we show that deep covariance descriptors are more
effective than the standard classification with fully connected layers and
softmax. Besides, we propose a completely new and original solution to model
the temporal dynamic of facial expressions as deep trajectories on the SPD
manifold. As an extension of the classification pipeline of covariance
descriptors, we apply SVM with valid positive definite kernels derived from
global alignment for deep covariance trajectories classification. By performing
extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we show that
both the proposed static and dynamic approaches achieve state-of-the-art
performance for facial expression recognition outperforming many recent
approaches.Comment: A preliminary version of this work appeared in "Otberdout N, Kacem A,
Daoudi M, Ballihi L, Berretti S. Deep Covariance Descriptors for Facial
Expression Recognition, in British Machine Vision Conference 2018, BMVC 2018,
Northumbria University, Newcastle, UK, September 3-6, 2018. ; 2018 :159."
arXiv admin note: substantial text overlap with arXiv:1805.0386
Robust Facial Expression Recognition with Convolutional Visual Transformers
Facial Expression Recognition (FER) in the wild is extremely challenging due
to occlusions, variant head poses, face deformation and motion blur under
unconstrained conditions. Although substantial progresses have been made in
automatic FER in the past few decades, previous studies are mainly designed for
lab-controlled FER. Real-world occlusions, variant head poses and other issues
definitely increase the difficulty of FER on account of these
information-deficient regions and complex backgrounds. Different from previous
pure CNNs based methods, we argue that it is feasible and practical to
translate facial images into sequences of visual words and perform expression
recognition from a global perspective. Therefore, we propose Convolutional
Visual Transformers to tackle FER in the wild by two main steps. First, we
propose an attentional selective fusion (ASF) for leveraging the feature maps
generated by two-branch CNNs. The ASF captures discriminative information by
fusing multiple features with global-local attention. The fused feature maps
are then flattened and projected into sequences of visual words. Second,
inspired by the success of Transformers in natural language processing, we
propose to model relationships between these visual words with global
self-attention. The proposed method are evaluated on three public in-the-wild
facial expression datasets (RAF-DB, FERPlus and AffectNet). Under the same
settings, extensive experiments demonstrate that our method shows superior
performance over other methods, setting new state of the art on RAF-DB with
88.14%, FERPlus with 88.81% and AffectNet with 61.85%. We also conduct
cross-dataset evaluation on CK+ show the generalization capability of the
proposed method
An original framework for understanding human actions and body language by using deep neural networks
The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour.
By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way.
These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively.
While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements;
both are essential tasks in many computer vision applications, including event recognition, and video surveillance.
In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided.
The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements.
All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
Vision Transformer with Attentive Pooling for Robust Facial Expression Recognition
Facial Expression Recognition (FER) in the wild is an extremely challenging
task. Recently, some Vision Transformers (ViT) have been explored for FER, but
most of them perform inferiorly compared to Convolutional Neural Networks
(CNN). This is mainly because the new proposed modules are difficult to
converge well from scratch due to lacking inductive bias and easy to focus on
the occlusion and noisy areas. TransFER, a representative transformer-based
method for FER, alleviates this with multi-branch attention dropping but brings
excessive computations. On the contrary, we present two attentive pooling (AP)
modules to pool noisy features directly. The AP modules include Attentive Patch
Pooling (APP) and Attentive Token Pooling (ATP). They aim to guide the model to
emphasize the most discriminative features while reducing the impacts of less
relevant features. The proposed APP is employed to select the most informative
patches on CNN features, and ATP discards unimportant tokens in ViT. Being
simple to implement and without learnable parameters, the APP and ATP
intuitively reduce the computational cost while boosting the performance by
ONLY pursuing the most discriminative features. Qualitative results demonstrate
the motivations and effectiveness of our attentive poolings. Besides,
quantitative results on six in-the-wild datasets outperform other
state-of-the-art methods.Comment: Codes will be public on https://github.com/youqingxiaozhua/APVi
- …