43 research outputs found
CNN Based 3D Facial Expression Recognition Using Masking And Landmark Features
Automatically recognizing facial expression is an important part for human-machine interaction. In this paper, we first review the previous studies on both 2D and 3D facial expression recognition, and then summarize the key research questions to solve in the future. Finally, we propose a 3D facial expression recognition (FER) algorithm using convolutional neural networks (CNNs) and landmark features/masks, which is invariant to pose and illumination variations due to the solely use of 3D geometric facial models without any texture information. The proposed method has been tested on two public 3D facial expression databases: BU-4DFE and BU-3DFE. The results show that the CNN model benefits from the masking, and the combination of landmark and CNN features can further improve the 3D FER accuracy
Set Operation Aided Network For Action Units Detection
As a large number of parameters exist in deep model-based methods, training such models usually requires many fully AU-annotated facial images. This is true with regard to the number of frames in two widely used datasets: BP4D [31] and DISFA [18], while those frames were captured from a small number of subjects (41, 27 respectively). This is problematic, as subjects produce highly consistent facial muscle movements, adding more frames per subject would only adds more close points in the feature space, and thus the classifier does not benefit from those extra frames. Data augmentation methods can be applied to alleviate the problem to a certain degree, but they fail to augment new subjects. We propose a novel Set Operation Aided Network (SO-Net) for action units\u27 detection. Specifically, new features and the corresponding labels are generated by adding set operations to both the feature and label spaces. The generated new features can be treated as a representation of a hypothetical image. As a result, we can implicitly obtain training examples beyond what was originally observed in the dataset. Therefore, the deep model is forced to learn subject-independent features and is generalizable to unseen subjects. SO-Net is end-to-end trainable and can be flexibly plugged in any CNN model during training. We evaluate the proposed method on two public datasets, BP4D and DISFA. The experiment shows a state-of-the-art performance, demonstrating the effectiveness of the proposed method
Facial Expression Recognition By De-expression Residue Learning
A facial expression is a combination of an expressive component and a neutral component of a person. In this paper, we propose to recognize facial expressions by extracting information of the expressive component through a de-expression learning procedure, called De-expression Residue Learning (DeRL). First, a generative model is trained by cGAN. This model generates the corresponding neutral face image for any input face image. We call this procedure de-expression because the expressive information is filtered out by the generative model; however, the expressive information is still recorded in the intermediate layers. Given the neutral face image, unlike previous works using pixel-level or feature-level difference for facial expression classification, our new method learns the deposition (or residue) that remains in the intermediate layers of the generative model. Such a residue is essential as it contains the expressive component deposited in the generative model from any input facial expression images. Seven public facial expression databases are employed in our experiments. With two databases (BU-4DFE and BP4D-spontaneous) for pre-training, the DeRL method has been evaluated on five databases, CK+, Oulu-CASIA, MMI, BU-3DFE, and BP4D+. The experimental results demonstrate the superior performance of the proposed method
Identity-adaptive Facial Expression Recognition Through Expression Regeneration Using Conditional Generative Adversarial Networks
Subject variation is a challenging issue for facial expression recognition, especially when handling unseen subjects with small-scale labeled facial expression databases. Although transfer learning has been widely used to tackle the problem, the performance degrades on new data. In this paper, we present a novel approach (so-called IA-gen) to alleviate the issue of subject variations by regenerating expressions from any input facial images. First of all, we train conditional generative models to generate six prototypic facial expressions from any given query face image while keeping the identity related information unchanged. Generative Adversarial Networks are employed to train the conditional generative models, and each of them is designed to generate one of the prototypic facial expression images. Second, a regular CNN (FER-Net) is fine- tuned for expression classification. After the corresponding prototypic facial expressions are regenerated from each facial image, we output the last FC layer of FER-Net as features for both the input image and the generated images. Based on the minimum distance between the input image and the generated expression images in the feature space, the input image is classified as one of the prototypic expressions consequently. Our proposed method can not only alleviate the influence of inter-subject variations but will also be flexible enough to integrate with any other FER CNNs for person-independent facial expression recognition. Our method has been evaluated on CK+, Oulu-CASIA, BU-3DFE and BU-4DFE databases, and the results demonstrate the effectiveness of our proposed method
Exploiting Semantic Embedding And Visual Feature For Facial Action Unit Detection
Recent study on detecting facial action units (AU) has utilized auxiliary information (i.e., facial landmarks, relationship among AUs and expressions, web facial images, etc.), in order to improve the AU detection performance. As of now, no semantic information of AUs has yet been explored for such a task. As a matter of fact, AU semantic descriptions provide much more information than the binary AU labels alone, thus we propose to exploit the Semantic Embedding and Visual feature (SEV-Net) for AU detection. More specifically, AU semantic embeddings are obtained through both Intra-AU and Inter-AU attention modules, where the Intra-AU attention module captures the relation among words within each sentence that describes individual AU, and the Inter-AU attention module focuses on the relation among those sentences. The learned AU semantic embeddings are then used as guidance for the generation of attention maps through a cross-modality attention network. The generated cross-modality attention maps are further used as weights for the aggregated feature. Our proposed method is unique in that the semantic features are exploited as the first of this kind. The approach has been evaluated on three public AU-coded facial expression databases and has achieved a superior performance than the state-of-the-art peer methods
Multi-modality Empowered Network For Facial Action Unit Detection
This paper presents a new thermal empowered multi-task network (TEMT-Net) to improve facial action unit detection. Our primary goal is to leverage the situation that the training set has multi-modality data while the application scenario only has one modality. Thermal images are robust to illumination and face color. In the proposed multi-task framework, we utilize both modality data. Action unit detection and facial landmark detection are correlated tasks. To utilize the advantage and the correlation of different modalities and different tasks, we propose a novel thermal empowered multi-task deep neural network learning approach for action unit detection, facial landmark detection and thermal image reconstruction simultaneously. The thermal image generator and facial landmark detection provide regularization on the learned features with shared factors as the input color images. Extensive experiments are conducted on the BP4D and MMSE databases, with the comparison to the state-of-the-art methods. The experiments show that the multi-modality framework improves the AU detection significantly
Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior Understanding
Contrastive learning has shown promising potential for learning robust
representations by utilizing unlabeled data. However, constructing effective
positive-negative pairs for contrastive learning on facial behavior datasets
remains challenging. This is because such pairs inevitably encode the
subject-ID information, and the randomly constructed pairs may push similar
facial images away due to the limited number of subjects in facial behavior
datasets. To address this issue, we propose to utilize activity descriptions,
coarse-grained information provided in some datasets, which can provide
high-level semantic information about the image sequences but is often
neglected in previous studies. More specifically, we introduce a two-stage
Contrastive Learning with Text-Embeded framework for Facial behavior
understanding (CLEF). The first stage is a weakly-supervised contrastive
learning method that learns representations from positive-negative pairs
constructed using coarse-grained activity information. The second stage aims to
train the recognition of facial expressions or facial action units by
maximizing the similarity between image and the corresponding text label names.
The proposed CLEF achieves state-of-the-art performance on three in-the-lab
datasets for AU recognition and three in-the-wild datasets for facial
expression recognition
An EEG-Based Multi-Modal Emotion Database With Both Posed And Authentic Facial Actions For Emotion Analysis
Emotion is an experience associated with a particular pattern of physiological activity along with different physiological, behavioral and cognitive changes. One behavioral change is facial expression, which has been studied extensively over the past few decades. Facial behavior varies with a person\u27s emotion according to differences in terms of culture, personality, age, context, and environment. In recent years, physiological activities have been used to study emotional responses. A typical signal is the electroencephalogram (EEG), which measures brain activity. Most of existing EEG-based emotion analysis has overlooked the role of facial expression changes. There exits little research on the relationship between facial behavior and brain signals due to the lack of dataset measuring both EEG and facial action signals simultaneously. To address this problem, we propose to develop a new database by collecting facial expressions, action units, and EEGs simultaneously. We recorded the EEGs and face videos of both posed facial actions and spontaneous expressions from 29 participants with different ages, genders, ethnic backgrounds. Differing from existing approaches, we designed a protocol to capture the EEG signals by evoking participants\u27 individual action units explicitly. We also investigated the relation between the EEG signals and facial action units. As a baseline, the database has been evaluated through the experiments on both posed and spontaneous emotion recognition with images alone, EEG alone, and EEG fused with images, respectively. The database will be released to the research community to advance the state of the art for automatic emotion recognition
The 2nd 3D Face Alignment In The Wild Challenge (3DFAW-video): Dense Reconstruction From Video
3D face alignment approaches have strong advantages over 2D with respect to representational power and robustness to illumination and pose. Over the past few years, a number of research groups have made rapid advances in dense 3D alignment from 2D video and obtained impressive results. How these various methods compare is relatively unknown. Previous benchmarks addressed sparse 3D alignment and single image 3D reconstruction. No commonly accepted evaluation protocol exists for dense 3D face reconstruction from video with which to compare them. The 2nd 3D Face Alignment in the Wild from Videos (3DFAW-Video) Challenge extends the previous 3DFAW 2016 competition to the estimation of dense 3D facial structure from video. It presented a new large corpora of profile-to-profile face videos recorded under different imaging conditions and annotated with corresponding high-resolution 3D ground truth meshes. In this paper we outline the evaluation protocol, the data used, and the results. 3DFAW-Video is to be held in conjunction with the 2019 International Conference on Computer Vision, in Seoul, Korea