1,464 research outputs found
Shallow Triple Stream Three-dimensional CNN (STSTNet) for Micro-expression Recognition
In the recent year, state-of-the-art for facial micro-expression recognition
have been significantly advanced by deep neural networks. The robustness of
deep learning has yielded promising performance beyond that of traditional
handcrafted approaches. Most works in literature emphasized on increasing the
depth of networks and employing highly complex objective functions to learn
more features. In this paper, we design a Shallow Triple Stream
Three-dimensional CNN (STSTNet) that is computationally light whilst capable of
extracting discriminative high level features and details of micro-expressions.
The network learns from three optical flow features (i.e., optical strain,
horizontal and vertical optical flow fields) computed based on the onset and
apex frames of each video. Our experimental results demonstrate the
effectiveness of the proposed STSTNet, which obtained an unweighted average
recall rate of 0.7605 and unweighted F1-score of 0.7353 on the composite
database consisting of 442 samples from the SMIC, CASME II and SAMM databases.Comment: 5 pages, 1 figure, Accepted and published in IEEE FG 201
CNN based facial aesthetics analysis through dynamic robust losses and ensemble regression
In recent years, estimating beauty of faces has attracted growing interest in the fields of computer vision and machine
learning. This is due to the emergence of face beauty datasets (such as SCUT-FBP, SCUT-FBP5500 and KDEF-PT) and
the prevalence of deep learning methods in many tasks. The goal of this work is to leverage the advances in Deep
Learning architectures to provide stable and accurate face beauty estimation from static face images. To this end, our
proposed approach has three main contributions. To deal with the complicated high-level features associated with the FBP
problem by using more than one pre-trained Convolutional Neural Network (CNN) model, we propose an architecture with
two backbones (2B-IncRex). In addition to 2B-IncRex, we introduce a parabolic dynamic law to control the behavior
of the robust loss parameters during training. These robust losses are ParamSmoothL1, Huber, and Tukey. As a third
contribution, we propose an ensemble regression based on five regressors, namely Resnext-50, Inception-v3 and three
regressors based on our proposed 2B-IncRex architecture. These models are trained with the following dynamic loss
functions: Dynamic ParamSmoothL1, Dynamic Tukey, Dynamic ParamSmoothL1, Dynamic Huber, and Dynamic Tukey,
respectively. To evaluate the performance of our approach, we used two datasets: SCUT-FBP5500 and KDEF-PT. The
dataset SCUT-FBP5500 contains two evaluation scenarios provided by the database developers: 60-40% split and five-
fold cross-validation. Our approach outperforms state-of-the-art methods on several metrics in both evaluation scenarios of
SCUT-FBP5500. Moreover, experiments on the KDEF-PT dataset demonstrate the efficiency of our approach for estimating
facial beauty using transfer learning, despite the presence of facial expressions and limited data. These comparisons highlight
the effectiveness of the proposed solutions for FBP. They also show that the proposed Dynamic robust losses lead to more
flexible and accurate estimators.Open Access funding provided thanks to the CRUE-CSIC
agreement with Springer Nature
Deep face tracking and parsing in the wild
Face analysis has been a long-standing research direction in the field of computer vision and pattern recognition. A complete face analysis system involves solving several tasks including face detection, face tracking, face parsing, and face recognition. Recently, the performance of methods in all tasks has significantly improved thanks to the employment of Deep Convolutional Neural Networks (DCNNs).
However, existing face analysis algorithms mainly focus on solving facial images captured in the constrained laboratory environment, and their performance on real-world images has remained less explored. Compared with the lab environment, the in-the-wild settings involve greater diversity in face sizes, poses, facial expressions, background clutters, lighting conditions and imaging quality.
This thesis investigates two fundamental tasks in face analysis under in-the-wild settings: face tracking and face parsing. Both tasks serve as important prerequisites for downstream face analysis applications. However, in-the-wild datasets remain scarce in both fields and models have not been rigorously evaluated in such settings. In this thesis, we aim to bridge that gap of lacking in-the-wild data, evaluate existing methods in these settings, and develop accurate, robust and efficient deep learning-based methods for the two tasks.
For face tracking in the wild, we introduce the first in-the-wild face tracking dataset, MobiFace, that consists of 80 videos captured by mobile phones during mobile live-streaming. The environment of the live-streaming performance is fully unconstrained and the interactions between users and mobile phones are natural and spontaneous. Next, we evaluate existing tracking methods, including generic object trackers and dedicated face trackers. The results show that MobiFace represent unique challenges in face tracking in the wild and cannot be readily solved by existing methods. Finally, we present a DCNN-based framework, FT-RCNN, that significantly outperforms other methods in face tracking in the wild.
For face parsing in the wild, we introduce the first large-scale in-the-wild face dataset, iBugMask, that contains 21, 866 training images and 1, 000 testing images. Unlike existing datasets, the images in iBugMask are captured in the fully unconstrained environment and are not cropped or preprocessed of any kind. Manually annotated per-pixel labels for eleven facial regions are provided for each target face. Next, we benchmark existing parsing methods and the results show that iBugMask is extremely challenging for all methods.
By rigorous benchmarking, we observe that the pre-processing of facial images with bounding boxes in face parsing in the wild introduces bias. When cropping the face with a bounding box, a cropping margin has to be hand-picked. If face alignment is used, fiducial landmarks are required and a predefined alignment template has to be selected. These additional hyper-parameters have to be carefully considered and can have a significant impact on the face parsing performance.
To solve this, we propose Region-of-Interest (RoI) Tanh-polar transform that warps the whole image to a fixed-sized representation. Moreover, the RoI Tanh-polar transform is differentiable and allows for rotation equivariance in
1
DCNNs. We show that when coupled with a simple Fully Convolutional Network, our RoI Tanh-polar transformer Network has achieved state-of-the-art results on face parsing in the wild.
This thesis contributes towards in-the-wild face tracking and face parsing by providing novel datasets and proposing effective frameworks. Both tasks can benefit real-world downstream applications such as facial age estimation, facial expression recognition and lip-reading. The proposed RoI Tanh-polar transform also provides a new perspective in how to preprocess the face images and make the DCNNs truly end-to-end for real-world face analysis applications.Open Acces
Towards a Robust Thermal-Visible Heterogeneous Face Recognition Approach Based on a Cycle Generative Adversarial Network
Security is a sensitive area that concerns all authorities around the world due to the emerging terrorism phenomenon. Contactless biometric technologies such as face recognition have grown in interest for their capacity to identify probe subjects without any human interaction. Since traditional face recognition systems use visible spectrum sensors, their performances decrease rapidly when some visible imaging phenomena occur, mainly illumination changes. Unlike the visible spectrum, Infrared spectra are invariant to light changes, which makes them an alternative solution for face recognition. However, in infrared, the textural information is lost. We aim, in this paper, to benefit from visible and thermal spectra by proposing a new heterogeneous face recognition approach. This approach includes four scientific contributions. The first one is the annotation of a thermal face database, which has been shared via Github with all the scientific community. The second is the proposition of a multi-sensors face detector model based on the last YOLO v3 architecture, able to detect simultaneously faces captured in visible and thermal images. The third contribution takes up the challenge of modality gap reduction between visible and thermal spectra, by applying a new structure of CycleGAN, called TV-CycleGAN, which aims to synthesize visible-like face images from thermal face images. This new thermal-visible synthesis method includes all extreme poses and facial expressions in color space. To show the efficacy and the robustness of the proposed TV-CycleGAN, experiments have been applied on three challenging benchmark databases, including different real-world scenarios: TUFTS and its aligned version, NVIE and PUJ. The qualitative evaluation shows that our method generates more realistic faces. The quantitative one demonstrates that the proposed TV -CycleGAN gives the best improvement on face recognition rates. Therefore, instead of applying a direct matching from thermal to visible images which allows a recognition rate of 47,06% for TUFTS Database, a proposed TV-CycleGAN ensures accuracy of 57,56% for the same database. It contributes to a rate enhancement of 29,16%, and 15,71% for NVIE and PUJ databases, respectively. It reaches an accuracy enhancement of 18,5% for the aligned TUFTS database. It also outperforms some recent state of the art methods in terms of F1-Score, AUC/EER and other evaluation metrics. Furthermore, it should be mentioned that the obtained visible synthesized face images using TV-CycleGAN method are very promising for thermal facial landmark detection as a fourth contribution of this paper
Evaluation of Data Augmentation Techniques for Facial Expression Recognition Systems
Most Facial Expression Recognition (FER) systems rely on machine learning approaches that require large databases for an effective training. As these are not easily available, a good solution is to augment the databases with appropriate data augmentation (DA) techniques, which are typically based on either geometric transformation or oversampling augmentations (e.g., generative adversarial networks (GANs)). However, it is not always easy to understand which DA technique may be more convenient for FER systems because most state-of-the-art experiments use different settings which makes the impact of DA techniques not comparable. To advance in this respect, in this paper, we evaluate and compare the impact of using well-established DA techniques on the emotion recognition accuracy of a FER system based on the well-known VGG16 convolutional neural network (CNN). In particular, we consider both geometric transformations and GAN to increase the amount of training images. We performed cross-database evaluations: training with the "augmented" KDEF database and testing with two different databases (CK+ and ExpW). The best results were obtained combining horizontal reflection, translation and GAN, bringing an accuracy increase of approximately 30%. This outperforms alternative approaches, except for the one technique that could however rely on a quite bigger database
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer
Remote photoplethysmography (rPPG), which aims at measuring heart activities
and physiological signals from facial video without any contact, has great
potential in many applications (e.g., remote healthcare and affective
computing). Recent deep learning approaches focus on mining subtle rPPG clues
using convolutional neural networks with limited spatio-temporal receptive
fields, which neglect the long-range spatio-temporal perception and interaction
for rPPG modeling. In this paper, we propose the PhysFormer, an end-to-end
video transformer based architecture, to adaptively aggregate both local and
global spatio-temporal features for rPPG representation enhancement. As key
modules in PhysFormer, the temporal difference transformers first enhance the
quasi-periodic rPPG features with temporal difference guided global attention,
and then refine the local spatio-temporal representation against interference.
Furthermore, we also propose the label distribution learning and a curriculum
learning inspired dynamic constraint in frequency domain, which provide
elaborate supervisions for PhysFormer and alleviate overfitting. Comprehensive
experiments are performed on four benchmark datasets to show our superior
performance on both intra- and cross-dataset testings. One highlight is that,
unlike most transformer networks needed pretraining from large-scale datasets,
the proposed PhysFormer can be easily trained from scratch on rPPG datasets,
which makes it promising as a novel transformer baseline for the rPPG
community. The codes will be released at
https://github.com/ZitongYu/PhysFormer.Comment: Accepted by CVPR202
Facial Expression Recognition Based on Deep Learning Convolution Neural Network: A Review
Facial emotional processing is one of the most important activities in effective calculations, engagement with people and computers, machine vision, video game testing, and consumer research. Facial expressions are a form of nonverbal communication, as they reveal a person's inner feelings and emotions. Extensive attention to Facial Expression Recognition (FER) has recently been received as facial expressions are considered. As the fastest communication medium of any kind of information. Facial expression recognition gives a better understanding of a person's thoughts or views and analyzes them with the currently trending deep learning methods. Accuracy rate sharply compared to traditional state-of-the-art systems. This article provides a brief overview of the different FER fields of application and publicly accessible databases used in FER and studies the latest and current reviews in FER using Convolution Neural Network (CNN) algorithms. Finally, it is observed that everyone reached good results, especially in terms of accuracy, with different rates, and using different data sets, which impacts the results
- …