1,464 research outputs found

    Shallow Triple Stream Three-dimensional CNN (STSTNet) for Micro-expression Recognition

    Full text link
    In the recent year, state-of-the-art for facial micro-expression recognition have been significantly advanced by deep neural networks. The robustness of deep learning has yielded promising performance beyond that of traditional handcrafted approaches. Most works in literature emphasized on increasing the depth of networks and employing highly complex objective functions to learn more features. In this paper, we design a Shallow Triple Stream Three-dimensional CNN (STSTNet) that is computationally light whilst capable of extracting discriminative high level features and details of micro-expressions. The network learns from three optical flow features (i.e., optical strain, horizontal and vertical optical flow fields) computed based on the onset and apex frames of each video. Our experimental results demonstrate the effectiveness of the proposed STSTNet, which obtained an unweighted average recall rate of 0.7605 and unweighted F1-score of 0.7353 on the composite database consisting of 442 samples from the SMIC, CASME II and SAMM databases.Comment: 5 pages, 1 figure, Accepted and published in IEEE FG 201

    Video-based infant discomfort detection

    Get PDF

    CNN based facial aesthetics analysis through dynamic robust losses and ensemble regression

    Get PDF
    In recent years, estimating beauty of faces has attracted growing interest in the fields of computer vision and machine learning. This is due to the emergence of face beauty datasets (such as SCUT-FBP, SCUT-FBP5500 and KDEF-PT) and the prevalence of deep learning methods in many tasks. The goal of this work is to leverage the advances in Deep Learning architectures to provide stable and accurate face beauty estimation from static face images. To this end, our proposed approach has three main contributions. To deal with the complicated high-level features associated with the FBP problem by using more than one pre-trained Convolutional Neural Network (CNN) model, we propose an architecture with two backbones (2B-IncRex). In addition to 2B-IncRex, we introduce a parabolic dynamic law to control the behavior of the robust loss parameters during training. These robust losses are ParamSmoothL1, Huber, and Tukey. As a third contribution, we propose an ensemble regression based on five regressors, namely Resnext-50, Inception-v3 and three regressors based on our proposed 2B-IncRex architecture. These models are trained with the following dynamic loss functions: Dynamic ParamSmoothL1, Dynamic Tukey, Dynamic ParamSmoothL1, Dynamic Huber, and Dynamic Tukey, respectively. To evaluate the performance of our approach, we used two datasets: SCUT-FBP5500 and KDEF-PT. The dataset SCUT-FBP5500 contains two evaluation scenarios provided by the database developers: 60-40% split and five- fold cross-validation. Our approach outperforms state-of-the-art methods on several metrics in both evaluation scenarios of SCUT-FBP5500. Moreover, experiments on the KDEF-PT dataset demonstrate the efficiency of our approach for estimating facial beauty using transfer learning, despite the presence of facial expressions and limited data. These comparisons highlight the effectiveness of the proposed solutions for FBP. They also show that the proposed Dynamic robust losses lead to more flexible and accurate estimators.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature

    Deep face tracking and parsing in the wild

    Get PDF
    Face analysis has been a long-standing research direction in the field of computer vision and pattern recognition. A complete face analysis system involves solving several tasks including face detection, face tracking, face parsing, and face recognition. Recently, the performance of methods in all tasks has significantly improved thanks to the employment of Deep Convolutional Neural Networks (DCNNs). However, existing face analysis algorithms mainly focus on solving facial images captured in the constrained laboratory environment, and their performance on real-world images has remained less explored. Compared with the lab environment, the in-the-wild settings involve greater diversity in face sizes, poses, facial expressions, background clutters, lighting conditions and imaging quality. This thesis investigates two fundamental tasks in face analysis under in-the-wild settings: face tracking and face parsing. Both tasks serve as important prerequisites for downstream face analysis applications. However, in-the-wild datasets remain scarce in both fields and models have not been rigorously evaluated in such settings. In this thesis, we aim to bridge that gap of lacking in-the-wild data, evaluate existing methods in these settings, and develop accurate, robust and efficient deep learning-based methods for the two tasks. For face tracking in the wild, we introduce the first in-the-wild face tracking dataset, MobiFace, that consists of 80 videos captured by mobile phones during mobile live-streaming. The environment of the live-streaming performance is fully unconstrained and the interactions between users and mobile phones are natural and spontaneous. Next, we evaluate existing tracking methods, including generic object trackers and dedicated face trackers. The results show that MobiFace represent unique challenges in face tracking in the wild and cannot be readily solved by existing methods. Finally, we present a DCNN-based framework, FT-RCNN, that significantly outperforms other methods in face tracking in the wild. For face parsing in the wild, we introduce the first large-scale in-the-wild face dataset, iBugMask, that contains 21, 866 training images and 1, 000 testing images. Unlike existing datasets, the images in iBugMask are captured in the fully unconstrained environment and are not cropped or preprocessed of any kind. Manually annotated per-pixel labels for eleven facial regions are provided for each target face. Next, we benchmark existing parsing methods and the results show that iBugMask is extremely challenging for all methods. By rigorous benchmarking, we observe that the pre-processing of facial images with bounding boxes in face parsing in the wild introduces bias. When cropping the face with a bounding box, a cropping margin has to be hand-picked. If face alignment is used, fiducial landmarks are required and a predefined alignment template has to be selected. These additional hyper-parameters have to be carefully considered and can have a significant impact on the face parsing performance. To solve this, we propose Region-of-Interest (RoI) Tanh-polar transform that warps the whole image to a fixed-sized representation. Moreover, the RoI Tanh-polar transform is differentiable and allows for rotation equivariance in 1 DCNNs. We show that when coupled with a simple Fully Convolutional Network, our RoI Tanh-polar transformer Network has achieved state-of-the-art results on face parsing in the wild. This thesis contributes towards in-the-wild face tracking and face parsing by providing novel datasets and proposing effective frameworks. Both tasks can benefit real-world downstream applications such as facial age estimation, facial expression recognition and lip-reading. The proposed RoI Tanh-polar transform also provides a new perspective in how to preprocess the face images and make the DCNNs truly end-to-end for real-world face analysis applications.Open Acces

    Towards a Robust Thermal-Visible Heterogeneous Face Recognition Approach Based on a Cycle Generative Adversarial Network

    Get PDF
    Security is a sensitive area that concerns all authorities around the world due to the emerging terrorism phenomenon. Contactless biometric technologies such as face recognition have grown in interest for their capacity to identify probe subjects without any human interaction. Since traditional face recognition systems use visible spectrum sensors, their performances decrease rapidly when some visible imaging phenomena occur, mainly illumination changes. Unlike the visible spectrum, Infrared spectra are invariant to light changes, which makes them an alternative solution for face recognition. However, in infrared, the textural information is lost. We aim, in this paper, to benefit from visible and thermal spectra by proposing a new heterogeneous face recognition approach. This approach includes four scientific contributions. The first one is the annotation of a thermal face database, which has been shared via Github with all the scientific community. The second is the proposition of a multi-sensors face detector model based on the last YOLO v3 architecture, able to detect simultaneously faces captured in visible and thermal images. The third contribution takes up the challenge of modality gap reduction between visible and thermal spectra, by applying a new structure of CycleGAN, called TV-CycleGAN, which aims to synthesize visible-like face images from thermal face images. This new thermal-visible synthesis method includes all extreme poses and facial expressions in color space. To show the efficacy and the robustness of the proposed TV-CycleGAN, experiments have been applied on three challenging benchmark databases, including different real-world scenarios: TUFTS and its aligned version, NVIE and PUJ. The qualitative evaluation shows that our method generates more realistic faces. The quantitative one demonstrates that the proposed TV -CycleGAN gives the best improvement on face recognition rates. Therefore, instead of applying a direct matching from thermal to visible images which allows a recognition rate of 47,06% for TUFTS Database, a proposed TV-CycleGAN ensures accuracy of 57,56% for the same database. It contributes to a rate enhancement of 29,16%, and 15,71% for NVIE and PUJ databases, respectively. It reaches an accuracy enhancement of 18,5% for the aligned TUFTS database. It also outperforms some recent state of the art methods in terms of F1-Score, AUC/EER and other evaluation metrics. Furthermore, it should be mentioned that the obtained visible synthesized face images using TV-CycleGAN method are very promising for thermal facial landmark detection as a fourth contribution of this paper

    Evaluation of Data Augmentation Techniques for Facial Expression Recognition Systems

    Get PDF
    Most Facial Expression Recognition (FER) systems rely on machine learning approaches that require large databases for an effective training. As these are not easily available, a good solution is to augment the databases with appropriate data augmentation (DA) techniques, which are typically based on either geometric transformation or oversampling augmentations (e.g., generative adversarial networks (GANs)). However, it is not always easy to understand which DA technique may be more convenient for FER systems because most state-of-the-art experiments use different settings which makes the impact of DA techniques not comparable. To advance in this respect, in this paper, we evaluate and compare the impact of using well-established DA techniques on the emotion recognition accuracy of a FER system based on the well-known VGG16 convolutional neural network (CNN). In particular, we consider both geometric transformations and GAN to increase the amount of training images. We performed cross-database evaluations: training with the "augmented" KDEF database and testing with two different databases (CK+ and ExpW). The best results were obtained combining horizontal reflection, translation and GAN, bringing an accuracy increase of approximately 30%. This outperforms alternative approaches, except for the one technique that could however rely on a quite bigger database

    PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer

    Full text link
    Remote photoplethysmography (rPPG), which aims at measuring heart activities and physiological signals from facial video without any contact, has great potential in many applications (e.g., remote healthcare and affective computing). Recent deep learning approaches focus on mining subtle rPPG clues using convolutional neural networks with limited spatio-temporal receptive fields, which neglect the long-range spatio-temporal perception and interaction for rPPG modeling. In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture, to adaptively aggregate both local and global spatio-temporal features for rPPG representation enhancement. As key modules in PhysFormer, the temporal difference transformers first enhance the quasi-periodic rPPG features with temporal difference guided global attention, and then refine the local spatio-temporal representation against interference. Furthermore, we also propose the label distribution learning and a curriculum learning inspired dynamic constraint in frequency domain, which provide elaborate supervisions for PhysFormer and alleviate overfitting. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra- and cross-dataset testings. One highlight is that, unlike most transformer networks needed pretraining from large-scale datasets, the proposed PhysFormer can be easily trained from scratch on rPPG datasets, which makes it promising as a novel transformer baseline for the rPPG community. The codes will be released at https://github.com/ZitongYu/PhysFormer.Comment: Accepted by CVPR202

    Facial Expression Recognition Based on Deep Learning Convolution Neural Network: A Review

    Get PDF
    Facial emotional processing is one of the most important activities in effective calculations, engagement with people and computers, machine vision, video game testing, and consumer research. Facial expressions are a form of nonverbal communication, as they reveal a person's inner feelings and emotions. Extensive attention to Facial Expression Recognition (FER) has recently been received as facial expressions are considered. As the fastest communication medium of any kind of information. Facial expression recognition gives a better understanding of a person's thoughts or views and analyzes them with the currently trending deep learning methods. Accuracy rate sharply compared to traditional state-of-the-art systems. This article provides a brief overview of the different FER fields of application and publicly accessible databases used in FER and studies the latest and current reviews in FER using Convolution Neural Network (CNN) algorithms. Finally, it is observed that everyone reached good results, especially in terms of accuracy, with different rates, and using different data sets, which impacts the results
    corecore