217,353 research outputs found
A Personalized Affective Memory Neural Model for Improving Emotion Recognition
Recent models of emotion recognition strongly rely on supervised deep
learning solutions for the distinction of general emotion expressions. However,
they are not reliable when recognizing online and personalized facial
expressions, e.g., for person-specific affective understanding. In this paper,
we present a neural model based on a conditional adversarial autoencoder to
learn how to represent and edit general emotion expressions. We then propose
Grow-When-Required networks as personalized affective memories to learn
individualized aspects of emotion expressions. Our model achieves
state-of-the-art performance on emotion recognition when evaluated on
\textit{in-the-wild} datasets. Furthermore, our experiments include ablation
studies and neural visualizations in order to explain the behavior of our
model.Comment: Accepted by the International Conference on Machine Learning 2019
(ICML2019
A Deeper Look at Facial Expression Dataset Bias
Datasets play an important role in the progress of facial expression
recognition algorithms, but they may suffer from obvious biases caused by
different cultures and collection conditions. To look deeper into this bias, we
first conduct comprehensive experiments on dataset recognition and crossdataset
generalization tasks, and for the first time explore the intrinsic causes of
the dataset discrepancy. The results quantitatively verify that current
datasets have a strong buildin bias and corresponding analyses indicate that
the conditional probability distributions between source and target datasets
are different. However, previous researches are mainly based on shallow
features with limited discriminative ability under the assumption that the
conditional distribution remains unchanged across domains. To address these
issues, we further propose a novel deep Emotion-Conditional Adaption Network
(ECAN) to learn domain-invariant and discriminative feature representations,
which can match both the marginal and the conditional distributions across
domains simultaneously. In addition, the largely ignored expression class
distribution bias is also addressed by a learnable re-weighting parameter, so
that the training and testing domains can share similar class distribution.
Extensive cross-database experiments on both lab-controlled datasets (CK+,
JAFFE, MMI and Oulu-CASIA) and real-world databases (AffectNet, FER2013, RAF-DB
2.0 and SFEW 2.0) demonstrate that our ECAN can yield competitive performances
across various facial expression transfer tasks and outperform the
state-of-theart methods
Pain Intensity Estimation by a Self--Taught Selection of Histograms of Topographical Features
Pain assessment through observational pain scales is necessary for special
categories of patients such as neonates, patients with dementia, critically ill
patients, etc. The recently introduced Prkachin-Solomon score allows pain
assessment directly from facial images opening the path for multiple assistive
applications. In this paper, we introduce the Histograms of Topographical (HoT)
features, which are a generalization of the topographical primal sketch, for
the description of the face parts contributing to the mentioned score. We
propose a semi-supervised, clustering oriented self--taught learning procedure
developed on the emotion oriented Cohn-Kanade database. We use this procedure
to improve the discrimination between different pain intensity levels and the
generalization with respect to the monitored persons, while testing on the UNBC
McMaster Shoulder Pain database
Cross-modal Supervision for Learning Active Speaker Detection in Video
In this paper, we show how to use audio to supervise the learning of active
speaker detection in video. Voice Activity Detection (VAD) guides the learning
of the vision-based classifier in a weakly supervised manner. The classifier
uses spatio-temporal features to encode upper body motion - facial expressions
and gesticulations associated with speaking. We further improve a generic model
for active speaker detection by learning person specific models. Finally, we
demonstrate the online adaptation of generic models learnt on one dataset, to
previously unseen people in a new dataset, again using audio (VAD) for weak
supervision. The use of temporal continuity overcomes the lack of clean
training data. We are the first to present an active speaker detection system
that learns on one audio-visual dataset and automatically adapts to speakers in
a new dataset. This work can be seen as an example of how the availability of
multi-modal data allows us to learn a model without the need for supervision,
by transferring knowledge from one modality to another.Comment: 16 page
FReeNet: Multi-Identity Face Reenactment
This paper presents a novel multi-identity face reenactment framework, named
FReeNet, to transfer facial expressions from an arbitrary source face to a
target face with a shared model. The proposed FReeNet consists of two parts:
Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC
adopts an encode-decoder architecture to efficiently convert expression in a
latent landmark space, which significantly narrows the gap of the face contour
between source and target identities. The GAG leverages the converted landmark
to reenact the photorealistic image with a reference image of the target
person. Moreover, a new triplet perceptual loss is proposed to force the GAG
module to learn appearance and geometry information simultaneously, which also
enriches facial details of the reenacted images. Further experiments
demonstrate the superiority of our approach for generating photorealistic and
expression-alike faces, as well as the flexibility for transferring facial
expressions between identities.Comment: Add more experiments; Revise the paper carefully
Unsupervised Eyeglasses Removal in the Wild
Eyeglasses removal is challenging in removing different kinds of eyeglasses,
e.g., rimless glasses, full-rim glasses and sunglasses, and recovering
appropriate eyes. Due to the large visual variants, the conventional methods
lack scalability. Most existing works focus on the frontal face images in the
controlled environment, such as the laboratory, and need to design specific
systems for different eyeglass types. To address the limitation, we propose a
unified eyeglass removal model called Eyeglasses Removal Generative Adversarial
Network (ERGAN), which could handle different types of glasses in the wild. The
proposed method does not depend on the dense annotation of eyeglasses location
but benefits from the large-scale face images with weak annotations.
Specifically, we study the two relevant tasks simultaneously, i.e., removing
and wearing eyeglasses. Given two facial images with and without eyeglasses,
the proposed model learns to swap the eye area in two faces. The generation
mechanism focuses on the eye area and invades the difficulty of generating a
new face. In the experiment, we show the proposed method achieves a competitive
removal quality in terms of realism and diversity. Furthermore, we evaluate
ERGAN on several subsequent tasks, such as face verification and facial
expression recognition. The experiment shows that our method could serve as a
pre-processing method for these tasks
Probabilistic Attribute Tree in Convolutional Neural Networks for Facial Expression Recognition
In this paper, we proposed a novel Probabilistic Attribute Tree-CNN (PAT-CNN)
to explicitly deal with the large intra-class variations caused by
identity-related attributes, e.g., age, race, and gender. Specifically, a novel
PAT module with an associated PAT loss was proposed to learn features in a
hierarchical tree structure organized according to attributes, where the final
features are less affected by the attributes. Then, expression-related features
are extracted from leaf nodes. Samples are probabilistically assigned to tree
nodes at different levels such that expression-related features can be learned
from all samples weighted by probabilities. We further proposed a
semi-supervised strategy to learn the PAT-CNN from limited attribute-annotated
samples to make the best use of available data. Experimental results on five
facial expression datasets have demonstrated that the proposed PAT-CNN
outperforms the baseline models by explicitly modeling attributes. More
impressively, the PAT-CNN using a single model achieves the best performance
for faces in the wild on the SFEW dataset, compared with the state-of-the-art
methods using an ensemble of hundreds of CNNs.Comment: 10 page
Gaussian Process Domain Experts for Model Adaptation in Facial Behavior Analysis
We present a novel approach for supervised domain adaptation that is based
upon the probabilistic framework of Gaussian processes (GPs). Specifically, we
introduce domain-specific GPs as local experts for facial expression
classification from face images. The adaptation of the classifier is
facilitated in probabilistic fashion by conditioning the target expert on
multiple source experts. Furthermore, in contrast to existing adaptation
approaches, we also learn a target expert from available target data solely.
Then, a single and confident classifier is obtained by combining the
predictions from multiple experts based on their confidence. Learning of the
model is efficient and requires no retraining/reweighting of the source
classifiers. We evaluate the proposed approach on two publicly available
datasets for multi-class (MultiPIE) and multi-label (DISFA) facial expression
classification. To this end, we perform adaptation of two contextual factors:
'where' (view) and 'who' (subject). We show in our experiments that the
proposed approach consistently outperforms both source and target classifiers,
while using as few as 30 target examples. It also outperforms the
state-of-the-art approaches for supervised domain adaptation
Person Identification with Visual Summary for a Safe Access to a Smart Home
SafeAccess is an integrated system designed to provide easier and safer
access to a smart home for people with or without disabilities. The system is
designed to enhance safety and promote the independence of people with
disability (i.e., visually impaired). The key functionality of the system
includes the detection and identification of human and generating contextual
visual summary from the real-time video streams obtained from the cameras
placed in strategic locations around the house. In addition, the system
classifies human into groups (i.e. friends/families/caregiver versus
intruders/burglars/unknown). These features allow the user to grant/deny remote
access to the premises or ability to call emergency services. In this paper, we
focus on designing a prototype system for the smart home and building a robust
recognition engine that meets the system criteria and addresses speed,
accuracy, deployment and environmental challenges under a wide variety of
practical and real-life situations. To interact with the system, we implemented
a dialog enabled interface to create a personalized profile using face images
or video of friend/families/caregiver. To improve computational efficiency, we
apply change detection to filter out frames and use Faster-RCNN to detect the
human presence and extract faces using Multitask Cascaded Convolutional
Networks (MTCNN). Subsequently, we apply LBP/FaceNet to identify a person and
groups by matching extracted faces with the profile. SafeAccess sends a visual
summary to the users with an MMS containing a person's name if any match found
or as "Unknown", scene image, facial description, and contextual information.
SafeAccess identifies friends/families/caregiver versus intruders/unknown with
an average F-score 0.97 and generates a visual summary from 10 classes with an
average accuracy of 98.01%
Inverting face embeddings with convolutional neural networks
Deep neural networks have dramatically advanced the state of the art for many
areas of machine learning. Recently they have been shown to have a remarkable
ability to generate highly complex visual artifacts such as images and text
rather than simply recognize them.
In this work we use neural networks to effectively invert low-dimensional
face embeddings while producing realistically looking consistent images. Our
contribution is twofold, first we show that a gradient ascent style approaches
can be used to reproduce consistent images, with a help of a guiding image.
Second, we demonstrate that we can train a separate neural network to
effectively solve the minimization problem in one pass, and generate images in
real-time. We then evaluate the loss imposed by using a neural network instead
of the gradient descent by comparing the final values of the minimized loss
function
- …