924 research outputs found
Differentiating Between Spontaneous and Posed Facial Expression using Inception V4
Master's thesis Information- and communication technology IKT590 - University of Agder 2018This thesis proposes a way to simplify and make solutions for spontaneous and
posed facial expression analysis more efficient. Traditional approaches have been
using hand-crafted features and two image frames to be able to differentiate between
spontaneous and posed facial expressions. The solution aims to be as flexible
as possible and introduces two models to differentiate between posed and
spontaneous facial expression.
We introduce Inception V4 as an algorithm to solve this task. The results
indicate that Inception V4 may be too deep and unable to differentiate between
spontaneous and posed facial expression accurately. A shallow CNN model is
also introduced. The shallow CNN model performs better than the Inception V4
model. None of the two come close to the state-of-the-art results. This may
indicate that to differentiate between spontaneous and posed facial expressions
the difference between the onset and apex frame of an expression is needed as
input. This thesis, also suggests an alternative algorithm based on our findings.
For further work, an algorithm which is not as deep as Inception V4 is needed.
However, by using parts of the Inception V4 architecture, we may be able to
capture facial features better.
The task of differentiating between spontaneous emotion and posed emotion
has also been investigated; however, the results do not show great promise. The
task does not have any state-of-the-art results to compare our approach with. Our
models, although lacking in performance, does seem able to capture relevant facial
features from the dataset
Automatic analysis of facial actions: a survey
As one of the most comprehensive and objective ways to describe facial expressions, the Facial Action Coding System (FACS) has recently received significant attention. Over the past 30 years, extensive research has been conducted by psychologists and neuroscientists on various aspects of facial expression analysis using FACS. Automating FACS coding would make this research faster and more widely applicable, opening up new avenues to understanding how we communicate through facial expressions. Such an automated process can also potentially increase the reliability, precision and temporal resolution of coding. This paper provides a comprehensive survey of research into machine analysis of facial actions. We systematically review all components of such systems: pre-processing, feature extraction and machine coding of facial actions. In addition, the existing FACS-coded facial expression databases are summarised. Finally, challenges that have to be addressed to make automatic facial action analysis applicable in real-life situations are extensively discussed. There are two underlying motivations for us to write this survey paper: the first is to provide an up-to-date review of the existing literature, and the second is to offer some insights into the future of machine recognition of facial actions: what are the challenges and opportunities that researchers in the field face
A novel database of Children's Spontaneous Facial Expressions (LIRIS-CSE)
Computing environment is moving towards human-centered designs instead of
computer centered designs and human's tend to communicate wealth of information
through affective states or expressions. Traditional Human Computer Interaction
(HCI) based systems ignores bulk of information communicated through those
affective states and just caters for user's intentional input. Generally, for
evaluating and benchmarking different facial expression analysis algorithms,
standardized databases are needed to enable a meaningful comparison. In the
absence of comparative tests on such standardized databases it is difficult to
find relative strengths and weaknesses of different facial expression
recognition algorithms. In this article we present a novel video database for
Children's Spontaneous facial Expressions (LIRIS-CSE). Proposed video database
contains six basic spontaneous facial expressions shown by 12 ethnically
diverse children between the ages of 6 and 12 years with mean age of 7.3 years.
To the best of our knowledge, this database is first of its kind as it records
and shows spontaneous facial expressions of children. Previously there were few
database of children expressions and all of them show posed or exaggerated
expressions which are different from spontaneous or natural expressions. Thus,
this database will be a milestone for human behavior researchers. This database
will be a excellent resource for vision community for benchmarking and
comparing results. In this article, we have also proposed framework for
automatic expression recognition based on convolutional neural network (CNN)
architecture with transfer learning approach. Proposed architecture achieved
average classification accuracy of 75% on our proposed database i.e. LIRIS-CSE
Machine learning for automatic analysis of affective behaviour
The automated analysis of affect has been gaining rapidly increasing attention by researchers over the past two decades, as it constitutes a fundamental step towards achieving next-generation computing technologies and integrating them into everyday life (e.g. via affect-aware, user-adaptive interfaces, medical imaging, health assessment, ambient intelligence etc.). The work presented in this thesis focuses on several fundamental problems manifesting in the course towards the achievement of reliable, accurate and robust affect sensing systems. In more detail, the motivation behind this work lies in recent developments in the field, namely (i) the creation of large, audiovisual databases for affect analysis in the so-called ''Big-Data`` era, along with (ii) the need to deploy systems under demanding, real-world conditions. These developments led to the requirement for the analysis of emotion expressions continuously in time, instead of merely processing static images, thus unveiling the wide range of temporal dynamics related to human behaviour to researchers. The latter entails another deviation from the traditional line of research in the field: instead of focusing on predicting posed, discrete basic emotions (happiness, surprise etc.), it became necessary to focus on spontaneous, naturalistic expressions captured under settings more proximal to real-world conditions, utilising more expressive emotion descriptions than a set of discrete labels. To this end, the main motivation of this thesis is to deal with challenges arising from the adoption of continuous dimensional emotion descriptions under naturalistic scenarios, considered to capture a much wider spectrum of expressive variability than basic emotions, and most importantly model emotional states which are commonly expressed by humans in their everyday life. In the first part of this thesis, we attempt to demystify the quite unexplored problem of predicting continuous emotional dimensions. This work is amongst the first to explore the problem of predicting emotion dimensions via multi-modal fusion, utilising facial expressions, auditory cues and shoulder gestures. A major contribution of the work presented in this thesis lies in proposing the utilisation of various relationships exhibited by emotion dimensions in order to improve the prediction accuracy of machine learning methods - an idea which has been taken on by other researchers in the field since. In order to experimentally evaluate this, we extend methods such as the Long Short-Term Memory Neural Networks (LSTM), the Relevance Vector Machine (RVM) and Canonical Correlation Analysis (CCA) in order to exploit output relationships in learning. As it is shown, this increases the accuracy of machine learning models applied to this task.
The annotation of continuous dimensional emotions is a tedious task, highly prone to the influence of various types of noise. Performed real-time by several annotators (usually experts), the annotation process can be heavily biased by factors such as subjective interpretations of the emotional states observed, the inherent ambiguity of labels related to human behaviour, the varying reaction lags exhibited by each annotator as well as other factors such as input device noise and annotation errors. In effect, the annotations manifest a strong spatio-temporal annotator-specific bias. Failing to properly deal with annotation bias and noise leads to an inaccurate ground truth, and therefore to ill-generalisable machine learning models. This deems the proper fusion of multiple annotations, and the inference of a clean, corrected version of the ``ground truth'' as one of the most significant challenges in the area. A highly important contribution of this thesis lies in the introduction of Dynamic Probabilistic Canonical Correlation Analysis (DPCCA), a method aimed at fusing noisy continuous annotations. By adopting a private-shared space model, we isolate the individual characteristics that are annotator-specific and not shared, while most importantly we model the common, underlying annotation which is shared by annotators (i.e., the derived ground truth). By further learning temporal dynamics and incorporating a time-warping process, we are able to derive a clean version of the ground truth given multiple annotations, eliminating temporal discrepancies and other nuisances.
The integration of the temporal alignment process within the proposed private-shared space model deems DPCCA suitable for the problem of temporally aligning human behaviour; that is, given temporally unsynchronised sequences (e.g., videos of two persons smiling), the goal is to generate the temporally synchronised sequences (e.g., the smile apex should co-occur in the videos). Temporal alignment is an important problem for many applications where multiple datasets need to be aligned in time. Furthermore, it is particularly suitable for the analysis of facial expressions, where the activation of facial muscles (Action Units) typically follows a set of predefined temporal phases. A highly challenging scenario is when the observations are perturbed by gross, non-Gaussian noise (e.g., occlusions), as is often the case when analysing data acquired under real-world conditions. To account for non-Gaussian noise, a robust variant of Canonical Correlation Analysis (RCCA) for robust fusion and temporal alignment is proposed. The model captures the shared, low-rank subspace of the observations, isolating the gross noise in a sparse noise term. RCCA is amongst the first robust variants of CCA proposed in literature, and as we show in related experiments outperforms other, state-of-the-art methods for related tasks such as the fusion of multiple modalities under gross noise.
Beyond private-shared space models, Component Analysis (CA) is an integral component of most computer vision systems, particularly in terms of reducing the usually high-dimensional input spaces in a meaningful manner pertaining to the task-at-hand (e.g., prediction, clustering). A final, significant contribution of this thesis lies in proposing the first unifying framework for probabilistic component analysis. The proposed framework covers most well-known CA methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Locality Preserving Projections (LPP) and Slow Feature Analysis (SFA), providing further theoretical insights into the workings of CA. Moreover, the proposed framework is highly flexible, enabling novel CA methods to be generated by simply manipulating the connectivity of latent variables (i.e. the latent neighbourhood). As shown experimentally, methods derived via the proposed framework outperform other equivalents in several problems related to affect sensing and facial expression analysis, while providing advantages such as reduced complexity and explicit variance modelling.Open Acces
Artificial Intelligence Tools for Facial Expression Analysis.
Inner emotions show visibly upon the human face and are understood as a basic guide to an individual’s inner world. It is, therefore, possible to determine a person’s attitudes and the effects of others’ behaviour on their deeper feelings through examining facial expressions. In real world applications, machines that interact with people need strong facial expression recognition. This recognition is seen to hold advantages for varied applications in affective computing, advanced human-computer interaction, security, stress and depression analysis, robotic systems, and machine learning. This thesis starts by proposing a benchmark of dynamic versus static methods for facial Action Unit (AU) detection. AU activation is a set of local individual facial muscle parts that occur in unison constituting a natural facial expression event. Detecting AUs automatically can provide explicit benefits since it considers both static and dynamic facial features. For this research, AU occurrence activation detection was conducted by extracting features (static and dynamic) of both nominal hand-crafted and deep learning representation from each static image of a video. This confirmed the superior ability of a pretrained model that leaps in performance. Next, temporal modelling was investigated to detect the underlying temporal variation phases using supervised and unsupervised methods from dynamic sequences. During these processes, the importance of stacking dynamic on top of static was discovered in encoding deep features for learning temporal information when combining the spatial and temporal schemes simultaneously. Also, this study found that fusing both temporal and temporal features will give more long term temporal pattern information. Moreover, we hypothesised that using an unsupervised method would enable the leaching of invariant information from dynamic textures. Recently, fresh cutting-edge developments have been created by approaches based on Generative Adversarial Networks (GANs). In the second section of this thesis, we propose a model based on the adoption of an unsupervised DCGAN for the facial features’ extraction and classification to achieve the following: the creation of facial expression images under different arbitrary poses (frontal, multi-view, and in the wild), and the recognition of emotion categories and AUs, in an attempt to resolve the problem of recognising the static seven classes of emotion in the wild. Thorough experimentation with the proposed cross-database performance demonstrates that this approach can improve the generalization results. Additionally, we showed that the features learnt by the DCGAN process are poorly suited to encoding facial expressions when observed under multiple views, or when trained from a limited number of positive examples. Finally, this research focuses on disentangling identity from expression for facial expression recognition. A novel technique was implemented for emotion recognition from a single monocular image. A large-scale dataset (Face vid) was created from facial image videos which were rich in variations and distribution of facial dynamics, appearance, identities, expressions, and 3D poses. This dataset was used to train a DCNN (ResNet) to regress the expression parameters from a 3D Morphable Model jointly with a back-end classifier
Towards spatial and temporal analysis of facial expressions in 3D data
Facial expressions are one of the most important means for communication of emotions and meaning. They are used to clarify and give emphasis, to express intentions, and form a crucial part of any human interaction. The ability to automatically recognise and analyse expressions could therefore prove to be vital in human behaviour understanding, which has applications in a number of areas such as psychology, medicine and security.
3D and 4D (3D+time) facial expression analysis is an expanding field, providing the ability to deal with problems inherent to 2D images, such as out-of-plane motion, head pose, and lighting and illumination issues. Analysis of data of this kind requires extending successful approaches applied to the 2D problem, as well as the development of new techniques. The introduction of recent new databases containing appropriate expression data, recorded in 3D or 4D, has allowed research
into this exciting area for the first time.
This thesis develops a number of techniques, both in 2D and 3D, that build towards a complete system for analysis of 4D expressions. Suitable feature types, designed by employing binary pattern methods, are developed for analysis of 3D facial geometry data. The full dynamics of 4D expressions are modelled, through a system reliant on motion-based features, to demonstrate how the different components of the expression (neutral-onset-apex-offset) can be distinguished and harnessed. Further, the spatial structure of expressions is harnessed to improve expression component intensity estimation in 2D videos. Finally, it is discussed how this latter step could be extended to 3D facial expression analysis, and also combined with temporal analysis. Thus, it is demonstrated that both spatial and temporal information, when combined with appropriate 3D features, is critical in analysis of 4D expression data.Open Acces
Recommended from our members
Optimal anticipatory control as a theory of motor preparation
Supported by a decade of primate electrophysiological experiments, the prevailing theory of neural motor control holds that movement generation is accomplished by a preparatory process that progressively steers the state of the motor cortex into a movement-specific optimal subspace prior to movement onset. The state of the cortex then evolves from these optimal subspaces, producing patterns of neural activity that serve as control inputs to the musculature. This theory, however, does not address the following questions: what characterizes the optimal subspace and what are the neural mechanisms that underlie the preparatory process? We address these questions with a circuit model of movement preparation and control. Specifically, we propose that preparation can be achieved by optimal feedback control (OFC) of the cortical state via a thalamo-cortical loop. Under OFC, the state of the cortex is selectively controlled along state-space directions that have future motor consequences, and not in other inconsequential ones. We show that OFC enables fast movement preparation and explains the observed orthogonality between preparatory and movement-related monkey motor cortex activity. This illustrates the importance of constraining new theories of neural function with experimental data. However, as recording technologies continue to improve, a key challenge is to extract meaningful insights from increasingly large-scale neural recordings. Latent variable models (LVMs) are powerful tools for addressing this challenge due to their ability to identify the low-dimensional latent variables that best explain these large data sets. One shortcoming of most LVMs, however, is that they assume a Euclidean latent space, while many kinematic variables, such as head rotations and the configuration of an arm, are naturally described by variables that live on non-Euclidean latent spaces (e.g., SO(3) and tori). To address this shortcoming, we propose the Manifold Gaussian Process Latent Variable Model, a method for simultaneously inferring nonparametric tuning curves and latent variables on non-Euclidean latent spaces. We show that our method is able to correctly infer the latent ring topology of the fly and mouse head direction circuits.This work was supported by a Trinity-Henry Barlow scholarship and a scholarship from the Ministry of Education, ROC Taiwan
Inferring Facial and Body Language
Machine analysis of human facial and body language is a challenging topic in computer
vision, impacting on important applications such as human-computer interaction and visual
surveillance. In this thesis, we present research building towards computational frameworks
capable of automatically understanding facial expression and behavioural body language.
The thesis work commences with a thorough examination in issues surrounding facial
representation based on Local Binary Patterns (LBP). Extensive experiments with different
machine learning techniques demonstrate that LBP features are efficient and effective for
person-independent facial expression recognition, even in low-resolution settings. We then
present and evaluate a conditional mutual information based algorithm to efficiently learn the
most discriminative LBP features, and show the best recognition performance is obtained by
using SVM classifiers with the selected LBP features. However, the recognition is performed
on static images without exploiting temporal behaviors of facial expression.
Subsequently we present a method to capture and represent temporal dynamics of facial
expression by discovering the underlying low-dimensional manifold. Locality Preserving Projections
(LPP) is exploited to learn the expression manifold in the LBP based appearance
feature space. By deriving a universal discriminant expression subspace using a supervised
LPP, we can effectively align manifolds of different subjects on a generalised expression manifold.
Different linear subspace methods are comprehensively evaluated in expression subspace
learning. We formulate and evaluate a Bayesian framework for dynamic facial expression
recognition employing the derived manifold representation. However, the manifold representation
only addresses temporal correlations of the whole face image, does not consider
spatial-temporal correlations among different facial regions. We then employ Canonical Correlation Analysis (CCA) to capture correlations among face
parts. To overcome the inherent limitations of classical CCA for image data, we introduce
and formalise a novel Matrix-based CCA (MCCA), which can better measure correlations in
2D image data. We show this technique can provide superior performance in regression and
recognition tasks, whilst requiring significantly fewer canonical factors. All the above work
focuses on facial expressions. However, the face is usually perceived not as an isolated object
but as an integrated part of the whole body, and the visual channel combining facial and
bodily expressions is most informative.
Finally we investigate two understudied problems in body language analysis, gait-based
gender discrimination and affective body gesture recognition. To effectively combine face
and body cues, CCA is adopted to establish the relationship between the two modalities, and
derive a semantic joint feature space for the feature-level fusion. Experiments on large data
sets demonstrate that our multimodal systems achieve the superior performance in gender
discrimination and affective state analysis.Research studentship of Queen Mary, the International Travel Grant of the Royal Academy of Engineering,
and the Royal Society International Joint Project
Bayesian inference in neural circuits and synapses
Bayesian inference describes how to reason optimally under uncertainty. As the brain faces considerable uncertainty, it may be possible to understand aspects of neural computation using Bayesian inference. In this thesis, I address several questions within this broad theme. First, I show that con dence reports may, in some circumstances be Bayes optimal, by taking a \doubly Bayesian" strategy: computing the Bayesian model evidence for several di erent models of participant's behaviour, one of which is itself Bayesian. Second, I address a related question concerning features of the probability distributions realised by neural activity. In particular, it has been show that neural activity obeys Zipf's law, as do many other statistical distributions. We show the emergence of Zipf's law is in fact unsurprising, as it emerges from the existence of an underlying latent variable: ring rate. Third, I show that synaptic plasticity can be formulated as a Bayesian inference problem, and I give neural evidence in support of this proposition, based on the hypothesis that neurons sample from the resulting posterior distributions. Fourth, I consider how oscillatory excitatory-inhibitory circuits might perform inference by relating these circuits to a highly effective method for probabilistic inference: Hamiltonian Monte Carlo
- …