585 research outputs found

    Verification of emotion recognition from facial expression

    Get PDF
    Analysis of facial expressions is an active topic of research with many potential applications, since the human face plays a significant role in conveying a person’s mental state. Due to the practical values it brings, scientists and researchers from different fields such as psychology, finance, marketing, and engineering have developed significant interest in this area. Hence, there are more of a need than ever for the intelligent tool to be employed in the emotional Human-Computer Interface (HCI) by analyzing facial expressions as a better alternative to the traditional devices such as the keyboard and mouse. The face is a window of human mind. The examination of mental states explores the human’s internal cognitive states. A facial emotion recognition system has a potential to read people’s minds and interpret the emotional thoughts to the world. High rates of recognition accuracy of facial emotions by intelligent machines have been achieved in existing efforts based on the benchmarked databases containing posed facial emotions. However, they are not qualified to interpret the human’s true feelings even if they are recognized. The difference between posed facial emotions and spontaneous ones has been identified and studied in the literature. One of the most interesting challenges in the field of HCI is to make computers more human-like for more intelligent user interfaces. In this dissertation, a Regional Hidden Markov Model (RHMM) based facial emotion recognition system is proposed. In this system, the facial features are extracted from three face regions: the eyebrows, eyes and mouth. These regions convey relevant information regarding facial emotions. As a marked departure from prior work, RHMMs for the states of these three distinct face regions instead of the entire face for each facial emotion type are trained. In the recognition step, regional features are extracted from test video sequences. These features are processed according to the corresponding RHMMs to learn the probabilities for the states of the three face regions. The combination of states is utilized to identify the estimated emotion type of a given frame in a video sequence. An experimental framework is established to validate the results of such a system. RHMM as a new classifier emphasizes the states of three facial regions, rather than the entire face. The dissertation proposes the method of forming observation sequences that represent the changes of states of facial regions for training RHMMs and recognition. The proposed method is applicable to the various forms of video clips, including real-time videos. The proposed system shows the human-like capability to infer people’s mental states from moderate level of facial spontaneous emotions conveyed in the daily life in contrast to posed facial emotions. Moreover, the extended research work associated with the proposed facial emotion recognition system is forwarded into the domain of finance and biomedical engineering, respectively. CEO’s fear facial emotion has been found as the strong and positive predictor to forecast the firm stock price in the market. In addition, the experiment results also have demonstrated the similarity of the spontaneous facial reactions to stimuli and inner affective states translated by brain activity. The results revealed the effectiveness of facial features combined with the features extracted from the signals of brain activity for multiple signals correlation analysis and affective state classification

    Computationally efficient deformable 3D object tracking with a monocular RGB camera

    Get PDF
    182 p.Monocular RGB cameras are present in most scopes and devices, including embedded environments like robots, cars and home automation. Most of these environments have in common a significant presence of human operators with whom the system has to interact. This context provides the motivation to use the captured monocular images to improve the understanding of the operator and the surrounding scene for more accurate results and applications.However, monocular images do not have depth information, which is a crucial element in understanding the 3D scene correctly. Estimating the three-dimensional information of an object in the scene using a single two-dimensional image is already a challenge. The challenge grows if the object is deformable (e.g., a human body or a human face) and there is a need to track its movements and interactions in the scene.Several methods attempt to solve this task, including modern regression methods based on Deep NeuralNetworks. However, despite the great results, most are computationally demanding and therefore unsuitable for several environments. Computational efficiency is a critical feature for computationally constrained setups like embedded or onboard systems present in robotics and automotive applications, among others.This study proposes computationally efficient methodologies to reconstruct and track three-dimensional deformable objects, such as human faces and human bodies, using a single monocular RGB camera. To model the deformability of faces and bodies, it considers two types of deformations: non-rigid deformations for face tracking, and rigid multi-body deformations for body pose tracking. Furthermore, it studies their performance on computationally restricted devices like smartphones and onboard systems used in the automotive industry. The information extracted from such devices gives valuable insight into human behaviour a crucial element in improving human-machine interaction.We tested the proposed approaches in different challenging application fields like onboard driver monitoring systems, human behaviour analysis from monocular videos, and human face tracking on embedded devices

    Computationally efficient deformable 3D object tracking with a monocular RGB camera

    Get PDF
    182 p.Monocular RGB cameras are present in most scopes and devices, including embedded environments like robots, cars and home automation. Most of these environments have in common a significant presence of human operators with whom the system has to interact. This context provides the motivation to use the captured monocular images to improve the understanding of the operator and the surrounding scene for more accurate results and applications.However, monocular images do not have depth information, which is a crucial element in understanding the 3D scene correctly. Estimating the three-dimensional information of an object in the scene using a single two-dimensional image is already a challenge. The challenge grows if the object is deformable (e.g., a human body or a human face) and there is a need to track its movements and interactions in the scene.Several methods attempt to solve this task, including modern regression methods based on Deep NeuralNetworks. However, despite the great results, most are computationally demanding and therefore unsuitable for several environments. Computational efficiency is a critical feature for computationally constrained setups like embedded or onboard systems present in robotics and automotive applications, among others.This study proposes computationally efficient methodologies to reconstruct and track three-dimensional deformable objects, such as human faces and human bodies, using a single monocular RGB camera. To model the deformability of faces and bodies, it considers two types of deformations: non-rigid deformations for face tracking, and rigid multi-body deformations for body pose tracking. Furthermore, it studies their performance on computationally restricted devices like smartphones and onboard systems used in the automotive industry. The information extracted from such devices gives valuable insight into human behaviour a crucial element in improving human-machine interaction.We tested the proposed approaches in different challenging application fields like onboard driver monitoring systems, human behaviour analysis from monocular videos, and human face tracking on embedded devices

    FAMOS: a framework for investigating the use of face features to identify spontaneous emotions

    Get PDF
    © 2017, Springer-Verlag London Ltd., part of Springer Nature. Emotion-based analysis has raised a lot of interest, particularly in areas such as forensics, medicine, music, psychology, and human-machine interface. Following this trend, the use of facial analysis (either automatic or human-based) is the most common subject to be investigated once this type of data can easily be collected and is well accepted in the literature as a metric for inference of emotional states. Despite this popularity, due to several constraints found in real-world scenarios (e.g. lightning, complex backgrounds, facial hair and so on), automatically obtaining affective information from face accurately is a very challenging accomplishment. This work presents a framework which aims to analyse emotional experiences through spontaneous facial expressions. The method consists of a new four-dimensional model, called FAMOS, to describe emotional experiences in terms of appraisal, facial expressions, mood, and subjective experiences using a semi-automatic facial expression analyser as ground truth for describing the facial actions. In addition, we present an experiment using a new protocol proposed to obtain spontaneous emotional reactions. The results have suggested that the initial emotional state described by the participants of the experiment was different from that described after the exposure to the eliciting stimulus, thus showing that the used stimuli were capable of inducing the expected emotional states in most individuals. Moreover, our results pointed out that spontaneous facial reactions to emotions are very different from those in prototypic expressions, especially in terms of expressiveness

    Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

    Full text link
    The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, which contain rich dynamic information. b) Furthermore, these videos typically capture individuals in some form of verbal communication (public talks, teleconferences, audiovisual human-computer interactions, interviews, monologues/dialogues in movies, etc). When existing 3D face reconstruction methods are applied in such videos, the artifacts in the reconstruction of the shape and motion of the mouth area are often severe, since they do not match well with the speech audio. To overcome the aforementioned limitations, we present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions. We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage. We demonstrate that, interestingly, the lipread loss is better suited for 3D reconstruction of mouth movements compared to traditional landmark losses, and even direct 3D supervision. Furthermore, the devised method does not rely on any text transcriptions or corresponding audio, rendering it ideal for training in unlabeled datasets. We verify the efficiency of our method through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies

    Personalized face and gesture analysis using hierarchical neural networks

    Full text link
    The video-based computational analyses of human face and gesture signals encompass a myriad of challenging research problems involving computer vision, machine learning and human computer interaction. In this thesis, we focus on the following challenges: a) the classification of hand and body gestures along with the temporal localization of their occurrence in a continuous stream, b) the recognition of facial expressivity levels in people with Parkinson's Disease using multimodal feature representations, c) the prediction of student learning outcomes in intelligent tutoring systems using affect signals, and d) the personalization of machine learning models, which can adapt to subject and group-specific nuances in facial and gestural behavior. Specifically, we first conduct a quantitative comparison of two approaches to the problem of segmenting and classifying gestures on two benchmark gesture datasets: a method that simultaneously segments and classifies gestures versus a cascaded method that performs the tasks sequentially. Second, we introduce a framework that computationally predicts an accurate score for facial expressivity and validate it on a dataset of interview videos of people with Parkinson's disease. Third, based on a unique dataset of videos of students interacting with MathSpring, an intelligent tutoring system, collected by our collaborative research team, we build models to predict learning outcomes from their facial affect signals. Finally, we propose a novel solution to a relatively unexplored area in automatic face and gesture analysis research: personalization of models to individuals and groups. We develop hierarchical Bayesian neural networks to overcome the challenges posed by group or subject-specific variations in face and gesture signals. We successfully validate our formulation on the problems of personalized subject-specific gesture classification, context-specific facial expressivity recognition and student-specific learning outcome prediction. We demonstrate the flexibility of our hierarchical framework by validating the utility of both fully connected and recurrent neural architectures

    Artificial Intelligence Tools for Facial Expression Analysis.

    Get PDF
    Inner emotions show visibly upon the human face and are understood as a basic guide to an individual’s inner world. It is, therefore, possible to determine a person’s attitudes and the effects of others’ behaviour on their deeper feelings through examining facial expressions. In real world applications, machines that interact with people need strong facial expression recognition. This recognition is seen to hold advantages for varied applications in affective computing, advanced human-computer interaction, security, stress and depression analysis, robotic systems, and machine learning. This thesis starts by proposing a benchmark of dynamic versus static methods for facial Action Unit (AU) detection. AU activation is a set of local individual facial muscle parts that occur in unison constituting a natural facial expression event. Detecting AUs automatically can provide explicit benefits since it considers both static and dynamic facial features. For this research, AU occurrence activation detection was conducted by extracting features (static and dynamic) of both nominal hand-crafted and deep learning representation from each static image of a video. This confirmed the superior ability of a pretrained model that leaps in performance. Next, temporal modelling was investigated to detect the underlying temporal variation phases using supervised and unsupervised methods from dynamic sequences. During these processes, the importance of stacking dynamic on top of static was discovered in encoding deep features for learning temporal information when combining the spatial and temporal schemes simultaneously. Also, this study found that fusing both temporal and temporal features will give more long term temporal pattern information. Moreover, we hypothesised that using an unsupervised method would enable the leaching of invariant information from dynamic textures. Recently, fresh cutting-edge developments have been created by approaches based on Generative Adversarial Networks (GANs). In the second section of this thesis, we propose a model based on the adoption of an unsupervised DCGAN for the facial features’ extraction and classification to achieve the following: the creation of facial expression images under different arbitrary poses (frontal, multi-view, and in the wild), and the recognition of emotion categories and AUs, in an attempt to resolve the problem of recognising the static seven classes of emotion in the wild. Thorough experimentation with the proposed cross-database performance demonstrates that this approach can improve the generalization results. Additionally, we showed that the features learnt by the DCGAN process are poorly suited to encoding facial expressions when observed under multiple views, or when trained from a limited number of positive examples. Finally, this research focuses on disentangling identity from expression for facial expression recognition. A novel technique was implemented for emotion recognition from a single monocular image. A large-scale dataset (Face vid) was created from facial image videos which were rich in variations and distribution of facial dynamics, appearance, identities, expressions, and 3D poses. This dataset was used to train a DCNN (ResNet) to regress the expression parameters from a 3D Morphable Model jointly with a back-end classifier

    Sensing, interpreting, and anticipating human social behaviour in the real world

    Get PDF
    Low-level nonverbal social signals like glances, utterances, facial expressions and body language are central to human communicative situations and have been shown to be connected to important high-level constructs, such as emotions, turn-taking, rapport, or leadership. A prerequisite for the creation of social machines that are able to support humans in e.g. education, psychotherapy, or human resources is the ability to automatically sense, interpret, and anticipate human nonverbal behaviour. While promising results have been shown in controlled settings, automatically analysing unconstrained situations, e.g. in daily-life settings, remains challenging. Furthermore, anticipation of nonverbal behaviour in social situations is still largely unexplored. The goal of this thesis is to move closer to the vision of social machines in the real world. It makes fundamental contributions along the three dimensions of sensing, interpreting and anticipating nonverbal behaviour in social interactions. First, robust recognition of low-level nonverbal behaviour lays the groundwork for all further analysis steps. Advancing human visual behaviour sensing is especially relevant as the current state of the art is still not satisfactory in many daily-life situations. While many social interactions take place in groups, current methods for unsupervised eye contact detection can only handle dyadic interactions. We propose a novel unsupervised method for multi-person eye contact detection by exploiting the connection between gaze and speaking turns. Furthermore, we make use of mobile device engagement to address the problem of calibration drift that occurs in daily-life usage of mobile eye trackers. Second, we improve the interpretation of social signals in terms of higher level social behaviours. In particular, we propose the first dataset and method for emotion recognition from bodily expressions of freely moving, unaugmented dyads. Furthermore, we are the first to study low rapport detection in group interactions, as well as investigating a cross-dataset evaluation setting for the emergent leadership detection task. Third, human visual behaviour is special because it functions as a social signal and also determines what a person is seeing at a given moment in time. Being able to anticipate human gaze opens up the possibility for machines to more seamlessly share attention with humans, or to intervene in a timely manner if humans are about to overlook important aspects of the environment. We are the first to propose methods for the anticipation of eye contact in dyadic conversations, as well as in the context of mobile device interactions during daily life, thereby paving the way for interfaces that are able to proactively intervene and support interacting humans.Blick, Gesichtsausdrücke, Körpersprache, oder Prosodie spielen als nonverbale Signale eine zentrale Rolle in menschlicher Kommunikation. Sie wurden durch vielzählige Studien mit wichtigen Konzepten wie Emotionen, Sprecherwechsel, Führung, oder der Qualität des Verhältnisses zwischen zwei Personen in Verbindung gebracht. Damit Menschen effektiv während ihres täglichen sozialen Lebens von Maschinen unterstützt werden können, sind automatische Methoden zur Erkennung, Interpretation, und Antizipation von nonverbalem Verhalten notwendig. Obwohl die bisherige Forschung in kontrollierten Studien zu ermutigenden Ergebnissen gekommen ist, bleibt die automatische Analyse nonverbalen Verhaltens in weniger kontrollierten Situationen eine Herausforderung. Darüber hinaus existieren kaum Untersuchungen zur Antizipation von nonverbalem Verhalten in sozialen Situationen. Das Ziel dieser Arbeit ist, die Vision vom automatischen Verstehen sozialer Situationen ein Stück weit mehr Realität werden zu lassen. Diese Arbeit liefert wichtige Beiträge zur autmatischen Erkennung menschlichen Blickverhaltens in alltäglichen Situationen. Obwohl viele soziale Interaktionen in Gruppen stattfinden, existieren unüberwachte Methoden zur Augenkontakterkennung bisher lediglich für dyadische Interaktionen. Wir stellen einen neuen Ansatz zur Augenkontakterkennung in Gruppen vor, welcher ohne manuelle Annotationen auskommt, indem er sich den statistischen Zusammenhang zwischen Blick- und Sprechverhalten zu Nutze macht. Tägliche Aktivitäten sind eine Herausforderung für Geräte zur mobile Augenbewegungsmessung, da Verschiebungen dieser Geräte zur Verschlechterung ihrer Kalibrierung führen können. In dieser Arbeit verwenden wir Nutzerverhalten an mobilen Endgeräten, um den Effekt solcher Verschiebungen zu korrigieren. Neben der Erkennung verbessert diese Arbeit auch die Interpretation sozialer Signale. Wir veröffentlichen den ersten Datensatz sowie die erste Methode zur Emotionserkennung in dyadischen Interaktionen ohne den Einsatz spezialisierter Ausrüstung. Außerdem stellen wir die erste Studie zur automatischen Erkennung mangelnder Verbundenheit in Gruppeninteraktionen vor, und führen die erste datensatzübergreifende Evaluierung zur Detektion von sich entwickelndem Führungsverhalten durch. Zum Abschluss der Arbeit präsentieren wir die ersten Ansätze zur Antizipation von Blickverhalten in sozialen Interaktionen. Blickverhalten hat die besondere Eigenschaft, dass es sowohl als soziales Signal als auch der Ausrichtung der visuellen Wahrnehmung dient. Somit eröffnet die Fähigkeit zur Antizipation von Blickverhalten Maschinen die Möglichkeit, sich sowohl nahtloser in soziale Interaktionen einzufügen, als auch Menschen zu warnen, wenn diese Gefahr laufen wichtige Aspekte der Umgebung zu übersehen. Wir präsentieren Methoden zur Antizipation von Blickverhalten im Kontext der Interaktion mit mobilen Endgeräten während täglicher Aktivitäten, als auch während dyadischer Interaktionen mittels Videotelefonie

    3D facial performance capture from monocular RGB video.

    Get PDF
    3D facial performance capture is an essential technique for animation production in featured films, video gaming, human computer interaction, VR/AR asset creation and digital heritage, which all have huge impact on our daily life. Traditionally, dedicated hardware such as depth sensors, laser scanners and camera arrays have been developed to acquire depth information for such purpose. However, such sophisticated instruments can only be operated by trained professionals. In recent years, the wide spread availability of mobile devices, and the increased interest of casual untrained users in applications such as image, video editing, virtual and facial model creation, have sparked interest in 3D facial reconstruction from 2D RGB input. Due to the depth ambiguity and facial appearance variation, 3D facial performance capture and modelling from 2D images are inherently ill-posed problems. However, with strong prior knowledge of the human face, it is possible to accurately infer the true 3D facial shape and performance from multiple observations captured with different viewing angles. Various 3D from 2D methods have been proposed and proven to work well in controlled environments. Nevertheless there are still many unexplored issues in uncontrolled in-the-wild environments. In order to achieve the same level of performance in controlled environments, interfering factors in uncontrolled environments such as varying illumination, partial occlusion and facial variation not captured by prior knowledge would require the development of new techniques. This thesis addresses existing challenges and proposes novel methods involving 2D landmark detection, 3D facial reconstruction and 3D performance tracking, which are validated through theoretical research and experimental studies. 3D facial performance tracking is a multidisciplinary problem involving many areas such as computer vision, computer graphics and machine learning. To deal with the large variations within a single image, we present new machine learning techniques for facial landmark detection based on our observation of the facial features in challenging scenarios to increase the robustness. To take advantage of the evidence aggregated from multiple observations, we present new robust and efficient optimisation techniques that impose consistency constrains that help filter out outliers. To exploit the person-specific model generation, temporal and spatial coherence in continuous video input, we present new methods to improve the performance via optimisation. In order to track the 3D facial performance, the fundamental prerequisite for good results is the accurate underlying 3D model of the actor. In this thesis, we present new methods that are targeted at 3D facial geometry reconstruction, which are more efficient than existing generic 3D geometry reconstruction methods. Evaluation and validation were obtained and analysed from substantial experiment, which shows the proposed methods in this thesis outperform the state-of-the-art methods and enable us to generate high quality results with less constraints
    • …
    corecore