78 research outputs found

    Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation

    Full text link
    Generating facial reactions in a human-human dyadic interaction is complex and highly dependent on the context since more than one facial reactions can be appropriate for the speaker's behaviour. This has challenged existing machine learning (ML) methods, whose training strategies enforce models to reproduce a specific (not multiple) facial reaction from each input speaker behaviour. This paper proposes the first multiple appropriate facial reaction generation framework that re-formulates the one-to-many mapping facial reaction generation problem as a one-to-one mapping problem. This means that we approach this problem by considering the generation of a distribution of the listener's appropriate facial reactions instead of multiple different appropriate facial reactions, i.e., 'many' appropriate facial reaction labels are summarised as 'one' distribution label during training. Our model consists of a perceptual processor, a cognitive processor, and a motor processor. The motor processor is implemented with a novel Reversible Multi-dimensional Edge Graph Neural Network (REGNN). This allows us to obtain a distribution of appropriate real facial reactions during the training process, enabling the cognitive processor to be trained to predict the appropriate facial reaction distribution. At the inference stage, the REGNN decodes an appropriate facial reaction by using this distribution as input. Experimental results demonstrate that our approach outperforms existing models in generating more appropriate, realistic, and synchronized facial reactions. The improved performance is largely attributed to the proposed appropriate facial reaction distribution learning strategy and the use of a REGNN. The code is available at https://github.com/TongXu-05/REGNN-Multiple-Appropriate-Facial-Reaction-Generation

    An Actor-Centric Approach to Facial Animation Control by Neural Networks For Non-Player Characters in Video Games

    Get PDF
    Game developers increasingly consider the degree to which character animation emulates facial expressions found in cinema. Employing animators and actors to produce cinematic facial animation by mixing motion capture and hand-crafted animation is labor intensive and therefore expensive. Emotion corpora and neural network controllers have shown promise toward developing autonomous animation that does not rely on motion capture. Previous research and practice in disciplines of Computer Science, Psychology and the Performing Arts have provided frameworks on which to build a workflow toward creating an emotion AI system that can animate the facial mesh of a 3d non-player character deploying a combination of related theories and methods. However, past investigations and their resulting production methods largely ignore the emotion generation systems that have evolved in the performing arts for more than a century. We find very little research that embraces the intellectual process of trained actors as complex collaborators from which to understand and model the training of a neural network for character animation. This investigation demonstrates a workflow design that integrates knowledge from the performing arts and the affective branches of the social and biological sciences. Our workflow begins at the stage of developing and annotating a fictional scenario with actors, to producing a video emotion corpus, to designing training and validating a neural network, to analyzing the emotion data annotation of the corpus and neural network, and finally to determining resemblant behavior of its autonomous animation control of a 3d character facial mesh. The resulting workflow includes a method for the development of a neural network architecture whose initial efficacy as a facial emotion expression simulator has been tested and validated as substantially resemblant to the character behavior developed by a human actor

    REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge

    Get PDF
    The Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic interaction scenarios, with all participants competing strictly under the same conditions. The goal of the challenge is to provide the first benchmark test set for multi-modal information processing and to foster collaboration among the audio, visual, and audio-visual behaviour analysis and behaviour generation (a.k.a generative AI) communities, to compare the relative merits of the approaches to automatic appropriate facial reaction generation under different spontaneous dyadic interaction conditions. This paper presents: (i) the novelties, contributions and guidelines of the REACT2023 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline-react2023.</p

    Conditional Adversarial Synthesis of 3D Facial Action Units

    Full text link
    Employing deep learning-based approaches for fine-grained facial expression analysis, such as those involving the estimation of Action Unit (AU) intensities, is difficult due to the lack of a large-scale dataset of real faces with sufficiently diverse AU labels for training. In this paper, we consider how AU-level facial image synthesis can be used to substantially augment such a dataset. We propose an AU synthesis framework that combines the well-known 3D Morphable Model (3DMM), which intrinsically disentangles expression parameters from other face attributes, with models that adversarially generate 3DMM expression parameters conditioned on given target AU labels, in contrast to the more conventional approach of generating facial images directly. In this way, we are able to synthesize new combinations of expression parameters and facial images from desired AU labels. Extensive quantitative and qualitative results on the benchmark DISFA dataset demonstrate the effectiveness of our method on 3DMM facial expression parameter synthesis and data augmentation for deep learning-based AU intensity estimation

    Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives

    Get PDF
    Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains. As a potentially crucial technique for the development of the next generation of emotional AI systems, we herein provide a comprehensive overview of the application of adversarial training to affective computing and sentiment analysis. Various representative adversarial training algorithms are explained and discussed accordingly, aimed at tackling diverse challenges associated with emotional AI systems. Further, we highlight a range of potential future research directions. We expect that this overview will help facilitate the development of adversarial training for affective computing and sentiment analysis in both the academic and industrial communities

    Virtual humans and Photorealism: The effect of photorealism of interactive virtual humans in clinical virtual environment on affective responses

    Get PDF
    The ability of realistic vs stylized representations of virtual characters to elicit emotions in users has been an open question for researchers and artists alike. We designed and performed a between subjects experiment using a medical virtual reality simulation to study the differences in the emotions aroused in participants while interacting with realistic and stylized virtual characters. The experiment included three conditions each of which presented a different representation of the virtual character namely; photo-realistic, non-photorealistic cartoon-shaded and non-photorealistic charcoal-sketch. The simulation used for the experiment, called the Rapid Response Training System was developed to train nurses to identify symptoms of rapid deterioration in patients. The emotional impact of interacting with the simulation on the participants was measured via both subjective and objective metrics. Quantitative objective measures were gathered using skin Electrodermal Activity (EDA) sensors, and quantitative subjective measures included Differential Emotion Survey (DES IV), Positive and Negative Affect Schedule (PANAS), and the co-presence or social presence questionnaire. The emotional state of the participants was analyzed across four distinct time steps during which the medical condition of the virtual patient deteriorated, and was contrasted to a baseline affective state. The data from the EDA sensors indicated that the mean level of arousal was highest in the charcoal-sketch condition, lowest in the realistic condition, with responses in the cartoon-shaded condition was in the middle. Mean arousal responses also seemed to be consistent in both the cartoon-shaded and charcoal-sketch conditions across all time steps, while the mean arousal response of participants in the realistic condition showed a significant drop from time step 1 through time step 2, corresponding to the deterioration of the virtual patient. Mean scores of participants in the DES survey seems to suggest that participants in the realistic condition elicited a higher emotional response than participants in both non-realistic conditions. Within the non-realistic conditions, participants in the cartoon-shaded condition seemed to elicit a higher emotional response than those in the charcoal-sketch condition

    Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

    Full text link
    We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models

    FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion

    Full text link
    Speech-driven 3D facial animation synthesis has been a challenging task both in industry and research. Recent methods mostly focus on deterministic deep learning methods meaning that given a speech input, the output is always the same. However, in reality, the non-verbal facial cues that reside throughout the face are non-deterministic in nature. In addition, majority of the approaches focus on 3D vertex based datasets and methods that are compatible with existing facial animation pipelines with rigged characters is scarce. To eliminate these issues, we present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations that is trained with both 3D vertex and blendshape based datasets. Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input. To the best of our knowledge, we are the first to employ the diffusion method for the task of speech-driven 3D facial animation synthesis. We have run extensive objective and subjective analyses and show that our approach achieves better or comparable results in comparison to the state-of-the-art methods. We also introduce a new in-house dataset that is based on a blendshape based rigged character. We recommend watching the accompanying supplementary video. The code and the dataset will be publicly available.Comment: Pre-print of the paper accepted at ACM SIGGRAPH MIG 202

    Emotion recognition in simulated social interactions

    Get PDF
    Social context plays an important role in everyday emotional interactions, and others' faces often provide contextual cues in social situations. Investigating this complex social process is a challenge that can be addressed with the use of computergenerated facial expressions. In the current research, we use synthesized facial expressions to investigate the influence of socioaffective inferential mechanisms on the recognition of social emotions. Participants judged blends of facial expressions of shame-sadness, or of anger-disgust, in a target avatar face presented at the center of a screen while a contextual avatar face expressed an emotion (disgust, contempt, sadness) or remained neutral. The dynamics of the facial expressions and the head/gaze movements of the two avatars were manipulated in order to create an interaction in which the two avatars shared eye gaze only in the social interaction condition. Results of Experiment 1 revealed that when the avatars engaged in social interaction, target expression blends of shame and sadness were perceived as expressing more shame if the contextual face expressed disgust and more sadness when the contextual face expressed sadness. Interestingly, perceptions of shame were not enhanced when the contextual face expressed contempt. The latter finding is probably attributable to the low recognition rates for the expression of contempt observed in Experiment 2
    corecore