78 research outputs found
Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation
Generating facial reactions in a human-human dyadic interaction is complex
and highly dependent on the context since more than one facial reactions can be
appropriate for the speaker's behaviour. This has challenged existing machine
learning (ML) methods, whose training strategies enforce models to reproduce a
specific (not multiple) facial reaction from each input speaker behaviour. This
paper proposes the first multiple appropriate facial reaction generation
framework that re-formulates the one-to-many mapping facial reaction generation
problem as a one-to-one mapping problem. This means that we approach this
problem by considering the generation of a distribution of the listener's
appropriate facial reactions instead of multiple different appropriate facial
reactions, i.e., 'many' appropriate facial reaction labels are summarised as
'one' distribution label during training. Our model consists of a perceptual
processor, a cognitive processor, and a motor processor. The motor processor is
implemented with a novel Reversible Multi-dimensional Edge Graph Neural Network
(REGNN). This allows us to obtain a distribution of appropriate real facial
reactions during the training process, enabling the cognitive processor to be
trained to predict the appropriate facial reaction distribution. At the
inference stage, the REGNN decodes an appropriate facial reaction by using this
distribution as input. Experimental results demonstrate that our approach
outperforms existing models in generating more appropriate, realistic, and
synchronized facial reactions. The improved performance is largely attributed
to the proposed appropriate facial reaction distribution learning strategy and
the use of a REGNN. The code is available at
https://github.com/TongXu-05/REGNN-Multiple-Appropriate-Facial-Reaction-Generation
An Actor-Centric Approach to Facial Animation Control by Neural Networks For Non-Player Characters in Video Games
Game developers increasingly consider the degree to which character animation emulates facial expressions found in cinema. Employing animators and actors to produce cinematic facial animation by mixing motion capture and hand-crafted animation is labor intensive and therefore expensive. Emotion corpora and neural network controllers have shown promise toward developing autonomous animation that does not rely on motion capture. Previous research and practice in disciplines of Computer Science, Psychology and the Performing Arts have provided frameworks on which to build a workflow toward creating an emotion AI system that can animate the facial mesh of a 3d non-player character deploying a combination of related theories and methods. However, past investigations and their resulting production methods largely ignore the emotion generation systems that have evolved in the performing arts for more than a century. We find very little research that embraces the intellectual process of trained actors as complex collaborators from which to understand and model the training of a neural network for character animation. This investigation demonstrates a workflow design that integrates knowledge from the performing arts and the affective branches of the social and biological sciences. Our workflow begins at the stage of developing and annotating a fictional scenario with actors, to producing a video emotion corpus, to designing training and validating a neural network, to analyzing the emotion data annotation of the corpus and neural network, and finally to determining resemblant behavior of its autonomous animation control of a 3d character facial mesh. The resulting workflow includes a method for the development of a neural network architecture whose initial efficacy as a facial emotion expression simulator has been tested and validated as substantially resemblant to the character behavior developed by a human actor
REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge
The Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic interaction scenarios, with all participants competing strictly under the same conditions. The goal of the challenge is to provide the first benchmark test set for multi-modal information processing and to foster collaboration among the audio, visual, and audio-visual behaviour analysis and behaviour generation (a.k.a generative AI) communities, to compare the relative merits of the approaches to automatic appropriate facial reaction generation under different spontaneous dyadic interaction conditions. This paper presents: (i) the novelties, contributions and guidelines of the REACT2023 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline-react2023.</p
Conditional Adversarial Synthesis of 3D Facial Action Units
Employing deep learning-based approaches for fine-grained facial expression
analysis, such as those involving the estimation of Action Unit (AU)
intensities, is difficult due to the lack of a large-scale dataset of real
faces with sufficiently diverse AU labels for training. In this paper, we
consider how AU-level facial image synthesis can be used to substantially
augment such a dataset. We propose an AU synthesis framework that combines the
well-known 3D Morphable Model (3DMM), which intrinsically disentangles
expression parameters from other face attributes, with models that
adversarially generate 3DMM expression parameters conditioned on given target
AU labels, in contrast to the more conventional approach of generating facial
images directly. In this way, we are able to synthesize new combinations of
expression parameters and facial images from desired AU labels. Extensive
quantitative and qualitative results on the benchmark DISFA dataset demonstrate
the effectiveness of our method on 3DMM facial expression parameter synthesis
and data augmentation for deep learning-based AU intensity estimation
Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives
Over the past few years, adversarial training has become an extremely active
research topic and has been successfully applied to various Artificial
Intelligence (AI) domains. As a potentially crucial technique for the
development of the next generation of emotional AI systems, we herein provide a
comprehensive overview of the application of adversarial training to affective
computing and sentiment analysis. Various representative adversarial training
algorithms are explained and discussed accordingly, aimed at tackling diverse
challenges associated with emotional AI systems. Further, we highlight a range
of potential future research directions. We expect that this overview will help
facilitate the development of adversarial training for affective computing and
sentiment analysis in both the academic and industrial communities
Virtual humans and Photorealism: The effect of photorealism of interactive virtual humans in clinical virtual environment on affective responses
The ability of realistic vs stylized representations of virtual characters to elicit emotions in users has been an open question for researchers and artists alike. We designed and performed a between subjects experiment using a medical virtual reality simulation to study the differences in the emotions aroused in participants while interacting with realistic and stylized virtual characters. The experiment included three conditions each of which presented a different representation of the virtual character namely; photo-realistic, non-photorealistic cartoon-shaded and non-photorealistic charcoal-sketch. The simulation used for the experiment, called the Rapid Response Training System was developed to train nurses to identify symptoms of rapid deterioration in patients. The emotional impact of interacting with the simulation on the participants was measured via both subjective and objective metrics. Quantitative objective measures were gathered using skin Electrodermal Activity (EDA) sensors, and quantitative subjective measures included Differential Emotion Survey (DES IV), Positive and Negative Affect Schedule (PANAS), and the co-presence or social presence questionnaire. The emotional state of the participants was analyzed across four distinct time steps during which the medical condition of the virtual patient deteriorated, and was contrasted to a baseline affective state. The data from the EDA sensors indicated that the mean level of arousal was highest in the charcoal-sketch condition, lowest in the realistic condition, with responses in the cartoon-shaded condition was in the middle. Mean arousal responses also seemed to be consistent in both the cartoon-shaded and charcoal-sketch conditions across all time steps, while the mean arousal response of participants in the realistic condition showed a significant drop from time step 1 through time step 2, corresponding to the deterioration of the virtual patient. Mean scores of participants in the DES survey seems to suggest that participants in the realistic condition elicited a higher emotional response than participants in both non-realistic conditions. Within the non-realistic conditions, participants in the cartoon-shaded condition seemed to elicit a higher emotional response than those in the charcoal-sketch condition
Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications
We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one
mapping from speech signal to 3D face meshes on small datasets with limited
speakers. While these models can achieve high-quality lip articulation for
speakers in the training set, they are unable to capture the full and diverse
distribution of 3D facial motions that accompany speech in the real world.
Importantly, the relationship between speech and facial motion is one-to-many,
containing both inter-speaker and intra-speaker variations and necessitating a
probabilistic approach. In this paper, we identify and address key challenges
that have so far limited the development of probabilistic models: lack of
datasets and metrics that are suitable for training and evaluating them, as
well as the difficulty of designing a model that generates diverse results
while remaining faithful to a strong conditioning signal as speech. We first
propose large-scale benchmark datasets and metrics suitable for probabilistic
modeling. Then, we demonstrate a probabilistic model that achieves both
diversity and fidelity to speech, outperforming other methods across the
proposed benchmarks. Finally, we showcase useful applications of probabilistic
models trained on these large-scale datasets: we can generate diverse
speech-driven 3D facial motion that matches unseen speaker styles extracted
from reference clips; and our synthetic meshes can be used to improve the
performance of downstream audio-visual models
FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
Speech-driven 3D facial animation synthesis has been a challenging task both
in industry and research. Recent methods mostly focus on deterministic deep
learning methods meaning that given a speech input, the output is always the
same. However, in reality, the non-verbal facial cues that reside throughout
the face are non-deterministic in nature. In addition, majority of the
approaches focus on 3D vertex based datasets and methods that are compatible
with existing facial animation pipelines with rigged characters is scarce. To
eliminate these issues, we present FaceDiffuser, a non-deterministic deep
learning model to generate speech-driven facial animations that is trained with
both 3D vertex and blendshape based datasets. Our method is based on the
diffusion technique and uses the pre-trained large speech representation model
HuBERT to encode the audio input. To the best of our knowledge, we are the
first to employ the diffusion method for the task of speech-driven 3D facial
animation synthesis. We have run extensive objective and subjective analyses
and show that our approach achieves better or comparable results in comparison
to the state-of-the-art methods. We also introduce a new in-house dataset that
is based on a blendshape based rigged character. We recommend watching the
accompanying supplementary video. The code and the dataset will be publicly
available.Comment: Pre-print of the paper accepted at ACM SIGGRAPH MIG 202
Emotion recognition in simulated social interactions
Social context plays an important role in everyday emotional interactions, and others' faces often provide contextual cues in social situations. Investigating this complex social process is a challenge that can be addressed with the use of computergenerated facial expressions. In the current research, we use synthesized facial expressions to investigate the influence of socioaffective inferential mechanisms on the recognition of social emotions. Participants judged blends of facial expressions of shame-sadness, or of anger-disgust, in a target avatar face presented at the center of a screen while a contextual avatar face expressed an emotion (disgust, contempt, sadness) or remained neutral. The dynamics of the facial expressions and the head/gaze movements of the two avatars were manipulated in order to create an interaction in which the two avatars shared eye gaze only in the social interaction condition. Results of Experiment 1 revealed that when the avatars engaged in social interaction, target expression blends of shame and sadness were perceived as expressing more shame if the contextual face expressed disgust and more sadness when the contextual face expressed sadness. Interestingly, perceptions of shame were not enhanced when the contextual face expressed contempt. The latter finding is probably attributable to the low recognition rates for the expression of contempt observed in Experiment 2
- …