5,893 research outputs found

    DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

    Full text link
    Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.Comment: Project Page: https://dreamtalk-project.github.i

    Neural Cognition and Affective Computing on Cyber Language

    Get PDF
    Characterized by its customary symbol system and simple and vivid expression patterns, cyber language acts as not only a tool for convenient communication but also a carrier of abundant emotions and causes high attention in public opinion analysis, internet marketing, service feedback monitoring, and social emergency management. Based on our multidisciplinary research, this paper presents a classification of the emotional symbols in cyber language, analyzes the cognitive characteristics of different symbols, and puts forward a mechanism model to show the dominant neural activities in that process. Through the comparative study of Chinese, English, and Spanish, which are used by the largest population in the world, this paper discusses the expressive patterns of emotions in international cyber languages and proposes an intelligent method for affective computing on cyber language in a unified PAD (Pleasure-Arousal-Dominance) emotional space

    Continuous Analysis of Affect from Voice and Face

    Get PDF
    Human affective behavior is multimodal, continuous and complex. Despite major advances within the affective computing research field, modeling, analyzing, interpreting and responding to human affective behavior still remains a challenge for automated systems as affect and emotions are complex constructs, with fuzzy boundaries and with substantial individual differences in expression and experience [7]. Therefore, affective and behavioral computing researchers have recently invested increased effort in exploring how to best model, analyze and interpret the subtlety, complexity and continuity (represented along a continuum e.g., from −1 to +1) of affective behavior in terms of latent dimensions (e.g., arousal, power and valence) and appraisals, rather than in terms of a small number of discrete emotion categories (e.g., happiness and sadness). This chapter aims to (i) give a brief overview of the existing efforts and the major accomplishments in modeling and analysis of emotional expressions in dimensional and continuous space while focusing on open issues and new challenges in the field, and (ii) introduce a representative approach for multimodal continuous analysis of affect from voice and face, and provide experimental results using the audiovisual Sensitive Artificial Listener (SAL) Database of natural interactions. The chapter concludes by posing a number of questions that highlight the significant issues in the field, and by extracting potential answers to these questions from the relevant literature. The chapter is organized as follows. Section 10.2 describes theories of emotion, Sect. 10.3 provides details on the affect dimensions employed in the literature as well as how emotions are perceived from visual, audio and physiological modalities. Section 10.4 summarizes how current technology has been developed, in terms of data acquisition and annotation, and automatic analysis of affect in continuous space by bringing forth a number of issues that need to be taken into account when applying a dimensional approach to emotion recognition, namely, determining the duration of emotions for automatic analysis, modeling the intensity of emotions, determining the baseline, dealing with high inter-subject expression variation, defining optimal strategies for fusion of multiple cues and modalities, and identifying appropriate machine learning techniques and evaluation measures. Section 10.5 presents our representative system that fuses vocal and facial expression cues for dimensional and continuous prediction of emotions in valence and arousal space by employing the bidirectional Long Short-Term Memory neural networks (BLSTM-NN), and introduces an output-associative fusion framework that incorporates correlations between the emotion dimensions to further improve continuous affect prediction. Section 10.6 concludes the chapter

    eMuu : an embodied emotional character for the ambient intelligent home

    Get PDF

    Spanish Expressive Voices: corpus for emotion research in Spanish

    Get PDF
    A new emotional multimedia database has been recorded and aligned. The database comprises speech and video recordings of one actor and one actress simulating a neutral state and the Big Six emotions: happiness, sadness, anger, surprise, fear and disgust. Due to a careful design and its size (more than 100 minutes per emotion), the recorded database allows comprehensive studies on emotional speech synthesis, prosodic modelling, speech conversion, far-field speech recognition and speech and video-based emotion identification. The database has been automatically labelled for prosodic purposes (5% was manually revised). The whole database has been validated thorough objective and perceptual tests, achieving a validation score as high as 89%

    Exploiting the robot kinematic redundancy for emotion conveyance to humans as a lower priority task

    Get PDF
    Current approaches do not allow robots to execute a task and simultaneously convey emotions to users using their body motions. This paper explores the capabilities of the Jacobian null space of a humanoid robot to convey emotions. A task priority formulation has been implemented in a Pepper robot which allows the specification of a primary task (waving gesture, transportation of an object, etc.) and exploits the kinematic redundancy of the robot to convey emotions to humans as a lower priority task. The emotions, defined by Mehrabian as points in the pleasure–arousal–dominance space, generate intermediate motion features (jerkiness, activity and gaze) that carry the emotional information. A map from this features to the joints of the robot is presented. A user study has been conducted in which emotional motions have been shown to 30 participants. The results show that happiness and sadness are very well conveyed to the user, calm is moderately well conveyed, and fear is not well conveyed. An analysis on the dependencies between the motion features and the emotions perceived by the participants shows that activity correlates positively with arousal, jerkiness is not perceived by the user, and gaze conveys dominance when activity is low. The results indicate a strong influence of the most energetic motions of the emotional task and point out new directions for further research. Overall, the results show that the null space approach can be regarded as a promising mean to convey emotions as a lower priority task.Postprint (author's final draft