4,425 research outputs found
On combining the facial movements of a talking head
We present work on Obie, an embodied conversational
agent framework. An embodied conversational agent, or
talking head, consists of three main components. The
graphical part consists of a face model and a facial muscle
model. Besides the graphical part, we have implemented
an emotion model and a mapping from emotions to facial
expressions. The animation part of the framework focuses
on the combination of different facial movements
temporally. In this paper we propose a scheme of
combining facial movements on a 3D talking head
Text-based Editing of Talking-head Video
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
Book review: Mob Rule Learning
Mob Rule Learning: Camps, Unconferences and Trashing the Talking Head. By Michelle Boule, Medford: Cyber Age Books, 2011, paperback ISBN 978-0-910965-92-7, 230 pages
Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
Audio-driven talking-head synthesis is a popular research topic for virtual
human-related applications. However, the inflexibility and inefficiency of
existing methods, which necessitate expensive end-to-end training to transfer
emotions from guidance videos to talking-head predictions, are significant
limitations. In this work, we propose the Emotional Adaptation for Audio-driven
Talking-head (EAT) method, which transforms emotion-agnostic talking-head
models into emotion-controllable ones in a cost-effective and efficient manner
through parameter-efficient adaptations. Our approach utilizes a pretrained
emotion-agnostic talking-head transformer and introduces three lightweight
adaptations (the Deep Emotional Prompts, Emotional Deformation Network, and
Emotional Adaptation Module) from different perspectives to enable precise and
realistic emotion controls. Our experiments demonstrate that our approach
achieves state-of-the-art performance on widely-used benchmarks, including LRW
and MEAD. Additionally, our parameter-efficient adaptations exhibit remarkable
generalization ability, even in scenarios where emotional training videos are
scarce or nonexistent. Project website: https://yuangan.github.io/eat/Comment: Accepted to ICCV 2023. Project page: https://yuangan.github.io/eat
Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks
Given an audio clip and a reference face image, the goal of the talking head
generation is to generate a high-fidelity talking head video. Although some
audio-driven methods of generating talking head videos have made some
achievements in the past, most of them only focused on lip and audio
synchronization and lack the ability to reproduce the facial expressions of the
target person. To this end, we propose a talking head generation model
consisting of a Memory-Sharing Emotion Feature extractor (MSEF) and an
Attention-Augmented Translator based on U-net (AATU). Firstly, MSEF can extract
implicit emotional auxiliary features from audio to estimate more accurate
emotional face landmarks.~Secondly, AATU acts as a translator between the
estimated landmarks and the photo-realistic video frames. Extensive qualitative
and quantitative experiments have shown the superiority of the proposed method
to the previous works. Codes will be made publicly available
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation
Talking head synthesis is a promising approach for the video production
industry. Recently, a lot of effort has been devoted in this research area to
improve the generation quality or enhance the model generalization. However,
there are few works able to address both issues simultaneously, which is
essential for practical applications. To this end, in this paper, we turn
attention to the emerging powerful Latent Diffusion Models, and model the
Talking head generation as an audio-driven temporally coherent denoising
process (DiffTalk). More specifically, instead of employing audio signals as
the single driving factor, we investigate the control mechanism of the talking
face, and incorporate reference face images and landmarks as conditions for
personality-aware generalized synthesis. In this way, the proposed DiffTalk is
capable of producing high-quality talking head videos in synchronization with
the source audio, and more importantly, it can be naturally generalized across
different identities without any further fine-tuning. Additionally, our
DiffTalk can be gracefully tailored for higher-resolution synthesis with
negligible extra computational cost. Extensive experiments show that the
proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking
head videos for generalized novel identities. For more video results, please
refer to \url{https://sstzal.github.io/DiffTalk/}.Comment: Project page https://sstzal.github.io/DiffTalk
Talking Head
My sister used to raise guinea hens, and her hens were consistently cantankerous and difficult. I thought she would enjoy this painting of a particularly salty guinea he
Multimodal Interaction in a Haptic Environment
In this paper we investigate the introduction of haptics in a multimodal tutoring environment. In this environment a haptic device is used to control a virtual piece of sterile cotton and a virtual injection needle. Speech input and output is provided to interact with a virtual tutor, available as a talking head, and a virtual patient. We introduce the haptic tasks and how different agents in the multi-agent system are made responsible for them. Notes are provided about the way we introduce an affective model in the tutor agent
- …