2,015 research outputs found
Text-based Editing of Talking-head Video
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement
Over the last few decades, many aspects of human life have been enhanced with
virtual domains, from the advent of digital assistants such as Amazon's Alexa
and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These
trends underscore the importance of generating photorealistic visual depictions
of humans. This has led to the rapid growth of so-called deepfake and
talking-head generation methods in recent years. Despite their impressive
results and popularity, they usually lack certain qualitative aspects such as
texture quality, lips synchronization, or resolution, and practical aspects
such as the ability to run in real-time. To allow for virtual human avatars to
be used in practical scenarios, we propose an end-to-end framework for
synthesizing high-quality virtual human faces capable of speaking with accurate
lip motion with a special emphasis on performance. We introduce a novel network
utilizing visemes as an intermediate audio representation and a novel data
augmentation strategy employing a hierarchical image synthesis approach that
allows disentanglement of the different modalities used to control the global
head motion. Our method runs in real-time, and is able to deliver superior
results compared to the current state-of-the-art
- …