DiffTalker: Co-driven audio-image diffusion for talking faces via
  intermediate landmarks

Cheng, Ning; Qi, Zipeng; Wang, Jianzong; Xiao, Jing; Zhang, Xulong

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Authors: Ning Cheng
Zipeng Qi
Jianzong Wang
Jing Xiao
Xulong Zhang
Publication date: 14 September 2023
Publisher

Abstract

Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.Comment: submmit to ICASSP 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.07509

Last time updated on 08/10/2023