Zero-shot talking avatar generation aims at synthesizing natural talking
videos from speech and a single portrait image. Previous methods have relied on
domain-specific heuristics such as warping-based motion representation and 3D
Morphable Models, which limit the naturalness and diversity of the generated
avatars. In this work, we introduce GAIA (Generative AI for Avatar), which
eliminates the domain priors in talking avatar generation. In light of the
observation that the speech only drives the motion of the avatar while the
appearance of the avatar and the background typically remain the same
throughout the entire video, we divide our approach into two stages: 1)
disentangling each frame into motion and appearance representations; 2)
generating motion sequences conditioned on the speech and reference portrait
image. We collect a large-scale high-quality talking avatar dataset and train
the model on it with different scales (up to 2B parameters). Experimental
results verify the superiority, scalability, and flexibility of GAIA as 1) the
resulting model beats previous baseline models in terms of naturalness,
diversity, lip-sync quality, and visual quality; 2) the framework is scalable
since larger models yield better results; 3) it is general and enables
different applications like controllable talking avatar generation and
text-instructed avatar generation.Comment: ICLR 2024. Project page: https://microsoft.github.io/GAIA