Generating stylized talking head with diverse head motions is crucial for
achieving natural-looking videos but still remains challenging. Previous works
either adopt a regressive method to capture the speaking style, resulting in a
coarse style that is averaged across all training data, or employ a universal
network to synthesize videos with different styles which causes suboptimal
performance. To address these, we propose a novel dynamic-weight method, namely
Say Anything withAny Style (SAAS), which queries the discrete style
representation via a generative model with a learned style codebook.
Specifically, we develop a multi-task VQ-VAE that incorporates three closely
related tasks to learn a style codebook as a prior for style extraction. This
discrete prior, along with the generative model, enhances the precision and
robustness when extracting the speaking styles of the given style clips. By
utilizing the extracted style, a residual architecture comprising a canonical
branch and style-specific branch is employed to predict the mouth shapes
conditioned on any driving audio while transferring the speaking style from the
source to any desired one. To adapt to different speaking styles, we steer
clear of employing a universal network by exploring an elaborate HyperStyle to
produce the style-specific weights offset for the style branch. Furthermore, we
construct a pose generator and a pose codebook to store the quantized pose
representation, allowing us to sample diverse head motions aligned with the
audio and the extracted style. Experiments demonstrate that our approach
surpasses state-of-theart methods in terms of both lip-synchronization and
stylized expression. Besides, we extend our SAAS to video-driven style editing
field and achieve satisfactory performance.Comment: 9 pages, 5 figures, conferenc