Gestures are non-verbal but important behaviors accompanying people's speech.
While previous methods are able to generate speech rhythm-synchronized
gestures, the semantic context of the speech is generally lacking in the
gesticulations. Although semantic gestures do not occur very regularly in human
speech, they are indeed the key for the audience to understand the speech
context in a more immersive environment. Hence, we introduce LivelySpeaker, a
framework that realizes semantics-aware co-speech gesture generation and offers
several control handles. In particular, our method decouples the task into two
stages: script-based gesture generation and audio-guided rhythm refinement.
Specifically, the script-based gesture generation leverages the pre-trained
CLIP text embeddings as the guidance for generating gestures that are highly
semantically aligned with the script. Then, we devise a simple but effective
diffusion-based gesture generation backbone simply using pure MLPs, that is
conditioned on only audio signals and learns to gesticulate with realistic
motions. We utilize such powerful prior to rhyme the script-guided gestures
with the audio signals, notably in a zero-shot setting. Our novel two-stage
generation framework also enables several applications, such as changing the
gesticulation style, editing the co-speech gestures via textual prompting, and
controlling the semantic awareness and rhythm alignment with guided diffusion.
Extensive experiments demonstrate the advantages of the proposed framework over
competing methods. In addition, our core diffusion-based generative model also
achieves state-of-the-art performance on two benchmarks. The code and model
will be released to facilitate future research.Comment: Accepted by ICCV 202