Co-speech gesture generation is crucial for automatic digital avatar
animation. However, existing methods suffer from issues such as unstable
training and temporal inconsistency, particularly in generating high-fidelity
and comprehensive gestures. Additionally, these methods lack effective control
over speaker identity and temporal editing of the generated gestures. Focusing
on capturing temporal latent information and applying practical controlling, we
propose a Controllable Co-speech Gesture Generation framework, named C2G2.
Specifically, we propose a two-stage temporal dependency enhancement strategy
motivated by latent diffusion models. We further introduce two key features to
C2G2, namely a speaker-specific decoder to generate speaker-related real-length
skeletons and a repainting strategy for flexible gesture generation/editing.
Extensive experiments on benchmark gesture datasets verify the effectiveness of
our proposed C2G2 compared with several state-of-the-art baselines. The link of
the project demo page can be found at https://c2g2-gesture.github.io/c2_gestureComment: 12 pages, 6 figures, 7 table