Most text-driven human motion generation methods employ sequential modeling
approaches, e.g., transformer, to extract sentence-level text representations
automatically and implicitly for human motion synthesis. However, these compact
text representations may overemphasize the action names at the expense of other
important properties and lack fine-grained details to guide the synthesis of
subtly distinct motion. In this paper, we propose hierarchical semantic graphs
for fine-grained control over motion generation. Specifically, we disentangle
motion descriptions into hierarchical semantic graphs including three levels of
motions, actions, and specifics. Such global-to-local structures facilitate a
comprehensive understanding of motion description and fine-grained control of
motion generation. Correspondingly, to leverage the coarse-to-fine topology of
hierarchical semantic graphs, we decompose the text-to-motion diffusion process
into three semantic levels, which correspond to capturing the overall motion,
local actions, and action specifics. Extensive experiments on two benchmark
human motion datasets, including HumanML3D and KIT, with superior performances,
justify the efficacy of our method. More encouragingly, by modifying the edge
weights of hierarchical semantic graphs, our method can continuously refine the
generated motion, which may have a far-reaching impact on the community. Code
and pre-training weights are available at
https://github.com/jpthu17/GraphMotion.Comment: Accepted by NeurIPS 202