Dancing to music is one of human's innate abilities since ancient times. In
machine learning research, however, synthesizing dance movements from music is
a challenging problem. Recently, researchers synthesize human motion sequences
through autoregressive models like recurrent neural network (RNN). Such an
approach often generates short sequences due to an accumulation of prediction
errors that are fed back into the neural network. This problem becomes even
more severe in the long motion sequence generation. Besides, the consistency
between dance and music in terms of style, rhythm and beat is yet to be taken
into account during modeling. In this paper, we formalize the music-driven
dance generation as a sequence-to-sequence learning problem and devise a novel
seq2seq architecture to efficiently process long sequences of music features
and capture the fine-grained correspondence between music and dance.
Furthermore, we propose a novel curriculum learning strategy to alleviate error
accumulation of autoregressive models in long motion sequence generation, which
gently changes the training process from a fully guided teacher-forcing scheme
using the previous ground-truth movements, towards a less guided autoregressive
scheme mostly using the generated movements instead. Extensive experiments show
that our approach significantly outperforms the existing state-of-the-arts on
automatic metrics and human evaluation. We also make a demo video in the
supplementary material to demonstrate the superior performance of our proposed
approach.Comment: Accepted by ICLR 202