8 research outputs found
MAGMA: Music Aligned Generative Motion Autodecoder
Mapping music to dance is a challenging problem that requires spatial and
temporal coherence along with a continual synchronization with the music's
progression. Taking inspiration from large language models, we introduce a
2-step approach for generating dance using a Vector Quantized-Variational
Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer
decoder to learn the correct sequencing of these primitives. We also evaluate
the importance of music representations by comparing naive music feature
extraction using Librosa to deep audio representations generated by
state-of-the-art audio compression algorithms. Additionally, we train
variations of the motion generator using relative and absolute positional
encodings to determine the effect on generated motion quality when generating
arbitrarily long sequence lengths. Our proposed approach achieve
state-of-the-art results in music-to-motion generation benchmarks and enables
the real-time generation of considerably longer motion sequences, the ability
to chain multiple motion sequences seamlessly, and easy customization of motion
sequences to meet style requirements
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues
Current techniques face difficulties in generating motions from intricate
semantic descriptions, primarily due to insufficient semantic annotations in
datasets and weak contextual understanding. To address these issues, we present
SemanticBoost, a novel framework that tackles both challenges simultaneously.
Our framework comprises a Semantic Enhancement module and a Context-Attuned
Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary
semantics from motion data, enriching the dataset's textual description and
ensuring precise alignment between text and motion data without depending on
large language models. On the other hand, the CAMD approach provides an
all-encompassing solution for generating high-quality, semantically consistent
motion sequences by effectively capturing context information and aligning the
generated motion with the given textual descriptions. Distinct from existing
methods, our approach can synthesize accurate orientational movements, combined
motions based on specific body part descriptions, and motions generated from
complex, extended sentences. Our experimental results demonstrate that
SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based
techniques, achieving cutting-edge performance on the Humanml3D dataset while
maintaining realistic and smooth motion generation quality
Robust Motion In-betweening
In this work we present a novel, robust transition generation technique that
can serve as a new tool for 3D animators, based on adversarial recurrent neural
networks. The system synthesizes high-quality motions that use
temporally-sparse keyframes as animation constraints. This is reminiscent of
the job of in-betweening in traditional animation pipelines, in which an
animator draws motion frames between provided keyframes. We first show that a
state-of-the-art motion prediction model cannot be easily converted into a
robust transition generator when only adding conditioning information about
future keyframes. To solve this problem, we then propose two novel additive
embedding modifiers that are applied at each timestep to latent representations
encoded inside the network's architecture. One modifier is a time-to-arrival
embedding that allows variations of the transition length with a single model.
The other is a scheduled target noise vector that allows the system to be
robust to target distortions and to sample different transitions given fixed
keyframes. To qualitatively evaluate our method, we present a custom
MotionBuilder plugin that uses our trained model to perform in-betweening in
production scenarios. To quantitatively evaluate performance on transitions and
generalizations to longer time horizons, we present well-defined in-betweening
benchmarks on a subset of the widely used Human3.6M dataset and on LaFAN1, a
novel high quality motion capture dataset that is more appropriate for
transition generation. We are releasing this new dataset along with this work,
with accompanying code for reproducing our baseline results.Comment: Published at SIGGRAPH 202