15 research outputs found
Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs
Most text-driven human motion generation methods employ sequential modeling
approaches, e.g., transformer, to extract sentence-level text representations
automatically and implicitly for human motion synthesis. However, these compact
text representations may overemphasize the action names at the expense of other
important properties and lack fine-grained details to guide the synthesis of
subtly distinct motion. In this paper, we propose hierarchical semantic graphs
for fine-grained control over motion generation. Specifically, we disentangle
motion descriptions into hierarchical semantic graphs including three levels of
motions, actions, and specifics. Such global-to-local structures facilitate a
comprehensive understanding of motion description and fine-grained control of
motion generation. Correspondingly, to leverage the coarse-to-fine topology of
hierarchical semantic graphs, we decompose the text-to-motion diffusion process
into three semantic levels, which correspond to capturing the overall motion,
local actions, and action specifics. Extensive experiments on two benchmark
human motion datasets, including HumanML3D and KIT, with superior performances,
justify the efficacy of our method. More encouragingly, by modifying the edge
weights of hierarchical semantic graphs, our method can continuously refine the
generated motion, which may have a far-reaching impact on the community. Code
and pre-training weights are available at
https://github.com/jpthu17/GraphMotion.Comment: Accepted by NeurIPS 202
GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation
In this paper, we propose a novel cascaded diffusion-based generative
framework for text-driven human motion synthesis, which exploits a strategy
named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy
sets up generation objectives by grouping body joints of detailed skeletons in
close semantic proximity together and then replacing each of such joint group
with a single body-part node. Such an operation recursively abstracts a human
pose to coarser and coarser skeletons at multiple granularity levels. With
gradually increasing the abstraction level, human motion becomes more and more
concise and stable, significantly benefiting the cross-modal motion synthesis
task. The whole text-driven human motion synthesis problem is then divided into
multiple abstraction levels and solved with a multi-stage generation framework
with a cascaded latent diffusion model: an initial generator first generates
the coarsest human motion guess from a given text description; then, a series
of successive generators gradually enrich the motion details based on the
textual description and the previous synthesized results. Notably, we further
integrate GUESS with the proposed dynamic multi-condition fusion mechanism to
dynamically balance the cooperative effects of the given textual condition and
synthesized coarse motion prompt in different generation stages. Extensive
experiments on large-scale datasets verify that GUESS outperforms existing
state-of-the-art methods by large margins in terms of accuracy, realisticness,
and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.Comment: Accepted by IEEE Transactions on Visualization and Computer Graphics
(2024
EE-TTS: Emphatic Expressive TTS with Linguistic Information
While Current TTS systems perform well in synthesizing high-quality speech,
producing highly expressive speech remains a challenge. Emphasis, as a critical
factor in determining the expressiveness of speech, has attracted more
attention nowadays. Previous works usually enhance the emphasis by adding
intermediate features, but they can not guarantee the overall expressiveness of
the speech. To resolve this matter, we propose Emphatic Expressive TTS
(EE-TTS), which leverages multi-level linguistic information from syntax and
semantics. EE-TTS contains an emphasis predictor that can identify appropriate
emphasis positions from text and a conditioned acoustic model to synthesize
expressive speech with emphasis and linguistic information. Experimental
results indicate that EE-TTS outperforms baseline with MOS improvements of 0.49
and 0.67 in expressiveness and naturalness. EE-TTS also shows strong
generalization across different datasets according to AB test results.Comment: Accepted by Interspeech 2023, fix some typo
ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling
The research fields of parametric face models and 3D face reconstruction have
been extensively studied. However, a critical question remains unanswered: how
to tailor the face model for specific reconstruction settings. We argue that
reconstruction with multi-view uncalibrated images demands a new model with
stronger capacity. Our study shifts attention from data-dependent 3D Morphable
Models (3DMM) to an understudied human-designed skinning model. We propose
Adaptive Skinning Model (ASM), which redefines the skinning model with more
compact and fully tunable parameters. With extensive experiments, we
demonstrate that ASM achieves significantly improved capacity than 3DMM, with
the additional advantage of model size and easy implementation for new
topology. We achieve state-of-the-art performance with ASM for multi-view
reconstruction on the Florence MICC Coop benchmark. Our quantitative analysis
demonstrates the importance of a high-capacity model for fully exploiting
abundant information from multi-view input in reconstruction. Furthermore, our
model with physical-semantic parameters can be directly utilized for real-world
applications, such as in-game avatar creation. As a result, our work opens up
new research directions for the parametric face models and facilitates future
research on multi-view reconstruction
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
Synthesizing realistic videos according to a given speech is still an open
challenge. Previous works have been plagued by issues such as inaccurate lip
shape generation and poor image quality. The key reason is that only motions
and appearances on limited facial areas (e.g., lip area) are mainly driven by
the input speech. Therefore, directly learning a mapping function from speech
to the entire head image is prone to ambiguity, particularly when using a short
video for training. We thus propose a decomposition-synthesis-composition
framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive
and speech-insensitive motion/appearance to facilitate effective learning from
limited training data, resulting in the generation of natural-looking videos.
First, given a fixed head pose (i.e., canonical space), we present a
speech-driven implicit model for lip image generation which concentrates on
learning speech-sensitive motion and appearance. Next, to model the major
speech-insensitive motion (i.e., head movement), we introduce a geometry-aware
mutual explicit mapping (GAMEM) module that establishes geometric mappings
between different head poses. This allows us to paste generated lip images at
the canonical space onto head images with arbitrary poses and synthesize
talking videos with natural head movements. In addition, a Blend-Net and a
contrastive sync loss are introduced to enhance the overall synthesis
performance. Quantitative and qualitative results on three benchmarks
demonstrate that our model can be trained by a video of just a few minutes in
length and achieve state-of-the-art performance in both visual quality and
speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip
NOFA: NeRF-based One-shot Facial Avatar Reconstruction
3D facial avatar reconstruction has been a significant research topic in
computer graphics and computer vision, where photo-realistic rendering and
flexible controls over poses and expressions are necessary for many related
applications. Recently, its performance has been greatly improved with the
development of neural radiance fields (NeRF). However, most existing NeRF-based
facial avatars focus on subject-specific reconstruction and reenactment,
requiring multi-shot images containing different views of the specific subject
for training, and the learned model cannot generalize to new identities,
limiting its further applications. In this work, we propose a one-shot 3D
facial avatar reconstruction framework that only requires a single source image
to reconstruct a high-fidelity 3D facial avatar. For the challenges of lacking
generalization ability and missing multi-view information, we leverage the
generative prior of 3D GAN and develop an efficient encoder-decoder network to
reconstruct the canonical neural volume of the source image, and further
propose a compensation network to complement facial details. To enable
fine-grained control over facial dynamics, we propose a deformation field to
warp the canonical volume into driven expressions. Through extensive
experimental comparisons, we achieve superior synthesis results compared to
several state-of-the-art methods
Understanding Medical Conversations with Scattered Keyword Attention and Weak Supervision from Responses
In this work, we consider the medical slot filling problem, i.e., the problem of converting medical queries into structured representations which is a challenging task. We analyze the effectiveness of two points: scattered keywords in user utterances and weak supervision with responses. We approach the medical slot filling as a multi-label classification problem with label-embedding attentive model to pay more attention to scattered medical keywords and learn the classification models by weak-supervision from responses. To evaluate the approaches, we annotate a medical slot filling data and collect a large scale unlabeled data. The experiments demonstrate that these two points are promising to improve the task
Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model
Text-guided motion synthesis aims to generate 3D human motion that not only precisely reflects the textual description but reveals the motion details as much as possible. Pioneering methods explore the diffusion model for text-to-motion synthesis and obtain significant superiority. However, these methods conduct diffusion processes either on the raw data distribution or the low-dimensional latent space, which typically suffer from the problem of modality inconsistency or detail-scarce. To tackle this problem, we propose a novel Basic-to-Advanced Hierarchical Diffusion Model, named B2A-HDM, to collaboratively exploit low-dimensional and high-dimensional diffusion models for high quality detailed motion synthesis. Specifically, the basic diffusion model in low-dimensional latent space provides the intermediate denoising result that to be consistent with the textual description, while the advanced diffusion model in high-dimensional latent space focuses on the following detail-enhancing denoising process. Besides, we introduce a multi-denoiser framework for the advanced diffusion model to ease the learning of high-dimensional model and fully explore the generative potential of the diffusion model. Quantitative and qualitative experiment results on two text-to-motion benchmarks (HumanML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity
Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination
We study the problem of training a Reinforcement Learning (RL) agent that is collaborative with humans without using human data. Although such agents can be obtained through self-play training, they can suffer significantly from the distributional shift when paired with unencountered partners, such as humans. In this paper, we propose Maximum Entropy Population-based training (MEP) to mitigate such distributional shift. In MEP, agents in the population are trained with our derived Population Entropy bonus to promote the pairwise diversity between agents and the individual diversity of agents themselves. After obtaining this diversified population, a common best agent is trained by paring with agents in this population via prioritized sampling, where the prioritization is dynamically adjusted based on the training progress. We demonstrate the effectiveness of our method MEP, with comparison to Self-Play PPO (SP), Population-Based Training (PBT), Trajectory Diversity (TrajeDi), and Fictitious Co-Play (FCP) in both matrix game and Overcooked game environments, with partners being human proxy models and real humans. A supplementary video showing experimental results is available at https://youtu.be/Xh-FKD0AAKE