88 research outputs found
Human Motion Generation: A Survey
Human motion generation aims to generate natural human pose sequences and
shows immense potential for real-world applications. Substantial progress has
been made recently in motion data collection technologies and generation
methods, laying the foundation for increasing interest in human motion
generation. Most research within this field focuses on generating human motions
based on conditional signals, such as text, audio, and scene contexts. While
significant advancements have been made in recent years, the task continues to
pose challenges due to the intricate nature of human motion and its implicit
relationship with conditional signals. In this survey, we present a
comprehensive literature review of human motion generation, which, to the best
of our knowledge, is the first of its kind in this field. We begin by
introducing the background of human motion and generative models, followed by
an examination of representative methods for three mainstream sub-tasks:
text-conditioned, audio-conditioned, and scene-conditioned human motion
generation. Additionally, we provide an overview of common datasets and
evaluation metrics. Lastly, we discuss open problems and outline potential
future research directions. We hope that this survey could provide the
community with a comprehensive glimpse of this rapidly evolving field and
inspire novel ideas that address the outstanding challenges.Comment: 20 pages, 5 figure
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model
The generation of co-speech gestures for digital humans is an emerging area
in the field of virtual human creation. Prior research has made progress by
using acoustic and semantic information as input and adopting classify method
to identify the person's ID and emotion for driving co-speech gesture
generation. However, this endeavour still faces significant challenges. These
challenges go beyond the intricate interplay between co-speech gestures, speech
acoustic, and semantics; they also encompass the complexities associated with
personality, emotion, and other obscure but important factors. This paper
introduces "diffmotion-v2," a speech-conditional diffusion-based and
non-autoregressive transformer-based generative model with WavLM pre-trained
model. It can produce individual and stylized full-body co-speech gestures only
using raw speech audio, eliminating the need for complex multimodal processing
and manually annotated. Firstly, considering that speech audio not only
contains acoustic and semantic features but also conveys personality traits,
emotions, and more subtle information related to accompanying gestures, we
pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract
low-level and high-level audio information. Secondly, we introduce an adaptive
layer norm architecture in the transformer-based layer to learn the
relationship between speech information and accompanying gestures. Extensive
subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT
datasets to confirm the WavLM and the model's ability to synthesize natural
co-speech gestures with various styles.Comment: 10 pages, 5 figures, 1 tabl
A Survey on Generative Diffusion Model
Deep learning shows excellent potential in generation tasks thanks to deep
latent representation. Generative models are classes of models that can
generate observations randomly concerning certain implied parameters. Recently,
the diffusion Model has become a rising class of generative models by its
power-generating ability. Nowadays, great achievements have been reached. More
applications except for computer vision, speech generation, bioinformatics, and
natural language processing are to be explored in this field. However, the
diffusion model has its genuine drawback of a slow generation process, single
data types, low likelihood, and the inability for dimension reduction. They are
leading to many enhanced works. This survey makes a summary of the field of the
diffusion model. We first state the main problem with two landmark works --
DDPM and DSM, and a unified landmark work -- Score SDE. Then, we present
improved techniques for existing problems in the diffusion-based model field,
including speed-up improvement For model speed-up improvement, data structure
diversification, likelihood optimization, and dimension reduction. Regarding
existing models, we also provide a benchmark of FID score, IS, and NLL
according to specific NFE. Moreover, applications with diffusion models are
introduced including computer vision, sequence modeling, audio, and AI for
science. Finally, there is a summarization of this field together with
limitations \& further directions. The summation of existing well-classified
methods is in our
Github:https://github.com/chq1155/A-Survey-on-Generative-Diffusion-Model
Dior-CVAE: Diffusion Priors in Variational Dialog Generation
Conditional variational autoencoders (CVAEs) have been used recently for
diverse response generation, by introducing latent variables to represent the
relationship between a dialog context and its potential responses. However, the
diversity of the generated responses brought by a CVAE model is limited due to
the oversimplified assumption of the isotropic Gaussian prior. We propose,
Dior-CVAE, a hierarchical CVAE model with an informative prior produced by a
diffusion model. Dior-CVAE derives a series of layer-wise latent variables
using attention mechanism and infusing them into decoder layers accordingly. We
propose memory dropout in the latent infusion to alleviate posterior collapse.
The prior distribution of the latent variables is parameterized by a diffusion
model to introduce a multimodal distribution. Overall, experiments on two
popular open-domain dialog datasets indicate the advantages of our approach
over previous Transformer-based variational dialog models in dialog response
generation. We publicly release the code for reproducing Dior-CVAE and all
baselines at
https://github.com/SkyFishMoon/Latent-Diffusion-Response-Generation
An investigation of speaker independent phrase break models in End-to-End TTS systems
This paper presents our work on phrase break prediction in the context of
end-to-end TTS systems, motivated by the following questions: (i) Is there any
utility in incorporating an explicit phrasing model in an end-to-end TTS
system?, and (ii) How do you evaluate the effectiveness of a phrasing model in
an end-to-end TTS system? In particular, the utility and effectiveness of
phrase break prediction models are evaluated in in the context of childrens
story synthesis, using listener comprehension. We show by means of perceptual
listening evaluations that there is a clear preference for stories synthesized
after predicting the location of phrase breaks using a trained phrasing model,
over stories directly synthesized without predicting the location of phrase
breaks.Comment: Submitted for review to IEEE Acces
- …