10,607 research outputs found
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
Speech-driven Animation with Meaningful Behaviors
Conversational agents (CAs) play an important role in human computer
interaction. Creating believable movements for CAs is challenging, since the
movements have to be meaningful and natural, reflecting the coupling between
gestures and speech. Studies in the past have mainly relied on rule-based or
data-driven approaches. Rule-based methods focus on creating meaningful
behaviors conveying the underlying message, but the gestures cannot be easily
synchronized with speech. Data-driven approaches, especially speech-driven
models, can capture the relationship between speech and gestures. However, they
create behaviors disregarding the meaning of the message. This study proposes
to bridge the gap between these two approaches overcoming their limitations.
The approach builds a dynamic Bayesian network (DBN), where a discrete variable
is added to constrain the behaviors on the underlying constraint. The study
implements and evaluates the approach with two constraints: discourse functions
and prototypical behaviors. By constraining on the discourse functions (e.g.,
questions), the model learns the characteristic behaviors associated with a
given discourse class learning the rules from the data. By constraining on
prototypical behaviors (e.g., head nods), the approach can be embedded in a
rule-based system as a behavior realizer creating trajectories that are timely
synchronized with speech. The study proposes a DBN structure and a training
approach that (1) models the cause-effect relationship between the constraint
and the gestures, (2) initializes the state configuration models increasing the
range of the generated behaviors, and (3) captures the differences in the
behaviors across constraints by enforcing sparse transitions between shared and
exclusive states per constraint. Objective and subjective evaluations
demonstrate the benefits of the proposed approach over an unconstrained model.Comment: 13 pages, 12 figures, 5 table
Dance-to-Music Generation with Encoder-based Textual Inversion of Diffusion Models
The harmonious integration of music with dance movements is pivotal in
vividly conveying the artistic essence of dance. This alignment also
significantly elevates the immersive quality of gaming experiences and
animation productions. While there has been remarkable advancement in creating
high-fidelity music from textual descriptions, current methodologies mainly
concentrate on modulating overarching characteristics such as genre and
emotional tone. They often overlook the nuanced management of temporal rhythm,
which is indispensable in crafting music for dance, since it intricately aligns
the musical beats with the dancers' movements. Recognizing this gap, we propose
an encoder-based textual inversion technique for augmenting text-to-music
models with visual control, facilitating personalized music generation.
Specifically, we develop dual-path rhythm-genre inversion to effectively
integrate the rhythm and genre of a dance motion sequence into the textual
space of a text-to-music model. Contrary to the classical textual inversion
method, which directly updates text embeddings to reconstruct a single target
object, our approach utilizes separate rhythm and genre encoders to obtain text
embeddings for two pseudo-words, adapting to the varying rhythms and genres. To
achieve a more accurate evaluation, we propose improved evaluation metrics for
rhythm alignment. We demonstrate that our approach outperforms state-of-the-art
methods across multiple evaluation metrics. Furthermore, our method seamlessly
adapts to in-the-wild data and effectively integrates with the inherent
text-guided generation capability of the pre-trained model. Samples are
available at \url{https://youtu.be/D7XDwtH1YwE}.Comment: 9 pages, 3 figure
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents
Embodied Conversational Agents (ECA) take on different forms, including
virtual avatars or physical agents, such as a humanoid robot. ECAs are often
designed to produce nonverbal behaviour to complement or enhance its verbal
communication. One form of nonverbal behaviour is co-speech gesturing, which
involves movements that the agent makes with its arms and hands that is paired
with verbal communication. Co-speech gestures for ECAs can be created using
different generation methods, such as rule-based and data-driven processes.
However, reports on gesture generation methods use a variety of evaluation
measures, which hinders comparison. To address this, we conducted a systematic
review on co-speech gesture generation methods for iconic, metaphoric, deictic
or beat gestures, including their evaluation methods. We reviewed 22 studies
that had an ECA with a human-like upper body that used co-speech gesturing in a
social human-agent interaction, including a user study to evaluate its
performance. We found most studies used a within-subject design and relied on a
form of subjective evaluation, but lacked a systematic approach. Overall,
methodological quality was low-to-moderate and few systematic conclusions could
be drawn. We argue that the field requires rigorous and uniform tools for the
evaluation of co-speech gesture systems. We have proposed recommendations for
future empirical evaluation, including standardised phrases and test scenarios
to test generative models. We have proposed a research checklist that can be
used to report relevant information for the evaluation of generative models as
well as to evaluate co-speech gesture use.Comment: 9 page
A Survey of Music Generation in the Context of Interaction
In recent years, machine learning, and in particular generative adversarial
neural networks (GANs) and attention-based neural networks (transformers), have
been successfully used to compose and generate music, both melodies and
polyphonic pieces. Current research focuses foremost on style replication (eg.
generating a Bach-style chorale) or style transfer (eg. classical to jazz)
based on large amounts of recorded or transcribed music, which in turn also
allows for fairly straight-forward "performance" evaluation. However, most of
these models are not suitable for human-machine co-creation through live
interaction, neither is clear, how such models and resulting creations would be
evaluated. This article presents a thorough review of music representation,
feature analysis, heuristic algorithms, statistical and parametric modelling,
and human and automatic evaluation measures, along with a discussion of which
approaches and models seem most suitable for live interaction
- …