7,269 research outputs found
Speech-driven Animation with Meaningful Behaviors
Conversational agents (CAs) play an important role in human computer
interaction. Creating believable movements for CAs is challenging, since the
movements have to be meaningful and natural, reflecting the coupling between
gestures and speech. Studies in the past have mainly relied on rule-based or
data-driven approaches. Rule-based methods focus on creating meaningful
behaviors conveying the underlying message, but the gestures cannot be easily
synchronized with speech. Data-driven approaches, especially speech-driven
models, can capture the relationship between speech and gestures. However, they
create behaviors disregarding the meaning of the message. This study proposes
to bridge the gap between these two approaches overcoming their limitations.
The approach builds a dynamic Bayesian network (DBN), where a discrete variable
is added to constrain the behaviors on the underlying constraint. The study
implements and evaluates the approach with two constraints: discourse functions
and prototypical behaviors. By constraining on the discourse functions (e.g.,
questions), the model learns the characteristic behaviors associated with a
given discourse class learning the rules from the data. By constraining on
prototypical behaviors (e.g., head nods), the approach can be embedded in a
rule-based system as a behavior realizer creating trajectories that are timely
synchronized with speech. The study proposes a DBN structure and a training
approach that (1) models the cause-effect relationship between the constraint
and the gestures, (2) initializes the state configuration models increasing the
range of the generated behaviors, and (3) captures the differences in the
behaviors across constraints by enforcing sparse transitions between shared and
exclusive states per constraint. Objective and subjective evaluations
demonstrate the benefits of the proposed approach over an unconstrained model.Comment: 13 pages, 12 figures, 5 table
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
This paper presents a novel framework for automatic speech-driven gesture
generation, applicable to human-agent interaction including both virtual agents
and robots. Specifically, we extend recent deep-learning-based, data-driven
methods for speech-driven gesture generation by incorporating representation
learning. Our model takes speech as input and produces gestures as output, in
the form of a sequence of 3D coordinates. Our approach consists of two steps.
First, we learn a lower-dimensional representation of human motion using a
denoising autoencoder neural network, consisting of a motion encoder MotionE
and a motion decoder MotionD. The learned representation preserves the most
important aspects of the human pose variation while removing less relevant
variation. Second, we train a novel encoder network SpeechE to map from speech
to a corresponding motion representation with reduced dimensionality. At test
time, the speech encoder and the motion decoder networks are combined: SpeechE
predicts motion representations based on a given speech signal and MotionD then
decodes these representations to produce motion sequences. We evaluate
different representation sizes in order to find the most effective
dimensionality for the representation. We also evaluate the effects of using
different speech features as input to the model. We find that mel-frequency
cepstral coefficients (MFCCs), alone or combined with prosodic features,
perform the best. The results of a subsequent user study confirm the benefits
of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code
is available at
https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation
This paper addresses the challenge of transferring the behavior expressivity
style of a virtual agent to another one while preserving behaviors shape as
they carry communicative meaning. Behavior expressivity style is viewed here as
the qualitative properties of behaviors. We propose TranSTYLer, a multimodal
transformer based model that synthesizes the multimodal behaviors of a source
speaker with the style of a target speaker. We assume that behavior
expressivity style is encoded across various modalities of communication,
including text, speech, body gestures, and facial expressions. The model
employs a style and content disentanglement schema to ensure that the
transferred style does not interfere with the meaning conveyed by the source
behaviors. Our approach eliminates the need for style labels and allows the
generalization to styles that have not been seen during the training phase. We
train our model on the PATS corpus, which we extended to include dialog acts
and 2D facial landmarks. Objective and subjective evaluations show that our
model outperforms state of the art models in style transfer for both seen and
unseen styles during training. To tackle the issues of style and content
leakage that may arise, we propose a methodology to assess the degree to which
behavior and gestures associated with the target style are successfully
transferred, while ensuring the preservation of the ones related to the source
content
LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
Gestures are non-verbal but important behaviors accompanying people's speech.
While previous methods are able to generate speech rhythm-synchronized
gestures, the semantic context of the speech is generally lacking in the
gesticulations. Although semantic gestures do not occur very regularly in human
speech, they are indeed the key for the audience to understand the speech
context in a more immersive environment. Hence, we introduce LivelySpeaker, a
framework that realizes semantics-aware co-speech gesture generation and offers
several control handles. In particular, our method decouples the task into two
stages: script-based gesture generation and audio-guided rhythm refinement.
Specifically, the script-based gesture generation leverages the pre-trained
CLIP text embeddings as the guidance for generating gestures that are highly
semantically aligned with the script. Then, we devise a simple but effective
diffusion-based gesture generation backbone simply using pure MLPs, that is
conditioned on only audio signals and learns to gesticulate with realistic
motions. We utilize such powerful prior to rhyme the script-guided gestures
with the audio signals, notably in a zero-shot setting. Our novel two-stage
generation framework also enables several applications, such as changing the
gesticulation style, editing the co-speech gestures via textual prompting, and
controlling the semantic awareness and rhythm alignment with guided diffusion.
Extensive experiments demonstrate the advantages of the proposed framework over
competing methods. In addition, our core diffusion-based generative model also
achieves state-of-the-art performance on two benchmarks. The code and model
will be released to facilitate future research.Comment: Accepted by ICCV 202
ACT2G: Attention-based Contrastive Learning for Text-to-Gesture Generation
Recent increase of remote-work, online meeting and tele-operation task makes
people find that gesture for avatars and communication robots is more important
than we have thought. It is one of the key factors to achieve smooth and
natural communication between humans and AI systems and has been intensively
researched. Current gesture generation methods are mostly based on deep neural
network using text, audio and other information as the input, however, they
generate gestures mainly based on audio, which is called a beat gesture.
Although the ratio of the beat gesture is more than 70% of actual human
gestures, content based gestures sometimes play an important role to make
avatars more realistic and human-like. In this paper, we propose a
attention-based contrastive learning for text-to-gesture (ACT2G), where
generated gestures represent content of the text by estimating attention weight
for each word from the input text. In the method, since text and gesture
features calculated by the attention weight are mapped to the same latent space
by contrastive learning, once text is given as input, the network outputs a
feature vector which can be used to generate gestures related to the content.
User study confirmed that the gestures generated by ACT2G were better than
existing methods. In addition, it was demonstrated that wide variation of
gestures were generated from the same text by changing attention weights by
creators
Novel Multimodal Feedback Techniques for In-Car Mid-Air Gesture Interaction
This paper presents an investigation into the effects of different feedback modalities on mid-air gesture interaction for infotainment systems in cars. Car crashes and near-crash events are most commonly caused by driver distraction. Mid-air interaction is a way of reducing driver distraction by reducing visual demand from infotainment. Despite a range of available modalities, feedback in mid-air gesture systems is generally provided through visual displays. We conducted a simulated driving study to investigate how different types of multimodal feedback can support in-air gestures. The effects of different feedback modalities on eye gaze behaviour, and the driving and gesturing tasks are considered. We found that feedback modality influenced gesturing behaviour. However, drivers corrected falsely executed gestures more often in non-visual conditions. Our findings show that non-visual feedback can reduce visual distraction significantl
- âŠ