56,539 research outputs found
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
Generating vivid and diverse 3D co-speech gestures is crucial for various
applications in animating virtual avatars. While most existing methods can
generate gestures from audio directly, they usually overlook that emotion is
one of the key factors of authentic co-speech gesture generation. In this work,
we propose EmotionGesture, a novel framework for synthesizing vivid and diverse
emotional co-speech 3D gestures from audio. Considering emotion is often
entangled with the rhythmic beat in speech audio, we first develop an
Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features
as well as model their correlation via a transcript-based visual-rhythm
alignment. Then, we propose an initial pose based Spatial-Temporal Prompter
(STP) to generate future gestures from the given initial poses. STP effectively
models the spatial-temporal correlations between the initial poses and the
future gestures, thus producing the spatial-temporal coherent pose prompt. Once
we obtain pose prompts, emotion, and audio beat features, we will generate 3D
co-speech gestures through a transformer architecture. However, considering the
poses of existing datasets often contain jittering effects, this would lead to
generating unstable gestures. To address this issue, we propose an effective
objective function, dubbed Motion-Smooth Loss. Specifically, we model motion
offset to compensate for jittering ground-truth by forcing gestures to be
smooth. Last, we present an emotion-conditioned VAE to sample emotion features,
enabling us to generate diverse emotional results. Extensive experiments
demonstrate that our framework outperforms the state-of-the-art, achieving
vivid and diverse emotional co-speech 3D gestures.Comment: Under revie
Should beat gestures be learned or designed? A benchmarking user study
In this paper, we present a user study on gener-ated beat gestures for humanoid agents. It has been shownthat Human-Robot Interaction can be improved by includingcommunicative non-verbal behavior, such as arm gestures. Beatgestures are one of the four types of arm gestures, and are knownto be used for emphasizing parts of speech. In our user study,we compare beat gestures learned from training data with hand-crafted beat gestures. The first kind of gestures are generatedby a machine learning model trained on speech audio andhuman upper body poses. We compared this approach with threehand-coded beat gestures methods: designed beat gestures, timedbeat gestures, and noisy gestures. Forty-one subjects participatedin our user study, and a ranking was derived from pairedcomparisons using the Bradley Terry Luce model. We found thatfor beat gestures, the gestures from the machine learning modelare preferred, followed by algorithmically generated gestures.This emphasizes the promise of machine learning for generating communicative actions.QC 20190815</p
Generating Holistic 3D Human Motion from Speech
This work addresses the problem of generating 3D holistic body motions from
human speech. Given a speech recording, we synthesize sequences of 3D body
poses, hand gestures, and facial expressions that are realistic and diverse. To
achieve this, we first build a high-quality dataset of 3D holistic body meshes
with synchronous speech. We then define a novel speech-to-motion generation
framework in which the face, body, and hands are modeled separately. The
separated modeling stems from the fact that face articulation strongly
correlates with human speech, while body poses and hand gestures are less
correlated. Specifically, we employ an autoencoder for face motions, and a
compositional vector-quantized variational autoencoder (VQ-VAE) for the body
and hand motions. The compositional VQ-VAE is key to generating diverse
results. Additionally, we propose a cross-conditional autoregressive model that
generates body poses and hand gestures, leading to coherent and realistic
motions. Extensive experiments and user studies demonstrate that our proposed
approach achieves state-of-the-art performance both qualitatively and
quantitatively. Our novel dataset and code will be released for research
purposes at https://talkshow.is.tue.mpg.de.Comment: Project Webpage: https://talkshow.is.tue.mpg.de; CVPR202
AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis
The generation of realistic and contextually relevant co-speech gestures is a
challenging yet increasingly important task in the creation of multimodal
artificial agents. Prior methods focused on learning a direct correspondence
between co-speech gesture representations and produced motions, which created
seemingly natural but often unconvincing gestures during human assessment. We
present an approach to pre-train partial gesture sequences using a generative
adversarial network with a quantization pipeline. The resulting codebook
vectors serve as both input and output in our framework, forming the basis for
the generation and reconstruction of gestures. By learning the mapping of a
latent space representation as opposed to directly mapping it to a vector
representation, this framework facilitates the generation of highly realistic
and expressive gestures that closely replicate human movement and behavior,
while simultaneously avoiding artifacts in the generation process. We evaluate
our approach by comparing it with established methods for generating co-speech
gestures as well as with existing datasets of human behavior. We also perform
an ablation study to assess our findings. The results show that our approach
outperforms the current state of the art by a clear margin and is partially
indistinguishable from human gesturing. We make our data pipeline and the
generation framework publicly available
C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model
Co-speech gesture generation is crucial for automatic digital avatar
animation. However, existing methods suffer from issues such as unstable
training and temporal inconsistency, particularly in generating high-fidelity
and comprehensive gestures. Additionally, these methods lack effective control
over speaker identity and temporal editing of the generated gestures. Focusing
on capturing temporal latent information and applying practical controlling, we
propose a Controllable Co-speech Gesture Generation framework, named C2G2.
Specifically, we propose a two-stage temporal dependency enhancement strategy
motivated by latent diffusion models. We further introduce two key features to
C2G2, namely a speaker-specific decoder to generate speaker-related real-length
skeletons and a repainting strategy for flexible gesture generation/editing.
Extensive experiments on benchmark gesture datasets verify the effectiveness of
our proposed C2G2 compared with several state-of-the-art baselines. The link of
the project demo page can be found at https://c2g2-gesture.github.io/c2_gestureComment: 12 pages, 6 figures, 7 table
Recommended from our members
Reference and Gestures in Dialogue Generation: Three Studies with Embodied Conversational Agents
This paper reports on three studies into social presence cues which were carried out in the context of the NECA (Net-environment for Embodied Emotional Conversational Agents) project and the EPOCH network. The first study concerns the generation of referring expressions. We adopted an existing algorithm for generating referring expressions such that it could run according to an egocentric and a neutral strategy. In an evaluation study, we found that the two strategies were correlated with the perceived friendliness of the speaker. In the second and the third study, we evaluated the gestures that were generated by the NECA system. In this paper, we briefly summarize the most salient results of these two studies. They concern the effect of gestures on perceived quality of speech and information retention
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
This paper presents a novel framework for automatic speech-driven gesture
generation, applicable to human-agent interaction including both virtual agents
and robots. Specifically, we extend recent deep-learning-based, data-driven
methods for speech-driven gesture generation by incorporating representation
learning. Our model takes speech as input and produces gestures as output, in
the form of a sequence of 3D coordinates. Our approach consists of two steps.
First, we learn a lower-dimensional representation of human motion using a
denoising autoencoder neural network, consisting of a motion encoder MotionE
and a motion decoder MotionD. The learned representation preserves the most
important aspects of the human pose variation while removing less relevant
variation. Second, we train a novel encoder network SpeechE to map from speech
to a corresponding motion representation with reduced dimensionality. At test
time, the speech encoder and the motion decoder networks are combined: SpeechE
predicts motion representations based on a given speech signal and MotionD then
decodes these representations to produce motion sequences. We evaluate
different representation sizes in order to find the most effective
dimensionality for the representation. We also evaluate the effects of using
different speech features as input to the model. We find that mel-frequency
cepstral coefficients (MFCCs), alone or combined with prosodic features,
perform the best. The results of a subsequent user study confirm the benefits
of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code
is available at
https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode
- …