19 research outputs found
HEMVIP: Human Evaluation of Multiple Videos in Parallel
In many research areas, for example motion and gesture generation, objective
measures alone do not provide an accurate impression of key stimulus traits
such as perceived quality or appropriateness. The gold standard is instead to
evaluate these aspects through user studies, especially subjective evaluations
of video stimuli. Common evaluation paradigms either present individual stimuli
to be scored on Likert-type scales, or ask users to compare and rate videos in
a pairwise fashion. However, the time and resources required for such
evaluations scale poorly as the number of conditions to be compared increases.
Building on standards used for evaluating the quality of multimedia codecs,
this paper instead introduces a framework for granular rating of multiple
comparable videos in parallel. This methodology essentially analyses all
condition pairs at once. Our contributions are 1) a proposed framework, called
HEMVIP, for parallel and granular evaluation of multiple video stimuli and 2) a
validation study confirming that results obtained using the tool are in close
agreement with results of prior studies using conventional multiple pairwise
comparisons.Comment: 8 pages, 2 figure
Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents
Embodied agents, in the form of virtual agents or social robots, are rapidly
becoming more widespread. In human-human interactions, humans use nonverbal
behaviours to convey their attitudes, feelings, and intentions. Therefore, this
capability is also required for embodied agents in order to enhance the quality
and effectiveness of their interactions with humans. In this paper, we propose
a novel framework that can generate sequences of joint angles from the speech
text and speech audio utterances. Based on a conditional Generative Adversarial
Network (GAN), our proposed neural network model learns the relationships
between the co-speech gestures and both semantic and acoustic features from the
speech input. In order to train our neural network model, we employ a public
dataset containing co-speech gestures with corresponding speech audio
utterances, which were captured from a single male native English speaker. The
results from both objective and subjective evaluations demonstrate the efficacy
of our gesture-generation framework for Robots and Embodied Agents.Comment: RO-MAN'23, 32nd IEEE International Conference on Robot and Human
Interactive Communication (RO-MAN), August 2023, Busan, South Kore
Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model
The generation of co-speech gestures for digital humans is an emerging area
in the field of virtual human creation. Prior research has made progress by
using acoustic and semantic information as input and adopting classify method
to identify the person's ID and emotion for driving co-speech gesture
generation. However, this endeavour still faces significant challenges. These
challenges go beyond the intricate interplay between co-speech gestures, speech
acoustic, and semantics; they also encompass the complexities associated with
personality, emotion, and other obscure but important factors. This paper
introduces "diffmotion-v2," a speech-conditional diffusion-based and
non-autoregressive transformer-based generative model with WavLM pre-trained
model. It can produce individual and stylized full-body co-speech gestures only
using raw speech audio, eliminating the need for complex multimodal processing
and manually annotated. Firstly, considering that speech audio not only
contains acoustic and semantic features but also conveys personality traits,
emotions, and more subtle information related to accompanying gestures, we
pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract
low-level and high-level audio information. Secondly, we introduce an adaptive
layer norm architecture in the transformer-based layer to learn the
relationship between speech information and accompanying gestures. Extensive
subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT
datasets to confirm the WavLM and the model's ability to synthesize natural
co-speech gestures with various styles.Comment: 10 pages, 5 figures, 1 tabl
TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation
This paper addresses the challenge of transferring the behavior expressivity
style of a virtual agent to another one while preserving behaviors shape as
they carry communicative meaning. Behavior expressivity style is viewed here as
the qualitative properties of behaviors. We propose TranSTYLer, a multimodal
transformer based model that synthesizes the multimodal behaviors of a source
speaker with the style of a target speaker. We assume that behavior
expressivity style is encoded across various modalities of communication,
including text, speech, body gestures, and facial expressions. The model
employs a style and content disentanglement schema to ensure that the
transferred style does not interfere with the meaning conveyed by the source
behaviors. Our approach eliminates the need for style labels and allows the
generalization to styles that have not been seen during the training phase. We
train our model on the PATS corpus, which we extended to include dialog acts
and 2D facial landmarks. Objective and subjective evaluations show that our
model outperforms state of the art models in style transfer for both seen and
unseen styles during training. To tackle the issues of style and content
leakage that may arise, we propose a methodology to assess the degree to which
behavior and gestures associated with the target style are successfully
transferred, while ensuring the preservation of the ones related to the source
content
ACT2G: Attention-based Contrastive Learning for Text-to-Gesture Generation
Recent increase of remote-work, online meeting and tele-operation task makes
people find that gesture for avatars and communication robots is more important
than we have thought. It is one of the key factors to achieve smooth and
natural communication between humans and AI systems and has been intensively
researched. Current gesture generation methods are mostly based on deep neural
network using text, audio and other information as the input, however, they
generate gestures mainly based on audio, which is called a beat gesture.
Although the ratio of the beat gesture is more than 70% of actual human
gestures, content based gestures sometimes play an important role to make
avatars more realistic and human-like. In this paper, we propose a
attention-based contrastive learning for text-to-gesture (ACT2G), where
generated gestures represent content of the text by estimating attention weight
for each word from the input text. In the method, since text and gesture
features calculated by the attention weight are mapped to the same latent space
by contrastive learning, once text is given as input, the network outputs a
feature vector which can be used to generate gestures related to the content.
User study confirmed that the gestures generated by ACT2G were better than
existing methods. In addition, it was demonstrated that wide variation of
gestures were generated from the same text by changing attention weights by
creators
Learning Speech-driven 3D Conversational Gestures from Video
We propose the first approach to automatically and jointly synthesize both
the synchronous 3D conversational body and hand gestures, as well as 3D face
and head animations, of a virtual character from speech input. Our algorithm
uses a CNN architecture that leverages the inherent correlation between facial
expression and hand gestures. Synthesis of conversational body gestures is a
multi-modal problem since many similar gestures can plausibly accompany the
same input speech. To synthesize plausible body gestures in this setting, we
train a Generative Adversarial Network (GAN) based model that measures the
plausibility of the generated sequences of 3D body motion when paired with the
input audio features. We also contribute a new way to create a large corpus of
more than 33 hours of annotated body, hand, and face data from in-the-wild
videos of talking people. To this end, we apply state-of-the-art monocular
approaches for 3D body and hand pose estimation as well as dense 3D face
performance capture to the video corpus. In this way, we can train on orders of
magnitude more data than previous algorithms that resort to complex in-studio
motion capture solutions, and thereby train more expressive synthesis
algorithms. Our experiments and user study show the state-of-the-art quality of
our speech-synthesized full 3D character animations
Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance
Gesture behavior is a natural part of human conversation. Much work has
focused on removing the need for tedious hand-animation to create embodied
conversational agents by designing speech-driven gesture generators. However,
these generators often work in a black-box manner, assuming a general
relationship between input speech and output motion. As their success remains
limited, we investigate in more detail how speech may relate to different
aspects of gesture motion. We determine a number of parameters characterizing
gesture, such as speed and gesture size, and explore their relationship to the
speech signal in a two-fold manner. First, we train multiple recurrent networks
to predict the gesture parameters from speech to understand how well gesture
attributes can be modeled from speech alone. We find that gesture parameters
can be partially predicted from speech, and some parameters, such as path
length, being predicted more accurately than others, like velocity. Second, we
design a perceptual study to assess the importance of each gesture parameter
for producing motion that people perceive as appropriate for the speech.
Results show that a degradation in any parameter was viewed negatively, but
some changes, such as hand shape, are more impactful than others. A video
summarization can be found at https://youtu.be/aw6-_5kmLjY.Comment: To be published in the Proceedings of the 20th ACM International
Conference on Intelligent Virtual Agents (IVA 20