Search CORE

19 research outputs found

HEMVIP: Human Evaluation of Multiple Videos in Parallel

Author: Henter Gustav Eje
Jonell Patrik
Kucherenko Taras
Wolfert Pieter
Yoon Youngwoo
Publication venue
Publication date: 01/01/2021
Field of study

In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scored on Likert-type scales, or ask users to compare and rate videos in a pairwise fashion. However, the time and resources required for such evaluations scale poorly as the number of conditions to be compared increases. Building on standards used for evaluating the quality of multimedia codecs, this paper instead introduces a framework for granular rating of multiple comparable videos in parallel. This methodology essentially analyses all condition pairs at once. Our contributions are 1) a proposed framework, called HEMVIP, for parallel and granular evaluation of multiple video stimuli and 2) a validation study confirming that results obtained using the tool are in close agreement with results of prior studies using conventional multiple pairwise comparisons.Comment: 8 pages, 2 figure

arXiv.org e-Print Archive

Publikationer från KTH

Ghent University Academic Bibliography

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents

Author: Johal Wafa
Liu Carson Yu
Mohammadi Gelareh
Song Yang
Publication venue
Publication date: 17/09/2023
Field of study

Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread. In human-human interactions, humans use nonverbal behaviours to convey their attitudes, feelings, and intentions. Therefore, this capability is also required for embodied agents in order to enhance the quality and effectiveness of their interactions with humans. In this paper, we propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances. Based on a conditional Generative Adversarial Network (GAN), our proposed neural network model learns the relationships between the co-speech gestures and both semantic and acoustic features from the speech input. In order to train our neural network model, we employ a public dataset containing co-speech gestures with corresponding speech audio utterances, which were captured from a single male native English speaker. The results from both objective and subjective evaluations demonstrate the efficacy of our gesture-generation framework for Robots and Embodied Agents.Comment: RO-MAN'23, 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), August 2023, Busan, South Kore

arXiv.org e-Print Archive

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

Author: Gao Fuxing
Ji Naye
Li Shunman
Wang Zhaohan
Zhang Fan
Zhao Siyuan
Publication venue
Publication date: 13/08/2023
Field of study

The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex multimodal processing and manually annotated. Firstly, considering that speech audio not only contains acoustic and semantic features but also conveys personality traits, emotions, and more subtle information related to accompanying gestures, we pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract low-level and high-level audio information. Secondly, we introduce an adaptive layer norm architecture in the transformer-based layer to learn the relationship between speech information and accompanying gestures. Extensive subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT datasets to confirm the WavLM and the model's ability to synthesize natural co-speech gestures with various styles.Comment: 10 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation

Author: Fares Mireille
Obin Nicolas
Pelachaud Catherine
Publication venue
Publication date: 08/08/2023
Field of study

This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content

arXiv.org e-Print Archive

ACT2G: Attention-based Contrastive Learning for Text-to-Gesture Generation

Author: Ikeuchi Katsushi
Kawasaki Hiroshi
Nakashima Yuta
Teshima Hitoshi
Thomas Diego
Wake Naoki
Publication venue
Publication date: 28/09/2023
Field of study

Recent increase of remote-work, online meeting and tele-operation task makes people find that gesture for avatars and communication robots is more important than we have thought. It is one of the key factors to achieve smooth and natural communication between humans and AI systems and has been intensively researched. Current gesture generation methods are mostly based on deep neural network using text, audio and other information as the input, however, they generate gestures mainly based on audio, which is called a beat gesture. Although the ratio of the beat gesture is more than 70% of actual human gestures, content based gestures sometimes play an important role to make avatars more realistic and human-like. In this paper, we propose a attention-based contrastive learning for text-to-gesture (ACT2G), where generated gestures represent content of the text by estimating attention weight for each word from the input text. In the method, since text and gesture features calculated by the attention weight are mapped to the same latent space by contrastive learning, once text is given as input, the network outputs a feature vector which can be used to generate gestures related to the content. User study confirmed that the gestures generated by ACT2G were better than existing methods. In addition, it was demonstrated that wide variation of gestures were generated from the same text by changing attention weights by creators

arXiv.org e-Print Archive

A Motion Matching-Based Framework for Controllable Gesture Synthesis from Speech

Author: Abdullah A.
Elgharib M.
Habibie I.
Neff M.
Nyatsanga S.
Sarkar K.
Theobalt C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2022
Field of study

MPG.PuRe

Learning Speech-driven 3D Conversational Gestures from Video

Author: Elgharib Mohamed
Habibie Ikhsanul
Liu Lingjie
Mehta Dushyant
Pons-Moll Gerard
Seidel Hans-Peter
Theobalt Christian
Xu Weipeng
Publication venue
Publication date: 01/01/2021
Field of study

We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as dense 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations

arXiv.org e-Print Archive

MPG.PuRe

Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance

Author: Hartmann Björn
Kucherenko Taras
Shapiro Ari
Smith Harrison Jesse
Stacy Marsella Chiu
Thiebaux Marcus
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/10/2020
Field of study

Gesture behavior is a natural part of human conversation. Much work has focused on removing the need for tedious hand-animation to create embodied conversational agents by designing speech-driven gesture generators. However, these generators often work in a black-box manner, assuming a general relationship between input speech and output motion. As their success remains limited, we investigate in more detail how speech may relate to different aspects of gesture motion. We determine a number of parameters characterizing gesture, such as speed and gesture size, and explore their relationship to the speech signal in a two-fold manner. First, we train multiple recurrent networks to predict the gesture parameters from speech to understand how well gesture attributes can be modeled from speech alone. We find that gesture parameters can be partially predicted from speech, and some parameters, such as path length, being predicted more accurately than others, like velocity. Second, we design a perceptual study to assess the importance of each gesture parameter for producing motion that people perceive as appropriate for the speech. Results show that a degradation in any parameter was viewed negatively, but some changes, such as hand shape, are more impactful than others. A video summarization can be found at https://youtu.be/aw6-_5kmLjY.Comment: To be published in the Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA 20

arXiv.org e-Print Archive

Crossref