7 research outputs found

    AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis

    Full text link
    The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available

    Adaptive Gesture Generation for Goal-directed Interaction Support

    No full text
    Voß H. Adaptive Gesture Generation for Goal-directed Interaction Support. In: Doctoral Symposium at the 22nd ACM International Conference on Virtual Agents (IVA). 2022.representational gestures are an integral part of human interaction, which helps to shape the conversation and facilitates the recall of essential information. Although this is true of human interaction, artificial systems exhibit difficulty in creating meaningful representational gestures that are able to convey additional information. In this project we will develop an adaptive non-verbal gesture system that that enables the generation of goal-directed gestures for collaborative tasks. As part of the development, we are exploring the amount of non-verbal expressiveness and fluidity that a system needs to exhibit, in order for gestures to be acknowledged as representational gestures. In addition, we want to understand if and how much an interaction benefits from the use of representational gestures between a human and agent and especially if this differs from a human-human context. After finishing the main components of the system, we will test the influence that intention detection and memory retention has on the perceived information in artificially created gestures. Here a special focus will also be on erroneous information and the deterioration of the interaction. We are currently working on adaptive non-verbal behaviour generation and have already developed an approach for detecting dis-/agreement and confusion events

    Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis

    No full text
    Voß H, Kopp S. Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis. In: ACM International Conference on Intelligent Virtual Agents (IVA '23), September 19--22, 2023, Würzburg, Germany. Accepted: 8.Due to their significance in human communication, the automatic generation of co-speech gestures in artificial embodied agents has received a lot of attention. Although modern deep learning approaches can generate realistic-looking conversational gestures from spoken language, they often lack the ability to convey meaningful information and generate contextually appropriate gestures. This paper presents an augmented approach to the generation of co-speech gestures that additionally takes into account given form and meaning features for the gestures. Our framework effectively acquires this information from a small corpus with rich semantic annotations and a larger corpus without such information. We provide an analysis of the effects of distinctive feature targets and we report on a human rater evaluation study demonstrating that our framework achieves semantic coherence and person perception on the same level as human ground truth behavior. We make our data pipeline and the generation framework publicly available

    AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis

    No full text
    Voß H, Kopp S. AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis. arXiv:2305.01241. 2023.The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available

    FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation

    No full text
    Harz L, Voß H, Kopp S. FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation. In: Proceedings of the 25th International Conference on Multimodal Interaction (ICMI '23). New York, NY, USA: ACM; 2023: 763–771

    Addressing Data Scarcity in Multimodal User State Recognition by Combining Semi-Supervised and Supervised Learning

    No full text
    Voß H, Wersing H, Kopp S. Addressing Data Scarcity in Multimodal User State Recognition by Combining Semi-Supervised and Supervised Learning. In: Hammal Z, ed. Companion Publication of the 2021 International Conference on Multimodal Interaction. New York, NY: Association for Computing Machinery ; 2021: 317-323.Detecting mental states of human users is crucial for the development of cooperative and intelligent robots, as it enables the robot to understand the user's intentions and desires. Despite their importance, it is difficult to obtain a large amount of high quality data for training automatic recognition algorithms as the time and effort required to collect and label such data is prohibitively high. In this paper we present a multimodal machine learning approach for detecting dis-/agreement and confusion states in a human-robot interaction environment, using just a small amount of manually annotated data. We collect a data set by conducting a human-robot interaction study and develop a novel preprocessing pipeline for our machine learning approach. By combining semi-supervised and supervised architectures, we are able to achieve an average F1-score of 81.1\% for dis-/agreement detection with a small amount of labeled data and a large unlabeled data set, while simultaneously increasing the robustness of the model compared to the supervised approach

    FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation

    No full text
    Harz L, Voß H, Kopp S. FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation. In: André E, Chetouani M, Vaufreydaz D, et al., eds. INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. New York, NY, USA: ACM; 2023: 763-771.Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods
    corecore