5,240 research outputs found

    Analyzing Input and Output Representations for Speech-Driven Gesture Generation

    Full text link
    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code is available at https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode

    Interactive Robot Learning of Gestures, Language and Affordances

    Full text link
    A growing field in robotics and Artificial Intelligence (AI) research is human-robot collaboration, whose target is to enable effective teamwork between humans and robots. However, in many situations human teams are still superior to human-robot teams, primarily because human teams can easily agree on a common goal with language, and the individual members observe each other effectively, leveraging their shared motor repertoire and sensorimotor resources. This paper shows that for cognitive robots it is possible, and indeed fruitful, to combine knowledge acquired from interacting with elements of the environment (affordance exploration) with the probabilistic observation of another agent's actions. We propose a model that unites (i) learning robot affordances and word descriptions with (ii) statistical recognition of human gestures with vision sensors. We discuss theoretical motivations, possible implementations, and we show initial results which highlight that, after having acquired knowledge of its surrounding environment, a humanoid robot can generalize this knowledge to the case when it observes another agent (human partner) performing the same motor actions previously executed during training.Comment: code available at https://github.com/gsaponaro/glu-gesture

    A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents

    Full text link
    Embodied Conversational Agents (ECA) take on different forms, including virtual avatars or physical agents, such as a humanoid robot. ECAs are often designed to produce nonverbal behaviour to complement or enhance its verbal communication. One form of nonverbal behaviour is co-speech gesturing, which involves movements that the agent makes with its arms and hands that is paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, such as rule-based and data-driven processes. However, reports on gesture generation methods use a variety of evaluation measures, which hinders comparison. To address this, we conducted a systematic review on co-speech gesture generation methods for iconic, metaphoric, deictic or beat gestures, including their evaluation methods. We reviewed 22 studies that had an ECA with a human-like upper body that used co-speech gesturing in a social human-agent interaction, including a user study to evaluate its performance. We found most studies used a within-subject design and relied on a form of subjective evaluation, but lacked a systematic approach. Overall, methodological quality was low-to-moderate and few systematic conclusions could be drawn. We argue that the field requires rigorous and uniform tools for the evaluation of co-speech gesture systems. We have proposed recommendations for future empirical evaluation, including standardised phrases and test scenarios to test generative models. We have proposed a research checklist that can be used to report relevant information for the evaluation of generative models as well as to evaluate co-speech gesture use.Comment: 9 page

    A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

    Full text link
    Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.Comment: Accepted for EUROGRAPHICS 202

    Prosody-Based Adaptive Metaphoric Head and Arm Gestures Synthesis in Human Robot Interaction

    Get PDF
    International audienceIn human-human interaction, the process of communication can be established through three modalities: verbal, non-verbal (i.e., gestures), and/or para-verbal (i.e., prosody). The linguistic literature shows that the para-verbal and non-verbal cues are naturally aligned and synchronized, however the natural mechanism of this synchronization is still unexplored. The difficulty encountered during the coordination between prosody and metaphoric head-arm gestures concerns the conveyed meaning , the way of performing gestures with respect to prosodic characteristics, their relative temporal arrangement, and their coordinated organization in the phrasal structure of utterance. In this research, we focus on the mechanism of mapping between head-arm gestures and speech prosodic characteristics in order to generate an adaptive robot behavior to the interacting human's emotional state. Prosody patterns and the motion curves of head-arm gestures are aligned separately into parallel Hidden Markov Models (HMM). The mapping between speech and head-arm gestures is based on the Coupled Hidden Markov Models (CHMM), which could be seen as a multi-stream collection of HMM, characterizing the segmented prosody and head-arm gestures' data. An emotional state based audio-video database has been created for the validation of this study. The obtained results show the effectiveness of the proposed methodology
    • 

    corecore