5,240 research outputs found
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
This paper presents a novel framework for automatic speech-driven gesture
generation, applicable to human-agent interaction including both virtual agents
and robots. Specifically, we extend recent deep-learning-based, data-driven
methods for speech-driven gesture generation by incorporating representation
learning. Our model takes speech as input and produces gestures as output, in
the form of a sequence of 3D coordinates. Our approach consists of two steps.
First, we learn a lower-dimensional representation of human motion using a
denoising autoencoder neural network, consisting of a motion encoder MotionE
and a motion decoder MotionD. The learned representation preserves the most
important aspects of the human pose variation while removing less relevant
variation. Second, we train a novel encoder network SpeechE to map from speech
to a corresponding motion representation with reduced dimensionality. At test
time, the speech encoder and the motion decoder networks are combined: SpeechE
predicts motion representations based on a given speech signal and MotionD then
decodes these representations to produce motion sequences. We evaluate
different representation sizes in order to find the most effective
dimensionality for the representation. We also evaluate the effects of using
different speech features as input to the model. We find that mel-frequency
cepstral coefficients (MFCCs), alone or combined with prosodic features,
perform the best. The results of a subsequent user study confirm the benefits
of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code
is available at
https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode
Interactive Robot Learning of Gestures, Language and Affordances
A growing field in robotics and Artificial Intelligence (AI) research is
human-robot collaboration, whose target is to enable effective teamwork between
humans and robots. However, in many situations human teams are still superior
to human-robot teams, primarily because human teams can easily agree on a
common goal with language, and the individual members observe each other
effectively, leveraging their shared motor repertoire and sensorimotor
resources. This paper shows that for cognitive robots it is possible, and
indeed fruitful, to combine knowledge acquired from interacting with elements
of the environment (affordance exploration) with the probabilistic observation
of another agent's actions.
We propose a model that unites (i) learning robot affordances and word
descriptions with (ii) statistical recognition of human gestures with vision
sensors. We discuss theoretical motivations, possible implementations, and we
show initial results which highlight that, after having acquired knowledge of
its surrounding environment, a humanoid robot can generalize this knowledge to
the case when it observes another agent (human partner) performing the same
motor actions previously executed during training.Comment: code available at https://github.com/gsaponaro/glu-gesture
A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents
Embodied Conversational Agents (ECA) take on different forms, including
virtual avatars or physical agents, such as a humanoid robot. ECAs are often
designed to produce nonverbal behaviour to complement or enhance its verbal
communication. One form of nonverbal behaviour is co-speech gesturing, which
involves movements that the agent makes with its arms and hands that is paired
with verbal communication. Co-speech gestures for ECAs can be created using
different generation methods, such as rule-based and data-driven processes.
However, reports on gesture generation methods use a variety of evaluation
measures, which hinders comparison. To address this, we conducted a systematic
review on co-speech gesture generation methods for iconic, metaphoric, deictic
or beat gestures, including their evaluation methods. We reviewed 22 studies
that had an ECA with a human-like upper body that used co-speech gesturing in a
social human-agent interaction, including a user study to evaluate its
performance. We found most studies used a within-subject design and relied on a
form of subjective evaluation, but lacked a systematic approach. Overall,
methodological quality was low-to-moderate and few systematic conclusions could
be drawn. We argue that the field requires rigorous and uniform tools for the
evaluation of co-speech gesture systems. We have proposed recommendations for
future empirical evaluation, including standardised phrases and test scenarios
to test generative models. We have proposed a research checklist that can be
used to report relevant information for the evaluation of generative models as
well as to evaluate co-speech gesture use.Comment: 9 page
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
Prosody-Based Adaptive Metaphoric Head and Arm Gestures Synthesis in Human Robot Interaction
International audienceIn human-human interaction, the process of communication can be established through three modalities: verbal, non-verbal (i.e., gestures), and/or para-verbal (i.e., prosody). The linguistic literature shows that the para-verbal and non-verbal cues are naturally aligned and synchronized, however the natural mechanism of this synchronization is still unexplored. The difficulty encountered during the coordination between prosody and metaphoric head-arm gestures concerns the conveyed meaning , the way of performing gestures with respect to prosodic characteristics, their relative temporal arrangement, and their coordinated organization in the phrasal structure of utterance. In this research, we focus on the mechanism of mapping between head-arm gestures and speech prosodic characteristics in order to generate an adaptive robot behavior to the interacting human's emotional state. Prosody patterns and the motion curves of head-arm gestures are aligned separately into parallel Hidden Markov Models (HMM). The mapping between speech and head-arm gestures is based on the Coupled Hidden Markov Models (CHMM), which could be seen as a multi-stream collection of HMM, characterizing the segmented prosody and head-arm gestures' data. An emotional state based audio-video database has been created for the validation of this study. The obtained results show the effectiveness of the proposed methodology
- âŠ