3 research outputs found
Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance
Gesture behavior is a natural part of human conversation. Much work has
focused on removing the need for tedious hand-animation to create embodied
conversational agents by designing speech-driven gesture generators. However,
these generators often work in a black-box manner, assuming a general
relationship between input speech and output motion. As their success remains
limited, we investigate in more detail how speech may relate to different
aspects of gesture motion. We determine a number of parameters characterizing
gesture, such as speed and gesture size, and explore their relationship to the
speech signal in a two-fold manner. First, we train multiple recurrent networks
to predict the gesture parameters from speech to understand how well gesture
attributes can be modeled from speech alone. We find that gesture parameters
can be partially predicted from speech, and some parameters, such as path
length, being predicted more accurately than others, like velocity. Second, we
design a perceptual study to assess the importance of each gesture parameter
for producing motion that people perceive as appropriate for the speech.
Results show that a degradation in any parameter was viewed negatively, but
some changes, such as hand shape, are more impactful than others. A video
summarization can be found at https://youtu.be/aw6-_5kmLjY.Comment: To be published in the Proceedings of the 20th ACM International
Conference on Intelligent Virtual Agents (IVA 20
Temporally Guided Music-to-Body-Movement Generation
This paper presents a neural network model to generate virtual violinist's
3-D skeleton movements from music audio. Improved from the conventional
recurrent neural network models for generating 2-D skeleton data in previous
works, the proposed model incorporates an encoder-decoder architecture, as well
as the self-attention mechanism to model the complicated dynamics in body
movement sequences. To facilitate the optimization of self-attention model,
beat tracking is applied to determine effective sizes and boundaries of the
training examples. The decoder is accompanied with a refining network and a
bowing attack inference mechanism to emphasize the right-hand behavior and
bowing attack timing. Both objective and subjective evaluations reveal that the
proposed model outperforms the state-of-the-art methods. To the best of our
knowledge, this work represents the first attempt to generate 3-D violinists'
body movements considering key features in musical body movement
Generating Gestures from Speech
This article describes a first version of a system for translating speech into Spanish Sign Language. The system proposed is made up of 4 modules: speech recognizer, semantic analysis, gesture sequence generation and gesture playing. For the speech recognizer and the semantic analysis, we use modules developed by IBM and the University of Colorado respectively. The gesture sequence generation uses the semantic concepts (obtained in the semantic analysis) associating them to several Spanish Sign Language gestures. This association is carried out based on a number of generating rules. For gesture animation, we have developed an animated character and a strategy for reducing the effort in gesture generation. This strategy consists of making the system generate automatically all agent positions necessary for the gesture animation. In this process, the system uses a few main agent positions (2-3 per second) and some interpolation strategies, both issues previously generated by the service developer. 1