9,557 research outputs found

    Analyzing Input and Output Representations for Speech-Driven Gesture Generation

    Full text link
    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code is available at https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode

    Early Turn-taking Prediction with Spiking Neural Networks for Human Robot Collaboration

    Full text link
    Turn-taking is essential to the structure of human teamwork. Humans are typically aware of team members' intention to keep or relinquish their turn before a turn switch, where the responsibility of working on a shared task is shifted. Future co-robots are also expected to provide such competence. To that end, this paper proposes the Cognitive Turn-taking Model (CTTM), which leverages cognitive models (i.e., Spiking Neural Network) to achieve early turn-taking prediction. The CTTM framework can process multimodal human communication cues (both implicit and explicit) and predict human turn-taking intentions in an early stage. The proposed framework is tested on a simulated surgical procedure, where a robotic scrub nurse predicts the surgeon's turn-taking intention. It was found that the proposed CTTM framework outperforms the state-of-the-art turn-taking prediction algorithms by a large margin. It also outperforms humans when presented with partial observations of communication cues (i.e., less than 40% of full actions). This early prediction capability enables robots to initiate turn-taking actions at an early stage, which facilitates collaboration and increases overall efficiency.Comment: Submitted to IEEE International Conference on Robotics and Automation (ICRA) 201

    P-model Alternative to the T-model

    Get PDF
    Standard linguistic analysis of syntax uses the T-model. This model requires the ordering: D-structure >> S-structure >> LF, where D-structure is the deep structure, S-structure is the surface structure, and LF is logical form. Between each of these representations there is movement which alters the order of the constituent words; movement is achieved using the principles and parameters of syntactic theory. Psychological analysis of sentence production is usually either serial or connectionist. Psychological serial models do not accommodate the T-model immediately so that here a new model called the P-model is introduced. The P-model is different from previous linguistic and psychological models. Here it is argued that the LF representation should be replaced by a variant of Frege's three qualities (sense, reference, and force), called the Frege representation or F-representation. In the F-representation the order of elements is not necessarily the same as that in LF and it is suggested that the correct ordering is: F-representation >> D-structure >> S-structure. This ordering appears to lead to a more natural view of sentence production and processing. Within this framework movement originates as the outcome of emphasis applied to the sentence. The requirement that the F-representation precedes the D-structure needs a picture of the particular principles and parameters which pertain to movement of words between representations. In general this would imply that there is a preferred or optimal ordering of the symbolic string in the F-representation. The standard ordering is retained because the general way of producing such an optimal ordering is unclear. In this case it is possible to produce an analysis of movement between LF and D-structure similar to the usual analysis of movement between S-structure and LF. It is suggested that a maximal amount of information about a language's grammar and lexicon is stored, because of the necessity of analyzing corrupted data

    Modifications and Frequency Occurrence of Gestures in Ns - Ns and Nns - Ns Dyads

    Full text link
    In this study, I investigate cross-linguistic differences and similarities in the speech associated gesture in the NS (Native Speaker) - NS and NNS (Nonnative Speaker) - NS dyads when they are telling a narrative. The gesture production between Indonesian native speakers when communicating in Indonesian (L1) and in English (L2) was coded and assessed based on Mc.Neill\u27s model of overall gesture units. The Indonesian speakers\u27 gesture modification when interacting in English was measured by the size of the gestures. The results indicate that Indonesian native speakers gesture more when they communicate in English and modify their gestures by making them bigger and therefore more noticeable to their interlocutors. They use gestures as a communication strategy to help interlocutors comprehend their idea

    The role of gesture delay in coda /r/ weakening: an articulatory, auditory and acoustic study

    Get PDF
    The cross-linguistic tendency of coda consonants to weaken, vocalize, or be deleted is shown to have a phonetic basis, resulting from gesture reduction, or variation in gesture timing. This study investigates the effects of the timing of the anterior tongue gesture for coda /r/ on acoustics and perceived strength of rhoticity, making use of two sociolects of Central Scotland (working- and middle-class) where coda /r/ is weakening and strengthening, respectively. Previous articulatory analysis revealed a strong tendency for these sociolects to use different coda /r/ tongue configurations—working- and middle-class speakers tend to use tip/front raised and bunched variants, respectively; however, this finding does not explain working-class /r/ weakening. A correlational analysis in the current study showed a robust relationship between anterior lingual gesture timing, F3, and percept of rhoticity. A linear mixed effects regression analysis showed that both speaker social class and linguistic factors (word structure and the checked/unchecked status of the prerhotic vowel) had significant effects on tongue gesture timing and formant values. This study provides further evidence that gesture delay can be a phonetic mechanism for coda rhotic weakening and apparent loss, but social class emerges as the dominant factor driving lingual gesture timing variation

    Responding to gratitude in elicited oral interaction. A taxonomy of communicative options

    Get PDF
    This study explores responses to gratitude as expressed in elicited oral interaction (mimetic-pretending open role-plays) produced by native speakers of American English. It first overviews the literature on this topic. It then presents a taxonomy of the head acts and supporting moves of the responses to gratitude instantiated in the corpus under examination, which considers their strategies and formulations. Finally, it reports on their frequency of occurrence and combinatorial options across communicative situations differing in terms of the social distance and power relationships between the interactants. The findings partly confirm what reported in the literature, but partly reveal the flexibility and adaptability of these reacting speech acts to the variable context in which they may be instantiated. On the one hand, the responses to gratitude identified tend to be encoded as simple utterances, and occasionally as complex combinations of head acts and/or supporting moves; also, their head acts show a preference for a small set of strategies and formulation types, while their supporting moves are much more varied in content and form, and thus situation-specific. On the other hand, the frequency of occurrence of the responses to gratitude, their dispersion across situations, and the range of their attested strategies and formulations are not in line with those reported in previous studies. I argue that these partly divergent findings are to be related to the different data collection and categorization procedures adopted, and the different communicative situations considered across studies. Overall, the study suggests that: responses to gratitude are a set of communicative events with fuzzy boundaries, which contains core (i.e. more prototypical) and peripheral (i.e. less prototypical) exemplars; although routinized in function, responses to gratitude are not completely conventionalized in their strategic or surface realizations; alternative research approaches may provide complementary insights into these reacting speech acts; and a higher degree of comparability across studies may be ensured if explicit pragmatic and semantic parameters are adopted in the classification of their shared object of study
    • 

    corecore