9,557 research outputs found
How Do Gestures Influence Thinking and Speaking? The Gesture-for-Conceptualization Hypothesis.
Peer reviewedPostprin
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
This paper presents a novel framework for automatic speech-driven gesture
generation, applicable to human-agent interaction including both virtual agents
and robots. Specifically, we extend recent deep-learning-based, data-driven
methods for speech-driven gesture generation by incorporating representation
learning. Our model takes speech as input and produces gestures as output, in
the form of a sequence of 3D coordinates. Our approach consists of two steps.
First, we learn a lower-dimensional representation of human motion using a
denoising autoencoder neural network, consisting of a motion encoder MotionE
and a motion decoder MotionD. The learned representation preserves the most
important aspects of the human pose variation while removing less relevant
variation. Second, we train a novel encoder network SpeechE to map from speech
to a corresponding motion representation with reduced dimensionality. At test
time, the speech encoder and the motion decoder networks are combined: SpeechE
predicts motion representations based on a given speech signal and MotionD then
decodes these representations to produce motion sequences. We evaluate
different representation sizes in order to find the most effective
dimensionality for the representation. We also evaluate the effects of using
different speech features as input to the model. We find that mel-frequency
cepstral coefficients (MFCCs), alone or combined with prosodic features,
perform the best. The results of a subsequent user study confirm the benefits
of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code
is available at
https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode
Early Turn-taking Prediction with Spiking Neural Networks for Human Robot Collaboration
Turn-taking is essential to the structure of human teamwork. Humans are
typically aware of team members' intention to keep or relinquish their turn
before a turn switch, where the responsibility of working on a shared task is
shifted. Future co-robots are also expected to provide such competence. To that
end, this paper proposes the Cognitive Turn-taking Model (CTTM), which
leverages cognitive models (i.e., Spiking Neural Network) to achieve early
turn-taking prediction. The CTTM framework can process multimodal human
communication cues (both implicit and explicit) and predict human turn-taking
intentions in an early stage. The proposed framework is tested on a simulated
surgical procedure, where a robotic scrub nurse predicts the surgeon's
turn-taking intention. It was found that the proposed CTTM framework
outperforms the state-of-the-art turn-taking prediction algorithms by a large
margin. It also outperforms humans when presented with partial observations of
communication cues (i.e., less than 40% of full actions). This early prediction
capability enables robots to initiate turn-taking actions at an early stage,
which facilitates collaboration and increases overall efficiency.Comment: Submitted to IEEE International Conference on Robotics and Automation
(ICRA) 201
P-model Alternative to the T-model
Standard linguistic analysis of syntax uses the T-model. This model
requires the ordering: D-structure S-structure LF,
where D-structure is the deep structure,
S-structure is the surface structure, and LF is logical form.
Between each of these representations there is movement which alters
the order of the constituent words; movement is achieved using the principles
and parameters of syntactic theory. Psychological analysis of sentence
production is usually either serial or connectionist. Psychological serial
models do not accommodate the T-model immediately so that here a new model
called the P-model is introduced. The P-model is different from previous
linguistic and psychological models. Here it is argued that the LF
representation should be replaced by a variant
of Frege's three qualities (sense, reference, and force),
called the Frege representation or F-representation.
In the F-representation the order of elements is not necessarily the same as
that in LF and it is suggested that the correct ordering is:
F-representation D-structure S-structure.
This ordering appears to lead to a more natural
view of sentence production and processing. Within this framework movement
originates as the outcome of emphasis applied to the sentence. The
requirement that the F-representation precedes the D-structure needs a picture
of the particular principles and parameters which pertain to movement of words
between representations. In general this would imply that there is a
preferred or optimal ordering of the symbolic string in the F-representation.
The standard ordering is retained because the general way of producing
such an optimal ordering is unclear. In this case it is possible to produce
an analysis of movement between LF and D-structure similar to the usual
analysis of movement between S-structure and LF.
It is suggested that a maximal amount of information about
a language's grammar and lexicon is stored,
because of the necessity of analyzing corrupted data
Modifications and Frequency Occurrence of Gestures in Ns - Ns and Nns - Ns Dyads
In this study, I investigate cross-linguistic differences and similarities in the speech associated gesture in the NS (Native Speaker) - NS and NNS (Nonnative Speaker) - NS dyads when they are telling a narrative. The gesture production between Indonesian native speakers when communicating in Indonesian (L1) and in English (L2) was coded and assessed based on Mc.Neill\u27s model of overall gesture units. The Indonesian speakers\u27 gesture modification when interacting in English was measured by the size of the gestures. The results indicate that Indonesian native speakers gesture more when they communicate in English and modify their gestures by making them bigger and therefore more noticeable to their interlocutors. They use gestures as a communication strategy to help interlocutors comprehend their idea
The role of gesture delay in coda /r/ weakening: an articulatory, auditory and acoustic study
The cross-linguistic tendency of coda consonants to weaken, vocalize, or be deleted is shown to
have a phonetic basis, resulting from gesture reduction, or variation in gesture timing. This study
investigates the effects of the timing of the anterior tongue gesture for coda /r/ on acoustics and
perceived strength of rhoticity, making use of two sociolects of Central Scotland (working- and
middle-class) where coda /r/ is weakening and strengthening, respectively. Previous articulatory
analysis revealed a strong tendency for these sociolects to use different coda /r/ tongue configurationsâworking-
and middle-class speakers tend to use tip/front raised and bunched variants,
respectively; however, this finding does not explain working-class /r/ weakening. A correlational
analysis in the current study showed a robust relationship between anterior lingual gesture timing,
F3, and percept of rhoticity. A linear mixed effects regression analysis showed that both speaker
social class and linguistic factors (word structure and the checked/unchecked status of the prerhotic
vowel) had significant effects on tongue gesture timing and formant values. This study provides further
evidence that gesture delay can be a phonetic mechanism for coda rhotic weakening and apparent
loss, but social class emerges as the dominant factor driving lingual gesture timing variation
Responding to gratitude in elicited oral interaction. A taxonomy of communicative options
This study explores responses to gratitude as expressed in elicited oral interaction (mimetic-pretending open role-plays) produced by native speakers of American English. It first overviews the literature on this topic. It then presents a taxonomy of the head acts and supporting moves of the responses to gratitude instantiated in the corpus under examination, which considers their strategies and formulations. Finally, it reports on their frequency of occurrence and combinatorial options across communicative situations differing in terms of the social distance and power relationships between the interactants. The findings partly confirm what reported in the literature, but partly reveal the flexibility and adaptability of these reacting speech acts to the variable context in which they may be instantiated. On the one hand, the responses to gratitude identified tend to be encoded as simple utterances, and occasionally as complex combinations of head acts and/or supporting moves; also, their head acts show a preference for a small set of strategies and formulation types, while their supporting moves are much more varied in content and form, and thus situation-specific. On the other hand, the frequency of occurrence of the responses to gratitude, their dispersion across situations, and the range of their attested strategies and formulations are not in line with those reported in previous studies. I argue that these partly divergent findings are to be related to the different data collection and categorization procedures adopted, and the different communicative situations considered across studies. Overall, the study suggests that: responses to gratitude are a set of communicative events with fuzzy boundaries, which contains core (i.e. more prototypical) and peripheral (i.e. less prototypical) exemplars; although routinized in function, responses to gratitude are not completely conventionalized in their strategic or surface realizations; alternative research approaches may provide complementary insights into these reacting speech acts; and a higher degree of comparability across studies may be ensured if explicit pragmatic and semantic parameters are adopted in the classification of their shared object of study
- âŠ