20 research outputs found
Understanding virtual speakers
This thesis addresses how verbal comprehension is affected by seeing the speaker and in particular when the speaker is an animated virtual speaker. Two people visually co-present – one talking and the other listening, trying to comprehend what is said – is a central and critical scenario whether one is interested in human cognition, communication or learning. Papers I & II are focused on the effect on comprehension of seeing a virtual speaker displaying visual speech cues (lip and head movements accompanying speech). The results presented indicate a positive effect in the presence of background babble noise but no effect in its absence. The results presented in paper II also indicate that the effect of seeing the virtual speaker is at least as effective as seeing a real speaker, that the exploitation of visual speech cues by a virtual speaker may require some adaptation but is not affected by subjective perception of the virtual speakers’ social traits. Papers III & IV focus on the effect of the temporal coordination of speech and gesture on memory encoding of speech, and the feasibility of a novel methodology to address this question. The objective of the methodology is the precise manipulating of individual gestures within naturalistic speech and gesture sequences recorded by motion capture and reproduced by virtual speakers. Results in paper III indicate that such temporal manipulations can be realized without subjective perception of the animation as unnatural as long as the shifted (manipulated) gestural movements temporally overlap with some speech (not pause or hesitation). Results of paper IV were that words accompanied by associated gestures in their original synchrony or gestures arriving earlier were more likely to be recalled. This mirrors the temporal coordination patterns that are common in natural speech-gesture production. Paper V explores how factual topics are comprehended and approached metacognitively when presented in different media, including a video of an animated virtual speaker with synthesized speech. They study made use of an interface where differences in information transience and navigation options are minimized between the media. Results indicate improved comprehension and a somewhat stronger tendency to repeat material when also seeing, compared to only listening to, the virtual speaker. Instances of navigation behaviours were, however, overall scarce and only tentative conclusions could be drawn regarding differences in metacognitive approaches between media. Paper VI presents a virtual replication of a choice blindness experimental paradigm. The results show that the level of detail of the presentation of a virtual environment and a speaker may affect self-reported presence as well the level of trust exhibited towards the speaker. The relevance of these findings is discussed with regards to how comprehension is affected by visible speakers in general and virtual speakers specifically, as well as possible consequences for the design and implementation of virtual speakers in educational applications and as research instruments
Recall and perceived naturalness of asynchronous speech and gesture
Part of the justification for an integrated view of speech and gestures ( (( is their temporal coordination. Gestures generally coincide with or precede, but rarely follow lexical affiliate (McNeill, 1992). How synchrony impacts listeners remains less explored, despite potential relevance for video communication and virtual conversational agents. ERP studies suggest that temporal alignment affects how words and gestures are integrated (Obermeier & Gunter, 2015) (Habets et al, 2011). Explicit perception of asynchrony is less sensitive and shifts longer than 1s can be tolerated (Kirchhof, 2014). However, gestures that are preceded by their lexical affiliates deviate from the expected pattern given regular exposure to speech which might implicitly affect listeners. We investigated whether the asymmetry of timing observed in production was reflected in differential effects of gestures shifted in either direction on how listeners perceive the speakers behavior as natural (Exp1) and/or impairing their processing and subsequent recall of words. (Exp2) Using motion capture to animate virtual speakers (giving explanations) allowed shifting specific gesture strokes within longer segments while preserving synchronized lip movements. For 16 short segments we produced videos in 3 conditions defined by the timing of a target gesture stroke relative a target word; either overlapping (SYNC) or shifted 500ms earlier (GIBEFORE) or later (GIAFTER). We classified the verbal content overlapping with shifted strokes by (unequally frequent) categories ”congruent”, ”incongruent” or ”filled/unfilled pauses”. In Exp1, 32 participants saw a composition of 4 videos from each of the 3 mentioned conditions plus a variation of SYNC with distorted pitch during a few nonItarget words (AUDIO). After each video the participants rated their impression that it was based on a capture of natural or was artificially generated (by an undefined algorithm). We transformed each participant’s responses to the range between 0 (most artificial) and 1(most natural). Results revealed no significant differences between conditions. However, comparing the ratings between the categories of overlap revealed that strokes shifted to ”filled /unfilled pauses” were rated as more artificial. In Exp2, 79 participants saw all 16 videos in one of four conditions. SYNC, GIBEFORE and GIAFTER were contrasted by a condition with seamlessly extinguished target gestures. Following each video and a distraction task, participants attempted to repeat what they heard in the video. Results revealed impaired recall of target words with extinguished or delayed gestures. In summary, asynchronous gestures were not perceived as less natural if overlapping with any words. Synchronous and preceding, but not following, gestures facilitated recall, as expected if the processing of speech and gestures (involved in this particular task) would be tuned to temporal patterns common in natural speech
Stronger activation of spatial representations of words by presentation of gestures in absence of spatial task
Speech and gestures express meaning together, connecting lexical and visuo-spatial representations in the speaker’s mind. Several studies have demonstrated uptake of spatial information from gestures however usually following explicit instructions to recreate or recall spatial features, deviating from typical listening situations. It remains unclear whether exposure to gestures activates spatial representations of verbal content without an explicitly spatial task. We conducted an online experiment where one group of participants (n=33) watched three videos of native Swedish speakers describing rooms in an audio-visual (AV) condition. Another group (n=32) only heard the same three descriptions without video (audio-only, A). During presentations participants were naïve to the following task, consisting of making either lexical- (is this a word?), semantic- (is this concrete?), or spatial (would this fit inside a room?) decisions on a series of 29 nouns, including seven mentioned in stimuli (with gesture in AV condition). Assuming that stronger activation of spatial representations facilitates the spatial decision task, we predicted that the AV condition would generate faster reaction times (on included nouns) compared to A in the spatial task, but not in the other tasks. Our analysis of recorded reaction times confirmed our predictions, suggesting a stronger activation of spatial representations of words presented with gesture in the absence of an explicit task demanding spatial processing. The results tally with previous findings suggesting that speech-gesture integration is modulated by content and task, thus contributing to our understanding of the role gestures plays in spatial processing in natural verbal communication
Motion iconicity in prosody
Evidence suggests that human non-verbal speech may be rich in iconicity. Here, we report results from two experiments aimed at testing whether perception of increasing and declining f0 can be iconically mapped onto motion events. We presented a sample of mixed-nationality participants (N = 118) with sets of two videos, where one pictured upward movement and the other downward movement. A disyllabic non-sense word prosodically resynthesized as increasing or declining in f0 was presented simultaneously with each video in a pair, and participants were tasked with guessing which of the two videos the word described. Results indicate that prosody is iconically associated with motion, such that motion-prosody congruent pairings were more readily selected than incongruent pairings (p < 0.033). However, the effect observed in our sample was primarily driven by selections of words with declining f0. A follow-up experiment with native Turkish speaking participants (N = 92) tested for the effect of language-specific metaphor for auditory pitch. Results showed no significant association between prosody and motion. Limitations of the experiment, and some implications for the motor theory of speech perception, and “gestural origins” theories of language evolution, are discussed
Motion capture-based animated characters for the study of speech–gesture integration
Digitally animated characters are promising tools in research studying how we integrate information from speech and visual sources such as gestures because they allow specific gesture features to be manipulated in isolation. We present an approach combining motion capture and 3D-animated characters that allows us to manipulate natural individual gesture strokes for experimental purposes, for example to temporally shift and present gestures in ecologically valid sequences. We exemplify how such stimuli can be used in an experiment investigating implicit detection of speech–gesture (a) synchrony, and discuss the general applicability of the workflow for research in this domain
Lessons Learned from a Study on Distractions in Virtual Learning Environments : Reliability, Ecological Validity, and an Elusive Social Component
Virtual Reality has long been proposed to combine the reliability of controlled laboratory settings with the ecological validity of real life. While the technological development steadily pushes towards even more realistic renderings of the real world— the elusiveness of social and emotional elements gradually becomes more evident. This is not the least true for behavioral studies in rich sociocultural contexts. This article examines the outcomes of a study on distractions, taking place in a socially rich context—the classroom. The study made use of a Virtual Reality environment simulating a junior high school lesson, where the Distraction condition consisted of peers watching nonrelevant content on their laptops. In the control condition these laptops were closed. No significant distraction effects were found, neither on learning nor behavior. Given the strong suppor t in the literature for such effects, the study design, including technical aspects, is scrutinized and discussed. We specifically highlight the difficulty of simulating a social relationship between the par ticipant and agents in VR, which in this case makes the distraction stimulus significantly weaker. It is argued that the distraction effect of nearby peers’ laptop use relies (par tly) on shared attention with social agents with an established social relation and common interests
Compensation for a large gesture-speech asynchrony in instructional videos
We investigated the pragmatic effects of gesture-speech lag by asking participants to reconstruct formations of geometric shapes based on instructional films in four conditions: sync, video or audio lag (±1,500 ms), audio only. All three video groups rated the task as less difficult compared to the audio-only group and performed better. The scores were slightly lower when sound preceded gestures (video lag), but not when gestures preceded sound (audio lag). Participants thus compensated for delays of 1.5 seconds in either direction, apparently without making a conscious effort. This greatly exceeds the previously reported time window for automatic multimodal integration