4,510 research outputs found

    Joint attention in spoken human-robot interaction

    Get PDF
    Gaze during situated language production and comprehension is tightly coupled with the unfolding speech stream - speakers look at entities before mentioning them (Griffin, 2001; Meyer et al., 1998), while listeners look at objects as they are mentioned (Tanenhaus et al., 1995). Thus, a speaker\u27s gaze to mentioned objects in a shared environment provides the listener with a cue to the speaker\u27s focus of visual attention and potentially to an intended referent. The coordination of interlocutor\u27s visual attention, in order to learn about the partner\u27s goals and intentions, has been called joint attention (Moore and Dunham, 1995; Emery, 2000). By revealing the speakers communicative intentions, such attentional cues thus complement spoken language, facilitating grounding and sometimes disambiguating references (Hanna and Brennan, 2007). Previous research has shown that people readily attribute intentional states to non-humans as well, like animals, computers, or robots (Nass and Moon, 2000). Assuming that people indeed ascribe intentional states to a robot, joint attention may be a relevant component of human-robot interaction as well. It was the objective of this thesis to investigate the hypothesis that people jointly attend to objects looked at by a speaking robot and that human listeners use this visual information to infer the robot\u27s communicative intentions. Five eye-tracking experiments in a spoken human-robot interaction setting were conducted and provide supporting evidence for this hypothesis. In these experiments, participants\u27 eye movements and responses were recorded while they viewed videos of a robot that described and looked at objects in a scene. The congruency and alignment of robot gaze and the spoken references were manipulated in order to establish the relevance of such gaze cues for utterance comprehension in participants. Results suggest that people follow robot gaze to objects and infer referential intentions from it, causing both facilitation and disruption of reference resolution, depending on the match or mismatch between inferred intentions and the actual utterance. Specifically, we have shown in Experiments 1-3 that people assign attentional and intentional states to a robot, interpreting its gaze as cue to intended referents. This interpretation determined how people grounded spoken references in the scene, thus, influencing overall utterance comprehension as well as the production of verbal corrections in response to false robot utterances. In Experiments 4 and 5, we further manipulated temporal synchronization and linear alignment of robot gaze and speech and found that substantial temporal shifts of gaze relative to speech did not affect utterance comprehension while the order of visual and spoken referential cues did. These results show that people interpret gaze cues in the order they occur in and expect the retrieved referential intentions to be realized accordingly. Thus, our findings converge to the result that people establish joint attention with a robot.Die Blickrichtung des Menschen ist eng mit Sprachproduktion und Sprachverstehen verknĂŒpft: So schaut ein Sprecher in der Regel auf ein Objekt kurz bevor er es nennt, wĂ€hrend der Blick des Hörers sich beim Verstehen des Objektnamens darauf richtet (Griffin, 2001; Meyer et al., 1998; Tanenhaus et al., 1995). Die Blickrichtung des Sprechers gibt dem Hörer also Aufschluss darĂŒber, wohin die Aufmerksamkeit des Sprechers gerade gerichtet ist und worĂŒber möglicherweise als nĂ€chstes gesprochen wird. Wenn jemand dem Blick seines GegenĂŒbers folgt, um herauszufinden was dieser fuer Ziele oder Absichten hat, spricht man von gemeinsamer Aufmerksamkeit (Joint Attention, bzw. Shared Attention, wenn beide GesprĂ€chspartner ihre Aufmerksamkeit bewusst koordinieren, Moore and Dunham, 1995; Emery, 2000). Der Blickrichtung des Sprechers zu folgen, kann demnach nĂŒtzlich sein, da sie hĂ€ufig seine Absichten verrĂ€t. Sie kann sogar das Sprachverstehen erleichtern, indem zum Beispiel referenzierende Ausdruecke mit Hilfe solcher visuellen Informationen disambiguiert werden (Hanna and Brennan, 2007). DarĂŒber hinaus wurde in der Vergangenheit gezeigt, dass Menschen hĂ€ufig nicht nur Menschen, sondern auch Tieren und Maschinen, wie zum Bespiel Robotern, Ab- sichten oder CharakterzĂŒge zuschreiben (Nass and Moon, 2000). Wenn Robotern tatsĂ€chlich die eigentlich menschliche FĂ€higkeit, Ziele oder Absichten zu haben, zugeordnet wird, dann ist davon auszugehen, dass gemeinsame Aufmerksamkeit auch einen wichtigen Bestandteil der Kommunikation zwischen Mensch und Roboter darstellt. Ziel dieser Dissertation war es, die Hypothese zu untersuchen, dass Menschen versuchen Aufmerksamkeit mit Robotern zu teilen, um zu erkennen, was ein Roboter beabsichtigt zu sagen oder zu tun. Wir stellen insgesamt fĂŒnf Experimente vor, die diese Hypothese unterstĂŒtzen. In diesen Experimenten wurden die Augenbewegungen und Antworten, beziehungsweise Reaktionszeiten, von Versuchspersonen aufgezeichnet, wĂ€hrend letztere sich Videos anschauten. Die Videos zeigten einen Roboter, welcher eine Anordnung von Objekten beschrieb, wĂ€hrend er seine Kamera auf das ein oder andere Objekt richtete, um Blickrichtung zu simulieren. Manipuliert wurde die Kongruenz der Verweise auf Objekte durch Blickrichtung und Objektnamen, sowie die Abfolge solcher Verweise. Folglich konnten der Informationsgehalt und die relative Gewichtung von Blickrichtung fuer das Sprachverstehen bestimmt werden. Unsere Ergebnisse belegen, dass Menschen tatsĂ€chlich dem Roboterblick folgen und ihn Ă€hnlich interpretieren wie die Blickrichtung anderer Menschen, d.h. Versuchspersonen leiteten aus der Blickrichtung des Roboters ab, was dessen vermeintliche (sprachliche) Absichten waren. Insbesondere zeigen die Experimente 1-3, dass Versuchspersonen die Blickrichtung des Roboters als Hinweis auf nachfolgende, referenzierende AusdrĂŒcke verstehen und dementsprechend die Äußerung des Roboter speziell auf jene angeschauten Objekte beziehen. Dies fĂŒhrt zu verkĂŒrzten Reaktionszeiten wenn die Verweise auf Objekte durch Blickrichtung und Objektnamen ĂŒbereinstimmen, wĂ€hrend widersprĂŒchliche Verweise zu verlĂ€ngerten Reaktionszeiten fĂŒhren. Dass Roboterblick als Ausdruck einer (sprachlichen) Absicht interpretiert wird, zeigt sich auch in den Antworten, mit denen Versuchspersonen falsche Aussagen des Roboters korrigierten. In den Experimenten 4-5 wurde außerdem die Anordnung der Verweise durch Blick und Sprache manipuliert. WĂ€hrend die genaue zeitliche Abstimmung der Verweise den Einfluss von Roboterblick nicht mindert, so scheint die Reihenfolge der Verweise entscheidend zu sein. Unsere Ergebnisse deuten darauf hin, dass Menschen Absichten aus den Verweisen durch Blickrichtung ableiten und erwarten, dass diese Absichten in derselben Anordnung umgesetzt werden. Insgesamt lassen unsere Ergebnisse also darauf schließen, dass Menschen versuchen, ihre Aufmerksamkeit gemeinsam mit Robotern zu koordinieren, um das Sprachverstehen zu erleichtern

    Gesture and Speech in Interaction - 4th edition (GESPIN 4)

    Get PDF
    International audienceThe fourth edition of Gesture and Speech in Interaction (GESPIN) was held in Nantes, France. With more than 40 papers, these proceedings show just what a flourishing field of enquiry gesture studies continues to be. The keynote speeches of the conference addressed three different aspects of multimodal interaction:gesture and grammar, gesture acquisition, and gesture and social interaction. In a talk entitled Qualitiesof event construal in speech and gesture: Aspect and tense, Alan Cienki presented an ongoing researchproject on narratives in French, German and Russian, a project that focuses especially on the verbal andgestural expression of grammatical tense and aspect in narratives in the three languages. Jean-MarcColletta's talk, entitled Gesture and Language Development: towards a unified theoretical framework,described the joint acquisition and development of speech and early conventional and representationalgestures. In Grammar, deixis, and multimodality between code-manifestation and code-integration or whyKendon's Continuum should be transformed into a gestural circle, Ellen Fricke proposed a revisitedgrammar of noun phrases that integrates gestures as part of the semiotic and typological codes of individuallanguages. From a pragmatic and cognitive perspective, Judith Holler explored the use ofgaze and hand gestures as means of organizing turns at talk as well as establishing common ground in apresentation entitled On the pragmatics of multi-modal face-to-face communication: Gesture, speech andgaze in the coordination of mental states and social interaction.Among the talks and posters presented at the conference, the vast majority of topics related, quitenaturally, to gesture and speech in interaction - understood both in terms of mapping of units in differentsemiotic modes and of the use of gesture and speech in social interaction. Several presentations explored the effects of impairments(such as diseases or the natural ageing process) on gesture and speech. The communicative relevance ofgesture and speech and audience-design in natural interactions, as well as in more controlled settings liketelevision debates and reports, was another topic addressed during the conference. Some participantsalso presented research on first and second language learning, while others discussed the relationshipbetween gesture and intonation. While most participants presented research on gesture and speech froman observer's perspective, be it in semiotics or pragmatics, some nevertheless focused on another importantaspect: the cognitive processes involved in language production and perception. Last but not least,participants also presented talks and posters on the computational analysis of gestures, whether involvingexternal devices (e.g. mocap, kinect) or concerning the use of specially-designed computer software forthe post-treatment of gestural data. Importantly, new links were made between semiotics and mocap data

    Incrementality and flexibility in sentence production

    Get PDF

    Accessibility of referent information influences sentence planning : An eye-tracking study

    Get PDF
    Acknowledgments We thank Phoebe Ye and Gouming Martens for help with data collection for Experiment 1 and 2, respectively. This research was supported by the European Research Council for the ERC Starting Grant (206198) to YC.Peer reviewedPublisher PD

    Turn-Taking in Human Communicative Interaction

    Get PDF
    The core use of language is in face-to-face conversation. This is characterized by rapid turn-taking. This turn-taking poses a number central puzzles for the psychology of language. Consider, for example, that in large corpora the gap between turns is on the order of 100 to 300 ms, but the latencies involved in language production require minimally between 600ms (for a single word) or 1500 ms (for as simple sentence). This implies that participants in conversation are predicting the ends of the incoming turn and preparing in advance. But how is this done? What aspects of this prediction are done when? What happens when the prediction is wrong? What stops participants coming in too early? If the system is running on prediction, why is there consistently a mode of 100 to 300 ms in response time? The timing puzzle raises further puzzles: it seems that comprehension must run parallel with the preparation for production, but it has been presumed that there are strict cognitive limitations on more than one central process running at a time. How is this bottleneck overcome? Far from being 'easy' as some psychologists have suggested, conversation may be one of the most demanding cognitive tasks in our everyday lives. Further questions naturally arise: how do children learn to master this demanding task, and what is the developmental trajectory in this domain? Research shows that aspects of turn-taking such as its timing are remarkably stable across languages and cultures, but the word order of languages varies enormously. How then does prediction of the incoming turn work when the verb (often the informational nugget in a clause) is at the end? Conversely, how can production work fast enough in languages that have the verb at the beginning, thereby requiring early planning of the whole clause? What happens when one changes modality, as in sign languages -- with the loss of channel constraints is turn-taking much freer? And what about face-to-face communication amongst hearing individuals -- do gestures, gaze, and other body behaviors facilitate turn-taking? One can also ask the phylogenetic question: how did such a system evolve? There seem to be parallels (analogies) in duetting bird species, and in a variety of monkey species, but there is little evidence of anything like this among the great apes. All this constitutes a neglected set of problems at the heart of the psychology of language and of the language sciences. This research topic welcomes contributions from right across the board, for example from psycholinguists, developmental psychologists, students of dialogue and conversation analysis, linguists interested in the use of language, phoneticians, corpus analysts and comparative ethologists or psychologists. We welcome contributions of all sorts, for example original research papers, opinion pieces, and reviews of work in subfields that may not be fully understood in other subfields

    Augmenting Situated Spoken Language Interaction with Listener Gaze

    Get PDF
    Collaborative task solving in a shared environment requires referential success. Human speakers follow the listener’s behavior in order to monitor language comprehension (Clark, 1996). Furthermore, a natural language generation (NLG) system can exploit listener gaze to realize an effective interaction strategy by responding to it with verbal feedback in virtual environments (Garoufi, Staudte, Koller, & Crocker, 2016). We augment situated spoken language interaction with listener gaze and investigate its role in human-human and human-machine interactions. Firstly, we evaluate its impact on prediction of reference resolution using a mulitimodal corpus collection from virtual environments. Secondly, we explore if and how a human speaker uses listener gaze in an indoor guidance task, while spontaneously referring to real-world objects in a real environment. Thirdly, we consider an object identification task for assembly under system instruction. We developed a multimodal interactive system and two NLG systems that integrate listener gaze in the generation mechanisms. The NLG system “Feedback” reacts to gaze with verbal feedback, either underspecified or contrastive. The NLG system “Installments” uses gaze to incrementally refer to an object in the form of installments. Our results showed that gaze features improved the accuracy of automatic prediction of reference resolution. Further, we found that human speakers are very good at producing referring expressions, and showing listener gaze did not improve performance, but elicited more negative feedback. In contrast, we showed that an NLG system that exploits listener gaze benefits the listener’s understanding. Specifically, combining a short, ambiguous instruction with con- trastive feedback resulted in faster interactions compared to underspecified feedback, and even outperformed following long, unambiguous instructions. Moreover, alternating the underspecified and contrastive responses in an interleaved manner led to better engagement with the system and an effcient information uptake, and resulted in equally good performance. Somewhat surprisingly, when gaze was incorporated more indirectly in the generation procedure and used to trigger installments, the non-interactive approach that outputs an instruction all at once was more effective. However, if the spatial expression was mentioned first, referring in gaze-driven installments was as efficient as following an exhaustive instruction. In sum, we provide a proof of concept that listener gaze can effectively be used in situated human-machine interaction. An assistance system using gaze cues is more attentive and adapts to listener behavior to ensure communicative success

    The Neural Mechanisms Supporting Structure and Inter-Brain Connectivity In Natural Conversation

    Get PDF
    Conversation is the height of human communication and social interaction, yet little is known about the neural mechanisms supporting it. To date, there have been no ecologically valid neuroimaging studies of conversation, and for good reason. Until recently, imaging techniques were hindered by artifact related to speech production. Now that we can circumvent this problem, I attempt to uncover the neural correlates of multiple aspects of conversation, including coordinating speaker change, the effect of conversation type (e.g. cooperative or argumentative) on inter-brain coupling, and the relationship between this coupling and social coherence. Pairs of individuals underwent simultaneous fMRI brain scans while they engaged in a series of unscripted conversations, for a total of 40 pairs (80 individuals). The first two studies in this dissertation lay a foundation by outlining brain regions supporting comprehension and production in both narrative and conversation - two aspects of discourse level communication. The subsequent studies focus on two unique features of conversation: alternating turns-at-talk and establishing inter-brain coherence through speech. The results show that at the moment of speaker change, both people are engaging attentional and mentalizing systems - which likely support orienting toward implicit cues signaling speaker change as well as anticipating the other person's intention to either begin or end his turn. Four networks were identified that are significantly predicted by a novel measure of social coherence; they include the posterior parietal cortex, medial prefrontal cortex, and right angular gyrus. Taken together, the findings reveal that natural conversation relies on multiple cognitive networks besides language to coordinate or enhance social interaction. &#8195

    Visual attention during conversation: an investigation using real-world stimuli.

    Get PDF
    This research investigates how people visually attend to each other in realistic settings. In particular, I explore how people move their eyes to attend to speakers during social situations. I examine which signalling cues are crucial to social interactions and how they work in conjunction to enable successful conversation in humans. Furthermore, a main aim of this research is to explore eye movement when participants are live or are third-party observers. Overall, using a range of techniques, the research has demonstrated the benefit of using both audio and visual cues to guide conversation following; how viewing the eyes of the speakers and their spatial location facilitates this; as well as an investigation of social attention in those with traits of disorders. Moreover, a key finding of the thesis is demonstrating the similarities in live eye-movements and third-party observations. Overall, the thesis offers a comprehensive account of which factors attract visual attention to speakers and facilitate conversation following

    Telops for language learning: Japanese language learners’ perceptions of authentic Japanese variety shows and implications for their use in the classroom

    Get PDF
    Research on the use of leisure-oriented media products in foreign language learning is not a novelty. Building further on insights into the effects of audiovisual input on learners, recent studies have started to explore online learning behaviour. This research employed an exploratory design to examine the perceptions of a Japanese variety show with intralingual text, known as telops, by Japanese Language Learners (JLLs) and native Japanese speakers through a multimodal transcript, eye-tracking technology, questionnaires, and field notes. Two main objectives underlie this study: (1) to gain insights into participants’ multimodal perceptions and attitudes towards the use of such authentic material for language learning, and (2) to gain a better understanding of the distribution of participants’ visual attention between stimuli. Data from 43 JLLs and five native Japanese speakers were analysed. The JLLs were organised into a pre-exchange, exchange and post-exchange group while the native Japanese speakers functioned as the reference group. A thematic analysis was conducted on the open-ended questionnaire responses and Areas Of Interest (AOIs) were grouped to generate fixation data. The themes suggest that all learner groups feel that telops help them link the stimuli in the television programme although some difficulty was experienced with the amount and pace of telops in the pre-exchange and exchange groups. The eye-tracking results show that faces and telops gather the most visual attention from all participant groups. Less clear-cut trends in visual attention are detected when AOIs on telops are grouped according to the degree in which they resemble the corresponding dialogue. This thesis concludes with suggestions as to how such authentic material can complement Japanese language learning
    • 

    corecore