521 research outputs found

    Shared Perception in Human-Robot Interaction

    Get PDF
    Interaction can be seen as a composition of perspectives: the integration of perceptions, intentions, and actions on the environment two or more agents share. For an interaction to be effective, each agent must be prone to “sharedness”: being situated in a common environment, able to read what others express about their perspective, and ready to adjust one’s own perspective accordingly. In this sense, effective interaction is supported by perceiving the environment jointly with others, a capability that in this research is called Shared Perception. Nonetheless, perception is a complex process that brings the observer receiving sensory inputs from the external world and interpreting them based on its own, previous experiences, predictions, and intentions. In addition, social interaction itself contributes to shaping what is perceived: others’ attention, perspective, actions, and internal states may also be incorporated into perception. Thus, Shared perception reflects the observer's ability to integrate these three sources of information: the environment, the self, and other agents. If Shared Perception is essential among humans, it is equally crucial for interaction with robots, which need social and cognitive abilities to interact with humans naturally and successfully. This research deals with Shared Perception within the context of Social Human-Robot Interaction (HRI) and involves an interdisciplinary approach. The two general axes of the thesis are the investigation of human perception while interacting with robots and the modeling of robot’s perception while interacting with humans. Such two directions are outlined through three specific Research Objectives, whose achievements represent the contribution of this work. i) The formulation of a theoretical framework of Shared Perception in HRI valid for interpreting and developing different socio-perceptual mechanisms and abilities. ii) The investigation of Shared Perception in humans focusing on the perceptual mechanism of Context Dependency, and therefore exploring how social interaction affects the use of previous experience in human spatial perception. iii) The implementation of a deep-learning model for Addressee Estimation to foster robots’ socio-perceptual skills through the awareness of others’ behavior, as suggested in the Shared Perception framework. To achieve the first Research Objective, several human socio-perceptual mechanisms are presented and interpreted in a unified account. This exposition parallels mechanisms elicited by interaction with humans and humanoid robots and aims to build a framework valid to investigate human perception in the context of HRI. Based on the thought of D. Davidson and conceived as the integration of information coming from the environment, the self, and other agents, the idea of "triangulation" expresses the critical dynamics of Shared Perception. Also, it is proposed as the functional structure to support the implementation of socio-perceptual skills in robots. This general framework serves as a reference to fulfill the other two Research Objectives, which explore specific aspects of Shared Perception. For what concerns the second Research Objective, the human perceptual mechanism of Context Dependency is investigated, for the first time, within social interaction. Human perception is based on unconscious inference, where sensory inputs integrate with prior information. This phenomenon helps in facing the uncertainty of the external world with predictions built upon previous experience. To investigate the effect of social interaction on such a mechanism, the iCub robot has been used as an experimental tool to create an interactive scenario with a controlled setting. A user study based on psychophysical methods, Bayesian modeling, and a neural network analysis of human results demonstrated that social interaction influenced Context Dependency so that when interacting with a social agent, humans rely less on their internal models and more on external stimuli. Such results are framed in Shared Perception and contribute to revealing the integration dynamics of the three sources of Shared Perception. The others’ presence and social behavior (other agents) affect the balance between sensory inputs (environment) and personal history (self) in favor of the information shared with others, that is, the environment. The third Research Objective consists of tackling the Addressee Estimation problem, i.e., understanding to whom a speaker is talking, to improve the iCub social behavior in multi-party interactions. Addressee Estimation can be considered a Shared Perception ability because it is achieved by using sensory information from the environment, internal representations of the agents’ position, and, more importantly, the understanding of others’ behavior. An architecture for Addressee Estimation is thus designed considering the integration process of Shared Perception (environment, self, other agents) and partially implemented with respect to the third element: the awareness of others’ behavior. To achieve this, a hybrid deep-learning (CNN+LSTM) model is developed to estimate the speaker-robot relative placement of the addressee based on the non-verbal behavior of the speaker. Addressee Estimation abilities based on Shared Perception dynamics are aimed at improving multi-party HRI. Making robots aware of other agents’ behavior towards the environment is the first crucial step for incorporating such information into the robot’s perception and modeling Shared Perception

    To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

    Full text link
    Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.Comment: Accepted version of a paper published at 2023 International Joint Conference on Neural Networks (IJCNN). Please find the published version and info to cite the paper at https://doi.org/10.1109/IJCNN54540.2023.10191452 . 10 pages, 8 Figures, 3 Table

    Automatic Context-Driven Inference of Engagement in HMI: A Survey

    Full text link
    An integral part of seamless human-human communication is engagement, the process by which two or more participants establish, maintain, and end their perceived connection. Therefore, to develop successful human-centered human-machine interaction applications, automatic engagement inference is one of the tasks required to achieve engaging interactions between humans and machines, and to make machines attuned to their users, hence enhancing user satisfaction and technology acceptance. Several factors contribute to engagement state inference, which include the interaction context and interactants' behaviours and identity. Indeed, engagement is a multi-faceted and multi-modal construct that requires high accuracy in the analysis and interpretation of contextual, verbal and non-verbal cues. Thus, the development of an automated and intelligent system that accomplishes this task has been proven to be challenging so far. This paper presents a comprehensive survey on previous work in engagement inference for human-machine interaction, entailing interdisciplinary definition, engagement components and factors, publicly available datasets, ground truth assessment, and most commonly used features and methods, serving as a guide for the development of future human-machine interaction interfaces with reliable context-aware engagement inference capability. An in-depth review across embodied and disembodied interaction modes, and an emphasis on the interaction context of which engagement perception modules are integrated sets apart the presented survey from existing surveys

    Combining dynamic head pose-gaze mapping with the robot conversational state for attention recognition in human-robot interactions

    Get PDF
    The ability to recognize the visual focus of attention (VFOA, i.e. what or whom a person is looking at) of people is important for robots or conversational agents interacting with multiple people, since it plays a key role in turn-taking, engagement or intention monitoring. As eye gaze estimation is often impossible to achieve, most systems currently rely on head pose as an approximation, creating ambiguities since the same head pose can be used to look at different VFOA targets. To address this challenge, we propose a dynamic Bayesian model for the VFOA recognition from head pose, where we make two main contributions. First, taking inspiration from behavioral models describing the relationships between the body, head and gaze orientations involved in gaze shifts, we propose novel gaze models that dynamically and more accurately predict the expected head orientation used for looking in a given gaze target direction. This is a neglected aspect of previous works but essential for recognition. Secondly, we propose to exploit the robot conversational state (when he speaks, objects to which he refers) as context to net appropriate priors on candidate VFOA targets and reduce the inherent VFOA ambiguities. Experiments on a public dataset where the humanoid robot NAO plays the role of an art guide and quiz master demonstrate the benefit of the two contributions

    Socially aware conversational agents

    Get PDF

    Tracking and modeling focus of attention in meetings [online]

    Get PDF
    Abstract This thesis addresses the problem of tracking the focus of attention of people. In particular, a system to track the focus of attention of participants in meetings is developed. Obtaining knowledge about a person\u27s focus of attention is an important step towards a better understanding of what people do, how and with what or whom they interact or to what they refer. In meetings, focus of attention can be used to disambiguate the addressees of speech acts, to analyze interaction and for indexing of meeting transcripts. Tracking a user\u27s focus of attention also greatly contributes to the improvement of human­computer interfaces since it can be used to build interfaces and environments that become aware of what the user is paying attention to or with what or whom he is interacting. The direction in which people look; i.e., their gaze, is closely related to their focus of attention. In this thesis, we estimate a subject\u27s focus of attention based on his or her head orientation. While the direction in which someone looks is determined by head orientation and eye gaze, relevant literature suggests that head orientation alone is a su#cient cue for the detection of someone\u27s direction of attention during social interaction. We present experimental results from a user study and from several recorded meetings that support this hypothesis. We have developed a Bayesian approach to model at whom or what someone is look­ ing based on his or her head orientation. To estimate head orientations in meetings, the participants\u27 faces are automatically tracked in the view of a panoramic camera and neural networks are used to estimate their head orientations from pre­processed images of their faces. Using this approach, the focus of attention target of subjects could be correctly identified during 73% of the time in a number of evaluation meet­ ings with four participants. In addition, we have investigated whether a person\u27s focus of attention can be pre­dicted from other cues. Our results show that focus of attention is correlated to who is speaking in a meeting and that it is possible to predict a person\u27s focus of attention based on the information of who is talking or was talking before a given moment. We have trained neural networks to predict at whom a person is looking, based on information about who was speaking. Using this approach we were able to predict who is looking at whom with 63% accuracy on the evaluation meetings using only information about who was speaking. We show that by using both head orientation and speaker information to estimate a person\u27s focus, the accuracy of focus detection can be improved compared to just using one of the modalities for focus estimation. To demonstrate the generality of our approach, we have built a prototype system to demonstrate focus­aware interaction with a household robot and other smart appliances in a room using the developed components for focus of attention tracking. In the demonstration environment, a subject could interact with a simulated household robot, a speech­enabled VCR or with other people in the room, and the recipient of the subject\u27s speech was disambiguated based on the user\u27s direction of attention. Zusammenfassung Die vorliegende Arbeit beschĂ€ftigt sich mit der automatischen Bestimmung und Ver­folgung des Aufmerksamkeitsfokus von Personen in Besprechungen. Die Bestimmung des Aufmerksamkeitsfokus von Personen ist zum VerstĂ€ndnis und zur automatischen Auswertung von Besprechungsprotokollen sehr wichtig. So kann damit beispielsweise herausgefunden werden, wer zu einem bestimmten Zeitpunkt wen angesprochen hat beziehungsweise wer wem zugehört hat. Die automatische Bestim­mung des Aufmerksamkeitsfokus kann desweiteren zur Verbesserung von Mensch-Maschine­Schnittstellen benutzt werden. Ein wichtiger Hinweis auf die Richtung, in welche eine Person ihre Aufmerksamkeit richtet, ist die Kopfstellung der Person. Daher wurde ein Verfahren zur Bestimmung der Kopfstellungen von Personen entwickelt. Hierzu wurden kĂŒnstliche neuronale Netze benutzt, welche als Eingaben vorverarbeitete Bilder des Kopfes einer Person erhalten, und als Ausgabe eine SchĂ€tzung der Kopfstellung berechnen. Mit den trainierten Netzen wurde auf Bilddaten neuer Personen, also Personen, deren Bilder nicht in der Trainingsmenge enthalten waren, ein mittlerer Fehler von neun bis zehn Grad fĂŒr die Bestimmung der horizontalen und vertikalen Kopfstellung erreicht. Desweiteren wird ein probabilistischer Ansatz zur Bestimmung von Aufmerksamkeits­zielen vorgestellt. Es wird hierbei ein Bayes\u27scher Ansatzes verwendet um die A­posterior iWahrscheinlichkeiten verschiedener Aufmerksamkteitsziele, gegeben beobachteter Kopfstellungen einer Person, zu bestimmen. Die entwickelten AnsĂ€tze wurden auf mehren Besprechungen mit vier bis fĂŒnf Teilnehmern evaluiert. Ein weiterer Beitrag dieser Arbeit ist die Untersuchung, inwieweit sich die Blickrich­tung der Besprechungsteilnehmer basierend darauf, wer gerade spricht, vorhersagen lĂ€ĂŸt. Es wurde ein Verfahren entwickelt um mit Hilfe von neuronalen Netzen den Fokus einer Person basierend auf einer kurzen Historie der Sprecherkonstellationen zu schĂ€tzen. Wir zeigen, dass durch Kombination der bildbasierten und der sprecherbasierten SchĂ€tzung des Aufmerksamkeitsfokus eine deutliche verbesserte SchĂ€tzung erreicht werden kann. Insgesamt wurde mit dieser Arbeit erstmals ein System vorgestellt um automatisch die Aufmerksamkeit von Personen in einem Besprechungsraum zu verfolgen. Die entwickelten AnsĂ€tze und Methoden können auch zur Bestimmung der Aufmerk­samkeit von Personen in anderen Bereichen, insbesondere zur Steuerung von comput­erisierten, interaktiven Umgebungen, verwendet werden. Dies wird an einer Beispielapplikation gezeigt
    • 

    corecore