4,658 research outputs found

    Robust Modeling of Epistemic Mental States

    Full text link
    This work identifies and advances some research challenges in the analysis of facial features and their temporal dynamics with epistemic mental states in dyadic conversations. Epistemic states are: Agreement, Concentration, Thoughtful, Certain, and Interest. In this paper, we perform a number of statistical analyses and simulations to identify the relationship between facial features and epistemic states. Non-linear relations are found to be more prevalent, while temporal features derived from original facial features have demonstrated a strong correlation with intensity changes. Then, we propose a novel prediction framework that takes facial features and their nonlinear relation scores as input and predict different epistemic states in videos. The prediction of epistemic states is boosted when the classification of emotion changing regions such as rising, falling, or steady-state are incorporated with the temporal features. The proposed predictive models can predict the epistemic states with significantly improved accuracy: correlation coefficient (CoERR) for Agreement is 0.827, for Concentration 0.901, for Thoughtful 0.794, for Certain 0.854, and for Interest 0.913.Comment: Accepted for Publication in Multimedia Tools and Application, Special Issue: Socio-Affective Technologie

    Virtual Meeting Rooms: From Observation to Simulation

    Get PDF
    Much working time is spent in meetings and, as a consequence, meetings have become the subject of multidisciplinary research. Virtual Meeting Rooms (VMRs) are 3D virtual replicas of meeting rooms, where various modalities such as speech, gaze, distance, gestures and facial expressions can be controlled. This allows VMRs to be used to improve remote meeting participation, to visualize multimedia data and as an instrument for research into social interaction in meetings. This paper describes how these three uses can be realized in a VMR. We describe the process from observation through annotation to simulation and a model that describes the relations between the annotated features of verbal and non-verbal conversational behavior.\ud As an example of social perception research in the VMR, we describe an experiment to assess human observers’ accuracy for head orientation

    Sensing, interpreting, and anticipating human social behaviour in the real world

    Get PDF
    Low-level nonverbal social signals like glances, utterances, facial expressions and body language are central to human communicative situations and have been shown to be connected to important high-level constructs, such as emotions, turn-taking, rapport, or leadership. A prerequisite for the creation of social machines that are able to support humans in e.g. education, psychotherapy, or human resources is the ability to automatically sense, interpret, and anticipate human nonverbal behaviour. While promising results have been shown in controlled settings, automatically analysing unconstrained situations, e.g. in daily-life settings, remains challenging. Furthermore, anticipation of nonverbal behaviour in social situations is still largely unexplored. The goal of this thesis is to move closer to the vision of social machines in the real world. It makes fundamental contributions along the three dimensions of sensing, interpreting and anticipating nonverbal behaviour in social interactions. First, robust recognition of low-level nonverbal behaviour lays the groundwork for all further analysis steps. Advancing human visual behaviour sensing is especially relevant as the current state of the art is still not satisfactory in many daily-life situations. While many social interactions take place in groups, current methods for unsupervised eye contact detection can only handle dyadic interactions. We propose a novel unsupervised method for multi-person eye contact detection by exploiting the connection between gaze and speaking turns. Furthermore, we make use of mobile device engagement to address the problem of calibration drift that occurs in daily-life usage of mobile eye trackers. Second, we improve the interpretation of social signals in terms of higher level social behaviours. In particular, we propose the first dataset and method for emotion recognition from bodily expressions of freely moving, unaugmented dyads. Furthermore, we are the first to study low rapport detection in group interactions, as well as investigating a cross-dataset evaluation setting for the emergent leadership detection task. Third, human visual behaviour is special because it functions as a social signal and also determines what a person is seeing at a given moment in time. Being able to anticipate human gaze opens up the possibility for machines to more seamlessly share attention with humans, or to intervene in a timely manner if humans are about to overlook important aspects of the environment. We are the first to propose methods for the anticipation of eye contact in dyadic conversations, as well as in the context of mobile device interactions during daily life, thereby paving the way for interfaces that are able to proactively intervene and support interacting humans.Blick, GesichtsausdrĂŒcke, Körpersprache, oder Prosodie spielen als nonverbale Signale eine zentrale Rolle in menschlicher Kommunikation. Sie wurden durch vielzĂ€hlige Studien mit wichtigen Konzepten wie Emotionen, Sprecherwechsel, FĂŒhrung, oder der QualitĂ€t des VerhĂ€ltnisses zwischen zwei Personen in Verbindung gebracht. Damit Menschen effektiv wĂ€hrend ihres tĂ€glichen sozialen Lebens von Maschinen unterstĂŒtzt werden können, sind automatische Methoden zur Erkennung, Interpretation, und Antizipation von nonverbalem Verhalten notwendig. Obwohl die bisherige Forschung in kontrollierten Studien zu ermutigenden Ergebnissen gekommen ist, bleibt die automatische Analyse nonverbalen Verhaltens in weniger kontrollierten Situationen eine Herausforderung. DarĂŒber hinaus existieren kaum Untersuchungen zur Antizipation von nonverbalem Verhalten in sozialen Situationen. Das Ziel dieser Arbeit ist, die Vision vom automatischen Verstehen sozialer Situationen ein StĂŒck weit mehr RealitĂ€t werden zu lassen. Diese Arbeit liefert wichtige BeitrĂ€ge zur autmatischen Erkennung menschlichen Blickverhaltens in alltĂ€glichen Situationen. Obwohl viele soziale Interaktionen in Gruppen stattfinden, existieren unĂŒberwachte Methoden zur Augenkontakterkennung bisher lediglich fĂŒr dyadische Interaktionen. Wir stellen einen neuen Ansatz zur Augenkontakterkennung in Gruppen vor, welcher ohne manuelle Annotationen auskommt, indem er sich den statistischen Zusammenhang zwischen Blick- und Sprechverhalten zu Nutze macht. TĂ€gliche AktivitĂ€ten sind eine Herausforderung fĂŒr GerĂ€te zur mobile Augenbewegungsmessung, da Verschiebungen dieser GerĂ€te zur Verschlechterung ihrer Kalibrierung fĂŒhren können. In dieser Arbeit verwenden wir Nutzerverhalten an mobilen EndgerĂ€ten, um den Effekt solcher Verschiebungen zu korrigieren. Neben der Erkennung verbessert diese Arbeit auch die Interpretation sozialer Signale. Wir veröffentlichen den ersten Datensatz sowie die erste Methode zur Emotionserkennung in dyadischen Interaktionen ohne den Einsatz spezialisierter AusrĂŒstung. Außerdem stellen wir die erste Studie zur automatischen Erkennung mangelnder Verbundenheit in Gruppeninteraktionen vor, und fĂŒhren die erste datensatzĂŒbergreifende Evaluierung zur Detektion von sich entwickelndem FĂŒhrungsverhalten durch. Zum Abschluss der Arbeit prĂ€sentieren wir die ersten AnsĂ€tze zur Antizipation von Blickverhalten in sozialen Interaktionen. Blickverhalten hat die besondere Eigenschaft, dass es sowohl als soziales Signal als auch der Ausrichtung der visuellen Wahrnehmung dient. Somit eröffnet die FĂ€higkeit zur Antizipation von Blickverhalten Maschinen die Möglichkeit, sich sowohl nahtloser in soziale Interaktionen einzufĂŒgen, als auch Menschen zu warnen, wenn diese Gefahr laufen wichtige Aspekte der Umgebung zu ĂŒbersehen. Wir prĂ€sentieren Methoden zur Antizipation von Blickverhalten im Kontext der Interaktion mit mobilen EndgerĂ€ten wĂ€hrend tĂ€glicher AktivitĂ€ten, als auch wĂ€hrend dyadischer Interaktionen mittels Videotelefonie

    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

    Full text link
    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System

    Predicting gaze direction from head pose yaw and pitch

    Get PDF
    Abstract - Socially assistive robots (SARs) must be able to interpret non-verbal communication from a human. A person’s gaze direction informs the observer where the visual attention is directed to. Therefore it is useful if a robot can interpret the gaze direction, so that it can assess whether a person is looking at it or some object in the environment. Gazing is a combination of head and eye movement, but detecting eye orientation from a distance is difficult in real life environments. Instead a robot can measure the head pose and infer the gaze direction. In this paper, we show that both the yaw and pitch of a human’s gaze can be inferred from the measured yaw and pitch of the human’s head pose with simple linear equations

    The Benefits and the Costs of Using Auditory Warning Messages in Dynamic Decision Making Settings

    Get PDF
    The failure to notice critical changes in both visual and auditory scenes may have important consequences for performance in complex dynamic environments, especially those related to security such as aviation, surveillance during major events, and command and control of emergency response. Previous work has shown that a significant number of situation changes remain undetected by operators in such environments. In the current study, we examined the impact of using auditory warning messages to support the detection of critical situation changes and to a broader extent the decision making required by the environment. Twenty-two participants performed a radar operator task involving multiple subtasks while detecting critical task-related events that were cued by a specific type of audio message. Results showed that about 22% of the critical changes remained undetected by participants, a percentage similar to that found in previous work using visual cues to support change detection. However, we found that audio messages tended to bias threat evaluation towards perceiving objects as more threatening than they were in reality. Such findings revealed both benefits and costs associated with using audio messages to support change detection in complex dynamic environments

    Managing heterogeneous cues in social contexts. A holistic approach for social interactions analysis

    Get PDF
    Une interaction sociale dĂ©signe toute action rĂ©ciproque entre deux ou plusieurs individus, au cours de laquelle des informations sont partagĂ©es sans "mĂ©diation technologique". Cette interaction, importante dans la socialisation de l'individu et les compĂ©tences qu'il acquiert au cours de sa vie, constitue un objet d'Ă©tude pour diffĂ©rentes disciplines (sociologie, psychologie, mĂ©decine, etc.). Dans le contexte de tests et d'Ă©tudes observationnelles, de multiples mĂ©canismes sont utilisĂ©s pour Ă©tudier ces interactions tels que les questionnaires, l'observation directe des Ă©vĂ©nements et leur analyse par des opĂ©rateurs humains, ou l'observation et l'analyse Ă  posteriori des Ă©vĂ©nements enregistrĂ©s par des spĂ©cialistes (psychologues, sociologues, mĂ©decins, etc.). Cependant, de tels mĂ©canismes sont coĂ»teux en termes de temps de traitement, ils nĂ©cessitent un niveau Ă©levĂ© d'attention pour analyser simultanĂ©ment plusieurs descripteurs, ils sont dĂ©pendants de l'opĂ©rateur (subjectivitĂ© de l'analyse) et ne peuvent viser qu'une facette de l'interaction. Pour faire face aux problĂšmes susmentionnĂ©s, il peut donc s'avĂ©rer utile d'automatiser le processus d'analyse de l'interaction sociale. Il s'agit donc de combler le fossĂ© entre les processus d'analyse des interactions sociales basĂ©s sur l'homme et ceux basĂ©s sur la machine. Nous proposons donc une approche holistique qui intĂšgre des signaux hĂ©tĂ©rogĂšnes multimodaux et des informations contextuelles (donnĂ©es "exogĂšnes" complĂ©mentaires) de maniĂšre dynamique et optionnelle en fonction de leur disponibilitĂ© ou non. Une telle approche permet l'analyse de plusieurs "signaux" en parallĂšle (oĂč les humains ne peuvent se concentrer que sur un seul). Cette analyse peut ĂȘtre encore enrichie Ă  partir de donnĂ©es liĂ©es au contexte de la scĂšne (lieu, date, type de musique, description de l'Ă©vĂ©nement, etc.) ou liĂ©es aux individus (nom, Ăąge, sexe, donnĂ©es extraites de leurs rĂ©seaux sociaux, etc.) Les informations contextuelles enrichissent la modĂ©lisation des mĂ©tadonnĂ©es extraites et leur donnent une dimension plus "sĂ©mantique". La gestion de cette hĂ©tĂ©rogĂ©nĂ©itĂ© est une Ă©tape essentielle pour la mise en Ɠuvre d'une approche holistique. L'automatisation de la capture et de l'observation " in vivo " sans scĂ©narios prĂ©dĂ©finis lĂšve des verrous liĂ©s Ă  i) la protection de la vie privĂ©e et Ă  la sĂ©curitĂ© ; ii) l'hĂ©tĂ©rogĂ©nĂ©itĂ© des donnĂ©es ; et iii) leur volume. Par consĂ©quent, dans le cadre de l'approche holistique, nous proposons (1) un modĂšle de donnĂ©es complet prĂ©servant la vie privĂ©e qui garantit le dĂ©couplage entre les mĂ©thodes d'extraction des mĂ©tadonnĂ©es et d'analyse des interactions sociales ; (2) une mĂ©thode gĂ©omĂ©trique non intrusive de dĂ©tection par contact visuel ; et (3) un modĂšle profond de classification des repas français pour extraire les informations du contenu vidĂ©o. L'approche proposĂ©e gĂšre des signaux hĂ©tĂ©rogĂšnes provenant de diffĂ©rentes modalitĂ©s en tant que sources multicouches (signaux visuels, signaux vocaux, informations contextuelles) Ă  diffĂ©rentes Ă©chelles de temps et diffĂ©rentes combinaisons entre les couches (reprĂ©sentation des signaux sous forme de sĂ©ries temporelles). L'approche a Ă©tĂ© conçue pour fonctionner sans dispositifs intrusifs, afin d'assurer la capture de comportements rĂ©els et de rĂ©aliser l'observation naturaliste. Nous avons dĂ©ployĂ© l'approche proposĂ©e sur la plateforme OVALIE qui vise Ă  Ă©tudier les comportements alimentaires dans diffĂ©rents contextes de la vie rĂ©elle et qui est situĂ©e Ă  l'UniversitĂ© Toulouse-Jean JaurĂšs, en France.Social interaction refers to any interaction between two or more individuals, in which information sharing is carried out without any mediating technology. This interaction is a significant part of individual socialization and experience gaining throughout one's lifetime. It is interesting for different disciplines (sociology, psychology, medicine, etc.). In the context of testing and observational studies, multiple mechanisms are used to study these interactions such as questionnaires, direct observation and analysis of events by human operators, or a posteriori observation and analysis of recorded events by specialists (psychologists, sociologists, doctors, etc.). However, such mechanisms are expensive in terms of processing time. They require a high level of attention to analyzing several cues simultaneously. They are dependent on the operator (subjectivity of the analysis) and can only target one side of the interaction. In order to face the aforementioned issues, the need to automatize the social interaction analysis process is highlighted. So, it is a question of bridging the gap between human-based and machine-based social interaction analysis processes. Therefore, we propose a holistic approach that integrates multimodal heterogeneous cues and contextual information (complementary "exogenous" data) dynamically and optionally according to their availability or not. Such an approach allows the analysis of multi "signals" in parallel (where humans are able only to focus on one). This analysis can be further enriched from data related to the context of the scene (location, date, type of music, event description, etc.) or related to individuals (name, age, gender, data extracted from their social networks, etc.). The contextual information enriches the modeling of extracted metadata and gives them a more "semantic" dimension. Managing this heterogeneity is an essential step for implementing a holistic approach. The automation of " in vivo " capturing and observation using non-intrusive devices without predefined scenarios introduces various issues that are related to data (i) privacy and security; (ii) heterogeneity; and (iii) volume. Hence, within the holistic approach we propose (1) a privacy-preserving comprehensive data model that grants decoupling between metadata extraction and social interaction analysis methods; (2) geometric non-intrusive eye contact detection method; and (3) French food classification deep model to extract information from the video content. The proposed approach manages heterogeneous cues coming from different modalities as multi-layer sources (visual signals, voice signals, contextual information) at different time scales and different combinations between layers (representation of the cues like time series). The approach has been designed to operate without intrusive devices, in order to ensure the capture of real behaviors and achieve the naturalistic observation. We have deployed the proposed approach on OVALIE platform which aims to study eating behaviors in different real-life contexts and it is located in University Toulouse-Jean JaurĂšs, France
    • 

    corecore