1,453 research outputs found

    Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction

    Get PDF
    The visual focus of attention (VFOA) has been recognized as a prominent conversational cue. We are interested in estimating and tracking the VFOAs associated with multi-party social interactions. We note that in this type of situations the participants either look at each other or at an object of interest; therefore their eyes are not always visible. Consequently both gaze and VFOA estimation cannot be based on eye detection and tracking. We propose a method that exploits the correlation between eye gaze and head movements. Both VFOA and gaze are modeled as latent variables in a Bayesian switching state-space model. The proposed formulation leads to a tractable learning procedure and to an efficient algorithm that simultaneously tracks gaze and visual focus. The method is tested and benchmarked using two publicly available datasets that contain typical multi-party human-robot and human-human interactions.Comment: 15 pages, 8 figures, 6 table

    Graphical models for social behavior modeling in face-to face interaction

    No full text
    International audienceThe goal of this paper is to model the coverbal behavior of a subject involved in face-to-face social interactions. For this end, we present a multimodal behavioral model based on a Dynamic Bayesian Network (DBN). The model was inferred from multimodal data of interacting dyads in a specific scenario designed to foster mutual attention and multimodal deixis of objects and places in a collaborative task. The challenge for this behavioral model is to generate coverbal actions (gaze, hand gestures) for the subject given his verbal productions, the current phase of the interaction and the perceived actions of the partner. In our work, the structure of the DBN was learned from data, which revealed an interesting causality graph describing precisely how verbal and coverbal human behaviors are coordinated during the studied interactions. Using this structure, DBN exhibits better performances compared to classical baseline models such as Hidden Markov Models (HMMs) and Hidden Semi-Markov Models (HSMMs). We outperform the baseline in both measures of performance, i.e. interaction unit recognition and behavior generation. DBN also reproduces more faithfully the coordination patterns between modalities observed in ground truth compared to the baseline models

    SALSA: A Novel Dataset for Multimodal Group Behavior Analysis

    Get PDF
    Studying free-standing conversational groups (FCGs) in unstructured social settings (e.g., cocktail party ) is gratifying due to the wealth of information available at the group (mining social networks) and individual (recognizing native behavioral and personality traits) levels. However, analyzing social scenes involving FCGs is also highly challenging due to the difficulty in extracting behavioral cues such as target locations, their speaking activity and head/body pose due to crowdedness and presence of extreme occlusions. To this end, we propose SALSA, a novel dataset facilitating multimodal and Synergetic sociAL Scene Analysis, and make two main contributions to research on automated social interaction analysis: (1) SALSA records social interactions among 18 participants in a natural, indoor environment for over 60 minutes, under the poster presentation and cocktail party contexts presenting difficulties in the form of low-resolution images, lighting variations, numerous occlusions, reverberations and interfering sound sources; (2) To alleviate these problems we facilitate multimodal analysis by recording the social interplay using four static surveillance cameras and sociometric badges worn by each participant, comprising the microphone, accelerometer, bluetooth and infrared sensors. In addition to raw data, we also provide annotations concerning individuals' personality as well as their position, head, body orientation and F-formation information over the entire event duration. Through extensive experiments with state-of-the-art approaches, we show (a) the limitations of current methods and (b) how the recorded multiple cues synergetically aid automatic analysis of social interactions. SALSA is available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure

    How to look next? A data-driven approach for scanpath prediction

    Get PDF
    By and large, current visual attention models mostly rely, when considering static stimuli, on the following procedure. Given an image, a saliency map is computed, which, in turn, might serve the purpose of predicting a sequence of gaze shifts, namely a scanpath instantiating the dynamics of visual attention deployment. The temporal pattern of attention unfolding is thus confined to the scanpath generation stage, whilst salience is conceived as a static map, at best conflating a number of factors (bottom-up information, top-down, spatial biases, etc.). In this note we propose a novel sequential scheme that consists of a three-stage processing relying on a center-bias model, a context/layout model, and an object-based model, respectively. Each stage contributes, at different times, to the sequential sampling of the final scanpath. We compare the method against classic scanpath generation that exploits state-of-the-art static saliency model. Results show that accounting for the structure of the temporal unfolding leads to gaze dynamics close to human gaze behaviour

    A transparent framework towards the context-sensitive recognition of conversational engagement

    Get PDF
    Modelling and recognising affective and mental user states is an urging topic in multiple research fields. This work suggests an approach towards adequate recognition of such states by combining state-of-the-art behaviour recognition classifiers in a transparent and explainable modelling framework that also allows to consider contextual aspects in the inference process. More precisely, in this paper we exemplify the idea of our framework with the recognition of conversational engagement in bi-directional conversations. We introduce a multi-modal annotation scheme for conversational engagement. We further introduce our hybrid approach that combines the accuracy of state-of-the art machine learning techniques, such as deep learning, with the capabilities of Bayesian Networks that are inherently interpretable and feature an important aspect that modern approaches are lacking - causal inference. In an evaluation on a large multi-modal corpus of bi-directional conversations, we show that this hybrid approach can even outperform state-of-the-art black-box approaches by considering context information and causal relations

    Object and feature based modelling of attention in meeting and surveillance videos

    Get PDF
    MPhilThe aim of the thesis is to create and validate models of visual attention. To this extent, a novel unsupervised object detection and tracking framework has been developed by the author. It is demonstrated on people, faces and moving objects and the output is integrated in modelling of visual attention. The proposed approach integrates several types of modules in initialisation, target estimation and validation. Tracking is rst used to introduce high-level features, by extending a popular model based on low-level features[1]. Two automatic models of visual attention are further implemented. One based on winner take it all and inhibition of return as the mech- anisms of selection on a saliency model with high- and low-level features combined. Another which is based only on high-level object tracking results and statistic proper- ties from the collected eye-traces, with the possibility of activating inhibition of return as an additional mechanism. The parameters of the tracking framework thoroughly investigated and its success demonstrated. Eye-tracking experiments show that high- level features are much better at explaining the allocation of attention by the subjects in the study. Low-level features alone do correlate signi cantly with real allocation of attention. However, in fact it lowers the correlation score when combined with high-level features in comparison to using high-level features alone. Further, ndings in collected eye-traces are studied with qualitative method, mainly to discover direc- tions in future research in the area. Similarities and dissimilarities between automatic models of attention and collected eye-traces are discusse
    corecore