7,461 research outputs found

    Visual Focus of Attention Estimation from Head Pose Posterior Probability Distributions

    Get PDF
    We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants from their head pose and contextual cues. The main contribution of the paper is the use of a head pose posterior distribution as a representation of the head pose information contained in the image data. This posterior encodes the probabilities of the different head poses given the image data, and constitute therefore a richer representation of the data than the mean or the mode of this distribution, as done in all previous work. These observations are exploited in a joint interaction model of all meeting participants pose observations, VFOAs, speaking status and of environmental contextual cues. Numerical experiments on a public database of 4 meetings of 22min on average show that this change of representation allows for a 5.4% gain with respect to the standard approach using head pose as observation

    Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction

    Get PDF
    The visual focus of attention (VFOA) has been recognized as a prominent conversational cue. We are interested in estimating and tracking the VFOAs associated with multi-party social interactions. We note that in this type of situations the participants either look at each other or at an object of interest; therefore their eyes are not always visible. Consequently both gaze and VFOA estimation cannot be based on eye detection and tracking. We propose a method that exploits the correlation between eye gaze and head movements. Both VFOA and gaze are modeled as latent variables in a Bayesian switching state-space model. The proposed formulation leads to a tractable learning procedure and to an efficient algorithm that simultaneously tracks gaze and visual focus. The method is tested and benchmarked using two publicly available datasets that contain typical multi-party human-robot and human-human interactions.Comment: 15 pages, 8 figures, 6 table

    Robust Head-Pose Estimation Based on Partially-Latent Mixture of Linear Regressions

    Get PDF
    Head-pose estimation has many applications, such as social event analysis, human-robot and human-computer interaction, driving assistance, and so forth. Head-pose estimation is challenging because it must cope with changing illumination conditions, variabilities in face orientation and in appearance, partial occlusions of facial landmarks, as well as bounding-box-to-face alignment errors. We propose tu use a mixture of linear regressions with partially-latent output. This regression method learns to map high-dimensional feature vectors (extracted from bounding boxes of faces) onto the joint space of head-pose angles and bounding-box shifts, such that they are robustly predicted in the presence of unobservable phenomena. We describe in detail the mapping method that combines the merits of unsupervised manifold learning techniques and of mixtures of regressions. We validate our method with three publicly available datasets and we thoroughly benchmark four variants of the proposed algorithm with several state-of-the-art head-pose estimation methods.Comment: 12 pages, 5 figures, 3 table

    Tracking and modeling focus of attention in meetings [online]

    Get PDF
    Abstract This thesis addresses the problem of tracking the focus of attention of people. In particular, a system to track the focus of attention of participants in meetings is developed. Obtaining knowledge about a person\u27s focus of attention is an important step towards a better understanding of what people do, how and with what or whom they interact or to what they refer. In meetings, focus of attention can be used to disambiguate the addressees of speech acts, to analyze interaction and for indexing of meeting transcripts. Tracking a user\u27s focus of attention also greatly contributes to the improvement of human­computer interfaces since it can be used to build interfaces and environments that become aware of what the user is paying attention to or with what or whom he is interacting. The direction in which people look; i.e., their gaze, is closely related to their focus of attention. In this thesis, we estimate a subject\u27s focus of attention based on his or her head orientation. While the direction in which someone looks is determined by head orientation and eye gaze, relevant literature suggests that head orientation alone is a su#cient cue for the detection of someone\u27s direction of attention during social interaction. We present experimental results from a user study and from several recorded meetings that support this hypothesis. We have developed a Bayesian approach to model at whom or what someone is look­ ing based on his or her head orientation. To estimate head orientations in meetings, the participants\u27 faces are automatically tracked in the view of a panoramic camera and neural networks are used to estimate their head orientations from pre­processed images of their faces. Using this approach, the focus of attention target of subjects could be correctly identified during 73% of the time in a number of evaluation meet­ ings with four participants. In addition, we have investigated whether a person\u27s focus of attention can be pre­dicted from other cues. Our results show that focus of attention is correlated to who is speaking in a meeting and that it is possible to predict a person\u27s focus of attention based on the information of who is talking or was talking before a given moment. We have trained neural networks to predict at whom a person is looking, based on information about who was speaking. Using this approach we were able to predict who is looking at whom with 63% accuracy on the evaluation meetings using only information about who was speaking. We show that by using both head orientation and speaker information to estimate a person\u27s focus, the accuracy of focus detection can be improved compared to just using one of the modalities for focus estimation. To demonstrate the generality of our approach, we have built a prototype system to demonstrate focus­aware interaction with a household robot and other smart appliances in a room using the developed components for focus of attention tracking. In the demonstration environment, a subject could interact with a simulated household robot, a speech­enabled VCR or with other people in the room, and the recipient of the subject\u27s speech was disambiguated based on the user\u27s direction of attention. Zusammenfassung Die vorliegende Arbeit beschäftigt sich mit der automatischen Bestimmung und Ver­folgung des Aufmerksamkeitsfokus von Personen in Besprechungen. Die Bestimmung des Aufmerksamkeitsfokus von Personen ist zum Verständnis und zur automatischen Auswertung von Besprechungsprotokollen sehr wichtig. So kann damit beispielsweise herausgefunden werden, wer zu einem bestimmten Zeitpunkt wen angesprochen hat beziehungsweise wer wem zugehört hat. Die automatische Bestim­mung des Aufmerksamkeitsfokus kann desweiteren zur Verbesserung von Mensch-Maschine­Schnittstellen benutzt werden. Ein wichtiger Hinweis auf die Richtung, in welche eine Person ihre Aufmerksamkeit richtet, ist die Kopfstellung der Person. Daher wurde ein Verfahren zur Bestimmung der Kopfstellungen von Personen entwickelt. Hierzu wurden künstliche neuronale Netze benutzt, welche als Eingaben vorverarbeitete Bilder des Kopfes einer Person erhalten, und als Ausgabe eine Schätzung der Kopfstellung berechnen. Mit den trainierten Netzen wurde auf Bilddaten neuer Personen, also Personen, deren Bilder nicht in der Trainingsmenge enthalten waren, ein mittlerer Fehler von neun bis zehn Grad für die Bestimmung der horizontalen und vertikalen Kopfstellung erreicht. Desweiteren wird ein probabilistischer Ansatz zur Bestimmung von Aufmerksamkeits­zielen vorgestellt. Es wird hierbei ein Bayes\u27scher Ansatzes verwendet um die A­posterior iWahrscheinlichkeiten verschiedener Aufmerksamkteitsziele, gegeben beobachteter Kopfstellungen einer Person, zu bestimmen. Die entwickelten Ansätze wurden auf mehren Besprechungen mit vier bis fünf Teilnehmern evaluiert. Ein weiterer Beitrag dieser Arbeit ist die Untersuchung, inwieweit sich die Blickrich­tung der Besprechungsteilnehmer basierend darauf, wer gerade spricht, vorhersagen läßt. Es wurde ein Verfahren entwickelt um mit Hilfe von neuronalen Netzen den Fokus einer Person basierend auf einer kurzen Historie der Sprecherkonstellationen zu schätzen. Wir zeigen, dass durch Kombination der bildbasierten und der sprecherbasierten Schätzung des Aufmerksamkeitsfokus eine deutliche verbesserte Schätzung erreicht werden kann. Insgesamt wurde mit dieser Arbeit erstmals ein System vorgestellt um automatisch die Aufmerksamkeit von Personen in einem Besprechungsraum zu verfolgen. Die entwickelten Ansätze und Methoden können auch zur Bestimmung der Aufmerk­samkeit von Personen in anderen Bereichen, insbesondere zur Steuerung von comput­erisierten, interaktiven Umgebungen, verwendet werden. Dies wird an einer Beispielapplikation gezeigt

    A Bayesian hierarchy for robust gaze estimation in human–robot interaction

    Get PDF
    In this text, we present a probabilistic solution for robust gaze estimation in the context of human–robot interaction. Gaze estimation, in the sense of continuously assessing gaze direction of an interlocutor so as to determine his/her focus of visual attention, is important in several important computer vision applications, such as the development of non-intrusive gaze-tracking equipment for psychophysical experiments in neuroscience, specialised telecommunication devices, video surveillance, human–computer interfaces (HCI) and artificial cognitive systems for human–robot interaction (HRI), our application of interest. We have developed a robust solution based on a probabilistic approach that inherently deals with the uncertainty of sensor models, but also and in particular with uncertainty arising from distance, incomplete data and scene dynamics. This solution comprises a hierarchical formulation in the form of a mixture model that loosely follows how geometrical cues provided by facial features are believed to be used by the human perceptual system for gaze estimation. A quantitative analysis of the proposed framework's performance was undertaken through a thorough set of experimental sessions. Results show that the framework performs according to the difficult requirements of HRI applications, namely by exhibiting correctness, robustness and adaptiveness

    EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis

    Get PDF
    Data clustering has received a lot of attention and numerous methods, algorithms and software packages are available. Among these techniques, parametric finite-mixture models play a central role due to their interesting mathematical properties and to the existence of maximum-likelihood estimators based on expectation-maximization (EM). In this paper we propose a new mixture model that associates a weight with each observed point. We introduce the weighted-data Gaussian mixture and we derive two EM algorithms. The first one considers a fixed weight for each observation. The second one treats each weight as a random variable following a gamma distribution. We propose a model selection method based on a minimum message length criterion, provide a weight initialization strategy, and validate the proposed algorithms by comparing them with several state of the art parametric and non-parametric clustering techniques. We also demonstrate the effectiveness and robustness of the proposed clustering technique in the presence of heterogeneous data, namely audio-visual scene analysis.Comment: 14 pages, 4 figures, 4 table

    Mass Displacement Networks

    Full text link
    Despite the large improvements in performance attained by using deep learning in computer vision, one can often further improve results with some additional post-processing that exploits the geometric nature of the underlying task. This commonly involves displacing the posterior distribution of a CNN in a way that makes it more appropriate for the task at hand, e.g. better aligned with local image features, or more compact. In this work we integrate this geometric post-processing within a deep architecture, introducing a differentiable and probabilistically sound counterpart to the common geometric voting technique used for evidence accumulation in vision. We refer to the resulting neural models as Mass Displacement Networks (MDNs), and apply them to human pose estimation in two distinct setups: (a) landmark localization, where we collapse a distribution to a point, allowing for precise localization of body keypoints and (b) communication across body parts, where we transfer evidence from one part to the other, allowing for a globally consistent pose estimate. We evaluate on large-scale pose estimation benchmarks, such as MPII Human Pose and COCO datasets, and report systematic improvements when compared to strong baselines.Comment: 12 pages, 4 figure
    • …
    corecore