155,377 research outputs found

    Object and feature based modelling of attention in meeting and surveillance videos

    Get PDF
    MPhilThe aim of the thesis is to create and validate models of visual attention. To this extent, a novel unsupervised object detection and tracking framework has been developed by the author. It is demonstrated on people, faces and moving objects and the output is integrated in modelling of visual attention. The proposed approach integrates several types of modules in initialisation, target estimation and validation. Tracking is rst used to introduce high-level features, by extending a popular model based on low-level features[1]. Two automatic models of visual attention are further implemented. One based on winner take it all and inhibition of return as the mech- anisms of selection on a saliency model with high- and low-level features combined. Another which is based only on high-level object tracking results and statistic proper- ties from the collected eye-traces, with the possibility of activating inhibition of return as an additional mechanism. The parameters of the tracking framework thoroughly investigated and its success demonstrated. Eye-tracking experiments show that high- level features are much better at explaining the allocation of attention by the subjects in the study. Low-level features alone do correlate signi cantly with real allocation of attention. However, in fact it lowers the correlation score when combined with high-level features in comparison to using high-level features alone. Further, ndings in collected eye-traces are studied with qualitative method, mainly to discover direc- tions in future research in the area. Similarities and dissimilarities between automatic models of attention and collected eye-traces are discusse

    Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction

    Get PDF
    The visual focus of attention (VFOA) has been recognized as a prominent conversational cue. We are interested in estimating and tracking the VFOAs associated with multi-party social interactions. We note that in this type of situations the participants either look at each other or at an object of interest; therefore their eyes are not always visible. Consequently both gaze and VFOA estimation cannot be based on eye detection and tracking. We propose a method that exploits the correlation between eye gaze and head movements. Both VFOA and gaze are modeled as latent variables in a Bayesian switching state-space model. The proposed formulation leads to a tractable learning procedure and to an efficient algorithm that simultaneously tracks gaze and visual focus. The method is tested and benchmarked using two publicly available datasets that contain typical multi-party human-robot and human-human interactions.Comment: 15 pages, 8 figures, 6 table

    Deep Attention Models for Human Tracking Using RGBD

    Get PDF
    Visual tracking performance has long been limited by the lack of better appearance models. These models fail either where they tend to change rapidly, like in motion-based tracking, or where accurate information of the object may not be available, like in color camouflage (where background and foreground colors are similar). This paper proposes a robust, adaptive appearance model which works accurately in situations of color camouflage, even in the presence of complex natural objects. The proposed model includes depth as an additional feature in a hierarchical modular neural framework for online object tracking. The model adapts to the confusing appearance by identifying the stable property of depth between the target and the surrounding object(s). The depth complements the existing RGB features in scenarios when RGB features fail to adapt, hence becoming unstable over a long duration of time. The parameters of the model are learned efficiently in the Deep network, which consists of three modules: (1) The spatial attention layer, which discards the majority of the background by selecting a region containing the object of interest; (2) the appearance attention layer, which extracts appearance and spatial information about the tracked object; and (3) the state estimation layer, which enables the framework to predict future object appearance and location. Three different models were trained and tested to analyze the effect of depth along with RGB information. Also, a model is proposed to utilize only depth as a standalone input for tracking purposes. The proposed models were also evaluated in real-time using KinectV2 and showed very promising results. The results of our proposed network structures and their comparison with the state-of-the-art RGB tracking model demonstrate that adding depth significantly improves the accuracy of tracking in a more challenging environment (i.e., cluttered and camouflaged environments). Furthermore, the results of depth-based models showed that depth data can provide enough information for accurate tracking, even without RGB information

    3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D Point Clouds

    Full text link
    We propose a method for joint detection and tracking of multiple objects in 3D point clouds, a task conventionally treated as a two-step process comprising object detection followed by data association. Our method embeds both steps into a single end-to-end trainable network eliminating the dependency on external object detectors. Our model exploits temporal information employing multiple frames to detect objects and track them in a single network, thereby making it a utilitarian formulation for real-world scenarios. Computing affinity matrix by employing features similarity across consecutive point cloud scans forms an integral part of visual tracking. We propose an attention-based refinement module to refine the affinity matrix by suppressing erroneous correspondences. The module is designed to capture the global context in affinity matrix by employing self-attention within each affinity matrix and cross-attention across a pair of affinity matrices. Unlike competing approaches, our network does not require complex post-processing algorithms, and processes raw LiDAR frames to directly output tracking results. We demonstrate the effectiveness of our method on the three tracking benchmarks: JRDB, Waymo, and KITTI. Experimental evaluations indicate the ability of our model to generalize well across datasets

    What were we all looking at? Identifying objects of collective visual attention

    Get PDF
    The file attached to this record is the authors final peer reviewed version. The publisher's final version can be found by following the DOI link below.We aim to identify the salient objects in an image by applying a model of visual attention. We automate the process by predicting those objects in an image that are most likely to be the focus of someone’s visual attention. Concretely, we first generate fixation maps from the eye tracking data, which express the ground truth of people’s visual attention for each training image. Then, we extract the high-level features based on the bag-of-visual-words image representation as input attributes along with the fixation maps to train a support vector regression model. With this model, we can predict a new query image’s saliency. Our experiments show that the model is capable of providing a good estimate for human visual attention in test images sets with one salient object and multiple salient objects. In this way, we seek to reduce the redundant information within the scene, and thus provide a more accurate depiction of the scene

    Visual Dialogue State Tracking for Question Generation

    Full text link
    GuessWhat?! is a visual dialogue task between a guesser and an oracle. The guesser aims to locate an object supposed by the oracle oneself in an image by asking a sequence of Yes/No questions. Asking proper questions with the progress of dialogue is vital for achieving successful final guess. As a result, the progress of dialogue should be properly represented and tracked. Previous models for question generation pay less attention on the representation and tracking of dialogue states, and therefore are prone to asking low quality questions such as repeated questions. This paper proposes visual dialogue state tracking (VDST) based method for question generation. A visual dialogue state is defined as the distribution on objects in the image as well as representations of objects. Representations of objects are updated with the change of the distribution on objects. An object-difference based attention is used to decode new question. The distribution on objects is updated by comparing the question-answer pair and objects. Experimental results on GuessWhat?! dataset show that our model significantly outperforms existing methods and achieves new state-of-the-art performance. It is also noticeable that our model reduces the rate of repeated questions from more than 50% to 21.9% compared with previous state-of-the-art methods.Comment: 8 pages, 4 figures, Accept-Oral by AAAI-202

    Gaze Guidance, Task-Based Eye Movement Prediction, and Real-World Task Inference using Eye Tracking

    Get PDF
    The ability to predict and guide viewer attention has important applications in computer graphics, image understanding, object detection, visual search and training. Human eye movements provide insight into the cognitive processes involved in task performance and there has been extensive research on what factors guide viewer attention in a scene. It has been shown, for example, that saliency in the image, scene context, and task at hand play significant roles in guiding attention. This dissertation presents and discusses research on visual attention with specific focus on the use of subtle visual cues to guide viewer gaze and the development of algorithms to predict the distribution of gaze about a scene. Specific contributions of this work include: a framework for gaze guidance to enable problem solving and spatial learning, a novel algorithm for task-based eye movement prediction, and a system for real-world task inference using eye tracking. A gaze guidance approach is presented that combines eye tracking with subtle image-space modulations to guide viewer gaze about a scene. Several experiments were conducted using this approach to examine its impact on short-term spatial information recall, task sequencing, training, and password recollection. A model of human visual attention prediction that uses saliency maps, scene feature maps and task-based eye movements to predict regions of interest was also developed. This model was used to automatically select target regions for active gaze guidance to improve search task performance. Finally, we develop a framework for inferring real-world tasks using image features and eye movement data. Overall, this dissertation naturally leads to an overarching framework, that combines all three contributions to provide a continuous feedback system to improve performance on repeated visual search tasks. This research has important applications in data visualization, problem solving, training, and online education
    • …
    corecore