155,377 research outputs found
Object and feature based modelling of attention in meeting and surveillance videos
MPhilThe aim of the thesis is to create and validate models of visual attention. To
this extent, a novel unsupervised object detection and tracking framework has been
developed by the author. It is demonstrated on people, faces and moving objects
and the output is integrated in modelling of visual attention. The proposed approach
integrates several types of modules in initialisation, target estimation and validation.
Tracking is rst used to introduce high-level features, by extending a popular model
based on low-level features[1]. Two automatic models of visual attention are further
implemented. One based on winner take it all and inhibition of return as the mech-
anisms of selection on a saliency model with high- and low-level features combined.
Another which is based only on high-level object tracking results and statistic proper-
ties from the collected eye-traces, with the possibility of activating inhibition of return
as an additional mechanism. The parameters of the tracking framework thoroughly
investigated and its success demonstrated. Eye-tracking experiments show that high-
level features are much better at explaining the allocation of attention by the subjects
in the study. Low-level features alone do correlate signi cantly with real allocation
of attention. However, in fact it lowers the correlation score when combined with
high-level features in comparison to using high-level features alone. Further, ndings
in collected eye-traces are studied with qualitative method, mainly to discover direc-
tions in future research in the area. Similarities and dissimilarities between automatic
models of attention and collected eye-traces are discusse
Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction
The visual focus of attention (VFOA) has been recognized as a prominent
conversational cue. We are interested in estimating and tracking the VFOAs
associated with multi-party social interactions. We note that in this type of
situations the participants either look at each other or at an object of
interest; therefore their eyes are not always visible. Consequently both gaze
and VFOA estimation cannot be based on eye detection and tracking. We propose a
method that exploits the correlation between eye gaze and head movements. Both
VFOA and gaze are modeled as latent variables in a Bayesian switching
state-space model. The proposed formulation leads to a tractable learning
procedure and to an efficient algorithm that simultaneously tracks gaze and
visual focus. The method is tested and benchmarked using two publicly available
datasets that contain typical multi-party human-robot and human-human
interactions.Comment: 15 pages, 8 figures, 6 table
Deep Attention Models for Human Tracking Using RGBD
Visual tracking performance has long been limited by the lack of better appearance models. These models fail either where they tend to change rapidly, like in motion-based tracking, or where accurate information of the object may not be available, like in color camouflage (where background and foreground colors are similar). This paper proposes a robust, adaptive appearance model which works accurately in situations of color camouflage, even in the presence of complex natural objects. The proposed model includes depth as an additional feature in a hierarchical modular neural framework for online object tracking. The model adapts to the confusing appearance by identifying the stable property of depth between the target and the surrounding object(s). The depth complements the existing RGB features in scenarios when RGB features fail to adapt, hence becoming unstable over a long duration of time. The parameters of the model are learned efficiently in the Deep network, which consists of three modules: (1) The spatial attention layer, which discards the majority of the background by selecting a region containing the object of interest; (2) the appearance attention layer, which extracts appearance and spatial information about the tracked object; and (3) the state estimation layer, which enables the framework to predict future object appearance and location. Three different models were trained and tested to analyze the effect of depth along with RGB information. Also, a model is proposed to utilize only depth as a standalone input for tracking purposes. The proposed models were also evaluated in real-time using KinectV2 and showed very promising results. The results of our proposed network structures and their comparison with the state-of-the-art RGB tracking model demonstrate that adding depth significantly improves the accuracy of tracking in a more challenging environment (i.e., cluttered and camouflaged environments). Furthermore, the results of depth-based models showed that depth data can provide enough information for accurate tracking, even without RGB information
3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D Point Clouds
We propose a method for joint detection and tracking of multiple objects in
3D point clouds, a task conventionally treated as a two-step process comprising
object detection followed by data association. Our method embeds both steps
into a single end-to-end trainable network eliminating the dependency on
external object detectors. Our model exploits temporal information employing
multiple frames to detect objects and track them in a single network, thereby
making it a utilitarian formulation for real-world scenarios. Computing
affinity matrix by employing features similarity across consecutive point cloud
scans forms an integral part of visual tracking. We propose an attention-based
refinement module to refine the affinity matrix by suppressing erroneous
correspondences. The module is designed to capture the global context in
affinity matrix by employing self-attention within each affinity matrix and
cross-attention across a pair of affinity matrices. Unlike competing
approaches, our network does not require complex post-processing algorithms,
and processes raw LiDAR frames to directly output tracking results. We
demonstrate the effectiveness of our method on the three tracking benchmarks:
JRDB, Waymo, and KITTI. Experimental evaluations indicate the ability of our
model to generalize well across datasets
What were we all looking at? Identifying objects of collective visual attention
The file attached to this record is the authors final peer reviewed version. The publisher's final version can be found by following the DOI link below.We aim to identify the salient objects in an image by applying a model of visual attention. We automate the process by predicting those objects in an image that are most likely to be the focus of someone’s visual attention. Concretely, we first generate fixation maps from the eye tracking data, which express the ground truth of people’s visual attention for each training image. Then, we extract the high-level features based on the bag-of-visual-words image representation as input attributes along with the fixation maps to train a support vector regression model. With this model, we can predict a new query image’s saliency. Our experiments show that the model is capable of providing a good estimate for human visual attention in test images sets with one salient object and multiple salient objects. In this way, we seek to reduce the redundant information within the scene, and thus provide a more accurate depiction of the scene
Visual Dialogue State Tracking for Question Generation
GuessWhat?! is a visual dialogue task between a guesser and an oracle. The
guesser aims to locate an object supposed by the oracle oneself in an image by
asking a sequence of Yes/No questions. Asking proper questions with the
progress of dialogue is vital for achieving successful final guess. As a
result, the progress of dialogue should be properly represented and tracked.
Previous models for question generation pay less attention on the
representation and tracking of dialogue states, and therefore are prone to
asking low quality questions such as repeated questions. This paper proposes
visual dialogue state tracking (VDST) based method for question generation. A
visual dialogue state is defined as the distribution on objects in the image as
well as representations of objects. Representations of objects are updated with
the change of the distribution on objects. An object-difference based attention
is used to decode new question. The distribution on objects is updated by
comparing the question-answer pair and objects. Experimental results on
GuessWhat?! dataset show that our model significantly outperforms existing
methods and achieves new state-of-the-art performance. It is also noticeable
that our model reduces the rate of repeated questions from more than 50% to
21.9% compared with previous state-of-the-art methods.Comment: 8 pages, 4 figures, Accept-Oral by AAAI-202
Gaze Guidance, Task-Based Eye Movement Prediction, and Real-World Task Inference using Eye Tracking
The ability to predict and guide viewer attention has important applications in computer graphics, image understanding, object detection, visual search and training. Human eye movements provide insight into the cognitive processes involved in task performance and there has been extensive research on what factors guide viewer attention in a scene. It has been shown, for example, that saliency in the image, scene context, and task at hand play significant roles in guiding attention.
This dissertation presents and discusses research on visual attention with specific focus on the use of subtle visual cues to guide viewer gaze and the development of algorithms to predict the distribution of gaze about a scene. Specific contributions of this work include: a framework for gaze guidance to enable problem solving and spatial learning, a novel algorithm for task-based eye movement prediction, and a system for real-world task inference using eye tracking.
A gaze guidance approach is presented that combines eye tracking with subtle image-space modulations to guide viewer gaze about a scene. Several experiments were conducted using this approach to examine its impact on short-term spatial information recall, task sequencing, training, and password recollection. A model of human visual attention prediction that uses saliency maps, scene feature maps and task-based eye movements to predict regions of interest was also developed. This model was used to automatically select target regions for active gaze guidance to improve search task performance. Finally, we develop a framework for inferring real-world tasks using image features and eye movement data.
Overall, this dissertation naturally leads to an overarching framework, that combines all three contributions to provide a continuous feedback system to improve performance on repeated visual search tasks. This research has important applications in data visualization, problem solving, training, and online education
- …