10,386 research outputs found
Goal Directed Visual Search Based on Color Cues: Co-operative Effectes of Top-Down & Bottom-Up Visual Attention
Focus of Attention plays an important role in perception of the visual environment. Certain objects stand out in the scene irrespective of observers\u27 goals. This form of attention capture, in which stimulus feature saliency captures our attention, is of a bottom-up nature. Often prior knowledge about objects and scenes can influence our attention. This form of attention capture, which is influenced by higher level knowledge about the objects, is called top-down attention. Top-down attention acts as a feedback mechanism for the feed-forward bottom-up attention. Visual search is a result of a combined effort of the top-down (cognitive cue) system and bottom-up (low level feature saliency) system. In my thesis I investigate the process of goal directed visual search based on color cue, which is a process of searching for objects of a certain color. The computational model generates saliency maps that predict the locations of interest during a visual search. Comparison between the model-generated saliency maps and the results of psychophysical human eye -tracking experiments was conducted. The analysis provides a measure of how well the human eye movements correspond with the predicted locations of the saliency maps. Eye tracking equipment in the Visual Perceptual Laboratory in the Center for Imaging Science was used to conduct the experiments
Predicting visual fixations on video based on low-level visual features
AbstractTo what extent can a computational model of the bottom–up visual attention predict what an observer is looking at? What is the contribution of the low-level visual features in the attention deployment? To answer these questions, a new spatio-temporal computational model is proposed. This model incorporates several visual features; therefore, a fusion algorithm is required to combine the different saliency maps (achromatic, chromatic and temporal). To quantitatively assess the model performances, eye movements were recorded while naive observers viewed natural dynamic scenes. Four completing metrics have been used. In addition, predictions from the proposed model are compared to the predictions from a state of the art model [Itti’s model (Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11), 1254–1259)] and from three non-biologically plausible models (uniform, flicker and centered models). Regardless of the metric used, the proposed model shows significant improvement over the selected benchmarking models (except the centered model). Conclusions are drawn regarding both the influence of low-level visual features over time and the central bias in an eye tracking experiment
Predicting Visual Attention and Distraction During Visual Search Using Convolutional Neural Networks
Most studies in computational modeling of visual attention encompass
task-free observation of images. Free-viewing saliency considers limited
scenarios of daily life. Most visual activities are goal-oriented and demand a
great amount of top-down attention control. Visual search task demands more
top-down control of attention, compared to free-viewing. In this paper, we
present two approaches to model visual attention and distraction of observers
during visual search. Our first approach adapts a light-weight free-viewing
saliency model to predict eye fixation density maps of human observers over
pixels of search images, using a two-stream convolutional encoder-decoder
network, trained and evaluated on COCO-Search18 dataset. This method predicts
which locations are more distracting when searching for a particular target.
Our network achieves good results on standard saliency metrics (AUC-Judd=0.95,
AUC-Borji=0.85, sAUC=0.84, NSS=4.64, KLD=0.93, CC=0.72, SIM=0.54, and IG=2.59).
Our second approach is object-based and predicts the distractor and target
objects during visual search. Distractors are all objects except the target
that observers fixate on during search. This method uses a Mask-RCNN
segmentation network pre-trained on MS-COCO and fine-tuned on COCO-Search18
dataset. We release our segmentation annotations of targets and distractors in
COCO-Search18 for three target categories: bottle, bowl, and car. The average
scores over the three categories are: F1-score=0.64, MAP(iou:0.5)=0.57,
MAR(iou:0.5)=0.73. Our implementation code in Tensorflow is publicly available
at https://github.com/ManooshSamiei/Distraction-Visual-Search .Comment: 33 pages, 24 figures, 12 tables, this is a pre-print manuscript
currently under review in Journal of Visio
Pseudo-Saliency for Human Gaze Simulation
Understanding and modeling human vision is an endeavor which can be; and has been, approached from multiple disciplines. Saliency prediction is a subdomain of computer vision which tries to predict human eye movements made during either guided or free viewing of static images. In the context of simulation and animation, vision is often also modeled for the purposes of realistic and reactive autonomous agents. These often focus more on plausible gaze movements of the eyes and head, and are less concerned with scene understanding through visual stimuli. In order to bring techniques and knowledge over from computer vision fields into simulated virtual humans requires a methodology to generate saliency maps. Traditional saliency models are ill suited for this due to large computational costs as well as a lack of control due to the nature of most deep network based models. The primary contribution of this thesis is a proposed model for generating pseudo-saliency maps for virtual characters, Parametric Saliency Maps (PSM). This parametric model calculates saliency as a weighted combination of 7 factors selected from saliency and attention literature. Experiments conducted show that the model is expressive enough to mimic results from state-of-the-art saliency models to a high degree of similarity, as well as being extraordinarily cheap to compute by virtue of being done using the graphics processing pipeline of a simulation. The secondary contribution, two models are proposed for saliency driven gaze control. These models are expressive and present novel approaches for controlling the gaze of a virtual character using only visual saliency maps as input
Pseudo-Saliency for Human Gaze Simulation
Understanding and modeling human vision is an endeavor which can be; and has been, approached from multiple disciplines. Saliency prediction is a subdomain of computer vision which tries to predict human eye movements made during either guided or free viewing of static images. In the context of simulation and animation, vision is often also modeled for the purposes of realistic and reactive autonomous agents. These often focus more on plausible gaze movements of the eyes and head, and are less concerned with scene understanding through visual stimuli. In order to bring techniques and knowledge over from computer vision fields into simulated virtual humans requires a methodology to generate saliency maps. Traditional saliency models are ill suited for this due to large computational costs as well as a lack of control due to the nature of most deep network based models. The primary contribution of this thesis is a proposed model for generating pseudo-saliency maps for virtual characters, Parametric Saliency Maps (PSM). This parametric model calculates saliency as a weighted combination of 7 factors selected from saliency and attention literature. Experiments conducted show that the model is expressive enough to mimic results from state-of-the-art saliency models to a high degree of similarity, as well as being extraordinarily cheap to compute by virtue of being done using the graphics processing pipeline of a simulation. The secondary contribution, two models are proposed for saliency driven gaze control. These models are expressive and present novel approaches for controlling the gaze of a virtual character using only visual saliency maps as input
VST++: Efficient and Stronger Visual Saliency Transformer
While previous CNN-based models have exhibited promising results for salient
object detection (SOD), their ability to explore global long-range dependencies
is restricted. Our previous work, the Visual Saliency Transformer (VST),
addressed this constraint from a transformer-based sequence-to-sequence
perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task
transformer decoder that concurrently predicts saliency and boundary outcomes
in a pure transformer architecture. Moreover, we introduced a novel token
upsampling method called reverse T2T for predicting a high-resolution saliency
map effortlessly within transformer-based structures. Building upon the VST
model, we further propose an efficient and stronger VST version in this work,
i.e. VST++. To mitigate the computational costs of the VST model, we propose a
Select-Integrate Attention (SIA) module, partitioning foreground into
fine-grained segments and aggregating background information into a single
coarse-grained token. To incorporate 3D depth information with low cost, we
design a novel depth position encoding method tailored for depth maps.
Furthermore, we introduce a token-supervised prediction loss to provide
straightforward guidance for the task-related tokens. We evaluate our VST++
model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD
benchmark datasets. Experimental results show that our model outperforms
existing methods while achieving a 25% reduction in computational costs without
significant performance compromise. The demonstrated strong ability for
generalization, enhanced performance, and heightened efficiency of our VST++
model highlight its potential
Multiscale Discriminant Saliency for Visual Attention
The bottom-up saliency, an early stage of humans' visual attention, can be
considered as a binary classification problem between center and surround
classes. Discriminant power of features for the classification is measured as
mutual information between features and two classes distribution. The estimated
discrepancy of two feature classes very much depends on considered scale
levels; then, multi-scale structure and discriminant power are integrated by
employing discrete wavelet features and Hidden markov tree (HMT). With wavelet
coefficients and Hidden Markov Tree parameters, quad-tree like label structures
are constructed and utilized in maximum a posterior probability (MAP) of hidden
class variables at corresponding dyadic sub-squares. Then, saliency value for
each dyadic square at each scale level is computed with discriminant power
principle and the MAP. Finally, across multiple scales is integrated the final
saliency map by an information maximization rule. Both standard quantitative
tools such as NSS, LCC, AUC and qualitative assessments are used for evaluating
the proposed multiscale discriminant saliency method (MDIS) against the
well-know information-based saliency method AIM on its Bruce Database wity
eye-tracking data. Simulation results are presented and analyzed to verify the
validity of MDIS as well as point out its disadvantages for further research
direction.Comment: 16 pages, ICCSA 2013 - BIOCA sessio
- …