4,468 research outputs found
Computational models of attention
This chapter reviews recent computational models of visual attention. We
begin with models for the bottom-up or stimulus-driven guidance of attention to
salient visual items, which we examine in seven different broad categories. We
then examine more complex models which address the top-down or goal-oriented
guidance of attention towards items that are more relevant to the task at hand
Computational models: Bottom-up and top-down aspects
Computational models of visual attention have become popular over the past
decade, we believe primarily for two reasons: First, models make testable
predictions that can be explored by experimentalists as well as theoreticians,
second, models have practical and technological applications of interest to the
applied science and engineering communities. In this chapter, we take a
critical look at recent attention modeling efforts. We focus on {\em
computational models of attention} as defined by Tsotsos \& Rothenstein
\shortcite{Tsotsos_Rothenstein11}: Models which can process any visual stimulus
(typically, an image or video clip), which can possibly also be given some task
definition, and which make predictions that can be compared to human or animal
behavioral or physiological responses elicited by the same stimulus and task.
Thus, we here place less emphasis on abstract models, phenomenological models,
purely data-driven fitting or extrapolation models, or models specifically
designed for a single task or for a restricted class of stimuli. For
theoretical models, we refer the reader to a number of previous reviews that
address attention theories and models more generally
\cite{Itti_Koch01nrn,Paletta_etal05,Frintrop_etal10,Rothenstein_Tsotsos08,Gottlieb_Balan10,Toet11,Borji_Itti12pami}
Fully automatic extraction of salient objects from videos in near real-time
Automatic video segmentation plays an important role in a wide range of
computer vision and image processing applications. Recently, various methods
have been proposed for this purpose. The problem is that most of these methods
are far from real-time processing even for low-resolution videos due to the
complex procedures. To this end, we propose a new and quite fast method for
automatic video segmentation with the help of 1) efficient optimization of
Markov random fields with polynomial time of number of pixels by introducing
graph cuts, 2) automatic, computationally efficient but stable derivation of
segmentation priors using visual saliency and sequential update mechanism, and
3) an implementation strategy in the principle of stream processing with
graphics processor units (GPUs). Test results indicates that our method
extracts appropriate regions from videos as precisely as and much faster than
previous semi-automatic methods even though any supervisions have not been
incorporated.Comment: submitted to Special Issue on High Performance Computation on
Hardware Accelerators, the Computer Journa
Unsupervised Video Analysis Based on a Spatiotemporal Saliency Detector
Visual saliency, which predicts regions in the field of view that draw the
most visual attention, has attracted a lot of interest from researchers. It has
already been used in several vision tasks, e.g., image classification, object
detection, foreground segmentation. Recently, the spectrum analysis based
visual saliency approach has attracted a lot of interest due to its simplicity
and good performance, where the phase information of the image is used to
construct the saliency map. In this paper, we propose a new approach for
detecting spatiotemporal visual saliency based on the phase spectrum of the
videos, which is easy to implement and computationally efficient. With the
proposed algorithm, we also study how the spatiotemporal saliency can be used
in two important vision task, abnormality detection and spatiotemporal interest
point detection. The proposed algorithm is evaluated on several commonly used
datasets with comparison to the state-of-art methods from the literature. The
experiments demonstrate the effectiveness of the proposed approach to
spatiotemporal visual saliency detection and its application to the above
vision tasksComment: 21 page
Modeling Bottom-Up and Top-Down Attention with a Neurodynamic Model of V1
Previous studies suggested that lateral interactions of V1 cells are
responsible, among other visual effects, of bottom-up visual attention
(alternatively named visual salience or saliency). Our objective is to mimic
these connections with a neurodynamic network of firing-rate neurons in order
to predict visual attention. Early visual subcortical processes (i.e. retinal
and thalamic) are functionally simulated. An implementation of the cortical
magnification function is included to define the retinotopical projections
towards V1, processing neuronal activity for each distinct view during scene
observation. Novel computational definitions of top-down inhibition (in terms
of inhibition of return and selection mechanisms), are also proposed to predict
attention in Free-Viewing and Visual Search tasks. Results show that our model
outpeforms other biologically-inpired models of saliency prediction while
predicting visual saccade sequences with the same model. We also show how
temporal and spatial characteristics of inhibition of return can improve
prediction of saccades, as well as how distinct search strategies (in terms of
feature-selective or category-specific inhibition) can predict attention at
distinct image contexts.Comment: 27 pages, 19 figure
A stochastic model of human visual attention with a dynamic Bayesian network
Recent studies in the field of human vision science suggest that the human
responses to the stimuli on a visual display are non-deterministic. People may
attend to different locations on the same visual input at the same time. Based
on this knowledge, we propose a new stochastic model of visual attention by
introducing a dynamic Bayesian network to predict the likelihood of where
humans typically focus on a video scene. The proposed model is composed of a
dynamic Bayesian network with 4 layers. Our model provides a framework that
simulates and combines the visual saliency response and the cognitive state of
a person to estimate the most probable attended regions. Sample-based inference
with Markov chain Monte-Carlo based particle filter and stream processing with
multi-core processors enable us to estimate human visual attention in near real
time. Experimental results have demonstrated that our model performs
significantly better in predicting human visual attention compared to the
previous deterministic models.Comment: 24 pages, single-column, 13 figures excluding portlaits, submitted to
IEEE Transactions on Pattern Analysis and Machine Intelligence
A Neurodynamic model of Saliency prediction in V1
Lateral connections in the primary visual cortex (V1) have long been
hypothesized to be responsible of several visual processing mechanisms such as
brightness induction, chromatic induction, visual discomfort and bottom-up
visual attention (also named saliency). Many computational models have been
developed to independently predict these and other visual processes, but no
computational model has been able to reproduce all of them simultaneously. In
this work we show that a biologically plausible computational model of lateral
interactions of V1 is able to simultaneously predict saliency and all the
aforementioned visual processes. Our model's (NSWAM) architecture is based on
Pennachio's neurodynamic model of lateral connections of V1. It is defined as a
network of firing rate neurons, sensitive to visual features such as
brightness, color, orientation and scale. We tested NSWAM saliency predictions
using images from several eye tracking datasets. We show that accuracy of
predictions, using shuffled metrics, obtained by our architecture is similar to
other state-of-the-art computational methods, particularly with synthetic
images (CAT2000-Pattern & SID4VAM) which mainly contain low level features.
Moreover, we outperform other biologically-inspired saliency models that are
specifically designed to exclusively reproduce saliency. Hence, we show that
our biologically plausible model of lateral connections can simultaneously
explain different visual proceses present in V1 (without applying any type of
training or optimization and keeping the same parametrization for all the
visual processes). This can be useful for the definition of a unified
architecture of the primary visual cortex.Comment: 17 pages, 17 figures, 6 table
Dynamical optical flow of saliency maps for predicting visual attention
Saliency maps are used to understand human attention and visual fixation.
However, while very well established for static images, there is no general
agreement on how to compute a saliency map of dynamic scenes. In this paper we
propose a mathematically rigorous approach to this prob- lem, including static
saliency maps of each video frame for the calculation of the optical flow.
Taking into account static saliency maps for calculating the optical flow
allows for overcoming the aperture problem. Our ap- proach is able to explain
human fixation behavior in situations which pose challenges to standard
approaches, such as when a fixated object disappears behind an occlusion and
reappears after several frames. In addition, we quantitatively compare our
model against alternative solutions using a large eye tracking data set.
Together, our results suggest that assessing optical flow information across a
series of saliency maps gives a highly accurate and useful account of human
overt attention in dynamic scenes
DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction
This paper studies audio-visual deep saliency prediction. It introduces a
conceptually simple and effective Deep Audio-Visual Embedding for dynamic
saliency prediction dubbed ``DAVE" in conjunction with our efforts towards
building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a
strong relation between auditory and visual cues for guiding gaze during
perception, video saliency models only consider visual cues and neglect the
auditory information that is ubiquitous in dynamic scenes. Here, we investigate
the applicability of audio cues in conjunction with visual ones in predicting
saliency maps using deep neural networks. To this end, the proposed model is
intentionally designed to be simple. Two baseline models are developed on the
same architecture which consists of an encoder-decoder. The encoder projects
the input into a feature space followed by a decoder that infers saliency. We
conduct an extensive analysis on different modalities and various aspects of
multi-model dynamic saliency prediction. Our results suggest that (1) audio is
a strong contributing cue for saliency prediction, (2) salient visible
sound-source is the natural cause of the superiority of our Audio-Visual model,
(3) richer feature representations for the input space leads to more powerful
predictions even in absence of more sophisticated saliency decoders, and (4)
Audio-Visual model improves over 53.54\% of the frames predicted by the best
Visual model (our baseline). Our endeavour demonstrates that audio is an
important cue that boosts dynamic video saliency prediction and helps models to
approach human performance. The code is available at
https://github.com/hrtavakoli/DAV
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
- …