4,468 research outputs found

    Computational models of attention

    Full text link
    This chapter reviews recent computational models of visual attention. We begin with models for the bottom-up or stimulus-driven guidance of attention to salient visual items, which we examine in seven different broad categories. We then examine more complex models which address the top-down or goal-oriented guidance of attention towards items that are more relevant to the task at hand

    Computational models: Bottom-up and top-down aspects

    Full text link
    Computational models of visual attention have become popular over the past decade, we believe primarily for two reasons: First, models make testable predictions that can be explored by experimentalists as well as theoreticians, second, models have practical and technological applications of interest to the applied science and engineering communities. In this chapter, we take a critical look at recent attention modeling efforts. We focus on {\em computational models of attention} as defined by Tsotsos \& Rothenstein \shortcite{Tsotsos_Rothenstein11}: Models which can process any visual stimulus (typically, an image or video clip), which can possibly also be given some task definition, and which make predictions that can be compared to human or animal behavioral or physiological responses elicited by the same stimulus and task. Thus, we here place less emphasis on abstract models, phenomenological models, purely data-driven fitting or extrapolation models, or models specifically designed for a single task or for a restricted class of stimuli. For theoretical models, we refer the reader to a number of previous reviews that address attention theories and models more generally \cite{Itti_Koch01nrn,Paletta_etal05,Frintrop_etal10,Rothenstein_Tsotsos08,Gottlieb_Balan10,Toet11,Borji_Itti12pami}

    Fully automatic extraction of salient objects from videos in near real-time

    Full text link
    Automatic video segmentation plays an important role in a wide range of computer vision and image processing applications. Recently, various methods have been proposed for this purpose. The problem is that most of these methods are far from real-time processing even for low-resolution videos due to the complex procedures. To this end, we propose a new and quite fast method for automatic video segmentation with the help of 1) efficient optimization of Markov random fields with polynomial time of number of pixels by introducing graph cuts, 2) automatic, computationally efficient but stable derivation of segmentation priors using visual saliency and sequential update mechanism, and 3) an implementation strategy in the principle of stream processing with graphics processor units (GPUs). Test results indicates that our method extracts appropriate regions from videos as precisely as and much faster than previous semi-automatic methods even though any supervisions have not been incorporated.Comment: submitted to Special Issue on High Performance Computation on Hardware Accelerators, the Computer Journa

    Unsupervised Video Analysis Based on a Spatiotemporal Saliency Detector

    Full text link
    Visual saliency, which predicts regions in the field of view that draw the most visual attention, has attracted a lot of interest from researchers. It has already been used in several vision tasks, e.g., image classification, object detection, foreground segmentation. Recently, the spectrum analysis based visual saliency approach has attracted a lot of interest due to its simplicity and good performance, where the phase information of the image is used to construct the saliency map. In this paper, we propose a new approach for detecting spatiotemporal visual saliency based on the phase spectrum of the videos, which is easy to implement and computationally efficient. With the proposed algorithm, we also study how the spatiotemporal saliency can be used in two important vision task, abnormality detection and spatiotemporal interest point detection. The proposed algorithm is evaluated on several commonly used datasets with comparison to the state-of-art methods from the literature. The experiments demonstrate the effectiveness of the proposed approach to spatiotemporal visual saliency detection and its application to the above vision tasksComment: 21 page

    Modeling Bottom-Up and Top-Down Attention with a Neurodynamic Model of V1

    Full text link
    Previous studies suggested that lateral interactions of V1 cells are responsible, among other visual effects, of bottom-up visual attention (alternatively named visual salience or saliency). Our objective is to mimic these connections with a neurodynamic network of firing-rate neurons in order to predict visual attention. Early visual subcortical processes (i.e. retinal and thalamic) are functionally simulated. An implementation of the cortical magnification function is included to define the retinotopical projections towards V1, processing neuronal activity for each distinct view during scene observation. Novel computational definitions of top-down inhibition (in terms of inhibition of return and selection mechanisms), are also proposed to predict attention in Free-Viewing and Visual Search tasks. Results show that our model outpeforms other biologically-inpired models of saliency prediction while predicting visual saccade sequences with the same model. We also show how temporal and spatial characteristics of inhibition of return can improve prediction of saccades, as well as how distinct search strategies (in terms of feature-selective or category-specific inhibition) can predict attention at distinct image contexts.Comment: 27 pages, 19 figure

    A stochastic model of human visual attention with a dynamic Bayesian network

    Full text link
    Recent studies in the field of human vision science suggest that the human responses to the stimuli on a visual display are non-deterministic. People may attend to different locations on the same visual input at the same time. Based on this knowledge, we propose a new stochastic model of visual attention by introducing a dynamic Bayesian network to predict the likelihood of where humans typically focus on a video scene. The proposed model is composed of a dynamic Bayesian network with 4 layers. Our model provides a framework that simulates and combines the visual saliency response and the cognitive state of a person to estimate the most probable attended regions. Sample-based inference with Markov chain Monte-Carlo based particle filter and stream processing with multi-core processors enable us to estimate human visual attention in near real time. Experimental results have demonstrated that our model performs significantly better in predicting human visual attention compared to the previous deterministic models.Comment: 24 pages, single-column, 13 figures excluding portlaits, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

    A Neurodynamic model of Saliency prediction in V1

    Full text link
    Lateral connections in the primary visual cortex (V1) have long been hypothesized to be responsible of several visual processing mechanisms such as brightness induction, chromatic induction, visual discomfort and bottom-up visual attention (also named saliency). Many computational models have been developed to independently predict these and other visual processes, but no computational model has been able to reproduce all of them simultaneously. In this work we show that a biologically plausible computational model of lateral interactions of V1 is able to simultaneously predict saliency and all the aforementioned visual processes. Our model's (NSWAM) architecture is based on Pennachio's neurodynamic model of lateral connections of V1. It is defined as a network of firing rate neurons, sensitive to visual features such as brightness, color, orientation and scale. We tested NSWAM saliency predictions using images from several eye tracking datasets. We show that accuracy of predictions, using shuffled metrics, obtained by our architecture is similar to other state-of-the-art computational methods, particularly with synthetic images (CAT2000-Pattern & SID4VAM) which mainly contain low level features. Moreover, we outperform other biologically-inspired saliency models that are specifically designed to exclusively reproduce saliency. Hence, we show that our biologically plausible model of lateral connections can simultaneously explain different visual proceses present in V1 (without applying any type of training or optimization and keeping the same parametrization for all the visual processes). This can be useful for the definition of a unified architecture of the primary visual cortex.Comment: 17 pages, 17 figures, 6 table

    Dynamical optical flow of saliency maps for predicting visual attention

    Full text link
    Saliency maps are used to understand human attention and visual fixation. However, while very well established for static images, there is no general agreement on how to compute a saliency map of dynamic scenes. In this paper we propose a mathematically rigorous approach to this prob- lem, including static saliency maps of each video frame for the calculation of the optical flow. Taking into account static saliency maps for calculating the optical flow allows for overcoming the aperture problem. Our ap- proach is able to explain human fixation behavior in situations which pose challenges to standard approaches, such as when a fixated object disappears behind an occlusion and reappears after several frames. In addition, we quantitatively compare our model against alternative solutions using a large eye tracking data set. Together, our results suggest that assessing optical flow information across a series of saliency maps gives a highly accurate and useful account of human overt attention in dynamic scenes

    DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction

    Full text link
    This paper studies audio-visual deep saliency prediction. It introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed ``DAVE" in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we investigate the applicability of audio cues in conjunction with visual ones in predicting saliency maps using deep neural networks. To this end, the proposed model is intentionally designed to be simple. Two baseline models are developed on the same architecture which consists of an encoder-decoder. The encoder projects the input into a feature space followed by a decoder that infers saliency. We conduct an extensive analysis on different modalities and various aspects of multi-model dynamic saliency prediction. Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53.54\% of the frames predicted by the best Visual model (our baseline). Our endeavour demonstrates that audio is an important cue that boosts dynamic video saliency prediction and helps models to approach human performance. The code is available at https://github.com/hrtavakoli/DAV

    Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges

    Full text link
    Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models
    • …
    corecore