12 research outputs found

    An integrated model of visual attention using shape-based features

    Get PDF
    Apart from helping shed some light on human perceptual mechanisms, modeling visual attention has important applications in computer vision. It has been shown to be useful in priming object detection, pruning interest points, quantifying visual clutter as well as predicting human eye movements. Prior work has either relied on purely bottom-up approaches or top-down schemes using simple low-level features. In this paper, we outline a top-down visual attention model based on shape-based features. The same shape-based representation is used to represent both the objects and the scenes that contain them. The spatial priors imposed by the scene and the feature priors imposed by the target object are combined in a Bayesian framework to generate a task-dependent saliency map. We show that our approach can predict the location of objects as well as match eye movements (92% overlap with human observers). We also show that the proposed approach performs better than existing bottom-up and top-down computational models

    A Bayesian inference theory of attention: neuroscience and algorithms

    Get PDF
    The past four decades of research in visual neuroscience has generated a large and disparate body of literature on the role of attention [Itti et al., 2005]. Although several models have been developed to describe specific properties of attention, a theoretical framework that explains the computational role of attention and is consistent with all known effects is still needed. Recently, several authors have suggested that visual perception can be interpreted as a Bayesian inference process [Rao et al., 2002, Knill and Richards, 1996, Lee and Mumford, 2003]. Within this framework, topdown priors via cortical feedback help disambiguate noisy bottom-up sensory input signals. Building on earlier work by Rao [2005], we show that this Bayesian inference proposal can be extended to explain the role and predict the main properties of attention: namely to facilitate the recognition of objects in clutter. Visual recognition proceeds by estimating the posterior probabilities for objects and their locations within an image via an exchange of messages between ventral and parietal areas of the visual cortex. Within this framework, spatial attention is used to reduce the uncertainty in feature information; feature-based attention is used to reduce the uncertainty in location information. In conjunction, they are used to recognize objects in clutter. Here, we find that several key attentional phenomena such such as pop-out, multiplicative modulation and change in contrast response emerge naturally as a property of the network. We explain the idea in three stages. We start with developing a simplified model of attention in the brain identifying the primary areas involved and their interconnections. Secondly, we propose a Bayesian network where each node has direct neural correlates within our simplified biological model. Finally, we elucidate the properties of the resulting model, showing that the predictions are consistent with physiological and behavioral evidence

    Discriminant Saliency, the Detection of Suspicious Coincidences, and Applications to Visual Recognition

    Full text link

    Integrated Learning of Saliency, Complex Features, and Object Detectors from Cluttered Scenes

    No full text

    What and where : a Bayesian inference theory of visual attention

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 107-116).In the theoretical framework described in this thesis, attention is part of the inference process that solves the visual recognition problem of what is where. The theory proposes a computational role for attention and leads to a model that predicts some of its main properties at the level of psychophysics and physiology. In our approach, the main goal of the visual system is to infer the identity and the position of objects in visual scenes: spatial attention emerges as a strategy to reduce the uncertainty in shape information while feature-based attention reduces the uncertainty in spatial information. Featural and spatial attention represent two distinct modes of a computational process solving the problem of recognizing and localizing objects, especially in difficult recognition tasks such as in cluttered natural scenes. We describe a specific computational model and relate it to the known functional anatomy of attention. We show that several well-known attentional phenomena - including bottom-up pop-out effects, multiplicative modulation of neuronal tuning curves and shift in contrast responses - emerge naturally as predictions of the model. We also show that the bayesian model predicts well human eye fixations (considered as a proxy for shifts of attention) in natural scenes. Finally, we demonstrate that the same model, used to modulate information in an existing feedforward model of the ventral stream, improves its object recognition performance in clutter.by Sharat Chikkerur.Ph.D

    LEARNING SALIENCY FOR HUMAN ACTION RECOGNITION

    Get PDF
    PhDWhen we are looking at a visual stimuli, there are certain areas that stand out from the neighbouring areas and immediately grab our attention. A map that identi- es such areas is called a visual saliency map. As humans can easily recognize actions when watching videos, having their saliency maps available might be bene cial for a fully automated action recognition system. In this thesis we look into ways of learning to predict the visual saliency and how to use the learned saliency for action recognition. In the rst phase, as opposed to the approaches that use manually designed fea- tures for saliency prediction, we propose few multilayer architectures for learning saliency features. First, we learn rst layer features in a two layer architecture using an unsupervised learning algorithm. Second, we learn second layer features in a two layer architecture using a supervision from recorded human gaze xations. Third, we use a deep architecture that learns features at all layers using only supervision from recorded human gaze xations. We show that the saliency prediction results we obtain are better than those obtained by approaches that use manually designed features. We also show that using a supervision on higher levels yields better saliency prediction results, i.e. the second approach outperforms the rst, and the third outperforms the second. In the second phase we focus on how saliency can be used to localize areas that will be used for action classi cation. In contrast to the manually designed action features, such as HOG/HOF, we learn the features using a fully supervised deep learning architecture. We show that our features in combination with the predicted saliency (from the rst phase) outperform manually designed features. We further develop an SVM framework that uses the predicted saliency and learned action features to both localize (in terms of bounding boxes) and classify the actions. We use saliency prediction as an additional cost in the SVM training and testing procedure when inferring the bounding box locations. We show that the approach in which saliency cost is added yields better action recognition results than the approach in which the cost is not added. The improvement is larger when the cost is added both in training and testing, rather than just in testing
    corecore