76,718 research outputs found
Deep Visual Attention Prediction
In this work, we aim to predict human eye fixation with view-free scenes
based on an end-to-end deep learning architecture. Although Convolutional
Neural Networks (CNNs) have made substantial improvement on human attention
prediction, it is still needed to improve CNN based attention models by
efficiently leveraging multi-scale features. Our visual attention network is
proposed to capture hierarchical saliency information from deep, coarse layers
with global saliency information to shallow, fine layers with local saliency
response. Our model is based on a skip-layer network structure, which predicts
human attention from multiple convolutional layers with various reception
fields. Final saliency prediction is achieved via the cooperation of those
global and local predictions. Our model is learned in a deep supervision
manner, where supervision is directly fed into multi-level layers, instead of
previous approaches of providing supervision only at the output layer and
propagating this supervision back to earlier layers. Our model thus
incorporates multi-level saliency predictions within a single network, which
significantly decreases the redundancy of previous approaches of learning
multiple network streams with different input scales. Extensive experimental
analysis on various challenging benchmark datasets demonstrate our method
yields state-of-the-art performance with competitive inference time.Comment: W. Wang and J. Shen. Deep visual attention prediction. IEEE TIP,
27(5):2368-2378,2018. Code and results can be found in
https://github.com/wenguanwang/deepattentio
Recurrent Models of Visual Attention
Applying convolutional neural networks to large images is computationally
expensive because the amount of computation scales linearly with the number of
image pixels. We present a novel recurrent neural network model that is capable
of extracting information from an image or video by adaptively selecting a
sequence of regions or locations and only processing the selected regions at
high resolution. Like convolutional neural networks, the proposed model has a
degree of translation invariance built-in, but the amount of computation it
performs can be controlled independently of the input image size. While the
model is non-differentiable, it can be trained using reinforcement learning
methods to learn task-specific policies. We evaluate our model on several image
classification tasks, where it significantly outperforms a convolutional neural
network baseline on cluttered images, and on a dynamic visual control problem,
where it learns to track a simple object without an explicit training signal
for doing so
Control of Selective Visual Attention: Modeling the "Where" Pathway
Intermediate and higher vision processes require selection of a subset of the available sensory information before further processing. Usually, this selection is implemented in the form of a spatially circumscribed region of the visual field, the so-called "focus of attention"
which scans the visual scene dependent on the input and
on the attentional state of the subject. We here present a model for the control of the focus of attention in primates, based on a saliency map. This mechanism is not only expected to model the functionality of biological vision but also to be essential for the understanding
of complex scenes in machine vision
Visual attention models for scene text recognition
In this paper we propose an approach to lexicon-free recognition of text in
scene images. Our approach relies on a LSTM-based soft visual attention model
learned from convolutional features. A set of feature vectors are derived from
an intermediate convolutional layer corresponding to different areas of the
image. This permits encoding of spatial information into the image
representation. In this way, the framework is able to learn how to selectively
focus on different parts of the image. At every time step the recognizer emits
one character using a weighted combination of the convolutional feature vectors
according to the learned attention model. Training can be done end-to-end using
only word level annotations. In addition, we show that modifying the beam
search algorithm by integrating an explicit language model leads to
significantly better recognition results. We validate the performance of our
approach on standard SVT and ICDAR'03 scene text datasets, showing
state-of-the-art performance in unconstrained text recognition
Dyslexia and the assessment of visual attention
Visual stream segregation has been proposed as a method to measure visual attention in dyslexia. Another task proposed to do this is the line-motion illusion. Both tasks, it is observed, can be carried out with spatially distributed stimuli. This, however, appears inconsistent with these tasks being linked speci?cally to attentional processes since this would require them to spatially focus cognitive resources. Also, both line-motion and visual stream segregation involve the perception of movement raising the possibility that what is actually measured by these tasks is not attention but some aspect of motion perception
- …
