12,113 research outputs found
Modelling eye movements and visual attention in synchronous visual and linguistic processing
This thesis focuses on modelling visual attention in tasks in which vision interacts
with language and other sources of contextual information. The work is based on
insights provided by experimental studies in visual cognition and psycholinguistics,
particularly cross-modal processing.
We present a series of models of eye-movements in situated language comprehension
capable of generating human-like scan-paths. Moreover we investigate the existence
of high level structure of the scan-paths and applicability of tools used in Natural
Language Processing in the analysis of this structure.
We show that scan paths carry interesting information that is currently neglected
in both experimental and modelling studies. This information, studied at a level beyond
simple statistical measures such as proportion of looks, can be used to extract
knowledge of more complicated patterns of behaviour, and to build models capable of
simulating human behaviour in the presence of linguistic material.
We also revisit classical model saliency and its extensions, in particular the Contextual
Guidance Model of Torralba et al. (2006), and extend it with memory of target
positions in visual search. We show that models of contextual guidance should contain
components responsible for short term learning and memorisation. We also investigate
the applicability of this type of model to prediction of human behaviour in tasks with
incremental stimuli as in situated language comprehension.
Finally we investigate the issue of objectness and object saliency, including their
effects on eye-movements and human responses to experimental tasks. In a simple
experiment we show that when using an object-based notion of saliency it is possible
to predict fixation locations better than using pixel-based saliency as formulated by Itti
et al. (1998). In addition we show that object based saliency fits into current theories
such as cognitive relevance and can be used to build unified models of cross-referential
visual and linguistic processing.
This thesis forms a foundation towards a more detailed study of scan-paths within
an object-based framework such as Cognitive Relevance Framework (Henderson et al.,
2007, 2009) by providing models capable of explaining human behaviour, and the
delivery of tools and methodologies to predict which objects would be attended to
during synchronous visual and linguistic processing
A Dilated Inception Network for Visual Saliency Prediction
Recently, with the advent of deep convolutional neural networks (DCNN), the
improvements in visual saliency prediction research are impressive. One
possible direction to approach the next improvement is to fully characterize
the multi-scale saliency-influential factors with a computationally-friendly
module in DCNN architectures. In this work, we proposed an end-to-end dilated
inception network (DINet) for visual saliency prediction. It captures
multi-scale contextual features effectively with very limited extra parameters.
Instead of utilizing parallel standard convolutions with different kernel sizes
as the existing inception module, our proposed dilated inception module (DIM)
uses parallel dilated convolutions with different dilation rates which can
significantly reduce the computation load while enriching the diversity of
receptive fields in feature maps. Moreover, the performance of our saliency
model is further improved by using a set of linear normalization-based
probability distribution distance metrics as loss functions. As such, we can
formulate saliency prediction as a probability distribution prediction task for
global saliency inference instead of a typical pixel-wise regression problem.
Experimental results on several challenging saliency benchmark datasets
demonstrate that our DINet with proposed loss functions can achieve
state-of-the-art performance with shorter inference time.Comment: Accepted by IEEE Transactions on Multimedia. The source codes are
available at https://github.com/ysyscool/DINe
Focusing computational visual attention in multi-modal human-robot interaction
Identifying verbally and non-verbally referred-to objects is an im-portant aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we introduce a saliency-based model that reflects how multi-modal referring acts influence the visual search, i.e. the task to find a specific object in a scene. Therefore, we combine positional information obtained from point-ing gestures with contextual knowledge about the visual appear-ance of the referred-to object obtained from language. The avail-able information is then integrated into a biologically-motivated saliency model that forms the basis for visual search. We prove the feasibility of the proposed approach by presenting the results of an experimental evaluation
PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection
Contexts play an important role in the saliency detection task. However,
given a context region, not all contextual information is helpful for the final
task. In this paper, we propose a novel pixel-wise contextual attention
network, i.e., the PiCANet, to learn to selectively attend to informative
context locations for each pixel. Specifically, for each pixel, it can generate
an attention map in which each attention weight corresponds to the contextual
relevance at each context location. An attended contextual feature can then be
constructed by selectively aggregating the contextual information. We formulate
the proposed PiCANet in both global and local forms to attend to global and
local contexts, respectively. Both models are fully differentiable and can be
embedded into CNNs for joint training. We also incorporate the proposed models
with the U-Net architecture to detect salient objects. Extensive experiments
show that the proposed PiCANets can consistently improve saliency detection
performance. The global and local PiCANets facilitate learning global contrast
and homogeneousness, respectively. As a result, our saliency model can detect
salient objects more accurately and uniformly, thus performing favorably
against the state-of-the-art methods
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition
Recognizing novel sub-categories with scarce samples is an essential and
challenging research topic in computer vision. Existing literature addresses
this challenge by employing local-based representation approaches, which may
not sufficiently facilitate meaningful object-specific semantic understanding,
leading to a reliance on apparent background correlations. Moreover, they
primarily rely on high-dimensional local descriptors to construct complex
embedding space, potentially limiting the generalization. To address the above
challenges, this article proposes a novel model called RSaG for few-shot
fine-grained visual recognition. RSaG introduces additional saliency-aware
supervision via saliency detection to guide the model toward focusing on the
intrinsic discriminative regions. Specifically, RSaG utilizes the saliency
detection model to emphasize the critical regions of each sub-category,
providing additional object-specific information for fine-grained prediction.
RSaG transfers such information with two symmetric branches in a mutual
learning paradigm. Furthermore, RSaG exploits inter-regional relationships to
enhance the informativeness of the representation and subsequently summarize
the highlighted details into contextual embeddings to facilitate the effective
transfer, enabling quick generalization to novel sub-categories. The proposed
approach is empirically evaluated on three widely used benchmarks,
demonstrating its superior performance.Comment: Under Revie
Cortical Dynamics of Contextually-Cued Attentive Visual Learning and Search: Spatial and Object Evidence Accumulation
How do humans use predictive contextual information to facilitate visual search? How are consistently paired scenic objects and positions learned and used to more efficiently guide search in familiar scenes? For example, a certain combination of objects can define a context for a kitchen and trigger a more efficient search for a typical object, such as a sink, in that context. A neural model, ARTSCENE Search, is developed to illustrate the neural mechanisms of such memory-based contextual learning and guidance, and to explain challenging behavioral data on positive/negative, spatial/object, and local/distant global cueing effects during visual search. The model proposes how global scene layout at a first glance rapidly forms a hypothesis about the target location. This hypothesis is then incrementally refined by enhancing target-like objects in space as a scene is scanned with saccadic eye movements. The model clarifies the functional roles of neuroanatomical, neurophysiological, and neuroimaging data in visual search for a desired goal object. In particular, the model simulates the interactive dynamics of spatial and object contextual cueing in the cortical What and Where streams starting from early visual areas through medial temporal lobe to prefrontal cortex. After learning, model dorsolateral prefrontal cortical cells (area 46) prime possible target locations in posterior parietal cortex based on goalmodulated percepts of spatial scene gist represented in parahippocampal cortex, whereas model ventral prefrontal cortical cells (area 47/12) prime possible target object representations in inferior temporal cortex based on the history of viewed objects represented in perirhinal cortex. The model hereby predicts how the cortical What and Where streams cooperate during scene perception, learning, and memory to accumulate evidence over time to drive efficient visual search of familiar scenes.CELEST, an NSF Science of Learning Center (SBE-0354378); SyNAPSE program of Defense Advanced Research Projects Agency (HR0011-09-3-0001, HR0011-09-C-0011
Recurrent Attentional Networks for Saliency Detection
Convolutional-deconvolution networks can be adopted to perform end-to-end
saliency detection. But, they do not work well with objects of multiple scales.
To overcome such a limitation, in this work, we propose a recurrent attentional
convolutional-deconvolution network (RACDNN). Using spatial transformer and
recurrent network units, RACDNN is able to iteratively attend to selected image
sub-regions to perform saliency refinement progressively. Besides tackling the
scale problem, RACDNN can also learn context-aware features from past
iterations to enhance saliency refinement in future iterations. Experiments on
several challenging saliency detection datasets validate the effectiveness of
RACDNN, and show that RACDNN outperforms state-of-the-art saliency detection
methods.Comment: CVPR 201
- …