695 research outputs found
Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition
Systems based on bag-of-words models from image features collected at maxima
of sparse interest point operators have been used successfully for both
computer visual object and action recognition tasks. While the sparse,
interest-point based approach to recognition is not inconsistent with visual
processing in biological systems that operate in `saccade and fixate' regimes,
the methodology and emphasis in the human and the computer vision communities
remains sharply distinct. Here, we make three contributions aiming to bridge
this gap. First, we complement existing state-of-the art large scale dynamic
computer vision annotated datasets like Hollywood-2 and UCF Sports with human
eye movements collected under the ecological constraints of the visual action
recognition task. To our knowledge these are the first large human eye tracking
datasets to be collected and made publicly available for video,
vision.imar.ro/eyetracking (497,107 frames, each viewed by 16 subjects), unique
in terms of their (a) large scale and computer vision relevance, (b) dynamic,
video stimuli, (c) task control, as opposed to free-viewing. Second, we
introduce novel sequential consistency and alignment measures, which underline
the remarkable stability of patterns of visual search among subjects. Third, we
leverage the significant amount of collected data in order to pursue studies
and build automatic, end-to-end trainable computer vision systems based on
human eye movements. Our studies not only shed light on the differences between
computer vision spatio-temporal interest point image sampling strategies and
the human fixations, as well as their impact for visual recognition
performance, but also demonstrate that human fixations can be accurately
predicted, and when used in an end-to-end automatic system, leveraging some of
the advanced computer vision practice, can lead to state of the art results
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
Learning to Attend Relevant Regions in Videos from Eye Fixations
Attentively important regions in video frames account for a majority part of
the semantics in each frame. This information is helpful in many applications
not only for entertainment (such as auto generating commentary and tourist
guide) but also for robotic control which holds a larascope supported for
laparoscopic surgery. However, it is not always straightforward to define and
locate such semantic regions in videos. In this work, we attempt to address the
problem of attending relevant regions in videos by leveraging the eye fixations
labels with a RNN-based visual attention model. Our experimental results
suggest that this approach holds a good potential to learn to attend semantic
regions in videos while its performance also heavily relies on the quality of
eye fixations labels.Comment: 7 page
A probabilistic tour of visual attention and gaze shift computational models
In this paper a number of problems are considered which are related to the
modelling of eye guidance under visual attention in a natural setting. From a
crude discussion of a variety of available models spelled in probabilistic
terms, it appears that current approaches in computational vision are hitherto
far from achieving the goal of an active observer relying upon eye guidance to
accomplish real-world tasks. We argue that this challenging goal not only
requires to embody, in a principled way, the problem of eye guidance within the
action/perception loop, but to face the inextricable link tying up visual
attention, emotion and executive control, in so far as recent neurobiological
findings are weighed up
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
Learning Gaze Transitions from Depth to Improve Video Saliency Estimation
In this paper we introduce a novel Depth-Aware Video Saliency approach to
predict human focus of attention when viewing RGBD videos on regular 2D
screens. We train a generative convolutional neural network which predicts a
saliency map for a frame, given the fixation map of the previous frame.
Saliency estimation in this scenario is highly important since in the near
future 3D video content will be easily acquired and yet hard to display. This
can be explained, on the one hand, by the dramatic improvement of 3D-capable
acquisition equipment. On the other hand, despite the considerable progress in
3D display technologies, most of the 3D displays are still expensive and
require wearing special glasses. To evaluate the performance of our approach,
we present a new comprehensive database of eye-fixation ground-truth for RGBD
videos. Our experiments indicate that integrating depth into video saliency
calculation is beneficial. We demonstrate that our approach outperforms
state-of-the-art methods for video saliency, achieving 15% relative
improvement
TurkerGaze: Crowdsourcing Saliency with Webcam based Eye Tracking
Traditional eye tracking requires specialized hardware, which means
collecting gaze data from many observers is expensive, tedious and slow.
Therefore, existing saliency prediction datasets are order-of-magnitudes
smaller than typical datasets for other vision recognition tasks. The small
size of these datasets limits the potential for training data intensive
algorithms, and causes overfitting in benchmark evaluation. To address this
deficiency, this paper introduces a webcam-based gaze tracking system that
supports large-scale, crowdsourced eye tracking deployed on Amazon Mechanical
Turk (AMTurk). By a combination of careful algorithm and gaming protocol
design, our system obtains eye tracking data for saliency prediction comparable
to data gathered in a traditional lab setting, with relatively lower cost and
less effort on the part of the researchers. Using this tool, we build a
saliency dataset for a large number of natural images. We will open-source our
tool and provide a web server where researchers can upload their images to get
eye tracking results from AMTurk.Comment: 9 pages, 14 figure
Benchmark 3D eye-tracking dataset for visual saliency prediction on stereoscopic 3D video
Visual Attention Models (VAMs) predict the location of an image or video
regions that are most likely to attract human attention. Although saliency
detection is well explored for 2D image and video content, there are only few
attempts made to design 3D saliency prediction models. Newly proposed 3D visual
attention models have to be validated over large-scale video saliency
prediction datasets, which also contain results of eye-tracking information.
There are several publicly available eye-tracking datasets for 2D image and
video content. In the case of 3D, however, there is still a need for
large-scale video saliency datasets for the research community for validating
different 3D-VAMs. In this paper, we introduce a large-scale dataset containing
eye-tracking data collected from 61 stereoscopic 3D videos (and also 2D
versions of those) and 24 subjects participated in a free-viewing test. We
evaluate the performance of the existing saliency detection methods over the
proposed dataset. In addition, we created an online benchmark for validating
the performance of the existing 2D and 3D visual attention models and
facilitate addition of new VAMs to the benchmark. Our benchmark currently
contains 50 different VAMs
Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning
Being able to predict human gaze behavior has obvious importance for
behavioral vision and for computer vision applications. Most models have mainly
focused on predicting free-viewing behavior using saliency maps, but these
predictions do not generalize to goal-directed behavior, such as when a person
searches for a visual target object. We propose the first inverse reinforcement
learning (IRL) model to learn the internal reward function and policy used by
humans during visual search. The viewer's internal belief states were modeled
as dynamic contextual belief maps of object locations. These maps were learned
by IRL and then used to predict behavioral scanpaths for multiple target
categories. To train and evaluate our IRL model we created COCO-Search18, which
is now the largest dataset of high-quality search fixations in existence.
COCO-Search18 has 10 participants searching for each of 18 target-object
categories in 6202 images, making about 300,000 goal-directed fixations. When
trained and evaluated on COCO-Search18, the IRL model outperformed baseline
models in predicting search fixation scanpaths, both in terms of similarity to
human search behavior and search efficiency. Finally, reward maps recovered by
the IRL model reveal distinctive target-dependent patterns of object
prioritization, which we interpret as a learned object context.Comment: 16 pages, 13 figures, CVPR 202
- …