1,642 research outputs found
Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition
Systems based on bag-of-words models from image features collected at maxima
of sparse interest point operators have been used successfully for both
computer visual object and action recognition tasks. While the sparse,
interest-point based approach to recognition is not inconsistent with visual
processing in biological systems that operate in `saccade and fixate' regimes,
the methodology and emphasis in the human and the computer vision communities
remains sharply distinct. Here, we make three contributions aiming to bridge
this gap. First, we complement existing state-of-the art large scale dynamic
computer vision annotated datasets like Hollywood-2 and UCF Sports with human
eye movements collected under the ecological constraints of the visual action
recognition task. To our knowledge these are the first large human eye tracking
datasets to be collected and made publicly available for video,
vision.imar.ro/eyetracking (497,107 frames, each viewed by 16 subjects), unique
in terms of their (a) large scale and computer vision relevance, (b) dynamic,
video stimuli, (c) task control, as opposed to free-viewing. Second, we
introduce novel sequential consistency and alignment measures, which underline
the remarkable stability of patterns of visual search among subjects. Third, we
leverage the significant amount of collected data in order to pursue studies
and build automatic, end-to-end trainable computer vision systems based on
human eye movements. Our studies not only shed light on the differences between
computer vision spatio-temporal interest point image sampling strategies and
the human fixations, as well as their impact for visual recognition
performance, but also demonstrate that human fixations can be accurately
predicted, and when used in an end-to-end automatic system, leveraging some of
the advanced computer vision practice, can lead to state of the art results
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
OpenEDS2020: Open Eyes Dataset
We present the second edition of OpenEDS dataset, OpenEDS2020, a novel
dataset of eye-image sequences captured at a frame rate of 100 Hz under
controlled illumination, using a virtual-reality head-mounted display mounted
with two synchronized eye-facing cameras. The dataset, which is anonymized to
remove any personally identifiable information on participants, consists of 80
participants of varied appearance performing several gaze-elicited tasks, and
is divided in two subsets: 1) Gaze Prediction Dataset, with up to 66,560
sequences containing 550,400 eye-images and respective gaze vectors, created to
foster research in spatio-temporal gaze estimation and prediction approaches;
and 2) Eye Segmentation Dataset, consisting of 200 sequences sampled at 5 Hz,
with up to 29,500 images, of which 5% contain a semantic segmentation label,
devised to encourage the use of temporal information to propagate labels to
contiguous frames. Baseline experiments have been evaluated on OpenEDS2020, one
for each task, with average angular error of 5.37 degrees when performing gaze
prediction on 1 to 5 frames into the future, and a mean intersection over union
score of 84.1% for semantic segmentation. As its predecessor, OpenEDS dataset,
we anticipate that this new dataset will continue creating opportunities to
researchers in eye tracking, machine learning and computer vision communities,
to advance the state of the art for virtual reality applications. The dataset
is available for download upon request at
http://research.fb.com/programs/openeds-2020-challenge/.Comment: Description of dataset used in OpenEDS2020 challenge:
https://research.fb.com/programs/openeds-2020-challenge
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
Fast non parametric entropy estimation for spatial-temporal saliency method
This paper formulates bottom-up visual saliency as center surround
conditional entropy and presents a fast and efficient technique for the
computation of such a saliency map. It is shown that the new saliency
formulation is consistent with self-information based saliency,
decision-theoretic saliency and Bayesian definition of surprises but also faces
the same significant computational challenge of estimating probability density
in very high dimensional spaces with limited samples. We have developed a fast
and efficient nonparametric method to make the practical implementation of
these types of saliency maps possible. By aligning pixels from the center and
surround regions and treating their location coordinates as random variables,
we use a k-d partitioning method to efficiently estimating the center surround
conditional entropy. We present experimental results on two publicly available
eye tracking still image databases and show that the new technique is
competitive with state of the art bottom-up saliency computational methods. We
have also extended the technique to compute spatiotemporal visual saliency of
video and evaluate the bottom-up spatiotemporal saliency against eye tracking
data on a video taken onboard a moving vehicle with the driver's eye being
tracked by a head mounted eye-tracker
Dynamical optical flow of saliency maps for predicting visual attention
Saliency maps are used to understand human attention and visual fixation.
However, while very well established for static images, there is no general
agreement on how to compute a saliency map of dynamic scenes. In this paper we
propose a mathematically rigorous approach to this prob- lem, including static
saliency maps of each video frame for the calculation of the optical flow.
Taking into account static saliency maps for calculating the optical flow
allows for overcoming the aperture problem. Our ap- proach is able to explain
human fixation behavior in situations which pose challenges to standard
approaches, such as when a fixated object disappears behind an occlusion and
reappears after several frames. In addition, we quantitatively compare our
model against alternative solutions using a large eye tracking data set.
Together, our results suggest that assessing optical flow information across a
series of saliency maps gives a highly accurate and useful account of human
overt attention in dynamic scenes
Deep Learning for Saliency Prediction in Natural Video
The purpose of this paper is the detection of salient areas in natural video
by using the new deep learning techniques. Salient patches in video frames are
predicted first. Then the predicted visual fixation maps are built upon them.
We design the deep architecture on the basis of CaffeNet implemented with Caffe
toolkit. We show that changing the way of data selection for optimisation of
network parameters, we can save computation cost up to 12 times. We extend deep
learning approaches for saliency prediction in still images with RGB values to
specificity of video using the sensitivity of the human visual system to
residual motion. Furthermore, we complete primary colour pixel values by
contrast features proposed in classical visual attention prediction models. The
experiments are conducted on two publicly available datasets. The first is
IRCCYN video database containing 31 videos with an overall amount of 7300
frames and eye fixations of 37 subjects. The second one is HOLLYWOOD2 provided
2517 movie clips with the eye fixations of 19 subjects. On IRCYYN dataset, the
accuracy obtained is of 89.51%. On HOLLYWOOD2 dataset, results in prediction of
saliency of patches show the improvement up to 2% with regard to RGB use only.
The resulting accuracy of 76, 6% is obtained. The AUC metric in comparison of
predicted saliency maps with visual fixation maps shows the increase up to 16%
on a sample of video clips from this dataset
Learning Gaze Transitions from Depth to Improve Video Saliency Estimation
In this paper we introduce a novel Depth-Aware Video Saliency approach to
predict human focus of attention when viewing RGBD videos on regular 2D
screens. We train a generative convolutional neural network which predicts a
saliency map for a frame, given the fixation map of the previous frame.
Saliency estimation in this scenario is highly important since in the near
future 3D video content will be easily acquired and yet hard to display. This
can be explained, on the one hand, by the dramatic improvement of 3D-capable
acquisition equipment. On the other hand, despite the considerable progress in
3D display technologies, most of the 3D displays are still expensive and
require wearing special glasses. To evaluate the performance of our approach,
we present a new comprehensive database of eye-fixation ground-truth for RGBD
videos. Our experiments indicate that integrating depth into video saliency
calculation is beneficial. We demonstrate that our approach outperforms
state-of-the-art methods for video saliency, achieving 15% relative
improvement
Computational models: Bottom-up and top-down aspects
Computational models of visual attention have become popular over the past
decade, we believe primarily for two reasons: First, models make testable
predictions that can be explored by experimentalists as well as theoreticians,
second, models have practical and technological applications of interest to the
applied science and engineering communities. In this chapter, we take a
critical look at recent attention modeling efforts. We focus on {\em
computational models of attention} as defined by Tsotsos \& Rothenstein
\shortcite{Tsotsos_Rothenstein11}: Models which can process any visual stimulus
(typically, an image or video clip), which can possibly also be given some task
definition, and which make predictions that can be compared to human or animal
behavioral or physiological responses elicited by the same stimulus and task.
Thus, we here place less emphasis on abstract models, phenomenological models,
purely data-driven fitting or extrapolation models, or models specifically
designed for a single task or for a restricted class of stimuli. For
theoretical models, we refer the reader to a number of previous reviews that
address attention theories and models more generally
\cite{Itti_Koch01nrn,Paletta_etal05,Frintrop_etal10,Rothenstein_Tsotsos08,Gottlieb_Balan10,Toet11,Borji_Itti12pami}
Bottom-up Attention, Models of
In this review, we examine the recent progress in saliency prediction and
proposed several avenues for future research. In spite of tremendous efforts
and huge progress, there is still room for improvement in terms finer-grained
analysis of deep saliency models, evaluation measures, datasets, annotation
methods, cognitive studies, and new applications. This chapter will appear in
Encyclopedia of Computational Neuroscience.Comment: arXiv admin note: substantial text overlap with arXiv:1810.0371
- …