2,085 research outputs found
Deep Learning for Saliency Prediction in Natural Video
The purpose of this paper is the detection of salient areas in natural video
by using the new deep learning techniques. Salient patches in video frames are
predicted first. Then the predicted visual fixation maps are built upon them.
We design the deep architecture on the basis of CaffeNet implemented with Caffe
toolkit. We show that changing the way of data selection for optimisation of
network parameters, we can save computation cost up to 12 times. We extend deep
learning approaches for saliency prediction in still images with RGB values to
specificity of video using the sensitivity of the human visual system to
residual motion. Furthermore, we complete primary colour pixel values by
contrast features proposed in classical visual attention prediction models. The
experiments are conducted on two publicly available datasets. The first is
IRCCYN video database containing 31 videos with an overall amount of 7300
frames and eye fixations of 37 subjects. The second one is HOLLYWOOD2 provided
2517 movie clips with the eye fixations of 19 subjects. On IRCYYN dataset, the
accuracy obtained is of 89.51%. On HOLLYWOOD2 dataset, results in prediction of
saliency of patches show the improvement up to 2% with regard to RGB use only.
The resulting accuracy of 76, 6% is obtained. The AUC metric in comparison of
predicted saliency maps with visual fixation maps shows the increase up to 16%
on a sample of video clips from this dataset
Invariance Analysis of Saliency Models versus Human Gaze During Scene Free Viewing
Most of current studies on human gaze and saliency modeling have used
high-quality stimuli. In real world, however, captured images undergo various
types of distortions during the whole acquisition, transmission, and displaying
chain. Some distortion types include motion blur, lighting variations and
rotation. Despite few efforts, influences of ubiquitous distortions on visual
attention and saliency models have not been systematically investigated. In
this paper, we first create a large-scale database including eye movements of
10 observers over 1900 images degraded by 19 types of distortions. Second, by
analyzing eye movements and saliency models, we find that: a) observers look at
different locations over distorted versus original images, and b) performances
of saliency models are drastically hindered over distorted images, with the
maximum performance drop belonging to Rotation and Shearing distortions.
Finally, we investigate the effectiveness of different distortions when serving
as data augmentation transformations. Experimental results verify that some
useful data augmentation transformations which preserve human gaze of reference
images can improve deep saliency models against distortions, while some invalid
transformations which severely change human gaze will degrade the performance
Saliency detection based on structural dissimilarity induced by image quality assessment model
The distinctiveness of image regions is widely used as the cue of saliency.
Generally, the distinctiveness is computed according to the absolute difference
of features. However, according to the image quality assessment (IQA) studies,
the human visual system is highly sensitive to structural changes rather than
absolute difference. Accordingly, we propose the computation of the structural
dissimilarity between image patches as the distinctiveness measure for saliency
detection. Similar to IQA models, the structural dissimilarity is computed
based on the correlation of the structural features. The global structural
dissimilarity of a patch to all the other patches represents saliency of the
patch. We adopt two widely used structural features, namely the local contrast
and gradient magnitude, into the structural dissimilarity computation in the
proposed model. Without any postprocessing, the proposed model based on the
correlation of either of the two structural features outperforms 11
state-of-the-art saliency models on three saliency databases.Comment: For associated source code, see https://github.com/yangli-xjtu/SD
A Locally Weighted Fixation Density-Based Metric for Assessing the Quality of Visual Saliency Predictions
With the increased focus on visual attention (VA) in the last decade, a large
number of computational visual saliency methods have been developed over the
past few years. These models are traditionally evaluated by using performance
evaluation metrics that quantify the match between predicted saliency and
fixation data obtained from eye-tracking experiments on human observers. Though
a considerable number of such metrics have been proposed in the literature,
there are notable problems in them. In this work, we discuss shortcomings in
existing metrics through illustrative examples and propose a new metric that
uses local weights based on fixation density which overcomes these flaws. To
compare the performance of our proposed metric at assessing the quality of
saliency prediction with other existing metrics, we construct a ground-truth
subjective database in which saliency maps obtained from 17 different VA models
are evaluated by 16 human observers on a 5-point categorical scale in terms of
their visual resemblance with corresponding ground-truth fixation density maps
obtained from eye-tracking data. The metrics are evaluated by correlating
metric scores with the human subjective ratings. The correlation results show
that the proposed evaluation metric outperforms all other popular existing
metrics. Additionally, the constructed database and corresponding subjective
ratings provide an insight into which of the existing metrics and future
metrics are better at estimating the quality of saliency prediction and can be
used as a benchmark
TurkerGaze: Crowdsourcing Saliency with Webcam based Eye Tracking
Traditional eye tracking requires specialized hardware, which means
collecting gaze data from many observers is expensive, tedious and slow.
Therefore, existing saliency prediction datasets are order-of-magnitudes
smaller than typical datasets for other vision recognition tasks. The small
size of these datasets limits the potential for training data intensive
algorithms, and causes overfitting in benchmark evaluation. To address this
deficiency, this paper introduces a webcam-based gaze tracking system that
supports large-scale, crowdsourced eye tracking deployed on Amazon Mechanical
Turk (AMTurk). By a combination of careful algorithm and gaming protocol
design, our system obtains eye tracking data for saliency prediction comparable
to data gathered in a traditional lab setting, with relatively lower cost and
less effort on the part of the researchers. Using this tool, we build a
saliency dataset for a large number of natural images. We will open-source our
tool and provide a web server where researchers can upload their images to get
eye tracking results from AMTurk.Comment: 9 pages, 14 figure
Benchmark 3D eye-tracking dataset for visual saliency prediction on stereoscopic 3D video
Visual Attention Models (VAMs) predict the location of an image or video
regions that are most likely to attract human attention. Although saliency
detection is well explored for 2D image and video content, there are only few
attempts made to design 3D saliency prediction models. Newly proposed 3D visual
attention models have to be validated over large-scale video saliency
prediction datasets, which also contain results of eye-tracking information.
There are several publicly available eye-tracking datasets for 2D image and
video content. In the case of 3D, however, there is still a need for
large-scale video saliency datasets for the research community for validating
different 3D-VAMs. In this paper, we introduce a large-scale dataset containing
eye-tracking data collected from 61 stereoscopic 3D videos (and also 2D
versions of those) and 24 subjects participated in a free-viewing test. We
evaluate the performance of the existing saliency detection methods over the
proposed dataset. In addition, we created an online benchmark for validating
the performance of the existing 2D and 3D visual attention models and
facilitate addition of new VAMs to the benchmark. Our benchmark currently
contains 50 different VAMs
The Effect of Distortions on the Prediction of Visual Attention
Existing saliency models have been designed and evaluated for predicting the
saliency in distortion-free images. However, in practice, the image quality is
affected by a host of factors at several stages of the image processing
pipeline such as acquisition, compression and transmission. Several studies
have explored the effect of distortion on human visual attention; however, none
of them have considered the performance of visual saliency models in the
presence of distortion. Furthermore, given that one potential application of
visual saliency prediction is to aid pooling of objective visual quality
metrics, it is important to compare the performance of existing saliency models
on distorted images. In this paper, we evaluate several state-of-the-art visual
attention models over different databases consisting of distorted images with
various types of distortions such as blur, noise and compression with varying
levels of distortion severity. This paper also introduces new improved
performance evaluation metrics that are shown to overcome shortcomings in
existing performance metrics. We find that the performance of most models
improves with moderate and high levels of distortions as compared to the near
distortion-free case. In addition, model performance is also found to decrease
with an increase in image complexity.Comment: 14 pages, 2 column, 14 figure
Learning Gaze Transitions from Depth to Improve Video Saliency Estimation
In this paper we introduce a novel Depth-Aware Video Saliency approach to
predict human focus of attention when viewing RGBD videos on regular 2D
screens. We train a generative convolutional neural network which predicts a
saliency map for a frame, given the fixation map of the previous frame.
Saliency estimation in this scenario is highly important since in the near
future 3D video content will be easily acquired and yet hard to display. This
can be explained, on the one hand, by the dramatic improvement of 3D-capable
acquisition equipment. On the other hand, despite the considerable progress in
3D display technologies, most of the 3D displays are still expensive and
require wearing special glasses. To evaluate the performance of our approach,
we present a new comprehensive database of eye-fixation ground-truth for RGBD
videos. Our experiments indicate that integrating depth into video saliency
calculation is beneficial. We demonstrate that our approach outperforms
state-of-the-art methods for video saliency, achieving 15% relative
improvement
Fixation Data Analysis for High Resolution Satellite Images
The presented study is an eye tracking experiment for high-resolution
satellite (HRS) images. The reported experiment explores the Area Of Interest
(AOI) based analysis of eye fixation data for complex HRS images. The study
reflects the requisite of reference data for bottom-up saliency-based
segmentation and the struggle of eye tracking data analysis for complex
satellite images. The intended fixation data analysis aims towards the
reference data creation for bottom-up saliency-based segmentation of
high-resolution satellite images. The analytical outcome of this experimental
study provides a solution for AOI-based analysis for fixation data in the
complex environment of satellite images and recommendations for reference data
construction which is already an ongoing effort.Comment: Extended version is submitted to SPIE-2018 conferenc
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
- …