7,058 research outputs found
Benchmark 3D eye-tracking dataset for visual saliency prediction on stereoscopic 3D video
Visual Attention Models (VAMs) predict the location of an image or video
regions that are most likely to attract human attention. Although saliency
detection is well explored for 2D image and video content, there are only few
attempts made to design 3D saliency prediction models. Newly proposed 3D visual
attention models have to be validated over large-scale video saliency
prediction datasets, which also contain results of eye-tracking information.
There are several publicly available eye-tracking datasets for 2D image and
video content. In the case of 3D, however, there is still a need for
large-scale video saliency datasets for the research community for validating
different 3D-VAMs. In this paper, we introduce a large-scale dataset containing
eye-tracking data collected from 61 stereoscopic 3D videos (and also 2D
versions of those) and 24 subjects participated in a free-viewing test. We
evaluate the performance of the existing saliency detection methods over the
proposed dataset. In addition, we created an online benchmark for validating
the performance of the existing 2D and 3D visual attention models and
facilitate addition of new VAMs to the benchmark. Our benchmark currently
contains 50 different VAMs
Cube Padding for Weakly-Supervised Saliency Prediction in 360{\deg} Videos
Automatic saliency prediction in 360{\deg} videos is critical for viewpoint
guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal
network which is (1) weakly-supervised trained and (2) tailor-made for
360{\deg} viewing sphere. Note that most existing methods are less scalable
since they rely on annotated saliency map for training. Most importantly, they
convert 360{\deg} sphere to 2D images (e.g., a single equirectangular image or
multiple separate Normal Field-of-View (NFoV) images) which introduces
distortion and image boundaries. In contrast, we propose a simple and effective
Cube Padding (CP) technique as follows. Firstly, we render the 360{\deg} view
on six faces of a cube using perspective projection. Thus, it introduces very
little distortion. Then, we concatenate all six faces while utilizing the
connectivity between faces on the cube for image padding (i.e., Cube Padding)
in convolution, pooling, convolutional LSTM layers. In this way, CP introduces
no image boundary while being applicable to almost all Convolutional Neural
Network (CNN) structures. To evaluate our method, we propose Wild-360, a new
360{\deg} video saliency dataset, containing challenging videos with saliency
heatmap annotations. In experiments, our method outperforms baseline methods in
both speed and quality.Comment: CVPR 201
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
Salient Object Detection in the Deep Learning Era: An In-Depth Survey
As an essential problem in computer vision, salient object detection (SOD)
has attracted an increasing amount of research attention over the years. Recent
advances in SOD are predominantly led by deep learning-based solutions (named
deep SOD). To enable in-depth understanding of deep SOD, in this paper, we
provide a comprehensive survey covering various aspects, ranging from algorithm
taxonomy to unsolved issues. In particular, we first review deep SOD algorithms
from different perspectives, including network architecture, level of
supervision, learning paradigm, and object-/instance-level detection. Following
that, we summarize and analyze existing SOD datasets and evaluation metrics.
Then, we benchmark a large group of representative SOD models, and provide
detailed analyses of the comparison results. Moreover, we study the performance
of SOD algorithms under different attribute settings, which has not been
thoroughly explored previously, by constructing a novel SOD dataset with rich
attribute annotations covering various salient object types, challenging
factors, and scene categories. We further analyze, for the first time in the
field, the robustness of SOD models to random input perturbations and
adversarial attacks. We also look into the generalization and difficulty of
existing SOD datasets. Finally, we discuss several open issues of SOD and
outline future research directions.Comment: Published on IEEE TPAMI. All the saliency prediction maps, our
constructed dataset with annotations, and codes for evaluation are publicly
available at \url{https://github.com/wenguanwang/SODsurvey
Deep Learning for Saliency Prediction in Natural Video
The purpose of this paper is the detection of salient areas in natural video
by using the new deep learning techniques. Salient patches in video frames are
predicted first. Then the predicted visual fixation maps are built upon them.
We design the deep architecture on the basis of CaffeNet implemented with Caffe
toolkit. We show that changing the way of data selection for optimisation of
network parameters, we can save computation cost up to 12 times. We extend deep
learning approaches for saliency prediction in still images with RGB values to
specificity of video using the sensitivity of the human visual system to
residual motion. Furthermore, we complete primary colour pixel values by
contrast features proposed in classical visual attention prediction models. The
experiments are conducted on two publicly available datasets. The first is
IRCCYN video database containing 31 videos with an overall amount of 7300
frames and eye fixations of 37 subjects. The second one is HOLLYWOOD2 provided
2517 movie clips with the eye fixations of 19 subjects. On IRCYYN dataset, the
accuracy obtained is of 89.51%. On HOLLYWOOD2 dataset, results in prediction of
saliency of patches show the improvement up to 2% with regard to RGB use only.
The resulting accuracy of 76, 6% is obtained. The AUC metric in comparison of
predicted saliency maps with visual fixation maps shows the increase up to 16%
on a sample of video clips from this dataset
Learning Gaze Transitions from Depth to Improve Video Saliency Estimation
In this paper we introduce a novel Depth-Aware Video Saliency approach to
predict human focus of attention when viewing RGBD videos on regular 2D
screens. We train a generative convolutional neural network which predicts a
saliency map for a frame, given the fixation map of the previous frame.
Saliency estimation in this scenario is highly important since in the near
future 3D video content will be easily acquired and yet hard to display. This
can be explained, on the one hand, by the dramatic improvement of 3D-capable
acquisition equipment. On the other hand, despite the considerable progress in
3D display technologies, most of the 3D displays are still expensive and
require wearing special glasses. To evaluate the performance of our approach,
we present a new comprehensive database of eye-fixation ground-truth for RGBD
videos. Our experiments indicate that integrating depth into video saliency
calculation is beneficial. We demonstrate that our approach outperforms
state-of-the-art methods for video saliency, achieving 15% relative
improvement
Enhancing Salient Object Segmentation Through Attention
Segmenting salient objects in an image is an important vision task with
ubiquitous applications. The problem becomes more challenging in the presence
of a cluttered and textured background, low resolution and/or low contrast
images. Even though existing algorithms perform well in segmenting most of the
object(s) of interest, they often end up segmenting false positives due to
resembling salient objects in the background. In this work, we tackle this
problem by iteratively attending to image patches in a recurrent fashion and
subsequently enhancing the predicted segmentation mask. Saliency features are
estimated independently for every image patch, which are further combined using
an aggregation strategy based on a Convolutional Gated Recurrent Unit (ConvGRU)
network. The proposed approach works in an end-to-end manner, removing
background noise and false positives incrementally. Through extensive
evaluation on various benchmark datasets, we show superior performance to the
existing approaches without any post-processing.Comment: CVPRW - Deep Vision 201
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach
Panoramic video provides immersive and interactive experience by enabling
humans to control the field of view (FoV) through head movement (HM). Thus, HM
plays a key role in modeling human attention on panoramic video. This paper
establishes a database collecting subjects' HM in panoramic video sequences.
From this database, we find that the HM data are highly consistent across
subjects. Furthermore, we find that deep reinforcement learning (DRL) can be
applied to predict HM positions, via maximizing the reward of imitating human
HM scanpaths through the agent's actions. Based on our findings, we propose a
DRL-based HM prediction (DHP) approach with offline and online versions, called
offline-DHP and online-DHP. In offline-DHP, multiple DRL workflows are run to
determine potential HM positions at each panoramic frame. Then, a heat map of
the potential HM positions, named the HM map, is generated as the output of
offline-DHP. In online-DHP, the next HM position of one subject is estimated
given the currently observed HM position, which is achieved by developing a DRL
algorithm upon the learned offline-DHP model. Finally, the experiments validate
that our approach is effective in both offline and online prediction of HM
positions for panoramic video, and that the learned offline-DHP model can
improve the performance of online-DHP.Comment: 15 pages, 10 figures, published on TPAMI 201
FaceSpoof Buster: a Presentation Attack Detector Based on Intrinsic Image Properties and Deep Learning
Nowadays, the adoption of face recognition for biometric authentication
systems is usual, mainly because this is one of the most accessible biometric
modalities. Techniques that rely on trespassing these kind of systems by using
a forged biometric sample, such as a printed paper or a recorded video of a
genuine access, are known as presentation attacks, but may be also referred in
the literature as face spoofing. Presentation attack detection is a crucial
step for preventing this kind of unauthorized accesses into restricted areas
and/or devices. In this paper, we propose a novel approach which relies in a
combination between intrinsic image properties and deep neural networks to
detect presentation attack attempts. Our method explores depth, salience and
illumination maps, associated with a pre-trained Convolutional Neural Network
in order to produce robust and discriminant features. Each one of these
properties are individually classified and, in the end of the process, they are
combined by a meta learning classifier, which achieves outstanding results on
the most popular datasets for PAD. Results show that proposed method is able to
overpass state-of-the-art results in an inter-dataset protocol, which is
defined as the most challenging in the literature.Comment: 7 pages, 1 figure, 7 table
- …