6,204 research outputs found
Deep Learning for Saliency Prediction in Natural Video
The purpose of this paper is the detection of salient areas in natural video
by using the new deep learning techniques. Salient patches in video frames are
predicted first. Then the predicted visual fixation maps are built upon them.
We design the deep architecture on the basis of CaffeNet implemented with Caffe
toolkit. We show that changing the way of data selection for optimisation of
network parameters, we can save computation cost up to 12 times. We extend deep
learning approaches for saliency prediction in still images with RGB values to
specificity of video using the sensitivity of the human visual system to
residual motion. Furthermore, we complete primary colour pixel values by
contrast features proposed in classical visual attention prediction models. The
experiments are conducted on two publicly available datasets. The first is
IRCCYN video database containing 31 videos with an overall amount of 7300
frames and eye fixations of 37 subjects. The second one is HOLLYWOOD2 provided
2517 movie clips with the eye fixations of 19 subjects. On IRCYYN dataset, the
accuracy obtained is of 89.51%. On HOLLYWOOD2 dataset, results in prediction of
saliency of patches show the improvement up to 2% with regard to RGB use only.
The resulting accuracy of 76, 6% is obtained. The AUC metric in comparison of
predicted saliency maps with visual fixation maps shows the increase up to 16%
on a sample of video clips from this dataset
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach
Panoramic video provides immersive and interactive experience by enabling
humans to control the field of view (FoV) through head movement (HM). Thus, HM
plays a key role in modeling human attention on panoramic video. This paper
establishes a database collecting subjects' HM in panoramic video sequences.
From this database, we find that the HM data are highly consistent across
subjects. Furthermore, we find that deep reinforcement learning (DRL) can be
applied to predict HM positions, via maximizing the reward of imitating human
HM scanpaths through the agent's actions. Based on our findings, we propose a
DRL-based HM prediction (DHP) approach with offline and online versions, called
offline-DHP and online-DHP. In offline-DHP, multiple DRL workflows are run to
determine potential HM positions at each panoramic frame. Then, a heat map of
the potential HM positions, named the HM map, is generated as the output of
offline-DHP. In online-DHP, the next HM position of one subject is estimated
given the currently observed HM position, which is achieved by developing a DRL
algorithm upon the learned offline-DHP model. Finally, the experiments validate
that our approach is effective in both offline and online prediction of HM
positions for panoramic video, and that the learned offline-DHP model can
improve the performance of online-DHP.Comment: 15 pages, 10 figures, published on TPAMI 201
Bottom-up Attention, Models of
In this review, we examine the recent progress in saliency prediction and
proposed several avenues for future research. In spite of tremendous efforts
and huge progress, there is still room for improvement in terms finer-grained
analysis of deep saliency models, evaluation measures, datasets, annotation
methods, cognitive studies, and new applications. This chapter will appear in
Encyclopedia of Computational Neuroscience.Comment: arXiv admin note: substantial text overlap with arXiv:1810.0371
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition
We present a new computational model for gaze prediction in egocentric videos
by exploring patterns in temporal shift of gaze fixations (attention
transition) that are dependent on egocentric manipulation tasks. Our assumption
is that the high-level context of how a task is completed in a certain way has
a strong influence on attention transition and should be modeled for gaze
prediction in natural dynamic scenes. Specifically, we propose a hybrid model
based on deep neural networks which integrates task-dependent attention
transition with bottom-up saliency prediction. In particular, the
task-dependent attention transition is learned with a recurrent neural network
to exploit the temporal context of gaze fixations, e.g. looking at a cup after
moving gaze away from a grasped bottle. Experiments on public egocentric
activity datasets show that our model significantly outperforms
state-of-the-art gaze prediction methods and is able to learn meaningful
transition of human attention.Comment: Accepted as oral presentation in ECCV 201
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
Deep Visual Attention Prediction
In this work, we aim to predict human eye fixation with view-free scenes
based on an end-to-end deep learning architecture. Although Convolutional
Neural Networks (CNNs) have made substantial improvement on human attention
prediction, it is still needed to improve CNN based attention models by
efficiently leveraging multi-scale features. Our visual attention network is
proposed to capture hierarchical saliency information from deep, coarse layers
with global saliency information to shallow, fine layers with local saliency
response. Our model is based on a skip-layer network structure, which predicts
human attention from multiple convolutional layers with various reception
fields. Final saliency prediction is achieved via the cooperation of those
global and local predictions. Our model is learned in a deep supervision
manner, where supervision is directly fed into multi-level layers, instead of
previous approaches of providing supervision only at the output layer and
propagating this supervision back to earlier layers. Our model thus
incorporates multi-level saliency predictions within a single network, which
significantly decreases the redundancy of previous approaches of learning
multiple network streams with different input scales. Extensive experimental
analysis on various challenging benchmark datasets demonstrate our method
yields state-of-the-art performance with competitive inference time.Comment: W. Wang and J. Shen. Deep visual attention prediction. IEEE TIP,
27(5):2368-2378,2018. Code and results can be found in
https://github.com/wenguanwang/deepattentio
Salient Object Detection in the Deep Learning Era: An In-Depth Survey
As an essential problem in computer vision, salient object detection (SOD)
has attracted an increasing amount of research attention over the years. Recent
advances in SOD are predominantly led by deep learning-based solutions (named
deep SOD). To enable in-depth understanding of deep SOD, in this paper, we
provide a comprehensive survey covering various aspects, ranging from algorithm
taxonomy to unsolved issues. In particular, we first review deep SOD algorithms
from different perspectives, including network architecture, level of
supervision, learning paradigm, and object-/instance-level detection. Following
that, we summarize and analyze existing SOD datasets and evaluation metrics.
Then, we benchmark a large group of representative SOD models, and provide
detailed analyses of the comparison results. Moreover, we study the performance
of SOD algorithms under different attribute settings, which has not been
thoroughly explored previously, by constructing a novel SOD dataset with rich
attribute annotations covering various salient object types, challenging
factors, and scene categories. We further analyze, for the first time in the
field, the robustness of SOD models to random input perturbations and
adversarial attacks. We also look into the generalization and difficulty of
existing SOD datasets. Finally, we discuss several open issues of SOD and
outline future research directions.Comment: Published on IEEE TPAMI. All the saliency prediction maps, our
constructed dataset with annotations, and codes for evaluation are publicly
available at \url{https://github.com/wenguanwang/SODsurvey
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
- …