16,203 research outputs found
Salient Object Detection in Video using Deep Non-Local Neural Networks
Detection of salient objects in image and video is of great importance in
many computer vision applications. In spite of the fact that the state of the
art in saliency detection for still images has been changed substantially over
the last few years, there have been few improvements in video saliency
detection. This paper investigates the use of recently introduced non-local
neural networks in video salient object detection. Non-local neural networks
are applied to capture global dependencies and hence determine the salient
objects. The effect of non-local operations is studied separately on static and
dynamic saliency detection in order to exploit both appearance and motion
features. A novel deep non-local neural network architecture is introduced for
video salient object detection and tested on two well-known datasets DAVIS and
FBMS. The experimental results show that the proposed algorithm outperforms
state-of-the-art video saliency detection methods.Comment: Submitted to Journal of Visual Communication and Image Representatio
Review of Visual Saliency Detection with Comprehensive Information
Visual saliency detection model simulates the human visual system to perceive
the scene, and has been widely used in many vision tasks. With the acquisition
technology development, more comprehensive information, such as depth cue,
inter-image correspondence, or temporal relationship, is available to extend
image saliency detection to RGBD saliency detection, co-saliency detection, or
video saliency detection. RGBD saliency detection model focuses on extracting
the salient regions from RGBD images by combining the depth information.
Co-saliency detection model introduces the inter-image correspondence
constraint to discover the common salient object in an image group. The goal of
video saliency detection model is to locate the motion-related salient object
in video sequences, which considers the motion cue and spatiotemporal
constraint jointly. In this paper, we review different types of saliency
detection algorithms, summarize the important issues of the existing methods,
and discuss the existent problems and future works. Moreover, the evaluation
datasets and quantitative measurements are briefly introduced, and the
experimental analysis and discission are conducted to provide a holistic
overview of different saliency detection methods.Comment: 18 pages, 11 figures, 7 tables, Accepted by IEEE Transactions on
Circuits and Systems for Video Technology 2018, https://rmcong.github.io
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
Computational models of attention
This chapter reviews recent computational models of visual attention. We
begin with models for the bottom-up or stimulus-driven guidance of attention to
salient visual items, which we examine in seven different broad categories. We
then examine more complex models which address the top-down or goal-oriented
guidance of attention towards items that are more relevant to the task at hand
Global and Local Sensitivity Guided Key Salient Object Re-augmentation for Video Saliency Detection
The existing still-static deep learning based saliency researches do not
consider the weighting and highlighting of extracted features from different
layers, all features contribute equally to the final saliency decision-making.
Such methods always evenly detect all "potentially significant regions" and
unable to highlight the key salient object, resulting in detection failure of
dynamic scenes. In this paper, based on the fact that salient areas in videos
are relatively small and concentrated, we propose a \textbf{key salient object
re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up
feature guidance} to improve detection accuracy in video scenes. KSORA includes
two sub-modules (WFE and KOS): WFE processes local salient feature selection
using bottom-up strategy, while KOS ranks each object in global fashion by
top-down statistical knowledge, and chooses the most critical object area for
local enhancement. The proposed KSORA can not only strengthen the saliency
value of the local key salient object but also ensure global saliency
consistency. Results on three benchmark datasets suggest that our model has the
capability of improving the detection accuracy on complex scenes. The
significant performance of KSORA, with a speed of 17FPS on modern GPUs, has
been verified by comparisons with other ten state-of-the-art algorithms.Comment: 6 figures, 10 page
Cube Padding for Weakly-Supervised Saliency Prediction in 360{\deg} Videos
Automatic saliency prediction in 360{\deg} videos is critical for viewpoint
guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal
network which is (1) weakly-supervised trained and (2) tailor-made for
360{\deg} viewing sphere. Note that most existing methods are less scalable
since they rely on annotated saliency map for training. Most importantly, they
convert 360{\deg} sphere to 2D images (e.g., a single equirectangular image or
multiple separate Normal Field-of-View (NFoV) images) which introduces
distortion and image boundaries. In contrast, we propose a simple and effective
Cube Padding (CP) technique as follows. Firstly, we render the 360{\deg} view
on six faces of a cube using perspective projection. Thus, it introduces very
little distortion. Then, we concatenate all six faces while utilizing the
connectivity between faces on the cube for image padding (i.e., Cube Padding)
in convolution, pooling, convolutional LSTM layers. In this way, CP introduces
no image boundary while being applicable to almost all Convolutional Neural
Network (CNN) structures. To evaluate our method, we propose Wild-360, a new
360{\deg} video saliency dataset, containing challenging videos with saliency
heatmap annotations. In experiments, our method outperforms baseline methods in
both speed and quality.Comment: CVPR 201
Deep Visual Attention Prediction
In this work, we aim to predict human eye fixation with view-free scenes
based on an end-to-end deep learning architecture. Although Convolutional
Neural Networks (CNNs) have made substantial improvement on human attention
prediction, it is still needed to improve CNN based attention models by
efficiently leveraging multi-scale features. Our visual attention network is
proposed to capture hierarchical saliency information from deep, coarse layers
with global saliency information to shallow, fine layers with local saliency
response. Our model is based on a skip-layer network structure, which predicts
human attention from multiple convolutional layers with various reception
fields. Final saliency prediction is achieved via the cooperation of those
global and local predictions. Our model is learned in a deep supervision
manner, where supervision is directly fed into multi-level layers, instead of
previous approaches of providing supervision only at the output layer and
propagating this supervision back to earlier layers. Our model thus
incorporates multi-level saliency predictions within a single network, which
significantly decreases the redundancy of previous approaches of learning
multiple network streams with different input scales. Extensive experimental
analysis on various challenging benchmark datasets demonstrate our method
yields state-of-the-art performance with competitive inference time.Comment: W. Wang and J. Shen. Deep visual attention prediction. IEEE TIP,
27(5):2368-2378,2018. Code and results can be found in
https://github.com/wenguanwang/deepattentio
Computational models: Bottom-up and top-down aspects
Computational models of visual attention have become popular over the past
decade, we believe primarily for two reasons: First, models make testable
predictions that can be explored by experimentalists as well as theoreticians,
second, models have practical and technological applications of interest to the
applied science and engineering communities. In this chapter, we take a
critical look at recent attention modeling efforts. We focus on {\em
computational models of attention} as defined by Tsotsos \& Rothenstein
\shortcite{Tsotsos_Rothenstein11}: Models which can process any visual stimulus
(typically, an image or video clip), which can possibly also be given some task
definition, and which make predictions that can be compared to human or animal
behavioral or physiological responses elicited by the same stimulus and task.
Thus, we here place less emphasis on abstract models, phenomenological models,
purely data-driven fitting or extrapolation models, or models specifically
designed for a single task or for a restricted class of stimuli. For
theoretical models, we refer the reader to a number of previous reviews that
address attention theories and models more generally
\cite{Itti_Koch01nrn,Paletta_etal05,Frintrop_etal10,Rothenstein_Tsotsos08,Gottlieb_Balan10,Toet11,Borji_Itti12pami}
Salient Object Detection in the Deep Learning Era: An In-Depth Survey
As an essential problem in computer vision, salient object detection (SOD)
has attracted an increasing amount of research attention over the years. Recent
advances in SOD are predominantly led by deep learning-based solutions (named
deep SOD). To enable in-depth understanding of deep SOD, in this paper, we
provide a comprehensive survey covering various aspects, ranging from algorithm
taxonomy to unsolved issues. In particular, we first review deep SOD algorithms
from different perspectives, including network architecture, level of
supervision, learning paradigm, and object-/instance-level detection. Following
that, we summarize and analyze existing SOD datasets and evaluation metrics.
Then, we benchmark a large group of representative SOD models, and provide
detailed analyses of the comparison results. Moreover, we study the performance
of SOD algorithms under different attribute settings, which has not been
thoroughly explored previously, by constructing a novel SOD dataset with rich
attribute annotations covering various salient object types, challenging
factors, and scene categories. We further analyze, for the first time in the
field, the robustness of SOD models to random input perturbations and
adversarial attacks. We also look into the generalization and difficulty of
existing SOD datasets. Finally, we discuss several open issues of SOD and
outline future research directions.Comment: Published on IEEE TPAMI. All the saliency prediction maps, our
constructed dataset with annotations, and codes for evaluation are publicly
available at \url{https://github.com/wenguanwang/SODsurvey
- …