2,404 research outputs found
Unsupervised Video Analysis Based on a Spatiotemporal Saliency Detector
Visual saliency, which predicts regions in the field of view that draw the
most visual attention, has attracted a lot of interest from researchers. It has
already been used in several vision tasks, e.g., image classification, object
detection, foreground segmentation. Recently, the spectrum analysis based
visual saliency approach has attracted a lot of interest due to its simplicity
and good performance, where the phase information of the image is used to
construct the saliency map. In this paper, we propose a new approach for
detecting spatiotemporal visual saliency based on the phase spectrum of the
videos, which is easy to implement and computationally efficient. With the
proposed algorithm, we also study how the spatiotemporal saliency can be used
in two important vision task, abnormality detection and spatiotemporal interest
point detection. The proposed algorithm is evaluated on several commonly used
datasets with comparison to the state-of-art methods from the literature. The
experiments demonstrate the effectiveness of the proposed approach to
spatiotemporal visual saliency detection and its application to the above
vision tasksComment: 21 page
Region-Based Multiscale Spatiotemporal Saliency for Video
Detecting salient objects from a video requires exploiting both spatial and
temporal knowledge included in the video. We propose a novel region-based
multiscale spatiotemporal saliency detection method for videos, where static
features and dynamic features computed from the low and middle levels are
combined together. Our method utilizes such combined features spatially over
each frame and, at the same time, temporally across frames using consistency
between consecutive frames. Saliency cues in our method are analyzed through a
multiscale segmentation model, and fused across scale levels, yielding to
exploring regions efficiently. An adaptive temporal window using motion
information is also developed to combine saliency values of consecutive frames
in order to keep temporal consistency across frames. Performance evaluation on
several popular benchmark datasets validates that our method outperforms
existing state-of-the-arts
Graph-Theoretic Spatiotemporal Context Modeling for Video Saliency Detection
As an important and challenging problem in computer vision, video saliency
detection is typically cast as a spatiotemporal context modeling problem over
consecutive frames. As a result, a key issue in video saliency detection is how
to effectively capture the intrinsical properties of atomic video structures as
well as their associated contextual interactions along the spatial and temporal
dimensions. Motivated by this observation, we propose a graph-theoretic video
saliency detection approach based on adaptive video structure discovery, which
is carried out within a spatiotemporal atomic graph. Through graph-based
manifold propagation, the proposed approach is capable of effectively modeling
the semantically contextual interactions among atomic video structures for
saliency detection while preserving spatial smoothness and temporal
consistency. Experiments demonstrate the effectiveness of the proposed approach
over several benchmark datasets.Comment: ICIP 201
Saliency-Guided Perceptual Grouping Using Motion Cues in Region-Based Artificial Visual Attention
Region-based artificial attention constitutes a framework for bio-inspired
attentional processes on an intermediate abstraction level for the use in
computer vision and mobile robotics. Segmentation algorithms produce regions of
coherently colored pixels. These serve as proto-objects on which the
attentional processes determine image portions of relevance. A single
region---which not necessarily represents a full object---constitutes the focus
of attention. For many post-attentional tasks, however, such as identifying or
tracking objects, single segments are not sufficient. Here, we present a
saliency-guided approach that groups regions that potentially belong to the
same object based on proximity and similarity of motion. We compare our results
to object selection by thresholding saliency maps and a further
attention-guided strategy
Computational models of attention
This chapter reviews recent computational models of visual attention. We
begin with models for the bottom-up or stimulus-driven guidance of attention to
salient visual items, which we examine in seven different broad categories. We
then examine more complex models which address the top-down or goal-oriented
guidance of attention towards items that are more relevant to the task at hand
Review of Visual Saliency Detection with Comprehensive Information
Visual saliency detection model simulates the human visual system to perceive
the scene, and has been widely used in many vision tasks. With the acquisition
technology development, more comprehensive information, such as depth cue,
inter-image correspondence, or temporal relationship, is available to extend
image saliency detection to RGBD saliency detection, co-saliency detection, or
video saliency detection. RGBD saliency detection model focuses on extracting
the salient regions from RGBD images by combining the depth information.
Co-saliency detection model introduces the inter-image correspondence
constraint to discover the common salient object in an image group. The goal of
video saliency detection model is to locate the motion-related salient object
in video sequences, which considers the motion cue and spatiotemporal
constraint jointly. In this paper, we review different types of saliency
detection algorithms, summarize the important issues of the existing methods,
and discuss the existent problems and future works. Moreover, the evaluation
datasets and quantitative measurements are briefly introduced, and the
experimental analysis and discission are conducted to provide a holistic
overview of different saliency detection methods.Comment: 18 pages, 11 figures, 7 tables, Accepted by IEEE Transactions on
Circuits and Systems for Video Technology 2018, https://rmcong.github.io
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
Spatiotemporal Knowledge Distillation for Efficient Estimation of Aerial Video Saliency
The performance of video saliency estimation techniques has achieved
significant advances along with the rapid development of Convolutional Neural
Networks (CNNs). However, devices like cameras and drones may have limited
computational capability and storage space so that the direct deployment of
complex deep saliency models becomes infeasible. To address this problem, this
paper proposes a dynamic saliency estimation approach for aerial videos via
spatiotemporal knowledge distillation. In this approach, five components are
involved, including two teachers, two students and the desired spatiotemporal
model. The knowledge of spatial and temporal saliency is first separately
transferred from the two complex and redundant teachers to their simple and
compact students, and the input scenes are also degraded from high-resolution
to low-resolution to remove the probable data redundancy so as to greatly speed
up the feature extraction process. After that, the desired spatiotemporal model
is further trained by distilling and encoding the spatial and temporal saliency
knowledge of two students into a unified network. In this manner, the
inter-model redundancy can be further removed for the effective estimation of
dynamic saliency on aerial videos. Experimental results show that the proposed
approach outperforms ten state-of-the-art models in estimating visual saliency
on aerial videos, while its speed reaches up to 28,738 FPS on the GPU platform
Recurrent Mixture Density Network for Spatiotemporal Visual Attention
In many computer vision tasks, the relevant information to solve the problem
at hand is mixed to irrelevant, distracting information. This has motivated
researchers to design attentional models that can dynamically focus on parts of
images or videos that are salient, e.g., by down-weighting irrelevant pixels.
In this work, we propose a spatiotemporal attentional model that learns where
to look in a video directly from human fixation data. We model visual attention
with a mixture of Gaussians at each frame. This distribution is used to express
the probability of saliency for each pixel. Time consistency in videos is
modeled hierarchically by: 1) deep 3D convolutional features to represent
spatial and short-term time relations and 2) a long short-term memory network
on top that aggregates the clip-level representation of sequential clips and
therefore expands the temporal domain from few frames to seconds. The
parameters of the proposed model are optimized via maximum likelihood
estimation using human fixations as training data, without knowledge of the
action in each video. Our experiments on Hollywood2 show state-of-the-art
performance on saliency prediction for video. We also show that our attentional
model trained on Hollywood2 generalizes well to UCF101 and it can be leveraged
to improve action classification accuracy on both datasets.Comment: ICLR 201
Salient Object Detection in Video using Deep Non-Local Neural Networks
Detection of salient objects in image and video is of great importance in
many computer vision applications. In spite of the fact that the state of the
art in saliency detection for still images has been changed substantially over
the last few years, there have been few improvements in video saliency
detection. This paper investigates the use of recently introduced non-local
neural networks in video salient object detection. Non-local neural networks
are applied to capture global dependencies and hence determine the salient
objects. The effect of non-local operations is studied separately on static and
dynamic saliency detection in order to exploit both appearance and motion
features. A novel deep non-local neural network architecture is introduced for
video salient object detection and tested on two well-known datasets DAVIS and
FBMS. The experimental results show that the proposed algorithm outperforms
state-of-the-art video saliency detection methods.Comment: Submitted to Journal of Visual Communication and Image Representatio
- …