3,456 research outputs found
Video Salient Object Detection Using Spatiotemporal Deep Features
This paper presents a method for detecting salient objects in videos where
temporal information in addition to spatial information is fully taken into
account. Following recent reports on the advantage of deep features over
conventional hand-crafted features, we propose a new set of SpatioTemporal Deep
(STD) features that utilize local and global contexts over frames. We also
propose new SpatioTemporal Conditional Random Field (STCRF) to compute saliency
from STD features. STCRF is our extension of CRF to the temporal domain and
describes the relationships among neighboring regions both in a frame and over
frames. STCRF leads to temporally consistent saliency maps over frames,
contributing to the accurate detection of salient objects' boundaries and noise
reduction during detection. Our proposed method first segments an input video
into multiple scales and then computes a saliency map at each scale level using
STD features with STCRF. The final saliency map is computed by fusing saliency
maps at different scale levels. Our experiments, using publicly available
benchmark datasets, confirm that the proposed method significantly outperforms
state-of-the-art methods. We also applied our saliency computation to the video
object segmentation task, showing that our method outperforms existing video
object segmentation methods.Comment: accepted at TI
Review of Visual Saliency Detection with Comprehensive Information
Visual saliency detection model simulates the human visual system to perceive
the scene, and has been widely used in many vision tasks. With the acquisition
technology development, more comprehensive information, such as depth cue,
inter-image correspondence, or temporal relationship, is available to extend
image saliency detection to RGBD saliency detection, co-saliency detection, or
video saliency detection. RGBD saliency detection model focuses on extracting
the salient regions from RGBD images by combining the depth information.
Co-saliency detection model introduces the inter-image correspondence
constraint to discover the common salient object in an image group. The goal of
video saliency detection model is to locate the motion-related salient object
in video sequences, which considers the motion cue and spatiotemporal
constraint jointly. In this paper, we review different types of saliency
detection algorithms, summarize the important issues of the existing methods,
and discuss the existent problems and future works. Moreover, the evaluation
datasets and quantitative measurements are briefly introduced, and the
experimental analysis and discission are conducted to provide a holistic
overview of different saliency detection methods.Comment: 18 pages, 11 figures, 7 tables, Accepted by IEEE Transactions on
Circuits and Systems for Video Technology 2018, https://rmcong.github.io
Salient Object Detection in Video using Deep Non-Local Neural Networks
Detection of salient objects in image and video is of great importance in
many computer vision applications. In spite of the fact that the state of the
art in saliency detection for still images has been changed substantially over
the last few years, there have been few improvements in video saliency
detection. This paper investigates the use of recently introduced non-local
neural networks in video salient object detection. Non-local neural networks
are applied to capture global dependencies and hence determine the salient
objects. The effect of non-local operations is studied separately on static and
dynamic saliency detection in order to exploit both appearance and motion
features. A novel deep non-local neural network architecture is introduced for
video salient object detection and tested on two well-known datasets DAVIS and
FBMS. The experimental results show that the proposed algorithm outperforms
state-of-the-art video saliency detection methods.Comment: Submitted to Journal of Visual Communication and Image Representatio
Graph-Theoretic Spatiotemporal Context Modeling for Video Saliency Detection
As an important and challenging problem in computer vision, video saliency
detection is typically cast as a spatiotemporal context modeling problem over
consecutive frames. As a result, a key issue in video saliency detection is how
to effectively capture the intrinsical properties of atomic video structures as
well as their associated contextual interactions along the spatial and temporal
dimensions. Motivated by this observation, we propose a graph-theoretic video
saliency detection approach based on adaptive video structure discovery, which
is carried out within a spatiotemporal atomic graph. Through graph-based
manifold propagation, the proposed approach is capable of effectively modeling
the semantically contextual interactions among atomic video structures for
saliency detection while preserving spatial smoothness and temporal
consistency. Experiments demonstrate the effectiveness of the proposed approach
over several benchmark datasets.Comment: ICIP 201
Learning Discriminative Motion Features Through Detection
Despite huge success in the image domain, modern detection models such as
Faster R-CNN have not been used nearly as much for video analysis. This is
arguably due to the fact that detection models are designed to operate on
single frames and as a result do not have a mechanism for learning motion
representations directly from video. We propose a learning procedure that
allows detection models such as Faster R-CNN to learn motion features directly
from the RGB video data while being optimized with respect to a pose estimation
task. Given a pair of video frames---Frame A and Frame B---we force our model
to predict human pose in Frame A using the features from Frame B. We do so by
leveraging deformable convolutions across space and time. Our network learns to
spatially sample features from Frame B in order to maximize pose detection
accuracy in Frame A. This naturally encourages our network to learn motion
offsets encoding the spatial correspondences between the two frames. We refer
to these motion offsets as DiMoFs (Discriminative Motion Features).
In our experiments we show that our training scheme helps learn effective
motion cues, which can be used to estimate and localize salient human motion.
Furthermore, we demonstrate that as a byproduct, our model also learns features
that lead to improved pose detection in still-images, and better keypoint
tracking. Finally, we show how to leverage our learned model for the tasks of
spatiotemporal action localization and fine-grained action recognition
Global and Local Sensitivity Guided Key Salient Object Re-augmentation for Video Saliency Detection
The existing still-static deep learning based saliency researches do not
consider the weighting and highlighting of extracted features from different
layers, all features contribute equally to the final saliency decision-making.
Such methods always evenly detect all "potentially significant regions" and
unable to highlight the key salient object, resulting in detection failure of
dynamic scenes. In this paper, based on the fact that salient areas in videos
are relatively small and concentrated, we propose a \textbf{key salient object
re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up
feature guidance} to improve detection accuracy in video scenes. KSORA includes
two sub-modules (WFE and KOS): WFE processes local salient feature selection
using bottom-up strategy, while KOS ranks each object in global fashion by
top-down statistical knowledge, and chooses the most critical object area for
local enhancement. The proposed KSORA can not only strengthen the saliency
value of the local key salient object but also ensure global saliency
consistency. Results on three benchmark datasets suggest that our model has the
capability of improving the detection accuracy on complex scenes. The
significant performance of KSORA, with a speed of 17FPS on modern GPUs, has
been verified by comparisons with other ten state-of-the-art algorithms.Comment: 6 figures, 10 page
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
A Review of Co-saliency Detection Technique: Fundamentals, Applications, and Challenges
Co-saliency detection is a newly emerging and rapidly growing research area
in computer vision community. As a novel branch of visual saliency, co-saliency
detection refers to the discovery of common and salient foregrounds from two or
more relevant images, and can be widely used in many computer vision tasks. The
existing co-saliency detection algorithms mainly consist of three components:
extracting effective features to represent the image regions, exploring the
informative cues or factors to characterize co-saliency, and designing
effective computational frameworks to formulate co-saliency. Although numerous
methods have been developed, the literature is still lacking a deep review and
evaluation of co-saliency detection techniques. In this paper, we aim at
providing a comprehensive review of the fundamentals, challenges, and
applications of co-saliency detection. Specifically, we provide an overview of
some related computer vision works, review the history of co-saliency
detection, summarize and categorize the major algorithms in this research area,
discuss some open issues in this area, present the potential applications of
co-saliency detection, and finally point out some unsolved challenges and
promising future works. We expect this review to be beneficial to both fresh
and senior researchers in this field, and give insights to researchers in other
related areas regarding the utility of co-saliency detection algorithms.Comment: 28 pages, 12 figures, 3 table
Recurrent Mixture Density Network for Spatiotemporal Visual Attention
In many computer vision tasks, the relevant information to solve the problem
at hand is mixed to irrelevant, distracting information. This has motivated
researchers to design attentional models that can dynamically focus on parts of
images or videos that are salient, e.g., by down-weighting irrelevant pixels.
In this work, we propose a spatiotemporal attentional model that learns where
to look in a video directly from human fixation data. We model visual attention
with a mixture of Gaussians at each frame. This distribution is used to express
the probability of saliency for each pixel. Time consistency in videos is
modeled hierarchically by: 1) deep 3D convolutional features to represent
spatial and short-term time relations and 2) a long short-term memory network
on top that aggregates the clip-level representation of sequential clips and
therefore expands the temporal domain from few frames to seconds. The
parameters of the proposed model are optimized via maximum likelihood
estimation using human fixations as training data, without knowledge of the
action in each video. Our experiments on Hollywood2 show state-of-the-art
performance on saliency prediction for video. We also show that our attentional
model trained on Hollywood2 generalizes well to UCF101 and it can be leveraged
to improve action classification accuracy on both datasets.Comment: ICLR 201
Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation
Integrating higher level visual and linguistic interpretations is at the
heart of human intelligence. As automatic visual category recognition in images
is approaching human performance, the high level understanding in the dynamic
spatiotemporal domain of videos and its translation into natural language is
still far from being solved. While most works on vision-to-text translations
use pre-learned or pre-established computational linguistic models, in this
paper we present an approach that uses vision alone to efficiently learn how to
translate into language the video content. We discover, in simple form, the
story played by main actors, while using only visual cues for representing
objects and their interactions. Our method learns in a hierarchical manner
higher level representations for recognizing subjects, actions and objects
involved, their relevant contextual background and their interaction to one
another over time. We have a three stage approach: first we take in
consideration features of the individual entities at the local level of
appearance, then we consider the relationship between these objects and actions
and their video background, and third, we consider their spatiotemporal
relations as inputs to classifiers at the highest level of interpretation.
Thus, our approach finds a coherent linguistic description of videos in the
form of a subject, verb and object based on their role played in the overall
visual story learned directly from training data, without using a known
language model. We test the efficiency of our approach on a large scale dataset
containing YouTube clips taken in the wild and demonstrate state-of-the-art
performance, often superior to current approaches that use more complex,
pre-learned linguistic knowledge
- …