2,679 research outputs found
Salient Object Detection in Video using Deep Non-Local Neural Networks
Detection of salient objects in image and video is of great importance in
many computer vision applications. In spite of the fact that the state of the
art in saliency detection for still images has been changed substantially over
the last few years, there have been few improvements in video saliency
detection. This paper investigates the use of recently introduced non-local
neural networks in video salient object detection. Non-local neural networks
are applied to capture global dependencies and hence determine the salient
objects. The effect of non-local operations is studied separately on static and
dynamic saliency detection in order to exploit both appearance and motion
features. A novel deep non-local neural network architecture is introduced for
video salient object detection and tested on two well-known datasets DAVIS and
FBMS. The experimental results show that the proposed algorithm outperforms
state-of-the-art video saliency detection methods.Comment: Submitted to Journal of Visual Communication and Image Representatio
Excitation Backprop for RNNs
Deep models are state-of-the-art for many vision tasks including video action
recognition and video captioning. Models are trained to caption or classify
activity in videos, but little is known about the evidence used to make such
decisions. Grounding decisions made by deep networks has been studied in
spatial visual content, giving more insight into model predictions for images.
However, such studies are relatively lacking for models of spatiotemporal
visual content - videos. In this work, we devise a formulation that
simultaneously grounds evidence in space and time, in a single pass, using
top-down saliency. We visualize the spatiotemporal cues that contribute to a
deep model's classification/captioning output using the model's internal
representation. Based on these spatiotemporal cues, we are able to localize
segments within a video that correspond with a specific action, or phrase from
a caption, without explicitly optimizing/training for these tasks.Comment: CVPR 2018 Camera Ready Versio
Benchmark 3D eye-tracking dataset for visual saliency prediction on stereoscopic 3D video
Visual Attention Models (VAMs) predict the location of an image or video
regions that are most likely to attract human attention. Although saliency
detection is well explored for 2D image and video content, there are only few
attempts made to design 3D saliency prediction models. Newly proposed 3D visual
attention models have to be validated over large-scale video saliency
prediction datasets, which also contain results of eye-tracking information.
There are several publicly available eye-tracking datasets for 2D image and
video content. In the case of 3D, however, there is still a need for
large-scale video saliency datasets for the research community for validating
different 3D-VAMs. In this paper, we introduce a large-scale dataset containing
eye-tracking data collected from 61 stereoscopic 3D videos (and also 2D
versions of those) and 24 subjects participated in a free-viewing test. We
evaluate the performance of the existing saliency detection methods over the
proposed dataset. In addition, we created an online benchmark for validating
the performance of the existing 2D and 3D visual attention models and
facilitate addition of new VAMs to the benchmark. Our benchmark currently
contains 50 different VAMs
Unsupervised Video Analysis Based on a Spatiotemporal Saliency Detector
Visual saliency, which predicts regions in the field of view that draw the
most visual attention, has attracted a lot of interest from researchers. It has
already been used in several vision tasks, e.g., image classification, object
detection, foreground segmentation. Recently, the spectrum analysis based
visual saliency approach has attracted a lot of interest due to its simplicity
and good performance, where the phase information of the image is used to
construct the saliency map. In this paper, we propose a new approach for
detecting spatiotemporal visual saliency based on the phase spectrum of the
videos, which is easy to implement and computationally efficient. With the
proposed algorithm, we also study how the spatiotemporal saliency can be used
in two important vision task, abnormality detection and spatiotemporal interest
point detection. The proposed algorithm is evaluated on several commonly used
datasets with comparison to the state-of-art methods from the literature. The
experiments demonstrate the effectiveness of the proposed approach to
spatiotemporal visual saliency detection and its application to the above
vision tasksComment: 21 page
Digging Deeper into Egocentric Gaze Prediction
This paper digs deeper into factors that influence egocentric gaze. Instead
of training deep models for this purpose in a blind manner, we propose to
inspect factors that contribute to gaze guidance during daily tasks. Bottom-up
saliency and optical flow are assessed versus strong spatial prior baselines.
Task-specific cues such as vanishing point, manipulation point, and hand
regions are analyzed as representatives of top-down information. We also look
into the contribution of these factors by investigating a simple recurrent
neural model for ego-centric gaze prediction. First, deep features are
extracted for all input video frames. Then, a gated recurrent unit is employed
to integrate information over time and to predict the next fixation. We also
propose an integrated model that combines the recurrent model with several
top-down and bottom-up cues. Extensive experiments over multiple datasets
reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up
saliency models perform poorly in predicting gaze and underperform spatial
biases, (3) deep features perform better compared to traditional features, (4)
as opposed to hand regions, the manipulation point is a strong influential cue
for gaze prediction, (5) combining the proposed recurrent model with bottom-up
cues, vanishing points and, in particular, manipulation point results in the
best gaze prediction accuracy over egocentric videos, (6) the knowledge
transfer works best for cases where the tasks or sequences are similar, and (7)
task and activity recognition can benefit from gaze prediction. Our findings
suggest that (1) there should be more emphasis on hand-object interaction and
(2) the egocentric vision community should consider larger datasets including
diverse stimuli and more subjects.Comment: presented at WACV 201
Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking
In this paper, we develop a new approach of spatially supervised recurrent
convolutional neural networks for visual object tracking. Our recurrent
convolutional network exploits the history of locations as well as the
distinctive visual features learned by the deep neural networks. Inspired by
recent bounding box regression methods for object detection, we study the
regression capability of Long Short-Term Memory (LSTM) in the temporal domain,
and propose to concatenate high-level visual features produced by convolutional
networks with region information. In contrast to existing deep learning based
trackers that use binary classification for region candidates, we use
regression for direct prediction of the tracking locations both at the
convolutional layer and at the recurrent unit. Our extensive experimental
results and performance comparison with state-of-the-art tracking methods on
challenging benchmark video tracking datasets shows that our tracker is more
accurate and robust while maintaining low computational cost. For most test
video sequences, our method achieves the best tracking performance, often
outperforms the second best by a large margin.Comment: 10 pages, 9 figures, conferenc
Graph-Theoretic Spatiotemporal Context Modeling for Video Saliency Detection
As an important and challenging problem in computer vision, video saliency
detection is typically cast as a spatiotemporal context modeling problem over
consecutive frames. As a result, a key issue in video saliency detection is how
to effectively capture the intrinsical properties of atomic video structures as
well as their associated contextual interactions along the spatial and temporal
dimensions. Motivated by this observation, we propose a graph-theoretic video
saliency detection approach based on adaptive video structure discovery, which
is carried out within a spatiotemporal atomic graph. Through graph-based
manifold propagation, the proposed approach is capable of effectively modeling
the semantically contextual interactions among atomic video structures for
saliency detection while preserving spatial smoothness and temporal
consistency. Experiments demonstrate the effectiveness of the proposed approach
over several benchmark datasets.Comment: ICIP 201
Spatiotemporal Knowledge Distillation for Efficient Estimation of Aerial Video Saliency
The performance of video saliency estimation techniques has achieved
significant advances along with the rapid development of Convolutional Neural
Networks (CNNs). However, devices like cameras and drones may have limited
computational capability and storage space so that the direct deployment of
complex deep saliency models becomes infeasible. To address this problem, this
paper proposes a dynamic saliency estimation approach for aerial videos via
spatiotemporal knowledge distillation. In this approach, five components are
involved, including two teachers, two students and the desired spatiotemporal
model. The knowledge of spatial and temporal saliency is first separately
transferred from the two complex and redundant teachers to their simple and
compact students, and the input scenes are also degraded from high-resolution
to low-resolution to remove the probable data redundancy so as to greatly speed
up the feature extraction process. After that, the desired spatiotemporal model
is further trained by distilling and encoding the spatial and temporal saliency
knowledge of two students into a unified network. In this manner, the
inter-model redundancy can be further removed for the effective estimation of
dynamic saliency on aerial videos. Experimental results show that the proposed
approach outperforms ten state-of-the-art models in estimating visual saliency
on aerial videos, while its speed reaches up to 28,738 FPS on the GPU platform
Spatiotemporal CNN for Video Object Segmentation
In this paper, we present a unified, end-to-end trainable spatiotemporal CNN
model for VOS, which consists of two branches, i.e., the temporal coherence
branch and the spatial segmentation branch. Specifically, the temporal
coherence branch pretrained in an adversarial fashion from unlabeled video
data, is designed to capture the dynamic appearance and motion cues of video
sequences to guide object segmentation. The spatial segmentation branch focuses
on segmenting objects accurately based on the learned appearance and motion
cues. To obtain accurate segmentation results, we design a coarse-to-fine
process to sequentially apply a designed attention module on multi-scale
feature maps, and concatenate them to produce the final prediction. In this
way, the spatial segmentation branch is enforced to gradually concentrate on
object regions. These two branches are jointly fine-tuned on video segmentation
sequences in an end-to-end manner. Several experiments are carried out on three
challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and Youtube-Object) to show
that our method achieves favorable performance against the state-of-the-arts.
Code is available at https://github.com/longyin880815/STCNN.Comment: 10 pages, 3 figures, 6 tables, CVPR 201
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
- …