17,104 research outputs found
MAIN: Multi-Attention Instance Network for Video Segmentation
Instance-level video segmentation requires a solid integration of spatial and
temporal information. However, current methods rely mostly on domain-specific
information (online learning) to produce accurate instance-level segmentations.
We propose a novel approach that relies exclusively on the integration of
generic spatio-temporal attention cues. Our strategy, named Multi-Attention
Instance Network (MAIN), overcomes challenging segmentation scenarios over
arbitrary videos without modelling sequence- or instance-specific knowledge. We
design MAIN to segment multiple instances in a single forward pass, and
optimize it with a novel loss function that favors class agnostic predictions
and assigns instance-specific penalties. We achieve state-of-the-art
performance on the challenging Youtube-VOS dataset and benchmark, improving the
unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at
real-time (30.3 FPS)
Dynamical optical flow of saliency maps for predicting visual attention
Saliency maps are used to understand human attention and visual fixation.
However, while very well established for static images, there is no general
agreement on how to compute a saliency map of dynamic scenes. In this paper we
propose a mathematically rigorous approach to this prob- lem, including static
saliency maps of each video frame for the calculation of the optical flow.
Taking into account static saliency maps for calculating the optical flow
allows for overcoming the aperture problem. Our ap- proach is able to explain
human fixation behavior in situations which pose challenges to standard
approaches, such as when a fixated object disappears behind an occlusion and
reappears after several frames. In addition, we quantitatively compare our
model against alternative solutions using a large eye tracking data set.
Together, our results suggest that assessing optical flow information across a
series of saliency maps gives a highly accurate and useful account of human
overt attention in dynamic scenes
Explicit Spatiotemporal Joint Relation Learning for Tracking Human Pose
We present a method for human pose tracking that is based on learning
spatiotemporal relationships among joints. Beyond generating the heatmap of a
joint in a given frame, our system also learns to predict the offset of the
joint from a neighboring joint in the frame. Additionally, it is trained to
predict the displacement of the joint from its position in the previous frame,
in a manner that can account for possibly changing joint appearance, unlike
optical flow. These relational cues in the spatial domain and temporal domain
are inferred in a robust manner by attending only to relevant areas in the
video frames. By explicitly learning and exploiting these joint relationships,
our system achieves state-of-the-art performance on standard benchmarks for
various pose tracking tasks including 3D body pose tracking in RGB video, 3D
hand pose tracking in depth sequences, and 3D hand gesture tracking in RGB
video
Self-supervised Learning for Video Correspondence Flow
The objective of this paper is self-supervised learning of feature embeddings
that are suitable for matching correspondences along the videos, which we term
correspondence flow. By leveraging the natural spatial-temporal coherence in
videos, we propose to train a ``pointer'' that reconstructs a target frame by
copying pixels from a reference frame.
We make the following contributions: First, we introduce a simple information
bottleneck that forces the model to learn robust features for correspondence
matching, and prevent it from learning trivial solutions, \eg matching based on
low-level colour information. Second, to tackle the challenges from tracker
drifting, due to complex object deformations, illumination changes and
occlusions, we propose to train a recursive model over long temporal windows
with scheduled sampling and cycle consistency. Third, we achieve
state-of-the-art performance on DAVIS 2017 video segmentation and JHMDB
keypoint tracking tasks, outperforming all previous self-supervised learning
approaches by a significant margin. Fourth, in order to shed light on the
potential of self-supervised learning on the task of video correspondence flow,
we probe the upper bound by training on additional data, \ie more diverse
videos, further demonstrating significant improvements on video segmentation.Comment: BMVC2019 (Oral Presentation
Recurrent Mixture Density Network for Spatiotemporal Visual Attention
In many computer vision tasks, the relevant information to solve the problem
at hand is mixed to irrelevant, distracting information. This has motivated
researchers to design attentional models that can dynamically focus on parts of
images or videos that are salient, e.g., by down-weighting irrelevant pixels.
In this work, we propose a spatiotemporal attentional model that learns where
to look in a video directly from human fixation data. We model visual attention
with a mixture of Gaussians at each frame. This distribution is used to express
the probability of saliency for each pixel. Time consistency in videos is
modeled hierarchically by: 1) deep 3D convolutional features to represent
spatial and short-term time relations and 2) a long short-term memory network
on top that aggregates the clip-level representation of sequential clips and
therefore expands the temporal domain from few frames to seconds. The
parameters of the proposed model are optimized via maximum likelihood
estimation using human fixations as training data, without knowledge of the
action in each video. Our experiments on Hollywood2 show state-of-the-art
performance on saliency prediction for video. We also show that our attentional
model trained on Hollywood2 generalizes well to UCF101 and it can be leveraged
to improve action classification accuracy on both datasets.Comment: ICLR 201
Global and Local Sensitivity Guided Key Salient Object Re-augmentation for Video Saliency Detection
The existing still-static deep learning based saliency researches do not
consider the weighting and highlighting of extracted features from different
layers, all features contribute equally to the final saliency decision-making.
Such methods always evenly detect all "potentially significant regions" and
unable to highlight the key salient object, resulting in detection failure of
dynamic scenes. In this paper, based on the fact that salient areas in videos
are relatively small and concentrated, we propose a \textbf{key salient object
re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up
feature guidance} to improve detection accuracy in video scenes. KSORA includes
two sub-modules (WFE and KOS): WFE processes local salient feature selection
using bottom-up strategy, while KOS ranks each object in global fashion by
top-down statistical knowledge, and chooses the most critical object area for
local enhancement. The proposed KSORA can not only strengthen the saliency
value of the local key salient object but also ensure global saliency
consistency. Results on three benchmark datasets suggest that our model has the
capability of improving the detection accuracy on complex scenes. The
significant performance of KSORA, with a speed of 17FPS on modern GPUs, has
been verified by comparisons with other ten state-of-the-art algorithms.Comment: 6 figures, 10 page
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
Real time expert system for anomaly detection of aerators based on computer vision technology and existing surveillance cameras
Aerators are essential and crucial auxiliary devices in intensive culture,
especially in industrial culture in China. The traditional methods cannot
accurately detect abnormal condition of aerators in time. Surveillance cameras
are widely used as visual perception modules of the Internet of Things, and
then using these widely existing surveillance cameras to realize real-time
anomaly detection of aerators is a cost-free and easy-to-promote method.
However, it is difficult to develop such an expert system due to some technical
and applied challenges, e.g., illumination, occlusion, complex background, etc.
To tackle these aforementioned challenges, we propose a real-time expert system
based on computer vision technology and existing surveillance cameras for
anomaly detection of aerators, which consists of two modules, i.e., object
region detection and working state detection. First, it is difficult to detect
the working state for some small object regions in whole images, and the time
complexity of global feature comparison is also high, so we present an object
region detection method based on the region proposal idea. Moreover, we propose
a novel algorithm called reference frame Kanade-Lucas-Tomasi (RF-KLT) algorithm
for motion feature extraction in fixed regions. Then, we present a dimension
reduction method of time series for establishing a feature dataset with obvious
boundaries between classes. Finally, we use machine learning algorithms to
build the feature classifier. The experimental results in both the actual video
dataset and the augmented video dataset show that the accuracy for detecting
object region and working state of aerators is 100% and 99.9% respectively, and
the detection speed is 77-333 frames per second (FPS) according to the
different types of surveillance cameras.Comment: 17 figure
Goal-oriented Object Importance Estimation in On-road Driving Videos
We formulate a new problem as Object Importance Estimation (OIE) in on-road
driving videos, where the road users are considered as important objects if
they have influence on the control decision of the ego-vehicle's driver. The
importance of a road user depends on both its visual dynamics, e.g.,
appearance, motion and location, in the driving scene and the driving goal,
\emph{e.g}., the planned path, of the ego vehicle. We propose a novel framework
that incorporates both visual model and goal representation to conduct OIE. To
evaluate our framework, we collect an on-road driving dataset at traffic
intersections in the real world and conduct human-labeled annotation of the
important objects. Experimental results show that our goal-oriented method
outperforms baselines and has much more improvement on the left-turn and
right-turn scenarios. Furthermore, we explore the possibility of using object
importance for driving control prediction and demonstrate that binary brake
prediction can be improved with the information of object importance
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
- …