16,495 research outputs found
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions
The body pose of a person wearing a camera is of great interest for
applications in augmented reality, healthcare, and robotics, yet much of the
person's body is out of view for a typical wearable camera. We propose a
learning-based approach to estimate the camera wearer's 3D body pose from
egocentric video sequences. Our key insight is to leverage interactions with
another person---whose body pose we can directly observe---as a signal
inherently linked to the body pose of the first-person subject. We show that
since interactions between individuals often induce a well-ordered series of
back-and-forth responses, it is possible to learn a temporal model of the
interlinked poses even though one party is largely out of view. We demonstrate
our idea on a variety of domains with dyadic interaction and show the
substantial impact on egocentric body pose estimation, which improves the state
of the art. Video results are available at
http://vision.cs.utexas.edu/projects/you2me
A spatiotemporal model with visual attention for video classification
High level understanding of sequential visual input is important for safe and
stable autonomy, especially in localization and object detection. While
traditional object classification and tracking approaches are specifically
designed to handle variations in rotation and scale, current state-of-the-art
approaches based on deep learning achieve better performance. This paper
focuses on developing a spatiotemporal model to handle videos containing moving
objects with rotation and scale changes. Built on models that combine
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to
classify sequential data, this work investigates the effectiveness of
incorporating attention modules in the CNN stage for video classification. The
superiority of the proposed spatiotemporal model is demonstrated on the Moving
MNIST dataset augmented with rotation and scaling.Comment: Accepted by Robotics: Science and Systems 2017 Workshop on
Articulated Model Trackin
Combined Static and Motion Features for Deep-Networks Based Activity Recognition in Videos
Activity recognition in videos in a deep-learning setting---or
otherwise---uses both static and pre-computed motion components. The method of
combining the two components, whilst keeping the burden on the deep network
less, still remains uninvestigated. Moreover, it is not clear what the level of
contribution of individual components is, and how to control the contribution.
In this work, we use a combination of CNN-generated static features and motion
features in the form of motion tubes. We propose three schemas for combining
static and motion components: based on a variance ratio, principal components,
and Cholesky decomposition. The Cholesky decomposition based method allows the
control of contributions. The ratio given by variance analysis of static and
motion features match well with the experimental optimal ratio used in the
Cholesky decomposition based method. The resulting activity recognition system
is better or on par with existing state-of-the-art when tested with three
popular datasets. The findings also enable us to characterize a dataset with
respect to its richness in motion information
Future Localization from an Egocentric Depth Image
This paper presents a method for future localization: to predict a set of
plausible trajectories of ego-motion given a depth image. We predict paths
avoiding obstacles, between objects, even paths turning around a corner into
space behind objects. As a byproduct of the predicted trajectories of
ego-motion, we discover in the image the empty space occluded by foreground
objects. We use no image based features such as semantic labeling/segmentation
or object detection/recognition for this algorithm. Inspired by proxemics, we
represent the space around a person using an EgoSpace map, akin to an
illustrated tourist map, that measures a likelihood of occlusion at the
egocentric coordinate system. A future trajectory of ego-motion is modeled by a
linear combination of compact trajectory bases allowing us to constrain the
predicted trajectory. We learn the relationship between the EgoSpace map and
trajectory from the EgoMotion dataset providing in-situ measurements of the
future trajectory. A cost function that takes into account partial occlusion
due to foreground objects is minimized to predict a trajectory. This cost
function generates a trajectory that passes through the occluded space, which
allows us to discover the empty space behind the foreground objects. We
quantitatively evaluate our method to show predictive validity and apply to
various real world scenes including walking, shopping, and social interactions.Comment: 9 page
Relational Long Short-Term Memory for Video Action Recognition
Spatial and temporal relationships, both short-range and long-range, between
objects in videos, are key cues for recognizing actions. It is a challenging
problem to model them jointly. In this paper, we first present a new variant of
Long Short-Term Memory, namely Relational LSTM, to address the challenge of
relation reasoning across space and time between objects. In our Relational
LSTM module, we utilize a non-local operation similar in spirit to the recently
proposed non-local network to substitute the fully connected operation in the
vanilla LSTM. By doing this, our Relational LSTM is capable of capturing long
and short-range spatio-temporal relations between objects in videos in a
principled way. Then, we propose a two-branch neural architecture consisting of
the Relational LSTM module as the non-local branch and a spatio-temporal
pooling based local branch. The local branch is utilized for capturing local
spatial appearance and/or short-term motion features. The two branches are
concatenated to learn video-level features from snippet-level ones which are
then used for classification. Experimental results on UCF-101 and HMDB-51
datasets show that our model achieves state-of-the-art results among LSTM-based
methods, while obtaining comparable performance with other state-of-the-art
methods (which use not directly comparable schema). Further, on the more
complex large-scale Charades dataset, we obtain a large 3.2% gain over
state-of-the-art methods, verifying the effectiveness of our method in complex
understanding
Multiple Object Tracking: A Literature Review
Multiple Object Tracking (MOT) is an important computer vision problem which
has gained increasing attention due to its academic and commercial potential.
Although different kinds of approaches have been proposed to tackle this
problem, it still remains challenging due to factors like abrupt appearance
changes and severe object occlusions. In this work, we contribute the first
comprehensive and most recent review on this problem. We inspect the recent
advances in various aspects and propose some interesting directions for future
research. To the best of our knowledge, there has not been any extensive review
on this topic in the community. We endeavor to provide a thorough review on the
development of this problem in recent decades. The main contributions of this
review are fourfold: 1) Key aspects in a multiple object tracking system,
including formulation, categorization, key principles, evaluation of an MOT are
discussed. 2) Instead of enumerating individual works, we discuss existing
approaches according to various aspects, in each of which methods are divided
into different groups and each group is discussed in detail for the principles,
advances and drawbacks. 3) We examine experiments of existing publications and
summarize results on popular datasets to provide quantitative comparisons. We
also point to some interesting discoveries by analyzing these results. 4) We
provide a discussion about issues of MOT research, as well as some interesting
directions which could possibly become potential research effort in the future
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Visual saliency models have enjoyed a big leap in performance in recent
years, thanks to advances in deep learning and large scale annotated data.
Despite enormous effort and huge breakthroughs, however, models still fall
short in reaching human-level accuracy. In this work, I explore the landscape
of the field emphasizing on new deep saliency models, benchmarks, and datasets.
A large number of image and video saliency models are reviewed and compared
over two image benchmarks and two large scale video datasets. Further, I
identify factors that contribute to the gap between models and humans and
discuss remaining issues that need to be addressed to build the next generation
of more powerful saliency models. Some specific questions that are addressed
include: in what ways current models fail, how to remedy them, what can be
learned from cognitive studies of attention, how explicit saliency judgments
relate to fixations, how to conduct fair model comparison, and what are the
emerging applications of saliency models
The Open World of Micro-Videos
Micro-videos are six-second videos popular on social media networks with
several unique properties. Firstly, because of the authoring process, they
contain significantly more diversity and narrative structure than existing
collections of video "snippets". Secondly, because they are often captured by
hand-held mobile cameras, they contain specialized viewpoints including
third-person, egocentric, and self-facing views seldom seen in traditional
produced video. Thirdly, due to to their continuous production and publication
on social networks, aggregate micro-video content contains interesting
open-world dynamics that reflects the temporal evolution of tag topics. These
aspects make micro-videos an appealing well of visual data for developing
large-scale models for video understanding. We analyze a novel dataset of
micro-videos labeled with 58 thousand tags. To analyze this data, we introduce
viewpoint-specific and temporally-evolving models for video understanding,
defined over state-of-the-art motion and deep visual features. We conclude that
our dataset opens up new research opportunities for large-scale video analysis,
novel viewpoints, and open-world dynamics
- …