2,593 research outputs found
Temporal Dynamic Graph LSTM for Action-driven Video Object Detection
In this paper, we investigate a weakly-supervised object detection framework.
Most existing frameworks focus on using static images to learn object
detectors. However, these detectors often fail to generalize to videos because
of the existing domain shift. Therefore, we investigate learning these
detectors directly from boring videos of daily activities. Instead of using
bounding boxes, we explore the use of action descriptions as supervision since
they are relatively easy to gather. A common issue, however, is that objects of
interest that are not involved in human actions are often absent in global
action descriptions known as "missing label". To tackle this problem, we
propose a novel temporal dynamic graph Long Short-Term Memory network (TD-Graph
LSTM). TD-Graph LSTM enables global temporal reasoning by constructing a
dynamic graph that is based on temporal correlations of object proposals and
spans the entire video. The missing label issue for each individual frame can
thus be significantly alleviated by transferring knowledge across correlated
objects proposals in the whole video. Extensive evaluations on a large-scale
daily-life action dataset (i.e., Charades) demonstrates the superiority of our
proposed method. We also release object bounding-box annotations for more than
5,000 frames in Charades. We believe this annotated data can also benefit other
research on video-based object recognition in the future.Comment: To appear in ICCV 201
Skeleton Focused Human Activity Recognition in RGB Video
The data-driven approach that learns an optimal representation of vision
features like skeleton frames or RGB videos is currently a dominant paradigm
for activity recognition. While great improvements have been achieved from
existing single modal approaches with increasingly larger datasets, the fusion
of various data modalities at the feature level has seldom been attempted. In
this paper, we propose a multimodal feature fusion model that utilizes both
skeleton and RGB modalities to infer human activity. The objective is to
improve the activity recognition accuracy by effectively utilizing the mutual
complemental information among different data modalities. For the skeleton
modality, we propose to use a graph convolutional subnetwork to learn the
skeleton representation. Whereas for the RGB modality, we will use the
spatial-temporal region of interest from RGB videos and take the attention
features from the skeleton modality to guide the learning process. The model
could be either individually or uniformly trained by the back-propagation
algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D
and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance,
which indicates that the proposed skeleton-driven attention mechanism for the
RGB modality increases the mutual communication between different data
modalities and brings more discriminative features for inferring human
activities.Comment: 8 page
Video Description: A Survey of Methods, Datasets and Evaluation Metrics
Video description is the automatic generation of natural language sentences
that describe the contents of a given video. It has applications in human-robot
interaction, helping the visually impaired and video subtitling. The past few
years have seen a surge of research in this area due to the unprecedented
success of deep learning in computer vision and natural language processing.
Numerous methods, datasets and evaluation metrics have been proposed in the
literature, calling the need for a comprehensive survey to focus research
efforts in this flourishing new direction. This paper fills the gap by
surveying the state of the art approaches with a focus on deep learning models;
comparing benchmark datasets in terms of their domains, number of classes, and
repository size; and identifying the pros and cons of various evaluation
metrics like SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video
description approaches combined subject, object and verb detection with
template based language models to generate sentences. However, the release of
large datasets revealed that these methods can not cope with the diversity in
unconstrained open domain videos. Classical approaches were followed by a very
short era of statistical methods which were soon replaced with deep learning,
the current state of the art in video description. Our survey shows that
despite the fast-paced developments, video description research is still in its
infancy due to the following reasons. Analysis of video description models is
challenging because it is difficult to ascertain the contributions, towards
accuracy or errors, of the visual features and the adopted language model in
the final description. Existing datasets neither contain adequate visual
diversity nor complexity of linguistic structures. Finally, current evaluation
metrics ...Comment: Accepted by ACM Computing Survey
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Over the past few years, deep neural networks (DNNs) have exhibited great
success in predicting the saliency of images. However, there are few works that
apply DNNs to predict the saliency of generic videos. In this paper, we propose
a novel DNN-based video saliency prediction method. Specifically, we establish
a large-scale eye-tracking database of videos (LEDOV), which provides
sufficient data to train the DNN models for predicting video saliency. Through
the statistical analysis of our LEDOV database, we find that human attention is
normally attracted by objects, particularly moving objects or the moving parts
of objects. Accordingly, we propose an object-to-motion convolutional neural
network (OM-CNN) to learn spatio-temporal features for predicting the
intra-frame saliency via exploring the information of both objectness and
object motion. We further find from our database that there exists a temporal
correlation of human attention with a smooth saliency transition across video
frames. Therefore, we develop a two-layer convolutional long short-term memory
(2C-LSTM) network in our DNN-based method, using the extracted features of
OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can
be generated, which consider the transition of attention across video frames.
Finally, the experimental results show that our method advances the
state-of-the-art in video saliency prediction.Comment: Jiang, Lai and Xu, Mai and Liu, Tie and Qiao, Minglang and Wang,
Zulin; DeepVS: A Deep Learning Based Video Saliency Prediction Approach;The
European Conference on Computer Vision (ECCV); September 201
PVSS: A Progressive Vehicle Search System for Video Surveillance Networks
This paper is focused on the task of searching for a specific vehicle that
appeared in the surveillance networks. Existing methods usually assume the
vehicle images are well cropped from the surveillance videos, then use visual
attributes, like colors and types, or license plate numbers to match the target
vehicle in the image set. However, a complete vehicle search system should
consider the problems of vehicle detection, representation, indexing, storage,
matching, and so on. Besides, attribute-based search cannot accurately find the
same vehicle due to intra-instance changes in different cameras and the
extremely uncertain environment. Moreover, the license plates may be
misrecognized in surveillance scenes due to the low resolution and noise. In
this paper, a Progressive Vehicle Search System, named as PVSS, is designed to
solve the above problems. PVSS is constituted of three modules: the crawler,
the indexer, and the searcher. The vehicle crawler aims to detect and track
vehicles in surveillance videos and transfer the captured vehicle images,
metadata and contextual information to the server or cloud. Then multi-grained
attributes, such as the visual features and license plate fingerprints, are
extracted and indexed by the vehicle indexer. At last, a query triplet with an
input vehicle image, the time range, and the spatial scope is taken as the
input by the vehicle searcher. The target vehicle will be searched in the
database by a progressive process. Extensive experiments on the public dataset
from a real surveillance network validate the effectiveness of the PVSS
Temporal Saliency Adaptation in Egocentric Videos
This work adapts a deep neural model for image saliency prediction to the
temporal domain of egocentric video. We compute the saliency map for each video
frame, firstly with an off-the-shelf model trained from static images, secondly
by adding a a convolutional or conv-LSTM layers trained with a dataset for
video saliency prediction. We study each configuration on EgoMon, a new dataset
made of seven egocentric videos recorded by three subjects in both free-viewing
and task-driven set ups. Our results indicate that the temporal adaptation is
beneficial when the viewer is not moving and observing the scene from a narrow
field of view. Encouraged by this observation, we compute and publish the
saliency maps for the EPIC Kitchens dataset, in which viewers are cooking.
Source code and models available at
https://imatge-upc.github.io/saliency-2018-videosalgan/Comment: Extended abstract at the ECCV 2018 Workshop on Egocentric Perception,
Interaction and Computing (EPIC
Situation-Aware Pedestrian Trajectory Prediction with Spatio-Temporal Attention Model
Pedestrian trajectory prediction is essential for collision avoidance in
autonomous driving and robot navigation. However, predicting a pedestrian's
trajectory in crowded environments is non-trivial as it is influenced by other
pedestrians' motion and static structures that are present in the scene. Such
human-human and human-space interactions lead to non-linearities in the
trajectories. In this paper, we present a new spatio-temporal graph based Long
Short-Term Memory (LSTM) network for predicting pedestrian trajectory in
crowded environments, which takes into account the interaction with static
(physical objects) and dynamic (other pedestrians) elements in the scene. Our
results are based on two widely-used datasets to demonstrate that the proposed
method outperforms the state-of-the-art approaches in human trajectory
prediction. In particular, our method leads to a reduction in Average
Displacement Error (ADE) and Final Displacement Error (FDE) of up to 55% and
61% respectively over state-of-the-art approaches
Representation Learning on Visual-Symbolic Graphs for Video Understanding
Events in natural videos typically arise from spatio-temporal interactions
between actors and objects and involve multiple co-occurring activities and
object classes. To capture this rich visual and semantic context, we propose
using two graphs: (1) an attributed spatio-temporal visual graph whose nodes
correspond to actors and objects and whose edges encode different types of
interactions, and (2) a symbolic graph that models semantic relationships. We
further propose a graph neural network for refining the representations of
actors, objects and their interactions on the resulting hybrid graph. Our model
goes beyond current approaches that assume nodes and edges are of the same
type, operate on graphs with fixed edge weights and do not use a symbolic
graph. In particular, our framework: a) has specialized attention-based message
functions for different node and edge types; b) uses visual edge features; c)
integrates visual evidence with label relationships; and d) performs global
reasoning in the semantic space. Experiments on challenging video understanding
tasks, such as temporal action localization on the Charades dataset, show that
the proposed method leads to state-of-the-art performance.Comment: ECCV 202
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
Salient Object Detection in Video using Deep Non-Local Neural Networks
Detection of salient objects in image and video is of great importance in
many computer vision applications. In spite of the fact that the state of the
art in saliency detection for still images has been changed substantially over
the last few years, there have been few improvements in video saliency
detection. This paper investigates the use of recently introduced non-local
neural networks in video salient object detection. Non-local neural networks
are applied to capture global dependencies and hence determine the salient
objects. The effect of non-local operations is studied separately on static and
dynamic saliency detection in order to exploit both appearance and motion
features. A novel deep non-local neural network architecture is introduced for
video salient object detection and tested on two well-known datasets DAVIS and
FBMS. The experimental results show that the proposed algorithm outperforms
state-of-the-art video saliency detection methods.Comment: Submitted to Journal of Visual Communication and Image Representatio
- …