6,579 research outputs found
Online Action Detection
In online action detection, the goal is to detect the start of an action in a
video stream as soon as it happens. For instance, if a child is chasing a ball,
an autonomous car should recognize what is going on and respond immediately.
This is a very challenging problem for four reasons. First, only partial
actions are observed. Second, there is a large variability in negative data.
Third, the start of the action is unknown, so it is unclear over what time
window the information should be integrated. Finally, in real world data, large
within-class variability exists. This problem has been addressed before, but
only to some extent. Our contributions to online action detection are
threefold. First, we introduce a realistic dataset composed of 27 episodes from
6 popular TV series. The dataset spans over 16 hours of footage annotated with
30 action classes, totaling 6,231 action instances. Second, we analyze and
compare various baseline methods, showing this is a challenging problem for
which none of the methods provides a good solution. Third, we analyze the
change in performance when there is a variation in viewpoint, occlusion,
truncation, etc. We introduce an evaluation protocol for fair comparison. The
dataset, the baselines and the models will all be made publicly available to
encourage (much needed) further research on online action detection on
realistic data.Comment: Project page:
http://homes.esat.kuleuven.be/~rdegeest/OnlineActionDetection.htm
Improving Sequential Determinantal Point Processes for Supervised Video Summarization
It is now much easier than ever before to produce videos. While the
ubiquitous video data is a great source for information discovery and
extraction, the computational challenges are unparalleled. Automatically
summarizing the videos has become a substantial need for browsing, searching,
and indexing visual content. This paper is in the vein of supervised video
summarization using sequential determinantal point process (SeqDPP), which
models diversity by a probabilistic distribution. We improve this model in two
folds. In terms of learning, we propose a large-margin algorithm to address the
exposure bias problem in SeqDPP. In terms of modeling, we design a new
probabilistic distribution such that, when it is integrated into SeqDPP, the
resulting model accepts user input about the expected length of the summary.
Moreover, we also significantly extend a popular video summarization dataset by
1) more egocentric videos, 2) dense user annotations, and 3) a refined
evaluation scheme. We conduct extensive experiments on this dataset (about 60
hours of videos in total) and compare our approach to several competitive
baselines
Predicting Human Interaction via Relative Attention Model
Predicting human interaction is challenging as the on-going activity has to
be inferred based on a partially observed video. Essentially, a good algorithm
should effectively model the mutual influence between the two interacting
subjects. Also, only a small region in the scene is discriminative for
identifying the on-going interaction. In this work, we propose a relative
attention model to explicitly address these difficulties. Built on a
tri-coupled deep recurrent structure representing both interacting subjects and
global interaction status, the proposed network collects spatio-temporal
information from each subject, rectified with global interaction information,
yielding effective interaction representation. Moreover, the proposed network
also unifies an attention module to assign higher importance to the regions
which are relevant to the on-going action. Extensive experiments have been
conducted on two public datasets, and the results demonstrate that the proposed
relative attention network successfully predicts informative regions between
interacting subjects, which in turn yields superior human interaction
prediction accuracy.Comment: To appear in IJCAI 201
Unified Embedding and Metric Learning for Zero-Exemplar Event Detection
Event detection in unconstrained videos is conceived as a content-based video
retrieval with two modalities: textual and visual. Given a text describing a
novel event, the goal is to rank related videos accordingly. This task is
zero-exemplar, no video examples are given to the novel event.
Related works train a bank of concept detectors on external data sources.
These detectors predict confidence scores for test videos, which are ranked and
retrieved accordingly. In contrast, we learn a joint space in which the visual
and textual representations are embedded. The space casts a novel event as a
probability of pre-defined events. Also, it learns to measure the distance
between an event and its related videos.
Our model is trained end-to-end on publicly available EventNet. When applied
to TRECVID Multimedia Event Detection dataset, it outperforms the
state-of-the-art by a considerable margin.Comment: IEEE CVPR 201
- …