15,342 research outputs found
Single Shot Temporal Action Detection
Temporal action detection is a very important yet challenging problem, since
videos in real applications are usually long, untrimmed and contain multiple
action instances. This problem requires not only recognizing action categories
but also detecting start time and end time of each action instance. Many
state-of-the-art methods adopt the "detection by classification" framework:
first do proposal, and then classify proposals. The main drawback of this
framework is that the boundaries of action instance proposals have been fixed
during the classification step. To address this issue, we propose a novel
Single Shot Action Detector (SSAD) network based on 1D temporal convolutional
layers to skip the proposal generation step via directly detecting action
instances in untrimmed video. On pursuit of designing a particular SSAD network
that can work effectively for temporal action detection, we empirically search
for the best network architecture of SSAD due to lacking existing models that
can be directly adopted. Moreover, we investigate into input feature types and
fusion strategies to further improve detection accuracy. We conduct extensive
experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When
setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD
significantly outperforms other state-of-the-art systems by increasing mAP from
19.0% to 24.6% on THUMOS 2014 and from 7.4% to 11.0% on MEXaction2.Comment: ACM Multimedia 201
UntrimmedNets for Weakly Supervised Action Recognition and Detection
Current action recognition methods heavily rely on trimmed videos for model
training. However, it is expensive and time-consuming to acquire a large-scale
trimmed video dataset. This paper presents a new weakly supervised
architecture, called UntrimmedNet, which is able to directly learn action
recognition models from untrimmed videos without the requirement of temporal
annotations of action instances. Our UntrimmedNet couples two important
components, the classification module and the selection module, to learn the
action models and reason about the temporal duration of action instances,
respectively. These two components are implemented with feed-forward networks,
and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit
the learned models for action recognition (WSR) and detection (WSD) on the
untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet
only employs weak supervision, our method achieves performance superior or
comparable to that of those strongly supervised approaches on these two
datasets.Comment: camera-ready version to appear in CVPR201
Learning Latent Super-Events to Detect Multiple Activities in Videos
In this paper, we introduce the concept of learning latent super-events from
activity videos, and present how it benefits activity detection in continuous
videos. We define a super-event as a set of multiple events occurring together
in videos with a particular temporal organization; it is the opposite concept
of sub-events. Real-world videos contain multiple activities and are rarely
segmented (e.g., surveillance videos), and learning latent super-events allows
the model to capture how the events are temporally related in videos. We design
temporal structure filters that enable the model to focus on particular
sub-intervals of the videos, and use them together with a soft attention
mechanism to learn representations of latent super-events. Super-event
representations are combined with per-frame or per-segment CNNs to provide
frame-level annotations. Our approach is designed to be fully differentiable,
enabling end-to-end learning of latent super-event representations jointly with
the activity detector using them. Our experiments with multiple public video
datasets confirm that the proposed concept of latent super-event learning
significantly benefits activity detection, advancing the state-of-the-arts.Comment: CVPR 201
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
Every moment counts in action recognition. A comprehensive understanding of
human activity in video requires labeling every frame according to the actions
occurring, placing multiple labels densely over a video sequence. To study this
problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new
dataset of dense labels over unconstrained internet videos. Modeling multiple,
dense labels benefits from temporal relations within and across classes. We
define a novel variant of long short-term memory (LSTM) deep networks for
modeling these temporal relations via multiple input and output connections. We
show that this model improves action labeling accuracy and further enables
deeper understanding tasks ranging from structured retrieval to action
prediction.Comment: To appear in IJC
Action Tubelet Detector for Spatio-Temporal Action Localization
Current state-of-the-art approaches for spatio-temporal action localization
rely on detections at the frame level that are then linked or tracked across
time. In this paper, we leverage the temporal continuity of videos instead of
operating at the frame level. We propose the ACtion Tubelet detector
(ACT-detector) that takes as input a sequence of frames and outputs tubelets,
i.e., sequences of bounding boxes with associated scores. The same way
state-of-the-art object detectors rely on anchor boxes, our ACT-detector is
based on anchor cuboids. We build upon the SSD framework. Convolutional
features are extracted for each frame, while scores and regressions are based
on the temporal stacking of these features, thus exploiting information from a
sequence. Our experimental results show that leveraging sequences of frames
significantly improves detection performance over using individual frames. The
gain of our tubelet detector can be explained by both more accurate scores and
more precise localization. Our ACT-detector outperforms the state-of-the-art
methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in
particular at high overlap thresholds.Comment: 9 page
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
We aim for zero-shot localization and classification of human actions in
video. Where traditional approaches rely on global attribute or object
classification scores for their zero-shot knowledge transfer, our main
contribution is a spatial-aware object embedding. To arrive at spatial
awareness, we build our embedding on top of freely available actor and object
detectors. Relevance of objects is determined in a word embedding space and
further enforced with estimated spatial preferences. Besides local object
awareness, we also embed global object awareness into our embedding to maximize
actor and object interaction. Finally, we exploit the object positions and
sizes in the spatial-aware embedding to demonstrate a new spatio-temporal
action retrieval scenario with composite queries. Action localization and
classification experiments on four contemporary action video datasets support
our proposal. Apart from state-of-the-art results in the zero-shot localization
and classification settings, our spatial-aware embedding is even competitive
with recent supervised action localization alternatives.Comment: ICC
- …