135,635 research outputs found
UntrimmedNets for Weakly Supervised Action Recognition and Detection
Current action recognition methods heavily rely on trimmed videos for model
training. However, it is expensive and time-consuming to acquire a large-scale
trimmed video dataset. This paper presents a new weakly supervised
architecture, called UntrimmedNet, which is able to directly learn action
recognition models from untrimmed videos without the requirement of temporal
annotations of action instances. Our UntrimmedNet couples two important
components, the classification module and the selection module, to learn the
action models and reason about the temporal duration of action instances,
respectively. These two components are implemented with feed-forward networks,
and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit
the learned models for action recognition (WSR) and detection (WSD) on the
untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet
only employs weak supervision, our method achieves performance superior or
comparable to that of those strongly supervised approaches on these two
datasets.Comment: camera-ready version to appear in CVPR201
Spatio-temporal Video Re-localization by Warp LSTM
The need for efficiently finding the video content a user wants is increasing
because of the erupting of user-generated videos on the Web. Existing
keyword-based or content-based video retrieval methods usually determine what
occurs in a video but not when and where. In this paper, we make an answer to
the question of when and where by formulating a new task, namely
spatio-temporal video re-localization. Specifically, given a query video and a
reference video, spatio-temporal video re-localization aims to localize
tubelets in the reference video such that the tubelets semantically correspond
to the query. To accurately localize the desired tubelets in the reference
video, we propose a novel warp LSTM network, which propagates the
spatio-temporal information for a long period and thereby captures the
corresponding long-term dependencies. Another issue for spatio-temporal video
re-localization is the lack of properly labeled video datasets. Therefore, we
reorganize the videos in the AVA dataset to form a new dataset for
spatio-temporal video re-localization research. Extensive experimental results
show that the proposed model achieves superior performances over the designed
baselines on the spatio-temporal video re-localization task
- …