9,375 research outputs found
Weakly Supervised Action Localization by Sparse Temporal Pooling Network
We propose a weakly supervised temporal action localization algorithm on
untrimmed videos using convolutional neural networks. Our algorithm learns from
video-level class labels and predicts temporal intervals of human actions with
no requirement of temporal localization annotations. We design our network to
identify a sparse subset of key segments associated with target actions in a
video using an attention module and fuse the key segments through adaptive
temporal pooling. Our loss function is comprised of two terms that minimize the
video-level action classification error and enforce the sparsity of the segment
selection. At inference time, we extract and score temporal proposals using
temporal class activations and class-agnostic attentions to estimate the time
intervals that correspond to target actions. The proposed algorithm attains
state-of-the-art results on the THUMOS14 dataset and outstanding performance on
ActivityNet1.3 even with its weak supervision.Comment: Accepted to CVPR 201
Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images
We address the problem of fine-grained action localization from temporally
untrimmed web videos. We assume that only weak video-level annotations are
available for training. The goal is to use these weak labels to identify
temporal segments corresponding to the actions, and learn models that
generalize to unconstrained web videos. We find that web images queried by
action names serve as well-localized highlights for many actions, but are
noisily labeled. To solve this problem, we propose a simple yet effective
method that takes weak video labels and noisy image labels as input, and
generates localized action frames as output. This is achieved by cross-domain
transfer between video frames and web images, using pre-trained deep
convolutional neural networks. We then use the localized action frames to train
action recognition models with long short-term memory networks. We collect a
fine-grained sports action data set FGA-240 of more than 130,000 YouTube
videos. It has 240 fine-grained actions under 85 sports activities. Convincing
results are shown on the FGA-240 data set, as well as the THUMOS 2014
localization data set with untrimmed training videos.Comment: Camera ready version for ACM Multimedia 201
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
State-of-the-art temporal action detectors inefficiently search the entire
video for specific actions. Despite the encouraging progress these methods
achieve, it is crucial to design automated approaches that only explore parts
of the video which are the most relevant to the actions being searched for. To
address this need, we propose the new problem of action spotting in video,
which we define as finding a specific action in a video while observing a small
portion of that video. Inspired by the observation that humans are extremely
efficient and accurate in spotting and finding action instances in video, we
propose Action Search, a novel Recurrent Neural Network approach that mimics
the way humans spot actions. Moreover, to address the absence of data recording
the behavior of human annotators, we put forward the Human Searches dataset,
which compiles the search sequences employed by human annotators spotting
actions in the AVA and THUMOS14 datasets. We consider temporal action
localization as an application of the action spotting problem. Experiments on
the THUMOS14 dataset reveal that our model is not only able to explore the
video efficiently (observing on average 17.3% of the video) but it also
accurately finds human activities with 30.8% mAP.Comment: Accepted to ECCV 201
- …