16 research outputs found
Weakly-supervised Temporal Action Localization by Uncertainty Modeling
Weakly-supervised temporal action localization aims to learn detecting
temporal intervals of action classes with only video-level labels. To this end,
it is crucial to separate frames of action classes from the background frames
(i.e., frames not belonging to any action classes). In this paper, we present a
new perspective on background frames where they are modeled as
out-of-distribution samples regarding their inconsistency. Then, background
frames can be detected by estimating the probability of each frame being
out-of-distribution, known as uncertainty, but it is infeasible to directly
learn uncertainty without frame-level labels. To realize the uncertainty
learning in the weakly-supervised setting, we leverage the multiple instance
learning formulation. Moreover, we further introduce a background entropy loss
to better discriminate background frames by encouraging their in-distribution
(action) probabilities to be uniformly distributed over all action classes.
Experimental results show that our uncertainty modeling is effective at
alleviating the interference of background frames and brings a large
performance gain without bells and whistles. We demonstrate that our model
significantly outperforms state-of-the-art methods on the benchmarks, THUMOS'14
and ActivityNet (1.2 & 1.3). Our code is available at
https://github.com/Pilhyeon/WTAL-Uncertainty-Modeling.Comment: Accepted by the 35th AAAI Conference on Artificial Intelligence (AAAI
2021
Localizing the Common Action Among a Few Videos
This paper strives to localize the temporal extent of an action in a long
untrimmed video. Where existing work leverages many examples with their start,
their ending, and/or the class of the action during training time, we propose
few-shot common action localization. The start and end of an action in a long
untrimmed video is determined based on just a hand-full of trimmed video
examples containing the same action, without knowing their common class label.
To address this task, we introduce a new 3D convolutional network architecture
able to align representations from the support videos with the relevant query
video segments. The network contains: (\textit{i}) a mutual enhancement module
to simultaneously complement the representation of the few trimmed support
videos and the untrimmed query video; (\textit{ii}) a progressive alignment
module that iteratively fuses the support videos into the query branch; and
(\textit{iii}) a pairwise matching module to weigh the importance of different
support videos. Evaluation of few-shot common action localization in untrimmed
videos containing a single or multiple action instances demonstrates the
effectiveness and general applicability of our proposal.Comment: ECCV 202
Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization
Temporally localizing activities within untrimmed videos has been extensively
studied in recent years. Despite recent advances, existing methods for
weakly-supervised temporal activity localization struggle to recognize when an
activity is not occurring. To address this issue, we propose a novel method
named A2CL-PT. Two triplets of the feature space are considered in our
approach: one triplet is used to learn discriminative features for each
activity class, and the other one is used to distinguish the features where no
activity occurs (i.e. background features) from activity-related features for
each video. To further improve the performance, we build our network using two
parallel branches which operate in an adversarial way: the first branch
localizes the most salient activities of a video and the second one finds other
supplementary activities from non-localized parts of the video. Extensive
experiments performed on THUMOS14 and ActivityNet datasets demonstrate that our
proposed method is effective. Specifically, the average mAP of IoU thresholds
from 0.1 to 0.9 on the THUMOS14 dataset is significantly improved from 27.9% to
30.0%.Comment: ECCV 2020 camera ready (Supplementary material: on ECVA soon
D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations
This work proposes a weakly-supervised temporal action localization
framework, called D2-Net, which strives to temporally localize actions using
video-level supervision. Our main contribution is the introduction of a novel
loss formulation, which jointly enhances the discriminability of latent
embeddings and robustness of the output temporal class activations with respect
to foreground-background noise caused by weak supervision. The proposed
formulation comprises a discriminative and a denoising loss term for enhancing
temporal action localization. The discriminative term incorporates a
classification loss and utilizes a top-down attention mechanism to enhance the
separability of latent foreground-background embeddings. The denoising loss
term explicitly addresses the foreground-background noise in class activations
by simultaneously maximizing intra-video and inter-video mutual information
using a bottom-up attention mechanism. As a result, activations in the
foreground regions are emphasized whereas those in the background regions are
suppressed, thereby leading to more robust predictions. Comprehensive
experiments are performed on two benchmarks: THUMOS14 and ActivityNet1.2. Our
D2-Net performs favorably in comparison to the existing methods on both
datasets, achieving gains as high as 3.6% in terms of mean average precision on
THUMOS14
Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization
With video-level labels, weakly supervised temporal action localization
(WTAL) applies a localization-by-classification paradigm to detect and classify
the action in untrimmed videos. Due to the characteristic of classification,
class-specific background snippets are inevitably mis-activated to improve the
discriminability of the classifier in WTAL. To alleviate the disturbance of
background, existing methods try to enlarge the discrepancy between action and
background through modeling background snippets with pseudo-snippet-level
annotations, which largely rely on artificial hypotheticals. Distinct from the
previous works, we present an adversarial learning strategy to break the
limitation of mining pseudo background snippets. Concretely, the background
classification loss forces the whole video to be regarded as the background by
a background gradient reinforcement strategy, confusing the recognition model.
Reversely, the foreground(action) loss guides the model to focus on action
snippets under such conditions. As a result, competition between the two
classification losses drives the model to boost its ability for action
modeling. Simultaneously, a novel temporal enhancement network is designed to
facilitate the model to construct temporal relation of affinity snippets based
on the proposed strategy, for further improving the performance of action
localization. Finally, extensive experiments conducted on THUMOS14 and
ActivityNet1.2 demonstrate the effectiveness of the proposed method.Comment: 9 pages, 5 figures, conferenc