1,655 research outputs found
Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization
With video-level labels, weakly supervised temporal action localization
(WTAL) applies a localization-by-classification paradigm to detect and classify
the action in untrimmed videos. Due to the characteristic of classification,
class-specific background snippets are inevitably mis-activated to improve the
discriminability of the classifier in WTAL. To alleviate the disturbance of
background, existing methods try to enlarge the discrepancy between action and
background through modeling background snippets with pseudo-snippet-level
annotations, which largely rely on artificial hypotheticals. Distinct from the
previous works, we present an adversarial learning strategy to break the
limitation of mining pseudo background snippets. Concretely, the background
classification loss forces the whole video to be regarded as the background by
a background gradient reinforcement strategy, confusing the recognition model.
Reversely, the foreground(action) loss guides the model to focus on action
snippets under such conditions. As a result, competition between the two
classification losses drives the model to boost its ability for action
modeling. Simultaneously, a novel temporal enhancement network is designed to
facilitate the model to construct temporal relation of affinity snippets based
on the proposed strategy, for further improving the performance of action
localization. Finally, extensive experiments conducted on THUMOS14 and
ActivityNet1.2 demonstrate the effectiveness of the proposed method.Comment: 9 pages, 5 figures, conferenc
Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization
Weakly-supervised temporal action localization aims to localize and recognize
actions in untrimmed videos with only video-level category labels during
training. Without instance-level annotations, most existing methods follow the
Segment-based Multiple Instance Learning (S-MIL) framework, where the
predictions of segments are supervised by the labels of videos. However, the
objective for acquiring segment-level scores during training is not consistent
with the target for acquiring proposal-level scores during testing, leading to
suboptimal results. To deal with this problem, we propose a novel
Proposal-based Multiple Instance Learning (P-MIL) framework that directly
classifies the candidate proposals in both the training and testing stages,
which includes three key designs: 1) a surrounding contrastive feature
extraction module to suppress the discriminative short proposals by considering
the surrounding contrastive information, 2) a proposal completeness evaluation
module to inhibit the low-quality proposals with the guidance of the
completeness pseudo labels, and 3) an instance-level rank consistency loss to
achieve robust detection by leveraging the complementarity of RGB and FLOW
modalities. Extensive experimental results on two challenging benchmarks
including THUMOS14 and ActivityNet demonstrate the superior performance of our
method.Comment: Accepted by CVPR 2023. Code is available at
https://github.com/RenHuan1999/CVPR2023_P-MI
Sub-action Prototype Learning for Point-level Weakly-supervised Temporal Action Localization
Point-level weakly-supervised temporal action localization (PWTAL) aims to
localize actions with only a single timestamp annotation for each action
instance. Existing methods tend to mine dense pseudo labels to alleviate the
label sparsity, but overlook the potential sub-action temporal structures,
resulting in inferior performance. To tackle this problem, we propose a novel
sub-action prototype learning framework (SPL-Loc) which comprises Sub-action
Prototype Clustering (SPC) and Ordered Prototype Alignment (OPA). SPC
adaptively extracts representative sub-action prototypes which are capable to
perceive the temporal scale and spatial content variation of action instances.
OPA selects relevant prototypes to provide completeness clue for pseudo label
generation by applying a temporal alignment loss. As a result, pseudo labels
are derived from alignment results to improve action boundary prediction.
Extensive experiments on three popular benchmarks demonstrate that the proposed
SPL-Loc significantly outperforms existing SOTA PWTAL methods
D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations
This work proposes a weakly-supervised temporal action localization
framework, called D2-Net, which strives to temporally localize actions using
video-level supervision. Our main contribution is the introduction of a novel
loss formulation, which jointly enhances the discriminability of latent
embeddings and robustness of the output temporal class activations with respect
to foreground-background noise caused by weak supervision. The proposed
formulation comprises a discriminative and a denoising loss term for enhancing
temporal action localization. The discriminative term incorporates a
classification loss and utilizes a top-down attention mechanism to enhance the
separability of latent foreground-background embeddings. The denoising loss
term explicitly addresses the foreground-background noise in class activations
by simultaneously maximizing intra-video and inter-video mutual information
using a bottom-up attention mechanism. As a result, activations in the
foreground regions are emphasized whereas those in the background regions are
suppressed, thereby leading to more robust predictions. Comprehensive
experiments are performed on two benchmarks: THUMOS14 and ActivityNet1.2. Our
D2-Net performs favorably in comparison to the existing methods on both
datasets, achieving gains as high as 3.6% in terms of mean average precision on
THUMOS14
Slow Motion Matters: A Slow Motion Enhanced Network for Weakly Supervised Temporal Action Localization
Weakly supervised temporal action localization (WTAL) aims to localize
actions in untrimmed videos with only weak supervision information (e.g.
video-level labels). Most existing models handle all input videos with a fixed
temporal scale. However, such models are not sensitive to actions whose pace of
the movements is different from the ``normal" speed, especially slow-motion
action instances, which complete the movements with a much slower speed than
their counterparts with a normal speed. Here arises the slow-motion blurred
issue: It is hard to explore salient slow-motion information from videos at
``normal" speed. In this paper, we propose a novel framework termed Slow Motion
Enhanced Network (SMEN) to improve the ability of a WTAL network by
compensating its sensitivity on slow-motion action segments. The proposed SMEN
comprises a Mining module and a Localization module. The mining module
generates mask to mine slow-motion-related features by utilizing the
relationships between the normal motion and slow motion; while the localization
module leverages the mined slow-motion features as complementary information to
improve the temporal action localization results. Our proposed framework can be
easily adapted by existing WTAL networks and enable them be more sensitive to
slow-motion actions. Extensive experiments on three benchmarks are conducted,
which demonstrate the high performance of our proposed framework
Weakly-supervised Temporal Action Localization by Uncertainty Modeling
Weakly-supervised temporal action localization aims to learn detecting
temporal intervals of action classes with only video-level labels. To this end,
it is crucial to separate frames of action classes from the background frames
(i.e., frames not belonging to any action classes). In this paper, we present a
new perspective on background frames where they are modeled as
out-of-distribution samples regarding their inconsistency. Then, background
frames can be detected by estimating the probability of each frame being
out-of-distribution, known as uncertainty, but it is infeasible to directly
learn uncertainty without frame-level labels. To realize the uncertainty
learning in the weakly-supervised setting, we leverage the multiple instance
learning formulation. Moreover, we further introduce a background entropy loss
to better discriminate background frames by encouraging their in-distribution
(action) probabilities to be uniformly distributed over all action classes.
Experimental results show that our uncertainty modeling is effective at
alleviating the interference of background frames and brings a large
performance gain without bells and whistles. We demonstrate that our model
significantly outperforms state-of-the-art methods on the benchmarks, THUMOS'14
and ActivityNet (1.2 & 1.3). Our code is available at
https://github.com/Pilhyeon/WTAL-Uncertainty-Modeling.Comment: Accepted by the 35th AAAI Conference on Artificial Intelligence (AAAI
2021
- …