Weakly-supervised action localization aims to recognize and localize action
instancese in untrimmed videos with only video-level labels. Most existing
models rely on multiple instance learning(MIL), where the predictions of
unlabeled instances are supervised by classifying labeled bags. The MIL-based
methods are relatively well studied with cogent performance achieved on
classification but not on localization. Generally, they locate temporal regions
by the video-level classification but overlook the temporal variations of
feature semantics. To address this problem, we propose a novel attention-based
hierarchically-structured latent model to learn the temporal variations of
feature semantics. Specifically, our model entails two components, the first is
an unsupervised change-points detection module that detects change-points by
learning the latent representations of video features in a temporal hierarchy
based on their rates of change, and the second is an attention-based
classification model that selects the change-points of the foreground as the
boundaries. To evaluate the effectiveness of our model, we conduct extensive
experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The
experiments show that our method outperforms current state-of-the-art methods,
and even achieves comparable performance with fully-supervised methods.Comment: Accepted to ICCV 2023. arXiv admin note: text overlap with
arXiv:2203.15187, arXiv:2003.12424, arXiv:2104.02967 by other author