9 research outputs found
D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations
This work proposes a weakly-supervised temporal action localization
framework, called D2-Net, which strives to temporally localize actions using
video-level supervision. Our main contribution is the introduction of a novel
loss formulation, which jointly enhances the discriminability of latent
embeddings and robustness of the output temporal class activations with respect
to foreground-background noise caused by weak supervision. The proposed
formulation comprises a discriminative and a denoising loss term for enhancing
temporal action localization. The discriminative term incorporates a
classification loss and utilizes a top-down attention mechanism to enhance the
separability of latent foreground-background embeddings. The denoising loss
term explicitly addresses the foreground-background noise in class activations
by simultaneously maximizing intra-video and inter-video mutual information
using a bottom-up attention mechanism. As a result, activations in the
foreground regions are emphasized whereas those in the background regions are
suppressed, thereby leading to more robust predictions. Comprehensive
experiments are performed on two benchmarks: THUMOS14 and ActivityNet1.2. Our
D2-Net performs favorably in comparison to the existing methods on both
datasets, achieving gains as high as 3.6% in terms of mean average precision on
THUMOS14
Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization
With video-level labels, weakly supervised temporal action localization
(WTAL) applies a localization-by-classification paradigm to detect and classify
the action in untrimmed videos. Due to the characteristic of classification,
class-specific background snippets are inevitably mis-activated to improve the
discriminability of the classifier in WTAL. To alleviate the disturbance of
background, existing methods try to enlarge the discrepancy between action and
background through modeling background snippets with pseudo-snippet-level
annotations, which largely rely on artificial hypotheticals. Distinct from the
previous works, we present an adversarial learning strategy to break the
limitation of mining pseudo background snippets. Concretely, the background
classification loss forces the whole video to be regarded as the background by
a background gradient reinforcement strategy, confusing the recognition model.
Reversely, the foreground(action) loss guides the model to focus on action
snippets under such conditions. As a result, competition between the two
classification losses drives the model to boost its ability for action
modeling. Simultaneously, a novel temporal enhancement network is designed to
facilitate the model to construct temporal relation of affinity snippets based
on the proposed strategy, for further improving the performance of action
localization. Finally, extensive experiments conducted on THUMOS14 and
ActivityNet1.2 demonstrate the effectiveness of the proposed method.Comment: 9 pages, 5 figures, conferenc
AdaFocus: Towards End-to-end Weakly Supervised Learning for Long-Video Action Understanding
Developing end-to-end models for long-video action understanding tasks
presents significant computational and memory challenges. Existing works
generally build models on long-video features extracted by off-the-shelf action
recognition models, which are trained on short-video datasets in different
domains, making the extracted features suffer domain discrepancy. To avoid
this, action recognition models can be end-to-end trained on clips, which are
trimmed from long videos and labeled using action interval annotations. Such
fully supervised annotations are expensive to collect. Thus, a weakly
supervised method is needed for long-video action understanding at scale. Under
the weak supervision setting, action labels are provided for the whole video
without precise start and end times of the action clip. To this end, we propose
an AdaFocus framework. AdaFocus estimates the spike-actionness and temporal
positions of actions, enabling it to adaptively focus on action clips that
facilitate better training without the need for precise annotations.
Experiments on three long-video datasets show its effectiveness. Remarkably, on
two of datasets, models trained with AdaFocus under weak supervision outperform
those trained under full supervision. Furthermore, we form a weakly supervised
feature extraction pipeline with our AdaFocus, which enables significant
improvements on three long-video action understanding tasks
Deep Learning for Dense Interpretation of Video: Survey of Various Approach, Challenges, Datasets and Metrics
Video interpretation has garnered considerable attention in computer vision and natural language processing fields due to the rapid expansion of video data and the increasing demand for various applications such as intelligent video search, automated video subtitling, and assistance for visually impaired individuals. However, video interpretation presents greater challenges due to the inclusion of both temporal and spatial information within the video. While deep learning models for images, text, and audio have made significant progress, efforts have recently been focused on developing deep networks for video interpretation. A thorough evaluation of current research is necessary to provide insights for future endeavors, considering the myriad techniques, datasets, features, and evaluation criteria available in the video domain. This study offers a survey of recent advancements in deep learning for dense video interpretation, addressing various datasets and the challenges they present, as well as key features in video interpretation. Additionally, it provides a comprehensive overview of the latest deep learning models in video interpretation, which have been instrumental in activity identification and video description or captioning. The paper compares the performance of several deep learning models in this field based on specific metrics. Finally, the study summarizes future trends and directions in video interpretation
Weakly-supervised Micro- and Macro-expression Spotting Based on Multi-level Consistency
Most micro- and macro-expression spotting methods in untrimmed videos suffer
from the burden of video-wise collection and frame-wise annotation.
Weakly-supervised expression spotting (WES) based on video-level labels can
potentially mitigate the complexity of frame-level annotation while achieving
fine-grained frame-level spotting. However, we argue that existing
weakly-supervised methods are based on multiple instance learning (MIL)
involving inter-modality, inter-sample, and inter-task gaps. The inter-sample
gap is primarily from the sample distribution and duration. Therefore, we
propose a novel and simple WES framework, MC-WES, using multi-consistency
collaborative mechanisms that include modal-level saliency, video-level
distribution, label-level duration and segment-level feature consistency
strategies to implement fine frame-level spotting with only video-level labels
to alleviate the above gaps and merge prior knowledge. The modal-level saliency
consistency strategy focuses on capturing key correlations between raw images
and optical flow. The video-level distribution consistency strategy utilizes
the difference of sparsity in temporal distribution. The label-level duration
consistency strategy exploits the difference in the duration of facial muscles.
The segment-level feature consistency strategy emphasizes that features under
the same labels maintain similarity. Experimental results on three challenging
datasets -- CAS(ME), CAS(ME), and SAMM-LV -- demonstrate that MC-WES is
comparable to state-of-the-art fully-supervised methods