129 research outputs found
Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization
Audio-Visual Event Localization (AVEL) is the task of temporally localizing
and classifying \emph{audio-visual events}, i.e., events simultaneously visible
and audible in a video. In this paper, we solve AVEL in a weakly-supervised
setting, where only video-level event labels (their presence/absence, but not
their locations in time) are available as supervision for training. Our idea is
to use a base model to estimate labels on the training data at a finer temporal
resolution than at the video level and re-train the model with these labels.
I.e., we determine the subset of labels for each \emph{slice} of frames in a
training video by (i) replacing the frames outside the slice with those from a
second video having no overlap in video-level labels, and (ii) feeding this
synthetic video into the base model to extract labels for just the slice in
question. To handle the out-of-distribution nature of our synthetic videos, we
propose an auxiliary objective for the base model that induces more reliable
predictions of the localized event labels as desired. Our three-stage pipeline
outperforms several existing AVEL methods with no architectural changes and
improves performance on a related weakly-supervised task as well
Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization
Weakly supervised temporal action localization (WSTAL) aims to localize
actions in untrimmed videos using video-level labels. Despite recent advances,
existing approaches mainly follow a localization-by-classification pipeline,
generally processing each segment individually, thereby exploiting only limited
contextual information. As a result, the model will lack a comprehensive
understanding (e.g. appearance and temporal structure) of various action
patterns, leading to ambiguity in classification learning and temporal
localization. Our work addresses this from a novel perspective, by exploring
and exploiting the cross-video contextual knowledge within the dataset to
recover the dataset-level semantic structure of action instances via weak
labels only, thereby indirectly improving the holistic understanding of
fine-grained action patterns and alleviating the aforementioned ambiguities.
Specifically, an end-to-end framework is proposed, including a Robust
Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge
Summarization and Aggregation (GKSA) module. First, the RMGCL module explores
the contrast and consistency of cross-video action features, assisting in
learning more structured and compact embedding space, thus reducing ambiguity
in classification learning. Further, the GKSA module is used to efficiently
summarize and propagate the cross-video representative action knowledge in a
learnable manner to promote holistic action patterns understanding, which in
turn allows the generation of high-confidence pseudo-labels for self-learning,
thus alleviating ambiguity in temporal localization. Extensive experiments on
THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method
outperforms the state-of-the-art methods, and can be easily plugged into other
WSTAL methods.Comment: Submitted to TCSVT. 14 pages and 7 figure
Progression-Guided Temporal Action Detection in Videos
We present a novel framework, Action Progression Network (APN), for temporal
action detection (TAD) in videos. The framework locates actions in videos by
detecting the action evolution process. To encode the action evolution, we
quantify a complete action process into 101 ordered stages (0\%, 1\%, ...,
100\%), referred to as action progressions. We then train a neural network to
recognize the action progressions. The framework detects action boundaries by
detecting complete action processes in the videos, e.g., a video segment with
detected action progressions closely follow the sequence 0\%, 1\%, ..., 100\%.
The framework offers three major advantages: (1) Our neural networks are
trained end-to-end, contrasting conventional methods that optimize modules
separately; (2) The APN is trained using action frames exclusively, enabling
models to be trained on action classification datasets and robust to videos
with temporal background styles differing from those in training; (3) Our
framework effectively avoids detecting incomplete actions and excels in
detecting long-lasting actions due to the fine-grained and explicit encoding of
the temporal structure of actions. Leveraging these advantages, the APN
achieves competitive performance and significantly surpasses its counterparts
in detecting long-lasting actions. With an IoU threshold of 0.5, the APN
achieves a mean Average Precision (mAP) of 58.3\% on the THUMOS14 dataset and
98.9\% mAP on the DFMAD70 dataset.Comment: Under Review. Code available at https://github.com/makecent/AP
HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation
Point-supervised Temporal Action Localization (PSTAL) is an emerging research
direction for label-efficient learning. However, current methods mainly focus
on optimizing the network either at the snippet-level or the instance-level,
neglecting the inherent reliability of point annotations at both levels. In
this paper, we propose a Hierarchical Reliability Propagation (HR-Pro)
framework, which consists of two reliability-aware stages: Snippet-level
Discrimination Learning and Instance-level Completeness Learning, both stages
explore the efficient propagation of high-confidence cues in point annotations.
For snippet-level learning, we introduce an online-updated memory to store
reliable snippet prototypes for each class. We then employ a Reliability-aware
Attention Block to capture both intra-video and inter-video dependencies of
snippets, resulting in more discriminative and robust snippet representation.
For instance-level learning, we propose a point-based proposal generation
approach as a means of connecting snippets and instances, which produces
high-confidence proposals for further optimization at the instance level.
Through multi-level reliability-aware learning, we obtain more reliable
confidence scores and more accurate temporal boundaries of predicted proposals.
Our HR-Pro achieves state-of-the-art performance on multiple challenging
benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably,
our HR-Pro largely surpasses all previous point-supervised methods, and even
outperforms several competitive fully supervised methods. Code will be
available at https://github.com/pipixin321/HR-Pro.Comment: 12 pages, 8 figure
Boundary Discretization and Reliable Classification Network for Temporal Action Detection
Temporal action detection aims to recognize the action category and determine
the starting and ending time of each action instance in untrimmed videos. The
mixed methods have achieved remarkable performance by simply merging
anchor-based and anchor-free approaches. However, there are still two crucial
issues in the mixed framework: (1) Brute-force merging and handcrafted anchors
design affect the performance and practical application of the mixed methods.
(2) A large number of false positives in action category predictions further
impact the detection performance. In this paper, we propose a novel Boundary
Discretization and Reliable Classification Network (BDRC-Net) that addresses
the above issues by introducing boundary discretization and reliable
classification modules. Specifically, the boundary discretization module (BDM)
elegantly merges anchor-based and anchor-free approaches in the form of
boundary discretization, avoiding the handcrafted anchors design required by
traditional mixed methods. Furthermore, the reliable classification module
(RCM) predicts reliable action categories to reduce false positives in action
category predictions. Extensive experiments conducted on different benchmarks
demonstrate that our proposed method achieves favorable performance compared
with the state-of-the-art. For example, BDRC-Net hits an average mAP of 68.6%
on THUMOS'14, outperforming the previous best by 1.5%. The code will be
released at https://github.com/zhenyingfang/BDRC-Net.Comment: 12 pages, Source code: https://github.com/zhenyingfang/BDRC-Ne
Weakly-supervised Micro- and Macro-expression Spotting Based on Multi-level Consistency
Most micro- and macro-expression spotting methods in untrimmed videos suffer
from the burden of video-wise collection and frame-wise annotation.
Weakly-supervised expression spotting (WES) based on video-level labels can
potentially mitigate the complexity of frame-level annotation while achieving
fine-grained frame-level spotting. However, we argue that existing
weakly-supervised methods are based on multiple instance learning (MIL)
involving inter-modality, inter-sample, and inter-task gaps. The inter-sample
gap is primarily from the sample distribution and duration. Therefore, we
propose a novel and simple WES framework, MC-WES, using multi-consistency
collaborative mechanisms that include modal-level saliency, video-level
distribution, label-level duration and segment-level feature consistency
strategies to implement fine frame-level spotting with only video-level labels
to alleviate the above gaps and merge prior knowledge. The modal-level saliency
consistency strategy focuses on capturing key correlations between raw images
and optical flow. The video-level distribution consistency strategy utilizes
the difference of sparsity in temporal distribution. The label-level duration
consistency strategy exploits the difference in the duration of facial muscles.
The segment-level feature consistency strategy emphasizes that features under
the same labels maintain similarity. Experimental results on three challenging
datasets -- CAS(ME), CAS(ME), and SAMM-LV -- demonstrate that MC-WES is
comparable to state-of-the-art fully-supervised methods
- …