64 research outputs found
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Audio-visual learning has been a major pillar of multi-modal machine
learning, where the community mostly focused on its modality-aligned setting,
i.e., the audio and visual modality are both assumed to signal the prediction
target. With the Look, Listen, and Parse dataset (LLP), we investigate the
under-explored unaligned setting, where the goal is to recognize audio and
visual events in a video with only weak labels observed. Such weak video-level
labels only tell what events happen without knowing the modality they are
perceived (audio, visual, or both). To enhance learning in this challenging
setting, we incorporate large-scale contrastively pre-trained models as the
modality teachers. A simple, effective, and generic method, termed Visual-Audio
Label Elaboration (VALOR), is innovated to harvest modality labels for the
training events. Empirical studies show that the harvested labels significantly
improve an attentional baseline by 8.0 in average F-score (Type@AV).
Surprisingly, we found that modality-independent teachers outperform their
modality-fused counterparts since they are noise-proof from the other
potentially unaligned modality. Moreover, our best model achieves the new
state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score
for Type@AV). VALOR is further generalized to Audio-Visual Event Localization
and achieves the new state-of-the-art as well. Code is available at:
https://github.com/Franklin905/VALOR
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
We focus on the weakly-supervised audio-visual video parsing task (AVVP),
which aims to identify and locate all the events in audio/visual modalities.
Previous works only concentrate on video-level overall label denoising across
modalities, but overlook the segment-level label noise, where adjacent video
segments (i.e., 1-second video clips) may contain different events. However,
recognizing events in the segment is challenging because its label could be any
combination of events that occur in the video. To address this issue, we
consider tackling AVVP from the language perspective, since language could
freely describe how various events appear in each segment beyond fixed labels.
Specifically, we design language prompts to describe all cases of event
appearance for each video. Then, the similarity between language prompts and
segments is calculated, where the event of the most similar prompt is regarded
as the segment-level label. In addition, to deal with the mislabeled segments,
we propose to perform dynamic re-weighting on the unreliable segments to adjust
their labels. Experiments show that our simple yet effective approach
outperforms state-of-the-art methods by a large margin
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Vision transformers (ViTs) have achieved impressive results on various
computer vision tasks in the last several years. In this work, we study the
capability of frozen ViTs, pretrained only on visual data, to generalize to
audio-visual data without finetuning any of its original parameters. To do so,
we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained
ViTs to audio-visual tasks by injecting a small number of trainable parameters
into every layer of a frozen ViT. To efficiently fuse visual and audio cues,
our LAVISH adapter uses a small set of latent tokens, which form an attention
bottleneck, thus, eliminating the quadratic cost of standard cross-attention.
Compared to the existing modality-specific audio-visual methods, our approach
achieves competitive or even better performance on various audio-visual tasks
while using fewer tunable parameters and without relying on costly audio
pretraining or external audio encoders. Our code is available at
https://genjib.github.io/project_page/LAVISH/Comment: CVPR 2023 Project Page: https://genjib.github.io/project_page/LAVISH
Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization
Audio-Visual Event Localization (AVEL) is the task of temporally localizing
and classifying \emph{audio-visual events}, i.e., events simultaneously visible
and audible in a video. In this paper, we solve AVEL in a weakly-supervised
setting, where only video-level event labels (their presence/absence, but not
their locations in time) are available as supervision for training. Our idea is
to use a base model to estimate labels on the training data at a finer temporal
resolution than at the video level and re-train the model with these labels.
I.e., we determine the subset of labels for each \emph{slice} of frames in a
training video by (i) replacing the frames outside the slice with those from a
second video having no overlap in video-level labels, and (ii) feeding this
synthetic video into the base model to extract labels for just the slice in
question. To handle the out-of-distribution nature of our synthetic videos, we
propose an auxiliary objective for the base model that induces more reliable
predictions of the localized event labels as desired. Our three-stage pipeline
outperforms several existing AVEL methods with no architectural changes and
improves performance on a related weakly-supervised task as well
- …