21 research outputs found
Weakly-Supervised Audio-Visual Segmentation
Audio-visual segmentation is a challenging task that aims to predict
pixel-level masks for sound sources in a video. Previous work applied a
comprehensive manually designed architecture with countless pixel-wise accurate
masks as supervision. However, these pixel-level masks are expensive and not
available in all cases. In this work, we aim to simplify the supervision as the
instance-level annotation, i.e., weakly-supervised audio-visual segmentation.
We present a novel Weakly-Supervised Audio-Visual Segmentation framework,
namely WS-AVS, that can learn multi-scale audio-visual alignment with
multi-scale multiple-instance contrastive learning for audio-visual
segmentation. Extensive experiments on AVSBench demonstrate the effectiveness
of our WS-AVS in the weakly-supervised audio-visual segmentation of
single-source and multi-source scenarios
A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Audio-visual source localization is a challenging task that aims to predict
the location of visual sound sources in a video. Since collecting ground-truth
annotations of sounding objects can be costly, a plethora of weakly-supervised
localization methods that can learn from datasets with no bounding-box
annotations have been proposed in recent years, by leveraging the natural
co-occurrence of audio and visual signals. Despite significant interest,
popular evaluation protocols have two major flaws. First, they allow for the
use of a fully annotated dataset to perform early stopping, thus significantly
increasing the annotation effort required for training. Second, current
evaluation metrics assume the presence of sound sources at all times. This is
of course an unrealistic assumption, and thus better metrics are necessary to
capture the model's performance on (negative) samples with no visible sound
sources. To accomplish this, we extend the test set of popular benchmarks,
Flickr SoundNet and VGG-Sound Sources, in order to include negative samples,
and measure performance using metrics that balance localization accuracy and
recall. Using the new protocol, we conducted an extensive evaluation of prior
methods, and found that most prior works are not capable of identifying
negatives and suffer from significant overfitting problems (rely heavily on
early stopping for best results). We also propose a new approach for visual
sound source localization that addresses both these problems. In particular, we
found that, through extreme visual dropout and the use of momentum encoders,
the proposed approach combats overfitting effectively, and establishes a new
state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code
and pre-trained models are available at https://github.com/stoneMo/SLAVC
Tree of Uncertain Thoughts Reasoning for Large Language Models
While the recently introduced Tree of Thoughts (ToT) has heralded
advancements in allowing Large Language Models (LLMs) to reason through
foresight and backtracking for global decision-making, it has overlooked the
inherent local uncertainties in intermediate decision points or "thoughts".
These local uncertainties, intrinsic to LLMs given their potential for diverse
responses, remain a significant concern in the reasoning process. Addressing
this pivotal gap, we introduce the Tree of Uncertain Thoughts (TouT) - a
reasoning framework tailored for LLMs. Our TouT effectively leverages Monte
Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse
local responses at these intermediate steps. By marrying this local uncertainty
quantification with global search algorithms, TouT enhances the model's
precision in response generation. We substantiate our approach with rigorous
experiments on two demanding planning tasks: Game of 24 and Mini Crosswords.
The empirical evidence underscores TouT's superiority over both ToT and
chain-of-thought prompting methods
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Continual learning is a challenging problem in which models need to be
trained on non-stationary data across sequential tasks for class-incremental
learning. While previous methods have focused on using either regularization or
rehearsal-based frameworks to alleviate catastrophic forgetting in image
classification, they are limited to a single modality and cannot learn compact
class-aware cross-modal representations for continual audio-visual learning. To
address this gap, we propose a novel class-incremental grouping network (CIGN)
that can learn category-wise semantic features to achieve continual
audio-visual learning. Our CIGN leverages learnable audio-visual class tokens
and audio-visual grouping to continually aggregate class-aware features.
Additionally, it utilizes class tokens distillation and continual grouping to
prevent forgetting parameters learned from previous tasks, thereby improving
the model's ability to capture discriminative audio-visual categories. We
conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and
VGG-Sound Sources benchmarks. Our experimental results demonstrate that the
CIGN achieves state-of-the-art audio-visual class-incremental learning
performance. Code is available at https://github.com/stoneMo/CIGN.Comment: ICCV 2023. arXiv admin note: text overlap with arXiv:2303.1705
Text-to-Audio Generation Synchronized with Videos
In recent times, the focus on text-to-audio (TTA) generation has intensified,
as researchers strive to synthesize audio from textual descriptions. However,
most existing methods, though leveraging latent diffusion models to learn the
correlation between audio and text embeddings, fall short when it comes to
maintaining a seamless synchronization between the produced audio and its
video. This often results in discernible audio-visual mismatches. To bridge
this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation
that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself
with three novel metrics dedicated to evaluating visual alignment and temporal
consistency. To complement this, we also present a simple yet effective
video-aligned TTA generation model, namely T2AV. Moving beyond traditional
methods, T2AV refines the latent diffusion approach by integrating
visual-aligned text embeddings as its conditional foundation. It employs a
temporal multi-head attention transformer to extract and understand temporal
nuances from video data, a feat amplified by our Audio-Visual ControlNet that
adeptly merges temporal visual representations with text embeddings. Further
enhancing this integration, we weave in a contrastive learning objective,
designed to ensure that the visual-aligned text embeddings resonate closely
with the audio features. Extensive evaluations on the AudioCaps and T2AV-Bench
demonstrate that our T2AV sets a new standard for video-aligned TTA generation
in ensuring visual alignment and temporal consistency.Comment: arXiv admin note: text overlap with arXiv:2305.1290
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Visual and linguistic pre-training aims to learn vision and language
representations together, which can be transferred to visual-linguistic
downstream tasks. However, there exists semantic confusion between language and
vision during the pre-training stage. Moreover, current pre-trained models tend
to take lots of computation resources for fine-tuning when transferred to
downstream tasks. In this work, we present a simple but effective approach for
learning Contrastive and Adaptive representations of Vision and Language,
namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn
alignments between the whole sentence and each image in the same batch during
the pre-training process. At the fine-tuning stage, we introduce two
lightweight adaptation networks to reduce model parameters and increase
training speed for saving computation resources. We evaluate our CAVL on six
main downstream tasks, including Visual Question Answering (VQA), Visual
Commonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR),
Region-to-Phrase Grounding (RPG), Text-to-Image Retrieval (TIR), and Zero-shot
Text-to-Image Retrieval (ZS-TIR). Compared to baselines, we achieve superior
performance and reduce the fine-tuning time by a large margin (in particular,
76.17%). Extensive experiments and ablation studies demonstrate the efficiency
of contrastive pre-training and adaptive fine-tuning proposed in our CAVL
Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models
Data augmentation has become a standard component of vision pre-trained
models to capture the invariance between augmented views. In practice,
augmentation techniques that mask regions of a sample with zero/mean values or
patches from other samples are commonly employed in pre-trained models with
self-/semi-/fully-supervised contrastive losses. However, the underlying
mechanism behind the effectiveness of these augmentation techniques remains
poorly explored. To investigate the problems, we conduct an empirical study to
quantify how data augmentation affects performance. Concretely, we apply 4
types of data augmentations termed with Random Erasing, CutOut, CutMix and
MixUp to a series of self-/semi-/fully- supervised pre-trained models. We
report their performance on vision tasks such as image classification, object
detection, instance segmentation, and semantic segmentation. We then explicitly
evaluate the invariance and diversity of the feature embedding. We observe
that: 1) Masking regions of the images decreases the invariance of the learned
feature embedding while providing a more considerable diversity. 2) Manual
annotations do not change the invariance or diversity of the learned feature
embedding. 3) The MixUp approach improves the diversity significantly, with
only a marginal decrease in terms of the invariance
LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
Visual Prompt Tuning (VPT) techniques have gained prominence for their
capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual
tasks using specialized learnable tokens termed as prompts. Contemporary VPT
methodologies, especially when employed with self-supervised vision
transformers, often default to the introduction of new learnable prompts or
gated prompt tokens predominantly sourced from the model's previous block. A
pivotal oversight in such approaches is their failure to harness the potential
of long-range previous blocks as sources of prompts within each self-supervised
ViT. To bridge this crucial gap, we introduce Long-term Spatial Prompt Tuning
(LSPT) - a revolutionary approach to visual representation learning. Drawing
inspiration from the intricacies of the human brain, LSPT ingeniously
incorporates long-term gated prompts. This feature serves as temporal coding,
curbing the risk of forgetting parameters acquired from earlier blocks. Further
enhancing its prowess, LSPT brings into play patch tokens, serving as spatial
coding. This is strategically designed to perpetually amass class-conscious
features, thereby fortifying the model's prowess in distinguishing and
identifying visual categories. To validate the efficacy of our proposed method,
we engaged in rigorous experimentation across 5 FGVC and 19 VTAB-1K benchmarks.
Our empirical findings underscore the superiority of LSPT, showcasing its
ability to set new benchmarks in visual prompt tuning performance