21 research outputs found
ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition
Class imbalance is a common challenge in real-world recognition tasks, where
the majority of classes have few samples, also known as tail classes. We
address this challenge with the perspective of generalization and empirically
find that the promising Sharpness-Aware Minimization (SAM) fails to address
generalization issues under the class-imbalanced setting. Through investigating
this specific type of task, we identify that its generalization bottleneck
primarily lies in the severe overfitting for tail classes with limited training
data. To overcome this bottleneck, we leverage class priors to restrict the
generalization scope of the class-agnostic SAM and propose a class-aware
smoothness optimization algorithm named Imbalanced-SAM (ImbSAM). With the
guidance of class priors, our ImbSAM specifically improves generalization
targeting tail classes. We also verify the efficacy of ImbSAM on two
prototypical applications of class-imbalanced recognition: long-tailed
classification and semi-supervised anomaly detection, where our ImbSAM
demonstrates remarkable performance improvements for tail classes and anomaly.
Our code implementation is available at
https://github.com/cool-xuan/Imbalanced_SAM.Comment: Accepted by International Conference on Computer Vision (ICCV) 202
BatchNorm-based Weakly Supervised Video Anomaly Detection
In weakly supervised video anomaly detection (WVAD), where only video-level
labels indicating the presence or absence of abnormal events are available, the
primary challenge arises from the inherent ambiguity in temporal annotations of
abnormal occurrences. Inspired by the statistical insight that temporal
features of abnormal events often exhibit outlier characteristics, we propose a
novel method, BN-WVAD, which incorporates BatchNorm into WVAD. In the proposed
BN-WVAD, we leverage the Divergence of Feature from Mean vector (DFM) of
BatchNorm as a reliable abnormality criterion to discern potential abnormal
snippets in abnormal videos. The proposed DFM criterion is also discriminative
for anomaly recognition and more resilient to label noise, serving as the
additional anomaly score to amend the prediction of the anomaly classifier that
is susceptible to noisy labels. Moreover, a batch-level selection strategy is
devised to filter more abnormal snippets in videos where more abnormal events
occur. The proposed BN-WVAD model demonstrates state-of-the-art performance on
UCF-Crime with an AUC of 87.24%, and XD-Violence, where AP reaches up to
84.93%. Our code implementation is accessible at
https://github.com/cool-xuan/BN-WVAD
DePT: Decoupled Prompt Tuning
This work breaks through the Base-New Tradeoff (BNT)dilemma in prompt tuning,
i.e., the better the tuned model generalizes to the base (or target) task, the
worse it generalizes to new tasks, and vice versa. Specifically, through an
in-depth analysis of the learned features of the base and new tasks, we observe
that the BNT stems from a channel bias issue, i.e., the vast majority of
feature channels are occupied by base-specific knowledge, resulting in the
collapse of taskshared knowledge important to new tasks. To address this, we
propose the Decoupled Prompt Tuning (DePT) framework, which decouples
base-specific knowledge from feature channels into an isolated feature space
during prompt tuning, so as to maximally preserve task-shared knowledge in the
original feature space for achieving better zero-shot generalization on new
tasks. Importantly, our DePT is orthogonal to existing prompt tuning methods,
hence it can improve all of them. Extensive experiments on 11 datasets show the
strong flexibility and effectiveness of DePT. Our code and pretrained models
are available at https://github.com/Koorye/DePT.Comment: 13 page
MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation
Zero-shot Text-to-Video synthesis generates videos based on prompts without
any videos. Without motion information from videos, motion priors implied in
prompts are vital guidance. For example, the prompt "airplane landing on the
runway" indicates motion priors that the "airplane" moves downwards while the
"runway" stays static. Whereas the motion priors are not fully exploited in
previous approaches, thus leading to two nontrivial issues: 1) the motion
variation pattern remains unaltered and prompt-agnostic for disregarding motion
priors; 2) the motion control of different objects is inaccurate and entangled
without considering the independent motion priors of different objects. To
tackle the two issues, we propose a prompt-adaptive and disentangled motion
control strategy coined as MotionZero, which derives motion priors from prompts
of different objects by Large-Language-Models and accordingly applies motion
control of different objects to corresponding regions in disentanglement.
Furthermore, to facilitate videos with varying degrees of motion amplitude, we
propose a Motion-Aware Attention scheme which adjusts attention among frames by
motion amplitude. Extensive experiments demonstrate that our strategy could
correctly control motion of different objects and support versatile
applications including zero-shot video edit
DETA: Denoised Task Adaptation for Few-Shot Learning
Test-time task adaptation in few-shot learning aims to adapt a pre-trained
task-agnostic model for capturing taskspecific knowledge of the test task, rely
only on few-labeled support samples. Previous approaches generally focus on
developing advanced algorithms to achieve the goal, while neglecting the
inherent problems of the given support samples. In fact, with only a handful of
samples available, the adverse effect of either the image noise (a.k.a.
X-noise) or the label noise (a.k.a. Y-noise) from support samples can be
severely amplified. To address this challenge, in this work we propose DEnoised
Task Adaptation (DETA), a first, unified image- and label-denoising framework
orthogonal to existing task adaptation approaches. Without extra supervision,
DETA filters out task-irrelevant, noisy representations by taking advantage of
both global visual information and local region details of support samples. On
the challenging Meta-Dataset, DETA consistently improves the performance of a
broad spectrum of baseline methods applied on various pre-trained models.
Notably, by tackling the overlooked image noise in Meta-Dataset, DETA
establishes new state-of-the-art results. Code is released at
https://github.com/nobody-1617/DETA.Comment: 10 pages, 5 figure
CUCL: Codebook for Unsupervised Continual Learning
The focus of this study is on Unsupervised Continual Learning (UCL), as it
presents an alternative to Supervised Continual Learning which needs
high-quality manual labeled data. The experiments under the UCL paradigm
indicate a phenomenon where the results on the first few tasks are suboptimal.
This phenomenon can render the model inappropriate for practical applications.
To address this issue, after analyzing the phenomenon and identifying the lack
of diversity as a vital factor, we propose a method named Codebook for
Unsupervised Continual Learning (CUCL) which promotes the model to learn
discriminative features to complete the class boundary. Specifically, we first
introduce a Product Quantization to inject diversity into the representation
and apply a cross quantized contrastive loss between the original
representation and the quantized one to capture discriminative information.
Then, based on the quantizer, we propose an effective Codebook Rehearsal to
address catastrophic forgetting. This study involves conducting extensive
experiments on CIFAR100, TinyImageNet, and MiniImageNet benchmark datasets. Our
method significantly boosts the performances of supervised and unsupervised
methods. For instance, on TinyImageNet, our method led to a relative
improvement of 12.76% and 7% when compared with Simsiam and BYOL, respectively.Comment: MM '23: Proceedings of the 31st ACM International Conference on
Multimedi
From Global to Local: Multi-scale Out-of-distribution Detection
Out-of-distribution (OOD) detection aims to detect "unknown" data whose
labels have not been seen during the in-distribution (ID) training process.
Recent progress in representation learning gives rise to distance-based OOD
detection that recognizes inputs as ID/OOD according to their relative
distances to the training data of ID classes. Previous approaches calculate
pairwise distances relying only on global image representations, which can be
sub-optimal as the inevitable background clutter and intra-class variation may
drive image-level representations from the same ID class far apart in a given
representation space. In this work, we overcome this challenge by proposing
Multi-scale OOD DEtection (MODE), a first framework leveraging both global
visual information and local region details of images to maximally benefit OOD
detection. Specifically, we first find that existing models pretrained by
off-the-shelf cross-entropy or contrastive losses are incompetent to capture
valuable local representations for MODE, due to the scale-discrepancy between
the ID training and OOD detection processes. To mitigate this issue and
encourage locally discriminative representations in ID training, we propose
Attention-based Local PropAgation (ALPA), a trainable objective that exploits a
cross-attention mechanism to align and highlight the local regions of the
target objects for pairwise examples. During test-time OOD detection, a
Cross-Scale Decision (CSD) function is further devised on the most
discriminative multi-scale representations to distinguish ID/OOD data more
faithfully. We demonstrate the effectiveness and flexibility of MODE on several
benchmarks -- on average, MODE outperforms the previous state-of-the-art by up
to 19.24% in FPR, 2.77% in AUROC. Code is available at
https://github.com/JimZAI/MODE-OOD.Comment: 13 page
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization
Recently, diffusion models have achieved remarkable success in generating
tasks, including image and audio generation. However, like other generative
models, diffusion models are prone to privacy issues. In this paper, we propose
an efficient query-based membership inference attack (MIA), namely Proximal
Initialization Attack (PIA), which utilizes groundtruth trajectory obtained by
initialized in and predicted point to infer memberships.
Experimental results indicate that the proposed method can achieve competitive
performance with only two queries on both discrete-time and continuous-time
diffusion models. Moreover, previous works on the privacy of diffusion models
have focused on vision tasks without considering audio tasks. Therefore, we
also explore the robustness of diffusion models to MIA in the text-to-speech
(TTS) task, which is an audio generation task. To the best of our knowledge,
this work is the first to study the robustness of diffusion models to MIA in
the TTS task. Experimental results indicate that models with mel-spectrogram
(image-like) output are vulnerable to MIA, while models with audio output are
relatively robust to MIA. {Code is available at
\url{https://github.com/kong13661/PIA}}