17 research outputs found
Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling
To capture the relationship between samples and labels, conditional
generative models often inherit spurious correlations from the training
dataset. This can result in label-conditional distributions that are imbalanced
with respect to another latent attribute. To mitigate this issue, which we call
spurious causality of conditional generation, we propose a general two-step
strategy. (a) Fairness Intervention (FI): emphasize the minority samples that
are hard to generate due to the spurious correlation in the training dataset.
(b) Corrective Sampling (CS): explicitly filter the generated samples and
ensure that they follow the desired latent attribute distribution. We have
designed the fairness intervention to work for various degrees of supervision
on the spurious attribute, including unsupervised, weakly-supervised, and
semi-supervised scenarios. Our experimental results demonstrate that FICS can
effectively resolve spurious causality of conditional generation across various
datasets.Comment: TMLR 202
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions
Vision-language models, such as contrastive language-image pre-training
(CLIP), have demonstrated impressive results in natural image domains. However,
these models often struggle when applied to specialized domains like remote
sensing, and adapting to such domains is challenging due to the limited number
of image-text pairs available for training. To address this, we propose S-CLIP,
a semi-supervised learning method for training CLIP that utilizes additional
unpaired images. S-CLIP employs two pseudo-labeling strategies specifically
designed for contrastive learning and the language modality. The caption-level
pseudo-label is given by a combination of captions of paired images, obtained
by solving an optimal transport problem between unpaired and paired images. The
keyword-level pseudo-label is given by a keyword in the caption of the nearest
paired image, trained through partial label learning that assumes a candidate
set of labels for supervision instead of the exact one. By combining these
objectives, S-CLIP significantly enhances the training of CLIP using only a few
image-text pairs, as demonstrated in various specialist domains, including
remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP
improves CLIP by 10% for zero-shot classification and 4% for image-text
retrieval on the remote sensing benchmark, matching the performance of
supervised CLIP while using three times fewer image-text pairs.Comment: NeurIPS 202
MASKER: Masked Keyword Regularization for Reliable Text Classification
Pre-trained language models have achieved state-of-the-art accuracies on
various text classification tasks, e.g., sentiment analysis, natural language
inference, and semantic textual similarity. However, the reliability of the
fine-tuned text classifiers is an often underlooked performance criterion. For
instance, one may desire a model that can detect out-of-distribution (OOD)
samples (drawn far from training distribution) or be robust against domain
shifts. We claim that one central obstacle to the reliability is the
over-reliance of the model on a limited number of keywords, instead of looking
at the whole context. In particular, we find that (a) OOD samples often contain
in-distribution keywords, while (b) cross-domain samples may not always contain
keywords; over-relying on the keywords can be problematic for both cases. In
light of this observation, we propose a simple yet effective fine-tuning
method, coined masked keyword regularization (MASKER), that facilitates
context-based prediction. MASKER regularizes the model to reconstruct the
keywords from the rest of the words and make low-confidence predictions without
enough context. When applied to various pre-trained language models (e.g.,
BERT, RoBERTa, and ALBERT), we demonstrate that MASKER improves OOD detection
and cross-domain generalization without degrading classification accuracy. Code
is available at https://github.com/alinlab/MASKER.Comment: AAAI 2021. First two authors contributed equall
Discovering and Mitigating Visual Biases through Keyword Explanation
Addressing biases in computer vision models is crucial for real-world AI
deployments. However, mitigating visual biases is challenging due to their
unexplainable nature, often identified indirectly through visualization or
sample statistics, which necessitates additional human supervision for
interpretation. To tackle this issue, we propose the Bias-to-Text (B2T)
framework, which interprets visual biases as keywords. Specifically, we extract
common keywords from the captions of mispredicted images to identify potential
biases in the model. We then validate these keywords by measuring their
similarity to the mispredicted images using a vision-language scoring model.
The keyword explanation form of visual bias offers several advantages, such as
a clear group naming for bias discovery and a natural extension for debiasing
using these group names. Our experiments demonstrate that B2T can identify
known biases, such as gender bias in CelebA, background bias in Waterbirds, and
distribution shifts in ImageNet-R/C. Additionally, B2T uncovers novel biases in
larger datasets, such as Dollar Street and ImageNet. For example, we discovered
a contextual bias between "bee" and "flower" in ImageNet. We also highlight
various applications of B2T keywords, including debiased training, CLIP
prompting, and model comparison.Comment: CVPR 2024. First two authors contributed equall
Diffusion Probabilistic Models for Structured Node Classification
This paper studies structured node classification on graphs, where the
predictions should consider dependencies between the node labels. In
particular, we focus on solving the problem for partially labeled graphs where
it is essential to incorporate the information in the known label for
predicting the unknown labels. To address this issue, we propose a novel
framework leveraging the diffusion probabilistic model for structured node
classification (DPM-SNC). At the heart of our framework is the extraordinary
capability of DPM-SNC to (a) learn a joint distribution over the labels with an
expressive reverse diffusion process and (b) make predictions conditioned on
the known labels utilizing manifold-constrained sampling. Since the DPMs lack
training algorithms for partially labeled data, we design a novel training
algorithm to apply DPMs, maximizing a new variational lower bound. We also
theoretically analyze how DPMs benefit node classification by enhancing the
expressive power of GNNs based on proposing AGG-WL, which is strictly more
powerful than the classic 1-WL test. We extensively verify the superiority of
our DPM-SNC in diverse scenarios, which include not only the transductive
setting on partially labeled graphs but also the inductive setting and
unlabeled graphs
Gesture-to-gesture translation in the wild via category-independent conditional maps
Recent works have shown Generative Adversarial Networks (GANs) to be particularly effective in image-to-image translations. However, in tasks such as body pose and hand gesture translation, existing methods usually require precise annotations, e.g. key-points or skeletons, which are time-consuming to draw. In this work, we propose a novel GAN architecture that decouples the required annotations into a category label - that specifies the gesture type - and a simple-to-draw category-independent conditional map - that expresses the location, rotation and size of the hand gesture. Our architecture synthesizes the target gesture while preserving the background context, thus effectively dealing with gesture translation in the wild. To this aim, we use an attention module and a rolling guidance approach, which loops the generated images back into the network and produces higher quality images compared to competing works. Thus, our GAN learns to generate new images from simple annotations without requiring key-points or skeleton labels. Results on two public datasets show that our method outperforms state of the art approaches both quantitatively and qualitatively. To the best of our knowledge, no work so far has addressed the gesture-to-gesture translation in the wild by requiring user-friendly annotations