12 research outputs found
Learning Graph Embeddings for Compositional Zero-shot Learning
In compositional zero-shot learning, the goal is to recognize unseen
compositions (e.g. old dog) of observed visual primitives states (e.g. old,
cute) and objects (e.g. car, dog) in the training set. This is challenging
because the same state can for example alter the visual appearance of a dog
drastically differently from a car. As a solution, we propose a novel graph
formulation called Compositional Graph Embedding (CGE) that learns image
features, compositional classifiers, and latent representations of visual
primitives in an end-to-end manner. The key to our approach is exploiting the
dependency between states, objects, and their compositions within a graph
structure to enforce the relevant knowledge transfer from seen to unseen
compositions. By learning a joint compatibility that encodes semantics between
concepts, our model allows for generalization to unseen compositions without
relying on an external knowledge base like WordNet. We show that in the
challenging generalized compositional zero-shot setting our CGE significantly
outperforms the state of the art on MIT-States and UT-Zappos. We also propose a
new benchmark for this task based on the recent GQA dataset.Comment: Accepted in IEEE CVPR 202
Learning to Prompt with Text Only Supervision for Vision-Language Models
Foundational vision-language models such as CLIP are becoming a new paradigm
in vision, due to their excellent generalization abilities. However, adapting
these models for downstream tasks while maintaining their generalization
remains a challenge. In literature, one branch of methods adapts CLIP by
learning prompts using visual information. While effective, most of these works
require labeled data which is not practical, and often struggle to generalize
towards new datasets due to over-fitting on the source data. An alternative
approach resorts to training-free methods by generating class descriptions from
large language models (LLMs) and perform prompt ensembling. However, these
methods often generate class specific prompts that cannot be transferred to
other classes, which incur higher costs by generating LLM descriptions for each
class separately. In this work, we propose to combine the strengths of these
both streams of methods by learning prompts using only text data derived from
LLMs. As supervised training of prompts is not trivial due to absence of
images, we develop a training approach that allows prompts to extract rich
contextual knowledge from LLM data. Moreover, with LLM contextual data mapped
within the learned prompts, it enables zero-shot transfer of prompts to new
classes and datasets potentially cutting the LLM prompt engineering cost. To
the best of our knowledge, this is the first work that learns generalized
prompts using text only data. We perform extensive evaluations on 4 benchmarks
where our method improves over prior ensembling works while being competitive
to those utilizing labeled images. Our code and pre-trained models are
available at https://github.com/muzairkhattak/ProText.Comment: Project Page: https://muzairkhattak.github.io/ProText
SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance
In semi-supervised semantic segmentation, a model is trained with a limited
number of labeled images along with a large corpus of unlabeled images to
reduce the high annotation effort. While previous methods are able to learn
good segmentation boundaries, they are prone to confuse classes with similar
visual appearance due to the limited supervision. On the other hand,
vision-language models (VLMs) are able to learn diverse semantic knowledge from
image-caption datasets but produce noisy segmentation due to the image-level
training. In SemiVL, we propose to integrate rich priors from VLM pre-training
into semi-supervised semantic segmentation to learn better semantic decision
boundaries. To adapt the VLM from global to local reasoning, we introduce a
spatial fine-tuning strategy for label-efficient learning. Further, we design a
language-guided decoder to jointly reason over vision and language. Finally, we
propose to handle inherent ambiguities in class labels by providing the model
with language guidance in the form of class definitions. We evaluate SemiVL on
4 semantic segmentation datasets, where it significantly outperforms previous
semi-supervised methods. For instance, SemiVL improves the state-of-the-art by
+13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC
with 92 labels. Project page: https://github.com/google-research/semiv
SILC: Improving Vision Language Pretraining with Self-Distillation
Image-Text pretraining on web-scale image caption datasets has become the
default recipe for open vocabulary classification and retrieval models thanks
to the success of CLIP and its variants. Several works have also used CLIP
features for dense prediction tasks and have shown the emergence of open-set
abilities. However, the contrastive objective used by these models only focuses
on image-text alignment and does not incentivise image feature learning for
dense prediction tasks. In this work, we introduce SILC, a novel framework for
vision language pretraining. SILC improves image-text contrastive learning with
the simple addition of local-to-global correspondence learning by
self-distillation. We show that distilling local image features from an
exponential moving average (EMA) teacher model significantly improves model
performance on dense predictions tasks like detection and segmentation, while
also providing improvements on image-level tasks such as classification and
retrieval. SILC models sets a new state of the art for zero-shot
classification, few shot classification, image and text retrieval, zero-shot
segmentation, and open vocabulary segmentation. We further show that SILC
features greatly benefit open vocabulary detection, captioning and visual
question answering
I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification
Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words. Consequently, our I2DFormer not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to localize visually relevant words in image regions. Quantitatively, we demonstrate that our I2DFormer significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our method leads to highly interpretable results where document words can be grounded in the image regions. Code available at https://github.com/ferjad/I2DFormer
Reliable fidelity and diversity metrics for generative models
Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Fr??chet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet; for example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics