623 research outputs found
Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders
Many approaches in generalized zero-shot learning rely on cross-modal mapping
between the image feature space and the class embedding space. As labeled
images are expensive, one direction is to augment the dataset by generating
either images or image features. However, the former misses fine-grained
details and the latter requires learning a mapping associated with class
embeddings. In this work, we take feature generation one step further and
propose a model where a shared latent space of image features and class
embeddings is learned by modality-specific aligned variational autoencoders.
This leaves us with the required discriminative information about the image and
classes in the latent features, on which we train a softmax classifier. The key
to our approach is that we align the distributions learned from images and from
side-information to construct latent features that contain the essential
multi-modal information associated with unseen classes. We evaluate our learned
latent features on several benchmark datasets, i.e. CUB, SUN, AWA1 and AWA2,
and establish a new state of the art on generalized zero-shot as well as on
few-shot learning. Moreover, our results on ImageNet with various zero-shot
splits show that our latent features generalize well in large-scale settings.Comment: Accepted at CVPR 201
Improving Zero-Shot Generalization for CLIP with Synthesized Prompts
With the growing interest in pretrained vision-language models like CLIP,
recent research has focused on adapting these models to downstream tasks.
Despite achieving promising results, most existing methods require labeled data
for all classes, which may not hold in real-world applications due to the long
tail and Zipf's law. For example, some classes may lack labeled data entirely,
such as emerging concepts. To address this problem, we propose a plug-and-play
generative approach called \textbf{S}ynt\textbf{H}es\textbf{I}zed
\textbf{P}rompts~(\textbf{SHIP}) to improve existing fine-tuning methods.
Specifically, we follow variational autoencoders to introduce a generator that
reconstructs the visual features by inputting the synthesized prompts and the
corresponding class names to the textual encoder of CLIP. In this manner, we
easily obtain the synthesized features for the remaining label-only classes.
Thereafter, we fine-tune CLIP with off-the-shelf methods by combining labeled
and synthesized features. Extensive experiments on base-to-new generalization,
cross-dataset transfer learning, and generalized zero-shot learning demonstrate
the superiority of our approach. The code is available at
\url{https://github.com/mrflogs/SHIP}.Comment: Accepted by ICCV 202
- …