41,895 research outputs found
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Generative training has been demonstrated to be powerful for building
visual-language models. However, on zero-shot discriminative benchmarks, there
is still a performance gap between models trained with generative and
discriminative objectives. In this paper, we aim to narrow this gap by
improving the efficacy of generative training on classification tasks, without
any finetuning processes or additional modules.
Specifically, we focus on narrowing the gap between the generative captioner
and the CLIP classifier. We begin by analysing the predictions made by the
captioner and classifier and observe that the caption generation inherits the
distribution bias from the language model trained with pure text modality,
making it less grounded on the visual signal. To tackle this problem, we
redesign the scoring objective for the captioner to alleviate the
distributional bias and focus on measuring the gain of information brought by
the visual inputs. We further design a generative training objective to match
the evaluation objective. We name our model trained and evaluated from the
novel procedures as Information Gain (IG) captioner. We pretrain the models on
the public Laion-5B dataset and perform a series of discriminative evaluations.
For the zero-shot classification on ImageNet, IG captioner achieves
improvements over the standard captioner, achieving comparable performances
with the CLIP classifier. IG captioner also demonstrated strong performance on
zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this
paper inspires further research towards unifying generative and discriminative
training procedures for visual-language models
- …