Vision-language models such as CLIP learn a generic text-image embedding from
large-scale training data. A vision-language model can be adapted to a new
classification task through few-shot prompt tuning. We find that such a prompt
tuning process is highly robust to label noises. This intrigues us to study the
key reasons contributing to the robustness of the prompt tuning paradigm. We
conducted extensive experiments to explore this property and find the key
factors are: 1) the fixed classname tokens provide a strong regularization to
the optimization of the model, reducing gradients induced by the noisy samples;
2) the powerful pre-trained image-text embedding that is learned from diverse
and generic web data provides strong prior knowledge for image classification.
Further, we demonstrate that noisy zero-shot predictions from CLIP can be used
to tune its own prompt, significantly enhancing prediction accuracy in the
unsupervised setting. The code is available at https://github.com/CEWu/PTNL.Comment: Accepted by ICCV202