This paper investigates the challenges of applying vision-language models
(VLMs) to zero-shot visual recognition tasks in an open-world setting, with a
focus on contrastive vision-language models such as CLIP. We first examine the
performance of VLMs on concepts of different granularity levels. We propose a
way to fairly evaluate the performance discrepancy under two experimental
setups and find that VLMs are better at recognizing fine-grained concepts.
Furthermore, we find that the similarity scores from VLMs do not strictly
reflect the correctness of the textual inputs given visual input. We propose an
evaluation protocol to test our hypothesis that the scores can be biased
towards more informative descriptions, and the nature of the similarity score
between embedding makes it challenging for VLMs to recognize the correctness
between similar but wrong descriptions. Our study highlights the challenges of
using VLMs in open-world settings and suggests directions for future research
to improve their zero-shot capabilities