Challenges of Zero-Shot Recognition with Vision-Language Models:
  Granularity and Correctness

Chen, Yanbei; Deng, Tiffany; Favaro, Paolo; Mittal, Abhay; Modolo, Davide; Tighe, Joseph; Wang, Manchen; Xu, Zhenlin; Zhu, Yi

Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Authors: Yanbei Chen
Tiffany Deng
Paolo Favaro
Abhay Mittal
Davide Modolo
Joseph Tighe
Manchen Wang
Zhenlin Xu
Yi Zhu
Publication date: 28 June 2023
Publisher

Abstract

This paper investigates the challenges of applying vision-language models (VLMs) to zero-shot visual recognition tasks in an open-world setting, with a focus on contrastive vision-language models such as CLIP. We first examine the performance of VLMs on concepts of different granularity levels. We propose a way to fairly evaluate the performance discrepancy under two experimental setups and find that VLMs are better at recognizing fine-grained concepts. Furthermore, we find that the similarity scores from VLMs do not strictly reflect the correctness of the textual inputs given visual input. We propose an evaluation protocol to test our hypothesis that the scores can be biased towards more informative descriptions, and the nature of the similarity score between embedding makes it challenging for VLMs to recognize the correctness between similar but wrong descriptions. Our study highlights the challenges of using VLMs in open-world settings and suggests directions for future research to improve their zero-shot capabilities

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.16048

Last time updated on 02/07/2023