Contrastive Language-Image Pretraining (CLIP) has demonstrated great
zero-shot performance for image-text matching because of its holistic use of
natural language supervision that covers large-scale, open-world visual
concepts. However, it is still challenging to adapt CLIP to compositional image
and text matching -- a more challenging image and matching mask requiring the
model understanding of compositional word concepts and visual components.
Towards better compositional generalization in zero-shot image and text
matching, in this paper, we study the problem from a causal perspective: the
erroneous semantics of individual entities are essentially confounders that
cause the matching failure. Therefore, we propose a novel training-free
compositional CLIP model (ComCLIP). ComCLIP disentangles input images into
subjects, objects, and action sub-images and composes CLIP's vision encoder and
text encoder to perform evolving matching over compositional text embedding and
sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations
introduced by the pretrained CLIP models and dynamically assess the
contribution of each entity when performing image and text matching.
Experiments on compositional image-text matching on SVO and ComVG and general
image-text retrieval on Flickr8K demonstrate the effectiveness of our
plug-and-play method, which boosts the zero-shot inference ability of CLIP even
without further training or fine-tuning of CLIP