ComCLIP: Training-Free Compositional Image and Text Matching

He, Xuehai; Jiang, Kenan; Wang, Xin Eric; Xu, Ruize

ComCLIP: Training-Free Compositional Image and Text Matching

Authors: Xuehai He
Kenan Jiang
Xin Eric Wang
Ruize Xu
Publication date: 24 November 2022
Publisher

Abstract

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for image-text matching because of its holistic use of natural language supervision that covers large-scale, open-world visual concepts. However, it is still challenging to adapt CLIP to compositional image and text matching -- a more challenging image and matching mask requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically assess the contribution of each entity when performing image and text matching. Experiments on compositional image-text matching on SVO and ComVG and general image-text retrieval on Flickr8K demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP even without further training or fine-tuning of CLIP

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2211.13854

Last time updated on 30/12/2022