Pretrained large-scale vision-language models such as CLIP have demonstrated
excellent generalizability over a series of downstream tasks. However, they are
sensitive to the variation of input text prompts and need a selection of prompt
templates to achieve satisfactory performance. Recently, various methods have
been proposed to dynamically learn the prompts as the textual inputs to avoid
the requirements of laboring hand-crafted prompt engineering in the fine-tuning
process. We notice that these methods are suboptimal in two aspects. First, the
prompts of the vision and language branches in these methods are usually
separated or uni-directionally correlated. Thus, the prompts of both branches
are not fully correlated and may not provide enough guidance to align the
representations of both branches. Second, it's observed that most previous
methods usually achieve better performance on seen classes but cause
performance degeneration on unseen classes compared to CLIP. This is because
the essential generic knowledge learned in the pretraining stage is partly
forgotten in the fine-tuning process. In this paper, we propose Co-Articulated
Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our
method considers prompts from both branches to generate the prompts to enhance
the representation alignment of both branches. Besides, to alleviate forgetting
about the essential knowledge, we minimize the feature discrepancy between the
learned prompts and the embeddings of hand-crafted prompts in the pre-trained
CLIP in the late transformer layers. We evaluate our method across three
representative tasks of generalization to novel classes, new target datasets
and unseen domain shifts. Experimental results demonstrate the superiority of
our method by exhibiting a favorable performance boost upon all tasks with high
efficiency.Comment: Accepted to AAAI2024. Code is available at
https://github.com/hulianyuyy/COMM