1,100 research outputs found
Deeply Coupled Cross-Modal Prompt Learning
Recent advancements in multimodal foundation models (e.g., CLIP) have
excelled in zero-shot generalization. Prompt tuning involved in the knowledge
transfer from foundation models to downstream tasks has gained significant
attention recently. Existing prompt-tuning methods in cross-modal learning,
however, either solely focus on language branch, or learn vision-language
interaction in a shallow mechanism. In this context, we propose a Deeply
coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly
accommodates the interplay between vision and language with a Cross-Modal
Prompt Attention (CMPA) mechanism, which enables the mutual exchange of
respective representation through a well-connected multi-head attention module
progressively and strongly. We then conduct comprehensive few-shot learning
experiments on 11 image classification datasets and analyze the robustness to
domain shift as well. Thorough experimental analysis evidently demonstrates the
superb few-shot generalization and compelling domain adaption capacity of a
well-executed DCP. The code can be found at https://github.com/GingL/CMPA.Comment: Accepted by ACL 2023 finding
- …