Although current prompt learning methods have successfully been designed to
effectively reuse the large pre-trained models without fine-tuning their large
number of parameters, they still have limitations to be addressed, i.e.,
without considering the adverse impact of meaningless patches in every image
and without simultaneously considering in-sample generalization and
out-of-sample generalization. In this paper, we propose an adaptive
multi-modality prompt learning to address the above issues. To do this, we
employ previous text prompt learning and propose a new image prompt learning.
The image prompt learning achieves in-sample and out-of-sample generalization,
by first masking meaningless patches and then padding them with the learnable
parameters and the information from texts. Moreover, each of the prompts
provides auxiliary information to each other, further strengthening these two
kinds of generalization. Experimental results on real datasets demonstrate that
our method outperforms SOTA methods, in terms of different downstream tasks