Recent advances in multimodal learning has resulted in powerful
vision-language models, whose representations are generalizable across a
variety of downstream tasks. Recently, their generalization ability has been
further extended by incorporating trainable prompts, borrowed from the natural
language processing literature. While such prompt learning techniques have
shown impressive results, we identify that these prompts are trained based on
global image features which limits itself in two aspects: First, by using
global features, these prompts could be focusing less on the discriminative
foreground image, resulting in poor generalization to various
out-of-distribution test cases. Second, existing work weights all prompts
equally whereas intuitively, prompts should be reweighed according to the
semantics of the image. We address these as part of our proposed Contextual
Prompt Learning (CoPL) framework, capable of aligning the prompts to the
localized features of the image. Our key innovations over earlier works include
using local image features as part of the prompt learning process, and more
crucially, learning to weight these prompts based on local features that are
appropriate for the task at hand. This gives us dynamic prompts that are both
aligned to local image features as well as aware of local contextual
relationships. Our extensive set of experiments on a variety of standard and
few-shot datasets show that our method produces substantially improved
performance when compared to the current state of the art methods. We also
demonstrate both few-shot and out-of-distribution performance to establish the
utility of learning dynamic prompts that are aligned to local image features.Comment: Accepted at AAAI 202