We focus on domain and class generalization problems in analyzing optical
remote sensing images, using the large-scale pre-trained vision-language model
(VLM), CLIP. While contrastively trained VLMs show impressive zero-shot
generalization performance, their effectiveness is limited when dealing with
diverse domains during training and testing. Existing prompt learning
techniques overlook the importance of incorporating domain and content
information into the prompts, which results in a drop in performance while
dealing with such multi-domain data. To address these challenges, we propose a
solution that ensures domain-invariant prompt learning while enhancing the
expressiveness of visual features. We observe that CLIP's vision encoder
struggles to identify contextual image information, particularly when image
patches are jumbled up. This issue is especially severe in optical remote
sensing images, where land-cover classes exhibit well-defined contextual
appearances. To this end, we introduce C-SAW, a method that complements CLIP
with a self-supervised loss in the visual space and a novel prompt learning
technique that emphasizes both visual domain and content-specific features. We
keep the CLIP backbone frozen and introduce a small set of projectors for both
the CLIP encoders to train C-SAW contrastively. Experimental results
demonstrate the superiority of C-SAW across multiple remote sensing benchmarks
and different generalization tasks.Comment: Accepted in ACM ICVGIP 202