Large-scale foundation models (e.g., CLIP) have shown promising zero-shot
generalization performance on downstream tasks by leveraging carefully designed
language prompts. However, despite their success, most prompt learning
techniques tend to underperform in the presence of domain shift. Our study
addresses this problem and, to improve CLIP's generalization ability across
domains, proposes \textsc{StyLIP}, a novel approach for Domain Generalization
(DG) based on a domain-agnostic prompt learning strategy. In the absence of
explicit domain knowledge, we aim to disentangle the visual style and the
content information extracted from the pre-trained CLIP in the prompts so they
can be effortlessly adapted to novel domains during inference. Furthermore, we
consider a set of style projectors to learn the prompt tokens directly from
these multi-scale style features, and the generated prompt embeddings are later
fused with the multi-scale visual features learned through a content projector.
The projectors are contrastively trained, given CLIP's frozen vision and text
encoders. We present extensive experiments in five different DG settings on
multiple benchmarks, demonstrating that \textsc{StyLIP} consistently
outperforms the relevant state-of-the-art methods.Comment: 23 pages, 7 figures, 9 table