Latent image representations arising from vision-language models have proved
immensely useful for a variety of downstream tasks. However, their utility is
limited by their entanglement with respect to different visual attributes. For
instance, recent work has shown that CLIP image representations are often
biased toward specific visual properties (such as objects or actions) in an
unpredictable manner. In this paper, we propose to separate representations of
the different visual modalities in CLIP's joint vision-language space by
leveraging the association between parts of speech and specific visual modes of
variation (e.g. nouns relate to objects, adjectives describe appearance). This
is achieved by formulating an appropriate component analysis model that learns
subspaces capturing variability corresponding to a specific part of speech,
while jointly minimising variability to the rest. Such a subspace yields
disentangled representations of the different visual properties of an image or
text in closed form while respecting the underlying geometry of the manifold on
which the representations lie. What's more, we show the proposed model
additionally facilitates learning subspaces corresponding to specific visual
appearances (e.g. artists' painting styles), which enables the selective
removal of entire visual themes from CLIP-based text-to-image synthesis. We
validate the model both qualitatively, by visualising the subspace projections
with a text-to-image model and by preventing the imitation of artists' styles,
and quantitatively, through class invariance metrics and improvements to
baseline zero-shot classification.Comment: Accepted at NeurIPS 202