Multi-modal embeddings encode images, sounds, texts, videos, etc. into a
single embedding space, aligning representations across modalities (e.g.,
associate an image of a dog with a barking sound). We show that multi-modal
embeddings can be vulnerable to an attack we call "adversarial illusions."
Given an image or a sound, an adversary can perturb it so as to make its
embedding close to an arbitrary, adversary-chosen input in another modality.
This enables the adversary to align any image and any sound with any text.
Adversarial illusions exploit proximity in the embedding space and are thus
agnostic to downstream tasks. Using ImageBind embeddings, we demonstrate how
adversarially aligned inputs, generated without knowledge of specific
downstream tasks, mislead image generation, text generation, and zero-shot
classification