Traditional machine learning (ML) models usually rely on large-scale labeled
datasets to achieve strong performance. However, such labeled datasets are
often challenging and expensive to obtain. Also, the predefined categories
limit the model's ability to generalize to other visual concepts as additional
labeled data is required. On the contrary, the newly emerged multimodal model,
which contains both visual and linguistic modalities, learns the concept of
images from the raw text. It is a promising way to solve the above problems as
it can use easy-to-collect image-text pairs to construct the training dataset
and the raw texts contain almost unlimited categories according to their
semantics. However, learning from a large-scale unlabeled dataset also exposes
the model to the risk of potential poisoning attacks, whereby the adversary
aims to perturb the model's training dataset to trigger malicious behaviors in
it. Previous work mainly focuses on the visual modality. In this paper, we
instead focus on answering two questions: (1) Is the linguistic modality also
vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To
answer the two questions, we conduct three types of poisoning attacks against
CLIP, the most representative multimodal contrastive learning framework.
Extensive evaluations on different datasets and model architectures show that
all three attacks can perform well on the linguistic modality with only a
relatively low poisoning rate and limited epochs. Also, we observe that the
poisoning effect differs between different modalities, i.e., with lower MinRank
in the visual modality and with higher Hit@K when K is small in the linguistic
modality. To mitigate the attacks, we propose both pre-training and
post-training defenses. We empirically show that both defenses can
significantly reduce the attack performance while preserving the model's
utility