Hateful memes have emerged as a significant concern on the Internet. These
memes, which are a combination of image and text, often convey messages vastly
different from their individual meanings. Thus, detecting hateful memes
requires the system to jointly understand the visual and textual modalities.
However, our investigation reveals that the embedding space of existing
CLIP-based systems lacks sensitivity to subtle differences in memes that are
vital for correct hatefulness classification. To address this issue, we propose
constructing a hatefulness-aware embedding space through retrieval-guided
contrastive training. Specifically, we add an auxiliary loss that utilizes hard
negative and pseudo-gold samples to train the embedding space. Our approach
achieves state-of-the-art performance on the HatefulMemes dataset with an AUROC
of 86.7. Notably, our approach outperforms much larger fine-tuned Large
Multimodal Models like Flamingo and LLaVA. Finally, we demonstrate a
retrieval-based hateful memes detection system, which is capable of making
hatefulness classification based on data unseen in training from a database.
This allows developers to update the hateful memes detection system by simply
adding new data without retraining, a desirable feature for real services in
the constantly-evolving landscape of hateful memes on the Internet