Prior work on ideology prediction has largely focused on single modalities,
i.e., text or images. In this work, we introduce the task of multimodal
ideology prediction, where a model predicts binary or five-point scale
ideological leanings, given a text-image pair with political content. We first
collect five new large-scale datasets with English documents and images along
with their ideological leanings, covering news articles from a wide range of US
mainstream media and social media posts from Reddit and Twitter. We conduct
in-depth analyses of news articles and reveal differences in image content and
usage across the political spectrum. Furthermore, we perform extensive
experiments and ablation studies, demonstrating the effectiveness of targeted
pretraining objectives on different model components. Our best-performing
model, a late-fusion architecture pretrained with a triplet objective over
multimodal content, outperforms the state-of-the-art text-only model by almost
4% and a strong multimodal baseline with no pretraining by over 3%.Comment: EMNLP 202