Multimedia content has become ubiquitous on social media platforms, leading
to the rise of multimodal misinformation (MM) and the urgent need for effective
strategies to detect and prevent its spread. In recent years, the challenge of
multimodal misinformation detection (MMD) has garnered significant attention by
researchers and has mainly involved the creation of annotated, weakly
annotated, or synthetically generated training datasets, along with the
development of various deep learning MMD models. However, the problem of
unimodal bias in MMD benchmarks -- where biased or unimodal methods outperform
their multimodal counterparts on an inherently multimodal task -- has been
overlooked. In this study, we systematically investigate and identify the
presence of unimodal bias in widely-used MMD benchmarks (VMU-Twitter, COSMOS),
raising concerns about their suitability for reliable evaluation. To address
this issue, we introduce the "VERification of Image-TExtpairs" (VERITE)
benchmark for MMD which incorporates real-world data, excludes "asymmetric
multimodal misinformation" and utilizes "modality balancing". We conduct an
extensive comparative study with a Transformer-based architecture that shows
the ability of VERITE to effectively address unimodal bias, rendering it a
robust evaluation framework for MMD. Furthermore, we introduce a new method --
termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating
realistic synthetic training data that preserve crossmodal relations between
legitimate images and false human-written captions. By leveraging CHASMA in the
training process, we observe consistent and notable improvements in predictive
performance on VERITE; with a 9.2% increase in accuracy. We release our code
at: https://github.com/stevejpapad/image-text-verificatio