Multimodal semantic understanding often has to deal with uncertainty, which
means the obtained messages tend to refer to multiple targets. Such uncertainty
is problematic for our interpretation, including inter- and intra-modal
uncertainty. Little effort has studied the modeling of this uncertainty,
particularly in pre-training on unlabeled datasets and fine-tuning in
task-specific downstream datasets. In this paper, we project the
representations of all modalities as probabilistic distributions via a
Probability Distribution Encoder (PDE) by utilizing sequence-level
interactions. Compared to the existing deterministic methods, such uncertainty
modeling can convey richer multimodal semantic information and more complex
relationships. Furthermore, we integrate uncertainty modeling with popular
pre-training frameworks and propose suitable pre-training tasks:
Distribution-based Vision-Language Contrastive learning (D-VLC),
Distribution-based Masked Language Modeling (D-MLM), and Distribution-based
Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging
downstream tasks, including image-text retrieval, visual question answering,
visual reasoning, and visual entailment, and achieve state-of-the-art results.Comment: CVPR 2023 accep