Asking questions about visual environments is a crucial way for intelligent
agents to understand rich multi-faceted scenes, raising the importance of
Visual Question Generation (VQG) systems. Apart from being grounded to the
image, existing VQG systems can use textual constraints, such as expected
answers or knowledge triplets, to generate focused questions. These constraints
allow VQG systems to specify the question content or leverage external
commonsense knowledge that can not be obtained from the image content only.
However, generating focused questions using textual constraints while enforcing
a high relevance to the image content remains a challenge, as VQG systems often
ignore one or both forms of grounding. In this work, we propose Contrastive
Visual Question Generation (ConVQG), a method using a dual contrastive
objective to discriminate questions generated using both modalities from those
based on a single one. Experiments on both knowledge-aware and standard VQG
benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and
generates image-grounded, text-guided, and knowledge-rich questions. Our human
evaluation results also show preference for ConVQG questions compared to
non-contrastive baselines.Comment: AAAI 2024. Project page at https://limirs.github.io/ConVQ