The ability to ask questions is a powerful tool to gather information in
order to learn about the world and resolve ambiguities. In this paper, we
explore a novel problem of generating discriminative questions to help
disambiguate visual instances. Our work can be seen as a complement and new
extension to the rich research studies on image captioning and question
answering. We introduce the first large-scale dataset with over 10,000
carefully annotated images-question tuples to facilitate benchmarking. In
particular, each tuple consists of a pair of images and 4.6 discriminative
questions (as positive samples) and 5.9 non-discriminative questions (as
negative samples) on average. In addition, we present an effective method for
visual discriminative question generation. The method can be trained in a
weakly supervised manner without discriminative images-question tuples but just
existing visual question answering datasets. Promising results are shown
against representative baselines through quantitative evaluations and user
studies.Comment: 14 pages, 12 figures, ICCV201