5 research outputs found

    On the Diagnosis and Generalization of Compositional Visual Reasoning

    Get PDF
    Computer vision is not only about recognizing visual signals, but also rea- soning over perceived visual elements. This ability, termed visual reasoning, is typically studied by multimodal tasks like visual question answering and image captioning. Thanks to recent developments in multimodal vision-and- language, we are closer to achieving visual reasoning than ever. However, more efforts are still required, in order to build visual reasoning systems that are robust, interpretable and generalizable. In this dissertation, I present my efforts towards visual reasoning, through both model diagnosis and enhancements. In the first part, I diagnose existing visual question answering models, including the end-to-end models and com- positional models, and show the advantage of the latter. In the second part, I dive deeper into compositional models, proposing techniques for enhancing them with improved performance on real-world images. In the third part, I generalize visual reasoning onto a different task, image captioning, introduc- ing a new setting of the task that requires strong reasoning to summarize and compare groups of images. With this dissertation, I showcase the advantages and disadvantages of compositional visual reasoning methods, which should be pursued in conjunction with non-compositional end-to-end models

    3D-Aware Visual Question Answering about Parts, Poses and Occlusions

    Full text link
    Despite rapid progress in Visual question answering (VQA), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks like navigation or manipulation. This includes an understanding of the 3D object pose, their parts and occlusions. In this work, we introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. We address 3D-aware VQA from both the dataset and the model perspective. First, we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains questions about object parts, their 3D poses, and occlusions. Second, we propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition. Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an important open research area.Comment: Accepted by NeurIPS202
    corecore