13,943 research outputs found

    Visual Reasoning and Image Understanding: A Question Answering Approach

    Get PDF
    Humans have amazing visual perception which allows them to comprehend what the eyes see. In the core of human visual perception, lies the ability to translate visual information and link the visual information with linguistic cues from natural language. Visual reasoning and image understanding is a result of superior visual perception where one is able to comprehend visual and linguistic information and navigate these two domains seamlessly. The premise of Visual Question Answering (VQA) is to challenge an Artificial Intelligent (AI) agent by asking it to predict an answer for a natural language question about an image. By doing so, it evaluates its ability in the three major components of visual reasoning, first, simultaneous extraction of visual features from the image and semantic features from the question, second, joint processing of the multimodal features (visual and semantic), and third, learning to recognize regions in the image that are important to answer the question. In this thesis, we investigate how an AI agent can achieve human like visual reasoning and image understanding ability with superior visual perception, and is able to link linguistic cues with visual information when tasked with Visual Question Answering (VQA). Based on the observation that humans tend to ask questions about everyday objects and its attributes in context of the image, we developed a Reciprocal Attention Fusion (RAF) model, first of its kind, where the AI agent learns to simultaneously identify salient image regions of arbitrary shape and size, and rectangular object bounding boxes, for answering the question. We demonstrated that by combining these multilevel visual features and learning to identify image- and object-level attention map, our model learns to identify important visual cues for answering the question; thus achieving state-of-the art performance on several large scale VQA dataset. Further, we hypothesized that for achieving even better reasoning, a VQA model needs to attend all objects along with the objects deemed important by the question-driven attention mechanism. We developed a Question Agnostic Attention (QAA) model that forces any VQA model to consider all objects in the image along with their learned attention representing, which in turn results in better generalisation across and different high level reasoning task (i.e. counting, relative position), supporting our hypothesis. Furthermore, humans learn to identify relationships between object and describe them with semantic labels (e.g. in front of, seating) to get a holistic understanding of the image. We developed a semantic parser that generate linguistic features from subject-relationship-predicate triplets, and proposed an VQA model to incorporate this relationship parser on top of existing reasoning mechanism. This way we are able guide the VQA model to convert visual relationships to linguistic features, much like humans, and use it generate a answer which requires much higher reasoning than only identifying objects. In summary, in this thesis, we endeavour to improve the visual perception of Visual Linguistic AI agents by imitating human reasoning and image understanding process. It investigates how AI agents can incorporate different level of visual attention, learn to use high level linguistic cues as relationship labels, make use of transfer learning to reason about the unknown and also prove design recommendation to building such system. We hope our effort can help the community build better Visual Linguistic AI agents the can comprehend what the camera sees
    • …
    corecore