62 research outputs found

    Recent, rapid advancement in visual question answering architecture: a review

    Full text link
    Understanding visual question answering is going to be crucial for numerous human activities. However, it presents major challenges at the heart of the artificial intelligence endeavor. This paper presents an update on the rapid advancements in visual question answering using images that have occurred in the last couple of years. Tremendous growth in research on improving visual question answering system architecture has been published recently, showing the importance of multimodal architectures. Several points on the benefits of visual question answering are mentioned in the review paper by Manmadhan et al. (2020), on which the present article builds, including subsequent updates in the field.Comment: 11 page

    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

    Full text link
    Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images. However, aside from natural images, abstract diagrams with semantic richness are still understudied in visual understanding and reasoning research. In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by real-world diagram word problems that highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning. Thus, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate potential IconQA models to learn semantic representations for icon images, we further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes. We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset. IconQA and Icon645 are available at https://iconqa.github.io.Comment: Corrected typos. Accepted to NeurIPS 2021, 27 pages, 18 figures. Data and code are available at https://iconqa.github.i

    Visual Reasoning and Image Understanding: A Question Answering Approach

    Get PDF
    Humans have amazing visual perception which allows them to comprehend what the eyes see. In the core of human visual perception, lies the ability to translate visual information and link the visual information with linguistic cues from natural language. Visual reasoning and image understanding is a result of superior visual perception where one is able to comprehend visual and linguistic information and navigate these two domains seamlessly. The premise of Visual Question Answering (VQA) is to challenge an Artificial Intelligent (AI) agent by asking it to predict an answer for a natural language question about an image. By doing so, it evaluates its ability in the three major components of visual reasoning, first, simultaneous extraction of visual features from the image and semantic features from the question, second, joint processing of the multimodal features (visual and semantic), and third, learning to recognize regions in the image that are important to answer the question. In this thesis, we investigate how an AI agent can achieve human like visual reasoning and image understanding ability with superior visual perception, and is able to link linguistic cues with visual information when tasked with Visual Question Answering (VQA). Based on the observation that humans tend to ask questions about everyday objects and its attributes in context of the image, we developed a Reciprocal Attention Fusion (RAF) model, first of its kind, where the AI agent learns to simultaneously identify salient image regions of arbitrary shape and size, and rectangular object bounding boxes, for answering the question. We demonstrated that by combining these multilevel visual features and learning to identify image- and object-level attention map, our model learns to identify important visual cues for answering the question; thus achieving state-of-the art performance on several large scale VQA dataset. Further, we hypothesized that for achieving even better reasoning, a VQA model needs to attend all objects along with the objects deemed important by the question-driven attention mechanism. We developed a Question Agnostic Attention (QAA) model that forces any VQA model to consider all objects in the image along with their learned attention representing, which in turn results in better generalisation across and different high level reasoning task (i.e. counting, relative position), supporting our hypothesis. Furthermore, humans learn to identify relationships between object and describe them with semantic labels (e.g. in front of, seating) to get a holistic understanding of the image. We developed a semantic parser that generate linguistic features from subject-relationship-predicate triplets, and proposed an VQA model to incorporate this relationship parser on top of existing reasoning mechanism. This way we are able guide the VQA model to convert visual relationships to linguistic features, much like humans, and use it generate a answer which requires much higher reasoning than only identifying objects. In summary, in this thesis, we endeavour to improve the visual perception of Visual Linguistic AI agents by imitating human reasoning and image understanding process. It investigates how AI agents can incorporate different level of visual attention, learn to use high level linguistic cues as relationship labels, make use of transfer learning to reason about the unknown and also prove design recommendation to building such system. We hope our effort can help the community build better Visual Linguistic AI agents the can comprehend what the camera sees

    A Comprehensive Review and Open Challenges on Visual Question Answering Models

    Get PDF
    Users are now able to actively interact with images and pose different questions based on images, thanks to recent developments in artificial intelligence. In turn, a response in a natural language answer is expected. The study discusses a variety of datasets that can be used to examine applications for visual question-answering (VQA), as well as their advantages and disadvantages. Four different forms of VQA models—simple joint embedding-based models, attention-based models, knowledge-incorporated models, and domain-specific VQA models—are in-depth examined in this article. We also critically assess the drawbacks and future possibilities of all current state-of-the-art (SoTa), end-to-end VQA models. Finally, we present the directions and guidelines for further development of the VQA models
    • …
    corecore