39,880 research outputs found

    A Simple Baseline for Knowledge-Based Visual Question Answering

    Full text link
    This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQAComment: Accepted at EMNLP 2023 (camera-ready version

    A Convolutional Neural Network Based Approach For Visual Question Answering

    Get PDF
    Computer Vision is a scientific discipline which involves the development of an algorithmic basis for the construction of intelligent systems that aim at analysis, understanding and extraction of useful information from visual data. This visual data can be plain images, video sequences, views from multiple cameras, etc. Natural Language Processing (NLP), is the ability of machines to read and understand human languages. Visual Question Answering (VQA), is a multi-discipline Artificial Intelligence (AI) research problem, which is a combination of Natural Language Processing (NLP), Computer Vision (CV), and Knowledge Reasoning (KR). Given an image and a question related to the image in natural language, the algorithm has to output an accurate natural language answer. Since the questions are open-ended, the system requires a very detailed understanding of the image, its context and a broad set of AI capabilities – object detection, activity recognition and knowledge-based reasoning. Since the release of the VQA dataset in 2014, numerous datasets and algorithms for VQA have been put forward. In this work, we propose a new baseline for the problem of visual question answering. Our model uses a deep residual network (ResNet) to compute the image features and ByteNet to compute question embeddings. A soft attention mechanism is used to focus on most relevant image features and a classifier is used to generate probabilities over an answer set. We implemented the solution in TensorFlow, which is an open source deep-learning platform, developed by Google. iv Prior to using deep residual network (ResNet) and ByteNet, we tried using VGG16 for extracting image features and long short-term memory units (LSTM) for extracting question features. We observed that using ResNet and ByteNet resulted in an improved accuracy when compared to using VGG16 and LSTM. We evaluate our model on three major image question answering datasets: DAQUAR-ALL, COCO-QA and The VQA Dataset. Our model, despite having a relatively simple architecture, achieves 64.6% accuracy on VQA 1.0 dataset and 59.7% accuracy on VQA 2.0 dataset

    METODE TANYA JAWAB BERMEDIA VIDEO UNTUK MENINGKATKAN AKTIVITAS BERBICARA ANAK AUTIS

    Get PDF
    Abstract; Autism children had disorder in developing speech skill. Autism children tended to show limited speech skill, monotone voice tune, parroting tendency; often repeating the new words heard without meaning to communicate, could not begin a conversation. The problem of this research was did the speech activity could increased trough application of video mediated question and answer method for autism children in SDN Tandes Kidul 1 Surabaya? The purpose of this research was to analyze the influence application of video mediated question and answer method to increase the speech activity of autism children in SDN Tandes Kidul 1 Surabaya. The design used in the research was Single Subject Research (SSR). The subject was an autism child in SDN Tandes Kidul 1 Surabaya. The data collection techniques used in the research was observation. The data analysis applied were visual analysis within condition and visual analysis inter condition. The research result indicated that the baseline phase, the speech frequent of autism children in stating simple sentence and answering question from answer and question activity was about 12-18. However after the video mediated question and answer method is applied by give questions after played videos about daily activities, the speech frequent of child got enhancement to be 24-30. From the result of visual analysis within condition, it was indicated that better change and visual analysis inter condition indicated that there was intervention positive influence toward behavior target. So it was concluded that video mediated question and answer method positive influenced toward increasing the speech activity of autism children.Keywords: Video mediated question and answer method, speech activit

    Solving Visual Madlibs with Multiple Cues

    Get PDF
    This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the ImageNet dataset, despite the wide scope of questions. In contrast, our approach employs features derived from networks trained for specialized tasks of scene classification, person activity prediction, and person and object attribute prediction. We also present a method for selecting sub-regions of an image that are relevant for evaluating the appropriateness of a putative answer. Visual features are computed both from the whole image and from local regions, while sentences are mapped to a common space using a simple normalized canonical correlation analysis (CCA) model. Our results show a significant improvement over the previous state of the art, and indicate that answering different question types benefits from examining a variety of image cues and carefully choosing informative image sub-regions
    • …
    corecore