39,880 research outputs found
A Simple Baseline for Knowledge-Based Visual Question Answering
This paper is on the problem of Knowledge-Based Visual Question Answering
(KB-VQA). Recent works have emphasized the significance of incorporating both
explicit (through external databases) and implicit (through LLMs) knowledge to
answer questions requiring external knowledge effectively. A common limitation
of such approaches is that they consist of relatively complicated pipelines and
often heavily rely on accessing GPT-3 API. Our main contribution in this paper
is to propose a much simpler and readily reproducible pipeline which, in a
nutshell, is based on efficient in-context learning by prompting LLaMA (1 and
2) using question-informative captions as contextual information. Contrary to
recent approaches, our method is training-free, does not require access to
external databases or APIs, and yet achieves state-of-the-art accuracy on the
OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to
understand important aspects of our method. Our code is publicly available at
https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQAComment: Accepted at EMNLP 2023 (camera-ready version
A Convolutional Neural Network Based Approach For Visual Question Answering
Computer Vision is a scientific discipline which involves the development of an algorithmic basis for the construction of intelligent systems that aim at analysis, understanding and extraction of useful information from visual data. This visual data can be plain images, video sequences, views from multiple cameras, etc. Natural Language Processing (NLP), is the ability of machines to read and understand human languages. Visual Question Answering (VQA), is a multi-discipline Artificial Intelligence (AI) research problem, which is a combination of Natural Language Processing (NLP), Computer Vision (CV), and Knowledge Reasoning (KR). Given an image and a question related to the image in natural language, the algorithm has to output an accurate natural language answer. Since the questions are open-ended, the system requires a very detailed understanding of the image, its context and a broad set of AI capabilities – object detection, activity recognition and knowledge-based reasoning. Since the release of the VQA dataset in 2014, numerous datasets and algorithms for VQA have been put forward. In this work, we propose a new baseline for the problem of visual question answering. Our model uses a deep residual network (ResNet) to compute the image features and ByteNet to compute question embeddings. A soft attention mechanism is used to focus on most relevant image features and a classifier is used to generate probabilities over an answer set. We implemented the solution in TensorFlow, which is an open source deep-learning platform, developed by Google. iv Prior to using deep residual network (ResNet) and ByteNet, we tried using VGG16 for extracting image features and long short-term memory units (LSTM) for extracting question features. We observed that using ResNet and ByteNet resulted in an improved accuracy when compared to using VGG16 and LSTM. We evaluate our model on three major image question answering datasets: DAQUAR-ALL, COCO-QA and The VQA Dataset. Our model, despite having a relatively simple architecture, achieves 64.6% accuracy on VQA 1.0 dataset and 59.7% accuracy on VQA 2.0 dataset
METODE TANYA JAWAB BERMEDIA VIDEO UNTUK MENINGKATKAN AKTIVITAS BERBICARA ANAK AUTIS
Abstract; Autism children had disorder in developing speech skill. Autism children tended to show limited speech skill, monotone voice tune, parroting tendency; often repeating the new words heard without meaning to communicate, could not begin a conversation. The problem of this research was did the speech activity could increased trough application of video mediated question and answer method for autism children in SDN Tandes Kidul 1 Surabaya? The purpose of this research was to analyze the influence application of video mediated question and answer method to increase the speech activity of autism children in SDN Tandes Kidul 1 Surabaya. The design used in the research was Single Subject Research (SSR). The subject was an autism child in SDN Tandes Kidul 1 Surabaya. The data collection techniques used in the research was observation. The data analysis applied were visual analysis within condition and visual analysis inter condition. The research result indicated that the baseline phase, the speech frequent of autism children in stating simple sentence and answering question from answer and question activity was about 12-18. However after the video mediated question and answer method is applied by give questions after played videos about daily activities, the speech frequent of child got enhancement to be 24-30. From the result of visual analysis within condition, it was indicated that better change and visual analysis inter condition indicated that there was intervention positive influence toward behavior target. So it was concluded that video mediated question and answer method positive influenced toward increasing the speech activity of autism children.Keywords: Video mediated question and answer method, speech activit
Solving Visual Madlibs with Multiple Cues
This paper focuses on answering fill-in-the-blank style multiple choice
questions from the Visual Madlibs dataset. Previous approaches to Visual
Question Answering (VQA) have mainly used generic image features from networks
trained on the ImageNet dataset, despite the wide scope of questions. In
contrast, our approach employs features derived from networks trained for
specialized tasks of scene classification, person activity prediction, and
person and object attribute prediction. We also present a method for selecting
sub-regions of an image that are relevant for evaluating the appropriateness of
a putative answer. Visual features are computed both from the whole image and
from local regions, while sentences are mapped to a common space using a simple
normalized canonical correlation analysis (CCA) model. Our results show a
significant improvement over the previous state of the art, and indicate that
answering different question types benefits from examining a variety of image
cues and carefully choosing informative image sub-regions
- …