52,935 research outputs found

    A Convolutional Neural Network Based Approach For Visual Question Answering

    Get PDF
    Computer Vision is a scientific discipline which involves the development of an algorithmic basis for the construction of intelligent systems that aim at analysis, understanding and extraction of useful information from visual data. This visual data can be plain images, video sequences, views from multiple cameras, etc. Natural Language Processing (NLP), is the ability of machines to read and understand human languages. Visual Question Answering (VQA), is a multi-discipline Artificial Intelligence (AI) research problem, which is a combination of Natural Language Processing (NLP), Computer Vision (CV), and Knowledge Reasoning (KR). Given an image and a question related to the image in natural language, the algorithm has to output an accurate natural language answer. Since the questions are open-ended, the system requires a very detailed understanding of the image, its context and a broad set of AI capabilities – object detection, activity recognition and knowledge-based reasoning. Since the release of the VQA dataset in 2014, numerous datasets and algorithms for VQA have been put forward. In this work, we propose a new baseline for the problem of visual question answering. Our model uses a deep residual network (ResNet) to compute the image features and ByteNet to compute question embeddings. A soft attention mechanism is used to focus on most relevant image features and a classifier is used to generate probabilities over an answer set. We implemented the solution in TensorFlow, which is an open source deep-learning platform, developed by Google. iv Prior to using deep residual network (ResNet) and ByteNet, we tried using VGG16 for extracting image features and long short-term memory units (LSTM) for extracting question features. We observed that using ResNet and ByteNet resulted in an improved accuracy when compared to using VGG16 and LSTM. We evaluate our model on three major image question answering datasets: DAQUAR-ALL, COCO-QA and The VQA Dataset. Our model, despite having a relatively simple architecture, achieves 64.6% accuracy on VQA 1.0 dataset and 59.7% accuracy on VQA 2.0 dataset

    ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

    Full text link
    In recent years, Visual Question Answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics. The accompanying code and dataset have been made publicly accessible at \url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate advancements within the research community, fostering the development of more multimodal fusion algorithms, specifically tailored to address the nuances of low-resource languages, exemplified by Vietnamese.Comment: A pre-print version and submitted to journa

    TallyQA: Answering Complex Counting Questions

    Full text link
    Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields state-of-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark.Comment: To appear in AAAI 2019 ( To download the dataset please go to http://www.manojacharya.com/

    Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets

    Full text link
    Visual question answering (Visual QA) has attracted a lot of attention lately, seen essentially as a form of (visual) Turing test that artificial intelligence should strive to achieve. In this paper, we study a crucial component of this task: how can we design good datasets for the task? We focus on the design of multiple-choice based datasets where the learner has to select the right answer from a set of candidate ones including the target (\ie the correct one) and the decoys (\ie the incorrect ones). Through careful analysis of the results attained by state-of-the-art learning models and human annotators on existing datasets, we show that the design of the decoy answers has a significant impact on how and what the learning models learn from the datasets. In particular, the resulting learner can ignore the visual information, the question, or both while still doing well on the task. Inspired by this, we propose automatic procedures to remedy such design deficiencies. We apply the procedures to re-construct decoy answers for two popular Visual QA datasets as well as to create a new Visual QA dataset from the Visual Genome project, resulting in the largest dataset for this task. Extensive empirical studies show that the design deficiencies have been alleviated in the remedied datasets and the performance on them is likely a more faithful indicator of the difference among learning models. The datasets are released and publicly available via http://www.teds.usc.edu/website_vqa/.Comment: Accepted for Oral Presentation at NAACL-HLT 201

    Visual Question Answering with Memory-Augmented Networks

    Full text link
    In this paper, we exploit a memory-augmented neural network to predict accurate answers to visual questions, even when those answers occur rarely in the training set. The memory network incorporates both internal and external memory blocks and selectively pays attention to each training exemplar. We show that memory-augmented neural networks are able to maintain a relatively long-term memory of scarce training exemplars, which is important for visual question answering due to the heavy-tailed distribution of answers in a general VQA setting. Experimental results on two large-scale benchmark datasets show the favorable performance of the proposed algorithm with a comparison to state of the art.Comment: CVPR 201

    Grounding semantics in robots for Visual Question Answering

    Get PDF
    In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning
    corecore