Search CORE

10 research outputs found

Question Type Guided Attention in Visual Question Answering

Author: Anandkumar Animashree
Furlanello Tommaso
Shi Yang
Zha Sheng
Publication venue
Publication date: 06/04/2018
Field of study

Visual Question Answering (VQA) requires integration of feature maps with drastically different structures and focus of the correct regions. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as "Activity Recognition", "Utility" and "Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we achieve 3% improvement for overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack of question type, with minimal performance loss

arXiv.org e-Print Archive

Crossref

Caltech Authors

Question Type Guided Attention in Visual Question Answering

Author: Anandkumar Animashree
Furlanello Tommaso
Shi Yang
Zha Sheng
Publication venue: Springer Nature
Publication date: 06/04/2018
Field of study

Visual Question Answering (VQA) requires integration of feature maps with drastically different structures. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as “Activity Recognition”, “Utility” and “Counting” on TDIUC dataset compared to the state-of-art. By adding QTA on the state-of-art model MCB, we achieve 3% improvement in overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack question type, with a minimal performance loss

Crossref

Caltech Authors

Recent, rapid advancement in visual question answering architecture: a review

Author: Berleant Daniel
Kodali Venkat
Publication venue
Publication date: 30/03/2022
Field of study

Understanding visual question answering is going to be crucial for numerous human activities. However, it presents major challenges at the heart of the artificial intelligence endeavor. This paper presents an update on the rapid advancements in visual question answering using images that have occurred in the last couple of years. Tremendous growth in research on improving visual question answering system architecture has been published recently, showing the importance of multimodal architectures. Several points on the benefits of visual question answering are mentioned in the review paper by Manmadhan et al. (2020), on which the present article builds, including subsequent updates in the field.Comment: 11 page

arXiv.org e-Print Archive

Application of multimodal machine learning to visual question answering

Author: Galvé Mateo Carlos
Publication venue
Publication date: 01/09/2021
Field of study

Master’s Degree in ICT Research and Innovation (i2-ICT)Due to the great advances in Natural Language Processing and Computer Vision in recent yearswith neural networks and attention mechanisms, a great interest in VQA has been awakened,starting to be considered as the ”Visual Turing Test” for modern AI systems, since it is aboutanswering a question from an image, where the system has to learn to understand and reasonabout the image and question shown. One of the main reasons for this great interest is thelarge number of potential applications that these systems allow, such as medical applicationsfor diagnosis through an image, assistants for blind people, e-learning applications, etc.In this Master’s thesis, a study of the state of the art of VQA is proposed, investigatingboth techniques and existing datasets. Finally, a development is carried out in order to try toreproduce the results of the art with the latest VQA models with the aim of being able to applythem and experiment on new datasets.Therefore, in this work, experiments are carried out with a first VQA model, MoViE+MCAN[1] [2] (winner of the 2020 VQA Challenge), which after observing its non-viability due toresource issues, we switched to the LXMERT Model [3], which consists of a pre-trained modelin 5 subtasks, which allows us to perform fine-tunnig on several tasks, which in this specificcase is the VQA task on the VQA v2.0 [4] dataset.As the main result of this Thesis we experimentally show that LXMERT provides similarresults to MoViE-MCAN (the best known method for VQA) in the most recent and demandingbenchmarks with less resources starting from the pre-trained model provided by the GitHubrepository [5]

Biblos-e Archivo

Recommended from our members

Image captioning and visual question answering with external knowledge

Author: Chen Chongyan
Publication venue
Publication date: 01/09/2021
Field of study

The ﬁelds of computer vision and natural language processing have made signiﬁcant advances in visual question answering (VQA) and image captioning. However, a limitation of models in use today is they typically perform poorly when the task requires common sense or external knowledge. Motivated by this observation, this work oﬀers an exploration of the beneﬁts of multi-source external knowledge for these two tasks. Three kinds of external knowledge are evaluated: knowledge base, reverse image search, and image search by text. This work demonstrates the advantage of these external knowledge sources via experiments on two image captioning datasets (COCO-Captions and VizWiz-Captions) and three visual question answering datasets (VQAv2, VizWiz-VQA, and OK-VQA).Informatio

Texas ScholarWorks