10 research outputs found
Question Type Guided Attention in Visual Question Answering
Visual Question Answering (VQA) requires integration of feature maps with
drastically different structures and focus of the correct regions. Image
descriptors have structures at multiple spatial scales, while lexical inputs
inherently follow a temporal sequence and naturally cluster into semantically
different question types. A lot of previous works use complex models to extract
feature representations but neglect to use high-level information summary such
as question types in learning. In this work, we propose Question Type-guided
Attention (QTA). It utilizes the information of question type to dynamically
balance between bottom-up and top-down visual features, respectively extracted
from ResNet and Faster R-CNN networks. We experiment with multiple VQA
architectures with extensive input ablation studies over the TDIUC dataset and
show that QTA systematically improves the performance by more than 5% across
multiple question type categories such as "Activity Recognition", "Utility" and
"Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we
achieve 3% improvement for overall accuracy. Finally, we propose a multi-task
extension to predict question types which generalizes QTA to applications that
lack of question type, with minimal performance loss
Question Type Guided Attention in Visual Question Answering
Visual Question Answering (VQA) requires integration of feature maps with drastically different structures. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as “Activity Recognition”, “Utility” and “Counting” on TDIUC dataset compared to the state-of-art. By adding QTA on the state-of-art model MCB, we achieve 3% improvement in overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack question type, with a minimal performance loss
Recent, rapid advancement in visual question answering architecture: a review
Understanding visual question answering is going to be crucial for numerous
human activities. However, it presents major challenges at the heart of the
artificial intelligence endeavor. This paper presents an update on the rapid
advancements in visual question answering using images that have occurred in
the last couple of years. Tremendous growth in research on improving visual
question answering system architecture has been published recently, showing the
importance of multimodal architectures. Several points on the benefits of
visual question answering are mentioned in the review paper by Manmadhan et al.
(2020), on which the present article builds, including subsequent updates in
the field.Comment: 11 page
Application of multimodal machine learning to visual question answering
Master’s Degree in ICT Research and Innovation (i2-ICT)Due to the great advances in Natural Language Processing and Computer Vision in recent yearswith neural networks and attention mechanisms, a great interest in VQA has been awakened,starting to be considered as the ”Visual Turing Test” for modern AI systems, since it is aboutanswering a question from an image, where the system has to learn to understand and reasonabout the image and question shown. One of the main reasons for this great interest is thelarge number of potential applications that these systems allow, such as medical applicationsfor diagnosis through an image, assistants for blind people, e-learning applications, etc.In this Master’s thesis, a study of the state of the art of VQA is proposed, investigatingboth techniques and existing datasets. Finally, a development is carried out in order to try toreproduce the results of the art with the latest VQA models with the aim of being able to applythem and experiment on new datasets.Therefore, in this work, experiments are carried out with a first VQA model, MoViE+MCAN[1] [2] (winner of the 2020 VQA Challenge), which after observing its non-viability due toresource issues, we switched to the LXMERT Model [3], which consists of a pre-trained modelin 5 subtasks, which allows us to perform fine-tunnig on several tasks, which in this specificcase is the VQA task on the VQA v2.0 [4] dataset.As the main result of this Thesis we experimentally show that LXMERT provides similarresults to MoViE-MCAN (the best known method for VQA) in the most recent and demandingbenchmarks with less resources starting from the pre-trained model provided by the GitHubrepository [5]
Recommended from our members
Image captioning and visual question answering with external knowledge
The fields of computer vision and natural language processing have made significant advances in visual question answering (VQA) and image captioning. However, a limitation of models in use today is they typically perform poorly when the task requires common sense or external knowledge. Motivated by this observation, this work offers an exploration of the benefits of multi-source external knowledge for these two tasks. Three kinds of external knowledge are evaluated: knowledge base, reverse image search, and image search by text. This work demonstrates the advantage of these external knowledge sources via experiments on two image captioning datasets (COCO-Captions and VizWiz-Captions) and three visual question answering datasets (VQAv2,
VizWiz-VQA, and OK-VQA).Informatio