153,012 research outputs found
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
We study visual question answering in a setting where the answer has to be
mined from a pool of relevant and irrelevant images given as a context. For
such a setting, a model must first retrieve relevant images from the pool and
answer the question from these retrieved images. We refer to this problem as
retrieval-based visual question answering (or RETVQA in short). The RETVQA is
distinctively different and more challenging than the traditionally-studied
Visual Question Answering (VQA), where a given question has to be answered with
a single relevant image in context. Towards solving the RETVQA task, we propose
a unified Multi Image BART (MI-BART) that takes a question and retrieved images
using our relevance encoder for free-form fluent answer generation. Further, we
introduce the largest dataset in this space, namely RETVQA, which has the
following salient features: multi-image and retrieval requirement for VQA,
metadata-independent questions over a pool of heterogeneous images, expecting a
mix of classification-oriented and open-ended generative answers. Our proposed
framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed
dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9%
and 11.8% on the image segment of the publicly available WebQA dataset on the
accuracy and fluency metrics, respectively.Comment: Accepted to IJCAI 202
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Generic Attention-model Explainability by Weighted Relevance Accumulation
Attention-based transformer models have achieved remarkable progress in
multi-modal tasks, such as visual question answering. The explainability of
attention-based methods has recently attracted wide interest as it can explain
the inner changes of attention tokens by accumulating relevancy across
attention layers. Current methods simply update relevancy by equally
accumulating the token relevancy before and after the attention processes.
However, the importance of token values is usually different during relevance
accumulation. In this paper, we propose a weighted relevancy strategy, which
takes the importance of token values into consideration, to reduce distortion
when equally accumulating relevance. To evaluate our method, we propose a
unified CLIP-based two-stage model, named CLIPmapper, to process
Vision-and-Language tasks through CLIP encoder and a following mapper.
CLIPmapper consists of self-attention, cross-attention, single-modality, and
cross-modality attention, thus it is more suitable for evaluating our generic
explainability method. Extensive perturbation tests on visual question
answering and image captioning validate that our explainability method
outperforms existing methods
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded
questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong
framework to tackle KB-VQA, first retrieves related documents with Dense
Passage Retrieval (DPR) and then uses them to answer questions. This paper
proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which
significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major
limitations in RA-VQA's retriever: (1) the image representations obtained via
image-to-text transforms can be incomplete and inaccurate and (2) relevance
scores between queries and documents are computed with one-dimensional
embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes
these limitations by obtaining image representations that complement those from
the image-to-text transforms using a vision model aligned with an existing
text-based retriever through a simple alignment network. FLMR also encodes
images and questions using multi-dimensional embeddings to capture
finer-grained relevance between queries and documents. FLMR significantly
improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%.
Finally, we equipped RA-VQA with two state-of-the-art large
multi-modal/language models to achieve VQA score in the OK-VQA
dataset.Comment: To appear at NeurIPS 2023. This is the camera-ready version. We fixed
some numbers and added more experiments to address reviewers' comment
- …