104 research outputs found
Weakly Supervised Visual Question Answer Generation
Growing interest in conversational agents promote twoway human-computer
communications involving asking and answering visual questions have become an
active area of research in AI. Thus, generation of visual questionanswer
pair(s) becomes an important and challenging task. To address this issue, we
propose a weakly-supervised visual question answer generation method that
generates a relevant question-answer pairs for a given input image and
associated caption. Most of the prior works are supervised and depend on the
annotated question-answer datasets. In our work, we present a weakly supervised
method that synthetically generates question-answer pairs procedurally from
visual information and captions. The proposed method initially extracts list of
answer words, then does nearest question generation that uses the caption and
answer word to generate synthetic question. Next, the relevant question
generator converts the nearest question to relevant language question by
dependency parsing and in-order tree traversal, finally, fine-tune a ViLBERT
model with the question-answer pair(s) generated at end. We perform an
exhaustive experimental analysis on VQA dataset and see that our model
significantly outperform SOTA methods on BLEU scores. We also show the results
wrt baseline models and ablation study
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
This work presents DocPedia, a novel large multimodal model (LMM) for
versatile OCR-free document understanding, capable of parsing images up to
2,5602,560 resolution. Unlike existing work either struggle with
high-resolution documents or give up the large language model thus vision or
language ability constrained, our DocPedia directly processes visual input in
the frequency domain rather than the pixel space. The unique characteristic
enables DocPedia to capture a greater amount of visual and textual information
using a limited number of visual tokens. To consistently enhance both
perception and comprehension abilities of our model, we develop a dual-stage
training strategy and enrich instructions/annotations of all training tasks
covering multiple document types. Extensive quantitative and qualitative
experiments conducted on various publicly available benchmarks confirm the
mutual benefits of jointly learning perception and comprehension tasks. The
results provide further evidence of the effectiveness and superior performance
of our DocPedia over other methods
Generative Visual Question Answering
Multi-modal tasks involving vision and language in deep learning continue to
rise in popularity and are leading to the development of newer models that can
generalize beyond the extent of their training data. The current models lack
temporal generalization which enables models to adapt to changes in future
data. This paper discusses a viable approach to creating an advanced Visual
Question Answering (VQA) model which can produce successful results on temporal
generalization. We propose a new data set, GenVQA, utilizing images and
captions from the VQAv2 and MS-COCO dataset to generate new images through
stable diffusion. This augmented dataset is then used to test a combination of
seven baseline and cutting edge VQA models. Performance evaluation focuses on
questions mirroring the original VQAv2 dataset, with the answers having been
adjusted to the new images. This paper's purpose is to investigate the
robustness of several successful VQA models to assess their performance on
future data distributions. Model architectures are analyzed to identify common
stylistic choices that improve generalization under temporal distribution
shifts. This research highlights the importance of creating a large-scale
future shifted dataset. This data can enhance the robustness of VQA models,
allowing their future peers to have improved ability to adapt to temporal
distribution shifts
Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
Researchers have extensively studied the field of vision and language,
discovering that both visual and textual content is crucial for understanding
scenes effectively. Particularly, comprehending text in videos holds great
significance, requiring both scene text understanding and temporal reasoning.
This paper focuses on exploring two recently introduced datasets, NewsVideoQA
and M4-ViteVQA, which aim to address video question answering based on textual
content. The NewsVideoQA dataset contains question-answer pairs related to the
text in news videos, while M4-ViteVQA comprises question-answer pairs from
diverse categories like vlogging, traveling, and shopping. We provide an
analysis of the formulation of these datasets on various levels, exploring the
degree of visual understanding and multi-frame comprehension required for
answering the questions. Additionally, the study includes experimentation with
BERT-QA, a text-only model, which demonstrates comparable performance to the
original methods on both datasets, indicating the shortcomings in the
formulation of these datasets. Furthermore, we also look into the domain
adaptation aspect by examining the effectiveness of training on M4-ViteVQA and
evaluating on NewsVideoQA and vice-versa, thereby shedding light on the
challenges and potential benefits of out-of-domain training
What Large Language Models Bring to Text-rich VQA?
Text-rich VQA, namely Visual Question Answering based on text recognition in
the images, is a cross-modal task that requires both image comprehension and
text recognition. In this work, we focus on investigating the advantages and
bottlenecks of LLM-based approaches in addressing this problem. To address the
above concern, we separate the vision and language modules, where we leverage
external OCR models to recognize texts in the image and Large Language Models
(LLMs) to answer the question given texts. The whole framework is training-free
benefiting from the in-context ability of LLMs. This pipeline achieved superior
performance compared to the majority of existing Multimodal Large Language
Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation
study, we find that LLM brings stronger comprehension ability and may introduce
helpful knowledge for the VQA problem. The bottleneck for LLM to address
text-rich VQA problems may primarily lie in visual part. We also combine the
OCR module with MLLMs and pleasantly find that the combination of OCR module
with MLLM also works. It's worth noting that not all MLLMs can comprehend the
OCR information, which provides insights into how to train an MLLM that
preserves the abilities of LLM
The Impact of Explanations on AI Competency Prediction in VQA
Explainability is one of the key elements for building trust in AI systems.
Among numerous attempts to make AI explainable, quantifying the effect of
explanations remains a challenge in conducting human-AI collaborative tasks.
Aside from the ability to predict the overall behavior of AI, in many
applications, users need to understand an AI agent's competency in different
aspects of the task domain. In this paper, we evaluate the impact of
explanations on the user's mental model of AI agent competency within the task
of visual question answering (VQA). We quantify users' understanding of
competency, based on the correlation between the actual system performance and
user rankings. We introduce an explainable VQA system that uses spatial and
object features and is powered by the BERT language model. Each group of users
sees only one kind of explanation to rank the competencies of the VQA model.
The proposed model is evaluated through between-subject experiments to probe
explanations' impact on the user's perception of competency. The comparison
between two VQA models shows BERT based explanations and the use of object
features improve the user's prediction of the model's competencies.Comment: Submitted to HCCAI 202
- …