175,439 research outputs found
Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms
Question categorization and expert retrieval methods have been crucial for
information organization and accessibility in community question & answering
(CQA) platforms. Research in this area, however, has dealt with only the text
modality. With the increasing multimodal nature of web content, we focus on
extending these methods for CQA questions accompanied by images. Specifically,
we leverage the success of representation learning for text and images in the
visual question answering (VQA) domain, and adapt the underlying concept and
architecture for automated category classification and expert retrieval on
image-based questions posted on Yahoo! Chiebukuro, the Japanese counterpart of
Yahoo! Answers.
To the best of our knowledge, this is the first work to tackle the
multimodality challenge in CQA, and to adapt VQA models for tasks on a more
ecologically valid source of visual questions. Our analysis of the differences
between visual QA and community QA data drives our proposal of novel
augmentations of an attention method tailored for CQA, and use of auxiliary
tasks for learning better grounding features. Our final model markedly
outperforms the text-only and VQA model baselines for both tasks of
classification and expert retrieval on real-world multimodal CQA data.Comment: Submitted for review at CIKM 201
FVQA: Fact-based Visual Question Answering
Visual Question Answering (VQA) has attracted a lot of attention in both
Computer Vision and Natural Language Processing communities, not least because
it offers insight into the relationships between two important sources of
information. Current datasets, and the models built upon them, have focused on
questions which are answerable by direct analysis of the question and image
alone. The set of such questions that require no external information to answer
is interesting, but very limited. It excludes questions which require common
sense, or basic factual knowledge to answer, for example. Here we introduce
FVQA, a VQA dataset which requires, and supports, much deeper reasoning. FVQA
only contains questions which require external information to answer.
We thus extend a conventional visual question answering dataset, which
contains image-question-answerg triplets, through additional
image-question-answer-supporting fact tuples. The supporting fact is
represented as a structural triplet, such as .
We evaluate several baseline models on the FVQA dataset, and describe a novel
model which is capable of reasoning about an image on the basis of supporting
facts.Comment: 16 page
Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images
Aiming at answering questions based on the content of remotely sensed images,
visual question answering for remote sensing data (RSVQA) has attracted much
attention nowadays. However, previous works in RSVQA have focused little on the
robustness of RSVQA. As we aim to enhance the reliability of RSVQA models, how
to learn robust representations against new words and different question
templates with the same meaning is the key challenge. With the proposed
augmented dataset, we are able to obtain more questions in addition to the
original ones with the same meaning. To make better use of this information, in
this study, we propose a contrastive learning strategy for training robust
RSVQA models against diverse question templates and words. Experimental results
demonstrate that the proposed augmented dataset is effective in improving the
robustness of the RSVQA model. In addition, the contrastive learning strategy
performs well on the low resolution (LR) dataset.Comment: This paper was submitted to the JURSE 2023 conference on November 5,
202
- …